| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
|
| |
Also parameterize `min_silence_duration_ms` in AudioSegmenter. I suspect
that for conversational speech, segmenting closer to 500 ms (rather than
the 2000ms default) is a better tradeoff between accuracy and
compute efficiency.
|
| |
|
|
| |
No longer needed.
|
| |
|
|
|
|
|
|
| |
FuzzyRepeatCommitter was approximating this behavior in the
best-performing configuration, so switch to it in earnest.
This committer simply commits audio once we detect a long enough gap in
speech. That's it!
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This logic is highly IO bound *and* latency critical so it makes sense to put
it into its own thread.
Also:
* Collector::drop* methods return the dropped audio. Committer includes
that audio in commits. Transcription thread holds onto it. When the
user segments their speech with a button press, the transcription
thread sends the entire combined audio of all commits over to Whisper
to be transcribed. This allows us to recover from errors introduced
by segmentation.
* Remove unused animator params
* Fix issue where clearing the board doesn't completely reset STT state
TODO:
* Coalescing does not occur for in-place updates. It should.
|
| |
|
|
|
|
|
|
| |
Also:
* Enable SO_REUSEADDR on browser src socket
* Temporarily add evaluation dependencies to requirements.txt
* Fix browser src. It's now looking for a prefix that the python app
actually uses.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Four threads:
* Main thread
* Transcription (mic -> collector -> whisper -> committer -> pager)
* VR input
* Keyboard input
Also:
* add OscPager class to encapsulate all OSC interactions.
* bump `last_n_must_match` from 2 to 3 to reduce hallucinations
|
| |
|
|
| |
This has a slight positive effect on my benchmark.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Try adding two filters on top of the usual AudioCollector:
* Minimum length preservation: never report fewer than N seconds worth
of audio data. Pad with silence as needed.
* Volume normalizing: normalize audio volume.
Using my benchmark of 30-second audio clips from 3 speakers (lower is
better):
length enf + norm = 87.118
nothing = 90.917
norm = 94.538
length = 111.402
Both together are a slight improvement, but independently degrade the
result by a lot. I also observed more hallucinations in a conversational
pattern when using them vs. not. So I'll phase them out.
I'm still curious about *compression* as opposed to normalization.
|
|
|
A set of proper interfaces is called for. See #dev-update-spam in
discord for drawing of design.
Also add code to mechanically optimize committer parameters using an
audio file. Not perfectly repeatable since it depends on the performance
characteristics of the machine, but prob better than what we had before
(nothing).
|