| Commit message (Collapse) | Author | Age |
| ... | |
| |
|
|
|
|
|
|
| |
Indeed it is. Bumped up the default max segment length to decrease
error.
Also add mic presets for beyond (the vr headset) and motu (my mic
interface).
|
| |
|
|
| |
This reverts commit 921b92a69f36502dc5eefd14ba3487c1bb49bb9d.
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
| |
Seems much faster than faster-whisper.
There are two issues:
* Requires NVIDIA 3000 series or higher.
* Incompatible with faster-whisper dependencies.
So it seems like we'll either need to toggle between two sets of
dependencies at runtime or have two environments.
|
| |
|
|
| |
Paging is now slower but more reliable.
|
| | |
|
| |
|
|
|
|
| |
I converted distil-whisper-medium.en to CTranslate2 format and uploaded
it to huggingface. This model is exceptionally fast and light compared
to the non-distilled version, at the cost of some accuracy.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When hot-miking into the built-in chatbox, there are sometimes long
pauses in conversation. After these pauses, it's undesirable to show the
transcript generate before the pause. This feature makes it so that
those transcripts can be dropped.
Also:
* Limit number of segments sent to browser source to 10. Allow this to
grow up to 10 segments before dropping the first 5 segments.
* Silence warnings generated by `install_in_venv`, used by e.g.
translation codepath.
* Enable audio normalization to improve accuracy when speaking softly,
at the cost of some accuracy when speaking normally.
Credit: user endo0269 on Discord suggested this feature.
|
| |
|
|
|
|
|
|
|
| |
BrowserSource now fades text out continuously over time.
TODO
* Delete C++ webserver, browsersource, transcript code
* Add UI for text age fading
|
| |
|
|
| |
Default is normal prio.
|
| |
|
|
|
|
|
| |
* uwu filter no longer adds extra whitespace before/after segments. This
would defeat commit logic.
* disabling phonemes works again - path to prefab was being quoted
twice, breaking the codepath.
|
| |
|
|
| |
Remove unused proxy code, curl, and images.
|
| |
|
|
| |
Oops :)
|
| | |
|
| |
|
|
|
|
|
|
|
| |
0.17.x are breaking faster_whisper's ability to download models.
Also:
* Start using frozen requirements.txt.
* Conditionally install torch & legacy whisper only when doing
mechanical optimization.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
... and restructure RemoveTrailingPeriod as a filter instead of as a
plugin.
Plugins have the power to change transcription data as it comes along,
but don't have access to the entire transcript. Filters have access to
the entire transcript but can't durably change it.
TODO
* This does not work with data passed through OSC
|
| |
|
|
|
| |
OSC was paging using incorrect board resolution. Use cfg to provide this
data.
|
| |
|
|
|
|
|
|
| |
Because the custom chatbox doesn't necessarily have an even multiple of
`sync_params` character slots, some layers in the animator write N
character slots while others write N-1. In the layers with only N-1
slots, they need something to do while slot N is being selected. This
patch creates a return-home transition in that case.
|
| | |
|
| | |
|
| |
|
|
|
|
|
| |
Oops, I meant to check this in a while back.
Since transcribe_v2.py now has feature parity with transcribe.py, delete
the old code.
|
| |
|
|
|
|
| |
... and use it to implement translation and text filters.
Also fix display of non-English characters in browser src.
|
| |
|
|
|
|
|
|
| |
Actually retain the whole transcript to avoid breaking the OSC pager.
Also constrain the UI buffer size by characters instead of lines. Since
some lines can be massive and others short, characters are a better way
of consistently keeping the UI memory in check.
|
| |
|
|
|
|
|
|
| |
Allows users to directly modulate the performance-latency tradeoff.
Also:
* Bump up UI buffer to 1k lines.
* Fix browser source reset. It now also resets preview text.
|
| |
|
|
| |
Improves viewer experience.
|
| |
|
|
|
| |
Also fix bug when not using previews. Audio buffer no longer grows
without bound while there's no speech.
|
| |
|
|
|
|
|
|
|
| |
Log file is constrained to 1 MB and UI to 100-200 lines. 1k lines is too
high to keep the UI from lagging.
Transcript is constrained to 4k characters.
Also put a 5 ms sleep in the transcription hot path.
|
| |
|
|
| |
This keeps memory usage from growing without bound.
|
| | |
|
| |
|
|
|
|
| |
I find it kind of annoying when people wave around a big chatbox so I
added the option to have the chatbox be locked in worldspace whenever
it's visible. This defaults to on and can be disabled.
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
It now waits up to 10 seconds for a graceful exit and falls back on
the equivalent of a SIGKILL. The caller is assumed to have signaled to the
process through `in_cb` that an exit is desired.
Also:
* Fix graceful exit path of transcribe_v2.py.
* Add toggle to enable/disable preview text. It is enabled by default.
* Constrain transcription temperature to 0.0. This keeps latency more
predictable at the cost of some accuracy.
|
| |
|
|
| |
This makes them more reliable.
|
| |
|
|
|
|
| |
The non-text OSC messages were paging in too close to the text OSC
messages, breaking the whole system. Now the non-text OSC messages bump
back the time at which text OSC messages can begin being sent.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Also:
* DiskStream starts returning silence when out of data instead of just
stopping.
* Filter out Whisper segments with high `no_speech_prob` and low
`avg_logprob`.
* Add `saveAudio` function, useful for debugging.
* Tune vad silence cutoff to 250 ms. This is pretty accurate in
benchmarks.
|
| |
|
|
|
|
|
| |
Also parameterize `min_silence_duration_ms` in AudioSegmenter. I suspect
that for conversational speech, segmenting closer to 500 ms (rather than
the 2000ms default) is a better tradeoff between accuracy and
compute efficiency.
|
| |
|
|
| |
No longer needed.
|
| |
|
|
|
|
|
|
| |
FuzzyRepeatCommitter was approximating this behavior in the
best-performing configuration, so switch to it in earnest.
This committer simply commits audio once we detect a long enough gap in
speech. That's it!
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This logic is highly IO bound *and* latency critical so it makes sense to put
it into its own thread.
Also:
* Collector::drop* methods return the dropped audio. Committer includes
that audio in commits. Transcription thread holds onto it. When the
user segments their speech with a button press, the transcription
thread sends the entire combined audio of all commits over to Whisper
to be transcribed. This allows us to recover from errors introduced
by segmentation.
* Remove unused animator params
* Fix issue where clearing the board doesn't completely reset STT state
TODO:
* Coalescing does not occur for in-place updates. It should.
|
| |
|
|
|
|
|
|
| |
Also:
* Enable SO_REUSEADDR on browser src socket
* Temporarily add evaluation dependencies to requirements.txt
* Fix browser src. It's now looking for a prefix that the python app
actually uses.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Four threads:
* Main thread
* Transcription (mic -> collector -> whisper -> committer -> pager)
* VR input
* Keyboard input
Also:
* add OscPager class to encapsulate all OSC interactions.
* bump `last_n_must_match` from 2 to 3 to reduce hallucinations
|
| |
|
|
| |
This has a slight positive effect on my benchmark.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Try adding two filters on top of the usual AudioCollector:
* Minimum length preservation: never report fewer than N seconds worth
of audio data. Pad with silence as needed.
* Volume normalizing: normalize audio volume.
Using my benchmark of 30-second audio clips from 3 speakers (lower is
better):
length enf + norm = 87.118
nothing = 90.917
norm = 94.538
length = 111.402
Both together are a slight improvement, but independently degrade the
result by a lot. I also observed more hallucinations in a conversational
pattern when using them vs. not. So I'll phase them out.
I'm still curious about *compression* as opposed to normalization.
|
| |
|
|
|
|
|
|
|
|
| |
A set of proper interfaces is called for. See #dev-update-spam in
discord for drawing of design.
Also add code to mechanically optimize committer parameters using an
audio file. Not perfectly repeatable since it depends on the performance
characteristics of the machine, but prob better than what we had before
(nothing).
|
| | |
|
| |
|
|
|
| |
Fix how OnExit callback is wired into GUI. Also make it exit Unity
process, if that's going on.
|
| | |
|
| |
|
|
| |
Oops, I meant to check these in earlier!
|
| |
|
|
|
|
|
|
|
|
| |
Also:
* Fully scrub AudioSource references from prefab when not using
phonemes.
* Disable net sync on phoneme params when not using them. When not
synced, they don't count against the total memory limit.
* Use config file in generate_params.py
|
| |
|
|
| |
If not set, the prefab will have its audio sources removed.
|