| Commit message (Collapse) | Author | Age |
| | |
|
| |
|
|
|
|
| |
I converted distil-whisper-medium.en to CTranslate2 format and uploaded
it to huggingface. This model is exceptionally fast and light compared
to the non-distilled version, at the cost of some accuracy.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When hot-miking into the built-in chatbox, there are sometimes long
pauses in conversation. After these pauses, it's undesirable to show the
transcript generate before the pause. This feature makes it so that
those transcripts can be dropped.
Also:
* Limit number of segments sent to browser source to 10. Allow this to
grow up to 10 segments before dropping the first 5 segments.
* Silence warnings generated by `install_in_venv`, used by e.g.
translation codepath.
* Enable audio normalization to improve accuracy when speaking softly,
at the cost of some accuracy when speaking normally.
Credit: user endo0269 on Discord suggested this feature.
|
| |
|
|
|
|
|
|
|
| |
BrowserSource now fades text out continuously over time.
TODO
* Delete C++ webserver, browsersource, transcript code
* Add UI for text age fading
|
| |
|
|
| |
Default is normal prio.
|
| |
|
|
|
|
|
| |
* uwu filter no longer adds extra whitespace before/after segments. This
would defeat commit logic.
* disabling phonemes works again - path to prefab was being quoted
twice, breaking the codepath.
|
| |
|
|
| |
Remove unused proxy code, curl, and images.
|
| |
|
|
|
|
|
|
|
| |
0.17.x are breaking faster_whisper's ability to download models.
Also:
* Start using frozen requirements.txt.
* Conditionally install torch & legacy whisper only when doing
mechanical optimization.
|
| | |
|
| |
|
|
|
|
|
|
| |
Actually retain the whole transcript to avoid breaking the OSC pager.
Also constrain the UI buffer size by characters instead of lines. Since
some lines can be massive and others short, characters are a better way
of consistently keeping the UI memory in check.
|
| |
|
|
|
|
|
|
| |
Allows users to directly modulate the performance-latency tradeoff.
Also:
* Bump up UI buffer to 1k lines.
* Fix browser source reset. It now also resets preview text.
|
| |
|
|
| |
Improves viewer experience.
|
| |
|
|
|
| |
Also fix bug when not using previews. Audio buffer no longer grows
without bound while there's no speech.
|
| |
|
|
|
|
|
|
|
| |
Log file is constrained to 1 MB and UI to 100-200 lines. 1k lines is too
high to keep the UI from lagging.
Transcript is constrained to 4k characters.
Also put a 5 ms sleep in the transcription hot path.
|
| |
|
|
| |
This keeps memory usage from growing without bound.
|
| | |
|
| |
|
|
|
|
| |
I find it kind of annoying when people wave around a big chatbox so I
added the option to have the chatbox be locked in worldspace whenever
it's visible. This defaults to on and can be disabled.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
It now waits up to 10 seconds for a graceful exit and falls back on
the equivalent of a SIGKILL. The caller is assumed to have signaled to the
process through `in_cb` that an exit is desired.
Also:
* Fix graceful exit path of transcribe_v2.py.
* Add toggle to enable/disable preview text. It is enabled by default.
* Constrain transcription temperature to 0.0. This keeps latency more
predictable at the cost of some accuracy.
|
| |
|
|
|
|
|
|
| |
FuzzyRepeatCommitter was approximating this behavior in the
best-performing configuration, so switch to it in earnest.
This committer simply commits audio once we detect a long enough gap in
speech. That's it!
|
| |
|
|
|
|
|
|
| |
Also:
* Enable SO_REUSEADDR on browser src socket
* Temporarily add evaluation dependencies to requirements.txt
* Fix browser src. It's now looking for a prefix that the python app
actually uses.
|
| |
|
|
|
| |
Fix how OnExit callback is wired into GUI. Also make it exit Unity
process, if that's going on.
|
| | |
|
| |
|
|
|
|
|
|
|
|
| |
Also:
* Fully scrub AudioSource references from prefab when not using
phonemes.
* Disable net sync on phoneme params when not using them. When not
synced, they don't count against the total memory limit.
* Use config file in generate_params.py
|
| |
|
|
| |
If not set, the prefab will have its audio sources removed.
|
| | |
|
| |
|
|
|
| |
Duplicating config between args and config is a huge pain in the ass to
maintain. Now we just launch using the config generated by the UI. ezpz.
|
| |
|
|
|
|
|
|
|
| |
wxWidgets encodes text inputs & multiple-choice inputs as strings. I
frequently have to convert these into ints & apply a range check.
Encapsulate that in a function and use a shitty little ASSIGN_OR_RETURN
macro to make the parsing as concise as possible.
Also delete unused WhisperCPP config settings.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Temporarily restore normal process priority. Working on adding a UI
option to set STT prio.
* Give audio indicator phonemes a 1/3 chance to do nothing. Makes result
sound a little better imo.
* Quiet down steamVR thread when steamVR isn't running
* Fix use of `button_id` and `hand_id` in steamvr.py
* Increase amount of silence allowed before transcript from 1 to 5
seconds. You want enough buffer to allow for a few full transcripts,
else you risk spuriously dropping audio.
* Enable background loading in audio metadata (required by vrc sdk)
|
| |
|
|
|
|
|
|
| |
This is now dynamically set inside transcribe.py.
As the buffer grows long, the threshold grows exponentially, keeping the
buffer short. The threshold starts small so that transcription starts
strict (accurate, slow) and get looser (inaccurate, fast) as needed.
|
| |
|
|
|
|
|
| |
Also fix prefab default size (no longer colossal).
TODO
* Add runtime & unity-time toggles
|
| |
|
|
| |
pyopenvr is both deprecated and buggy, so switch to pyopenxr.
|
| |
|
|
|
|
|
| |
Not yet done:
* Animator toggle
* OSC integration
|
| |
|
|
|
| |
Fix up .mat to point to correct textures/shader. Also delete templates
after copying shaders.
|
| |
|
|
| |
Specify file encoding when generating shaders.
|
| |
|
|
| |
* Fix mirror behavior for ray-marched chatbox
|
| | |
|
| |
|
|
|
|
|
|
| |
* Refactor shader code to make development easier. Templates are now
as small as possible.
* Update scaling code. Use Unity scaling instead of a blendshape.
* Check in a fuckton of shader FOSS. Mostly unused.
* Update TaSTT.fbx. Now has 6 faces instead of 2.
|
| |
|
|
|
|
| |
GUI was not correctly managing .meta files, causing two textures to use
the same GUID. Unity would notice and regenerate GUIDs, breaking the
custom chatbox material's texture references.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add two buttons: start auto re-generation of Unity assets, and stop.
These start/stop a thread which periodically (every 3 seconds) hashes
the user-provided animator, menu and parameters. When any one of these
change, it invokes the function to generate Unity assets.
The hash is non-cryptographic, so it's light. The only hit is that we
have to read the entire file contents every few seconds, and compute a
sum across that entire memory region. This is extremely light unless
you're on a spinning platter hard drive with a small cache.
Still seeing the bug where the material drops ref to the font bitmaps.
Probably need to update the .mat using the guids in the bitmap .meta
files.
|
| |
|
|
|
| |
Avoid deleting bitmap .meta files so that once the user sets up their
shader, it doesn't break.
|
| |
|
|
| |
Useful for projects with multiple avatars with different animators.
|
| |
|
|
|
|
| |
The paths you enter in the Unity panel (animator, menu, params, and
assets folder) are saved in the app config, but were not populated
correctly on app restart or pane redraw. Now they are.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Create a simple server with 3 endpoints:
* /create_session: Create a session and return its identifier.
* /set_transcript: Update a session's transcript.
* /get_transcript: Fetch a session's transcript.
Right now the session ID provides authentication *and* authorization.
There is no public/private ID so you have to trust whoever you share
your ID with.
IDs are long and generated by the server, so it should be somewhat
secure against low-effort hacking.
Other updates:
* Drop whisper_requirements.txt - no longer needed.
* Vendor curl to make it easier to interact with the server.
TODO:
* Fuzz test the server.
|
| |
|
|
|
|
|
|
| |
Circle goes red when speaking, grey when done. Ideally it would be in
the top right portion of the browser source, but this is a good start.
Also, hard-cap transcripts to 4096 chars. This prevents the STT from
lagging during long sessions.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Audio data is stored in chunks of frames, not in individual frames.
When I commit a transcript, I want to get rid of the portion of the
audio data responsible for that particular transcript. I have code that
does this, but it was dropping a slice of the list assuming that each
sample is stored individually.
Extra fun: Because we have to decimate mic frames, we have to convert
between whisper frames and mic frames to drop the correct amount of
audio data.
|
| |
|
|
|
|
|
| |
Add toggle to UI to enable a profanity filter. It replaces vowels in bad
words with asterisks.
Bugfix: filters now apply to OBS
|
| |
|
|
|
|
|
|
| |
Most transcription output is now gone by default. Users can enable a
more verbose output by toggling `Enable debug mode`.
Bugfix: Toggling off transcription would reset audio state, frequently
resulting in the loss of the last few words spoken.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recap: In the STT there's an algorithm that tries to determine when a
transcript is "stable" enough to commit. If that is too loose, then
accuracy suffers; if too strict, then the audio buffer eventually fills.
To mitigate the problem, I check whether the last N transcripts are
within some edit distance (Levenshtein edit distance) of each other. The
fuzzy matching lets us forgive small instabilities, like differences in
uppercase/lowercase or punctuation, while rejecting large instabilities.
The default value of 8 seems to be in the sweet spot of accuracy &
performance, but it will likely be tuned in the future.
|
| |
|
|
|
|
|
|
| |
... instead of simple equality.
TODO: add UI for threshold.
Bugfix: Frame::onAppStop() joins the OBS app thread.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This is useful when streaming. Occasionally the STT can get into
a bad state, and manually segmenting clears it up. However doing so
would clear your accumulated transcript, which isn't always desired. Add
ability to preserve the transcript.
A small wrinkle: the new commit logic requires N consecutive identical
windows before committing. To make this feature play nicely with it, I
had to forcibly commit any preview text that hasn't yet been committed.
Failing to do this would usually cause short utterances / the most
recently said stuff to get wiped out.
|