| Commit message (Collapse) | Author | Age |
| | |
|
| |
|
|
| |
This is overwhelmingly more common than custom chatbox.
|
| |
|
|
|
| |
Deprecated in the Python release of CTranslate2 as of 4.4.0:
https://github.com/OpenNMT/CTranslate2/blob/master/CHANGELOG.md#v440-2024-09-09
|
| |
|
|
|
|
|
| |
Also:
* Double # of audio device slots
* Fetch CuDNN from NVIDIA at runtime instead of vendoring
|
| | |
|
| |
|
|
|
|
|
| |
* Add checkbox to disable this feature if so desired.
* Delete old optimization code; can get it back from git if needed.
* Enforce that there's at least one space character ' ' between
committed segments.
|
| | |
|
| | |
|
| |
|
|
| |
Also disable flash-attention when CPU mode is selected
|
| |
|
|
| |
Pre-3000 series GPUs don't support it. Oops!
|
| |
|
|
|
| |
There's a modular avatar prefab for the custom chatbox on my gumroad.
Update the default settings to work with that prefab.
|
| |
|
|
|
|
|
|
| |
This should be significantly more efficient than prior versions.
* add large-v3 & distilled variant
* simplify model acquisition code now that distilled models are part of
faster-whisper.
|
| |
|
|
|
|
|
| |
These were broken due to some logic errors in the codepath which
acquires models from huggingface.
Distilled large-v2 seems promising as a new default model.
|
| |
|
|
|
| |
CUDNN now pulls from dropbox instead of google drive. This has the added
benefit of being about 10-20x faster (assuming you have fast internet).
|
| |
|
|
|
| |
Google drive intentionally broke CLI downloads ("don't be evil") and
UwwwuPP went away. Begin work rehosting both files.
|
| |
|
|
| |
Should enable compatibility with older GPUs.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The paper recommends filtering out segments with no_speech_prob > 0.6
and avg_logprob < -1. This is too loose of a bound for short-form audio
which is not guaranteed to contain speech.
I already have a tighter bound:
no_speech > 0.6 and avg_logprob < -0.5
While listening to instrumental music I find that a lot of
hallucinations sneak past that bound. So I added a second bound:
no_speech > 0.15 and avg_logprob < -0.7
Basically we filter out things that look like speech but have a worse
avg_logprob. Seems to not have false negatives. Requires testing.
Also: dial back the default max segment length from 15 seconds to 10
seconds. This is done based on performance observations in desktop.
|
| |
|
|
|
|
|
|
| |
Indeed it is. Bumped up the default max segment length to decrease
error.
Also add mic presets for beyond (the vr headset) and motu (my mic
interface).
|
| | |
|
| |
|
|
|
|
| |
I converted distil-whisper-medium.en to CTranslate2 format and uploaded
it to huggingface. This model is exceptionally fast and light compared
to the non-distilled version, at the cost of some accuracy.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When hot-miking into the built-in chatbox, there are sometimes long
pauses in conversation. After these pauses, it's undesirable to show the
transcript generate before the pause. This feature makes it so that
those transcripts can be dropped.
Also:
* Limit number of segments sent to browser source to 10. Allow this to
grow up to 10 segments before dropping the first 5 segments.
* Silence warnings generated by `install_in_venv`, used by e.g.
translation codepath.
* Enable audio normalization to improve accuracy when speaking softly,
at the cost of some accuracy when speaking normally.
Credit: user endo0269 on Discord suggested this feature.
|
| |
|
|
|
|
|
|
|
| |
BrowserSource now fades text out continuously over time.
TODO
* Delete C++ webserver, browsersource, transcript code
* Add UI for text age fading
|
| |
|
|
| |
Default is normal prio.
|
| |
|
|
|
|
|
| |
* uwu filter no longer adds extra whitespace before/after segments. This
would defeat commit logic.
* disabling phonemes works again - path to prefab was being quoted
twice, breaking the codepath.
|
| |
|
|
| |
Remove unused proxy code, curl, and images.
|
| |
|
|
|
|
|
|
|
| |
0.17.x are breaking faster_whisper's ability to download models.
Also:
* Start using frozen requirements.txt.
* Conditionally install torch & legacy whisper only when doing
mechanical optimization.
|
| | |
|
| |
|
|
|
|
|
|
| |
Actually retain the whole transcript to avoid breaking the OSC pager.
Also constrain the UI buffer size by characters instead of lines. Since
some lines can be massive and others short, characters are a better way
of consistently keeping the UI memory in check.
|
| |
|
|
|
|
|
|
| |
Allows users to directly modulate the performance-latency tradeoff.
Also:
* Bump up UI buffer to 1k lines.
* Fix browser source reset. It now also resets preview text.
|
| |
|
|
| |
Improves viewer experience.
|
| |
|
|
|
| |
Also fix bug when not using previews. Audio buffer no longer grows
without bound while there's no speech.
|
| |
|
|
|
|
|
|
|
| |
Log file is constrained to 1 MB and UI to 100-200 lines. 1k lines is too
high to keep the UI from lagging.
Transcript is constrained to 4k characters.
Also put a 5 ms sleep in the transcription hot path.
|
| |
|
|
| |
This keeps memory usage from growing without bound.
|
| | |
|
| |
|
|
|
|
| |
I find it kind of annoying when people wave around a big chatbox so I
added the option to have the chatbox be locked in worldspace whenever
it's visible. This defaults to on and can be disabled.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
It now waits up to 10 seconds for a graceful exit and falls back on
the equivalent of a SIGKILL. The caller is assumed to have signaled to the
process through `in_cb` that an exit is desired.
Also:
* Fix graceful exit path of transcribe_v2.py.
* Add toggle to enable/disable preview text. It is enabled by default.
* Constrain transcription temperature to 0.0. This keeps latency more
predictable at the cost of some accuracy.
|
| |
|
|
|
|
|
|
| |
FuzzyRepeatCommitter was approximating this behavior in the
best-performing configuration, so switch to it in earnest.
This committer simply commits audio once we detect a long enough gap in
speech. That's it!
|
| |
|
|
|
|
|
|
| |
Also:
* Enable SO_REUSEADDR on browser src socket
* Temporarily add evaluation dependencies to requirements.txt
* Fix browser src. It's now looking for a prefix that the python app
actually uses.
|
| |
|
|
|
| |
Fix how OnExit callback is wired into GUI. Also make it exit Unity
process, if that's going on.
|
| | |
|
| |
|
|
|
|
|
|
|
|
| |
Also:
* Fully scrub AudioSource references from prefab when not using
phonemes.
* Disable net sync on phoneme params when not using them. When not
synced, they don't count against the total memory limit.
* Use config file in generate_params.py
|
| |
|
|
| |
If not set, the prefab will have its audio sources removed.
|
| | |
|
| |
|
|
|
| |
Duplicating config between args and config is a huge pain in the ass to
maintain. Now we just launch using the config generated by the UI. ezpz.
|
| |
|
|
|
|
|
|
|
| |
wxWidgets encodes text inputs & multiple-choice inputs as strings. I
frequently have to convert these into ints & apply a range check.
Encapsulate that in a function and use a shitty little ASSIGN_OR_RETURN
macro to make the parsing as concise as possible.
Also delete unused WhisperCPP config settings.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Temporarily restore normal process priority. Working on adding a UI
option to set STT prio.
* Give audio indicator phonemes a 1/3 chance to do nothing. Makes result
sound a little better imo.
* Quiet down steamVR thread when steamVR isn't running
* Fix use of `button_id` and `hand_id` in steamvr.py
* Increase amount of silence allowed before transcript from 1 to 5
seconds. You want enough buffer to allow for a few full transcripts,
else you risk spuriously dropping audio.
* Enable background loading in audio metadata (required by vrc sdk)
|
| |
|
|
|
|
|
|
| |
This is now dynamically set inside transcribe.py.
As the buffer grows long, the threshold grows exponentially, keeping the
buffer short. The threshold starts small so that transcription starts
strict (accurate, slow) and get looser (inaccurate, fast) as needed.
|
| |
|
|
|
|
|
| |
Also fix prefab default size (no longer colossal).
TODO
* Add runtime & unity-time toggles
|
| |
|
|
| |
pyopenvr is both deprecated and buggy, so switch to pyopenxr.
|
| |
|
|
|
|
|
| |
Not yet done:
* Animator toggle
* OSC integration
|