TaSTT.git - Free self-hosted STT for VRChat.

	Commit message (Collapse)	Author	Age
*	Delete unused filesv1.0.0-beta00	yum	2025-07-23
\|
*	Remove flash_attention toggle	yum	2024-11-16
\| \| \| \| \|	Deprecated in the Python release of CTranslate2 as of 4.4.0: https://github.com/OpenNMT/CTranslate2/blob/master/CHANGELOG.md#v440-2024-09-09
*	Add support for whisper large v3 turbo	yum	2024-11-16
\| \| \| \| \| \| \|	Also: * Double # of audio device slots * Fetch CuDNN from NVIDIA at runtime instead of vendoring
*	Support as few as 1 char per sync in custom chatbox	yum	2024-07-30
\|
*	Another edge case: first commit should not get a leading space	yum	2024-07-12
\|
*	Edge case: initial preview should not have a space added in front of it	yum	2024-07-12
\| \| \| \|	God this code is a fucking nightmare
*	Translation shows original language by default	yum	2024-07-12
\| \| \| \| \| \| \|	* Add checkbox to disable this feature if so desired. * Delete old optimization code; can get it back from git if needed. * Enforce that there's at least one space character ' ' between committed segments.
*	Fix translation plugin	yum	2024-07-12
\| \| \| \| \| \| \| \| \| \| \|	Translation needs torch to convert the nllb model, but the latest version (2.3.1) has an embedded OMP dll which clashes with ctranslate2's dll. Using the last minor version instead (2.2.2) doesn't clash. Also propagate the device, quantization, and flash attention settings to the translator. If you're using GPU, this is a HUUUUGE performance uplift. Translation is basically instant. The bigger models are now feasible to use.
*	Bump CUDNN to v8.9.7v0.19.1	yum	2024-06-09
\| \| \| \|	Also disable flash-attention when CPU mode is selected
*	Add checkbox for flash-attention	yum	2024-06-09
\| \| \| \|	Pre-3000 series GPUs don't support it. Oops!
*	Upgrade faster-whisper with flash-attention2	yum	2024-06-05
\| \| \| \| \| \| \| \|	This should be significantly more efficient than prior versions. * add large-v3 & distilled variant * simplify model acquisition code now that distilled models are part of faster-whisper.
*	Fix distilled models	yum	2024-03-14
\| \| \| \| \| \| \|	These were broken due to some logic errors in the codepath which acquires models from huggingface. Distilled large-v2 seems promising as a new default model.
*	Add "simple" text-to-text demo for the modular avatar chatbox	yum	2024-03-08
\| \| \| \| \| \| \|	To use it: $ python3 -m pip install python-osc pillow $ cd Scripts $ python3 ./text_to_text_demo.py
*	Finish plumbing GPU compute typev0.18.1	yum	2024-02-09
\|
*	Add dropdown for GPU compute typev0.18.0	yum	2024-02-09
\| \| \| \|	Should enable compatibility with older GPUs.
*	Add another threshold to filter out common hallucinations	yum	2024-02-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The paper recommends filtering out segments with no_speech_prob > 0.6 and avg_logprob < -1. This is too loose of a bound for short-form audio which is not guaranteed to contain speech. I already have a tighter bound: no_speech > 0.6 and avg_logprob < -0.5 While listening to instrumental music I find that a lot of hallucinations sneak past that bound. So I added a second bound: no_speech > 0.15 and avg_logprob < -0.7 Basically we filter out things that look like speech but have a worse avg_logprob. Seems to not have false negatives. Requires testing. Also: dial back the default max segment length from 15 seconds to 10 seconds. This is done based on performance observations in desktop.
*	Verify that audio is clean after VAD segmentation	yum	2024-02-05
\| \| \| \| \| \| \| \|	Indeed it is. Bumped up the default max segment length to decrease error. Also add mic presets for beyond (the vr headset) and motu (my mic interface).
*	Revert "Begin experimenting with flash-attention"	yum	2024-01-08
\| \| \| \|	This reverts commit 921b92a69f36502dc5eefd14ba3487c1bb49bb9d.
*	Begin experimenting with flash-attention	yum	2023-12-13
\| \| \| \| \| \| \| \| \| \| \|	Seems much faster than faster-whisper. There are two issues: * Requires NVIDIA 3000 series or higher. * Incompatible with faster-whisper dependencies. So it seems like we'll either need to toggle between two sets of dependencies at runtime or have two environments.
*	Decrease OSC sync rate from 5 Hz to 3 Hzv0.17.0	yum	2023-12-08
\| \| \| \|	Paging is now slower but more reliable.
*	Add distilled whisper large-v2 model	yum	2023-12-08
\|
*	Add distilled whisper-medium model	yum	2023-11-07
\| \| \| \| \| \|	I converted distil-whisper-medium.en to CTranslate2 format and uploaded it to huggingface. This model is exceptionally fast and light compared to the non-distilled version, at the cost of some accuracy.
*	Transcripts preceding long pauses now dropv0.16.0	yum	2023-10-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When hot-miking into the built-in chatbox, there are sometimes long pauses in conversation. After these pauses, it's undesirable to show the transcript generate before the pause. This feature makes it so that those transcripts can be dropped. Also: * Limit number of segments sent to browser source to 10. Allow this to grow up to 10 segments before dropping the first 5 segments. * Silence warnings generated by `install_in_venv`, used by e.g. translation codepath. * Enable audio normalization to improve accuracy when speaking softly, at the cost of some accuracy when speaking normally. Credit: user endo0269 on Discord suggested this feature.
*	Reimplement BrowserSource as a StreamingPlugin	yum	2023-09-18
\| \| \| \| \| \| \| \| \|	BrowserSource now fades text out continuously over time. TODO * Delete C++ webserver, browsersource, transcript code * Add UI for text age fading
*	Bugfixesv0.15.4	yum	2023-09-16
\| \| \| \| \| \| \|	* uwu filter no longer adds extra whitespace before/after segments. This would defeat commit logic. * disabling phonemes works again - path to prefab was being quoted twice, breaking the codepath.
*	General cleanupv0.15.3	yum	2023-09-13
\| \| \| \|	Remove unused proxy code, curl, and images.
*	Bugfix: list input devices works again	yum	2023-09-12
\| \| \| \|	Oops :)
*	Pin huggingface_hub to 0.16.4v0.15.2	yum	2023-09-11
\| \| \| \| \| \| \| \| \|	0.17.x are breaking faster_whisper's ability to download models. Also: * Start using frozen requirements.txt. * Conditionally install torch & legacy whisper only when doing mechanical optimization.
*	Introduce notion of PresentationFilter	yum	2023-09-10
\| \| \| \| \| \| \| \| \| \| \| \| \|	... and restructure RemoveTrailingPeriod as a filter instead of as a plugin. Plugins have the power to change transcription data as it comes along, but don't have access to the entire transcript. Filters have access to the entire transcript but can't durably change it. TODO * This does not work with data passed through OSC
*	Fix paging bugv0.15.1	yum	2023-09-10
\| \| \| \| \|	OSC was paging using incorrect board resolution. Use cfg to provide this data.
*	Bugfix: eliminate dead-end in certain animator layers	yum	2023-09-10
\| \| \| \| \| \| \| \|	Because the custom chatbox doesn't necessarily have an even multiple of `sync_params` character slots, some layers in the animator write N character slots while others write N-1. In the layers with only N-1 slots, they need something to do while slot N is being selected. This patch creates a return-home transition in that case.
*	Users can now choose custom chatbox texture size in UI	yum	2023-09-10
\|
*	Fix local audio indicatorsv0.15.0	yum	2023-09-10
\|
*	Check in vad.py and delete transcribe.py	yum	2023-09-10
\| \| \| \| \| \| \|	Oops, I meant to check this in a while back. Since transcribe_v2.py now has feature parity with transcribe.py, delete the old code.
*	Add plugin interface	yum	2023-09-10
\| \| \| \| \| \|	... and use it to implement translation and text filters. Also fix display of non-English characters in browser src.
*	Bugfix: only cap display of transcript at 4K chars	yum	2023-09-10
\| \| \| \| \| \| \| \|	Actually retain the whole transcript to avoid breaking the OSC pager. Also constrain the UI buffer size by characters instead of lines. Since some lines can be massive and others short, characters are a better way of consistently keeping the UI memory in check.
*	Add UI for transcription loop delay	yum	2023-09-10
\| \| \| \| \| \| \| \|	Allows users to directly modulate the performance-latency tradeoff. Also: * Bump up UI buffer to 1k lines. * Fix browser source reset. It now also resets preview text.
*	Browser source now shows preview text as slightly transparent	yum	2023-09-09
\| \| \| \|	Improves viewer experience.
*	Add UI for max speech duration	yum	2023-09-09
\| \| \| \| \|	Also fix bug when not using previews. Audio buffer no longer grows without bound while there's no speech.
*	Constrain log file, UI text field, and transcript sizes	yum	2023-09-09
\| \| \| \| \| \| \| \| \|	Log file is constrained to 1 MB and UI to 100-200 lines. 1k lines is too high to keep the UI from lagging. Transcript is constrained to 4k characters. Also put a 5 ms sleep in the transcription hot path.
*	Make min silence duration configurable in UI	yum	2023-09-09
\|
*	Add `lock at spawn` option	yum	2023-09-09
\| \| \| \| \| \|	I find it kind of annoying when people wave around a big chatbox so I added the option to have the chatbox be locked in worldspace whenever it's visible. This defaults to on and can be disabled.
*	Bugfix: fix preview text enable/disable in browser source	yum	2023-09-09
\|
*	Bugfix: fix process leak in PythonWrapper::InvokeCommandWithArgs	yum	2023-09-09
\| \| \| \| \| \| \| \| \| \| \| \| \|	It now waits up to 10 seconds for a graceful exit and falls back on the equivalent of a SIGKILL. The caller is assumed to have signaled to the process through `in_cb` that an exit is desired. Also: * Fix graceful exit path of transcribe_v2.py. * Add toggle to enable/disable preview text. It is enabled by default. * Constrain transcription temperature to 0.0. This keeps latency more predictable at the cost of some accuracy.
*	Bugfix: non-text OSC messages wait for sync window	yum	2023-09-08
\| \| \| \|	This makes them more reliable.
*	Bugfix: text data now pages correctly	yum	2023-09-08
\| \| \| \| \| \|	The non-text OSC messages were paging in too close to the text OSC messages, breaking the whole system. Now the non-text OSC messages bump back the time at which text OSC messages can begin being sent.
*	Only transcribe if VAD detects something	yum	2023-09-08
\| \| \| \| \| \| \| \| \| \| \| \|	Also: * DiskStream starts returning silence when out of data instead of just stopping. * Filter out Whisper segments with high `no_speech_prob` and low `avg_logprob`. * Add `saveAudio` function, useful for debugging. * Tune vad silence cutoff to 250 ms. This is pretty accurate in benchmarks.
*	Add keyboard controls to transcribe_v2.py	yum	2023-09-08
\| \| \| \| \| \| \|	Also parameterize `min_silence_duration_ms` in AudioSegmenter. I suspect that for conversational speech, segmenting closer to 500 ms (rather than the 2000ms default) is a better tradeoff between accuracy and compute efficiency.
*	Drop transcription queue	yum	2023-09-07
\| \| \| \|	No longer needed.
*	Switch to VadCommitter	yum	2023-09-07
\| \| \| \| \| \| \| \|	FuzzyRepeatCommitter was approximating this behavior in the best-performing configuration, so switch to it in earnest. This committer simply commits audio once we detect a long enough gap in speech. That's it!