TaSTT.git - Free self-hosted STT for VRChat.

	Commit message (Collapse)	Author	Age
*	Add keyboard controls to transcribe_v2.py	yum	2023-09-08
\| \| \| \| \| \| \|	Also parameterize `min_silence_duration_ms` in AudioSegmenter. I suspect that for conversational speech, segmenting closer to 500 ms (rather than the 2000ms default) is a better tradeoff between accuracy and compute efficiency.
*	Drop transcription queue	yum	2023-09-07
\| \| \| \|	No longer needed.
*	Switch to VadCommitter	yum	2023-09-07
\| \| \| \| \| \| \| \|	FuzzyRepeatCommitter was approximating this behavior in the best-performing configuration, so switch to it in earnest. This committer simply commits audio once we detect a long enough gap in speech. That's it!
*	Put OSC logic into its own thread	yum	2023-09-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This logic is highly IO bound and latency critical so it makes sense to put it into its own thread. Also: * Collector::drop* methods return the dropped audio. Committer includes that audio in commits. Transcription thread holds onto it. When the user segments their speech with a button press, the transcription thread sends the entire combined audio of all commits over to Whisper to be transcribed. This allows us to recover from errors introduced by segmentation. * Remove unused animator params * Fix issue where clearing the board doesn't completely reset STT state TODO: * Coalescing does not occur for in-place updates. It should.
*	Wire transcribe_v2.py into GUI	yum	2023-09-03
\| \| \| \| \| \| \| \|	Also: * Enable SO_REUSEADDR on browser src socket * Temporarily add evaluation dependencies to requirements.txt * Fix browser src. It's now looking for a prefix that the python app actually uses.
*	Add threads to transcribe_v2.py	yum	2023-09-03
\| \| \| \| \| \| \| \| \| \| \| \|	Four threads: * Main thread * Transcription (mic -> collector -> whisper -> committer -> pager) * VR input * Keyboard input Also: * add OscPager class to encapsulate all OSC interactions. * bump `last_n_must_match` from 2 to 3 to reduce hallucinations
*	Apply subtle compression to audio before transcribing	yum	2023-09-03
\| \| \| \|	This has a slight positive effect on my benchmark.
*	Experiment with Collector filters	yum	2023-09-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Try adding two filters on top of the usual AudioCollector: * Minimum length preservation: never report fewer than N seconds worth of audio data. Pad with silence as needed. * Volume normalizing: normalize audio volume. Using my benchmark of 30-second audio clips from 3 speakers (lower is better): length enf + norm = 87.118 nothing = 90.917 norm = 94.538 length = 111.402 Both together are a slight improvement, but independently degrade the result by a lot. I also observed more hallucinations in a conversational pattern when using them vs. not. So I'll phase them out. I'm still curious about compression as opposed to normalization.
*	Begin rewriting transcribe.py	yum	2023-09-02
	A set of proper interfaces is called for. See #dev-update-spam in discord for drawing of design. Also add code to mechanically optimize committer parameters using an audio file. Not perfectly repeatable since it depends on the performance characteristics of the machine, but prob better than what we had before (nothing).