TaSTT.git - Free self-hosted STT for VRChat.

	Commit message (Collapse)	Author	Age
*	Fix race condition in commit logic	yum	2023-08-01
\| \| \| \| \| \| \| \|	Transcription thread now blocks until microphone thread deletes samples as requested. (This is hacky design, it should use a work queue or something, but I don't feel like doing that right now)
*	Only back off transcription loop when not transcribing	yum	2023-08-01
\| \| \| \| \| \| \| \| \| \|	It's possible that the user has toggled off transcription while the algorithm is still working. In this case we should not begin exponential backoff since there's still work to do. Also: * Shorten the hot-path sleep from 50ms to 5ms. * Remove unused variable in SleepInterruptible
*	Preserve audio chunk length when dropping samples	yum	2023-07-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we commit a transcription, we drop the corresponding audio data. Audio data is represented as a list of chunks. Each chunk contains a few hundred samples of audio data, representing O(10ms) of audio. If we want to drop a few seconds of data, this means simply deleting many chunks of audio. There's usually a chunk where we want to drop some portion of audio data. Instead of slicing away that part of the chunk, which would change its length, this change zeroes it out. This preserves the assumption that each chunk has the same temporal length.
*	Commit logic now drops parts of frames	yum	2023-07-08
\| \| \| \| \| \|	We used to drop entire frames only, leading to situations where more audio is dropped than desired. Now we drop frames down to the precision of the individual audio sample requested.
*	Update README	yum	2023-07-07
\| \| \| \| \| \| \| \| \|	Mostly updating roadmap stuff. Non-VRC use cases are "complete" since I was mostly targeting streaming. The ability to type into arbitrary text fields is still somewhat nascent & could be improved. Also update some other random stuff to be more up to date. KillFrenzy Avatar Text is now MIT, pog!
*	Enforce a stricter avg_logbprob than defaultv0.13.1	yum	2023-07-07
\| \| \| \| \| \| \| \|	Common hallucinations sneak in around -0.9 avg_logprob. Also: * Limit temperatures to just 0.0. Multiple values cause latency to occasionally spike.
*	Filter out segments based on avg_log_prob & no_speech_prob	yum	2023-07-07
\| \| \| \| \| \| \|	Surprisingly, these args do not cause transcribe() to omit those segments from the result, so we have to manually filter them out. Hallucinated phrases generally have one or both of these params set high.
*	Use 16-bit ints with generated silence	yum	2023-07-07
\| \| \| \|	Each sample of audio data is a 16-bit int, not an 8-bit int.
*	Fix performance regression	yum	2023-07-07
\| \| \| \| \|	Each chunk of audio samples should be encoded as a binary string, not as a list.
*	Enforce minimum 5.0 second duration on audio buffer	yum	2023-07-06
\| \| \| \| \| \| \| \| \| \| \| \|	New commit logic would reduce buffer to a size smaller than this, causing it to hallucinate things like: * "See you next time!" * "Thanks for watching!" * "Bye!" The hope is that by keeping the buffer at least 5.0 seconds long, as described in the paper, this will cut down on these events.
*	Add visual commit indicator to OBS browser source	yum	2023-06-30
\| \| \| \| \| \| \| \|	Circle goes red when speaking, grey when done. Ideally it would be in the top right portion of the browser source, but this is a good start. Also, hard-cap transcripts to 4096 chars. This prevents the STT from lagging during long sessions.
*	Bugfix: trailing period filter ignores ellipses	yum	2023-06-30
\| \| \| \|	... also print out "Ready!" when the STT is done loading.
*	fix: set gpu device index in whisper model	jsopn	2023-06-30
\|
*	Fix race condition around audio frames dropping	yum	2023-06-28
\| \| \| \| \| \| \| \| \| \| \|	onAudioFramesAvailable would bail out if audio_state.audio_paused is set, preventing frames from being dropped. This would cause transcriptions to get repeated sometimes. Now that frame dropping code always runs. Also adjust the code structure of the keyboard/VR input handlers to be more similar.
*	Bugfix: commit no longer wipes out audio buffer	yum	2023-06-28
\| \| \| \| \| \| \| \| \| \| \| \|	Audio data is stored in chunks of frames, not in individual frames. When I commit a transcript, I want to get rid of the portion of the audio data responsible for that particular transcript. I have code that does this, but it was dropping a slice of the list assuming that each sample is stored individually. Extra fun: Because we have to decimate mic frames, we have to convert between whisper frames and mic frames to drop the correct amount of audio data.
*	Add profanity filter	yum	2023-06-28
\| \| \| \| \| \| \|	Add toggle to UI to enable a profanity filter. It replaces vowels in bad words with asterisks. Bugfix: filters now apply to OBS
*	Add toggle for debug mode	yum	2023-06-28
\| \| \| \| \| \| \| \|	Most transcription output is now gone by default. Users can enable a more verbose output by toggling `Enable debug mode`. Bugfix: Toggling off transcription would reset audio state, frequently resulting in the loss of the last few words spoken.
*	Add UI for fuzzy commit threshold	yum	2023-06-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recap: In the STT there's an algorithm that tries to determine when a transcript is "stable" enough to commit. If that is too loose, then accuracy suffers; if too strict, then the audio buffer eventually fills. To mitigate the problem, I check whether the last N transcripts are within some edit distance (Levenshtein edit distance) of each other. The fuzzy matching lets us forgive small instabilities, like differences in uppercase/lowercase or punctuation, while rejecting large instabilities. The default value of 8 seems to be in the sweet spot of accuracy & performance, but it will likely be tuned in the future.
*	Adjust commit logic to use fuzzy string match threshold	yum	2023-06-27
\| \| \| \| \| \| \| \|	... instead of simple equality. TODO: add UI for threshold. Bugfix: Frame::onAppStop() joins the OBS app thread.
*	Add ability to preserve transcript while using push to talk	yum	2023-06-27
\| \| \| \| \| \| \| \| \| \| \| \| \|	This is useful when streaming. Occasionally the STT can get into a bad state, and manually segmenting clears it up. However doing so would clear your accumulated transcript, which isn't always desired. Add ability to preserve the transcript. A small wrinkle: the new commit logic requires N consecutive identical windows before committing. To make this feature play nicely with it, I had to forcibly commit any preview text that hasn't yet been committed. Failing to do this would usually cause short utterances / the most recently said stuff to get wiped out.
*	Add UI for browser src	yum	2023-06-26
\| \| \| \|	Add ability to toggle on/off browser src & configure port.
*	Bugfix: Transcript no longer repeats when paused in desktop	yum	2023-06-26
\| \| \| \| \|	Hitting the desktop keybinding to stop transcription would sometimes cause the last transcript to repeeat itself.
*	Add browser source, hardcoded to port 8097	yum	2023-06-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Transcription output now streams to localhost:8097. In OBS: * Create a browser source. * url: localhost:8097 * width: 2200 * height: 400 TODO: * Put behind toggle. * Create input field for port. Misc cleanup: * transcribe.py: Drop frames from audio capture thread instead of the transcription thread. Doing it the other way would result in occasional data loss.
*	Rework transcription commit logic	yum	2023-06-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At the core of the STT, there's a loop which uses Whisper to convert audio into a transcript. As you say something, whisper sees growing fragments of your sentence: t0: "Hell" t1: "Hello" t2: "Hello, world!" So we need some algorithm which takes these fragments and accumulates them into an ever-growing transcript. Previously I did this with fuzzy string matching. I'd find the region where the two transcripts overlap and edit the two together to produce a longer transcript. The big problem is that if there's no overlap, it's not clear whether whisper radically changed its mind as to what was said, or whether the user paused for a long time before saying something new. So I'd have to reset the growing transcript. Now I get the timestamps from Whisper and wait for it to give me the same 3 transcripts for the last utterance. Once the transcript stabilizes like this, I commit the text. This enables a temporally stable, ever-growing transcript that's also quite accurate. To prevent a latency regression, I also introduce the notion of "preview text", which is a preview of an utterance that has not yet stabilized. These previews do not contribute to the ever-growing transcript, but do get fed through the rest of the app, so they show up in-game / in OBS. Once they eventually stabilize, they get committed to the ever-growing transcript. This change is lightly tested!
*	Finish translation for Western European language speakersv0.12.0	yum	2023-05-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	NLLB needs its input to be split up into sentences. I use the sentence_splitter Python package to do this. It supports ~20 Western European languages, but notably, no Asian languages. * Sort spoken language list. English is still at the top. * Remove 'Translation source' dropdown. Infer this from the spoken language. * Add lang_compat.py to map language codes between the various libraries (whisper, nllb, sentence_splitter). * Fix bug where old text would appear in textbox when you first bring it up.
*	Add ability to translate into 200 languages	yum	2023-05-25
\| \| \| \| \| \| \| \| \|	Use Meta's No Language Left Behind (NLLB) algorithm to provide translation capabilities into 200 languages. Obviously most are very untested. This requires either 4.1 or 7.1 GB of RAM and significiantly increases transcription latency.
*	Add more text filters	yum	2023-05-24
\| \| \| \| \| \| \| \| \| \|	Add 3 filters: * Remove trailing period * Convert to uppercase * Convert to lowercase All may be composed. Upper/lower just overwrite each other so just use one.
*	Add UI toggle for uwu filter	yum	2023-05-24
\| \| \| \| \|	UI now has a checkbox for the uwu filter. Does not materially affect resource usage or latency when enabled.
*	Begin work on uwu filter	yum	2023-05-24
\| \| \| \| \| \|	Use UwwwuPP to translate your boring old speech into uwu-ified version. Still need to add a UI toggle for this.
*	Add ability to type using STT	yum	2023-05-23
\| \| \| \| \| \| \| \| \| \|	To use it, do a medium hold + long hold. Keep the long hold depressed until you're done speaking. The transcription will be typed into the currently selected input field. * Add more audio feedback * Make audio feedback play asynchronously so it doesn't slow down the controller input state machine as much.
*	Add ability to update textbox in place	yum	2023-05-22
\| \| \| \| \| \| \|	By holding the button while talking for at least 1.5 seconds, you can update the contents of the textbox without unlocking it from worldspace. So now you can carefully position your textbox once, then continually speak into it without having to reposition it every time.
*	Add keyboard togglev0.11.4	yum	2023-05-22
\| \| \| \| \| \|	Users can now configure a keybind to start/stop/dismiss the STT when in desktop mode. The default keybind is ctrl+x, since by default VRC doesn't use 'x' for anything.
*	Fix accidental semicolon typo	faker	2023-05-22
\|
*	Enable selecting specific GPU when transcribing	yum	2023-05-21
\| \| \| \| \| \|	Useful on devices with multiple GPUs, such as gaming laptops. * Update GUI/README.md.
*	Restore string matching, remove affinity maskv0.11.1	yum	2023-04-25
\| \| \| \| \| \|	Affinity mask no longer affects performance. String matching is still needed for temporal stability in fast-paced long-form transcription tasks.
*	~Finish integrating faster-whisper	yum	2023-04-24
\| \| \| \|	I'm able to use the new code to show text in game. Not yet play-tested.
*	Begin integrating faster-whisperv0.11.0	yum	2023-04-23
\| \| \| \| \| \|	This is a much faster, lower-VRAM reimplementation of Whisper in Python. Early testing is extremely promising: fast transcription speed, extremely low resource usage (CPU/RAM/VRAM), high accuracy.
*	Set PYTHONPATH in synchronous multiprocessing layer	yum	2023-03-08
\| \| \| \| \| \| \| \| \|	A user saw an error like `ModuleNotFoundError: No module named _socket`. StackOverflow blames this on PYTHONPATH, so let's try setting it. * Fix latent bug in Scripts/transcribe.py. PyAudio.open() positional parameters must be specified in correct order, even when telling it which parameter is which. shrug
*	Revert "Apply previous window conditioning to decoding layer"	yum	2023-02-22
\| \| \| \| \| \| \| \|	This reverts commit cece1ee8f1b985c2a89adb661dd02c6d44787f67. This does not in fact result in improved temporal stability. It makes makes things so unstable that even single-sentence messages fail to ever stabilize.
*	Apply previous window conditioning to decoding layer	yum	2023-02-22
\| \| \| \| \|	Per the Whisper source code, this should result in better temporal stability.
*	Remove exponential backoff capv0.7.0	yum	2023-02-19
\| \| \| \| \| \| \| \| \| \|	Allows sustained exponential backoff when not transcribing. Used to cap out at 1s. * Add more items to README TODO list * Adjust emote metadata * Emotes bugfix: Non-existent emote map doesn't cause transcription engine to bail out.
*	Finish emotes	yum	2023-02-13
\| \| \| \| \| \| \| \| \| \| \|	Emotes require 2 bytes per char. They're encoded into the region [0xE000, infinity). The texture is 4k, and uses 1k vertical pixels per emote segment, for a maximum of 32 segments. * Reduce volume of noise indicator by 90%. Quiet is probably better. Might want to add a volume slider idk. * Bugfix: emotes without a transparency channel now work * Address a couple Unity performance complaints about the shader
*	Begin work adding emotes	yum	2023-02-13
\| \| \| \| \| \| \| \| \| \| \| \|	Done: * Users can add images to Fonts/Emotes/ * The basename of that image ('clueless.png' becomes 'clueless') is the keyword to make the image show up in game. * Fix a bug in the shader where letters on the 2nd texture and later would have UV outside of [0.0, 1.0] Not yet implemented: * transcribed words are encoded using emotes mapping
*	Built-in chatbox no longer shows empty messages	yum	2023-02-04
\| \| \| \|	* Reduce noise on/off indicator volume by 50%
*	GUI: Add ability to choose button	yum	2023-01-25
\| \| \| \| \| \|	We use a button to start/stop transcription. Previously this was hardcoded to left joystick. Now users can pick from {left, right} x {joystick, a, b}.
*	Enable using built-in chatboxv0.3	yum	2023-01-22
\| \| \| \| \| \| \| \| \| \| \| \| \|	VRChat exposes a built-in chatbox which can be seen by anyone who has it enabled. This was not the case when I started this project: the chatbox would only be visible to friends. Since this is clearly useful, enabling the STT on public models, let's enable sending data to it. Caveats: * The built-in chatbox has anti-spam tech which limits us to updating about once every 2 seconds. The custom chatbox has no such limitation and is thus typically much faster.
*	Bugfix: user-provided paths may now contain spaces	yum	2023-01-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, paths containing spaces would be interpreted by python's argument parser as multiple separate arguments, causing it to fail. Now we escape paths inside PythonWrapper using std::quoted(). * Improve PII filtering. Python output would contain multiple path separators (like C:\\Users\\foo\\), defeating the PII regex. * Silence compiler warning in PII filter. * Document usability improvements. * Transcription layer exponential backoff goes to ~infinity when paused. This is a hack, since we really don't need to transcribe at all when paused, but it lets us keep the code simple. Good enough until the next rewrite. * Shader only samples background when necessary. * Limit matchStrings() print()s to DEBUG mode
*	Portability bugfixes	yum	2023-01-01
\| \| \| \| \|	* Expose option to run transcription engine on CPU instead of GPU * Use embedded git when setting up the Python virtual environment
*	Bugfix: regions truncate correctly at page boundaries	yum	2022-12-30
\| \| \| \| \| \| \| \|	Boards whose size is an even multiple of CHARS_PER_SYNC would lose the entire last region. * Attempt to fix runaway memory usage of GUI text frames, but this needs more work
*	GUI: Expose transcription window duration	yum	2022-12-30
\| \| \| \| \|	Users can pick longer transcription durations for accuracy-critical tasks, or shorter durations for latency-critical tasks.