TaSTT.git - Free self-hosted STT for VRChat.

	Commit message (Collapse)	Author	Age
*	Finish python virtual env	yum	2022-12-17
\| \| \| \| \| \| \| \| \|	GUI can now download all TaSTT dependencies and install them into a virtual environment. * Add buttons to check embedded python version & install dependencies * Add class to wrap interacting with embedded Python * Put all TaSTT python scripts into a folder
*	Optimize transcription latency	yum	2022-12-14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Shave off ~500ms due to locking. Acquiring a threading.Lock takes hundreds of milliseconds and the global interpreter lock already takes care of most crashy race conditions, so just remove the locks. Avoid writing audio to disk, saving more time (and disk wear / IOPS). Add basic profiling to transcribe(). Omit timestamps, since we don't use them (maybe we should!) Shorten noise indicators to 350ms The whisper behavior where it repeats tokens causes certain transcriptions to take many seconds. I haven't thought about how to fix this, yet.
*	Add on/off sound indicator (local)	yum	2022-11-25
\| \| \| \| \| \|	Now we have a visual and auditory indicator for transcription. The auditory indicator is only heard by the user, and can be used to reset the state of the board prior to displaying.
*	Tweak speech indicator	yum	2022-11-23
\| \| \| \| \| \| \| \| \|	Use a single indicator with 3 states: 1. green: actively speaking 2. orange: waiting for paging 3. red: up-to-date Use slightly nicer colors.
*	Shorten audio window to 10 seconds	yum	2022-11-22
\| \| \| \| \|	This helps with temporal stability in long-running transcriptions, and lets us get rid of that hack where we refuse to update old pages.
*	Fix audio bug	yum	2022-11-22
\| \| \| \| \| \| \|	Coarse locking was causing audio frames to drop, severely degrading transcription quality. We really need a spoken word integration test.
*	Rework input controls	yum	2022-11-22
\| \| \| \| \| \| \| \|	Press joystick once to start recording, again to stop. When you start recording, any previous text on the board is cleared. Add 2 visual indicators: one to indicate speech, another to indicate that audio is paging.
*	Tweak transcription again	yum	2022-11-16
\| \| \| \| \| \| \|	Works a little better on longer transcriptions while maintaining the same improved performance on short transcriptions. We really need a benchmark to evaluate performance mechanically.
*	Another transcription rework	yum	2022-11-14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After re-reading the paper, I noticed that they apply a couple optimizations I wasn't using. Use the top-level `whisper.transcribe` method, which is a little slower, but more accurate than the one I was using. Although this method is slower, it has better temporal stability due to the increased quality, which I think should make for an overall more responsive UX. Lower transcription quality means the paging layer has to waste time updating earlier cells. Also, drop the auto-commit stuff and go back to string stitching. I think it's better to let the user manually commit. A rework of the hand controls is probably coming soon. Finally, update README.
*	Fix reset button	yum	2022-11-12
\| \| \| \| \| \| \| \|	Board would lock up if you reset after the first page. osc_ctrl.clear() was assigning the wrong member :) Tweak continuous transcription logic: now we only commit if the transcription remains identical for N seconds.
*	Clicking the left joystick resets the board.	yum	2022-11-12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Increase no speech probability threshold. This is what was preventing short transcriptions from working. We rely more on the avg logprob filter now. * Remove string matching logic from transcribe. Now when we get 2 consecutive identical transcriptions, we commit the transcription. This could cause words to get cut off but in practice it doesn't seem to happen. * Fix steamvr joystick click detection. Moving the joystick would also fire the event, which is not correct. * Combine locks in transcribe.py. * Remove "clear" vocal control. * osc_ctrl.clear() resets last_message_encoded * Remove osc_ctrl.sendMessage (unused)
*	License scrub	yum	2022-11-10
\| \| \| \|	Begin auditing dependencies' licenses.
*	Add language flag to transcription CLI	yum	2022-11-06
\|
*	String matching no longer relies on spaces	yum	2022-11-06
\| \| \| \| \| \| \| \| \| \| \|	Add a `matchStrings` which does basically the same thing as `matchStringList` except it doesn't split the input at space boundaries. I think this should work better for Japanese and Chinese, since they don't use spaces. Doesn't seem to cause any accuracy regressions for English. Also update the README.
*	Expand character set from 80 to 64K characters	yum	2022-11-05
\| \| \| \| \| \| \| \| \| \| \| \| \|	Each character is now addressed with 2 bytes instead of 1. The number of bytes per character is configured in (I think) exactly one spot, so increasing or decreasing this is trivial. English speakers can just set it to 1. The animator seems a little unstable; if I leave my character in a public for a while, the board becomes unresponsive. Oh well. * Check in fonts. Did this so users don't have to remember to set the resolution or to disable mipmaps.
*	Improve transcription quality	yum	2022-11-01
\| \| \| \| \| \| \| \| \| \|	Apply heuristics described in whisper paper. Dramatically improve silence detection as well as overall transcription quality. I was able to read the entire demo script at speed without any serious transcription inaccuracies. Field testing is TODO.
*	Fix bug where some text would show up after saying 'Clear'	yum	2022-11-01
\|
*	Tweak continuous transcription	yum	2022-10-27
\| \| \| \| \|	Stitching new uses 6 word sliding window instead of 4 word. Seems to dramatically improve transcription quality.
*	Add 'over' keyword	yum	2022-10-27
\| \| \| \| \| \| \|	When the user says 'over', the board will stop displaying new transcriptions until the user says 'clear'. * Remove the control thread from transcribe.py
*	Add fast clear animation	yum	2022-10-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The old clear mechanism would write an empty cell in every layer, which would take (0.3 seconds) * (11 layers) == about 3 seconds. The new mechanism drives an animation which overwrites every character slot simultaneously, taking only 0.1 seconds. A nice ~30x speedup. * Fix the transcription exponential backoff logic. Saying new things will reset the delay to the minimum again. * Clearing the board will also reset the transcription delay back to the minimum. * Tune the noise detection minimum to 0.2 instead of 0.1. Speaking softly into the mic seems to fail to exceed the 0.1 threshold pretty often.
*	De-scuff continuous transcription	yum	2022-10-25
\| \| \| \| \| \| \| \| \| \|	Transcription stitching now occurs in word space, rather than in text space. This avoids problems where we accidentally duplicate or delete letters in the middle of words. Factor out stitching into its own module and add a small handful of test cases. Hopefully if we hit problems in production, we can just grow this list and avoid regressions if we reimplement.
*	Tweak transcription heuristics	yum	2022-10-25
\| \| \| \| \|	The heuristics now occur in the filtered word space, so punctuation and casing changes won't confound them.
*	Add exponentially longer sleeps to transcribe loop	yum	2022-10-25
\| \| \| \| \| \| \| \| \| \| \|	When the user pauses their speech for an extended period of time, the transcription engine will sleep for progressively longer intervals, up to 1.5 seconds between transcriptions. This allows us to reduce idle resource consumption. To enable responsive transcription while the user is speaking actively, we reset the sleep duration to the minimum whenever a change is detected.
*	Saying the word "clear" clears the board	yum	2022-10-24
\| \| \| \| \| \| \|	While the board is clearing, you can keep talking, and it will be rendered when the board finishes clearing. * bugfix: STT only beeps when it's out
*	Quiet down transcribe.py	yum	2022-10-20
\| \| \| \| \| \|	Also adjust continuous transcription algorithm to use leftmost minimum instead of rightmost. This prevents some cases where we generate longer and longer text.
*	Add continuous transcription mode	yum	2022-10-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Algorithm: * look at last 20 chars of last committed transcription * scan new transcription using 10-char sliding window * find spot where distance is minimized * stitch two messages together Thus we're able to maintain a continuously growing transcription without having to feed the AI more than 30 seconds of data at a time. Seems to work reasonably well in bench tests. Also fix silence detection. AI exposes a probability that nothing was said. Hand-pick a probability of 0.1. Sometimes the AI still goes sicko mode with this setting but going higher occasionally results in no transcription.
*	Add libunity.addTransition	yum	2022-10-15
\| \| \| \| \|	* Implement basic board toggle using new transition logic * Metadata can now restore from file
*	Transcribe.py now pages	yum	2022-10-15
\| \| \| \| \| \| \| \| \|	Messages longer than a board will automatically write over the top. TODO * Real cell-based message diffing * Cumulative transcription * this would completely mitigate the effects of trim events
*	Further improve transcribe.py responsiveness	yum	2022-10-15
\| \| \| \| \| \| \|	Add a third heuristic. If the transcription is relatively long and the first bit differs from the previous transcription, immediately overwrite. Because the transcription is long, it's a bit less likely to be a complete mistranscription.
*	Tweak transcribe.py	yum	2022-10-15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Slightly improve temporal stability and responsiveness at the cost of limiting to a 30 second recording. Before committing to a transcription, wait for two consecutive transcriptions such that they are identical, or the former is a prefix of the latter. This helps with temporal stability by eliminating most one-off wildly inaccurate transcriptions. Also make osc_ctrl.sendMessageLazy a little lazier, limiting it to 2 consecutive non-empty cells per call. This allows us to recover from mistranscriptions faster.
*	Fix animations: renamed prefab from CustomSTT to TaSTT	yum	2022-10-15
\| \| \| \| \| \| \| \| \|	Also: * Check in toggle on/off animations * Add toggle parameter * libunity bug: getUniqueId() was calling allocateId() incorrectly * Remove osc_ctrl `client` global * Fix transcribe.py text encoding
*	Add ability to leave board in world	yum	2022-10-11
\| \| \| \| \| \| \| \| \|	* Add VRLabs' World Constraint as a submodule * Add animations for world constraint * Add toggles for board * Add libunity.py (no content yet) * Support >30s transcription * Add board FBX
*	Introduce STT proof-of-concept	yum	2022-10-03
	Using OpenAI's whisper neural network, we can do local STT. Translation quality is good, system resource usage is minimal (1 GB VRAM), latency is much lower than cloud-based translation. * Add transcribe.py * Creates 3 threads: * One saves mic audio to a buffer * One passes mic audio to the STT * One sends the transcribed text to the board * Main thread listens for input. Press enter to start a new message. * Add osc_ctrl.sendMessageLazy, a simple diff-based message sending utility. * A little complexity: it only sends 1 empty cell per call, allowing us to quickly say new things without having to wait for the whole buffer to clear.