TaSTT.git - Free self-hosted STT for VRChat.

	Commit message (Collapse)	Author	Age
*	Finish python virtual env	yum	2022-12-17
\| \| \| \| \| \| \| \| \|	GUI can now download all TaSTT dependencies and install them into a virtual environment. * Add buttons to check embedded python version & install dependencies * Add class to wrap interacting with embedded Python * Put all TaSTT python scripts into a folder
*	Add emotes	yum	2022-11-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add emotes.py. It accepts a list of images and creates a texture with 64 total embedded images. The shader knows how to draw these into fixed 6-character-wide slots. Each slot must be aligned to a 6-character boundary. osc_ctrl has to pad with spaces to make this work. This whole patch is a little more complicated than it has any right to be, but my brain feels fuzzy and I don't know where to start fixing it, so I'm going to leave it shitty-but-functional for now. There's also some bug where writing a character into the 11th slot causes it to show up at the end of the board. I'll figure that out later, idk. I didn't include any of the emotes I use since I couldn't find any info on their licenses. I'm just banking on having a good workflow later on so people can add their own.
*	Tweak speech indicator	yum	2022-11-23
\| \| \| \| \| \| \| \| \|	Use a single indicator with 3 states: 1. green: actively speaking 2. orange: waiting for paging 3. red: up-to-date Use slightly nicer colors.
*	Shorten audio window to 10 seconds	yum	2022-11-22
\| \| \| \| \|	This helps with temporal stability in long-running transcriptions, and lets us get rid of that hack where we refuse to update old pages.
*	Rework input controls	yum	2022-11-22
\| \| \| \| \| \| \| \|	Press joystick once to start recording, again to stop. When you start recording, any previous text on the board is cleared. Add 2 visual indicators: one to indicate speech, another to indicate that audio is paging.
*	Fix reset button	yum	2022-11-12
\| \| \| \| \| \| \| \|	Board would lock up if you reset after the first page. osc_ctrl.clear() was assigning the wrong member :) Tweak continuous transcription logic: now we only commit if the transcription remains identical for N seconds.
*	Clicking the left joystick resets the board.	yum	2022-11-12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Increase no speech probability threshold. This is what was preventing short transcriptions from working. We rely more on the avg logprob filter now. * Remove string matching logic from transcribe. Now when we get 2 consecutive identical transcriptions, we commit the transcription. This could cause words to get cut off but in practice it doesn't seem to happen. * Fix steamvr joystick click detection. Moving the joystick would also fire the event, which is not correct. * Combine locks in transcribe.py. * Remove "clear" vocal control. * osc_ctrl.clear() resets last_message_encoded * Remove osc_ctrl.sendMessage (unused)
*	License scrub	yum	2022-11-10
\| \| \| \|	Begin auditing dependencies' licenses.
*	Fix osc_ctrl diffing	yum	2022-11-07
\| \| \| \| \| \|	Now we actually maintain an ongoing buffer with what we think is on the display. When we send a new cell, we update only that cell instead of overwriting the entire prefix up to that cell.
*	Update README	yum	2022-11-06
\|
*	String matching no longer relies on spaces	yum	2022-11-06
\| \| \| \| \| \| \| \| \| \| \|	Add a `matchStrings` which does basically the same thing as `matchStringList` except it doesn't split the input at space boundaries. I think this should work better for Japanese and Chinese, since they don't use spaces. Doesn't seem to cause any accuracy regressions for English. Also update the README.
*	Expand character set from 80 to 64K characters	yum	2022-11-05
\| \| \| \| \| \| \| \| \| \| \| \| \|	Each character is now addressed with 2 bytes instead of 1. The number of bytes per character is configured in (I think) exactly one spot, so increasing or decreasing this is trivial. English speakers can just set it to 1. The animator seems a little unstable; if I leave my character in a public for a while, the board becomes unresponsive. Oh well. * Check in fonts. Did this so users don't have to remember to set the resolution or to disable mipmaps.
*	OSC controller uses one sync per region instead of 2	yum	2022-11-05
\| \| \| \| \| \| \| \| \| \|	My theory as to why this seems to work: VRChat batches parameter updates together and applies them simultaneously. Thus we don't run into the expected failure mode where we update the prior region before paging over to the next region. * Fix beep feature * update the STT demo
*	Reduce dimensionality of animator by factor of 80	yum	2022-11-05
\| \| \| \| \| \| \| \| \| \| \|	Instead of generating one animation for every single character in our character set, we just generate 2: the lowest and the highest. We use blend trees to interpolate between these two extremes. This reduces the number of animations we have to generate by a factor of 80. It also clears the way for multi-language support (coming soon). It also means we don't have to reopen unity every time we generate a new animator.
*	Add speech-to-text demo	yum	2022-11-04
\|
*	Reduce sync rate from 10 Hz to 5 Hz	yum	2022-11-03
\| \| \| \| \| \| \| \|	Also reduce the number of syncs per cell from 3 to 2. Thus the effective sync rate went from (10 / 3 == 3.33 Hz) to (5 / 2 == 2.5 Hz). This comes at the cost of a degraded UX: updating a cell temporarily shows the contents of the previous cell.
*	Combine 4 boolean select parameters into one	yum	2022-11-01
\| \| \| \| \|	Should further improve reliability, especially in laggy environments. We'll see!
*	Fix bug where some text would show up after saying 'Clear'	yum	2022-11-01
\|
*	Reduce total # of select bits from 44 to 4	yum	2022-10-30
\| \| \| \| \| \| \| \| \|	The board is divided into 16 regions. We select the region to be updated by updating 4 boolean parameters. We used to define 4 parameters per layer. Now we just have 4 params total, which affect every layer. Total param memory: 142 bits -> 102 bits Params updated per region update: 56 -> 16
*	Add 'over' keyword	yum	2022-10-27
\| \| \| \| \| \| \|	When the user says 'over', the board will stop displaying new transcriptions until the user says 'clear'. * Remove the control thread from transcribe.py
*	Add fast clear animation	yum	2022-10-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The old clear mechanism would write an empty cell in every layer, which would take (0.3 seconds) * (11 layers) == about 3 seconds. The new mechanism drives an animation which overwrites every character slot simultaneously, taking only 0.1 seconds. A nice ~30x speedup. * Fix the transcription exponential backoff logic. Saying new things will reset the delay to the minimum again. * Clearing the board will also reset the transcription delay back to the minimum. * Tune the noise detection minimum to 0.2 instead of 0.1. Speaking softly into the mic seems to fail to exceed the 0.1 threshold pretty often.
*	Add exponentially longer sleeps to transcribe loop	yum	2022-10-25
\| \| \| \| \| \| \| \| \| \| \|	When the user pauses their speech for an extended period of time, the transcription engine will sleep for progressively longer intervals, up to 1.5 seconds between transcriptions. This allows us to reduce idle resource consumption. To enable responsive transcription while the user is speaking actively, we reset the sleep duration to the minimum whenever a change is detected.
*	Saying the word "clear" clears the board	yum	2022-10-24
\| \| \| \| \| \| \|	While the board is clearing, you can keep talking, and it will be rendered when the board finishes clearing. * bugfix: STT only beeps when it's out
*	STT now beeps when it shows text, and can be locked to world	yum	2022-10-24
\| \| \| \| \| \| \| \| \| \| \|	Empty cells are excluded from the beeping behavior. Note: I have not checked in the prefab with the audio source yet. * libtastt gen_fx now adds 3 toggles to FX layer: toggle board, toggle world lock, toggle beep sound * libunity guid_map can now append instead of replacing * TaSTT_Toggle_{On,Off}.anim now use the prefab path, as do generated animations
*	Rewrite FX and animation generators	yum	2022-10-23
\| \| \| \| \| \| \| \| \| \|	* Fix bug where facial animations cause already-written letters to change (!!!) * Add libtastt.py to hold abstractions layered over libunity * Fix * libunity: Fix bug where integer equality state transition conditions ignored the threshold * libunity: Support placing animator states at different positions
*	Quiet down transcribe.py	yum	2022-10-20
\| \| \| \| \| \|	Also adjust continuous transcription algorithm to use leftmost minimum instead of rightmost. This prevents some cases where we generate longer and longer text.
*	Add continuous transcription mode	yum	2022-10-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Algorithm: * look at last 20 chars of last committed transcription * scan new transcription using 10-char sliding window * find spot where distance is minimized * stitch two messages together Thus we're able to maintain a continuously growing transcription without having to feed the AI more than 30 seconds of data at a time. Seems to work reasonably well in bench tests. Also fix silence detection. AI exposes a probability that nothing was said. Hand-pick a probability of 0.1. Sometimes the AI still goes sicko mode with this setting but going higher occasionally results in no transcription.
*	Transcribe.py now pages	yum	2022-10-15
\| \| \| \| \| \| \| \| \|	Messages longer than a board will automatically write over the top. TODO * Real cell-based message diffing * Cumulative transcription * this would completely mitigate the effects of trim events
*	Tweak transcribe.py	yum	2022-10-15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Slightly improve temporal stability and responsiveness at the cost of limiting to a 30 second recording. Before committing to a transcription, wait for two consecutive transcriptions such that they are identical, or the former is a prefix of the latter. This helps with temporal stability by eliminating most one-off wildly inaccurate transcriptions. Also make osc_ctrl.sendMessageLazy a little lazier, limiting it to 2 consecutive non-empty cells per call. This allows us to recover from mistranscriptions faster.
*	Fix animations: renamed prefab from CustomSTT to TaSTT	yum	2022-10-15
\| \| \| \| \| \| \| \| \|	Also: * Check in toggle on/off animations * Add toggle parameter * libunity bug: getUniqueId() was calling allocateId() incorrectly * Remove osc_ctrl `client` global * Fix transcribe.py text encoding
*	Add ability to leave board in world	yum	2022-10-11
\| \| \| \| \| \| \| \| \|	* Add VRLabs' World Constraint as a submodule * Add animations for world constraint * Add toggles for board * Add libunity.py (no content yet) * Support >30s transcription * Add board FBX
*	Add osc_ctrl.ResizeBoard	yum	2022-10-04
\| \| \| \| \|	It's a little buggy; it likes to overwrite cells on the board. No idea why.
*	Introduce STT proof-of-concept	yum	2022-10-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using OpenAI's whisper neural network, we can do local STT. Translation quality is good, system resource usage is minimal (1 GB VRAM), latency is much lower than cloud-based translation. * Add transcribe.py * Creates 3 threads: * One saves mic audio to a buffer * One passes mic audio to the STT * One sends the transcribed text to the board * Main thread listens for input. Press enter to start a new message. * Add osc_ctrl.sendMessageLazy, a simple diff-based message sending utility. * A little complexity: it only sends 1 empty cell per call, allowing us to quickly say new things without having to wait for the whole buffer to clear.
*	Add 4th layer of indexing	yum	2022-10-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	* Double board size from 6x16 to 8x22 * Reduce parameter bits used (thanks to extra layer of indexing) * Rename template.anim to template.anim.txt to prevent Unity from constantly rewriting it * osc_ctrl.encodeMessage now pads the message so that all empty space is overwritten * Delete osc_ctrl.sendMessageCellContinuous. Now that we use a single 'Enable' bit, this idea is sidelined. * We can probably achieve the same effect by making TaSTT.shader a little more clever. For example, if we pass it the current cell number, it could render a time-based 'fade-in' effect which simulates smooth streaming.
*	Use a single Enable parameter instead of one per layer	yum	2022-10-02
\| \| \| \|	Even more reliable now.
*	Paging now works for other players at 40 characters per second	yum	2022-10-02
\| \| \| \| \| \| \|	* Shorten animations to 1 frame * Eliminate fx internal transition delays * These were causing the shader parameters to interpolate, causing the inconsistent / flickering letters I was seeing
*	Add 'Do Nothing' animation	yum	2022-10-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Per the VRC docs, state behaviors may not execute if the total length of time in the state is < 0.02 seconds. Adding a 2-frame 'Do Nothing' animation to the top of every layer seems to help with stability. shrug More cleanup: * Generate a unique return-home transition for each terminal state instead of reusing the same one. * Use globally unique state names in animator. * All animations are at least 2 frames long.
*	Add parameters to resize board (likely broken)	yum	2022-10-02
\| \| \| \| \| \| \| \| \| \| \| \|	... and a bunch of bugfixes: * Shader is now transparent * Simplify shader row/column calculation * Add punctuation to texture * Fix generate.sh * Add lorum_ipsum.txt * Fix how long text is scrolled * Simplify encoding logic in osc_ctrl.py
*	Add line wrapping and support for arbitrarily long messages	yum	2022-09-30
\| \| \| \| \| \| \| \| \| \| \|	Add trivial line wrapping algorithm. Words are only added to a line if they don't put it over the column limit, and only broken if they alone exceed the column limit. Extend board size to 16x6, using 145 bits of parameter memory. Add simple generate.sh script, which generates everything needed to use the text-to-text board.
*	Redo FX layer	yum	2022-09-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Apparently the same avatar parameter can only be updated so quickly before VRChat starts dropping messages. So now we divide the board into "groups" of 8 characters. Each group can be updated relatively slowly, but all groups can be updated in parallel. Thus we can update the board group-by-group, pausing between each group. * Fix shader bugs - now there are Row05 parameters, and row00 refers to the topmost row instead of the bottom-most. * Remove outdated layer/group names files * Extend osc_ctrl.py to support encoding & sending messages * Add generate_params.py to handle creating TaSTT_params.asset * Add generate_utils.py for common code generation facilities & parameters.
*	begin unity-native port	yum	2022-09-30
\|
*	FIRST WORKING PROTOTYPE!	yum	2022-09-29
\| \| \| \| \| \| \| \| \|	Can't get much faster than 0.1 seconds per character with the current design. Still, a good first step! * Simplify parameters: only use 3 8-bit ints + 1 boolean. * Rewrite FX generator according to new params. * Rewrite osc_ctrl.py to test in-game display.
*	delete SetLetters.cs	yum	2022-09-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Doesn't work in game. Also change # of characters per slot to 80, down from 128. Also realize that VRChat supports 256 BITS of parameter, not 256 BYTES. Next design idea: * 3 8-bit parameters: letter, row, col * 1 boolean parameter: active * one animation for each slot/letter combo, as usual * one fx layer like this: if !active: do nothing if row == 0: if col == 0: if letter == 0: play row00_col00_letter00 animation * because write defaults are off, we should be able to "save" letters by simply setting active = false * thus we don't need to simultaneously address the entire board, saving memory
*	add 'hello world' osc controller	yum	2022-09-29
	simply sends numbers to a parameter's osc address of course, nothing is showing up in game. More debugging is needed.