From 9fff496394dcd94c4084694ca96a5e07ab836274 Mon Sep 17 00:00:00 2001 From: yum Date: Mon, 23 Jan 2023 14:28:53 -0800 Subject: package.ps1 now fetches all dependencies Don't literally check in Python since it looks dodgy (rightfully so). Instead the build script just fetches it. * Update README, simplifying language and documenting other projects --- README.md | 96 ++++++++++++++++++++++++++------------------------------------- 1 file changed, 39 insertions(+), 57 deletions(-) (limited to 'README.md') diff --git a/README.md b/README.md index 4a48d2e..bc6f6b0 100644 --- a/README.md +++ b/README.md @@ -2,8 +2,7 @@ TaSTT (pronounced "tasty") is a free speech-to-text tool for VRChat. It uses local machine transcription to turn your voice into text, then sends it into -VRChat via OSC. A few parameters, a machine-generated FX layer, and a -custom shader display the text in game. +VRChat via OSC. ![Speech-to-text demo](Images/speech_to_text_demo.gif) @@ -20,9 +19,7 @@ Made with love by yum\_food. ## Usage and setup -To use a prebuilt package, go to the releases tab and download the latest -release. Follow the guide associated with that release. To give you a taste, -[here's the v0.0 setup guide](https://www.youtube.com/watch?v=0qjxkdVTqcs). +Get the latest package from [the releases page](https://github.com/yum-food/TaSTT/releases/latest). Please [join the discord](https://discord.gg/YWmCvbCRyn) to share feedback and get technical help. @@ -30,35 +27,28 @@ get technical help. To build your own package from source, see GUI/README.md. Basic controls: -* Short click the left joystick to make it show up & start transcribing. -* Short click the left joystick to make it lock in place & stop transcribing. -* Long click the left joystick to make it go away & stop transcribing. +* Short click the left joystick to toggle transcription. +* Long click the left joystick to hide the text box. * Scale it up/down in the radial menu. ## Features -* 4x48 grid, 256 or 65536 characters per slot. -* Text-to-text interface. -* Speech-to-text interface. +* Customizable board resolution, [up to ridiculous sizes](https://www.youtube.com/watch?v=u5h-ivkwS0M). +* 8-bit and 16-bit character encodings. +* Japanese, Korean, and Chinese glyphs included. * Multiple language support. - * Transcription within the same language works for many languages. - * Translation from N languages to English is supported. - * Translation from English into other languages is added case by case. This - is a limitation of the state of the art in machine translation: fine-tuned - English->other language models far outperform English->many language models. -* Start/stop transcription by clicking left joystick. -* Resizable: talk to friends close up or far away. +* Resizable. * Audio feedback: hear distinct beeps when transcription starts and stops. * May also enable in-game noise indicator, to grab others' attention. -* Visual transcription indicator. Green == talking, orange == waiting for sync, - red == done talking. -* May be attached to hand or left in world space. -* Free as in beer. -* Free as in freedom. +* Visual transcription indicator. +* Locks to world space when done speaking. +* Can use built-in chatbox (usable with public avatars!) * Privacy-respecting: transcription is done on your GPU, not in the cloud. * Hackable. -* 100% from-scratch implementation. -* Permissive MIT license. +* From-scratch implementation. +* Free as in beer. +* Free as in freedom. +* MIT license. ### Motivation @@ -72,32 +62,37 @@ reason or another: 1. RabidCrab's STT costs money and relies on cloud-based transcription. I have struggled with latency, quality, and reliability issues. It's also closed-source. -2. The in-game text box is only visible to your friends list, making it - useless for those who like to make new friends. - -Thus I believe that a free alternative is both needed and justified. - -I hope that this codebase aids and motivates the creation of better, more -expressive communication tools for mutes. +2. The in-game text box is not visible in streamer mode, and limits you to one + update every ~2 seconds, making it a poor choice for latency-sensitive + communication. +3. [KillFrenzy's AvatarText](https://github.com/killfrenzy96/KillFrenzyAvatarText) + only supports text-to-text, and is GPL, making it legally risky for people + who want to sell closed-source software. +4. [I5UCC's VRCTextboxSTT](https://github.com/I5UCC/VRCTextboxSTT) makes + KillFrenzy's AvatarText and Whisper kiss. It's the closest spiritual cousin + to this repository. There are two crucial differences: it's GPL not MIT, and + it doesn't abstract away the command line. ### Design overview -There are currently 5 important pieces: +These are the important bits: -1. `TaSTT.shader`. A simple unlit shader. Has one parameter per cell in the - display. -2. `libunity.py`. Contains the logic required to generate and manipulate Unity +1. `TaSTT_template.shader`. A simple unlit shader template. Contains the + business logic for the shader that shows text in game. +2. `generate_shader.py`. Adds parameters and an accessor function to the + shader template. +3. `libunity.py`. Contains the logic required to generate and manipulate Unity YAML files. Works well enough on YAMLs up to ~40k documents, 1M lines. -3. `libtastt.py`. Contains the logic to generate TaSTT-specific Unity files, +4. `libtastt.py`. Contains the logic to generate TaSTT-specific Unity files, namely the animations and the animator. -4. `osc_ctrl.py`. Sends OSC messages to VRChat, which it dutifully passes along +5. `osc_ctrl.py`. Sends OSC messages to VRChat, which it dutifully passes along to the generated FX layer. -5. `transcribe.py`. Uses OpenAI's whisper neural network to transcribe audio +6. `transcribe.py`. Uses OpenAI's whisper neural network to transcribe audio and sends it to the board using osc_ctrl. #### Parameters & board indexing -I divide the board into 16 regions and use a single int parameter, +I divide the board into several regions and use a single int parameter, `TaSTT_Select`, to select the active region. For each byte of data in the active region, I use a float parameter to blend between two animations: one with value 0, and one with value 255. @@ -105,24 +100,11 @@ animations: one with value 0, and one with value 255. To support wide character sets, I support 2 bytes per character. This can be configured down to 1 byte per character to save parameter bits. -The the total amount of parameter memory used is dictated by this equation: - -``` -ROWS = 4 -COLS = 44 -CELLS = 16 -MEMORY = ROWS * COLS * (N bits per character) / CELLS + 1 + log2(CELLS) -``` - -This is currently 93 bits for 1-byte characters and 181 bits for 2-byte -characters. - #### FX controller design The FX controller (AKA animator) is pretty simple. There is one layer for each -character in a cell. The layer has to work out which cell it's in, then -work out which letter we want to write in that cell, then run an animation for -that letter. +sync parameter (i.e. each character byte). The layer has to work out which +region it's in, then write a byte to the correct shader parameter. ![One FX layer with 16 cells](Images/tastt_anim.png) @@ -172,8 +154,8 @@ Contributions welcome. Send a pull request to this repository. checking transcriptions without having to see the board in game. 6. TTS. Multiple people have requested this. See if there are open source algorithms available; or, figure out how to integrate with - 7. Save UI input fields to config file. Persist across process exit. It's - annoying having to re-enter the config every time I use the STT. + 7. ~~Save UI input fields to config file. Persist across process exit. It's + annoying having to re-enter the config every time I use the STT.~~ DONE 8. Customizable controller bindings. Someone mentioned they use left click to unmute. Let's work around users, not make them change their existing keybinds. -- cgit v1.2.3