summaryrefslogtreecommitdiffstats
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md96
1 files changed, 39 insertions, 57 deletions
diff --git a/README.md b/README.md
index 4a48d2e..bc6f6b0 100644
--- a/README.md
+++ b/README.md
@@ -2,8 +2,7 @@
TaSTT (pronounced "tasty") is a free speech-to-text tool for VRChat. It uses
local machine transcription to turn your voice into text, then sends it into
-VRChat via OSC. A few parameters, a machine-generated FX layer, and a
-custom shader display the text in game.
+VRChat via OSC.
![Speech-to-text demo](Images/speech_to_text_demo.gif)
@@ -20,9 +19,7 @@ Made with love by yum\_food.
## Usage and setup
-To use a prebuilt package, go to the releases tab and download the latest
-release. Follow the guide associated with that release. To give you a taste,
-[here's the v0.0 setup guide](https://www.youtube.com/watch?v=0qjxkdVTqcs).
+Get the latest package from [the releases page](https://github.com/yum-food/TaSTT/releases/latest).
Please [join the discord](https://discord.gg/YWmCvbCRyn) to share feedback and
get technical help.
@@ -30,35 +27,28 @@ get technical help.
To build your own package from source, see GUI/README.md.
Basic controls:
-* Short click the left joystick to make it show up & start transcribing.
-* Short click the left joystick to make it lock in place & stop transcribing.
-* Long click the left joystick to make it go away & stop transcribing.
+* Short click the left joystick to toggle transcription.
+* Long click the left joystick to hide the text box.
* Scale it up/down in the radial menu.
## Features
-* 4x48 grid, 256 or 65536 characters per slot.
-* Text-to-text interface.
-* Speech-to-text interface.
+* Customizable board resolution, [up to ridiculous sizes](https://www.youtube.com/watch?v=u5h-ivkwS0M).
+* 8-bit and 16-bit character encodings.
+* Japanese, Korean, and Chinese glyphs included.
* Multiple language support.
- * Transcription within the same language works for many languages.
- * Translation from N languages to English is supported.
- * Translation from English into other languages is added case by case. This
- is a limitation of the state of the art in machine translation: fine-tuned
- English->other language models far outperform English->many language models.
-* Start/stop transcription by clicking left joystick.
-* Resizable: talk to friends close up or far away.
+* Resizable.
* Audio feedback: hear distinct beeps when transcription starts and stops.
* May also enable in-game noise indicator, to grab others' attention.
-* Visual transcription indicator. Green == talking, orange == waiting for sync,
- red == done talking.
-* May be attached to hand or left in world space.
-* Free as in beer.
-* Free as in freedom.
+* Visual transcription indicator.
+* Locks to world space when done speaking.
+* Can use built-in chatbox (usable with public avatars!)
* Privacy-respecting: transcription is done on your GPU, not in the cloud.
* Hackable.
-* 100% from-scratch implementation.
-* Permissive MIT license.
+* From-scratch implementation.
+* Free as in beer.
+* Free as in freedom.
+* MIT license.
### Motivation
@@ -72,32 +62,37 @@ reason or another:
1. RabidCrab's STT costs money and relies on cloud-based transcription. I have
struggled with latency, quality, and reliability issues. It's also
closed-source.
-2. The in-game text box is only visible to your friends list, making it
- useless for those who like to make new friends.
-
-Thus I believe that a free alternative is both needed and justified.
-
-I hope that this codebase aids and motivates the creation of better, more
-expressive communication tools for mutes.
+2. The in-game text box is not visible in streamer mode, and limits you to one
+ update every ~2 seconds, making it a poor choice for latency-sensitive
+ communication.
+3. [KillFrenzy's AvatarText](https://github.com/killfrenzy96/KillFrenzyAvatarText)
+ only supports text-to-text, and is GPL, making it legally risky for people
+ who want to sell closed-source software.
+4. [I5UCC's VRCTextboxSTT](https://github.com/I5UCC/VRCTextboxSTT) makes
+ KillFrenzy's AvatarText and Whisper kiss. It's the closest spiritual cousin
+ to this repository. There are two crucial differences: it's GPL not MIT, and
+ it doesn't abstract away the command line.
### Design overview
-There are currently 5 important pieces:
+These are the important bits:
-1. `TaSTT.shader`. A simple unlit shader. Has one parameter per cell in the
- display.
-2. `libunity.py`. Contains the logic required to generate and manipulate Unity
+1. `TaSTT_template.shader`. A simple unlit shader template. Contains the
+ business logic for the shader that shows text in game.
+2. `generate_shader.py`. Adds parameters and an accessor function to the
+ shader template.
+3. `libunity.py`. Contains the logic required to generate and manipulate Unity
YAML files. Works well enough on YAMLs up to ~40k documents, 1M lines.
-3. `libtastt.py`. Contains the logic to generate TaSTT-specific Unity files,
+4. `libtastt.py`. Contains the logic to generate TaSTT-specific Unity files,
namely the animations and the animator.
-4. `osc_ctrl.py`. Sends OSC messages to VRChat, which it dutifully passes along
+5. `osc_ctrl.py`. Sends OSC messages to VRChat, which it dutifully passes along
to the generated FX layer.
-5. `transcribe.py`. Uses OpenAI's whisper neural network to transcribe audio
+6. `transcribe.py`. Uses OpenAI's whisper neural network to transcribe audio
and sends it to the board using osc_ctrl.
#### Parameters & board indexing
-I divide the board into 16 regions and use a single int parameter,
+I divide the board into several regions and use a single int parameter,
`TaSTT_Select`, to select the active region. For each byte of data
in the active region, I use a float parameter to blend between two
animations: one with value 0, and one with value 255.
@@ -105,24 +100,11 @@ animations: one with value 0, and one with value 255.
To support wide character sets, I support 2 bytes per character. This
can be configured down to 1 byte per character to save parameter bits.
-The the total amount of parameter memory used is dictated by this equation:
-
-```
-ROWS = 4
-COLS = 44
-CELLS = 16
-MEMORY = ROWS * COLS * (N bits per character) / CELLS + 1 + log2(CELLS)
-```
-
-This is currently 93 bits for 1-byte characters and 181 bits for 2-byte
-characters.
-
#### FX controller design
The FX controller (AKA animator) is pretty simple. There is one layer for each
-character in a cell. The layer has to work out which cell it's in, then
-work out which letter we want to write in that cell, then run an animation for
-that letter.
+sync parameter (i.e. each character byte). The layer has to work out which
+region it's in, then write a byte to the correct shader parameter.
![One FX layer with 16 cells](Images/tastt_anim.png)
@@ -172,8 +154,8 @@ Contributions welcome. Send a pull request to this repository.
checking transcriptions without having to see the board in game.
6. TTS. Multiple people have requested this. See if there are open source
algorithms available; or, figure out how to integrate with
- 7. Save UI input fields to config file. Persist across process exit. It's
- annoying having to re-enter the config every time I use the STT.
+ 7. ~~Save UI input fields to config file. Persist across process exit. It's
+ annoying having to re-enter the config every time I use the STT.~~ DONE
8. Customizable controller bindings. Someone mentioned they use left click
to unmute. Let's work around users, not make them change their existing
keybinds.