summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--README.md81
1 files changed, 57 insertions, 24 deletions
diff --git a/README.md b/README.md
index 79a0afa..6900aea 100644
--- a/README.md
+++ b/README.md
@@ -53,29 +53,32 @@ Basic controls:
* Works with the built-in chatbox (usable with public avatars!)
* Customizable board resolution, [up to ridiculous sizes](https://www.youtube.com/watch?v=u5h-ivkwS0M).
* Lighweight design:
+ * Works with VRC native chatbox - works with any avatar without modification
* Custom textbox requires as few as 65 parameter bits
- * Transcription doesn't affect VRChat framerate much, since VRC is heavily
- CPU-bound. Performance impact when not speaking is negligible.
+ * Transcription doesn't destroy your frames in game since VRChat is heavily
+ CPU bound. Performance impact when not speaking is negligible.
+* Performant: uses CTranslate2 inference engine with GPU support and
+ flash-attention
* Browser source. Use with OBS!
* Multi-language support.
- * Japanese, Korean, and Chinese glyphs included, among many other languages.
- * Full list of Unicode blocks is defined in
- [generate_fonts.py](https://github.com/yum-food/TaSTT/blob/master/Scripts/generate_fonts.py#L43-L109).
* Whisper natively supports transcription in [100 languages](
https://github.com/openai/whisper/blob/main/whisper/tokenizer.py#L10).
+ * A local translation algorithm (Meta's NLLB) enables translating into 200
+ other languages with good-ish accuracy (BLEU scores typically around 20-35)
+ and low latency.
* Customizable:
* Control button may be set to left/right a/b/joystick.
- * Text color, background color, and border color are customizable in the shader.
- * Text background may be customized with PBR textures: base color, normal,
- metallic, roughness, and emission are all implemented.
- * Border width and rounding are customizable.
- * Shader supports physically based shading: smoothness, metallic, and emissive.
+ * Text filters: lowercase, uppercase, uwu, remove trailing period, profanity
+ censoring.
* Many optional quality-of-life features:
* Audio feedback: hear distinct beeps when transcription starts and stops.
* May also enable in-game noise indicator, to grab others' attention.
- * Visual transcription indicator.
- * Resize with a blendtree in your radial menu.
-* Locks to world space when done speaking.
+* Custom chatbox features:
+ * Free modular avatar prefab available [here](https://yumfood.gumroad.com/l/tastt_modular).
+ * Resizable with a blendtree in your radial menu.
+ * Locks to world space either when summoned (default) or when done speaking.
+ * Unicode variant (supporting e.g. Chinese and Japanese) is available
+ through the app's Unity panel.
* Privacy-respecting: transcription is done on your GPU, not in the cloud.
* Hackable.
* From-scratch implementation.
@@ -83,6 +86,34 @@ Basic controls:
* Free as in freedom.
* MIT license.
+## Bad parts
+
+I think that any ethical software project should disclose what sucks about it.
+Here's what sucks about this project:
+
+* The app UI looks like trash. Only you will see it, so I don't think this
+ really matters. (Electron rewrite when?)
+* The app is HUGE. This mostly stems from the bundled NVIDIA CUDNN .dll's
+ (~1.0GB) and portable git (~500 MB).
+ * NVIDIA's DLLs should be statically linked into ctranslate2. That probably
+ means doing our own build of ctranslate2... yuck.
+ * Portable git can probably be stripped down. It includes a full mingw
+ environment responsible for the vast majority of the size, which we almost
+ certainly don't need.
+* The app doesn't start automatically with steamvr (TODO do this)
+* The app starts in a weird state where it's transcribing and doesn't really
+ back off correctly. Press the controller keybind once to stop transcription
+ then again to put it into a normal state.
+* The backend Unity code is pretty gory. (This is largely irrelevant to end
+ users, since end users mostly use the VRC-native chatbox or the modular
+ avatar prefab.) I have a burning disdain for C# so I wrote a scuffed
+ "animator as code" library (libunity.py) in Python. This includes a lot of
+ crazy shit like a multiprocess YAML parser and a ton of macro-like string
+ manipulation/concatenation. We should just use the upstream C# animator as
+ code library.
+* The app doesn't include any version numbers, so debugging version-specific
+ issues can be tough (TODO fix this)
+
## Requirements
System requirements:
@@ -93,11 +124,10 @@ System requirements:
lot more, so I wouldn't recommend it.
* I've tested on a 1080 Ti and a 3090 and saw comparable latency.
* SteamVR.
-* No write defaults on your avatar if you're using the custom text box.
-Avatar resources used:
+Avatar resources used by custom chatbox:
-* Tris: 4
+* Tris: 12
* Material slots: 1
* Texture memory: 340 KB (English), 130 MB (international)
* Parameter bits: 65-217 (configurable; more bits == faster paging)
@@ -114,7 +144,10 @@ reason or another:
1. RabidCrab's STT costs money and relies on cloud-based transcription.
Because of the reliance on cloud-based transcription services, it's
- typically slower and less reliable than local transcription.
+ typically slower and less reliable than local transcription. However, the
+ accuracy and speed of cloud AI models has improved radically since late
+ 2022, so this is probably the best option if money and privacy don't matter
+ to you.
2. The in-game text box is not visible in streamer mode, and limits you to one
update every ~2 seconds, making it a poor choice for latency-sensitive
communication.
@@ -129,14 +162,14 @@ reason or another:
also uses Whisper, but they rely on the C# interface to Const-Me's
CUDA-enabled Whisper implementation. This implementation does not support
beam search decoding and waits for pauses to segment your voice. Thus it's
- less accurate and higher latency than this project's Python-based
- transcription engine, but it's more performant. It supports more feature
+ less accurate and higher latency than this project's
+ transcription engine. It supports more features
(like cloud-based TTS), so you might want to check it out.
-Why should you pick this project over the alternatives? This project has
-the lowest latency (measured <500ms end-to-end on mid-range hardware), most
-reliable transcriptions of any STT in VRChat, period. There is no network hop
-to worry about and no subscription to manage. Just download and go.
+Why should you pick this project over the alternatives? This project is mature,
+low-latency (typically 500-1000 ms end-to-end in game under load), reliable, and
+accurate. There is no network hop to worry about and no subscription to manage.
+Just download and go.
## Design overview
@@ -152,7 +185,7 @@ These are the important bits:
namely the animations and the animator.
5. `osc_ctrl.py`. Sends OSC messages to VRChat, which it dutifully passes along
to the generated FX layer.
-6. `transcribe.py`. Uses OpenAI's whisper neural network to transcribe audio
+6. `transcribe_v2.py`. Uses OpenAI's whisper neural network to transcribe audio
and sends it to the board using osc_ctrl.
#### Parameters & board indexing