summaryrefslogtreecommitdiffstats
path: root/README.md
diff options
context:
space:
mode:
authoryum <yum.food.vr@gmail.com>2025-05-17 23:41:20 -0700
committeryum <yum.food.vr@gmail.com>2025-05-17 23:54:56 -0700
commitf8e95c0b85288a10f435e0edabf43defa0c303ac (patch)
treec0fd2d499cd7ee6e51947f1df62e7cad05b67816 /README.md
parent0c54e1fc74fe7677a0d4fef1c147c6e886d182db (diff)
Add STT code
Diffstat (limited to 'README.md')
-rw-r--r--README.md45
1 files changed, 42 insertions, 3 deletions
diff --git a/README.md b/README.md
index eaeceea..abb0576 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,40 @@
# Optimized text paging for VRChat
+This repo provides code to help you send English text into VRChat. It includes:
+
+1. Training code to produce an English-language tokenizer of any vocabulary
+ size.
+2. Code to turn your tokenizer into a lookup table for GPU decoding.
+3. Unity code to generate an animator to shuttle data from OSC to material
+ properties.
+4. OSC code to talk to your Unity animator.
+
+To get started, see Quick Start.
+
+## Quick start
+
+1. Clone this repo.
+2. Clone my toon shader, [2ner](https://github.com/yum-food/2ner).
+3. Install Lyuma's av3emulator.
+4. Drag STT.prefab onto your avatar's root.
+5. Enter play mode.
+6. Open PowerShell.
+
+```bash
+$ cd ~
+$ mkdir tmp
+$ cd tmp
+$ python.exe -m venv venv
+$ ./venv/Scripts/Activate.ps1
+$ pushd /path/to/FastTextPaging/
+$ pip3 install -r requirements.txt
+$ python3 ./hi.py
+```
+
+7. Start typing.
+
+## Design overview
+
It is sometimes useful to send text data into VRChat, for example for
speech-to-text (STT). This is typically done naively, with a "block" of
n 8-bit characters\* sent in along with an 8-bit pointer. Since avatars can only
@@ -19,7 +54,7 @@ used. Thus to reach a typical reading speed, you need to use (260/4.7) = 55.5
OSC bits. The goal of this module is to get more out of these bits by
compressing text over the wire.
-## Unigram tokenizer
+### Unigram tokenizer
Byte pair encoding (BPE) is an encoding scheme frequently used in natural
language processing (NLP) contexts. For any language with a fixed character set
@@ -127,7 +162,7 @@ bits naive rate bpe rate speedup factor
I reserve 39 token slots for sequences of whitespace characters of length 2-40. This helps simplify formatting. To end a line or position text, you can just send in the exact right number of spaces, and a fixed-width font renderer will position things as intended.
-## Paging data into shader
+### Paging data into shader
Sending this data to a shader is pretty simple:
@@ -224,7 +259,7 @@ void GetTokens(uint screen_ptr, out uint block_ptr, out uint tokens[BLOCK_WIDTH]
}
```
-## GPU decoding
+### GPU decoding
Now we have to translate the tokens into text. I do this with a texture laid out as follows:
@@ -236,6 +271,10 @@ My tokenizer's vocabulary is 65,536 tokens. If we add up the lengths of every to
So, the entire vocabulary - length+offset head and content - requires a 32-bit RGBA texture with 232,419 slots. We'll just jam this into a 512x512 texture, at an occupancy ratio of 88.66% (11.34% waste). The total VRAM usage of that lookup table (LUT) is 1 MiB.
+![Unigram tokenizer texture](Images/unigram_lut_for_visualization.png)
+
+*A 64K vocabulary tokenizer I trained on Wikipedia and OpenSubtitles.*
+
We want to implement this API:
```c