diff options
| author | yum <yum.food.vr@gmail.com> | 2025-05-17 23:41:20 -0700 |
|---|---|---|
| committer | yum <yum.food.vr@gmail.com> | 2025-05-17 23:54:56 -0700 |
| commit | f8e95c0b85288a10f435e0edabf43defa0c303ac (patch) | |
| tree | c0fd2d499cd7ee6e51947f1df62e7cad05b67816 /README.md | |
| parent | 0c54e1fc74fe7677a0d4fef1c147c6e886d182db (diff) | |
Add STT code
Diffstat (limited to 'README.md')
| -rw-r--r-- | README.md | 45 |
1 files changed, 42 insertions, 3 deletions
@@ -1,5 +1,40 @@ # Optimized text paging for VRChat +This repo provides code to help you send English text into VRChat. It includes: + +1. Training code to produce an English-language tokenizer of any vocabulary + size. +2. Code to turn your tokenizer into a lookup table for GPU decoding. +3. Unity code to generate an animator to shuttle data from OSC to material + properties. +4. OSC code to talk to your Unity animator. + +To get started, see Quick Start. + +## Quick start + +1. Clone this repo. +2. Clone my toon shader, [2ner](https://github.com/yum-food/2ner). +3. Install Lyuma's av3emulator. +4. Drag STT.prefab onto your avatar's root. +5. Enter play mode. +6. Open PowerShell. + +```bash +$ cd ~ +$ mkdir tmp +$ cd tmp +$ python.exe -m venv venv +$ ./venv/Scripts/Activate.ps1 +$ pushd /path/to/FastTextPaging/ +$ pip3 install -r requirements.txt +$ python3 ./hi.py +``` + +7. Start typing. + +## Design overview + It is sometimes useful to send text data into VRChat, for example for speech-to-text (STT). This is typically done naively, with a "block" of n 8-bit characters\* sent in along with an 8-bit pointer. Since avatars can only @@ -19,7 +54,7 @@ used. Thus to reach a typical reading speed, you need to use (260/4.7) = 55.5 OSC bits. The goal of this module is to get more out of these bits by compressing text over the wire. -## Unigram tokenizer +### Unigram tokenizer Byte pair encoding (BPE) is an encoding scheme frequently used in natural language processing (NLP) contexts. For any language with a fixed character set @@ -127,7 +162,7 @@ bits naive rate bpe rate speedup factor I reserve 39 token slots for sequences of whitespace characters of length 2-40. This helps simplify formatting. To end a line or position text, you can just send in the exact right number of spaces, and a fixed-width font renderer will position things as intended. -## Paging data into shader +### Paging data into shader Sending this data to a shader is pretty simple: @@ -224,7 +259,7 @@ void GetTokens(uint screen_ptr, out uint block_ptr, out uint tokens[BLOCK_WIDTH] } ``` -## GPU decoding +### GPU decoding Now we have to translate the tokens into text. I do this with a texture laid out as follows: @@ -236,6 +271,10 @@ My tokenizer's vocabulary is 65,536 tokens. If we add up the lengths of every to So, the entire vocabulary - length+offset head and content - requires a 32-bit RGBA texture with 232,419 slots. We'll just jam this into a 512x512 texture, at an occupancy ratio of 88.66% (11.34% waste). The total VRAM usage of that lookup table (LUT) is 1 MiB. + + +*A 64K vocabulary tokenizer I trained on Wikipedia and OpenSubtitles.* + We want to implement this API: ```c |
