Import FastTextPager repo

author: yum <yum.food.vr@gmail.com> 2025-07-23 22:39:45 -0700
committer: yum <yum.food.vr@gmail.com> 2025-07-23 22:39:45 -0700
commit: f6b93a20d754579008076e85f5c0a97e1bcbc258 (patch)
tree: 7288699d6f22e76c4f30636a37e94265b3ef7708 /Designs
parent: f3782c200c9a2ec2b77708da67b4127a38465ad1 (diff)
parent: 043a447133695bfd2285a534b941db972873a692 (diff)
1 files changed, 321 insertions, 0 deletions
diff --git a/Designs/fast_text_paging.md b/Designs/fast_text_paging.md
new file mode 100644
index 0000000..abb0576
--- /dev/null
+++ b/Designs/fast_text_paging.md
@@ -0,0 +1,321 @@
+# Optimized text paging for VRChat
+
+This repo provides code to help you send English text into VRChat. It includes:
+
+1. Training code to produce an English-language tokenizer of any vocabulary
+   size.
+2. Code to turn your tokenizer into a lookup table for GPU decoding.
+3. Unity code to generate an animator to shuttle data from OSC to material
+   properties.
+4. OSC code to talk to your Unity animator.
+
+To get started, see Quick Start.
+
+## Quick start
+
+1. Clone this repo.
+2. Clone my toon shader, [2ner](https://github.com/yum-food/2ner).
+3. Install Lyuma's av3emulator.
+4. Drag STT.prefab onto your avatar's root.
+5. Enter play mode.
+6. Open PowerShell.
+
+```bash
+$ cd ~
+$ mkdir tmp
+$ cd tmp
+$ python.exe -m venv venv
+$ ./venv/Scripts/Activate.ps1
+$ pushd /path/to/FastTextPaging/
+$ pip3 install -r requirements.txt
+$ python3 ./hi.py
+```
+
+7. Start typing.
+
+## Design overview
+
+It is sometimes useful to send text data into VRChat, for example for
+speech-to-text (STT). This is typically done naively, with a "block" of
+n 8-bit characters\* sent in along with an 8-bit pointer. Since avatars can only
+send 256 bits at 3 Hz\*\* with OSC, this means you can only send (256 - 8) / 8 =
+31 characters per sync. The average English word is 4.79 characters long, so
+if we naively send in 1 character per byte, then we get a speed limit of 6.47
+words per sync. The works out to ~20 words per second or ~1200 words per minute
+(wpm). Adults typically read at about 238-260 wpm.
+
+\* Typically ASCII encoding is used.
+
+\*\* Experimentally, 3 Hz is the fastest you can reliably page data with OSC in
+busy instances.
+
+Sending in one character per byte gives you (1200/256) ~= 4.7 wpm per OSC bit
+used. Thus to reach a typical reading speed, you need to use (260/4.7) = 55.5
+OSC bits. The goal of this module is to get more out of these bits by
+compressing text over the wire.
+
+### Unigram tokenizer
+
+Byte pair encoding (BPE) is an encoding scheme frequently used in natural
+language processing (NLP) contexts. For any language with a fixed character set
+(e.g. an alphabet), you can count up how often each character is used in a
+large corpus of text. Then you can repeatedly join together characters and
+assign a unique token to joined characters in the order of the most frequently
+occurring sequences. You wind up with a lookup table like this:
+
+```
+0: [UNK]
+1: <s>
+2: </s>
+3: [CLS]
+4: [SEP]
+...
+100:  from
+101:  but
+102:  he
+103: e
+104:  now
+...
+10000: eo
+10001:  is currently
+10002:  dish
+10003: Mi
+10004: 6)
+```
+
+The above example is from a unigram\* sentencepiece organizer that I trained.
+
+\* A unigram tokenizer is a variant of the byte-pair tokenizer. Where a byte
+pair tokenizer views inputs as arbitrary sequences of bytes, a unigram
+tokenizer views it as a sequence of letters.
+
+The tokenizer has a vocabulary size of 65,536 tokens. It was trained on
+opensubtitles and 5% of wikipedia, with `unidecode` normalization applied to
+limit training data to ASCII. Subword lengths are distributed as follows:
+
+Subword length histogram:
+1: 95
+2: 1032
+3: 3350
+4: 5439
+5: 5445
+6: 5082
+7: 5334
+8: 5866
+9: 6172
+10: 5934
+11: 5329
+12: 4698
+13: 4016
+14: 3191
+15: 2434
+16: 2095
+
+In the test set - 25% wikipedia 75% conversational English -
+this tokenizer yields 4.896 characters per token. (Recall that
+the average English word is 4.8 characters long. Not bad!)
+
+If we naively send these tokens into the game with a 16-bit number, we can send
+floor((256-8)/16) = 15 tokens per sync. This gives us an average of 73.44
+characters per sync - more than 2x higher than the naive approach's 31.
+
+Here is how the unigram-tokenized scheme fairs against the naive scheme in
+every possible configuration of bits used:
+
+```
+bits    naive rate  bpe rate    speedup factor
+8       n/a         n/a         n/a
+16      1           n/a         0.000
+24      2           4.896       2.448
+32      3           4.896       1.632
+40      4           9.792       2.448
+48      5           9.792       1.958
+56      6           14.688      2.448
+64      7           14.688      2.098
+72      8           19.584      2.448
+80      9           19.584      2.176
+88      10          24.480      2.448
+96      11          24.480      2.225
+104     12          29.376      2.448
+112     13          29.376      2.260
+120     14          34.272      2.448
+128     15          34.272      2.285
+136     16          39.168      2.448
+144     17          39.168      2.304
+152     18          44.064      2.448
+160     19          44.064      2.319
+168     20          48.960      2.448
+176     21          48.960      2.331
+184     22          53.856      2.448
+192     23          53.856      2.342
+200     24          58.752      2.448
+208     25          58.752      2.350
+216     26          63.648      2.448
+224     27          63.648      2.357
+232     28          68.544      2.448
+240     29          68.544      2.364
+248     30          73.440      2.448
+256     31          73.440      2.369
+```
+
+([Spreadsheet](https://docs.google.com/spreadsheets/d/1d9SEZvo3Q-6U_Wf9nuGRKXxndUhKn2V3Q0Ox0nOB4T4/edit?usp=sharing))
+
+I reserve 39 token slots for sequences of whitespace characters of length 2-40. This helps simplify formatting. To end a line or position text, you can just send in the exact right number of spaces, and a fixed-width font renderer will position things as intended.
+
+### Paging data into shader
+
+Sending this data to a shader is pretty simple:
+
+- An OSC app encodes a string into tokens and pages it into the game with OSC.
+  - The app sends a pointer of *where* the tokens should be rendered along with them. Since the tokens can encode a variable length string, the pointer must be able to point to any spot in the rendering window. Thus we are limited to a 256-bit display with an 8-bit pointer, or a 64K display with a 16-bit pointer. We call this the visual pointer.
+  - A second pointer tells the animator which shader parameters the tokens and visual pointer should be written to. This can be 8-bit. We call this the logical pointer.
+- Animator uses the logical pointer to decide which shader parameters to send the visual pointer and tokens to.
+
+Here is the expected speedup in every possible configuration with a 1-byte
+overhead (1BO) or a 2-byte overhead (2BO):
+
+```
+bits	naive rate	bpe rate (1BO)	speedup	bpe rate (2BO)	speedup
+8		n/a			n/a				n/a		n/a				n/a
+16		1			0.000			0.000	0.000			0.000
+24		2			0.000			0.000	0.000			0.000
+32		3			4.896			1.632	0.000			0.000
+40		4			4.896			1.224	4.896			1.224
+48		5			9.792			1.958	4.896			0.979
+56		6			9.792			1.632	9.792			1.632
+64		7			14.688			2.098	9.792			1.399
+72		8			14.688			1.836	14.688			1.836
+80		9			19.584			2.176	14.688			1.632
+88		10			19.584			1.958	19.584			1.958
+96		11			24.480			2.225	19.584			1.780
+104		12			24.480			2.040	24.480			2.040
+112		13			29.376			2.260	24.480			1.883
+120		14			29.376			2.098	29.376			2.098
+128		15			34.272			2.285	29.376			1.958
+136		16			34.272			2.142	34.272			2.142
+144		17			39.168			2.304	34.272			2.016
+152		18			39.168			2.176	39.168			2.176
+160		19			44.064			2.319	39.168			2.061
+168		20			44.064			2.203	44.064			2.203
+176		21			48.960			2.331	44.064			2.098
+184		22			48.960			2.225	48.960			2.225
+192		23			53.856			2.342	48.960			2.129
+200		24			53.856			2.244	53.856			2.244
+208		25			58.752			2.350	53.856			2.154
+216		26			58.752			2.260	58.752			2.260
+224		27			63.648			2.357	58.752			2.176
+232		28			63.648			2.273	63.648			2.273
+240		29			68.544			2.364	63.648			2.195
+248		30			68.544			2.285	68.544			2.285
+256		31			73.440			2.369	68.544			2.211
+```
+
+([Spreadsheet](https://docs.google.com/spreadsheets/d/1d9SEZvo3Q-6U_Wf9nuGRKXxndUhKn2V3Q0Ox0nOB4T4/edit?usp=sharing))
+
+As you can see, a 2-byte visual pointer is very damaging to the speedup at low bit budgets. So in bit-constrained setups we should definitely use a smaller display.
+
+Notably, *there is only one crossover point*. If all configurations except the 2-byte overhead 48-bit configuration, BPE-based paging is always\* faster.
+
+\* Always, going off of the *expected* rate. If you get unlucky and your tokens all decode to 1 character, then BPE-based paging is about 50% *slower* than naive encoding.
+
+Because the Unity animator sucks shit, we're going to decode tokens on the GPU. As a refresher, the GPU sees data like this:
+
+```
+_Text_Block00_Visual_Ptr: 0
+_Text_Block00_Token00: 13,766
+_Text_Block00_Token01: 84
+_Text_Block01_Visual_Ptr: 13
+_Text_Block01_Token00: 599
+_Text_Block01_Token01: 8,301
+...
+```
+
+I.e. it sees "blocks" of data with tokens and visual pointers. The visual pointer just says where on a grid it should draw the subwords represented by the tokens.
+
+The pixel shader can trivially work out what grid location the current pixel belongs to. By scanning through the visual pointers, it can work out which block it has to draw.
+
+We can generate a function like this:
+
+```c
+#define BLOCK_WIDTH 2
+void GetBlock(uint which_block, out float data[BLOCK_WIDTH]) {
+  [loop]
+  for (uint i = 0; i < BLOCK_WIDTH; i++) {
+    data[i] = _Text_Blocks[which_block][i];
+  }
+}
+
+// Get the tokens that cover `screen_ptr`. Also returns `block_ptr`, the
+// location where this block of tokens begins.
+void GetTokens(uint screen_ptr, out uint block_ptr, out uint tokens[BLOCK_WIDTH]) {
+  uint which_block;
+  [loop]
+  for (uint i = 0; i < BLOCK_WIDTH; i++) {
+    if (screen_ptr >= _Text_Block_Visual_Ptrs[i]) {
+      which_block = i;
+    }
+  }
+  GetBlock(which_block, tokens);
+}
+```
+
+### GPU decoding
+
+Now we have to translate the tokens into text. I do this with a texture laid out as follows:
+
+1. A fixed-length array of (offset, length) pairs. Offset is 24 bits, giving us an address space of about 16 million slots. Length is 8 bits, but as established above, the longest token is only 16 characters long. So we're wasting about 4 bits. This tells us we should use an RGBA texture.
+
+2. A variable length array of ASCII-encoded strings. Each slot is RGBA, so it can hold 4 characters.
+
+My tokenizer's vocabulary is 65,536 tokens. If we add up the lengths of every token, rounding them up to the nearest multiple of 4, we get the length 667,532 This means that we need 166,883 slots to fit the actual content of the vocabulary.
+
+So, the entire vocabulary - length+offset head and content - requires a 32-bit RGBA texture with 232,419 slots. We'll just jam this into a 512x512 texture, at an occupancy ratio of 88.66% (11.34% waste). The total VRAM usage of that lookup table (LUT) is 1 MiB.
+
+![Unigram tokenizer texture](Images/unigram_lut_for_visualization.png)
+
+*A 64K vocabulary tokenizer I trained on Wikipedia and OpenSubtitles.*
+
+We want to implement this API:
+
+```c
+uint GetChar(uint screen_ptr);
+```
+
+Internally, it must:
+
+1. Get tokens. (GetTokens - already done)
+2. Figure out which token covers the screen\_ptr.
+3. Figure out which character in the token covers the screen\_ptr.
+
+Let's break down [2]. We can get the length of a token with a single texture tap. So, naively, we can just scan through the tokens in the current block, add up their lengths, and stop once we find a token covering the current slot. The scan incurs a worst-case cost of `BLOCK_WIDTH` texture taps. The character lookup is then a single tap.
+
+```c
+// Gets the length of the subword encoded by the token. Performs one texture
+// tap.
+void TokenLengthOffset(uint token, out uint length, out uint offset);
+// Gets the nth character of the token stored at `token_offset`.
+uint GetTokenChar(uint token_offset, uint nth);
+
+uint GetChar(uint screen_ptr) {
+  uint block_ptr;
+  uint tokens[BLOCK_WIDTH];
+  GetTokens(screen_ptr, tokens, block_ptr);
+  uint start = block_ptr;
+  uint covering_token = 0;
+  uint token_ptr = block_ptr;
+  uint token_offset;
+  for (uint ii = 0; ii < BLOCK_WIDTH; ii++) {
+    uint token_length;
+    TokenLengthOffset(tokens[ii], token_length, token_offset);
+    if (token_ptr + token_length >= screen_ptr) {
+      covering_token = tokens[ii];
+    }
+    token_ptr += token_length;
+  }
+  // covering_token covers screen_ptr. It starts at token_ptr.
+  return GetTokenChar(token_offset, screen_ptr - token_ptr);
+}
+```
+
+That's actually it for the GPU decoding. Once you have the character, you can use standard fixed-width font rendering techniques to display it (e.g. [disinfo](https://github.com/yum-food/disinfo) and [msdf](https://github.com/Chlumsky/msdfgen)).
+
author	yum <yum.food.vr@gmail.com>	2025-07-23 22:39:45 -0700
committer	yum <yum.food.vr@gmail.com>	2025-07-23 22:39:45 -0700
commit	f6b93a20d754579008076e85f5c0a97e1bcbc258 (patch)
tree	7288699d6f22e76c4f30636a37e94265b3ef7708 /Designs
parent	f3782c200c9a2ec2b77708da67b4127a38465ad1 (diff)
parent	043a447133695bfd2285a534b941db972873a692 (diff)