README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321

# Optimized text paging for VRChat

This repo provides code to help you send English text into VRChat. It includes:

1. Training code to produce an English-language tokenizer of any vocabulary
   size.
2. Code to turn your tokenizer into a lookup table for GPU decoding.
3. Unity code to generate an animator to shuttle data from OSC to material
   properties.
4. OSC code to talk to your Unity animator.

To get started, see Quick Start.

## Quick start

1. Clone this repo.
2. Clone my toon shader, [2ner](https://github.com/yum-food/2ner).
3. Install Lyuma's av3emulator.
4. Drag STT.prefab onto your avatar's root.
5. Enter play mode.
6. Open PowerShell.

```bash
$ cd ~
$ mkdir tmp
$ cd tmp
$ python.exe -m venv venv
$ ./venv/Scripts/Activate.ps1
$ pushd /path/to/FastTextPaging/
$ pip3 install -r requirements.txt
$ python3 ./hi.py
```

7. Start typing.

## Design overview

It is sometimes useful to send text data into VRChat, for example for
speech-to-text (STT). This is typically done naively, with a "block" of
n 8-bit characters\* sent in along with an 8-bit pointer. Since avatars can only
send 256 bits at 3 Hz\*\* with OSC, this means you can only send (256 - 8) / 8 =
31 characters per sync. The average English word is 4.79 characters long, so
if we naively send in 1 character per byte, then we get a speed limit of 6.47
words per sync. The works out to ~20 words per second or ~1200 words per minute
(wpm). Adults typically read at about 238-260 wpm.

\* Typically ASCII encoding is used.

\*\* Experimentally, 3 Hz is the fastest you can reliably page data with OSC in
busy instances.

Sending in one character per byte gives you (1200/256) ~= 4.7 wpm per OSC bit
used. Thus to reach a typical reading speed, you need to use (260/4.7) = 55.5
OSC bits. The goal of this module is to get more out of these bits by
compressing text over the wire.

### Unigram tokenizer

Byte pair encoding (BPE) is an encoding scheme frequently used in natural
language processing (NLP) contexts. For any language with a fixed character set
(e.g. an alphabet), you can count up how often each character is used in a
large corpus of text. Then you can repeatedly join together characters and
assign a unique token to joined characters in the order of the most frequently
occurring sequences. You wind up with a lookup table like this:

```
0: [UNK]
1: <s>
2: </s>
3: [CLS]
4: [SEP]
...
100:  from
101:  but
102:  he
103: e
104:  now
...
10000: eo
10001:  is currently
10002:  dish
10003: Mi
10004: 6)
```

The above example is from a unigram\* sentencepiece organizer that I trained.

\* A unigram tokenizer is a variant of the byte-pair tokenizer. Where a byte
pair tokenizer views inputs as arbitrary sequences of bytes, a unigram
tokenizer views it as a sequence of letters.

The tokenizer has a vocabulary size of 65,536 tokens. It was trained on
opensubtitles and 5% of wikipedia, with `unidecode` normalization applied to
limit training data to ASCII. Subword lengths are distributed as follows:

Subword length histogram:
1: 95
2: 1032
3: 3350
4: 5439
5: 5445
6: 5082
7: 5334
8: 5866
9: 6172
10: 5934
11: 5329
12: 4698
13: 4016
14: 3191
15: 2434
16: 2095

In the test set - 25% wikipedia 75% conversational English -
this tokenizer yields 4.896 characters per token. (Recall that
the average English word is 4.8 characters long. Not bad!)

If we naively send these tokens into the game with a 16-bit number, we can send
floor((256-8)/16) = 15 tokens per sync. This gives us an average of 73.44
characters per sync - more than 2x higher than the naive approach's 31.

Here is how the unigram-tokenized scheme fairs against the naive scheme in
every possible configuration of bits used:

```
bits    naive rate  bpe rate    speedup factor
8       n/a         n/a         n/a
16      1           n/a         0.000
24      2           4.896       2.448
32      3           4.896       1.632
40      4           9.792       2.448
48      5           9.792       1.958
56      6           14.688      2.448
64      7           14.688      2.098
72      8           19.584      2.448
80      9           19.584      2.176
88      10          24.480      2.448
96      11          24.480      2.225
104     12          29.376      2.448
112     13          29.376      2.260
120     14          34.272      2.448
128     15          34.272      2.285
136     16          39.168      2.448
144     17          39.168      2.304
152     18          44.064      2.448
160     19          44.064      2.319
168     20          48.960      2.448
176     21          48.960      2.331
184     22          53.856      2.448
192     23          53.856      2.342
200     24          58.752      2.448
208     25          58.752      2.350
216     26          63.648      2.448
224     27          63.648      2.357
232     28          68.544      2.448
240     29          68.544      2.364
248     30          73.440      2.448
256     31          73.440      2.369
```

([Spreadsheet](https://docs.google.com/spreadsheets/d/1d9SEZvo3Q-6U_Wf9nuGRKXxndUhKn2V3Q0Ox0nOB4T4/edit?usp=sharing))

I reserve 39 token slots for sequences of whitespace characters of length 2-40. This helps simplify formatting. To end a line or position text, you can just send in the exact right number of spaces, and a fixed-width font renderer will position things as intended.

### Paging data into shader

Sending this data to a shader is pretty simple:

- An OSC app encodes a string into tokens and pages it into the game with OSC.
  - The app sends a pointer of *where* the tokens should be rendered along with them. Since the tokens can encode a variable length string, the pointer must be able to point to any spot in the rendering window. Thus we are limited to a 256-bit display with an 8-bit pointer, or a 64K display with a 16-bit pointer. We call this the visual pointer.
  - A second pointer tells the animator which shader parameters the tokens and visual pointer should be written to. This can be 8-bit. We call this the logical pointer.
- Animator uses the logical pointer to decide which shader parameters to send the visual pointer and tokens to.

Here is the expected speedup in every possible configuration with a 1-byte
overhead (1BO) or a 2-byte overhead (2BO):

```
bits	naive rate	bpe rate (1BO)	speedup	bpe rate (2BO)	speedup
8		n/a			n/a				n/a		n/a				n/a
16		1			0.000			0.000	0.000			0.000
24		2			0.000			0.000	0.000			0.000
32		3			4.896			1.632	0.000			0.000
40		4			4.896			1.224	4.896			1.224
48		5			9.792			1.958	4.896			0.979
56		6			9.792			1.632	9.792			1.632
64		7			14.688			2.098	9.792			1.399
72		8			14.688			1.836	14.688			1.836
80		9			19.584			2.176	14.688			1.632
88		10			19.584			1.958	19.584			1.958
96		11			24.480			2.225	19.584			1.780
104		12			24.480			2.040	24.480			2.040
112		13			29.376			2.260	24.480			1.883
120		14			29.376			2.098	29.376			2.098
128		15			34.272			2.285	29.376			1.958
136		16			34.272			2.142	34.272			2.142
144		17			39.168			2.304	34.272			2.016
152		18			39.168			2.176	39.168			2.176
160		19			44.064			2.319	39.168			2.061
168		20			44.064			2.203	44.064			2.203
176		21			48.960			2.331	44.064			2.098
184		22			48.960			2.225	48.960			2.225
192		23			53.856			2.342	48.960			2.129
200		24			53.856			2.244	53.856			2.244
208		25			58.752			2.350	53.856			2.154
216		26			58.752			2.260	58.752			2.260
224		27			63.648			2.357	58.752			2.176
232		28			63.648			2.273	63.648			2.273
240		29			68.544			2.364	63.648			2.195
248		30			68.544			2.285	68.544			2.285
256		31			73.440			2.369	68.544			2.211
```

([Spreadsheet](https://docs.google.com/spreadsheets/d/1d9SEZvo3Q-6U_Wf9nuGRKXxndUhKn2V3Q0Ox0nOB4T4/edit?usp=sharing))

As you can see, a 2-byte visual pointer is very damaging to the speedup at low bit budgets. So in bit-constrained setups we should definitely use a smaller display.

Notably, *there is only one crossover point*. If all configurations except the 2-byte overhead 48-bit configuration, BPE-based paging is always\* faster.

\* Always, going off of the *expected* rate. If you get unlucky and your tokens all decode to 1 character, then BPE-based paging is about 50% *slower* than naive encoding.

Because the Unity animator sucks shit, we're going to decode tokens on the GPU. As a refresher, the GPU sees data like this:

```
_Text_Block00_Visual_Ptr: 0
_Text_Block00_Token00: 13,766
_Text_Block00_Token01: 84
_Text_Block01_Visual_Ptr: 13
_Text_Block01_Token00: 599
_Text_Block01_Token01: 8,301
...
```

I.e. it sees "blocks" of data with tokens and visual pointers. The visual pointer just says where on a grid it should draw the subwords represented by the tokens.

The pixel shader can trivially work out what grid location the current pixel belongs to. By scanning through the visual pointers, it can work out which block it has to draw.

We can generate a function like this:

```c
#define BLOCK_WIDTH 2
void GetBlock(uint which_block, out float data[BLOCK_WIDTH]) {
  [loop]
  for (uint i = 0; i < BLOCK_WIDTH; i++) {
    data[i] = _Text_Blocks[which_block][i];
  }
}

// Get the tokens that cover `screen_ptr`. Also returns `block_ptr`, the
// location where this block of tokens begins.
void GetTokens(uint screen_ptr, out uint block_ptr, out uint tokens[BLOCK_WIDTH]) {
  uint which_block;
  [loop]
  for (uint i = 0; i < BLOCK_WIDTH; i++) {
    if (screen_ptr >= _Text_Block_Visual_Ptrs[i]) {
      which_block = i;
    }
  }
  GetBlock(which_block, tokens);
}
```

### GPU decoding

Now we have to translate the tokens into text. I do this with a texture laid out as follows:

1. A fixed-length array of (offset, length) pairs. Offset is 24 bits, giving us an address space of about 16 million slots. Length is 8 bits, but as established above, the longest token is only 16 characters long. So we're wasting about 4 bits. This tells us we should use an RGBA texture.

2. A variable length array of ASCII-encoded strings. Each slot is RGBA, so it can hold 4 characters.

My tokenizer's vocabulary is 65,536 tokens. If we add up the lengths of every token, rounding them up to the nearest multiple of 4, we get the length 667,532 This means that we need 166,883 slots to fit the actual content of the vocabulary.

So, the entire vocabulary - length+offset head and content - requires a 32-bit RGBA texture with 232,419 slots. We'll just jam this into a 512x512 texture, at an occupancy ratio of 88.66% (11.34% waste). The total VRAM usage of that lookup table (LUT) is 1 MiB.

![Unigram tokenizer texture](Images/unigram_lut_for_visualization.png)

*A 64K vocabulary tokenizer I trained on Wikipedia and OpenSubtitles.*

We want to implement this API:

```c
uint GetChar(uint screen_ptr);
```

Internally, it must:

1. Get tokens. (GetTokens - already done)
2. Figure out which token covers the screen\_ptr.
3. Figure out which character in the token covers the screen\_ptr.

Let's break down [2]. We can get the length of a token with a single texture tap. So, naively, we can just scan through the tokens in the current block, add up their lengths, and stop once we find a token covering the current slot. The scan incurs a worst-case cost of `BLOCK_WIDTH` texture taps. The character lookup is then a single tap.

```c
// Gets the length of the subword encoded by the token. Performs one texture
// tap.
void TokenLengthOffset(uint token, out uint length, out uint offset);
// Gets the nth character of the token stored at `token_offset`.
uint GetTokenChar(uint token_offset, uint nth);

uint GetChar(uint screen_ptr) {
  uint block_ptr;
  uint tokens[BLOCK_WIDTH];
  GetTokens(screen_ptr, tokens, block_ptr);
  uint start = block_ptr;
  uint covering_token = 0;
  uint token_ptr = block_ptr;
  uint token_offset;
  for (uint ii = 0; ii < BLOCK_WIDTH; ii++) {
    uint token_length;
    TokenLengthOffset(tokens[ii], token_length, token_offset);
    if (token_ptr + token_length >= screen_ptr) {
      covering_token = tokens[ii];
    }
    token_ptr += token_length;
  }
  // covering_token covers screen_ptr. It starts at token_ptr.
  return GetTokenChar(token_offset, screen_ptr - token_ptr);
}
```

That's actually it for the GPU decoding. Once you have the character, you can use standard fixed-width font rendering techniques to display it (e.g. [disinfo](https://github.com/yum-food/disinfo) and [msdf](https://github.com/Chlumsky/msdfgen)).