1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
|
# Optimized text paging for VRChat
This repo provides code to help you send English text into VRChat. It includes:
1. Training code to produce an English-language tokenizer of any vocabulary
size.
2. Code to turn your tokenizer into a lookup table for GPU decoding.
3. Unity code to generate an animator to shuttle data from OSC to material
properties.
4. OSC code to talk to your Unity animator.
To get started, see Quick Start.
## Quick start
1. Clone this repo.
2. Clone my toon shader, [2ner](https://github.com/yum-food/2ner).
3. Install Lyuma's av3emulator.
4. Drag STT.prefab onto your avatar's root.
5. Enter play mode.
6. Open PowerShell.
```bash
$ cd ~
$ mkdir tmp
$ cd tmp
$ python.exe -m venv venv
$ ./venv/Scripts/Activate.ps1
$ pushd /path/to/FastTextPaging/
$ pip3 install -r requirements.txt
$ python3 ./hi.py
```
7. Start typing.
## Design overview
It is sometimes useful to send text data into VRChat, for example for
speech-to-text (STT). This is typically done naively, with a "block" of
n 8-bit characters\* sent in along with an 8-bit pointer. Since avatars can only
send 256 bits at 3 Hz\*\* with OSC, this means you can only send (256 - 8) / 8 =
31 characters per sync. The average English word is 4.79 characters long, so
if we naively send in 1 character per byte, then we get a speed limit of 6.47
words per sync. The works out to ~20 words per second or ~1200 words per minute
(wpm). Adults typically read at about 238-260 wpm.
\* Typically ASCII encoding is used.
\*\* Experimentally, 3 Hz is the fastest you can reliably page data with OSC in
busy instances.
Sending in one character per byte gives you (1200/256) ~= 4.7 wpm per OSC bit
used. Thus to reach a typical reading speed, you need to use (260/4.7) = 55.5
OSC bits. The goal of this module is to get more out of these bits by
compressing text over the wire.
### Unigram tokenizer
Byte pair encoding (BPE) is an encoding scheme frequently used in natural
language processing (NLP) contexts. For any language with a fixed character set
(e.g. an alphabet), you can count up how often each character is used in a
large corpus of text. Then you can repeatedly join together characters and
assign a unique token to joined characters in the order of the most frequently
occurring sequences. You wind up with a lookup table like this:
```
0: [UNK]
1: <s>
2: </s>
3: [CLS]
4: [SEP]
...
100: from
101: but
102: he
103: e
104: now
...
10000: eo
10001: is currently
10002: dish
10003: Mi
10004: 6)
```
The above example is from a unigram\* sentencepiece organizer that I trained.
\* A unigram tokenizer is a variant of the byte-pair tokenizer. Where a byte
pair tokenizer views inputs as arbitrary sequences of bytes, a unigram
tokenizer views it as a sequence of letters.
The tokenizer has a vocabulary size of 65,536 tokens. It was trained on
opensubtitles and 5% of wikipedia, with `unidecode` normalization applied to
limit training data to ASCII. Subword lengths are distributed as follows:
Subword length histogram:
1: 95
2: 1032
3: 3350
4: 5439
5: 5445
6: 5082
7: 5334
8: 5866
9: 6172
10: 5934
11: 5329
12: 4698
13: 4016
14: 3191
15: 2434
16: 2095
In the test set - 25% wikipedia 75% conversational English -
this tokenizer yields 4.896 characters per token. (Recall that
the average English word is 4.8 characters long. Not bad!)
If we naively send these tokens into the game with a 16-bit number, we can send
floor((256-8)/16) = 15 tokens per sync. This gives us an average of 73.44
characters per sync - more than 2x higher than the naive approach's 31.
Here is how the unigram-tokenized scheme fairs against the naive scheme in
every possible configuration of bits used:
```
bits naive rate bpe rate speedup factor
8 n/a n/a n/a
16 1 n/a 0.000
24 2 4.896 2.448
32 3 4.896 1.632
40 4 9.792 2.448
48 5 9.792 1.958
56 6 14.688 2.448
64 7 14.688 2.098
72 8 19.584 2.448
80 9 19.584 2.176
88 10 24.480 2.448
96 11 24.480 2.225
104 12 29.376 2.448
112 13 29.376 2.260
120 14 34.272 2.448
128 15 34.272 2.285
136 16 39.168 2.448
144 17 39.168 2.304
152 18 44.064 2.448
160 19 44.064 2.319
168 20 48.960 2.448
176 21 48.960 2.331
184 22 53.856 2.448
192 23 53.856 2.342
200 24 58.752 2.448
208 25 58.752 2.350
216 26 63.648 2.448
224 27 63.648 2.357
232 28 68.544 2.448
240 29 68.544 2.364
248 30 73.440 2.448
256 31 73.440 2.369
```
([Spreadsheet](https://docs.google.com/spreadsheets/d/1d9SEZvo3Q-6U_Wf9nuGRKXxndUhKn2V3Q0Ox0nOB4T4/edit?usp=sharing))
I reserve 39 token slots for sequences of whitespace characters of length 2-40. This helps simplify formatting. To end a line or position text, you can just send in the exact right number of spaces, and a fixed-width font renderer will position things as intended.
### Paging data into shader
Sending this data to a shader is pretty simple:
- An OSC app encodes a string into tokens and pages it into the game with OSC.
- The app sends a pointer of *where* the tokens should be rendered along with them. Since the tokens can encode a variable length string, the pointer must be able to point to any spot in the rendering window. Thus we are limited to a 256-bit display with an 8-bit pointer, or a 64K display with a 16-bit pointer. We call this the visual pointer.
- A second pointer tells the animator which shader parameters the tokens and visual pointer should be written to. This can be 8-bit. We call this the logical pointer.
- Animator uses the logical pointer to decide which shader parameters to send the visual pointer and tokens to.
Here is the expected speedup in every possible configuration with a 1-byte
overhead (1BO) or a 2-byte overhead (2BO):
```
bits naive rate bpe rate (1BO) speedup bpe rate (2BO) speedup
8 n/a n/a n/a n/a n/a
16 1 0.000 0.000 0.000 0.000
24 2 0.000 0.000 0.000 0.000
32 3 4.896 1.632 0.000 0.000
40 4 4.896 1.224 4.896 1.224
48 5 9.792 1.958 4.896 0.979
56 6 9.792 1.632 9.792 1.632
64 7 14.688 2.098 9.792 1.399
72 8 14.688 1.836 14.688 1.836
80 9 19.584 2.176 14.688 1.632
88 10 19.584 1.958 19.584 1.958
96 11 24.480 2.225 19.584 1.780
104 12 24.480 2.040 24.480 2.040
112 13 29.376 2.260 24.480 1.883
120 14 29.376 2.098 29.376 2.098
128 15 34.272 2.285 29.376 1.958
136 16 34.272 2.142 34.272 2.142
144 17 39.168 2.304 34.272 2.016
152 18 39.168 2.176 39.168 2.176
160 19 44.064 2.319 39.168 2.061
168 20 44.064 2.203 44.064 2.203
176 21 48.960 2.331 44.064 2.098
184 22 48.960 2.225 48.960 2.225
192 23 53.856 2.342 48.960 2.129
200 24 53.856 2.244 53.856 2.244
208 25 58.752 2.350 53.856 2.154
216 26 58.752 2.260 58.752 2.260
224 27 63.648 2.357 58.752 2.176
232 28 63.648 2.273 63.648 2.273
240 29 68.544 2.364 63.648 2.195
248 30 68.544 2.285 68.544 2.285
256 31 73.440 2.369 68.544 2.211
```
([Spreadsheet](https://docs.google.com/spreadsheets/d/1d9SEZvo3Q-6U_Wf9nuGRKXxndUhKn2V3Q0Ox0nOB4T4/edit?usp=sharing))
As you can see, a 2-byte visual pointer is very damaging to the speedup at low bit budgets. So in bit-constrained setups we should definitely use a smaller display.
Notably, *there is only one crossover point*. If all configurations except the 2-byte overhead 48-bit configuration, BPE-based paging is always\* faster.
\* Always, going off of the *expected* rate. If you get unlucky and your tokens all decode to 1 character, then BPE-based paging is about 50% *slower* than naive encoding.
Because the Unity animator sucks shit, we're going to decode tokens on the GPU. As a refresher, the GPU sees data like this:
```
_Text_Block00_Visual_Ptr: 0
_Text_Block00_Token00: 13,766
_Text_Block00_Token01: 84
_Text_Block01_Visual_Ptr: 13
_Text_Block01_Token00: 599
_Text_Block01_Token01: 8,301
...
```
I.e. it sees "blocks" of data with tokens and visual pointers. The visual pointer just says where on a grid it should draw the subwords represented by the tokens.
The pixel shader can trivially work out what grid location the current pixel belongs to. By scanning through the visual pointers, it can work out which block it has to draw.
We can generate a function like this:
```c
#define BLOCK_WIDTH 2
void GetBlock(uint which_block, out float data[BLOCK_WIDTH]) {
[loop]
for (uint i = 0; i < BLOCK_WIDTH; i++) {
data[i] = _Text_Blocks[which_block][i];
}
}
// Get the tokens that cover `screen_ptr`. Also returns `block_ptr`, the
// location where this block of tokens begins.
void GetTokens(uint screen_ptr, out uint block_ptr, out uint tokens[BLOCK_WIDTH]) {
uint which_block;
[loop]
for (uint i = 0; i < BLOCK_WIDTH; i++) {
if (screen_ptr >= _Text_Block_Visual_Ptrs[i]) {
which_block = i;
}
}
GetBlock(which_block, tokens);
}
```
### GPU decoding
Now we have to translate the tokens into text. I do this with a texture laid out as follows:
1. A fixed-length array of (offset, length) pairs. Offset is 24 bits, giving us an address space of about 16 million slots. Length is 8 bits, but as established above, the longest token is only 16 characters long. So we're wasting about 4 bits. This tells us we should use an RGBA texture.
2. A variable length array of ASCII-encoded strings. Each slot is RGBA, so it can hold 4 characters.
My tokenizer's vocabulary is 65,536 tokens. If we add up the lengths of every token, rounding them up to the nearest multiple of 4, we get the length 667,532 This means that we need 166,883 slots to fit the actual content of the vocabulary.
So, the entire vocabulary - length+offset head and content - requires a 32-bit RGBA texture with 232,419 slots. We'll just jam this into a 512x512 texture, at an occupancy ratio of 88.66% (11.34% waste). The total VRAM usage of that lookup table (LUT) is 1 MiB.

*A 64K vocabulary tokenizer I trained on Wikipedia and OpenSubtitles.*
We want to implement this API:
```c
uint GetChar(uint screen_ptr);
```
Internally, it must:
1. Get tokens. (GetTokens - already done)
2. Figure out which token covers the screen\_ptr.
3. Figure out which character in the token covers the screen\_ptr.
Let's break down [2]. We can get the length of a token with a single texture tap. So, naively, we can just scan through the tokens in the current block, add up their lengths, and stop once we find a token covering the current slot. The scan incurs a worst-case cost of `BLOCK_WIDTH` texture taps. The character lookup is then a single tap.
```c
// Gets the length of the subword encoded by the token. Performs one texture
// tap.
void TokenLengthOffset(uint token, out uint length, out uint offset);
// Gets the nth character of the token stored at `token_offset`.
uint GetTokenChar(uint token_offset, uint nth);
uint GetChar(uint screen_ptr) {
uint block_ptr;
uint tokens[BLOCK_WIDTH];
GetTokens(screen_ptr, tokens, block_ptr);
uint start = block_ptr;
uint covering_token = 0;
uint token_ptr = block_ptr;
uint token_offset;
for (uint ii = 0; ii < BLOCK_WIDTH; ii++) {
uint token_length;
TokenLengthOffset(tokens[ii], token_length, token_offset);
if (token_ptr + token_length >= screen_ptr) {
covering_token = tokens[ii];
}
token_ptr += token_length;
}
// covering_token covers screen_ptr. It starts at token_ptr.
return GetTokenChar(token_offset, screen_ptr - token_ptr);
}
```
That's actually it for the GPU decoding. Once you have the character, you can use standard fixed-width font rendering techniques to display it (e.g. [disinfo](https://github.com/yum-food/disinfo) and [msdf](https://github.com/Chlumsky/msdfgen)).
|