diff options
| author | yum <yum.food.vr@gmail.com> | 2022-11-06 12:50:38 -0800 |
|---|---|---|
| committer | yum <yum.food.vr@gmail.com> | 2022-11-06 12:50:38 -0800 |
| commit | 7146acb9d4ad751fc5ced411a2990d0aad17d08f (patch) | |
| tree | 30d5f9f9a7f47bc4272fa9e9fff5c0226c376686 | |
| parent | 3a123fb5cabdbdef4f1b98031ec90c42e1d6e911 (diff) | |
String matching no longer relies on spaces
Add a `matchStrings` which does basically the same thing as
`matchStringList` except it doesn't split the input at space boundaries.
I think this should work better for Japanese and Chinese, since they
don't use spaces.
Doesn't seem to cause any accuracy regressions for English.
Also update the README.
| -rw-r--r-- | Images/tastt_anim.png | bin | 0 -> 42579 bytes | |||
| -rw-r--r-- | Images/tastt_blend.png | bin | 0 -> 14070 bytes | |||
| -rw-r--r-- | README.md | 79 | ||||
| -rw-r--r-- | osc_ctrl.py | 2 | ||||
| -rw-r--r-- | string_matcher.py | 80 | ||||
| -rw-r--r-- | transcribe.py | 23 |
6 files changed, 120 insertions, 64 deletions
diff --git a/Images/tastt_anim.png b/Images/tastt_anim.png Binary files differnew file mode 100644 index 0000000..2cd8612 --- /dev/null +++ b/Images/tastt_anim.png diff --git a/Images/tastt_blend.png b/Images/tastt_blend.png Binary files differnew file mode 100644 index 0000000..7373dfd --- /dev/null +++ b/Images/tastt_blend.png @@ -8,7 +8,7 @@ custom shader display the text in game.  Features: -* 8x22 display grid, 80 characters per slot. +* 4x44 grid, 256 or 65536 characters per slot. * Text-to-text interface. * Speech-to-text interface. * Free as in beer. @@ -52,10 +52,10 @@ There are currently 5 important pieces: 1. `TaSTT.shader`. A simple unlit shader. Has one parameter per cell in the display. -2. `generate_animations.sh`. Generates one animation per (row, column, letter). - These animations allow us to write the shader's parameters from an FX layer. -3. `generate_fx.py`. Generates a colossal FX layer which maps (row, column, - letter, active) to exactly one of TaSTT.shader's parameters. +2. `libunity.py`. Contains the logic required to generate and manipulate Unity + YAML files. Works well enough on YAMLs up to ~40k documents, 1M lines. +3. `libtastt.py`. Contains the logic to generate TaSTT-specific Unity files, + namely the animations and the animator. 4. `osc_ctrl.py`. Sends OSC messages to VRChat, which it dutifully passes along to the generated FX layer. 5. `transcribe.py`. Uses OpenAI's whisper neural network to transcribe audio @@ -63,61 +63,52 @@ There are currently 5 important pieces: #### Parameters & board indexing -There are 2 obvious ways to tell the board how to display a message: +I divide the board into 16 regions and use a single int parameter, +`TaSTT_Select`, to select the active region. For each byte of data +in the active region, I use a float parameter to blend between two +animations: one with value 0, and one with value 255. -1. Independently parameterize every character slot. If we want to display - a 140-character tweet, this means using (140 characters) * (8 bits - per character) == 1120 bits of parameter memory. VRChat only gives us 256! -2. Parameterize one character slot. We could have an 8-bit letter, an 8-bit row - select, and an 8-bit column select. To avoid overwriting cells while we seek, - we could include a 1-bit enable. This approach works, and uses very few - parameter bits, but it requires us to update the same parameter very quickly. - Experimental results with this were not promising; remote viewers would see - the wrong letters pretty often. +To support wide character sets, I support 2 bytes per character. This +can be configured down to 1 byte per character to save parameter bits. -Thus I settled on a hybrid approach: we divide the board into `cells`, -inside of which we can independently address each character slot. There -are currently 16 cells. - -Since the board has (22 columns) * (8 rows) == 176 character slots, each cell -contains (176 characters) / (16 cells) = 11 characters. - -To update a cell, we do this: - -1. Select the cell. Since there are 16 cells, this requires 4 bits. -2. For each letter in the cell, select the letter. Since we support 256 letters - per cell, this requires 8 bits. - -To avoid overwriting cells while we seek around, we also have a single boolean -which enables/disables updating any cells. - -Thus the total amount of parameter memory used is dictated by this equation: +The the total amount of parameter memory used is dictated by this equation: ``` -ROWS * COLS * 8 / CELLS + 1 + log2(CELLS) +ROWS = 4 +COLS = 44 +CELLS = 16 +MEMORY = ROWS * COLS * (N bits per character) / CELLS + 1 + log2(CELLS) ``` -This is currently 93 bits. +This is currently 93 bits for 1-byte characters and 181 bits for 2-byte +characters. #### FX controller design The FX controller (AKA animator) is pretty simple. There is one layer for each -character in a cell. Thus the layer has to work out which cell it's in, then +character in a cell. The layer has to work out which cell it's in, then work out which letter we want to write in that cell, then run an animation for that letter. -Here's a layer where I manually moved things around to show the structure of -the decision tree: - - + From top down, we first check if updating the board is enabled. If no, we stay -in the first state. Then we check which cell we're in. This is divided into 4 -binary checks, each looking at a boolean parameter. Finally, we fire one of 80 -animations based on the value of the current layer's Letter parameter. +in the first state. Then we check which cell we're in. Finally, we drive a +shader parameter to one of 256 possible values using a blendtree. + + + +The blendtree trick lets us represent wide character sets efficiently. The +number of animations required increases logarithmically with the size of the +character set: -In the pictured FX layer, there are 16 cells each controlling 80 animations, -for a total of 1280 animations. There are 11 such layers. +``` +(N bytes per character) = ceil(log2(size of character set)) +(total animations) = + (2 animations per byte) * + (N bytes per character) * + (M chars per cell) +``` ### Contributing diff --git a/osc_ctrl.py b/osc_ctrl.py index e5a2608..bb6dd87 100644 --- a/osc_ctrl.py +++ b/osc_ctrl.py @@ -119,7 +119,7 @@ def splitMessage(msg): line = "" word_prefix = word[0:BOARD_COLS-1] + "-" word_suffix = word[BOARD_COLS-1:] - print("append prefix {}".format(word_prefix)) + #print("append prefix {}".format(word_prefix)) lines.append(word_prefix) word = word_suffix diff --git a/string_matcher.py b/string_matcher.py index 17bfaac..f529e0c 100644 --- a/string_matcher.py +++ b/string_matcher.py @@ -5,6 +5,8 @@ from Levenshtein import distance as levenshtein_distance import typing +DEBUG = False + # Find the window where the distance between these two transcriptions is # minimized and use it to stitch them together. def matchStringList(old_words: typing.List[str], @@ -42,25 +44,89 @@ def matchStringList(old_words: typing.List[str], else: return " ".join(new_words) -def matchStrings(old_text: str, new_text: str, window_size = 4) -> str: +def matchSpaceDelimitedStrings(old_text: str, new_text: str, window_size = 4) -> str: old_words = old_text.split() new_words = new_text.split() return matchStringList(old_words, new_words, window_size) +def matchStrings(old_text: str, new_text: str, window_size = 3) -> str: + if old_text == new_text: + return old_text + elif len(old_text) >= window_size and len(new_text) >= window_size: + # Find the window where the cumulative string distance + # between the text in that window in the old/new transcription + # is minimized. + + best_match_i = None + best_match_j = None + best_match_d = window_size * 1000 + + for i in range(0, 1 + len(old_text) - window_size): + old_slice = old_text[i:i + window_size] + + for j in range(0, 1 + len(new_text) - window_size): + new_slice = new_text[j:j + window_size] + cur_d = 0 + for k in range(0, window_size): + cur_d += levenshtein_distance(old_slice[k], new_slice[k]) + if cur_d <= best_match_d: + best_match_i = i + best_match_j = j + best_match_d = cur_d + + if DEBUG: + print("optimum at old '{}'/{} new '{}'/{} d={}".format( + old_slice, i, new_slice, j, cur_d)) + + old_prefix = old_text[0:best_match_i] + overlap = new_text[best_match_j:best_match_j + window_size] + new_suffix = new_text[best_match_j + window_size:] + + if DEBUG: + print("Best match i: {}".format(best_match_i)) + print("Best match j: {}".format(best_match_j)) + print("Window size: {}".format(window_size)) + print("Old prefix: {}".format(old_prefix)) + print("Overlap: {}".format(overlap)) + print("New suffix: {}".format(new_suffix)) + print("Input 1: {}".format(old_text)) + print("Input 2: {}".format(new_text)) + print("Output: {}".format(old_prefix + + new_text[best_match_j:])) + return old_prefix + new_text[best_match_j:] + else: + return new_text + if __name__ == "__main__": # Identical transcriptions should not be changed. - assert(matchStrings("This is a test case.", "This is a test case.", window_size = 3) == "This is a test case.") + assert(matchSpaceDelimitedStrings("This is a test case.", "This is a test case.", window_size = 3) == "This is a test case.") # A suffix should be detected and ignored. - assert(matchStrings("This is a test case.", "is a test case.", window_size = 3) == "This is a test case.") + assert(matchSpaceDelimitedStrings("This is a test case.", "is a test case.", window_size = 3) == "This is a test case.") # A lengthening suffix should be correctly appended. - assert(matchStrings("This is a test", "is a test case.", window_size = 3) == "This is a test case.") + assert(matchSpaceDelimitedStrings("This is a test", "is a test case.", window_size = 3) == "This is a test case.") # A strictly longer transcription should override the old prefix. - assert(matchStrings("This is a test", "This is a test case.", window_size = 3) == "This is a test case.") + assert(matchSpaceDelimitedStrings("This is a test", "This is a test case.", window_size = 3) == "This is a test case.") # Paranoia: repetitive text broke the older implementation, so I included # some test cases without fully understanding what the old problem was. - assert(matchStrings("test test test", "test test test test test test", window_size + assert(matchSpaceDelimitedStrings("test test test", "test test test test test test", window_size = 3) == "test test test test test test") - assert(matchStrings("test test test test test test", "test test test", window_size + assert(matchSpaceDelimitedStrings("test test test test test test", "test test test", window_size = 3) == "test test test test test test") + + print(matchStrings("foo bar", "bar baz")) + print(matchStrings("alpha beta", "beta gamma")) + + in1 = "Okay, what about now? Looks like it sort of works. Key word being sort of." + in2 = "okay what about now looks like it sort of works key word being sort of looks" + bad_out = "Okay, what about now? Looks like it sort of works. Key word being sort of works key word being sort of looks" + good_out = "Okay, what about now? Looks like it sort of works. Key word being sort of looks" + print(matchStrings(in1, in2)) + + in1 = "This repository can take" + in2 = "This repository contains the code for" + bad_out = "This repository can tode for" + good_out = "This repository contains the code for" + print(matchStrings(in1, in2)) + print("Tests passed.") diff --git a/transcribe.py b/transcribe.py index 5d2897c..4014dc8 100644 --- a/transcribe.py +++ b/transcribe.py @@ -43,9 +43,6 @@ class AudioState: frames_lock = threading.Lock() text = "" - # To improve temporal stability, we require two consecutive identical - # transcriptions before "committing" to a transcription. - text_candidate = "" text_lock = threading.Lock() record_audio = True @@ -56,6 +53,9 @@ class AudioState: transcribe_sleep_duration_max_s = 1.50 transcribe_no_change_count = 0 transcribe_sleep_duration = transcribe_sleep_duration_min_s + # The language the user is speaking in. + language = whisper.tokenizer.TO_LANGUAGE_CODE["japanese"] + # When the user says `over`, we stop displaying new transcriptions until # they clear the board again. display_paused = False @@ -162,7 +162,6 @@ def resetAudioLocked(audio_state): resetDiskAudioLocked(audio_state, audio_state.VOICE_AUDIO_FILENAME) audio_state.text = "" - audio_state.text_candidate = "" osc_ctrl.clear(audio_state.osc_client) def resetAudio(audio_state): @@ -171,7 +170,7 @@ def resetAudio(audio_state): audio_state.frames_lock.release() # Transcribe the audio recorded in a file. -def transcribe(model, filename): +def transcribe(audio_state, model, filename): audio_state.frames_lock.acquire() audio = whisper.load_audio(filename) @@ -179,7 +178,8 @@ def transcribe(model, filename): audio = whisper.pad_or_trim(audio) mel = whisper.log_mel_spectrogram(audio).to(model.device) - options = whisper.DecodingOptions(language = "en", + #options = whisper.DecodingOptions(language = "en", + options = whisper.DecodingOptions(language = audio_state.language, beam_size = 5) result = whisper.decode(model, mel, options) @@ -220,7 +220,7 @@ def transcribeAudio(audio_state, model): time.sleep(0.1) continue - text = transcribe(model, audio_state.VOICE_AUDIO_FILENAME) + text = transcribe(audio_state, model, audio_state.VOICE_AUDIO_FILENAME) if not text: continue @@ -241,18 +241,17 @@ def transcribeAudio(audio_state, model): print("Transcription: {}".format(audio_state.text)) old_text = audio_state.text - old_words = audio_state.text.split() - new_words = text.split() + #old_words = audio_state.text.split() + #new_words = text.split() - audio_state.text = string_matcher.matchStringList(old_words, new_words) + audio_state.text = string_matcher.matchStrings(audio_state.text, + text, window_size = 5) if old_text != audio_state.text: # We think the user said something, so reset the amount of # time we sleep between transcriptions to the minimum. audio_state.transcribe_no_change_count = 0 audio_state.transcribe_sleep_duration = audio_state.transcribe_sleep_duration_min_s - audio_state.text_candidate = text - audio_state.text_lock.release() def sendAudio(audio_state): |
