String matching no longer relies on spaces

Add a `matchStrings` which does basically the same thing as `matchStringList` except it doesn't split the input at space boundaries. I think this should work better for Japanese and Chinese, since they don't use spaces. Doesn't seem to cause any accuracy regressions for English. Also update the README.
author: yum <yum.food.vr@gmail.com> 2022-11-06 12:50:38 -0800
committer: yum <yum.food.vr@gmail.com> 2022-11-06 12:50:38 -0800
commit: 7146acb9d4ad751fc5ced411a2990d0aad17d08f (patch)
tree: 30d5f9f9a7f47bc4272fa9e9fff5c0226c376686
parent: 3a123fb5cabdbdef4f1b98031ec90c42e1d6e911 (diff)
6 files changed, 120 insertions, 64 deletions
diff --git a/Images/tastt_anim.png b/Images/tastt_anim.png
new file mode 100644
index 0000000..2cd8612
--- /dev/null
+++ b/Images/tastt_anim.png
diff --git a/Images/tastt_blend.png b/Images/tastt_blend.png
new file mode 100644
index 0000000..7373dfd
--- /dev/null
+++ b/Images/tastt_blend.png
diff --git a/README.md b/README.md
index 9ee090e..f0cce3d 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@ custom shader display the text in game.
 ![Speech-to-text demo](Images/speech_to_text_demo.gif)
 
 Features:
-* 8x22 display grid, 80 characters per slot.
+* 4x44 grid, 256 or 65536 characters per slot.
 * Text-to-text interface.
 * Speech-to-text interface.
 * Free as in beer.
@@ -52,10 +52,10 @@ There are currently 5 important pieces:
 
 1. `TaSTT.shader`. A simple unlit shader. Has one parameter per cell in the
    display.
-2. `generate_animations.sh`. Generates one animation per (row, column, letter).
-   These animations allow us to write the shader's parameters from an FX layer.
-3. `generate_fx.py`. Generates a colossal FX layer which maps (row, column,
-   letter, active) to exactly one of TaSTT.shader's parameters.
+2. `libunity.py`. Contains the logic required to generate and manipulate Unity
+   YAML files. Works well enough on YAMLs up to ~40k documents, 1M lines.
+3. `libtastt.py`. Contains the logic to generate TaSTT-specific Unity files,
+   namely the animations and the animator.
 4. `osc_ctrl.py`. Sends OSC messages to VRChat, which it dutifully passes along
    to the generated FX layer.
 5. `transcribe.py`. Uses OpenAI's whisper neural network to transcribe audio
@@ -63,61 +63,52 @@ There are currently 5 important pieces:
 
 #### Parameters & board indexing
 
-There are 2 obvious ways to tell the board how to display a message:
+I divide the board into 16 regions and use a single int parameter,
+`TaSTT_Select`, to select the active region. For each byte of data
+in the active region, I use a float parameter to blend between two
+animations: one with value 0, and one with value 255.
 
-1. Independently parameterize every character slot. If we want to display
-   a 140-character tweet, this means using (140 characters) * (8 bits
-   per character) == 1120 bits of parameter memory. VRChat only gives us 256!
-2. Parameterize one character slot. We could have an 8-bit letter, an 8-bit row
-   select, and an 8-bit column select. To avoid overwriting cells while we seek,
-   we could include a 1-bit enable. This approach works, and uses very few
-   parameter bits, but it requires us to update the same parameter very quickly.
-   Experimental results with this were not promising; remote viewers would see
-   the wrong letters pretty often.
+To support wide character sets, I support 2 bytes per character. This
+can be configured down to 1 byte per character to save parameter bits.
 
-Thus I settled on a hybrid approach: we divide the board into `cells`,
-inside of which we can independently address each character slot. There
-are currently 16 cells.
-
-Since the board has (22 columns) * (8 rows) == 176 character slots, each cell
-contains (176 characters) / (16 cells) = 11 characters.
-
-To update a cell, we do this:
-
-1. Select the cell. Since there are 16 cells, this requires 4 bits.
-2. For each letter in the cell, select the letter. Since we support 256 letters
-   per cell, this requires 8 bits.
-
-To avoid overwriting cells while we seek around, we also have a single boolean
-which enables/disables updating any cells.
-
-Thus the total amount of parameter memory used is dictated by this equation:
+The the total amount of parameter memory used is dictated by this equation:
 
 ```
-ROWS * COLS * 8 / CELLS + 1 + log2(CELLS)
+ROWS = 4
+COLS = 44
+CELLS = 16
+MEMORY = ROWS * COLS * (N bits per character) / CELLS + 1 + log2(CELLS)
 ```
 
-This is currently 93 bits.
+This is currently 93 bits for 1-byte characters and 181 bits for 2-byte
+characters.
 
 #### FX controller design
 
 The FX controller (AKA animator) is pretty simple. There is one layer for each
-character in a cell. Thus the layer has to work out which cell it's in, then
+character in a cell. The layer has to work out which cell it's in, then
 work out which letter we want to write in that cell, then run an animation for
 that letter.
 
-Here's a layer where I manually moved things around to show the structure of
-the decision tree:
-
-![One FX layer with 4-bit indexing](Images/four_bit_indexing.png)
+![One FX layer with 16 cells](Images/tastt_anim.png)
 
 From top down, we first check if updating the board is enabled. If no, we stay
-in the first state. Then we check which cell we're in. This is divided into 4
-binary checks, each looking at a boolean parameter. Finally, we fire one of 80
-animations based on the value of the current layer's Letter parameter.
+in the first state. Then we check which cell we're in. Finally, we drive a
+shader parameter to one of 256 possible values using a blendtree.
+
+![An 8-bit blendtree](Images/tastt_blend.png)
+
+The blendtree trick lets us represent wide character sets efficiently. The
+number of animations required increases logarithmically with the size of the
+character set:
 
-In the pictured FX layer, there are 16 cells each controlling 80 animations,
-for a total of 1280 animations. There are 11 such layers.
+```
+(N bytes per character) = ceil(log2(size of character set))
+(total animations) =
+    (2 animations per byte) *
+    (N bytes per character) *
+    (M chars per cell)
+```
 
 ### Contributing
 
diff --git a/osc_ctrl.py b/osc_ctrl.py
index e5a2608..bb6dd87 100644
--- a/osc_ctrl.py
+++ b/osc_ctrl.py
@@ -119,7 +119,7 @@ def splitMessage(msg):
                 line = ""
             word_prefix = word[0:BOARD_COLS-1] + "-"
             word_suffix = word[BOARD_COLS-1:]
-            print("append prefix {}".format(word_prefix))
+            #print("append prefix {}".format(word_prefix))
             lines.append(word_prefix)
             word = word_suffix
 
diff --git a/string_matcher.py b/string_matcher.py
index 17bfaac..f529e0c 100644
--- a/string_matcher.py
+++ b/string_matcher.py
@@ -5,6 +5,8 @@ from Levenshtein import distance as levenshtein_distance
 
 import typing
 
+DEBUG = False
+
 # Find the window where the distance between these two transcriptions is
 # minimized and use it to stitch them together.
 def matchStringList(old_words: typing.List[str],
@@ -42,25 +44,89 @@ def matchStringList(old_words: typing.List[str],
     else:
         return " ".join(new_words)
 
-def matchStrings(old_text: str, new_text: str, window_size = 4) -> str:
+def matchSpaceDelimitedStrings(old_text: str, new_text: str, window_size = 4) -> str:
     old_words = old_text.split()
     new_words = new_text.split()
     return matchStringList(old_words, new_words, window_size)
 
+def matchStrings(old_text: str, new_text: str, window_size = 3) -> str:
+    if old_text == new_text:
+        return old_text
+    elif len(old_text) >= window_size and len(new_text) >= window_size:
+        # Find the window where the cumulative string distance
+        # between the text in that window in the old/new transcription
+        # is minimized.
+
+        best_match_i = None
+        best_match_j = None
+        best_match_d = window_size * 1000
+
+        for i in range(0, 1 + len(old_text) - window_size):
+            old_slice = old_text[i:i + window_size]
+
+            for j in range(0, 1 + len(new_text) - window_size):
+                new_slice = new_text[j:j + window_size]
+                cur_d = 0
+                for k in range(0, window_size):
+                    cur_d += levenshtein_distance(old_slice[k], new_slice[k])
+                if cur_d <= best_match_d:
+                    best_match_i = i
+                    best_match_j = j
+                    best_match_d = cur_d
+
+                    if DEBUG:
+                        print("optimum at old '{}'/{} new '{}'/{} d={}".format(
+                            old_slice, i, new_slice, j, cur_d))
+
+        old_prefix = old_text[0:best_match_i]
+        overlap = new_text[best_match_j:best_match_j + window_size]
+        new_suffix = new_text[best_match_j + window_size:]
+
+        if DEBUG:
+            print("Best match i:    {}".format(best_match_i))
+            print("Best match j:    {}".format(best_match_j))
+            print("Window size:     {}".format(window_size))
+            print("Old prefix:      {}".format(old_prefix))
+            print("Overlap:         {}".format(overlap))
+            print("New suffix:      {}".format(new_suffix))
+            print("Input 1:         {}".format(old_text))
+            print("Input 2:         {}".format(new_text))
+            print("Output:          {}".format(old_prefix +
+                new_text[best_match_j:]))
+        return old_prefix + new_text[best_match_j:]
+    else:
+        return new_text
+
 if __name__ == "__main__":
     # Identical transcriptions should not be changed.
-    assert(matchStrings("This is a test case.", "This is a test case.", window_size = 3) == "This is a test case.")
+    assert(matchSpaceDelimitedStrings("This is a test case.", "This is a test case.", window_size = 3) == "This is a test case.")
     # A suffix should be detected and ignored.
-    assert(matchStrings("This is a test case.", "is a test case.", window_size = 3) == "This is a test case.")
+    assert(matchSpaceDelimitedStrings("This is a test case.", "is a test case.", window_size = 3) == "This is a test case.")
     # A lengthening suffix should be correctly appended.
-    assert(matchStrings("This is a test", "is a test case.", window_size = 3) == "This is a test case.")
+    assert(matchSpaceDelimitedStrings("This is a test", "is a test case.", window_size = 3) == "This is a test case.")
     # A strictly longer transcription should override the old prefix.
-    assert(matchStrings("This is a test", "This is a test case.", window_size = 3) == "This is a test case.")
+    assert(matchSpaceDelimitedStrings("This is a test", "This is a test case.", window_size = 3) == "This is a test case.")
     # Paranoia: repetitive text broke the older implementation, so I included
     # some test cases without fully understanding what the old problem was.
-    assert(matchStrings("test test test", "test test test test test test", window_size
+    assert(matchSpaceDelimitedStrings("test test test", "test test test test test test", window_size
         = 3) == "test test test test test test")
-    assert(matchStrings("test test test test test test", "test test test", window_size
+    assert(matchSpaceDelimitedStrings("test test test test test test", "test test test", window_size
         = 3) == "test test test test test test")
+
+    print(matchStrings("foo bar", "bar baz"))
+    print(matchStrings("alpha beta", "beta gamma"))
+
+    in1 = "Okay, what about now? Looks like it sort of works. Key word being sort of."
+    in2 = "okay what about now looks like it sort of works key word being sort of looks"
+    bad_out = "Okay, what about now? Looks like it sort of works. Key word being sort of works key word being sort of looks"
+    good_out = "Okay, what about now? Looks like it sort of works. Key word being sort of looks"
+    print(matchStrings(in1, in2))
+
+    in1 = "This repository can take"
+    in2 = "This repository contains the code for"
+    bad_out  = "This repository can tode for"
+    good_out = "This repository contains the code for"
+    print(matchStrings(in1, in2))
+
     print("Tests passed.")
 
diff --git a/transcribe.py b/transcribe.py
index 5d2897c..4014dc8 100644
--- a/transcribe.py
+++ b/transcribe.py
@@ -43,9 +43,6 @@ class AudioState:
     frames_lock = threading.Lock()
 
     text = ""
-    # To improve temporal stability, we require two consecutive identical
-    # transcriptions before "committing" to a transcription.
-    text_candidate = ""
     text_lock = threading.Lock()
 
     record_audio = True
@@ -56,6 +53,9 @@ class AudioState:
     transcribe_sleep_duration_max_s = 1.50
     transcribe_no_change_count = 0
     transcribe_sleep_duration = transcribe_sleep_duration_min_s
+    # The language the user is speaking in.
+    language = whisper.tokenizer.TO_LANGUAGE_CODE["japanese"]
+
     # When the user says `over`, we stop displaying new transcriptions until
     # they clear the board again.
     display_paused = False
@@ -162,7 +162,6 @@ def resetAudioLocked(audio_state):
     resetDiskAudioLocked(audio_state, audio_state.VOICE_AUDIO_FILENAME)
 
     audio_state.text = ""
-    audio_state.text_candidate = ""
     osc_ctrl.clear(audio_state.osc_client)
 
 def resetAudio(audio_state):
@@ -171,7 +170,7 @@ def resetAudio(audio_state):
     audio_state.frames_lock.release()
 
 # Transcribe the audio recorded in a file.
-def transcribe(model, filename):
+def transcribe(audio_state, model, filename):
 
     audio_state.frames_lock.acquire()
     audio = whisper.load_audio(filename)
@@ -179,7 +178,8 @@ def transcribe(model, filename):
 
     audio = whisper.pad_or_trim(audio)
     mel = whisper.log_mel_spectrogram(audio).to(model.device)
-    options = whisper.DecodingOptions(language = "en",
+    #options = whisper.DecodingOptions(language = "en",
+    options = whisper.DecodingOptions(language = audio_state.language,
             beam_size = 5)
     result = whisper.decode(model, mel, options)
 
@@ -220,7 +220,7 @@ def transcribeAudio(audio_state, model):
             time.sleep(0.1)
             continue
 
-        text = transcribe(model, audio_state.VOICE_AUDIO_FILENAME)
+        text = transcribe(audio_state, model, audio_state.VOICE_AUDIO_FILENAME)
         if not text:
             continue
 
@@ -241,18 +241,17 @@ def transcribeAudio(audio_state, model):
         print("Transcription: {}".format(audio_state.text))
 
         old_text = audio_state.text
-        old_words = audio_state.text.split()
-        new_words = text.split()
+        #old_words = audio_state.text.split()
+        #new_words = text.split()
 
-        audio_state.text = string_matcher.matchStringList(old_words, new_words)
+        audio_state.text = string_matcher.matchStrings(audio_state.text,
+                text, window_size = 5)
         if old_text != audio_state.text:
             # We think the user said something, so  reset the amount of
             # time we sleep between transcriptions to the minimum.
             audio_state.transcribe_no_change_count = 0
             audio_state.transcribe_sleep_duration = audio_state.transcribe_sleep_duration_min_s
 
-        audio_state.text_candidate = text
-
         audio_state.text_lock.release()
 
 def sendAudio(audio_state):
author	yum <yum.food.vr@gmail.com>	2022-11-06 12:50:38 -0800
committer	yum <yum.food.vr@gmail.com>	2022-11-06 12:50:38 -0800
commit	7146acb9d4ad751fc5ced411a2990d0aad17d08f (patch)
tree	30d5f9f9a7f47bc4272fa9e9fff5c0226c376686
parent	3a123fb5cabdbdef4f1b98031ec90c42e1d6e911 (diff)