summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--Images/tastt_anim.pngbin0 -> 42579 bytes
-rw-r--r--Images/tastt_blend.pngbin0 -> 14070 bytes
-rw-r--r--README.md79
-rw-r--r--osc_ctrl.py2
-rw-r--r--string_matcher.py80
-rw-r--r--transcribe.py23
6 files changed, 120 insertions, 64 deletions
diff --git a/Images/tastt_anim.png b/Images/tastt_anim.png
new file mode 100644
index 0000000..2cd8612
--- /dev/null
+++ b/Images/tastt_anim.png
Binary files differ
diff --git a/Images/tastt_blend.png b/Images/tastt_blend.png
new file mode 100644
index 0000000..7373dfd
--- /dev/null
+++ b/Images/tastt_blend.png
Binary files differ
diff --git a/README.md b/README.md
index 9ee090e..f0cce3d 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@ custom shader display the text in game.
![Speech-to-text demo](Images/speech_to_text_demo.gif)
Features:
-* 8x22 display grid, 80 characters per slot.
+* 4x44 grid, 256 or 65536 characters per slot.
* Text-to-text interface.
* Speech-to-text interface.
* Free as in beer.
@@ -52,10 +52,10 @@ There are currently 5 important pieces:
1. `TaSTT.shader`. A simple unlit shader. Has one parameter per cell in the
display.
-2. `generate_animations.sh`. Generates one animation per (row, column, letter).
- These animations allow us to write the shader's parameters from an FX layer.
-3. `generate_fx.py`. Generates a colossal FX layer which maps (row, column,
- letter, active) to exactly one of TaSTT.shader's parameters.
+2. `libunity.py`. Contains the logic required to generate and manipulate Unity
+ YAML files. Works well enough on YAMLs up to ~40k documents, 1M lines.
+3. `libtastt.py`. Contains the logic to generate TaSTT-specific Unity files,
+ namely the animations and the animator.
4. `osc_ctrl.py`. Sends OSC messages to VRChat, which it dutifully passes along
to the generated FX layer.
5. `transcribe.py`. Uses OpenAI's whisper neural network to transcribe audio
@@ -63,61 +63,52 @@ There are currently 5 important pieces:
#### Parameters & board indexing
-There are 2 obvious ways to tell the board how to display a message:
+I divide the board into 16 regions and use a single int parameter,
+`TaSTT_Select`, to select the active region. For each byte of data
+in the active region, I use a float parameter to blend between two
+animations: one with value 0, and one with value 255.
-1. Independently parameterize every character slot. If we want to display
- a 140-character tweet, this means using (140 characters) * (8 bits
- per character) == 1120 bits of parameter memory. VRChat only gives us 256!
-2. Parameterize one character slot. We could have an 8-bit letter, an 8-bit row
- select, and an 8-bit column select. To avoid overwriting cells while we seek,
- we could include a 1-bit enable. This approach works, and uses very few
- parameter bits, but it requires us to update the same parameter very quickly.
- Experimental results with this were not promising; remote viewers would see
- the wrong letters pretty often.
+To support wide character sets, I support 2 bytes per character. This
+can be configured down to 1 byte per character to save parameter bits.
-Thus I settled on a hybrid approach: we divide the board into `cells`,
-inside of which we can independently address each character slot. There
-are currently 16 cells.
-
-Since the board has (22 columns) * (8 rows) == 176 character slots, each cell
-contains (176 characters) / (16 cells) = 11 characters.
-
-To update a cell, we do this:
-
-1. Select the cell. Since there are 16 cells, this requires 4 bits.
-2. For each letter in the cell, select the letter. Since we support 256 letters
- per cell, this requires 8 bits.
-
-To avoid overwriting cells while we seek around, we also have a single boolean
-which enables/disables updating any cells.
-
-Thus the total amount of parameter memory used is dictated by this equation:
+The the total amount of parameter memory used is dictated by this equation:
```
-ROWS * COLS * 8 / CELLS + 1 + log2(CELLS)
+ROWS = 4
+COLS = 44
+CELLS = 16
+MEMORY = ROWS * COLS * (N bits per character) / CELLS + 1 + log2(CELLS)
```
-This is currently 93 bits.
+This is currently 93 bits for 1-byte characters and 181 bits for 2-byte
+characters.
#### FX controller design
The FX controller (AKA animator) is pretty simple. There is one layer for each
-character in a cell. Thus the layer has to work out which cell it's in, then
+character in a cell. The layer has to work out which cell it's in, then
work out which letter we want to write in that cell, then run an animation for
that letter.
-Here's a layer where I manually moved things around to show the structure of
-the decision tree:
-
-![One FX layer with 4-bit indexing](Images/four_bit_indexing.png)
+![One FX layer with 16 cells](Images/tastt_anim.png)
From top down, we first check if updating the board is enabled. If no, we stay
-in the first state. Then we check which cell we're in. This is divided into 4
-binary checks, each looking at a boolean parameter. Finally, we fire one of 80
-animations based on the value of the current layer's Letter parameter.
+in the first state. Then we check which cell we're in. Finally, we drive a
+shader parameter to one of 256 possible values using a blendtree.
+
+![An 8-bit blendtree](Images/tastt_blend.png)
+
+The blendtree trick lets us represent wide character sets efficiently. The
+number of animations required increases logarithmically with the size of the
+character set:
-In the pictured FX layer, there are 16 cells each controlling 80 animations,
-for a total of 1280 animations. There are 11 such layers.
+```
+(N bytes per character) = ceil(log2(size of character set))
+(total animations) =
+ (2 animations per byte) *
+ (N bytes per character) *
+ (M chars per cell)
+```
### Contributing
diff --git a/osc_ctrl.py b/osc_ctrl.py
index e5a2608..bb6dd87 100644
--- a/osc_ctrl.py
+++ b/osc_ctrl.py
@@ -119,7 +119,7 @@ def splitMessage(msg):
line = ""
word_prefix = word[0:BOARD_COLS-1] + "-"
word_suffix = word[BOARD_COLS-1:]
- print("append prefix {}".format(word_prefix))
+ #print("append prefix {}".format(word_prefix))
lines.append(word_prefix)
word = word_suffix
diff --git a/string_matcher.py b/string_matcher.py
index 17bfaac..f529e0c 100644
--- a/string_matcher.py
+++ b/string_matcher.py
@@ -5,6 +5,8 @@ from Levenshtein import distance as levenshtein_distance
import typing
+DEBUG = False
+
# Find the window where the distance between these two transcriptions is
# minimized and use it to stitch them together.
def matchStringList(old_words: typing.List[str],
@@ -42,25 +44,89 @@ def matchStringList(old_words: typing.List[str],
else:
return " ".join(new_words)
-def matchStrings(old_text: str, new_text: str, window_size = 4) -> str:
+def matchSpaceDelimitedStrings(old_text: str, new_text: str, window_size = 4) -> str:
old_words = old_text.split()
new_words = new_text.split()
return matchStringList(old_words, new_words, window_size)
+def matchStrings(old_text: str, new_text: str, window_size = 3) -> str:
+ if old_text == new_text:
+ return old_text
+ elif len(old_text) >= window_size and len(new_text) >= window_size:
+ # Find the window where the cumulative string distance
+ # between the text in that window in the old/new transcription
+ # is minimized.
+
+ best_match_i = None
+ best_match_j = None
+ best_match_d = window_size * 1000
+
+ for i in range(0, 1 + len(old_text) - window_size):
+ old_slice = old_text[i:i + window_size]
+
+ for j in range(0, 1 + len(new_text) - window_size):
+ new_slice = new_text[j:j + window_size]
+ cur_d = 0
+ for k in range(0, window_size):
+ cur_d += levenshtein_distance(old_slice[k], new_slice[k])
+ if cur_d <= best_match_d:
+ best_match_i = i
+ best_match_j = j
+ best_match_d = cur_d
+
+ if DEBUG:
+ print("optimum at old '{}'/{} new '{}'/{} d={}".format(
+ old_slice, i, new_slice, j, cur_d))
+
+ old_prefix = old_text[0:best_match_i]
+ overlap = new_text[best_match_j:best_match_j + window_size]
+ new_suffix = new_text[best_match_j + window_size:]
+
+ if DEBUG:
+ print("Best match i: {}".format(best_match_i))
+ print("Best match j: {}".format(best_match_j))
+ print("Window size: {}".format(window_size))
+ print("Old prefix: {}".format(old_prefix))
+ print("Overlap: {}".format(overlap))
+ print("New suffix: {}".format(new_suffix))
+ print("Input 1: {}".format(old_text))
+ print("Input 2: {}".format(new_text))
+ print("Output: {}".format(old_prefix +
+ new_text[best_match_j:]))
+ return old_prefix + new_text[best_match_j:]
+ else:
+ return new_text
+
if __name__ == "__main__":
# Identical transcriptions should not be changed.
- assert(matchStrings("This is a test case.", "This is a test case.", window_size = 3) == "This is a test case.")
+ assert(matchSpaceDelimitedStrings("This is a test case.", "This is a test case.", window_size = 3) == "This is a test case.")
# A suffix should be detected and ignored.
- assert(matchStrings("This is a test case.", "is a test case.", window_size = 3) == "This is a test case.")
+ assert(matchSpaceDelimitedStrings("This is a test case.", "is a test case.", window_size = 3) == "This is a test case.")
# A lengthening suffix should be correctly appended.
- assert(matchStrings("This is a test", "is a test case.", window_size = 3) == "This is a test case.")
+ assert(matchSpaceDelimitedStrings("This is a test", "is a test case.", window_size = 3) == "This is a test case.")
# A strictly longer transcription should override the old prefix.
- assert(matchStrings("This is a test", "This is a test case.", window_size = 3) == "This is a test case.")
+ assert(matchSpaceDelimitedStrings("This is a test", "This is a test case.", window_size = 3) == "This is a test case.")
# Paranoia: repetitive text broke the older implementation, so I included
# some test cases without fully understanding what the old problem was.
- assert(matchStrings("test test test", "test test test test test test", window_size
+ assert(matchSpaceDelimitedStrings("test test test", "test test test test test test", window_size
= 3) == "test test test test test test")
- assert(matchStrings("test test test test test test", "test test test", window_size
+ assert(matchSpaceDelimitedStrings("test test test test test test", "test test test", window_size
= 3) == "test test test test test test")
+
+ print(matchStrings("foo bar", "bar baz"))
+ print(matchStrings("alpha beta", "beta gamma"))
+
+ in1 = "Okay, what about now? Looks like it sort of works. Key word being sort of."
+ in2 = "okay what about now looks like it sort of works key word being sort of looks"
+ bad_out = "Okay, what about now? Looks like it sort of works. Key word being sort of works key word being sort of looks"
+ good_out = "Okay, what about now? Looks like it sort of works. Key word being sort of looks"
+ print(matchStrings(in1, in2))
+
+ in1 = "This repository can take"
+ in2 = "This repository contains the code for"
+ bad_out = "This repository can tode for"
+ good_out = "This repository contains the code for"
+ print(matchStrings(in1, in2))
+
print("Tests passed.")
diff --git a/transcribe.py b/transcribe.py
index 5d2897c..4014dc8 100644
--- a/transcribe.py
+++ b/transcribe.py
@@ -43,9 +43,6 @@ class AudioState:
frames_lock = threading.Lock()
text = ""
- # To improve temporal stability, we require two consecutive identical
- # transcriptions before "committing" to a transcription.
- text_candidate = ""
text_lock = threading.Lock()
record_audio = True
@@ -56,6 +53,9 @@ class AudioState:
transcribe_sleep_duration_max_s = 1.50
transcribe_no_change_count = 0
transcribe_sleep_duration = transcribe_sleep_duration_min_s
+ # The language the user is speaking in.
+ language = whisper.tokenizer.TO_LANGUAGE_CODE["japanese"]
+
# When the user says `over`, we stop displaying new transcriptions until
# they clear the board again.
display_paused = False
@@ -162,7 +162,6 @@ def resetAudioLocked(audio_state):
resetDiskAudioLocked(audio_state, audio_state.VOICE_AUDIO_FILENAME)
audio_state.text = ""
- audio_state.text_candidate = ""
osc_ctrl.clear(audio_state.osc_client)
def resetAudio(audio_state):
@@ -171,7 +170,7 @@ def resetAudio(audio_state):
audio_state.frames_lock.release()
# Transcribe the audio recorded in a file.
-def transcribe(model, filename):
+def transcribe(audio_state, model, filename):
audio_state.frames_lock.acquire()
audio = whisper.load_audio(filename)
@@ -179,7 +178,8 @@ def transcribe(model, filename):
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
- options = whisper.DecodingOptions(language = "en",
+ #options = whisper.DecodingOptions(language = "en",
+ options = whisper.DecodingOptions(language = audio_state.language,
beam_size = 5)
result = whisper.decode(model, mel, options)
@@ -220,7 +220,7 @@ def transcribeAudio(audio_state, model):
time.sleep(0.1)
continue
- text = transcribe(model, audio_state.VOICE_AUDIO_FILENAME)
+ text = transcribe(audio_state, model, audio_state.VOICE_AUDIO_FILENAME)
if not text:
continue
@@ -241,18 +241,17 @@ def transcribeAudio(audio_state, model):
print("Transcription: {}".format(audio_state.text))
old_text = audio_state.text
- old_words = audio_state.text.split()
- new_words = text.split()
+ #old_words = audio_state.text.split()
+ #new_words = text.split()
- audio_state.text = string_matcher.matchStringList(old_words, new_words)
+ audio_state.text = string_matcher.matchStrings(audio_state.text,
+ text, window_size = 5)
if old_text != audio_state.text:
# We think the user said something, so reset the amount of
# time we sleep between transcriptions to the minimum.
audio_state.transcribe_no_change_count = 0
audio_state.transcribe_sleep_duration = audio_state.transcribe_sleep_duration_min_s
- audio_state.text_candidate = text
-
audio_state.text_lock.release()
def sendAudio(audio_state):