| Commit message (Collapse) | Author | Age |
| ... | |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Transcription output now streams to localhost:8097.
In OBS:
* Create a browser source.
* url: localhost:8097
* width: 2200
* height: 400
TODO:
* Put behind toggle.
* Create input field for port.
Misc cleanup:
* transcribe.py: Drop frames from audio capture thread instead of the
transcription thread. Doing it the other way would result in
occasional data loss.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At the core of the STT, there's a loop which uses Whisper to convert
audio into a transcript. As you say something, whisper sees growing
fragments of your sentence:
t0: "Hell"
t1: "Hello"
t2: "Hello, world!"
So we need some algorithm which takes these fragments and
accumulates them into an ever-growing transcript.
Previously I did this with fuzzy string matching. I'd find the region
where the two transcripts overlap and edit the two together to produce a
longer transcript. The big problem is that if there's no overlap, it's
not clear whether whisper radically changed its mind as to what was
said, or whether the user paused for a long time before saying
something new. So I'd have to reset the growing transcript.
Now I get the timestamps from Whisper and wait for it to give me the
same 3 transcripts for the last utterance. Once the transcript
stabilizes like this, I commit the text. This enables a temporally
stable, ever-growing transcript that's also quite accurate.
To prevent a latency regression, I also introduce the notion of "preview
text", which is a preview of an utterance that has not yet stabilized.
These previews do not contribute to the ever-growing transcript, but do
get fed through the rest of the app, so they show up in-game / in OBS.
Once they eventually stabilize, they get committed to the ever-growing
transcript.
This change is lightly tested!
|
| |
|
|
|
|
|
|
|
| |
pyopenvr is deprecated and is causing a user issue
(https://github.com/yum-food/TaSTT/issues/2).
That user was kind enough to experiment with different configs and
didn't find a simple fix. So let's close this tech debt issue the right
way.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
NLLB needs its input to be split up into sentences. I use the
sentence_splitter Python package to do this. It supports ~20 Western
European languages, but notably, no Asian languages.
* Sort spoken language list. English is still at the top.
* Remove 'Translation source' dropdown. Infer this from the spoken
language.
* Add lang_compat.py to map language codes between the various libraries
(whisper, nllb, sentence_splitter).
* Fix bug where old text would appear in textbox when you first bring it
up.
|
| |
|
|
|
|
|
|
|
| |
Use Meta's No Language Left Behind (NLLB) algorithm to provide
translation capabilities into 200 languages. Obviously most are very
untested.
This requires either 4.1 or 7.1 GB of RAM and significiantly increases
transcription latency.
|
| |
|
|
|
|
|
|
|
|
| |
Add 3 filters:
* Remove trailing period
* Convert to uppercase
* Convert to lowercase
All may be composed. Upper/lower just overwrite each other so just use
one.
|
| |
|
|
|
| |
UI now has a checkbox for the uwu filter. Does not materially affect
resource usage or latency when enabled.
|
| |
|
|
|
|
| |
Use UwwwuPP to translate your boring old speech into uwu-ified version.
Still need to add a UI toggle for this.
|
| |
|
|
|
|
|
|
|
|
| |
To use it, do a medium hold + long hold. Keep the long hold depressed
until you're done speaking. The transcription will be typed into the
currently selected input field.
* Add more audio feedback
* Make audio feedback play asynchronously so it doesn't slow down the
controller input state machine as much.
|
| |
|
|
|
|
|
| |
By holding the button while talking for at least 1.5 seconds, you can
update the contents of the textbox without unlocking it from worldspace.
So now you can carefully position your textbox once, then continually
speak into it without having to reposition it every time.
|
| |
|
|
|
|
| |
Users can now configure a keybind to start/stop/dismiss the STT when in
desktop mode. The default keybind is ctrl+x, since by default VRC
doesn't use 'x' for anything.
|
| | |
|
| |
|
|
|
|
| |
Useful on devices with multiple GPUs, such as gaming laptops.
* Update GUI/README.md.
|
| |
|
|
|
|
| |
See comment for details.
* Update README
|
| |
|
|
|
|
|
|
|
| |
faster-whisper doesn't need it. This reduces install size from 6.00GB
with base.en model to 1.70GB.
* Use a single sampler in shader (enables using more than 16 textures)
* Minor legibility regression - need to improve AA.
* Enable backface culling in shader (minor performance win)
|
| |
|
|
|
|
| |
Affinity mask no longer affects performance. String matching is still
needed for temporal stability in fast-paced long-form transcription
tasks.
|
| |
|
|
| |
I'm able to use the new code to show text in game. Not yet play-tested.
|
| |
|
|
|
|
| |
This is a much faster, lower-VRAM reimplementation of Whisper in Python.
Early testing is extremely promising: fast transcription speed,
extremely low resource usage (CPU/RAM/VRAM), high accuracy.
|
| |
|
|
|
|
|
|
|
|
| |
We used to populate 7 4k textures + 1 2k texture for all users.
Now if the user has configured `bytes_per_char=1` in the Unity
panel, we just populate a single 512x512 texture containing the
first 128 ASCII characters.
This reduces texture memory usage by 99.74%, from 134.67 MB to
340 KB.
|
| |
|
|
|
| |
Need python310._pth, specifically 'import site' line, for
embedded python + pip to get along.
|
| |
|
|
|
|
|
|
|
| |
A user saw an error like `ModuleNotFoundError: No module named _socket`.
StackOverflow blames this on PYTHONPATH, so let's try setting it.
* Fix latent bug in Scripts/transcribe.py. PyAudio.open() positional
parameters must be specified in correct order, even when telling it
which parameter is which. *shrug*
|
| |
|
|
| |
Not ready yet.
|
| |
|
|
|
|
|
|
|
|
| |
Sort of a misnomer. The idea is to use C++ for transcription and Python
for steamvr and OSC.
Having issues getting output from multithreaded Python code. Not in the
mood to figure this out today.
* Hide unimplemented parts of C++ panel.
|
| |
|
|
|
|
|
|
| |
This reverts commit cece1ee8f1b985c2a89adb661dd02c6d44787f67.
This does *not* in fact result in improved temporal stability. It makes
makes things so unstable that even single-sentence messages fail to
ever stabilize.
|
| |
|
|
|
|
|
|
| |
Use Const-me/Whisper to perform transcription. This implementation is
vastly more efficient: CPU usage, memory usage, and VRAM usage are all
dramatically reduced. It's slightly less accurate when comparing the
same model (due to the lack of beam search decoding), but since you can
use larger models, the impact is largely a wash.
|
| |
|
|
|
| |
Per the Whisper source code, this should result in better temporal
stability.
|
| |
|
|
|
|
|
|
|
|
| |
Allows sustained exponential backoff when not transcribing. Used to cap
out at 1s.
* Add more items to README TODO list
* Adjust emote metadata
* Emotes bugfix: Non-existent emote map doesn't cause transcription
engine to bail out.
|
| |
|
|
|
|
|
|
| |
Don't render any part of an emote with alpha < 0.5. Improves visual
clarity in the common case at the cost of generality.
* Emotes now use physically-based shading.
* Use round() to denoise shader parameters instead of floor()
|
| |
|
|
|
|
|
|
|
|
|
| |
Emotes require 2 bytes per char. They're encoded into the region
[0xE000, infinity). The texture is 4k, and uses 1k vertical pixels
per emote segment, for a maximum of 32 segments.
* Reduce volume of noise indicator by 90%. Quiet is probably better.
Might want to add a volume slider idk.
* Bugfix: emotes without a transparency channel now work
* Address a couple Unity performance complaints about the shader
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Done:
* Users can add images to Fonts/Emotes/
* The basename of that image ('clueless.png' becomes 'clueless') is the
keyword to make the image show up in game.
* Fix a bug in the shader where letters on the 2nd texture and later
would have UV outside of [0.0, 1.0]
Not yet implemented:
* transcribed words are encoded using emotes mapping
|
| |
|
|
| |
* Reduce noise on/off indicator volume by 50%
|
| |
|
|
| |
Looks more legible. Thanks Noppers for the feedback!
|
| |
|
|
| |
Ruling out possibilities for a user reported bug.
|
| |
|
|
|
|
| |
* Fix prefab: bounding box & position are now set to 0
* Fix shader: text is no longer upside down
* Update README
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
TaSTT shader now uses physically based rendering (PBR). Users can pick
smoothness, metallic, and emissive.
This implementation borrows heavily from catlikecoding.com's excellent
tutorials, which are released under MIT No Attribution (MIT-0).
https://catlikecoding.com/unity/tutorials/license/
To retain what little clarity remains in the shader, I have chosen not
to attribute the code in the source itself.
|
| |
|
|
| |
The --extra-index-url must appear *before* the dependency in this file.
|
| |
|
|
|
|
| |
We use a button to start/stop transcription. Previously this was
hardcoded to left joystick. Now users can pick from {left, right} x
{joystick, a, b}.
|
| |
|
|
|
|
|
| |
This seems to be the canonical way of listing a Python app's
dependencies.
* Installing dependencies no longer hangs the GUI
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
VRChat exposes a built-in chatbox which can be seen by anyone who has
it enabled. This was not the case when I started this project: the
chatbox would only be visible to friends. Since this is clearly useful,
enabling the STT on public models, let's enable sending data to it.
Caveats:
* The built-in chatbox has anti-spam tech which limits us to updating
about once every 2 seconds. The custom chatbox has no such limitation
and is thus typically much faster.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, paths containing spaces would be interpreted by python's argument
parser as multiple separate arguments, causing it to fail. Now we escape paths
inside PythonWrapper using std::quoted().
* Improve PII filtering. Python output would contain multiple path separators
(like C:\\Users\\foo\\), defeating the PII regex.
* Silence compiler warning in PII filter.
* Document usability improvements.
* Transcription layer exponential backoff goes to ~infinity when paused.
This is a hack, since we really don't need to transcribe at all when paused,
but it lets us keep the code simple. Good enough until the next rewrite.
* Shader only samples background when necessary.
* Limit matchStrings() print()s to DEBUG mode
|
| |
|
|
|
| |
* Expose option to run transcription engine on CPU instead of GPU
* Use embedded git when setting up the Python virtual environment
|
| |
|
|
|
|
| |
Re-paging anything on screen N causes screens N+1...infinity to
completely re-page. This fixes cases where we go back and draw something
at the bottom of the board, and it never gets overwritten.
|
| |
|
|
|
|
|
|
| |
Boards whose size is an even multiple of CHARS_PER_SYNC would lose the
entire last region.
* Attempt to fix runaway memory usage of GUI text frames, but this needs
more work
|
| |
|
|
|
| |
Users can pick longer transcription durations for accuracy-critical
tasks, or shorter durations for latency-critical tasks.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
VRChat won't update the FX layer associated with an avatar unless its
GUID changes. Delete the GUID file when overwriting our generated FX
layer to work around this.
* Change paging behavior: when a region is updated, we re-page everything
that comes after it. This fixes the issue where we go back to update
something, then jump back to the current screen, leaving some random
chunk of text somewhere on the board.
* Reduce transcription time from 28s to 10s. I'm going to expose this to
the user since there's a fundamental latency/stability tradeoff here.
|
| |
|
|
|
|
|
|
| |
Bump up recording window to 28 seconds. This helps a lot with long-form
transcription tasks, s.a. transcribing an audiobook.
We should expose this as a parameter, since at 10s the transcription delay is
typically 300ms, while at 28s it's typically 1.1-1.2s.
|
| |
|
|
|
|
|
|
| |
Users can now control how many letters wide and tall the board is.
Tested at 4x48, 5x60, 10x120, and 20x240. At 20x240, Unity freezes and
does not make forward progress. Perhaps creating 4800 float parameters
isn't a truly scalable interface.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now it's possible to generate shaders with a custom number of rows, columns,
and bytes per character.
All edits to the shader should go through TaSTT_template.shader. To generate
a new shader from the template:
$ ./Scripts/generate_shader.py \
--bytes_per_char 2 \
--rows 1 \
--cols 12
--shader_template $(pwd)/Shaders/TaSTT_template.shader \
--shader_path $(pwd)/Shaders/TaSTT.shader
|
| |
|
|
|
| |
Users can now see the number of avatar parameter bits they'll use
prior to committing.
|
| |
|
|
|
|
|
|
|
| |
An off-by-one issue in numRegions() would result in one extra layer
trying to drive a letter in the last region, which would wrap back
around to the 0th character slot (cell).
* GUI explicitly logs when it's done generating avatar stuff
* OSC layer no longer tries to update cells which don't exist
|