| Commit message (Collapse) | Author | Age |
| |
|
|
|
| |
Fix up .mat to point to correct textures/shader. Also delete templates
after copying shaders.
|
| |
|
|
| |
Specify file encoding when generating shaders.
|
| |
|
|
| |
* Fix mirror behavior for ray-marched chatbox
|
| | |
|
| |
|
|
|
|
|
|
| |
* Refactor shader code to make development easier. Templates are now
as small as possible.
* Update scaling code. Use Unity scaling instead of a blendshape.
* Check in a fuckton of shader FOSS. Mostly unused.
* Update TaSTT.fbx. Now has 6 faces instead of 2.
|
| |
|
|
|
|
| |
GUI was not correctly managing .meta files, causing two textures to use
the same GUID. Unity would notice and regenerate GUIDs, breaking the
custom chatbox material's texture references.
|
| |
|
|
|
|
|
|
| |
Transcription thread now blocks until microphone thread deletes samples
as requested.
(This is hacky design, it should use a work queue or something, but I
don't feel like doing that right now)
|
| |
|
|
|
|
|
|
|
|
| |
It's possible that the user has toggled off transcription while the
algorithm is still working. In this case we should *not* begin
exponential backoff since there's still work to do.
Also:
* Shorten the hot-path sleep from 50ms to 5ms.
* Remove unused variable in SleepInterruptible
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add two buttons: start auto re-generation of Unity assets, and stop.
These start/stop a thread which periodically (every 3 seconds) hashes
the user-provided animator, menu and parameters. When any one of these
change, it invokes the function to generate Unity assets.
The hash is non-cryptographic, so it's light. The only hit is that we
have to read the entire file contents every few seconds, and compute a
sum across that entire memory region. This is extremely light unless
you're on a spinning platter hard drive with a small cache.
Still seeing the bug where the material drops ref to the font bitmaps.
Probably need to update the .mat using the guids in the bitmap .meta
files.
|
| |
|
|
|
| |
Avoid deleting bitmap .meta files so that once the user sets up their
shader, it doesn't break.
|
| |
|
|
| |
Useful for projects with multiple avatars with different animators.
|
| |
|
|
|
|
| |
The paths you enter in the Unity panel (animator, menu, params, and
assets folder) are saved in the app config, but were not populated
correctly on app restart or pane redraw. Now they are.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we commit a transcription, we drop the corresponding audio data.
Audio data is represented as a list of chunks. Each chunk contains a few
hundred samples of audio data, representing O(10ms) of audio.
If we want to drop a few seconds of data, this means simply deleting
many chunks of audio. There's usually a chunk where we want to drop some
portion of audio data.
Instead of slicing away that part of the chunk, which would change its
length, this change zeroes it out. This preserves the assumption that
each chunk has the same temporal length.
|
| |
|
|
|
|
| |
We used to drop entire frames only, leading to situations where more
audio is dropped than desired. Now we drop frames down to the precision
of the individual audio sample requested.
|
| |
|
|
|
|
|
|
|
| |
Mostly updating roadmap stuff. Non-VRC use cases are "complete" since I
was mostly targeting streaming. The ability to type into arbitrary text
fields is still somewhat nascent & could be improved.
Also update some other random stuff to be more up to date. KillFrenzy
Avatar Text is now MIT, pog!
|
| |
|
|
|
|
|
|
| |
Common hallucinations sneak in around -0.9 avg_logprob.
Also:
* Limit temperatures to just 0.0. Multiple values cause latency to
occasionally spike.
|
| |
|
|
|
|
|
| |
Surprisingly, these args do not cause transcribe() to omit those
segments from the result, so we have to manually filter them out.
Hallucinated phrases generally have one or both of these params set
high.
|
| |
|
|
| |
Each sample of audio data is a 16-bit int, not an 8-bit int.
|
| |
|
|
|
| |
Each chunk of audio samples should be encoded as a binary string, not as
a list.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
New commit logic would reduce buffer to a size smaller than this,
causing it to hallucinate things like:
* "See you next time!"
* "Thanks for watching!"
* "Bye!"
The hope is that by keeping the buffer at least 5.0 seconds long, as
described in the paper, this will cut down on these events.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Create a simple server with 3 endpoints:
* /create_session: Create a session and return its identifier.
* /set_transcript: Update a session's transcript.
* /get_transcript: Fetch a session's transcript.
Right now the session ID provides authentication *and* authorization.
There is no public/private ID so you have to trust whoever you share
your ID with.
IDs are long and generated by the server, so it should be somewhat
secure against low-effort hacking.
Other updates:
* Drop whisper_requirements.txt - no longer needed.
* Vendor curl to make it easier to interact with the server.
TODO:
* Fuzz test the server.
|
| |
|
|
| |
Forgot to check this in, oops!
|
| |
|
|
|
|
|
|
| |
Circle goes red when speaking, grey when done. Ideally it would be in
the top right portion of the browser source, but this is a good start.
Also, hard-cap transcripts to 4096 chars. This prevents the STT from
lagging during long sessions.
|
| |
|
|
| |
... also print out "Ready!" when the STT is done loading.
|
| |\
| |
| | |
Set GPU device index in whisper model
|
| |/ |
|
| |
|
|
|
|
|
|
|
|
|
| |
onAudioFramesAvailable would bail out if audio_state.audio_paused is
set, preventing frames from being dropped. This would cause
transcriptions to get repeated sometimes.
Now that frame dropping code always runs.
Also adjust the code structure of the keyboard/VR input handlers to be
more similar.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Audio data is stored in chunks of frames, not in individual frames.
When I commit a transcript, I want to get rid of the portion of the
audio data responsible for that particular transcript. I have code that
does this, but it was dropping a slice of the list assuming that each
sample is stored individually.
Extra fun: Because we have to decimate mic frames, we have to convert
between whisper frames and mic frames to drop the correct amount of
audio data.
|
| |
|
|
|
|
|
| |
Add toggle to UI to enable a profanity filter. It replaces vowels in bad
words with asterisks.
Bugfix: filters now apply to OBS
|
| |
|
|
|
|
|
|
| |
Most transcription output is now gone by default. Users can enable a
more verbose output by toggling `Enable debug mode`.
Bugfix: Toggling off transcription would reset audio state, frequently
resulting in the loss of the last few words spoken.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recap: In the STT there's an algorithm that tries to determine when a
transcript is "stable" enough to commit. If that is too loose, then
accuracy suffers; if too strict, then the audio buffer eventually fills.
To mitigate the problem, I check whether the last N transcripts are
within some edit distance (Levenshtein edit distance) of each other. The
fuzzy matching lets us forgive small instabilities, like differences in
uppercase/lowercase or punctuation, while rejecting large instabilities.
The default value of 8 seems to be in the sweet spot of accuracy &
performance, but it will likely be tuned in the future.
|
| |
|
|
|
|
|
|
| |
... instead of simple equality.
TODO: add UI for threshold.
Bugfix: Frame::onAppStop() joins the OBS app thread.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This is useful when streaming. Occasionally the STT can get into
a bad state, and manually segmenting clears it up. However doing so
would clear your accumulated transcript, which isn't always desired. Add
ability to preserve the transcript.
A small wrinkle: the new commit logic requires N consecutive identical
windows before committing. To make this feature play nicely with it, I
had to forcibly commit any preview text that hasn't yet been committed.
Failing to do this would usually cause short utterances / the most
recently said stuff to get wiped out.
|
| |
|
|
|
|
| |
Should improve legibility.
* Update README
|
| |
|
|
| |
Seems to help reduce impact on time-sensitive apps like OBS.
|
| |
|
|
| |
No longer used.
|
| |
|
|
| |
Add ability to toggle on/off browser src & configure port.
|
| |
|
|
|
| |
Hitting the desktop keybinding to stop transcription would sometimes
cause the last transcript to repeeat itself.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Transcription output now streams to localhost:8097.
In OBS:
* Create a browser source.
* url: localhost:8097
* width: 2200
* height: 400
TODO:
* Put behind toggle.
* Create input field for port.
Misc cleanup:
* transcribe.py: Drop frames from audio capture thread instead of the
transcription thread. Doing it the other way would result in
occasional data loss.
|
| |
|
|
|
|
|
|
| |
No longer needed with new commit logic (8d0add86f66db532). Assign it to
5 minutes.
Assuming 4 bytes per sample @ 16 kHz, this buffer maxes out at 19.2
megabytes of memory usage.
|
| |
|
|
|
|
|
| |
This was slowing down app startup to an unacceptable degree. Now it just
runs once ever.
Add a button to the debug panel to manually re-setup venv if needed.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At the core of the STT, there's a loop which uses Whisper to convert
audio into a transcript. As you say something, whisper sees growing
fragments of your sentence:
t0: "Hell"
t1: "Hello"
t2: "Hello, world!"
So we need some algorithm which takes these fragments and
accumulates them into an ever-growing transcript.
Previously I did this with fuzzy string matching. I'd find the region
where the two transcripts overlap and edit the two together to produce a
longer transcript. The big problem is that if there's no overlap, it's
not clear whether whisper radically changed its mind as to what was
said, or whether the user paused for a long time before saying
something new. So I'd have to reset the growing transcript.
Now I get the timestamps from Whisper and wait for it to give me the
same 3 transcripts for the last utterance. Once the transcript
stabilizes like this, I commit the text. This enables a temporally
stable, ever-growing transcript that's also quite accurate.
To prevent a latency regression, I also introduce the notion of "preview
text", which is a preview of an utterance that has not yet stabilized.
These previews do not contribute to the ever-growing transcript, but do
get fed through the rest of the app, so they show up in-game / in OBS.
Once they eventually stabilize, they get committed to the ever-growing
transcript.
This change is lightly tested!
|
| |
|
|
|
|
|
|
|
| |
pyopenvr is deprecated and is causing a user issue
(https://github.com/yum-food/TaSTT/issues/2).
That user was kind enough to experiment with different configs and
didn't find a simple fix. So let's close this tech debt issue the right
way.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
NLLB needs its input to be split up into sentences. I use the
sentence_splitter Python package to do this. It supports ~20 Western
European languages, but notably, no Asian languages.
* Sort spoken language list. English is still at the top.
* Remove 'Translation source' dropdown. Infer this from the spoken
language.
* Add lang_compat.py to map language codes between the various libraries
(whisper, nllb, sentence_splitter).
* Fix bug where old text would appear in textbox when you first bring it
up.
|
| |
|
|
|
|
|
|
|
| |
Use Meta's No Language Left Behind (NLLB) algorithm to provide
translation capabilities into 200 languages. Obviously most are very
untested.
This requires either 4.1 or 7.1 GB of RAM and significiantly increases
transcription latency.
|
| |
|
|
|
|
|
|
|
|
| |
Add 3 filters:
* Remove trailing period
* Convert to uppercase
* Convert to lowercase
All may be composed. Upper/lower just overwrite each other so just use
one.
|
| |
|
|
|
|
|
|
|
|
| |
I forgor to put them into ApplyConfigToInputFields.
The reason this is necessary: we need to create the text field where we
log things before we can deserialize the config. To keep the code
structure "clean" I just wrote another function to apply the config
(ApplyConfigToInputFields). However I have to remember to update it when
I add new fields.
|
| |
|
|
|
| |
UI now has a checkbox for the uwu filter. Does not materially affect
resource usage or latency when enabled.
|
| |
|
|
|
|
| |
Use UwwwuPP to translate your boring old speech into uwu-ified version.
Still need to add a UI toggle for this.
|
| |
|
|
|
|
|
|
|
|
| |
To use it, do a medium hold + long hold. Keep the long hold depressed
until you're done speaking. The transcription will be typed into the
currently selected input field.
* Add more audio feedback
* Make audio feedback play asynchronously so it doesn't slow down the
controller input state machine as much.
|