| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This logic is highly IO bound *and* latency critical so it makes sense to put
it into its own thread.
Also:
* Collector::drop* methods return the dropped audio. Committer includes
that audio in commits. Transcription thread holds onto it. When the
user segments their speech with a button press, the transcription
thread sends the entire combined audio of all commits over to Whisper
to be transcribed. This allows us to recover from errors introduced
by segmentation.
* Remove unused animator params
* Fix issue where clearing the board doesn't completely reset STT state
TODO:
* Coalescing does not occur for in-place updates. It should.
|
| |
|
|
|
|
|
|
| |
Also:
* Enable SO_REUSEADDR on browser src socket
* Temporarily add evaluation dependencies to requirements.txt
* Fix browser src. It's now looking for a prefix that the python app
actually uses.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Four threads:
* Main thread
* Transcription (mic -> collector -> whisper -> committer -> pager)
* VR input
* Keyboard input
Also:
* add OscPager class to encapsulate all OSC interactions.
* bump `last_n_must_match` from 2 to 3 to reduce hallucinations
|
| |
|
|
| |
This has a slight positive effect on my benchmark.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Try adding two filters on top of the usual AudioCollector:
* Minimum length preservation: never report fewer than N seconds worth
of audio data. Pad with silence as needed.
* Volume normalizing: normalize audio volume.
Using my benchmark of 30-second audio clips from 3 speakers (lower is
better):
length enf + norm = 87.118
nothing = 90.917
norm = 94.538
length = 111.402
Both together are a slight improvement, but independently degrade the
result by a lot. I also observed more hallucinations in a conversational
pattern when using them vs. not. So I'll phase them out.
I'm still curious about *compression* as opposed to normalization.
|
| |
|
|
|
|
|
|
|
|
| |
A set of proper interfaces is called for. See #dev-update-spam in
discord for drawing of design.
Also add code to mechanically optimize committer parameters using an
audio file. Not perfectly repeatable since it depends on the performance
characteristics of the machine, but prob better than what we had before
(nothing).
|
| | |
|
| |
|
|
| |
Oops, I meant to check these in earlier!
|
| |
|
|
|
|
|
|
|
|
| |
Also:
* Fully scrub AudioSource references from prefab when not using
phonemes.
* Disable net sync on phoneme params when not using them. When not
synced, they don't count against the total memory limit.
* Use config file in generate_params.py
|
| |
|
|
| |
If not set, the prefab will have its audio sources removed.
|
| | |
|
| |
|
|
|
| |
Duplicating config between args and config is a huge pain in the ass to
maintain. Now we just launch using the config generated by the UI. ezpz.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Temporarily restore normal process priority. Working on adding a UI
option to set STT prio.
* Give audio indicator phonemes a 1/3 chance to do nothing. Makes result
sound a little better imo.
* Quiet down steamVR thread when steamVR isn't running
* Fix use of `button_id` and `hand_id` in steamvr.py
* Increase amount of silence allowed before transcript from 1 to 5
seconds. You want enough buffer to allow for a few full transcripts,
else you risk spuriously dropping audio.
* Enable background loading in audio metadata (required by vrc sdk)
|
| |
|
|
|
|
|
|
| |
This is now dynamically set inside transcribe.py.
As the buffer grows long, the threshold grows exponentially, keeping the
buffer short. The threshold starts small so that transcription starts
strict (accurate, slow) and get looser (inaccurate, fast) as needed.
|
| |
|
|
|
| |
We now play arpeggiated *chords* of vowels instead of one, allowing for
a denser audio feedback mechanism.
|
| |
|
|
|
|
|
| |
Also fix prefab default size (no longer colossal).
TODO
* Add runtime & unity-time toggles
|
| |
|
|
|
| |
openxr doesn't have any notion of background process, making it unusable
trash :)
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I this improves the code structure of the controller input thread and
leads to some deduplication, so I'm going to keep it. However, the
intended purpose was to decrease lag when pressing buttons, and in that
regard it failed.
The lag goes all the way down to the input layer, implying that the
input thread is not able to consistently run at its intended 100 Hz
sample rate. I suspect that the Python global interpreter lock (GIL) is
at fault.
Since we can't realistically move all our functionality into one thread
in a non-blocking model, I think multiprocessing is the logical choice
going forward. Each thread in transcribe.py would become its own
process, and pub/sub through some intermediary process sitting in the
middle.
|
| |
|
|
| |
pyopenvr is both deprecated and buggy, so switch to pyopenxr.
|
| |
|
|
| |
Text box now shows an animated ellipsis prior to first speech.
|
| |
|
|
|
| |
Deprecate the visual and auditory speech indicators, saving 4 bits
across the board. Fixed overhead is now 21 bits.
|
| |
|
|
|
| |
No UVs for raymarched geometry yet, so drop textures. Also drop most
old shader settings.
|
| |
|
|
| |
Specify file encoding when generating shaders.
|
| |
|
|
| |
* Fix mirror behavior for ray-marched chatbox
|
| | |
|
| |
|
|
|
|
|
|
| |
* Refactor shader code to make development easier. Templates are now
as small as possible.
* Update scaling code. Use Unity scaling instead of a blendshape.
* Check in a fuckton of shader FOSS. Mostly unused.
* Update TaSTT.fbx. Now has 6 faces instead of 2.
|
| |
|
|
|
|
|
|
| |
Transcription thread now blocks until microphone thread deletes samples
as requested.
(This is hacky design, it should use a work queue or something, but I
don't feel like doing that right now)
|
| |
|
|
|
|
|
|
|
|
| |
It's possible that the user has toggled off transcription while the
algorithm is still working. In this case we should *not* begin
exponential backoff since there's still work to do.
Also:
* Shorten the hot-path sleep from 50ms to 5ms.
* Remove unused variable in SleepInterruptible
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we commit a transcription, we drop the corresponding audio data.
Audio data is represented as a list of chunks. Each chunk contains a few
hundred samples of audio data, representing O(10ms) of audio.
If we want to drop a few seconds of data, this means simply deleting
many chunks of audio. There's usually a chunk where we want to drop some
portion of audio data.
Instead of slicing away that part of the chunk, which would change its
length, this change zeroes it out. This preserves the assumption that
each chunk has the same temporal length.
|
| |
|
|
|
|
| |
We used to drop entire frames only, leading to situations where more
audio is dropped than desired. Now we drop frames down to the precision
of the individual audio sample requested.
|
| |
|
|
|
|
|
|
|
| |
Mostly updating roadmap stuff. Non-VRC use cases are "complete" since I
was mostly targeting streaming. The ability to type into arbitrary text
fields is still somewhat nascent & could be improved.
Also update some other random stuff to be more up to date. KillFrenzy
Avatar Text is now MIT, pog!
|
| |
|
|
|
|
|
|
| |
Common hallucinations sneak in around -0.9 avg_logprob.
Also:
* Limit temperatures to just 0.0. Multiple values cause latency to
occasionally spike.
|
| |
|
|
|
|
|
| |
Surprisingly, these args do not cause transcribe() to omit those
segments from the result, so we have to manually filter them out.
Hallucinated phrases generally have one or both of these params set
high.
|
| |
|
|
| |
Each sample of audio data is a 16-bit int, not an 8-bit int.
|
| |
|
|
|
| |
Each chunk of audio samples should be encoded as a binary string, not as
a list.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
New commit logic would reduce buffer to a size smaller than this,
causing it to hallucinate things like:
* "See you next time!"
* "Thanks for watching!"
* "Bye!"
The hope is that by keeping the buffer at least 5.0 seconds long, as
described in the paper, this will cut down on these events.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Create a simple server with 3 endpoints:
* /create_session: Create a session and return its identifier.
* /set_transcript: Update a session's transcript.
* /get_transcript: Fetch a session's transcript.
Right now the session ID provides authentication *and* authorization.
There is no public/private ID so you have to trust whoever you share
your ID with.
IDs are long and generated by the server, so it should be somewhat
secure against low-effort hacking.
Other updates:
* Drop whisper_requirements.txt - no longer needed.
* Vendor curl to make it easier to interact with the server.
TODO:
* Fuzz test the server.
|
| |
|
|
| |
Forgot to check this in, oops!
|
| |
|
|
|
|
|
|
| |
Circle goes red when speaking, grey when done. Ideally it would be in
the top right portion of the browser source, but this is a good start.
Also, hard-cap transcripts to 4096 chars. This prevents the STT from
lagging during long sessions.
|
| |
|
|
| |
... also print out "Ready!" when the STT is done loading.
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
| |
onAudioFramesAvailable would bail out if audio_state.audio_paused is
set, preventing frames from being dropped. This would cause
transcriptions to get repeated sometimes.
Now that frame dropping code always runs.
Also adjust the code structure of the keyboard/VR input handlers to be
more similar.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Audio data is stored in chunks of frames, not in individual frames.
When I commit a transcript, I want to get rid of the portion of the
audio data responsible for that particular transcript. I have code that
does this, but it was dropping a slice of the list assuming that each
sample is stored individually.
Extra fun: Because we have to decimate mic frames, we have to convert
between whisper frames and mic frames to drop the correct amount of
audio data.
|
| |
|
|
|
|
|
| |
Add toggle to UI to enable a profanity filter. It replaces vowels in bad
words with asterisks.
Bugfix: filters now apply to OBS
|
| |
|
|
|
|
|
|
| |
Most transcription output is now gone by default. Users can enable a
more verbose output by toggling `Enable debug mode`.
Bugfix: Toggling off transcription would reset audio state, frequently
resulting in the loss of the last few words spoken.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recap: In the STT there's an algorithm that tries to determine when a
transcript is "stable" enough to commit. If that is too loose, then
accuracy suffers; if too strict, then the audio buffer eventually fills.
To mitigate the problem, I check whether the last N transcripts are
within some edit distance (Levenshtein edit distance) of each other. The
fuzzy matching lets us forgive small instabilities, like differences in
uppercase/lowercase or punctuation, while rejecting large instabilities.
The default value of 8 seems to be in the sweet spot of accuracy &
performance, but it will likely be tuned in the future.
|
| |
|
|
|
|
|
|
| |
... instead of simple equality.
TODO: add UI for threshold.
Bugfix: Frame::onAppStop() joins the OBS app thread.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This is useful when streaming. Occasionally the STT can get into
a bad state, and manually segmenting clears it up. However doing so
would clear your accumulated transcript, which isn't always desired. Add
ability to preserve the transcript.
A small wrinkle: the new commit logic requires N consecutive identical
windows before committing. To make this feature play nicely with it, I
had to forcibly commit any preview text that hasn't yet been committed.
Failing to do this would usually cause short utterances / the most
recently said stuff to get wiped out.
|
| |
|
|
| |
Add ability to toggle on/off browser src & configure port.
|
| |
|
|
|
| |
Hitting the desktop keybinding to stop transcription would sometimes
cause the last transcript to repeeat itself.
|