| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
|
|
|
|
|
|
| |
Audio data is stored in chunks of frames, not in individual frames.
When I commit a transcript, I want to get rid of the portion of the
audio data responsible for that particular transcript. I have code that
does this, but it was dropping a slice of the list assuming that each
sample is stored individually.
Extra fun: Because we have to decimate mic frames, we have to convert
between whisper frames and mic frames to drop the correct amount of
audio data.
|
| |
|
|
|
|
|
| |
Add toggle to UI to enable a profanity filter. It replaces vowels in bad
words with asterisks.
Bugfix: filters now apply to OBS
|
| |
|
|
|
|
|
|
| |
Most transcription output is now gone by default. Users can enable a
more verbose output by toggling `Enable debug mode`.
Bugfix: Toggling off transcription would reset audio state, frequently
resulting in the loss of the last few words spoken.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recap: In the STT there's an algorithm that tries to determine when a
transcript is "stable" enough to commit. If that is too loose, then
accuracy suffers; if too strict, then the audio buffer eventually fills.
To mitigate the problem, I check whether the last N transcripts are
within some edit distance (Levenshtein edit distance) of each other. The
fuzzy matching lets us forgive small instabilities, like differences in
uppercase/lowercase or punctuation, while rejecting large instabilities.
The default value of 8 seems to be in the sweet spot of accuracy &
performance, but it will likely be tuned in the future.
|
| |
|
|
|
|
|
|
| |
... instead of simple equality.
TODO: add UI for threshold.
Bugfix: Frame::onAppStop() joins the OBS app thread.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This is useful when streaming. Occasionally the STT can get into
a bad state, and manually segmenting clears it up. However doing so
would clear your accumulated transcript, which isn't always desired. Add
ability to preserve the transcript.
A small wrinkle: the new commit logic requires N consecutive identical
windows before committing. To make this feature play nicely with it, I
had to forcibly commit any preview text that hasn't yet been committed.
Failing to do this would usually cause short utterances / the most
recently said stuff to get wiped out.
|
| |
|
|
| |
Seems to help reduce impact on time-sensitive apps like OBS.
|
| |
|
|
| |
No longer used.
|
| |
|
|
| |
Add ability to toggle on/off browser src & configure port.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Transcription output now streams to localhost:8097.
In OBS:
* Create a browser source.
* url: localhost:8097
* width: 2200
* height: 400
TODO:
* Put behind toggle.
* Create input field for port.
Misc cleanup:
* transcribe.py: Drop frames from audio capture thread instead of the
transcription thread. Doing it the other way would result in
occasional data loss.
|
| |
|
|
|
|
|
|
| |
No longer needed with new commit logic (8d0add86f66db532). Assign it to
5 minutes.
Assuming 4 bytes per sample @ 16 kHz, this buffer maxes out at 19.2
megabytes of memory usage.
|
| |
|
|
|
|
|
| |
This was slowing down app startup to an unacceptable degree. Now it just
runs once ever.
Add a button to the debug panel to manually re-setup venv if needed.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
NLLB needs its input to be split up into sentences. I use the
sentence_splitter Python package to do this. It supports ~20 Western
European languages, but notably, no Asian languages.
* Sort spoken language list. English is still at the top.
* Remove 'Translation source' dropdown. Infer this from the spoken
language.
* Add lang_compat.py to map language codes between the various libraries
(whisper, nllb, sentence_splitter).
* Fix bug where old text would appear in textbox when you first bring it
up.
|
| |
|
|
|
|
|
|
|
| |
Use Meta's No Language Left Behind (NLLB) algorithm to provide
translation capabilities into 200 languages. Obviously most are very
untested.
This requires either 4.1 or 7.1 GB of RAM and significiantly increases
transcription latency.
|
| |
|
|
|
|
|
|
|
|
| |
Add 3 filters:
* Remove trailing period
* Convert to uppercase
* Convert to lowercase
All may be composed. Upper/lower just overwrite each other so just use
one.
|
| |
|
|
|
|
|
|
|
|
| |
I forgor to put them into ApplyConfigToInputFields.
The reason this is necessary: we need to create the text field where we
log things before we can deserialize the config. To keep the code
structure "clean" I just wrote another function to apply the config
(ApplyConfigToInputFields). However I have to remember to update it when
I add new fields.
|
| |
|
|
|
| |
UI now has a checkbox for the uwu filter. Does not materially affect
resource usage or latency when enabled.
|
| |
|
|
|
|
| |
Use UwwwuPP to translate your boring old speech into uwu-ified version.
Still need to add a UI toggle for this.
|
| |
|
|
|
|
|
|
|
| |
Remove the button. This is a big source of confusion for new users. Now
it happens automatically upon starting any task that needs it.
* Begin removing CPP implementation of Whisper. faster-whisper is a much
easier/better solution.
* Flip default of `clear OSC configs` from false to true.
|
| |
|
|
|
|
| |
Users can now configure a keybind to start/stop/dismiss the STT when in
desktop mode. The default keybind is ctrl+x, since by default VRC
doesn't use 'x' for anything.
|
| |
|
|
|
|
| |
Useful on devices with multiple GPUs, such as gaming laptops.
* Update GUI/README.md.
|
| |
|
|
|
|
| |
Affinity mask no longer affects performance. String matching is still
needed for temporal stability in fast-paced long-form transcription
tasks.
|
| |
|
|
|
|
|
| |
Depth was being calculated wrong, causing text box to render behind
objects it's in front of.
* Fix package.ps1 compression. 7z was increasing file size, somehow.
|
| |
|
|
| |
I'm able to use the new code to show text in game. Not yet play-tested.
|
| |
|
|
| |
Intended to avoid accidentally releasing dirty environments.
|
| |
|
|
|
|
| |
Use `pip freeze` and `pip uninstall` to reset the venv to a near-default
state. Filter out `future` since we need to vendor it. If it ever gets
removed, the installation is borked.
|
| |
|
|
|
|
|
|
| |
This dependency fails to install with the embedded python, so now it's
vendored.
Installing pip after wheel would result in wheel reinstalling, so we
also vendor pip.
|
| |
|
|
|
|
|
|
|
|
| |
We used to populate 7 4k textures + 1 2k texture for all users.
Now if the user has configured `bytes_per_char=1` in the Unity
panel, we just populate a single 512x512 texture containing the
first 128 ASCII characters.
This reduces texture memory usage by 99.74%, from 134.67 MB to
340 KB.
|
| |
|
|
|
| |
Need python310._pth, specifically 'import site' line, for
embedded python + pip to get along.
|
| |
|
|
|
| |
If you don't have Python installed, venv setup will fail. Begin work
fixing environment config so `pip install` uses vendored Python.
|
| |
|
|
|
|
|
|
|
| |
A user saw an error like `ModuleNotFoundError: No module named _socket`.
StackOverflow blames this on PYTHONPATH, so let's try setting it.
* Fix latent bug in Scripts/transcribe.py. PyAudio.open() positional
parameters must be specified in correct order, even when telling it
which parameter is which. *shrug*
|
| |
|
|
|
|
|
|
| |
Expose decode method, beam search parameters, and voice activity
detection parameters in GUI.
* Remove WhisperCPP::Init(), do it on launch instead.
* Add float support to ConfigMarshal
|
| |
|
|
|
|
| |
Twofold approach:
* All spawned processes have the desired path (new codepath)
* Setup command silences the warning (old codepath)
|
| |
|
|
|
|
| |
Do these in a std::future.
* SetAffinityMask() now returns a value on all control paths
|
| |
|
|
|
|
|
|
|
| |
A user pointed out that constraining the Python implmentation to a
single core does not affect visible latency. This seems true on my
PC as well.
* Reimplement Python transcription wxProcess as a std::async.
App shutdown is much faster now.
|
| |
|
|
|
| |
* Plumb beam search params into whisper cpp implementation
(currently broken)
|
| |
|
|
|
| |
Things like " (static)" and " *explosions*" were showing up a lot with
ggml-medium.bin. Filter them out.
|
| |
|
|
|
|
|
| |
Use forked Whisper implementation which has tweaks to reduce dropped
words around the beginning VAD segments.
* Retain audio after VAD segmentation events
|
| |
|
|
|
|
|
|
| |
* Pip install, dependency install, and model download can be gracefully
interrupted and resume later.
* Mic list was pointing at freed memory. Fix this by copying into the
heap with std::unique_ptr()s. Mic list in CPP panel is much more
reliable now.
|
| |
|
|
| |
Not ready yet.
|
| | |
|
| |
|
|
|
|
|
|
|
|
| |
Sort of a misnomer. The idea is to use C++ for transcription and Python
for steamvr and OSC.
Having issues getting output from multithreaded Python code. Not in the
mood to figure this out today.
* Hide unimplemented parts of C++ panel.
|
| |
|
|
| |
Simplifies debugging process.
|
| |
|
|
|
|
|
|
|
|
|
| |
Rapidyaml started refusing to parse config files so I dropped it.
* Add ConfigMarshal clas to support very simple config marshalling
* No versioning, no type indicators, nothing.
* Supports int, bool, and string.
* Bool are serialized as int.
* Log no longer segfaults if given nullptr wxTextCtrl*.
* Fix how whisper CPP GUI fields restore from config
|
| |
|
|
|
| |
* Implement HTTPMapper classes
* Browser source respects user-configured source port
|
| |
|
|
|
|
| |
Server needs to parse incoming HTTP.
* Server spawns a thread for each incoming connection
|
| |
|
|
|
|
|
| |
oatpp was a crashy mess. Begin making a simple web server from scratch.
* Add Designs/ folder to document nontrivial things like the webserver
design
|
| |
|
|
|
|
| |
It's a crashy mess, but it sort of works.
* Add Transcript class to send transcription segments between layers
|
| |
|
|
|
| |
Browser source queries /api/transcript at 10Hz via jquery and renders
the response.
|
| |
|
|
|
|
| |
Documented in BrowserSource::Run().
* Parameterize Release/Debug in build scripts
|