TaSTT.git/Scripts/string_matcher.py, branch master

TaSTT.git/Scripts/string_matcher.py, branch master Free self-hosted STT for VRChat. https://git.yummers.dev/TaSTT.git/atom?h=master 2023-06-25T01:02:37+00:00 Rework transcription commit logic 2023-06-25T01:02:37+00:00 yum yum.food.vr@gmail.com 2023-06-25T01:02:37+00:00 urn:sha1:8d0add86f66db5324f8b965b832aea7cc1361498 At the core of the STT, there's a loop which uses Whisper to convert audio into a transcript. As you say something, whisper sees growing fragments of your sentence: t0: "Hell" t1: "Hello" t2: "Hello, world!" So we need some algorithm which takes these fragments and accumulates them into an ever-growing transcript. Previously I did this with fuzzy string matching. I'd find the region where the two transcripts overlap and edit the two together to produce a longer transcript. The big problem is that if there's no overlap, it's not clear whether whisper radically changed its mind as to what was said, or whether the user paused for a long time before saying something new. So I'd have to reset the growing transcript. Now I get the timestamps from Whisper and wait for it to give me the same 3 transcripts for the last utterance. Once the transcript stabilizes like this, I commit the text. This enables a temporally stable, ever-growing transcript that's also quite accurate. To prevent a latency regression, I also introduce the notion of "preview text", which is a preview of an utterance that has not yet stabilized. These previews do not contribute to the ever-growing transcript, but do get fed through the rest of the app, so they show up in-game / in OBS. Once they eventually stabilize, they get committed to the ever-growing transcript. This change is lightly tested! Begin work on C++ implementation 2023-02-23T05:49:29+00:00 yum yum.food.vr@gmail.com 2023-02-21T21:19:43+00:00 urn:sha1:9a97fbc3c583ccd518d838faaaa36ed9aa5558e1 Use Const-me/Whisper to perform transcription. This implementation is vastly more efficient: CPU usage, memory usage, and VRAM usage are all dramatically reduced. It's slightly less accurate when comparing the same model (due to the lack of beam search decoding), but since you can use larger models, the impact is largely a wash. Bugfix: user-provided paths may now contain spaces 2023-01-04T18:03:39+00:00 yum yum.food.vr@gmail.com 2023-01-04T17:52:02+00:00 urn:sha1:66d311b3267620995e5c35b16f3fba18ed0c48f3 Previously, paths containing spaces would be interpreted by python's argument parser as multiple separate arguments, causing it to fail. Now we escape paths inside PythonWrapper using std::quoted(). * Improve PII filtering. Python output would contain multiple path separators (like C:\\Users\\foo\\), defeating the PII regex. * Silence compiler warning in PII filter. * Document usability improvements. * Transcription layer exponential backoff goes to ~infinity when paused. This is a hack, since we really don't need to transcribe at all when paused, but it lets us keep the code simple. Good enough until the next rewrite. * Shader only samples background when necessary. * Limit matchStrings() print()s to DEBUG mode Fine-tune transcription 2022-12-30T08:01:28+00:00 yum yum.food.vr@gmail.com 2022-12-30T08:01:28+00:00 urn:sha1:abdaa7ce215086bf1070d6093731cd35df866cbb Bump up recording window to 28 seconds. This helps a lot with long-form transcription tasks, s.a. transcribing an audiobook. We should expose this as a parameter, since at 10s the transcription delay is typically 300ms, while at 28s it's typically 1.1-1.2s. Finish python virtual env 2022-12-18T01:51:12+00:00 yum yum.food.vr@gmail.com 2022-12-18T01:51:12+00:00 urn:sha1:ee8213d1d2c2008d2d996929500c9e87dac325a3 GUI can now download all TaSTT dependencies and install them into a virtual environment. * Add buttons to check embedded python version & install dependencies * Add class to wrap interacting with embedded Python * Put all TaSTT python scripts into a folder