Rework transcription commit logic - TaSTT.git - Free self-hosted STT for VRChat.

diff options

author	yum <yum.food.vr@gmail.com>	2023-06-24 18:02:37 -0700
committer	yum <yum.food.vr@gmail.com>	2023-06-24 18:02:37 -0700
commit	8d0add86f66db5324f8b965b832aea7cc1361498 (patch)
tree	8d82ba69ce9c381aa5fe8594a8232315d360435f /GUI
parent	e689105f8ad480eaf82eaed12e82a139df0b772b (diff)

Rework transcription commit logic

At the core of the STT, there's a loop which uses Whisper to convert audio into a transcript. As you say something, whisper sees growing fragments of your sentence: t0: "Hell" t1: "Hello" t2: "Hello, world!" So we need some algorithm which takes these fragments and accumulates them into an ever-growing transcript. Previously I did this with fuzzy string matching. I'd find the region where the two transcripts overlap and edit the two together to produce a longer transcript. The big problem is that if there's no overlap, it's not clear whether whisper radically changed its mind as to what was said, or whether the user paused for a long time before saying something new. So I'd have to reset the growing transcript. Now I get the timestamps from Whisper and wait for it to give me the same 3 transcripts for the last utterance. Once the transcript stabilizes like this, I commit the text. This enables a temporally stable, ever-growing transcript that's also quite accurate. To prevent a latency regression, I also introduce the notion of "preview text", which is a preview of an utterance that has not yet stabilized. These previews do not contribute to the ever-growing transcript, but do get fed through the rest of the app, so they show up in-game / in OBS. Once they eventually stabilize, they get committed to the ever-growing transcript. This change is lightly tested!

Diffstat (limited to 'GUI')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: