TaSTT.git/Scripts/transcribe.py, branch v0.13.1

TaSTT.git/Scripts/transcribe.py, branch v0.13.1 Free self-hosted STT for VRChat. https://git.yummers.dev/TaSTT.git/atom?h=v0.13.1 2023-07-07T09:35:51+00:00 Enforce a stricter avg_logbprob than default 2023-07-07T09:35:51+00:00 yum yum.food.vr@gmail.com 2023-07-07T09:30:18+00:00 urn:sha1:7a576bcac1c37c3c5a59fadf172aa70b15ff83c8 Common hallucinations sneak in around -0.9 avg_logprob. Also: * Limit temperatures to just 0.0. Multiple values cause latency to occasionally spike. Filter out segments based on avg_log_prob & no_speech_prob 2023-07-07T08:58:45+00:00 yum yum.food.vr@gmail.com 2023-07-07T08:57:56+00:00 urn:sha1:2793ac9dd31059f2fc29f7978bcb688a7de664ed Surprisingly, these args do not cause transcribe() to omit those segments from the result, so we have to manually filter them out. Hallucinated phrases generally have one or both of these params set high. Use 16-bit ints with generated silence 2023-07-07T08:44:28+00:00 yum yum.food.vr@gmail.com 2023-07-07T08:44:28+00:00 urn:sha1:742eb86d652d7689bbf3ae8b286bf0a6b1c2380d Each sample of audio data is a 16-bit int, not an 8-bit int. Fix performance regression 2023-07-07T08:27:02+00:00 yum yum.food.vr@gmail.com 2023-07-07T08:27:02+00:00 urn:sha1:cdc4889cb5e752d00f7f8933a5486f4f3441f6e9 Each chunk of audio samples should be encoded as a binary string, not as a list. Enforce minimum 5.0 second duration on audio buffer 2023-07-07T00:36:14+00:00 yum yum.food.vr@gmail.com 2023-07-07T00:36:14+00:00 urn:sha1:d0d3b18ad0a859e5e7a1cc5b8a569349b505c924 New commit logic would reduce buffer to a size smaller than this, causing it to hallucinate things like: * "See you next time!" * "Thanks for watching!" * "Bye!" The hope is that by keeping the buffer at least 5.0 seconds long, as described in the paper, this will cut down on these events. Add visual commit indicator to OBS browser source 2023-07-01T02:46:17+00:00 yum yum.food.vr@gmail.com 2023-07-01T02:44:27+00:00 urn:sha1:4f3131b4a36d8e1557edb31d3754a431717dab7b Circle goes red when speaking, grey when done. Ideally it would be in the top right portion of the browser source, but this is a good start. Also, hard-cap transcripts to 4096 chars. This prevents the STT from lagging during long sessions. Bugfix: trailing period filter ignores ellipses 2023-07-01T01:55:12+00:00 yum yum.food.vr@gmail.com 2023-07-01T01:55:12+00:00 urn:sha1:9ab500036bdfa87215e9a05fc167c4d9dea8e437 ... also print out "Ready!" when the STT is done loading. fix: set gpu device index in whisper model 2023-06-30T00:15:30+00:00 jsopn github@jsopn.com 2023-06-30T00:15:30+00:00 urn:sha1:95927745b3d3b2ac566f1cfd634a40e760ed29cb Fix race condition around audio frames dropping 2023-06-29T06:50:34+00:00 yum yum.food.vr@gmail.com 2023-06-29T06:50:34+00:00 urn:sha1:3b10d3ab3073af2ed716d1607bb92394bb8817fc onAudioFramesAvailable would bail out if audio_state.audio_paused is set, preventing frames from being dropped. This would cause transcriptions to get repeated sometimes. Now that frame dropping code always runs. Also adjust the code structure of the keyboard/VR input handlers to be more similar. Bugfix: commit no longer wipes out audio buffer 2023-06-29T05:18:18+00:00 yum yum.food.vr@gmail.com 2023-06-29T05:11:46+00:00 urn:sha1:b1efbf5ce1ebd584796d4a57cf9c7b6517f91fac Audio data is stored in chunks of frames, not in individual frames. When I commit a transcript, I want to get rid of the portion of the audio data responsible for that particular transcript. I have code that does this, but it was dropping a slice of the list assuming that each sample is stored individually. Extra fun: Because we have to decimate mic frames, we have to convert between whisper frames and mic frames to drop the correct amount of audio data.