TaSTT.git/Scripts/transcribe.py, branch v0.14.0

TaSTT.git/Scripts/transcribe.py, branch v0.14.0 Free self-hosted STT for VRChat. https://git.yummers.dev/TaSTT.git/atom?h=v0.14.0 2023-08-02T06:42:24+00:00 Fix race condition in commit logic 2023-08-02T06:42:24+00:00 yum yum.food.vr@gmail.com 2023-08-02T06:42:24+00:00 urn:sha1:7b5cbfd76ede7522555dcc87b014239b4f6fbe8c Transcription thread now blocks until microphone thread deletes samples as requested. (This is hacky design, it should use a work queue or something, but I don't feel like doing that right now) Only back off transcription loop when not transcribing 2023-08-02T06:30:42+00:00 yum yum.food.vr@gmail.com 2023-08-02T06:25:26+00:00 urn:sha1:fa7cb7220029fcc506476bf7b32aab90a0077a14 It's possible that the user has toggled off transcription while the algorithm is still working. In this case we should *not* begin exponential backoff since there's still work to do. Also: * Shorten the hot-path sleep from 50ms to 5ms. * Remove unused variable in SleepInterruptible Preserve audio chunk length when dropping samples 2023-07-09T02:06:45+00:00 yum yum.food.vr@gmail.com 2023-07-09T01:55:41+00:00 urn:sha1:a602bfb95665697b15a2de58694c6ac064af2916 When we commit a transcription, we drop the corresponding audio data. Audio data is represented as a list of chunks. Each chunk contains a few hundred samples of audio data, representing O(10ms) of audio. If we want to drop a few seconds of data, this means simply deleting many chunks of audio. There's usually a chunk where we want to drop some portion of audio data. Instead of slicing away that part of the chunk, which would change its length, this change zeroes it out. This preserves the assumption that each chunk has the same temporal length. Commit logic now drops parts of frames 2023-07-08T22:57:39+00:00 yum yum.food.vr@gmail.com 2023-07-08T22:57:39+00:00 urn:sha1:80f46a7a346e73c94a3bb8ae01099743020ef2a4 We used to drop entire frames only, leading to situations where more audio is dropped than desired. Now we drop frames down to the precision of the individual audio sample requested. Update README 2023-07-08T00:54:40+00:00 yum yum.food.vr@gmail.com 2023-07-08T00:54:40+00:00 urn:sha1:5db7426bb14b7e51275c14d8173bd67e8addc4ce Mostly updating roadmap stuff. Non-VRC use cases are "complete" since I was mostly targeting streaming. The ability to type into arbitrary text fields is still somewhat nascent & could be improved. Also update some other random stuff to be more up to date. KillFrenzy Avatar Text is now MIT, pog! Enforce a stricter avg_logbprob than default 2023-07-07T09:35:51+00:00 yum yum.food.vr@gmail.com 2023-07-07T09:30:18+00:00 urn:sha1:7a576bcac1c37c3c5a59fadf172aa70b15ff83c8 Common hallucinations sneak in around -0.9 avg_logprob. Also: * Limit temperatures to just 0.0. Multiple values cause latency to occasionally spike. Filter out segments based on avg_log_prob & no_speech_prob 2023-07-07T08:58:45+00:00 yum yum.food.vr@gmail.com 2023-07-07T08:57:56+00:00 urn:sha1:2793ac9dd31059f2fc29f7978bcb688a7de664ed Surprisingly, these args do not cause transcribe() to omit those segments from the result, so we have to manually filter them out. Hallucinated phrases generally have one or both of these params set high. Use 16-bit ints with generated silence 2023-07-07T08:44:28+00:00 yum yum.food.vr@gmail.com 2023-07-07T08:44:28+00:00 urn:sha1:742eb86d652d7689bbf3ae8b286bf0a6b1c2380d Each sample of audio data is a 16-bit int, not an 8-bit int. Fix performance regression 2023-07-07T08:27:02+00:00 yum yum.food.vr@gmail.com 2023-07-07T08:27:02+00:00 urn:sha1:cdc4889cb5e752d00f7f8933a5486f4f3441f6e9 Each chunk of audio samples should be encoded as a binary string, not as a list. Enforce minimum 5.0 second duration on audio buffer 2023-07-07T00:36:14+00:00 yum yum.food.vr@gmail.com 2023-07-07T00:36:14+00:00 urn:sha1:d0d3b18ad0a859e5e7a1cc5b8a569349b505c924 New commit logic would reduce buffer to a size smaller than this, causing it to hallucinate things like: * "See you next time!" * "Thanks for watching!" * "Bye!" The hope is that by keeping the buffer at least 5.0 seconds long, as described in the paper, this will cut down on these events.