TaSTT.git, branch v0.18.0

Add dropdown for GPU compute type

2024-02-10T01:21:46+00:00

Should enable compatibility with older GPUs.

Add another threshold to filter out common hallucinations

2024-02-06T01:40:37+00:00

The paper recommends filtering out segments with no_speech_prob > 0.6 and avg_logprob < -1. This is too loose of a bound for short-form audio which is not guaranteed to contain speech. I already have a tighter bound: no_speech > 0.6 and avg_logprob < -0.5 While listening to instrumental music I find that a lot of hallucinations sneak past that bound. So I added a second bound: no_speech > 0.15 and avg_logprob < -0.7 Basically we filter out things that look like speech but have a worse avg_logprob. Seems to not have false negatives. Requires testing. Also: dial back the default max segment length from 15 seconds to 10 seconds. This is done based on performance observations in desktop.

Verify that audio is clean after VAD segmentation

2024-02-06T01:02:23+00:00

Indeed it is. Bumped up the default max segment length to decrease error. Also add mic presets for beyond (the vr headset) and motu (my mic interface).

Revert "Begin experimenting with flash-attention"

2024-01-09T02:59:27+00:00

This reverts commit 921b92a69f36502dc5eefd14ba3487c1bb49bb9d.

Fix font rendering ddx/ddy logic

2024-01-09T02:59:21+00:00

Begin experimenting with flash-attention

2023-12-13T21:54:57+00:00

Seems much faster than faster-whisper. There are two issues: * Requires NVIDIA 3000 series or higher. * Incompatible with faster-whisper dependencies. So it seems like we'll either need to toggle between two sets of dependencies at runtime or have two environments.

Decrease OSC sync rate from 5 Hz to 3 Hz

2023-12-09T02:15:03+00:00

Paging is now slower but more reliable.

Add distilled whisper large-v2 model

2023-12-09T02:13:56+00:00

Add distilled whisper-medium model

2023-11-07T23:05:29+00:00

I converted distil-whisper-medium.en to CTranslate2 format and uploaded it to huggingface. This model is exceptionally fast and light compared to the non-distilled version, at the cost of some accuracy.

Transcripts preceding long pauses now drop

2023-10-06T01:28:42+00:00

When hot-miking into the built-in chatbox, there are sometimes long pauses in conversation. After these pauses, it's undesirable to show the transcript generate before the pause. This feature makes it so that those transcripts can be dropped. Also: * Limit number of segments sent to browser source to 10. Allow this to grow up to 10 segments before dropping the first 5 segments. * Silence warnings generated by `install_in_venv`, used by e.g. translation codepath. * Enable audio normalization to improve accuracy when speaking softly, at the cost of some accuracy when speaking normally. Credit: user endo0269 on Discord suggested this feature.