| Commit message (Collapse) | Author | Age |
| | |
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
|
|
| |
- update cursorignore
- add hallucination filter training & inference code
- put logging into a central module
- segment metadata logging occurs before filtering
- segment metadata logging is on by default
- check in embedded python setup script
- include trained hallucination filter model
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |\ |
|
| | | |
|
| | | |
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Segment metadata can now be logged to a json as the app runs. The goal
is to identify the params that heavily correlate with hallucinations.
Also:
* use 7zip for compression in build, speeding things up
* log dll download progress every few seconds
* shrink package
|
| | |
| |
| |
| |
| |
| | |
* fix model acquisition
* fix local beepsnd
* fix volume control
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
- add desktop and vr input threads
- add audio feedback for input
- add volume control for audio feedback
- add UI for custom chatbox/built in chatbox
- add ability to dismiss built in chatbox (sync empty messages)
- limit lines in python console
- limit length of each transcript
|
| | |
| |
| |
| |
| |
| |
| |
| | |
- fix unicode output from python terminal
- fix cpu inference
- add filters
- add beam search params to UI
- DRY up config definition in UI
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
- Filters actually get applied now, huge accuracy boost
- Use silero-vad python library instead of rolling our own
- Expose prompt parameter
- Auto setup venv on launch
- Clean up python output
- Auto acquire all dependencies on launch
- Add icon
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
1. main STT app works in new project structure
2. UI dumps mics on startup to populate mic list
3. add missing deps (hf-xet, wave)
4. normalize audio volume when transcribing. Probably still wrong tbqh.
5. add checkbox to save audio segments & improve logic so only segments
with speech get saved.
6. add default config settings
|
| | |
| |
| |
| | |
HEAVILY VIBE CODED!
|
| | | |
|
| | | |
|
| | | |
|
| | |
|
| |
|
|
| |
This is overwhelmingly more common than custom chatbox.
|
| |
|
|
|
| |
Deprecated in the Python release of CTranslate2 as of 4.4.0:
https://github.com/OpenNMT/CTranslate2/blob/master/CHANGELOG.md#v440-2024-09-09
|
| |
|
|
|
|
|
| |
Also:
* Double # of audio device slots
* Fetch CuDNN from NVIDIA at runtime instead of vendoring
|
| | |
|
| | |
|
| |
|
|
|
|
| |
Use some js magic to deduce the hostname instead of hardcoding
localhost. If you used the browser source under 127.0.0.1, then
you'd get XSS blocked from making the ajax calls. This fixes that.
|
| | |
|
| |
|
|
| |
God this code is a fucking nightmare
|
| | |
|
| |
|
|
|
|
|
| |
* Add checkbox to disable this feature if so desired.
* Delete old optimization code; can get it back from git if needed.
* Enforce that there's at least one space character ' ' between
committed segments.
|
| |
|
|
|
|
|
|
|
|
|
| |
Translation needs torch to convert the nllb model, but the latest
version (2.3.1) has an embedded OMP dll which clashes with ctranslate2's
dll. Using the last minor version instead (2.2.2) doesn't clash.
Also propagate the device, quantization, and flash attention settings
to the translator. If you're using GPU, this is a HUUUUGE performance
uplift. Translation is basically instant. The bigger models are now
feasible to use.
|
| | |
|
| | |
|
| |
|
|
| |
Also disable flash-attention when CPU mode is selected
|
| |
|
|
| |
Pre-3000 series GPUs don't support it. Oops!
|
| |
|
|
|
| |
There's a modular avatar prefab for the custom chatbox on my gumroad.
Update the default settings to work with that prefab.
|
| |
|
|
|
|
|
|
| |
This should be significantly more efficient than prior versions.
* add large-v3 & distilled variant
* simplify model acquisition code now that distilled models are part of
faster-whisper.
|
| |
|
|
|
|
|
| |
These were broken due to some logic errors in the codepath which
acquires models from huggingface.
Distilled large-v2 seems promising as a new default model.
|
| |
|
|
|
|
|
| |
To use it:
$ python3 -m pip install python-osc pillow
$ cd Scripts
$ python3 ./text_to_text_demo.py
|
| |
|
|
|
| |
CUDNN now pulls from dropbox instead of google drive. This has the added
benefit of being about 10-20x faster (assuming you have fast internet).
|
| |
|
|
|
| |
Google drive intentionally broke CLI downloads ("don't be evil") and
UwwwuPP went away. Begin work rehosting both files.
|
| | |
|
| | |
|
| |
|
|
| |
Should enable compatibility with older GPUs.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The paper recommends filtering out segments with no_speech_prob > 0.6
and avg_logprob < -1. This is too loose of a bound for short-form audio
which is not guaranteed to contain speech.
I already have a tighter bound:
no_speech > 0.6 and avg_logprob < -0.5
While listening to instrumental music I find that a lot of
hallucinations sneak past that bound. So I added a second bound:
no_speech > 0.15 and avg_logprob < -0.7
Basically we filter out things that look like speech but have a worse
avg_logprob. Seems to not have false negatives. Requires testing.
Also: dial back the default max segment length from 15 seconds to 10
seconds. This is done based on performance observations in desktop.
|