summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAge
* Fix local audio indicatorsv0.15.0yum2023-09-10
|
* Check in vad.py and delete transcribe.pyyum2023-09-10
| | | | | | | Oops, I meant to check this in a while back. Since transcribe_v2.py now has feature parity with transcribe.py, delete the old code.
* Add plugin interfaceyum2023-09-10
| | | | | | ... and use it to implement translation and text filters. Also fix display of non-English characters in browser src.
* Bugfix: only cap display of transcript at 4K charsyum2023-09-10
| | | | | | | | Actually retain the whole transcript to avoid breaking the OSC pager. Also constrain the UI buffer size by characters instead of lines. Since some lines can be massive and others short, characters are a better way of consistently keeping the UI memory in check.
* Add UI for transcription loop delayyum2023-09-10
| | | | | | | | Allows users to directly modulate the performance-latency tradeoff. Also: * Bump up UI buffer to 1k lines. * Fix browser source reset. It now also resets preview text.
* Browser source now shows preview text as slightly transparentyum2023-09-09
| | | | Improves viewer experience.
* Add UI for max speech durationyum2023-09-09
| | | | | Also fix bug when not using previews. Audio buffer no longer grows without bound while there's no speech.
* Constrain log file, UI text field, and transcript sizesyum2023-09-09
| | | | | | | | | Log file is constrained to 1 MB and UI to 100-200 lines. 1k lines is too high to keep the UI from lagging. Transcript is constrained to 4k characters. Also put a 5 ms sleep in the transcription hot path.
* Constrain UI text buffers to 1000 linesyum2023-09-09
| | | | This keeps memory usage from growing without bound.
* Make min silence duration configurable in UIyum2023-09-09
|
* Add `lock at spawn` optionyum2023-09-09
| | | | | | I find it kind of annoying when people wave around a big chatbox so I added the option to have the chatbox be locked in worldspace whenever it's visible. This defaults to on and can be disabled.
* Bugfix: fix preview text enable/disable in browser sourceyum2023-09-09
|
* Bugfix: fix process leak in PythonWrapper::InvokeCommandWithArgsyum2023-09-09
| | | | | | | | | | | | | It now waits up to 10 seconds for a graceful exit and falls back on the equivalent of a SIGKILL. The caller is assumed to have signaled to the process through `in_cb` that an exit is desired. Also: * Fix graceful exit path of transcribe_v2.py. * Add toggle to enable/disable preview text. It is enabled by default. * Constrain transcription temperature to 0.0. This keeps latency more predictable at the cost of some accuracy.
* Bugfix: non-text OSC messages wait for sync windowyum2023-09-08
| | | | This makes them more reliable.
* Bugfix: text data now pages correctlyyum2023-09-08
| | | | | | The non-text OSC messages were paging in too close to the text OSC messages, breaking the whole system. Now the non-text OSC messages bump back the time at which text OSC messages can begin being sent.
* Only transcribe if VAD detects somethingyum2023-09-08
| | | | | | | | | | | | Also: * DiskStream starts returning silence when out of data instead of just stopping. * Filter out Whisper segments with high `no_speech_prob` and low `avg_logprob`. * Add `saveAudio` function, useful for debugging. * Tune vad silence cutoff to 250 ms. This is pretty accurate in benchmarks.
* Add keyboard controls to transcribe_v2.pyyum2023-09-08
| | | | | | | Also parameterize `min_silence_duration_ms` in AudioSegmenter. I suspect that for conversational speech, segmenting closer to 500 ms (rather than the 2000ms default) is a better tradeoff between accuracy and compute efficiency.
* Drop transcription queueyum2023-09-07
| | | | No longer needed.
* Switch to VadCommitteryum2023-09-07
| | | | | | | | FuzzyRepeatCommitter was approximating this behavior in the best-performing configuration, so switch to it in earnest. This committer simply commits audio once we detect a long enough gap in speech. That's it!
* Put OSC logic into its own threadyum2023-09-05
| | | | | | | | | | | | | | | | | | This logic is highly IO bound *and* latency critical so it makes sense to put it into its own thread. Also: * Collector::drop* methods return the dropped audio. Committer includes that audio in commits. Transcription thread holds onto it. When the user segments their speech with a button press, the transcription thread sends the entire combined audio of all commits over to Whisper to be transcribed. This allows us to recover from errors introduced by segmentation. * Remove unused animator params * Fix issue where clearing the board doesn't completely reset STT state TODO: * Coalescing does not occur for in-place updates. It should.
* Wire transcribe_v2.py into GUIyum2023-09-03
| | | | | | | | Also: * Enable SO_REUSEADDR on browser src socket * Temporarily add evaluation dependencies to requirements.txt * Fix browser src. It's now looking for a prefix that the python app actually uses.
* Add threads to transcribe_v2.pyyum2023-09-03
| | | | | | | | | | | | Four threads: * Main thread * Transcription (mic -> collector -> whisper -> committer -> pager) * VR input * Keyboard input Also: * add OscPager class to encapsulate all OSC interactions. * bump `last_n_must_match` from 2 to 3 to reduce hallucinations
* Apply subtle compression to audio before transcribingyum2023-09-03
| | | | This has a slight positive effect on my benchmark.
* Experiment with Collector filtersyum2023-09-03
| | | | | | | | | | | | | | | | | | | | | | Try adding two filters on top of the usual AudioCollector: * Minimum length preservation: never report fewer than N seconds worth of audio data. Pad with silence as needed. * Volume normalizing: normalize audio volume. Using my benchmark of 30-second audio clips from 3 speakers (lower is better): length enf + norm = 87.118 nothing = 90.917 norm = 94.538 length = 111.402 Both together are a slight improvement, but independently degrade the result by a lot. I also observed more hallucinations in a conversational pattern when using them vs. not. So I'll phase them out. I'm still curious about *compression* as opposed to normalization.
* Begin rewriting transcribe.pyyum2023-09-02
| | | | | | | | | | A set of proper interfaces is called for. See #dev-update-spam in discord for drawing of design. Also add code to mechanically optimize committer parameters using an audio file. Not perfectly repeatable since it depends on the performance characteristics of the machine, but prob better than what we had before (nothing).
* Fix reference to deprecated symbolv0.14.1yum2023-09-01
|
* Bugfix: app no longer hangs if closed while transcribingyum2023-09-01
| | | | | Fix how OnExit callback is wired into GUI. Also make it exit Unity process, if that's going on.
* Various cleanupyum2023-09-01
|
* Check in app_config.py, remove_audio_source.pyyum2023-09-01
| | | | Oops, I meant to check these in earlier!
* Add `Enable phonemes` toggle to radial menuyum2023-09-01
| | | | | | | | | | Also: * Fully scrub AudioSource references from prefab when not using phonemes. * Disable net sync on phoneme params when not using them. When not synced, they don't count against the total memory limit. * Use config file in generate_params.py
* Add Unity panel toggle for phonemes (in-game audio indicator)yum2023-09-01
| | | | If not set, the prefab will have its audio sources removed.
* libtastt.py now uses config file where appropriateyum2023-08-31
|
* transcribe.py now just reads from config fileyum2023-08-31
| | | | | Duplicating config between args and config is a huge pain in the ass to maintain. Now we just launch using the config generated by the UI. ezpz.
* Clean up UI stoi patternyum2023-08-31
| | | | | | | | | wxWidgets encodes text inputs & multiple-choice inputs as strings. I frequently have to convert these into ints & apply a range check. Encapsulate that in a function and use a shitty little ASSIGN_OR_RETURN macro to make the parsing as concise as possible. Also delete unused WhisperCPP config settings.
* Bugfixes and tweaksyum2023-08-31
| | | | | | | | | | | | | * Temporarily restore normal process priority. Working on adding a UI option to set STT prio. * Give audio indicator phonemes a 1/3 chance to do nothing. Makes result sound a little better imo. * Quiet down steamVR thread when steamVR isn't running * Fix use of `button_id` and `hand_id` in steamvr.py * Increase amount of silence allowed before transcript from 1 to 5 seconds. You want enough buffer to allow for a few full transcripts, else you risk spuriously dropping audio. * Enable background loading in audio metadata (required by vrc sdk)
* Deprecate commit similarity thresholdyum2023-08-30
| | | | | | | | This is now dynamically set inside transcribe.py. As the buffer grows long, the threshold grows exponentially, keeping the buffer short. The threshold starts small so that transcription starts strict (accurate, slow) and get looser (inaccurate, fast) as needed.
* Continue work on in-game audio, revert steamvr.pyyum2023-08-30
| | | | | We now play arpeggiated *chords* of vowels instead of one, allowing for a denser audio feedback mechanism.
* Fix in-game audio indicatoryum2023-08-29
| | | | | | | Also fix prefab default size (no longer colossal). TODO * Add runtime & unity-time toggles
* Switch back to openvryum2023-08-28
| | | | | openxr doesn't have any notion of background process, making it unusable trash :)
* Put audio feedback into its own threadyum2023-08-25
| | | | | | | | | | | | | | | | | | I this improves the code structure of the controller input thread and leads to some deduplication, so I'm going to keep it. However, the intended purpose was to decrease lag when pressing buttons, and in that regard it failed. The lag goes all the way down to the input layer, implying that the input thread is not able to consistently run at its intended 100 Hz sample rate. I suspect that the Python global interpreter lock (GIL) is at fault. Since we can't realistically move all our functionality into one thread in a non-blocking model, I think multiprocessing is the logical choice going forward. Each thread in transcribe.py would become its own process, and pub/sub through some intermediary process sitting in the middle.
* Finish pyopenvr -> pyopenxr migrationyum2023-08-25
| | | | pyopenvr is both deprecated and buggy, so switch to pyopenxr.
* Various shader cleanupyum2023-08-21
| | | | | | | * remove unused variables, functions, keywords * rename fixedN to floatN * move min/max raymarch distance to top of ray_march.cginc * fix frame emission
* Workaround: use STT mesh to write depth bufferyum2023-08-12
| | | | | | There's some bug around using the raymarched world position to write the depth buffer. I wasn't able to find it quickly, so for now, use the original world position to write the depth buffer.
* Improve numerical stability in raymarcheryum2023-08-12
| | | | | | | | | | | | Increase units by a factor of 100 to avoid running into numerical instability on 32-bit floats. This comes at zero measured performance cost. This makes a visible difference in quality. Other minor changes: * Raymarching loop tries to get up to 4x closer than MINIMUM_HIT_DISTANCE before bailing out. This comes at no measured performance cost. * Convert `fixed` types to `float` in STT_text.cginc.
* Bugfix: Shader no shows text mirroredyum2023-08-12
|
* Bugfix: ellipsis waits for boardyum2023-08-12
| | | | | Regression created while optimizing shader. Performance still around 730 microseconds on my computer with this change.
* Optimize skew-pyramid frame SDFyum2023-08-12
| | | | | Use symmetry to reduce # of distance calculations by 50%. Because the pyramid can be skewed, we can't reduce this by another factor of 2.
* Small raymarching optimizationsyum2023-08-11
| | | | | | | | | | | | Using PIX to quantify changes, reduce raymarcher runtime from ~1.0 ms to ~850 us. In order of impact: * Tighten raymarch min/max distances * Make `in_mirror` check truly branchless * Gate ellipsis animation with non-divergent if statement Everything else is < 10 microseconds of improvement.
* Animate pre-speech ellipsisyum2023-08-11
| | | | Text box now shows an animated ellipsis prior to first speech.
* Deprecate old parametersyum2023-08-11
| | | | | Deprecate the visual and auditory speech indicators, saving 4 bits across the board. Fixed overhead is now 21 bits.