| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
|
|
|
| |
GUI can now download all TaSTT dependencies and install them into a
virtual environment.
* Add buttons to check embedded python version & install dependencies
* Add class to wrap interacting with embedded Python
* Put all TaSTT python scripts into a folder
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Shave off ~500ms due to locking. Acquiring a threading.Lock takes
hundreds of milliseconds and the global interpreter lock already takes
care of most crashy race conditions, so just remove the locks.
Avoid writing audio to disk, saving more time (and disk wear / IOPS).
Add basic profiling to transcribe().
Omit timestamps, since we don't use them (maybe we should!)
Shorten noise indicators to 350ms
The whisper behavior where it repeats tokens causes certain
transcriptions to take many seconds. I haven't thought about how to fix
this, yet.
|
| |
|
|
|
|
| |
Now we have a visual and auditory indicator for transcription. The
auditory indicator is only heard by the user, and can be used to reset
the state of the board prior to displaying.
|
| |
|
|
|
|
|
|
|
| |
Use a single indicator with 3 states:
1. green: actively speaking
2. orange: waiting for paging
3. red: up-to-date
Use slightly nicer colors.
|
| |
|
|
|
| |
This helps with temporal stability in long-running transcriptions, and
lets us get rid of that hack where we refuse to update old pages.
|
| |
|
|
|
|
|
| |
Coarse locking was causing audio frames to drop, severely degrading
transcription quality.
We really need a spoken word integration test.
|
| |
|
|
|
|
|
|
| |
Press joystick once to start recording, again to stop. When you start
recording, any previous text on the board is cleared.
Add 2 visual indicators: one to indicate speech, another to indicate
that audio is paging.
|
| |
|
|
|
|
|
| |
Works a little better on longer transcriptions while maintaining the
same improved performance on short transcriptions.
We really need a benchmark to evaluate performance mechanically.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After re-reading the paper, I noticed that they apply a couple
optimizations I wasn't using. Use the top-level `whisper.transcribe`
method, which is a little slower, but more accurate than the one I was
using.
Although this method is slower, it has better temporal stability due to
the increased quality, which I think should make for an overall more
responsive UX. Lower transcription quality means the paging layer has to
waste time updating earlier cells.
Also, drop the auto-commit stuff and go back to string stitching. I
think it's better to let the user manually commit. A rework of the hand
controls is probably coming soon.
Finally, update README.
|
| |
|
|
|
|
|
|
| |
Board would lock up if you reset after the first page. osc_ctrl.clear()
was assigning the wrong member :)
Tweak continuous transcription logic: now we only commit if the
transcription remains identical for N seconds.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Increase no speech probability threshold. This is what was preventing
short transcriptions from working. We rely more on the avg logprob
filter now.
* Remove string matching logic from transcribe. Now when we get 2
consecutive identical transcriptions, we commit the transcription.
This *could* cause words to get cut off but in practice it doesn't seem to
happen.
* Fix steamvr joystick click detection. Moving the joystick would also
fire the event, which is not correct.
* Combine locks in transcribe.py.
* Remove "clear" vocal control.
* osc_ctrl.clear() resets last_message_encoded
* Remove osc_ctrl.sendMessage (unused)
|
| |
|
|
| |
Begin auditing dependencies' licenses.
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
| |
Add a `matchStrings` which does basically the same thing as
`matchStringList` except it doesn't split the input at space boundaries.
I think this should work better for Japanese and Chinese, since they
don't use spaces.
Doesn't seem to cause any accuracy regressions for English.
Also update the README.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Each character is now addressed with 2 bytes instead of 1. The number of
bytes per character is configured in (I think) exactly one spot, so
increasing or decreasing this is trivial. English speakers can just set
it to 1.
The animator seems a little unstable; if I leave my character in a
public for a while, the board becomes unresponsive. Oh well.
* Check in fonts. Did this so users don't have to remember to set the
resolution or to disable mipmaps.
|
| |
|
|
|
|
|
|
|
|
| |
Apply heuristics described in whisper paper. Dramatically improve
silence detection as well as overall transcription quality.
I was able to read the entire demo script at speed without any serious
transcription inaccuracies.
Field testing is TODO.
|
| | |
|
| |
|
|
|
| |
Stitching new uses 6 word sliding window instead of 4 word. Seems to
dramatically improve transcription quality.
|
| |
|
|
|
|
|
| |
When the user says 'over', the board will stop displaying new
transcriptions until the user says 'clear'.
* Remove the control thread from transcribe.py
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The old clear mechanism would write an empty cell in every layer,
which would take (0.3 seconds) * (11 layers) == about 3 seconds.
The new mechanism drives an animation which overwrites every character
slot simultaneously, taking only 0.1 seconds. A nice ~30x speedup.
* Fix the transcription exponential backoff logic. Saying new things
will reset the delay to the minimum again.
* Clearing the board will also reset the transcription delay back to
the minimum.
* Tune the noise detection minimum to 0.2 instead of 0.1. Speaking
softly into the mic seems to fail to exceed the 0.1 threshold pretty
often.
|
| |
|
|
|
|
|
|
|
|
| |
Transcription stitching now occurs in word space, rather than in text
space. This avoids problems where we accidentally duplicate or delete
letters in the middle of words.
Factor out stitching into its own module and add a small handful of
test cases. Hopefully if we hit problems in production, we can just
grow this list and avoid regressions if we reimplement.
|
| |
|
|
|
| |
The heuristics now occur in the filtered word space, so punctuation
and casing changes won't confound them.
|
| |
|
|
|
|
|
|
|
|
|
| |
When the user pauses their speech for an extended period of time, the
transcription engine will sleep for progressively longer intervals,
up to 1.5 seconds between transcriptions. This allows us to reduce
idle resource consumption.
To enable responsive transcription while the user is speaking actively,
we reset the sleep duration to the minimum whenever a change is
detected.
|
| |
|
|
|
|
|
| |
While the board is clearing, you can keep talking, and it will be
rendered when the board finishes clearing.
* bugfix: STT only beeps when it's out
|
| |
|
|
|
|
| |
Also adjust continuous transcription algorithm to use leftmost minimum
instead of rightmost. This prevents some cases where we generate longer
and longer text.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Algorithm:
* look at last 20 chars of last committed transcription
* scan new transcription using 10-char sliding window
* find spot where distance is minimized
* stitch two messages together
Thus we're able to maintain a continuously growing transcription
without having to feed the AI more than 30 seconds of data at a
time. Seems to work reasonably well in bench tests.
Also fix silence detection. AI exposes a probability that nothing
was said. Hand-pick a probability of 0.1. Sometimes the AI still
goes sicko mode with this setting but going higher occasionally
results in no transcription.
|
| |
|
|
|
| |
* Implement basic board toggle using new transition logic
* Metadata can now restore from file
|
| |
|
|
|
|
|
|
|
| |
Messages longer than a board will automatically write over the top.
TODO
* Real cell-based message diffing
* Cumulative transcription
* this would completely mitigate the effects of trim events
|
| |
|
|
|
|
|
| |
Add a third heuristic. If the transcription is relatively long and the
first bit differs from the previous transcription, immediately
overwrite. Because the transcription is long, it's a bit less likely to
be a complete mistranscription.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Slightly improve temporal stability and responsiveness at the cost of
limiting to a 30 second recording.
Before committing to a transcription, wait for two consecutive
transcriptions such that they are identical, or the former is a
prefix of the latter. This helps with temporal stability by eliminating
most one-off wildly inaccurate transcriptions.
Also make osc_ctrl.sendMessageLazy a little lazier, limiting it to 2
consecutive non-empty cells per call. This allows us to recover from
mistranscriptions faster.
|
| |
|
|
|
|
|
|
|
| |
Also:
* Check in toggle on/off animations
* Add toggle parameter
* libunity bug: getUniqueId() was calling allocateId() incorrectly
* Remove osc_ctrl `client` global
* Fix transcribe.py text encoding
|
| |
|
|
|
|
|
|
|
| |
* Add VRLabs' World Constraint as a submodule
* Add animations for world constraint
* Add toggles for board
* Add libunity.py (no content yet)
* Support >30s transcription
* Add board FBX
|
|
|
Using OpenAI's whisper neural network, we can do local STT. Translation
quality is good, system resource usage is minimal (1 GB VRAM), latency
is much lower than cloud-based translation.
* Add transcribe.py
* Creates 3 threads:
* One saves mic audio to a buffer
* One passes mic audio to the STT
* One sends the transcribed text to the board
* Main thread listens for input. Press enter to start a new message.
* Add osc_ctrl.sendMessageLazy, a simple diff-based message sending utility.
* A little complexity: it only sends 1 empty cell per call, allowing us to
quickly say new things without having to wait for the whole buffer to
clear.
|