| Commit message (Collapse) | Author | Age |
| ... | |
| | |
|
| |
|
|
|
| |
Duplicating config between args and config is a huge pain in the ass to
maintain. Now we just launch using the config generated by the UI. ezpz.
|
| |
|
|
|
|
|
|
|
| |
wxWidgets encodes text inputs & multiple-choice inputs as strings. I
frequently have to convert these into ints & apply a range check.
Encapsulate that in a function and use a shitty little ASSIGN_OR_RETURN
macro to make the parsing as concise as possible.
Also delete unused WhisperCPP config settings.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
* Temporarily restore normal process priority. Working on adding a UI
option to set STT prio.
* Give audio indicator phonemes a 1/3 chance to do nothing. Makes result
sound a little better imo.
* Quiet down steamVR thread when steamVR isn't running
* Fix use of `button_id` and `hand_id` in steamvr.py
* Increase amount of silence allowed before transcript from 1 to 5
seconds. You want enough buffer to allow for a few full transcripts,
else you risk spuriously dropping audio.
* Enable background loading in audio metadata (required by vrc sdk)
|
| |
|
|
|
|
|
|
| |
This is now dynamically set inside transcribe.py.
As the buffer grows long, the threshold grows exponentially, keeping the
buffer short. The threshold starts small so that transcription starts
strict (accurate, slow) and get looser (inaccurate, fast) as needed.
|
| |
|
|
|
| |
We now play arpeggiated *chords* of vowels instead of one, allowing for
a denser audio feedback mechanism.
|
| |
|
|
|
|
|
| |
Also fix prefab default size (no longer colossal).
TODO
* Add runtime & unity-time toggles
|
| |
|
|
|
| |
openxr doesn't have any notion of background process, making it unusable
trash :)
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I this improves the code structure of the controller input thread and
leads to some deduplication, so I'm going to keep it. However, the
intended purpose was to decrease lag when pressing buttons, and in that
regard it failed.
The lag goes all the way down to the input layer, implying that the
input thread is not able to consistently run at its intended 100 Hz
sample rate. I suspect that the Python global interpreter lock (GIL) is
at fault.
Since we can't realistically move all our functionality into one thread
in a non-blocking model, I think multiprocessing is the logical choice
going forward. Each thread in transcribe.py would become its own
process, and pub/sub through some intermediary process sitting in the
middle.
|
| |
|
|
| |
pyopenvr is both deprecated and buggy, so switch to pyopenxr.
|
| |
|
|
|
|
|
| |
* remove unused variables, functions, keywords
* rename fixedN to floatN
* move min/max raymarch distance to top of ray_march.cginc
* fix frame emission
|
| |
|
|
|
|
| |
There's some bug around using the raymarched world position to write the
depth buffer. I wasn't able to find it quickly, so for now, use the
original world position to write the depth buffer.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Increase units by a factor of 100 to avoid running into numerical
instability on 32-bit floats. This comes at zero measured performance
cost. This makes a visible difference in quality.
Other minor changes:
* Raymarching loop tries to get up to 4x closer than
MINIMUM_HIT_DISTANCE before bailing out. This comes at no measured
performance cost.
* Convert `fixed` types to `float` in STT_text.cginc.
|
| | |
|
| |
|
|
|
| |
Regression created while optimizing shader. Performance still around 730
microseconds on my computer with this change.
|
| |
|
|
|
| |
Use symmetry to reduce # of distance calculations by 50%. Because the
pyramid can be skewed, we can't reduce this by another factor of 2.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Using PIX to quantify changes, reduce raymarcher runtime from ~1.0 ms to
~850 us.
In order of impact:
* Tighten raymarch min/max distances
* Make `in_mirror` check truly branchless
* Gate ellipsis animation with non-divergent if statement
Everything else is < 10 microseconds of improvement.
|
| |
|
|
| |
Text box now shows an animated ellipsis prior to first speech.
|
| |
|
|
|
| |
Deprecate the visual and auditory speech indicators, saving 4 bits
across the board. Fixed overhead is now 21 bits.
|
| |
|
|
|
|
|
| |
Not yet done:
* Animator toggle
* OSC integration
|
| |
|
|
|
| |
The chatbox background would not completely disappear at _Emerge = 0, so
two dots would show up when the chatbox spawns.
|
| |
|
|
| |
Performance impact remains to be seen.
|
| |
|
|
|
| |
Remove unused code & cruft. Ray marcher now updates i.worldPos before
executing PBR shading, which fixes some artifacts.
|
| |
|
|
|
| |
No UVs for raymarched geometry yet, so drop textures. Also drop most
old shader settings.
|
| |
|
|
| |
Looks more like a comic-style speech bubble now.
|
| |
|
|
|
| |
Fix up .mat to point to correct textures/shader. Also delete templates
after copying shaders.
|
| |
|
|
| |
Specify file encoding when generating shaders.
|
| |
|
|
| |
* Fix mirror behavior for ray-marched chatbox
|
| | |
|
| |
|
|
|
|
|
|
| |
* Refactor shader code to make development easier. Templates are now
as small as possible.
* Update scaling code. Use Unity scaling instead of a blendshape.
* Check in a fuckton of shader FOSS. Mostly unused.
* Update TaSTT.fbx. Now has 6 faces instead of 2.
|
| |
|
|
|
|
| |
GUI was not correctly managing .meta files, causing two textures to use
the same GUID. Unity would notice and regenerate GUIDs, breaking the
custom chatbox material's texture references.
|
| |
|
|
|
|
|
|
| |
Transcription thread now blocks until microphone thread deletes samples
as requested.
(This is hacky design, it should use a work queue or something, but I
don't feel like doing that right now)
|
| |
|
|
|
|
|
|
|
|
| |
It's possible that the user has toggled off transcription while the
algorithm is still working. In this case we should *not* begin
exponential backoff since there's still work to do.
Also:
* Shorten the hot-path sleep from 50ms to 5ms.
* Remove unused variable in SleepInterruptible
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add two buttons: start auto re-generation of Unity assets, and stop.
These start/stop a thread which periodically (every 3 seconds) hashes
the user-provided animator, menu and parameters. When any one of these
change, it invokes the function to generate Unity assets.
The hash is non-cryptographic, so it's light. The only hit is that we
have to read the entire file contents every few seconds, and compute a
sum across that entire memory region. This is extremely light unless
you're on a spinning platter hard drive with a small cache.
Still seeing the bug where the material drops ref to the font bitmaps.
Probably need to update the .mat using the guids in the bitmap .meta
files.
|
| |
|
|
|
| |
Avoid deleting bitmap .meta files so that once the user sets up their
shader, it doesn't break.
|
| |
|
|
| |
Useful for projects with multiple avatars with different animators.
|
| |
|
|
|
|
| |
The paths you enter in the Unity panel (animator, menu, params, and
assets folder) are saved in the app config, but were not populated
correctly on app restart or pane redraw. Now they are.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we commit a transcription, we drop the corresponding audio data.
Audio data is represented as a list of chunks. Each chunk contains a few
hundred samples of audio data, representing O(10ms) of audio.
If we want to drop a few seconds of data, this means simply deleting
many chunks of audio. There's usually a chunk where we want to drop some
portion of audio data.
Instead of slicing away that part of the chunk, which would change its
length, this change zeroes it out. This preserves the assumption that
each chunk has the same temporal length.
|
| |
|
|
|
|
| |
We used to drop entire frames only, leading to situations where more
audio is dropped than desired. Now we drop frames down to the precision
of the individual audio sample requested.
|
| |
|
|
|
|
|
|
|
| |
Mostly updating roadmap stuff. Non-VRC use cases are "complete" since I
was mostly targeting streaming. The ability to type into arbitrary text
fields is still somewhat nascent & could be improved.
Also update some other random stuff to be more up to date. KillFrenzy
Avatar Text is now MIT, pog!
|
| |
|
|
|
|
|
|
| |
Common hallucinations sneak in around -0.9 avg_logprob.
Also:
* Limit temperatures to just 0.0. Multiple values cause latency to
occasionally spike.
|
| |
|
|
|
|
|
| |
Surprisingly, these args do not cause transcribe() to omit those
segments from the result, so we have to manually filter them out.
Hallucinated phrases generally have one or both of these params set
high.
|
| |
|
|
| |
Each sample of audio data is a 16-bit int, not an 8-bit int.
|
| |
|
|
|
| |
Each chunk of audio samples should be encoded as a binary string, not as
a list.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
New commit logic would reduce buffer to a size smaller than this,
causing it to hallucinate things like:
* "See you next time!"
* "Thanks for watching!"
* "Bye!"
The hope is that by keeping the buffer at least 5.0 seconds long, as
described in the paper, this will cut down on these events.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Create a simple server with 3 endpoints:
* /create_session: Create a session and return its identifier.
* /set_transcript: Update a session's transcript.
* /get_transcript: Fetch a session's transcript.
Right now the session ID provides authentication *and* authorization.
There is no public/private ID so you have to trust whoever you share
your ID with.
IDs are long and generated by the server, so it should be somewhat
secure against low-effort hacking.
Other updates:
* Drop whisper_requirements.txt - no longer needed.
* Vendor curl to make it easier to interact with the server.
TODO:
* Fuzz test the server.
|
| |
|
|
| |
Forgot to check this in, oops!
|
| |
|
|
|
|
|
|
| |
Circle goes red when speaking, grey when done. Ideally it would be in
the top right portion of the browser source, but this is a good start.
Also, hard-cap transcripts to 4096 chars. This prevents the STT from
lagging during long sessions.
|
| |
|
|
| |
... also print out "Ready!" when the STT is done loading.
|
| |\
| |
| | |
Set GPU device index in whisper model
|