diff options
| author | yum <yum.food.vr@gmail.com> | 2023-02-26 18:57:07 -0800 |
|---|---|---|
| committer | yum <yum.food.vr@gmail.com> | 2023-02-26 19:49:40 -0800 |
| commit | 00a0350a0218cf4b03d14acac84110bc1e882bee (patch) | |
| tree | 433f8907ab398c9a53f79060a7f248f5352bbe51 | |
| parent | a6de8f9654c90c51713e77791ff7155f34d27c21 (diff) | |
Frames with no VAD are shortened, not dropped
On PCM buffers of length >= captureParams.dropStartSilence, a
"no voice" VAD verdict would result in the PCM buffer being entirely
cleared. The emergent behavior is that when VAD segments speech,
words right after the segmentation window can frequently be dropped.
By removing a prefix from the PCM buffer and clearing the VAD buffers,
the transcription algorithm has access to "leading" frames before the
frames which triggered VAD. This reduces cases where words are omitted
in the middle of long statements.
| -rw-r--r-- | Whisper/MF/AudioBuffer.h | 9 | ||||
| -rw-r--r-- | Whisper/Whisper/ContextImpl.capture.cpp | 2 |
2 files changed, 10 insertions, 1 deletions
diff --git a/Whisper/MF/AudioBuffer.h b/Whisper/MF/AudioBuffer.h index 93a6ace..b12dff4 100644 --- a/Whisper/MF/AudioBuffer.h +++ b/Whisper/MF/AudioBuffer.h @@ -43,5 +43,14 @@ namespace Whisper if( !stereo.empty() ) stereo.resize( len * 2 ); } + + void dropFirst(size_t len) + { + assert(len <= mono.size()); + size_t remainder = mono.size() - len; + auto tmp = std::vector<float>(remainder); + memcpy(tmp.data(), mono.data() + len, remainder); + mono = std::move(tmp); + } }; }
\ No newline at end of file diff --git a/Whisper/Whisper/ContextImpl.capture.cpp b/Whisper/Whisper/ContextImpl.capture.cpp index 0062c2a..0b213b8 100644 --- a/Whisper/Whisper/ContextImpl.capture.cpp +++ b/Whisper/Whisper/ContextImpl.capture.cpp @@ -242,7 +242,7 @@ namespace if( newSamples < captureParams.dropStartSilence ) return S_OK; - pcm.clear(); + pcm.dropFirst(1024); vad.clear(); pcmStartTime = nextSampleTime; return S_OK; |
