Frames with no VAD are shortened, not dropped

On PCM buffers of length >= captureParams.dropStartSilence, a "no voice" VAD verdict would result in the PCM buffer being entirely cleared. The emergent behavior is that when VAD segments speech, words right after the segmentation window can frequently be dropped. By removing a prefix from the PCM buffer and clearing the VAD buffers, the transcription algorithm has access to "leading" frames before the frames which triggered VAD. This reduces cases where words are omitted in the middle of long statements.
author: yum <yum.food.vr@gmail.com> 2023-02-26 18:57:07 -0800
committer: yum <yum.food.vr@gmail.com> 2023-02-26 19:49:40 -0800
commit: 00a0350a0218cf4b03d14acac84110bc1e882bee (patch)
tree: 433f8907ab398c9a53f79060a7f248f5352bbe51
parent: a6de8f9654c90c51713e77791ff7155f34d27c21 (diff)
2 files changed, 10 insertions, 1 deletions
diff --git a/Whisper/MF/AudioBuffer.h b/Whisper/MF/AudioBuffer.h
index 93a6ace..b12dff4 100644
--- a/Whisper/MF/AudioBuffer.h
+++ b/Whisper/MF/AudioBuffer.h
@@ -43,5 +43,14 @@ namespace Whisper
 			if( !stereo.empty() )
 				stereo.resize( len * 2 );
 		}
+
+		void dropFirst(size_t len)
+		{
+			assert(len <= mono.size());
+			size_t remainder = mono.size() - len;
+			auto tmp = std::vector<float>(remainder);
+			memcpy(tmp.data(), mono.data() + len, remainder);
+			mono = std::move(tmp);
+		}
 	};
 }
 \ No newline at end of file
diff --git a/Whisper/Whisper/ContextImpl.capture.cpp b/Whisper/Whisper/ContextImpl.capture.cpp
index 0062c2a..0b213b8 100644
--- a/Whisper/Whisper/ContextImpl.capture.cpp
+++ b/Whisper/Whisper/ContextImpl.capture.cpp
@@ -242,7 +242,7 @@ namespace
 			if( newSamples < captureParams.dropStartSilence )
 				return S_OK;
 
-			pcm.clear();
+			pcm.dropFirst(1024);
 			vad.clear();
 			pcmStartTime = nextSampleTime;
 			return S_OK;
author	yum <yum.food.vr@gmail.com>	2023-02-26 18:57:07 -0800
committer	yum <yum.food.vr@gmail.com>	2023-02-26 19:49:40 -0800
commit	00a0350a0218cf4b03d14acac84110bc1e882bee (patch)
tree	433f8907ab398c9a53f79060a7f248f5352bbe51
parent	a6de8f9654c90c51713e77791ff7155f34d27c21 (diff)