Maker here. A few technical notes that might be interesting to this crowd:
The Voice Agent chains three models in the browser: Whisper for STT → a local LLM → Kokoro/SpeechT5 for TTS. All inference runs on-device via WebGPU. The latency isn't amazing yet, but the fact that it works at all with zero backend is kind of wild.
The landing page has an auto-playing demo that generates speech locally as soon as you visit — you'll hear it typewrite and speak three sentences. That was important to me because "runs in your browser" sounds like marketing until you actually hear it happen.
Happy to go deep on the WebGPU inference pipeline, model conversion process, or anything else.
Looks really dope. Does it use VAD like Silero locally as well?
Performance is really good on my M2 Pro. Model download still takes time on my fiber optic but it's fine.
Maker here. A few technical notes that might be interesting to this crowd:
The Voice Agent chains three models in the browser: Whisper for STT → a local LLM → Kokoro/SpeechT5 for TTS. All inference runs on-device via WebGPU. The latency isn't amazing yet, but the fact that it works at all with zero backend is kind of wild.
The landing page has an auto-playing demo that generates speech locally as soon as you visit — you'll hear it typewrite and speak three sentences. That was important to me because "runs in your browser" sounds like marketing until you actually hear it happen.
Happy to go deep on the WebGPU inference pipeline, model conversion process, or anything else.