← Back to Speech-to-Text App

Sprint 1 — Model Selection

Evaluated Whisper and Deepgram for real-time transcription. Whisper won for accuracy on technical audio, but latency needs optimization.

The first real decision for the speech-to-text app: which model to use. The two main contenders were OpenAI’s Whisper (self-hosted) and Deepgram’s API. Both handle English well, but the use case here is transcribing technical conversations — code reviews, architecture discussions, debugging sessions — where accuracy on jargon matters more than raw speed.

Testing was straightforward. Recorded a few sample clips of me talking through code, ran them through both services, and compared the output. Whisper nailed the technical terms almost every time. Deepgram was faster but stumbled on things like “FastAPI,” “WebSocket,” and variable names. For this project, accuracy wins.

The trade-off is latency. Whisper running locally on a decent GPU gives about 2-3 seconds of delay for a 10-second audio chunk. That’s workable but not great for real-time use. Next sprint focuses on chunked streaming and buffer optimization to get that number down. The goal is sub-second perceived latency by breaking audio into smaller overlapping windows.

Want a Custom AI App?

Let's talk about building something similar for your business.

Book a Fit Check