Local Speech Recognition

Why Build This

I wanted voice-to-text without:

Cloud API dependencies
Monthly subscription fees
Privacy concerns about audio being sent elsewhere
Internet connectivity requirements

The goal: completely offline, always available, auto-paste anywhere.

The Solution

OpenAI’s Whisper model running locally via faster-whisper (optimized for CPU inference).

Implementation

Push-to-talk system in Python:

Hold Alt key → recording starts
Speak → audio captured locally
Release Alt → transcription runs, auto-pastes via clipboard

Works across all Windows applications. No configuration needed.

Model Choice

Using Whisper small (484MB):

Good accuracy for general use
Fast transcription (~2-3 seconds for 10-second audio)
Runs comfortably on CPU (no GPU required)

Larger models available (medium, large-v3) if accuracy matters more than speed.

Technical Details

Components:

faster-whisper - Optimized Whisper inference (CTranslate2-based)
sounddevice - Audio capture
keyboard - Hotkey detection (Alt key)
pyperclip - Clipboard integration for auto-paste

Windows encoding fix:

Added UTF-8 wrapper for console output (Windows console uses cp1252 by default, can’t handle emojis in output)

Desktop integration:

Created .lnk shortcut for taskbar pinning
Runs in background, minimal resource usage when idle

What I Learned

Local AI is practical: No need to reach for cloud APIs for everything. Whisper models are small enough to run locally, accurate enough for daily use.

Python ecosystem is mature: Finding the right libraries (faster-whisper vs openai-whisper) makes a huge difference in performance.

Windows quirks: Console encoding, hotkey detection, clipboard access—each has its own edge cases. Testing reveals them quickly.

Current Status

Fully functional. Fixed UTF-8 encoding bug. Ready for daily use.

Possible enhancements:

Voice activity detection (auto-start recording on speech)
Integration with LM Studio (voice → LLM → voice response)
Custom wake word detection

But the core use case works: speak, get text, move on.

Technologies: Python, Whisper (OpenAI), faster-whisper, sounddevice, keyboard Model: Whisper small (484MB) Status: Operational Privacy: 100% local, zero cloud calls