Local Speech Recognition

Why Build This

I wanted voice-to-text without:

  • Cloud API dependencies
  • Monthly subscription fees
  • Privacy concerns about audio being sent elsewhere
  • Internet connectivity requirements

The goal: completely offline, always available, auto-paste anywhere.

The Solution

OpenAI’s Whisper model running locally via faster-whisper (optimized for CPU inference).

Implementation

Push-to-talk system in Python:

  • Hold Alt key → recording starts
  • Speak → audio captured locally
  • Release Alt → transcription runs, auto-pastes via clipboard

Works across all Windows applications. No configuration needed.

Model Choice

Using Whisper small (484MB):

  • Good accuracy for general use
  • Fast transcription (~2-3 seconds for 10-second audio)
  • Runs comfortably on CPU (no GPU required)

Larger models available (medium, large-v3) if accuracy matters more than speed.

Technical Details

Components:

  • faster-whisper - Optimized Whisper inference (CTranslate2-based)
  • sounddevice - Audio capture
  • keyboard - Hotkey detection (Alt key)
  • pyperclip - Clipboard integration for auto-paste

Windows encoding fix:

  • Added UTF-8 wrapper for console output (Windows console uses cp1252 by default, can’t handle emojis in output)

Desktop integration:

  • Created .lnk shortcut for taskbar pinning
  • Runs in background, minimal resource usage when idle

What I Learned

Local AI is practical: No need to reach for cloud APIs for everything. Whisper models are small enough to run locally, accurate enough for daily use.

Python ecosystem is mature: Finding the right libraries (faster-whisper vs openai-whisper) makes a huge difference in performance.

Windows quirks: Console encoding, hotkey detection, clipboard access—each has its own edge cases. Testing reveals them quickly.

Current Status

Fully functional. Fixed UTF-8 encoding bug. Ready for daily use.

Possible enhancements:

  • Voice activity detection (auto-start recording on speech)
  • Integration with LM Studio (voice → LLM → voice response)
  • Custom wake word detection

But the core use case works: speak, get text, move on.


Technologies: Python, Whisper (OpenAI), faster-whisper, sounddevice, keyboard Model: Whisper small (484MB) Status: Operational Privacy: 100% local, zero cloud calls