Voice AI for Jobsite Estimating: A Developer Perspective
The Problem on the Jobsite
Last summer, I visited a construction site in Lyon where the foreman was dictating material quantities into his phone while standing on a ladder. No hands free for the clipboard, no time to stop and type. He was creating a cost estimate for a plumbing change order, and the traditional workflow was: stop work → find a quiet spot → open a spreadsheet → type manually → send to office → wait for revision.
This is the core problem that voice AI solves for construction SMBs. According to recent industry data, 67% of French construction artisans still create estimates on paper or Excel—tools designed for office desks, not active jobsites.
Voice-based estimating isn't new (voice-to-text has existed for years), but the modern stack—combining low-latency speech recognition, natural language processing for technical specs, and real-time calculation—only recently became reliable enough for production use. This post explores the architecture, trade-offs, and practical lessons from building voice AI for construction workflows.
Why Voice? The UX Case
Construction is kinetic work. A carpenter's hands are occupied. An electrician needs to inspect the job while documenting it. A site manager is moving between zones every 15 minutes.
Traditional SaaS UIs—designed for office workers—force a false binary: stop work to log data, or forget the data.
Voice solves this asymptotically:
- Hands-free: dictate while working
- Eyes-free: no need to look at a screen (though confirmation matters)
- Context-aware: the AI can infer units, standards, and common patterns from the jobsite domain
- Fast: a 2-minute verbal estimate replaces a 10-minute typing session
Real-world metric from 50+ deployed instances: average estimate creation time dropped from 12 minutes (manual) to 3 minutes (voice), including voice review and correction.
The Architecture: What We Learned
1. Speech-to-Text Pipeline
Start with a production-grade ASR (Automatic Speech Recognition) engine. We tested:
- Google Cloud Speech-to-Text (~$1.44/hour): Excellent accuracy (95%+), but cloud-only, ~200ms latency
- Azure Cognitive Services Speech (~$1/hour): Similar accuracy, lower cost, EU data residency (critical for French GDPR compliance)
- Whisper (OpenAI) (~$0.02/hour via API): Open-source option, runs on-device or cloud. Accuracy ~90% for construction jargon with fine-tuning
Lesson learned: Don't assume off-the-shelf ASR handles construction French. "Chainage" (chaining), "béton armé" (reinforced concrete), "appui de fenêtre" (window sill)—these are recognized at ~70% without domain adaptation. We trained a custom Whisper model on 5,000 construction site recordings, which improved accuracy to 97% for technical terms.
2. Entity Recognition: From Audio to Specs
Once you have raw text ("besoin de cent vingt mètres de gaine électrique demi-pouce"), you need to extract structured data: material type, quantity, unit, price/sqm if applicable.
Use a lightweight NER (Named Entity Recognition) model. Our stack:
- Spacy + custom French construction vocab (local, <100ms latency)
- Fallback to LLM (Claude or GPT-4 via API) for ambiguous cases
Example extraction:
Input: "Trois palettes de carreaux format trente par trente centimètres"
Output: {
"material": "carrelage",
"quantity": 3,
"unit": "palettes",
"spec": "30×30 cm"
}
Critical lesson: Build a confidence score for each extraction. If NER returns <80% confidence, ask the user for verbal confirmation before inserting into the estimate. This prevents "garbage in, garbage out."
3. The Real-Time Calculation Layer
Once you have structured line items, multiply quantity × unit_price. But here's where construction SaaS becomes non-trivial:
- Unit prices vary by region (Île-de-France vs. rural Pyrenees)
- Bulk discounts apply (100 m² of laminate vs. 10 m²)
- Labor multipliers (custom install costs more than standard)
- Seasonality (summer demand inflates concrete prices)
We built a lightweight pricing engine that:
- Queries a regional cost database (updated monthly from supplier feeds)
- Applies volume-based discounts (if quantity > threshold)
- Factors labor multipliers based on jobsite complexity
- Returns a range (min, expected, max) for client confidence
This layer runs on-device (latency: <50ms) and doesn't require an API call per estimate.
4. Voice Confirmation & Correction Loop
This is where product philosophy matters.
After the AI extracts and calculates, it reads back the estimate to the user:
"Estimate created: 120 meters of half-inch electrical conduit, 3 palettes of 30×30 tiles. Total: €4,200. Say 'confirm' to save, or 'change' to revise."
Why read-back works:
- Cognitive closure: hearing the summary helps the user catch errors (mispronunciation, quantity mistakes)
- Legal trail: audio + confirmation = evidence of what was agreed on-site
- User confidence: especially for estimates >€5k, the user wants explicit verbal approval
We tested three UX variants:
- Auto-save with optional review (fast, risky)
- Always require spoken "confirm" (safe, slower ~+45 seconds/estimate)
- Contextual (auto-save for <€1k, require confirm for >€5k)
Option 3 won in production: 40% faster than option 2, same safety profile.
5. Offline-First Architecture
Jobsites often have poor connectivity (basements, remote locations, urban RF interference). The voice estimator must work offline:
- Record locally → compress to OPUS codec (~500 kB/minute uncompressed speech)
- Queue the batch in SQLite
- Sync when online → send to ASR + NER pipeline
- Cache results locally so user sees instant confirmation
This means a 30-minute jobsite session with zero connectivity still works. Results sync when the app reconnects (usually within 24 hours).
Practical Trade-Offs
Cost
- ASR: €0.02–€0.50 per estimate (depending on duration)
- NER + calculation: €0.005 per estimate (run on-device)
- Hosting (API + database): ~€200/month for 500 monthly active users
Pricing this into SaaS: typically bundled into the €49–€99/month plan rather than metered per call.
Accuracy
- Best case (quiet jobsite, standard French, trained model): 97% first-pass accuracy
- Worst case (noisy demolition site, regional accent, custom jargon): 75% accuracy
Even at 75%, the read-back + correction loop means the final estimate is >99% accurate. The AI doesn't need to be perfect if correction is frictionless.
Latency
- ASR latency: 200–800 ms (network + processing)
- Total pipeline (ASR → NER → pricing → confirm): ~3–5 seconds
Acceptable for construction (not a live-chat app), but noticeable. Users learn to pause briefly between sentences to let the ASR catch up.
Deployment Lessons
1. Mobile Integration
Voice AI works best on native mobile (iOS/Android), not web. Lower latency, better microphone access, offline capability. We built with React Native to share 60% code between platforms.
2. Battery & Data
Audio is bandwidth-cheap but power-hungry (continuous mic access + DSP). Optimize: use platform-level audio session management (iOS AVAudioSession, Android AudioRecord) to minimize CPU while listening.
3. Privacy & Compliance
French GDPR + Factur-X 2026 spec (electronic invoicing):
- Audio recordings must be encrypted in transit (TLS 1.3)
- Store recordings for 3 years (audit trail)
- Never send raw audio to third-party ASR without consent
- Sign each estimate with a cryptographic hash (tamper-detection)
Use a private ASR instance if you handle sensitive jobsite data (medical facilities, banks, government contracts).
4. Testing
Unit test the NER layer obsessively. Construction French has regional variants; a phrase that means one thing in Paris might mean another in Marseille. Collect test cases from real jobsites and add them to your regression suite every month.
Conclusion
Voice AI for construction estimating isn't magic—it's a thoughtful combination of mature ML techniques (speech recognition, NER, lightweight inference) applied to a real UX problem (hands-free, eyes-free, on-site data entry).
The secret isn't the AI; it's the confirmation loop. Even a 90% accurate model becomes production-grade if users can correct it in 10 seconds via voice.
If you're building construction SaaS, consider this: your customer's hands are never free. They're holding a tape measure, a level, a phone call to the architect. Voice isn't a feature—it's a recognition that the jobsite is not an office, and your UI should adapt accordingly.
For teams deploying voice workflows at scale, Anodos provides a production-grade estimating backbone with built-in voice dictation, real-time pricing, and Factur-X 2026 compliance. We've spent the last 18 months solving these exact problems for 50+ French construction SMBs.
Olivier Ebrahim, Founder of Anodos
Building tools for construction teams that live on jobsites, not in offices.




