Building Sakhi: Hindi Voice-to-Form for India's ASHA Workers, Solo in Six Weeks

Dev.to / 5/19/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageIndustry & Market MovesModels & Research

Read original →

共有:

Key Points

The project “Sakhi” builds a Hindi voice-to-form pipeline to help India’s ASHA community health workers capture maternal and child health visits more reliably than handwritten paper forms.
It uses two deployment modes: a workstation path with Whisper for Hindi transcription and Gemma 4 E4B (function calling) for structured form extraction, and a fully offline Android path using Gemma 4 E2B INT4 via the Cactus SDK.
The solution targets core constraints in rural healthcare—Hindi often appears in dialects and cloud STT can have high word error rates, and connectivity is intermittent so offline operation must be the default.
The workstation mode supports voice-to-form end-to-end in about 15–25 seconds using RTX hardware, while the offline on-device mode completes the full text-to-form and danger-sign detection workflow without any network connection (reported ~320.7s).
The work was submitted to Kaggle’s “Gemma 4 Good Hackathon,” with the source code available on GitHub and fine-tuning performed on Ollama.

TL;DR — Six-week solo build of a Hindi voice-to-form pipeline for India's ~1 million community health workers. Two deployment modes: a workstation path with Whisper + Gemma 4 E4B on Ollama, and a fully offline on-device path running Gemma 4 E2B INT4 on the Cactus SDK on Android. Submitted to Kaggle's Gemma 4 Good Hackathon. Source on GitHub, fine-tune on Ollama.

The problem

India's 1 million Accredited Social Health Activists (ASHAs) handle the last clinical mile for maternal and child health. They conduct 50+ million home visits a year — vitals, symptoms, counselling, danger-sign assessment. Every visit still ends with a paper form filled from memory and physically carried to the Primary Health Center on the next clinic day.

Danger signs that were observed — preeclampsia, postpartum hemorrhage, neonatal distress — sometimes never reach the clinical system in time for intervention.

Two compounding constraints make this hard to fix with conventional tooling:

Hindi voice, often in regional dialects. Cloud STT is unreliable on rural-clinical Hindi (published benchmarks: 27–70%+ WER, deletion-dominant — numbers and symptoms silently drop).
Connectivity is intermittent. Airplane-mode operation cannot be a fallback. It must be the default.

Architecture

Two deployment modes for how ASHAs actually work — a workstation in the health center, and the phone in the field:

Workstation path (PHC, GPU):
[Hindi Audio] → Whisper-Large CT2 → Hindi Normalization → Gemma 4 E4B (function calling)
                                                            ├── extract_form()
                                                            ├── flag_danger_sign()
                                                            └── issue_referral()

On-device path (Android, no network):
[Hindi Text] → Hindi Normalization → Visit-type detect → Gemma 4 E2B INT4 on Cactus
                                                          ├── extract_form
                                                          └── detect_danger

Workstation mode handles voice: a phone uploads audio to a shared PC at the sub-centre, Whisper-Large-V2 Hindi via CTranslate2 transcribes, Gemma 4 E4B Q4_K_M on Ollama extracts the structured form with native function calling. End-to-end 15–25 seconds on an RTX 5070 Ti.

Field mode runs the full pipeline (normalize → detect visit type → extract form → flag danger signs) entirely on-device. End-to-end 320.7s on a OnePlus 11R (Snapdragon 8+ Gen 1), zero network. The on-device LLM does Hindi text → form; voice routes to the workstation when WiFi returns (more on why below).

The hardest engineering call: leaving on-device voice OUT

I wanted on-device voice-to-form. A phone, no laptop, no network — that's the cleanest pitch. I pulled it from the build instead.

Cactus SDK ships multilingual Whisper INT4 for transcription — no Hindi-specific checkpoint. The published numbers are bad:

27% WER best-case on rural Hindi
70%+ on clinical content
Error profile is deletion-dominant — numbers and symptoms silently drop while filler words survive

A missed BP reading is a missed referral. A demo where Sakhi says "BP normal" because the actual 155/100 was deleted during transcription is exactly the failure mode an ASHA cannot catch in the field.

So voice routes to the workstation where Whisper-Large-V2 Hindi runs. The on-device LLM handles Hindi text → form for the case where an ASHA types a quick note offline. Field mode also captures raw audio offline and syncs to the workstation when WiFi returns.

This was the most uncomfortable call of the build. The submission video shows raw on-device JSON output from text input instead of faking voice.

Anti-hallucination: model extracts, Python decides

The hardest problem isn't getting Gemma to talk about a transcript. It's getting it to stop inventing. Early prototypes:

Hallucinated patient names from generic forms of address (दीदी / बहन — Hindi for "elder sister" / "sister", used informally for any woman regardless of relation).
Invented BP readings on routine visits that never mentioned vitals.
Turned counselling utterances ("eat iron-rich food, drink plenty of water") into "danger signs."

The pattern that stuck: Gemma proposes evidence; Python decides what counts. The LLM extracts only what was said — verbatim utterances, structured under the schema. Validation, range-checks, deduplication, blocklist filtering: none of that runs inside the prompt. It runs in code, against the transcript, after extraction.

Six layers of validation:

Evidence length filter — danger signs with under 10-character evidence are dropped.
Generic ASHA phrase blocklist — boilerplate (कोई तकलीफ़ हो तो फ़ोन कर दीजिए / "call me if there's any problem") filtered.
Normal-value filter — signs citing benign values (110/70, बिल्कुल ठीक / "totally fine", सामान्य / "normal") stripped.
Transcript grounding — evidence must appear verbatim in the transcript.
Deduplication across overlapping danger signs.
Form validation — strips invented patient names (दीदी/बहन patterns), default ages, phantom lab results; range checks on BP (60–250 / 30–150), Hb (3–20), weight (1–200), gestational weeks (1–45).

False-alarm rate on routine visits: 0.

Demographics never go through the LLM

Early prototypes asked Gemma to extract patient name, age, and household composition from the audio. It hallucinated names from दीदी and बहन, defaulted ages on under-specified utterances, invented household members.

The fix wasn't prompt-tuning. It was structural: demographics enter as a typed header — the way every clinical EMR works. The LLM never sees the question. It only extracts what was said during the visit.

This pattern generalizes — any LLM-based structured extraction where the field is known-and-typed should not be in the prompt at all.

The Blackwell + Windows + Unsloth dead end

Unsloth's bundled save_pretrained_gguf mmap-fails on Blackwell + Windows:

RuntimeError: unable to mmap ... [WinError 8] Not enough memory resources

WSL was out (CUDA passthrough for Whisper was already finicky in this setup). Linux dual-boot would have eaten two days I didn't have.

I wrote scripts/export_merge.py — manual LoRA-into-base delta-merge in PyTorch — then handed the merged FP16 model to llama.cpp/convert_hf_to_gguf.py + llama-quantize Q4_K_M. The fine-tune ships on the Ollama registry through that workaround:

ollama pull tusharbrisingr9802/sakhi

A/B vs base on the eval rubric: 14/15 fine-tune vs 15/15 base. Base is the production path. The fine-tune is published for deployments that prefer English schema-label normalization (दस्त → Diarrhea, चक्कर → dizziness).

Reproduce it locally

The workstation stack is the primary path:

git clone https://github.com/Tushar-9802/Sakhi
cd Sakhi
pip install -r requirements-runtime.txt
ollama pull gemma4:e4b-it-q4_K_M
cd frontend && npm install && npm run build && cd ..
python api.py
# Browser: http://localhost:8000

Requires ~10 GB VRAM (E4B Q4_K_M is roughly 9 GB resident). Verifies function calling, normalization, the 6-layer validation, and schema correctness end-to-end. Voice-to-form, text-to-form, and queue-and-sync all run on this stack.

For the on-device Android path see the GitHub Release — prebuilt APK plus in-app SAF zip-import of the Cactus model. Cactus's gemma-4-E2B-it INT4 build is gated on HuggingFace, so it isn't redistributed; the import flow keeps the no-adb path open for reviewers.

What's not in this submission

Full root-cause walkthroughs live in FAILURES.md in the repo:

No on-device voice — covered above. On-device LLM does Hindi text → form; voice routes to the workstation.
No real ASHA endorsement. Outreach didn't land inside the deadline. Real-voice testing came from family help in Bareilly — Hindi-native readers on a real phone mic, three of four role-play scripts. Not a corpus.
Synthetic training data. 1,154 fine-tune examples and the 15-case automated eval are LLM-generated Hindi with gTTS audio.
Regional dialect coverage. Tested on standard Hindi from Bareilly + role-play scripts. Bhojpuri, Awadhi, Magahi, code-switched Marwari/Bhili are not validated.

What's next

Partner with an ASHA training institute to collect 100+ hours of real ASHA home-visit audio under field conditions.
Fine-tune an IndicWhisper variant on that real audio for the on-device voice-in path that is not in this submission.
Harden integration with the official MCTS API so forms post directly into the NHM system instead of being exported as JSON/CSV.
Pilot with 10–20 ASHA workers in one rural block with before/after time-and-accuracy measurement.

Links

3-min demo video — https://youtu.be/n-u7J1lljUg
GitHub repository — https://github.com/Tushar-9802/Sakhi
Ollama fine-tune — ollama pull tusharbrisingr9802/sakhi
Kaggle writeup — https://www.kaggle.com/competitions/gemma-4-good-hackathon/writeups/sakhi-voice-to-form-for-asha-workers

If any of the patterns above are useful in your own LLM extraction pipelines — the model-extracts/Python-decides separation, demographics-as-typed-header, or the Whisper-INT4-WER receipts argument for not shipping fake on-device voice — drop a note in the comments. I'm @Tushar-9802 on GitHub.

Black Hat USA

AI Business

Lazie vs BarTranslate — Deep Comparison for Mac Users

Dev.to

AI Competitor Analysis for Affiliate Marketers

Dev.to

5 Things You Can Build with Claude Code and Live Search Data

Dev.to

Lazie + Image Translation on Mac — Workflow

Dev.to

Building Sakhi: Hindi Voice-to-Form for India's ASHA Workers, Solo in Six Weeks

Key Points

The problem

Architecture

The hardest engineering call: leaving on-device voice OUT

Anti-hallucination: model extracts, Python decides

Demographics never go through the LLM

The Blackwell + Windows + Unsloth dead end

Reproduce it locally

What's not in this submission

What's next

Links

Related Articles

Black Hat USA

Lazie vs BarTranslate — Deep Comparison for Mac Users

AI Competitor Analysis for Affiliate Marketers

5 Things You Can Build with Claude Code and Live Search Data

Lazie + Image Translation on Mac — Workflow

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer