TL;DR — Six-week solo build of a Hindi voice-to-form pipeline for India's ~1 million community health workers. Two deployment modes: a workstation path with Whisper + Gemma 4 E4B on Ollama, and a fully offline on-device path running Gemma 4 E2B INT4 on the Cactus SDK on Android. Submitted to Kaggle's Gemma 4 Good Hackathon. Source on GitHub, fine-tune on Ollama.
The problem
India's 1 million Accredited Social Health Activists (ASHAs) handle the last clinical mile for maternal and child health. They conduct 50+ million home visits a year — vitals, symptoms, counselling, danger-sign assessment. Every visit still ends with a paper form filled from memory and physically carried to the Primary Health Center on the next clinic day.
Danger signs that were observed — preeclampsia, postpartum hemorrhage, neonatal distress — sometimes never reach the clinical system in time for intervention.
Two compounding constraints make this hard to fix with conventional tooling:
- Hindi voice, often in regional dialects. Cloud STT is unreliable on rural-clinical Hindi (published benchmarks: 27–70%+ WER, deletion-dominant — numbers and symptoms silently drop).
- Connectivity is intermittent. Airplane-mode operation cannot be a fallback. It must be the default.
Architecture
Two deployment modes for how ASHAs actually work — a workstation in the health center, and the phone in the field:
Workstation path (PHC, GPU):
[Hindi Audio] → Whisper-Large CT2 → Hindi Normalization → Gemma 4 E4B (function calling)
├── extract_form()
├── flag_danger_sign()
└── issue_referral()
On-device path (Android, no network):
[Hindi Text] → Hindi Normalization → Visit-type detect → Gemma 4 E2B INT4 on Cactus
├── extract_form
└── detect_danger
Workstation mode handles voice: a phone uploads audio to a shared PC at the sub-centre, Whisper-Large-V2 Hindi via CTranslate2 transcribes, Gemma 4 E4B Q4_K_M on Ollama extracts the structured form with native function calling. End-to-end 15–25 seconds on an RTX 5070 Ti.
Field mode runs the full pipeline (normalize → detect visit type → extract form → flag danger signs) entirely on-device. End-to-end 320.7s on a OnePlus 11R (Snapdragon 8+ Gen 1), zero network. The on-device LLM does Hindi text → form; voice routes to the workstation when WiFi returns (more on why below).
The hardest engineering call: leaving on-device voice OUT
I wanted on-device voice-to-form. A phone, no laptop, no network — that's the cleanest pitch. I pulled it from the build instead.
Cactus SDK ships multilingual Whisper INT4 for transcription — no Hindi-specific checkpoint. The published numbers are bad:
- 27% WER best-case on rural Hindi
- 70%+ on clinical content
- Error profile is deletion-dominant — numbers and symptoms silently drop while filler words survive
A missed BP reading is a missed referral. A demo where Sakhi says "BP normal" because the actual 155/100 was deleted during transcription is exactly the failure mode an ASHA cannot catch in the field.
So voice routes to the workstation where Whisper-Large-V2 Hindi runs. The on-device LLM handles Hindi text → form for the case where an ASHA types a quick note offline. Field mode also captures raw audio offline and syncs to the workstation when WiFi returns.
This was the most uncomfortable call of the build. The submission video shows raw on-device JSON output from text input instead of faking voice.
Anti-hallucination: model extracts, Python decides
The hardest problem isn't getting Gemma to talk about a transcript. It's getting it to stop inventing. Early prototypes:
- Hallucinated patient names from generic forms of address (
दीदी/बहन— Hindi for "elder sister" / "sister", used informally for any woman regardless of relation). - Invented BP readings on routine visits that never mentioned vitals.
- Turned counselling utterances ("eat iron-rich food, drink plenty of water") into "danger signs."
The pattern that stuck: Gemma proposes evidence; Python decides what counts. The LLM extracts only what was said — verbatim utterances, structured under the schema. Validation, range-checks, deduplication, blocklist filtering: none of that runs inside the prompt. It runs in code, against the transcript, after extraction.
Six layers of validation:
- Evidence length filter — danger signs with under 10-character evidence are dropped.
-
Generic ASHA phrase blocklist — boilerplate (
कोई तकलीफ़ हो तो फ़ोन कर दीजिए/ "call me if there's any problem") filtered. -
Normal-value filter — signs citing benign values (
110/70,बिल्कुल ठीक/ "totally fine",सामान्य/ "normal") stripped. - Transcript grounding — evidence must appear verbatim in the transcript.
- Deduplication across overlapping danger signs.
- Form validation — strips invented patient names (दीदी/बहन patterns), default ages, phantom lab results; range checks on BP (60–250 / 30–150), Hb (3–20), weight (1–200), gestational weeks (1–45).
False-alarm rate on routine visits: 0.
Demographics never go through the LLM
Early prototypes asked Gemma to extract patient name, age, and household composition from the audio. It hallucinated names from दीदी and बहन, defaulted ages on under-specified utterances, invented household members.
The fix wasn't prompt-tuning. It was structural: demographics enter as a typed header — the way every clinical EMR works. The LLM never sees the question. It only extracts what was said during the visit.
This pattern generalizes — any LLM-based structured extraction where the field is known-and-typed should not be in the prompt at all.
The Blackwell + Windows + Unsloth dead end
Unsloth's bundled save_pretrained_gguf mmap-fails on Blackwell + Windows:
RuntimeError: unable to mmap ... [WinError 8] Not enough memory resources
WSL was out (CUDA passthrough for Whisper was already finicky in this setup). Linux dual-boot would have eaten two days I didn't have.
I wrote scripts/export_merge.py — manual LoRA-into-base delta-merge in PyTorch — then handed the merged FP16 model to llama.cpp/convert_hf_to_gguf.py + llama-quantize Q4_K_M. The fine-tune ships on the Ollama registry through that workaround:
ollama pull tusharbrisingr9802/sakhi
A/B vs base on the eval rubric: 14/15 fine-tune vs 15/15 base. Base is the production path. The fine-tune is published for deployments that prefer English schema-label normalization (दस्त → Diarrhea, चक्कर → dizziness).
Reproduce it locally
The workstation stack is the primary path:
git clone https://github.com/Tushar-9802/Sakhi
cd Sakhi
pip install -r requirements-runtime.txt
ollama pull gemma4:e4b-it-q4_K_M
cd frontend && npm install && npm run build && cd ..
python api.py
# Browser: http://localhost:8000
Requires ~10 GB VRAM (E4B Q4_K_M is roughly 9 GB resident). Verifies function calling, normalization, the 6-layer validation, and schema correctness end-to-end. Voice-to-form, text-to-form, and queue-and-sync all run on this stack.
For the on-device Android path see the GitHub Release — prebuilt APK plus in-app SAF zip-import of the Cactus model. Cactus's gemma-4-E2B-it INT4 build is gated on HuggingFace, so it isn't redistributed; the import flow keeps the no-adb path open for reviewers.
What's not in this submission
Full root-cause walkthroughs live in FAILURES.md in the repo:
- No on-device voice — covered above. On-device LLM does Hindi text → form; voice routes to the workstation.
- No real ASHA endorsement. Outreach didn't land inside the deadline. Real-voice testing came from family help in Bareilly — Hindi-native readers on a real phone mic, three of four role-play scripts. Not a corpus.
- Synthetic training data. 1,154 fine-tune examples and the 15-case automated eval are LLM-generated Hindi with gTTS audio.
- Regional dialect coverage. Tested on standard Hindi from Bareilly + role-play scripts. Bhojpuri, Awadhi, Magahi, code-switched Marwari/Bhili are not validated.
What's next
- Partner with an ASHA training institute to collect 100+ hours of real ASHA home-visit audio under field conditions.
- Fine-tune an IndicWhisper variant on that real audio for the on-device voice-in path that is not in this submission.
- Harden integration with the official MCTS API so forms post directly into the NHM system instead of being exported as JSON/CSV.
- Pilot with 10–20 ASHA workers in one rural block with before/after time-and-accuracy measurement.
Links
- 3-min demo video — https://youtu.be/n-u7J1lljUg
- GitHub repository — https://github.com/Tushar-9802/Sakhi
-
Ollama fine-tune —
ollama pull tusharbrisingr9802/sakhi - Kaggle writeup — https://www.kaggle.com/competitions/gemma-4-good-hackathon/writeups/sakhi-voice-to-form-for-asha-workers
If any of the patterns above are useful in your own LLM extraction pipelines — the model-extracts/Python-decides separation, demographics-as-typed-header, or the Whisper-INT4-WER receipts argument for not shipping fake on-device voice — drop a note in the comments. I'm @Tushar-9802 on GitHub.




