Build Your Own JARVIS: A Deep Dive into Memo AI - The Privacy-First Local Voice Agent

Dev.to / 4/12/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The article explains how to build “Memo AI,” a privacy-first, local voice agent that transcribes speech, detects intent, and executes system tasks without sending audio or prompts to the cloud.
  • It presents a five-layer pipeline—UI/audio ingestion (Streamlit), transcription (OpenAI Whisper with runtime FFmpeg via static-ffmpeg), local reasoning (Ollama with Llama-family models plus JSON-only prompting and Pydantic validation), and a tool-dispatch action layer.
  • To improve reliability despite LLM non-determinism, the reasoning layer constrains outputs to JSON and validates them into structured intent objects using Pydantic.
  • For safety, the article introduces security controls such as path scoping so file operations are restricted to a defined root directory (e.g., /output), reducing the risk of harmful filesystem access.
  • The system is designed to be modular so additional capabilities (e.g., web search or email sending) can be added by implementing new tools and dispatch rules.

Build Your Own JARVIS: A Deep Dive into Memo AI - The Privacy-First Local Voice Agent

How I built a voice-controlled AI that executes system tasks, generates code, and takes notes—all without sending a single byte to the cloud.

The Motivation: Why Go Local?

In an era where every voice command to Alexa, Siri, or ChatGPT is logged on a distant server, privacy has become a premium. I wanted to build an agent that was "Privacy-by-Design." My goal was Memo AI: a robust, full-stack application that handles voice transcription, intent detection, and system-level execution entirely on local hardware.

The Architecture: A 5-Layer Pipeline

Building a local agent isn't just about plugging in an LLM. It’s about building a reliable pipeline that can handle messy human speech and turn it into precise system commands.

1. Ingestion Layer (Streamlit and Audio Buffers)

The application starts with a reactive UI built using Streamlit. I used streamlit-mic-recorder to capture audio directly from the browser. The raw audio is buffered as bytes, which allows for immediate processing without needing to manage temporary file cleanup manually in the initial stage.

2. Transcription Layer (OpenAI Whisper)

To convert speech to text, I integrated OpenAI’s Whisper (using the base model for a balance of speed and accuracy).

  • The Problem: Whisper depends on FFmpeg, which is often a "path nightmare" for users to install.
  • The Solution: I used static-ffmpeg to dynamically inject the required binaries into the environment at runtime. No system-level installation required.

3. Reasoning Layer (Ollama and Intent Detection)

The transcribed text is sent to a local Ollama instance (running Llama 3.2 or Phi-3).
This is where most agents fail because LLMs are non-deterministic. I solved this by:

  • System Prompting: Forcing the model to act as a "JSON-only Engine."
  • Pydantic Validation: Using Python’s Pydantic library to parse the LLM's raw string into a structured object with specific intents (create_file, write_code, etc.).

4. Action Layer (The Tool Dispatcher)

Once the intent is classified, a dispatcher maps the command to a Python "Tool."

  • Need a script? The write_code tool is triggered.
  • Need to remember something? The create_file tool handles it. This modular approach makes it trivial to add new features like "Search Web" or "Send Email" in the future.

5. Security and Persistence Layer

Executing code based on AI instructions is dangerous. I implemented Path Scoping. Every file operation is validated against a strict root directory (/output). If the LLM tries to write to C:/Windows/, the system raises a ValueError and blocks the action.

The Technical Stack

  • Frontend: Streamlit
  • Speech-to-Text: OpenAI Whisper
  • Large Language Model: Ollama (Llama 3.2 1B / 3B)
  • Validation: Pydantic V2
  • System Glue: Python 3.10+

The Biggest Challenges (And How I Overcame Them)

Challenge 1: The "FP16" CPU Warning

When running Whisper on a standard laptop without a dedicated NVIDIA GPU, it often throws warnings about 16-bit floating point (FP16) not being supported.

  • Fix: I implemented a check in the stt.py module to default to FP32 when CUDA is not detected, ensuring a smooth experience for all users.

Challenge 2: LLM Hallucinations in JSON

Sometimes an LLM adds conversational filler around the JSON ("Sure! Here is your JSON: ...").

  • Fix: I wrote a robust extraction utility in intent.py using string splitting and regex-like markers ( ``` json) to strip away the fluff before parsing. This makes the system nearly "un-crashable" regardless of the LLM's chatty personality.

Challenge 3: Path Traversal Security

Allowing an AI to generate filenames is a security risk. A malicious prompt could trick the AI into creating a file called ../../startup_script.py.

  • Fix: I used os.path.abspath and Path.resolve() to ensure the final destination of any file is physically located within my project's /output sandbox before the write() command is ever called.

Future Scope

Memo AI is just the beginning. The next steps involve:

  1. RAG (Retrieval-Augmented Generation): Allowing the agent to "read" your local PDFs and answer questions.
  2. Voice Feedback: Adding Text-to-Speech (TTS) using a local model like Piper or Coqui so the agent can talk back.
  3. Active Tooling: Integrating with system APIs to control volume, brightness, or open applications.

Conclusion

Building this project taught me that you don't need a massive cloud budget to build powerful AI. With the right orchestration of local models, you can build a personal assistant that is fast, free to run, and—most importantly—completely yours.

Link to Project Repository: https://github.com/priyanshsingh11/Memo-AI

Found this useful? Shares are appreciated!