AI Navigate

afm mlx on MacOs - new Version released! Great new features (MacOS)

Reddit r/LocalLLaMA / 3/18/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • AFM MLX on MacOS releases version 0.9.7 as a 100% Swift wrapper around MLX with more advanced inference features and no Python required.
  • The release adds support for more models than the baseline Swift MLX, broadening model availability on macOS.
  • Installation is straightforward via pip (pip install macafm) or Homebrew (brew install scouzi1966/afm/afm).
  • Telegram integration lets users chat with a local model through a Telegram bot, enabling remote interaction.
  • It ships an experimental tool parser (afm_adaptive_xml) and runtime options such as --enable-prefix-caching, --enable-grammar-constraints, --no-think, --concurrent, --guided-json, and --vlm for multimode models, with notes on compatibility and defaults.

Visit the repo. 100% Open Source. Vibe coded PRs accepted! It's a wrapper of MLX with more advanced inference features. There are more models supported than the baseline Swift MLX. This is 100% swift. Not python required. You can install with PIP but that's the extent of it.

New in 0.9.7
https://github.com/scouzi1966/maclocal-api

pip install macafm or brew install scouzi1966/afm/afm

Telegram integration: Give it a bot ID and chat with your local model from anywhere with Telegram client. First phase is basic

Experimental tool parser: afm_adaptive_xml. The lower quant/B models are not the best at tool calling compliance to conform to the client schema.

--enable-prefix-caching: Enable radix tree prefix caching for KV cache reuse across requests

--enable-grammar-constraints: Enable EBNF grammar-constrained decoding for tool calls (requires --tool-call-parser afm_adaptive_xml).Forces valid XML tool call structure at generation time, preventing JSON-inside-XML and missing parameters. Integrates with xGrammar

--no-think:Disable thinking/reasoning. Useful for Qwen 3.5 that have some tendencies to overthink

--concurrent: Max concurrent requests (enables batch mode; 0 or 1 reverts to serial). For batch inference. Get more througput with parallel requests vs serialized requests

--guided-json: Force schema output

--vlm: Load multimode models as vlm. This allows user to bypass vlm for better pure text output. Text only is on by default

submitted by /u/scousi
[link] [comments]