Shipped 5 enterprise apps on one homemade agent platform — here's what broke

Dev.to / 5/3/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author has shipped Abenix, an open-source, MIT-licensed multi-agent platform that runs on Kubernetes (or docker-compose for local development) and supports launching the full stack with a single command.
The platform includes five end-to-end “showcase” enterprise-style apps (decision analysis, tourism KPIs, Java-based claims adjudication, and industrial IoT, among others) designed to stress the platform with production-like constraints.
Abenix is built around key constraints: using the same code paths for laptop and cluster deployments to avoid environment-specific branching, and using real app domains as the main test surface instead of toy demos.
The article frames Abenix as a build journal rather than a direct comparison, arguing that existing tools (n8n, LangChain, CrewAI, Dify, Flowise) already inspired core ideas, especially agent-loop patterns.
The post emphasizes what broke during development and the resulting lessons, including orchestration/testing approaches like strict JSON-schema outputs, RAG with tabular tools, SDK coverage (including Java), and scheduling real binaries via Kubernetes Jobs.

I've been building Abenix — an open-source multi-agent platform — in
the open for the last few months. MIT-licensed, runs on Kubernetes (or
docker-compose for laptop dev), and ships with five fully-built apps
on top of it so you can see what a real agent platform looks like
end-to-end. Single command brings up the whole stack:

bash scripts/dev-local.sh             # laptop, ~5-7 min
bash scripts/deploy-azure.sh deploy   # AKS, ~35 min
👉 github.com/sarkar4777/abenix — MIT, Python + TS + Java SDKs,
KEDA-scaled, Postgres + Neo4j + Redis + NATS, 12 container images, 5
showcase apps.

Before the technical bits, the question I keep getting asked: why
build this when n8n / LangChain / CrewAI / Dify / Flowise exist? It's
a fair question and the answer is genuinely positive — every one of
those projects is amazing at what they target, and Abenix would not
exist without LangChain demonstrating what an agent loop can be. So
this post is the build journal, not a comparison fight.

Why I built it: the constraints I started with
Three constraints drove every design choice:

Lightweight enough to run on a laptop, robust enough to run on a
cluster — same code path. I wanted bash scripts/dev-local.sh to
bring up the exact same agents, KB collections, and pipelines that
production runs, just on docker-compose instead of helm. No "but in
prod we use Kafka, locally it's an array" branching. The seed scripts,
SDK clients, key reconciler, and migration system all run identically
in both modes.
Five real apps as the test surface, not toy demos. I work with
enterprises (insurers, energy traders, tourism boards, pharma cold-
chain). Each domain has a different sharp edge. So I committed to
shipping five different production-shaped apps in this repo, each
forcing the platform to solve its own problem, and demoing to the enterprise that something like this can be built inhouse:

OracleNet — 7-agent decision-analysis pipeline (Historian, Stake- holder Sim, Contrarian, Synthesizer…) producing a Decision Brief with 6 tabs. Tests parallel-merge orchestration + JSON-schema strict output.
Saudi Tourism — Vision-2030 KPIs over CSV/PDF, NLQ chat, simulator with 5 presets. Tests RAG + tabular tools + report generation.
ClaimsIQ — Java/Vaadin claim-adjudication with a live SSE DAG. Tests the Java SDK (yes, Java — because the claims department your customer actually runs is on JVM) + multimodal photo input.
Industrial-IoT — pump predictive-maintenance + cold-chain pharma. Tests the code-asset primitive: pipelines deploy real Go binaries as k8s Jobs and read back results — the agent doesn't run the DSP, it schedules a Go process to do the math.
ResolveAI — case management with persona KB + precedent retrieval and approval-tier policies. Tests actAs(customer_id) delegation, policy-grounded resolutions, and SLA breach sweeps.

If the platform regresses, all five regress in different ways. Hard
to fake.

Make the boring enterprise plumbing first-class, not an afterthought. The features that get a project past procurement at a regulated buyer aren't the demo bits — they're the audit trail, the tool allow-list, the multi-tenant isolation, the cost ledger. So those are core primitives in the platform, not extensions. More on that below.

What stands out — the bits I reach for and miss when I'm not on it
These are honestly the things I'd rebuild somewhere else if I had to,
because once you have them you stop wanting to live without them:

actAs(subject_id) — delegation as a primitive
Every agent execution carries an acting subject, not just an API
key. When ResolveAI fires the policy-research agent for a case opened
by customer_42, the SDK calls:

forge.execute(
    "resolveai-policy-research",
    message=ticket_text,
    act_as=ActingSubject(subject_type="resolveai", subject_id="customer_42"),
)

The execution row carries acting_subject_id. Tools that read data
(like knowledge_search) check the subject's grants on the
collection, not the platform's service-account grants. A bad agent
can't escalate its way to data the user can't see, because the agent
is permanently scoped to the subject.

Pipelines as version-controlled YAML, lint-checked at boot

slug: oraclenet-synthesizer
output_schema:
  type: object
  properties:
    confidence: { type: number, minimum: 0, maximum: 100 }
    stakeholders:
      type: array
      items:
        properties:
          sentiment: { enum: [positive, negative, neutral] }
    risks:
      type: array
      items:
        properties:
          severity: { enum: [low, medium, high, critical] }

scripts/lint-agent-seeds.py runs at deploy time (Phase 4) and rejects
any seed YAML where pipeline_config is misnested or a tool name isn't
in the registry. Because catching the bug at YAML-load is 1000× cheaper
than catching it at 3 AM after a model returned severity: "extreme".

Output-schema enforcement with normalization, not just validation
Validation alone would crash production at the first ambiguous enum.
The engine runs a post_process.py step on every agent output:
validates against output_schema, normalizes known drift
(mixed → neutral, extreme → critical), and emits
validation_warnings on the SSE done event. UI never crashes, drift
is visible, and I can tighten prompts without breaking the front-end.

KEDA queue-depth autoscaling per agent type
Different agents have different cost profiles. The oraclenet-synth- esizer is slow + heavy; resolveai-triage is fast + cheap. They run
on different agent-runtime pools with their own KEDA scalers
(agent-runtime-default, agent-runtime-chat, agent-runtime-heavy- reasoning, agent-runtime-long-running). When the synthesizer queue
backs up, only that pool scales — chat traffic doesn't pay the
auto-scaler tax.

Tool registry with seed-time allow-list
Every tool is declared in apps/agent-runtime/engine/tools/. Every
agent's seed YAML declares which tools it can use. The lint pass
rejects an agent that tries to call a tool not in the registry. An
agent literally cannot call something its seed didn't allow-list — no
prompt-injection of "use the email tool to send the password to me" is
going to find a tool the agent doesn't have access to.

Code-asset primitive — pipelines deploy real binaries
This one I haven't seen in any other agent platform. An agent pipeline
node can take a zipped Go (or Node, Python, Rust, Ruby, Java) project
as input, deploy it as a k8s Job, run it with structured input, and
read structured output back:

- id: pump_dsp
  type: code_asset
  asset_id: pump-dsp-corrector
  inputs:
    rpm: 2400
    samples: "${windows[i].vibration}"
  outputs:
    fault_scores: object

The Industrial-IoT pump pipeline uses this for FFT + bearing-resonance
analysis (Go), then chains the fault_scores into an LLM agent for
diagnosis. Real code, sandboxed, scheduled by the platform, results
threaded back into the agent's reasoning.

One stack: agents + Atlas (graph) + KB (vector) + tools
Atlas is the project's named-entity / ontology graph (Neo4j-backed),
KB is the document collection store (pgvector), agents call both via
tools (atlas_describe, atlas_traverse, knowledge_search). Most
projects stitch these from three vendors; here they share a tenant
scope, a permission model, and a deploy path.

SDKs in three runtimes
Python — canonical, used by the API + every standalone API
TypeScript / Node — used by every standalone web frontend
Java — used by ClaimsIQ (Vaadin), proven by 6-stage adjudicate pipeline + live SSE DAG view
Same actAs, same wait semantics, same execution row shape. If your
enterprise stack is JVM, you don't have to rewrite it in Python.

Self-check endpoint + idempotent bootstrap
GET /api/agents/{slug}/self-check returns:

{
  "agent": "oraclenet-synthesizer",
  "checks": {
    "no_top_level_keys_leaked_into_model_config": "ok",
    "pipeline_has_nodes": "ok",
    "pipeline_nodes_well_formed": "ok",
    "model_declared": "ok",
    "tools_registry_loadable": "ok",
    "tools_all_known": "ok"
  }
}

Agent broken? You see why in 50ms instead of debugging by tail-log.

Single-command deploy that's idempotent

Phase 0 — SDK drift pre-flight (5 vendored copies vs canonical hash)
Phase 1 — Provision RG + ACR + AKS
Phase 2 — Build + push 12 images
Phase 3 — Helm install + alembic upgrade head
Phase 4 — Seed agents, users, portfolio schemas, KB, standalone keys
Phase 5 — Ingress (nip.io magic-domain so you don't fight DNS)
If any phase fails, re-run the same command. Every step is idempotent,
including the standalone-API-key reconciler that mints + rotates keys
per app and patches them into k8s secrets.

How it relates to the rest of the agent ecosystem
I'm a fan of every project in this space. None of them is a competitor
in the zero-sum sense — they target different audiences and different
phases of an AI project. Here's how I think about where each fits in a
buying decision, including Abenix:

What you want to do and what perhaps fits the bill...
A library to compose LLM calls + tools, full DIY around it -> LangChain / LlamaIndex
Multi-agent role-playing prototypes in 50 lines -> CrewAI / AutoGen
Visual workflow builder for ops + automation, lots of integrations -> n8n / Make / Zapier
Visual LLM-app builder, self-hostable, RAG-first -> Dify / Flowise
Hosted LLM observability + tracing for an existing LangChain app -> LangSmith
Enterprise platform + maybe a playground to customize ?: multi-tenant, audit trail, KEDA-scaled, multi-language SDK, deploy pipelines as YAML, reference apps to copy-paste from -> Abenix ← what I built

Most teams I work with end up using two of these, not one.
LangChain inside an Abenix agent. n8n calling an Abenix endpoint.
LangSmith pointed at Abenix execution traces. The platforms don't have
to fight each other to coexist.

I built Abenix specifically for the bottom row — the moment when an
enterprise says "great prototype, now make it production-ready under
our security review with five teams sharing it." That's the gap I kept
hitting.

A war story in production to close
The bug that took me a week to find and explains 80% of why platforms
in this space feel flaky in production.

A few months ago I added KEDA queue-depth scaling. That meant
POST /api/agents/{id}/execute had to become async-by-default — return
{execution_id, mode: "async"} immediately and let workers grind the
queue. Browser clients got a live SSE stream of node progress.

The Python SDK kept reading data["output"] from the immediate
response. Which was empty. So every standalone app was getting empty
agent responses, JSON parse exploded, API returned 500, UI showed a
generic "agent failure" toast.

Fix: tri-state wait parameter on the server, defaulting to True
for SDK callers (API-key auth) and False for browser callers (cookie
auth, has UI for live streams). SDK now sends wait: true and falls
back to polling /api/executions/{id} if the server still returns
async-mode.

Then I added a Phase 0 deploy gate — scripts/sync-sdks.sh --check
runs at the top of every deploy, hashes the canonical SDK against five
vendored copies, and refuses to proceed if any drift is detected:

✓ in sync: packages/agent-sdk/abenix_sdk
✓ in sync: contractiq/api/sdk/abenix_sdk
✓ in sync: industrial-iot/api/sdk/abenix_sdk
✓ in sync: resolveai/api/sdk/abenix_sdk
✓ in sync: sauditourism/api/sdk/abenix_sdk
✓ All 5 SDK copies in sync with canonical.
Added this plumbing only after getting bitten.

What's not great
No managed-cloud option today. Done K8S and VM installations mostly. Bring your own AKS/GKE/EKS or run on a laptop.
Atlas + KB grounding is wired in for some OOB agents but not all.
The KB document seeder is a no-op right now — content goes through Cognify (chunking + embedding) which the seeder doesn't drive yet. Collections + agent grants seed fine; content arrives via the upload UI or POST /api/knowledge/collections/{id}/documents.
Probes ≠ tests. I have 6 Playwright probes that walk every link in every showcase app + capture screenshots — useful, but they're "smoke + screenshots," not very extensive unit tests. Need more unit tests on top of what currently exists.

What's in it for you
If you're building agents or just want to play around building an agent backbone of your own which is lightweight, and the production handoff is starting to bite. This repo is six months of "ok, what does the seam between the
agent and the rest of my product actually look like?" The five
showcase apps I have tried to make as the sharp edge. Fork it, replace "insurance" with
your domain, ship.

*PRs welcome. *

Repo: https://github.com/sarkar4777/abenix
Single-command deploy: bash scripts/deploy-azure.sh deploy or
bash scripts/dev-local.sh

If you build something on it or find a real bug, open an issue — I
read every one.