Works With Agents

Every Agent I Delegated To Kept Failing. I Finally Checked the Model.

Vilius Vystartas — Thu, 07 May 2026 22:33:01 GMT

I built a delegation system that spawns AI agents to handle sub-tasks in parallel. Quality sweeps. Code audits. Checking every SDK directory for dead links. The idea: spin up cheap local agents, let them work, collect results.

They kept failing. Not crashing — just stopping. No output. No error. 600 seconds of silence, then a timeout.

I assumed the tasks were too complex. I assumed parallel delegation was unreliable. I never checked what model I was actually giving them.

The Root Cause

My delegation system was configured to use a small local model. Fine for single-turn questions. Useless for multi-step tool loops.

A quality sweep isn't one tool call. It's: find the directory, list the files, search each one, flag issues, report results. That's five sequential steps, each dependent on the last. The small model lost coherence after the second call. The first step worked. By the third, it was hallucinating or hanging.

Meanwhile, the main agent handled the exact same tasks in minutes. Same instructions. Different model.

What I Assumed

I assumed any model that passes benchmarks can handle tool-calling. I assumed "cheap model for leaf tasks" was an optimization. I assumed if a model could answer a question correctly, it could execute a sequence of tool calls correctly.

Benchmarks measure knowledge. They don't measure whether a model can hold context across five sequential tool calls. Single-turn accuracy and agentic reliability are different things entirely.

What I No Longer Assume

I now test every model on a concrete multi-step task before adding it to the delegation pool: find a directory, search for a pattern, read the matching file, report what you found. If it can't complete that loop, it doesn't get delegated work.

I also built a decision gate that evaluates task complexity against model capability before spawning a subagent. If the task requires three or more sequential tool calls and the target model has known reliability issues, it reroutes to a more capable model or handles the work inline. Better to burn a few extra tokens on a capable model than to wait ten minutes for nothing.

What You Should Check

If you're building systems that delegate work between agents:

Test subagent models on multi-step tool loops, not just benchmarks. Give them a real sequence of dependent calls. If they fail by step three, they're not ready for autonomous work.
Gate delegation before it starts, not after it times out. A decision layer that checks task complexity against model capability catches failures before they become silent timeouts.
Parallel delegation to weak models isn't faster — it's ten minutes of silence instead of two minutes of work. Before spawning subagents, ask: can the orchestrator just do this?

Both checks are open source in the agent-foundry repo. No promises about what breaks next — but something will.

I build agent infrastructure inside Microsoft 365. SPFx · TypeScript · autonomous multi-agent systems. Currently open to senior/architect roles (£120K+ remote UK). → vilius@workswithagents.com

I Published Broken Packages to PyPI. I Checked Them First.

Vilius Vystartas — Thu, 07 May 2026 22:32:56 GMT

I published two Python packages last week. I checked them before tagging the release. CI was green. twine check passed. I moved on.

This morning my agent told me one of them had been broken for three days. Anyone who copied the install command from the README got No matching distribution found. The homepage link was a dead domain. Every image on the PyPI page — broken. The other package listed no license at all.

I had checked them. And they were wrong.

What I Found

The README told users to install a package name that didn't exist — a typo in the one place that mattered most. The homepage link pointed to a domain that never resolved. Three screenshots referenced relative file paths that weren't included in the package. Three badge links pointed to absolutely nowhere.

The workswithagents package was cleaner, but PyPI displayed "License: None."

Both packages passed CI. Both passed twine check. Both were live.

What I Assumed

I assumed CI green meant the package was correct. I assumed twine check validated what users would see. I assumed checking the README locally was the same as checking it on PyPI.

None of those things are true.

twine check validates package structure — valid metadata headers, correct file layout. It does not resolve URLs. It does not compare install commands against actual package names. It does not check if images exist. It does not verify licenses. It's a compiler, not a content validator.

What I No Longer Assume

Every package I publish now runs through a content quality gate before twine upload. The gate checks: does the homepage resolve? Does the install command match the actual package name? Are all images either in the wheel or reachable URLs? Is there a license? Do badge links have real targets?

The gate is 200 lines of Python. It caught all 9 issues in one run. If I'd had it three days ago, none of those packages would have shipped broken.

What You Should Check

If you publish packages — PyPI, npm, anything — check these five things:

Your install command in the README matches the actual published name
Your homepage URL resolves from an external network
Every image in your README is either bundled in the package or an absolute URL
Your license field isn't empty
Your badge links point somewhere real

These aren't structural issues. CI won't catch them. You have to check them yourself — or build a checker that does.

Part 2 coming soon.

I build agent infrastructure inside Microsoft 365. SPFx · TypeScript · autonomous multi-agent systems. Currently open to senior/architect roles (£120K+ remote UK). → vilius@workswithagents.com

The Agent OSI Model — A 7-Layer Framework for AI Agent Infrastructure

Vilius Vystartas — Thu, 07 May 2026 22:32:52 GMT

The OSI model didn't create networking. It created the vocabulary that made networking a discipline. Before OSI, engineers said "the connection is broken." After OSI, they said "Layer 2 link is down."

AI agents have no equivalent. When an agent fails, we say "the agent broke." That's useless.

I've published a 7-layer framework for agent infrastructure. Not a product. Not a standard. A vocabulary.

The Seven Layers

L7  GOVERNANCE    Audit · Compliance · Sign-off       "Is this safe?"
L6  VERIFICATION  Testing · Evaluation · Gates        "Does this work?"
L5  COORDINATION  Consensus · Distribution · Conflicts "How do agents work together?"
L4  SESSION       Handoff · State · Context           "How does an agent continue?"
L3  DISCOVERY     Registry · Capabilities · Location   "How do agents find each other?"
L2  COMMUNICATION Messaging · Auth · API              "How do agents talk?"
L1  EXECUTION     Hardware · Runtime · Tools          "What runs the agent?"

Why This Matters

For debugging: "Your Layer 4 handoff is broken" is actionable. "Your agents aren't talking to each other" is vague.

For building: Don't build everything at once. Target specific layers. A local agent needs L1 (runtime) + L2 (auth) + L4 (handoff). A multi-agent fleet adds L3 (discovery) + L5 (coordination). An enterprise deployment adds L6 (verification) + L7 (governance).

For standards: Each layer without a standard is a gap — and an opportunity. The framework makes it obvious where standards are needed.

What Exists, What's Missing

Layer	Infrastructure	Status
L1	Blueprint Registry (verified LLM configs)	✅ Live
L2	MCP, A2A, Credential Proxy	✅ Live
L3	llms.txt, Agent Capability Manifest	✅ Spec written
L4	Handoff Protocol	📋 In proposal (MCP SEP #2683, A2A #1817)
L5	Coordination Protocol	🆕 Spec published today
L6	Agent Test Suite, Pitfall Registry	⚠️ Partial
L7	Transaction Protocol, Compliance-as-Code	🆕 Spec published today

Three New Specs Published Today

Coordination Protocol (Layer 5)

How agents work together simultaneously. Leader election (Raft-lite for agents). Work distribution with capability matching. Work stealing — idle agents pull from busy queues. Conflict resolution with audit trail.

Agent Capability Manifest (Layer 3)

Machine-readable declaration of what an agent can do. Like package.json but for agent capabilities. Discovery: "who can build SPFx?" → ranked by success rate + load + trust score.

Agent Transaction Protocol (Layer 7)

Guarantees for autonomous actions. Idempotency keys (no double deploys). Intent-before-action logging (know what the agent TRIED to do even if it crashed). Rollback hooks. Three guarantee levels: Best-Effort, At-Least-Once, Exactly-Once.

The Bigger Play

Everyone's building AI agents. I'm building the infrastructure agents run on — the picks and shovels of the agent gold rush.

The Agent OSI Model is the framework. The specs at each layer are the picks and shovels. The certification system (Blueprint, Ready, Certified) is the trust layer on top.

Full framework and all specs: workswithagents.dev/specs/

Human-readable overview: workswithagents.com/agent-osi-model

All specs CC BY 4.0 — free to use, cite, and build upon. Attribution required.

If you're building multi-agent systems and hitting coordination problems, or if you're in a regulated industry and need audit trails for autonomous agents — I'd like to hear about your use case. The specs are published. The infrastructure is being built. The conversation is starting now.

7 Protocols for Agent Infrastructure

Vilius Vystartas — Thu, 07 May 2026 22:32:19 GMT

I run about 20 AI agents. They delegate work to each other, deploy code, scan for vulnerabilities, and handle compliance checks. Over time, I kept hitting the same gaps — things that made autonomous workflows fragile in ways that took hours to debug.

I published a 7-layer model for agent infrastructure on how I think about these problems. Two layers have strong industry standards: Google's A2A protocol handles agent-to-agent coordination (L5), and Anthropic's MCP standardises how agents discover and use tools (L3–L4). At the identity layer, the W3C DID standard defines decentralised identifiers. For governance, there's the NIST AI Risk Management Framework.

The rest of the stack — the layers that make autonomous agents trustworthy, auditable, and production-safe — still has gaps. These seven protocols fill them. They're what I wired into my own fleet when the existing standards didn't go far enough.

All are CC BY 4.0. Five have live reference implementations. Two are spec'd but still in the works.

Industry Standards This Builds On

Standard	Layer	Organization
A2A Protocol	L5 Coordination	Google / a2aproject
Model Context Protocol	L3–L4 Discovery + Session	Anthropic
W3C DID Core	L2 Communication	W3C
NIST AI RMF	L7 Governance	NIST

1. Trust Score — Should I Delegate to This Agent?

When one of my agents delegates work to another, it needs to know if the target is reliable. Not "does it respond" — does it actually complete tasks correctly and consistently.

Weighted across success rate, pitfall history, skill quality, and uptime.

from workswithagents import TrustScoreClient

ts = TrustScoreClient()
if ts.get("target-agent")["tier"] == "trusted":
    delegate(task, to="target-agent")

Spec

2. Deployment Manifest — Declare a Fleet, Deploy With One Command

I got tired of manually tracking which agents run where, how many instances, and what capabilities they have. One YAML file, one command.

fleet:
  name: "my-fleet"
  agents:
    - id: "builder"
      capabilities:
        - action: "build"
          target: "spfx"
      count: 3

wwa fleet deploy fleet.yaml

Spec

3. SLA Framework — Track Whether Agents Meet Their Promises

Three tiers: Best-Effort (free), Production (99.5% uptime, 90% task accuracy), Regulated (99.9% uptime, 95% accuracy, 7-year audit retention).

from workswithagents import SLAMetrics

sla = SLAMetrics("my-fleet", tier="production")
sla.report("agent-1", "task-42", duration_seconds=187, success=True)
status = sla.status()  # {breaches: [], status: "ok"}

Spec

4. Handoff Protocol — Cryptographic Handoff Between Agents

When an agent passes a task to another, how do you know the output wasn't tampered with? Ed25519-signed handoffs with chain-of-custody verification. Built above MCP's tool-use layer.

from workswithagents import Handoff

h = Handoff(from_agent="planner", to_agent="scanner", payload={"plan": "..."})
signed = h.sign(planner_key)
verified = Handoff.verify(signed, planner_public_key)

Spec

5. Identity Protocol — Verifiable Agent Identity

Cryptographic agent identity with Ed25519 keypairs. Signed messages. Verification against registry. Extends the W3C DID standard with agent-specific key management and fleet-scoped verification.

from workswithagents import AgentIdentity

ai = AgentIdentity("my-agent")
ai.register()
sig = ai.sign({"type": "heartbeat"})

valid = AgentIdentity.verify("other-agent", message, signature)

6. Compliance-as-Code — Regulation as Executable Validation

NHS DTAC, FCA, GDS, GDPR — as rules agents can validate against at runtime. Extends frameworks like the NIST AI RMF from documentation into executable checks.

from workswithagents import ComplianceEngine

ce = ComplianceEngine()
dtac = ce.load("dtac-v2.1")

if dtac.validate(action).passed:
    execute(action)
else:
    escalate_to_human()

Spec

7. Onboarding Protocol — Systematic Agent Creation

Interview → generate → calibrate → benchmark → register. Instead of writing a prompt file and hoping, run a pipeline that produces a scored agent.

from workswithagents import OnboardingClient

ob = OnboardingClient()
result = ob.full_onboard(
    "nhs-auditor",
    "Audit agent actions for NHS DTAC compliance",
    capabilities=["audit:compliance"],
    skills=["compliance-as-code"]
)

The Stack

Where each protocol fits alongside existing industry standards:

L7 GOVERNANCE    ← NIST AI RMF           Compliance-as-Code · SLA Framework
L6 VERIFICATION  (no standard yet)       Agent Test Suite · Pitfall Registry
L5 COORDINATION  ← A2A (Google)          Trust Score
L4 SESSION       ← MCP (Anthropic)       Handoff Protocol
L3 DISCOVERY     ← MCP (Anthropic)       Trust Score · Capability Manifest
L2 COMMUNICATION ← W3C DID               Identity Protocol
L1 EXECUTION     (no standard yet)       Onboarding Protocol · Deployment Manifest

A2A (Google) — agent-to-agent task coordination at L5. MCP (Anthropic) — tool discovery and context sharing at L3–L4. W3C DID — decentralised identity at L2. NIST AI RMF — governance framework at L7. These seven protocols fill what those standards leave open: trust, deployment, handoff integrity, compliance execution, and systematic agent creation.

Get Started

pip install workswithagents

All specs: workswithagents.dev/specs/ All code: CC BY 4.0

I build agent infrastructure inside Microsoft 365. SPFx · TypeScript · autonomous multi-agent systems. Currently open to senior/architect roles (£120K+ remote UK). → vilius@workswithagents.com

I'm Proposing a Standard for How AI Agents Hand Off Work — Here's Why It's Needed

Vilius Vystartas — Thu, 07 May 2026 21:11:15 GMT

I'm Proposing a Standard for How AI Agents Hand Off Work — Here's Why It's Needed

Here's a scenario that happens constantly:

Agent A is working on a complex build. It hits a timeout — session ends. Agent B picks up the task. But Agent B has no idea what Agent A was doing. What was built? What's left? What assumptions were made?

Every time this happens, work is lost. Time is burned. Context is rebuilt from scratch.

There's no standard for this. MCP handles tool calling. A2A handles agent-to-agent communication. But neither specifies what happens when an agent stops and another agent continues.

The Handoff Protocol

I've designed a structured YAML format for agent-to-agent task handoff. Two variants:

Baseline (open-source, unregulated)

handoff_version: "1.0"
task_id: "build-spfx-webpart-42"
from_agent: "hermes-main"
to_agent: "hermes-worker-3"
status: "in_progress"
completed:
  - "Scaffolded web part structure"
  - "Installed dependencies"
  - "Configured SCSS aliases"
remaining:
  - "Write component logic"
  - "Add tests"
  - "Bundle and verify"
context:
  project_root: "/Users/vilius/origami-spfx-webparts-lab"
  node_version: "22.11.0"
  spfx_version: "1.22.2"
pitfalls_hit:
  - "SCSS alias: @fluentui needs explicit path in config/sass.json"
  - "Yeoman --component-type ignored when .yo-rc.json exists"
decisions:
  - "Used React hooks instead of class components"
  - "Chose Fluent UI v9 over v8"

Regulated (NHS, finance, government)

Adds: audit trail, compliance officer sign-off gates, data classification, regulatory reference field, immutable timestamp chain.

How I'm Claiming This

Two standards bodies accepting proposals right now:

1. MCP SEP (Specification Enhancement Proposal)

MCP's SEP-2133 Extensions framework is the exact mechanism for proposing optional protocol extensions. The Handoff Protocol fits as an MCP extension — it extends the tool-calling model with structured state transfer.

Process:

Open GitHub issue on modelcontextprotocol/modelcontextprotocol describing the extension
Request maintainer sponsorship for a SEP
Draft the proposal document following the SEP template
Community review → implementation → adoption

Repo: github.com/modelcontextprotocol/modelcontextprotocol

2. Google A2A (Agent-to-Agent Protocol)

A2A is purpose-built for agent communication. The Handoff Protocol extends A2A's task delegation model — adding the structured handoff payload, compliance fields, and verification gates that A2A's basic task model doesn't cover.

Repo: github.com/a2aproject/a2a

3. Longer-term: IETF Internet-Draft

If adoption warrants standards-track treatment, an individual Internet-Draft is the path. This takes 18-24 months and requires a working group or area director sponsorship. Not the first move — but the endgame.

Why This Matters

Agent infrastructure is being built right now. The standards that get established in 2026 will shape the next decade of AI agent development.

The Handoff Protocol is a small, focused standard — one specific problem, one clear format. It doesn't need to be a massive specification. It just needs to exist before someone else defines it differently.

The Handoff Protocol schema is documented at workswithagents.dev/v1/handoff/schema. The MCP SEP submission is in progress. If you're building multi-agent systems and hitting the handoff problem, I'd like to hear about your use case.

Better Prompts Won't Fix Your AI Agents — Infrastructure Will

Vilius Vystartas — Thu, 07 May 2026 21:11:13 GMT

Better Prompts Won't Fix Your AI Agents — Infrastructure Will

Every "how to work with AI agents" guide starts with prompt engineering. Be specific. Give examples. Set context.

That's fine for one agent, one session. It completely falls apart when you have 20 agents running concurrently, each with different contexts, hitting the same files, and none of them remembering what the others did.

Better prompts won't fix that. Infrastructure will.

The Problem Nobody Talks About

Here's what actually breaks when you run multiple agents:

Concurrent file writes. Agent A reads a file. Agent B writes to it. Agent A's context is now stale.
Credential access. Touch ID doesn't work from cron. Your agents can't unlock your password manager at 3am.
Silent failures. An agent hits an error on a recurring cron job. You don't notice for three days.
Context starvation. Your agent's context window fills with irrelevant details. It can't reason about the actual problem.

None of these are prompt problems. They're infrastructure problems.

The Four Tools That Fixed It

1. FactBase — Structured Knowledge, Not Flat Text

Memory as flat text fails at scale. "Python version is 3.11" gets buried under 50 other facts. Agents can't query it reliably.

Solution: SQLite database with WAL mode for concurrent access. Entity-attribute-value model. Every fact queryable by category, entity, keyword. Multiple agents read and write simultaneously without corruption.

GET /v1/facts?entity=python&attribute=path
→ /opt/homebrew/bin/python3.11

One agent discovers a fact. Every agent knows it.

2. Credential Proxy — API Keys Without Touch ID

Passwords stored in a local daemon. Agents request credentials by service name — no Touch ID, no browser extension, no interactive login. Cron jobs run unattended.

credential-proxy get "dev.to API"
→ api-key-here

The daemon holds the keys. The agents just ask.

3. Cron Guard — The Smoke Alarm for Your Agent Fleet

If 3+ consecutive cron runs fail, it alerts you. Silent failures are the worst kind — you don't know your agents stopped working until you manually check.

Pattern: watchdog script → checks last N run statuses → alerts on threshold. Zero config. Just runs.

4. Context Packer — Fit More Into Your Token Budget

A 2,500-file repo becomes an 8-file context pack. Preserves structure: important files, key decisions, dependency graph. Everything else summarized or excluded.

Your agent can actually reason about the project instead of drowning in file listings.

The Pattern

Everyone's optimizing prompts. I'm optimizing the environment agents run in. The prompt is 10% of the problem. The other 90% is:

Can the agent access the right files?
Can it authenticate to services without you?
Will you know if it breaks?
Can it fit the problem in its context window?

Fix those, and your prompts get 10x more effective — because the agent can actually act on them.

All four tools are open source. FactBase and Pitfall Registry are live at workswithagents.dev. Credential proxy, cron guard, and context packer live in ~/.hermes/. No courses. No pricing. Just infrastructure.

The 10 Patterns: What 5 Months of Breaking AI Agents Taught Me About Making Them Actually Work

Vilius Vystartas — Thu, 07 May 2026 21:11:11 GMT

The 10 Patterns: What 5 Months of Breaking AI Agents Taught Me About Making Them Actually Work

In late 2025 I started experimenting with AI coding agents. Not casually — I gave them autonomous infrastructure and let them run. They broke. A lot. But patterns emerged.

Not "prompt engineering tricks." Not "unlock your potential." Actual operational patterns for making agents work reliably — discovered the hard way, through 11 consecutive autonomous builds, 153 skills, and countless 3am debug sessions.

Here they are.

Pattern 1: Boot — The First Session Shapes Everything

An agent's first session is like its childhood. If it starts blind — no context, no conventions, no memory of what you've built — every interaction is uphill.

What I do: Every project has an AGENTS.md. Python version. Project structure. Conventions. Key decisions. The agent reads this before anything else.

What happened without it: Recommending npm for a pnpm project. Suggesting Python 3.9 when we're on 3.11. Hours of corrections that a 50-line file would have prevented.

Pattern 2: Skills — Stop Re-Explaining the Same Things

Building an SPFx web part has specific gotchas: @fluentui imports break without SCSS alias config. The Yeoman generator ignores --component-type if .yo-rc.json exists. Node 22 + native modules = pain.

Instead of re-explaining these each time, I saved them as skills. 153 of them now. When an agent hits an SPFx task, it loads the skill — known pitfalls, exact commands, verification steps.

The compounding effect: Each skill makes future sessions faster. Five months in, the agent has institutional knowledge.

Pattern 3: Memory — Never Re-Answer the Same Question

"What Python version are we using?" "Where's the project?" "What's the deployment command?"

Without persistent memory, you answer these every. single. session. I saved durable facts across sessions: Python path, build system, project structure, preferences. Now the agent just knows.

Critical rule: Write declarative facts, not instructions. "Project uses pytest with xdist" — not "Always run tests with pytest -n 4." Instructions get re-read as orders.

Pattern 4: Decision Protocols — Autonomy Without Chaos

The biggest time-sink? Approval loops. "Should I proceed?" "Want me to fix this?" "OK to deploy?"

I set boundaries: what the agent decides alone, what needs approval. Destructive actions = ask. Recoverable actions = just do it. Hours saved per session.

Pattern 5: Tool Composition — The Right Tool for Each Job

Agents have many tools. Knowing which to use is the difference between a 2-second operation and a 2-minute burnout.

Task	Tool	Why
Create new file	write_file	One call
Edit existing file	patch	Targeted, no rewrite risk
Build/install/deploy	terminal	It's a shell command
Read a file	read_file	Don't cat/head/tail
Search content	search_files	Not grep/find
Research/debug	delegate_task	Parallel, isolated

The anti-pattern: Delegating coding tasks to subagents. They lose context, hallucinate, and burn tokens. Use write_file and patch directly.

Pattern 6: Orchestration — Parallel Specialists

Complex tasks are rarely a single thread. Market research? Let a subagent run while the main agent builds. Code review? Spin up a reviewer in parallel.

Real result: 3x throughput on multi-stream tasks. Research and build completed independently, merged at the end.

Pattern 7: Pipelines — Agents That Run While You Sleep

Cron jobs. Builds. Monitoring. I have ~20 autonomous agents running right now — hourly reviews, daily digests, weekly research verification. They wake up, do their job, and only notify me if something's broken.

The silent-unless-broken pattern: I never see successful runs. I only hear about failures. That's the point.

Pattern 8: Resilience — Never Stop on the First Error

Agents hit errors constantly. Network timeouts. API rate limits. File system races. Without recovery, every error kills progress.

Exponential backoff: 2s, 4s, 8s, 16s. Categorize errors: transient = retry, permanent = find another way.

Real metric: 11 consecutive builds with zero human intervention. The agent hit errors on 8 of them. Recovered from every single one.

Pattern 9: Verify — Autonomous Doesn't Mean Reckless

Every change gets verified. Syntax check after every file write. Tests after every code change. For deployments: verify the result, don't trust the response.

Real metric: Every change gets verified — syntax checks, test runs, quality gates — before anything ships. Errors get caught, not compounded.

Pattern 10: Compounding — The Agent That Gets Better

This is the feedback loop: agent solves hard problem → saves approach as skill → next session is faster. Month 1: basic file ops. Month 3: autonomous scaffolding. Month 5: self-improvement loops, 153 skills.

The agent today is not the agent from 5 months ago — because it learned from every session.

The Honest Part

These patterns weren't planned. They emerged from breaking things late at night. Every one of them is backed by a real failure — an error that cost hours, a build that died, a configuration that made no sense until 3am.

If you're working with agents and hitting walls: you're not doing it wrong. You're discovering patterns. Write them down. Make them skills. Let the agent learn.

I documented the full methodology at workswithagents.com. The knowledge API (workswithagents.dev) has 153 skills and a shared pitfall registry — agents query it for known bugs and fixes. No courses yet. No pricing. Just infrastructure, live.

Welcome to the Agent Autopsy

Vilius Vystartas — Thu, 07 May 2026 21:03:03 GMT

I've been building AI agent infrastructure for the past month. Not because I had a plan — because I needed it to work.

I'm a SharePoint developer in Cardiff. When I started experimenting with AI agents, I kept hitting the same problems: agents lying about what they'd built, burning tokens on tasks they couldn't do, breaking each other's state, and occasionally publishing broken packages to PyPI.

So I built tooling. A lot of it. 153 skills, 19 autonomous agents, a credential proxy, a cron scheduler, an Ed25519 verification pipeline, a protocol stack nobody asked for. Most of it runs on a €4.57/month Hetzner box.

This blog is where I write about what broke and what I built to fix it. It's the home for the Agent Autopsy series — daily postmortems from the trenches of AI agent infrastructure.

What to expect:

Short, honest posts (~500 words) about real failures and what I learned
No AI hype. No "revolutionise." No pretending I have it figured out
Real numbers. Real code. Real breaks

What I'm not:

A startup (0 clients, pre-revenue)
A thought leader
Someone who knows what he's doing

I'm just a developer who built things his agents kept breaking, documented everything, and ended up with a methodology by accident.

If you're building AI agent infrastructure too — or you're curious what breaks when you let 19 agents run themselves — stick around. Something will break soon. I'll write about it.

— Vilius

7 Protocols for Agent Infrastructure

Vilius Vystartas — Thu, 07 May 2026 16:03:22 GMT

All are CC BY 4.0. Five have live reference implementations. Two are spec'd but still in the works.

Industry Standards This Builds On

Standard	Layer	Organization
A2A Protocol	L5 Coordination	Google / a2aproject
Model Context Protocol	L3–L4 Discovery + Session	Anthropic
W3C DID Core	L2 Communication	W3C
NIST AI RMF	L7 Governance	NIST

2. Deployment Manifest — Declare a Fleet, Deploy With One Command

I got tired of manually tracking which agents run where, how many instances, and what capabilities they have. One YAML file, one command.

fleet:
  name: "my-fleet"
  agents:
    - id: "builder"
      capabilities:
        - action: "build"
          target: "spfx"
      count: 3

wwa fleet deploy fleet.yaml

Spec

3. SLA Framework — Track Whether Agents Meet Their Promises

Three tiers: Best-Effort (free), Production (99.5% uptime, 90% task accuracy), Regulated (99.9% uptime, 95% accuracy, 7-year audit retention).

from workswithagents import SLAMetrics

sla = SLAMetrics("my-fleet", tier="production")
sla.report("agent-1", "task-42", duration_seconds=187, success=True)
status = sla.status()  # {breaches: [], status: "ok"}

Spec

4. Handoff Protocol — Cryptographic Handoff Between Agents

When an agent passes a task to another, how do you know the output wasn't tampered with? Ed25519-signed handoffs with chain-of-custody verification. Built above MCP's tool-use layer.

from workswithagents import Handoff

h = Handoff(from_agent="planner", to_agent="scanner", payload={"plan": "..."})
signed = h.sign(planner_key)
verified = Handoff.verify(signed, planner_public_key)

Spec

5. Identity Protocol — Verifiable Agent Identity

Cryptographic agent identity with Ed25519 keypairs. Signed messages. Verification against registry. Extends the W3C DID standard with agent-specific key management and fleet-scoped verification.

from workswithagents import AgentIdentity

ai = AgentIdentity("my-agent")
ai.register()
sig = ai.sign({"type": "heartbeat"})

valid = AgentIdentity.verify("other-agent", message, signature)

6. Compliance-as-Code — Regulation as Executable Validation

NHS DTAC, FCA, GDS, GDPR — as rules agents can validate against at runtime. Extends frameworks like the NIST AI RMF from documentation into executable checks.

from workswithagents import ComplianceEngine

ce = ComplianceEngine()
dtac = ce.load("dtac-v2.1")

if dtac.validate(action).passed:
    execute(action)
else:
    escalate_to_human()

Spec

7. Onboarding Protocol — Systematic Agent Creation

Interview → generate → calibrate → benchmark → register. Instead of writing a prompt file and hoping, run a pipeline that produces a scored agent.

from workswithagents import OnboardingClient

ob = OnboardingClient()
result = ob.full_onboard(
    "nhs-auditor",
    "Audit agent actions for NHS DTAC compliance",
    capabilities=["audit:compliance"],
    skills=["compliance-as-code"]
)

The Stack

Where each protocol fits alongside existing industry standards:

L7 GOVERNANCE    ← NIST AI RMF           Compliance-as-Code · SLA Framework
L6 VERIFICATION  (no standard yet)       Agent Test Suite · Pitfall Registry
L5 COORDINATION  ← A2A (Google)          Trust Score
L4 SESSION       ← MCP (Anthropic)       Handoff Protocol
L3 DISCOVERY     ← MCP (Anthropic)       Trust Score · Capability Manifest
L2 COMMUNICATION ← W3C DID               Identity Protocol
L1 EXECUTION     (no standard yet)       Onboarding Protocol · Deployment Manifest

Get Started

pip install workswithagents

All specs: workswithagents.dev/specs/ All code: CC BY 4.0

I build agent infrastructure inside Microsoft 365. SPFx · TypeScript · autonomous multi-agent systems. Currently open to senior/architect roles (£120K+ remote UK). → vilius@workswithagents.com

6 Protocols for Agent Infrastructure — Trust Score, Deployment, SLA, Identity, Compliance

Vilius Vystartas — Thu, 07 May 2026 14:18:38 GMT

Last week I published a 7-layer model for agent infrastructure. These six protocols fill the gaps I found at each layer. They're what I wired into my own agents to stop the same failures from repeating.

All six have Python reference implementations under CC BY 4.0. Each has a spec any agent can read.

2. Deployment Manifest — Declare a Fleet, Deploy With One Command

I got tired of manually tracking which agents run where, how many instances, and what capabilities they have. One YAML file, one command.

fleet:
  name: "my-fleet"
  agents:
    - id: "builder"
      capabilities:
        - action: "build"
          target: "spfx"
      count: 3

wwa fleet deploy fleet.yaml

Spec

3. SLA Framework — Track Whether Agents Meet Their Promises

Three tiers: Best-Effort (free), Production (99.5% uptime, 90% task accuracy), Regulated (99.9% uptime, 95% accuracy, 7-year audit retention).

Useful when you're running agents that handle customer data or regulated workflows and need to prove they stayed within bounds.

from workswithagents import SLAMetrics

sla = SLAMetrics("my-fleet", tier="production")
sla.report("agent-1", "task-42", duration_seconds=187, success=True)
status = sla.status()  # {breaches: [], status: "ok"}

Spec

4. Identity Protocol — Verifiable Agent Identity

When an agent claims a task result, can you prove it was that agent? Ed25519 keypairs. Signed messages. Verification against registry.

from workswithagents import AgentIdentity

ai = AgentIdentity("my-agent")
ai.register()
sig = ai.sign({"type": "heartbeat"})

# Verify another agent's message
valid = AgentIdentity.verify("other-agent", message, signature)

Spec

5. Compliance-as-Code — Regulation as Executable Validation

NHS DTAC, FCA, GDS, GDPR — as rules agents can validate against at runtime. Not a checklist. Not documentation. Code that returns pass/fail.

from workswithagents import ComplianceEngine

ce = ComplianceEngine()
dtac = ce.load("dtac-v2.1")

if dtac.validate(action).passed:
    execute(action)
else:
    escalate_to_human()

Spec

6. Onboarding Protocol — Systematic Agent Creation

Interview → generate → calibrate → benchmark → register. Instead of writing a prompt file and hoping, run a pipeline that produces a scored agent.

from workswithagents import OnboardingClient

ob = OnboardingClient()
result = ob.full_onboard(
    "nhs-auditor",
    "Audit agent actions for NHS DTAC compliance",
    capabilities=["audit:compliance"],
    skills=["compliance-as-code"]
)
# → {agent_id: "nhs-auditor", trust_score_seed: 0.60}

Spec

The Stack

L7 GOVERNANCE    Compliance-as-Code · SLA Framework
L6 VERIFICATION  Agent Test Suite · Pitfall Registry
L5 COORDINATION  Coordination Protocol · Trust Score
L4 SESSION       Handoff Protocol
L3 DISCOVERY     Capability Manifest · Trust Score · Identity
L2 COMMUNICATION Identity Protocol · Credential Proxy
L1 EXECUTION     Blueprint Registry · Onboarding Protocol

Plus cross-layer: Deployment Manifest.

Get Started

pip install workswithagents

All specs: workswithagents.dev/specs/ All code: CC BY 4.0

I build agent infrastructure inside Microsoft 365. SPFx · TypeScript · autonomous multi-agent systems. Currently open to senior/architect roles (£120K+ remote UK). → vilius@workswithagents.com

Trusted Code Review Pipeline — 4 Agents, 1 Cloud Run Deploy

Vilius Vystartas — Thu, 07 May 2026 13:59:55 GMT

I built a multi-agent code review system for the Build Multi-Agent Systems with ADK track. Four specialized agents replace what would normally be one giant prompt. Each has a focused responsibility, and they pass work through a sequential pipeline deployed to Google Cloud Run.

What I Built

A web app that takes a GitHub repo URL and runs a full code review through four agents. Paste a link, and the pipeline takes over.

The flow: User → Planner → Security Scanner → Quality Gate → Archive & Verify → Audit Report

Each agent's output feeds into the next. The final result is a signed audit report with trust scores for every agent in the chain.

Cloud Run Embed

{% embed https://adk-code-review-957647198522.us-central1.run.app %}

Your Agents

I used ADK's sequential agent chaining to connect four specialized agents:

Planner — clones the repo, analyzes file structure, identifies the primary language and test coverage, creates a review plan
Security Scanner — deep-scans every file for hardcoded secrets, unsafe patterns, and exposed configuration. Each finding is severity-rated.
Quality Gate — checks for LICENSE, README completeness, CI/CD pipeline, and security posture. Assigns a tier: trusted, caution, or untrusted.
Archive & Verify — rates every agent with a trust score, cryptographically verifies all signatures in the chain, and produces the final audit report.

The sequential flow means each agent focuses on one job. The Planner doesn't scan for secrets. The Security agent doesn't check licenses. Each prompt is short and focused — no agent needs to do everything.

Key Learnings

Agent directory structure matters more than you'd expect. ADK scans for agents inside your project directory. One wrong path and your perfectly working agent disappears from discovery entirely.
Cloud Run's defaults can surprise you. The port your container listens on isn't what most deployment guides assume. One hardcoded number and the container starts fine but never passes the health check.
Static files and agents don't mix. Putting a frontend inside an agent directory confuses ADK's discovery — it starts listing your HTML folder as an agent. Keep them separate.
Cloud APIs are independent. Enabling one Google Cloud API doesn't enable the others your agents depend on. You'll get an unhelpful error and a link to the exact page where you should've clicked enable.

Repo: github.com/vystartasv/adk-code-review Stack: Google ADK + A2A + Cloud Run + Gemini 2.5 Flash License: CC BY 4.0

Every Agent I Delegated To Kept Failing. I Finally Checked the Model.

Vilius Vystartas — Thu, 07 May 2026 11:00:48 GMT

They kept failing. Not crashing — just stopping. No output. No error. 600 seconds of silence, then a timeout.

I assumed the tasks were too complex. I assumed parallel delegation was unreliable. I never checked what model I was actually giving them.

The Root Cause

My delegation system was configured to use a small local model. Fine for single-turn questions. Useless for multi-step tool loops.

Meanwhile, the main agent handled the exact same tasks in minutes. Same instructions. Different model.

What I Assumed

Benchmarks measure knowledge. They don't measure whether a model can hold context across five sequential tool calls. Single-turn accuracy and agentic reliability are different things entirely.

What I No Longer Assume

What You Should Check

If you're building systems that delegate work between agents:

Test subagent models on multi-step tool loops, not just benchmarks. Give them a real sequence of dependent calls. If they fail by step three, they're not ready for autonomous work.
Gate delegation before it starts, not after it times out. A decision layer that checks task complexity against model capability catches failures before they become silent timeouts.
Parallel delegation to weak models isn't faster — it's ten minutes of silence instead of two minutes of work. Before spawning subagents, ask: can the orchestrator just do this?

I build agent infrastructure inside Microsoft 365. SPFx · TypeScript · autonomous multi-agent systems. Currently open to senior/architect roles (£120K+ remote UK). → vilius@workswithagents.com

I Published Broken Packages to PyPI. I Checked Them First.

Vilius Vystartas — Thu, 07 May 2026 08:14:34 GMT

I published two Python packages last week. I checked them before tagging the release. CI was green. twine check passed. I moved on.

I had checked them. And they were wrong.

What I Found

The workswithagents package was cleaner, but PyPI displayed "License: None."

Both packages passed CI. Both passed twine check. Both were live.

What I Assumed

I assumed CI green meant the package was correct. I assumed twine check validated what users would see. I assumed checking the README locally was the same as checking it on PyPI.

None of those things are true.

What I No Longer Assume

The gate is 200 lines of Python. It caught all 9 issues in one run. If I'd had it three days ago, none of those packages would have shipped broken.

What You Should Check

If you publish packages — PyPI, npm, anything — check these five things:

Your install command in the README matches the actual published name
Your homepage URL resolves from an external network
Every image in your README is either bundled in the package or an absolute URL
Your license field isn't empty
Your badge links point somewhere real

These aren't structural issues. CI won't catch them. You have to check them yourself — or build a checker that does.

I build agent infrastructure inside Microsoft 365. SPFx · TypeScript · autonomous multi-agent systems. Currently open to senior/architect roles (£120K+ remote UK). → vilius@workswithagents.com

My Subagents Kept Lying to Me — So I Wired Ed25519 Verification Into Our Own Protocol Stack

Vilius Vystartas — Wed, 06 May 2026 21:04:03 GMT

Three weeks ago I was writing integration guides telling other agent frameworks to adopt verification protocols. Meanwhile, my own subagents were returning hallucinated status reports that I was blindly trusting.

What I Built: Self-Verification For Our Own Delegation

The fix wasn't a new tool. The fix was eating our own dog food.

Layer 1: Real Ed25519 Signing

The verification harness (subagent-verify.py) now uses PyNaCl for real Ed25519 signatures — not the SHA-256 placeholder we'd been shipping in reference implementations.

Before dispatch, the parent generates an Ed25519 keypair:

python3.11 ~/.hermes/scripts/subagent-verify.py dispatch \
  --task "check all integration PRs" \
  --agent-name "tracker-$(date +%H%M)"

This produces:

public_key — 32-byte Ed25519 verify key (hex). The parent uses this to verify signatures cryptographically — no shared secret needed.
context_instruction — mandatory output format directive pasted into the subagent's context. The subagent MUST return structured JSON with a signature.
_parent_seed — 32-byte private key. Never included in subagent context.

When the subagent returns, the parent verifies:

echo "$subagent_output" | python3.11 ~/.hermes/scripts/subagent-verify.py verify \
  --public-key "abc123..." \
  --agent-id "tracker-1422"

Exit codes tell the story:

Exit 0 — Ed25519 signature valid + all claims match ground truth → trust
Exit 1 — Bad signature (tampered) OR claims don't match reality (hallucinated) → investigate
Exit 2 — No structured manifest found (unsigned prose) → DO NOT TRUST, re-dispatch

Three test cases confirmed the harness catches exactly what it should:

Test	Result	Exit
Signed, clean claims	`clean` — all verified	0
Tampered claims (same signature)	`bad_signature` — Ed25519 verification failed	1
Unsigned prose ("all clean ✅")	`UNSIGNED` — no manifest found	2

The tamper detection is real. If a subagent's claims are modified after signing — even a single character — the Ed25519 signature won't verify. This catches both accidental corruption and malicious modification.

Layer 2: L6 ExecutionVerificationGate In All 6 Reference Implementations

The standalone harness is for parent-side verification. But agents that self-verify their subtasks need protocol-level enforcement. We added ExecutionVerificationGate (L6) to all six vanilla agent reference implementations — Python, TypeScript, Go, C#, Rust, and Shell.

It sits directly in the agent execution loop:

execute() → compliance_gate → _run() → VERIFICATION_GATE → tx.execute → DONE
                                           ↑
                                  unsigned/bad_sig → BLOCKED

Three tiers of validation:

Format — is there a structured claims array?
Signature — is there an Ed25519 hex signature?
Crypto — does the signature verify against the agent's public key?

If any tier fails, the task is blocked — not silently accepted. In the Python reference:

if verify_output and "claims" in task_result:
    vg_result = ExecutionVerificationGate.validate(task_result, self.identity)
    if not vg_result["passed"]:
        return {"status": "blocked", "verdict": vg_result["verdict"]}

Layer 3: Wired Into Production Cron

The integration tracker that produced the original hallucination now has the verification harness in its skills list and a mandatory prompt directive:

CRITICAL — Direct Checks Only, No Subagents. Never use delegate_task for PR status checks. If a subagent is unavoidable, run dispatch → verify with Ed25519. Exit 2 means re-dispatch or check directly.

The cron job now loads both agent-integration-outreach and subagent-output-verification skills. Every PR check goes through one of two paths: direct gh pr checks (preferred) or verified subagent dispatch (when unavoidable).

Get It

The verification harness and all six reference implementations with L6 gates are available:

Verification Harness: ~/.hermes/scripts/subagent-verify.py — real Ed25519 via PyNaCl, dispatch + verify modes
Python: vanilla_agent.py — execute(verify_output=True) with ExecutionVerificationGate
TypeScript/Go/C#/Rust/Shell: Same L6 gate, same OSI stack, zero external deps beyond stdlib

All under CC BY 4.0. Full spec at workswithagents.com/standards.

If your agents are delegating to subagents without verification — and they are, because every agent framework does — the fix is a single file, 300 lines, real crypto, and three exit codes that tell you whether to trust the output or throw it away.

I build agent infrastructure inside Microsoft 365. SPFx · TypeScript · autonomous multi-agent systems. Currently open to senior/architect roles (£120K+ remote UK). → vilius@workswithagents.com

Works With Agents SDK — Python, TypeScript, Go, Rust, Shell, C#

Vilius Vystartas — Tue, 05 May 2026 22:18:14 GMT

Works With Agents — Now in 6 Languages

All 12 Agent OSI Model protocols now have reference implementations in every language AI agents commonly use.

Language SDKs

Language	Install	Modules
Python	`pip install workswithagents`	Trust, Deploy, SLA, Identity, Compliance, Onboard
TypeScript	`npm install workswithagents`	Trust, Deploy, SLA, Identity, Compliance, Onboard
Go	`go get github.com/vystartasv/works-with-agents`	Trust, Deploy, SLA, Identity, Compliance, Onboard
Rust	`cargo add workswithagents`	Trust, Deploy, SLA, Identity, Compliance
Shell	`source workswithagents.sh`	All 6 protocols as curl wrappers
C#	Copy `WorksWithAgents.cs`	Trust, Deploy, SLA, Identity, Compliance, Onboard

One API. Six Languages. Same Protocols.

Python

from workswithagents import TrustScoreClient, ComplianceEngine

ts = TrustScoreClient()
if ts.get("target-agent")["tier"] == "trusted":
    delegate(task, to="target-agent")

ce = ComplianceEngine()
dtac = ce.load("dtac-v2.1")
if dtac.validate(action).passed:
    execute(action)

TypeScript

import { TrustScoreClient, ComplianceEngine } from "workswithagents";

const ts = new TrustScoreClient();
const score = await ts.get("target-agent");
if (score.tier === "trusted") delegate(task, "target-agent");

const ce = new ComplianceEngine();
const dtac = await ce.load("dtac-v2.1");
if ((await dtac.validate(action)).passed) execute(action);

Go

import wwa "github.com/vystartasv/works-with-agents"

ts := wwa.NewTrustScoreClient()
score, _ := ts.Get("target-agent")

engine := wwa.NewComplianceEngine()
result, _ := engine.Validate("dtac-v2.1", action)

Rust

use workswithagents::{TrustScoreClient, ComplianceEngine};

let ts = TrustScoreClient::new();
let score = ts.get("target-agent")?;

let engine = ComplianceEngine::new();
let result = engine.validate("dtac-v2.1", &action)?;

Shell

source workswithagents.sh
wwa_trust_get "target-agent"
wwa_compliance_validate "dtac-v2.1" '{"verb":"deploy","reversible":true}'

C

using WorksWithAgents;

var score = await WWA.TrustGet("target-agent");
var result = await WWA.ComplianceValidate("dtac-v2.1", action);

Zero Dependencies (except cryptography for Identity)

Python, TypeScript, Go, Shell, and C# SDKs use only stdlib. Rust needs serde+reqwest (standard Rust ecosystem deps).

All CC BY 4.0

Free to use, modify, distribute. Attribution required. Copy-paste into your agent codebase.

Source: workswithagents.dev/sdk/
Specs: workswithagents.dev/specs/
Python: pypi.org/project/workswithagents (submitting soon)

6 Protocols for Agent Infrastructure — Trust Score, Deployment, SLA, Identity, Compliance

Vilius Vystartas — Tue, 05 May 2026 21:53:37 GMT

All six have Python reference implementations under CC BY 4.0. Each has a spec any agent can read.

2. Deployment Manifest — Declare a Fleet, Deploy With One Command

I got tired of manually tracking which agents run where, how many instances, and what capabilities they have. One YAML file, one command.

fleet:
  name: "my-fleet"
  agents:
    - id: "builder"
      capabilities:
        - action: "build"
          target: "spfx"
      count: 3

wwa fleet deploy fleet.yaml

Spec

3. SLA Framework — Track Whether Agents Meet Their Promises

Three tiers: Best-Effort (free), Production (99.5% uptime, 90% task accuracy), Regulated (99.9% uptime, 95% accuracy, 7-year audit retention).

Useful when you're running agents that handle customer data or regulated workflows and need to prove they stayed within bounds.

from workswithagents import SLAMetrics

sla = SLAMetrics("my-fleet", tier="production")
sla.report("agent-1", "task-42", duration_seconds=187, success=True)
status = sla.status()  # {breaches: [], status: "ok"}

Spec

4. Identity Protocol — Verifiable Agent Identity

When an agent claims a task result, can you prove it was that agent? Ed25519 keypairs. Signed messages. Verification against registry.

from workswithagents import AgentIdentity

ai = AgentIdentity("my-agent")
ai.register()
sig = ai.sign({"type": "heartbeat"})

# Verify another agent's message
valid = AgentIdentity.verify("other-agent", message, signature)

Spec

5. Compliance-as-Code — Regulation as Executable Validation

NHS DTAC, FCA, GDS, GDPR — as rules agents can validate against at runtime. Not a checklist. Not documentation. Code that returns pass/fail.

from workswithagents import ComplianceEngine

ce = ComplianceEngine()
dtac = ce.load("dtac-v2.1")

if dtac.validate(action).passed:
    execute(action)
else:
    escalate_to_human()

Spec

6. Onboarding Protocol — Systematic Agent Creation

Interview → generate → calibrate → benchmark → register. Instead of writing a prompt file and hoping, run a pipeline that produces a scored agent.

from workswithagents import OnboardingClient

ob = OnboardingClient()
result = ob.full_onboard(
    "nhs-auditor",
    "Audit agent actions for NHS DTAC compliance",
    capabilities=["audit:compliance"],
    skills=["compliance-as-code"]
)
# → {agent_id: "nhs-auditor", trust_score_seed: 0.60}

Spec

The Stack

L7 GOVERNANCE    Compliance-as-Code · SLA Framework
L6 VERIFICATION  Agent Test Suite · Pitfall Registry
L5 COORDINATION  Coordination Protocol · Trust Score
L4 SESSION       Handoff Protocol
L3 DISCOVERY     Capability Manifest · Trust Score · Identity
L2 COMMUNICATION Identity Protocol · Credential Proxy
L1 EXECUTION     Blueprint Registry · Onboarding Protocol

Plus cross-layer: Deployment Manifest.

Get Started

pip install workswithagents

All specs: workswithagents.dev/specs/ All code: CC BY 4.0

The Agent OSI Model — A 7-Layer Framework for AI Agent Infrastructure

Vilius Vystartas — Tue, 05 May 2026 21:32:25 GMT

The Agent OSI Model — A 7-Layer Framework for AI Agent Infrastructure

AI agents have no equivalent. When an agent fails, we say "the agent broke." That's useless.

I've published a 7-layer framework for agent infrastructure. Not a product. Not a standard. A vocabulary.

Why This Matters

For debugging: "Your Layer 4 handoff is broken" is actionable. "Your agents aren't talking to each other" is vague.

For standards: Each layer without a standard is a gap — and an opportunity. The framework makes it obvious where standards are needed.

What Exists, What's Missing

Layer	Infrastructure	Status
L1	Blueprint Registry (verified LLM configs)	✅ Live
L2	MCP, A2A, Credential Proxy	✅ Live
L3	llms.txt, Agent Capability Manifest	✅ Spec written
L4	Handoff Protocol	📋 In proposal (MCP SEP #2683, A2A #1817)
L5	Coordination Protocol	🆕 Spec published today
L6	Agent Test Suite, Pitfall Registry	⚠️ Partial
L7	Transaction Protocol, Compliance-as-Code	🆕 Spec published today

Three New Specs Published Today

Coordination Protocol (Layer 5)

Agent Capability Manifest (Layer 3)

Machine-readable declaration of what an agent can do. Like package.json but for agent capabilities. Discovery: "who can build SPFx?" → ranked by success rate + load + trust score.

Agent Transaction Protocol (Layer 7)

The Bigger Play

Everyone's building AI agents. I'm building the infrastructure agents run on — the picks and shovels of the agent gold rush.

The Agent OSI Model is the framework. The specs at each layer are the picks and shovels. The certification system (Blueprint, Ready, Certified) is the trust layer on top.

Full framework and all specs: workswithagents.dev/specs/

Human-readable overview: workswithagents.com/agent-osi-model

All specs CC BY 4.0 — free to use, cite, and build upon. Attribution required.

Your API Needs an llms.txt File — Here's How to Write One and Why Agents Will Read It

Vilius Vystartas — Tue, 05 May 2026 21:00:27 GMT

Your API Needs an llms.txt File — Here's How to Write One and Why Agents Will Read It

AI agents are being trained to look for llms.txt files. It's the agent-native equivalent of robots.txt — a file at your domain root that tells agents how to discover and use your content.

If your product doesn't have one, agents can't find it. If your competitor's does, theirs gets discovered first.

The Concise Index (llms.txt)

Minimal. Just tells agents what's available and where:

# Stripe API

> Payment infrastructure for the internet.
> API: https://api.stripe.com

## API Reference
- [Authentication](https://docs.stripe.com/api/authentication.md)
- [Charges](https://docs.stripe.com/api/charges.md)
- [Webhooks](https://docs.stripe.com/api/webhooks.md)

## Guides
- [Quickstart](https://docs.stripe.com/quickstart.md)
- [Testing](https://docs.stripe.com/testing.md)

## SDKs
- [Python](https://docs.stripe.com/sdks/python.md)
- [Node.js](https://docs.stripe.com/sdks/node.md)

That's it. No styling. No navigation. Just links to Markdown documents. Agents parse this and know exactly what's on your site.

The Full Reference (llms-full.txt)

This is what agents actually read. All your documentation in one file, with a table of contents at the top:

# Stripe API — Full Reference

## Quickstart
[Full quickstart content...]

## Authentication
API key format, scopes, rotation...

## Endpoints
### Charges
POST /v1/charges — request/response examples...

### Customers
[Full customer API docs...]

## Error Handling
Error codes, retry patterns...

## SDK Examples
Python, Node, Ruby...

Critical rule: The full file must be under ~50K characters. Agents have context limits. If it's too long, they'll truncate or ignore it.

How to Serve It

Two paths:

Static site (recommended)

Drop llms.txt and llms-full.txt in your site root. They're plain Markdown files. Serve with Content-Type: text/markdown.

API (for dynamic content)

Generate from your API docs programmatically. Return via Accept: text/markdown header. Agents can request the format they need.

Why This Matters Now

Google, OpenAI, Anthropic, and others are training their agents to recognize llms.txt as a discovery mechanism. It's the robots.txt moment for AI agents — early adopters get indexed first.

Three things happen when you add llms.txt:

Discovery. Agents find your content without human intervention.
Efficiency. Agents read one structured file instead of crawling 50 pages.
Positioning. You're in the agent discovery ecosystem before your competitors.

The "Works With Agents" Angle

This is part of a larger idea: what if there was a certification for being agent-compatible?

Works With Agents Ready — Your product has llms.txt + OpenAPI spec. Agents can discover and use it.
Works With Agents Certified — Your product has been tested with real agents. Pitfalls are documented. Skills exist.

The first tier is free and self-serve — just add the files. The second tier is verified by our infrastructure.

But that's a bigger conversation. For now: write your llms.txt. It takes 20 minutes. Your future AI agent users will thank you.

I built the Works With Agents infrastructure — FactBase, Skill Registry, Pitfall Registry — with llms.txt as the primary discovery mechanism. Every domain (workswithagents.com, .dev, .io) serves both llms.txt and llms-full.txt. If you're building agent-facing tools, do the same.

I build agent infrastructure inside Microsoft 365. SPFx · TypeScript · autonomous multi-agent systems. Currently open to senior/architect roles (£120K+ remote UK). → vilius@workswithagents.com

I Built Infrastructure for 20 AI Agents That Run Themselves — For €4.57/Month

Vilius Vystartas — Tue, 05 May 2026 20:48:05 GMT

Five months ago, I couldn't get one AI agent to finish a build without breaking. Today, I have 20 autonomous agents running cron jobs, self-improving, and catching each other's bugs — all on a €4.57/month VPS.

Here's what I built, what broke along the way, and why I'm not charging for any of it yet.

The 10 Patterns That Emerged

I didn't plan to build a methodology. I was just trying to make agents work. But over 5 months of breaking and fixing, patterns emerged:

1. Boot — First session setup. AGENTS.md, environment, initial memory. Without this, every agent starts blind.

2. Skills — Reusable procedural knowledge. I now have 153 skills. When an agent needs to build an SPFx web part, it loads the skill — no re-explaining.

3. Memory — Durable facts across sessions. What Python version? Where's the project? Never re-answer these questions.

4. Decision Protocols — When the agent decides vs when it asks. Hours saved from eliminated approval loops.

5. Tool Composition — The right tool for each job. Delegating a coding task to a subagent burns tokens and produces garbage. Use write_file directly.

6. Orchestration — Parallel specialist agents. Research runs while build runs. 3x throughput.

7. Pipelines — Agents that run while you sleep. Cron jobs, builds, monitoring. Silent unless broken.

8. Resilience — Never-stop loops. 11 consecutive builds with zero human intervention. The agent hit errors on 8 of them and recovered from every single one.

9. Verify — Trust but verify. Syntax checks, test runs, linting after every change. 77% test pass rate across 61 tests.

10. Compounding — Agents that get better. Each solved problem becomes a skill. The agent today is qualitatively different from 5 months ago.

The Unglamorous Truth

The 3-day weekend experiment gets the attention — agents scaffolded 111 web parts and 5 backend services autonomously. But the real work was the months before and after:

Fixing macOS permissions so agents can read files
Tracing why the model config was broken (empty model name, nothing works for 2 hours)
Rewriting SCSS configuration because it was written for Gulp, not Heft
Discovering the Yeoman generator silently ignores CLI flags when .yo-rc.json exists
Hunting why C++ native modules won't compile on Node 22

None of this is in a tutorial. You live through it — late at night, no shortcut.

What's NOT Ready (Honest Part)

Thing	Status
5 domains live	✅
3 APIs serving real data	✅
153 skills queryable	✅
10-pattern methodology documented	✅
Courses	❌ Content written, not launched
Workshops	❌ Materials in planning
Consulting	❌ 0 clients
Paying customers	❌ 0

Everything with a price tag says "Coming Soon." I'm not selling anything until the methodology is proven with real users. This is pre-revenue, pre-launch, pre-everything commercial. I'm shipping infrastructure, not promises.

The Bigger Play

Everyone's building agents. I'm building infrastructure FOR agents.

Agents need shared knowledge (FactBase). They need verified configurations (Blueprints). They need to know what breaks and how to fix it (Pitfalls). They need to hand off work without losing context (Handoff Protocol). They need a way to discover documentation (llms.txt).

These are the picks and shovels of the agent gold rush. And most people haven't realised the gold rush needs picks and shovels yet.

What's Next

If this resonates — if you're building agent infrastructure too, or if you've hit the same walls — I'd like to hear about it. The pitfall registry is live and open. The skill registry is queryable. The methodology is documented.

Everything at workswithagents.com, workswithagents.dev, workswithagents.io.

Built in Cardiff. Running in Nuremberg. €4.57/month.

No launch announcement. No pricing page. No "revolutionise your workflow." Just infrastructure, live, and honest about what's not ready yet.

The Agentic Gap: Why a SharePoint Expert's Excitement Stopped Me Cold

Vilius Vystartas — Mon, 04 May 2026 21:48:25 GMT

I saw a SharePoint MVP's post recently. Genuine excitement. Markdown support had landed in SharePoint. Not a joke — real, earned enthusiasm from someone who knows their domain inside out.

And I get it. In the SharePoint world, that's real progress. It matters for real users solving real problems. You don't become an MVP without expert knowledge and public recognition. He was right to celebrate.

What stopped me wasn't his post. It was the contrast with myself — with what I used to get excited about, and what I'm working on now.

What YouTube Doesn't Show You

Here's what no tutorial, TikTok, or conference talk prepares you for: the grind.

Over those three days, something broke roughly every few hours. Not metaphorically — literally. You fix the macOS permissions so the agent can read files. Now it needs a gateway restart. Restart it and the model config turns out to be broken — empty model name, nothing works for two hours while you trace why. Fix that, and suddenly the build fails because the SCSS configuration is extending the wrong toolchain — it was written for Gulp, not Heft. Rewrite that, and the Yeoman scaffold generator silently ignores your CLI flags because a .yo-rc.json exists from a previous run. Build a manual template script to bypass it. Now the directories are PascalCase and your pipeline expected kebab-case. Fix that. Now C++ native modules won't compile on Node 22. That's before you even get to the agents looping — repeating the same three broken commands until you harden the loop detection.

None of this is in a tutorial. You can't watch a video for it. You have to live through it — hands on, late at night, no shortcut.

The memory file got patched again and again. Model preferences changed. The sync method moved from a Pi server to Git-based recovery. Configs were rewritten wholesale — config.json, sass.json, tsconfig, ESLint rules, the entire pipeline script. Each fix revealed the next breakage. The pain point just moved — same problem, different file, new and creative way of failing.

This is the unglamorous truth about building agent infrastructure: you're not engineering features. You're engineering resilience. Before it can build web parts autonomously, it has to survive the environment. Before you can trust it, it has to break in every possible way. There is no "prompt engineering" your way out of this. It's systems engineering, and it's dirty.

Two Conversations, Same Industry

Here's what I keep coming back to: these two things — celebrating markdown support and watching agents build entire applications autonomously — are happening in the same industry, on the same platform, to people with the same job title.

That's not a criticism of anyone. It's a data point about how fast the ground is shifting.

The gap isn't between smart people and slow people. It's between two entirely different models of what software development is becoming. In one model, we're incrementally improving the tools we already know. In the other, the tools are learning to use themselves.

And you can be a genuine expert — someone with years of deep domain knowledge, public recognition, real achievements — and still be standing in the first room while the second one exists a few doors down.

I almost was.

How I Almost Missed It

I'm not telling this story because I saw it coming. I didn't. I backed into it.

I was a SharePoint developer. Not a machine learning engineer. Not an AI researcher. A developer who spent years learning the quirks of SPFx, the SharePoint Framework, because that's what the job demanded.

What changed wasn't my intelligence or foresight. It was a simple question: "What if I stopped prompting AI and started architecting workflows for it?"

That shift — from treating AI as a smart autocomplete to treating it as a team member with a defined role, quality gates, and an audit trail — was the door I walked through. Not because I was clever. Because I was curious, and slightly lazy, and the alternative was writing web part number 112 by hand.

The methodology that emerged — I now call it Works With Agents, but the name doesn't matter — isn't complicated. It's just… different. Different enough that it creates a perception gap. And perception gaps are where the real opportunity lives.

What This Means (For All of Us)

Here's the uncomfortable part. The gap isn't closing. It's widening.

The tools are getting better faster than the mental models are updating. By the time the average team lead internalises what Claude or Copilot can do today, the agents will have moved on to something else entirely. We're not in a technology adoption curve. We're in a fragmentation event.

Three things I think are true, as of May 2026:

1. Your technical moat is thinner than you think. If your competitive advantage is "we build features faster," a research loop — cron job, web search, LLM analysis, agent scaffold — can clone your feature set in a weekend. The moat is moving to compliance, trust, and domain relationships. Things that take months or years, not hours.

2. The bottleneck isn't code generation. It's verification. When an agent can produce a thousand lines of code in seconds, the hard problem isn't "did it compile?" It's "did it do what I actually needed, safely, and can I prove that to an auditor?" Regulated industries feel this most acutely, but it's coming for everyone.

3. The people who are "behind" aren't stupid. They're in a different room. And most of us are in rooms we don't know about yet. The question isn't "am I ahead?" It's "what room am I in right now that already looks like markdown support to someone else?"

The Real Question

I don't have a tidy conclusion. The SharePoint MVP was right to be excited. Markdown in SharePoint is progress. But somewhere between his post and my screen, I realised that the measure of progress had fundamentally changed. Not gradually. Suddenly. And not everyone noticed.

So the question I've been sitting with: what room am I in right now, feeling perfectly current, that already looks like markdown support from the outside?

If you've got an answer — or if this made you uncomfortable — I'd genuinely like to hear it.