You shipped an LLM agent. It worked great in the demo. Now it’s running 24/7 and you have new problems your web-app playbook doesn’t cover: an agent that loops on a flaky tool and burns 50,000 tokens on one prompt, a model that hallucinates confident JSON and silently fails, a prompt tweak that ships in seconds and craters tool-call accuracy by 30%. Your APM dashboard stays green the whole time because none of it is a 500.
Web-app SLOs (uptime, p95 latency, Http5xx) are necessary but not sufficient. Agents need agent-shaped SLOs — task success rate, $ per task, tool success rate, repair-retry rate — and middleware that can act on them before the bill arrives.
This post walks through that middleware end-to-end on a deployable App Service sample: an OpenTelemetry-instrumented agent loop on Azure OpenAI (with managed identity, no keys), eleven custom metrics flowing into App Insights, a per-tenant budget circuit breaker that downshifts gpt-4o to gpt-4o-mini at 80% spend (a 16× cost reduction), retry-with-prompt-repair for malformed tool calls, a chaos CLI for practicing failures, and the bit that makes the whole thing self-healing — a Logic App “healer” that calls slotsswap automatically when a metric alert fires on Http5xx > 5 in 5 minutes. From alert-fire to rolled-back-slot is about 4 minutes, with zero application code in the healer.
👉 Read the full article on Tech Community and grab the sample repo on GitHub.
Comments