The team had built an AI assistant that analyzes legal contracts for German SMEs — extracting clauses, flagging risks, and summarizing obligations. The core product worked. But operating it in production was another story.
A prompt change on a Friday afternoon caused a 22% drop in output quality. The team found out three days later — via support tickets from angry customers. By then, hundreds of contracts had been processed with degraded outputs.
A bug in their multi-step agent caused an infinite retry loop. It burned €1,200 in GPT-4o API costs over a single weekend before anyone noticed. There were no alerts, no spend limits, no visibility.
They evaluated LangSmith and Helicone. Both required sending prompts and outputs — containing sensitive contract data — to US servers. Their legal team immediately said no. They were stuck monitoring nothing.
The team deployed AgentLens on their own infrastructure — a single Railway instance in the EU region. No data leaves their environment. The integration with their existing OpenAI setup took two lines of code:
For their multi-step contract analysis agent, they added trace instrumentation to get a full waterfall view of every step:
They also configured a budget alert at €50/day and enabled automatic quality scoring on every response — no manual setup, no evaluation pipeline to maintain.
The automatic quality scoring runs on every response. When a prompt change caused a quality drop, AgentLens fired an alert within 87 minutes. The team rolled back before any customers were affected.
A budget alert at €50/day caught an agent loop within 40 minutes. The team fixed the retry logic the same day. Previously, the same bug had cost €1,200 over a weekend undetected.
The waterfall debugger shows every span — which step ran, how long it took, what it cost, and what it returned. Finding the failing step in a 5-step agent went from log-diving to a single glance.
Because AgentLens runs entirely within their own infrastructure, contract data never leaves their EU environment. The auditor reviewed the setup and closed the audit without a single corrective action.
"We had a prompt change tank quality by 22% on a Friday afternoon. We found out on Monday morning via support tickets. With AgentLens, we would have caught it in under 90 minutes — automatically, without any manual evaluation setup. And since everything runs on our own server, our legal team actually approved it."
20-minute demo. We connect AgentLens to your LLM app and show you live what it finds.
No data leaves your infrastructure. No long-term commitment.
Or start free: github.com/Soufianeazz/agentlens