A
AIMOS

Validation

How does an AI agent become
production-ready?

Every human makes mistakes. Every LLM hallucinates. The difference: We can measure how often — and systematically ensure it stays below a defined threshold.

The Problem: AI Agents Can Hallucinate

A language model processing a tax return can fabricate amounts. A safety assistant can hallucinate regulatory references. A support agent can make commitments the company cannot keep.

Most AI providers ignore this problem — or rely on a “works most of the time” approach. We don't.

Typical AI Hallucinations

  • Fabricated amounts (“Refund: €4,782” — never calculated)
  • False regulatory references (“per §35a Para. 7” — does not exist)
  • Mixed client data (data from Client A in email to Client B)
  • False commitments (“I have submitted your return”)
  • Outdated information (2021 allowance instead of 2025)

Our Solution: Systematic Validation

Derived from the methodology for autonomous vehicles (ADAS) and functional safety. Adapted for AI agents in enterprise environments.

// Validation V — from specification to statistical proof
ISO/TS 5083 Cl. 6.2
1. Acceptance Criteria

Business case → max. error rate.
E.g.: <0.15% at 7,500 operations/year.

ISO/PAS 8800 Cl. 9
2. Agent Design

Prompt engineering + reference knowledge.
Domain expertise in files, not in LLM memory.

ISO 26262 / SOTIF
3. Safety Measures

Hallucination detection.
Deterministic + semantic verification.

AI Agent

OODA Loop
TurboQuant
Qwen 27B

Iterative feedback loop:
Error → Prompt fix → Re-test

ISO/TS 5083 Cl. 9
6. Production Monitoring

Every operation is automatically scored.
Quarterly error rate reporting.

ISO/TS 5083 H.4
5. Monte Carlo Validation

2,000 synthetic scenarios.
Statistical proof with confidence interval.

ISO 21448 (SOTIF)
4. Phase Tests (SOTIF/FuSi)

Isolated tests per OODA phase.
Equivalence Classes + Boundary Values.

The Effort Behind the Reliability

2,000+
Synthetic Test Scenarios
Automatically generated from the
agent's parameter space
10,000+
AI-Assisted Checks
Each scenario passes through
15 automated checks
Days
Validation Runtime per Agent
Multi-day GPU cycles until
statistical proof is achieved

For every agent, we run multi-day validation cycles with thousands of test cases on our GPU infrastructure. Each individual test case is evaluated with AI support — deterministically for numbers and facts, semantically for tone and context. Only when the measured error rate falls below the agreed acceptance criteria does the agent go into production. We invest this effort for every single agent.

Hallucination Detection: Two Layers

Layer 1: Deterministic

Python code, no LLM. 100% reliable, <1 second.

  • ✓ Every EUR amount in the output is verified against input data
  • ✓ No tool-call artifacts in emails (XML, JSON)
  • ✓ No client data cross-contamination (scope check)
  • ✓ No internal system terms exposed externally
  • ✓ Prompt injection resistance

Layer 2: Semantic (LLM-Based)

Separate LLM call, low temperature. Calibrated against gold standard.

  • ✓ Professional tone (even with difficult clients)
  • ✓ Content consistency (no refund without supporting data)
  • ✓ Completeness (missing documents mentioned)
  • ✓ No false promises
  • ✓ Correct language (DE/EN per client)

Calibration: Each check prompt is calibrated against hand-curated gold drafts (known good + known bad). Precision, recall, and F1 score are measured. Only checks with F1 > 0.9 are deployed. The details of our calibration methodology are proprietary.

Research in Production

TurboQuant

KV cache compression to 3-bit (ICLR 2026). 6× more context on the same GPU. Zero accuracy loss.

Speculative Decoding

Small draft model generates, large model validates. 2.5× faster inference at identical quality.

Claim Decomposition

Every statement by the agent is broken down into atomic claims and verified against source data. Based on FActScore and Chain-of-Verification (Meta 2023).

Reference Standards

Our methodology is based on standards developed for autonomous driving and ADAS systems. We are the first to systematically apply this methodology to AI agents in enterprise environments.

ISO/TS 5083:2025 — Safety for Automated Driving Systems: Design, Verification and Validation

Acceptance criteria, Monte Carlo, scenario generation, V&V process

ISO/PAS 8800:2024 — Road Vehicles: Safety and Artificial Intelligence

AI Safety Requirements, Input Space Refinement, Output Insufficiencies

ISO 21448 (SOTIF) — Safety of the Intended Functionality

Validation of Intended Functionality, Residual Risk, Trigger Conditions

ISO 26262 / Automotive SPICE — Functional Safety + Process Quality

HAZOP, FMEA, deterministic Safety Measures, process maturity

Interested in the Details?

The complete validation methodology is part of our consulting services. We'd be happy to show you in a personal meeting how we set up your AI agent so you can rely on it with confidence.

Get in Touch