Technische Dokumentation

Technische Architektur

From the challenge to the LLM call — the architecture principles of AIMOS.

The Challenge

32 GB VRAM — That Is All You Need

Cloud LLMs operate with huge context windows on specialized server clusters. AIMOS runs on a single graphics card in your office — and achieves through architectural means a performance that is not only sufficient for enterprise tasks, but often delivers better results than oversized cloud models.

// Context Window Comparison (Tokens)
250K 500K 750K 1.000K Gemini 2.5 1.000.000 Cloud • $$$ Claude 4 200.000 Cloud • $$ GPT-4o 128.000 Cloud • $$ AIMOS — Local, Your GPU, Your Data — TurboQuant KV Compression (ICLR 2026) Starter 20K — RTX 4060 Ti 16 GB, Qwen 14B Business 33K — RTX 3090 24 GB, Qwen 27B Business+ 52K — RTX 5090 32 GB + Speculative Decoding + SGLang Professional 100K+ — 2× RTX 3090 NVLink 48 GB / A100 80 GB TurboQuant: 3-Bit KV → 6× more context Smaller context window than cloud — but: TurboQuant + Architecture compensate for this. And: your data stays with you.

The right model for the right task

Models with 200 billion parameters and 1 million tokens of context are impressive — but for structured enterprise tasks often oversized. Conversely, external customer contact requires a larger model than internal data queries. AIMOS scales with the task.

14B — Starter

Simple cases, document categorization, status emails. Runs on an RTX 4060 Ti (16 GB, from 400 €). With Multi-Pass Self-Refinement ~80% of 27B quality.

27B — Business

Full AI assistant, FuSa Safety Manager, complex analyses. Precise tool calls (~86% BFCL), 33K context with TurboQuant KV compression. On RTX 3090 (24 GB) or RTX 5090 with Speculative Decoding (~7× faster).

Same Software

Both model sizes run on the same AIMOS platform. An upgrade from 27B to 70B is possible at any time — by swapping hardware, without reconfiguring the agents.

Seven Architecture Principles Instead of Raw Computing Power

AIMOS compensates for the smaller context window not through bigger hardware — but through architecture that ensures the agent has exactly what it needs in context for the current task.

AIMOS compensates with seven architecture principles, which are explained in detail on this page:

Data Flow

System Overview

Messages arrive through various channels, are centrally distributed, and processed by the appropriate agent — on a shared GPU.

INPUTS Telegram E-Mail Voice Dashboard Shared Listener receives all channels PostgreSQL Message Queue Orchestrator VRAM Guard • Process Manager Finance Agent Memory • DATEV • ETA Engineering Agent Memory • FEM • DXF Logistics Agent Memory • SAP • REST Your Agent Memory • Your Skills GPU — Local LLM Inference Qwen 3.5:27B • RTX 3090 • 24 GB + TurboQuant sequential collect messages buffer distribute Agents Database Orchestrator GPU / LLM

Architecture Principles

Seven Principles for Local AI Performance

Each principle addresses a specific limitation of local operation — together they enable enterprise-grade capability on a single GPU.

1

Hybrid Long-term Memory

Unlimited facts instead of finite context tokens

Each agent has its own memory with two search mechanisms: FTS5 (full-text search) and MiniLM-L6-v2 (384-dimensional vector embeddings). Results are combined via Reciprocal Rank Fusion — relevant memories are found even with imprecise search terms.

Instead of storing 200,000 tokens of history, the agent remembers the relevant facts — and retrieves them instantly with the right query. The number of stored memories is unlimited.

// Hybrid Search in Action
FTS5:  "steel profile supplier" → 12 hits
Vector: "Who supplies beams?" → 8 hits
RRF:   Fusion → Top 20, sorted by relevance
Stored in: SQLite (per agent)
Embedding model: local, no cloud call
2

Dreaming (Memory Consolidation)

Securing knowledge before the context fills up

Trigger

Not time-triggered, but by context pressure: When the conversation history exceeds the threshold (12/18/25 messages, depending on the agent), the Orchestrator starts a dreaming cycle.

Process

The LLM analyses the history and extracts facts as MEM: lines into long-term memory. Simultaneously, workspace files (notes, todo lists) are updated via FILE: lines.

Result

Then the history is cleared — without information loss. Weekly reports (Phase 5) additionally summarize the status every 7 days.

3

Agent-Splitting

Specialists instead of generalists

Instead of overloading one agent with a huge system prompt, AIMOS distributes tasks across multiple specialists with short, focused prompts. Each agent uses only 17–22% of its context window for the system prompt — the rest remains for memory, conversation, and response.

99%
One agent, 11K prompt
Timeout, no space
17%
Specialist A, 1.5K prompt
83% free for work
19%
Specialist B, 2.8K prompt
81% free for work
4

Context Budget Guard

Automatic token management before every LLM call

// VRAM budget per hardware tier (to scale)
Starter RTX 4060 Ti 14B — 8.5 GB 5.5 GB ~20K Tok Business RTX 3090 27B — 16 GB 5.5 GB ~33K Tok (measured) Business+ RTX 5090 + SGLang 27B + 4B Draft = 18.5 GB 11 GB KV (turbo3) ~88K + Spec. Decoding Professional 2× 3090 NVLink or A100 80 GB 70B — 35 GB 11 GB ~22K Tok Model weights (fixed) KV cache with TurboQuant (3-bit compression) Reserve Same software, different capacity. Starter: efficient. Business: gold standard. Professional: maximum quality. TurboQuant compresses the KV cache to 3 bits — 6× more context on the same GPU. Speculative Decoding: up to 2.5× faster.

KV cache (Key-Value Cache) = the working memory of the language model during a conversation. It holds the system prompt, memories, conversation history, and the reserved tokens for the response. The more VRAM remains for the KV cache, the longer and deeper conversations are possible.

// Context Window Composition (14,336 Tokens)
Core Prompt ~2.000 Agent ~400-700 Tools ~400-600 Memories ~500-1.500 Calendar Projects Conversation History dynamic (15-35 messages) Response ~2,000 reserved Fixed per agent (17-22%) Dynamic (memory + conversation + response) ! Budget exceeded? Remove oldest messages • Truncate tool results to 200 characters • Prompt + tools remain complete

The history cap adapts dynamically: agents with short prompts (17%) retain up to 35 messages, agents with long prompts only 15. Before every LLM call, the token sum is checked — if it exceeds the budget, it is automatically trimmed. The agent prompt and tool definitions always remain fully intact.

5

Structured Context Injection

Maximum information with minimal tokens

Instead of packing calendar, projects, and contacts as free text into the context, AIMOS injects them as compact, structured blocks. The LLM understands these formats with minimal tokens and can react to them immediately.

<calendar>
[OVERDUE] 2026-03-20 Proposal
[TODAY] 15:00 Meeting
</calendar>
<projects>
[OVERDUE] Statics → Mueller
[BLOCKED] Drawing missing
</projects>
<memories>
Company uses DATEV (imp=9)
Boss is Mueller (imp=8)
</memories>
6

Sequential VRAM Operation

All agents share one GPU, one model

Qwen 3.5:27B (Q4, ~17 GB VRAM)

32-billion-parameter model with native tool calling. Smaller models (<20B) fail at reliable tool control — a production-critical finding from our evaluation.

Orchestrator & VRAM Guard

The Orchestrator detects new messages in the DB queue, spawns the appropriate agent, and ensures that only one agent occupies the GPU at a time. Heartbeat monitoring detects hanging processes (>60s) and frees blocked VRAM.

SGLang & RadixAttention

High-performance LLM runtime with OpenAI-compatible API endpoint. RadixAttention: the prefix cache is shared between agents — agent switching in milliseconds instead of seconds.

Keep-Alive

The model stays in VRAM for 30 minutes. All agents share the same model — no unloading during agent switching. Only after 30 minutes of inactivity is VRAM freed.

// Anatomy of an LLM Request
System Prompt + Memory Context Budget Guard Token-Check LLM Inference SGLang API Tool Dispatch Ring-Check Audit Log + Response Token-Tracking
7

Escalation & PII Vault

Automatic fallback for complex tasks

Escalation

If a task exceeds the capabilities of the local 27B model — or a timeout occurs — the agent automatically escalates to a more powerful cloud LLM (e.g., Claude Sonnet). The user notices nothing; they always receive a response.

PII Vault (Anonymization)

Before escalation, the PII Vault automatically anonymizes all personal data: names, phone numbers, email addresses, company names. Only the sanitized query leaves the network. The response is re-personalized locally. Your data always stays local.