Technische Dokumentation
From the challenge to the LLM call — the architecture principles of AIMOS.
The Challenge
Cloud LLMs operate with huge context windows on specialized server clusters. AIMOS runs on a single graphics card in your office — and achieves through architectural means a performance that is not only sufficient for enterprise tasks, but often delivers better results than oversized cloud models.
Models with 200 billion parameters and 1 million tokens of context are impressive — but for structured enterprise tasks often oversized. Conversely, external customer contact requires a larger model than internal data queries. AIMOS scales with the task.
Simple cases, document categorization, status emails. Runs on an RTX 4060 Ti (16 GB, from 400 €). With Multi-Pass Self-Refinement ~80% of 27B quality.
Full AI assistant, FuSa Safety Manager, complex analyses. Precise tool calls (~86% BFCL), 33K context with TurboQuant KV compression. On RTX 3090 (24 GB) or RTX 5090 with Speculative Decoding (~7× faster).
Both model sizes run on the same AIMOS platform. An upgrade from 27B to 70B is possible at any time — by swapping hardware, without reconfiguring the agents.
AIMOS compensates for the smaller context window not through bigger hardware — but through architecture that ensures the agent has exactly what it needs in context for the current task.
AIMOS compensates with seven architecture principles, which are explained in detail on this page:
Data Flow
Messages arrive through various channels, are centrally distributed, and processed by the appropriate agent — on a shared GPU.
Architecture Principles
Each principle addresses a specific limitation of local operation — together they enable enterprise-grade capability on a single GPU.
Unlimited facts instead of finite context tokens
Each agent has its own memory with two search mechanisms: FTS5 (full-text search) and MiniLM-L6-v2 (384-dimensional vector embeddings). Results are combined via Reciprocal Rank Fusion — relevant memories are found even with imprecise search terms.
Instead of storing 200,000 tokens of history, the agent remembers the relevant facts — and retrieves them instantly with the right query. The number of stored memories is unlimited.
Securing knowledge before the context fills up
Not time-triggered, but by context pressure: When the conversation history exceeds the threshold (12/18/25 messages, depending on the agent), the Orchestrator starts a dreaming cycle.
The LLM analyses the history and extracts facts as MEM: lines into long-term memory. Simultaneously, workspace files (notes, todo lists) are updated via FILE: lines.
Then the history is cleared — without information loss. Weekly reports (Phase 5) additionally summarize the status every 7 days.
Specialists instead of generalists
Instead of overloading one agent with a huge system prompt, AIMOS distributes tasks across multiple specialists with short, focused prompts. Each agent uses only 17–22% of its context window for the system prompt — the rest remains for memory, conversation, and response.
Automatic token management before every LLM call
KV cache (Key-Value Cache) = the working memory of the language model during a conversation. It holds the system prompt, memories, conversation history, and the reserved tokens for the response. The more VRAM remains for the KV cache, the longer and deeper conversations are possible.
The history cap adapts dynamically: agents with short prompts (17%) retain up to 35 messages, agents with long prompts only 15. Before every LLM call, the token sum is checked — if it exceeds the budget, it is automatically trimmed. The agent prompt and tool definitions always remain fully intact.
Maximum information with minimal tokens
Instead of packing calendar, projects, and contacts as free text into the context, AIMOS injects them as compact, structured blocks. The LLM understands these formats with minimal tokens and can react to them immediately.
All agents share one GPU, one model
32-billion-parameter model with native tool calling. Smaller models (<20B) fail at reliable tool control — a production-critical finding from our evaluation.
The Orchestrator detects new messages in the DB queue, spawns the appropriate agent, and ensures that only one agent occupies the GPU at a time. Heartbeat monitoring detects hanging processes (>60s) and frees blocked VRAM.
High-performance LLM runtime with OpenAI-compatible API endpoint. RadixAttention: the prefix cache is shared between agents — agent switching in milliseconds instead of seconds.
The model stays in VRAM for 30 minutes. All agents share the same model — no unloading during agent switching. Only after 30 minutes of inactivity is VRAM freed.
Automatic fallback for complex tasks
If a task exceeds the capabilities of the local 27B model — or a timeout occurs — the agent automatically escalates to a more powerful cloud LLM (e.g., Claude Sonnet). The user notices nothing; they always receive a response.
Before escalation, the PII Vault automatically anonymizes all personal data: names, phone numbers, email addresses, company names. Only the sanitized query leaves the network. The response is re-personalized locally. Your data always stays local.