Architecture — AIMOS

The Challenge

32 GB VRAM — That Is All You Need

Cloud LLMs operate with huge context windows on specialized server clusters. AIMOS runs on a single graphics card in your office — and achieves through architectural means a performance that is not only sufficient for enterprise tasks, but often delivers better results than oversized cloud models.

// Context Window Comparison (Tokens)

The right model for the right task

Models with 200 billion parameters and 1 million tokens of context are impressive — but for structured enterprise tasks often oversized. Conversely, external customer contact requires a larger model than internal data queries. AIMOS scales with the task.

14B — Starter

Simple cases, document categorization, status emails. Runs on an RTX 4060 Ti (16 GB, from 400 €). With Multi-Pass Self-Refinement ~80% of 27B quality.

27B — Business

Full AI assistant, FuSa Safety Manager, complex analyses. Precise tool calls (~86% BFCL), 33K context with TurboQuant KV compression. On RTX 3090 (24 GB) or RTX 5090 with Speculative Decoding (~7× faster).

Same Software

Both model sizes run on the same AIMOS platform. An upgrade from 27B to 70B is possible at any time — by swapping hardware, without reconfiguring the agents.

Seven Architecture Principles Instead of Raw Computing Power

AIMOS compensates for the smaller context window not through bigger hardware — but through architecture that ensures the agent has exactly what it needs in context for the current task.

AIMOS compensates with seven architecture principles, which are explained in detail on this page:

1 Long-term Memory 2 Dreaming 3 Agent-Splitting 4 Budget Guard 5 Context Injection 6 VRAM-Sharing 7 Escalation

Architecture Principles

Seven Principles for Local AI Performance

Each principle addresses a specific limitation of local operation — together they enable enterprise-grade capability on a single GPU.

Hybrid Long-term Memory

Unlimited facts instead of finite context tokens

Each agent has its own memory with two search mechanisms: FTS5 (full-text search) and MiniLM-L6-v2 (384-dimensional vector embeddings). Results are combined via Reciprocal Rank Fusion — relevant memories are found even with imprecise search terms.

Instead of storing 200,000 tokens of history, the agent remembers the relevant facts — and retrieves them instantly with the right query. The number of stored memories is unlimited.

// Hybrid Search in Action

FTS5: "steel profile supplier" → 12 hits

Vector: "Who supplies beams?" → 8 hits

RRF: Fusion → Top 20, sorted by relevance

Stored in: SQLite (per agent)
Embedding model: local, no cloud call

Dreaming (Memory Consolidation)

Securing knowledge before the context fills up

Trigger

Not time-triggered, but by context pressure: When the conversation history exceeds the threshold (12/18/25 messages, depending on the agent), the Orchestrator starts a dreaming cycle.

Process

The LLM analyses the history and extracts facts as MEM: lines into long-term memory. Simultaneously, workspace files (notes, todo lists) are updated via FILE: lines.

Result

Then the history is cleared — without information loss. Weekly reports (Phase 5) additionally summarize the status every 7 days.

Agent-Splitting

Specialists instead of generalists

Instead of overloading one agent with a huge system prompt, AIMOS distributes tasks across multiple specialists with short, focused prompts. Each agent uses only 17–22% of its context window for the system prompt — the rest remains for memory, conversation, and response.

99%

One agent, 11K prompt

Timeout, no space

17%

Specialist A, 1.5K prompt

83% free for work

19%

Specialist B, 2.8K prompt

81% free for work

Context Budget Guard

Automatic token management before every LLM call

// VRAM budget per hardware tier (to scale)

KV cache (Key-Value Cache) = the working memory of the language model during a conversation. It holds the system prompt, memories, conversation history, and the reserved tokens for the response. The more VRAM remains for the KV cache, the longer and deeper conversations are possible.

// Context Window Composition (14,336 Tokens)

The history cap adapts dynamically: agents with short prompts (17%) retain up to 35 messages, agents with long prompts only 15. Before every LLM call, the token sum is checked — if it exceeds the budget, it is automatically trimmed. The agent prompt and tool definitions always remain fully intact.

Structured Context Injection

Maximum information with minimal tokens

Instead of packing calendar, projects, and contacts as free text into the context, AIMOS injects them as compact, structured blocks. The LLM understands these formats with minimal tokens and can react to them immediately.

[OVERDUE] 2026-03-20 Proposal

[TODAY] 15:00 Meeting

</calendar>

[OVERDUE] Statics → Mueller

[BLOCKED] Drawing missing

</projects>

Company uses DATEV (imp=9)

Boss is Mueller (imp=8)

</memories>

Sequential VRAM Operation

All agents share one GPU, one model

Qwen 3.5:27B (Q4, ~17 GB VRAM)

32-billion-parameter model with native tool calling. Smaller models (<20B) fail at reliable tool control — a production-critical finding from our evaluation.

Orchestrator & VRAM Guard

The Orchestrator detects new messages in the DB queue, spawns the appropriate agent, and ensures that only one agent occupies the GPU at a time. Heartbeat monitoring detects hanging processes (>60s) and frees blocked VRAM.

SGLang & RadixAttention

High-performance LLM runtime with OpenAI-compatible API endpoint. RadixAttention: the prefix cache is shared between agents — agent switching in milliseconds instead of seconds.

Keep-Alive

The model stays in VRAM for 30 minutes. All agents share the same model — no unloading during agent switching. Only after 30 minutes of inactivity is VRAM freed.

// Anatomy of an LLM Request

Escalation & PII Vault

Automatic fallback for complex tasks

Escalation

If a task exceeds the capabilities of the local 27B model — or a timeout occurs — the agent automatically escalates to a more powerful cloud LLM (e.g., Claude Sonnet). The user notices nothing; they always receive a response.

PII Vault (Anonymization)

Before escalation, the PII Vault automatically anonymizes all personal data: names, phone numbers, email addresses, company names. Only the sanitized query leaves the network. The response is re-personalized locally. Your data always stays local.

Technische Architektur