Technical Documentation

Technical Architecture

From SovereignNode to LLM call — this is how AIMOS works under the hood.

Stack Diagram

System Overview

The complete data flow from user message to response — all layers at a glance.

USER CHANNELS Telegram E-Mail Voice (STT/TTS) Dashboard Shared Listener PostgreSQL (Message Relay) Orchestrator (VRAM Guard, Process Manager) Agent-Prozess (Memory + Skills + Prompt-Builder) LLM (Local Inference) Response Path Kernel Database Orchestration Inference

Inference

Local AI Inference

Local inference via SGLang. Sequential operation. Intelligent VRAM management.

Qwen 3.5:27B (Q4, ~17 GB VRAM)

27-billion parameter model with native tool-calling. Smaller models (<20B) fail at reliable tool control — a production-critical finding from our evaluation.

SGLang Runtime

High-performance LLM runtime with OpenAI-compatible API endpoint. RadixAttention: Prefix cache is shared between agents — no reloading on agent switch.

Sequential Operation

The VRAM Guard ensures that only one agent accesses the GPU at a time. Requests are held in the database queue and processed sequentially — no OOM, no VRAM conflict.

Keep-Alive / RadixAttention

The model stays in VRAM for 30 minutes. All agents share the same model — no unloading on agent switch. VRAM is only released after 30 minutes of inactivity.

// Anatomy of an LLM Request
System Prompt + Memory Context Cognitive Balance Check LLM Inference SGLang API Tool Dispatch Ring-Check Audit Log + Response Token-Tracking

Context Management

Context Architecture

14,336 tokens context window. Each agent uses 17–22% for its prompt — the rest remains for memory, conversations, and tool calls.

// Context Window Composition (14,336 Tokens)
Core Prompt ~2.000 Agent ~400-700 Tools ~400-600 Memories ~500-1.500 Calendar ~200 Chats ~300-600 Conversation History dynamic Antwort ~2,000 reserved Fixed per Agent (17-22%) Dynamic (Memory + Conversation + Response) ! Context Budget Guard Automatic trimming: When the context exceeds the budget, the conversation history is compressed before the LLM call starts. zZ Dreaming-Trigger When the history exceeds the threshold, the agent consolidates insights into long-term memory (Dreaming) and clears the history.

Context Budget Guard

Before each LLM call, the token count is checked. If it exceeds the budget, the conversation history is automatically trimmed — oldest messages first. The agent prompt and tool definitions always remain fully preserved.

Dynamic Compression

The available context budget is dynamically calculated: shorter agent prompts leave more room for conversation history and memories. Agents with extensive tool sets compensate with shorter system prompts.

Agent-Splitting

Instead of overloading one agent with a massive prompt, AIMOS distributes work across specialists with short, focused prompts. Each agent masters its domain — less prompt, more room for context.

Infrastructure

SovereignNode

A single server. Local GPU. No cloud dependency. The SovereignNode is the heart of every AIMOS installation — a physical or virtual server that hosts all components.

Everything runs on-premise: the LLM inference, the databases, the agent processes, and the communication channels. No byte leaves your network — unless you explicitly configure it (e.g., Telegram messages).

Component Minimum Recommended
GPU NVIDIA RTX 3090 (24 GB VRAM) NVIDIA RTX 5090 (32 GB VRAM)
RAM 32 GB DDR4 64 GB DDR5
Storage 256 GB SSD 1 TB NVMe
CPU 8 Cores 16+ Cores
OS Ubuntu 24.04 LTS Ubuntu 26.04 LTS
SovereignNode GPU (NVIDIA CUDA / LLM Runtime) Qwen 3.5:27B (Q4, ~17 GB VRAM, native Tool-Calling) PostgreSQL SQLite (Memory) Orchestrator + VRAM Guard Agent A Agent B Agent C Shared Listener (Telegram, E-Mail, Voice)

Dual-DB

Dual-DB Architecture

AIMOS uses two database systems with clearly separated responsibilities:

PostgreSQL (Relay Database)

Central message relay between Shared Listener, Orchestrator, and agents. Stores incoming messages, audit logs, PII vault mappings, and session data. Multi-process capable through connection pooling.

SQLite (Agent-Memory)

Each agent has its own SQLite database with semantic, episodic, and procedural memory. Hybrid search via FTS5 + vector embeddings. Portable by simply copying the file.

PostgreSQL message_relay audit_log pii_vault sessions llm_usage SQLite (per Agent) semantic_memory episodic_memory procedural_memory vector_embeddings dreaming_log Sync via Orchestrator

Interoperability

Agent Portability

AIMOS agents are portable, compatible, and interoperable through open standards.

OAP Export/Import

The Open Agent Package format enables the complete export of an agent including memory, skills, and configuration as a portable archive.

agent_export.oap
  config.yaml
  memory.sqlite
  skills/
  prompts/

MCP Bridge (39 Tools)

The Model Context Protocol enables external LLMs (Claude, GPT, etc.) to access AIMOS skills. 39 tools are available as an MCP server.

sql_query file_read rest_call memory_search +35 mehr

A2A Agent Cards

Each agent publishes an Agent Card (JSON-LD) per Google A2A specification. External systems can query capabilities, input formats, and trust level.

"name": "Engineering Agent",
"skills": ["cad_read", "bom_gen"],
"trust_ring": 1
SovereignNode A Export: agent.oap Transfer OAP (Memory + Skills + Config) Import SovereignNode B Agent aktiv

Technical Highlights

What Sets AIMOS Apart

Native Tool-Calling

No text hacks or regex parsing — AIMOS uses the native tool-calling API of the LLM. The agent controls systems directly, instead of merely describing actions.

Multilingual Voice

Speech recognition (Whisper STT) and speech synthesis (Piper TTS) in all languages — agents understand voice messages and respond in the user's native language.

Token-Tracking

Every LLM call is captured: input/output tokens, latency, context utilization. Full cost transparency per agent, per conversation, per month.

Conversation Threading

Every agent knows who it is talking to on which channel. Telegram, email, and internal messages are cleanly separated — no confusion between conversation partners.