Building Context Compression in Astreus

Handle long conversations without hitting token limits. Automatic summarization keeps context while cutting costs.

Long conversations with AI agents create a fundamental problem. Every API call sends the entire message history, causing token costs to escalate and eventually hitting model limits. Astreus solves this with automatic context compression that summarizes older messages while maintaining conversational coherence. ## The Context Window Challenge Language models are stateless. They don't remember previous exchanges unless you include them in every request. A 50-message conversation can easily consume 25,000+ tokens per API call, creating escalating costs and slower response times. Even with 128K token limits in GPT-4o or 200K in Claude, production applications will eventually hit these ceilings. Without compression, you're forced to either truncate history (losing context) or restart conversations (breaking continuity). ## Enabling Auto Context Compression Astreus provides `autoContextCompression` as a configuration option that works alongside the memory system. When enabled, it automatically summarizes older messages during extended dialogues, preserving important details through intelligent summarization rather than truncation. Here's the real Astreus API for enabling context compression: ```javascript const agent = await Agent.create({ name: 'ContextAgent', model: 'gpt-4o', memory: true, autoContextCompression: true, systemPrompt: 'You can handle very long conversations efficiently.' }); ``` The `memory` option must be enabled for compression to work. This ensures conversational history is tracked and available for summarization. ## How It Works in Practice When your conversation grows beyond a certain threshold, Astreus automatically triggers compression. The system analyzes older messages and creates concise summaries that capture key information while drastically reducing token count. Recent messages remain in full detail for immediate context. This balances the need for rich recent context with the efficiency of summarized history. ```javascript // After 20+ exchanges, the agent still remembers early details const response = await agent.ask('What was the first space fact I mentioned?'); // The system retrieves information from compressed history console.log(response); // References the original fact despite many intervening messages ``` This demonstrates memory persistence across extended conversations. The agent maintains understanding of early exchanges even after dozens of subsequent messages. ## Environment Setup To use context compression, you need two environment variables configured: ```bash OPENAI_API_KEY=your_openai_api_key_here DB_URL=sqlite://./memory.db ``` The `DB_URL` is essential because memory and compression require persistent storage. SQLite works well for development, while production deployments typically use PostgreSQL or other robust databases. ## Testing Compression Effectiveness The best way to verify compression is through extended conversations. Send 20-30 messages on various topics, then ask the agent to recall early information. ```javascript // Start a long conversation for (let i = 0; i < 20; i++) { await agent.ask(`Tell me space fact number ${i + 1}`); } // Verify early context is preserved const recall = await agent.ask('What was space fact number 1?'); ``` If the agent accurately recalls information from early in the conversation, compression is working correctly. The memory persists despite the conversation length and automatic summarization. ## Cost Savings at Scale Without compression, a 100-message conversation at 500 tokens each sends 50,000 tokens per request. At $0.01 per 1K tokens, that's $0.50 per message just for input. With compression, you might send only 5,000 tokens (recent messages plus summary), dropping costs to $0.05 per message. That's 90% savings that compounds with every exchange. ```javascript // Rough token calculation const uncompressedTokens = messageCount * avgTokensPerMessage; const compressedTokens = (recentMessages * avgTokensPerMessage) + summaryTokens; const savings = ((uncompressedTokens - compressedTokens) / uncompressedTokens) * 100; console.log(`Token reduction: ${savings.toFixed(1)}%`); ``` For production applications handling thousands of conversations daily, these savings translate to thousands of dollars in reduced API costs. ## Getting Started You can explore context compression through two paths. Clone the official example repository from `astreus-ai/context-compression` for a complete working implementation, or install `@astreus-ai/astreus` directly via npm and integrate it into your existing project. The example repository includes a pre-configured setup that demonstrates compression in action with a simple chat interface. ```bash # Install Astreus npm install @astreus-ai/astreus # Or clone the example git clone https://github.com/astreus-ai/context-compression.git cd context-compression npm install ``` Once installed, set your environment variables and run the example to see compression working in real-time. ## When to Use Compression Context compression shines in customer support bots, personal assistants, and any application where conversations naturally extend beyond 10-15 exchanges. Short interactions (under 10 messages) rarely benefit from compression since they don't approach token limits. For applications where users return to conversations over days or weeks, compression becomes essential. It enables indefinite conversation length without degrading performance or hitting hard limits. ```javascript // Ideal for long-running assistants const personalAssistant = await Agent.create({ name: 'PersonalAssistant', model: 'gpt-4o', memory: true, autoContextCompression: true, systemPrompt: 'You are a helpful personal assistant who remembers past conversations.' }); // Supports multi-day conversations const morning = await personalAssistant.ask('What are my tasks today?'); // ... many hours and messages later ... const evening = await personalAssistant.ask('Did I complete everything we discussed this morning?'); ``` The agent maintains continuity across the entire day despite potentially hundreds of intervening messages. ## Best Practices Enable compression from the start if you anticipate conversations exceeding 10-15 exchanges. Retrofitting compression into existing systems with uncompressed history can be challenging. Monitor your actual token usage through OpenAI's dashboard or your database logs. This reveals whether compression is triggering appropriately and delivering expected savings. ```javascript // Example conversation with compression const response = await agent.ask('What tasks did we discuss earlier?'); console.log(response); ``` If you notice token counts staying consistently high despite many messages, verify that both `memory` and `autoContextCompression` are enabled. ## Compression vs. Memory Understand the distinction between these two features. Memory persists facts and context between sessions, while compression manages token efficiency within a single conversation thread. They work together synergistically. Memory stores what to remember, compression decides how to represent it efficiently. ```javascript // Both features enabled for optimal behavior const agent = await Agent.create({ name: 'OptimalAgent', model: 'gpt-4o', memory: true, // Long-term fact persistence autoContextCompression: true, // Efficient context management systemPrompt: 'Remember important details and handle long conversations.' }); ``` This configuration gives you both long-term memory across sessions and efficient handling of extended conversations within sessions. ## Production Considerations In production, ensure your database can handle the write load from memory operations. SQLite works for low-traffic applications, but high-volume deployments need PostgreSQL or similar robust solutions. Test compression with realistic conversation patterns. Run automated tests that simulate 50+ message exchanges and verify that critical information persists throughout. ```javascript // Automated compression test async function testCompression() { const agent = await Agent.create({ name: 'TestAgent', model: 'gpt-4o', memory: true, autoContextCompression: true }); // Send many messages await agent.ask('My favorite color is blue.'); for (let i = 0; i < 30; i++) { await agent.ask(`Tell me fact number ${i}`); } // Verify early info is retained const response = await agent.ask('What is my favorite color?'); assert(response.includes('blue'), 'Compression failed to preserve early context'); } ``` This automated test ensures compression doesn't lose critical details during summarization. ## The Results Context compression transforms AI agents from token-limited conversations into persistent assistants capable of indefinite dialogue. You gain predictable costs regardless of conversation length, consistent performance without degradation, and the ability to maintain natural long-running interactions. In production deployments, teams report 85-95% token cost reductions for conversations exceeding 20 messages. Users experience faster responses due to smaller context windows, and developers gain confidence that conversations won't fail when approaching model limits. For any Astreus application handling extended interactions, enabling `autoContextCompression` is a simple configuration change that delivers immediate benefits. This experiment is written for Astreus v0.5.37. Please ensure you are using a compatible version.