Forgetless: Smart Context Optimization for LLMs

Every LLM has a context limit. When your documents, conversations, and knowledge exceed that limit, you're forced to make difficult choices about what to include. Forgetless makes those choices intelligently, achieving up to 14.5x compression while preserving what matters most.

Every language model has a ceiling. GPT-5 caps at 400K tokens. Claude Opus 4.5 offers 200K. Gemini 2.5 Pro stretches to 1M. Even the most generous context windows eventually run out when you're processing research papers, analyzing codebases, or maintaining long conversation histories. When that happens, you're forced into an uncomfortable position: what do you cut? Most developers reach for crude solutions. Truncate from the beginning. Keep only the last N messages. Maybe implement a sliding window. These approaches share a fundamental flaw: they treat all content as equally important. The system prompt gets the same treatment as a tangential comment from three hours ago. Critical context disappears alongside noise. I built Forgetless to solve this problem properly. ## The Context Optimization Problem Consider a typical LLM application. You have a system prompt defining the agent's behavior. You have retrieved documents from a knowledge base. You have the current conversation history. You have tool outputs and intermediate reasoning. Add these together and you quickly exceed your token budget. The naive solution is to just cut content until it fits. But this creates a cascade of problems. **Loss of critical context** means system prompts get truncated. Important instructions vanish. The agent's behavior becomes unpredictable because it literally forgot what it was supposed to do. **Relevance blindness** occurs because a conversation from yesterday about the user's preferences might be more relevant to the current query than the last five messages about weather. Simple truncation can't distinguish between them. **Semantic fragmentation** happens when cutting in the middle of a concept splits related information across the boundary. Half the explanation makes it in; half doesn't. The model receives broken context that's harder to reason about than no context at all. **No prioritization** is perhaps the most fundamental issue: all tokens are treated equally. A 500-token tangent about a minor detail competes for space with your carefully crafted 50-token system prompt. ## How Forgetless Works Forgetless implements a six-stage pipeline that transforms massive content into optimized context. ### Stage 1: Content Ingestion The pipeline accepts diverse inputs: raw text, markdown documents, code files, PDFs, images, and structured data. Files are read lazily, meaning a 100MB PDF doesn't consume memory until it's actually needed. ```rust:Rust let result = Forgetless::new() .add(WithPriority::critical(system_prompt)) .add_file("research-paper.pdf") .add_file("codebase/src/**/*.rs") .add(conversation_history) .query(user_question) .run() .await?; ``` Each piece of content can be assigned a priority level: critical, high, medium, or low. Critical content is guaranteed to appear in the output. High-priority content is strongly preferred. Medium is the default. Low-priority content is dropped first when space is tight. ### Stage 2: Smart Chunking Unlike naive approaches that split on fixed character counts, Forgetless uses content-aware chunking. The chunker understands the structure of what it's processing. Code is split at function and class boundaries, preserving complete logical units. Markdown is split at headers, keeping sections together. Conversations are split at message boundaries, never cutting mid-thought. Structured data respects object and array boundaries. This semantic awareness ensures that each chunk represents a coherent piece of information that can stand on its own. ### Stage 3: Local Embedding Forgetless generates vector embeddings using the all-MiniLM-L6-v2 model, running entirely on your machine via FastEmbed. No API calls to OpenAI or other providers. No data leaving your infrastructure. No per-token costs. An LRU cache prevents redundant computation; if you process the same document multiple times, the embeddings are retrieved from cache rather than recomputed. ### Stage 4: Hybrid Scoring This is where Forgetless differentiates itself. Rather than relying on a single relevance signal, it combines four factors: | Signal | Description | |--------|-------------| | **Priority** | Explicit importance markers set by the developer | | **Recency** | Newer content weighted higher than older content | | **Semantic Similarity** | Vector distance to the current query | | **Position** | Original document position for tie-breaking | The hybrid approach captures different dimensions of relevance. A chunk might score low on semantic similarity but high on priority because it contains system instructions. Another chunk might have low explicit priority but high semantic similarity because it directly answers the user's question. ### Stage 5: Budget Selection With all chunks scored, the selector builds the optimal subset that fits within your token limit. Critical chunks are added first, unconditionally. Then high-priority chunks. Then the remaining chunks in order of their hybrid score until the budget is exhausted. Token counting uses the cl100k_base tokenizer (GPT-4 compatible) for accurate budget enforcement. No surprises where your carefully constructed context exceeds the limit. ### Stage 6: Context Assembly The selected chunks are reassembled into coherent output. The result includes comprehensive statistics: ```rust:Rust println!("Input tokens: {}", result.stats.input_tokens); println!("Output tokens: {}", result.total_tokens); println!("Compression ratio: {:.1}x", result.stats.compression_ratio); println!("Chunks: {} -> {}", result.stats.chunks_processed, result.stats.chunks_selected); ``` ## Real-World Performance In production workloads, Forgetless achieves significant compression: | Metric | Before | After | |--------|--------|-------| | Input Tokens | 1,847,291 | 127,843 | | Compression | 1x | 14.5x | | Processing Time | - | Sub-second | The 14.5x compression ratio means content that would require 14 API calls (or wouldn't fit at all) now fits in a single request. More importantly, the output contains the most relevant content, not just whatever happened to be at the end. ## Why Rust? Forgetless is written in Rust for several compelling reasons. Performance is critical because embedding generation and similarity search are computationally intensive. Rust's zero-cost abstractions and lack of garbage collection ensure consistent, predictable performance, with parallel processing that scales linearly with available cores. Memory efficiency matters when large documents can consume significant memory. Rust's ownership model ensures memory is released as soon as it's no longer needed, and lazy file loading means you only pay for what you actually process. Safety is paramount because context optimization sits in a critical path. A crash or memory corruption doesn't just break your application; it potentially corrupts your LLM's context. Rust's compile-time guarantees eliminate entire classes of bugs. Additionally, optional CUDA and Metal support enables hardware-accelerated embedding generation on NVIDIA and Apple Silicon respectively. ## Integration Patterns Forgetless integrates cleanly into existing LLM applications. For direct library usage, add it as a Rust dependency and call directly. This is ideal for Rust applications or when you need maximum performance: ```rust:Rust let optimized = Forgetless::new() .config(Config::default().context_limit(128_000)) .add(system_prompt) .add_file("knowledge_base/**/*.md") .query(user_message) .run() .await?; let response = llm.chat(&optimized.context).await?; ``` For non-Rust applications, Forgetless provides an HTTP server with REST endpoints. Upload files via multipart form data and receive optimized context as JSON: ```bash:Terminal cargo run --bin forgetless-server --features server ``` ## When to Use Forgetless Forgetless excels in research and analysis scenarios where academic papers, legal documents, or technical specifications exceed context limits. It's invaluable for codebase understanding, feeding entire repositories to LLMs for code generation, review, or documentation tasks. Agent memory systems particularly benefit from Forgetless, maintaining conversation history and learned information across sessions without hitting token ceilings. RAG pipelines can use it to post-process retrieved documents, fitting them within context limits while preserving relevance. Multi-document synthesis becomes practical, combining information from multiple sources into coherent, focused context. ## Privacy by Design Forgetless processes everything locally. Your documents, conversations, and queries never leave your machine. The embedding model runs on your hardware. The optimization pipeline runs on your hardware. No external API calls are required for the core functionality. Optional LLM-enhanced features like image understanding and smart summarization can be enabled for additional intelligence, but even these run locally via mistral.rs and SmolVLM. Your data stays yours. ## Getting Started Add Forgetless to your Rust project: ```properties:Cargo.toml [dependencies] forgetless = { git = "https://github.com/pzzaworks/forgetless" } ``` Basic usage: ```rust:Rust use forgetless::{Forgetless, Config, WithPriority}; let result = Forgetless::new() .config(Config::default().context_limit(128_000)) .add(WithPriority::critical(system)) .add_file("document.pdf") .query("What are the key points?") .run() .await?; // Use result.content with your LLM ``` Comprehensive documentation is available at [forgetless.org](https://forgetless.org).