Forgetless
Smart context optimization library for LLMs. Compress massive content to fit your token budget with intelligent prioritization.
## What is Forgetless?

Forgetless is a smart context optimization library for Large Language Models written in Rust. It solves one of the most challenging problems in LLM applications: fitting massive amounts of content into limited token budgets while preserving the most relevant information.
Every LLM has context limits. Whether you're working with GPT-5's 400K tokens, Claude Opus 4.5's 200K, Gemini 2.5 Pro's 1M, or smaller local models, there's always a ceiling. When your documents, conversation history, and knowledge base exceed that limit, you're forced to make difficult choices about what to include. Forgetless makes those choices intelligently, achieving compression ratios up to 14.5x while keeping what matters most.
The library was built with a clear philosophy: context optimization should be fast, private, and intelligent. All processing happens locally with no external API calls required for the core functionality. Your content never leaves your machine, embeddings are generated locally using FastEmbed, and the entire pipeline runs with minimal latency.
## Core Architecture

Forgetless operates through a sophisticated six-stage pipeline that transforms raw content into optimized context.
The pipeline begins with **Content Ingestion**, where text, files, and conversations are collected and prepared for processing. Files are read lazily, meaning large documents don't consume memory until they're actually needed. The system supports PDF extraction, image processing, code files, markdown, and structured data formats.
Next comes **Smart Chunking**, which breaks content into semantically meaningful pieces. Unlike naive approaches that split on fixed character counts, Forgetless uses content-aware chunking that respects natural boundaries. Code is split at function and class boundaries, markdown at headers, and conversations at message boundaries. This preserves context within each chunk.
**Local Embedding** generates vector representations using the all-MiniLM-L6-v2 model running entirely on your machine. No API calls to OpenAI or other providers are needed. An LRU cache prevents redundant computation when processing overlapping content.
**Hybrid Scoring** combines multiple signals to determine chunk relevance: priority levels provide explicit importance markers, recency weights newer content higher, semantic similarity measures relevance to the query, and position scoring considers where content appears in the original document.
Finally, **Budget Selection** uses these hybrid scores to select the optimal subset of chunks that fit within your token limit. Critical content is always preserved regardless of score. The system uses the cl100k_base tokenizer (GPT-4 compatible) for accurate token counting. **Context Assembly** then combines selected chunks into coherent output ready for LLM consumption, including comprehensive statistics about compression ratio, chunks processed, and tokens saved.
## Why Rust?
Forgetless is built with Rust for critical reasons that directly impact reliability and performance.
**Performance** is essential because embedding generation and similarity search are computationally intensive operations. Rust's zero-cost abstractions and lack of garbage collection ensure consistent, predictable performance. Parallel processing scales linearly with available cores thanks to Rayon.
**Memory efficiency** matters when processing large documents. Rust's ownership model ensures memory is released immediately when no longer needed. Lazy file loading means a 100MB PDF doesn't consume memory until it's actually processed.
**Safety** is paramount because context optimization sits in a critical path. A crash or memory corruption doesn't just break your application-it potentially corrupts your LLM's context. Rust's compile-time guarantees eliminate entire classes of bugs that would be runtime errors in other languages.
**Hardware acceleration** through optional CUDA and Metal support enables GPU-accelerated embedding generation on NVIDIA and Apple Silicon respectively.
## Key Features
Forgetless provides a comprehensive set of features designed for production LLM applications.
### Priority System
The four-tier priority system gives you explicit control over what matters most:
| Priority | Behavior |
|----------|----------|
| Critical | Always kept regardless of budget |
| High | Preferred during selection, dropped only when necessary |
| Medium | Default level, selected based on relevance score |
| Low | Dropped first when over budget |
This allows you to ensure system prompts and user queries are never truncated while letting less important context be intelligently compressed.
### Multi-Format Support
Forgetless handles diverse content types with specialized processing. PDFs are extracted via pdftotext with fallback parsing. Images can optionally be described using SmolVLM. Code files (.rs, .py, .js, .ts, .go, and more) are split at function and class boundaries. Markdown respects header-based structure, while JSON, YAML, and TOML maintain object boundary detection. Plain text uses sentence boundary chunking with semantic awareness.
### GPU Acceleration
For demanding workloads, Forgetless supports hardware acceleration through CUDA for NVIDIA GPUs, Metal for Apple Silicon, Intel MKL for CPU acceleration on Intel processors, and Apple Accelerate framework on macOS.
### Optional LLM Enhancement
While the core pipeline runs without any LLM, you can optionally enable enhanced features. The Vision LLM generates detailed descriptions for images using SmolVLM-256M, while the Context LLM enables smart scoring and summarization via mistral.rs. These features download models (~500MB) on first use and run entirely locally.
## Installation
Install Forgetless from crates.io:
```bash:Terminal
cargo add forgetless
```
Or add to your Cargo.toml:
```properties:Cargo.toml
[dependencies]
forgetless = "0.1"
```
For optional features like the HTTP server or GPU acceleration:
```bash:Terminal
# With HTTP server
cargo add forgetless --features server
# With GPU acceleration (macOS Apple Silicon)
cargo add forgetless --features metal
# With GPU acceleration (NVIDIA)
cargo add forgetless --features cuda
```
### Server Binary
You can also install the server as a standalone binary:
```bash:Terminal
cargo install forgetless --features server
forgetless-server # Runs on http://localhost:8080
```
## Quick Start
The fluent builder API makes optimization straightforward:
```rust:Rust
use forgetless::{Forgetless, Config, WithPriority};
let system = "You are a helpful assistant.";
let user_query = "What are the key points?";
let result = Forgetless::new()
.config(Config::default().context_limit(128_000))
.add(WithPriority::critical(system))
.add_file("research-paper.pdf")
.add_file("notes.md")
.query(user_query)
.run()
.await?;
println!("Compressed {} tokens to {}",
result.stats.input_tokens,
result.total_tokens);
```
## Configuration Options
| Option | Default | Description |
|--------|---------|-------------|
| `context_limit` | 128,000 | Maximum tokens in output |
| `chunk_size` | 512 | Target chunk size in tokens |
| `vision_llm` | false | Enable image understanding |
| `context_llm` | false | Enable smart LLM-based scoring |
| `parallel` | true | Enable parallel file processing |
| `cache` | true | Enable embedding cache |
## HTTP Server
For non-Rust applications, Forgetless provides an HTTP server that exposes REST endpoints with multipart form data support for file uploads, CORS for cross-origin requests, and health check endpoints for container orchestration:
```bash:Terminal
cargo run --bin forgetless-server --features server
```
## Use Cases
Forgetless excels in research paper analysis where entire papers with citations need to fit into context for Q&A. It's ideal for codebase understanding, compressing large codebases for code generation tasks. Conversation history can be maintained without hitting limits, and multiple documents can be processed with intelligent summarization. Agent memory systems particularly benefit, as Forgetless enables memory that scales beyond token limits.
## Performance
Benchmarks demonstrate Forgetless's efficiency with 14.5x compression ratio on typical workloads, sub-second processing for most documents, minimal memory footprint through lazy file loading, and parallel processing that scales with available cores.
## Documentation
Comprehensive documentation is available at [forgetless.org](https://forgetless.org), including getting started guides, API reference, and configuration deep dives.
## More Slices From This Pzza
Dive deeper into the ideas and technology behind this project:
- [Forgetless: Smart Context Optimization for LLMs](/oven/forgetless-smart-context-optimization-llms) - How Forgetless uses intelligent prioritization to compress content within token budgets
- [Advanced Memory Systems for AI Agents: Engineering for Persistence and Contextual Intelligence](/oven/advanced-memory-systems-ai-agents) - Engineering persistent and contextual memory systems for AI agents
- [The Technical Architecture of Modern AI Agents](/oven/technical-architecture-modern-ai-agents) - Architecture patterns behind modern AI agents and their context management