Building Agents with Vision in Astreus
Create agents that can see and understand images. Analyze screenshots, diagrams, photos, and visual content with multimodal AI capabilities.
Modern AI agents can process images just as naturally as they process text. Astreus makes it simple to build vision-enabled agents that analyze visual content, from UI screenshots to data visualizations.
Getting Started
First, install Astreus and set up your environment. You'll need an OpenAI API key with access to vision-capable models.
Create a .env file with your configuration:
OPENAI_API_KEY=sk-your-openai-api-key-here
DB_URL=sqlite://./astreus.db
Creating a Vision Agent
Enable vision capabilities by setting the vision flag and specifying a vision-capable model. The gpt-4o model provides strong multimodal performance.
The visionModel parameter specifies which model handles image processing. Using the same model for both text and vision ensures consistent behavior.
Analyzing Images
Pass images to your agent using the attachments parameter. The agent processes both the text prompt and the image together.
The agent examines the image and provides a detailed description. You can ask specific questions to focus the analysis on particular aspects.
UI Design Review
Vision agents excel at evaluating interface designs. They spot inconsistencies, accessibility issues, and usability problems in mockups and screenshots.
The system prompt guides how the agent approaches visual analysis. Tailor it to your specific use case for more relevant feedback.
Visual Debugging
Share error screenshots instead of transcribing stack traces. The agent reads error messages, identifies issues, and suggests fixes.
This natural workflow accelerates debugging. The agent examines the entire error context visible in the screenshot, often catching details you might miss when manually transcribing.
Data Extraction
Extract structured data from invoices, receipts, and forms without complex OCR pipelines. Vision agents understand document layout contextually.
The agent recognizes field types and relationships, handling variations in format. It returns clean structured data ready for processing in your application.
Chart Analysis
Analyze data visualizations to extract insights and trends. The agent interprets visual encodings like color, position, and size.
This works even when underlying data isn't available. The agent reads values from axes, identifies patterns, and highlights outliers directly from the visualization.
Comparing Multiple Images
Process multiple images simultaneously for side-by-side comparison. Pass multiple attachments in a single request.
The agent analyzes both images together, identifying specific differences and their impact. This enables sophisticated before-after analysis and A/B testing evaluation.
Running Your Agent
Once you've built your vision agent, run it in your development environment:
The complete example repository is available at astreus-ai/agent-with-vision on GitHub. Clone it to explore the full implementation and experiment with different use cases.
Key Configuration Options
Understanding the configuration options helps you optimize your vision agents:
- name: Agent identifier for tracking and debugging
- model: Primary language model (gpt-4o recommended for vision)
- visionModel: Vision-specific model, typically matches the primary model
- vision: Boolean flag enabling image processing capabilities
- systemPrompt: Instructions that guide agent behavior and analysis approach
- attachments: Array of image references with type and path properties
Image Input Methods
Astreus supports multiple ways to provide images. Use local file paths for the most straightforward approach:
Relative paths work too, resolved from your project directory. Choose the method that fits your workflow and file organization.
Crafting Effective Prompts
Specific prompts produce better results. Provide context about the image type and what aspects you want analyzed.
Frame questions around specific concerns or goals. Mention the target audience or use case to help the agent apply appropriate criteria in its analysis.
Use Cases
Vision-enabled agents unlock powerful workflows across many domains:
Design & UX: Automated design review, accessibility audits, consistency checking across pages and components.
Development: Visual debugging from screenshots, code review from presentation slides, architecture diagram analysis.
Data Processing: Invoice and receipt processing, form data extraction, chart and graph interpretation.
Quality Assurance: Visual regression testing, screenshot comparison, UI compliance verification.
Performance Considerations
Image processing consumes more tokens than text-only interactions. Start with moderate resolution images and increase quality only when the agent misses important details. Balance quality with cost efficiency for your specific use case.
The gpt-4o model provides strong vision capabilities with reasonable token usage. Monitor your usage patterns to optimize for your workload.
Building Specialized Agents
Create domain-specific agents by tailoring system prompts. This focuses analysis on relevant criteria for your use case.
Specialized agents provide more relevant insights because they apply domain-appropriate evaluation criteria. The system prompt acts as their expertise and guides their analytical approach.
Next Steps
Start with simple tasks like image description to build intuition. Experiment with prompt phrasing to understand how different approaches affect output quality and focus.
As you gain experience, combine vision with other Astreus capabilities. Build specialized agents for your specific visual analysis needs. The example repository at astreus-ai/agent-with-vision provides a solid foundation to explore and extend.
Vision capabilities open up entirely new interaction patterns. Agents that can see bridge the gap between human visual communication and AI processing, enabling more natural and powerful workflows.
This experiment is written for Astreus v0.5.37. Please ensure you are using a compatible version.