Building Agents with Vision in Astreus

Create agents that can see and understand images. Analyze screenshots, diagrams, photos, and visual content with multimodal AI capabilities.

Modern AI agents can process images just as naturally as they process text. Astreus makes it simple to build vision-enabled agents that analyze visual content, from UI screenshots to data visualizations. ## Getting Started First, install Astreus and set up your environment. You'll need an OpenAI API key with access to vision-capable models. ```bash npm install @astreus-ai/astreus ``` Create a `.env` file with your configuration: ``` OPENAI_API_KEY=sk-your-openai-api-key-here DB_URL=sqlite://./astreus.db ``` ## Creating a Vision Agent Enable vision capabilities by setting the `vision` flag and specifying a vision-capable model. The `gpt-4o` model provides strong multimodal performance. ```javascript import { Agent } from '@astreus-ai/astreus'; const agent = await Agent.create({ name: 'VisionBot', model: 'gpt-4o', visionModel: 'gpt-4o', vision: true, systemPrompt: 'You can analyze and describe images in detail.' }); ``` The `visionModel` parameter specifies which model handles image processing. Using the same model for both text and vision ensures consistent behavior. ## Analyzing Images Pass images to your agent using the `attachments` parameter. The agent processes both the text prompt and the image together. ```javascript const result = await agent.ask( "Analyze this image and describe what you see", { attachments: [{ type: 'image', path: './screenshot.png' }] } ); console.log(result); // Returns detailed image analysis ``` The agent examines the image and provides a detailed description. You can ask specific questions to focus the analysis on particular aspects. ## UI Design Review Vision agents excel at evaluating interface designs. They spot inconsistencies, accessibility issues, and usability problems in mockups and screenshots. ```javascript const designAgent = await Agent.create({ name: 'UIReviewer', model: 'gpt-4o', visionModel: 'gpt-4o', vision: true, systemPrompt: `Analyze UI designs for: - Visual hierarchy and balance - Color contrast (WCAG standards) - Spacing consistency - Responsive design considerations` }); const review = await designAgent.ask( 'Evaluate this mobile app screen for accessibility issues.', { attachments: [{ type: 'image', path: './app-screen.png' }] } ); ``` The system prompt guides how the agent approaches visual analysis. Tailor it to your specific use case for more relevant feedback. ## Visual Debugging Share error screenshots instead of transcribing stack traces. The agent reads error messages, identifies issues, and suggests fixes. ```javascript const debugAgent = await Agent.create({ name: 'DebugHelper', model: 'gpt-4o', visionModel: 'gpt-4o', vision: true, systemPrompt: 'You are a senior developer helping debug code issues.' }); const diagnosis = await debugAgent.ask( 'This error appeared when I tried to save. What is wrong?', { attachments: [{ type: 'image', path: './error-screenshot.png' }] } ); ``` This natural workflow accelerates debugging. The agent examines the entire error context visible in the screenshot, often catching details you might miss when manually transcribing. ## Data Extraction Extract structured data from invoices, receipts, and forms without complex OCR pipelines. Vision agents understand document layout contextually. ```javascript const extractionAgent = await Agent.create({ name: 'DataExtractor', model: 'gpt-4o', visionModel: 'gpt-4o', vision: true }); const data = await extractionAgent.ask( 'Extract invoice details as JSON: number, date, total, items.', { attachments: [{ type: 'image', path: './invoice.png' }] } ); ``` The agent recognizes field types and relationships, handling variations in format. It returns clean structured data ready for processing in your application. ## Chart Analysis Analyze data visualizations to extract insights and trends. The agent interprets visual encodings like color, position, and size. ```javascript const analysis = await agent.ask( 'Summarize the key trends shown in this sales chart.', { attachments: [{ type: 'image', path: './sales-chart.png' }] } ); ``` This works even when underlying data isn't available. The agent reads values from axes, identifies patterns, and highlights outliers directly from the visualization. ## Comparing Multiple Images Process multiple images simultaneously for side-by-side comparison. Pass multiple attachments in a single request. ```javascript const comparison = await agent.ask( 'Compare these two designs. Which has better visual hierarchy?', { attachments: [ { type: 'image', path: './design-v1.png' }, { type: 'image', path: './design-v2.png' } ] } ); ``` The agent analyzes both images together, identifying specific differences and their impact. This enables sophisticated before-after analysis and A/B testing evaluation. ## Running Your Agent Once you've built your vision agent, run it in your development environment: ```bash npm run dev ``` The complete example repository is available at `astreus-ai/agent-with-vision` on GitHub. Clone it to explore the full implementation and experiment with different use cases. ## Key Configuration Options Understanding the configuration options helps you optimize your vision agents: - **name**: Agent identifier for tracking and debugging - **model**: Primary language model (gpt-4o recommended for vision) - **visionModel**: Vision-specific model, typically matches the primary model - **vision**: Boolean flag enabling image processing capabilities - **systemPrompt**: Instructions that guide agent behavior and analysis approach - **attachments**: Array of image references with type and path properties ## Image Input Methods Astreus supports multiple ways to provide images. Use local file paths for the most straightforward approach: ```javascript attachments: [{ type: 'image', path: '/absolute/path/to/image.png' }] ``` Relative paths work too, resolved from your project directory. Choose the method that fits your workflow and file organization. ## Crafting Effective Prompts Specific prompts produce better results. Provide context about the image type and what aspects you want analyzed. ```javascript // Generic (less effective) await agent.ask('What do you see?', { attachments: [...] }); // Specific (more effective) await agent.ask( 'This is a mobile checkout screen. Identify any usability issues that might prevent users from completing their purchase.', { attachments: [...] } ); ``` Frame questions around specific concerns or goals. Mention the target audience or use case to help the agent apply appropriate criteria in its analysis. ## Use Cases Vision-enabled agents unlock powerful workflows across many domains: **Design & UX**: Automated design review, accessibility audits, consistency checking across pages and components. **Development**: Visual debugging from screenshots, code review from presentation slides, architecture diagram analysis. **Data Processing**: Invoice and receipt processing, form data extraction, chart and graph interpretation. **Quality Assurance**: Visual regression testing, screenshot comparison, UI compliance verification. ## Performance Considerations Image processing consumes more tokens than text-only interactions. Start with moderate resolution images and increase quality only when the agent misses important details. Balance quality with cost efficiency for your specific use case. The `gpt-4o` model provides strong vision capabilities with reasonable token usage. Monitor your usage patterns to optimize for your workload. ## Building Specialized Agents Create domain-specific agents by tailoring system prompts. This focuses analysis on relevant criteria for your use case. ```javascript // E-commerce product reviewer const productAgent = await Agent.create({ name: 'ProductReviewer', model: 'gpt-4o', visionModel: 'gpt-4o', vision: true, systemPrompt: `Evaluate product photos for e-commerce. Check: lighting quality, background cleanliness, product visibility, angle appropriateness, color accuracy indicators.` }); // Medical diagram analyzer const medicalAgent = await Agent.create({ name: 'MedicalAnalyzer', model: 'gpt-4o', visionModel: 'gpt-4o', vision: true, systemPrompt: `Analyze medical diagrams and charts. Focus on: anatomical accuracy, labeling clarity, educational value.` }); ``` Specialized agents provide more relevant insights because they apply domain-appropriate evaluation criteria. The system prompt acts as their expertise and guides their analytical approach. ## Next Steps Start with simple tasks like image description to build intuition. Experiment with prompt phrasing to understand how different approaches affect output quality and focus. As you gain experience, combine vision with other Astreus capabilities. Build specialized agents for your specific visual analysis needs. The example repository at `astreus-ai/agent-with-vision` provides a solid foundation to explore and extend. Vision capabilities open up entirely new interaction patterns. Agents that can see bridge the gap between human visual communication and AI processing, enabling more natural and powerful workflows. This experiment is written for Astreus v0.5.37. Please ensure you are using a compatible version.