Building Agents with Vision in Astreus

December 1, 2025by Berke

Create agents that can see and understand images. Analyze screenshots, diagrams, photos, and visual content with multimodal AI capabilities.

ListenReady

0:00

6:50

Modern AI agents can process images just as naturally as they process text. Astreus makes it simple to build vision-enabled agents that analyze visual content, from UI screenshots to data visualizations.

Getting Started

First, install Astreus and set up your environment. You'll need an OpenAI API key with access to vision-capable models.

Bash
npm install @astreus-ai/astreus

Create a .env file with your configuration:

OPENAI_API_KEY=sk-your-openai-api-key-here
DB_URL=sqlite://./astreus.db

Creating a Vision Agent

Enable vision capabilities by setting the vision flag and specifying a vision-capable model. The gpt-4o model provides strong multimodal performance.

JavaScript
import { Agent } from '@astreus-ai/astreus';

const agent = await Agent.create({
  name: 'VisionBot',
  model: 'gpt-4o',
  visionModel: 'gpt-4o',
  vision: true,
  systemPrompt: 'You can analyze and describe images in detail.'
});

The visionModel parameter specifies which model handles image processing. Using the same model for both text and vision ensures consistent behavior.

Analyzing Images

Pass images to your agent using the attachments parameter. The agent processes both the text prompt and the image together.

JavaScript
const result = await agent.ask(
  "Analyze this image and describe what you see",
  {
    attachments: [{
      type: 'image',
      path: './screenshot.png'
    }]
  }
);

console.log(result); // Returns detailed image analysis

The agent examines the image and provides a detailed description. You can ask specific questions to focus the analysis on particular aspects.

UI Design Review

Vision agents excel at evaluating interface designs. They spot inconsistencies, accessibility issues, and usability problems in mockups and screenshots.

JavaScript
const designAgent = await Agent.create({
  name: 'UIReviewer',
  model: 'gpt-4o',
  visionModel: 'gpt-4o',
  vision: true,
  systemPrompt: `Analyze UI designs for:
  - Visual hierarchy and balance
  - Color contrast (WCAG standards)
  - Spacing consistency
  - Responsive design considerations`
});

const review = await designAgent.ask(
  'Evaluate this mobile app screen for accessibility issues.',
  {
    attachments: [{
      type: 'image',
      path: './app-screen.png'
    }]
  }
);

The system prompt guides how the agent approaches visual analysis. Tailor it to your specific use case for more relevant feedback.

Visual Debugging

Share error screenshots instead of transcribing stack traces. The agent reads error messages, identifies issues, and suggests fixes.

JavaScript
const debugAgent = await Agent.create({
  name: 'DebugHelper',
  model: 'gpt-4o',
  visionModel: 'gpt-4o',
  vision: true,
  systemPrompt: 'You are a senior developer helping debug code issues.'
});

const diagnosis = await debugAgent.ask(
  'This error appeared when I tried to save. What is wrong?',
  {
    attachments: [{
      type: 'image',
      path: './error-screenshot.png'
    }]
  }
);

This natural workflow accelerates debugging. The agent examines the entire error context visible in the screenshot, often catching details you might miss when manually transcribing.

Data Extraction

Extract structured data from invoices, receipts, and forms without complex OCR pipelines. Vision agents understand document layout contextually.

JavaScript
const extractionAgent = await Agent.create({
  name: 'DataExtractor',
  model: 'gpt-4o',
  visionModel: 'gpt-4o',
  vision: true
});

const data = await extractionAgent.ask(
  'Extract invoice details as JSON: number, date, total, items.',
  {
    attachments: [{
      type: 'image',
      path: './invoice.png'
    }]
  }
);

The agent recognizes field types and relationships, handling variations in format. It returns clean structured data ready for processing in your application.

Chart Analysis

Analyze data visualizations to extract insights and trends. The agent interprets visual encodings like color, position, and size.

JavaScript
const analysis = await agent.ask(
  'Summarize the key trends shown in this sales chart.',
  {
    attachments: [{
      type: 'image',
      path: './sales-chart.png'
    }]
  }
);

This works even when underlying data isn't available. The agent reads values from axes, identifies patterns, and highlights outliers directly from the visualization.

Comparing Multiple Images

Process multiple images simultaneously for side-by-side comparison. Pass multiple attachments in a single request.

JavaScript
const comparison = await agent.ask(
  'Compare these two designs. Which has better visual hierarchy?',
  {
    attachments: [
      { type: 'image', path: './design-v1.png' },
      { type: 'image', path: './design-v2.png' }
    ]
  }
);

The agent analyzes both images together, identifying specific differences and their impact. This enables sophisticated before-after analysis and A/B testing evaluation.

Running Your Agent

Once you've built your vision agent, run it in your development environment:

Bash
npm run dev

The complete example repository is available at astreus-ai/agent-with-vision on GitHub. Clone it to explore the full implementation and experiment with different use cases.

Key Configuration Options

Understanding the configuration options helps you optimize your vision agents:

name: Agent identifier for tracking and debugging
model: Primary language model (gpt-4o recommended for vision)
visionModel: Vision-specific model, typically matches the primary model
vision: Boolean flag enabling image processing capabilities
systemPrompt: Instructions that guide agent behavior and analysis approach
attachments: Array of image references with type and path properties

Image Input Methods

Astreus supports multiple ways to provide images. Use local file paths for the most straightforward approach:

JavaScript
attachments: [{
  type: 'image',
  path: '/absolute/path/to/image.png'
}]

Relative paths work too, resolved from your project directory. Choose the method that fits your workflow and file organization.

Crafting Effective Prompts

Specific prompts produce better results. Provide context about the image type and what aspects you want analyzed.

JavaScript
// Generic (less effective)
await agent.ask('What do you see?', { attachments: [...] });

// Specific (more effective)
await agent.ask(
  'This is a mobile checkout screen. Identify any usability issues that might prevent users from completing their purchase.',
  { attachments: [...] }
);

Frame questions around specific concerns or goals. Mention the target audience or use case to help the agent apply appropriate criteria in its analysis.

Use Cases

Vision-enabled agents unlock powerful workflows across many domains:

Design & UX: Automated design review, accessibility audits, consistency checking across pages and components.

Development: Visual debugging from screenshots, code review from presentation slides, architecture diagram analysis.

Data Processing: Invoice and receipt processing, form data extraction, chart and graph interpretation.

Quality Assurance: Visual regression testing, screenshot comparison, UI compliance verification.

Performance Considerations

Image processing consumes more tokens than text-only interactions. Start with moderate resolution images and increase quality only when the agent misses important details. Balance quality with cost efficiency for your specific use case.

The gpt-4o model provides strong vision capabilities with reasonable token usage. Monitor your usage patterns to optimize for your workload.

Building Specialized Agents

Create domain-specific agents by tailoring system prompts. This focuses analysis on relevant criteria for your use case.

JavaScript
// E-commerce product reviewer
const productAgent = await Agent.create({
  name: 'ProductReviewer',
  model: 'gpt-4o',
  visionModel: 'gpt-4o',
  vision: true,
  systemPrompt: `Evaluate product photos for e-commerce.
  Check: lighting quality, background cleanliness, product visibility,
  angle appropriateness, color accuracy indicators.`
});

// Medical diagram analyzer
const medicalAgent = await Agent.create({
  name: 'MedicalAnalyzer',
  model: 'gpt-4o',
  visionModel: 'gpt-4o',
  vision: true,
  systemPrompt: `Analyze medical diagrams and charts.
  Focus on: anatomical accuracy, labeling clarity, educational value.`
});

Specialized agents provide more relevant insights because they apply domain-appropriate evaluation criteria. The system prompt acts as their expertise and guides their analytical approach.

Next Steps

Start with simple tasks like image description to build intuition. Experiment with prompt phrasing to understand how different approaches affect output quality and focus.

As you gain experience, combine vision with other Astreus capabilities. Build specialized agents for your specific visual analysis needs. The example repository at astreus-ai/agent-with-vision provides a solid foundation to explore and extend.

Vision capabilities open up entirely new interaction patterns. Agents that can see bridge the gap between human visual communication and AI processing, enabling more natural and powerful workflows.

This experiment is written for Astreus v0.5.37. Please ensure you are using a compatible version.

more slices from this kitchen

Explore more experiments from the test kitchen:

Building Agents with Knowledge in Astreus - Create AI agents that can search and retrieve information from knowledge bases using RAG. Learn how to integrate documents, enable semantic search, and build domain-specific agents.
Building Agents with Memory in Astreus - Learn how to build AI agents with persistent memory using Astreus. Store conversation history and retrieve context across sessions.
Building Agent Persistence in Astreus - Save and restore agent state. Conversations persist across sessions. Pick up where you left off.