vLLora - Debug your agents in realtime Blog

Introducing Lucy: Trace-Native Debugging Inside vLLora

Tue, 20 Jan 2026 00:00:00 GMT

Your agent fails midway through a task. The trace is right there in vLLora, but it's 200 spans deep. You start scrolling, scanning for the red error or the suspicious tool call. Somewhere in those spans is the answer, but finding it takes longer than it should.

Today we're launching Lucy, an AI assistant built directly into vLLora that reads your traces and tells you what went wrong. You ask a question in plain English, Lucy inspects the trace, and you get a diagnosis with concrete next steps. Lucy is available now in beta.

Sorry, your browser doesn’t support embedded videos.

Why finding out what went wrong is hard

Agent failures don’t look like traditional exceptions. A single bad response is usually the result of a chain of small choices spread across a long execution.

Long traces: One thread can include hundreds of spans across model calls, tool calls, retries, and fallbacks.
Delayed symptoms: The root cause often happens early, but only becomes visible much later in the run.
Silent degradation: A thread can be marked "successful" while actually running with missing data, wrong assumptions, or a broken tool path.

When debugging becomes "scroll until you get lucky," you miss important signals and burn time (and tokens) doing it.

Lucy is good at exactly this: reading the trace end-to-end, spotting failure patterns, and turning them into actionable fixes.

What can you ask Lucy to do

Lucy sits next to your traces and threads. Ask a plain-English question, and it will inspect the trace, flag failure points, and return a fix-oriented report: root cause, impact, and recommended next steps.

Ask Lucy questions like:

Analyze this thread for issues
Check for errors in this thread
Show me the slowest operations
What's the total cost?

Lucy can also help you spot patterns across multiple failing runs and suggest prompt rewrites to reduce ambiguity.

"What’s wrong with my thread?"

We had a Travel agent which was running for a long time, apparently stuck in a loop within the BetweenHorizonalEnd span. Instead of digging through the logs manually, we simply asked Lucy:

What's wrong with my thread?

Lucy inspected the thread's spans, identified a recurring failure pattern, and explained the root cause and impact, along with concrete next steps.

What Lucy found: Schema mismatches and contradictory prompts

In this trace, the agent was failing to complete a travel itinerary. Lucy didn't just flag the error; she identified a complex failure pattern involving both the code (schema) and the instructions (prompt).

1. The "Hallucinated" Arguments Lucy pinpointed exactly why the tools were failing. The model was trying to call research_flights with a from_city argument and research_accommodations with check_in_date.

The Diagnosis: "Severe Tool Schema Mismatch."
The Reality: These arguments didn't exist in the registered tool definition, causing the model to hit a wall of unexpected keyword errors.

2. The Hidden Logic Trap Critically, Lucy found a root cause that a human scanning logs would likely miss: Prompt Contradiction.

The Conflict: The system prompt instructed the agent to "prefer analysis only" while simultaneously telling it that it "MUST call tools."
The Result: The model was paralyzed between two opposing instructions, leading to the erratic tool behavior.

3. Silent Failures (Truncation) Lucy also caught a silent degradation issue: Severe Output Truncation. The Restaurant Extraction step was hitting token limits and cutting off data mid-list (output_tokens: 4000... truncated). The run looked "successful" to the server, but the downstream user was getting incomplete data.

Lucy’s report turned a vague "it's not working" complaint into three distinct engineering tasks: fix the tool schema, clarify the system prompt, and increase the context window for extraction.

Caption: Lucy analyzes the trace and detects multiple issues simultaneously: invalid tool arguments (from_city), contradictory system prompt instructions, and token truncation in the output.

Why this matters in production

This is a common failure mode in tool-using agents: when the tool contract isn't perfectly aligned (schema, handler, prompt, examples), the model starts guessing.

The cost isn't limited to a single failed call:

Latency increases as the agent retries and thrashes on deterministic validation failures
Cost increases as token usage accumulates across repeated attempts
Quality degrades when the agent gives up on structured tools and improvises without real data

Even if your run "succeeds," you can still be paying for broken execution paths.

How Lucy works with vLLora tracing

Lucy's intelligence comes from vLLora's tracing infrastructure. vLLora captures everything your agent does:

Spans: Individual operations like LLM calls, tool executions, and retrieval steps
Runs: A single execution of your agent, made up of a tree of spans
Threads: A full conversation, containing multiple runs over time

When you ask Lucy a question, it pulls the relevant spans and runs, reconstructs the execution flow, and analyzes patterns across the data. This is context that would take a human hours to piece together manually.

Get started

Lucy is available now in beta for all vLLora users.

Click the Lucy icon in the bottom right corner.
Ask a question like "What's wrong with my thread?"
Get an instant diagnosis without leaving your workflow.

Lucy will inspect your active context and give you a clear diagnosis, so you can spend less time scrolling and more time shipping.

See the full Lucy documentation here

Silent Failures: Why a “Successful” LLM Workflow Can Cost 40% More

Wed, 31 Dec 2025 00:00:00 GMT

Your agent returns the right answer. The status is 200 OK, and the user walks away satisfied. On the surface, everything looks fine. But when you check the API bill, it doesn’t line up with how simple the task actually was.

LLMs are unusually resilient. When a tool call fails, they don’t stop execution. They try again with small variations. When a response looks off, they adjust and keep going. That behavior is often helpful, but it can also hide broken execution paths. The user sees a successful result, while your token usage quietly absorbs retries, fallbacks, and extra reasoning that never needed to happen.

The Illusion of Success

When an agent returns the correct output and the logs are clean, we assume the logic is sound. However, LLM resilience introduces a new debugging challenge.

Standard Software: Invalid parameters trigger immediate exceptions. You see the stack trace and fix the bug.
LLMs: If a tool call fails, the workflow doesn't crash with an error. Most of the SDKs have a built-in retry mechanism that will retry with new arguments, switches strategies, or forces a solution.

This resilience masks architectural issues. The agent produces the correct output while quietly absorbing retries and extra reasoning steps.

Standard observability tools catch crashes but often miss these silent performance leaks. A "successful" run looks identical to an optimized one on a dashboard, even if it performed three times the necessary work.

The Suspect: A Slow Travel Agent

To make this concrete, consider a simple travel planning agent. It takes a destination, travel dates, and a few preferences, then generates a five-day itinerary.

From a functional perspective, the agent behaves as expected. Each run produces a reasonable itinerary that matches the user’s request, and there are no visible errors or user complaints.

The problem shows up when you look at the metrics:

Output: Correct itinerary
Status: 200 OK
Time Taken: 361 seconds (over 6 minutes)
Cost: $0.068 per run

For a task of this scope, both the time taken and the cost were unusually high. A closer look raised an obvious question: why did generating a straightforward itinerary require 49 separate LLM calls?

The Investigation (Using vLLora MCP)

At this point, the problem wasn’t correctness — it was understanding how the agent arrived at the result. Manually tracing through dozens of JSON logs would have been slow and error-prone, especially given the number of model calls involved.

Instead, we used the vLLora MCP server to inspect the most recent agent run. MCP exposes trace data as structured tools, which means a coding agent can reason about execution flow, tool calls, and model behavior directly — without parsing raw logs or switching to a separate dashboard.

We asked the coding agent:

Use vLLora MCP to inspect the most recent agent run and explain why it produced this result.

Sorry, your browser doesn’t support embedded videos.

The agent inspected the latest traces and summarized what actually happened during the run. While the execution was marked successful, the trace revealed repeated failed attempts to call the same tool.

Specifically:

The agent retried the same tool call multiple times with adjusted parameters
Each failure was handled internally without surfacing an error
A fallback path eventually produced the correct result
The extra retries directly inflated both latency and cost

Because the run completed successfully, none of this appeared in error metrics. The inefficiency only becomes visible when you inspect the execution path itself rather than the final outcome.

The Reveal: The Parameter Mismatch

The MCP analysis pointed to a very specific failure pattern. This wasn’t a logic bug or a model hallucination. It was a syntax mismatch between what the model assumed and what the tool schema actually required.

The agent was effectively stuck in a validation loop.

Attempt 1

The model called research_accommodations using camelCase parameters such as checkin_date.

Result: ValidationError
Reason: The tool schema expected snake_case parameter names.

Attempt 2

After observing the failure, the model retried with a lowercase variation: checkindate.

Result: ValidationError
Reason: The parameter name still did not match the schema.

Attempt 3

The model simplified further, removing part of the name and trying check_in.

Result: ValidationError
Reason: Still not a valid parameter.

After multiple failed attempts, the agent abandoned the structured tool entirely.

Fallback path

The model fell back to a generic search call:

tavily_search("hotels in Tokyo")

This fallback produced usable results, which is why the overall run completed successfully and returned a 200 OK. However, that success came at a cost. The trace showed 21 wasted tool calls and thousands of input tokens consumed by repeated retries, error messages, and recovery logic.

From the outside, the agent looked healthy. Under the hood, it was working much harder than it needed to.

The Fix: Delegating to the Agent

Once the MCP analysis identified the root cause, ambiguous docstrings, there was no need to manually search through the codebase or write the fix by hand. We delegated the change to the coding agent.

From Cursor, we asked:

Update the research_accommodations tool definition.
Make the check_in_date parameter explicitly require snake_case to prevent retry loops.

The agent located the relevant Pydantic model and updated the docstrings to remove any ambiguity for the model.

The Code Change

Before: Ambiguous

class AccommodationSearch(BaseModel):
    """Search for hotels and accommodations."""
    check_in_date: str = Field(
        description="Check-in date in YYYY-MM-DD format"
    )

The description specified the value format, but left the parameter name open to interpretation.

After: Explicit

class AccommodationSearch(BaseModel):
    """
    Search for hotels.
    IMPORTANT: All parameters must be in snake_case.
    """
    check_in_date: str = Field(
        description="Check-in date (YYYY-MM-DD). Strictly use parameter name: 'check_in_date'."
    )

By explicitly stating the required parameter name, the ambiguity that caused the retry loop was removed.

With the fix applied, we cleared the agent context and ran the exact same travel planning task again to verify the results.

Measuring the Impact

To compare the two runs, we asked the coding agent to analyze both traces side by side and summarize the differences.

The Prompt

Compare the performance of the bad run 4ea18f79-4c4c-4d2c-b628-20d510af7181 against the fixed run a5cf084b-01b2-4288-acef-aa2bedc31426. Show me a table of Latency, Cost, and Token Usage differences.

Sorry, your browser doesn’t support embedded videos.

The agent analyzed the telemetry from both traces and generated this comparison:

Metric	Bad Run (4ea18f79)	Fixed Run (a5cf084b)	Difference	Improvement
Latency	361.21 seconds (6.02 min)	194.66 seconds (3.24 min)	-166.55 seconds	46.1% faster
Total Cost	$0.0683	$0.0430	-$0.0254	37.1% cheaper
LLM Calls	49 calls	28 calls	-21 calls	42.9% fewer
Input Tokens	114,162	64,608	-49,554	43.4% reduction
Output Tokens	14,916	10,691	-4,225	28.3% reduction
Total Tokens	129,078	75,299	-53,779	41.7% reduction

Impact at Scale

For a single run, saving 2 cents might seem negligible. But at production scale, "silent failures" are a massive budget leak.

Based on these numbers, an agent running 1,000 times a day would see:

Annual Savings: ~$9,271/year
Processing Time Saved: ~46 hours per day
Token Reduction: ~54 million tokens/day

Where did the waste go?

The comparison highlights exactly where the inefficiency was hiding. By fixing the parameter names, we eliminated:

Multiple Retry Loops: The agent no longer wastes rounds guessing the correct parameter syntax.
Context Pollution: We removed thousands of tokens of error messages and failed tool outputs from the context window.
Inefficient Fallbacks: The agent uses the specialized research_accommodations tool immediately, rather than falling back to a more expensive generic search.

The fix was a one-line documentation change. But we wouldn't have found it without seeing the actual execution pattern—the retry attempts that looked like normal agent behavior until we inspected the traces.

Why This Matters

Observability isn't just about catching errors; it's about catching inefficiencies. When agents "work" but cost too much, you need to see the execution flow, not just the final result.

Traditional debugging workflows require you to:

Notice the performance issue
Switch to a tracing UI
Search for the relevant trace
Manually parse JSON logs
Connect the dots across multiple tool calls

The MCP workflow lets your coding agent do steps 2-5. You stay in your editor. The agent understands the trace structure and can explain what's happening—not just what failed, but what's inefficient.

Connecting the MCP Server

vLLora's MCP server runs alongside your vLLora instance. Configure your MCP client to connect to it:

{
  "mcpServers": {
    "vllora": {
      "url": "http://localhost:9090/mcp"
    }
  }
}

or install the MCP server in your IDE:

Quick Install

Once connected, your coding agent automatically discovers the trace inspection tools and can start using them immediately.

Closing Thoughts

Silent failures are expensive. They don't break your application, but they inflate your costs and slow down your users. The challenge is visibility: you need to see the execution flow, not just the final result.

vLLora's MCP Server brings trace inspection into your coding workflow, so you can debug inefficiencies the same way you debug errors: in your editor, with your tools. Don't just check if your agent works. Check how it works.

For setup details and advanced configuration, see the vLLora MCP Server documentation.

Introducing the vLLora MCP Server

Tue, 23 Dec 2025 00:00:00 GMT

If you’re building agents with tools like Claude Code or Cursor, or you prefer working in the terminal, you’ve probably hit this friction already. Your agent runs, something breaks partway through, and now you have to context-switch to a web UI to understand what happened. You search for the right trace, click through LLM calls, and then try to carry that context back into your editor.

vLLora’s MCP Server removes that context switch. Your coding agent becomes the interface for inspecting traces, understanding failures, and debugging agent behavior — without leaving your editor or terminal.

Making Traces Programmatic

vLLora already captures detailed traces for every agent run — model calls, tool executions, and execution flow — and the web UI remains a powerful way to explore that data.

But not every debugging workflow fits a dashboard. If you’re working from the terminal, iterating inside an IDE, or using a coding agent to help debug another agent, you need trace data where that work happens. You need structured access that tools and agents can consume directly.

Built for Coding Agents

When you're building AI agents that need debugging, you shouldn't have to leave your coding environment to inspect traces.

Your coding agent already understands MCP. When you connect vLLora's MCP server, your agent immediately knows how to use the trace inspection tools. The JSON schemas are built into the protocol, so your agent understands what parameters each tool needs and what it returns.

For a complete list of available tools and prompts, see the MCP Server documentation.

The "Something Just Failed" Workflow

You run your agent and it produces an unexpected result. You need to debug it.

Instead of opening a tracing UI, you ask your coding agent to help debug it. The agent can:

locate recent failing runs
walk execution flow across spans
inspect the exact payload sent to the model

The agent handles the underlying queries and returns the context you need — while you stay in your editor.

Debugging in Practice

Here’s what debugging looks like once the MCP server is connected.

Sorry, your browser doesn’t support embedded videos.

An agent run completes, but keeps failing in the same way. The agent believes it’s fixing the issue by retrying with different parameter names, but the failures persist.

You ask your coding agent:

Use vLLora MCP to inspect the most recent agent run and explain why it produced this result.

The agent searches recent traces, follows the execution flow, and inspects the tool call spans. It finds repeated calls like:

{
  "tool": "research_flights",
  "arguments": {
    "from_city": "NYC",
    "to_city": "SFO",
    "departure_date": "2025-02-20"
  }
}

From the trace data, the agent sees that from_city is not a valid parameter in the registered tool schema. Because the argument names don’t match the schema exposed at runtime, the function never executes — every retry fails before the tool logic runs.

Instead of guessing, the agent explains the root cause directly from execution data: a mismatch between the agent’s assumed parameter names and the actual tool definition.

You get a clear explanation of why retries didn’t help and what needs to change, without leaving your editor or inspecting raw logs.

Connecting the MCP Server

vLLora's MCP server runs alongside your vLLora instance. Configure your MCP client to connect to it:

{
  "mcpServers": {
    "vllora": {
      "url": "http://localhost:9090/mcp"
    }
  }
}

or install the MCP server in your IDE:

Quick Install

Once connected, your coding agent automatically discovers the trace inspection tools and can start using them.

Closing Thoughts

Debugging AI agents has been tedious—too much context switching, too little visibility into what's happening. vLLora's MCP Server brings trace inspection into your coding workflow, so you can debug agents the same way you debug code: in your editor, with your tools.

This brings observability closer to where agent reasoning happens.

For setup details and advanced configuration, see the vLLora MCP Server documentation.

Debugging Agents: Why Prompt Tweaks Can't Fix Stale State

Mon, 22 Dec 2025 00:00:00 GMT

In the earlier deep-agent case study (Browsr), I focused on architecture. Here I'll stay grounded in one debugging failure I hit in a maps agent—a failure that looked like a prompt problem but wasn't. The agent behaved correctly in chat, the UI looked correct, and yet the results were consistently from the wrong area. I tried the usual prompt tweaks: stronger instructions, "be careful," "use the visible map," retries. None of it moved the needle.

Here's how map state flows through the agent loop and where it can drift:

The Bug

The user expectation was simple:

Pan the map to the neighborhood you care about.
Ask for "Starbucks."
Get Starbucks locations in what's visible on the map.

What I observed instead:

The user panned and zoomed to San Francisco.
The agent responded confidently and took action.
The places returned were from Mumbai, not San Francisco.

Nothing in the conversation transcript looked obviously wrong. The root cause: the agent wasn't getting the context of what the user was seeing on the map. The mismatch was between the visible map state on the user's screen and what the agent had access to when making tool calls.

The Tool Payloads

You don't need the implementation to see the bug. The tool call arguments are enough. Here's the wrong tool call:

{
  "name": "search_place_by_name",
  "arguments": {
    "query": "Starbucks"
    // MISSING CONTEXT:
    // No "center_point" or "viewport_bbox" passed here.
    // The backend silently defaulted to the session start location (Mumbai).
  }
}

The tool was called without any location context because the agent didn't have access to what the user was seeing on the map. It defaulted to GPS/stale coordinates internally, so results came from the wrong area.

After I fixed the state being passed through, the tool call looked like this:

{
  "name": "search_places",
  "arguments": {
    "latitude": 37.7476,
    "longitude": -122.4337,
    "query": "starbucks"
  }
}

Now the visible map coordinates are explicitly passed, and results align with what the user is looking at.

And importantly, retries didn't help. Without correcting the state, the agent would keep calling search_place_by_name without location, producing the same wrong results.

The agent wasn’t “bad at following instructions.” It was acting on the wrong state.

Why Prompt Tweaks Failed

I had already told the agent the right thing, in multiple forms:

Search where the user is looking.
Prefer the visible map area over the user's location.
If the map moved, use the new location.

But the agent couldn't see what the user was seeing on the map. The agent didn't have access to the current map view context—the center coordinates, zoom level, or visible bounds. When the user panned to San Francisco, that information stayed on the client side and never made it to the agent's context.

The agent can't follow an instruction that depends on information it doesn't have. When the tool schema expects explicit coordinates, and the agent's internal state still contains GPS/default coordinates (because it never received the updated map context), retries reproduce the same error:

The prompt is correct.
The reasoning is coherent.
The tool arguments are wrong because the agent doesn't have the map context.

How I Discovered It

I found this bug by inspecting the tool call arguments in the execution logs. The logs showed the search_place_by_name tool being called with just the query—no location context.

The results came back from the wrong area because the agent never received the context of what the user was seeing—it was using stale GPS coordinates internally instead of the visible map bounds the user was actually looking at. Once I saw this mismatch between what the user saw and what the agent knew, the rest of the debugging was straightforward. I used vLLora to capture these traces, which made the missing location argument obvious immediately.

The Fix

The fix wasn't changing prompts or tool schemas. It was ensuring the map state flowed from the React frontend to the agent's execution context.

Here's the mechanism:

Frontend: React state tracking. The Google Maps component (GoogleMapsManager) tracks the current map center and zoom in React state. When the user pans or zooms, setCenter and setZoom update this state. This state lives entirely on the client side—the backend agent never sees it unless we explicitly send it.

State capture on message send. When the user types a message and submits it, we capture the current map state from React before sending the request to the agent backend. The Chat component reads center and zoom from the map component's state at that moment.

Context injection into agent execution. We inject the captured map coordinates into the agent's execution context. In our setup, this happens through the task context object that gets passed with each agent invocation. The context includes:

{
  map_center: { latitude: 37.7749, longitude: -122.4194 },
  map_zoom: 13
}

This context is available to the agent throughout its execution. The agent's system prompt can reference these values, or they can be injected directly into tool calls.

Agent uses context in tool calls. The agent now has access to the visible map coordinates. When it needs to call search_places, it extracts the coordinates from the context and passes them explicitly:

{
  "name": "search_places",
  "arguments": {
    "latitude": 37.7749,  // from context.map_center.latitude
    "longitude": -122.4194,  // from context.map_center.longitude
    "query": "starbucks"
  }
}

The tools themselves don't change—search_places still requires explicit latitude and longitude parameters. What changed is that the agent now receives the current visible map coordinates as context, so when the user pans to San Francisco and asks for "Starbucks," the agent uses the San Francisco coordinates instead of defaulting to GPS or stale coordinates.

Alternative approaches we considered:

WebSocket sync: Continuously sync map state to the backend. Too much overhead for infrequent updates.
Specialized tool: Add a get_current_map_state() tool the agent could call. Adds latency and another step the agent might forget.
Augment system prompt: Inject coordinates directly into the system prompt string. Works, but harder to debug and less flexible than structured context.

The context injection approach is clean: the state flows once per message, the agent has structured access to it, and we can inspect it in logs.

After the fix, the tool calls now include the visible map context, and the results appear exactly where the user is looking.

The logs show the corrected behavior: location context is now properly passed through in the execution context, and the search results align with the visible map area.

In hindsight, it's obvious. Without inspecting the actual tool call arguments, it wasn't.

The Lesson

This maps bug is one instance of a broader class:

The UI can be correct.
The agent's narration can be correct.
The tool call can still be wrong if state is stale, mis-scoped, or silently substituted.

Prompt tweaks help when the agent is misunderstanding an instruction. They don't help when the agent is faithfully executing the wrong state. That's when you need to inspect what context the agent actually has—and log or trace what state flows through your tool calls.

Building Better Agents

This bug highlights a design pattern for building agents that need to stay in sync with dynamic UI state:

Explicitly pass visible state into agent context. If your agent needs to act on what the user is seeing (map location, selected text, visible table rows, etc.), don't assume the agent knows. Make the connection explicit: when the user interacts with the UI, capture that state and inject it into the agent's context before each turn.

Design your state flow. Map out what state your agent needs to make correct tool calls. Then trace where that state lives (UI component, backend, user session) and ensure it flows through to the agent at the right time. The maps agent needed map center/zoom—those live in React state and get passed through the task context.

Inspect tool arguments, not just responses. The conversation looked fine because the agent's responses were coherent. The bug was in the tool call arguments. Make tool call inspection part of your debugging workflow—capture the actual arguments being sent, not just the tool responses or conversation transcript.

Prefer explicit context over implicit defaults. The search_place_by_name tool defaulted to GPS coordinates when location wasn't provided. That default masked the real problem. Better to require explicit parameters or fail fast when context is missing.

The fix wasn't changing prompts or tool schemas—it was ensuring the agent receives the state it needs to make correct decisions. That's the difference between debugging symptoms and fixing architecture.

Building AI-Powered Image Generation with OpenAI-Compatible Responses API

Fri, 12 Dec 2025 00:00:00 GMT

Introduction

The Responses API represents a powerful evolution in how we interact with large language models. Unlike traditional chat completion APIs that return simple text responses, the Responses API enables structured, multi-step workflows that can orchestrate multiple tools and produce rich, multi-modal outputs.

In this article, we'll explore how to build an AI-powered application that combines web search and image generation capabilities.

Source Code: The complete example is available on GitHub.

Documentation: For comprehensive Responses API documentation, see the Responses API guide and Image Generation guide.

Understanding the Responses API

The Responses API is a more powerful alternative to the traditional Completions API. It enables structured, multi-step workflows with support for multiple built-in tools like web search and image generation, producing rich, multi-modal outputs that can be easily processed programmatically.

Prerequisites and Setup

Before we dive into the code, let's ensure we have everything we need.

Required Dependencies

Our example requires the following Rust crates:

vllora_llm - The Vllora LLM client library
async-openai-compat - OpenAI-compatible type definitions (version 0.30.1)
base64 - For decoding base64-encoded images (version 0.22)
tokio - Async runtime (version 1.x with full features)
serde_json - JSON serialization support

Cargo.toml Configuration

Here's the complete Cargo.toml for our example:

[package]
name = "responses_image_generation_example"
version = "0.1.0"
edition = "2021"

[workspace]

[dependencies]
vllora_llm = "0.1.17"

tokio = { version = "1", features = ["full"] }
serde_json = "1.0"
base64 = "0.22"

Environment Setup

You'll need to set your API key as an environment variable:

export VLLORA_OPENAI_API_KEY="your-api-key-here"

Note: Make sure to keep your API key secure. Never commit it to version control or expose it in client-side code.

Building the Request

Now let's construct our Responses API request. We'll create a request that uses both web search and image generation tools.

Creating the CreateResponse Structure

use vllora_llm::async_openai::types::responses::CreateResponse;
use vllora_llm::async_openai::types::responses::ImageGenTool;
use vllora_llm::async_openai::types::responses::InputParam;
use vllora_llm::async_openai::types::responses::Tool;
use vllora_llm::async_openai::types::responses::WebSearchTool;

let responses_req = CreateResponse {
    model: Some("gpt-4.1".to_string()),
    input: InputParam::Text(
        "Search for the latest news from today and generate an image about it".to_string(),
    ),
    tools: Some(vec![
        Tool::WebSearch(WebSearchTool::default()),
        Tool::ImageGeneration(ImageGenTool::default()),
    ]),
    ..Default::default()
};

Understanding the Components

Model Selection - We're using "gpt-4.1", which supports the Responses API and tool calling. Make sure to use a model that supports these features.

Input Parameter - We use InputParam::Text to provide a simple text prompt. The model will:

First use the web search tool to find current news
Then use the image generation tool to create an image related to that news

Tool Configuration - We specify two tools:

WebSearchTool::default() - Uses default web search configuration
ImageGenTool::default() - Uses default image generation settings

The ..Default::default() ensures all other fields use their default values, which is a common Rust pattern for struct initialization.

Initializing the Client

Next, we need to set up the Vllora LLM client with our credentials.

Client Configuration

use vllora_llm::client::VlloraLLMClient;
use vllora_llm::types::credentials::ApiKeyCredentials;
use vllora_llm::types::credentials::Credentials;

let client = VlloraLLMClient::default()
    .with_credentials(Credentials::ApiKey(ApiKeyCredentials {
        api_key: std::env::var("VLLORA_OPENAI_API_KEY")
            .expect("VLLORA_OPENAI_API_KEY must be set"),
    }));

Credential Management

The client uses a builder pattern for configuration. Here we:

Start with VlloraLLMClient::default() for default settings
Chain .with_credentials() to provide authentication
Use Credentials::ApiKey() with ApiKeyCredentials for API key authentication
Read the API key from the environment variable

Tip: In production, consider using a more robust error handling approach instead of .expect(), such as returning a Result or using a configuration management library.

Sending the Request and Handling Responses

Now let's send our request and see what we get back.

Making the API Call

use vllora_llm::error::LLMResult;

println!("Sending request with tools: web_search_preview and image_generation");
let response = client.responses().create(responses_req).await?;

The client.responses().create() method:

Returns a Result
Is async, so we use .await
The ? operator propagates errors up the call stack

Understanding the Response Structure

The Response struct contains an output field, which is a vector of OutputItem variants. Each item represents a different type of output from the API:

Text messages from the model
Image generation results
Web search results
Other tool outputs

Processing Text Messages

Let's see how to extract and display text content from the response.

Matching Message Outputs

use vllora_llm::async_openai::types::responses::OutputItem;
use vllora_llm::async_openai::types::responses::OutputMessageContent;

for (index, output) in response.output.iter().enumerate() {
    match output {
        OutputItem::Message(message) => {
            println!("\n[Message {}]", index);
            println!("{}", "-".repeat(80));

            for content in &message.content {
                match content {
                    OutputMessageContent::OutputText(text_output) => {
                        // Print the text content
                        println!("\n{}", text_output.text);

                        // Print sources/annotations if available
                        if !text_output.annotations.is_empty() {
                            println!("Annotations: {:#?}", text_output.annotations);
                        }
                    }
                    _ => {
                        println!("Other content type: {:?}", content);
                    }
                }
            }
            println!("\n{}", "=".repeat(80));
        }
        // ... handle other output types
    }
}

Understanding Message Content

Message Structure - Each Message contains a content vector that can hold different content types:

OutputText - The actual text response
Other content types for different media

Annotations - Text outputs can include annotations which provide:

Citations and sources (especially useful with web search)
References to tool calls
Additional metadata

These annotations are particularly valuable when using web search tools, as they show where the information came from.

Handling Image Generation Results

This is the core focus of our example - extracting and saving generated images.

Understanding ImageGenToolCall

When the model uses the image generation tool, the response includes OutputItem::ImageGenerationCall variants. Each call contains:

A result field with the base64-encoded image data
Metadata about the generation

Decoding and Saving Images

Here's our complete image handling function:

use vllora_llm::async_openai::types::responses::ImageGenToolCall;
use base64::{engine::general_purpose::STANDARD, Engine as _};
use std::fs;

/// Decodes a base64-encoded image from an ImageGenerationCall and saves it to a file.
///
/// # Arguments
/// * `image_generation_call` - The image generation call containing the base64-encoded image
/// * `index` - The index to use in the filename
///
/// # Returns
/// * `Ok(filename)` - The filename where the image was saved
/// * `Err(e)` - An error if the call has no result, decoding fails, or file writing fails
fn decode_and_save_image(
    image_generation_call: &ImageGenToolCall,
    index: usize,
) -> Result<String, Box<dyn std::error::Error>> {
    // Extract base64 image from the call
    let base64_image = image_generation_call
        .result
        .as_ref()
        .ok_or("Image generation call has no result")?;

    // Decode base64 image
    let image_data = STANDARD.decode(base64_image)?;

    // Save to file
    let filename = format!("generated_image_{}.png", index);
    fs::write(&filename, image_data)?;

    Ok(filename)
}

Step-by-Step Breakdown

Extract Base64 Data - We access the result field, which is an Option. We use .ok_or() to convert None into an error if the result is missing.
Decode Base64 - The base64 crate's STANDARD engine decodes the base64 string into raw bytes. This can fail if the string is malformed, so we use ? to propagate errors.
Save to File - We use Rust's standard library fs::write() to save the decoded bytes to a file. We name it generated_image_{index}.png to avoid conflicts when multiple images are generated.
Return Filename - We return the filename so the caller knows where the image was saved.

Using the Function

Here's how we integrate this into our response processing:

OutputItem::ImageGenerationCall(image_generation_call) => {
    println!("\n[Image Generation Call {}]", index);
    match decode_and_save_image(image_generation_call, index) {
        Ok(filename) => {
            println!("✓ Successfully saved image to: {}", filename);
        }
        Err(e) => {
            eprintln!("✗ Failed to decode/save image: {}", e);
        }
    }
}

We match on OutputItem::ImageGenerationCall, extract the call, and pass it to our decoding function. We handle both success and error cases gracefully.

Complete Example Walkthrough

Let's put it all together and see the complete flow:

Complete Source Code

use vllora_llm::async_openai::types::responses::CreateResponse;
use vllora_llm::async_openai::types::responses::ImageGenTool;
use vllora_llm::async_openai::types::responses::ImageGenToolCall;
use vllora_llm::async_openai::types::responses::InputParam;
use vllora_llm::async_openai::types::responses::OutputItem;
use vllora_llm::async_openai::types::responses::OutputMessageContent;
use vllora_llm::async_openai::types::responses::Tool;
use vllora_llm::async_openai::types::responses::WebSearchTool;

use base64::{engine::general_purpose::STANDARD, Engine as _};
use std::fs;

use vllora_llm::client::VlloraLLMClient;
use vllora_llm::error::LLMResult;
use vllora_llm::types::credentials::ApiKeyCredentials;
use vllora_llm::types::credentials::Credentials;

fn decode_and_save_image(
    image_generation_call: &ImageGenToolCall,
    index: usize,
) -> Result<String, Box<dyn std::error::Error>> {
    let base64_image = image_generation_call
        .result
        .as_ref()
        .ok_or("Image generation call has no result")?;

    let image_data = STANDARD.decode(base64_image)?;
    let filename = format!("generated_image_{}.png", index);
    fs::write(&filename, image_data)?;

    Ok(filename)
}

#[tokio::main]
async fn main() -> LLMResult<()> {
    // 1) Build a Responses-style request using async-openai-compat types
    // with tools for web_search_preview and image_generation
    let responses_req = CreateResponse {
        model: Some("gpt-4.1".to_string()),
        input: InputParam::Text(
            "Search for the latest news from today and generate an image about it".to_string(),
        ),
        tools: Some(vec![
            Tool::WebSearch(WebSearchTool::default()),
            Tool::ImageGeneration(ImageGenTool::default()),
        ]),
        ..Default::default()
    };

    // 2) Construct a VlloraLLMClient
    let client =
        VlloraLLMClient::default().with_credentials(Credentials::ApiKey(ApiKeyCredentials {
            api_key: std::env::var("VLLORA_OPENAI_API_KEY")
                .expect("VLLORA_OPENAI_API_KEY must be set"),
        }));

    // 3) Non-streaming: send the request and print the final reply
    println!("Sending request with tools: web_search_preview and image_generation");
    let response = client.responses().create(responses_req).await?;

    println!("\nNon-streaming reply:");
    println!("{}", "=".repeat(80));

    for (index, output) in response.output.iter().enumerate() {
        match output {
            OutputItem::ImageGenerationCall(image_generation_call) => {
                println!("\n[Image Generation Call {}]", index);
                match decode_and_save_image(image_generation_call, index) {
                    Ok(filename) => {
                        println!("✓ Successfully saved image to: {}", filename);
                    }
                    Err(e) => {
                        eprintln!("✗ Failed to decode/save image: {}", e);
                    }
                }
            }
            OutputItem::Message(message) => {
                println!("\n[Message {}]", index);
                println!("{}", "-".repeat(80));

                for content in &message.content {
                    match content {
                        OutputMessageContent::OutputText(text_output) => {
                            println!("\n{}", text_output.text);

                            if !text_output.annotations.is_empty() {
                                println!("Annotations: {:#?}", text_output.annotations);
                            }
                        }
                        _ => {
                            println!("Other content type: {:?}", content);
                        }
                    }
                }
                println!("\n{}", "=".repeat(80));
            }
            _ => {
                println!("\n[Other Output {}]", index);
                println!("{:?}", output);
            }
        }
    }

    Ok(())
}

Execution Flow

Request Construction - We build a CreateResponse with our prompt and tools
Client Initialization - We create and configure the Vllora LLM client
API Call - We send the request and await the response
Response Processing - We iterate through output items:
- Handle image generation calls by decoding and saving
- Display text messages with annotations
- Handle any other output types
File Output - Generated images are saved to disk as PNG files

Expected Output

When you run this example, you'll see output like:

Sending request with tools: web_search_preview and image_generation

Non-streaming reply:
================================================================================

[Message 0]
--------------------------------------------------------------------------------

Here's the latest news from today: [summary of current news]

Annotations: [citations and sources from web search]

================================================================================

[Image Generation Call 1]
✓ Successfully saved image to: generated_image_1.png

The actual news content and image will vary based on what's happening when you run it!

Summary

This example demonstrates how to use the Responses API to create multi-tool workflows that combine web search and image generation. The key steps are:

Build a CreateResponse request with the desired tools (WebSearchTool and ImageGenTool)
Initialize the VlloraLLMClient with your API credentials
Send the request and receive structured outputs
Process different output types: extract text from OutputItem::Message and decode base64 images from OutputItem::ImageGenerationCall
Save decoded images to disk using standard Rust file I/O

The Responses API enables powerful, structured workflows that go beyond simple text completions, making it ideal for building applications that need to orchestrate multiple AI capabilities.

Pause, Inspect, Edit: Debug Mode for LLM Requests in vLLora

Thu, 11 Dec 2025 00:00:00 GMT

LLMs behave like black boxes. You send them a request, hope the prompt is right, hope your agent didn't mutate it, hope the framework packaged it correctly — and then hope the response makes sense. In simple one-shot queries this usually works fine. But when you're building agents, tools, multi-step workflows, or RAG pipelines, it becomes very hard to see what the model is actually receiving. A single unexpected message, parameter, or system prompt change can shift the entire run.

Today we're introducing Debug Mode for LLM requests in vLLora that makes this visible — and editable.

Here’s what debugging looks like in practice:

vLLora now supports Debug Mode for LLM requests. When Debug Mode is enabled, every request pauses before it reaches the model. Debug Mode works by inserting breakpoints on every outgoing LLM request, allowing you to inspect, edit, or continue execution.

You can:

Inspect the exact request
Edit anything
Continue execution normally

This brings a familiar software-engineering workflow ("pause -> inspect -> edit -> continue") to LLM development.

Why We Built This

If you've built anything beyond a simple chat interface, you've likely hit one of these:

Silent tool-call failures (wrong name / bad params / malformed JSON)
Overloaded or corrupted context / RAG input leading to hallucination or truncation
Error accumulation and state drift in long or multi-step workflows
Lack of visibility: standard logs rarely show the actual request sent to the model

It is difficult to fix these issues without proper observability. Debug Mode changes that.

What Happens When a Request Pauses

Here's what it looks like when vLLora intercepts a request right before it's sent:

You get a real-time snapshot of:

The selected model
Full message array (system, user, assistant)
Parameters like temperature or max tokens
Any tool definitions
Any extra fields and headers your framework injected

This is the full request payload your application is about to send — not what you assume it's sending.

Edit Anything

Click Edit and the payload becomes modifiable:

You can adjust:

Message content
System prompts
Model name
Parameters
Tool definitions
Metadata

Temporary Changes

This affects only the current request. Your application code stays untouched.

It's a fast way to validate fixes, test ideas, and confirm what the agent should have sent.

Continue the Workflow

When you click Continue, vLLora:

Sends your edited request to the model
Receives the real response
Passes it back to your application
Resumes the workflow as if nothing unusual happened

After you click Continue, the workflow proceeds using the response from your edited request. The agent treats it the same way it would treat any normal response from the model.

Why This Matters for Agents

Agents are long-running chains of decisions. Each step can depend on the previous one, and each step can affect the next. Once you're 15 steps deep, you might not know whether:

The prompt changed
A system message was overwritten
A parameter was set differently than expected
The context blew up
A tool schema got mutated

With Debug Mode:

You catch drift early
You see exactly what the model receives
You fix issues in seconds
You avoid rerunning long multi-step workflows
You test prompt or parameter changes instantly

For deep agents, debugging becomes 10x easier.

Closing Thoughts

Debugging LLM systems has been mostly tedious. Debug Mode gives you a clear view into what’s happening and a way to correct issues as they occur.

If you need to understand or fix what an agent is sending, this is the most direct way to do it.

Read the docs: Debug Mode

Try it locally: Quickstart

Exploring Deep Agent Architecture with vLLora: Case Study – Browsr

Mon, 08 Dec 2025 00:00:00 GMT

Over the last year, agents have grown from one-shot prompt wrappers into systems that can work a problem for minutes or hours—researching, trying ideas, fixing mistakes, and resuming where they left off. Tools like Claude Code, Deep Research, Manus AI, and LangChain’s deep-agents all use this pattern.

A typical deep-agent architecture:

Keeps a running plan / TODO list of what still needs to be done.
Uses tools (like a browser, shell, APIs) to act in the world step by step.
Stores persistent memory (artifacts, notes, intermediate results) so it doesn’t forget earlier work.
Regularly evaluates its own progress, adjusts the plan, and retries when something fails.

Because it can plan, remember, and correct itself, a deep agent can run for a long duration, tens or hundreds of steps without losing the thread of the task.

Let’s debug and observe Browsr using vLLora(a tool for agent observability) and see what happens under the hood.

Browsr

Browsr is a headless browser agent that lets you create sequences using a deep agent pattern and then hands you the payloads to run over APIs at scale. It also exports website data as structured or LLM-friendly markdown.

You can explore the definition and related configurations in this repo.

Note: Always respect the copyright rules and terms of the sites you scrape.

Debugging with vLLora

In this article, we use vLLora to illustrate how deep agents work. vLLora lets you debug and observe your agents locally. vLLora can help us to better understand our architecture; toolcalls and observe the full agent timeline. It also works with all popular models.

Browsr iterates in 1–3 command bursts as a single step, saving context to artifacts and completes the task with final tool.

Driver: browser_step is the main executor; every turn runs 1–3 browser commands with explicit thinking, evaluation_previous_goal, memory, and next_goal.
Context control: Large tool outputs are written to disk so the model can drop token-heavy responses and reload them on demand.
Stateful loop: Up to eight iterations, each grounded in the latest observation block (DOM + screenshot) to avoid hallucinating.
Strict tool contract: Exactly one tool call per reply (no free text), keeping the agent deterministic and debuggable.

Lets further examine tool definitions as stated below.

browser_step is the driver between steps. The system prompt forces the model to read the latest DOM and screenshot, report the current state, and then decide what to do next. Each turn must include:

thinking: Reasoning about the current state.
evaluation_previous_goal: Verdict on last step
next_goal: Next immediate goal in one sentence.
commands: Array of commands to be executed.

You can checkout the full agent defintion here.

Example: In one representative run, Browsr used the available context to navigate in step one, click in step two, and then run a JS evaluation to return structured data from the page.

Sample Traces

Average cost and no. of steps using gpt-4.1-mini

Average cost per trace ≈ $0.0303 per run
Average steps ≈ 10.5 steps per run

Why Observability is Critical for Deep Agents

AI engineers spend a lot of time trying to understand why their agents behave the way they do tweaking system prompts, stepping through tool calls, and guessing what went wrong somewhere in the middle of a long run.

As agents move from single-shot tasks to long-running, multi-step workflows, understanding their behavior becomes extremely harder. A "deep agent" might run for 50+ steps, making hundreds of decisions.

Drift over time: An agent can start off doing exactly what you want, then slowly drift off-course because of noisy context, misinterpreted instructions, or a small misunderstanding early on that compounds over later steps.
Expose cost and context: Spot token spikes, context bloat, and expensive branches and compare between different models.
Make decisions traceable: Line up what the agent read, wrote, and decided so you can see cause and effect.
No big-picture view of execution: You rarely get a clear, end-to-end picture of where time and money are going: is it planning, tool execution, retries, or extraction?

vLLora is built to make this debuggable. It lets you see what your deep agents are actually doing across long runs.

Next Steps

Explore and compare using other models
Test the architecture with different LLMs to evaluate performance and cost-effectiveness
Test Computer use automation with custom fine tuned models
Extend the agent's capabilities beyond the browser to general computer use, leveraging fine-tuned models for specific tasks.
Simulate a complex scenario involving several steps to showcase real capability of deep agents.

In the next article, we'll explore these extensions and how they change the agent behavior.

Debugging LiveKit Voice Agents with vLLora

Tue, 04 Nov 2025 00:00:00 GMT

Voice agents built with LiveKit Agents enable real-time, multimodal AI interactions that can handle voice, video, and text. These agents power everything from customer support bots to telehealth assistants, and debugging them requires visibility into the complex pipeline of speech-to-text, language model, and text-to-speech interactions.

In this video, we go over how you can debug voice agents built using LiveKit Agents with vLLora. You'll see how to trace every model call, tool execution, and response as your agent processes real-time audio streams.

Setup

Run and configure vLLora locally. Follow the Quickstart guide to get started.

brew tap vllora/vllora
brew install vllora
vllora

In your LiveKit Agent code, configure your LLM provider to use vLLora's endpoint:

from livekit.plugins import openai
import os

session = AgentSession(
   llm=openai.LLM(
      model="model-name",
      base_url="http://localhost:9090/v1",  # vLLora endpoint
      api_key="no_key"  # vLLora doesn't validate API keys
   ),
    # ... stt, tts, etc ...
)

What vLLora Shows You

With vLLora running, you can see:

Model Calls: Every LLM model call with complete input/output, token usage, cost, and timing information
Tool Definitions: All tools available to your agent, including their schemas and descriptions
Tool Usage: Every tool call made by the agent, including parameters and responses

By providing complete visibility into your voice agent's execution, vLLora makes it easier to build reliable, performant voice AI applications with LiveKit Agents.

Debugging Kilocode with vLLora

Mon, 03 Nov 2025 00:00:00 GMT

Developers building coding agents need visiblity into how context is flowing through the agent, how much context is used, what tools are being called. vLLora enables you to debug all of this in real time.

Setup

Run and configure vLLora locally. Follow the Quickstart guide to get started.

brew tap vllora/vllora
brew install vllora
vllora

In KiloCode, during setup select OpenAI Compatible and set the base URL to vLLora's endpoint. For API key, use no_key as vLLora does not validate the API key, since you set API key in the vLLora UI.

Now open your code editor with KiloCode and start prompting your agent.

The Prompt

Add a customer leaderboard or loyalty points tracker component, 
and embed a mini gallery section for user engagement.

When this prompt runs in KiloCode, the agent edits several files, creates new components, updates imports, and adjusts the layout to match the request.

With vLLora running, we could see run involved 10 model calls and a sequence of tool executions including read_file, write_to_file, execute_command, apply_diff, and update_todo_list.

Across the session, we could see the context size steadily grow as it started with about 9,000 input tokens and reached nearly 90,000 tokens by the end as the agent read, wrote, and reloaded files.
This illustrates how coding agents like KiloCode repeatedly expand their working context as the project state evolves.

Beyond the visible tools in this trace, the underlying agent also defines a larger toolset, such as:

new_task, list_code_definition_names, and search_files for project understanding
insert_content, search_and_replace, and apply_diff for precise code edits
browser_action and execute_command for testing and validation
update_todo_list and attempt_completion for managing the reasoning cycle

vLLora captures every call in sequence, showing which tools and how they were used, how much context each request consumed, and how the model responded. This experience makes debugging easier by exposing where the agent slows down, repeats steps, or mismanages context. It helps you identify issues faster, optimize performance, and build more reliable coding agents.

Using vLLora with OpenAI Agents SDK

Sun, 02 Nov 2025 00:00:00 GMT

The OpenAI Agents SDK makes it easy to build agents with handoffs, streaming, and function calling. The hard part? Seeing what's actually happening when things don't work as expected.

Setup vLLora

First, install vLLora using Homebrew:

brew tap vllora/vllora
brew install vllora
vllora

Quick Setup

Route your OpenAI requests through vLLora by changing the base URL:

from openai import OpenAI

client = OpenAI(
    api_key="no_key",
    base_url="http://localhost:9090/v1"
)

This gives you basic traces showing model calls, latencies, token usage, and function executions. You'll see what's being sent and received, but you're missing agent-specific context like handoffs, state transitions, and streaming details.

Full Agent Visibility

For complete tracing with agent state, handoffs, and streaming context, use the vLLora Python library:

pip install 'vllora[openai]'

Set your vLLora endpoint:

export VLLORA_API_BASE_URL=http://localhost:9090

Initialize vLLora before creating agents:

from vllora.openai import init

init()

# Now define your agents
from openai import OpenAI
# ...

vLLora automatically captures agent interactions, handoffs, function calls, and streaming responses. No client configuration needed—just initialize once and all your agent workflows are traced end-to-end.

You'll see agent state transitions, handoff triggers, function inputs and outputs, and streaming chunks bundled into unified traces. Each trace shows the complete execution path with timing information, so you can spot bottlenecks and debug multi-agent workflows. When an agent hands off to another, when a function executes, or when streaming starts and stops—it's all visible in one place.

Next Steps

Get started with vLLora: Quickstart Guide
Learn about deeper integrations: Working with Agent Frameworks
Explore the full documentation: Introduction

Using vLLora with Google ADK

Sat, 01 Nov 2025 00:00:00 GMT

Google ADK (Agent Development Kit) lets you build multi-agent systems across different LLM providers—Gemini, OpenAI, Anthropic, and more. But when your planner agent produces a FunctionCall for an AgentTool that doesn't run correctly, or a nested sub-agent fails silently, debugging what happened across agents and sessions becomes nearly impossible.

Debugging with vLLora

import litellm
import os
# Configure LiteLLM to route through vLLora
os.environ["OPENAI_API_KEY"] = "no_key"
os.environ["OPENAI_API_BASE"] = "http://localhost:9090/v1"

Then use LiteLLM models in your agents as usual:

from google.adk.agents import Agent
from google.adk.models.lite_llm import LiteLlm

weather_agent = Agent(
    name="weather_agent",
    model=LiteLlm(model="openai/gpt-4o"),
    tools=[get_weather],
    # ...
)

All requests from this agent will flow through vLLora, giving you traces of model calls.

Here you can see the traces of the model calls as well as the the get_weather tool call. But Google ADK has addtional metadata which is missing in the traces.

Advanced Tracing

To get the complete observability including agent boundaries, tool calls, and nested workflows, use the vLLora Python library:

pip install 'vllora[adk]'

Set your vLLora endpoint:

export VLLORA_API_BASE_URL=http://localhost:9090

Initialize vLLora before creating agents:

from vllora.adk import init

init()

# Now define your agents
from google.adk.agents import Agent
# ...

That's it. vLLora automatically discovers all agents, wraps their methods, and links sessions across your entire workflow. You don't need to configure LiteLLM separately; the initialization handles everything.

Now you can see the full ADK workflow with extra metadata about the agent and tools, all bundled together as a single run. Agent transitions, tool executions, and model calls are captured in one unified trace.

With the library integration, you get complete visibility into agent boundaries, seeing exactly when control passes between agents. Every tool call is tracked with its inputs and outputs, sessions are linked across multiple agents and sub-agents, and complex nested workflows become visualizable. Whether you're debugging a single agent or orchestrating dozens, vLLora shows you exactly what's happening at every step.

For more details on integrating vLLora with Google ADK and other agent frameworks, check out our agent framework documentation.

Using vLLora to debug Agents

Thu, 30 Oct 2025 00:00:00 GMT

Building AI agents is hard. Debugging them locally across multiple SDKs, tools, and providers feels like flying blind. Logs give you partial visibility. You need to see every call, latency, cost, and output in context without rewriting code.

Why debugging agents is hard

When you debug locally, requests disappear into SDKs. You piece together prints, partial logs, and guesswork. When something breaks or slows down, pinpointing the step, model, or tool is hard. Cost tracking is manual at best.

Meet vLLora

vLLora is a local debugging tool with a UI that intercepts LLM requests. It implements the OpenAI API, so your existing clients and frameworks work unchanged. Set base_url to http://localhost:9090/v1 and run your code as-is. vLLora forwards requests to your chosen provider using your keys, preserves streaming and tool/function calls, and records a trace for each step.

Get started in under a minute

Install vLLora, point your SDK to it, and keep your existing code:

brew tap vllora/vllora
brew install vllora
vllora

Change your base URL and you're done:

LangChain (Python)
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:9090/v1",
    model="openai/gpt-4o-mini",
)

Every request now flows through vLLora. Open http://localhost:9091 to see the traces streaming in real time. For detailed setup instructions across different frameworks, see Using vLLora.

Observe your agent in real time

Open the UI while you build. Each request shows inputs, outputs, timing, and cost. No custom logging code needed.

How debugging works

vLLora sits between your SDK and the provider, capturing every request your framework makes and streaming traces to the UI in real time. You get full visibility into inputs, outputs, latency, and cost per model call. Requests are grouped by run or time bucket so you can see how your agent behaves step by step, replay turns, inspect streaming output, or compare model responses across different calls.

Compatibility and models

vLLora works out of the box with OpenAI-compatible clients and major agent frameworks (LangChain, Google ADK, OpenAI Agents). Keep your code, just change the base URL. Use your own provider keys and switch between 300+ models to compare quality, performance, and cost.

When to use vLLora

vLLora shines when you're building agents that use multiple tools and models, measuring latency and cost per step matters, or you're switching between providers like OpenAI, Anthropic, and Gemini and need consistent logs across all of them. If you're debugging chain-of-thought issues or tracking down missing tool calls, vLLora gives you a single pane of glass to see everything that's happening.

Next steps

Ready to dive deeper? Check out the Quickstart for installation details and sending your first trace, or explore Working with Agent Frameworks for deeper integration with frameworks like OpenAI Agents SDK and Google ADK. For a complete overview of the product and setup details, see the Introduction.

You now have x-ray vision for your agents. Build, trace, and optimize faster, all without touching your code.

vLLora - Debug your agents in realtime Blog

Introducing Lucy: Trace-Native Debugging Inside vLLora

Why finding out what went wrong is hard​

What can you ask Lucy to do​

"What’s wrong with my thread?"​

What Lucy found: Schema mismatches and contradictory prompts​

Why this matters in production​

How Lucy works with vLLora tracing​

Get started​

Silent Failures: Why a “Successful” LLM Workflow Can Cost 40% More

The Illusion of Success​

The Suspect: A Slow Travel Agent​

The Investigation (Using vLLora MCP)​

The Reveal: The Parameter Mismatch​

Attempt 1​

Attempt 2​

Attempt 3​

Fallback path​

The Fix: Delegating to the Agent​

The Code Change​

Before: Ambiguous​

After: Explicit​

Measuring the Impact​

The Prompt​

Impact at Scale​

Where did the waste go?​

Why This Matters​

Connecting the MCP Server​

Quick Install​

Closing Thoughts​

Introducing the vLLora MCP Server

Making Traces Programmatic​

Built for Coding Agents​

The "Something Just Failed" Workflow​

Debugging in Practice​

Connecting the MCP Server​

Quick Install​

Closing Thoughts​

Debugging Agents: Why Prompt Tweaks Can't Fix Stale State

The Bug​

The Tool Payloads​

Why Prompt Tweaks Failed​

How I Discovered It​

The Fix​

The Lesson​

Building Better Agents​

Building AI-Powered Image Generation with OpenAI-Compatible Responses API

Introduction​

Understanding the Responses API​

Prerequisites and Setup​

Required Dependencies​

Cargo.toml Configuration​

Environment Setup​

Building the Request​

Creating the CreateResponse Structure​

Understanding the Components​

Initializing the Client​

Client Configuration​

Credential Management​

Sending the Request and Handling Responses​

Making the API Call​

Understanding the Response Structure​

Processing Text Messages​

Matching Message Outputs​

Understanding Message Content​

Handling Image Generation Results​

Understanding ImageGenToolCall​

Decoding and Saving Images​

Step-by-Step Breakdown​

Using the Function​

Complete Example Walkthrough​

Complete Source Code​

Execution Flow​

Expected Output​

Summary​

Pause, Inspect, Edit: Debug Mode for LLM Requests in vLLora

Why We Built This​

What Happens When a Request Pauses​

Edit Anything​

Continue the Workflow​

Why finding out what went wrong is hard

What can you ask Lucy to do

"What’s wrong with my thread?"

What Lucy found: Schema mismatches and contradictory prompts

Why this matters in production

How Lucy works with vLLora tracing

Get started

The Illusion of Success

The Suspect: A Slow Travel Agent

The Investigation (Using vLLora MCP)

The Reveal: The Parameter Mismatch

Attempt 1

Attempt 2

Attempt 3

Fallback path

The Fix: Delegating to the Agent

The Code Change

Before: Ambiguous

After: Explicit

Measuring the Impact

The Prompt

Impact at Scale

Where did the waste go?

Why This Matters

Connecting the MCP Server

Quick Install

Closing Thoughts

Making Traces Programmatic

Built for Coding Agents

The "Something Just Failed" Workflow

Debugging in Practice

Connecting the MCP Server

Quick Install

Closing Thoughts

The Bug

The Tool Payloads

Why Prompt Tweaks Failed

How I Discovered It

The Fix

The Lesson

Building Better Agents

Introduction

Understanding the Responses API

Prerequisites and Setup

Required Dependencies

Cargo.toml Configuration

Environment Setup

Building the Request

Creating the CreateResponse Structure

Understanding the Components

Initializing the Client

Client Configuration

Credential Management

Sending the Request and Handling Responses

Making the API Call

Understanding the Response Structure

Processing Text Messages

Matching Message Outputs

Understanding Message Content

Handling Image Generation Results

Understanding ImageGenToolCall

Decoding and Saving Images

Step-by-Step Breakdown

Using the Function

Complete Example Walkthrough

Complete Source Code

Execution Flow

Expected Output

Summary

Why We Built This

What Happens When a Request Pauses

Edit Anything

Continue the Workflow

Why This Matters for Agents

Closing Thoughts

Browsr

Debugging with vLLora

Sample Traces

Average cost and no. of steps using gpt-4.1-mini

Why Observability is Critical for Deep Agents