Blog Post

How to Monitor AI Agents in Commerce Systems

Updated

Published

November 11, 2025

mins read

Rahul Raj

in this blog post

Heading 2

Artificial intelligence (AI) isn’t just writing text or generating images anymore. It’s starting to make real-world decisions. Now, with agentic systems, we’re entering an era where AI models don’t just respond; they act autonomously, buying, booking, and negotiating on behalf of users.

That may sound promising, but those of us in the trenches of reliability know that progress always comes with trade-offs. Make no mistake, this shift fundamentally changes how observability works.

The technology: how agentic commerce works

Google’s Agentic Payments Protocol (AP2) is one of the first open frameworks designed to enable AI agents to handle end-to-end commerce transactions from product discovery to checkout.

Instead of a human clicking through a shopping cart, an AI agent (powered by Gemini) can:

Interpret a natural-language intent: “Help me buy a coffee maker from Amazon”
Search merchant catalogues through APIs
Fetch user credentials from a secure wallet service
Initiate and confirm payment, all through machine-to-machine communication

Behind this, a network of microservices, including merchant, payment, and credential agents, coordinates the transaction

Each agent communicates over JSON-RPC(Remote Procedure Call), using the AP2 spec to ensure trust and interoperability between systems. Essentially, all agents in AP2 act like modular teammates that “speak the same language” (JSON-RPC) and follow a shared playbook (the AP2 spec).

This lets a shopping assistant, merchant, and payment provider coordinate securely, even if they’re built by different companies or run in different environments. By enabling coordination across organizations, AP2 shows how AI can now not only guide but also perform commercial transactions.

The problem: visibility collapses when logic moves to AI

That new autonomy comes with a price: blind spots. Traditional synthetic tests assume deterministic logic. You click, the app calls an API, and you measure latency and response codes.

But in an agentic architecture:

The logic is non-deterministic (driven by model reasoning)
The execution chain spans multiple networks, LLM APIs, merchant systems, CDNs, and payment processors
Failures are contextual, not just code-based (e.g., “Gemini throttled this request” might look like a 200 OK)

In other words, once AI starts making the decisions, traditional visibility breaks down — you lose the map. How do you monitor a system that decides, learns, and acts dynamically across infrastructure you don’t fully control?

Our setup: recreating the agentic stack

To answer that question, we built a functioning AP2 environment using Google’s open-source agentic-commerce/AP2 repository.

Here’s what we stood up locally:

Frontend: The AP2 Dev UI, hosted behind Cloudflare CDN (simulating public access)
‍Origin: A tunnelled environment representing an AWS backend (FastAPI services)
‍Backend Agents: ‍
Merchant Agent – processes shopping queries‍
Credentials Provider – stores payment credentials‍
Payment Processor – handles mandate creation and transaction flow‍
AI Brain: Gemini 2.5 Flash, invoked via Google’s Generative Language API

In short, a complete AI-driven commerce stack, from intent to reasoning to payment initiation, running autonomously. At that point, we leveraged Catchpoint Internet Performance Monitoring (IPM) tools to measure performance, track latency, and identify failures across the system.

A screenshot of a computerDescription automatically generated — Catchpoint’s Internet Stack Map

Monitoring the agentic commerce stack

We modelled the system using Internet Stack Map to visualize and test every layer.

Layer	What It Represents	Monitoring Method
DNS (Cloudflare)	DNS resolution and routing for the domain	DNS Monitor
CDN (Cloudflare Edge)	Edge delivery, caching, and TLS	CDN/HTTP Monitor
Origin (AWS)	FastAPI service entry point	Web Object Monitor
Backend (AP2 Agents)	Merchant, Payment, and Credentials microservices	API Monitors
AI Layer (Gemini)	Generative reasoning endpoint	API Monitor with response validation
UI (AP2 Dev Interface)	Frontend experience and intent submission	Full Browser Test & Transaction Tests with response validation

To connect these layers, we used Catchpoint’s global variable extraction to capture the session ID generated in the UI synthetic test and reuse it across multiple backend API tests. Tests ran every 5 minutes from global agents.

The UI synthetic flow traces a user journey from the browser prompt (“Help me buy a coffee maker”) through the multi-agent orchestration, cart creation, payment mandate signing, and final payment receipt. Using Catchpoint synthetic E2E flow testing and Internet Stack Map, we get the following:

Transaction correctness: verifies that intent → cart → payment_mandate → signed_mandate → payment_result flows and that the transaction state is persisted across agents.

Dashboard showing successful synthetic test run on with header level detail showing successful transaction

Per-service performance: measures response time for each service / agent (create_intent_mandate, find_products, create_cart_mandate, get_payment_methods, create_payment_mandate, sign_mandates_on_user_device, initiate_payment). This converts “who’s slowing down the transaction” from guesswork to data.

In the example below, we see find products alone took 13 seconds to fetch all product SKUs.

A screenshot of a phoneDescription automatically generated

Dashboard correlating Waterfall and Explorer views, illustrating response times and performance metrics for each agentic service and API call

Token consumption visibility: tracks model token usage (prompt, completion, total) per run so you can spot cost anomalies or prompt drift.

Screenshot showing various token usage metrics over time

Waterfall & Explorer insights: waterfall traces and explorer request/response bodies surface the exact failing/requesting call and payload, enabling fast RCA‍

Stack map correlation: connects failing runs to observed infrastructure components (CDN, edge, middleware, origin) to locate the root cause quickly.

Key blind spots we found

Catchpoint’s end-to-end AI commerce monitoring surfaced a few real-world failures that would otherwise appear as generic “assistant unresponsive” events.

1. Gemini errors

When we began testing, the system looked healthy and all endpoints returned 200 OK. But deeper inspection revealed Gemini 429 errors hidden within successful responses.

A screenshot of a chatDescription automatically generated

A failed synthetic test run, with Waterfall and request-level data pinpointing the source and details of the transaction error

2. UI stuck due to merchant service timeout

The front-end chat interface froze after create_intent_mandate
Catchpoint Explorer and Timing breakdown revealed a connection timeout (~10s) while waiting on the merchant agent service
The delay was isolated at the connect phase, confirming that the issue occurred before payload transfer—likely a temporary connectivity or backend overload condition
Stack Map correlation showed this hop passing through Cloudflare → Apigee → Merchant microservice, pinpointing the bottleneck to the merchant layer.

A failed synthetic test run, with Waterfall data revealing the connection timeout and the precise service layer responsible for the issue

3. UI Down – 530 error from Cloudflare

Synthetic UI tests later reported complete inaccessibility with an HTTP 530 (Origin Error) response
The error originated at the CDN edge (Cloudflare), meaning requests never reached the origin host
Catchpoint’s full waterfall and DNS tracing confirmed normal DNS resolution and SSL handshake, isolating the fault to Cloudflare’s route-to-origin link
Because the 530 surfaced at the UI layer, RUM or browser logs alone would have masked it as “page load failed”

A screenshot of a phone numberDescription automatically generated

AI assistants are multi-layered systems where a “slow model” isn’t always the problem. By combining synthetic testing, end-to-end performance tracing, and cross-layer correlation, teams can pinpoint exactly which agent or network segment caused a stall—whether it’s a Gemini model delay, a merchant service timeout, or a CDN origin error.

Evolving observability for AI-driven systems

As AI systems begin to transact independently, observability must extend beyond servers and APIs into the logic that drives decisions. With IPM, organizations can now monitor AI commerce environments across both the infrastructure and reasoning layers, ensuring reliability in this new class of intelligent systems and provide teams with visibility into:

What the AI decided
What it executed
And where it broke

Agentic systems like AP2 are early prototypes of a broader shift, where applications become autonomous agents orchestrating workflows in real time. When that happens, performance data will no longer be just about page loads or API speeds. It will be about intent execution, the AI’s ability to fulfil a task successfully across systems. At Catchpoint, that’s the challenge we’re already preparing for: extending observability from systems to intelligent decision flows.

See it in action: good vs bad agentic outcomes

Below are real interactions highlighting both a smooth and an error-prone client experience in agentic commerce.

Learn more about AI monitoring

LLMs don’t stand still: How to monitor and trust the models powering your AI

Semantic caching: What we measured and why it matters

Agentic AI: Powerful but fragile—what you need to know

Google’s Agent-to-Agent (A2A) protocol is here—now let’s make it observable

When AI tools fail: How to map your AI dependencies for proactive visibility

‍

Summary

When AI agents buy, book, and negotiate autonomously, observability must evolve. Using Google’s AP2 framework and Gemini model, we built a full agentic commerce stack and instrumented it end to end. The findings reveal where visibility breaks across LLM reasoning, microservices, and networks, and how synthetic testing and Stack Map restore traceability and root-cause insight.

‍