Blog Post

How to Monitor AI Agents in Commerce Systems

Updated
Published
November 11, 2025
#
 mins read

in this blog post

AI agents are now executing commerce transactions end to end, searching products, fetching credentials, and initiating payments without a human in the loop. With agentic systems, AI models don't just respond; they act autonomously, buying, booking, and negotiating on behalf of users.

That autonomy also introduces new monitoring blind spots that traditional observability wasn't designed to handle. This shift fundamentally changes how teams need to think about reliability.

Key takeaways:

  • Agentic commerce systems like Google's AP2 let AI agents handle full transactions autonomously
  • Traditional synthetic tests can't detect failures hidden inside non-deterministic AI logic
  • Catchpoint IPM, part of the LogicMonitor platform, monitors both infrastructure and AI reasoning layers
  • End-to-end tracing revealed Gemini rate-limit errors, merchant timeouts, and CDN origin failures that surface-level monitoring missed

The Technology: How Agentic Commerce Works

Google's Agentic Payments Protocol (AP2) is one of the first open frameworks designed to enable AI agents to handle end-to-end commerce transactions from product discovery to checkout.

Instead of a human clicking through a shopping cart, an AI agent (powered by Gemini) can:

  • Interpret a natural-language intent: "Help me buy a coffee maker from Amazon"
  • Search merchant catalogues through APIs
  • Fetch user credentials from a secure wallet service
  • Initiate and confirm payment, all through machine-to-machine communication

Behind this, a network of microservices, including merchant, payment, and credential agents, coordinates the transaction.

Each agent communicates over JSON-RPC (Remote Procedure Call), using the AP2 spec to ensure trust and interoperability between systems. All agents in AP2 communicate over JSON-RPC and follow the AP2 spec, which ensures consistent trust and interoperability even when agents are built by different teams or deployed in different environments.

This lets a shopping assistant, merchant, and payment provider coordinate securely, even if they're built by different companies or run in different environments. By enabling coordination across organizations, AP2 shows how AI can now guide and perform commercial transactions.

The Problem: Visibility Collapses When Logic Moves to AI

That new autonomy comes with a price: blind spots. Traditional synthetic tests assume deterministic logic. You click, the app calls an API, and you measure latency and response codes.

But in an agentic architecture:

  • The logic is non-deterministic (driven by model reasoning)
  • The execution chain spans multiple networks, LLM APIs, merchant systems, CDNs, and payment processors
  • Failures are contextual, not just code-based (e.g., "Gemini throttled this request" might look like a 200 OK)

Once AI starts making the decisions, traditional visibility tools can no longer map what's happening across the full execution chain. The core challenge: how do you monitor a system that decides, learns, and acts dynamically across infrastructure you don't fully control?

Our Setup: Recreating the Agentic Stack

To answer that question, we built a functioning AP2 environment using Google's open-source agentic-commerce/AP2 repository.

The local environment included:

  • Frontend: The AP2 Dev UI, hosted behind Cloudflare CDN (simulating public access)
  • Origin: A tunnelled environment representing an AWS backend (FastAPI services)
  • Backend Agents:
  • Merchant Agent -- processes shopping queries
  • Credentials Provider -- stores payment credentials
  • Payment Processor -- handles mandate creation and transaction flow
  • AI Brain: Gemini 2.5 Flash, invoked via Google's Generative Language API

In short, a complete AI-driven commerce stack, from intent to reasoning to payment initiation, running autonomously. We then leveraged Catchpoint Internet Performance Monitoring (IPM), part of the LogicMonitor platform, to measure performance, track latency, and identify failures across the system.

A screenshot of a computer showing the Catchpoint Internet Stack Map

Catchpoint Internet Stack Map

Monitoring the Agentic Commerce Stack

We modelled the system using Internet Stack Map to visualize and test every layer.

Layer

What It Represents

Monitoring Method

DNS (Cloudflare)

DNS resolution and routing for the domain

DNS Monitor

CDN (Cloudflare Edge)

Edge delivery, caching, and TLS

CDN/HTTP Monitor

Origin (AWS)

FastAPI service entry point

Web Object Monitor

Backend (AP2 Agents)

Merchant, Payment, and Credentials microservices

API Monitors

AI Layer (Gemini)

Generative reasoning endpoint

API Monitor with response validation

UI (AP2 Dev Interface)

Frontend experience and intent submission

Full Browser Test & Transaction Tests with response validation

To connect these layers, we used Catchpoint IPM's global variable extraction to capture the session ID generated in the UI synthetic test and reuse it across multiple backend API tests. Tests ran every 5 minutes from global agents.

The UI synthetic flow traces a user journey from the browser prompt ("Help me buy a coffee maker") through the multi-agent orchestration, cart creation, payment mandate signing, and final payment receipt. Using Catchpoint synthetic E2E flow testing and Internet Stack Map, we get the following:

Transaction correctness: verifies that intent → cart → payment_mandate → signed_mandate → payment_result flows and that the transaction state is persisted across agents.

Dashboard showing successful synthetic test run with header level detail showing successful transaction

Per-service performance: measures response time for each service/agent (create_intent_mandate, find_products, create_cart_mandate, get_payment_methods, create_payment_mandate, sign_mandates_on_user_device, initiate_payment). This converts "who's slowing down the transaction" from guesswork to data.

In the example below, we see find_products alone took 13 seconds to fetch all product SKUs.

A screenshot showing per-service performance breakdown
Dashboard correlating Waterfall and Explorer views, illustrating response times and performance metrics for each agentic service and API call

Token consumption visibility: tracks model token usage (prompt, completion, total) per run so you can spot cost anomalies or prompt drift.

Token consumption metrics dashboard
Screenshot showing various token usage metrics over time

Waterfall & Explorer insights: waterfall traces and explorer request/response bodies surface the exact failing/requesting call and payload, enabling fast root cause analysis.

Stack map correlation: connects failing runs to observed infrastructure components (CDN, edge, middleware, origin) to locate the root cause quickly.

Key Blind Spots We Found

End-to-end AI commerce monitoring with Catchpoint IPM surfaced a few real-world failures that would otherwise appear as generic "assistant unresponsive" events.

1. Gemini Errors

When we began testing, the system looked healthy and all endpoints returned 200 OK. But deeper inspection revealed Gemini 429 errors hidden within successful responses.

A screenshot of a chat showing hidden Gemini errors
A failed synthetic test run, with Waterfall and request-level data pinpointing the source and details of the transaction error

2. UI Stuck Due to Merchant Service Timeout

  • The front-end chat interface froze after create_intent_mandate
  • Catchpoint Explorer and Timing breakdown revealed a connection timeout (~10s) while waiting on the merchant agent service
  • The delay was isolated at the connect phase, confirming that the issue occurred before payload transfer, likely a temporary connectivity or backend overload condition
  • Stack Map correlation showed this hop passing through Cloudflare → Apigee → Merchant microservice, pinpointing the bottleneck to the merchant layer
A failed synthetic test run, with Waterfall data revealing the connection timeout and the precise service layer responsible for the issue

3. UI Down: 530 Error From Cloudflare

  • Synthetic UI tests later reported complete inaccessibility with an HTTP 530 (Origin Error) response
  • The error originated at the CDN edge (Cloudflare), meaning requests never reached the origin host
  • Catchpoint's full waterfall and DNS tracing confirmed normal DNS resolution and SSL handshake, isolating the fault to Cloudflare's route-to-origin link
  • Because the 530 surfaced at the UI layer, RUM or browser logs alone would have masked it as "page load failed"
A screenshot showing the 530 error details
Error detail screenshot

AI assistants are multi-layered systems where a "slow model" isn't the only possible culprit. By combining synthetic testing, end-to-end performance tracing, and cross-layer correlation, teams can pinpoint exactly which agent or network segment caused a stall, whether it's a Gemini model delay, a merchant service timeout, or a CDN origin error.

Evolving Observability for AI-Driven Systems

As AI systems begin to transact independently, observability must extend beyond servers and APIs into the logic that drives decisions. With Catchpoint IPM, organizations can monitor AI commerce environments across both the infrastructure and reasoning layers, gaining visibility into the AI's decision logic, execution path, and failure points.

Agentic systems like AP2 are early prototypes of a broader shift, where applications become autonomous agents orchestrating workflows in real time. When that happens, performance data will no longer be about page loads or API speeds alone. It will be about intent execution: the AI's ability to fulfil a task successfully across systems.

By unifying LM Envision, Catchpoint IPM, and Edwin AI into one Autonomous IT platform, LogicMonitor extends observability from systems to intelligent decision flows, giving teams the visibility they need to keep AI-driven operations reliable.

Learn More About AI Monitoring

Summary

When AI agents buy, book, and negotiate autonomously, observability must evolve. Using Google’s AP2 framework and Gemini model, we built a full agentic commerce stack and instrumented it end to end. The findings reveal where visibility breaks across LLM reasoning, microservices, and networks, and how synthetic testing and Stack Map restore traceability and root-cause insight.

AI agents are now executing commerce transactions end to end, searching products, fetching credentials, and initiating payments without a human in the loop. With agentic systems, AI models don't just respond; they act autonomously, buying, booking, and negotiating on behalf of users.

That autonomy also introduces new monitoring blind spots that traditional observability wasn't designed to handle. This shift fundamentally changes how teams need to think about reliability.

Key takeaways:

  • Agentic commerce systems like Google's AP2 let AI agents handle full transactions autonomously
  • Traditional synthetic tests can't detect failures hidden inside non-deterministic AI logic
  • Catchpoint IPM, part of the LogicMonitor platform, monitors both infrastructure and AI reasoning layers
  • End-to-end tracing revealed Gemini rate-limit errors, merchant timeouts, and CDN origin failures that surface-level monitoring missed

The Technology: How Agentic Commerce Works

Google's Agentic Payments Protocol (AP2) is one of the first open frameworks designed to enable AI agents to handle end-to-end commerce transactions from product discovery to checkout.

Instead of a human clicking through a shopping cart, an AI agent (powered by Gemini) can:

  • Interpret a natural-language intent: "Help me buy a coffee maker from Amazon"
  • Search merchant catalogues through APIs
  • Fetch user credentials from a secure wallet service
  • Initiate and confirm payment, all through machine-to-machine communication

Behind this, a network of microservices, including merchant, payment, and credential agents, coordinates the transaction.

Each agent communicates over JSON-RPC (Remote Procedure Call), using the AP2 spec to ensure trust and interoperability between systems. All agents in AP2 communicate over JSON-RPC and follow the AP2 spec, which ensures consistent trust and interoperability even when agents are built by different teams or deployed in different environments.

This lets a shopping assistant, merchant, and payment provider coordinate securely, even if they're built by different companies or run in different environments. By enabling coordination across organizations, AP2 shows how AI can now guide and perform commercial transactions.

The Problem: Visibility Collapses When Logic Moves to AI

That new autonomy comes with a price: blind spots. Traditional synthetic tests assume deterministic logic. You click, the app calls an API, and you measure latency and response codes.

But in an agentic architecture:

  • The logic is non-deterministic (driven by model reasoning)
  • The execution chain spans multiple networks, LLM APIs, merchant systems, CDNs, and payment processors
  • Failures are contextual, not just code-based (e.g., "Gemini throttled this request" might look like a 200 OK)

Once AI starts making the decisions, traditional visibility tools can no longer map what's happening across the full execution chain. The core challenge: how do you monitor a system that decides, learns, and acts dynamically across infrastructure you don't fully control?

Our Setup: Recreating the Agentic Stack

To answer that question, we built a functioning AP2 environment using Google's open-source agentic-commerce/AP2 repository.

The local environment included:

  • Frontend: The AP2 Dev UI, hosted behind Cloudflare CDN (simulating public access)
  • Origin: A tunnelled environment representing an AWS backend (FastAPI services)
  • Backend Agents:
  • Merchant Agent -- processes shopping queries
  • Credentials Provider -- stores payment credentials
  • Payment Processor -- handles mandate creation and transaction flow
  • AI Brain: Gemini 2.5 Flash, invoked via Google's Generative Language API

In short, a complete AI-driven commerce stack, from intent to reasoning to payment initiation, running autonomously. We then leveraged Catchpoint Internet Performance Monitoring (IPM), part of the LogicMonitor platform, to measure performance, track latency, and identify failures across the system.

A screenshot of a computer showing the Catchpoint Internet Stack Map

Catchpoint Internet Stack Map

Monitoring the Agentic Commerce Stack

We modelled the system using Internet Stack Map to visualize and test every layer.

Layer

What It Represents

Monitoring Method

DNS (Cloudflare)

DNS resolution and routing for the domain

DNS Monitor

CDN (Cloudflare Edge)

Edge delivery, caching, and TLS

CDN/HTTP Monitor

Origin (AWS)

FastAPI service entry point

Web Object Monitor

Backend (AP2 Agents)

Merchant, Payment, and Credentials microservices

API Monitors

AI Layer (Gemini)

Generative reasoning endpoint

API Monitor with response validation

UI (AP2 Dev Interface)

Frontend experience and intent submission

Full Browser Test & Transaction Tests with response validation

To connect these layers, we used Catchpoint IPM's global variable extraction to capture the session ID generated in the UI synthetic test and reuse it across multiple backend API tests. Tests ran every 5 minutes from global agents.

The UI synthetic flow traces a user journey from the browser prompt ("Help me buy a coffee maker") through the multi-agent orchestration, cart creation, payment mandate signing, and final payment receipt. Using Catchpoint synthetic E2E flow testing and Internet Stack Map, we get the following:

Transaction correctness: verifies that intent → cart → payment_mandate → signed_mandate → payment_result flows and that the transaction state is persisted across agents.

Dashboard showing successful synthetic test run with header level detail showing successful transaction

Per-service performance: measures response time for each service/agent (create_intent_mandate, find_products, create_cart_mandate, get_payment_methods, create_payment_mandate, sign_mandates_on_user_device, initiate_payment). This converts "who's slowing down the transaction" from guesswork to data.

In the example below, we see find_products alone took 13 seconds to fetch all product SKUs.

A screenshot showing per-service performance breakdown
Dashboard correlating Waterfall and Explorer views, illustrating response times and performance metrics for each agentic service and API call

Token consumption visibility: tracks model token usage (prompt, completion, total) per run so you can spot cost anomalies or prompt drift.

Token consumption metrics dashboard
Screenshot showing various token usage metrics over time

Waterfall & Explorer insights: waterfall traces and explorer request/response bodies surface the exact failing/requesting call and payload, enabling fast root cause analysis.

Stack map correlation: connects failing runs to observed infrastructure components (CDN, edge, middleware, origin) to locate the root cause quickly.

Key Blind Spots We Found

End-to-end AI commerce monitoring with Catchpoint IPM surfaced a few real-world failures that would otherwise appear as generic "assistant unresponsive" events.

1. Gemini Errors

When we began testing, the system looked healthy and all endpoints returned 200 OK. But deeper inspection revealed Gemini 429 errors hidden within successful responses.

A screenshot of a chat showing hidden Gemini errors
A failed synthetic test run, with Waterfall and request-level data pinpointing the source and details of the transaction error

2. UI Stuck Due to Merchant Service Timeout

  • The front-end chat interface froze after create_intent_mandate
  • Catchpoint Explorer and Timing breakdown revealed a connection timeout (~10s) while waiting on the merchant agent service
  • The delay was isolated at the connect phase, confirming that the issue occurred before payload transfer, likely a temporary connectivity or backend overload condition
  • Stack Map correlation showed this hop passing through Cloudflare → Apigee → Merchant microservice, pinpointing the bottleneck to the merchant layer
A failed synthetic test run, with Waterfall data revealing the connection timeout and the precise service layer responsible for the issue

3. UI Down: 530 Error From Cloudflare

  • Synthetic UI tests later reported complete inaccessibility with an HTTP 530 (Origin Error) response
  • The error originated at the CDN edge (Cloudflare), meaning requests never reached the origin host
  • Catchpoint's full waterfall and DNS tracing confirmed normal DNS resolution and SSL handshake, isolating the fault to Cloudflare's route-to-origin link
  • Because the 530 surfaced at the UI layer, RUM or browser logs alone would have masked it as "page load failed"
A screenshot showing the 530 error details
Error detail screenshot

AI assistants are multi-layered systems where a "slow model" isn't the only possible culprit. By combining synthetic testing, end-to-end performance tracing, and cross-layer correlation, teams can pinpoint exactly which agent or network segment caused a stall, whether it's a Gemini model delay, a merchant service timeout, or a CDN origin error.

Evolving Observability for AI-Driven Systems

As AI systems begin to transact independently, observability must extend beyond servers and APIs into the logic that drives decisions. With Catchpoint IPM, organizations can monitor AI commerce environments across both the infrastructure and reasoning layers, gaining visibility into the AI's decision logic, execution path, and failure points.

Agentic systems like AP2 are early prototypes of a broader shift, where applications become autonomous agents orchestrating workflows in real time. When that happens, performance data will no longer be about page loads or API speeds alone. It will be about intent execution: the AI's ability to fulfil a task successfully across systems.

By unifying LM Envision, Catchpoint IPM, and Edwin AI into one Autonomous IT platform, LogicMonitor extends observability from systems to intelligent decision flows, giving teams the visibility they need to keep AI-driven operations reliable.

Learn More About AI Monitoring

This is some text inside of a div block.

You might also like

Blog post

SRE Report: Why fast is what users trust

Blog post

SRE Report 2026: What surprised us, what didn't, and why the gaps matter most

Blog post

The SRE Report 2026: Defensible Ns