MCP Servers; the observability gap nobody is talking about
by Vandan
5 min read

Tags

Since late 2024, AI assistants have evolved from simple chat interfaces into systems that actively interact with tools, APIs, observability platforms, ticketing systems, cloud environments, and enterprise workflows.

Instead of just answering questions, AI agents are now expected to:

  • Retrieve incidents
  • Query dashboards
  • Open tickets
  • Search repositories
  • Analyze logs
  • Trigger workflows
  • Interact with internal systems

And increasingly, all of this is happening through MCP servers.

The Model Context Protocol (MCP) is rapidly becoming the bridge between AI agents and operational systems. Everywhere you look, companies are launching MCP servers for observability platforms, developer tools, cloud services, collaboration systems, and internal enterprise applications.

The ecosystem is exploding.

But something interesting happens when AI assistants start depending on MCP infrastructure for real work.

A completely new reliability problem emerges.

AI Agents Now Have Dependencies

In traditional applications, the dependency chain was relatively straightforward:

  • User
  • Frontend
  • Backend
  • Database/API

But AI workflows now introduce another operational layer:

  • AI assistant
  • MCP client
  • MCP server
  • Tool execution
  • Downstream services
  • Enterprise systems

That changes everything.

Because now an AI assistant may fail not because the model is broken — but because:

  • An MCP server is unavailable
  • Tool discovery is failing
  • Authentication expired
  • A downstream API slowed down
  • Tool schemas changed unexpectedly
  • Responses became incomplete
  • Permissions drifted
  • Certain tools work from one region but not another

And most users would never know where the failure actually occurred.

They simply experience:

  • “I couldn’t complete that task.”
  • Delayed responses
  • Incorrect outputs
  • Partial answers
  • Silent workflow failures

From the outside, the AI assistant appears unreliable.

But the real problem may exist several layers deeper inside the MCP ecosystem.

A Failure Mode That Doesn’t Exist in the HTTP World

Here’s the scenario that made this real for me.

An engineering team ships an AI-powered support product built on their own MCP server. A backend cache misconfiguration causes tools/list, the endpoint that tells the AI agent which tools are available and returns an empty array. Every HTTP request still returns status 200. Their existing monitoring shows green. Datadog shows green.

For six hours, every customer using the AI feature gets silent degradation: the agent can’t find any tools and falls back to unhelpful generic responses. Nobody on the engineering team knows. The incident is eventually surfaced by a customer support ticket. Total time to diagnose: three hours.

Here’s what makes this failure mode different from anything that came before: a standard HTTP uptime check has no way to catch it. HTTP 200 is correct. The server is reachable. Latency is normal. By every traditional measure, the system is healthy.

The failure is at the protocol level; inside the MCP conversation between agent and server and that’s a layer that current monitoring tools simply don’t see.

What the Industry Is Doing Today

To be clear, observability companies are already moving quickly into the MCP ecosystem.

Organizations like Datadog and New Relic have started exposing observability capabilities through MCP servers. In many cases, the goal is straightforward and powerful: allow AI assistants to interact directly with operational data, incidents, dashboards, logs, traces, and workflows.

Datadog positions its MCP Server as a bridge between AI agents and observability telemetry, enabling agents to retrieve logs, metrics, incidents, traces, and operational context directly inside AI workflows.

New Relic has similarly focused on connecting AI agents to observability data and instrumenting MCP interactions so organizations can visualize MCP request lifecycles, tool invocations, and execution paths.

The first wave of MCP adoption has largely focused on enablement:

  • Making observability data accessible to AI agents
  • Allowing tools to be discovered dynamically
  • Giving assistants the ability to query systems and perform operational tasks
  • Instrumenting MCP requests and server-side activity

And honestly, that makes sense. The ecosystem is still young, and the immediate priority has been enabling AI-driven workflows as quickly as possible.

But there’s a capability gap that hasn’t been addressed yet. Both Datadog Synthetics and Grafana Synthetic Monitoring operate at the HTTP level. They can verify that an endpoint returns 200, that a response arrives within a latency threshold, that a certificate hasn’t expired. They don’t understand the MCP protocol. They can’t initialize an MCP handshake, call tools/list, or verify that the tools an agent depends on are present and returning valid responses.

That means the industry is currently able to:

  • Expose MCP functionality
  • Instrument MCP traffic
  • Log and trace MCP activity

…but not yet validate whether the entire AI workflow is actually reliable from the agent’s perspective.

Because an MCP server being reachable does not necessarily mean:

  • The right tools are available
  • Tool responses are valid
  • Downstream dependencies are healthy
  • Workflows complete successfully
  • AI agents behave consistently
  • The end-user experience remains reliable

Traditional Monitoring May Not Be Enough

Most monitoring systems today focus on:

  • Infrastructure health
  • API uptime
  • CPU and memory
  • Logs and traces
  • Network latency

But MCP introduces something different.

An MCP server might technically be “healthy” while still failing AI workflows.

For example:

  • The endpoint is reachable, but tool discovery is broken
  • The MCP server responds, but critical tools timeout
  • A downstream dependency silently fails
  • Responses return successfully, but contain unusable data
  • Tool definitions drift over time
  • AI agents begin behaving inconsistently

Traditional uptime checks may never catch these issues.

And as organizations increasingly rely on AI agents for operational tasks, these failures become much more visible.

In many ways, this feels similar to the early days of APIs and microservices. Initially, teams focused heavily on service availability and backend telemetry. But over time, organizations realized that uptime alone was not enough to measure actual user experience and synthetic monitoring emerged to fill that gap by simulating real user interactions rather than just pinging endpoints.

MCP ecosystems are heading in the same direction. The question is just how long it takes.

The Next Operational Question

Over the next few years, I think observability teams will begin asking entirely new questions:

  • How do we measure MCP reliability?
  • What does an SLA for an MCP server even look like?
  • How do we validate tool correctness?
  • How do we detect schema drift?
  • How do we know whether an AI workflow actually succeeded?
  • How do we separate model issues from MCP infrastructure issues?
  • How do we troubleshoot AI task failures across multiple systems?

More importantly:

How do we monitor whether the agent itself can successfully do its job?

That feels fundamentally different from traditional infrastructure monitoring.

A New Category of Observability Is Emerging

The answer, I think, is protocol-aware synthetic monitoring where probes that don’t just check whether an MCP server is reachable, but actually speak the MCP protocol. They initialize a session, call tools/list, verify the expected tools are present, invoke specific tools with known inputs, and assert that the responses are structurally valid.

This is the gap: treating the MCP conversation itself as the thing to test, not just the HTTP transport underneath it.

It’s no longer just about servers, APIs, or applications.

It’s about workflows executed by AI agents and whether those workflows will actually succeed when a real user triggers them.

If you’re building or consuming MCP servers, I’d be curious how you’re handling this today. Are you catching protocol-level failures before your users do, or are you still finding out from support tickets?