Skip to content
Contact Us
AI Vendor and Build-vs-Buy4 min

How to Evaluate an AI Agent Consultant When the Demo Tells You Nothing

A demo proves the agent works in a sandbox. Here are the six controls to inspect that prove it will work against your real data, permissions, and edge cases.

Business and technology leaders reviewing an AI agent evaluation checklist with permissions, data, monitoring, and fallback controls.
Figure 01 Business and technology leaders reviewing an AI agent evaluation checklist with permissions, data, monitoring, and fallback controls.
Answer summary

The practical answer

Short answer
A demo proves the agent works in a sandbox. Here are the six controls to inspect that prove it will work against your real data, permissions, and edge cases.
Best fit
Industry: B2B services and technology. Function: Operations and technology
Operating path
AI Vendor and Build-vs-Buy -> AI Transformation
Key metric
3 checks before build: tools, data, and fallback paths

The demo is a magic trick, and you already know how it ends

The agent reads a clean invoice, pulls the right purchase order, posts a tidy summary to the channel. The room nods. The deal moves. Then six weeks into the build, somebody feeds it the invoice where the vendor name is spelled three different ways across two systems, the PO is closed, and the amount is off by a rounding error nobody can explain. The agent does what unconstrained agents do: it picks an answer and acts on it with total confidence.

That gap, between the rehearsed path and the messy Tuesday, is the entire evaluation. A demo is a sample of one happy case the consultant chose. What you are actually buying is the behavior on the cases they did not show you: the conflicting record, the missing field, the request that is one keystroke away from touching something it should never touch. The NIST AI Risk Management Framework is useful here precisely because it ignores the demo and organizes the problem around mapping intended use, measuring risk, and managing the system after it goes live. Bain's agentic AI research lands in the same place: agentic transformation is an operating problem, not a model showcase.

So stop asking the consultant to impress you. Ask them four questions the demo cannot answer: What systems can this agent reach? What actions can it take, versus only recommend? What evidence does it surface so a human can check its work? And what does it do the moment it is unsure or the source data disagrees with itself? An agent that can write to your accounting system is a different risk than one that drafts a message for a human to send. If the consultant treats those as the same conversation, that is your answer.

Inspect the blocklist before the feature list

The tell of a serious agent consultant is that they describe the thing as a constrained participant in a workflow, not a digital employee you can point at problems. Autonomous-employee language is a sales frame; constrained-participant language is a design. Microsoft's documentation of Copilot's architecture, data protection, and auditing is worth reading even if you will never buy Copilot, because it spells out the machinery that has to exist before any AI assistant touches real business data: tenant boundaries, the permissions the agent inherits, logging, and retention. The product is incidental; the checklist is the point.

Translate that into six things you can ask to see, today, without a contract. One, the tool allowlist: the explicit, finite set of actions the agent is permitted to invoke, and what is deliberately left off it. Two, action limits: where it can recommend versus where it can execute, and what dollar amount or record type trips a hard stop. Three, the audit log: every action, with inputs and outputs, attributable and reviewable after the fact. Four, the approval queue: which actions pause for a human, and who that human is. Five, retrieval and prompt governance: where its knowledge comes from and how a wrong source gets corrected. Six, the edge-case test plan: the deliberately ugly inputs they will run before you trust it with anything that matters.

Here is the practical test. Say you run a 60-person professional services firm and the pitch is an agent that triages inbound client requests and updates your CRM. Ask: "Show me what happens when two records for the same client conflict." A consultant who has built this before has an answer queued up, usually some version of: it stops, flags the conflict, and routes to a named owner. A consultant who has only built demos will reach for the impressive-sounding answer instead, something about how the model "intelligently reconciles" the records. Reconciling records by guessing is exactly the failure you are trying to avoid.

AI agent governance diagram showing source data, permitted tools, human review, monitoring, and exception handling.
AI agent governance diagram showing source data, permitted tools, human review, monitoring, and exception handling.

Make them show you a workflow that got measurably better

The last move is to refuse model benchmarks as evidence. The agent's accuracy on some public test set tells you nothing about your CRM, your invoices, or your client mess. McKinsey's State of AI research and the PwC Responsible AI survey both point at the same value drivers: adoption, governance, and redesigning the work so a human stays accountable. Ask the consultant to show one workflow that got better and quantify it: hours of cycle time pulled out, error rate before and after a review gate, handoffs that stopped getting dropped, the percentage of cases that now escalate cleanly instead of dying in someone's inbox. If the only number they can produce describes the model and not the work, they have not run this in production.

What you can do this week: pick the single highest-volume, lowest-stakes workflow you have, and use it as the audition. Write down the six controls above as a one-page rubric, hand it to every consultant on your shortlist, and ask them to map their proposed agent against it before you discuss price. The ones who get sharper and more specific under that scrutiny are the ones worth hiring. The ones who steer back toward the demo are telling you they only have the demo.

When the real job is a governed assistant that works inside your permissions and audit trail, that is the territory of AI agents and internal copilots. When the underlying problem is actually how work gets routed and data moves between systems, the agent is the wrong tool and workflow automation is the right one. Knowing which of those you have is itself worth more than any demo.

Continue the operating path
Topic hub AI Vendor and Build-vs-Buy Vendor selection, build-vs-buy decisions, platform fit, data access, integration cost, and switching risk. Pillar AI Transformation Tool selection should follow workflow selection. This shelf helps buyers compare vendors, custom builds, and automation partners without vendor pressure.
Related intelligence
Sources
  1. NIST AI Risk Management Framework
  2. Bain agentic AI transformation research
  3. Microsoft Learn Copilot architecture, data protection, and auditing
  4. McKinsey State of AI research
  5. PwC Responsible AI survey
Move on this

Turn this AI question into a governed workflow.

Start with the next step that matches readiness: score, audit, blueprint, sprint, or governance.

Evaluate the agent opportunity →