The pilot that scanned 91% of invoices and saved nobody any time
Picture a 60-person B2B services firm running vendor invoices through a shiny new extraction tool. The demo was flawless. Ninety-one percent of invoices come through clean: vendor name, PO number, line items, total — all extracted, all correct. Leadership sees the throughput chart and starts modeling headcount. Then they go look at the AP clerk's actual Tuesday, and she is doing exactly as much work as before.
Why? Because she never spent her day on the 91%. Those were the fast ones — ninety seconds each, half on autopilot. Her day was the other 9%: the invoice with a hand-scrawled PO, the vendor who bills three projects on one document, the credit memo masquerading as an invoice, the duplicate that arrives a week late. The automation industrialized the part that was already cheap and left the expensive part untouched. That is the single most common way a document intake ROI case quietly falls apart.
So the first number you model is not keystrokes saved. It is exception cost — the fully loaded minutes, escalations, and downstream cleanup that the messy minority consumes. IBM's overview of intelligent document processing is candid that classification and extraction are the easy half; the operating value lives in how the system handles what it is unsure about. If your ROI math averages the easy and the hard files together, you are pricing a process that does not exist.
Model two queues, not one average
Split every document family into two streams and price them separately. Stream one is straight-through: machine reads, machine validates against the source rule, machine posts, nobody touches it. Stream two is exceptions: low extraction confidence, a total that doesn't reconcile to the line items, an entity that doesn't match your vendor master, a document type that doesn't belong in this path at all. The honest ROI lives almost entirely in stream two — because that is where a human still sits, and that is where errors leak into finance, fulfillment, or a customer's record before anyone notices.
That reframes the question. Automation doesn't win by reaching 100% straight-through; it wins by shrinking the cost-per-exception. So measure that directly: how many minutes does the reviewer now spend per flagged document, and is it dropping? A well-built workflow makes the exception faster to resolve, not just rarer — it surfaces what was extracted, which source field it pulled from, which rule fired, and what looks anomalous, so the clerk decides in twenty seconds instead of hunting through a PDF for three minutes. The same operating discipline runs through McKinsey's operations work and Gartner's data and analytics coverage: the payoff tracks data quality and exception design, not model cleverness.
Then watch the two leading indicators that tell you whether you bought leverage or just relocated the bottleneck. First, the downstream correction rate — how often does finance or fulfillment fix something the system "completed"? If that climbs, your straight-through number is fiction. Second, the trend in average review time per exception. Falling exception minutes is the signal to expand; a growing, more-confusing review queue is the signal to fix rules and source data before you touch another document type. PwC's responsible AI research makes the governance case plainly: a clean audit trail is the difference between a tool finance trusts and one it quietly works around.
What to do Monday: instrument one document family before you automate it
Resist the urge to automate "documents." Pick exactly one family — say, vendor invoices, or new-client onboarding packets, or COIs from subcontractors — with one owner and one downstream system. Then, before any AI touches it, spend a week instrumenting the manual process. Tally the straight-through count, the exception count, and the real minutes each consumes. Most operations leaders are stunned to find the exceptions are a fifth of volume and three-quarters of the labor. That ratio is your entire business case; without it, you are guessing.
Run the AI-assisted workflow in parallel with the manual process, not in place of it, and hold a weekly review on four questions only: which documents went clean, which got flagged, which fields caused the most reviewer disagreement, and which outputs got corrected downstream. The field that triggers the most disagreement is your roadmap — it tells you whether the fix is a tighter rule, better source data, or a document type that never belonged in this path. MIT Sloan's AI coverage keeps returning to this point: the systems that stick are the ones with a measurement cadence, not the ones with the slickest pilot.
The unglamorous version wins: one packet type, one owner, one downstream system, one definition of done, and a falling cost-per-exception you can point to. Once that holds for a quarter, the same intake architecture extends to the next family without re-litigating trust. Model the two-queue economics in the AI ROI Calculator, and when you're ready to move from a wiring diagram to a production workflow, the 90-Day AI Implementation Sprint is the governed path.