The bottleneck isn't the return. It's the chase before the return.
Walk into a 25-person CPA firm in the second week of March. The seniors aren't stuck on judgment calls — they're stuck waiting. A client uploaded eleven of fourteen requested items. The brokerage 1099 is a screenshot of a screenshot. Two PBC requests have aged past a week with no reply, and a manager just kicked a workpaper back because the supporting docs for a fixed-asset addition never made it into the binder. None of that is accounting. All of it is burning the hours you bill for accounting.
That gap is exactly where firms get talked into pointing AI at the wrong target. The Thomson Reuters 2026 AI in Professional Services report and AICPA and CIMA's AI resources both point at reviewable client-service work as the realistic entry point — but "reviewable" is doing heavy lifting in that sentence. The first AI job should prepare the file a human signs: it ingests what the client sent, classifies it against the request list, flags the three missing items, and assembles a clean review packet. It does not decide whether the R&D expense qualifies. The moment an assistant looks like it's reaching a tax, audit, or accounting conclusion without a reviewer's hands on it, you've crossed from administrative prep into professional judgment, and that's the line you don't let software walk across.
Pick the metric before you pick the tool
Here's the test that separates a real pilot from a demo. Before you turn anything on, write down four numbers from last season: the average age of an open PBC request, the percentage of client document sets that arrive incomplete, the number of workpapers returned for missing support, and the hours a senior spends per engagement just reconciling what the client sent against what you asked for. Those are your baselines. If you can't name them, you're not ready to measure whether AI helped — you're just adding another tool to a process you haven't characterized.
Then run the weekly review on behavior, not output volume. The wrong scoreboard is "the assistant processed 340 documents." The right scoreboard is: did missing-item flags catch real gaps, or did the partner find a 1099 the system waved through? Are classification rejections going down week over week? Is the senior's reconciliation time actually shrinking, or did we just move the work to a new screen? A model that generates more drafts while review time stays flat hasn't earned anything. Once those numbers are tied to a named owner — a manager who owns the close calendar, not "the firm" — that's when the AI Opportunity Score or the AI ROI Calculator tell you something. Run them on a baseline you made up and they'll just confirm the number you wanted.
Traceability is the whole game in a firm that gets examined
An accounting firm has a constraint a marketing team never will: someone may pull your workpapers two years later and ask where a number came from. So the non-negotiable for any AI in the file-prep lane is that every item it touches keeps a live link back to its source document. If the system summarizes a bank statement but can't point you to the statement, it's worse than useless — it's a hole in the audit trail. The NIST AI Risk Management Framework gives partners a structure for mapping intended use, risk, and accountability for exactly this kind of workflow, and CISA's AI data-security best practices should govern how client-confidential files, retention rules, and permissioned access actually work — because a leaked client tax file is a different category of problem than a leaked draft email.
So the guardrails are concrete: preserve source-document links on everything, require reviewer approval before any conclusion leaves the firm, log every missing or conflicting item the system finds, and keep unsupported tax and audit judgments out of the pilot entirely. Deloitte's State of AI in the Enterprise 2026 reads like a long argument that the firms getting returns are the ones who scoped narrowly and proved it before widening. Do the same: hold the line at one file-prep lane until you can demonstrate traceability, confidentiality, and reviewer acceptance — then expand into the adjacent routine, like 1099 intake or fixed-asset rollforward support, one at a time. On Monday, the move is small: pick one engagement type, write down those four baseline numbers, and name the manager who owns the result.