The failure mode isn't a typo. It's a paid invoice that shouldn't have been.
Picture a 120-person company running maybe 900 invoices a month. A vendor emails new banking details. The invoice matches an open PO, the amount looks normal, and an assistant drafts a clean, confident exception note explaining why it should be approved. Everyone reads the note. Nobody re-verifies the bank change against the vendor master. The payment goes out. That is not a drafting failure. It is a routing-and-authority failure, and no amount of well-written prose would have caught it.
This is why "should we use ChatGPT Business or build something custom for invoice routing" is the wrong first question. The right one is: where does payment authority live? ChatGPT Business is a place to think, summarize, and explain. It is not a place to decide who gets paid. The instant a tool's output can move money — or can be copied into the system that moves money without a hard control in between — you have left the territory a general chat assistant should occupy.
The broader research is unanimous that smaller companies get value from AI only when it's wired into real work, not bolted on as a sidebar. RSM's middle-market survey, the San Francisco Fed's work on AI and small businesses, and the OECD's SME adoption report all point the same direction. In accounts payable, "wired in" has a precise meaning: the route from inbox to posting touches vendor master data, three-way match, approval thresholds, GL coding, duplicate detection, and the audit file — and an auditor can later reconstruct every step.
Draw the line at the moment the route becomes a decision
Here's a clean test you can apply this week. Take any AP task and ask: if the model is wrong, who catches it before money moves, and how? If the answer is "a human reads the output and chooses what to do next," that task is safe in ChatGPT Business. Drafting a vendor reply, explaining why an invoice tripped an exception, summarizing a stack of remittance emails, helping a new AP clerk understand your approval matrix — all of that is reversible, human-gated work.
Now the other side. "Match this invoice to PO 4471, confirm the vendor bank record hasn't changed, route to the right approver based on the $25K threshold, and stage it for posting in the ERP." Every one of those steps has a correct answer that a system can enforce deterministically — and a wrong answer that costs real money. That is workflow territory. The model can still draft the human-facing explanation for the approver, but the match, the validation, the threshold check, and the posting must be code with a logged trail, not a generated suggestion someone trusts.
So the architecture splits cleanly. The deterministic layer owns the invoice image, the OCR-extracted fields, the vendor match, the PO status, the approval rule that fired, the GL code, the segregation-of-duties check, the exception reason, and the final reviewer. The model contributes language on top. NIST's AI Risk Management Framework is useful here precisely because it forces you to name who is accountable for each AI-assisted step rather than letting "the tool did it" stand. And invoices are not benign documents — they carry vendor bank details, tax IDs, and contract terms, which is exactly why CISA's guidance on securing AI data matters and why OpenAI's enterprise privacy controls only answer half the question. The other half is your own rule about which fields are even allowed into a prompt. A vendor's new ACH routing number should never be the thing a chat tool reasons about; it should be the thing a control verifies.
What to do Monday: pick one vendor group and watch six numbers
Don't pilot "invoice AI." Pilot one narrow slice — say, recurring SaaS invoices from your top 20 vendors, or freight bills from a single carrier group. Narrow scope is what lets you actually tell whether anything improved, and it matches where the Deloitte State of AI in the Enterprise 2026 evidence says value shows up: in production, on a defined workflow, not in open-ended experiments.
Instrument six numbers before you start and read them after 60 days: invoice cycle time, exception rate, duplicate-payment flags caught, approval aging, GL correction rate, and close-period rework. If the only thing that improved is "exception notes read more nicely," you proved you needed ChatGPT Business and nothing more — keep it there and stop. If approval aging dropped and duplicate flags rose before payment, you've justified building the workflow for that vendor group. Use the invoice-routing automation guide and the AI ROI Calculator to put AP hours, approval lag, and control risk into the same view.
Before you expand past that first slice, write down three rules in plain language: who can override a system-chosen route, what happens when a duplicate-payment warning fires, and which exceptions must stop and wait for procurement or the controller. Those three sentences are the difference between AP that's faster and AP that quietly routed around its own controls. When the slice works and you can explain exactly which of the six numbers moved and why, then widen it — one vendor group at a time, never the whole inbox at once.