The "passed" test that nobody actually ran
Picture the Thursday before a Monday go-live. The QA tracker shows 94 of 96 acceptance tests green. The customer's project manager is reassured. Then someone asks for the evidence behind test 47 — the SSO integration check — and there's a screenshot from a different environment, taken two sprints ago, attached to a criterion that changed in the meantime. The test wasn't run against the current build. It was marked passed because the tracker needed a green cell. That single gap is how a clean dashboard ships a defect into a customer's production environment.
This is the specific failure mode in implementation QA, and it's why "should we use ChatGPT Business or build something custom" is the wrong opening question. The real question is whether your acceptance criteria, test results, defect logs, configuration evidence, and customer signoff live in one place where a green status is provably tied to evidence — or whether they're reconstructed from someone's memory of what got tested. A chat assistant cannot fix the second problem. It can only make the prose around it cleaner.
ChatGPT Business genuinely earns its seat for the drafting work that surrounds QA: turning a messy acceptance-criteria email into structured test cases, summarizing a defect triage call, comparing two versions of a test plan and flagging what changed. That's real time saved, and the broader adoption case for targeted AI in smaller companies holds up in the research from RSM, the San Francisco Fed, and the OECD. But none of those drafting wins close the gap between "marked passed" and "proven passed." That gap is where go-live risk lives, and it's a structural problem, not a wording problem.
The line: drafting versus the go-live gate
Here's the cleanest test for build-vs-buy in QA specifically. If the output is something a human reads and then decides what to do — a test plan draft, a triage summary — ChatGPT Business is fine. ChatGPT Business with the data handling described in OpenAI's enterprise privacy material covers that comfortably. But the moment the output is a status that gates a release — "test 47: passed," "no critical defects open," "customer signed off" — the chat transcript is the wrong home. A status that authorizes shipping has to be enforced by a system, not asserted in a conversation.
A custom workflow for implementation QA isn't a fancier chatbot. It's a small set of hard rules: every test links to a specific acceptance criterion and a specific build; a test can't be marked passed without an attached evidence artifact from the right environment; every defect has an owner and a retest status; and go-live is mechanically blocked while any criterion tagged critical lacks evidence. The model still helps inside that frame — classifying a defect's severity, drafting the retest note — but it never gets to declare the gate open. A 40-person services firm doing six implementations a quarter doesn't need a heavy QA platform to do this; a structured tracker with enforced required fields and a blocking rule on the critical-evidence column gets most of the way there.
Two guardrails matter more in implementation QA than in most AI use cases, because you're often testing inside the customer's environment. The NIST AI Risk Management Framework is useful for naming the specific failure paths here — wrong context (evidence from the wrong build), weak measurement (a green status with no artifact behind it), unclear accountability (no named owner on an open defect). And because QA touches customer configurations, credentials, and proprietary system details, CISA's AI data-security guidance applies directly: that evidence has to stay permissioned, not pasted into a general chat thread where it leaves your control.
Run one gate, measure leakage, then decide
Don't argue this in the abstract. Pick one project type and one acceptance gate — say, the integration-testing gate on your most common implementation pattern — and instrument it for one quarter. The number that settles the debate is defect leakage: how many defects escaped that gate and surfaced after go-live, in the customer's hands. Track it alongside retest cycle time (how long from defect logged to retested and closed) and evidence completeness (what fraction of "passed" criteria actually have a valid artifact attached). The Deloitte 2026 AI research makes the same point at the enterprise level: value shows up in production outcomes, not pilot enthusiasm. In QA, production outcome means defects that didn't escape.
The read is straightforward once you have the numbers. If leakage is fine and evidence completeness is already high, your problem was never the gate — keep ChatGPT Business for the drafting and move on. If leakage is high or evidence completeness is full of green cells with nothing behind them, that's the signal that the status needs to be system-owned, and the custom workflow earns its build. Either way, write the decision down: what you chose, and the leakage and evidence numbers that drove it. A successor running the next implementation should be able to see why.
If you want the build pattern in more detail, the QA automation guide covers how to structure the evidence-and-gate layer, and a staged rollout plan keeps you from trying to instrument every implementation path at once. Start with the one gate where a leaked defect costs you the most customer trust. When the owner of that gate can explain — in cycle time and leakage, not vibes — what got better, expand to the next one. Build the AI roadmap from there.