The project was green every week until the week it cancelled
Every delivery leader at an implementation or consulting firm has lived this: a project shows green on the status report for eleven straight weeks, then the client calls and it's a five-alarm fire. The percent-complete field said 80. The risk log was empty. The summary read fine. None of it was true, because the PM was reporting the schedule they wished they had, not the one they were on.
That is exactly the report you should not hand to AI yet. A summarizer reading that data won't catch the lie — it will launder it. It pulls "80% complete, no open risks, on track" from the source fields and writes a confident paragraph that sounds even more authoritative than the human version. You haven't removed the bad signal; you've given it better grammar and faster distribution.
The pattern across the research is consistent. The RSM middle-market AI survey and Deloitte's State of AI in the Enterprise 2026 both land on the same uncomfortable point: AI returns value on top of governed workflows, not scattered ones. For a services firm, the status report sits downstream of your weakest discipline — how honestly PMs log slippage, how RAG status gets assigned, who owns the risk register. Automate that, and you scale the dishonesty, not the visibility.
Before you wire anything up, run the SMB AI readiness assessment and ask one blunt question: do two PMs looking at the same project agree on whether it's yellow?
Three things that have to be true before a model touches the report
The fix is not "buy a better tool." It's deciding whether your status data is even reportable. Three tests, specific to delivery work:
One — does "yellow" mean something? In most firms, RAG status is vibes. One PM marks yellow when a milestone slips a day; another stays green until the client escalates. If the definition lives in people's heads, a model trained to summarize it will produce summaries that mean nothing across the portfolio. Write the rule down: yellow = milestone at risk within two weeks, red = milestone already missed or client-facing scope at risk. Now the data has a spine.
Two — is there one source of truth, or four? The schedule is in the PM tool, the risk log is in a spreadsheet, the budget burn is in the finance system, and the real status is in the delivery lead's head. An AI that stitches those together will confidently reconcile numbers that contradict each other. Pick the system of record per field — schedule here, budget there — and retire the shadow trackers before automation, not after.
Three — who gets paged when the summary is wrong? The NIST AI Risk Management Framework and CISA's AI Data Security Best Practices both come back to the same operational hinge: named ownership, logged decisions, and a defined escalation path. If no human is accountable for the accuracy of the generated report — and for the client and revenue data it's pulling from — you have automated blame-shifting. "The AI said it was green" is not a defense a partner can give a client.
If you can't pass all three, the report stays manual. Not forever — just until the controls exist. Turn the pause into a sequenced fix with the 90-day AI implementation plan rather than letting "we'll get to it" become the permanent state.
Restart narrow: one program, one owner, one honest baseline
When you do come back to it, resist the urge to build "the AI that reports on every project." Start with one program manager and their portfolio of five engagements. Have the model draft the weekly status from the cleaned source fields, and have that one owner red-line every draft for a month before anything reaches a client or a leadership deck. You are not testing the model's writing. You are testing whether your newly-defined data produces a report the owner would actually stand behind.
Judge it on numbers that matter to delivery, not on "it saved time." Track the correction rate — how often the owner overrides the draft RAG status or fixes a misread risk. Track the surprise rate — how many projects still went red without warning, because that's the metric your old reports were failing. Track whether managers are catching slippage a week earlier than before. If the drafts are still missing the same red flags your humans missed, the problem was never the writing; it's still the source data, and you've confirmed it cheaply.
Only after that one PM trusts the output should you widen to the next. Before you scale a single seat further, run the math honestly with AI ROI measurement without fake savings — because faster status reports that are still wrong is a cost, not a saving.