Skip to content
Contact Us
Technical Debt5 min

The IP Audit That Decides Whether You're Buying an AI Asset or a Deletion Order

In AI/ML deals, the model weights can be a liability, not an asset. Here's the data-provenance audit and deal structure that protects your basis.

Illustration of an AI model training pipeline showing data inputs,
model weights, and potential IP contamination points.
Figure 01 Illustration of an AI model training pipeline showing data inputs, model weights, and potential IP contamination points.
Answer summary

The practical answer

Short answer
In AI/ML deals, the model weights can be a liability, not an asset. Here's the data-provenance audit and deal structure that protects your basis.
Best fit
Industry: Private Equity. Function: Technical Due Diligence
Operating path
Technical Debt -> Turnaround & Restructuring -> Transaction Advisory Services -> Valuations
Key metric
66% of AI/ML codebases contain high-risk open source vulnerabilities or license conflicts (Synopsys 2024).

The asset on the balance sheet might be a court order waiting to happen

Picture the diligence call. The target's CTO walks you through the model: 340 million parameters, fine-tuned on "proprietary industry data," 94% accuracy on the eval set, the whole moat. Your tech advisor nods. The financials show 70% gross margin on inference. Everybody likes the deal.

Nobody in the room can tell you where the training data came from. And that one gap is the difference between buying an asset and inheriting a liability that can be deleted by a judge.

Here is the part that breaks every instinct a PE operator brings from software M&A. In a normal code deal, IP risk is deterministic and containable. You scan the repo, you catch the GPL violation, you rewrite the offending module for maybe $50K in engineering time, and the risk is gone. The damage is bounded by the size of the bad code.

In an AI/ML deal, you are not primarily buying code. You are buying weights — billions of numbers that encode patterns the model absorbed from its training corpus. You cannot surgically remove a tainted dataset from a set of weights the way you delete a file. If the corpus was scraped without consent, salted with copyrighted material, or built on data licensed only for research, the contamination is baked into the math. The remedy isn't a patch. It's a full retrain — or, increasingly, deletion.

This matters because the cost asymmetry is brutal. Synopsys' 2024 OSSRA report found that 66% of AI, machine learning, and big-data codebases carry high-risk vulnerabilities or license conflicts. Better than a coin flip that your target is sitting on something a standard IP rep won't cover. And courts are no longer treating this as theoretical: Shoosmiths' litigation risk analysis documents AI disputes now outpacing conventional IP litigation, with disgorgement — forced deletion of models trained on illicit data — emerging as a live remedy, not a hypothetical.

So before you anchor on the revenue multiple, ask the only question that determines whether the multiple means anything: if a court ordered this model destroyed tomorrow, what would the company have left?

Audit the stack the way the contamination flows: input, engine, output

The reason generic IP checklists fail here is that they were built for a world where the code is the asset. In AI, the code is often the cheapest, most replaceable layer. Risk enters at the data and propagates upward. So audit it in the order the contamination actually travels.

Input: where did every byte of training data come from?

This is the layer that gets companies deleted, and it's the layer founders are most likely to wave past. You want a dataset-by-dataset accounting, not a paragraph of reassurance. Three questions separate a clean deal from a buried one:

  • Did they respect robots.txt? If the data team scraped sites that explicitly disallowed crawling, that conduct is now being used to strip away fair-use defenses. "Everyone scrapes" is not a diligence answer.
  • Is there PII in the corpus? Under GDPR and CPRA, a single deletion request for a user whose data sits in the training set can legally require unlearning — and for most architectures, unlearning means retraining. One subject-access request can trigger the same cost as a contamination finding.
  • Were "research only" datasets used to train a commercial product? A startup that bootstrapped on an academic dataset licensed for non-commercial use built its commercial model on a license violation. That's common, and it's almost never disclosed unprompted.

Engine: did they actually build the model, or fine-tune someone else's?

Most mid-market AI companies did not train a foundation model from scratch — the economics don't allow it. Epoch AI's compute analysis shows frontier-model training runs scaling into eight and nine figures, which is precisely why your target almost certainly fine-tuned an open base. That's fine — until you read the base model's license. Some "open" weights forbid commercial use above a user threshold, or prohibit using the model to compete with its creator. And if there's an AGPL-licensed component anywhere in the inference path, the copyleft terms can reach your entire proprietary platform. The question to put to the CTO: "Show me the license for every model and library in the serving stack, including the base checkpoint you fine-tuned." Watch how long it takes them to answer.

Output: can they even own what the model produces?

The US Copyright Office has held repeatedly that purely AI-generated work isn't copyrightable. If the target's core product is machine-generated output — copy, images, designs, code — with no meaningful human authorship, they may own zero IP in the thing customers pay for. There's no moat; a competitor can replicate the output and the target has no infringement claim. The diligence test is concrete: ask them to demonstrate substantial human modification of the AI's output. If they can't, you're underwriting a product with no defensible IP at the bottom of it.

Chart showing the rising cost of AI model retraining vs traditional
code remediation.
Chart showing the rising cost of AI model retraining vs traditional code remediation.

What you actually do Monday: demand a Data Bill of Materials, then price the gaps

You already require a Software Bill of Materials in tech diligence. For an AI target, add a Data Bill of Materials and treat its absence as a finding, not a delay. The DBOM lists every training dataset, its source, its license, and the consent mechanism behind it. A target with clean provenance can produce this in days because they tracked it as they went. A target that can't produce it didn't track it — which means you should assume contamination and price accordingly.

When the DBOM has holes, you have three levers, in order of how much trust the target has earned:

  1. Retrain escrow. Hold back enough consideration to fund a clean retrain — compute plus the months of lost go-to-market while the new model is rebuilt and re-validated. Don't size this off the compute bill alone; the GTM downtime is usually the larger number, because a model that's offline for a quarter is revenue you're not collecting.
  2. Specific indemnity, not boilerplate. A generic "no IP infringement" rep does not cover this. Name the risks: training-data copyright infringement, and model-disgorgement orders. If the seller's counsel resists carving these out, that resistance is itself diligence signal.
  3. Buy the team and architecture, retrain in a clean room. When the provenance is genuinely unsalvageable, stop trying to acquire the contaminated weights at all. Structure an asset purchase of the people and the model design, and make a documented clean-room retrain on licensed data a condition that closes before — or shortly after — the deal. You get the capability without inheriting the liability.

The instinct to fight is the urge to treat the model's reported accuracy as the asset. It isn't. The asset is a defensible, retrainable, license-clean pipeline. A 94% eval score on a poisoned corpus is worth less than an 88% score you can legally own and rebuild.

If you're working a tech-enabled deal more broadly, pair this with our walkthrough of technology due diligence red flags and our cybersecurity and IP assessment framework — the same operator lens, applied to the layers around the model.

Continue the operating path
Topic hub Technical Debt Quantification in dollars, not adjectives. Then a remediation plan that runs in parallel with delivery. Pillar Turnaround & Restructuring Technical debt is real money. Once you can name it as a number — its impact on velocity, EBITDA, and exit multiple — it stops being a vague engineering complaint and becomes a board agenda item. Service Transaction Advisory Services Operator-led buy-side and sell-side diligence for technology middle-market deals. Financial rigor, technical diligence, and integration risk in one workstream. Service Valuations Credible valuation work for SaaS, services, IP, ARR/MRR, cap tables, and exit readiness in technology middle-market transactions. Service Performance Improvement Revenue, margin, delivery, technical debt, and operating-system improvement for technology firms with stalled growth or compressed EBITDA.
Related intelligence
Sources
  1. Synopsys, "2024 Open Source Security and Risk Analysis Report"
  2. Epoch AI, "Training Compute of Frontier Models" (Cost Analysis)
  3. Shoosmiths, "Litigation Risk 2026: AI Disputes Surpass IP"
Move on this

A 14-day operator-led diagnostic, before the gap is priced into your multiple.

No retainer until we agree on the work.

Request a Turnaround Assessment →