Don’t Steal This Book: Authors Protest AI Training Data – Guide for Business Leaders

Don’t Steal This Book: 10,000 authors push back as AI taps copyrighted text — What business leaders must know about AI training data and licensing

An “empty” book—its pages filled only with the printed names of nearly 10,000 writers—was handed out at the London Book Fair with a single, blunt demand: don’t let AI companies train on authors’ work without permission. The volume, titled Don’t Steal This Book, features contributors from Kazuo Ishiguro to Malorie Blackman and was organised by composer and copyright campaigner Ed Newton-Rex to dramatise a growing clash between creators and generative AI firms.

“The AI industry was ‘built on stolen work … taken without permission or payment’.” — Ed Newton‑Rex

For executives using AI for business, this isn’t a literary squabble. It’s a material issue for procurement, product strategy, vendor risk and the cost base of AI deployment. The policy choices governments make—and the commercial solutions publishers propose—will change how companies source AI training data, the price of using advanced models, and the legal exposure of any product that generates or summarises content.

Why CIOs and CMOs should care right now

Short version: the data that fuels ChatGPT-style systems and image generators often includes copyrighted material scraped from the web. If regulations or court decisions force AI providers to licence that material, training costs and vendor pricing could rise; if governments allow an opt-out approach, creators argue the burden will shift to authors to protect their work. Either route changes the economics and legal risk of AI automation projects.

  • Reputational risk: Using models trained on contested sources invites public backlash and potential boycotts.
  • Legal risk: Litigation and settlements are already shaping the market—some reports put Anthropic’s settlement with authors at roughly $1.5bn, a sign of how costly disputes can become.
  • Procurement risk: Licensing rules will affect vendor contracts, indemnities and budgets for AI initiatives.

How models are trained — plain and simple

Large language models and many image generators learn patterns from huge datasets. Those datasets come from public web pages, books, news articles, forums and image repositories. “Training data” is simply the text and images used to teach a model language, style, facts and visual concepts. The heart of the dispute is whether it’s acceptable to include copyrighted works in those datasets without explicit permission or payment to rights-holders.

Key terms, defined

  • Opt-out: AI firms can use works by default unless creators explicitly say “don’t use mine.” Think of it like a front door left unlocked unless the homeowner bolts it.
  • Opt-in / mandatory licence: Companies must obtain permission or buy licences before using copyrighted works—creators need not take action to protect their material.
  • Collective licensing: Publishers or rights organisations negotiate a blanket licence that gives AI firms access to many works, with fees redistributed to creators.
  • Copyright waiver for “commercial research”: A proposed carve-out that could allow AI training on copyrighted material for commercial model development without licence—controversial because it could bypass creator consent.
  • Training data provenance: The record that proves where training data came from and whether it was licensed—vital for vendor due diligence.

Where policy stands (and why the UK matters)

The UK government has been consulting on copyright changes and is due to publish an economic impact assessment and update by March 18. Options floated include keeping current law, requiring licences, adopting an opt-out model, or even allowing a copyright waiver for “commercial research.” A government spokesperson framed the ambition as balancing protection for creators with innovation: “The government wants a copyright regime that values and protects human creativity, can be trusted, and unlocks innovation.”

“The UK government must not legalise book theft to benefit AI companies.” — back cover, Don’t Steal This Book

Policy outcomes in one market ripple globally. If the UK permits broad use of copyrighted works by default, AI firms may centralise training there; if it requires licences, companies may shift development where rules are laxer or where a clear commercial licensing market exists. Meanwhile, litigation in the US and actions across the EU (where copyright frameworks are generally more protective of rights-holders) mean companies building or licensing AI models will face a patchwork of legal regimes.

Industry responses — collective licences and settlements

Publishers’ Licensing Services (PLS) and other industry bodies are proposing collective licensing schemes to offer a scalable way for AI firms to access published works while compensating creators. Collective licensing can work like music licenses do: a single contract covers many works, fees are pooled and distributed. But the model must overcome global fragmentation, complex rights (translations, editions, anthologies) and the technical task of linking fees to individual authors.

At the same time, litigation and settlements are creating market signals. Reports indicate Anthropic reached a large settlement with authors—figures in the media have hovered around $1.5bn—demonstrating how litigation risk can translate into material commercial exposure.

Two counterpoints worth hearing

Supporters of broader data access argue that allowing models to use a wide array of texts accelerates innovation, improves discovery, and powers tools that can increase readership and monetise content in new ways. For example, summarisation tools can drive interest to the original works and open new revenue streams.

The trade-off is clear: broader access can accelerate product development but risks undermining traditional creator revenue and provoking legal pushback. The right policy or commercial structure needs to balance those benefits against fair compensation and creative control.

Three operational responses for businesses

Three pragmatic strategies will dominate corporate choices. Each has practical implications for cost, time-to-market and legal exposure.

1. Prioritise licensed training data

Pros: strongest legal defensibility, aligns with creators and publishers, lowers reputational risk.

Cons: licensing is likely to be the most expensive path and could slow development.

When to choose: customer-facing products with high regulatory scrutiny or where IP risk is material (e.g., legal, media, education).

2. Engage with collective licensing

Pros: scalable access to many works, redistributes fees to creators, reduces transactional friction.

Cons: depends on market uptake, may not cover all needed works or geographies, governance/fee distribution challenges.

When to choose: firms needing broad literary coverage and willing to participate in shaping licence terms.

3. Reduce reliance on contested public data

Pros: lower litigation exposure, greater control over dataset quality and provenance.

Cons: building proprietary datasets is time-consuming and costly; may limit model capability initially.

When to choose: companies building domain-specific AI agents, or those valuing long-term control (e.g., enterprise AI for sales or healthcare).

Practical 30/90/180-day plan for executive teams

  • 0–30 days: Audit current AI vendors for data provenance. Require written attestations that training data is licensed or permissibly sourced. Flag any vendors unable to provide provenance.
  • 30–90 days: Update procurement templates to include IP indemnities, warranties and clear termination rights tied to data provenance. Budget for potential licensing fees and renegotiate SOWs where necessary.
  • 90–180 days: Pilot an in-house dataset or choose vendors offering licensed models. Revisit product roadmaps to phase in features that minimise dependence on contested datasets; run legal and reputational impact assessments for major launches.

Vendor due-diligence checklist

  • Can the vendor provide provenance logs for the training data used?
  • Are there written licences or rights assignments for copyrighted works used to train models?
  • Does the vendor offer IP indemnity coverage and what are the limits/exclusions?
  • Is there a process for removing or retraining on contested content if required by regulation or court order?
  • Does the vendor participate in any collective licensing schemes or industry standards for data transparency?

Where to watch next

  • March 18 — UK publication of its economic impact assessment and copyright consultation update.
  • Major court decisions and settlements in the US and EU, which can quickly recalibrate market expectations and vendor contracts.
  • Industry moves: rollouts of collective licences (like PLS) and vendor commitments to licensed training datasets.

Frequently asked questions

  • What form did the protest take?
  • About 10,000 authors had their names printed in Don’t Steal This Book, a volume distributed at the London Book Fair to protest AI firms using copyrighted texts without permission.

  • Why are authors protesting?
  • Authors and artists argue that models trained on their work without payment compete with their livelihoods. As Ed Newton‑Rex said, “This is not a victimless crime – generative AI competes with the people whose work it is trained on, robbing them of their livelihoods.”

  • What are the industry responses?
  • Publishers’ Licensing Services is promoting a collective licensing scheme to provide legal access to works and distribute fees. High-profile litigation and settlements—reported settlements include a large payout by Anthropic—are also shaping the landscape.

  • What should businesses do now?
  • Start with vendor audits and procurement updates, budget for licensing scenarios, and pilot less-contested data strategies. Prioritise decisions based on product risk and market exposure.

“It is not in any way unreasonable to expect AI companies to pay for the use of authors’ books.” — Malorie Blackman

Whether policy lands as opt-out, opt-in, mandatory licence or some hybrid, companies building or buying AI should treat training data as a first-class legal and commercial risk. The blank pages of Don’t Steal This Book may have been theatrical, but the practical signal is firm: creators want compensation and control. For business leaders, the sensible move is proactive risk management—clarify vendor practices, budget for licensing, and design product choices that can tolerate regulatory shifts. That’s how you keep innovation moving without tripping over the legal shoelaces of tomorrow.

Where possible, monitor official government updates and major legal rulings, and consider subscribing to specialist legal and policy briefings on AI copyright to stay ahead of fast-moving changes.