03. How It Works
Training data and copyright
When an AI company scrapes the web to build a training dataset, it creates copies of billions of copyrighted documents. These copies are used during training, typically not retained verbatim afterward, but the model learns from them. The key legal question is whether this process falls within the US doctrine of fair use (or equivalent in other jurisdictions).
Fair use (US) is a four-factor test: (1) purpose and character of the use, (2) nature of the copyrighted work, (3) amount taken, (4) effect on the market for the original. AI companies argue that training is transformative (factor 1) and does not substitute for the original (factor 4). Rights holders argue that the AI outputs compete directly with their work and that the market for licensing training data is real.
The New York Times v. OpenAI and Microsoft case, filed December 27, 2023, is the most prominent US case. The NYT alleges that its articles were used to train models that can reproduce them nearly verbatim in some cases, eliminating the need to visit nytimes.com. The Times is seeking billions in damages. The case is pending as of June 2026.
Other significant cases: Getty Images v. Stability AI (filed in US and UK courts, alleging unauthorized use of millions of licensed photos). Seventeen authors including John Grisham, Jodi Picoult, and George R.R. Martin sued OpenAI in September 2023. Music publishers including Universal Music sued Anthropic alleging misuse of song lyrics. These cases are working through courts.
OpenAI's public position is that training on copyrighted material is lawful because "copyright today covers virtually every sort of human expression" and restricting training to public domain material "would not provide AI systems that meet the needs of today's citizens." The company argues this falls within fair use.
In 2025, the first substantive US ruling on these questions came in Bartz v. Anthropic. A federal court found that training a model on books Anthropic had lawfully acquired was fair use, while holding that obtaining works from pirated sources remained actionable infringement. Anthropic subsequently agreed to a settlement reported at around $1.5 billion to resolve the claims over pirated copies. This is one district-court ruling rather than settled law, and other courts may weigh the fair use factors differently.
Who owns AI-generated outputs
The US Copyright Office has consistently held that purely AI-generated works do not receive copyright protection because copyright requires human authorship. Key rulings:
Thaler v. Perlmutter. A federal district court upheld the Copyright Office's refusal to register a work created autonomously by an AI system, finding that human authorship is required. On appeal, the US Court of Appeals for the D.C. Circuit affirmed in March 2025 that a purely AI-generated work cannot be copyrighted, and the Supreme Court denied certiorari in March 2026, leaving the human-authorship requirement in place.
Copyright Office AI Report, Part 2 (January 2025):
The Office concluded that AI-generated outputs cannot be copyrighted. However, works where a human author makes creative choices using AI as a tool, selecting, arranging, and modifying outputs, can qualify for copyright in the human-authored elements. The line between tool use and autonomous generation requires case-by-case analysis.
Copyright Office AI Report, Part 3 (pre-publication May 2025, congressional request). Addressed generative AI training. The Office's conclusions on training data liability are expected to inform congressional action. The final published version had no substantive changes from the pre-publication analysis.
Data provenance
Data provenance refers to the documented lineage of training data: where it came from, what license it carried, whether it was scraped or licensed, and whether it has been modified. Most frontier models lack public provenance documentation. Researchers and journalists have inferred training data composition from model behavior rather than documentation.
Tools such as Have I Been Trained allow artists to check whether their images appear in LAION datasets (used to train many image generation models). Meta's opt-out mechanism lets EU users request their public Facebook and Instagram posts not be used for AI training. Google, Apple, and others offer similar limited opt-outs.
Common Crawl, the most widely used web scrape, makes no copyright assertions and simply captures publicly accessible URLs. Providers who build datasets on top of Common Crawl (as many do) inherit whatever copyright questions attach to those URLs.
Watermarking and content credentials
C2PA (Coalition for Content Provenance and Authenticity). A technical standard for cryptographically signed content provenance metadata, embedded in media files. A C2PA-compliant image carries a Content Credential: a tamper-evident record of how it was created, what tools were used, and what modifications were made. If the metadata is stripped, soft bindings (invisible watermarks or fingerprints) can link the file back to its credential. C2PA is royalty-free and supported by Adobe, Microsoft, Google, BBC, and camera manufacturers including Sony and Nikon. Version 2.x is current as of 2026 and has significant security improvements over 1.x.
SynthID:
Google DeepMind's watermarking system for AI-generated images, audio, video, and text, which embeds imperceptible signals directly in the content rather than as metadata. Available through Google products and as open source.
The EU AI Act requires that providers of systems generating synthetic content ensure it is "marked in a machine-readable format and detectable as artificially generated or manipulated." This applies from August 2026. The Act does not mandate a specific technology, leaving room for C2PA, SynthID, or other approaches.