Skip to content

Copyright, IP, and Data Provenance in AI

AI & You 10 min read

In Short

Training large AI models on web-scraped data has triggered a wave of copyright lawsuits from news publishers, authors, and image companies, with US courts beginning to rule on fair use arguments in 2024-2025. The US Copyright Office has concluded that purely AI-generated outputs receive no copyright protection, while the question of training data liability is only beginning to be tested in court, with early 2025 rulings (such as Bartz v. Anthropic) pointing in different directions depending on how the data was acquired. Watermarking standards (C2PA) and opt-out mechanisms are emerging but not yet mandated in most jurisdictions.

01. What It Is

AI training data copyright concerns whether using copyrighted works to train a model infringes the rights of the works' creators. This involves two distinct questions: does training itself constitute infringement, and who (if anyone) owns the copyright in the model's outputs? A third question concerns data provenance: how can anyone know what data a model was trained on, and how can that data be traced back to its origin?

Content credentials and watermarking address a related but different problem: how to mark AI-generated outputs so they can be identified as synthetic after they leave the model.

02. Why It Matters

The outputs of generative AI are commercially valuable. If those outputs draw on protected expression, the creators of that expression may be entitled to compensation or control. Courts and regulators are still determining whether this is true and under what conditions. The outcomes will shape what training data AI companies can legally use, whether licensing markets for training data develop, and what rights creators retain over their work.

At the same time, disinformation and synthetic media concerns have pushed governments and platforms toward requiring disclosure and traceability of AI-generated content. The EU AI Act requires detectable marking of synthetic content from August 2026. The US Copyright Office issued guidance requiring disclosure of AI-generated content in copyright applications from March 2023.

03. How It Works

Training data and copyright

When an AI company scrapes the web to build a training dataset, it creates copies of billions of copyrighted documents. These copies are used during training, typically not retained verbatim afterward, but the model learns from them. The key legal question is whether this process falls within the US doctrine of fair use (or equivalent in other jurisdictions).

Fair use (US) is a four-factor test: (1) purpose and character of the use, (2) nature of the copyrighted work, (3) amount taken, (4) effect on the market for the original. AI companies argue that training is transformative (factor 1) and does not substitute for the original (factor 4). Rights holders argue that the AI outputs compete directly with their work and that the market for licensing training data is real.

The New York Times v. OpenAI and Microsoft case, filed December 27, 2023, is the most prominent US case. The NYT alleges that its articles were used to train models that can reproduce them nearly verbatim in some cases, eliminating the need to visit nytimes.com. The Times is seeking billions in damages. The case is pending as of June 2026.

Other significant cases: Getty Images v. Stability AI (filed in US and UK courts, alleging unauthorized use of millions of licensed photos). Seventeen authors including John Grisham, Jodi Picoult, and George R.R. Martin sued OpenAI in September 2023. Music publishers including Universal Music sued Anthropic alleging misuse of song lyrics. These cases are working through courts.

OpenAI's public position is that training on copyrighted material is lawful because "copyright today covers virtually every sort of human expression" and restricting training to public domain material "would not provide AI systems that meet the needs of today's citizens." The company argues this falls within fair use.

In 2025, the first substantive US ruling on these questions came in Bartz v. Anthropic. A federal court found that training a model on books Anthropic had lawfully acquired was fair use, while holding that obtaining works from pirated sources remained actionable infringement. Anthropic subsequently agreed to a settlement reported at around $1.5 billion to resolve the claims over pirated copies. This is one district-court ruling rather than settled law, and other courts may weigh the fair use factors differently.

Who owns AI-generated outputs

The US Copyright Office has consistently held that purely AI-generated works do not receive copyright protection because copyright requires human authorship. Key rulings:

Thaler v. Perlmutter. A federal district court upheld the Copyright Office's refusal to register a work created autonomously by an AI system, finding that human authorship is required. On appeal, the US Court of Appeals for the D.C. Circuit affirmed in March 2025 that a purely AI-generated work cannot be copyrighted, and the Supreme Court denied certiorari in March 2026, leaving the human-authorship requirement in place.

Copyright Office AI Report, Part 2 (January 2025):
The Office concluded that AI-generated outputs cannot be copyrighted. However, works where a human author makes creative choices using AI as a tool, selecting, arranging, and modifying outputs, can qualify for copyright in the human-authored elements. The line between tool use and autonomous generation requires case-by-case analysis.

Copyright Office AI Report, Part 3 (pre-publication May 2025, congressional request). Addressed generative AI training. The Office's conclusions on training data liability are expected to inform congressional action. The final published version had no substantive changes from the pre-publication analysis.

Data provenance

Data provenance refers to the documented lineage of training data: where it came from, what license it carried, whether it was scraped or licensed, and whether it has been modified. Most frontier models lack public provenance documentation. Researchers and journalists have inferred training data composition from model behavior rather than documentation.

Tools such as Have I Been Trained allow artists to check whether their images appear in LAION datasets (used to train many image generation models). Meta's opt-out mechanism lets EU users request their public Facebook and Instagram posts not be used for AI training. Google, Apple, and others offer similar limited opt-outs.

Common Crawl, the most widely used web scrape, makes no copyright assertions and simply captures publicly accessible URLs. Providers who build datasets on top of Common Crawl (as many do) inherit whatever copyright questions attach to those URLs.

Watermarking and content credentials

C2PA (Coalition for Content Provenance and Authenticity). A technical standard for cryptographically signed content provenance metadata, embedded in media files. A C2PA-compliant image carries a Content Credential: a tamper-evident record of how it was created, what tools were used, and what modifications were made. If the metadata is stripped, soft bindings (invisible watermarks or fingerprints) can link the file back to its credential. C2PA is royalty-free and supported by Adobe, Microsoft, Google, BBC, and camera manufacturers including Sony and Nikon. Version 2.x is current as of 2026 and has significant security improvements over 1.x.

SynthID:
Google DeepMind's watermarking system for AI-generated images, audio, video, and text, which embeds imperceptible signals directly in the content rather than as metadata. Available through Google products and as open source.

The EU AI Act requires that providers of systems generating synthetic content ensure it is "marked in a machine-readable format and detectable as artificially generated or manipulated." This applies from August 2026. The Act does not mandate a specific technology, leaving room for C2PA, SynthID, or other approaches.

04. Provenance and Content Credentials for Creators

Beyond the standard itself, a working creator can attach a Content Credential to their own files and have it travel with the work. Adobe's free Content Authenticity web app, in public beta since April 2025, lets a photographer or artist sign up to 50 images at once, whether or not they were made in Adobe tools, and attach a verified name (through Verified on LinkedIn) plus social links so the work stays credited. The same credentials can be applied on export from Photoshop, Lightroom, and Premiere, and Adobe adds them automatically to anything generated in its Firefly models.

Provenance can also begin in the camera. Leica's M11-P was the first production camera to sign images at the moment of capture, and Nikon (Z6III) and Canon (EOS R1 and R5 Mark II) now do the same through newsroom services that require registration or paid activation, with Reuters an early Canon tester. Platforms are starting to surface this. LinkedIn marks signed posts with a small "Cr" pin a viewer can hover to see who made a file and whether AI was involved.

On the IP side, the Adobe app also records a "do not train on my work" preference inside the credential, honored so far by Adobe Firefly and Spawning, the company behind Have I Been Trained. Anything generated with Google's own tools carries its SynthID watermark automatically.

The catches are practical. Camera signing is often paid or region-locked, few platforms display credentials yet, and a credential proves origin, not that the content is true.
For the detection side, see Deepfakes and Detecting AI-Generated Media.

05. Key Terms

Fair use:
US copyright doctrine allowing limited use of protected works without permission. Whether AI training qualifies is the central contested legal question.

Training data license:
A contractual right, rather than a statutory right, to use specific datasets. OpenAI and others have licensing deals with AP, Reuters, Axel Springer, and other publishers for training data use.

LAION:
Large-scale Artificial Intelligence Open Network. A German nonprofit that created large image-text datasets (LAION-400M, LAION-5B) used to train Stable Diffusion and many other models. LAION temporarily took its datasets offline in 2023 following legal scrutiny.

C2PA Content Credential:
A cryptographically signed provenance record embedded in a digital asset, verifiable by anyone using C2PA-compliant tools. Not DRM: it does not restrict use, only documents origin.

Opt-out mechanism:
A way for rights holders to signal that their content should not be used for AI training. Devoid of legal force under current US law but increasingly recognized by major AI companies voluntarily or under EU/UK pressure.

06. Examples and Cases

NYT v. OpenAI (December 2023). The Times showed instances where GPT-4 reproduced articles near-verbatim. This weakens the "transformation" argument for fair use and suggests memorization occurred. The case is the most closely watched in the field.

Getty v. Stability AI. Getty Images embedded its watermark in training data; Stability AI's model reproduced distorted versions of the watermark in some outputs, providing direct evidence of copying. As of late 2025, the UK case has been decided: the High Court ruled on 4 November 2025, and Getty largely lost. Its primary copyright claims failed (in part because Getty did not pursue the training-related claims to judgment in the UK), and it prevailed only on limited, narrow trademark points. The parallel US case remained pending.

Authors Guild lawsuits:
Grisham, Martin, and others allege that models trained on their books can produce derivative works without license fees. These cases turn on whether the outputs compete with or substitute for the originals.

Stability AI v. Andersen et al. A class action by visual artists against Stability AI and Midjourney, alleging their styles and images were used without consent.

07. Common Pitfalls and Misconceptions

"Scraping public web pages is legal, so training on that data is legal."
Accessing a public page and copying its content are different acts. Terms of service frequently prohibit scraping, and terms-of-service violations may support certain legal theories even if copyright claims fail.

"AI companies have indemnified users."
Some (OpenAI, Microsoft, Google, Adobe) offer limited copyright indemnification for enterprise customers using their APIs or products. This shifts liability to the vendor but does not resolve the underlying legal question.

"A C2PA credential proves an image is real."
C2PA proves the credential was signed by a conforming implementation. It does not verify the content is accurate or unadulterated; it documents what the tool recorded about the creation process.

"If AI generated it, no one owns it."
Correct for purely autonomous AI outputs under current US law. Incorrect when a human makes creative choices in using AI as a tool. The human's creative expression is protectable even if the AI did the rendering.

Verified against primary sources

Every claim traces to a cited source below.

Key terms

Fair use
US doctrine allowing limited use of protected works without permission.
Training data license
A contractual right to use specific datasets for training.
LAION
German nonprofit behind large image-text datasets used to train image models.
C2PA Content Credential
A cryptographically signed provenance record embedded in a digital asset.
Opt-out mechanism
A way for rights holders to signal their content should not train AI.

Tags

#copyright #fair-use #training-data #data-provenance #watermarking #ai-regulation

More in AI & Society