Copyright negotiations between publishers and Large Language Model (LLM) providers represent one of the biggest regulatory and commercial challenges of 2026. With the EU AI Act compliance deadline set for August 2026, Italian publishers are facing critical decisions regarding content monetization, AI model training authorization, and intellectual property management. This article analyzes the legal frameworks, current economic agreements with ChatGPT, Claude, and Gemini, and operational strategies to maximize the value of editorial data.
Licensing dynamics have evolved significantly from simple crawling agreements. Today, publishers must consider three simultaneous dimensions: the right to indexing for organic search, The right to train generative models e The right to cite and attribute in AI responses. Each dimension has distinct contractual, economic, and strategic implications.
The Regulatory Landscape: EU AI Act and Compliance August 2026
The Italian and European regulatory framework is constantly changing. The EU AI Act classifies AI risk-based systems, and in High-risk systems Many of the tools used for training LLMs with editorial data are included. As analyzed in detail in the article EU AI Act Compliance for Italian Publishers — Deadline August 2026, the transparency and disclosure obligations of the training model become mandatory for anyone who provides data.
Publishers must provide LLM providers with compliance documents that include:
- Explicit declaration of which dataset was used for training
- Certification that the dataset does not contain unauthorized personal data
- Confirmation of copyright ownership on all transferred content
- Audit log regarding the dataset's exposure period to the model
This legal architecture makes the formalization of imperative Data Licensing Agreements binding, no longer simple bilateral ToS.
Anatomy of a Modern Data Licensing Agreement
A data licensing agreement between a publisher and an LLM provider must include specific sections to operate in compliance with the EU AI Act and to protect the publisher's rights.
1. Dataset Definition and Licensing Scope
The first section must precisely specify which body of content is covered by the agreement. Examples of correct specification:
- All articles published on the www.editoriale.it domain between 01/01/2022 and 12/31/2025, excluding those classified as “draft” in their editorial status.
- Italian language content, minimum length of 500 words, excluding news ticker and aggregation articles
- Metadata included: title, publication date, author, category, structured tags
The lack of a clear dataset definition is the main cause of disputes in publishing. Many publishers have implicitly authorized crawling without realizing that the provider was using the data for generative training—a substantially different use.
2. Specific Use Rights and Restrictions
The modern licensing scheme includes a matrix of distinct rights:
| Usage Type | Typically Authorized | Compensation |
| Web Search Indexing | Yes (no robots.txt) | Implicit (referral traffic) |
| LLM Model Training | Be explicit | Installment plan |
| Proprietary Fine-Tuning | Rarely | Premium (5x training) |
| Citability in AI Responses | Yes (with attribution) | Synthetic traffic + links |
The lack of clarity surrounding this matrix was the cause of the dispute between publishers and OpenAI (2023-2024). Publishers believed they had only granted indexing rights, while OpenAI used the data for generative training.
3. Compensation Mechanisms: Current Models in 2026
Currently, compensation models are structured around four main types:
Model A: Payment-Per-Million-Tokens (PPMT)
OpenAI and Anthropic have adopted this model with major French publishers (Le Monde, Agence France-Presse) and British ones (Financial Times). The publisher receives a fee based on the number of tokens from their dataset used in training:
- Fee standard: €0.02 – €0.08 per million tokens
- Dataset of 1M articles (average 800 words): ~1.5 billion tokens → potential revenue €30k–€120k annually
- Advantage: Scalable, transparent, measurable
- Disadvantage: It does not compensate for the value lost perpetually for future versions of the model.
Model B: AI Product Revenue Share
Some premium publishers (particularly in business journalism) have negotiated a percentage share of the revenue generated by AI products that incorporate their content:
- Revenue share: 0.51% in Q3 2021 – 21% in Q3 2021 of revenue from ChatGPT Plus, Claude Pro, and Google One AI Premium
- Applicable to publishers with over 5M verified pages/year only
- Typically capped at an annual maximum (€500k – €5M depending on publisher tier)
- Advantage: Incentive alignment, scalable upside
- Disadvantage: Audit complexity, accounting disputes
Model C: Temporal Exclusive Licensing
Less common but growing: the publisher authorizes training with a time embargo. Practical example:
- Content published before 6 months: authorized for training without restrictions
- Content published in the last 6 months: training prohibited, only crawling for research allowed
- Compensation: Annual fixed fee (€50k–€500k) + bonus if the provider meets the embargo
- Advantage: Protects “fresh” news, maintains competitive edge
Model D: Hybrid Citability + Attribution Revenue
As per the EU AI Act: the provider commits to explicitly citing the publisher in responses on specific topics, and generated synthetic traffic (click-throughs from AI responses) will be compensated:
- Compensation: €0.01 – €0.05 per quote generated in response
- Monitoring: Via tracking APIs (e.g., UTM parameters on AI responses)
- Advantage: Simple to implement, based on real value (visibility)
Negotiation with the Three Dominant Providers: State of the Art August 2026
OpenAI (ChatGPT): Current Licensing Framework
OpenAI released in March 2026 a Publisher Data Licensing Program with standardized parameters:
- Tier 1 (Small Publishers: <10M pages/year): PPMT model at €0.02/M token, minimum amount €5k/year, maximum €50k/year
- Tier 2 (Medium Publishers: 10M–100M pages/year): PPMT at €0.05/M token, minimum €50k, maximum €500k
- Tier 3 (Major Publishers: >100M pages/year): Custom negotiation with optional revenue share
- Guaranteed Opt-Out Publishers can exclude their content from GPT-5 training (next version), but not from GPT-4 Turbo (already in production).
OpenAI's standard clauses include:
- Perpetual right to use data for training present and future versions of OpenAI models
- Prohibition of sub-licensing to third parties (e.g., you cannot transfer data sold to OpenAI to Anthropic)
- Publisher indemnity for liability for faithfully reproduced content in output (fair use defense)
- No-compete: if the publisher has its own LLM model, it cannot train it with the data it provides to OpenAI
Anthropic (Claude): More Guarantees Approach
Anthropic has adopted a more conservative legal stance, opposing mass licensing and instead proposing:
- Explicit Opt-In for Each Dataset No data is used without a GDPR-compliant Data Processing Agreement (DPA).
- Guaranteed Minimum Compensation €25k/year even for small publishers
- Right to Audit The publisher can annually audit how the dataset was used in Claude training.
- Retention Policy Data is not retained on Anthropic servers for more than 24 months after the end of the agreement.
Anthropic's competitive advantage is legal credibility: risk-averse publishers (like Italian publishing groups with strong legal exposure) prefer this model.
Google Gemini: Integration with Publisher Program
Google has incorporated data licensing into its Google News Initiative Partner Program:
- Compensation via Gemini for Publishers API: €0.001 per prompt citing publisher content in Gemini responses
- Priority access to the Gemini API beta for partner publishers (40% discount on API calls)
- Integration with Google Analytics to track synthetic quotes and traffic from Gemini
- No exclusivity: the publisher may license data simultaneously to OpenAI, Anthropic, and Google
This model is the most advantageous for niche Italian publishers, as Google incentivizes source variety to avoid information monoculture.
Operational Negotiation Strategies for Italian Publishers
Preliminary Audit of Your Dataset
Before starting negotiations, the publisher must accurately map its assets:
- Total number of published articles/pages
- Temporal distribution (counts per year)
- Average length (words per article)
- Languages (Italian, English, others)
- Thematic sectors (business, tech, lifestyle, news, etc.)
- Originality Rate: how much is original content vs. aggregation/wire services
Italian publishers often overestimate the value of their dataset. A national average of 2,000 articles/year for a niche publisher produces only 1.6M tokens—well below the threshold where PPMT becomes relevant (approximately 500M tokens for significant value).
2. Sectoral Coalition and Collective Bargaining
The EU AI Act Recital 50 explicitly promotes collective negotiations between publishers and providers. In 2026, regional coalitions emerged:
- Italy FIEG (Italian Federation of Newspaper Publishers) is establishing a collective data pool to negotiate better terms.
- France The APIG (General Information Press Alliance) has negotiated minimum terms that also bind non-member publishers through regulatory pressure.
- Spain The APM (Association of Media Publishers) has forced Google to pay €1.3 million annually for snippets in search.
A small-to-medium sized Italian publisher (500k–2M articles) is 3x more likely to obtain favorable terms if negotiating through FIEG rather than on their own.
3. Proposal Structure: Cover Letter Template
An effective proposal to OpenAI, Anthropic, or Google must include:
- Executive Summary (1 page): Who are you, sector, audience size, geographic relevance
- Dataset Specification (2 pages): Exact volume, languages, quality score, originality
- Valuation Proposal (1 page): Compensation requested calculated according to PPMT baseline + premium for quality/originality
- Legal Assurances (1 page): Intellectual Property Statement, Absence of Third-Party Rights, GDPR Compliance
- Monitoring & Reporting (1 page): Audit framework, quarterly reporting, future opt-out right
Providers receive dozens of proposals daily: a well-structured proposal has a 10x greater chance of being analyzed by business teams (not relegated to a legal decline form).
Tax and Accounting Implications for Italian Publishers
Data licensing compensation has significant implications on the tax and accounting sides.
Tax System in Italy
Amounts received as data licensing fees are classified as Income from intellectual property exploitation according to the Italian Tax Code (Articles 115 et seq.):
- If the publisher is a PJ subject to IRES: The compensation is taxable income subject to a 24.1% IRES rate plus the regional IRAP rate (3.91% in Lombardy, for example)
- If it is a Sole proprietorship This is business income subject to ordinary taxation (marginal tax rate ranging from 23.1% to 43.1% depending on total income)
- Cost deduction Are legal negotiation costs (IP lawyers), audits, and EU AI Act compliance deductible?
- Assignment of Rights vs. License: If you perpetually transfer rights (it is not reversible), you have a capital gain on intangible assets—more complex tax implications
An Italian publisher receiving €100k from OpenAI for data licensing will need to calculate a tax liability of approximately €30k–€50k depending on their legal structure.
Accounting and EU AI Act Compliance
The EU AI Act requires permanent documentation of:
- Dataset start and end dates for training
- Unique identifier for each file/item transferred
- Later versions of the model that use the dataset (e.g., GPT-4 vs. GPT-5)
- Possible use for fine-tuning or domain-specific adaptation
This documentation must be kept for at least 7 years and made available at the request of EU authorities (EDPB, AGCM, Garante Privacy).
Common Legal Risks and Mitigation
Risk 1: Third-Party Rights Embedded in the Dataset
Many Italian publishers republish content from news agencies (ANSA, Adnkronos, Dire) with simple attribution. If you provide this data to an LLM provider, you are potentially violating the original agency's copyright.
Mitigation
- Preliminary audit: segregate the dataset into “original content” vs. “aggregated content”
- License only the original part (reduces value, but eliminates liability)
- Negotiate sub-licensing agreements with news agencies (complex, but possible)
- To have IP (Errors & Omissions) insurance that covers this exposure
Risk 2: GDPR and Personal Data in Articles
News articles often contain personal data (names, addresses, sensitive information). Transmitting this data to LLM providers who will train models without anonymization is a GDPR violation.
Mitigation
- Pre-processing: Automatically anonymize personal data before handover (tools: Microsoft Presidio, Stanford Stanza PII-extractor)
- Express DPA with the provider that specifies GDPR protections
- Right to opt-out for subjects requesting de-indexing (Art. 17 GDPR right to be forgotten)
Risk 3: Perpetual Clauses and Lack of Sunset
Many OpenAI contracts include perpetual rights clauses for data usage. This means that even if you terminate your relationship with OpenAI, your data remains in the GPT-5, GPT-6, etc. model.
Mitigation
- Negotiate explicitly a sunset clauseValid rights for up to 5 years after the end of the agreement; thereafter, data must be purged or anonymized.“
- Specify opt-out for future major versions (e.g., “data for GPT-4 yes, for GPT-5 no without a new agreement”)
- Ask right to audit annual to verify that the data has actually been purged
Integration with EU AI Act Compliance — Link to Reference Documents
As detailed in EU AI Act Compliance for Italian Publishers — Deadline August 2026, data licensing agreements must include compliance documentation:
- Copy of all provider notification communications on data usage for training
- Statement on the mode of anonymization or pseudonymization (if applicable)
- Assessment of the risk of negative consequences on fundamental rights (Art. 29 EU AI Act)
- Action plan for identified risk mitigation
in parallel, the management of citability and attribution in AI outputs of models must align with the strategies described in Answer Engine Optimization (AEO) Beyond AI Overviews, where it examines how to position yourself to be cited by ChatGPT, Perplexity, and Google Deep Research Agent.
Operational Checklist for Negotiating a Data Licensing Agreement
- Week 1-2: Internal dataset audit (volume, quality, originality, GDPR gaps)
- Week 2-3: Preparation of legal documentation (IP ownership declaration, non-infringement certificate, insurance)
- Week 3-4: Drafting the licensing proposal (3-5 pages, template above)
- Weeks 4-6: Simultaneous forwarding to OpenAI, Anthropic, Google via business contacts (not generic forms)
- Week 6-12: Term Negotiation (awaiting response, counterclaims, negotiation rounds)
- Week 12+ Technical Implementation (API Setup, Monitoring, Compliance Documentation)
- Aftertaste: Quarterly audit, AI citation tracking, compliance update for new model versions
Economic Scenarios: How Much Can an Italian Publisher Earn
Scenario A: Specialized Tech Publisher (1 million articles, 85% originality, Italian + English)
- Estimated tokens: ~800M tokens
- OpenAI Compensation (Tier 2 PPMT): €0.05/M token x 800 = €40k/year
- Anthropic Compensation: €25k guaranteed + quality bonus
- Google Compensation (citeability): ~0.5M citations/year × €0.001 = €500
- Total potential: €65.5k/year gross (after taxes: ~€40k net)
Scenario B: General Lifestyle/News Publisher (3 million articles, 60% originality, primarily in Italian)
- Estimated tokens: ~1.8B tokens
- Quality discount (originality 60%): -30%
- OpenAI Compensation: €0.04/M tokens × 1.8B × 0.7 = €50.4k/year
- Anthropic Compensation: €25k
- Google Compensation: negligible (low-specialization content)
- Total: €75k/year gross (after taxes: ~€45k net)
Scenario C: Vertical Niche Publisher (200,000 articles, 95.1% originality, specialty tech/business)
- Estimated tokens: ~160M tokens
- Premium quality: +50% (rare, highly specialized content)
- OpenAI Compensation (Tier 1): €0.02/M tokens × 160 × 1.5 = €4.8k + floor €5k = €5k
- Anthropic Compensation: €25k (minimum floor)
- Google Compensation: ~2M citations/year (vertical specialty) x €0.001 = €2k
- Total: €32k/year gross, but high strategic value (access to Claude/Gemini training)
These scenarios show a pattern: Direct monetization from data licensing is modest (€5k–€75k/year for average Italian publishers). The real value is strategic: preferential access to beta APIs, cost reduction, and above all, positioning as a reliable source in AI outputs.
FAQ
If I refuse to give my data to OpenAI/Claude/Gemini, can I still exclude my site from their training?
Partially. If you don't sign a data licensing agreement, you can prevent crawling via robots.txt and request it from their legal team. However, according to the EU AI Act, once the crawling is publicly available (and not blocked by robots.txt), the provider could argue they are entitled to training under fair use. For total protection, you must: (1) block robots.txt; (2) send a Cease and Desist Letter signed by a lawyer; (3) actively monitor through tools like GPTbot detector. For the extract of citation monitoring, refer to Real-time Citability Monitoring.
What's the difference between licensing data to OpenAI and being cited in Google's AI Overviews?
They are two distinct channels: (1) Data Licensing from OpenAI: Do you cede historical data for ChatGPT training—is it a one-time or annual commercial agreement. AI Overviews Google Google crawls your present content and cites it in AI answers via web search—it's free (or monetized via AdSense/AdX). The two are not mutually exclusive: you can license historical data to OpenAI and simultaneously be cited in Google AI Overviews for new content. See in-depth details at Zero-Click Permanent and AI Overview Citations.
If you signed a Data Licensing Agreement with OpenAI, do I need to do the same with Anthropic and Google so as not to be disadvantaged?
No, but it’s strategically advisable. Each provider has a different audience and use case: OpenAI dominates ChatGPT (consumer), Anthropic has Claude (enterprise/developer), and Google controls over 90% of search traffic, so Google AI Overviews reaches more users. From a revenue perspective: if you license only to OpenAI, you lose citation revenue from Claude and Gemini. A rational publisher should negotiate simultaneously with all three, perhaps with slightly different terms (e.g., Google with revenue share from citations, OpenAI with PPMT, Anthropic with a guaranteed minimum fee).
How do I know if my dataset is “good” enough to negotiate terms above the standard PPMT?
I providers evaluate datasets along three dimensions: (a) Volume over 1B tokens is interesting; (b) Specialization Datasets in niche verticals (legal tech, medical, fintech) are worth 2-5x a premium compared to generic content; (c) Originality A dataset containing >85% original content is worth more than aggregated datasets. If your publisher has all three of these attributes, you have leverage to request revenue sharing instead of simple PPMT. Contact a lawyer specializing in IP and LLM licensing (e.g., AVG&Partners in Milan) for a pre-negotiation assessment.
What happens to my dataset if the LLM provider fails or is acquired?
It is the most dangerous legal gap today. If OpenAI were to fail tomorrow, what happens to the data surrendered for GPT-4 training? The standard contract states that the data remains “property of OpenAI” even in bankruptcy. An acquisition (e.g., Microsoft buys OpenAI) is no better: the rights to your data pass to Microsoft. To mitigate: (1) negotiate a “termination clause” that specifies that in case of M&A, the data must be purged or returned; (2) request “data escrow” (a neutral third party holds backups); (3) IP insurance that covers this scenario (extremely rare, but it exists). Unfortunately, none of the three providers (OpenAI, Anthropic, Google) accepts escrow terms today.
Conclusion
Data Licensing Agreements with LLM providers represent a marginal but strategically significant economic opportunity for Italian publishers in 2026. Direct compensation (€5k–€75k annually) will not transform publishing business models, but positioning as a primary source in AI outputs—through a combination of licensing + citability + Answer Engine Optimization—can stabilize synthetic traffic and organic visibility in an ecosystem where AI Overviews and permanent Zero-Click searches are increasingly eroding traditional web traffic.
Operational implementation requires three sequential steps: (1) internal dataset audit to understand volume, quality, and GDPR compliance; (2) simultaneous negotiation with OpenAI (PPMT), Anthropic (minimal fees + audit rights), and Google (citable + revenue-share); (3) integration of EU AI Act compliance (documentation, monitoring, audit trails) to protect the publisher from future regulatory risk.
Publishers who postpone this decision until December 2026 (post-deadline EU AI Act compliance) will find themselves negotiating from a position of weakness: providers will have already frozen their training architecture, making it more difficult to extract economic concessions. The strategic window is August-October 2026.
For in-depth information on structural citability and how to position yourself in AI outputs, please refer to our articles on Featured Snippet Optimization in the AI Era e LLM Crawlbot Management 2026, which provide the technical framework to maximize data value beyond simple commercial licensing.





