Another AI Training Data Theft? BUT This Time It’s Anthropic, NOT OpenAI

Reddit has sued Anthropic for copying millions of posts, joining a long list of lawsuits from authors, newspapers, artists, coders, and music companies..

Jun 06, 2025

You are reading the 100th edition of “The Responsible AI Digest by SoRAI” (School of Responsible AI) . Subscribe today for regular updates!

At the School of Responsible AI (SoRAI), we empower individuals and organizations to become AI-literate through comprehensive, practical, and engaging programs. For individuals, we offer specialized training such as AI Governance certifications (AIGP, RAI) and an immersive AI Literacy Specialization. This specialization teaches AI using a scientific framework structured around four levels of cognitive skills. Our first course is now live and focuses on the foundational cognitive skills of Remembering and Understanding. Want to learn more? Explore all courses: [Link] Write to us for customized enterprise training: [Link]

🔦 Today's Spotlight

🔟 Key AI Training Data Lawsuits (as of June 2025)

As Generative AI systems become increasingly dependent on massive datasets, a surge of lawsuits has emerged globally to challenge how AI companies acquire and use such data. At the heart of these legal disputes lies the unauthorized scraping and use of copyrighted or user-generated content by AI firms to train large language models and other generative tools. This summary outlines ten of the most significant lawsuits filed as of June 2025, showcasing a growing confrontation between content creators, platforms, and AI developers.

1. Reddit vs Anthropic (2025, USA) – Social Media Platform vs AI Startup Reddit sued Anthropic in California for scraping via 100,000+ automated bot sessions despite explicit restrictions in its terms of service. The platform alleges breach of contract, unfair competition, and even trespass to chattels for ignoring robots.txt and server strain. Having already licensed its data to OpenAI and Google, Reddit seeks damages and an injunction, marking a strong pushback against unauthorized data usage by AI firms.

2. New York Times vs OpenAI & Microsoft (2023–2025, USA) The Times filed suit alleging that OpenAI used millions of its articles without permission, and that ChatGPT produces summaries or excerpts resembling its work. A judge allowed key copyright claims like “inducement of infringement” to proceed, even as some others were dismissed. This ongoing case highlights media companies’ efforts to protect journalistic content from unlicensed AI training.

3. Authors Guild (Grisham, Martin, et al.) vs OpenAI (2023–2025, USA) Seventeen high-profile authors accused OpenAI of using their novels for training without consent. The case includes copyright infringement and claims of mimicking their distinctive writing styles. Microsoft was later added due to its partnership with OpenAI. This class-action lawsuit is a major effort to defend literary rights against large-scale ingestion by generative AI systems.

4. Sarah Silverman & Authors vs Meta (2023–2025, USA) Silverman and other writers sued Meta for training LLaMA on pirated versions of their books found in datasets like ThePile. They argue Meta scraped unauthorized copies from shadow libraries, violating copyrights. A judge allowed parts of the lawsuit to continue in 2024, making it one of the most visible challenges against Meta’s AI training methods.

5. Mona Awad & Paul Tremblay vs OpenAI (2023–2025, USA) Among the earliest suits by authors, this California case alleged OpenAI used entire novels without permission, as evidenced by ChatGPT’s ability to summarize their plots in detail. The case was later merged with other author actions and continues under the same legal counsel as the Silverman suit, laying early legal groundwork for literary copyright in AI.

6. Getty Images vs Stability AI (2023–2025, UK & USA) Getty sued Stability AI for using 12 million of its copyrighted images to train Stable Diffusion without licensing. The images even retained distorted watermarks in generated outputs, raising both copyright and trademark concerns. Getty’s lawsuits in both the U.S. and UK remain active, with a major UK trial expected in mid-2025.

7. Artists (Andersen, et al.) vs Stability AI & Others (2023–2025, USA) Visual artists filed a class-action against Stability AI, Midjourney, and DeviantArt, claiming their copyrighted artwork was used to train AI image generators. They argue that the resulting outputs mimic their styles without permission, harming their livelihoods. This case is ongoing and explores whether artistic style can be protected under copyright in the AI context.

8. Open-Source Coders vs GitHub Copilot (2022–2024, USA) Developers sued GitHub, Microsoft, and OpenAI over Copilot reproducing licensed code snippets verbatim without attribution. Filed in California, the suit challenges whether open-source licenses are violated when AI trains on public repositories. Some claims have survived motions to dismiss, making this a landmark test of AI’s legality in software development.

9. Clearview AI vs ACLU & Illinois BIPA Cases (2020–2023, USA) Facial recognition firm Clearview AI was sued for scraping billions of facial images from social media without consent. Under Illinois’ Biometric Information Privacy Act (BIPA), the ACLU secured a settlement restricting Clearview’s business practices. This case catalyzed wider scrutiny of biometric data use in AI, with further regulatory actions seen across Europe.

10. Music Publishers (UMG, ABKCO, Concord) vs Anthropic (2023–2025, USA) Major music publishers sued Anthropic for allegedly training Claude on over 500 popular song lyrics without licenses. Though Anthropic avoided a preliminary injunction in March 2025, the core copyright claims are heading to trial. This is among the first big lawsuits from the music industry over generative AI’s use of lyrical content.

These legal battles underscore a critical juncture in AI development- where questions of consent, ownership, and fair use clash with innovation and machine learning practices. While some cases have progressed to trial and others remain pending, the overarching theme is clear: the era of “free-for-all” data scraping is facing increasing legal resistance. The outcomes of these lawsuits will likely shape the regulatory and ethical boundaries of AI for years to come.

🚀 AI Breakthroughs

Bing Video Creator Debuts to Democratize AI Video Generation for All Users

• Bing Video Creator, powered by Sora, transforms text prompts into short videos for free, allowing users to bring their ideas to life through video on the Bing Mobile App and soon on desktop

• Microsoft introduces Bing Video Creator to democratize AI video generation, providing users with accessible creativity tools and the ability to share up to three videos at once via email or social media

• Designed with Responsible AI principles, Bing Video Creator uses safeguards to prevent misuse and ensures content safety by implementing content credentials based on the C2PA standard;

The Responsible AI Digest by School of Responsible AI- SoRAI

Another AI Training Data Theft? BUT This Time It’s Anthropic, NOT OpenAI

Reddit has sued Anthropic for copying millions of posts, joining a long list of lawsuits from authors, newspapers, artists, coders, and music companies..

Today's highlights:

You are reading the 100th edition of “The Responsible AI Digest by SoRAI” (School of Responsible AI) . Subscribe today for regular updates!

🔦 Today's Spotlight

🔟 Key AI Training Data Lawsuits (as of June 2025)

🚀 AI Breakthroughs

Bing Video Creator Debuts to Democratize AI Video Generation for All Users

NVIDIA Launches Llama Nemotron Nano VL for Advanced Intelligent Document Processing

Mistral Code Enables Enterprise Developers to Integrate Secure, Advanced AI Coding Seamlessly

NotebookLM Expands Sharing Capabilities with Public Link Feature for Diverse Uses

Meta shares more about the technology inside Aria Gen 2

Cursor AI Expands Capabilities with BugBot, Memories, and MCP in Major Update

FDA Launches AI Tool Elsa to Boost Efficiency and Transform Operations

OpenAI Enhances ChatGPT with New Business Features, Expands Enterprise Functionality and Flexibility

OpenAI Introduces Optional Internet Access to Codex for Enhanced Workflow Flexibility

Manus AI Launches Text-to-Video Platform for Scripted, Animated Video Creation

Meta Enhances AI Tools to Fully Automate Advertising Campaign Creation by 2026

Google CEO Highlights Growth of Vibe Coding with AI Tools Like Replit

AI Drives Wage and Skill Transformation as Industry Adapts to New Roles by 2025

⚖️ AI Ethics

Reddit Sues Anthropic Alleging Unlicensed Use of Data for AI Model Training

Apple Set to Reveal Limited Apple Intelligence Features at WWDC25 Amid Upgrades

India Launches Four Responsible AI Solutions to Promote Ethical Development

West Bengal Employs AI to Detect Fake College Applications in Admission Portal

Yashoda AI Launches in India to Empower Women in AI Literacy and Safety

OpenAI Identifies Significant Cyber Threat Misuses of ChatGPT Stemming from China

🎓AI Academia

Adoption of Watermarking Measures for AI-Generated content and Implications under the EU AI Act

Generalized Auditing Framework Reveals Hidden Bias in Medical AI Datasets Across Modalities

New Research Reveals Stealthy Data Poisoning Threats to Language Model Integrity

Survey Explores Large Language Models Shifting from Fast Decisions to Analytical Reasoning

Research Reveals How Data Contamination Affects Performance in Large Language Models

Comprehensive Review Highlights Large Language Models' Impact on Text-to-SQL Conversion

New Benchmark Evaluates Uncertainty in Large Language Models Using Multiple Choice Questions

Trust, Risk, and Security in Agentic AI: A Detailed Examination of TRiSM

Discussion about this post