Another AI Training Data Theft? BUT This Time It’s Anthropic, NOT OpenAI
Reddit has sued Anthropic for copying millions of posts, joining a long list of lawsuits from authors, newspapers, artists, coders, and music companies..
Today's highlights:
You are reading the 100th edition of “The Responsible AI Digest by SoRAI” (School of Responsible AI) . Subscribe today for regular updates!
At the School of Responsible AI (SoRAI), we empower individuals and organizations to become AI-literate through comprehensive, practical, and engaging programs. For individuals, we offer specialized training such as AI Governance certifications (AIGP, RAI) and an immersive AI Literacy Specialization. This specialization teaches AI using a scientific framework structured around four levels of cognitive skills. Our first course is now live and focuses on the foundational cognitive skills of Remembering and Understanding. Want to learn more? Explore all courses: [Link] Write to us for customized enterprise training: [Link]
🔦 Today's Spotlight
🔟 Key AI Training Data Lawsuits (as of June 2025)
As Generative AI systems become increasingly dependent on massive datasets, a surge of lawsuits has emerged globally to challenge how AI companies acquire and use such data. At the heart of these legal disputes lies the unauthorized scraping and use of copyrighted or user-generated content by AI firms to train large language models and other generative tools. This summary outlines ten of the most significant lawsuits filed as of June 2025, showcasing a growing confrontation between content creators, platforms, and AI developers.
1. Reddit vs Anthropic (2025, USA) – Social Media Platform vs AI Startup Reddit sued Anthropic in California for scraping via 100,000+ automated bot sessions despite explicit restrictions in its terms of service. The platform alleges breach of contract, unfair competition, and even trespass to chattels for ignoring robots.txt and server strain. Having already licensed its data to OpenAI and Google, Reddit seeks damages and an injunction, marking a strong pushback against unauthorized data usage by AI firms.
2. New York Times vs OpenAI & Microsoft (2023–2025, USA) The Times filed suit alleging that OpenAI used millions of its articles without permission, and that ChatGPT produces summaries or excerpts resembling its work. A judge allowed key copyright claims like “inducement of infringement” to proceed, even as some others were dismissed. This ongoing case highlights media companies’ efforts to protect journalistic content from unlicensed AI training.
3. Authors Guild (Grisham, Martin, et al.) vs OpenAI (2023–2025, USA) Seventeen high-profile authors accused OpenAI of using their novels for training without consent. The case includes copyright infringement and claims of mimicking their distinctive writing styles. Microsoft was later added due to its partnership with OpenAI. This class-action lawsuit is a major effort to defend literary rights against large-scale ingestion by generative AI systems.
4. Sarah Silverman & Authors vs Meta (2023–2025, USA) Silverman and other writers sued Meta for training LLaMA on pirated versions of their books found in datasets like ThePile. They argue Meta scraped unauthorized copies from shadow libraries, violating copyrights. A judge allowed parts of the lawsuit to continue in 2024, making it one of the most visible challenges against Meta’s AI training methods.
5. Mona Awad & Paul Tremblay vs OpenAI (2023–2025, USA) Among the earliest suits by authors, this California case alleged OpenAI used entire novels without permission, as evidenced by ChatGPT’s ability to summarize their plots in detail. The case was later merged with other author actions and continues under the same legal counsel as the Silverman suit, laying early legal groundwork for literary copyright in AI.
6. Getty Images vs Stability AI (2023–2025, UK & USA) Getty sued Stability AI for using 12 million of its copyrighted images to train Stable Diffusion without licensing. The images even retained distorted watermarks in generated outputs, raising both copyright and trademark concerns. Getty’s lawsuits in both the U.S. and UK remain active, with a major UK trial expected in mid-2025.
7. Artists (Andersen, et al.) vs Stability AI & Others (2023–2025, USA) Visual artists filed a class-action against Stability AI, Midjourney, and DeviantArt, claiming their copyrighted artwork was used to train AI image generators. They argue that the resulting outputs mimic their styles without permission, harming their livelihoods. This case is ongoing and explores whether artistic style can be protected under copyright in the AI context.
8. Open-Source Coders vs GitHub Copilot (2022–2024, USA) Developers sued GitHub, Microsoft, and OpenAI over Copilot reproducing licensed code snippets verbatim without attribution. Filed in California, the suit challenges whether open-source licenses are violated when AI trains on public repositories. Some claims have survived motions to dismiss, making this a landmark test of AI’s legality in software development.
9. Clearview AI vs ACLU & Illinois BIPA Cases (2020–2023, USA) Facial recognition firm Clearview AI was sued for scraping billions of facial images from social media without consent. Under Illinois’ Biometric Information Privacy Act (BIPA), the ACLU secured a settlement restricting Clearview’s business practices. This case catalyzed wider scrutiny of biometric data use in AI, with further regulatory actions seen across Europe.
10. Music Publishers (UMG, ABKCO, Concord) vs Anthropic (2023–2025, USA) Major music publishers sued Anthropic for allegedly training Claude on over 500 popular song lyrics without licenses. Though Anthropic avoided a preliminary injunction in March 2025, the core copyright claims are heading to trial. This is among the first big lawsuits from the music industry over generative AI’s use of lyrical content.
These legal battles underscore a critical juncture in AI development- where questions of consent, ownership, and fair use clash with innovation and machine learning practices. While some cases have progressed to trial and others remain pending, the overarching theme is clear: the era of “free-for-all” data scraping is facing increasing legal resistance. The outcomes of these lawsuits will likely shape the regulatory and ethical boundaries of AI for years to come.
🚀 AI Breakthroughs
Bing Video Creator Debuts to Democratize AI Video Generation for All Users
• Bing Video Creator, powered by Sora, transforms text prompts into short videos for free, allowing users to bring their ideas to life through video on the Bing Mobile App and soon on desktop
• Microsoft introduces Bing Video Creator to democratize AI video generation, providing users with accessible creativity tools and the ability to share up to three videos at once via email or social media
• Designed with Responsible AI principles, Bing Video Creator uses safeguards to prevent misuse and ensures content safety by implementing content credentials based on the C2PA standard;
NVIDIA Launches Llama Nemotron Nano VL for Advanced Intelligent Document Processing
• Llama Nemotron Nano VL, a new vision language model by NVIDIA, excels in extracting insights from complex documents like PDFs and charts, enhancing enterprise data processing.
• The model demonstrates industry-leading accuracy in tasks such as text recognition, table parsing, and diagram interpretation, proven by its performance in OCRBench v2 benchmarks.
• Llama Nemotron Nano VL specializes in automating document workflows across industries, benefiting financial, healthcare, legal, and technical sectors with efficient document processing capabilities.
Mistral Code Enables Enterprise Developers to Integrate Secure, Advanced AI Coding Seamlessly
• Mistral Code offers enterprise software teams a secure AI-powered coding assistant, enhancing productivity with local deployment, in-IDE support, and strong enterprise tooling.
• Building on the open-source Continue, Mistral Code provides solutions for enterprise challenges with a single integrated stack and robust controls for proprietary repositories.
• Global enterprises like Abanca, SNCF, and Capgemini adopt Mistral Code, benefiting from multiview editing, chat, and granular controls suitable for regulated industries.
NotebookLM Expands Sharing Capabilities with Public Link Feature for Diverse Uses
• NotebookLM introduces public sharing via single links, enabling users to share detailed notebooks on nonprofit projects, business manuals, or class study guides more effortlessly than before
• Users can easily create public notebooks by selecting "Share" and setting access to "Anyone with a link," allowing broader interaction without compromising content integrity
• Public notebooks offer interactive features such as questions and generated content like audio overviews and FAQs, enhancing exploration and understanding within the NotebookLM community;
Meta shares more about the technology inside Aria Gen 2
• Aria Gen 2 enhances computer vision capabilities with dual the number of cameras from Gen 1, enabling a wider field of view and better 3D tracking for advanced applications
• New sensors onboard Aria Gen 2, including an ambient light sensor and contact microphone, offer improved audio capture and environmental context recognition, enhancing versatility in diverse settings
• Aria Gen 2 delivers precise time alignment across devices with Sub-GHz radio technology, allowing synchronized multi-device operations with sub-millisecond accuracy, a significant innovation over previous models;
Cursor AI Expands Capabilities with BugBot, Memories, and MCP in Major Update
• Cursor 1.0, the AI-enabled coding platform by Anysphere, features 'BugBot' for automatic code reviews, allowing developers to address issues using pre-filled prompts
• Memory capabilities, named 'Memories', let Cursor 1.0 retain and reference conversational facts, currently available as a beta feature for users
• Users gain enhanced coding flexibility with Jupyter Notebook integration, allowing direct editing with Claude’s Sonnet models, alongside quick setup of Model Context Protocol servers
FDA Launches AI Tool Elsa to Boost Efficiency and Transform Operations
• The FDA has introduced Elsa, an AI tool that enhances employee efficiency across various roles, from scientific reviewers to investigators, to better serve the American populace
• Operating within a high-security GovCloud environment, Elsa safeguards sensitive data, ensuring the integrity of FDA-regulated information without training models on data from the regulated industry
• Initial uses of Elsa include accelerated clinical protocol reviews and faster label comparisons, optimizing processes and supporting the FDA's mission of improving operational efficiency and safety assessments.
OpenAI Enhances ChatGPT with New Business Features, Expands Enterprise Functionality and Flexibility
• OpenAI expands ChatGPT's business capabilities with features like Connectors for integrating internal systems, record mode for meeting content, and custom connector support
• New product updates allow ChatGPT to analyze enterprise data, generate visualizations, and interact with systems like GitHub, Gmail, and SharePoint while respecting user permissions
• Flexible pricing structures and enterprise credits enhance access to advanced models, aiming to extend ChatGPT's role from conversational tool to comprehensive business assistant.
OpenAI Introduces Optional Internet Access to Codex for Enhanced Workflow Flexibility
• OpenAI has introduced optional internet access for Codex, enabling package installation, API access, and dependency upgrades, currently for Plus, Pro, and Team users, with Enterprise access imminent;
• Users have control over domains and HTTP methods Codex can access, a measure to mitigate risks of unintended requests, and detailed usage guidance is provided on X;
• Codex's internet access was a top-requested feature, overcoming limitations of sandboxed environments, and was launched alongside updates including voice dictation and improved GitHub integration;
Manus AI Launches Text-to-Video Platform for Scripted, Animated Video Creation
• Manus AI Agent debuts a text-to-video service that converts text prompts into structured, sequenced video stories, automating the entire process from scene planning to animating content;
• Currently offered in early access to select subscribers, Manus challenges established models like OpenAI’s Sora and Google, part of a growing trend in subscription-based video generation services;
• Manus, in partnership with Microsoft’s Azure AI Foundry, uses a multi-agent system to autonomously execute complex tasks, distinguishing itself in the competitive AI agent landscape amid rising US-China tensions.
Meta Enhances AI Tools to Fully Automate Advertising Campaign Creation by 2026
• Meta is intensifying its reliance on AI to transform digital advertising, aiming to enable brands to create comprehensive ad campaigns from scratch using AI by next year
• The new AI tools would generate visuals, videos, and ad copy from a product image and budget goal, targeting users based on factors like geolocation
• Small and medium-sized businesses are poised to benefit from these AI-driven tools, though concerns remain about AI's ability to match human-crafted ad polish.
Google CEO Highlights Growth of Vibe Coding with AI Tools Like Replit
• Google CEO Sundar Pichai highlights "vibe coding" as a growing trend, citing tools like Cursor and Replit for making casual coding more accessible and experimental;
• During interviews, Pichai emphasized the transformative power AI tools provide to developers, comparing current web development capabilities to those from 25 years ago;
• Google reports that AI assistance now contributes to over 30% of code written at the company, showcasing the impact of AI in enhancing software development efficiency and creativity.
AI Drives Wage and Skill Transformation as Industry Adapts to New Roles by 2025
• PwC's 2025 Global AI Jobs Barometer highlights AI's role in nearly tripling revenue growth per worker in AI-centric industries since 2022
• Every industry observed has ramped up AI usage, including traditionally unexpected sectors such as mining and agriculture, suggesting a universal AI adoption trend
• Workers with AI skills enjoy a 56% wage premium compared to peers without those skills, marking a significant increase from previous years.
⚖️ AI Ethics
Reddit Sues Anthropic Alleging Unlicensed Use of Data for AI Model Training
• Reddit has filed a lawsuit against Anthropic in Northern California, claiming the AI company used its data without proper authorization for model training, violating user agreements
• Reddit’s legal action marks it as the first Big Tech firm to challenge an AI model provider over training data practices, alongside publishers who sued tech firms on similar issues
• OpenAI's Sam Altman, Reddit’s third-largest shareholder, adds a notable layer to the case, while Reddit maintains agreements with other AI providers for data use under specific conditions.
Apple Set to Reveal Limited Apple Intelligence Features at WWDC25 Amid Upgrades
• Apple plans to introduce a new API for developers to directly access its Apple foundation models, offering an affordable on-device solution for adding AI features to apps
• New AI capabilities in Shortcuts will allow users to create automations using natural language, aiming to simplify the process for both newcomers and seasoned users
• The Apple foundation language model will be improved, with new versions tested across various parameter sizes, enhancing features like notifications and writing tools;
India Launches Four Responsible AI Solutions to Promote Ethical Development
• The Indian government plans to launch four responsible AI solutions under the IndiaAI Mission, focusing on bias mitigation and fairness, between September and December via AIKosh
• Over 400 AI tools, including deepfake detection and watermarking AI-generated content, are under evaluation, with plans to release them as open-source resources on AIKosh
• As part of a broader initiative, India will select 30 AI application projects spanning key sectors like healthcare and agriculture to promote real-world AI use cases;
West Bengal Employs AI to Detect Fake College Applications in Admission Portal
• The West Bengal Higher Education Department is deploying AI to detect and eliminate fake undergraduate applications via its centralised admission portal, enhancing process efficiency and integrity;
• The AI system will analyze ID proof, names, mobile numbers, and photos to identify fraudulent submissions, inspired by past incidents involving celebrity names like Sunny Leone on merit lists;
• College officials support the AI's integration, highlighting its potential to streamline admissions, reduce admin tasks, and limit fraud, which previously involved cybercafes generating fake applications.
Yashoda AI Launches in India to Empower Women in AI Literacy and Safety
• Yashoda AI- a pioneering initiative by NCW—aims to empower Indian women with essential AI literacy, cybersecurity skills, and digital safety knowledge for inclusive digital empowerment
• Focused on rural and semi-urban communities, Yashoda AI fosters digital literacy and safety by engaging students, educators, and women from the police force in transformative community-driven education
• Aligned with India's AI leadership vision, the initiative emphasizes women's leadership in digital spaces and echoes national commitments to inclusive, responsible technology fostering a tech-savvy future.
OpenAI Identifies Significant Cyber Threat Misuses of ChatGPT Stemming from China
• OpenAI's latest report identifies China as a likely origin for a significant number of cyber threats and influence attempts using ChatGPT, highlighting global security concerns.
• The company's efforts to disrupt malicious AI usage found ChatGPT was misused for covert influence operations, with four out of ten sample cases reportedly linked to China.
• OpenAI underscores the need for common-sense AI regulations to prevent misuse by authoritarian regimes, aiming to protect global users from potential cyber threats and influence operations.
🎓AI Academia
Adoption of Watermarking Measures for AI-Generated content and Implications under the EU AI Act
• A recent study highlights the current lack of watermarking and labeling practices among AI image generators, with only 38% using watermarking and 8% labeling deep fakes;
• The 2024 EU AI Act requires the implementation of watermarks in AI-generated content, with non-compliance resulting in fines of up to 15 million Euros or 3% of annual turnover;
• Experts suggest that overcoming current implementation challenges is crucial for aligning AI creation with societal interests, potentially reducing misinformation risks associated with AI-generated content.
Generalized Auditing Framework Reveals Hidden Bias in Medical AI Datasets Across Modalities
• A new framework called G-AUDIT has been developed to detect and address dataset bias in medical AI, functioning across different modalities and data types;
• G-AUDIT evaluates the relationship between labels and data attributes, such as patient demographics and clinical environments, to identify risks of shortcut learning in AI models;
• Applied to various healthcare datasets, G-AUDIT effectively highlights subtle biases, advancing the reliability and integrity of AI applications in settings like imaging and electronic health records.
New Research Reveals Stealthy Data Poisoning Threats to Language Model Integrity
• Researchers have developed PoisonedParrot, a data poisoning attack that causes large language models (LLMs) to generate copyrighted content, even without direct training on the specific materials
• PoisonedParrot demonstrates the vulnerability of LLMs to subtle data poisoning, leaving current defenses largely ineffective despite their simplicity and stealth
• Researchers propose a potential defense named ParrotTrap to counteract this new threat, inviting the community to delve deeper into this evolving security concern.
Survey Explores Large Language Models Shifting from Fast Decisions to Analytical Reasoning
• A recent survey delves into the advancement of reasoning Large Language Models, highlighting their evolution from System 1 to more complex System 2 reasoning capabilities
• Cutting-edge models like OpenAI’s o1/o3 and DeepSeek’s R1 are showcased for their expert-level performances, closely mimicking human-like analytical processes in fields such as mathematics and coding
• The survey underscores the importance of combining foundational and reasoning LLMs to address biases and improve accuracy, offering a comprehensive comparison of LLM reasoning benchmarks.
Research Reveals How Data Contamination Affects Performance in Large Language Models
• A new paper examines data contamination's impact on Large Language Models (LLMs), highlighting that overlaps between training and test datasets can artificially boost performance evaluations
• The survey categorizes contamination detection methods into White-Box, Gray-Box, and Black-Box approaches and emphasizes the need for more rigorous evaluation protocols for LLMs
• Strategies for contamination-free evaluation discussed include data updating, data rewriting, and prevention methods, with a focus on dynamic benchmarks and LLM-driven evaluation techniques.
Comprehensive Review Highlights Large Language Models' Impact on Text-to-SQL Conversion
• A recent survey provides a comprehensive review of how Large Language Models are utilized for Text-to-SQL tasks, highlighting methods like prompt engineering and fine-tuning
• The survey examines classic benchmarks and evaluation metrics, alongside offering practical insights and a detailed taxonomy of LLM-based Text2SQL methods
• Future directions and challenges in the evolving Text-to-SQL field are discussed, emphasizing the role of LLMs in bridging users with relational databases.
New Benchmark Evaluates Uncertainty in Large Language Models Using Multiple Choice Questions
• UBENCH, a new benchmark for assessing uncertainty in large language models (LLMs), utilizes confidence intervals and features 11,978 multiple-choice questions encompassing various capabilities.
• Extensive experiments demonstrate that confidence interval-based methods effectively quantify uncertainty, with open-source models often rivaling closed-source models in performance.
• Chain-of-Thought and role-playing prompts show promise for enhancing model reliability, although temperature changes do not adhere to a predictable influence pattern.
Trust, Risk, and Security in Agentic AI: A Detailed Examination of TRiSM
• A recent review on TRiSM in LLM-based agentic multi-agent systems highlights new architectural paradigms redefining autonomy, collaboration, and decision-making in enterprise and societal contexts
• The study presents a risk taxonomy and unique threat vectors for agentic AI applications, showcasing real-world vulnerabilities through case studies in comprehensive risk management
• Key insights into trust-building, transparency, and oversight techniques are provided, alongside state-of-the-art security, explainability strategies, and future research directions for AI system alignment with TRiSM principles.
About SoRAI: SoRAI is committed to advancing AI literacy through practical, accessible, and high-quality education. Our programs emphasize responsible AI use, equipping learners with the skills to anticipate and mitigate risks effectively. Our flagship AIGP certification courses, built on real-world experience, drive AI governance education with innovative, human-centric approaches, laying the foundation for quantifying AI governance literacy. Subscribe to our free newsletter to stay ahead of the AI Governance curve.