OpenAI Trained AI Models on Copyrighted O'Reilly Media Books, Researchers Claim

The AI Disclosures Project conducted a test to see if OpenAI’s AI models could identify content from paywalled O’Reilly Media books.

Advertisement
Written by Akash Dutta, Edited by David Delima | Updated: 2 April 2025 18:42 IST
Highlights
  • O’Reilly Media is said to not have any licensing agreement with OpenAI
  • GPT-4o was said to show the highest recognition of copyrighted content
  • Researchers used a membership inference attack in the test

As many as 3,962 paragraph excerpts from 34 O’Reilly Media books were used for the test

Photo Credit: Unsplash/ Levart_Photographer

OpenAI might have trained its artificial intelligence (AI) models on copyrighted content, according to a research paper. A recently published paper from the non-profit organisation AI Disclosures Project, the San Francisco-based AI firm's recent large language models (LLMs) showed a higher recognition of copyrighted content compared to its older models. The researchers used a recently developed method called DE-COP to detect copyrighted content in the AI models' training dataset. Notably, the study found that the GPT-4o mini was not trained on the specific copyrighted content.

Researchers Used DE-COP to Test OpenAI's Training Dataset

The study, titled Beyond Public Access in LLM Pre-Training Data, was conducted to check if OpenAI's AI models were trained on non-public book content. For the study, researchers focused on O'Reilly Media, a US online learning platform, which contains numerous copyrighted books. The founder of the platform, Tim O'Reilly, was also one of the co-authors of the study.

The researchers used DE-COP method to test whether the training data of the AI models contained copyrighted material. This is a relatively new test, introduced in a paper published in 2024. The method, also known as a membership inference attack, quizzes an AI model with a multiple-choice test to see whether it can identify copyrighted content from machine-generated paraphrased alternatives.

Advertisement

The researchers used Claude 3.5 Sonnet to paraphrase the copyrighted material. As many as 3,962 paragraph excerpts from 34 O'Reilly Media books were used for the test.

Advertisement

Based on the tests conducted, the researchers claimed to have found that the GPT-4o AI model showed the highest recognition of the copyrighted and paywalled O'Reilly book content with an 82 percent Area Under the Receiver Operating Characteristic Curve (AURUC) score. Notably, the AURUC score is part of the DE-COP method and is derived from the guess rates from the multiple-choice test.

The study also found that older OpenAI AI models, such as GPT-3.5 Turbo, showed lesser content recognition compared to GPT-4o, but still high enough to be significant. However, GPT-4o mini was found not to be trained on the paywalled O'Reilly Media books. The paper states the reason could be that the test is not effective against smaller language models.

 

Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.

Advertisement

Related Stories

Popular Mobile Brands
  1. MacBook Air (2025) With M4 Chip Available at This Discounted Price
  2. OnePlus 15R Storage Options Leaked: Here's How Much It Might Cost in India
  3. Motorola Edge 70 With 5,000mAh Battery Launched in India at This Price
  4. Pixel 10 Series Gets Price Cuts During Google's End of Year Sale: See Offers
  5. Oppo Reno 15c With Snapdragon 7 Gen 4 SoC Launched at This Price
  6. Jio Launches Happy New Year 2026 Prepaid Plans: Check Price, Benefits
  7. Apple Fitness+ Service Is Now Available in India: See Features
  8. All the Details About Kunal Khemu's Comedy Drama 'Single Papa'
  9. Game of the Year Winner Clair Obscur: Expedition 33 Gets New Major Update
  10. Logitech MX Master 4 Launches in India With These Features
  1. Clair Obscur: Expedition 33 Gets New 'Thank You' Update After Winning at The Game Awards
  2. Apple Fitness+ Now Available in India With Custom Workout Programmes: Price and Other Details
  3. Samsung Could Reportedly Strike a Deal With AMD to Build Future 2nm Process Chipsets
  4. Pixel 10 Series, Pixel Accessories Get Price Cuts in India During Google's End of Year Sale
  5. Alexa's Popular Requests in 2025 Included K-Pop, Bollywood, Podcasts and Details About Celebrities
  6. Logitech MX Master 4 Launched in India With 8,000 DPI Sensor and Multi-Pairing Support
  7. Amazon Introduces Ask This Book AI Feature for the Kindle App, Provides Spoiler-Free Answers
  8. MacBook Air (2025) With M4 Chip Available With Over Rs. 10,000 Discount in India: Here Are the Details
  9. Oppo Reno 15c Launched With Snapdragon 7 Gen 4 SoC, 6,500mAh Battery: Price, Specifications
  10. Star Wars: Fate of the Old Republic Will Launch Before 2030, Game Director Confirms
Gadgets 360 is available in
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2025. All rights reserved.