Apple, Anthropic and Other AI Firms Have Reportedly Trained AI Models on Thousands of YouTube Videos

EleutherAI, a non-profit AI research lab, reportedly compiled the dataset that trained Apple and Anthropic’s AI models.

Advertisement
Written by Akash Dutta, Edited by Manas Mitul | Updated: 17 July 2024 13:21 IST
Highlights
  • The stolen YouTube data comes from Marques Brownlee, MrBeast, and more
  • Apple reportedly used this dataset to train its OpenELM AI model
  • YouTube prohibits accessing videos using any automated means

Data from Indian YouTube creators such as CarryMinati and Ashish Chanchlani was also reportedly swiped

Photo Credit: Reuters

Apple, Anthropic, and other major artificial intelligence (AI) firms have reportedly trained AI models on data from hundreds of thousands of YouTube videos. A new report claims that multiple AI companies used a publicly available dataset called Pile which contained the plain text of videos' subtitles without any video imagery. The data was collected from popular YouTube creators such as MrBeast, Marques Brownlee, and PewDiePie as well as Indian YouTube creators such as CarryMinati, BB ki Vines, and Ashish Chanchlani.

Multiple AI Models Reportedly Trained on YouTube Videos

Proof News conducted an investigation to find that subtitles data from as many as 1,73,536 YouTube videos were taken from more than 48,000 channels. As per the report, EleutherAI, a non-profit AI research lab, curated this dataset. Later, it was used by companies such as Apple, Anthropic, Nvidia, Salesforce, and more. Notably, the AI lab published a research paper highlighting the details of the dataset.

Advertisement

EleutherAI created a data repository of 800GB dubbed Pile and made it publicly available for those who wanted to train AI models but could not afford large datasets. The majority of the dataset was taken from publicly available sources such as English Wikipedia, e-books, and more. However, it also contained the subtitles from all the videos compiled in a dataset called YouTube Subtitles.

The report claimed that the Pile was used to train Apple's OpenELM AI model, on the basis of the research paper's description. Salesforce, Nvidia, and Anthropic's AI models' research papers also reportedly mention the usage of the dataset.

Advertisement

Anthropic spokesperson Jennifer Martinez told the publication in a statement, “The Pile includes a very small subset of YouTube subtitles. YouTube's terms cover direct use of its platform, which is distinct from use of the Pile dataset. On the point about potential violations of YouTube's terms of service, we'd have to refer you to the Pile authors.”

Notably, YouTube's terms of service prohibit anyone from accessing the videos on the platform using automated means such as robots, botnets or scrapers. YouTube Subtitles will fall under the scraping category. A Google spokesperson told Proof News in an email response that the tech giant has taken “action over the years to prevent abusive, unauthorised scraping.” However, no comments were made about AI firms' usage of the data.

Advertisement

In a post on X (formerly known as Twitter), Marques Brownlee called out Apple for sourcing data from companies that included his videos' transcripts, but he also highlighted that it was not the iPhone maker's fault since they did not collect the data.

While this dataset was collected and distributed publicly, there could be other instances of data scraping on platforms such as YouTube. With AI firms scrambling to find more data to train their large language models (LLMs), data procurement might continue to enter similar legally grey areas.

 

Get your daily dose of tech news, reviews, and insights, in under 80 characters on Gadgets 360 Turbo. Connect with fellow tech lovers on our Forum. Follow us on X, Facebook, WhatsApp, Threads and Google News for instant updates. Catch all the action on our YouTube channel.

Advertisement

Related Stories

Popular Mobile Brands
  1. Oppo K15 Pro Series With Active Cooling Fan Launched: See Price
  2. iQOO 15 Apex Edition Arrives in India as a Special Variant of iQOO 15
  3. Lava Bold N2 Lite Arrives With a 5,000mAh Battery at This Price in India
  4. iQOO 15 Apex Colour Option Revealed, Will Launch in India on April 1
  5. Redmi Note 15 SE 5G to Launch With a Larger Battery Than Note 15 5G
  6. Google Finally Lets Users Change Their Gmail Address
  7. Disney Reportedly Interested in Buying Fortnite Maker Epic Games
  8. OnePlus Nord 6 Camera Configuration Revealed as India Launch Draws Near
  9. Instagram Might Be Testing a 'Plus' Subscription With These Features
  10. Oracle Begins Layoffs Affecting Thousands: Report
  1. Samsung Enables Blood Pressure Monitoring on Some Galaxy Watch Models in the US; Watch 9 Development Tipped
  2. Oppo K15 Pro+ and Oppo K15 Pro Launched With Active Cooling Fan, Up to 8,000mAh Battery: Price, Features
  3. Oracle to Reportedly Lay Off Thousands of Employees
  4. iQOO 15 Apex Edition Launched in India With 144Hz Refresh Rate, Snapdragon 8 Elite Gen 5 Chip: Price, Specifications
  5. Disney Reportedly Keen on Acquiring Fortnite Maker Epic Games at Some Point
  6. Lava Bold N2 Lite Launched in India With 5,000mAh Battery, 6.75-Inch Display: Price, Specifications
  7. Oppo K15 Pro Key Specifications Revealed Ahead of China Launch; Dimensity 8500 Super SoC Confirmed
  8. Google Pixel 11 Pro Leaked Renders Indicate It Might Sport Thinner Bezels Than the Pixel 10 Pro
  9. Gmail's AI Inbox Feature With Smart Prioritisation Rolls Out for Google AI Ultra Subscribers in the US
  10. Ray-Ban Meta Optics Styles Launched as Meta’s First Prescription-Focussed Smart Glasses: Price, Specifications
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2026. All rights reserved.