Tech giants use YouTube subtitles for AI training without permission

Apple, Nvidia, and Anthropic have been found to be using YouTube subtitles to train AI models, which is against YouTube policies. A report by Proof News and Wired showed that such firms had used a dataset of the transcripts from thousands of YouTube videos without properly acquiring the license to do so.

Also Read: UK watchdog launches probe into Microsoft’s AI talent acquisition

The study revealed that Apple, Nvidia, and Anthropic used the YouTube Subtitles dataset. This dataset consists of transcripts from 173,536 YouTube videos from 48,000 channels. The videos include educational channels like Khan Academy and MIT, news channels like The Wall Street Journal, and top creators like MrBeast and Marques Brownlee.

Popular YouTubers react to data exploitation

Marques Brownlee, a popular YouTuber, commented on the issue on X. He said, “Apple has gathered data for AI from other firms. One of them collected a lot of data/transcripts from YouTube videos, including mine. ” While Apple may not have scraped the data directly, and Brownlee pointed out that this problem will persist.

The “YouTube Subtitles” dataset was developed by EleutherAI and published in 2020. It contains 5. 7GB of data, which includes subtitles from the YouTube videos that have been removed from the platform.

According to YouTube’s terms and conditions, accessing videos by “automated means” is prohibited. The existence of subtitles from removed videos only adds to the issue, raising questions about privacy and copyright infringement.

Salesforce, an organization also implicated in the probe, has also admitted to having used said dataset.

“The Pile dataset referred to in the research paper was trained in 2021 for academic and research purposes. The dataset was publicly available and released under a permissive license.”
Salesforce spokesperson

However, the use of YouTube content without permission is still controversial to this date. In April, YouTube CEO Neal Mohan said that using YouTube videos, transcripts, or clips for AI training is a “clear violation” of the policies. However, according to the New York Times, OpenAI used a million hours of YouTube videos to train its GPT-4 model.

Legal battles erupt over AI companies’ use of internet content

The issue of AI corporations using content from the internet without authorization has increased after the launch of ChatGPT. Additionally, content creators are suing Stability AI and Midjourney for allegedly scraping copyrighted works without permission. YouTube’s owner, Google, faced class-action lawsuits regarding similar claims, stating that legal actions of this kind threaten the basis of generative AI.

In an interview with The Wall Street Journal, OpenAI’s CTO Mira Murati did not elaborate on whether the company used videos from social media platforms to train this new model. Microsoft AI CEO Mustafa Suleyman stated that content on the open web had been considered fair use since the 1990s based on what he called the “social contract.”