Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks
Apple, NVIDIA, others face backlash over scraping data from YouTube content for AI training
Photo by Szabo Viktor / Unsplash

Apple, NVIDIA, others face backlash over scraping data from YouTube content for AI training

This includes transcripts of videos from channels run by popular creators like Marques Brownlee and MrBeast.

Emmanuel Oyedeji profile image
by Emmanuel Oyedeji

Ever since generative AI became mainstream, the AI community has faced increasing scrutiny over the use of individual data, sparking essential conversations about the ethical boundaries and regulations needed for AI training data.

This is because, over the past year, several incidents and lawsuits surrounding unethical data scraping in the use of AI training have become increasingly common. And, even the big names in the AI industry are not left out.

In a recent development, content on YouTube has now become a goldmine for big tech companies to train their AI models—unethically! This was unveiled in a new investigation by Proof News, which revealed that some of the world's biggest tech companies, including AppleNVIDIA, and Anthropic, have been training their AI models on a massive dataset that scraped transcripts (excluding videos or images) from over 173,000 YouTube videos, without permission from the creators!

This dataset, created by a non-profit called EleutherAI, includes transcripts of videos from channels run by popular creators like Marques Brownlee and MrBeast, as well as major news outlets like BBC and The New York Times.

This violates YouTube's terms of service, which prohibit data harvesting, with YouTube CEO Neal Mohan and Google (YouTube's parent company) having already condemned the practice.

It is notable to mention that this incident is just the tip of the iceberg. Unethical data scraping has become a recurring issue in AI development. Tech giants like Google - the victim in this case, Apple and OpenAI have been hit with lawsuits alleging unethical scraping of data from users without consent in order to train their AI products. There have also been reports of these tech companies quietly paying for content behind paywalls and login screens, fueling a hidden trade in chat logs and old personal photos from defunct social media apps.

This development brings up the concerns surrounding personal data harvesting for AI training, and how they touch upon the rights of internet users, a topic that has been fiercely debated online since generative AI emerged.

But, the bigger challenge lies in the lack of transparency from some AI companies. Earlier this year, Apple faced criticism for shrouding the source data used to train their new AI tool in secrecy. Similarly, OpenAI, the company behind the upcoming AI video generator "Sora," dodged questions regarding the use of YouTube videos in its development.

This lack of transparency makes it challenging to make big techs accountable and stop this unethical data usage altogether. It raises a significant question: Can the use of data from public websites for unethical AI development ever be stopped?

The answer on the other end of the question could very well establish the strong ethical implications that AI training entails, and highlight how robust AI regulations must be to ensure responsible development and protect the rights of content creators and all internet users.

Emmanuel Oyedeji profile image
by Emmanuel Oyedeji

Subscribe to Techloy.com

Get the latest information about companies, products, careers, and funding in the technology industry across emerging markets globally.

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Read More