Meta, Google and OpenAI used proprietary data to coach LLMs, report

Gary Marcus is a number one AI researcher who’s increasingly horrified by what he sees. He founded no less than two AI startups, one among which was sold to Uber, and has been researching the subject for over 20 years. Just last weekend, that Financial Times called him “Perhaps the loudest AI questioner” and reported that Marcus believed he was targeted by a critical post by Sam Altman X: “Give me the confidence of a mediocre deep learning skeptic.”

Marcus stepped up his criticism the very next day after appearing within the FT. writes on his Substack about “generative AI as a Shakespearean tragedy”. The topic was a Bomb report from The New York Times that OpenAI violated YouTube’s terms of service by scraping over one million hours of user-generated content. What’s worse, Google’s need for data to coach its own AI model was so insatiable that it did the identical thing, potentially violating the copyrights of the content creators whose videos it used without their consent.

Marcus noted back in 2018 that he had expressed doubts in regards to the “data-hungry” training approach that aimed to offer AI models with as much content as possible. In fact, he listed eight of his warnings dating back to that point Diagnosis of hallucinations in 2001, the whole lot comes true like a curse on MacBeth or Hamlet that manifests itself within the fifth act. “What makes this tragic is that many of us tried so hard to warn the field that this is where we would end up,” Marcus wrote.

While Marcus declined to comment AssetsThe tragedy goes far beyond the proven fact that nobody listened to critics like him and Ed Zitron, one other outstanding skeptic quoted from the FT. According to the Just, which cites quite a few background sources, each Google and OpenAI knew their actions were legally dubious—counting on the proven fact that copyright law within the age of AI had yet to be litigated—but felt that They had no alternative but to proceed pumping data into their firms using large language models to remain ahead of their competition. And in Google’s case, the corporate can have suffered damage from OpenAI’s massive scraping efforts, but its own rule-bending to scrape the identical data left it with a proverbial arm on its back.

did OpenAI uses YouTube Videos?

Google employees became aware that OpenAI was using YouTube content to coach its models, which might violate each its own terms of service and potentially the copyright protections of the creators who own the videos. In this quandary, Google decided to not publicly denounce OpenAI for fear of drawing attention to its own use of YouTube videos to coach AI models Just reported.

A Google spokesperson said this Assets The company has “seen unconfirmed reports” that OpenAI used YouTube videos. They added that YouTube’s terms of service prohibit “the unauthorized scraping or downloading” of videos, which the corporate “has long employed technical and legal measures to prevent.”

According to Marcus, the behavior of those big tech firms was predictable because data was the important thing ingredient in developing the AI tools that these big tech firms were in an arms race to develop. Without high-quality data like well-written novels, podcasts from knowledgeable hosts, or expertly produced movies, the chatbots and image generators run the danger of spitting out mediocre content. This idea could be summed up with the information science saying “crap in, crap out.” In a comment for Assets Jim Stratton, chief technology officer at HR software company Workday, said: “Data is the lifeblood of AI,” making the “need for high-quality, timely data more important than ever.”

Around 2021, OpenAI bumped into a knowledge shortage. OpenAI desperately needed more human language instances to further improve its ChatGPT tool, which was still a couple of 12 months away from release, and decided to source it from YouTube. Employees discussed that compressing YouTube videos won’t be allowed. Eventually, a gaggle including OpenAI President Greg Brockman implemented the plan.

The proven fact that a high-ranking figure like Brockman was involved within the plan was, in keeping with Marcus, a testament to how groundbreaking such data collection methods were for the event of AI. Brockman did so “most likely knowing he was in a legal gray area — and yet still eager to feed the beast,” Marcus wrote. “If everything falls apart, whether for legal or technical reasons, this image may remain.”

When reached for comment, an OpenAI spokesperson didn’t reply to specific questions on using YouTube videos to coach its models. “Each of our models has a unique dataset that we curate to help them understand the world and remain globally competitive in research,” they wrote in an email. “We leverage multiple sources, including publicly available data and non-public data partnerships, and are exploring synthetic data generation,” they said, referring to the practice of using AI-generated content to coach AI models.

Mira Murati, Chief Technology Officer of OpenAI, was asked in a single Wall Street Journal interview whether the corporate’s latest Sora video image generator was trained using YouTube videos; She replied, “I’m actually not sure about that.” Last week, YouTube CEO Neal Mohan responded that while he didn’t know whether OpenAI had actually used YouTube data to coach Sora or one other tool, if it had If this were to occur, this is able to violate the platforms’ rules. Mohan did it mention that Google uses some YouTube content to coach its AI tools based on some contracts with individual YouTubers. A press release that a Google spokesperson reiterated Assets in an email.

Meta decides that the license agreement would take too long

OpenAI was not alone in facing the dearth of sufficient data. Meta also handled the subject. When Meta realized that its AI products weren’t as advanced as OpenAI’s; It held quite a few meetings with top executives to search out ways to secure more data to coach its systems. Executives considered options reminiscent of paying a $10 per book royalty on latest releases and buying the publisher Simon & Schuster outright. At these meetings, executives admitted that they’d already used copyrighted material without the authors’ permission. Ultimately, they decided to maneuver forward, even when it meant possible lawsuits in the longer term, they said New York Times.

Meta didn’t reply to a request for comment.

Meta’s lawyers believed they might be protected by compensation within the event of litigation In 2015, Google won against a consortium of authors. At the time, a judge ruled that Google could use the authors’ books without having to pay a licensing fee because the corporate used their work to construct a search engine that was sufficiently transformative to be considered fair use.

OpenAI argues something similar in a case initiated against the corporate New York Times In December. The Just claims that OpenAI used its copyrighted material without compensating it. While OpenAI claims that its use of the materials falls under fair use because they were collected to coach a big language model, and never since it is a competing news organization.

For Marcus, the hunger for more data was proof that the whole AI concept was built on it shaky ground. So that the AI can do it live up Due to the hype with which it’s billed, it simply requires more data than is offered. “All of this happened because of the realization that their systems simply cannot succeed without even more data than the Internet data they were already trained on,” Marcus wrote on Substack.

OpenAI appeared to confess this was the case in written testimony to the House of Lords in December. “It would be impossible to train today’s leading AI models without using copyrighted material,” the corporate wrote.

Subscribe to the Eye on AI newsletter to not sleep thus far on how AI is shaping the longer term of business. Sign up without spending a dime.

Meta, Google and OpenAI used proprietary data to coach LLMs, report

did OpenAI uses YouTube Videos?

Meta decides that the license agreement would take too long

The Trump billionaires who lead the economy and the things they are saying

11 mighty possibilities how husbands can show love without saying a word

What creditworthiness do lenders use?

How to freeze your credit freed from charge from all 3 loan offices

Your customers use AI to exchange them. Do these 3 things before you do that

After Mark Goldberg we now have entered an era of fintech maximalism

The Trump billionaires who lead the economy and the things they are saying

11 mighty possibilities how husbands can show love without saying a word

What creditworthiness do lenders use?

How to freeze your credit freed from charge from all 3 loan offices

About Us

Must read

Popular categories

Our Newsletter