OpenAI trying to rope in more news organizations to train AI models

OpenAI

As news outlets team up with AI firms to teach their models using news stories, the amount companies like OpenAI are ready to cough up for copyrighted info is becoming apparent. According to The Information, OpenAI is willing to shell out between $1 million and $5 million annually to license copyrighted news articles for training its AI models. This sheds light on how much AI companies are looking to pony up for licensed content.

This news comes on the heels of a recent report stating that Apple is seeking partnerships with media companies to use their content for AI training, and they’re dangling at least $50 million over a span of several years for the data. The Verge has contacted OpenAI to get their take on these figures.

These numbers are quite in line with some past non-AI licensing deals. Take Meta, for example; when they introduced the Facebook News tab (now no longer available in Europe), they reportedly offered up to $3 million per year to license news stories, headlines, and previews.

Yet, it’s unclear whether the total payments would match up to some of the more significant sums we’ve seen. Consider Google, for example; in 2020, they announced a $1 billion investment for partnerships with news organizations. Moreover, in response to new laws, Google has recently pledged an annual compensation of $100 million to Canadian publishers for linking to their articles.

The enormous language models in use today have predominantly undergone training using data sourced from the internet. While certain AI models maintain secrecy about their training data origins, numerous others disclose specifics about the datasets or web crawlers they employ. The pricing for training datasets varies, contingent on the provider, dataset size, and the content encompassed within.

Certain data providers, such as LAION, are open source and completely free, utilized by models like Stable Diffusion. AI developers frequently deploy web crawlers to gather data from the internet to aid in training their models.

However, this approach is encountering significant hurdles. Firstly, OpenAI’s GPT crawler has been barred from accessing data by certain companies, including The New York Times and Vox Media, the parent company of The Verge. Secondly, several organizations contend that training on their data amounts to copyright infringement.

The New York Times and others have taken legal action against OpenAI and Microsoft, claiming copyright infringement. They argue that ChatGPT and Microsoft’s Copilot can churn out content that closely resembles their work. Teaming up with partners is a way for AI companies to sidestep these problems, and it’s become a more prevalent practice over the past year.

Rohan Sharma