Reddit’s User Content Sold to AI Company Ahead of IPO
Reddit, the popular online forum where millions of users post, comment and vote on a variety of topics, has signed a lucrative deal to sell its vast trove of user-generated content to an unnamed artificial intelligence company, according to people familiar with the matter.
The deal, which is worth $60 million a year, was disclosed to potential investors earlier this year as Reddit prepares to go public as soon as next month, the people said, speaking on the condition of anonymity because the agreement was private. The AI company, which the people declined to identify, will use Reddit’s data to train its large language models, or LLMs, which are powerful algorithms that can generate realistic text, images and videos.
Reddit, which was founded in 2005 and is based in San Francisco, is one of the most visited websites in the world, with over 50 million daily active users and over 100,000 communities, or subreddits, covering topics ranging from politics and memes to science and sports. The site, which calls itself “the front page of the internet”, is known for its lively and sometimes controversial discussions, as well as its role in amplifying social movements, such as the GameStop stock frenzy and the Black Lives Matter protests.
Reddit’s data, which includes text, images, videos and metadata, such as upvotes and downvotes, is a valuable resource for AI researchers and developers, who use it to train and test their models on a diverse and large-scale corpus of human language and behavior. Reddit has previously allowed academic researchers and nonprofit organizations to access its data for free, but has also faced criticism for not adequately protecting the privacy and consent of its users.
The deal with the AI company marks the first time that Reddit has monetized its data in such a way, and could signal a new revenue stream for the company as it seeks to attract investors for its initial public offering, or IPO. Reddit, which is majority-owned by Advance Publications, the parent company of Condé Nast, which owns The New York Times, has seen its revenue grow by 20 percent in 2023, reaching over $800 million, according to Bloomberg. The company has also raised $700 million in a funding round in August, valuing it at $10 billion.
However, the deal could also raise ethical and legal questions about how Reddit’s data is used and who benefits from it. Some Reddit users and moderators may not be aware or comfortable with the fact that their posts and comments are being sold to a third-party company, especially one that is not publicly disclosed. Moreover, some of the content on Reddit may be sensitive, offensive or inaccurate, and could potentially be used to create harmful or misleading AI applications, such as deepfakes, spam or propaganda.
Reddit did not respond to requests for comment on the deal. The AI company also did not respond to inquiries.
The deal comes at a time when large language models, such as GPT-4, which is developed by OpenAI, a research organization backed by prominent tech figures like Elon Musk and Peter Thiel, are becoming more advanced and widely used in various domains, such as natural language processing, computer vision and speech recognition. These models, which are trained on massive amounts of text data scraped from the internet, can generate coherent and convincing texts on almost any topic, given a prompt or a query.
However, large language models also pose significant challenges and risks, such as environmental impact, bias, toxicity, misinformation and intellectual property rights. As these models rely on data from the internet, they often reflect and amplify the existing prejudices, stereotypes and falsehoods that are present in online sources. Moreover, these models may not respect the original authors or owners of the data, and may infringe on their privacy, consent and attribution.
To address some of these issues, some AI companies and organizations have started to seek formal agreements and licenses with the data providers, such as publishers, media outlets and social media platforms, to access and use their content for AI purposes. For example, OpenAI has signed deals with Axel Springer, the German publisher of Politico and Business Insider, and the Associated Press, to use their articles to train and test its models. OpenAI is also reportedly in talks with CNN, Fox and Time, among others.
However, these agreements are not standardized or transparent, and may vary depending on the terms and conditions of each deal. Moreover, they may not cover all the possible uses and implications of the data, and may not adequately protect the rights and interests of the data subjects, such as the users, authors, editors and moderators.
As Reddit sells its data to an unnamed AI company ahead of its IPO, it may face scrutiny and backlash from its users, moderators, regulators and the public, who may demand more accountability and transparency from the company and its partner. Reddit may also have to balance its financial goals with its social and ethical responsibilities, as it navigates the complex and evolving landscape of AI and data.