Reddit has launched a lawsuit against AI search engine Perplexity and three data firms—Oxylabs UAB, AWMProxy, and SerpApi—accusing them of illegally scraping its content for AI training purposes. The legal action, filed in the US District Court for the Southern District of New York, claims that these entities bypassed Reddit’s and Google’s security measures to harvest nearly 3 billion search engine result pages (SERPs) in just two weeks this past July.
Allegations of Systematic Scraping
According to the lawsuit, the defendants employed deceptive tactics to mask their identities and locations while extracting data from Reddit. The company likened the operation to “would-be bank robbers” targeting the cash transport instead of the bank vault itself. This aggressive scraping undermines Reddit’s copyright protections, as the platform has previously issued a cease-and-desist letter to Perplexity after tracing the stolen data back to them.
Key Players and Connections
Perplexity continues to be listed as a client of SerpApi, alongside major tech companies like Meta, Samsung, and Nvidia. This highlights the high demand for training data among AI developers. Reddit has already secured licensing deals with OpenAI and Google but has also pursued legal action against Anthropic over unauthorized data usage.
Broader Legal Landscape
This case is part of a growing trend of copyright disputes involving AI companies. Encyclopedia Britannica, which owns Merriam-Webster, recently filed a similar lawsuit against Perplexity for copyright infringement. The core issue revolves around AI’s insatiable need for massive datasets of human-generated content—much of which is copyrighted—and the legal complexities of obtaining that content.
Perplexity’s Defense
Perplexity argues that it does not require licensing agreements because it doesn’t train foundational AI models. Instead, it states that Reddit responses are used in its search results “lawfully.” However, this claim is at odds with Reddit’s assertion that the scraping was systematic and unauthorized.
Why This Matters
The lawsuit underscores the escalating tension between AI developers and content creators over data ownership. Reddit, with over 110 million daily active users and billions of posts, represents a valuable source of training data. The outcome of this case could set precedents for how AI companies access and utilize copyrighted material, potentially reshaping the future of data licensing and intellectual property rights in the rapidly evolving AI landscape.































