Trapped in an 'AI labyrinth': One company's plan to stop to bots scraping content for AI training
Cloudfare has developed a "honeypot" for web crawlers to combat unwanted bots. The aim is to tackle these bots without their knowledge. The company hopes this new approach will effectively deter malicious activities on the internet. The honeypot serves as a proactive defense mechanism against harmful bots. Cloudfare's innovative solution aims to enhance cybersecurity measures for online platforms.

"We wanted to create a new way to thwart these unwanted bots, without letting them know," Cloudfare said of its "honeypot" for web crawlers. How can we stop artificial intelligence (AI) from stealing our content? US-based web services provider Cloudflare says it has come up with a solution to web scraping - by setting up an "AI labyrinth" to trap bots.
AI Labyrinth to Thwart Bots
More specifically, this maze is aimed at detecting "AI crawlers," bots that systematically mine data from web pages’ content and trap them there. The company said that it has seen "an explosion of new crawlers used by AI companies to scrape data for model training". Generative artificial intelligence (genAI) requires enormous databases for training its models.
Several tech companies - such as OpenAI, Meta, or Stability AI - have been accused of extracting data that includes copyrighted content. To prevent the phenomenon, Cloudflare will "link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them" when detecting "inappropriate bot activity" to make them waste time and resources.
Protecting Content from AI Scraping
"We wanted to create a new way to thwart these unwanted bots, without letting them know they’ve been thwarted," the company said, comparing the process to a "honeypot" while also helping it to catalogue nefarious actors. Cloudflare is used in around 20 per cent of all websites, according to the latest estimations.
The decoy is made of "real and related to scientific facts" content but "just not relevant or proprietary to the site being crawled," the blog post added. It will also be invisible to human visitors and won’t impact web referencing, the company said.
Increasing Concerns and Solutions
An increasing number of voices are calling for stronger measures, including regulations, to protect content from being stolen by AI actors. Visual artists are now exploring how to by adding a layer of data acting as a decoy for AI and therefore, preserving their artistic style by making it harder to mimic by genAI.
Other different approaches have been explored, including, for example, with tech companies agreeing to allow AI to train on their content in exchange for undisclosed sums. Others, like the news agency and several artists, have decided to take the matter to court over the potential infringement of copyright laws.