Cloudflare has deployed an AI system designed to intercept bots that crawl public pages and collect data to train other AI models.

Image source: cloudflare.com

Website owners can theoretically try to fight off page crawlers by using robots.txt files with directives for bots, changing web server settings, CAPTCHA checks, and blocking bots entirely. In reality, crawler operators often ignore robots.txt directives, bypassing both CAPTCHA tests and server settings. As a result, the resources receive an increasing amount of unwanted traffic, and website data ends up in arrays for training AI without the permission of copyright holders – there is no clear answer to the question of the legality of this practice.

Cloudflare proposed not to block the crawlers, but to run them into the system, only to show them useless AI-generated content that they will consume, having fallen into the “AI maze”. Having detected unauthorized scanning of materials, the system does not block the request, but issues a number of links to AI-generated pages, convincing enough to lead the crawler to them. Such content looks real, but these are no longer the materials that the system is trying to protect – as a result, the crawler wastes time and resources. The AI-generated materials themselves are real and related to scientific facts, because Cloudflare does not intend to breed disinformation, and frankly garbage content can harm the site’s reputation and search engine optimization.

The system will be a deterrent to web content crawlers, which will waste resources and increase the cost of operation. It will be a useful tool for detecting bot activity; a human will not dive into such an “AI maze” to a depth of more than four links, the creators of the system are sure. But this solution is not a panacea: such things usually give rise to an arms race, and Cloudflare is already thinking about what to do next to stay ahead.

Leave a Reply

Your email address will not be published. Required fields are marked *