Open source Git hosting platform SourceHut said its services were slowed by web crawlers run by AI companies. Similar complaints are increasingly coming from other resource owners.
Image source: Kai Wenzel / unsplash.com
To limit traffic from AI bots, SourceHut had to deploy Nepenthes, a defense against rogue web crawlers that collect data to train AI models. The platform’s administration unilaterally blocked entire address ranges of several cloud providers, including Google Cloud and Microsoft Azure, due to excessive traffic from bots deployed on their networks. Owners of legitimate services on these infrastructures were advised to contact SourceHut individually to add them to the exceptions. In 2022, SourceHut also suffered due to excessive requests to its resources from the Google Go Module Mirror service.
In 2023, OpenAI pledged that its bots would follow robots.txt directives, which specify how web crawlers should process data from websites. Other AI developers have made similar commitments, but complaints of abuse continue to pour in. Last summer, the iFixit website, in particular, was invaded by the Anthropic Claudebot. In December, the Vercel host reported a significant presence of AI crawlers in its infrastructure: OpenAI GPTbot sent 569 million requests to its network, while Anthropic Claude sent 370 million. Together, they accounted for about 20% of the 4.5 billion Googlebot requests, which are used to index resources on Google.
Image source: Kai Wenzel / unsplash.com
At the same time, Dennis Schubert, a developer of the distributed social network Diaspora, complained that over the previous 60 days, AI bots accounted for 70% of the traffic to his server. The publication went viral, and the activity of AI scanners dropped sharply; however, online hooligans organized a massive invasion of requests from clients with a user-agent string value that matched OpenAI GPTbot. However, the real OpenAI AI bot sends requests from the Microsoft Azure infrastructure, and in the case of the Diaspora server, they came from AWS addresses and even from American Internet providers.
Sometimes the situation is complicated by the fact that some bots have multiple purposes. For example, Meta✴ AI bot and AppleBot collect data exclusively for AI training, while GoogleBot serves both AI and search indexing. To avoid confusion, Google added a separate Google-Extended value for AI training tools in 2023.
As promised, the Half-Life 2 RTX demo from modders at Orbifold Studios was released on…
Microsoft has teamed up with Swiss startup inait to create a digital brain modeled after…
The Starlab space station is a joint project between the American Voyager and the European…
SpaceX has announced the launch of a new budget tariff plan for Starlink satellite Internet,…
Developers Ice-Pick Lodge have released a free prologue to their story-driven RPG Pathologic 3. The…
Over 95% of the world's internet traffic is carried by underwater cables, and damage to…