Hachyderm @hachyderm

Recent searches

Search options

Only available when logged in.

**Jonathan Corbet** @corbet@social.kernel.org · Jan 22

Jonathan Corbet @corbet@social.kernel.org

A followup for folks who are curious about the whole AI botswarm problem...

Some of these bots are clearly running on a bunch of machines on the same net. I have been able to reduce the traffic significantly by treating everything as a class-C net and doing subnet-level throttling. That and simply blocking a couple of them.

But that leaves a lot of traffic with an interesting characteristic: there are millions of obvious bot hits (following a pattern through the site, for example) that all come from a different IP. An access log with 9M lines as over 1M IP addresses, and few of them appear more than about three times.

So these things are running on widely distributed botnets, likely on compromised computers, and they are doing their best to evade any sort of recognition or throttling. I don't think that any sort of throttling or database of known-bot IPs is going to help here...not quite sure what to do about it.

What a world we have made for ourselves...

**Daniel Thomas** @DanielRThomas@social.coop · Jan 23

Jan 23

Daniel Thomas @DanielRThomas@social.coop

@corbet Try to lose them in an AI poisoning maze running on a different host to the real content? Robots.txt blocks permitted crawlers from visiting the maze but every real page contains a link into the maze. Maze is much bigger than real site so should eat higher proportion of traffic, but heavily throttled as no real users to worry about. Logs from maze give you IPs to block from the real site, though as you note that doesn't help much if they only visit 3 times each.

Raven667 @raven667@hachyderm.io

@DanielRThomas @corbet I just saw an article about something like this called Nepenthes https://www.404media.co/developer-creates-infinite-maze-to-trap-ai-crawlers-in/ I'm not sure how well it'll work here though because each crawler source is only used for a handful of requests before moving on, so it's hard to ID on so few requests to serve a tarpit that doesn't trap legit users. Maybe combined with something like @joeyh proposed where a page with bogus honeypot links is served as an interstitial page in place of less common documents with a auto-refresh to the legit document, so the scraper starts seeding its queue of requests with links that you know are bogus, so can be an instant trap for future requests. Ideally a normal client would just follow the meta-refresh and the user-agent wouldn't try to pre-load the booby-trapped URLs, but this would require some validation, and for the LWN crowd that might include user-agents such as elinks, lynx and dillo not just Firefox and Chrome.

404 Media · Jan 23Developer Creates Infinite Maze That Traps AI Training Bots"Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself."

Jan 24, 2025, 07:50 PM··Phanpy

0boosts·0favorites

**Daniel Thomas** @DanielRThomas@social.coop · Jan 24

Jan 24

Daniel Thomas @DanielRThomas@social.coop

@raven667 @corbet @joeyh Yes something like that, or even just putting links on pages with robots.txt set such that well behaved crawlers ignore them. Longer list of options here: https://tldr.nettime.org/@asrg/113867412641585520

iocaine

> The deadliest poison known to AI.

This is a tarpit, modeled after Nepenthes, intended to catch unwelcome web crawlers, but with a slightly different, more aggressive intended usage scenario. The core idea is to configure a reverse proxy to serve content generated by iocaine to AI crawlers, but normal content to every other visitor. This differs from Nepenthes, where the idea is to link to it, and trap crawlers that way. Not with iocaine, where the trap is laid by the reverse proxy.

iocaine does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. iocaine is purely about generating garbage.

tldr.nettimeASRG (@asrg@tldr.nettime.org)Attached: 1 image Sabot in the Age of AI Here is a curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning. 🔻 iocaine The deadliest AI poison—iocaine generates garbage rather than slowing crawlers. 🔗 https://git.madhouse-project.org/algernon/iocaine 🔻 Nepenthes A tarpit designed to catch web crawlers, especially those scraping for LLMs. It devours anything that gets too close. @aaron@zadzmo.org 🔗 https://zadzmo.org/code/nepenthes/ 🔻 Quixotic Feeds fake content to bots and robots.txt-ignoring #LLM scrapers. @marcusb@mastodon.sdf.org 🔗 https://marcusb.org/hacks/quixotic.html 🔻 Poison the WeLLMs A reverse-proxy that serves diassociated-press style reimaginings of your upstream pages, poisoning any LLMs that scrape your content. @mike@mikecoats.social 🔗 https://codeberg.org/MikeCoats/poison-the-wellms 🔻 Django-llm-poison A django app that poisons content when served to #AI bots. @Fingel@indieweb.social 🔗 https://github.com/Fingel/django-llm-poison 🔻 KonterfAI A model poisoner that generates nonsense content to degenerate LLMs. 🔗 https://codeberg.org/konterfai/konterfai

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back