hachyderm.io is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hachyderm is a safe space, LGBTQIA+ and BLM, primarily comprised of tech industry professionals world wide. Note that many non-user account types have restrictions - please see our About page.

Administered by:

Server stats:

9.4K
active users

A followup for folks who are curious about the whole AI botswarm problem...

Some of these bots are clearly running on a bunch of machines on the same net. I have been able to reduce the traffic significantly by treating everything as a class-C net and doing subnet-level throttling. That and simply blocking a couple of them.

But that leaves a lot of traffic with an interesting characteristic: there are millions of obvious bot hits (following a pattern through the site, for example) that all come from a different IP. An access log with 9M lines as over 1M IP addresses, and few of them appear more than about three times.

So these things are running on widely distributed botnets, likely on compromised computers, and they are doing their best to evade any sort of recognition or throttling. I don't think that any sort of throttling or database of known-bot IPs is going to help here...not quite sure what to do about it.

What a world we have made for ourselves...

@corbet Try to lose them in an AI poisoning maze running on a different host to the real content? Robots.txt blocks permitted crawlers from visiting the maze but every real page contains a link into the maze. Maze is much bigger than real site so should eat higher proportion of traffic, but heavily throttled as no real users to worry about. Logs from maze give you IPs to block from the real site, though as you note that doesn't help much if they only visit 3 times each.

Raven667

@DanielRThomas @corbet I just saw an article about something like this called Nepenthes 404media.co/developer-creates- I'm not sure how well it'll work here though because each crawler source is only used for a handful of requests before moving on, so it's hard to ID on so few requests to serve a tarpit that doesn't trap legit users. Maybe combined with something like @joeyh proposed where a page with bogus honeypot links is served as an interstitial page in place of less common documents with a auto-refresh to the legit document, so the scraper starts seeding its queue of requests with links that you know are bogus, so can be an instant trap for future requests. Ideally a normal client would just follow the meta-refresh and the user-agent wouldn't try to pre-load the booby-trapped URLs, but this would require some validation, and for the LWN crowd that might include user-agents such as elinks, lynx and dillo not just Firefox and Chrome.

404 Media · Developer Creates Infinite Maze That Traps AI Training Bots"Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself."

@raven667 @corbet @joeyh Yes something like that, or even just putting links on pages with robots.txt set such that well behaved crawlers ignore them. Longer list of options here: tldr.nettime.org/@asrg/1138674

iocaine

> The deadliest poison known to AI.

This is a tarpit, modeled after Nepenthes, intended to catch unwelcome web crawlers, but with a slightly different, more aggressive intended usage scenario. The core idea is to configure a reverse proxy to serve content generated by iocaine to AI crawlers, but normal content to every other visitor. This differs from Nepenthes, where the idea is to link to it, and trap crawlers that way. Not with iocaine, where the trap is laid by the reverse proxy.

iocaine does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. iocaine is purely about generating garbage.
tldr.nettimeASRG (@asrg@tldr.nettime.org)Attached: 1 image Sabot in the Age of AI Here is a curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning. 🔻 iocaine The deadliest AI poison—iocaine generates garbage rather than slowing crawlers. 🔗 https://git.madhouse-project.org/algernon/iocaine 🔻 Nepenthes A tarpit designed to catch web crawlers, especially those scraping for LLMs. It devours anything that gets too close. @aaron@zadzmo.org 🔗 https://zadzmo.org/code/nepenthes/ 🔻 Quixotic Feeds fake content to bots and robots.txt-ignoring #LLM scrapers. @marcusb@mastodon.sdf.org 🔗 https://marcusb.org/hacks/quixotic.html 🔻 Poison the WeLLMs A reverse-proxy that serves diassociated-press style reimaginings of your upstream pages, poisoning any LLMs that scrape your content. @mike@mikecoats.social 🔗 https://codeberg.org/MikeCoats/poison-the-wellms 🔻 Django-llm-poison A django app that poisons content when served to #AI bots. @Fingel@indieweb.social 🔗 https://github.com/Fingel/django-llm-poison 🔻 KonterfAI A model poisoner that generates nonsense content to degenerate LLMs. 🔗 https://codeberg.org/konterfai/konterfai