hachyderm.io is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hachyderm is a safe space, LGBTQIA+ and BLM, primarily comprised of tech industry professionals world wide. Note that many non-user account types have restrictions - please see our About page.

Administered by:

Server stats:

9.4K
active users

ROFLMAO.

Claude decided to crawl one of the sites on my new server, where known bots are redirected to an iocaine maze. Claude has been in the maze for 13k requests so far, over the course of 30 minutes.

I will need to fine tune the rate limiting, because it didn't hit any rate limits - it scanned using 902 different client IPs. So simply rate limiting by IP doesn't fly. I'll rate limit by (possibly normalized) agent (they all used the same UA).

Over the course of this 30 minutes, it downloaded about ~300 times less data than if I would've let it scrape the real thing, and each request took about the tenth of the time to serve than the real thing would have. So I saved bandwidth, saved processing time, likely saved RAM too, and served garbage to Claude.

Job well done.

Summary card of repository algernon/iocaine
MadHouse Git RepositoriesiocaineThe deadliest poison known to AI.

@algernon this is awesome and I want to do it myself too. Is there a write-up or blog on how you set it up?

@Infosecben There are some notes in #iocaine's repo, here, and my exact setup is documented here (the server config is also free software).

Hope that helps! But if you have questions, feel free to @ me, I'm more than happy to help you serve garbage to the robots. :flan_evil:

Summary card of repository algernon/iocaine
MadHouse Git Repositoriesiocaine/docs/deploying.md at mainiocaine - The deadliest poison known to AI
aburka 🫣

@algernon @Infosecben I thought I had heard some of the bots are using fake user agents that don't identify them as crawlers at all (so your proxy config there wouldn't catch them), is that true?

@aburka @Infosecben Yep, some of them use fake user agents, and those are not caught in this trap. Yet.

I just configured my reverse proxy to direct /cgi-bin/ to the maze, and I will be adding links to the sites hosted there, so that crawlers will find it. I can then do some digging in the logs and figure out how to handle the misbehavers.

@algernon @aburka @Infosecben is there a way to do dynamic config so that any source IP that requests something suspicious gets added to the maze list? I'm thinking there's some well known resources that it makes no sense to see a request for in the course of a normal human visiting the website....

@arichtman @aburka @Infosecben I don't know if it is possible to set that up with Caddy out of the box. If there isn't, I can always write a module.

But first things first: trapping & limiting known baddies is the first step. Leading other baddies into the maze, and limiting within the maze is the next step, and I'll iterate from there, likely by adding IP ranges or new user agents to the known baddies list.

It's a bit manual, but I'm not automating it until it turns out that automating it would save time.

@arichtman Anything that hits something robots.txt tells it not to hit is a candidate... @algernon @aburka @Infosecben

@BenAveling @arichtman @aburka @Infosecben A candidate, yes, but that in itself is far from enough indication. I think a better indicator is how much time it spends in the maze. A human won't spend much time there, and won't crawl links at lightning speed.

@algernon @Infosecben I liked an idea I saw around here of putting a link on the main page saying "if you're human don't click here", put the target URL of the link in robots.txt, and then put iocaine on the other end. That way humans won't click (at least not more than once...), well behaved crawlers will stay out, and the bastards will get caught

@aburka @Infosecben yep, that's the plan (in addition to the current setup)!