ROFLMAO.
Claude decided to crawl one of the sites on my new server, where known bots are redirected to an iocaine maze. Claude has been in the maze for 13k requests so far, over the course of 30 minutes.
I will need to fine tune the rate limiting, because it didn't hit any rate limits - it scanned using 902 different client IPs. So simply rate limiting by IP doesn't fly. I'll rate limit by (possibly normalized) agent (they all used the same UA).
Over the course of this 30 minutes, it downloaded about ~300 times less data than if I would've let it scrape the real thing, and each request took about the tenth of the time to serve than the real thing would have. So I saved bandwidth, saved processing time, likely saved RAM too, and served garbage to Claude.
Job well done.
@algernon this is awesome and I want to do it myself too. Is there a write-up or blog on how you set it up?
@Infosecben There are some notes in #iocaine's repo, here, and my exact setup is documented here (the server config is also free software).
Hope that helps! But if you have questions, feel free to @ me, I'm more than happy to help you serve garbage to the robots.
@algernon @Infosecben I thought I had heard some of the bots are using fake user agents that don't identify them as crawlers at all (so your proxy config there wouldn't catch them), is that true?
@aburka @Infosecben Yep, some of them use fake user agents, and those are not caught in this trap. Yet.
I just configured my reverse proxy to direct /cgi-bin/
to the maze, and I will be adding links to the sites hosted there, so that crawlers will find it. I can then do some digging in the logs and figure out how to handle the misbehavers.
@algernon @Infosecben I liked an idea I saw around here of putting a link on the main page saying "if you're human don't click here", put the target URL of the link in robots.txt, and then put iocaine on the other end. That way humans won't click (at least not more than once...), well behaved crawlers will stay out, and the bastards will get caught
@aburka @Infosecben yep, that's the plan (in addition to the current setup)!