@corbet Try to lose them in an AI poisoning maze running on a different host to the real content? Robots.txt blocks permitted crawlers from visiting the maze but every real page contains a link into the maze. Maze is much bigger than real site so should eat higher proportion of traffic, but heavily throttled as no real users to worry about. Logs from maze give you IPs to block from the real site, though as you note that doesn't help much if they only visit 3 times each.
@DanielRThomas @corbet I just saw an article about something like this called Nepenthes https://www.404media.co/developer-creates-infinite-maze-to-trap-ai-crawlers-in/ I'm not sure how well it'll work here though because each crawler source is only used for a handful of requests before moving on, so it's hard to ID on so few requests to serve a tarpit that doesn't trap legit users. Maybe combined with something like @joeyh proposed where a page with bogus honeypot links is served as an interstitial page in place of less common documents with a auto-refresh to the legit document, so the scraper starts seeding its queue of requests with links that you know are bogus, so can be an instant trap for future requests. Ideally a normal client would just follow the meta-refresh and the user-agent wouldn't try to pre-load the booby-trapped URLs, but this would require some validation, and for the LWN crowd that might include user-agents such as elinks, lynx and dillo not just Firefox and Chrome.
@raven667 @corbet @joeyh Yes something like that, or even just putting links on pages with robots.txt set such that well behaved crawlers ignore them. Longer list of options here: https://tldr.nettime.org/@asrg/113867412641585520