Got some complaints the #musl git server was sometimes giving errors from too much load. Load caused by abusive LLM scrapers hammering cgit. So I added frontend rules that apply a 10 byte/sec rate limit to client IPs hammering >5 requests per 10 sec for the duration of the hammering. Load average has plummeted.
I wasn't actually able to confirm the limits affecting anything I casually tried hitting the server with, but it appears to be working, so
haproxy recipe in case it's useful to anyone (or if anyone has recommendations to fix/improve it):
stick-table type ip size 10k expire 10s store http_req_rate(10s)
filter bwlim-out mylimit default-limit 1 default-period 1s
http-response set-bandwidth-limit mylimit if { sc_http_req_rate(0) gt 5 }
@lanodan I specifically did super-slow rather than error to keep them bogged down with open connections.
@dalias In my server I block them at the firewall level. Hosts are blocked for 72h and at any given day there are over 500 of them in the list, at least; today there are more than 700.
The way I did it was through a Varnish rule that redirects to a daemon that logs and adds a rule to the firewall (which then prunes after a few hours).
This list is updated hourly: https://tia.mat.br/blocked-ips.php (FWIW, it's not actually in PHP; it's a hack for Varnish to pick the right rule.)
@dalias Some of the rules also catch things trying to find wordpress or phpmyadmin or other things, and this has severely reduced the load& amount of logs
Another thing I did was adding this fake robots.txt with a honeypot that also blocks an IP address for 3 days if they don't follow the Disallow rule: https://tia.mat.br/robots.txt
@lafp For now I'm happy with what I did. It got rid of all the excessive load with no risk of blocking any legitimate access.
@dalias Oh, absolutely! Limiting the bandwidth is a pretty good idea, especially since a lot of bots only handle timeouts for connections, not for fetching data.
@dalias 5 requests per 10 seconds seems like something that a human could easily be hit by accident when looking through commit logs.
@alwayscurious Yeah, probably should make it more like 20-30 per 10 sec.