User-agent aside there are usually small details bots leave out unless they are using headless chrome of course. Most bots can't do HTTP/2.0 yet all common browsers support it. Most bots will not be sending cors, no-cors, navigate sec_fetch_mode headers whereas browsers do. Some bots do not send a accept_language header. Those are just a few things one can look for and deal with in simple web server ACL's. Some bots do not support http-keepalive, though this can knock out some poor middle boxes if dropping connections that do not support http keepalive.
At the tcp layer some bots do not set MSS options or use very strange values. This can get into false positives so I just don't publish IPv6 records for my web servers and then limit to an MSS range of 1280 to 1460 on IPv4 which knocks out many bots.
There are always the possibilities of false positives but they can be logged and reviewed acceptable losses should the load on the servers get too high. Another mitigating control is to perform analysis on previous logs and use maps to exclude people that post on a regular basis or have logins to the site assuming none of them are part of the problem. If a registered user is part of the problem give them an error page after {n} requests.
We have been suffering this. It's easy enough to weather high traffic loads for pages, but our issue is targeted applications. Things like website search bars are getting target with functional searches for sub pages and content by labels, etc.
It causes the web server to run out of handles for the pending database lookups.
A real mess. The problem is these searches are valid and the page will return a 200 result with "Nothing in that search found!" types of messages. Why would the crawler ever stop? It's going to work and work until we all die and there's still another epoch of search term combos left to try.
We solve problems like this all the time, but we're hitting another level and really exposing some issues. Ideally our WAF can start to kick the traffic. It's good to see other people having this issue. We first started addressing this last fall -- around November.
We see similar issues on LessWrong. We're constantly being hit by bots that are egregiously badly behaved. Common behaviors include making far more requests per second than our entire userbase combined, distributing those requests between many IPs in order to bypass the rate limit on our firewall, and making each request with a unique user-agent string randomly drawn from a big list of user agents, to prevent blocking them that way. They ignore robots.txt. Other than the IP address, there's no way to identify them or find an abuse contact.
For THIS application, would a boring rate-limiter not help? I mean it won't get rid of the DOS part of this right off, but these are not agents/bots MEANING to DOS a site, so if they get a 429 often enough they should back off on their own, no?
There are already legal remedies in place. This isnt a technical problem. We cant advocate for three or four behemoths gatekeeping whitelists they're willing to peer with. Which is how spam is currently controlled.
Megacorps have no problem with illegal AI scraping - they have departments full of people whose jobs is scraping detection & prevention, expensive layers who can sue scrapers, and plenty of machines to handle any scraping-related load.
But small orgs and individuals have none of that (see the OP article) - how would someone like LWN Jonathan Corbet use "legal remedies" against scraper bots which use fake user agent, a wide variety of IP addresses, and don't announce themselves? Suing random ISPs is going to be prohibitively expensive.
It really seems to me that it's time for web to get the same distributed protection mechanisms as email did. It is really a technical problem. If you ever run your own mail server (I used to), you know that it would be impossible without relying on non-commercial/non-profit organizations like SpamHaus and DCC.
Specifically, I am hoping for "distributed blocklist" - by itself, Fedora infra or LWN don't see enough traffic to detect a distributed scraper traffic. But if enough small websites pool their logs together via some DCC-like technology, then they might be able to detect scraping farms and block them for good. Throw in some poisoner tech, like Nerpenthes, and this might have good enough false-positive/false-negative rate to be usable.
It's inappropriate to usurp the language of cyberattacks just to denigrate certain traffic. It might be true that this category of traffic is too voluminous to handle at their current capability, resulting in bad service.
However cyberattacks, especially distributed ones, require intentionality and they require coordination.
There is an obvious coordination - all those requests come from different IP ranges, but rarely hit the same pages.
The intentionality is not as clear, but I think "willful bypassing of protection site owner put up" counts for something? If a local store has "a free gift per household" ad, and someone is trying to get multiple free gifts by presenting fake addresses, this would be at least fraud. Similarly, if a scraping service rotates IP addresses and user agents to prevent blocks, it should be fraud as well, and maybe "unintentional cyber-attack" if server has trouble or falls over.
I can't speak to LWN, but from what I've seen this is a bot that crawls the site, generates search terms and "deeper" crawling techniques using AI, and then makes another set of requests.
This would be generating topical queries to add search for, e.g.,
User-agent aside there are usually small details bots leave out unless they are using headless chrome of course. Most bots can't do HTTP/2.0 yet all common browsers support it. Most bots will not be sending cors, no-cors, navigate sec_fetch_mode headers whereas browsers do. Some bots do not send a accept_language header. Those are just a few things one can look for and deal with in simple web server ACL's. Some bots do not support http-keepalive, though this can knock out some poor middle boxes if dropping connections that do not support http keepalive.
At the tcp layer some bots do not set MSS options or use very strange values. This can get into false positives so I just don't publish IPv6 records for my web servers and then limit to an MSS range of 1280 to 1460 on IPv4 which knocks out many bots.
There are always the possibilities of false positives but they can be logged and reviewed acceptable losses should the load on the servers get too high. Another mitigating control is to perform analysis on previous logs and use maps to exclude people that post on a regular basis or have logins to the site assuming none of them are part of the problem. If a registered user is part of the problem give them an error page after {n} requests.
We have been suffering this. It's easy enough to weather high traffic loads for pages, but our issue is targeted applications. Things like website search bars are getting target with functional searches for sub pages and content by labels, etc. It causes the web server to run out of handles for the pending database lookups.
A real mess. The problem is these searches are valid and the page will return a 200 result with "Nothing in that search found!" types of messages. Why would the crawler ever stop? It's going to work and work until we all die and there's still another epoch of search term combos left to try.
We solve problems like this all the time, but we're hitting another level and really exposing some issues. Ideally our WAF can start to kick the traffic. It's good to see other people having this issue. We first started addressing this last fall -- around November.
We see similar issues on LessWrong. We're constantly being hit by bots that are egregiously badly behaved. Common behaviors include making far more requests per second than our entire userbase combined, distributing those requests between many IPs in order to bypass the rate limit on our firewall, and making each request with a unique user-agent string randomly drawn from a big list of user agents, to prevent blocking them that way. They ignore robots.txt. Other than the IP address, there's no way to identify them or find an abuse contact.
We need some kind of fail2ban for AI scrapers. Fingerprint them, then share the fingerprint databases via torrent or something.
For THIS application, would a boring rate-limiter not help? I mean it won't get rid of the DOS part of this right off, but these are not agents/bots MEANING to DOS a site, so if they get a 429 often enough they should back off on their own, no?
they explicitly mention this:
> each IP stays below the thresholds for our existing circuit breakers, but the overload is overwhelming.
We need something like DCC[0] or Spamhaus block list, but for web, not for email.
[0] https://en.wikipedia.org/wiki/Distributed_Checksum_Clearingh...
There are already legal remedies in place. This isnt a technical problem. We cant advocate for three or four behemoths gatekeeping whitelists they're willing to peer with. Which is how spam is currently controlled.
Huh? It's the opposite.
Megacorps have no problem with illegal AI scraping - they have departments full of people whose jobs is scraping detection & prevention, expensive layers who can sue scrapers, and plenty of machines to handle any scraping-related load.
But small orgs and individuals have none of that (see the OP article) - how would someone like LWN Jonathan Corbet use "legal remedies" against scraper bots which use fake user agent, a wide variety of IP addresses, and don't announce themselves? Suing random ISPs is going to be prohibitively expensive.
It really seems to me that it's time for web to get the same distributed protection mechanisms as email did. It is really a technical problem. If you ever run your own mail server (I used to), you know that it would be impossible without relying on non-commercial/non-profit organizations like SpamHaus and DCC.
Specifically, I am hoping for "distributed blocklist" - by itself, Fedora infra or LWN don't see enough traffic to detect a distributed scraper traffic. But if enough small websites pool their logs together via some DCC-like technology, then they might be able to detect scraping farms and block them for good. Throw in some poisoner tech, like Nerpenthes, and this might have good enough false-positive/false-negative rate to be usable.
It's inappropriate to usurp the language of cyberattacks just to denigrate certain traffic. It might be true that this category of traffic is too voluminous to handle at their current capability, resulting in bad service.
However cyberattacks, especially distributed ones, require intentionality and they require coordination.
There is an obvious coordination - all those requests come from different IP ranges, but rarely hit the same pages.
The intentionality is not as clear, but I think "willful bypassing of protection site owner put up" counts for something? If a local store has "a free gift per household" ad, and someone is trying to get multiple free gifts by presenting fake addresses, this would be at least fraud. Similarly, if a scraping service rotates IP addresses and user agents to prevent blocks, it should be fraud as well, and maybe "unintentional cyber-attack" if server has trouble or falls over.
I can't speak to LWN, but from what I've seen this is a bot that crawls the site, generates search terms and "deeper" crawling techniques using AI, and then makes another set of requests.
This would be generating topical queries to add search for, e.g.,
Again -- they don't say specifically what the traffic is doing, and this is just an example, but in this scenario DDOS is probably closer to accurate.