This is a usually technical crowd, so I can't help but wonder if many people genuinely don't get it, or if they are just feigning a lack of understanding to be dismissive of Anubis.
Sure, the people who make the AI scraper bots are going to figure out how to actually do the work. The point is that they hadn't, and this worked for quite a while.
As the botmakers circumvent, new methods of proof-of-notbot will be made available.
It's really as simple as that. If a new method comes out and your site is safe for a month or two, great! That's better than dealing with fifty requests a second, wondering if you can block whole netblocks, and if so, which.
This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.
> they are just feigning a lack of understanding to be dismissive of Anubis.
I actually find the featured article very interesting. It doesn't feel dismissive of Anubis, but rather it questions whether this particular solution makes sense or not in a constructive way.
I still don't understand what Anubis solves if it can be bypassed too easily: If you use User-agent switcher (i emulate wget) as firefox addon on kernel.org or ffmpeg.org you save the entire check time and straight up skip Anubis. Apparently they use a whitelist for user-agents due to allowing legitimate wget usage on these domains. However if I (an honest human can) the scrapers and grifters can too.
If anyone wants to try themselves. This is by no means against Anubis, but raising the question: Can you even protect a domain if you force yourself to whitelist (for a full bypass) easy to guess UAs?
A lot of scrapers are actually utilizing some malware installed on residential user's machines, so the request is legitimately coming from a chrome UA on a residential ip.
It really should be recognised just how many people are watching Cloudflare interstitials on nearly every site these days (and I totally get why this happens) yet making a huge amount of noise about Anubis on a very small amount of sites.
You know, you say that, and while I understand where you're coming from I was browsing the git repo when github had a slight error and I was greeted with an angry pink unicorn. If Github can be fun like that, Anubis can too, I think.
Yeah, but do people like that? It feels pretty patronizing to me in a similar way. Like "Weee! So cute that our website is broken, good luck doing your job! <3"
I think it's reasonable and fair, and something you are expected to tolerate in a free world. In fact, I think it's rather unusual to take this benign and inconsequential thing as personal as you do.
Not at all. I can't stand it either. It's definitely patronising and infantile. I tolerate the silliness, grit my teeth and move on but it wears away at my patience.
I wonder why the anime girl is received so badly. Is it because it's seen as childish? Is it bad because it confuses people (i.e. don't do this because other don't do this)?
Thinking about it logically, putting some "serious" banner there would just make everything a bit more grey and boring and would make no functional difference. So why is it disliked so much?
Why? It has sexual connotations, and it involves someone under the age of consent. As wikipedia puts it: "In a 2010 critique of the manga series Loveless, the feminist writer T. A. Noonan argued that, in Japanese culture, catgirl characteristics have a similar role to that of the Playboy Bunny in western culture, serving as a fetishization of youthful innocence."
Keep in mind that the author explicitly asks you not to do this, and offers a paid white label version. You can still do it yourself, but maybe you shouldn’t.
Anubis was originally an open source project built for a personnal blog. It gained traction but the anime girl remained so that people are reminded of the nature of the project. Comparing it with Cloudflare is truly absurd. That said, a paid version is available with guard page customization.
The annoying thing about cloudflare is that most of the time once you’re blocked: you’re blocked.
There’s literally no way for you to bypass the block if you’re affected.
Its incredibly scary, I once had a bad useragent (without knowing it) and half the internet went offline, I couldn’t even access documentation or my email providers site, and there was no contact information or debugging information to help me resolve it: just a big middle finger for half the internet.
I haven’t had issues with any sites using Anubis (yet), but I suspect there are ways to verify that you’re a human if your browser fails the automatic check at least.
CloudFlare is dystopic. It centralizes even the part of the Internet that hadn't been centralized before. It is a perfect Trojan horse to bypass all encryption. And it chooses who accesses (a considerable chunk of) the Internet and who doesn't.
IaaS: Only if you do TLS termination at their gateway, otherwise not really, they'd need to get into your operating system to get the keys which might not always be easy. They could theoretically MITM the KVM terminal when you put in your disk decryption keys but that seems unlikely.
It could be a lot worse. Soccer rights-holders effectively shut-down the Cloudflare facilitated Internet in Spain during soccer matches to 'curb piracy'.
The Soccer rightsholders - LaLiga - claim more than 50% of pirate IPs illegally distributing its content are protected by Cloudflare. Many were using an application called DuckVision to facilitate this streaming.
Telefónica, the ISP, upon realizing they couldn’t directly block DuckVision’s IP or identify its users, decided on a drastic solution: blocking entire IP ranges belonging to Cloudflare, which continues to affect a huge number of services that had nothing to do with soccer piracy.
Now imagine your government provided internet agent gets blacklisted because your linked social media post was interpreted by an LLM to be anti-establishment, and we are painting a picture of our current trajectory.
The question might become, what side of the black wall are you going to be on?
Seriously though I do think we are going to see increasing interest in alternative nets, especially as governments tighten their control over the internet or even break away into isolated nation nets.
Paradoxically, the problem with an "alternative net" (which could be tunneled over the regular one) is keeping it alternative. It has to be kept small and un-influential in order to stay under the radar. If you end up with an "alternative" which is used by journalists and politicians, you've just reinvented the mainstream, and you're no longer safe from being hit by a policy response.
Think private trackers. The opposite of 4chan, which is an "alternative" that got too influential in setting the tone of the rest of the internet.
The truth is the internet was never designed or intended to host private information. It was created for scientists by scientists to share research papers. Capitalists perverted it.
I'm on an older system here, and both Cloudflare and Anubis entirely block me out of sites. Once you start blocking actual users out of your sites, it simply has gone too far. At least provide an alternative method to enter your site (e.g. via login) that's not hampered by erroneous human checks. Same for the captchas where you help train AIs by choosing out of a set of tiny/ noisy pictures. I often struggle for 5 to 10 minutes to get past that nonsense. I heard bots have less trouble.
Basically we're already past the point where the web is made for actual humans, now it's made for bots.
> Once you start blocking actual users out of your sites, it simply has gone too far.
It has, scrapers are out of control. Anubis and its ilk are a desperate measure, and some fallout is expected. And you don't get to dictate how a non-commercial site tries to avoid throttling and/or bandwidth overage bills.
Because I don't have the fucking time to deal with AI scraper bots. I went harder - anything even looking suspiciously close to a scraper that's not on Google's index [1] or has wget in its user agent gets their entire /24 hard banned for a month, with an email address to contact for unbanning.
That seems to be a pretty effective way for now to keep scrapers, spammers and other abusive behavior away. Normal users don't do certain site actions at the speed that scraper bots do, there's no other practically relevant search engine than Google, I've never ever seen an abusive bot hide as wget (they all try to emulate looking like a human operated web browser), and no AI agent yet is smart enough to figure out how to interpret the message "Your ISP's network appears to have been used by bot activity. Please write an email to xxx@yyy.zzz with <ABC> as the subject line (or click on this pre-filled link) and you will automatically get unblocked".
Simple. A honeypot link in a three levels deep menu which no ordinary human would care about that, thanks to a JS animation, needs at least half a second for a human to click on. Any bot that clicks it in less than half a second gets the banhammer. No need for invasive tracking, third party integrations, whatever.
That does sound like a much human friendlier approach than Anubis. I agree that tarpits and honeypots are a good stopgap until the legal system catches up to the rampant abuse of these "AI" companies. It's when your solutions start affecting real human users just because they are not "normal" in some way that I stop being sympathetic.
FYI - you can communicate with the author of Anubis, who has already said she's working on ways to make sure that all browsers - links, lynx, dillo, midori, et cetera, work.
Unless you're paying Cloudflare a LOT of money, you won't get to talk with anyone who can or will do anything about issues. They know about their issues and simply don't care.
If you don't mind taking a few minutes, perhaps put some details about your setup in a bug report?
I’m planning a trip to France right now, and it seems like half the websites in that country (for example, ratp.fr for Paris public transport info) require me to check a CloudFlare checkbox to promise that I am a human. And of those that don’t, quite a few just plain lock me out...
I find the same when using some foreign sites. I think the operator must have configured that France is OK, maybe neighboring countries too, the rest of the world must be checked.
You might have to show a passport when you enter France, and have your baggage and person (intrusively) scanned if you fly there, for much the same reason.
People, some of them in positions of government in some nation states want to cause harm to the services of other states. Cloudflare was probably the easiest tradeoff for balancing security of the service with accessibility and cost to the French/Parisian taxpayer.
Not that I'm happy about any of this, but I can understand it.
Even when not on VPN, if a site uses the CloudFlare interstitials, I will get it every single time - at least the "prove you're not a bot" checkbox. I get the full CAPTCHA if I'm on a VPN or I change browsers. It is certainly enough to annoy me. More than Anubis, though I do think Anubis is also annoying, mainly because of being nearly worthless.
For me both are things that mostly show up for 1-3 seconds, then get replaced by the actual website. I suspect that's the user experience of 99% of people.
If you fall in the other 1% (e.g. due to using unusual browsers or specific IP ranges), cloudflare tends to be much worse
You must be on a good network. You should run one of those "get paid to share your internet connection with AI companies" apps. Since you're on a good network you might make a lot of money. And then your network will get cloudflared, of course.
We should repeat this until every network is cloudflared and everyone hates cloudflare and cloudflare loses all its customers and goes bankrupt. The internet would be better for it.
I hit Cloudflare's garbage about as much as I hit Anubis. With the difference that far more sites use Cloudflare than Anubis, thus Anubis is far worse at triggering false positives.
The article doesn't say and I constantly get the most difficult Google captchas, cloudflare block pages saying "having trouble?" (which is a link to submit a ticket that seems to land in /dev/null), IP blocks because user agent spoofing, errors "unsupported browser" when I don't do user agent spoofing... the only anti-bot thing that reliably works on all my clients is Anubis. I'm really wondering what kinds of false positives you think Anubis has, since (as far as I can tell) it's a completely open and deterministic algorithm that just lets you in if you solve the challenge, and as the author of the article demonstrated with some C code (if you don't want to run the included JavaScript that does it for you), that works even if you are a bot. And afaik that's the point: no heuristics and false positives but a straight game of costs; making bad scraping behavior simply cost more than implementing caching correctly or using commoncrawl
I've had Anubis repeatedly fail to authorize me to access numerous open source projects, including the mesa3d gitlab, with a message looking something like "you failed".
As a legitimate open source developer and contributor to buildroot, I've had no recourse besides trying other browsers, networks, and machines, and it's triggered on several combinations.
Interesting, I didn't even know it had such a failure mode. Thanks for the reply, I'll sadly have to update my opinion on this project since it's apparently not a pure "everyone is equal if they can Prove the Work" system as I thought :(
I'm curious how, though, since the submitted article doesn't mention that and demonstrates curl working (which is about as low as you can go on the browser emulation front), but no time to look into it atm. Maybe it's because of an option or module that the author didn't have enabled
It sounds[1] like this was an issue with assumptions regarding header stability. Hopefully as people update their installations things will improve for us end users.
That's called authentication. In the case of the stalker, by biometrics (facial recognition). This could be a solution
But that's not what Cloudflare does. Cloudflare guesses whether you are a bot and then either blocks you or not. If it currently likes you, bless your luck
If your boss doesn't want you to browse the web, where some technical content is accompanied by an avatar that the author likes, they may not be suitable as boss, or at least not for positions where it's their job to look over your shoulder and make sure you're not watching series during work time. Seems like a weird employment place if they need to check that anyway
I fail to see how this particular "anime girl" and the potential for clients seeing it, could make you think that's a fair request. That seems extremely ridiculous to me.
It's an MIT licensed, open project. Fork it and change the icon to your favorite white-bread corporate logo if you want. It would probably take less time than complaining about it on HN.
I think the complaint is rather that you don't know when it will rear its face on third-party websites that you are visiting as part of work. Forking wouldn't help with not seeing it on other sites
(Even if I agree that the boss or customers should just get over it. It's not like they're drawing genitalia on screen and it's also easily explainable if they don't already know it themselves.)
Add a rule to your adblocker for the image, then. The main site appears to have it at `anubis.techaro.lol/.within.website/x/cmd/anubis/static/img/happy.webp?cacheBuster=v1.21.3-43-gb0fa256`, so a rule for `||*/.within.website/x/cmd/anubis/static/img/$image` ought to work for ublock origin (purely a guess regarding wildcards for domain, I've never set a rule without a domain before)
Cloudflare's solution works without javascript enabled unless the website turns up the scare level to max or you are on an IP with already bad reputation. Anubis does not.
But at the end of the day both are shit and we should not accept either. That includes not using one as an excuse for the other.
Laughable. They say this but anyone who actually surfs the web with a non-bleeding edge non-corporate browser gets constantly blocked by Cloudflare. The idea that their JS computational paywalls only pop up rarely is absurd. Anyone believing this line lacks lived experience. My Comcast IP shouldn't have a bad rep and using a browser from ~2015 shouldn't make me scary. But I can't even read bills on congress.gov anymore thanks to bad CF deployals.
Also, Anubis does have a non-JS mode: the HTML header meta-refresh based challenge. It's just that the type of people who use Cloudflare or Anubis almost always just deploy the default (mostly broken) configs that block as many human people as bots. And they never realize it because they only measure such things with javascript.
Over the past few years I've read far more comments complaining about Cloudflare doing it than Anubis. In fact, this discussion section is the first time I've seen people talking about Anubis.
It sounds like you're saying that it's not the proof-of-work that's stopping AI scrapers, but the fact that Anubis imposes an unusual flow to load the site.
If that's true Anubis should just remove the proof-of-work part, so legitimate human visitors don't have to stare at a loading screen for several seconds while their device wastes electricity.
> If that's true Anubis should just remove the proof-of-work part
This is my very strong belief. To make it even clearer how absurd the present situation is, every single one of the proof-of-work systems I’ve looked at has been using SHA-256, which is basically the worst choice possible.
Proof-of-work is bad rate limiting which depends on a level playing field between real users and attackers. This is already a doomed endeavour. Using SHA-256 just makes it more obvious: there’s an asymmetry factor in the order of tens of thousands between common real-user hardware and software, and pretty easy attacker hardware and software. You cannot bridge such a divide. If you allow the attacker to augment it with a Bitcoin mining rig, the efficiency disparity factor can go up to tens of millions.
These proof-of-work systems are only working because attackers haven’t tried yet. And as long as attackers aren’t trying, you can settle for something much simpler and more transparent.
If they were serious about the proof-of-work being the defence, they’d at least have started with something like Argon2d.
Anubis also relies on modern web browser features:
- ES6 modules to load the client-side code and the proof-of-work challenge code.
- Web Workers to run the proof-of-work challenge in a separate thread to avoid blocking the UI thread.
- Fetch API to communicate with the Anubis server.
- Web Cryptography API to generate the proof-of-work challenge.
This ensures that browsers are decently modern in order to combat most known scrapers. It's not perfect, but it's a good start.
This will also lock out users who have JavaScript disabled, prevent your server from being indexed in search engines, require users to have HTTP cookies enabled, and require users to spend time solving the proof-of-work challenge.
This does mean that users using text-only browsers or older machines where they are unable to update their browser will be locked out of services protected by Anubis. This is a tradeoff that I am not happy about, but it is the world we live in now.
Except this is exactly the problem. Now you are checking for mainstream browsers instead of some notion of legitimate users. And as TFA shows a motivated attacker can bypass all of that while legitimate users of non-mainstream browsers are blocked.
Aren't most scrapers using things like Playright or Puppeteer anyway by now, especially since so many pages are rendered using JS and even without Anubis would be unreadable without executing modern JS?
... except when you do not crawl with a browser at all. It's so trivial to solve just like the taviso post demostrated.
This makes zero sense, this is simply the wrong approach. Already tired of saying so and been attacked. So I'm glad professional-random-Internet-bullshit-ignorer Tavis Ormandy wrote this one.
All this is true, but also somewhat irrelevant. In reality the amount of actual hash work is completely negligible.
For usability reasons Anubus only requires that you to go trough a the proof of work flow only once in a given period. (I think the default is once per week.) That's just very little work.
Detecting you need to occasionally send a request trough a headless browser far more of a hassle than the PoW. If you prefer LLMs rather than normal internet search, it'll probably consume far more compute as well.
> For usability reasons Anubus only requires that you to go trough a the proof of work flow only once in a given period. (I think the default is once per week.) That's just very little work.
If you keep cookies. I do not want to keep cookies for otherwise "stateless" sites. I have maybe a dozen sites whitelisted, every other site loses cookies when I close the tab.
A bigger problem is that you should not have to enable javascript for otherwise static sites. If you enable JS, cookies are a relatively minor issue compared to all the other ways the website can keep state about you.
+1 for go-away. It's a bit more involved to configure, but worth the effort imo. It can be considerably more transparent to the user, triggering the nuclear PoW check less often, while being just as effective, in my experience.
my family takes care of a large-ish forest, so I have to help since my early teens.
Let me tell you: think twice, it's f*ckin dangerous. Chainsaws, winches, heavy trees falling and breaking in unpredictable ways. I had a couple of close calls myself. Recently a guy from a neighbor village was squashed to death by a root plate that tilted.
I often think about quitting tech myself, but becoming a full-time lumberjack is certainly not an alternative for me.
Hah, I know, been around forests since childhood, seen (and done) plenty of sketchy stuff. For me it averages out to couple days of forest work a year. It's backbreaking labour, and then you deal with the weather.
But man, if tech goes straight into cyberpunk dystopia but without the cool gadgets, maybe it is the better alternative.
Worth getting to know the in and outs of forest management now. I don’t think AI will take most tech jobs soon, but they sure as hell are already making them boring.
I don't think anything will stop AI companies for long. They can do spot AI agentic checks of workflows that stop working for some reason and the AI can usually figure out what the problem is and then update the workflow to get around it.
1) scrapers just run a full browser and wait for the page to stabilize. They did this before this thing launched, so it probably never worked.
2) The AI reading the page needs something like 5 seconds * 1600W to process it. Assuming my phone can even perform that much compute as efficiently as a server class machine, it’d take a large multiple of five seconds to do it, and get stupid hot in the process.
Note that (2) holds even if the AI is doing something smart like batch processing 10-ish articles at once.
Yes. Obviously dumb but also nearly 100% successful at the current point in time.
And likely going to stay successful as the non-protected internet still provides enough information to dumb crawlers that it’s not financially worth it to even vibe-code a workaround.
Or in other words: Anubis may be dumb, but the average crawler that completely exhausting some sites resources is even dumber.
And so it all works out.
And so the question remains: how dumb was it exactly, when it works so well and continues to work so well?
I understand this as an argument that it’s better to be down for everyone than have a minority of users switch browsers.
I’m not convinced by that makes sense.
Now ideally you would have the resources to serve all users and all the AI bots without performance degradation, but for some projects that’s not feasible.
does it work well? I run chromium controlled by playwright for scraping and typically make Gemini implement the script for it because it's not worth my time otherwise. -but I'm not crawling the Internet generally (which I think there is very little financial incentive to do; it's a very expensive process even ignoring Anubis et al); it's always that I want something specific and am sufficiently annoyed by lack of API.
regarding authentication mentioned elsewhere, passing cookies is no big deal.
Anubis is not meant to stop single endpoints from scraping. It's meant to make it harder for massive AI scrapers. The problematic ones evade rate limiting by using many different ip addresses, and make scraping cheaper on themselves by running headless. Anubis is specifically built to make that kind of scraping harder as i understand it.
And of all the high-profile projects implementing it, like the LKML archives, none have backed down yet, so I’m assuming the initial improvement in numbers must continue or it would have been removed since
I run a service under the protection of go-away[0], which is similar to Anubis, and can attest it works very well, still. Went from constant outages due to ridiculous volumes of requests to good load times for real users and no bad crawlers coming through.
the workaround is literally just running a headless browser, and that's pretty much the default nowadays.
if you want to save some $$$ you can spend like 30 minutes making a cracker like in the article. just make it multi threaded, add a queue and boom, your scraper nodes can go back to their cheap configuration. or since these are AI orgs we're talking about, write a gpu cracker and laugh as it solves challenges far faster than any user could.
custom solutions aren't worth it for individual sites, but with how widespread anubis is it's become worth it.
I agree. Your estimate for (2), about 0.0022 kWh, corresponds to about a sixth of the charge of an iPhone 15 pro and would take longer than ten minutes on the phone, even at max power draw. It feels about right for the amount of energy/compute of a large modern MoE loading large pages of several 10k tokens. For example this tech (couple month old) could input 52.3k tokens per second to a 672B parameter model, per H100 node instance, which probably burns about 6–8kW while doing it. The new B200s should be about 2x to 3x more energy efficient, but your point still holds within an order of magnitude.
The argument doesn't quite hold. The mass scraping (for training) is almost never doing by a GPU system it's almost always done by a dedicated system running a full chrome fork in some automated way (not just the signatures but some bugs give that away).
And frankly processing a single page of text is run within a single token window so likely is run for a blink (ms) before moving onto the next data entry. The kicker is it's run over potentially thousands of times depending on your training strategy.
At inference there's now a dedicated tool that may perform a "live" request to scrape the site contents. But then this is just pushed into a massive context window to give the next token anyway.
The point is that scraping is already inherently cost-intensive so a small additional cost from having to solve a challenge is not going to make a dent in the equation. It doesn't matter what server is doing what for that.
100 billion web pages * 0.02 USD of PoW/page = 2 billion dollars, the point is not to stop every scraper/crawler, the point is to raise the costs enough to avoid being bombarded by all of them
Yes, but it's not going to be 0.02 USD of PoW per page! That is an absurd number. It'd mean a two-hour proof of work for a server CPU, a ten hour proof of work for a phone.
In reality you can do maybe a 1/10000th of that before the latency hit to real users becomes unacceptable.
And then, the cost is not per page. The cost is per cookie. Even if the cookie is rate-limited, you could easily use it for 1000 downloads.
Those two errors are multiplicative, so your numbers are probably off by about 7 orders of magnitudes. The cost of the PoW is not going to be $2B, but about $200.
The problem is that 7 + 2 on a submission form only affects people who want to submit something, Anubis affects every user who wants to read something on your site
The question then is why read only users are consuming so much resources that serving them big chunks of JS instead reduces loads of the server. Maybe improve you rendering and/or caching before employing DRM solutions that are doomed to fail anyway.
The problem it's originally fixing is bad scrapers accessing dynamic site content that's expensive to produce, like trying to crawl all diffs in a git repo, or all mediawiki oldids.
Now it's also used on mostly static content because it is effective vs scrapers that otherwise ignore robots.txt.
The author make it very clear that he understands the problem Anubis is attempting to solve. His issue is that the chosen approach doesn't solve that problem; it just inhibits access to humans, particularly those with limited access to compute resources.
That's the opposite of being dismissive. The author has taken the time to deeply understand both the problem and the proposed solution, and has taken the time to construct a well-researched and well-considered argument.
> This is a usually technical crowd, so I can't help but wonder if many people genuinely don't get it, or if they are just feigning a lack of understanding to be dismissive of Anubis.
This is a confusing comment because it appears you don’t understand the well-written critique in the linked blog post.
> This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.
The key point in the blog post is that it’s the inverse of a CAPTCHA: The proof of work requirement is solved by the computer automatically.
You don’t have to teach a computer how to solve this proof of work because it’s designed for the computer to solve the proof of work.
It makes the crawling process more expensive because it has to actually run scripts on the page (or hardcode a workaround for specific versions) but from a computational perspective that’s actually easier and far more deterministic than trying to have AI solve visual CAPTCHA challenges.
But for actual live users who don't see anything but a transient screen, Anubis is a better experience than all those pesky CAPTCHAs (I am bored of trying to recognize bikes, pedestrian crossings, buses, hydrants).
The question is if this is the sweet spot, and I can't find anyone doing the comparative study (how many annoyed human visitors, how many humans stopped and, obviously, how many bots stopped).
> Anubis is a better experience than all those pesky CAPTCHAs (I am bored of trying to recognize bikes, pedestrian crossings, buses, hydrants).
Most CAPTCHAs are invisible these days, and Anubis is worse than them. Also, CAPTCHAs are not normally deployed just for visiting a site, they are mostly used when you want to submit something.
Wouldn't it be nice to have a good study that supports either your or my view?
FWIW, I've never been stopped by Anubis, so even if it's much more rarely implemented, that's still infinitely less than 5-10 captchas a day I do see regularly. I do agree it's still different scales, but I don't trust your gut feel either. Thus a suggestion to look for a study.
Not OP but try browsing the web with a combination of Browser + OS that is slightly off to what most people use and you'll see Captchas pop up at every corner of the Internet.
And if the new style of Captchas is then like this one it's much more disturbing.
It's been going on for decades now too. It's a cat and mouse game that will be with us for as long as people try to exploit online resources with bots. Which will be until the internet is divided into nation nets, suffocated by commercial interests, and we all decide to go play outside instead.
No. This went into overdrive in the "AI" (crawlers for massive LLM for ML chatbot) era.
Frankly it's something I'm sad we don't yet see a lawsuit for similar to the times v OpenAI. A lot of "new crawlers" claim to innocently forget about established standards like robots.txt
I just wish people would name and shame the massive companies at the top stomping on the rest of the internet in an edge to "get a step up over the competition".
That doesn't really challenge what I said, there's not much "different this time" except the scale is commensurate to the era. Search engine crawlers used to take down websites as well.
I understand and agree with what you are saying though, the cat and mouse is not necessarily technical. Part of solving the searchbot issue was also social, with things like robots.txt being a social contract between companies and websites, not a technical one.
> The bots will eventually be indistinguishable from humans
Not until they get issued government IDs they won't!
Extrapolating from current trends, some form of online ID attestation (likely based on government-issued ID[1]) will become normal in the next decade, and naturally, this will be included in the anti-bot arsenal. It will be up to the site operator to trust identities signed by the Russian government.
1. Despite what Sam Altman's eyeball company will try to sell you, government registers will always be the anchor of trust for proof-of-identity, they've been doing it for centuries and have become good at it and have earned the goodwill.
We can't just have "send me a picture of your ID" because that is pointlessly easy to spoof - just copy someone else's ID.
So there must be some verification that you, the person at the keyboard, is the same person as that ID identifies. The UK is rapidly finding out that that is extremely difficult to do reliably. Video doesn't really work reliably on all cases, and still images are too easily spoofed. It's not really surprising, though, because identifying humans reliably is hard even for humans.
If we do it at the network level - like assigning a government-issued network connection to a specific individual, so the system knows that any traffic from a given IP address belongs to that specific individual. There are obvious problems with this model, not least that IP addresses were never designed for this, and spoofing an IP becomes identity theft.
We also do need bot access for things, so there must be some method of granting access to bots.
I think that to make this work, we'd need to re-architect the internet from the ground up. To get there, I don't think we can start from here.
- "The person at the keyboard, is the same person as that ID identifies" is a high expectation, and can probably be avoided—you just need verifiable credentials and you gotta trust they're not spoofed
- Many official government IDs are digital now
- Most architectures for solving this problem involve bundling multiple identity "attestations," so proof of personhood would ultimately be a gradient. (This does, admittedly, seem complicated though ... but World is already doing it, and there are many examples of services where providing additional information confers additional trust. Blue checkmarks to name the most obvious one.)
As for what it might look like to start from the ground up and solve this problem, https://urbit.org/, for all its flaws, is the only serious attempt I know of and proves it's possible in principle, though perhaps not in practice
Why isn't it necessary to prove that the person at the keyboard is the person in the ID? That seems like the minimum bar for entry to this problem. Otherwise we can automate the ID checks and the bots can identify as humans no problem.
We almost all have IC Chip readers in our pocket (our cell phones), so if the government issues a card that has a private key embedded in it, akin to existing GnuPG SmartCards, you can use your phone to sign an attestation of your personhood.
In fact, Japan already has this in the form of "My Number Card". You go to a webpage, the webpage says "scan this QR code, touch your phone to your ID card, and type in your pin code", and doing that is enough to prove the the website that you're a human. You can choose to share name/birthday/address, and it's possible to only share a subset.
Robots do not get issued these cards. The government verifies your human-ness when they issue them.
Any site can use this system, not just government sites.
Germany has this. The card plus PIN technically proves you are in current possession of both, not that you are the person (no biometrics or the like). You can chose to share/request not only certain data fields but also eg if you are below or above a certain age or height without disclosing the actual number.
I want to believe that this would be used at amusement parks to scan "can I safely get on this ride" and at the entrance to stairs to tell you if you'll bump your head or not.
The system as a whole is rarely used. I think it’s a combination of poor APIs and hesitation of the population. For somebody without technical knowledge, there is no obvious difference to the private video ID companies. On the surface, you may believe that all data is transferred anyway and you have to trust providers in all cases, not that some magic makes it so third parties don’t get more than necessary.
I don’t know of any real world example that queries height, I mentioned it because it is part of the data set and privacy-preserving queries are technically possible. Age restrictions are the obvious example, but even there I am not aware of any commercial use, only for government services like tax filing or organ donor registry. Also, nobody really measures your height, you just tell them what to put there when you get the ID. Not so for birth dates, which they take from previous records going back to the birth certificate.
That is already solved by governments and businesses. If you have recently attempted to log into a US government website, you were probably told that you need Login.gov or ID.me. ID.me verifies identity via driver’s license, passport, Social Security number—and often requires users to take a video selfie, matched against uploaded ID images. If automated checks fail, a “Trusted Referee” video call is offered.
If you think this sounds suspiciously close the what businesses do with KYC, Know Your Customer, you're correct!
IDs would have to be reissued with a public/private key model you can use to sign your requests.
> the person at the keyboard, is the same person as that ID identifies
This won't be possible to verify - you could lend your ID out to bots but that would come at the risk of being detected and blanket banned from the internet.
"In Europe" is technically true but makes it sound more widely used than I believe it to be... though maybe my knowledge is out of date.
Their website lists 24 supported countries (including some non-EU like UK and Norway, and missing a few of the 27 EU countries) - https://www.itsme-id.com/en-GB/coverage
But does it actually have much use outside of Belgium?
Certainly in the UK I've never come across anyone, government or private business, mentioning it - even since the law passed requiring many sites to verify that visitors are adults. I wouldn't even be familiar with the name if I hadn't learned about its being used in Belgium.
Maybe some other countries are now using it, beyond just Belgium?
One problem with solutions like that is the the website needs to pay for every log in. So you save a few dollars blocking scrapers but now you have to pay thousands of dollars to this company instead.
Officially sanctioned 2fa tied to your official government ID. Over here we have "It's me" [1].
Yes, you can in theory still use your ID card with a usb cardreader for accessing gov services, but good luck finding up to date drivers for your OS or use a mobile etc.
Except that itsme crap is not from the government and doesn't support activation on anything but a Windows / Mac machine. No Linux support at all, while the Belgian government stuff (CSAM) supports Linux just fine.
It is from the banks that leveraged their KYC but was adopted very broadly by gov and many other id required or linked services. AFAIK it does not need a computer to activate besides your phone and one of those bank issued 2FA challange card readers.
For CSAM, also AFAIK, first 'activation' includes a visit to your local municipality to verify your identity. Unless you go via itsme, as it is and authorized CSAM key holder.
UK is stupidly far behind on this though. On one hand the digitization of government services is really well done(thanks to the fantastic team behind .gov websites), but on the other it's like being in the dark ages of tech. My native country has physical ID cards that contain my personal certificate that I can use to sign things or to - gasp! - prove that I am who I say I am. There is a government app that you can use to scan your ID card using the NFC chip in your phone, after providing it with a password that you set when you got the card it produces a token that can then be used to verify your identy or sign documents digitally - and those signatures legally have the same weight as real paper signatures.
UK is in this weird place where there isn't one kind of ID that everyone has - for most people it's the driving licence, but obviously that's not good enough. But my general point is that UK could just look over at how other countries are doing it and copy good solutions to this problem, instead of whatever nonsense is being done right now with the age verification process being entirely outsourced to private companies.
> UK is in this weird place where there isn't one kind of ID that everyone has - for most people it's the driving licence, but obviously that's not good enough.
As a Brit I personally went through a phase of not really existing — no credit card, no driving licence, expired passport - so I know how annoying this can be.
But it’s worth noting that we have this situation not because of mismanagement or technical illiteracy or incompetence but because of a pretty ingrained (centuries old) political and cultural belief that the police shouldn’t be able to ask you “papers please”. We had ID cards in World War II, everyone found them egregious and they were scrapped. It really will be discussed in those terms each time it is mentioned, and it really does come down to this original aspect of policing by consent.
So the age verification thing is running up against this lack of a pervasive ID, various KYC situations also do, we can get an ID card to satisfy verification for in-person voting if we have no others, but it is not proof of identity anywhere else, etc.
It is frustrating to people who do not have that same cultural touchstone but the “no to ID” attitude is very very normal; generally the UK prefers this idea of contextual, rather than universal ID. It’s a deliberate design choice.
Same in Australia - there was a referendum about whether we should have government-issued ID cards, and the answer was an emphatic "NO". And Australia is hitting or going to hit the same problem with the age verification thing for social media.
I doesn’t require a ground up rework. The easiest idea is real people can get an official online id at some site like login.gov and website operators verify people using that api. Some countries already have this kind of thing from what I understand. The tech bros want to implement this on the blockchain but the government could also do it.
In all likelihood, most people will do so via the Apple Wallet (or the equivalent on their non-Apple devices). It's going to be painful to use Open source OSes for a while, thanks to CloudFlare and Anubis. This is not the future I want, but we can't have nice things.
> This is not the future I want, but we can't have nice things.
Actually, we can if we collectively decide that we should have them. Refuse to use sites that require these technologies and demand governments to solve the issue in better ways, e.g. by ensuring there are legal consequences for abusive corporations.
No worries. Stick an unregistered copy of win 11 (ms doesn’t seem to care) and your drivers license in an isolated VM and let the AI RDP
into it for you.
Manually browsing the web yourself will probably be trickier moving forward though.
Silly you, joking around like that. Can you imagine owning a toaster?! Sooo inconvenient and unproductive! Guess, if you change your housing plan, you gonna bring it along like an infectious tick? Hahah — no thank you! :D
You will own nothing and you will be happy!
(Please be reminded, failing behavioral compliance with, and/or voicing disapproval of this important moral precept, jokingly or not, is in violation of your citizenship subscription's general terms and conditions. This incident will be reported. Customer services will assist you within 48 hours. Please, do not leave your base zone until this issue has been resolved to your satisfaction.)
The internet would come to a grinding halt as everyone would suddenly become mindful of their browsing. It's not hard to imagine a situation where, say, pornhub sells its access data and the next day you get sacked at your teaching job.
It doesn't need to. Thanks to asymmetric cryptography governments can in theory provide you with a way to prove you are a human (or of a certain age) without:
1. the government knowing who you are authenticating yourself to
2. or the recipient learning anything but the fact that you are a human
3. or the recipient being able to link you to a previous session if you authenticate yourself again later
The EU is trying to build such a scheme for online age verification (I'm not sure if their scheme also extends to point 3 though. Probably?).
But I don't get how is goes for spam or scrapping: if I can pass the test "anonymously", then what prevents me from doing it for illegal purposes?
I get it for age verification: it is difficult for a child to get a token that says they are allowed to access porn because adults around them don't want them to access porn (and even though one could sell tokens online, it effectively makes it harder to access porn as a child).
But how does it prevent someone from using their ID to get tokens for their scrapper? If it's anonymous, then there is no risk in doing it, is there?
IIRC, you could use asymmetric cryptography to derive a site-specific pseudonymous token from the service and your government ID without the service knowing what your government ID is or the government provider knowing what service you are using.
The service then links the token to your account and uses ordinary detection measures to see if you're spamming, flooding, phishing, whatever. If you do, the token gets blacklisted and you can no longer sign on to that service.
This isn't foolproof - you could still bribe random people on the street to be men/mules in the middle and do your flooding through them - but it's much harder than just spinning up ten thousand bots on a residential proxy.
But that does not really answer my question: if a human can prove that they are human anonymously (by getting an anonymous token), what prevents them from passing that token to an AI?
The whole point is to prevent a robot from accessing the API. If you want to detect the robot based on its activity, you don't need to bother humans with the token in the first place: just monitor the activity.
It does not prevent a bot from using your ID. But a) the repercussions for getting caught are much more tangible when you can't hide behind anonymity - you risk getting blanket banned from the internet and b) the scale is significantly reduced - how many people are willing to rent/sell their IDs, i.e., their right to access the internet?
Edit: ok I see the argument that the feedback mechanism could be difficult when all the website can report is "hey, you don't know me but this dude from request xyz you just authenticated fucked all my shit up". But at the end of the day, privacy preservation is an implementation detail I don't see governments guaranteeing.
> But at the end of the day, privacy preservation is an implementation detail I don't see governments guaranteeing.
Sure, I totally see how you can prevent unwanted activity by identifying the users. My question was about the privacy-preserving way. I just don't see how that would be possible.
It does work as long as the attesting authority doesn't allow issuing a new identity (before it expires) if the old one is lost.
You (Y) generate a keypair and send your public key to the the attesting authority A, and keep your private key. You get a certificate.
You visit site b.com, and it asks for your identity, so you hash b.com|yourprivatekey. You submit the hash to b.com, along with a ZKP that you possess a private key that makes the hash work out, and that the private key corresponds to the public key in the certificate, and that the certificate has a valid signature from A.
If you break the rules of b.com, b.com bans your hash. Also, they set a hard rate limit on how many requests per hash are allowed. You could technically sell your hash and proof, but a scraper would need to buy up lots of them to do scraping.
Now the downside is that if you go to A and say your private key was compromised, or you lost control of it - the answer has to be tough luck. In reality, the certificates would expire after a while, so you could get a new hash every 6 months or something (and circumvent the bans), and if you lost the key, you'd need to wait out the expiry. The alternative is a scheme where you and A share a secret key - but then they can calculate your hash and conspire with b.com to unmask you.
Isn't the whole point of a privacy-preserving scheme be that you can ask many "certificates" to the attesting authority and it won't care (because you may need as many as the number of websites you visit), and the website b.com won't be able to link you to them, and therefore if it bans certificate C1, you can just start using certificate C2?
And then of course, if you need millions of certificates because b.com keeps banning you, it means that they ban you based on your activity, not based on your lack of certificate. And in that case, it feels like the certificate is useless in the first place: b.com has to monitor and ban you already.
There isn't a technical solution to this: governments and providers not only want proof of identity matching IDs, they want proof of life, too.
This will always end with live video of the person requesting to log in to provide proof of life at the very least, and if they're lazy/want more data, they'll tie in their ID verification process to their video pipeline.
That's not not the kind of proof of life the government and companies want online. They want to make sure their video identification 1) is of a living person right now, and 2) that living person matches their government ID.
It's a solution to the "grandma died but we've been collecting her Social Security benefits anyway", or "my son stole my wallet with my ID & credit card", or (god forbid) "We incapacitated/killed this person to access their bank account using facial ID".
It's also a solution to the problem advertisers, investors and platforms face of 1) wanting huge piles of video training data for free and 2) determining that a user truly is a monetizable human being and not a freeloader bot using stolen/sold credentials.
> That's not not the kind of proof of life the government and companies want online.
Well that's your assumption about governments, but it doesn't have to be true. There are governments that don't try to exploit their people. The question is whether such governments can have technical solutions to achieve that or not (I'm genuinely interested in understanding whether or not it's technically feasible).
It's the kind of proof my government already asks of me to sign documents much, much more important than watching adult content, such as social security benefits.
Such schemes have the fatal flaw that they can be trivially abused. All you need are a couple of stolen/sold identities and bots start proving their humanness and adultness to everyone.
> Such schemes have the fatal flaw that they can be trivially abused
I wouldn't expect the abuse rate to be higher than what it is for chip-and-pin debit cards. PKI failure modes are well understood and there are mitigations galore.
I did think asymmetric cryptography but I assumed the validators would be third parties / individual websites and therefore connections could be made using your public key. But I guess having the government itself provide the authentication service makes more sense.
I wonder if they'd actually honor 1 instead of forcing recipients to be registered, as presumably they'd be interested in tracking user activity.
Besides making yourself party to a criminal conspiracy, I suspect it would be partly the same reason you won't sell/rent your real-world identity to other people today; an illegal immigrant may be willing to rent it from you right now.
Mostly, it will because online identifies will be a market for lemons: there will be so many fake/expired/revoked identities being sold that the value of each one will be worth pennies, and that's not commensurate with the risk of someone commiting crimes and linking it to your government-registered identity.
> the same reason you won't sell/rent your real-world identity to other people today
If you sell your real-world identity to other people today, and they get arrested, then the police will know your identity (obviously). How does that work with a privacy-preserving scheme? If you sell your anonymous token that says that you are a human to a machine and the machine gets arrested, then the police won't be able to know who you are, right? That was the whole point of the privacy-preserving token.
I'm genuinely interested, I don't understand how it can work technically and be privacy-preserving.
> With privacy preserving cryptography the tokens are standalone and have no ties to the identity that spawned them.
I suspect there will be different levels of attestations from the anonymous ("this is an adult"), to semi-anonymous ("this person was born in 20YY and resides in administrative region XYZ") to the compete record ("This is John Quincy Smith III born on YYYY-MM-DD with ID doc number ABC123"). Somewhere in between the extremes is an pseudonymous token that's strongly tied to a single identity with non-repudiation.
Anonymous identities that can be easily churned out on demand by end-users have zero antibot utility
While it's the privacy advocate's ideal, the politics reality is very few governments will deploy "privacy preserving" cryptography that gets in the way of LE investigations[1]. The best you can hope for is some escrowed service that requires a warrant to unmask the identity for any given token, so privacy is preserved in most cases, and against most parties except law enforcement when there's a valid warrant.
1. They can do it overtly in thr design of the system, or covertly via side-channels, logging, or leaking bits in ways that are hard for an outsider to investigate without access to the complete source code and or/system outputs, such as not-quite-random pseudo-randoms.
> Mostly, it will because online identifies will be a market for lemons: there will be so many fake/expired/revoked identities being sold that the value of each one will be worth pennies, and that's not commensurate with the risk of someone commiting crimes and linking it to your government-registered identity.
That would be trivially solved by using same verification mechanisms they would be used with.
You are right about the negative outcomes that this might have but you have way too much faith in the average person caring enough before it happens to them.
I live with the naïve and optimistic dream that something like that would just show that everyone was in the list so they can't use it to discriminate against people.
Eyeball company play is to be a general identity provider, which is an obvious move for anyone who tries to fill this gap. You can already connect your passport in the World app.
> some form of online ID attestation (likely based on government-issued ID[1]) will become normal in the next decade
I believe this is likely, and implemented in the right way, I think it will be a good thing.
A zero-knowledge way of attesting persistent pseudonymous identity would solve a lot of problems. If the government doesn’t know who you are attesting to, the service doesn’t know your real identity, services can’t correlate users, and a service always sees the same identity, then this is about as privacy-preserving as you can get with huge upside.
A social media site can ban an abusive user without them being able to simply register a new account. One person cannot operate tens of thousands of bot profiles. Crawlers can be banned once. Spammers can be locked out of email.
> A social media site can ban an abusive user without them being able to simply register a new account.
This is an absolutely gargantuan-sized antifeature that would single-handedly drive me out of the parts of the internet that choose to embrace this hellish tech.
I think social media platforms should have the ability to effectively ban abusive users, and I’m pretty sure that’s a mainstream viewpoint shared by most people.
The alternative is that you think people should be able to use social media platforms in ways that violate their rules, and that the platforms should not be able to refuse service to these users. I don’t think that’s a justifiable position to take, but I’m open to hearing an argument for it. Simply calling it “hellish” isn’t an argument.
And can you clarify if your position accounts for spammers? Because as far as I can see, your position is very clearly “spammers should be allowed to spam”.
This has quite nasty consequences for privacy. For this reason, alternatives are desirable. I have less confidence on what such an alternative should be, however.
It depends on your precise requirements and assumptions.
Does your definition of 'privacy-preserving' distrust Google, Apple, Xiaomi, HTC, Honor, Samsung and suchlike?
Do you also distrust third-party clowns like experian and equifax (whose current systems have gaping security holes) and distrust large government IT projects (which are outsourced to clowns like Fujutsu who don't know what they're doing) ??
Do you require it to work on all devices, including outdated phones and tablets; PCs; Linux-only devices; other networked devices like smart lightbulbs; and so on? Does it have to work in places phones aren't allowed, or mobile data/bluetooth isn't available? Does the identity card have to be as thin, flexible, durable and cheap as a credit card, precluding any built-in fingerprint sensors and suchlike?
Does the age validation have to protect against an 18-year-old passing the age check on their 16-year-old friend's account? While also being privacy-preserving enough nobody can tell the two accounts were approved with the same ID card?
Does the system also have to work on websites without user accounts, because who the hell creates a pornhub account anyway?
Does the system need to work without the government approving individual websites' access to the system? Does it also need to be support proving things like name, nationality, and right to work in the country so people can apply for bank accounts and jobs online? And yet does it need to prevent sites from requiring names just for ad targeting purposes?
Do all approvals have to be provable, so every company can prove to the government that the checks were properly carried out at the right time? Does it have to be possible to revoke cards in a timely manner, but without maintaining a huge list of revoked cards, and without every visit to a porn site triggering a call to a government server for a revocation check?
If you want to accomplish all of these goals - you're going to have a tough time.
I can easily imagine having a way to prove my age in a privacy-preserving way: a trusted party knows that I am 18+ and gives me a token that proves that I am 18+ without divulging anything else. I take that token and pass it to the website that requires me to be 18+. The website knows nothing about me other than I have a token that says I am 18+.
Of course, I can get a token and then give it to a child. Just like I can buy cigarettes and give them to a child. But the age verification helps in that I don't want children to access cigarettes, so I won't do it.
The "you are a human" verification fundamentally doesn't work, because the humans who make the bots are not aligned with the objective of the verification. If it's privacy-preserving, it means that a human can get a token, feed it to their bot and call it a day. And nobody will know who gave the token to the bot, precisely because it is privacy-preserving.
While the question of "is it actually possible to do this in a privacy preserving way?" is certainly interesting, was there ever a _single_ occasion where a government had the option of doing something in a privacy preserving way, when a non-privacy preserving way was also possible? Politicians would absolutely kill for the idea of unmasking dissenters on internet forums. Even if the option is a possibility, they are deliberately not going to implement it.
I didn't know about e-IDs in other countries, but in Scandinavia (at least in Norway and Sweden, but I know the same system is used in Denmark as well) they are very much tied to your personal number which uniquely identifies you. Healthcare data is also not encrypted.
Well the e-ID is an ID, so to the government it's tied to a person. But I know that in multiple countries it's possible to use the e-ID to only share the information necessary with the receiver in a way that the government cannot track. Typically, share only the fact that you are 18+ without sharing your name or birthday, and without the government being able to track where you shared that fact.
Fun fact: The Norwegian wine monopoly is rolling out exactly this to prevent scalpers buying up new releases. Each online release will require a signup in advance with a verified account.
Eh? With the "anonymous" models that we're pushing for right now, nothing stops you from handing over your verification token (or the control of your browser) to a robot for a fee. The token issued by the verifier just says "yep, that's an adult human", not "this is John Doe, living at 123 Main St, Somewhere, USA". If it's burned, you can get a new one.
If we move to a model where the token is permanently tied to your identity, there might be an incentive for you not to risk your token being added to a blocklist. But there's no shortage of people who need a bit of extra cash and for whom it's not a bad trade. So there will be a nearly-endless supply of "burner" tokens for use by trolls, scammers, evil crawlers, etc.
And if they are as successful as they are threatening to be, they will have destroyed so many jobs that I am sure they will find a few thousand people across the world who will accept a stipend to loan their essence to the machine.
Maybe there will be a way to certify humanness. Human testing facility could be a local office you walk over to get your “I am a human” hardware key. Maybe it expires after a week or so to ensure that you are still alive.
But if that hardware key is privacy-preserving (i.e. websites don't get your identity when you use it), what prevents you from using it for your illegal activity? Scrapers and spam are built by humans, who could get such a hardware key.
Not even: the government is supposed to provide you with more than one token (how would you verify yourself as a human to more than one website otherwise?)
It might be a tool in the box. But it’s still cat and mouse.
In my place we quickly concluded the scrapers have tons of compute and the “proof-of-work” aspect was meaningless to them. It’s simply the “response from site changed, need to change our scraping code” aspect that helps.
>But it takes a human some time and work to tell the crawler HOW.
Yes, for these human-based challenges. But this challenge is defined in code. It's not like crawlers don't run JavaScript. It's 2025, they all use headless browsers, not curl.
> The point is that they hadn't, and this worked for quite a while.
That's what I was hoping to get from the "Numbers" section.
I generally don't look up the logs or numbers on my tiny, personal web spaces hosted on my server, and I imagine I could, at some point, become the victim of aggressive crawling (or maybe I have without noticing because I've got an oversized server on a dual link connection).
But the numbers actually only show the performance of doing the PoW, not the effect it has had on any site — I am just curious, and I'd love it if someone has done the analysis, ideally grouped by the bot type ("OpenAI bot was responsible for 17% of all requests, this got reduced from 900k requests a day to 0 a day"...). Search, unfortunately, only gives me all the "Anubis is helping fight aggressive crawling" blog articles, nothing with substance (I haven't tried hard, I admit).
The cost benefit calculus for workarounds changes based on popularity. Your custom lock might be easy to break by a professional, but the handful of people who might ever care to pick it are unlikely to be trying that hard. A lock which lets you into 5% of houses however might be worth learning to break.
If you are going to rely on security through obscurity there are plenty of ways to do that that won't block actual humans because they dare use a non-mainstream browser. You can also do it without displaying cringeworthy art that is only there to get people to pay for the DRM solution you are peddling - that shit has no place in the open source ecosystem.
On the contrary: Making things look silly and unprofessional so that Big Serious Corporations With Money will pay thousands of dollars to whitelabel them is an OUTSTANDING solution for preserving software freedom while raising money for hardworking developers.
I'd rather not raise money for "hardworking" developers if their work is spreading DRM on the web. And it's not just "Big Serious Corporations" that don't want to see your furry art.
I'm not commenting on the value of this project (I wouldn't characterize captchas as DRM, but I see why you have that negative connotation) and I tend to agree with the OP that this is simply wasting energy, but the amount of seething over "anime catgirls" makes me want to write all the docs for my next projects in UwU text and charge for a whimsy-free version. (o˘◡˘o)
Please do, it's better if people make their negative personality traits public so that you can avoid them before wasting your time. It will also be useful to show your hypocrisy when you inevitably complain about someone else doing something that you don't like.
I don't think you need to try to die on this hill (primarily remarking w.r.t. your lumping in Anubis with Cloudflare/Google/et al. as one). In any case, I'm not appreciating the proliferation of the CAPTCHA-wall any more than you are.
The mascot artist wrote in here in another thread about the design philosophies, and they are IMO a lot more honorable in comparison (to BigCo).
Besides, it's MIT FOSS. Can't a site operator shoehorn in their own image if they were so inclined?
i love this thread because it the Serious Business Man doesn't realize that purposeful unprofessionalism like anime art, silly uwu :3 catgirls, writing with no capitalization are done specifically to be unpalatable to Serious Business Man—threatening to not interact with people like that is the funniest thing.
Acting obnoxiously to piss people off makes you seem like an inexperienced teenager and distances more than "Serious Business Man".
I look forward for this to be taken to the logical extreme when a niche subculture of internet nerds change their entire online persona to revolve around scat pornography to spite "the normals", I'm sure they'll be remembered fondly as witty and intelligent and not at all as mentally ill young people.
I deployed a proof of work based auth system once where every single request required hashing a new nonce. Compare with Anubis where only one request a week requires it. The math said doing it that frequently, and with variable argon params the server could tune if it suspected bots, would be impactful enough to deter bots.
Would I do that again? Probably not. These days I’d require a weekly mDL or equivalent credential presentation.
I have to disagree that an anti-bot measure that only works globally for a few weeks until bots trivially bypass it is effective. In an arms race against bots the bots win. You have to outsmart them by challenging them to do something that only a human can do or is actually prohibitively expensive for bots to do at scale. Anubis doesn't pass that test. And now it’s littered everywhere defunct and useless.
> As the botmakers circumvent, new methods of proof-of-notbot will be made available.
Yes, but the fundamental problem is that the AI crawler does the same amount of work as a legitimate user, not more.
So if you design the work such that it takes five seconds on a five year old smartphone, it could inconvenience a large portion of your user base. But once that scheme is understood by the crawler, it will delay the start of their aggressive crawling by... well-under five seconds.
An open source javascript challenge as a crawler blocker may work until it gets large enough for crawlers to care, but then they just have an engineer subscribe to changes on GitHub and have new challenge algorithms implemented before the majority of the deployment base migrates.
Wasn't there also weird behaviors reported by webadmins across the world, like crawlers used by LLM companies are fetching evergreen data ad nauseum or something along that? I thought the point of adding PoW than just blocking them was to convince them to at least do it right.
No, it’s exactly because I understand that it bothers me. I understand it will be effective against bots for a few months and best, and legitimate human users will be stuck dealing with the damn thing for years to come. Just like captchas.
It is because you are dealing with crawlers that already have a nontrivial cost per page, adding something relatively trivial that is still within the bounds regular users accept won't change the motivations of bad actors at all.
Technical people are prone to black-and-white thinking, which makes it hard to understand that making something more difficult will cause people to do it less even though it’s still possible.
I think the argument on offer is more, this juice isn't worth the squeeze. Each user is being slowed down and annoyed for something that bots will trivially bypass if they become aware of it.
If they become aware of it and actually think it’s worthwhile. Malicious bots work by scaling, and implementing special cases for every random web site doesn’t scale. And it’s likely they never even notice.
If this kind of security by not being noticed is the plan, why not just have a trivial (but unique) captcha that asks the user to click a button with no battery wasting computation?
Respectfully, I think it's you missing the point here. None of this is to say you shouldn't use Anubis, but Tavis Ormandy is offering a computer science critique of how it purports to function. You don't have to care about computer science in this instance! But you can't dismiss it because it's computer science.
Consider:
An adaptive password hash like bcrypt or Argon2 uses a work function to apply asymmetric costs to adversaries (attackers who don't know the real password). Both users and attackers have to apply the work function, but the user gets ~constant value for it (they know the password, so to a first approx. they only have to call it once). Attackers have to iterate the function, potentially indefinitely, in the limit obtaining 0 reward for infinite cost.
A blockchain cryptocurrency uses a work function principally as a synchronization mechanism. The work function itself doesn't have a meaningfully separate adversary. Everyone obtains the same value (the expected value of attempting to solve the next round of the block commitment puzzle) for each application of the work function. And note in this scenario most of the value returned from the work function goes to a small, centralized group of highly-capitalized specialists.
A proof-of-work-based antiabuse system wants to function the way a password hash functions. You want to define an adversary and then find a way to incur asymmetric costs on them, so that the adversary gets minimal value compared to legitimate users.
And this is in fact how proof-of-work-based antispam systems function: the value of sending a single spam message is so low that the EV of applying the work function is negative.
But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.
There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.
This is also how the Blu-Ray BD+ system worked.
The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).
The problem with "this is good because none of the scrapers even bother to do this POW yet" is that you don't need an annoying POW to get that value! You could just write a mildly complicated Javascript function, or do an automated captcha.
A lot of these passive types of anti-abuse systems rely on the rather bold assumption that making a bot perform a computation is expensive, but isn't for me as an ordinary user.
According to whom or what data exactly?
AI operators are clearly well-funded operations and the amount of electricity and CPU power is negligible. Software like Anubis and nearly all its identical predecessors grant you access after a single "proof". So you then have free reign to scrape the whole site.
The best physical analogy are those shopping cart things where you have to insert a quarter to unlock the cart, and you presumably get it back when you return the cart.
The group of people this doesn't affect are the well-funded, a quarter is a small price to pay for leaving your cart in the middle of the parking lot.
Those that suffer the most are the ones that can't find a quarter in the cupholder so you're stuck filling your arms with groceries.
Would you be richer if they didn't charge you a quarter? (For these anti-bot tools you're paying the electric company, not the site owner.). Maybe. But if you're Scrooge McDuck who is counting?
Right, that's the point of the article. If you can tune asymmetric costs on bots/scrapers, it doesn't matter: you can drive bot costs to infinity without doing so for users. But if everyone's on a level playing field, POW is problematic.
I like your example because the quarters for shopping cards are not universal everywhere. Some societies have either accepted shopping cart shrinkage as an acceptable cost of doing business or have found better ways to deter it.
Scrapers are orders of magnitude faster than humans at browsing websites. If the challenge takes 1 second but a human stays on the page for 3 minutes, then it's negligible. But if the challenge takes 1 second and the scraper does ita job in 5 seconds, you already have a 20% slowdown
No, because in this case there are cookies involved. If the scraper accepts cookies then it's trivial to detect it and block it. If it doesn't, it will have to solve the challenge every single time.
For what it's worth, kernel.org seems to be running an old version of Anubis that predates the current challenge generation method. Previously it took information about the user request, hashed it, and then relied on that being idempotent to avoid having to store state. This didn't scale and was prone to issues like in the OP.
The modern version of Anubis as of PR https://github.com/TecharoHQ/anubis/pull/749 uses a different flow. Minting a challenge generates state including 64 bytes of random data. This random data is sent to the client and used on the server side in order to validate challenge solutions.
The core problem here is that kernel.org isn't upgrading their version of Anubis as it's released. I suspect this means they're also vulnerable to GHSA-jhjj-2g64-px7c.
Right, I get that. I'm just saying that over the long term, you're going to have to find asymmetric costs to apply to scrapers, or it's not going to work. I'm not criticizing any specific implementation detail of your current system. It's good to have a place to take it!
I think that's the valuable observation in this post. Tavis can tell me I'm wrong. :)
> But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.
Based on my own experience fighting these AI scrappers, I feel that the way they are actually implemented makes it that in practice there is asymmetry in the work scrappers have to do vs humans.
The pattern these scrappers follow is that they are highly distributed. I’ll see a given {ip, UA} pair make a request to /foo immediately followed by _hundreds_ of requests from completely different {ip, UA} pairs to all the links from that page (ie: /foo/a, /foo/b, /foo/c, etc..).
This is a big part of what makes these AI crawlers such a challenge for us admins. There isn’t a whole lot we can do to apply regular rate limiting techniques: the IPs are always changing and are no longer limited to corporate ASN (I’m now seeing IPs belonging to consumer ISPs and even cell phone companies), and the User Agents all look genuine. But when looking through the logs you can see the pattern that all these unrelated requests are actually working together to perform a BFS traversal of your site.
Given this pattern, I believe that’s what makes the Anubis approach actually work well in practice. For a given user, they will encounter the challenge once when accessing the site the first time, then they’ll be able to navigate through it without incurring any cost. While the AI scrappers would need to solve the challenge for every single one of their “nodes” (or whatever it is they would call their {ip, UA} pairs). From a site reliability perspective, I don’t even care if the crawlers manage to solve the challenge or not. That it manages to slow them down enough to rate limit them as a network is enough.
To be clear: I don’t disagree with you that the cost incurred by regular human users is still high. But I don’t think it’s fair to say that this is not a situation in which the cost to the adversary is not asymmetrical. It wouldn’t be if the AI crawlers hadn’t converged towards an implementation that behaves as a DDOS botnet.
> The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).
No, that's missing the point. Anubis is effectively a DDoS protection system, all the talking about AI bots comes from the fact that the latest wave of DDoS attacks was initiated by AI scrapers, whether intentionally or not.
If these bots would clone git repos instead of unleashing the hordes of dumbest bots on Earth pretending to be thousands and thousands of users browsing through git blame web UI, there would be no need for Anubis.
Did you accidentally reply to a wrong comment? (not trying to be snarky, just confused)
The only "justification" there would be is that it keeps the server online that struggled under load before deploying it. That's the whole reason why major FLOSS projects and code forges have deployed Anubis. Nobody cares about bots downloading FLOSS code or kernel mailing lists archives; they care about keeping their infrastructure running and whether it's being DDoSed or not.
I just said you didn't have to justify it. I don't care why you run it. Run whatever you want. The point of the post is that regardless of your reasons for running it, it's unlikely to work in the long run.
And what I said is that all these most visible deployments of Anubis did not deploy it to be a content protection system of any kind, so it doesn't have to work this way at all for them. As long as the server doesn't struggle with load anymore after deploying Anubis, it's a win - and it works so far.
(and frankly, it likely will only need to work until the bubble bursts, making "the long run" irrelevant)
Not exactly sure what you're talking about. The problem is caused by tons of shitty companies cutting corners to collect training data as fast as possible, fueled by easy money that you get by putting "AI" somewhere in your company's name.
As soon as the investment boom is over, this will be largely gone. LLMs will continue to be trained and data will continue to be scraped, but that alone isn't the problem. Search engine crawlers somehow manage not to DDoS the servers they pull the data from, competent AI scrapers can do the same. In fact, a competent AI scraper wouldn't even be stopped by Anubis as it is right now at all, and yet Anubis works pretty well in practice. Go figure.
> There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.
That depends on what you count as normal users though. Users that want to use alternative players also have to deal with this and since yt-dlp and youtube-dl before have been able to provide a solution for those user and bots can just do the same I'm not sure if I'd call the scheme successful in any way.
The (almost only?) distinguishing factor between genuine users and bots is the total volume of requests, but this can still be used for asymmetric costs. If botPain > botPainThreshold and humanPain < humanPainThreshold then Anubis is working as intended. A key point is that those inequalities look different at the next level of detail. A very rough model might be:
The article points out that the botPain Anubis currently generates is unfortunately much too low to hit any realistic threshold. But if the cost model I've suggested above is in any way realistic, then useful improvements would include:
1. More frequent but less taxing computation demands (this assumes c_1 >> c_2)
2. Parallel computation (this improves the human experience with no effect for bots)
ETA: Concretely, regarding (1), I would tolerate 500ms lag on every page load (meaning forget about the 7-day cookie), and wouldn't notice 250ms.
That's exactly what I'm saying isn't happening: the user pays some cost C per article, and the bot pays exactly the same cost C. Both obtain the same reward. That's not how Hashcash works.
Can you flesh that out more? In the case of AI scrapers it seems especially clear: the model companies just want tokens, and are paying a (one-time) cost of C for N tokens.
Again, with Hashcash, this isn't how it works: most outbound spam messages are worthless. The point of the system is to exploit the negative exponent on the attacker's value function.
The scraper breaking every time a new version of Anubis is deployed, until new anti-Anubis features are implemented, is the point; if the scrapers were well-engineered by a team that cared about the individual sites they're scraping, they probably wouldn't be so pathological towards forges.
The human-labor cost of working around Anubis is unlikely to be paid unless it affects enough data to be worth dedicating time to, and the data they're trying to scrape can typically be obtained "respectfully" in those cases -- instead of hitting the git blame route on every file of every commit of every repo, just clone the repos and run it locally, etc.
Sure, but if that's the case, you don't need the POW, which is what bugs people about this design. I'm not objecting to the idea of anti-bot content protection on websites.
Perhaps I caused confusion by writing "If botPain > botPainThreshold and humanPain < humanPainThreshold then Anubis is working as intended", as I'm not actually disputing that Anubis is currently ineffective against bots. (The article makes that point and I agree with it.) I'm arguing against what I take to be your stronger claim, namely that no "Anubis-like" countermeasure (meaning no countermeasure that charges each request the same amount of CPU in expectation) can work.
I claim that the cost for the two classes of user are meaningfully different: bots care exclusively about the total CPU usage, while humans care about some subjective combination of average and worst-case elapsed times on page loads. Because the sheer number of requests done by bots is so much higher, there's an opportunity to hurt them disproportionately according to their cost model by tweaking Anubis to increase the frequency of checks but decrease each check's elapsed time below the threshold of human annoyance.
The fundamental failure of this is that you can’t publish data to the web and not publish data to the web. If you make things public, the public will use it.
It’s ineffective. (And furry sex-subculture propaganda pushed by its author, which is out of place in such software.)
The misguided parenthetical aside, this is not about resources being public, this is about bad actors accessing those resources in a highly inefficient and resource-intensive manner, effectively DDOS-ing the source.
Yes, and those sites take way more effort to crawl than other sites. They may still get crawled, but likely less often than the ones that don't use JavaScript for rendering (which is the main purpose of Anubis - saving bandwidth from crawlers who crawl sites way too often).
(Also, note the difference between using JavaScript for display logic and requiring JavaScript to load any content at all. Most websites do the first, the second isn't quite as common.)
TFA — and most comments here — seem to completely miss what I thought was the main point of Anubis: it counters the crawler's "identity scattering"/sybil'ing/parallel crawling.
Any access will fall into either of the following categories:
- client with JS and cookies. In this case the server now has an identity to apply rate limiting to, from the cookie. Humans should never hit it, but crawlers will be slowed down immensely or ejected. Of course the identity can be rotated — at the cost of solving the puzzle again.
- amnesiac (no cookies) clients with JS. Each access is now expensive.
(- no JS - no access.)
The point is to prevent parallel crawling and overloading the server. Crawlers can still start an arbitrary number of parallel crawls, but each one costs to start and needs to stay below some rate limit. Previously, the server would collapse under thousands of crawler requests per second. That is what Anubis is making prohibitively expensive.
Yes, I think you're right.
The commentary about its (presumed, imagined) effectiveness is very much making the assumption that it's designed to be an impenetrable wall[0] -- i.e. prevent bots from accessing the content entirely.
I think TFA is generally quite good and has something of a good point about the economics of the situation, but finding the math shake out that way should, perhaps, lead one to question their starting point / assumptions[1].
In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?
[0] Mentioning 'impenetrable wall' is probably setting off alarm bells, because of course that would be a bad design.
[1] (Edited to add:) I should say 'to question their assumptions more' -- like I said, the article is quite good and it does present this as confusing, at least.
> In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?
I agree, but the advertising is the whole issue. "Checking to see you're not a bot!" and all that.
Therefore some people using Anubis expect it to be an impenetrable wall, to "block AI scrapers", especially those that believe it's a way for them to be excluded from training data.
It's why just a few days ago there was a HN frontpage post of someone complaining that "AI scrapers have learnt to get past Anubis".
But that is a fight that one will never win (analog hole as the nuclear option).
If it said something like "Wait 5 seconds, our servers are busy!", I would think that people's expectations will be more accurate.
As a robot I'm really not that sympathetic to anti-bot language backfiring on humans. I have to look away every time it comes up on my screen. If they changed their language and advertising, I'll be more sympathetic -- it's not as if I disagree that overloading servers for not much benefit is bad!
You could setup a system for parellelizing the creation of these Anubis PoW cookies independent of the crawling logic. That would probably work, but it's a pretty heavy lift compared to 'just run a browser with JavaScript'.
Well maybe, but even then, how many parallel crawls are you going to do per site? 100 maybe? You can still get enough keys to do that for all sites in just a few hours per week.
I'm a scraper developer and Anubis would have worked 10 - 20 years ago, but now all broad scrapers run on a real headless browser with full cookie support and costs relatively nothing in compute. I'd be surprised if LLM bots would use anything else given the fact that they have all of this compute and engineers already available.
That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale. It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.
You can actually see this in real life if you google web scraping services and which targets they claim to bypass - all of them bypass generic anti-bots like Cloudflare, Akamai etc. but struggle with custom and rare stuff like Chinese websites or small forums because scraping market is a market like any other and high value problems are solved first. So becoming a low value problem is a very easy way to avoid confrontation.
> That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale.
Isn't this what Microsoft is trying to do with their sliding puzzle piece and choose the closest match type systems?
Also, if you come in on a mobile browser it could ask you to lay your phone flat and then shake it up and down for a second or something similar that would be a challenge for a datacenter bot pretending to be a phone.
How do you bypass cloudflare? I do some light scrapping for some personal stuff, but I can't figure out how to bypass it. Like do you randomize IPs using several VPNs at the same time?
I usually just sit there on my phone pressing the "I am not a robot box" when it triggers.
It's still pretty hard to bypass it with open source solutions. To bypass CF you need:
- an automated browser that doesn't leak the fact it's being automated
- ability to fake the browser fingerprint (e.g. Linux is heavily penalized)
- residential or mobile proxies (for small scale your home IP is probably good enough)
- deployment environment that isn't leaked to the browser.
- realistic scrape pattern and header configuration (header order, referer, prewalk some pages with cookies etc.)
This is really hard to do at scale but for small personal scripts you can have reasonable results with flavor of the month playwright forks on github like nodriver or dedicated tools like Flaresolver but I'd just find a web scraping api with low entry price and just drop 15$ month and avoid this chase because it can be really time consuming.
If you're really on budget - most of them offer 1,000 credits for free which will get you avg 100 pages a month per service and you can get 10 of them as they all mostly function the same.
That's really the only option available here, right? The goal is to keep sites low friction for end users while stopping bots. Requiring an account with some moderation would stop the majority of bots, but it would add a lot of friction for your human users.
The other option is proof of work. Make clients use JS to do expensive calculations that aren’t a big deal for single clients, but get expensive at scale. Not ideal, but another tool to potentially use.
> It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.
These are trivial for an AI agent to solve though, even with very dumb watered down models.
That seems like it would make bot blocking saas (like cloudflare or tollbit) more attractive because it could amortize that effort/cost across many clients.
>This dance to get access is just a minor annoyance for me, but I question how it proves I’m not a bot. These steps can be trivially and cheaply automated.
>I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.
No need to mimic the actual challenge process. Just change your user agent to not have "Mozilla" in it; Anubis only serves you the challenge if it has that. For myself I just made a sideloaded browser extension to override the UA header for the handful of websites I visit that use Anubis, including those two kernel.org domains.
(Why do I do it? For most of them I don't enable JS or cookies for so the challenge wouldn't pass anyway. For the ones that I do enable JS or cookies for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)
Sadly, touching the user-agent header more or less instantly makes you uniquely identifiable.
Browser fingerprinting works best against people with unique headers. There's probably millions of people using an untouched safari on iPhone. Once you touch your user-agent header, you're likely the only person in the world with that fingerprint.
If you're browsing with a browser, then there are 1000 ways to identify you. If you're browsing without a browser, then there is at least one way to identify you.
'There's so many cliffs around that not jumping off that one barely helps you'.
I meeeeeannn... sure? I know that browser fingerprinting works quite well without, but custom headers are actually a game over in terms of not getting tracked.
UA fingerprinting isn't a problem for me. As I said I only modify the UA for the handful of sites that use Anubis that I visit. I trust those sites enough that them fingerprinting me is unlikely, and won't be a problem even if they did.
yes, but it puts you in the incredibly small bucket of "users that has weird headers that don't mesh well", and makes using the rest of the (many) other fingerprinting techniques all the more accurate.
While it's definitely possible to train a model for that, 'very easy' is nonsense.
Unless you've got some superintelligence hidden somewhere, you'd choose a neural net. To train, you need a large supply of LABELED data. Seems like a challenge to build that dataset; after all, we have no scalable method for classifying as of yet.
The string “null” or actually null? I have recently seen a huge amount of bot traffic which has actually no UA and just outright block it. It’s almost entirely (microsoft cloud) Azure script attacks.
Yes, but you can take the bet, and win more often than not, that your adversary is most likely not tracking visitor probabilities if you can detect that they aren't using a major fingerprinting provider.
The string I use in my extension is "anubis is crap". I took it from a different FF extension that had been posted in a /g/ thread about Anubis, which is where I got the idea from in the first place. I don't use other people's extensions if I can help it (because of the obvious risk), but I figured I'd use the same string in my own extension so as to be combined with users of that extension for the sake of user-agent statistics.
It's also a bit telling that you read the phrase "I took it from a different FF extension that had been posted" and interpreted it as taking advice instead of reading source code.
The UA will be compared to other data points such as screen resolution, fonts, plugins, etc. which means that you are definitely more identifiable if you change just the UA vs changing your entire browser or operating system.
Anubis will let curl through, while blocking any non-mainstream browser which will likely say "Mozilla" in its UA just for best compatibility and call that a "bot"? WTF.
> (Why do I do it? For most of them I don't enable JS so the challenge wouldn't pass anyway. For the ones that I do enable JS for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)
Hm. If your site is "sticky", can it mine Monero or something in the background?
We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"
We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"
Doesn't Safari sort of already do that? "This tab is using significant power", or summat? I know I've seen that message, I just don't have a good repro.
Edge does, as well. It drops a warning in the middle of the screen, displays the resource-hogging tab, and asks whether you want to force-close the tab or wait.
> Just change your user agent to not have "Mozilla" in it. Anubis only serves you the challenge if you have that.
Won't that break many other things? My understanding was that basically everyone's user-agent string nowadays is packed with a full suite of standard lies.
Nope, they're on cloudflare so that all my banking traffic can be intercepted by a foreign company I have no relation to. The web is really headed in a great direction :)
If your Firefox supports sideloading extensions then making extensions that modify request or response headers is easy.
All the API is documented in https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web... . My Anubis extension modifies request headers using `browser.webRequest.onBeforeSendHeaders.addListener()` . Your case sounds like modifying response headers which is `browser.webRequest.onHeadersReceived.addListener()` . Either way the API is all documented there, as is the `manifest.json` that you'll need to write to register this JS code as a background script and whatever permissions you need.
Then zip the manifest and the script together, rename the zip file to "<id_in_manifest>.xpi", place it in the sideloaded extensions directory (depends on distro, eg /usr/lib/firefox/browser/extensions), restart firefox and it should show up. If you need to debug it, you can use the about:debugging#/runtime/this-firefox page to launch a devtools window connected to the background script.
This is neither here nor there but the character isn't a cat. It's in the name, Anubis, who is an Egyptian deity typically depicted as a jackal or generic canine, and the gatekeeper of the afterlife who weighs the souls of the dead (hence the tagline). So more of a dog-girl, or jackal-girl if you want to be technical.
Every representation I've ever seen of Anubis - including remarkably well preserved statues from antiquity - are either a male human body with a canine head, or fully canine.
This anime girl is not Anubis. It's a modern cartoon characters that simply borrows the name because it sounds cool, without caring anything about the history or meaning behind it.
Anime culture does this all the time, drawing on inspiration from all cultures but nearly always only paying the barest lip service to the original meaning.
I don't have an issue with that, personally. All cultures and religions should be fair game as inspiration for any kind of art. But I do have an issue with claiming that the newly inspired creation is equivalent in any way to the original source just because they share a name and some other very superficial characteristics.
It's also that the anime style already makes all heads shaped vaguely like felines. Add upwards pointing furry ears and it's not wrong to call it a cat girl.
> they share a name and some other very superficial characteristics.
I wasn't implying anything more than that, although now I see the confusing wording in my original comment. All I meant to say was that between the name and appearance it's clear the mascot is canid rather than feline. Not that the anime girl with dog ears is an accurate representation of the Egyptian deity haha.
I think you're taking it a bit too seriously. In turn, I am, of course, also taking it too seriously.
> I do have an issue with claiming that the newly inspired creation is equivalent in any way to the original source
Nobody is claiming that the drawing is Anubis or even a depiction of Anubis, like the statues etc. you are interested in.
It's a mascot. "Mascot design by CELPHASE" -- it says, in the screenshot.
Generally speaking -- I can't say that this is what happened with this project -- you would commission someone to draw or otherwise create a mascot character for something after the primary ideation phase of the something.
This Anubis-inspired mascot is, presumably, Anubis-inspired because the project is called Anubis, which is a name with fairly obvious connections to and an understanding of "the original source".
> Anime culture does this all the time, ...
I don't know what bone you're picking here. This seems like a weird thing to say.
I mean, what anime culture? It's a drawing on a website.
Yes, I can see the manga/anime influence -- it's a very popular, mainstream artform around the world.
I like to talk seriously about art, representation, and culture. What's wrong with that? It's at least as interesting as discussing databases or web frameworks.
In case you feel it needs linking to the purpose of this forum, the art in question here is being forcefully shown to people in a situation that makes them do a massive context switch. I want to look at the linux or ffmpeg source code but my browser failed a security check and now I'm staring at a random anime girl instead. What's the meaning here, what's the purpose behind this? I feel that there's none, except for the library author's preference, and therefore this context switch wasted my time and energy.
Maybe I'm being unfair and the code author is so wrapped up in liking anime girls that they think it would be soothing to people who end up on that page. In which case, massive failure of understanding the target audience.
Maybe they could allow changing the art or turning it off?
> Anime culture does this all the time
>> I don't know what bone you're picking here
I'm not picking any bone there. I love anime, and I love the way it feels so free in borrowing from other cultures. That said, the anime I tend to like is more Miyazaki or Satoshi Kon and less kawaii girls.
Hey there! The design of the mascot serves a dual-purpose, and was done very intentionally.
Your workflow getting interrupted, especially with a full-screen challenge page, is a very high-stress event. The mascot serves a purpose in being particularly distinct and recognizable, but also disarming for first-time users. This emotional response was calibrated particularly for more non-technical users who would be quick to be worried about 'being hit by a virus'. In particular I find that bot challenges tend to feel very accusing ("PROVE! PROVE YOU ARE NOT A ROBOT!"), and that a little bit of silly would disarm that feeling.
Similarly, that's why the error version of the mascot looks more surprised if anything. After all, only legitimate users will ever see that. (bots don't have eyes, or at least don't particularly care)
As for the design specifically, making it more anubis-like would probably have been a bit TOO furry and significantly hurt adoption. The design prompt was to stick to a jackal girl. Then again, I kinda wished in retrospect I had made the ears much, much longer.
Viewing the challenge screenshot again after reading your response definitely sheds light as to why I have no aggro toward Anubis (even if the branding supposedly wouldn't jive well with a super professional platform, but hey, I think having the alternate, commercial offering is super brilliant in turn).
On the other hand, I immediately see red when I get stopped in my tracks by all the widely used (and often infinitely-unpassable) Cloudflare/Google/etc. implementations with wordings that do nothing but add insult to injury.
Thank you for the thought you put into that. I think you guys hit it out of the park.
What does all of this have to do with (depictions of, references to, etc.) Anubis though? You responded to a comment about the mascot surely being a "jackalgirl" as opposed to a "catgirl" because of the Anubis name and other references. It seemed like you had an issue with the artwork, that it wasn't Anubisy enough, or something. Why would the drawing being more like the statues improve the situation?
Now you seem to be saying that anything that isn't what you wanted to find on the website is the problem. This makes sense, it just has nothing to do with what is shown on that page. But you're effectively getting frustrated at not getting to the page you wanted to and then directing your frustration toward the presentation of the "error message". That does not make sense.
> I like to talk seriously about art, representation, and culture. What's wrong with that? It's at least as interesting as discussing databases or web frameworks.
I don't have a problem with talking about art, you'll note that I responded in kind.
When I said "I think you're taking it too seriously" I wasn't expecting that to be extrapolated to all subjects, just the one that was being discussed in the local context.
>I like to talk seriously about art, representation, and culture. What's wrong with that?
It's no fun.
For one, you pulled your original response out of your ass. That the mascot is not a "catgirl" as identified by OP, but a canine variant of the same concept, because the project is named after the Egyptian god, is both obvious and uninteresting. You added nothing to that.
You're running around shouting "I get the joke, I get the joke" while grandstanding about how serious you are about art, one of the human pursuits helped least by seriousness, considering.
If you've decided you also need to be silly about it today, then at least have the decency to make up a conspiracy theory about the author being in fact a front for an Egyptian cult trying to bring back the old gods using the harvested compute, or whatever.
>massive failure of understanding the target audience.
Heh.
The anime image is put there as an intentional, and to my view rightful, act of irreverence.
One that works, too: I unironically find the people going like "my girl/boss will be mad at me if they see this style of image on my computer" positively hilarious.
>Maybe they could allow changing the art or turning it off?
They sure do. For money. Was in the release announcement.
Not enough irreverence in your game and you can end up being the person who let them build the torment nexus. Many such cases, and that's why we're where we are.
>That said, the anime I tend to like is more Miyazaki or Satoshi Kon and less kawaii girls.
Both. I don't want any random pictures of young girls popping up while I'm browsing the web, and why would adults insert pictures of young girls into their project in the first place?
What a strange comment to make about a cartoon character.
Also, the anime reference is very much intentional at this point; while the source code is open so anyone can change it, the author sells a version for the boring serious types where you can easily change the logo without recompiling the source yourself. Adding the additional bottleneck of having to sync a custom fork or paying out to placate the "serious" people is a great way to get the large corporations to pay a small fee to cover maintenance.
This anime thing is the one thing about computer culture that I just don't seem to get. I did not get it as child, when suddenly half of children cartoons became animes and I just disliked the aestheic. I didn't get it in school, when people started reading mangas . I'll probably never get it.
Therefore I sincerely hope, they do go away from anubis, so I can further dwell in my ignorance.
I feel the same. It's a distinct part of nerd culture.
In the '70s, if you were into computers you were most likely also a fan of Star Trek. I remember an anecdote from the 1990s when an entire dial-up ISP was troubleshooting its modem pools because there were zero people connected and they assumed there was an outage. The outage happened to occur exactly while that week's episode of X-Files was airing in their time zone. Just as the credits rolled, all modems suddenly lit up as people connected to IRC and Usenet to chat about the episode. In ~1994 close to 100% of residential internet users also happened to follow X-Files on linear television. There was essentially a 1:1 overlap between computer nerds and sci-fi nerds.
Today's analog seems to be that almost all nerds love anime and Andy Weir books and some of us feel a bit alienated by that.
> Today's analog seems to be that almost all nerds love anime and Andy Weir books and some of us feel a bit alienated by that.
Especially because (from my observation) modern "nerds" who enjoy anime seem to relish at bringing it (and various sex-related things) up at inappropriate times and are generally emotionally immature.
It's quite refreshing seeing that other people have similar lines of thinking and that I'm not alone in feeling somewhat alienated.
I think I'd push back and say that nerd culture is no longer really a single thing. Back in the star trek days, the nerd "community" was small enough that star trek could be a defining quality shared by the majority. Now the nerd community has grown, and there are too many people to have defining parts of the culture that are loved by the majority.
Eg if the nerd community had $x$ people in the star trek days, now there are more than $x$ nerds who like anime and more than $x$ nerds who dislike it. And the total size is much bigger than both.
You don't have to get it to be able to accept that others like it. Why not let them have their fun?
This sounds more as though you actively dislike anime than merely not seeing the appeal or being "ignorant". If you were to ignore it, there wouldn't be an issue...
Well, this is their personal project. You're welcome to make your own, or you can remove the branding if you want: it's open licensed. Or if you're not a coder, they even offer to remove the branding if you support the project
I don't get the impression that it's meant to be annoying, but a personal preference. I can't know that, though whitelabeling is a common thing people pay for without the original brand having made their logo extra ugly
While subjecting the entire Internet to industrial-scale abuse by inconsiderate and poorly written crawlers for the sake of building an overhyped "whatever" is of course perfectly acceptable.
Well, to be fair, that's not our doing so not really an argument for why one should accept something one apparently dislikes (I myself find the character funny and it brings a fun moment when it flashes by, but I can understand/accept that others see it differently of course)
Might've caught on because the animes had plots, instead of considering viewers to have the attention spans of idiots like Western kids' shows (and, in the 21st century, software) tend to do.
I don't think it's relevant to debate if anime or other forms of media is objectively better. But as someone who has never understood anime, I view mainstream western TV series as filled with hours of cleverly written dialogue and long story arches, whereas the little anime I've watched seems to mostly be overly dramatic colorful action scenes with intense screamed dialogue and strange bodily noises. Should we maybe assume that we are both a bit ignorant of the preferences of others?
Let's rather assume that you're the kind of person who debates a thing by first saying that it's not relevant to debate, then putting forward a pretty out-of-context comparison, and finally concluding that I should feel bad about myself. That kind of story arc does seem to correlate with finding mainstream Western TV worthwhile; there's something structurally similar to the funny way your thought went.
It's more likely that the project itself will disappear into irrelevance as soon as AI scrapers bother implementing the PoW (which is trivial for them, as the post explains) or figure out that they can simply remove "Mozilla" from their user-agent to bypass it entirely.
That's what it's for, isn't it? Make crawling slower and more expensive. Shitty crawlers not being able to run the PoW efficiently or at all is just a plus. Although:
> which is trivial for them, as the post explains
Sadly the site's being hugged to death right now so I can't really tell if I'm missing part of your argument here.
> figure out that they can simply remove "Mozilla" from their user-agent
And flag themselves in the logs to get separately blocked or rate limited. Servers win if malicious bots identify themselves again, and forcing them to change the user agent does that.
> That's what it's for, isn't it? Make crawling slower and more expensive.
The default settings produce a computational cost of milliseconds for a week of access. For this to be relevant it would have to be significantly more expensive to the point it would interfere with human access.
I thought the point (which the article misses) is that a token gives you an identity, and an identity can be tracked and rate limited.
So a crawlers that goes very ethically and does very little strain on the server should indeed be able to crawl for a whole week on a cheap compute, one that hammers the server hard will not.
...unless you're sus, then the difficulty increases. And if you unleash a single scrapping bot, you're not a problem anyway. It's for botnets of thousands, mimicking browsers on residual connections to make them hard to filter out or rate limit, effectively DDoSing the server.
Perhaps you just don't realize how much did the scraping load increase in the last 2 years or so. If your server can stay up after deploying Anubis, you've already won.
If it's an actual botnet, then it's hijacked computers belonging to other people, who are the ones paying the power bills. The attacker doesn't care that each computer takes a long time to calculate. If you have 1000 computers each spending 5s/page, then your botnet can retrieve 200 pages/s.
If it's just a cloud deployment, still it has resources that vastly outstrip a normal person's.
The fundamental issue is that you can't serve example.com slower than a legitimate user on a crappy 10 year old laptop could tolerate, because that starts losing you real human users. So if let's say say user is happy to wait 5 seconds per page at most, then this is absolutely no obstacle to a modern 128 core Epyc. If you make it troublesome to the 128 core monster, then no normal person will find the site usable.
But right now, it does, as these bots tend to be really dumb (presumably, a more competent botnet user wouldn't have it do an equivalent of copying Wikipedia by crawling through its every single page in the first place). With a bit of luck, it will be enough until the bubble bursts and the problem is gone, and you won't need to deploy Anubis just to keep your server running anymore.
The explanation of how the estimate is made is more detailed, but here is the referenced conclusion:
>> So (11508 websites * 2^16 sha256 operations) / 2^21, that’s about 6 minutes to mine enough tokens for every single Anubis deployment in the world. That means the cost of unrestricted crawler access to the internet for a week is approximately $0.
>> In fact, I don’t think we reach a single cent per month in compute costs until several million sites have deployed Anubis.
If you use one solution to browse the entire site, you're linking every pageload to the same session, and can then be easily singled out and blocked. The idea that you can scan a site for a week by solving the riddle once is incorrect. That works for non-abusers.
Well, since they can get a unique token for every site every 6 minutes only using a free GCP VPS that doesn't really matter, scraping can easily be spread out across tokens or they can cheaply and quickly get a new one whenever the old one gets blocked.
Unless they require a new token for each new request or every x minutes or something it won't matter.
And as the poster mentioned if you are running an AI model you probably have GPUs to spare. Unlike the dev working from a 5 year old Thinkpad or their phone.
Apparently bcrypt has design that makes it difficult to accelerate effectively on a GPU.
Indeed a new token should be requested per request; the tokens could also be pre-calculated, so that while the user is browsing a page, the browser could calculate tickets suitable to access the next likely browsing targets (e.g. the "next" button).
The biggest downside I see is that mobile devices would likely suffer. Possible the difficulty of the challange is/should be varied by other metrics, such as the number of requests arriving per time unit from a C-class network etc.
That's a matter of increasing the difficulty isn't it? And if the added cost is really negligible, we can just switch to a "refresh" challenge for the same added latency and without burning energy for no reason.
And if you don't increase it, crawlers will DoS the sites again and legitimate users will have to wait until the next tech hype bubble for the site to load, which is the reason why software like Anubis is being installed in the first place.
If you triple the difficulty, the cost of solving the PoW is still neglible to the crawlers but you've harmed real users even more.
The reason why anubis works is not the PoW, it is that the dev time to implement the bypass takes out the lowest effort bots. Thus the correct response is to keep the PoW difficulty low so you minimize harm to real users. Or better yet, implementing your own custom check that doesn't use any PoW and relies on ever higher obscurity to block the low effort bots.
The more anubis is used, the less effective it is and the more it harms real users.
I'm not using the latest generation of phones, not in the slightest, and I don't really care, because the alternative to Anubis-like intersitials is the sites not loading at all when they're mass-crawled to death.
There's some sites[1] that can print your user agent for you. Try it in a few different browsers and you will be surprised. They're honestly unhinged.. I have no idea why we still use this header in 2025!
> This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources or trying to conserve them.
Counterpoint - it seems to work. People use anubis because its the best of bad options.
If theory and reality disagree, it means either you are missing something or your theory is wrong.
Geoblocking China and Singapore solves that problem, it seems, at least the non-residential IPs (though I also see a lot of aggressive bots coming from residential IP space from China).
I wish the old trick of sending CCP-unfriendly content to get the great firewall to kill the connection for you still worked, but in the days of TLS everywhere that doesn't seem to work anymore.
Superficial comment regarding the catgirl, I don't get why some people are so adamant and enthusiastic for others to see it, but if you like me find it distasteful and annoying, consider copying these uBlock rules: https://sdf.org/~pkal/src+etc/anubis-ublock.txt. Brings me joy to know what I am not seeing whenever I get stopped by this page :)
Can you clarify if you mean that you do no understand the reasons that people dislike these images, or do you find the very idea of disliking it hard to relate to?
I cannot claim that I understand it well, but my best guess is that these are images that represent a kind of culture that I have encountered both in real-life and online that I never felt comfortable around. It doesn't seem unreasonable that this uneasiness around people with identity-constituting interests in anime, Furries, MLP, medieval LARP, etc. transfers back onto their imagery. And to be clear, it is not like I inherently hate anime as a medium or the idea of anthropomorphism in art. There is some kind of social ineptitude around propagating these _kinds_ of interests that bugs me.
I cannot claim that I am satisfies with this explanation. I know that the dislike I feel for this is very similar to that I feel when visiting a hacker space where I don't know anyone. But I hope that I could at least give a feeling for why some people don't like seeing catgirls every time I open a repository and that it doesn't necessarily have anything to do with advocating for a "corporate soulless web".
I can't really explain it but it definitely feels extremely cringeworthy. Maybe it's the neckbeard sexuality or the weird furry aspect. I don't like it.
> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans
I'm an unsure if this deadpan humor or if the author has never tried to solve a CAPTCHA that is something like "select the squares with an orthodox rabbi present"
I wonder if it's an intentional quirk that you can only pass some CAPTCHAs if you're a human who knows what an American fire hydrant or school bus looks like?
So much this. The first time one asked me to click on "crosswalks", I genuinely had to think for a while as I struggled to remember WTF a "crosswalk" was in AmEng. I am a native English speaker, writer, editor and professionally qualified teacher, but my form of English does not have the word "crosswalk" or any word that is a synonym for it. (It has phrases instead.)
Our schoolbuses are ordinary buses with a special number on the front. They are no specific colour.
There are other examples which aren't coming immediately to mind, but it is vexing when the designer of a CAPTCHA isn't testing if I am human but if I am American.
I doubt it’s intentional. Google (owner of reCAPTCHA) is a US company, so it’s more likely they either haven’t considered what they see every day is far from universal; don’t care about other countries; or specifically just care about training for the US.
Google demanding I flag yellow cars when asked to flag taxis is the silliest Americanism I've seen. At least the school bus has SCHOOL BUS written all over it and fire hydrants aren't exactly an American exclusive thing.
On some Russian and Asian site I ran into trouble signing up for a forum using translation software because the CAPTCHA requires me to enter characters I couldn't read or reproduce. It doesn't happen as often as the Google thing, but the problem certainly isn't restricted to American sites!
There are also services out that will solve any CAPTCHA for you at a very small cost to you. And an AI company will get steep discounts with the volumes of traffic they do.
There are some browser extensions for it too, like NopeCHA, it works 99% of the time and saves me the hassle of doing them.
Any site using CAPTCHA's today is really only hurting there real customers and low hanging fruit.
Of course this assumes they can't solve the capture themselves, with ai, which often they can.
Yes, but not at a rate that enables them to be a risk to your hosting bill. My understanding is that the goal here isn't to prevent crawlers, it's to prevent overly aggressive ones.
Every time I see one of these I think it's a malicious redirect to some pervert-dwelling imageboard.
On that note, is kernel.org really using this for free and not the paid version without the anime? Linux Foundation really that desperate for cash after they gas up all the BMW's?
It's crazy (especially considering anime is more popular now than ever; netflix alone is making billions a year on anime) that people see a completely innocent little anime picture and immediately think "pervent-dwelling imageboard".
> people see a completely innocent little anime picture and immediately think "pervent-dwelling imageboard"
Think you can thank the furries for that.
Every furry I've happened to come across was very pervy in some way, and so that what immediately comes to mind when I see furry-like pictures like the one shown in the article.
Out of interest, how many furries have you met? I've been to several fur meets, and have met approximately three furries who I would not want to know anymore for one reason or another
Admittedly just a handful. But I met them in entirely non-furry settings, for example as a user of a regular open source program I was a contributor to (which wasn't Rust based[1]).
None of them were very pervy at first, only after I got to know them.
Even if the images aren’t the kind of sexualized (or downright pornographic) content this implies… having cutesy anime girls pop up when a user loads your site is, at best, wildly unprofessional. (Dare I say “cringe”?) For something as serious and legit as kernel.org to have this, I do think it’s frankly shocking and unacceptable.
Huh, why would they need the unbranded version? The branded version works just fine. It's usually easier to deploy ordinary open source software than it is for software that needs to be licensed, because you don't need special download pages or license keys.
If it makes sense for an organization to donate to a project they rely on, then they should just donate. No need to debrand if it's not strictly required, all that would do is give the upstream project less exposure. For design reasons maybe? But LKML isn't "designed" at all, it has always exposed the raw ugly interface of mailing list software.
Also, this brand does have trust. Sure, I'm annoyed by these PoW captcha pages, but I'm a lot more likely to enable Javascript if it's the Anubis character, than if it is debranded. If it is debranded, it could be any of the privacy-invasive captcha vendors, but if it's Anubis, I know exactly what code is going to run.
If i saw an anime pic show up, thatd be a flag. I only know of Anubis’ existence and use of anime from hn.
It is only trusted by a small subset of people who are in the know. It is not about “anime bad” but that a large chunk of the population isnt into it for whatever reason.
I love anime but it can also be cringe. I find this cringe as it seems many others do too.
It won't stop the crawlers immediately, but it might lead to an overhyped and underwhelming LLM release from a big name company, and force them to reassess their crawling strategy going forward?
Crawlers already know how to stop crawling recursive or otherwise excessive/suspicious content. They've dealt with this problem long before LLM-related crawling.
Why is kernel.org doing this for essentially static content? Cache control headers and ETAGS should solve this. Also, the Linux kernel has solved the C10K problem.
The contents in question are statically generated, 1-3 KB HTML files. Hosting a single image would be the equivalent of cold serving 100s of requests.
Putting up a scraper shield seems like it's more of a political statement than a solution to a real technical problem. It's also antithetical to open collaboration and an open internet of which Linux is a product.
A great option for most people, and indeed Anubis' README recommends using Cloudflare if possible. However, not everyone can use a paid CDN. Some people can't pay because their payment methods aren't accepted. Some people need to serve content or to countries which a major CDN can't for legal and compliance reasons. Some organizations need their own independent infrastructure to serve their organizational misson.
I have a S24 (flagship of 2024) and Anubis often takes 10-20 seconds to complete, that time is going to add up if more and more sites adopt it, leaning to a worse browsing experience and wasted battery life.
Meanwhile AI farms will just run their own nuclear reactors eventually and be unaffected.
I really don't understand why someone thought this was a good idea, even if well intentioned.
Something must be wrong on your flagship smartphone because I have an entry level one that doesn't take that long.
It seems there is a large number of operations crawling the web to build models that aren't using directly infrastructure hosted on AI farms BUT botnet running on commodity hardware and residencial networks to circumvent their ip range from being blacklisted. Anubis point is to block those.
Which browser and which difficulty setting is that?
Because I've got the same model line but about 3 or 4 years older and it usually just flashes by in the browser Lightning from F-droid which is an OS webview wrapper. On occasion a second or maybe two, I assume that's either bad luck in finding a solution or a site with a higher difficulty setting. Not sure if I've seen it in Fennec (firefox mobile) yet but, if so, it's the same there
I've been surprised that this low threshold stops bots but I'm reading in this thread that it's rather that bot operators mostly just haven't bothered implementing the necessary features yet. It's going to get worse... We've not even won the battle let alone the war. Idk if this is going to be sustainable, we'll see where the web ends up...
Either your phone is on some extreme power saving mode, your ad blocker is breaking Javascript, or something is wrong with your phone.
I've certainly seen Anubis take a few seconds (three or four maybe) but that was on a very old phone that barely loaded any website more complex than HN.
I remember that LiteCoin briefly had this idea, to be easy on consumer hardware but hard on GPUs. The ASICs didn't take long to obliterate the idea though.
Maybe there's going to be some form of pay per browse system? even if it's some negligible cost on the order of 1$ per month (and packaged with other costs), I think economies of scale would allow servers to perform a lifetime of S24 captchas in a couple of seconds.
That's not bypassing it, that's them finally engaging with the PoW challenge as intended, making crawling slower and more expensive, instead failing to crawl at all, which is more of a plus.
This however forces servers to increase the challenge difficulty, which increases the waiting time for the first-time access.
> After further investigation and communication. This is not a bug. The threat actor group in question installed headless chrome and simply computed the proof of work. I'm just going to submit a default rule that blocks huawei.
It doesn't work for headless chrome, sure. The thing is that often, for threats like this to work they need lots scale, and they need it cheaply because the actors are just throwing a wide net and hoping to catch it. Headless chrome doesn't scale cheaply so by forcing script kiddies to use it you're pricing them out of their own game. For now.
Doesn't have to be black or white. You can have a much easier challenge for regular visitors if you block the only (and giant) party that has implemented a solver so far. We can work on both fronts at once...
Why does that matter? The challenge needs to stay expensive enough to slow down bots, but legitimate users won't be solving anywhere near the same amount of challenges and the alternative is the site getting crawled to death, so they can wait once in a while.
Too bad the challenge's result is only a waste of electricity. Maybe they should do like some of those alt-coins and search for prime numbers or something similar instead.
Of course that doesn't directly help the site operator. Maybe it could actually do a bit of bitcoin mining for the site owner. Then that could pay for the cost of accessing the site.
this only holds through if the data to be accessed is less valuable than the computational cost. in this case, that is false and spending a few dollars to scrape data is more than worth.
reducing the problem to a cost issue is bound to be short sighted.
This is not about preventing crawling entirely, it's about finding a way to prevent crawlers from repeatedly everything way too frequently just because crawling is just very cheap. Of course it will always be worth it to crawl the Linux Kernel mailing list, but maybe with a high enough cost per crawl the crawlers will learn to be fine with only crawling it once per hour for example
If you want my help training up your billion dollar model then you should pay me. My content is for humans. If you're not a human you are an unwelcome burden.
Search engines, at least, are designed to index the content, for the purpose of helping humans find it.
Language models are designed to filch content out of my website so it can reproduce it later without telling the humans where it came from or linking them to my site to find the source.
This is exactly the reason that "I just don't like 'AI'." You should ask the bot owners why they "just don't like appropriate copyright attribution."
You can't copyright an idea, only a specific expression of an idea. An LLM works at the level of "ideas" (in essence - for example if you subtract the vector for "woman" from "man" and add the difference to "king" you get a point very close to "queen") and reproduces them in new contexts and makes its own connections to other ideas. It would be absurd for you to demand attribution and payment every time someone who read your Python blog said "Python is dynamically type-checked and garbage-collected". Thankfully that's not how the law works. Abusive traffic is a problem, but the world is a better place if humans can learn from these ideas with the help of ChatGPT et al. and to say they shouldn't be allowed to just because your ego demands credit for every idea someone learns from you is purely selfish.
LLMs quite literally work at the level of their source material, that's how training works, that's how RAG works, etc.
There is no proof that LLMs work at the level of "ideas", if you could prove that, you'e solve a whole lot of incredibly expensive problems that are current bottlenecks for training and inference.
It is a bit ironic that you'd call someone wanting to control and be paid for the thing they themselves created "selfish", while at the same time writing apologia on why it's okay for a trillion dollar private company to steal someone else's work for their own profit.
It isn't some moral imperative that OpenAI gets access to all of humanity's creations so they can turn a profit.
As a reference on the volume aspect: I have a tiny server where I host some of my git repos. After the fans of my server spun increasingly faster/louder every week, I decided to log the requests [1]. In a single week, ClaudeBot made 2.25M (!) requests (7.55GiB), whereas GoogleBot made only 24 requests (8.37MiB). After installing Anubis the traffic went down to before the AI hype started.
As others have said, it's definitely volume, but also the lack of respecting robots.txt. Most AI crawlers that I've seen bombarding our sites just relentlessly scrape anything and everything, without even checking to see if anything has changed since the last time they crawled the site.
Yep, AI scrapers have been breaking our open-source project gerrit instance hosted at Linux Network Foundation.
Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me. This should be a solved problem. But it looks like this field is full of wrongly behaving companies with complete disregards toward common goods.
>Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me.
a mix of ignorance, greed, and a bit of the tragedy of the commons. If you don't respect anyone around you, you're not going to care about any rules or ettiquite that don't directly punish you. Society has definitely broken down over the decades.
My understanding is that AI scrapers rotate IPs to bypass rate-limiting. Anubis requires clients to solve a proof-of-work challenge upon their first visit to the site to obtain a token that is tied to their IP and is valid for some number of requests -- thus forcing impolite scrapers to solve a new PoW challenge each time they rotate IPs, while being unobtrusive for regular users and scrapers that don't try to bypass rate limits.
It's like a secondary rate-limit on the ability of scrapers to rotate IPs, thus allowing your primary IP-based rate-limiting to remain effective.
The math in the article assumes scrapers only need one Anubis token per site, whereas a scraper using 500,000 IPs would require 500,000 tokens.
Scaling up the math in the article, which states it would take 6 CPU-minutes to generate enough tokens to scrape 11,508 Anubis-using websites, we're now looking at 4.3 CPU-hours to obtain enough tokens to scrape your website (and 50,000 CPU-hours to scrape the Internet). This still isn't all that much -- looking at cloud VM prices, that's around 10c to crawl your website and $1000 to crawl the Internet, which doesn't seem like a lot but it's much better than "too low to even measure".
However, the article observes Anubis's default difficulty can be solved in 30ms on a single-core server CPU. That seems unreasonably low to me; I would expect something like a second to be a more appropriate difficulty. Perhaps the server is benefiting from hardware accelerated sha256, whereas Anubis has to be fast enough on clients without it? If it's possible to bring the JavaScript PoW implementation closer to parity with a server CPU (maybe using a hash function designed to be expensive and hard to accelerate, rather than one designed to be cheap and easy to accelerate), that would bring the cost of obtaining 500k tokens up to 138 CPU-hours -- about $2-3 to crawl one site, or around $30,000 to crawl all Anubis deployments.
I'm somewhat skeptical of the idea of Anubis -- that cost still might be way too low, especially given the billions of VC dollars thrown at any company with "AI" in their sales pitch -- but I think the article is overly pessimistic. If your goal is not to stop scrapers, but rather to incentivize scrapers to be respectful by making it cheaper to abide by rate limits than it is to circumvent them, maybe Anubis (or something like it) really is enough.
(Although if it's true that AI companies really are using botnets of hacked computers, then Anubis is totally useless against bots smart enough to solve the challenges since the bots aren't paying for the CPU time.)
If the scraper scrapes from a small number of IPs they're easy to block or rate-limit. Rate-limits against this behaviour are fairly easy to implement, as are limits against non-human user agents, hence the botnet with browser user agents.
The Duke University Library analysis posted elsewhere in the discussion is promising.
I'm certain the botnets are using hacked/malwared computers, as the huge majority of requests come from ISPs and small hosting providers. It's probably more common for this to be malware, e.g. a program that streams pirate TV, or a 'free' VPN app, which joins the user's device to a botnet.
Criminal convictions in the US require a standard of proof that is "beyond a reasonable doubt" and I suspect cases like this would not pass the required mens rea test, as, in their minds at least (and probably a judge's), there was no ill intent to cause a denial of service... and trying to argue otherwise based on any technical reasoning (e.g. "most servers cannot handle this load and they somehow knew it") is IMO unlikely to sway the court... especially considering web scraping has already been ruled legal, and that a ToS clause against that cannot be legally enforced.
There's an angle where criminal intent doesn't matter when it comes to negligence and damages. They have to had known that their scrapers would cause denial of service, unauthorized access, increased costs for operators, etc.
That's not a certain outcome. If you're willing to do this case, I can provide access logs and any evidence you want. You can keep any money you win plus I'll pay a bonus on top! Wanna do it?
Keep in mind I'm in Germany, the server is in another EU country, and the worst scrapers overseas (in China, USA, and Singapore). Thanks to these LLMs there is no barrier to have the relevant laws be translated in all directions I trust that won't be a problem! :P
> criminal intent doesn't matter when it comes to negligence and damages
Are you a criminal defense attorney or prosecutor?
> They have to had known
IMO good luck convincing a judge of that... especially "beyond a reasonable doubt" as would be required for criminal negligence. They could argue lots of other scrapers operate just fine without causing problems, and that they tested theirs on other sites without issue.
coming from a different legal system so please forgive my ignorance: Is it necessary in the US to prove ill intent in order to sue for repairs?
Just wondering, because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.
When we say "do we need" or "can we do" we're talking about the idea of how plausible it is to win case. A lawyer won't take a case with bad odds of winning, even if you want to pay extra because a part of their reputation lies on taking battles they feel they can win.
>because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.
IANAL, so the boring answer is "it depends". reparations aren't guaranteed, but there's 50 different state laws to consider, on top of federal law.
Generally, they are not entitled to pay for damages themselves, but they may possibly be charged with battery. Intent will be a strong factor in winning the case.
I thought only capital crimes (murder, for example) held the standard of beyond a reasonable doubt. Lesser crimes require the standard of either a "Preponderance of Evidence" or "Clear and Convincing Evidence" as burden of proof.
Still, even by those lesser standards, it's hard to build a case.
It's civil cases that have the lower standard of proof. Civil cases arise when one party sues another, typically seeking money, and they are claims in equity, where the defendant is alleged to have harmed the plaintiff in some way.
Criminal cases require proof beyond a reasonable doubt. Most things that can result in jail time are criminal cases. Criminal cases are almost always brought by the government, and criminal acts are considered harm to society rather than to (strictly) an individual. In the US, criminal cases are classified as "misdemeanors" or "felonies," but that language is not universal in other jurisdictions.
they seem to be written by either idiots and/or people that don't give a shit about being good internet citizens
either way the result is the same: they induce massive load
well written crawlers will:
- not hit a specific ip/host more frequently than say 1 req/5s
- put newly discovered URLs at the end of a distributed queue (NOT do DFS per domain)
- limit crawling depth based on crawled page quality and/or response time
- respect robots.txt
- make it easy to block them
- wait 2 seconds for a page to load before aborting the connection
- wait for the previous request to finish before requesting the next page, since that would only induce more load, get even slower, and eventually take everything down
I've designed my site to hold up to traffic spikes anyway and the bots I'm getting aren't as crazy as the ones I hear about from other, bigger website operators (like the OpenStreetMap wiki, still pretty niche), so I don't block much of them. Can't vet every visitor so they'll get the content anyway, whether I like it or not. But if I see a bot having HTTP 499 "client went away before page finished loading" entries in the access log, I'm not wasting my compute on those assholes. That's a block. I haven't had to do that before, in a decade of hosting my own various tools and websites
I disagree with the post author in their premise that things like Anubis are easy to bypass if you craft your bot well enough and throw the compute at it.
Thing is, the actual lived experience of webmasters tells that the bots that scrape the internets for LLMs are nothing like crafted software. They are more like your neighborhood shit-for-brain meth junkies competing with one another who makes more robberies in a day, no matter the profit.
Those bots are extremely stupid. They are worse than script kiddies’ exploit searching software. They keep banging the pages without regard to how often, if ever, they change. If they were 1/10th like many scraping companies’ software, they wouldn’t be a problem in the first place.
Since these bots are so dumb, anything that is going to slow them down or stop them in their tracks is a good thing. Short of drone strikes on data centers or accidents involving owners of those companies that provide networks of botware and residential proxies for LLM companies, it seems fairly effective, doesn’t it?
It is the way it is because there are easy pickings to be made even with this low effort, but the more sites adopt such measures, the less stupid your average bot will be.
As I've been saying for a while now - if you want to filter for only humans, ask questions only a human can easily answer; counting the number of letters in a word seems to be a good way to filter out LLMs, for example. Yes, that can be relatively easily gotten around, just like Anubis, but with the benefit that it doesn't filter out humans and has absolutely minimal system requirements (a browser that can submit HTML forms), possibly even less than the site itself.
There are forums which ask domain-specific questions as a CAPTCHA upon attempting to register an account, and as someone who has employed such a method, it is very effective. (Example: what nominal diameter is the intake valve stem on a 1954 Buick Nailhead?)
For smaller forums, any customization to the new account process will work. When I ran a forum that was getting a frustratingly high amount of spammer signups, I modified the login flow to ask the user to add 1 to the 6-digit number in the stock CAPTCHA. Spam signups dropped like a rock.
> counting the number of letters in a word seems to be a good way to filter out LLMs
As long as this challenge remains obscure enough to be not worth implementing special handlers in the crawler, this sounds a neat idea.
But I think if everyone starts doing this particular challenge (char count), the crawlers will start instructing a cheap LLM to do appropriate tool calls and get around it. So the challenge needs to be obscure.
I wonder if anyone tried building a crawler-firewall or even nginx script which will let the site admin plug their own challenge generator in lua or something, which would then create a minimum HTML form. Maybe even vibe code it :)
I would actually say that it's been successful in determining at least one, so far, large scale abuser, which can the be blocked via more traditional methods.
I have my own project that finds malicious traffic IP addresses, and through searching through the results, it's allowed me to identify IP address ranges to be blocked completely.
Yielding useful information may not have been what it was designed to do, but it's still a useful outcome. Funny thing about Anubis' viral popularity is that it was designed to just protect the author's personal site from a vast army of resource-sucking marauders, and grew because it was open sourced and a LOT of other people found it useful and effective.
Whenever I see an otherwise civil and mature project utilize something outwardly childish like this I audibly groan and close the page.
I'm sure the software behind it is fine but the imagery and style of it (and the confidence to feature it) makes me doubt the mental credibility/social maturity of anybody willing to make it the first thing you see when accessing a webpage.
Edit:
From a quick check of the "CEO" of the company, I was unsurprised to have my concerns confirmed. I may be behind the times but I think there are far too many people in who act obnoxiously (as part of what can only be described as a new subculture) in open source software today and I wish there were better terms to describe it.
"Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project."
So its meant/preferred to block low effort crawlers which can still cause damage if you don't deal with them. a 3 second deterrent seems good in that regard. Maybe the 3 second deterrent can come as in rate limiting an ip? but they might use swath's of ip :/
Anubis exists specifically to handle the problem of bots dodging IP rate limiting. The challenge is tied to your IP, so if you're cycling IPs with every request, you pay dramatically more PoW than someone using a single IP. It's intended to be used in depth with IP rate limiting.
Yea I'm not convinced unless somehow the vast majority of scrapers aren't already using headless browsers (which I assume they are). I feel like all this does is warm the planet.
On my daily browser with V8 JIT disabled, Cloudflare Turnstile has the worst performance hit, and often requires an additional click to clear.
Anubis usually clears in with no clicks and no noticeable slowdown, even with JIT off. Among the common CAPTCHA solutions it's the least annoying for me.
The software is easy. Apt install debian apache2 php certbot and you're pretty much set to deploy content to /var/www. I'm sure any BSD variant is also fine, or lots of other software distributions that don't require a graphical environment
On an old laptop running Windows XP (yes, with GUI, breaking my own rule there) I've also run a lot of services, iirc on 256MB RAM. XP needed about 70 I think, or 52 if I killed stuff like Explorer and unnecessary services, and the remainder was sufficient to run a uTorrent server, XAMPP (Apache, MySQL, Perl and PHP) stack, Filezilla FTP server, OpenArena game server, LogMeIn for management, some network traffic monitoring tool, and probably more things I'm forgetting. This ran probably until like 2014 and I'm pretty sure the site has been on the HN homepage with a blog post about IPv6. The only thing that I wanted to run but couldn't was a Minecraft server that a friend had requested. You can do a heck of a lot with a hundred megabytes of free RAM but not run most Javaware :)
Article might be a bit shallow, or maybe my understanding of how Anubis works is incorrect?
1. Anubis makes you calculate a challenge.
2. You get a "token" that you can use for a week to access the website.
3. (I don't see this being considered in the article) "token" that is used too much is rate limited. Calculating a new token for each request is expensive.
- https://news.ycombinator.com/item?id=44970290 mentions of other requirements that are allegedly on purpose to block older clients (as browser emulators presumably often would appear to be, because why would they bother implementing newer mechanisms when the web has backwards compatibility)
Basically. Anubis is meant to block mindless, careless, rude bots with seemingly no technically proficient human behind the process; these bots tend to be very aggressive and make tons of requests bringing sites down.
The assumption is that if you’re the operator of these bots and care enough to implement the proof of work challenge for Anubis you could also realize your bot is dumb and make it more polite and considerate.
Of course nothing precludes someone implementing the proof of work on the bot but otherwise leaving it the same (rude and abusive). In this case Anubis still works as a somewhat fancy rate limiter which is still good.
There’s no possible setting that would make it expensive enough to deter AI scrapers while preserving an acceptable user experience. The more zeros you add the more real users suffer, despite not creating much of a challenge to datacenter-hosted scrapers.
Yeah the PoW is minor for botters but annoying people. I think the only positive is if enough people see anime girls on there screens there might actually be political pressure to make laws against rampent bot crawling
But still enough to prevent a billion request DDoS
These sites have been search engine scrapped forever. It’s not about blocking bots entirely just about this new wave of fuck you I don’t care if your host goes down quasi malicious scrappers
Yes, but a single bot is not a concern. It's the first "D" in DDoS that makes it hard to handle
(and these bots tend to be very, very dumb - which often happens to make them more effective at DDoSing the server, as they're taking the worst and the most expensive ways to scrape content that's openly available more efficiently elsewhere)
What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net.
I hate Amazon's failure pets, I hate google's failure mini-games -- it strikes me as an organizational effort to get really good at failing rather than spending that same effort to avoid failures all together.
It's like everyone collectively thought the standard old Apache 404 not found page was too feature-rich and that customers couldn't handle a 3 digit error, so instead we now get a "Whoops! There appears to be an error! :) :eggplant: :heart: :heart: <pet image.png>" and no one knows what the hell is going on even though the user just misplaced a number in the URL.
This is something I've always felt about design in general. You should never make it so that a symbol for an inconvenience appears happy or smug, it's a great way to turn people off your product or webpage.
Reddit implemented something a while back that says "You've been blocked by network security!" with a big smiling Reddit snoo front and centre on the page and every time I bump into it I can't help but think this.
The original versions were a way to make fun even a boring event such as a 404. If the page stops conveying the type of error to the user then it's just bad UX but also vomiting all the internal jargon to a non-tech user is bad UX.
So, I don't see an error code + something fun to be that bad.
People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today, so I don't see how having fun error pages to be such an issue?
Usually when I hit an error page, and especially if I hit repeated errors, I'm not in the mood for fun, and I'm definitely not in the mood for "fun" provided by the people who probably screwed up to begin with. It comes off as "oops, we can't do anything useful, but maybe if we try to act cute you'll forget that".
Also, it was more fun the first time or two. There's a not a lot of orginal fun on the error pages you get nowadays.
> People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today
It's been a while, but I don't remember much gratuitous cutesiness on the 90s Web. Not unless you were actively looking for it.
Not to those who don't exist in such cultures. It's creepy, childish, strange to them. It's not something they see in everyday life, nor would I really want to. There is a reason why cartoons are aimed for younger audiences.
Besides if your webserver is throwing errors, you've configured it incorrectly. Those pages should be branded as the site design with a neat and polite description to what the error is.
> What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net
This is probably intentional. They offer an paid unbranded version. If they had a corporate friendly brand on the free offering, then there would be fewer people paying for the unbranded one.
That also got old when you got it again and again while you were trying to actually do something. But there wasn't the space to fit quite as much twee on the screen...
I can't find any documentation that says Anubis does this, (although it seems odd to me that it wouldn't, and I'd love a reference) but it could do the following:
1. Store the nonce (or some other identifier) of each jwt it passes out in the data store
2. Track the number or rate of requests from each token in the data store
3. If a token exceeds the rate limit threshold, revoke the token (or do some other action, like tarpit requests with that token, or throttle the requests)
Then if a bot solves the challenge it can only continue making requests with the token if it is well behaved and doesn't make requests too quickly.
It could also do things like limit how many tokens can be given out to a single ip address at a time to prevent a single server from generating a bunch of tokens.
Good on you that you found a solution to myself but personally I will just not use websites that pull this and not contribute to projects where using such a website is required. If you respect me so little that you will make demands about how I use my computer and block me as a bot if I don't comply then I am going to assume that you're not worth my time.
With the asymmetry of doing the PoW in javascript versus compiled c code, I wonder if this type of rate limiting is ever going to be directly implemented into regular web browsers. (I assume there's already plugins for curl/wget)
Other than Safari, mainstream browsers seem to have given up on considering browsing without javascript enabled a valid usecase. So it would purely be a performance improvement thing.
Apple supports people that want to not use their software as the gods at Apple intended it? What parallel universe Version of Apple is this!
Seriously though, does anything of Apple's work without JS, like Icloud or Find my phone? Or does Safari somehow support it in a way that other browsers don't?
> an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources
Sure, if you ignore that humans click on one page and the problematic scrapers (not the normal search engine volume, but the level we see nowadays where misconfigured crawlers go insane on your site) are requesting many thousands to millions of times more pages per minute. So they'll need many many times the compute to continue hammering your site whereas a normal user can muster to load that one page from the search results that they were interested in
Something feels bizarrely incongruent about the people using Anubis. These people used to be the most vehemently pro-piracy, pro internet freedom and information accessibility, etc.
Yet now when it's AI accessing their own content, suddenly they become the DMCA and want to put up walls everywhere.
I'm not part of the AI doomer cult like many here, but it would seem to me that if you publish your content publicly, typically the point is that it would be publicly available and accessible to the world...or am I crazy?
As everything moves to AI-first, this just means nobody will ever find your content and it will not be part of the collective human knowledge. At which point, what's the point of publishing it.
Oh, its time to bring Internet back to humans. Maybe its time to treat first layer of Internet just as transport. Then, layer large VPN networks and put services there. People will just VPN to vISP to reach content. Different networks, different interests :) But this time dont fuck up abuse handling. Someone is doing something fishy? Depeer him from network (or his un-cooperating upstream!).
The argument isn't that it's difficult for them to circumvent - it's not - but that it adds enough friction to force them to rethink how they're scraping at scale and/or self-throttle.
I personally don't care about the act of scraping itself, but the volume of scraping traffic has forced administrators' hands here. I suspect we'd be seeing far fewer deployments if the scrapers behaved themselves to begin with.
The OP author shows that the cost to scrape an Anubis site is essentially zero since it is a fairly simple PoW algorithm that the scraper can easily solve. It adds basically no compute time or cost for a crawler run out of a data center. How does that force rethinking?
The cookie will be invalidated if shared between IPs, and it's my understanding that most Anubis deployments are paired with per-IP rate limits, which should reduce the amount of overall volume by limiting how many independent requests can be made at any given time.
That being said, I agree with you that there are ways around this for a dedicated adversary, and that it's unlikely to be a long-term solution as-is. My hope is that the act of having to circumvent Anubis at scale will prompt some introspection (do you really need to be rescraping every website constantly?), but that's hopeful thinking.
>do you really need to be rescraping every website constantly
Yes, because if you believe you out-resource your competition, by doing this you deny them training material.
The problem with crawlers if that they're functionally indistinguishable from your average malware botnet in behavior. If you saw a bunch of traffic from residential IPs using the same token that's a big tell.
Time to switch to stagit. Unfortunately it does not generate static pages for a git repo except "master". I am sure someone will modify to support branches.
Anubis works because AI crawlers do very little requests from an ip address to bypass rate-limiting. Last year they could still be blocked by ip range, but now the requests are from so many different networks that doesn't work anymore.
Doing the proof-of-work for every request is apparently too much work for them.
Crawlers using a single ip, or multiple ips from a single range are easily identifiable and rate-limited.
About the difficulty of proving you are human especially when every test built has so much incentive to be broken. I don't think it will be solved, or could ever be solved.
Still, 128MB is not enough to even run Debian let alone Apache/NGINX. I’m on my phone, but it doesn’t seem like the author is using Cloudflare or another CDN. I’d like to know what they are doing.
* Thin clients with only 256 MiB RAM and 400 MHz are possible, though more RAM and faster processors are recommended.
* For workstations, diskless workstations and standalone systems, 1500 MHz and 1024 MiB RAM are the absolute minimum requirements. For running modern webbrowsers and LibreOffice at least 2048 MiB RAM is recommended.
That's for some educational distro, which presumably is running some fancy desktop environment with fancy GUI programs. I don't think that is reflective of what a web server needs.
A web server is really only going to be running 3 things: init, sshd, and the web server software. Even if we give init and sshd half of 128 MB, there's still 64 MB left for the web server.
Moving bytes around doesn't take RAM but CPU. Notice how switches don't advertise how many gigabytes of RAM they have, but can push a few gigabits of content around between all 24 ports at once without even going expensive
Also, the HN homepage is pretty tame so long as you don't run WordPress. You don't get more than a few requests per second, so multiply that with the page size (images etc.) and you probably get a few megabits as bandwidth, no problem even for a Raspberry Pi 1 if the sdcard can read fast enough or the files are mapped to RAM by the kernel
Reading the original release post for Anubis [0], it seems like it operates mainly on the assumption that AI scrapers have limited support for JS, particularly modern features. At its core it's security through obscurity; I suspect that as usage of Anubis grows, more scrapers will deliberately implement the features needed to bypass it.
That doesn't necessarily mean it's useless, but it also isn't really meant to block scrapers in the way TFA expects it to.
> It's a reverse proxy that requires browsers and bots to solve a proof-of-work challenge before they can access your site, just like Hashcash.
It's meant to rate-limit accesses by requiring client-side compute light enough for legitimate human users and responsible crawlers in order to access but taxing enough to cost indiscriminate crawlers that request host resources excessively.
It indeed mentions that lighter crawlers do not implement the right functionality in order to execute the JS, but that's not the main reason why it is thought to be sensible. It's a challenge saying that you need to want the content bad enough to spend the amount of compute an individual typically has on hand in order to get me to do the work to serve you.
> Anubis is a man-in-the-middle HTTP proxy that requires clients to either solve or have solved a proof-of-work challenge before they can access the site. This is a very simple way to block the most common AI scrapers because they are not able to execute JavaScript to solve the challenge. The scrapers that can execute JavaScript usually don't support the modern JavaScript features that Anubis requires. In case a scraper is dedicated enough to solve the challenge, Anubis lets them through because at that point they are functionally a browser.
As the article notes, the work required is negligible, and as the linked post notes, that's by design. Wasting scraper compute is part of the picture to be sure, but not really its primary utility.
Why require proof of work with difficulty at all then? Just have no UI other than (javascript) required and run a trivial computation in WASM as a way of testing for modern browser features. That way users don't complain that it is taking 30s on their low-end phone and it doesn't make it any easier for scrapers to scrape (because the PoW was trivial anyways).
Once per ip. Presumably there's ip-based rate limiting implemented on top of this, so it's a barrier for scrapers that aggressively rotate ip's to circumvent rate limits.
It happens once if the user agent keeps a cookie that can be used for rate limiting. If a crawler hits the limit they need to either wait or throw the cookie away and solve another challenge.
Anubis is based on hashcash concepts - just adapted to a web request flow. Basically the same thing - moderately expensive for the sender/requester to compute, insanely cheap for the server/recipient to verify.
We need bitcoin-based lightning nano-payments for such things. Like visiting the website will cost $0.0001 cent, the lightning invoice is embedded in the header and paid for after single-click confirmation or if threshold is under a pre-configured value. Only way to deal with AI crawlers and future AI scams.
With the current approach we just waste the energy, if you use bitcoin already mined (=energy previously wasted) it becomes sustainable.
We deployed hashcash for a while back in 2004 to implement Picasa's email relay - at the time it was a pretty good solution because all our clients were kind of similar in capability. Now I think the fastest/slowest device is a broader range (just like Tavis says), so it is harder to tune the difficulty for that.
No, the economics will never work out for a Proof of Work-based counter-abuse challenge. CPU is just too cheap in comparison to the cost of human latency. An hour of a server CPU costs $0.01. How much is an hour of your time worth?
That's all the asymmetry you need to make it unviable. Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users. So there's no point in theorizing about an attacker solving the challenges cheaper than a real user's computer, and thus no point in trying to design a different proof of work that's more resistant to whatever trick the attackers are using to solve it for cheap. Because there's no trick.
But for a scraper to be effective it has to load orders of magnitude more pages than a human browses, so a fixed delay causes a human to take 1.1x as long, but it will slow down scraper by 100x. Requiring 100x more hardware to do the same job is absolutely a significant economic impediment.
The entire problem is that proof of work does not increase the cost of scraping by 100x. It does not even increase it by 100%. If you run the numbers, a reasonable estimate is that it increases the cost by maybe 0.1%. It is pure snakeoil.
>An hour of a server CPU costs $0.01. How much is an hour of your time worth?
That's irrelevant. A human is not going to be solving the challenge by hand, nor is the computer of a legitimate user going to be solving the challenge continuously for one hour. The real question is, does the challenge slow down clients enough that the server does not expend outsized resources serving requests of only a few users?
>Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users.
No, I disagree. If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.
The problem with proof-of-work is many legitimate users are on battery-powered, 5-year-old smartphones. While the scraping servers are huge, 96-core, quadruple-power-supply beasts.
The human needs to wait for their computer to solve the challenge.
You are trading something dirt-cheap (CPU time) for something incredibly expensive (human latency).
Case in point:
> If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.
No. A human sees a 10x slowdown. A human on a low end phone sees a 50x slowdown.
And the scraper paid one 1/1000000th of a dollar. (The scraper does not care about latency.)
That is not an effective deterrent. And there is no difficulty factor for the challenge that will work. Either you are adding too much latency to real users, or passing the challenge is too cheap to deter scrapers.
For the actual request, yes. For the complete experience of using the website not so much, since a human will take at least several seconds to process the information returned.
>And the scraper paid one 1/1000000th of a dollar. (The scraper does not care about latency.)
The point need not be to punish the client, but to throttle it. The scraper may not care about taking longer, but the website's operator may very well care about not being hammered by requests.
Yes, and then we can avoid the entire issue. It's patronizing for people to assume users wouldn't notice a 10x or 50x slowdown. You can tell those who think that way are not web developers, as we know that every millisecond has a real, nonlinear fiscal cost.
Of course, then the issue becomes "what is the latency and cost incurred by a scraper to maintain and load balance across a large list of IPs". If it turns out that this is easily addressed by scrapers then we need another solution. Perhaps, the user's browser computes tokens in the background and then serves them to sites alongside a certificate or hash (to prevent people from just buying and selling these tokens).
We solve the latency issue by moving it off-line, and just accept the tradeoff that a user is going to have to spend compute periodically in order to identify themselves in an increasingly automated world.
Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project.
wait but then why bother with this PoW system at all? if they're just trying to block anyone without JS that's way easier and doesn't require slowing things down for end users on old devices.
reminds of how wikipedia literally has all the data available even in a nice format just for scrapers (I think) and even THEN, there are some scrapers which still scraped wikipedia and actually made wikipedia lose some money so much that I am pretty sure that some official statement had to be made or they disclosed about it without official statement.
Even then, man I feel like you yourself can save on so many resources (both yours) and (wikipedia) if scrapers had the sense to not scrape wikipedia and instead follow wikipedia's rules
If we're presupposing an adversary with infinite money then there's no solution. One may as well just take the site offline. The point is to spend effort in such a way that the adversary has to spend much more effort, hopefully so much it's impractical.
> This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources or trying to conserve them.
A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.
Secondly, Anubis specifically targets bots that try to blend in with human traffic. Bots that don't try to blend in with humans are basically ignored and out-of-scope. Most malicious bots don't want to be targeted, so they want to blend in... so they kind of have to deal with this. If they want to avoid the Anubis challenge, they have to essentially identify themselves. If not, they have to solve it.
Finally... If bots really want to durably be able to pass Anubis challenges, they pretty much have no choice but to run the arbitrary code. Anything else would be a pretty straight-forward cat and mouse game. And, that means that being able to accelerate the challenge response is a non-starter: if they really want to pass it, and not appear like a bot, the path of least resistance is to simply run a browser. That's a big hurdle and definitely does increase the complexity of scraping the Internet. It increases more the more sites that use this sort of challenge system. While the scrapers have more resources, tools like Anubis scale the resources required a lot more for scraping operations than it does a specific random visitor.
To me, the most important point is that it only fights bot traffic that intentionally tries to blend in. That's why it's OK that the proof-of-work challenge is relatively weak: the point is that it's non-trivial and can't be ignored, not that it's particularly expensive to compute.
If bots want to avoid the challenge, they can always identify themselves. Of course, then they can also readily be blocked, which is exactly what they want to avoid.
In the long term, I think the success of this class of tools will stem from two things:
1. Anti-botting improvements, particularly in the ability to punish badly behaved bots, and possibly share reputation information across sites.
2. Diversity of implementations. More implementations of this concept will make it harder for bots to just hardcode fastpath challenge response implementations and force them to actually run the code in order to pass the challenge.
I haven't kept up with the developments too closely, but as silly as it seems I really do think this is a good idea. Whether it holds up as the metagame evolves is anyone's guess, but there's actually a lot of directions it could be taken to make it more effective without ruining it for everyone.
> A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.
... has phpbb not heard of the old "only create the session on the second visit, if the cookie was successfully created" trick?
phpBB supports browsers that don't support or accept cookies: if you don't have a cookie, the URL for all links and forms will have the session ID in it. Which would be great, but it seems like these bots are not picking those up either for whatever reason.
We have been seeing our clients' sites being absolutely *hammered* by AI bots trying to blend in. Some of the bots use invalid user agents - they _look_ valid on the surface, but under the slightest scrutiny, it becomes obvious they're not real browsers.
Personally I have no issues with AI bots, that properly identify themselves, from scraping content as if the site operator doesn't want it to happen they can easily block the offending bot(s).
We built our own proof-of-work challenge that we enable on client sites/accounts as they come under 'attack' and it has been incredible how effective it is. That said I do think it is only a matter of time before the tactics change and these "malicious" AI bots are adapted to look more human / like real browsers.
I mean honestly it wouldn't be _that_ hard to enable them to run javascript or to emulate a real/accurate User-Agent. That said they could even run headless versions of the browser engines...
It's definitely going to be cat-and-mouse.
The most brutal honest truth is that if they throttled themselves as not to totally crash whatever site they're trying to scrape we'd probably have never noticed or gone through the trouble of writing our own proof-of-work challenge.
Unfortunately those writing/maintaining these AI bots that hammer sites to death probably either have no concept of the damage it can do or they don't care.
> We have been seeing our clients' sites being absolutely hammered by AI bots trying to blend in. Some of the bots use invalid user agents - they _look_ valid on the surface, but under the slightest scrutiny, it becomes obvious they're not real browsers.
Yep. I noticed this too.
> That said they could even run headless versions of the browser engines...
Yes, exactly. To my knowledge that's what's going on with the latest wave that is passing Anubis.
That said, it looks like the solution to that particular wave is going to be to just block Huawei cloud IP ranges for now. I guess a lot of these requests are coming from that direction.
Personally though I think there are still a lot of directions Anubis can go in that might tilt this cat and mouse game a bit more. I have some optimism.
Yes, Anubis is a dog-headed or jackal-headed god. I actually can't find anywhere on the Anubis website where they talk about their mascot; they just refer to her neutrally as the "default branding".
Since dog girls and cat girls in anime can look rather similar (both being mostly human + ears/tail), and the project doesn't address the point outright, we can probably forgive Tavis for assuming catgirl.
> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans.
> Anubis – confusingly – inverts this idea.
Not really, AI easily automates traditional captchas now. At least this one does not need extensions to bypass.
The actual answer to how this blocks AI crawlers is that they just don't bother to solve the challenge. Once they do bother solving the challenge, the challenge will presumably be changed to a different one.
We're 1-2 years away from putting the entire internet behind Cloudflare, and Anubis is what upsets you? I really don't get these people. Seeing an anime catgirl for 1-2 seconds won't kill you. It might save the internet though.
The principle behind Anubis is very simple: it forces every visitor to brute force a math problem. This cost is negligible if you're running it on your computer or phone. However, if you are running thousands of crawlers in parallel, the cost adds up. Anubis basically makes it expensive to crawl the internet.
It's not perfect, but much much better than putting everything behind Cloudflare.
Isn't that line of reasoning implies companies with multi-billion dollars in their war chest are much more "human" than a literal human with student loans?
What are "bots"? This needs to include goggleadservices, PIA sharing for profit, real-time ad auctions, and other "non-user" traffic.
The difference between that and the LLM training data scraping, is that the previous non-human traffic was assumed, by site servers, to increase their human traffic, through search engine ranking, and thus their revenue. However the current training data scraping is likely to have the opposite effect: capturing traffic with LLM summaries, instead of redirecting it to original source sites.
This is the first major disruption to the internet's model of finance since ad revenue look over after the dot bomb.
So far, it's in the same category as the environmental disaster in progress, ownership is refusing to acknowledge the problem, and insisting on business as usual.
Rational predictions are that it's not going to end well...
"Although the long term problem is the business model of servers paying for all network bandwidth."
Servers do not "pay for all the network bandwidth" as if they are somehow being targeted for fees and carrying water for the clients that are somehow getting it for "free". Everyone pays for the bandwidth they use, clients, servers, and all the networks in between, one way or another. Nobody out there gets free bandwidth at scale. The AI scrapers are paying lots of money to scrape the internet at the scales they do.
The Ai scrapers are most likely vc funded and all they care about is getting as much data as possible and not worry about the costs.
They are hiring machines at scale too so definitely bandwidth etc. are cheaper for them too. Maybe use a provider that doesn't have too much bandwidth issues (hetzner?)
But still, the point being that you might be hosting website on your small server and that scraper with its machines beast can come and effectively ddos your server looking for data to scrape. Deterring them is what matters so that the economical scale finally slide back to our favours again.
Maybe my statement wasn't clear. The point is that the server operators pay for all of the bandwidth of access to their servers.
When this access is beneficial to them, that's OK, when it's detrimental to them, they're paying for their own decline.
The statement isn't really concerned with what if anything the scraper operators are paying, and I don't think that really matters in reaching the conclusion.
> The difference between that and the LLM training data scraping
Is the traffic that people are complaining about really training traffic?
My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run.
That doesn't seem like enough traffic to be a really big problem.
On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not.
Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time.
Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does.
So what's really going on here? Anybody actually know?
The traffic I'm seeing on a wiki I host looks like plain old scraping. When it hits it's a steady load of lots of traffic going all over, from lots of IPs. And they really like diffs between old page revisions for some reason.
That sounds like a really dumb scraper indeed. I don't think you'd want to feed very many diffs into a training run or most inference runs.
But if there's a (discoverable) page comparing every revision of a page to every other revision, and a page has N revisions, there are going to be (N^2-N)/2 delta pages, so could it just be the majority of the distinct pages your Wiki has are deltas?
I would think that by now the "AI companies" would have something smarter steering their scrapers. Like, I dunno, some kind of AI. But maybe they don't for some reason? Or maybe the big ones do, but smaller "hungrier" ones, with less staff but still probably with a lot of cash, are willing to burn bandwidth so they don't have to implement that?
Can we talk about the "sexy anime girl" thing? Seems it's popular in geek/nerd/hacker circles and I for one don't get it. Browsing reddit anonymously you're flooded with near-pornographic fan-made renders of these things, I really don't get the appeal. Can someone enlighten me?
It's a good question. Anime (like many media, but especially anime) is known to have gratuitous fan service where girls/women of all ages are in revealing clothing for seemingly no reason except to just entice viewers.
The reasoning is that because they aren't real people, it's okay to draw and view images of anime, regardless of their age. And because geek/nerd circles tend not to socialize with real women, we get this over-proliferation of anime girls.
I always wondered about these anti bot precautions... as a firefox user, with ad blocking and 3rd party cookies disabled, i get the goddamn captcha or other random check (like this) on a bunch of pages now, every time i visit them...
Is it worth it? Millions of users wasting cpu and power for what? Saving a few cents on hosting? Just rate limit requests per second per IP and be done.
Sooner or later bots will be better at captchas than humans, what then? What's so bad with bots reading your blog? When bots evolve, what then? UK style, scan your ID card before you can visit?
The internet became a pain to use... back in the time, you opened the website and saw the content. Now you open it, get an antibot check, click, forward to the actual site, a cookie prompt, multiple clicks, then a headline + ads, scroll down a milimeter... do you want to subscribe to a newsletter? Why, i didn't even read the first sentence of the article yet... scroll down.. chat with AI bot popup... a bit further down login here to see full article...
Most of the modern web is unusable. I know I'm ranting, but this is just one of the pieces of a puzzle that makes basic browsing a pain these days.
So it's a paywall with -- good intentions -- and even more accessibility concerns. Thus accelerating enshittification.
Who's managing the network effects? How do site owners control false positives? Do they have support teams granting access? How do we know this is doing any good?
It's convoluted security theater mucking up an already bloated , flimsy and sluggish internet. It's frustrating enough to guess schoolbuses every time I want to get work done, now I have to see porfnified kitty waifus
(openwrt is another community plagued with this crap)
> The idea of “weighing souls” reminded me of another anti-spam solution from the 90s… believe it or not, there was once a company that used poetry to block spam!
> Habeas would license short haikus to companies to embed in email headers. They would then aggressively sue anyone who reproduced their poetry without a license. The idea was you can safely deliver any email with their header, because it was too legally risky to use it in spam.
Kind of a tangent but learning about this was so fun. I guess it's ultimately a hack for there not being another legally enforceable way to punish people for claiming "this email is not spam"?
IANAL so what I'm saying is almost certainly nonsense. But it seems weird that the MIT license has to explicitly say that the licensed software comes with no warranty that it works, but that emails don't have to come with a warranty that they are not spam! Maybe it's hard to define what makes an email spam, but surely it is also hard to define what it means for software to work. Although I suppose spam never e.g. breaks your centrifuge.
i suppose one nice property is that it is trivially scalable. if the problem gets really bad and the scrapers have llms embedded in them to solve captchas, the difficulty could be cranked up and the lifetime could be cranked down. it would make the user experience pretty crappy (party like it's 1999) but it could keep sites up for unauthenticated users without engaging in some captcha complexity race.
it does have arty political vibes though, the distributed and decentralized open source internet with guardian catgirls vs. late stage capitalism's quixotic quest to eat itself to death trying to build an intellectual and economic robot black hole.
Kernel.org* just has to actually configure Anubis rather than deploying the default broken config. Enable the meta-refresh proof of work rather than relying on the corporate browsers only bleeding edge javascript application proof of work.
* or whatever site the author is talking about, his site is currently inaccessible due to the amount of people trying to load it.
Oh I totally reacted to the title. The last few times Anubis has been the topic there's always comments about "cringy" mascot and putting that front and center in the title just made me believe that anime catgirls was meant as an insult.
Honestly I am okay with anime catgirls since I just find it funny but still it would be cool to see linux related stuff. Imagine mr tux penguin gif of him racing in like supertuxcart for the linux website.
sourcehut also uses anubis but they have removed the anime catgirl thing with their own logo, I think disroot also does that I am not sure though
> As you may have noticed, SourceHut has deployed Anubis to parts of our services to protect ourselves from aggressive LLM crawlers.
Its nice that sourcehut themselves have talked about it on their own blog but I had discovered this through the anubis website themselves showcases or soemthing like that iirc.
Yes, your link from four months ago says they deployed Anubis. Now actually go to sourcehut yourself and you'll see it uses go-away, not Anubis. Or read the footnote at the bottom of your link (in fact, linked from the very sentence you quoted) that says they were looking at go-away at the time.
> A few weeks after this blog post, I moved us from Anubis to go-away, which is more configurable and allows us to reduce the user impact of Anubis (e.g. by offering challenges that don’t require JavaScript, or support text-mode browsers better). We have rolled this out on several services now, and unfortunately I think they’re going to remain necessary for a while yet – presumably until the bubble pops, I guess.
Oh sorry, Didn't know about the fact that you started using go away after anubis, my bad.
But if I remember correctly, when you were using anubis, you had changed the logo of the anime catgirl to something related to sourcehut/ its logo right?
I don't understand, why do people resort to this tool instead of simply blocking by UA string or IP address. Are there so many people running these AI crawlers?
I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.
Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.
OpenAI/Anthropic/Perplexity aren't the bad actors here. If they are, they are relatively simply to block - why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?
I get the sense many of the bad actors are simply poor copycats that are poorly building LLMs and are scraping the entire web without a care in the world
Perplexity's defense is that they're not doing it for training/KB building crawls but for answering dynamic queries calls and this is apparently better.
I do not see the words "residential" or "proxy" anywhere in that article... or any other text that might imply they are using those things. And personally... I don't trust crimeflare at all. I think they and their MITM-as-a-service has done even more/lasting damage to the global Internet and user privacy in general than all AI/LLMs combined.
However, if this information is accurate... perhaps site owners should allow AI/bot user agents but respond with different content (or maybe a 404?) instead, to try to prevent it from making multiple requests with different UAs.
I had 500,000 residential IPs make 1-4 requests each in the past couple of days.
These had the same user agent (latest Safari), but previously the agent has been varied.
Blocking this shit is much more complicated than any blocking necessary before 2024.
The data is available for free download in bulk (it's a university) and this is advertised in several places, including the 429 response, the HTML source and the API documentation, but the AI people ignore this.
Lots of companies run these kind of crawlers now as part of their products.
They buy proxies and rotate through proxy lists constantly. It's all residential IPs, so blocking IPs actually hurts end users. Often it's the real IPs of VPN service customers, etc.
There are lots of companies around that you can buy this type of proxy service from.
Why does Anubis not leverage PoW from its users to do something useful (at best, distributed computing for science, at worst, a crypto-currency at least allowing the webmasters to get back some cash)
People are already complaining. Could you imagine how much fodder this'd give people who didn't like the work or the distribution of any funds that a cryptocurrency would create (which would be pennies, I think, and more work to distribute than would be worth doing).
If people are truly concerned about the crawlers hammering their 128mb raspberry pi website then a better solution would be to provide an alternative way for scrapers to access the data (e.g., voluntarily contribute a copy of their public site to something like common crawl).
If Anubis blocked crawler requests but helpfully redirected to a giant tar ball of every site using their service (with deltas or something to reduce bandwidth) I bet nobody would bother actually spending the time to automate cracking it since it’s basically negative value. You could even make it a torrent so most of the be costs are paid by random large labs/universities.
I think the real reason most are so obsessed with blocking crawlers is they want “their cut”… an imagined huge check from OpenAI for their fan fiction/technical reports/whatever.
No, this doesn’t work. Many of the affected sites have these but they’re ignored. We’re talking about git forges, arguably the most standardised tool in the industry, where instead of just fetching the repository every single history revision of every single file gets recursively hammered to death.
The people spending the VC cash to make the internet unusable right now don’t know how to program. They especially don’t give a shit about being respectful. They just hammer all the sites, all the time, forever.
The kind of crawlers/scrapers who DDoS a site like this aren't going to bother checking common crawl or tarballs. You vastly overestimate the intelligence and prosociality of what bursty crawler requests tend to look like. (Anyone who is smart or prosocial will set up their crawler to not overwhelm a site with requests in the first place - yet any site with any kind of popularity gets flooded with these requests sooner or later)
If they don’t have the intelligence to go after the more efficient data collection method then they likely won’t have the intelligence or willpower to work around the second part I mentioned (keeping something like Anubis). The only problem is when you put Anubis in the way of determined, intelligent crawlers without giving them a choice that doesn’t involve breaking Anubis.
> I think the real reason most are so obsessed with blocking crawlers is they want “their cut”…
I find that an unfair view of the situation. Sure, there are examples such as StackOverflow (which is ridiculous enough as they didn't make the content) but the typical use case I've seen on the small scale is "I want to self-host my git repos because M$ has ruined GitHub, but some VC-funded assholes are drowning the server in requests".
They could just clone the git repo, and then pull every n hours, but it requires specialized code so they won't. Why would they? There's no money in maintaining that. And that's true for any positive measure you may imagine until these companies are fined for destroying the commons.
While that’s a reasonable opinion to have, it’s a fight they can’t really win. It’s like putting up a poster in a public square then running up to random people and shouting “no, this poster isn’t for you because I don’t like you, no looking!” Except the person they’re blocking is an unstoppable mega corporation that’s not even morally in the wrong imo (except for when they overburden people’s sites, that’s bad ofc)
The looking is fine, the photographing and selling the photo less so… and fyi in denmark monuments have copyright so if you photograph and sell the photos you owe fees :)
The bad scrapers would get blocked by the wall I mentioned. The ones intelligent enough to break the wall would simply take the easier way out and download the alternative data source.
literally the top link when I search for his exact text "why are anime catgirls blocking my access to the Linux kernel?" https://lock.cmpxchg8b.com/anubis.html
Maybe travis needs more google-fu. maybe that includes using duckduckgo?
This is a usually technical crowd, so I can't help but wonder if many people genuinely don't get it, or if they are just feigning a lack of understanding to be dismissive of Anubis.
Sure, the people who make the AI scraper bots are going to figure out how to actually do the work. The point is that they hadn't, and this worked for quite a while.
As the botmakers circumvent, new methods of proof-of-notbot will be made available.
It's really as simple as that. If a new method comes out and your site is safe for a month or two, great! That's better than dealing with fifty requests a second, wondering if you can block whole netblocks, and if so, which.
This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.
> they are just feigning a lack of understanding to be dismissive of Anubis.
I actually find the featured article very interesting. It doesn't feel dismissive of Anubis, but rather it questions whether this particular solution makes sense or not in a constructive way.
I agree - the article is interesting and not dismissive.
I was talking more about some of the people here ;)
I still don't understand what Anubis solves if it can be bypassed too easily: If you use User-agent switcher (i emulate wget) as firefox addon on kernel.org or ffmpeg.org you save the entire check time and straight up skip Anubis. Apparently they use a whitelist for user-agents due to allowing legitimate wget usage on these domains. However if I (an honest human can) the scrapers and grifters can too.
https://addons.mozilla.org/en-US/firefox/addon/uaswitcher/
If anyone wants to try themselves. This is by no means against Anubis, but raising the question: Can you even protect a domain if you force yourself to whitelist (for a full bypass) easy to guess UAs?
It's extra work for scrapers. They pretend to be upstanding citizens (Chrome UA from residential IPs). You can more easily block those.
A lot of scrapers are actually utilizing some malware installed on residential user's machines, so the request is legitimately coming from a chrome UA on a residential ip.
Well, “actually”, perhaps; “legitimately” is still a stretch.
It really should be recognised just how many people are watching Cloudflare interstitials on nearly every site these days (and I totally get why this happens) yet making a huge amount of noise about Anubis on a very small amount of sites.
That says something about the chosen picture, doesn't it? Probably that it's not well liked. It certainly isn't neutral, while the Cloudfare page is.
You know, you say that, and while I understand where you're coming from I was browsing the git repo when github had a slight error and I was greeted with an angry pink unicorn. If Github can be fun like that, Anubis can too, I think.
Yeah, but do people like that? It feels pretty patronizing to me in a similar way. Like "Weee! So cute that our website is broken, good luck doing your job! <3"
Reminds me of the old uwu error message meme.
> patronizing
I think it's reasonable and fair, and something you are expected to tolerate in a free world. In fact, I think it's rather unusual to take this benign and inconsequential thing as personal as you do.
Not at all. I can't stand it either. It's definitely patronising and infantile. I tolerate the silliness, grit my teeth and move on but it wears away at my patience.
I don't think you want to suggest that everyone must like it?
Nothing says, "Change out the logo for something that doesn't make my clients tingle in an uncomfortable way" like the MIT license.
I wonder why the anime girl is received so badly. Is it because it's seen as childish? Is it bad because it confuses people (i.e. don't do this because other don't do this)?
Thinking about it logically, putting some "serious" banner there would just make everything a bit more grey and boring and would make no functional difference. So why is it disliked so much?
The GitHub unicorn doesn't look as if it came out of a furry dev's wank bank.
Why? It has sexual connotations, and it involves someone under the age of consent. As wikipedia puts it: "In a 2010 critique of the manga series Loveless, the feminist writer T. A. Noonan argued that, in Japanese culture, catgirl characteristics have a similar role to that of the Playboy Bunny in western culture, serving as a fetishization of youthful innocence."
> Thinking about it logically
This isn't about logic.
I'm glad that they kept the anime girl rather than replacing her with a sterile message. The Internet should be a fun place again.
Because the world is full of haters?
I personally find anime kind of cringe but that's just a matter of taste.
Keep in mind that the author explicitly asks you not to do this, and offers a paid white label version. You can still do it yourself, but maybe you shouldn’t.
That's a good point and I didn't know that.
Anubis was originally an open source project built for a personnal blog. It gained traction but the anime girl remained so that people are reminded of the nature of the project. Comparing it with Cloudflare is truly absurd. That said, a paid version is available with guard page customization.
I don't trip over CloudFlare except when in a weird VPN, and then it always gets out of my way after the challenge.
Anubis screws with me a lot, and often doesn't work.
The annoying thing about cloudflare is that most of the time once you’re blocked: you’re blocked.
There’s literally no way for you to bypass the block if you’re affected.
Its incredibly scary, I once had a bad useragent (without knowing it) and half the internet went offline, I couldn’t even access documentation or my email providers site, and there was no contact information or debugging information to help me resolve it: just a big middle finger for half the internet.
I haven’t had issues with any sites using Anubis (yet), but I suspect there are ways to verify that you’re a human if your browser fails the automatic check at least.
CloudFlare is dystopic. It centralizes even the part of the Internet that hadn't been centralized before. It is a perfect Trojan horse to bypass all encryption. And it chooses who accesses (a considerable chunk of) the Internet and who doesn't.
Anubis looks much better than this.
It's literally insane. After Snowden, how the fuck did we ended up with a single US company terminating almost every TLS connection?
> It is a perfect Trojan horse to bypass all encryption
Isn't any hosting provider also this?
Not necessarily.
FaaS: Yes.
IaaS: Only if you do TLS termination at their gateway, otherwise not really, they'd need to get into your operating system to get the keys which might not always be easy. They could theoretically MITM the KVM terminal when you put in your disk decryption keys but that seems unlikely.
It could be a lot worse. Soccer rights-holders effectively shut-down the Cloudflare facilitated Internet in Spain during soccer matches to 'curb piracy'.
The Soccer rightsholders - LaLiga - claim more than 50% of pirate IPs illegally distributing its content are protected by Cloudflare. Many were using an application called DuckVision to facilitate this streaming.
Telefónica, the ISP, upon realizing they couldn’t directly block DuckVision’s IP or identify its users, decided on a drastic solution: blocking entire IP ranges belonging to Cloudflare, which continues to affect a huge number of services that had nothing to do with soccer piracy.
https://pabloyglesias.medium.com/telef%C3%B3nicas-cloudflare...
https://www.broadbandtvnews.com/2025/02/19/cloudflare-takes-...
https://community.cloudflare.com/t/spain-providers-blocks-cl...
Now imagine your government provided internet agent gets blacklisted because your linked social media post was interpreted by an LLM to be anti-establishment, and we are painting a picture of our current trajectory.
I don't have to imagine
Anubis checks proof of work so as long as JavaScript runs you will pass it.
A "digital no-fly-list" is hella cyberpunk, though.
The question might become, what side of the black wall are you going to be on?
Seriously though I do think we are going to see increasing interest in alternative nets, especially as governments tighten their control over the internet or even break away into isolated nation nets.
Paradoxically, the problem with an "alternative net" (which could be tunneled over the regular one) is keeping it alternative. It has to be kept small and un-influential in order to stay under the radar. If you end up with an "alternative" which is used by journalists and politicians, you've just reinvented the mainstream, and you're no longer safe from being hit by a policy response.
Think private trackers. The opposite of 4chan, which is an "alternative" that got too influential in setting the tone of the rest of the internet.
Not necessarily, Yggdrasil flies under the radar because it's inherently hard to block.
Tor even more so, the power of Tor is that the more people use it: the stronger it becomes to centralised adversaries.
The main issue with Tor is the performance of it though.
I thought that the main issue with tor was that so many of the exit nodes are actually the FBI.
You don't ever have to leave the Tor network.
I host IRC on a hidden service, and even Facebook (lol) offers a hidden service endpoint.
All that is needed is for a critical mass of people and a decent index: and we successfully have reinvented "the wired" from Serial Experiments: Lain
The truth is the internet was never designed or intended to host private information. It was created for scientists by scientists to share research papers. Capitalists perverted it.
I'm on an older system here, and both Cloudflare and Anubis entirely block me out of sites. Once you start blocking actual users out of your sites, it simply has gone too far. At least provide an alternative method to enter your site (e.g. via login) that's not hampered by erroneous human checks. Same for the captchas where you help train AIs by choosing out of a set of tiny/ noisy pictures. I often struggle for 5 to 10 minutes to get past that nonsense. I heard bots have less trouble.
Basically we're already past the point where the web is made for actual humans, now it's made for bots.
> Once you start blocking actual users out of your sites, it simply has gone too far.
It has, scrapers are out of control. Anubis and its ilk are a desperate measure, and some fallout is expected. And you don't get to dictate how a non-commercial site tries to avoid throttling and/or bandwidth overage bills.
No, they are a lazy measure. Most websites that slap on these kinds of checks don't even bother with more human-friendly measures first.
Because I don't have the fucking time to deal with AI scraper bots. I went harder - anything even looking suspiciously close to a scraper that's not on Google's index [1] or has wget in its user agent gets their entire /24 hard banned for a month, with an email address to contact for unbanning.
That seems to be a pretty effective way for now to keep scrapers, spammers and other abusive behavior away. Normal users don't do certain site actions at the speed that scraper bots do, there's no other practically relevant search engine than Google, I've never ever seen an abusive bot hide as wget (they all try to emulate looking like a human operated web browser), and no AI agent yet is smart enough to figure out how to interpret the message "Your ISP's network appears to have been used by bot activity. Please write an email to xxx@yyy.zzz with <ABC> as the subject line (or click on this pre-filled link) and you will automatically get unblocked".
[1] https://developers.google.com/search/docs/crawling-indexing/...
> Normal users don't do certain site actions at the speed that scraper bots do
How would you know when you have already banned them.
Simple. A honeypot link in a three levels deep menu which no ordinary human would care about that, thanks to a JS animation, needs at least half a second for a human to click on. Any bot that clicks it in less than half a second gets the banhammer. No need for invasive tracking, third party integrations, whatever.
That does sound like a much human friendlier approach than Anubis. I agree that tarpits and honeypots are a good stopgap until the legal system catches up to the rampant abuse of these "AI" companies. It's when your solutions start affecting real human users just because they are not "normal" in some way that I stop being sympathetic.
I gave up on a lot of websites because of the aggressive blocking.
FYI - you can communicate with the author of Anubis, who has already said she's working on ways to make sure that all browsers - links, lynx, dillo, midori, et cetera, work.
Unless you're paying Cloudflare a LOT of money, you won't get to talk with anyone who can or will do anything about issues. They know about their issues and simply don't care.
If you don't mind taking a few minutes, perhaps put some details about your setup in a bug report?
It's the other way around for me sometimes — I've never had issue with Anubis, I frequently get it with CF-protected sites.
(Not to mention all the sites which started putting country restrictions in on their generally useful instruction articles etc — argh)
I’m planning a trip to France right now, and it seems like half the websites in that country (for example, ratp.fr for Paris public transport info) require me to check a CloudFlare checkbox to promise that I am a human. And of those that don’t, quite a few just plain lock me out...
And a lot of US sites don't work in France either, or they ban you after just a couple requests with no appeal...
I find the same when using some foreign sites. I think the operator must have configured that France is OK, maybe neighboring countries too, the rest of the world must be checked.
It's not hard to understand why though surely?
You might have to show a passport when you enter France, and have your baggage and person (intrusively) scanned if you fly there, for much the same reason.
People, some of them in positions of government in some nation states want to cause harm to the services of other states. Cloudflare was probably the easiest tradeoff for balancing security of the service with accessibility and cost to the French/Parisian taxpayer.
Not that I'm happy about any of this, but I can understand it.
The antagonists in this case are not state sponsored terrorists, instead it's AI bros DDoSing the internet.
I get one basically every time I go to gitlab.com on Firefox.
It is easy to pass the challange, but it isn't any better than Anubis.
Even when not on VPN, if a site uses the CloudFlare interstitials, I will get it every single time - at least the "prove you're not a bot" checkbox. I get the full CAPTCHA if I'm on a VPN or I change browsers. It is certainly enough to annoy me. More than Anubis, though I do think Anubis is also annoying, mainly because of being nearly worthless.
For me both are things that mostly show up for 1-3 seconds, then get replaced by the actual website. I suspect that's the user experience of 99% of people.
If you fall in the other 1% (e.g. due to using unusual browsers or specific IP ranges), cloudflare tends to be much worse
You must be on a good network. You should run one of those "get paid to share your internet connection with AI companies" apps. Since you're on a good network you might make a lot of money. And then your network will get cloudflared, of course.
We should repeat this until every network is cloudflared and everyone hates cloudflare and cloudflare loses all its customers and goes bankrupt. The internet would be better for it.
I hit Cloudflare's garbage about as much as I hit Anubis. With the difference that far more sites use Cloudflare than Anubis, thus Anubis is far worse at triggering false positives.
Huh? What false positives does Anubis produce?
The article doesn't say and I constantly get the most difficult Google captchas, cloudflare block pages saying "having trouble?" (which is a link to submit a ticket that seems to land in /dev/null), IP blocks because user agent spoofing, errors "unsupported browser" when I don't do user agent spoofing... the only anti-bot thing that reliably works on all my clients is Anubis. I'm really wondering what kinds of false positives you think Anubis has, since (as far as I can tell) it's a completely open and deterministic algorithm that just lets you in if you solve the challenge, and as the author of the article demonstrated with some C code (if you don't want to run the included JavaScript that does it for you), that works even if you are a bot. And afaik that's the point: no heuristics and false positives but a straight game of costs; making bad scraping behavior simply cost more than implementing caching correctly or using commoncrawl
I've had Anubis repeatedly fail to authorize me to access numerous open source projects, including the mesa3d gitlab, with a message looking something like "you failed".
As a legitimate open source developer and contributor to buildroot, I've had no recourse besides trying other browsers, networks, and machines, and it's triggered on several combinations.
Interesting, I didn't even know it had such a failure mode. Thanks for the reply, I'll sadly have to update my opinion on this project since it's apparently not a pure "everyone is equal if they can Prove the Work" system as I thought :(
I'm curious how, though, since the submitted article doesn't mention that and demonstrates curl working (which is about as low as you can go on the browser emulation front), but no time to look into it atm. Maybe it's because of an option or module that the author didn't have enabled
It sounds[1] like this was an issue with assumptions regarding header stability. Hopefully as people update their installations things will improve for us end users.
[1]: https://anubis.techaro.lol/blog/release/v1.20.0/#chrome-wont...
Thank goodness. It was feeling quite dystopian being caught in a bot dragnet that blocked me from resources that are relevant and vital to my work.
So yes, it is like having a stalker politely open the door for you as you walk into a shop, because they know very well who you are.
In a world full of robots that look like humans, the stalker who knows you and lets you in might be the only solution.
That's called authentication. In the case of the stalker, by biometrics (facial recognition). This could be a solution
But that's not what Cloudflare does. Cloudflare guesses whether you are a bot and then either blocks you or not. If it currently likes you, bless your luck
> This could be a solution
Until the moment someone will figure out the generation of realistic enough 3d faces.
That stalker might itself be a bot though, so there's no solution.
Both are equally terrible - one doesn't require explanations to my boss though
If your boss doesn't want you to browse the web, where some technical content is accompanied by an avatar that the author likes, they may not be suitable as boss, or at least not for positions where it's their job to look over your shoulder and make sure you're not watching series during work time. Seems like a weird employment place if they need to check that anyway
we have customers in our offices pretty much every day, I think "no anime girls on screens" is a fair request
I fail to see how this particular "anime girl" and the potential for clients seeing it, could make you think that's a fair request. That seems extremely ridiculous to me.
It's an MIT licensed, open project. Fork it and change the icon to your favorite white-bread corporate logo if you want. It would probably take less time than complaining about it on HN.
I think the complaint is rather that you don't know when it will rear its face on third-party websites that you are visiting as part of work. Forking wouldn't help with not seeing it on other sites
(Even if I agree that the boss or customers should just get over it. It's not like they're drawing genitalia on screen and it's also easily explainable if they don't already know it themselves.)
Add a rule to your adblocker for the image, then. The main site appears to have it at `anubis.techaro.lol/.within.website/x/cmd/anubis/static/img/happy.webp?cacheBuster=v1.21.3-43-gb0fa256`, so a rule for `||*/.within.website/x/cmd/anubis/static/img/$image` ought to work for ublock origin (purely a guess regarding wildcards for domain, I've never set a rule without a domain before)
If Anubis didn't ship with a weird looking anime girl I think people would treat it akin to Cloudflares block pages.
Which means they'd still hate it and find it annoying
We can make noise about both things, and how they're ruining the internet.
Cloudflare's solution works without javascript enabled unless the website turns up the scare level to max or you are on an IP with already bad reputation. Anubis does not.
But at the end of the day both are shit and we should not accept either. That includes not using one as an excuse for the other.
Laughable. They say this but anyone who actually surfs the web with a non-bleeding edge non-corporate browser gets constantly blocked by Cloudflare. The idea that their JS computational paywalls only pop up rarely is absurd. Anyone believing this line lacks lived experience. My Comcast IP shouldn't have a bad rep and using a browser from ~2015 shouldn't make me scary. But I can't even read bills on congress.gov anymore thanks to bad CF deployals.
Also, Anubis does have a non-JS mode: the HTML header meta-refresh based challenge. It's just that the type of people who use Cloudflare or Anubis almost always just deploy the default (mostly broken) configs that block as many human people as bots. And they never realize it because they only measure such things with javascript.
TO BE FAIR
I dislike those even more.
Over the past few years I've read far more comments complaining about Cloudflare doing it than Anubis. In fact, this discussion section is the first time I've seen people talking about Anubis.
It sounds like you're saying that it's not the proof-of-work that's stopping AI scrapers, but the fact that Anubis imposes an unusual flow to load the site.
If that's true Anubis should just remove the proof-of-work part, so legitimate human visitors don't have to stare at a loading screen for several seconds while their device wastes electricity.
> If that's true Anubis should just remove the proof-of-work part
This is my very strong belief. To make it even clearer how absurd the present situation is, every single one of the proof-of-work systems I’ve looked at has been using SHA-256, which is basically the worst choice possible.
Proof-of-work is bad rate limiting which depends on a level playing field between real users and attackers. This is already a doomed endeavour. Using SHA-256 just makes it more obvious: there’s an asymmetry factor in the order of tens of thousands between common real-user hardware and software, and pretty easy attacker hardware and software. You cannot bridge such a divide. If you allow the attacker to augment it with a Bitcoin mining rig, the efficiency disparity factor can go up to tens of millions.
These proof-of-work systems are only working because attackers haven’t tried yet. And as long as attackers aren’t trying, you can settle for something much simpler and more transparent.
If they were serious about the proof-of-work being the defence, they’d at least have started with something like Argon2d.
The proof of work isn't really the crux. They've been pretty clear about this from the beginning.
I'll just quote from their blog post from January.
https://xeiaso.net/blog/2025/anubis/
Anubis also relies on modern web browser features:
- ES6 modules to load the client-side code and the proof-of-work challenge code.
- Web Workers to run the proof-of-work challenge in a separate thread to avoid blocking the UI thread.
- Fetch API to communicate with the Anubis server.
- Web Cryptography API to generate the proof-of-work challenge.
This ensures that browsers are decently modern in order to combat most known scrapers. It's not perfect, but it's a good start.
This will also lock out users who have JavaScript disabled, prevent your server from being indexed in search engines, require users to have HTTP cookies enabled, and require users to spend time solving the proof-of-work challenge.
This does mean that users using text-only browsers or older machines where they are unable to update their browser will be locked out of services protected by Anubis. This is a tradeoff that I am not happy about, but it is the world we live in now.
Except this is exactly the problem. Now you are checking for mainstream browsers instead of some notion of legitimate users. And as TFA shows a motivated attacker can bypass all of that while legitimate users of non-mainstream browsers are blocked.
Aren't most scrapers using things like Playright or Puppeteer anyway by now, especially since so many pages are rendered using JS and even without Anubis would be unreadable without executing modern JS?
... except when you do not crawl with a browser at all. It's so trivial to solve just like the taviso post demostrated.
This makes zero sense, this is simply the wrong approach. Already tired of saying so and been attacked. So I'm glad professional-random-Internet-bullshit-ignorer Tavis Ormandy wrote this one.
All this is true, but also somewhat irrelevant. In reality the amount of actual hash work is completely negligible.
For usability reasons Anubus only requires that you to go trough a the proof of work flow only once in a given period. (I think the default is once per week.) That's just very little work.
Detecting you need to occasionally send a request trough a headless browser far more of a hassle than the PoW. If you prefer LLMs rather than normal internet search, it'll probably consume far more compute as well.
> For usability reasons Anubus only requires that you to go trough a the proof of work flow only once in a given period. (I think the default is once per week.) That's just very little work.
If you keep cookies. I do not want to keep cookies for otherwise "stateless" sites. I have maybe a dozen sites whitelisted, every other site loses cookies when I close the tab.
A bigger problem is that you should not have to enable javascript for otherwise static sites. If you enable JS, cookies are a relatively minor issue compared to all the other ways the website can keep state about you.
Well, that's not a problem when scraping. Most scraping libraries have ways to retain cookies.
This is basically what most of the challenge types in go-away (https://git.gammaspectra.live/git/go-away/wiki/Challenges) do.
+1 for go-away. It's a bit more involved to configure, but worth the effort imo. It can be considerably more transparent to the user, triggering the nuclear PoW check less often, while being just as effective, in my experience.
I feel like the future will have this, plus ads displayed while the work is done, so websites can profit while they profit.
Every now and then I consider stepping away from the computer job, and becoming a lumberjack. This is one of those moments.
my family takes care of a large-ish forest, so I have to help since my early teens. Let me tell you: think twice, it's f*ckin dangerous. Chainsaws, winches, heavy trees falling and breaking in unpredictable ways. I had a couple of close calls myself. Recently a guy from a neighbor village was squashed to death by a root plate that tilted.
I often think about quitting tech myself, but becoming a full-time lumberjack is certainly not an alternative for me.
Hah, I know, been around forests since childhood, seen (and done) plenty of sketchy stuff. For me it averages out to couple days of forest work a year. It's backbreaking labour, and then you deal with the weather.
But man, if tech goes straight into cyberpunk dystopia but without the cool gadgets, maybe it is the better alternative.
Worth getting to know the in and outs of forest management now. I don’t think AI will take most tech jobs soon, but they sure as hell are already making them boring.
adCAPTCHA already does this:
https://adcaptcha.com
This is a joke, right? The landing page makes it seem so.
I tried the captcha in their login page and it made the entire page, including the puzzle piece slider, run at 2 fps.
My god, we do really live in 2025.
Holy shit. Opening the demo from the menu, it's like captchas and youtube ads had a baby
Exactly this.
I don't think anything will stop AI companies for long. They can do spot AI agentic checks of workflows that stop working for some reason and the AI can usually figure out what the problem is and then update the workflow to get around it.
This was obviously dumb when it launched:
1) scrapers just run a full browser and wait for the page to stabilize. They did this before this thing launched, so it probably never worked.
2) The AI reading the page needs something like 5 seconds * 1600W to process it. Assuming my phone can even perform that much compute as efficiently as a server class machine, it’d take a large multiple of five seconds to do it, and get stupid hot in the process.
Note that (2) holds even if the AI is doing something smart like batch processing 10-ish articles at once.
> This was obviously dumb when it launched:
Yes. Obviously dumb but also nearly 100% successful at the current point in time.
And likely going to stay successful as the non-protected internet still provides enough information to dumb crawlers that it’s not financially worth it to even vibe-code a workaround.
Or in other words: Anubis may be dumb, but the average crawler that completely exhausting some sites resources is even dumber.
And so it all works out.
And so the question remains: how dumb was it exactly, when it works so well and continues to work so well?
> Yes. Obviously dumb but also nearly 100% successful at the current point in time.
Only if you don't care about negatively affecting real users.
I understand this as an argument that it’s better to be down for everyone than have a minority of users switch browsers.
I’m not convinced by that makes sense.
Now ideally you would have the resources to serve all users and all the AI bots without performance degradation, but for some projects that’s not feasible.
In the end it’s all a compromise.
does it work well? I run chromium controlled by playwright for scraping and typically make Gemini implement the script for it because it's not worth my time otherwise. -but I'm not crawling the Internet generally (which I think there is very little financial incentive to do; it's a very expensive process even ignoring Anubis et al); it's always that I want something specific and am sufficiently annoyed by lack of API.
regarding authentication mentioned elsewhere, passing cookies is no big deal.
Anubis is not meant to stop single endpoints from scraping. It's meant to make it harder for massive AI scrapers. The problematic ones evade rate limiting by using many different ip addresses, and make scraping cheaper on themselves by running headless. Anubis is specifically built to make that kind of scraping harder as i understand it.
Does it actually? I don't think I've seen a case study with hard numbers.
Here’s one study
https://dukespace.lib.duke.edu/server/api/core/bitstreams/81...
And of all the high-profile projects implementing it, like the LKML archives, none have backed down yet, so I’m assuming the initial improvement in numbers must continue or it would have been removed since
I run a service under the protection of go-away[0], which is similar to Anubis, and can attest it works very well, still. Went from constant outages due to ridiculous volumes of requests to good load times for real users and no bad crawlers coming through.
[0]: https://git.gammaspectra.live/git/go-away
Great, thanks for the link.
the workaround is literally just running a headless browser, and that's pretty much the default nowadays.
if you want to save some $$$ you can spend like 30 minutes making a cracker like in the article. just make it multi threaded, add a queue and boom, your scraper nodes can go back to their cheap configuration. or since these are AI orgs we're talking about, write a gpu cracker and laugh as it solves challenges far faster than any user could.
custom solutions aren't worth it for individual sites, but with how widespread anubis is it's become worth it.
I agree. Your estimate for (2), about 0.0022 kWh, corresponds to about a sixth of the charge of an iPhone 15 pro and would take longer than ten minutes on the phone, even at max power draw. It feels about right for the amount of energy/compute of a large modern MoE loading large pages of several 10k tokens. For example this tech (couple month old) could input 52.3k tokens per second to a 672B parameter model, per H100 node instance, which probably burns about 6–8kW while doing it. The new B200s should be about 2x to 3x more energy efficient, but your point still holds within an order of magnitude.
https://lmsys.org/blog/2025-05-05-large-scale-ep/
The argument doesn't quite hold. The mass scraping (for training) is almost never doing by a GPU system it's almost always done by a dedicated system running a full chrome fork in some automated way (not just the signatures but some bugs give that away).
And frankly processing a single page of text is run within a single token window so likely is run for a blink (ms) before moving onto the next data entry. The kicker is it's run over potentially thousands of times depending on your training strategy.
At inference there's now a dedicated tool that may perform a "live" request to scrape the site contents. But then this is just pushed into a massive context window to give the next token anyway.
The point is that scraping is already inherently cost-intensive so a small additional cost from having to solve a challenge is not going to make a dent in the equation. It doesn't matter what server is doing what for that.
100 billion web pages * 0.02 USD of PoW/page = 2 billion dollars, the point is not to stop every scraper/crawler, the point is to raise the costs enough to avoid being bombarded by all of them
Yes, but it's not going to be 0.02 USD of PoW per page! That is an absurd number. It'd mean a two-hour proof of work for a server CPU, a ten hour proof of work for a phone.
In reality you can do maybe a 1/10000th of that before the latency hit to real users becomes unacceptable.
And then, the cost is not per page. The cost is per cookie. Even if the cookie is rate-limited, you could easily use it for 1000 downloads.
Those two errors are multiplicative, so your numbers are probably off by about 7 orders of magnitudes. The cost of the PoW is not going to be $2B, but about $200.
I'm going to phrase the explanation like this in the future. Couldn't have said it better myself.
The problem is that 7 + 2 on a submission form only affects people who want to submit something, Anubis affects every user who wants to read something on your site
The question then is why read only users are consuming so much resources that serving them big chunks of JS instead reduces loads of the server. Maybe improve you rendering and/or caching before employing DRM solutions that are doomed to fail anyway.
The problem it's originally fixing is bad scrapers accessing dynamic site content that's expensive to produce, like trying to crawl all diffs in a git repo, or all mediawiki oldids. Now it's also used on mostly static content because it is effective vs scrapers that otherwise ignore robots.txt.
The author make it very clear that he understands the problem Anubis is attempting to solve. His issue is that the chosen approach doesn't solve that problem; it just inhibits access to humans, particularly those with limited access to compute resources.
That's the opposite of being dismissive. The author has taken the time to deeply understand both the problem and the proposed solution, and has taken the time to construct a well-researched and well-considered argument.
> This is a usually technical crowd, so I can't help but wonder if many people genuinely don't get it, or if they are just feigning a lack of understanding to be dismissive of Anubis.
This is a confusing comment because it appears you don’t understand the well-written critique in the linked blog post.
> This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.
The key point in the blog post is that it’s the inverse of a CAPTCHA: The proof of work requirement is solved by the computer automatically.
You don’t have to teach a computer how to solve this proof of work because it’s designed for the computer to solve the proof of work.
It makes the crawling process more expensive because it has to actually run scripts on the page (or hardcode a workaround for specific versions) but from a computational perspective that’s actually easier and far more deterministic than trying to have AI solve visual CAPTCHA challenges.
But for actual live users who don't see anything but a transient screen, Anubis is a better experience than all those pesky CAPTCHAs (I am bored of trying to recognize bikes, pedestrian crossings, buses, hydrants).
The question is if this is the sweet spot, and I can't find anyone doing the comparative study (how many annoyed human visitors, how many humans stopped and, obviously, how many bots stopped).
> Anubis is a better experience than all those pesky CAPTCHAs (I am bored of trying to recognize bikes, pedestrian crossings, buses, hydrants).
Most CAPTCHAs are invisible these days, and Anubis is worse than them. Also, CAPTCHAs are not normally deployed just for visiting a site, they are mostly used when you want to submit something.
We are obviously living a different Internet reality, and that's the whole point — we need numbers to really establish baseline truth.
FTR, I am mostly browsing from Serbia using Firefox browser on a Linux or MacOS machine.
I don’t think we are living in a different reality, I just don’t think you are accounting for all the CAPTCHAs you successfully pass without seeing.
Wouldn't it be nice to have a good study that supports either your or my view?
FWIW, I've never been stopped by Anubis, so even if it's much more rarely implemented, that's still infinitely less than 5-10 captchas a day I do see regularly. I do agree it's still different scales, but I don't trust your gut feel either. Thus a suggestion to look for a study.
Not OP but try browsing the web with a combination of Browser + OS that is slightly off to what most people use and you'll see Captchas pop up at every corner of the Internet.
And if the new style of Captchas is then like this one it's much more disturbing.
It's been going on for decades now too. It's a cat and mouse game that will be with us for as long as people try to exploit online resources with bots. Which will be until the internet is divided into nation nets, suffocated by commercial interests, and we all decide to go play outside instead.
No. This went into overdrive in the "AI" (crawlers for massive LLM for ML chatbot) era.
Frankly it's something I'm sad we don't yet see a lawsuit for similar to the times v OpenAI. A lot of "new crawlers" claim to innocently forget about established standards like robots.txt
I just wish people would name and shame the massive companies at the top stomping on the rest of the internet in an edge to "get a step up over the competition".
That doesn't really challenge what I said, there's not much "different this time" except the scale is commensurate to the era. Search engine crawlers used to take down websites as well.
I understand and agree with what you are saying though, the cat and mouse is not necessarily technical. Part of solving the searchbot issue was also social, with things like robots.txt being a social contract between companies and websites, not a technical one.
Yes, this is not a problem that will be solved with technical measures. Trying to do so is only going to make the web worse for us humans.
This arms race will have a terminus. The bots will eventually be indistinguishable from humans. Some already are.
> The bots will eventually be indistinguishable from humans
Not until they get issued government IDs they won't!
Extrapolating from current trends, some form of online ID attestation (likely based on government-issued ID[1]) will become normal in the next decade, and naturally, this will be included in the anti-bot arsenal. It will be up to the site operator to trust identities signed by the Russian government.
1. Despite what Sam Altman's eyeball company will try to sell you, government registers will always be the anchor of trust for proof-of-identity, they've been doing it for centuries and have become good at it and have earned the goodwill.
How does this work, though?
We can't just have "send me a picture of your ID" because that is pointlessly easy to spoof - just copy someone else's ID.
So there must be some verification that you, the person at the keyboard, is the same person as that ID identifies. The UK is rapidly finding out that that is extremely difficult to do reliably. Video doesn't really work reliably on all cases, and still images are too easily spoofed. It's not really surprising, though, because identifying humans reliably is hard even for humans.
If we do it at the network level - like assigning a government-issued network connection to a specific individual, so the system knows that any traffic from a given IP address belongs to that specific individual. There are obvious problems with this model, not least that IP addresses were never designed for this, and spoofing an IP becomes identity theft.
We also do need bot access for things, so there must be some method of granting access to bots.
I think that to make this work, we'd need to re-architect the internet from the ground up. To get there, I don't think we can start from here.
If you're really curious about this, there's a place where people discuss these problems annually: https://internetidentityworkshop.com/
Various things you're not thinking of:
- "The person at the keyboard, is the same person as that ID identifies" is a high expectation, and can probably be avoided—you just need verifiable credentials and you gotta trust they're not spoofed
- Many official government IDs are digital now
- Most architectures for solving this problem involve bundling multiple identity "attestations," so proof of personhood would ultimately be a gradient. (This does, admittedly, seem complicated though ... but World is already doing it, and there are many examples of services where providing additional information confers additional trust. Blue checkmarks to name the most obvious one.)
As for what it might look like to start from the ground up and solve this problem, https://urbit.org/, for all its flaws, is the only serious attempt I know of and proves it's possible in principle, though perhaps not in practice
that is interesting, thanks.
Why isn't it necessary to prove that the person at the keyboard is the person in the ID? That seems like the minimum bar for entry to this problem. Otherwise we can automate the ID checks and the bots can identify as humans no problem.
And how come the UK is failing so badly at this?
We almost all have IC Chip readers in our pocket (our cell phones), so if the government issues a card that has a private key embedded in it, akin to existing GnuPG SmartCards, you can use your phone to sign an attestation of your personhood.
In fact, Japan already has this in the form of "My Number Card". You go to a webpage, the webpage says "scan this QR code, touch your phone to your ID card, and type in your pin code", and doing that is enough to prove the the website that you're a human. You can choose to share name/birthday/address, and it's possible to only share a subset.
Robots do not get issued these cards. The government verifies your human-ness when they issue them. Any site can use this system, not just government sites.
Germany has this. The card plus PIN technically proves you are in current possession of both, not that you are the person (no biometrics or the like). You can chose to share/request not only certain data fields but also eg if you are below or above a certain age or height without disclosing the actual number.
> if you are below or above a certain age or height
Is discrimination against dwarves still a thing in Germany?
I want to believe that this would be used at amusement parks to scan "can I safely get on this ride" and at the entrance to stairs to tell you if you'll bump your head or not.
The system as a whole is rarely used. I think it’s a combination of poor APIs and hesitation of the population. For somebody without technical knowledge, there is no obvious difference to the private video ID companies. On the surface, you may believe that all data is transferred anyway and you have to trust providers in all cases, not that some magic makes it so third parties don’t get more than necessary.
I don’t know of any real world example that queries height, I mentioned it because it is part of the data set and privacy-preserving queries are technically possible. Age restrictions are the obvious example, but even there I am not aware of any commercial use, only for government services like tax filing or organ donor registry. Also, nobody really measures your height, you just tell them what to put there when you get the ID. Not so for birth dates, which they take from previous records going back to the birth certificate.
That is already solved by governments and businesses. If you have recently attempted to log into a US government website, you were probably told that you need Login.gov or ID.me. ID.me verifies identity via driver’s license, passport, Social Security number—and often requires users to take a video selfie, matched against uploaded ID images. If automated checks fail, a “Trusted Referee” video call is offered.
If you think this sounds suspiciously close the what businesses do with KYC, Know Your Customer, you're correct!
Not good enough, providers and governments want proof of life and proof of identity that matches government IDs.
Without that, anyone can pretend to be their dead grandma/murder victim, or someone whose ID they stole.
How about a chip implant signed by the government hospital that attests for your vitality? Looks like this is where things are headed
IDs would have to be reissued with a public/private key model you can use to sign your requests.
> the person at the keyboard, is the same person as that ID identifies
This won't be possible to verify - you could lend your ID out to bots but that would come at the risk of being detected and blanket banned from the internet.
I have a wonderful new idea for this problem space based on your username.
In Europe we have itsme. You link the phone app to your ID, then you can use it to scan QR codes to log into websites.
"In Europe" is technically true but makes it sound more widely used than I believe it to be... though maybe my knowledge is out of date.
Their website lists 24 supported countries (including some non-EU like UK and Norway, and missing a few of the 27 EU countries) - https://www.itsme-id.com/en-GB/coverage
But does it actually have much use outside of Belgium?
Certainly in the UK I've never come across anyone, government or private business, mentioning it - even since the law passed requiring many sites to verify that visitors are adults. I wouldn't even be familiar with the name if I hadn't learned about its being used in Belgium.
Maybe some other countries are now using it, beyond just Belgium?
Oh I wasn't aware of that. I remember a Dutch friend talking to me about a similar app they had. Maybe they have a re-branded version of it?
One problem with solutions like that is the the website needs to pay for every log in. So you save a few dollars blocking scrapers but now you have to pay thousands of dollars to this company instead.
Im from europe I never heard about it
In Singapore, we have SingPass, which is also an OpenID Connect implementation.
Officially sanctioned 2fa tied to your official government ID. Over here we have "It's me" [1].
Yes, you can in theory still use your ID card with a usb cardreader for accessing gov services, but good luck finding up to date drivers for your OS or use a mobile etc.
[1] https://www.itsme-id.com/en-BE/
Except that itsme crap is not from the government and doesn't support activation on anything but a Windows / Mac machine. No Linux support at all, while the Belgian government stuff (CSAM) supports Linux just fine.
It is from the banks that leveraged their KYC but was adopted very broadly by gov and many other id required or linked services. AFAIK it does not need a computer to activate besides your phone and one of those bank issued 2FA challange card readers.
For CSAM, also AFAIK, first 'activation' includes a visit to your local municipality to verify your identity. Unless you go via itsme, as it is and authorized CSAM key holder.
UK is stupidly far behind on this though. On one hand the digitization of government services is really well done(thanks to the fantastic team behind .gov websites), but on the other it's like being in the dark ages of tech. My native country has physical ID cards that contain my personal certificate that I can use to sign things or to - gasp! - prove that I am who I say I am. There is a government app that you can use to scan your ID card using the NFC chip in your phone, after providing it with a password that you set when you got the card it produces a token that can then be used to verify your identy or sign documents digitally - and those signatures legally have the same weight as real paper signatures.
UK is in this weird place where there isn't one kind of ID that everyone has - for most people it's the driving licence, but obviously that's not good enough. But my general point is that UK could just look over at how other countries are doing it and copy good solutions to this problem, instead of whatever nonsense is being done right now with the age verification process being entirely outsourced to private companies.
> UK is in this weird place where there isn't one kind of ID that everyone has - for most people it's the driving licence, but obviously that's not good enough.
As a Brit I personally went through a phase of not really existing — no credit card, no driving licence, expired passport - so I know how annoying this can be.
But it’s worth noting that we have this situation not because of mismanagement or technical illiteracy or incompetence but because of a pretty ingrained (centuries old) political and cultural belief that the police shouldn’t be able to ask you “papers please”. We had ID cards in World War II, everyone found them egregious and they were scrapped. It really will be discussed in those terms each time it is mentioned, and it really does come down to this original aspect of policing by consent.
So the age verification thing is running up against this lack of a pervasive ID, various KYC situations also do, we can get an ID card to satisfy verification for in-person voting if we have no others, but it is not proof of identity anywhere else, etc.
It is frustrating to people who do not have that same cultural touchstone but the “no to ID” attitude is very very normal; generally the UK prefers this idea of contextual, rather than universal ID. It’s a deliberate design choice.
Same in Australia - there was a referendum about whether we should have government-issued ID cards, and the answer was an emphatic "NO". And Australia is hitting or going to hit the same problem with the age verification thing for social media.
I doesn’t require a ground up rework. The easiest idea is real people can get an official online id at some site like login.gov and website operators verify people using that api. Some countries already have this kind of thing from what I understand. The tech bros want to implement this on the blockchain but the government could also do it.
Can't wait to sign into my web browser with my driver's license.
In all likelihood, most people will do so via the Apple Wallet (or the equivalent on their non-Apple devices). It's going to be painful to use Open source OSes for a while, thanks to CloudFlare and Anubis. This is not the future I want, but we can't have nice things.
> This is not the future I want, but we can't have nice things.
Actually, we can if we collectively decide that we should have them. Refuse to use sites that require these technologies and demand governments to solve the issue in better ways, e.g. by ensuring there are legal consequences for abusive corporations.
No worries. Stick an unregistered copy of win 11 (ms doesn’t seem to care) and your drivers license in an isolated VM and let the AI RDP into it for you.
Manually browsing the web yourself will probably be trickier moving forward though.
What's next? Requiring a license to make toast in your own damn toaster?
> your own damn toaster
Silly you, joking around like that. Can you imagine owning a toaster?! Sooo inconvenient and unproductive! Guess, if you change your housing plan, you gonna bring it along like an infectious tick? Hahah — no thank you! :D
You will own nothing and you will be happy!
(Please be reminded, failing behavioral compliance with, and/or voicing disapproval of this important moral precept, jokingly or not, is in violation of your citizenship subscription's general terms and conditions. This incident will be reported. Customer services will assist you within 48 hours. Please, do not leave your base zone until this issue has been resolved to your satisfaction.)
"Luckily" you won't have to do only that, you'll need to provide live video to prove you're the person in the ID and that you're alive.
The internet would come to a grinding halt as everyone would suddenly become mindful of their browsing. It's not hard to imagine a situation where, say, pornhub sells its access data and the next day you get sacked at your teaching job.
It doesn't need to. Thanks to asymmetric cryptography governments can in theory provide you with a way to prove you are a human (or of a certain age) without:
1. the government knowing who you are authenticating yourself to
2. or the recipient learning anything but the fact that you are a human
3. or the recipient being able to link you to a previous session if you authenticate yourself again later
The EU is trying to build such a scheme for online age verification (I'm not sure if their scheme also extends to point 3 though. Probably?).
But I don't get how is goes for spam or scrapping: if I can pass the test "anonymously", then what prevents me from doing it for illegal purposes?
I get it for age verification: it is difficult for a child to get a token that says they are allowed to access porn because adults around them don't want them to access porn (and even though one could sell tokens online, it effectively makes it harder to access porn as a child).
But how does it prevent someone from using their ID to get tokens for their scrapper? If it's anonymous, then there is no risk in doing it, is there?
IIRC, you could use asymmetric cryptography to derive a site-specific pseudonymous token from the service and your government ID without the service knowing what your government ID is or the government provider knowing what service you are using.
The service then links the token to your account and uses ordinary detection measures to see if you're spamming, flooding, phishing, whatever. If you do, the token gets blacklisted and you can no longer sign on to that service.
This isn't foolproof - you could still bribe random people on the street to be men/mules in the middle and do your flooding through them - but it's much harder than just spinning up ten thousand bots on a residential proxy.
But that does not really answer my question: if a human can prove that they are human anonymously (by getting an anonymous token), what prevents them from passing that token to an AI?
The whole point is to prevent a robot from accessing the API. If you want to detect the robot based on its activity, you don't need to bother humans with the token in the first place: just monitor the activity.
It does not prevent a bot from using your ID. But a) the repercussions for getting caught are much more tangible when you can't hide behind anonymity - you risk getting blanket banned from the internet and b) the scale is significantly reduced - how many people are willing to rent/sell their IDs, i.e., their right to access the internet?
Edit: ok I see the argument that the feedback mechanism could be difficult when all the website can report is "hey, you don't know me but this dude from request xyz you just authenticated fucked all my shit up". But at the end of the day, privacy preservation is an implementation detail I don't see governments guaranteeing.
> But at the end of the day, privacy preservation is an implementation detail I don't see governments guaranteeing.
Sure, I totally see how you can prevent unwanted activity by identifying the users. My question was about the privacy-preserving way. I just don't see how that would be possible.
One option I can think of is that the attesting authority might block you if you're behaving badly.
That doesn't work without the attesting authority knowing what you are doing, which would make this scheme no longer anonymous.
It does work as long as the attesting authority doesn't allow issuing a new identity (before it expires) if the old one is lost.
You (Y) generate a keypair and send your public key to the the attesting authority A, and keep your private key. You get a certificate.
You visit site b.com, and it asks for your identity, so you hash b.com|yourprivatekey. You submit the hash to b.com, along with a ZKP that you possess a private key that makes the hash work out, and that the private key corresponds to the public key in the certificate, and that the certificate has a valid signature from A.
If you break the rules of b.com, b.com bans your hash. Also, they set a hard rate limit on how many requests per hash are allowed. You could technically sell your hash and proof, but a scraper would need to buy up lots of them to do scraping.
Now the downside is that if you go to A and say your private key was compromised, or you lost control of it - the answer has to be tough luck. In reality, the certificates would expire after a while, so you could get a new hash every 6 months or something (and circumvent the bans), and if you lost the key, you'd need to wait out the expiry. The alternative is a scheme where you and A share a secret key - but then they can calculate your hash and conspire with b.com to unmask you.
Isn't the whole point of a privacy-preserving scheme be that you can ask many "certificates" to the attesting authority and it won't care (because you may need as many as the number of websites you visit), and the website b.com won't be able to link you to them, and therefore if it bans certificate C1, you can just start using certificate C2?
And then of course, if you need millions of certificates because b.com keeps banning you, it means that they ban you based on your activity, not based on your lack of certificate. And in that case, it feels like the certificate is useless in the first place: b.com has to monitor and ban you already.
Or am I missing something?
There isn't a technical solution to this: governments and providers not only want proof of identity matching IDs, they want proof of life, too.
This will always end with live video of the person requesting to log in to provide proof of life at the very least, and if they're lazy/want more data, they'll tie in their ID verification process to their video pipeline.
You already provided proof of a living legal identity when you got the ID, and it already expires to make you provide proof again every few years.
That's not not the kind of proof of life the government and companies want online. They want to make sure their video identification 1) is of a living person right now, and 2) that living person matches their government ID.
It's a solution to the "grandma died but we've been collecting her Social Security benefits anyway", or "my son stole my wallet with my ID & credit card", or (god forbid) "We incapacitated/killed this person to access their bank account using facial ID".
It's also a solution to the problem advertisers, investors and platforms face of 1) wanting huge piles of video training data for free and 2) determining that a user truly is a monetizable human being and not a freeloader bot using stolen/sold credentials.
> That's not not the kind of proof of life the government and companies want online.
Well that's your assumption about governments, but it doesn't have to be true. There are governments that don't try to exploit their people. The question is whether such governments can have technical solutions to achieve that or not (I'm genuinely interested in understanding whether or not it's technically feasible).
It's the kind of proof my government already asks of me to sign documents much, much more important than watching adult content, such as social security benefits.
Such schemes have the fatal flaw that they can be trivially abused. All you need are a couple of stolen/sold identities and bots start proving their humanness and adultness to everyone.
> Such schemes have the fatal flaw that they can be trivially abused
I wouldn't expect the abuse rate to be higher than what it is for chip-and-pin debit cards. PKI failure modes are well understood and there are mitigations galore.
Blatant automatic behavior can still be detected, and much more definitive actions can be takes in such a system
Detecting is a thing, but how do you identify the origin if it was done in a privacy-preserving manner? The whole point was that you couldn't, right?
I did think asymmetric cryptography but I assumed the validators would be third parties / individual websites and therefore connections could be made using your public key. But I guess having the government itself provide the authentication service makes more sense.
I wonder if they'd actually honor 1 instead of forcing recipients to be registered, as presumably they'd be interested in tracking user activity.
How would it prevent you from renting your identity out to a bot farm?
Besides making yourself party to a criminal conspiracy, I suspect it would be partly the same reason you won't sell/rent your real-world identity to other people today; an illegal immigrant may be willing to rent it from you right now.
Mostly, it will because online identifies will be a market for lemons: there will be so many fake/expired/revoked identities being sold that the value of each one will be worth pennies, and that's not commensurate with the risk of someone commiting crimes and linking it to your government-registered identity.
> the same reason you won't sell/rent your real-world identity to other people today
If you sell your real-world identity to other people today, and they get arrested, then the police will know your identity (obviously). How does that work with a privacy-preserving scheme? If you sell your anonymous token that says that you are a human to a machine and the machine gets arrested, then the police won't be able to know who you are, right? That was the whole point of the privacy-preserving token.
I'm genuinely interested, I don't understand how it can work technically and be privacy-preserving.
It would appear most of the people commenting on the subject don't even understand it.
With privacy preserving cryptography the tokens are standalone and have no ties to the identity that spawned them.
No enforcement for abuse is possible.
> With privacy preserving cryptography the tokens are standalone and have no ties to the identity that spawned them.
I suspect there will be different levels of attestations from the anonymous ("this is an adult"), to semi-anonymous ("this person was born in 20YY and resides in administrative region XYZ") to the compete record ("This is John Quincy Smith III born on YYYY-MM-DD with ID doc number ABC123"). Somewhere in between the extremes is an pseudonymous token that's strongly tied to a single identity with non-repudiation.
Anonymous identities that can be easily churned out on demand by end-users have zero antibot utility
The latter attestation will be completely useless for privacy.
Right, that's my feeling as well
While it's the privacy advocate's ideal, the politics reality is very few governments will deploy "privacy preserving" cryptography that gets in the way of LE investigations[1]. The best you can hope for is some escrowed service that requires a warrant to unmask the identity for any given token, so privacy is preserved in most cases, and against most parties except law enforcement when there's a valid warrant.
1. They can do it overtly in thr design of the system, or covertly via side-channels, logging, or leaking bits in ways that are hard for an outsider to investigate without access to the complete source code and or/system outputs, such as not-quite-random pseudo-randoms.
> Mostly, it will because online identifies will be a market for lemons: there will be so many fake/expired/revoked identities being sold that the value of each one will be worth pennies, and that's not commensurate with the risk of someone commiting crimes and linking it to your government-registered identity. That would be trivially solved by using same verification mechanisms they would be used with.
You are right about the negative outcomes that this might have but you have way too much faith in the average person caring enough before it happens to them.
I live with the naïve and optimistic dream that something like that would just show that everyone was in the list so they can't use it to discriminate against people.
> sells its access data
or has it leaked somehow.
Eyeball company play is to be a general identity provider, which is an obvious move for anyone who tries to fill this gap. You can already connect your passport in the World app.
https://world.org/blog/announcements/new-world-id-passport-c...
Note: one of the founders of the World app is Sam Altman.
> some form of online ID attestation (likely based on government-issued ID[1]) will become normal in the next decade
I believe this is likely, and implemented in the right way, I think it will be a good thing.
A zero-knowledge way of attesting persistent pseudonymous identity would solve a lot of problems. If the government doesn’t know who you are attesting to, the service doesn’t know your real identity, services can’t correlate users, and a service always sees the same identity, then this is about as privacy-preserving as you can get with huge upside.
A social media site can ban an abusive user without them being able to simply register a new account. One person cannot operate tens of thousands of bot profiles. Crawlers can be banned once. Spammers can be locked out of email.
> A social media site can ban an abusive user without them being able to simply register a new account.
This is an absolutely gargantuan-sized antifeature that would single-handedly drive me out of the parts of the internet that choose to embrace this hellish tech.
I think social media platforms should have the ability to effectively ban abusive users, and I’m pretty sure that’s a mainstream viewpoint shared by most people.
The alternative is that you think people should be able to use social media platforms in ways that violate their rules, and that the platforms should not be able to refuse service to these users. I don’t think that’s a justifiable position to take, but I’m open to hearing an argument for it. Simply calling it “hellish” isn’t an argument.
And can you clarify if your position accounts for spammers? Because as far as I can see, your position is very clearly “spammers should be allowed to spam”.
>A zero-knowledge way of attesting persistent pseudonymous identity
why would a government do that though? the alternative is easier and gives it more of what it wants.
The alternative would have far less support from the public.
This has quite nasty consequences for privacy. For this reason, alternatives are desirable. I have less confidence on what such an alternative should be, however.
Can you elaborate on that? Are you implying that it is strictly impossible to do this in a privacy-preserving way?
It depends on your precise requirements and assumptions.
Does your definition of 'privacy-preserving' distrust Google, Apple, Xiaomi, HTC, Honor, Samsung and suchlike?
Do you also distrust third-party clowns like experian and equifax (whose current systems have gaping security holes) and distrust large government IT projects (which are outsourced to clowns like Fujutsu who don't know what they're doing) ??
Do you require it to work on all devices, including outdated phones and tablets; PCs; Linux-only devices; other networked devices like smart lightbulbs; and so on? Does it have to work in places phones aren't allowed, or mobile data/bluetooth isn't available? Does the identity card have to be as thin, flexible, durable and cheap as a credit card, precluding any built-in fingerprint sensors and suchlike?
Does the age validation have to protect against an 18-year-old passing the age check on their 16-year-old friend's account? While also being privacy-preserving enough nobody can tell the two accounts were approved with the same ID card?
Does the system also have to work on websites without user accounts, because who the hell creates a pornhub account anyway?
Does the system need to work without the government approving individual websites' access to the system? Does it also need to be support proving things like name, nationality, and right to work in the country so people can apply for bank accounts and jobs online? And yet does it need to prevent sites from requiring names just for ad targeting purposes?
Do all approvals have to be provable, so every company can prove to the government that the checks were properly carried out at the right time? Does it have to be possible to revoke cards in a timely manner, but without maintaining a huge list of revoked cards, and without every visit to a porn site triggering a call to a government server for a revocation check?
If you want to accomplish all of these goals - you're going to have a tough time.
Not sure what you are trying to say.
I can easily imagine having a way to prove my age in a privacy-preserving way: a trusted party knows that I am 18+ and gives me a token that proves that I am 18+ without divulging anything else. I take that token and pass it to the website that requires me to be 18+. The website knows nothing about me other than I have a token that says I am 18+.
Of course, I can get a token and then give it to a child. Just like I can buy cigarettes and give them to a child. But the age verification helps in that I don't want children to access cigarettes, so I won't do it.
The "you are a human" verification fundamentally doesn't work, because the humans who make the bots are not aligned with the objective of the verification. If it's privacy-preserving, it means that a human can get a token, feed it to their bot and call it a day. And nobody will know who gave the token to the bot, precisely because it is privacy-preserving.
I am not implying anything and mean only what I directly said.
More specifically, I do not know if a privacy preserving method exists. This is different from thinking that it doesn't exist.
While the question of "is it actually possible to do this in a privacy preserving way?" is certainly interesting, was there ever a _single_ occasion where a government had the option of doing something in a privacy preserving way, when a non-privacy preserving way was also possible? Politicians would absolutely kill for the idea of unmasking dissenters on internet forums. Even if the option is a possibility, they are deliberately not going to implement it.
> was there ever a _single_ occasion
I don't know where you live, but in my case, many. Beginning with the fact that I can buy groceries with cash.
Example does not fit, when cash was introduced electronic money transfer was not an option.
Health insurance being digitalised and encrypted on the insurance card in a decentralised way?
Many e-IDs in many countries?
I didn't know about e-IDs in other countries, but in Scandinavia (at least in Norway and Sweden, but I know the same system is used in Denmark as well) they are very much tied to your personal number which uniquely identifies you. Healthcare data is also not encrypted.
Well the e-ID is an ID, so to the government it's tied to a person. But I know that in multiple countries it's possible to use the e-ID to only share the information necessary with the receiver in a way that the government cannot track. Typically, share only the fact that you are 18+ without sharing your name or birthday, and without the government being able to track where you shared that fact.
This is privacy-preserving and modern.
Fun fact: The Norwegian wine monopoly is rolling out exactly this to prevent scalpers buying up new releases. Each online release will require a signup in advance with a verified account.
Eh? With the "anonymous" models that we're pushing for right now, nothing stops you from handing over your verification token (or the control of your browser) to a robot for a fee. The token issued by the verifier just says "yep, that's an adult human", not "this is John Doe, living at 123 Main St, Somewhere, USA". If it's burned, you can get a new one.
If we move to a model where the token is permanently tied to your identity, there might be an incentive for you not to risk your token being added to a blocklist. But there's no shortage of people who need a bit of extra cash and for whom it's not a bad trade. So there will be a nearly-endless supply of "burner" tokens for use by trolls, scammers, evil crawlers, etc.
If it's illegal that person could face legal consequences
They... stole it from me?
At this future point, AI firms will simply rent people’s identities to use online.
They are already getting people hooked on "free" access so they will have plenty of subjects willing to do that to keep that access.
And if they are as successful as they are threatening to be, they will have destroyed so many jobs that I am sure they will find a few thousand people across the world who will accept a stipend to loan their essence to the machine.
Can't wait to start my stolen id as a service for the botnets
It will be hard to tune them to be just the right level of ignorant and slow as us though!
Soon enough there will be competing Unicode characters that can remove exclamation points.
Maybe there will be a way to certify humanness. Human testing facility could be a local office you walk over to get your “I am a human” hardware key. Maybe it expires after a week or so to ensure that you are still alive.
But if that hardware key is privacy-preserving (i.e. websites don't get your identity when you use it), what prevents you from using it for your illegal activity? Scrapers and spam are built by humans, who could get such a hardware key.
You’d at least be limited to deploying a single verified scraper which might be too slow for people to bother with.
Not even: the government is supposed to provide you with more than one token (how would you verify yourself as a human to more than one website otherwise?)
The idea would be a connection requires the key so you could both verify at more than one website and be limited to one instance per website.
If you use the same token on more than one website, it's not privacy-preserving anymore.
Sure.
It might be a tool in the box. But it’s still cat and mouse.
In my place we quickly concluded the scrapers have tons of compute and the “proof-of-work” aspect was meaningless to them. It’s simply the “response from site changed, need to change our scraping code” aspect that helps.
>But it takes a human some time and work to tell the crawler HOW.
Yes, for these human-based challenges. But this challenge is defined in code. It's not like crawlers don't run JavaScript. It's 2025, they all use headless browsers, not curl.
> The point is that they hadn't, and this worked for quite a while.
That's what I was hoping to get from the "Numbers" section.
I generally don't look up the logs or numbers on my tiny, personal web spaces hosted on my server, and I imagine I could, at some point, become the victim of aggressive crawling (or maybe I have without noticing because I've got an oversized server on a dual link connection).
But the numbers actually only show the performance of doing the PoW, not the effect it has had on any site — I am just curious, and I'd love it if someone has done the analysis, ideally grouped by the bot type ("OpenAI bot was responsible for 17% of all requests, this got reduced from 900k requests a day to 0 a day"...). Search, unfortunately, only gives me all the "Anubis is helping fight aggressive crawling" blog articles, nothing with substance (I haven't tried hard, I admit).
Edit: from further down the thread there's https://dukespace.lib.duke.edu/server/api/core/bitstreams/81... but no analysis of how many real customers were denied — more data would be even better
The cost benefit calculus for workarounds changes based on popularity. Your custom lock might be easy to break by a professional, but the handful of people who might ever care to pick it are unlikely to be trying that hard. A lock which lets you into 5% of houses however might be worth learning to break.
If you are going to rely on security through obscurity there are plenty of ways to do that that won't block actual humans because they dare use a non-mainstream browser. You can also do it without displaying cringeworthy art that is only there to get people to pay for the DRM solution you are peddling - that shit has no place in the open source ecosystem.
On the contrary: Making things look silly and unprofessional so that Big Serious Corporations With Money will pay thousands of dollars to whitelabel them is an OUTSTANDING solution for preserving software freedom while raising money for hardworking developers.
I'd rather not raise money for "hardworking" developers if their work is spreading DRM on the web. And it's not just "Big Serious Corporations" that don't want to see your furry art.
I'm not commenting on the value of this project (I wouldn't characterize captchas as DRM, but I see why you have that negative connotation) and I tend to agree with the OP that this is simply wasting energy, but the amount of seething over "anime catgirls" makes me want to write all the docs for my next projects in UwU text and charge for a whimsy-free version. (o˘◡˘o)
Please do, it's better if people make their negative personality traits public so that you can avoid them before wasting your time. It will also be useful to show your hypocrisy when you inevitably complain about someone else doing something that you don't like.
I don't think you need to try to die on this hill (primarily remarking w.r.t. your lumping in Anubis with Cloudflare/Google/et al. as one). In any case, I'm not appreciating the proliferation of the CAPTCHA-wall any more than you are.
The mascot artist wrote in here in another thread about the design philosophies, and they are IMO a lot more honorable in comparison (to BigCo).
Besides, it's MIT FOSS. Can't a site operator shoehorn in their own image if they were so inclined?
i love this thread because it the Serious Business Man doesn't realize that purposeful unprofessionalism like anime art, silly uwu :3 catgirls, writing with no capitalization are done specifically to be unpalatable to Serious Business Man—threatening to not interact with people like that is the funniest thing.
negative signaling works!
Acting obnoxiously to piss people off makes you seem like an inexperienced teenager and distances more than "Serious Business Man".
I look forward for this to be taken to the logical extreme when a niche subculture of internet nerds change their entire online persona to revolve around scat pornography to spite "the normals", I'm sure they'll be remembered fondly as witty and intelligent and not at all as mentally ill young people.
Sounds like a similar idea to what the "plus N-word license" is trying to accomplish
I deployed a proof of work based auth system once where every single request required hashing a new nonce. Compare with Anubis where only one request a week requires it. The math said doing it that frequently, and with variable argon params the server could tune if it suspected bots, would be impactful enough to deter bots.
Would I do that again? Probably not. These days I’d require a weekly mDL or equivalent credential presentation.
I have to disagree that an anti-bot measure that only works globally for a few weeks until bots trivially bypass it is effective. In an arms race against bots the bots win. You have to outsmart them by challenging them to do something that only a human can do or is actually prohibitively expensive for bots to do at scale. Anubis doesn't pass that test. And now it’s littered everywhere defunct and useless.
> As the botmakers circumvent, new methods of proof-of-notbot will be made available.
Yes, but the fundamental problem is that the AI crawler does the same amount of work as a legitimate user, not more.
So if you design the work such that it takes five seconds on a five year old smartphone, it could inconvenience a large portion of your user base. But once that scheme is understood by the crawler, it will delay the start of their aggressive crawling by... well-under five seconds.
An open source javascript challenge as a crawler blocker may work until it gets large enough for crawlers to care, but then they just have an engineer subscribe to changes on GitHub and have new challenge algorithms implemented before the majority of the deployment base migrates.
With all the SPAs out there, if you want to crawl the entire Web, you need a headless browser running JavaScript. Which will pass Anubis for free.
Wasn't there also weird behaviors reported by webadmins across the world, like crawlers used by LLM companies are fetching evergreen data ad nauseum or something along that? I thought the point of adding PoW than just blocking them was to convince them to at least do it right.
Everytime we need to deploy such mechanisms, you reward those that already crawled the data and you penalize newcomers and other honest crawlers.
For some sites Anubis might be fitting, but it should be mindfully deployed.
Many sufficiently technical people take to heart:
- Everything is pwned
- Security through obscurity is bad
Without taking to heart:
- What a threat model is
And settle on a kind of permanent contrarian nihilist doomerism.
Why eat greens? You'll die one day anyway.
No, it’s exactly because I understand that it bothers me. I understand it will be effective against bots for a few months and best, and legitimate human users will be stuck dealing with the damn thing for years to come. Just like captchas.
Did you read the article? OP doesn't care about bots figuring it out. It's about the compute needed to do the work.
It's quite an interesting piece, I feel like you projected something completely different onto it.
Your point is valid, but completely adjacent.
You don't even need to go there. If the damn thing didn't work the site admin wouldn't have added it and kept it.
Sure the program itself is jank in multiple ways but it solves the problem well enough.
As I understand it, this is Proof of Work, which is strictly not a mouse and cat situation.
It is because you are dealing with crawlers that already have a nontrivial cost per page, adding something relatively trivial that is still within the bounds regular users accept won't change the motivations of bad actors at all.
What is the existing cost per page? as far as I know an http request and some string parsing is somewhat trivial, say 14kb of bandwidth per page?
On a side note, is the anime girl image customizable? I did a quick Google search an it seems that only the commercial version offers rebranding.
It's free software. The paid version includes an option to change it, and they ask politely that you don't change it otherwise.
Technical people are prone to black-and-white thinking, which makes it hard to understand that making something more difficult will cause people to do it less even though it’s still possible.
I think the argument on offer is more, this juice isn't worth the squeeze. Each user is being slowed down and annoyed for something that bots will trivially bypass if they become aware of it.
If they become aware of it and actually think it’s worthwhile. Malicious bots work by scaling, and implementing special cases for every random web site doesn’t scale. And it’s likely they never even notice.
If this kind of security by not being noticed is the plan, why not just have a trivial (but unique) captcha that asks the user to click a button with no battery wasting computation?
Because you can't sell that as a commercial solution that the open source software ecosystem provides free advertising to.
That works too, but not quite as well so it decreases the unwanted activity somewhat less.
Respectfully, I think it's you missing the point here. None of this is to say you shouldn't use Anubis, but Tavis Ormandy is offering a computer science critique of how it purports to function. You don't have to care about computer science in this instance! But you can't dismiss it because it's computer science.
Consider:
An adaptive password hash like bcrypt or Argon2 uses a work function to apply asymmetric costs to adversaries (attackers who don't know the real password). Both users and attackers have to apply the work function, but the user gets ~constant value for it (they know the password, so to a first approx. they only have to call it once). Attackers have to iterate the function, potentially indefinitely, in the limit obtaining 0 reward for infinite cost.
A blockchain cryptocurrency uses a work function principally as a synchronization mechanism. The work function itself doesn't have a meaningfully separate adversary. Everyone obtains the same value (the expected value of attempting to solve the next round of the block commitment puzzle) for each application of the work function. And note in this scenario most of the value returned from the work function goes to a small, centralized group of highly-capitalized specialists.
A proof-of-work-based antiabuse system wants to function the way a password hash functions. You want to define an adversary and then find a way to incur asymmetric costs on them, so that the adversary gets minimal value compared to legitimate users.
And this is in fact how proof-of-work-based antispam systems function: the value of sending a single spam message is so low that the EV of applying the work function is negative.
But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.
There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.
This is also how the Blu-Ray BD+ system worked.
The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).
The problem with "this is good because none of the scrapers even bother to do this POW yet" is that you don't need an annoying POW to get that value! You could just write a mildly complicated Javascript function, or do an automated captcha.
A lot of these passive types of anti-abuse systems rely on the rather bold assumption that making a bot perform a computation is expensive, but isn't for me as an ordinary user.
According to whom or what data exactly?
AI operators are clearly well-funded operations and the amount of electricity and CPU power is negligible. Software like Anubis and nearly all its identical predecessors grant you access after a single "proof". So you then have free reign to scrape the whole site.
The best physical analogy are those shopping cart things where you have to insert a quarter to unlock the cart, and you presumably get it back when you return the cart.
The group of people this doesn't affect are the well-funded, a quarter is a small price to pay for leaving your cart in the middle of the parking lot.
Those that suffer the most are the ones that can't find a quarter in the cupholder so you're stuck filling your arms with groceries.
Would you be richer if they didn't charge you a quarter? (For these anti-bot tools you're paying the electric company, not the site owner.). Maybe. But if you're Scrooge McDuck who is counting?
Right, that's the point of the article. If you can tune asymmetric costs on bots/scrapers, it doesn't matter: you can drive bot costs to infinity without doing so for users. But if everyone's on a level playing field, POW is problematic.
I like your example because the quarters for shopping cards are not universal everywhere. Some societies have either accepted shopping cart shrinkage as an acceptable cost of doing business or have found better ways to deter it.
Scrapers are orders of magnitude faster than humans at browsing websites. If the challenge takes 1 second but a human stays on the page for 3 minutes, then it's negligible. But if the challenge takes 1 second and the scraper does ita job in 5 seconds, you already have a 20% slowdown
By that logic you could just make your website in general load slower to make scraping harder.
No, because in this case there are cookies involved. If the scraper accepts cookies then it's trivial to detect it and block it. If it doesn't, it will have to solve the challenge every single time.
Scrapers do not care about having a 20% slowdown. All they care is being able to scale up. This does not block any scale up attempt.
For what it's worth, kernel.org seems to be running an old version of Anubis that predates the current challenge generation method. Previously it took information about the user request, hashed it, and then relied on that being idempotent to avoid having to store state. This didn't scale and was prone to issues like in the OP.
The modern version of Anubis as of PR https://github.com/TecharoHQ/anubis/pull/749 uses a different flow. Minting a challenge generates state including 64 bytes of random data. This random data is sent to the client and used on the server side in order to validate challenge solutions.
The core problem here is that kernel.org isn't upgrading their version of Anubis as it's released. I suspect this means they're also vulnerable to GHSA-jhjj-2g64-px7c.
OP is a real human user trying to make your DRM work with their system. That you consider this to be an "issue" that should be fixed says a lot.
Right, I get that. I'm just saying that over the long term, you're going to have to find asymmetric costs to apply to scrapers, or it's not going to work. I'm not criticizing any specific implementation detail of your current system. It's good to have a place to take it!
I think that's the valuable observation in this post. Tavis can tell me I'm wrong. :)
> But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.
Based on my own experience fighting these AI scrappers, I feel that the way they are actually implemented makes it that in practice there is asymmetry in the work scrappers have to do vs humans.
The pattern these scrappers follow is that they are highly distributed. I’ll see a given {ip, UA} pair make a request to /foo immediately followed by _hundreds_ of requests from completely different {ip, UA} pairs to all the links from that page (ie: /foo/a, /foo/b, /foo/c, etc..).
This is a big part of what makes these AI crawlers such a challenge for us admins. There isn’t a whole lot we can do to apply regular rate limiting techniques: the IPs are always changing and are no longer limited to corporate ASN (I’m now seeing IPs belonging to consumer ISPs and even cell phone companies), and the User Agents all look genuine. But when looking through the logs you can see the pattern that all these unrelated requests are actually working together to perform a BFS traversal of your site.
Given this pattern, I believe that’s what makes the Anubis approach actually work well in practice. For a given user, they will encounter the challenge once when accessing the site the first time, then they’ll be able to navigate through it without incurring any cost. While the AI scrappers would need to solve the challenge for every single one of their “nodes” (or whatever it is they would call their {ip, UA} pairs). From a site reliability perspective, I don’t even care if the crawlers manage to solve the challenge or not. That it manages to slow them down enough to rate limit them as a network is enough.
To be clear: I don’t disagree with you that the cost incurred by regular human users is still high. But I don’t think it’s fair to say that this is not a situation in which the cost to the adversary is not asymmetrical. It wouldn’t be if the AI crawlers hadn’t converged towards an implementation that behaves as a DDOS botnet.
> The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).
No, that's missing the point. Anubis is effectively a DDoS protection system, all the talking about AI bots comes from the fact that the latest wave of DDoS attacks was initiated by AI scrapers, whether intentionally or not.
If these bots would clone git repos instead of unleashing the hordes of dumbest bots on Earth pretending to be thousands and thousands of users browsing through git blame web UI, there would be no need for Anubis.
I'm not moralizing, I'm talking about whether it can work. If it's your site, you don't need to justify putting anything in front of it.
Did you accidentally reply to a wrong comment? (not trying to be snarky, just confused)
The only "justification" there would be is that it keeps the server online that struggled under load before deploying it. That's the whole reason why major FLOSS projects and code forges have deployed Anubis. Nobody cares about bots downloading FLOSS code or kernel mailing lists archives; they care about keeping their infrastructure running and whether it's being DDoSed or not.
I just said you didn't have to justify it. I don't care why you run it. Run whatever you want. The point of the post is that regardless of your reasons for running it, it's unlikely to work in the long run.
And what I said is that all these most visible deployments of Anubis did not deploy it to be a content protection system of any kind, so it doesn't have to work this way at all for them. As long as the server doesn't struggle with load anymore after deploying Anubis, it's a win - and it works so far.
(and frankly, it likely will only need to work until the bubble bursts, making "the long run" irrelevant)
> and frankly, it likely will only need to work until the bubble bursts, making "the long run" irrelevant
Now I get why people are so weirdly being dismissive about the whole thing. Good luck, it's not going to "burst" any time soon.
Or rather, a "burst" would not change the world in the direction you want it to be.
Not exactly sure what you're talking about. The problem is caused by tons of shitty companies cutting corners to collect training data as fast as possible, fueled by easy money that you get by putting "AI" somewhere in your company's name.
As soon as the investment boom is over, this will be largely gone. LLMs will continue to be trained and data will continue to be scraped, but that alone isn't the problem. Search engine crawlers somehow manage not to DDoS the servers they pull the data from, competent AI scrapers can do the same. In fact, a competent AI scraper wouldn't even be stopped by Anubis as it is right now at all, and yet Anubis works pretty well in practice. Go figure.
> There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.
That depends on what you count as normal users though. Users that want to use alternative players also have to deal with this and since yt-dlp and youtube-dl before have been able to provide a solution for those user and bots can just do the same I'm not sure if I'd call the scheme successful in any way.
The (almost only?) distinguishing factor between genuine users and bots is the total volume of requests, but this can still be used for asymmetric costs. If botPain > botPainThreshold and humanPain < humanPainThreshold then Anubis is working as intended. A key point is that those inequalities look different at the next level of detail. A very rough model might be:
botPain = nBotRequests * cpuWorkPerRequest * dollarsPerCpuSecond
humanPain = c_1 * max(elapsedTimePerRequest) + c_2 * avg(elapsedTimePerRequest)
The article points out that the botPain Anubis currently generates is unfortunately much too low to hit any realistic threshold. But if the cost model I've suggested above is in any way realistic, then useful improvements would include:
1. More frequent but less taxing computation demands (this assumes c_1 >> c_2)
2. Parallel computation (this improves the human experience with no effect for bots)
ETA: Concretely, regarding (1), I would tolerate 500ms lag on every page load (meaning forget about the 7-day cookie), and wouldn't notice 250ms.
That's exactly what I'm saying isn't happening: the user pays some cost C per article, and the bot pays exactly the same cost C. Both obtain the same reward. That's not how Hashcash works.
I'm saying your notion of "the same cost" is off. They pay the same total CPU cost, but that isn't the actual perceived cost in each case.
Can you flesh that out more? In the case of AI scrapers it seems especially clear: the model companies just want tokens, and are paying a (one-time) cost of C for N tokens.
Again, with Hashcash, this isn't how it works: most outbound spam messages are worthless. The point of the system is to exploit the negative exponent on the attacker's value function.
The scraper breaking every time a new version of Anubis is deployed, until new anti-Anubis features are implemented, is the point; if the scrapers were well-engineered by a team that cared about the individual sites they're scraping, they probably wouldn't be so pathological towards forges.
The human-labor cost of working around Anubis is unlikely to be paid unless it affects enough data to be worth dedicating time to, and the data they're trying to scrape can typically be obtained "respectfully" in those cases -- instead of hitting the git blame route on every file of every commit of every repo, just clone the repos and run it locally, etc.
Sure, but if that's the case, you don't need the POW, which is what bugs people about this design. I'm not objecting to the idea of anti-bot content protection on websites.
Perhaps I caused confusion by writing "If botPain > botPainThreshold and humanPain < humanPainThreshold then Anubis is working as intended", as I'm not actually disputing that Anubis is currently ineffective against bots. (The article makes that point and I agree with it.) I'm arguing against what I take to be your stronger claim, namely that no "Anubis-like" countermeasure (meaning no countermeasure that charges each request the same amount of CPU in expectation) can work.
I claim that the cost for the two classes of user are meaningfully different: bots care exclusively about the total CPU usage, while humans care about some subjective combination of average and worst-case elapsed times on page loads. Because the sheer number of requests done by bots is so much higher, there's an opportunity to hurt them disproportionately according to their cost model by tweaking Anubis to increase the frequency of checks but decrease each check's elapsed time below the threshold of human annoyance.
The fundamental failure of this is that you can’t publish data to the web and not publish data to the web. If you make things public, the public will use it.
It’s ineffective. (And furry sex-subculture propaganda pushed by its author, which is out of place in such software.)
The misguided parenthetical aside, this is not about resources being public, this is about bad actors accessing those resources in a highly inefficient and resource-intensive manner, effectively DDOS-ing the source.
>And furry sex-subculture propaganda pushed by its author
if your first thought when seeing a catgirl is sex, i got bad news for you
Also, it forces the crawler to gain code execution capabilities, which for many companies will just make them give up and scrape someone else.
I don't know if you've noticed, but there's a few websites these days that use javascript as part of their display logic.
Yes, and those sites take way more effort to crawl than other sites. They may still get crawled, but likely less often than the ones that don't use JavaScript for rendering (which is the main purpose of Anubis - saving bandwidth from crawlers who crawl sites way too often).
(Also, note the difference between using JavaScript for display logic and requiring JavaScript to load any content at all. Most websites do the first, the second isn't quite as common.)
TFA — and most comments here — seem to completely miss what I thought was the main point of Anubis: it counters the crawler's "identity scattering"/sybil'ing/parallel crawling.
Any access will fall into either of the following categories:
- client with JS and cookies. In this case the server now has an identity to apply rate limiting to, from the cookie. Humans should never hit it, but crawlers will be slowed down immensely or ejected. Of course the identity can be rotated — at the cost of solving the puzzle again.
- amnesiac (no cookies) clients with JS. Each access is now expensive.
(- no JS - no access.)
The point is to prevent parallel crawling and overloading the server. Crawlers can still start an arbitrary number of parallel crawls, but each one costs to start and needs to stay below some rate limit. Previously, the server would collapse under thousands of crawler requests per second. That is what Anubis is making prohibitively expensive.
Yes, I think you're right. The commentary about its (presumed, imagined) effectiveness is very much making the assumption that it's designed to be an impenetrable wall[0] -- i.e. prevent bots from accessing the content entirely.
I think TFA is generally quite good and has something of a good point about the economics of the situation, but finding the math shake out that way should, perhaps, lead one to question their starting point / assumptions[1].
In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?
[0] Mentioning 'impenetrable wall' is probably setting off alarm bells, because of course that would be a bad design.
[1] (Edited to add:) I should say 'to question their assumptions more' -- like I said, the article is quite good and it does present this as confusing, at least.
> In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?
I agree, but the advertising is the whole issue. "Checking to see you're not a bot!" and all that.
Therefore some people using Anubis expect it to be an impenetrable wall, to "block AI scrapers", especially those that believe it's a way for them to be excluded from training data.
It's why just a few days ago there was a HN frontpage post of someone complaining that "AI scrapers have learnt to get past Anubis".
But that is a fight that one will never win (analog hole as the nuclear option).
If it said something like "Wait 5 seconds, our servers are busy!", I would think that people's expectations will be more accurate.
As a robot I'm really not that sympathetic to anti-bot language backfiring on humans. I have to look away every time it comes up on my screen. If they changed their language and advertising, I'll be more sympathetic -- it's not as if I disagree that overloading servers for not much benefit is bad!
You don't necessarily need JS, you just need something that can detect if Anybis is used and complete the challenge.
Sure, doesn't change anything though; you still need to spend energy on a bunch of hash calculations.
But then you rate limit that challenge.
You could setup a system for parellelizing the creation of these Anubis PoW cookies independent of the crawling logic. That would probably work, but it's a pretty heavy lift compared to 'just run a browser with JavaScript'.
Well maybe, but even then, how many parallel crawls are you going to do per site? 100 maybe? You can still get enough keys to do that for all sites in just a few hours per week.
This is a good point, presuming the rate limiting is actually applied.
I'm a scraper developer and Anubis would have worked 10 - 20 years ago, but now all broad scrapers run on a real headless browser with full cookie support and costs relatively nothing in compute. I'd be surprised if LLM bots would use anything else given the fact that they have all of this compute and engineers already available.
That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale. It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.
You can actually see this in real life if you google web scraping services and which targets they claim to bypass - all of them bypass generic anti-bots like Cloudflare, Akamai etc. but struggle with custom and rare stuff like Chinese websites or small forums because scraping market is a market like any other and high value problems are solved first. So becoming a low value problem is a very easy way to avoid confrontation.
> That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale.
Isn't this what Microsoft is trying to do with their sliding puzzle piece and choose the closest match type systems?
Also, if you come in on a mobile browser it could ask you to lay your phone flat and then shake it up and down for a second or something similar that would be a challenge for a datacenter bot pretending to be a phone.
How do you bypass cloudflare? I do some light scrapping for some personal stuff, but I can't figure out how to bypass it. Like do you randomize IPs using several VPNs at the same time?
I usually just sit there on my phone pressing the "I am not a robot box" when it triggers.
It's still pretty hard to bypass it with open source solutions. To bypass CF you need:
- an automated browser that doesn't leak the fact it's being automated
- ability to fake the browser fingerprint (e.g. Linux is heavily penalized)
- residential or mobile proxies (for small scale your home IP is probably good enough)
- deployment environment that isn't leaked to the browser.
- realistic scrape pattern and header configuration (header order, referer, prewalk some pages with cookies etc.)
This is really hard to do at scale but for small personal scripts you can have reasonable results with flavor of the month playwright forks on github like nodriver or dedicated tools like Flaresolver but I'd just find a web scraping api with low entry price and just drop 15$ month and avoid this chase because it can be really time consuming.
If you're really on budget - most of them offer 1,000 credits for free which will get you avg 100 pages a month per service and you can get 10 of them as they all mostly function the same.
I use Camoufox for the browser and "playwright-captcha" for the CAPTCHA solving action. It's not fully reliable but it works.
I believe usually you would bypass by using residential ips / proxies?
I run it through my home network and I'm still triggering it. I add 2s delays between page load and it still triggers
Well, if that's true... I am so sorry to tell you this, it looks like you are in fact a robot.
Flaresolverr can bypass it.
Ironically by runnung cloudflare warp.
This only works if you're a low-value site (which admittedly most sites are).
Bot blocking through obscurity
That's really the only option available here, right? The goal is to keep sites low friction for end users while stopping bots. Requiring an account with some moderation would stop the majority of bots, but it would add a lot of friction for your human users.
The other option is proof of work. Make clients use JS to do expensive calculations that aren’t a big deal for single clients, but get expensive at scale. Not ideal, but another tool to potentially use.
I like it, make the bot developers play whack-a-mole.
Of course, you're going to have to verify each custom puzzle aren't you.
> It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.
These are trivial for an AI agent to solve though, even with very dumb watered down models.
You can also generate custom solutions at scale with LLMs. So each user could get a different CAPTCHA.
At that point you’re probably spending more money blocking the scrapers than you would spend just letting them through.
That seems like it would make bot blocking saas (like cloudflare or tollbit) more attractive because it could amortize that effort/cost across many clients.
>This dance to get access is just a minor annoyance for me, but I question how it proves I’m not a bot. These steps can be trivially and cheaply automated.
>I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.
No need to mimic the actual challenge process. Just change your user agent to not have "Mozilla" in it; Anubis only serves you the challenge if it has that. For myself I just made a sideloaded browser extension to override the UA header for the handful of websites I visit that use Anubis, including those two kernel.org domains.
(Why do I do it? For most of them I don't enable JS or cookies for so the challenge wouldn't pass anyway. For the ones that I do enable JS or cookies for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)
Sadly, touching the user-agent header more or less instantly makes you uniquely identifiable.
Browser fingerprinting works best against people with unique headers. There's probably millions of people using an untouched safari on iPhone. Once you touch your user-agent header, you're likely the only person in the world with that fingerprint.
If someone's out to uniquely identify your activity on the internet, your User-Agent string is going to be the least of your problems.
Not sure what you mean, as exactly this is happening currently on 99% of the web. Brought to you by: ads
If you're browsing with a browser, then there are 1000 ways to identify you. If you're browsing without a browser, then there is at least one way to identify you.
I think what they meant is: there’s already so many other ways to fingerprint (say, canvas) that a common user agent doesn’t significantly help you
'There's so many cliffs around that not jumping off that one barely helps you'.
I meeeeeannn... sure? I know that browser fingerprinting works quite well without, but custom headers are actually a game over in terms of not getting tracked.
UA fingerprinting isn't a problem for me. As I said I only modify the UA for the handful of sites that use Anubis that I visit. I trust those sites enough that them fingerprinting me is unlikely, and won't be a problem even if they did.
If your headers are new every time then it is very difficult to figure out who is who.
yes, but it puts you in the incredibly small bucket of "users that has weird headers that don't mesh well", and makes using the rest of the (many) other fingerprinting techniques all the more accurate.
> If your headers are new every time then it is very difficult to figure out who is who.
https://xkcd.com/1105/
It is very easy unless the IP address is also switching up.
It's very easy to train a model to identify anomalies like that.
While it's definitely possible to train a model for that, 'very easy' is nonsense.
Unless you've got some superintelligence hidden somewhere, you'd choose a neural net. To train, you need a large supply of LABELED data. Seems like a challenge to build that dataset; after all, we have no scalable method for classifying as of yet.
I'll set mine to "null" if the rest of you will set yours...
The string “null” or actually null? I have recently seen a huge amount of bot traffic which has actually no UA and just outright block it. It’s almost entirely (microsoft cloud) Azure script attacks.
I was thinking the string "null". But if you have a better idea.
User-Agent: '; DROP TABLE blocked_bots;
Yes, but you can take the bet, and win more often than not, that your adversary is most likely not tracking visitor probabilities if you can detect that they aren't using a major fingerprinting provider.
I wouldn’t think the intention is to s/Mozilla// but to select another well-known UA string.
The string I use in my extension is "anubis is crap". I took it from a different FF extension that had been posted in a /g/ thread about Anubis, which is where I got the idea from in the first place. I don't use other people's extensions if I can help it (because of the obvious risk), but I figured I'd use the same string in my own extension so as to be combined with users of that extension for the sake of user-agent statistics.
It's a bit telling that you "don't use extensions if you can help it" but trust advice from a 4chan board
It's also a bit telling that you read the phrase "I took it from a different FF extension that had been posted" and interpreted it as taking advice instead of reading source code.
It's telling that he understands the difference between taking something he can't fully verify and taking simple hints that improve his understanding?
4chan, the worlds greatest hacker
The UA will be compared to other data points such as screen resolution, fonts, plugins, etc. which means that you are definitely more identifiable if you change just the UA vs changing your entire browser or operating system.
I don't think there are any.
Because servers would serve different content based on user agent virtually all browsers start with Mozilla/5.0...
curl, wget, lynx, and elinks all don't by default (I checked). Mainstream web browsers likely all do, and will forever.
Anubis will let curl through, while blocking any non-mainstream browser which will likely say "Mozilla" in its UA just for best compatibility and call that a "bot"? WTF.
> (Why do I do it? For most of them I don't enable JS so the challenge wouldn't pass anyway. For the ones that I do enable JS for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)
Hm. If your site is "sticky", can it mine Monero or something in the background?
We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"
We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"
Doesn't Safari sort of already do that? "This tab is using significant power", or summat? I know I've seen that message, I just don't have a good repro.
Edge does, as well. It drops a warning in the middle of the screen, displays the resource-hogging tab, and asks whether you want to force-close the tab or wait.
> Just change your user agent to not have "Mozilla" in it. Anubis only serves you the challenge if you have that.
Won't that break many other things? My understanding was that basically everyone's user-agent string nowadays is packed with a full suite of standard lies.
It doesn't break the two kernel.org domains that the article is about, nor any of the others I use. At least not in a way that I noticed.
In 2025 I think most of the web has moved on from checking user strings. Your bank might still do it but they won't be running Anubis.
Nope, they're on cloudflare so that all my banking traffic can be intercepted by a foreign company I have no relation to. The web is really headed in a great direction :)
The web as a whole definitely has not moved on from that.
I'm interested in your extension. I'm wondering if I could do something similar to force text encoding of pages into Japanese.
If your Firefox supports sideloading extensions then making extensions that modify request or response headers is easy.
All the API is documented in https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web... . My Anubis extension modifies request headers using `browser.webRequest.onBeforeSendHeaders.addListener()` . Your case sounds like modifying response headers which is `browser.webRequest.onHeadersReceived.addListener()` . Either way the API is all documented there, as is the `manifest.json` that you'll need to write to register this JS code as a background script and whatever permissions you need.
Then zip the manifest and the script together, rename the zip file to "<id_in_manifest>.xpi", place it in the sideloaded extensions directory (depends on distro, eg /usr/lib/firefox/browser/extensions), restart firefox and it should show up. If you need to debug it, you can use the about:debugging#/runtime/this-firefox page to launch a devtools window connected to the background script.
Cheers! I'm in Safari so I'll see if there's a match.
Doesn’t that just mean the AI bots can do the same? So what’s the point?
wtf? how is this then better than a captcha or something similar?!
This is neither here nor there but the character isn't a cat. It's in the name, Anubis, who is an Egyptian deity typically depicted as a jackal or generic canine, and the gatekeeper of the afterlife who weighs the souls of the dead (hence the tagline). So more of a dog-girl, or jackal-girl if you want to be technical.
Every representation I've ever seen of Anubis - including remarkably well preserved statues from antiquity - are either a male human body with a canine head, or fully canine.
This anime girl is not Anubis. It's a modern cartoon characters that simply borrows the name because it sounds cool, without caring anything about the history or meaning behind it.
Anime culture does this all the time, drawing on inspiration from all cultures but nearly always only paying the barest lip service to the original meaning.
I don't have an issue with that, personally. All cultures and religions should be fair game as inspiration for any kind of art. But I do have an issue with claiming that the newly inspired creation is equivalent in any way to the original source just because they share a name and some other very superficial characteristics.
It's also that the anime style already makes all heads shaped vaguely like felines. Add upwards pointing furry ears and it's not wrong to call it a cat girl.
> they share a name and some other very superficial characteristics.
I wasn't implying anything more than that, although now I see the confusing wording in my original comment. All I meant to say was that between the name and appearance it's clear the mascot is canid rather than feline. Not that the anime girl with dog ears is an accurate representation of the Egyptian deity haha.
It's refreshing to see a reply as thought out as this in today's day and and age of "move fast and post garbage".
I think you're taking it a bit too seriously. In turn, I am, of course, also taking it too seriously.
> I do have an issue with claiming that the newly inspired creation is equivalent in any way to the original source
Nobody is claiming that the drawing is Anubis or even a depiction of Anubis, like the statues etc. you are interested in. It's a mascot. "Mascot design by CELPHASE" -- it says, in the screenshot.
Generally speaking -- I can't say that this is what happened with this project -- you would commission someone to draw or otherwise create a mascot character for something after the primary ideation phase of the something. This Anubis-inspired mascot is, presumably, Anubis-inspired because the project is called Anubis, which is a name with fairly obvious connections to and an understanding of "the original source".
> Anime culture does this all the time, ...
I don't know what bone you're picking here. This seems like a weird thing to say. I mean, what anime culture? It's a drawing on a website. Yes, I can see the manga/anime influence -- it's a very popular, mainstream artform around the world.
I like to talk seriously about art, representation, and culture. What's wrong with that? It's at least as interesting as discussing databases or web frameworks.
In case you feel it needs linking to the purpose of this forum, the art in question here is being forcefully shown to people in a situation that makes them do a massive context switch. I want to look at the linux or ffmpeg source code but my browser failed a security check and now I'm staring at a random anime girl instead. What's the meaning here, what's the purpose behind this? I feel that there's none, except for the library author's preference, and therefore this context switch wasted my time and energy.
Maybe I'm being unfair and the code author is so wrapped up in liking anime girls that they think it would be soothing to people who end up on that page. In which case, massive failure of understanding the target audience.
Maybe they could allow changing the art or turning it off?
> Anime culture does this all the time >> I don't know what bone you're picking here
I'm not picking any bone there. I love anime, and I love the way it feels so free in borrowing from other cultures. That said, the anime I tend to like is more Miyazaki or Satoshi Kon and less kawaii girls.
Hey there! The design of the mascot serves a dual-purpose, and was done very intentionally.
Your workflow getting interrupted, especially with a full-screen challenge page, is a very high-stress event. The mascot serves a purpose in being particularly distinct and recognizable, but also disarming for first-time users. This emotional response was calibrated particularly for more non-technical users who would be quick to be worried about 'being hit by a virus'. In particular I find that bot challenges tend to feel very accusing ("PROVE! PROVE YOU ARE NOT A ROBOT!"), and that a little bit of silly would disarm that feeling.
Similarly, that's why the error version of the mascot looks more surprised if anything. After all, only legitimate users will ever see that. (bots don't have eyes, or at least don't particularly care)
As for the design specifically, making it more anubis-like would probably have been a bit TOO furry and significantly hurt adoption. The design prompt was to stick to a jackal girl. Then again, I kinda wished in retrospect I had made the ears much, much longer.
(wow another Only on HN moment, the designer of the topic's catgirl shows up with topical work)
Thanks for sharing your design notes on the mascot!
Hi there, thank you for chiming in.
Viewing the challenge screenshot again after reading your response definitely sheds light as to why I have no aggro toward Anubis (even if the branding supposedly wouldn't jive well with a super professional platform, but hey, I think having the alternate, commercial offering is super brilliant in turn).
On the other hand, I immediately see red when I get stopped in my tracks by all the widely used (and often infinitely-unpassable) Cloudflare/Google/etc. implementations with wordings that do nothing but add insult to injury.
Thank you for the thought you put into that. I think you guys hit it out of the park.
What does all of this have to do with (depictions of, references to, etc.) Anubis though? You responded to a comment about the mascot surely being a "jackalgirl" as opposed to a "catgirl" because of the Anubis name and other references. It seemed like you had an issue with the artwork, that it wasn't Anubisy enough, or something. Why would the drawing being more like the statues improve the situation?
Now you seem to be saying that anything that isn't what you wanted to find on the website is the problem. This makes sense, it just has nothing to do with what is shown on that page. But you're effectively getting frustrated at not getting to the page you wanted to and then directing your frustration toward the presentation of the "error message". That does not make sense.
> I like to talk seriously about art, representation, and culture. What's wrong with that? It's at least as interesting as discussing databases or web frameworks.
I don't have a problem with talking about art, you'll note that I responded in kind. When I said "I think you're taking it too seriously" I wasn't expecting that to be extrapolated to all subjects, just the one that was being discussed in the local context.
As far as I'm aware it is already possible to change the art displayed (don't know about turning off), most just seem to not care and use the default
>I like to talk seriously about art, representation, and culture. What's wrong with that?
It's no fun.
For one, you pulled your original response out of your ass. That the mascot is not a "catgirl" as identified by OP, but a canine variant of the same concept, because the project is named after the Egyptian god, is both obvious and uninteresting. You added nothing to that.
You're running around shouting "I get the joke, I get the joke" while grandstanding about how serious you are about art, one of the human pursuits helped least by seriousness, considering.
If you've decided you also need to be silly about it today, then at least have the decency to make up a conspiracy theory about the author being in fact a front for an Egyptian cult trying to bring back the old gods using the harvested compute, or whatever.
>massive failure of understanding the target audience.
Heh.
The anime image is put there as an intentional, and to my view rightful, act of irreverence.
One that works, too: I unironically find the people going like "my girl/boss will be mad at me if they see this style of image on my computer" positively hilarious.
>Maybe they could allow changing the art or turning it off?
They sure do. For money. Was in the release announcement.
Not enough irreverence in your game and you can end up being the person who let them build the torment nexus. Many such cases, and that's why we're where we are.
>That said, the anime I tend to like is more Miyazaki or Satoshi Kon and less kawaii girls.
A true connoiseur only watches chibi :3
> out of your ass
If I wanted to be spoken to this way I'd make a reddit account.
Might try for one anyway.
I'm assuming the aversion is more about why young anime girls are popping up, not about what animal it is
Why is there an aversion though? Is it about the image itself or because of the subculture people are associating with the image?
Both. I don't want any random pictures of young girls popping up while I'm browsing the web, and why would adults insert pictures of young girls into their project in the first place?
What a strange comment to make about a cartoon character.
Also, the anime reference is very much intentional at this point; while the source code is open so anyone can change it, the author sells a version for the boring serious types where you can easily change the logo without recompiling the source yourself. Adding the additional bottleneck of having to sync a custom fork or paying out to placate the "serious" people is a great way to get the large corporations to pay a small fee to cover maintenance.
It's an aversion to the sexualised depiction of girls barely the age of puberty or under the age of consent.
I'd ask why you /don't/ have an aversion to that?
(yes, "not all anime" etc...)
I don't know why OP said 'young' at all. It has no characteristics pointing to age, and secondary gender characteristics.
So I'd ask why that makes you think of sexual consent.
What part of that image is sexualized?
You know, there are enough real pedophiles in the world, like POTUS for instance. No need to go after made up ones.
Well, thank you for that. That's a great weight off me mind.
... but entirely lacking the primary visual feature that Anubis had.
When I instantly read it, I knew it was anubis. I hope the anime catgirls never disapear from that project :)
This anime thing is the one thing about computer culture that I just don't seem to get. I did not get it as child, when suddenly half of children cartoons became animes and I just disliked the aestheic. I didn't get it in school, when people started reading mangas . I'll probably never get it. Therefore I sincerely hope, they do go away from anubis, so I can further dwell in my ignorance.
I feel the same. It's a distinct part of nerd culture.
In the '70s, if you were into computers you were most likely also a fan of Star Trek. I remember an anecdote from the 1990s when an entire dial-up ISP was troubleshooting its modem pools because there were zero people connected and they assumed there was an outage. The outage happened to occur exactly while that week's episode of X-Files was airing in their time zone. Just as the credits rolled, all modems suddenly lit up as people connected to IRC and Usenet to chat about the episode. In ~1994 close to 100% of residential internet users also happened to follow X-Files on linear television. There was essentially a 1:1 overlap between computer nerds and sci-fi nerds.
Today's analog seems to be that almost all nerds love anime and Andy Weir books and some of us feel a bit alienated by that.
> Today's analog seems to be that almost all nerds love anime and Andy Weir books and some of us feel a bit alienated by that.
Especially because (from my observation) modern "nerds" who enjoy anime seem to relish at bringing it (and various sex-related things) up at inappropriate times and are generally emotionally immature.
It's quite refreshing seeing that other people have similar lines of thinking and that I'm not alone in feeling somewhat alienated.
I think I'd push back and say that nerd culture is no longer really a single thing. Back in the star trek days, the nerd "community" was small enough that star trek could be a defining quality shared by the majority. Now the nerd community has grown, and there are too many people to have defining parts of the culture that are loved by the majority.
Eg if the nerd community had $x$ people in the star trek days, now there are more than $x$ nerds who like anime and more than $x$ nerds who dislike it. And the total size is much bigger than both.
But what if they choose a different image that you don't get? What if they used an abstract modern art piece that no one gets? Oh the horror!
You don't have to get it to be able to accept that others like it. Why not let them have their fun?
This sounds more as though you actively dislike anime than merely not seeing the appeal or being "ignorant". If you were to ignore it, there wouldn't be an issue...
They can have their fun on their personal websites. Subjecting others to your "fun" when you knows it annoys them is not cool.
Well, this is their personal project. You're welcome to make your own, or you can remove the branding if you want: it's open licensed. Or if you're not a coder, they even offer to remove the branding if you support the project
I don't get the impression that it's meant to be annoying, but a personal preference. I can't know that, though whitelabeling is a common thing people pay for without the original brand having made their logo extra ugly
While subjecting the entire Internet to industrial-scale abuse by inconsiderate and poorly written crawlers for the sake of building an overhyped "whatever" is of course perfectly acceptable.
Well, to be fair, that's not our doing so not really an argument for why one should accept something one apparently dislikes (I myself find the character funny and it brings a fun moment when it flashes by, but I can understand/accept that others see it differently of course)
Yes, an argument for why one should accept something one apparently dislikes usually only works when it's from authority.
Might've caught on because the animes had plots, instead of considering viewers to have the attention spans of idiots like Western kids' shows (and, in the 21st century, software) tend to do.
I don't think it's relevant to debate if anime or other forms of media is objectively better. But as someone who has never understood anime, I view mainstream western TV series as filled with hours of cleverly written dialogue and long story arches, whereas the little anime I've watched seems to mostly be overly dramatic colorful action scenes with intense screamed dialogue and strange bodily noises. Should we maybe assume that we are both a bit ignorant of the preferences of others?
Let's rather assume that you're the kind of person who debates a thing by first saying that it's not relevant to debate, then putting forward a pretty out-of-context comparison, and finally concluding that I should feel bad about myself. That kind of story arc does seem to correlate with finding mainstream Western TV worthwhile; there's something structurally similar to the funny way your thought went.
Its nice to see there is still some whimsy on the internet.
Everything got so corporate and sterile.
Everyone copying the same Japanese cartoon style isn't any better than everyone copying corporate memphis.
I think it definitively would be. Perhaps a small one, but still
As Anubis the egyptian god is represented as a dog-headed human, I thought the drawing was of a dog-girl.
Perhaps a jackal girl? I guess "cat girl" gets used very broadly to mean kemomimi (pardon the spelling) though
kemono == animal
mimi == ears
We all know it's doomed
That's called a self-fulfilling prophecy and is not in fact mandatory to participate in.
I'm not making any git commits to remove it…
Probably talking about different doomed things then, sorry.
It's not the only project with an anime girl as its mascot.
ComfyUI has what I think is a foxgirl as its official mascot, and that's the de-facto primary UI for generating Stable Diffusion or related content.
I've noticed the word "comfy" used more than usual recently and often by the anime-obsessed, is there cultural relevance I'm not understanding?
It's more likely that the project itself will disappear into irrelevance as soon as AI scrapers bother implementing the PoW (which is trivial for them, as the post explains) or figure out that they can simply remove "Mozilla" from their user-agent to bypass it entirely.
> as AI scrapers bother implementing the PoW
That's what it's for, isn't it? Make crawling slower and more expensive. Shitty crawlers not being able to run the PoW efficiently or at all is just a plus. Although:
> which is trivial for them, as the post explains
Sadly the site's being hugged to death right now so I can't really tell if I'm missing part of your argument here.
> figure out that they can simply remove "Mozilla" from their user-agent
And flag themselves in the logs to get separately blocked or rate limited. Servers win if malicious bots identify themselves again, and forcing them to change the user agent does that.
> That's what it's for, isn't it? Make crawling slower and more expensive.
The default settings produce a computational cost of milliseconds for a week of access. For this to be relevant it would have to be significantly more expensive to the point it would interfere with human access.
I thought the point (which the article misses) is that a token gives you an identity, and an identity can be tracked and rate limited.
So a crawlers that goes very ethically and does very little strain on the server should indeed be able to crawl for a whole week on a cheap compute, one that hammers the server hard will not.
Sure but it's really cheap to mint new identities, each node on their scrapping cluster can mint hundreds of thousands of tokens per second.
Provisioning new ips is probably more costly than calculating the tokens, at least with the default difficulty setting.
...unless you're sus, then the difficulty increases. And if you unleash a single scrapping bot, you're not a problem anyway. It's for botnets of thousands, mimicking browsers on residual connections to make them hard to filter out or rate limit, effectively DDoSing the server.
Perhaps you just don't realize how much did the scraping load increase in the last 2 years or so. If your server can stay up after deploying Anubis, you've already won.
How is it going to hurt those?
If it's an actual botnet, then it's hijacked computers belonging to other people, who are the ones paying the power bills. The attacker doesn't care that each computer takes a long time to calculate. If you have 1000 computers each spending 5s/page, then your botnet can retrieve 200 pages/s.
If it's just a cloud deployment, still it has resources that vastly outstrip a normal person's.
The fundamental issue is that you can't serve example.com slower than a legitimate user on a crappy 10 year old laptop could tolerate, because that starts losing you real human users. So if let's say say user is happy to wait 5 seconds per page at most, then this is absolutely no obstacle to a modern 128 core Epyc. If you make it troublesome to the 128 core monster, then no normal person will find the site usable.
It's not really hijacked computers, there is a whole market for vpns with residential exit nodes.
The way i think it works is they provide free VPN to the users or even pay their internet bill and then sell the access to their ip.
The client just connects to a vpn and has a residential exit IP.
The cost of the VPN is probably higher than the cost for the proof of work though.
> How is it going to hurt those?
In an endless cat-and-mouse game, it won't.
But right now, it does, as these bots tend to be really dumb (presumably, a more competent botnet user wouldn't have it do an equivalent of copying Wikipedia by crawling through its every single page in the first place). With a bit of luck, it will be enough until the bubble bursts and the problem is gone, and you won't need to deploy Anubis just to keep your server running anymore.
> Sadly the site's being hugged to death right now
Luckily someone had already captured an archive snapshot: https://archive.ph/BSh1l
The explanation of how the estimate is made is more detailed, but here is the referenced conclusion:
>> So (11508 websites * 2^16 sha256 operations) / 2^21, that’s about 6 minutes to mine enough tokens for every single Anubis deployment in the world. That means the cost of unrestricted crawler access to the internet for a week is approximately $0.
>> In fact, I don’t think we reach a single cent per month in compute costs until several million sites have deployed Anubis.
If you use one solution to browse the entire site, you're linking every pageload to the same session, and can then be easily singled out and blocked. The idea that you can scan a site for a week by solving the riddle once is incorrect. That works for non-abusers.
Well, since they can get a unique token for every site every 6 minutes only using a free GCP VPS that doesn't really matter, scraping can easily be spread out across tokens or they can cheaply and quickly get a new one whenever the old one gets blocked.
Wasn't sha256 designed to be very fast to generate? They should be using bcrypt or something similar.
Unless they require a new token for each new request or every x minutes or something it won't matter.
And as the poster mentioned if you are running an AI model you probably have GPUs to spare. Unlike the dev working from a 5 year old Thinkpad or their phone.
Apparently bcrypt has design that makes it difficult to accelerate effectively on a GPU.
Indeed a new token should be requested per request; the tokens could also be pre-calculated, so that while the user is browsing a page, the browser could calculate tickets suitable to access the next likely browsing targets (e.g. the "next" button).
The biggest downside I see is that mobile devices would likely suffer. Possible the difficulty of the challange is/should be varied by other metrics, such as the number of requests arriving per time unit from a C-class network etc.
That's a matter of increasing the difficulty isn't it? And if the added cost is really negligible, we can just switch to a "refresh" challenge for the same added latency and without burning energy for no reason.
If you increase the difficulty much beyond what it currently is, legitimate users end up having to wait for ages.
And if you don't increase it, crawlers will DoS the sites again and legitimate users will have to wait until the next tech hype bubble for the site to load, which is the reason why software like Anubis is being installed in the first place.
If you triple the difficulty, the cost of solving the PoW is still neglible to the crawlers but you've harmed real users even more.
The reason why anubis works is not the PoW, it is that the dev time to implement the bypass takes out the lowest effort bots. Thus the correct response is to keep the PoW difficulty low so you minimize harm to real users. Or better yet, implementing your own custom check that doesn't use any PoW and relies on ever higher obscurity to block the low effort bots.
The more anubis is used, the less effective it is and the more it harms real users.
I am guessing you don't realize that that means people using not the latest generation phones will suffer.
I'm not using the latest generation of phones, not in the slightest, and I don't really care, because the alternative to Anubis-like intersitials is the sites not loading at all when they're mass-crawled to death.
It's more about the (intentional?) DDoS from AI scrappers, than preventing them from accessing the content. Bandwidth is not cheap.
Im not on Firefox or any Firefox derivative and I still get anime cat girls making sure I'm not a bot.
Mozilla is used in the user agent string of all major browsers for historical reasons, but not necessarily headless ones or so on.
Oh that's interesting, I had no idea.
There's some sites[1] that can print your user agent for you. Try it in a few different browsers and you will be surprised. They're honestly unhinged.. I have no idea why we still use this header in 2025!
[1]: https://dnschecker.org/user-agent-info.php
¡Nyah!
> This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources or trying to conserve them.
Counterpoint - it seems to work. People use anubis because its the best of bad options.
If theory and reality disagree, it means either you are missing something or your theory is wrong.
Counter-counter point: it only stopped them for a few weeks and now it doesn’t work: https://news.ycombinator.com/item?id=44914773
Geoblocking China and Singapore solves that problem, it seems, at least the non-residential IPs (though I also see a lot of aggressive bots coming from residential IP space from China).
I wish the old trick of sending CCP-unfriendly content to get the great firewall to kill the connection for you still worked, but in the days of TLS everywhere that doesn't seem to work anymore.
Only Huawei so far, no? That could be easy to block on a network level for the time being
Of course we knew from the beginning that this first stage of "bots don't even try to solve it, no matter the difficulty" isn't a forever solution
AliCloud also seems to send a more capable scraper army, but so far they're not using botnets ("residential proxies") to hide their bad practices.
Superficial comment regarding the catgirl, I don't get why some people are so adamant and enthusiastic for others to see it, but if you like me find it distasteful and annoying, consider copying these uBlock rules: https://sdf.org/~pkal/src+etc/anubis-ublock.txt. Brings me joy to know what I am not seeing whenever I get stopped by this page :)
I don't get why so many people find it "distasteful and annoying"
Can you clarify if you mean that you do no understand the reasons that people dislike these images, or do you find the very idea of disliking it hard to relate to?
I cannot claim that I understand it well, but my best guess is that these are images that represent a kind of culture that I have encountered both in real-life and online that I never felt comfortable around. It doesn't seem unreasonable that this uneasiness around people with identity-constituting interests in anime, Furries, MLP, medieval LARP, etc. transfers back onto their imagery. And to be clear, it is not like I inherently hate anime as a medium or the idea of anthropomorphism in art. There is some kind of social ineptitude around propagating these _kinds_ of interests that bugs me.
I cannot claim that I am satisfies with this explanation. I know that the dislike I feel for this is very similar to that I feel when visiting a hacker space where I don't know anyone. But I hope that I could at least give a feeling for why some people don't like seeing catgirls every time I open a repository and that it doesn't necessarily have anything to do with advocating for a "corporate soulless web".
You could respect it without "getting" it though.
I can't really explain it but it definitely feels extremely cringeworthy. Maybe it's the neckbeard sexuality or the weird furry aspect. I don't like it.
> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans
I'm an unsure if this deadpan humor or if the author has never tried to solve a CAPTCHA that is something like "select the squares with an orthodox rabbi present"
I enjoyed the furor around the 2008 RapidShare catpcha lol
- https://www.htmlcenter.com/blog/now-thats-an-annoying-captch...
- https://depressedprogrammer.wordpress.com/2008/04/20/worst-c...
- https://medium.com/xato-security/a-captcha-nightmare-f6176fa...
The problem with that CAPTCHA is you're not allowed to solve it on Saturdays.
I wonder if it's an intentional quirk that you can only pass some CAPTCHAs if you're a human who knows what an American fire hydrant or school bus looks like?
> an American fire hydrant or school bus
So much this. The first time one asked me to click on "crosswalks", I genuinely had to think for a while as I struggled to remember WTF a "crosswalk" was in AmEng. I am a native English speaker, writer, editor and professionally qualified teacher, but my form of English does not have the word "crosswalk" or any word that is a synonym for it. (It has phrases instead.)
Our schoolbuses are ordinary buses with a special number on the front. They are no specific colour.
There are other examples which aren't coming immediately to mind, but it is vexing when the designer of a CAPTCHA isn't testing if I am human but if I am American.
I doubt it’s intentional. Google (owner of reCAPTCHA) is a US company, so it’s more likely they either haven’t considered what they see every day is far from universal; don’t care about other countries; or specifically just care about training for the US.
Google demanding I flag yellow cars when asked to flag taxis is the silliest Americanism I've seen. At least the school bus has SCHOOL BUS written all over it and fire hydrants aren't exactly an American exclusive thing.
On some Russian and Asian site I ran into trouble signing up for a forum using translation software because the CAPTCHA requires me to enter characters I couldn't read or reproduce. It doesn't happen as often as the Google thing, but the problem certainly isn't restricted to American sites!
There are also services out that will solve any CAPTCHA for you at a very small cost to you. And an AI company will get steep discounts with the volumes of traffic they do.
There are some browser extensions for it too, like NopeCHA, it works 99% of the time and saves me the hassle of doing them.
Any site using CAPTCHA's today is really only hurting there real customers and low hanging fruit.
Of course this assumes they can't solve the capture themselves, with ai, which often they can.
Yes, but not at a rate that enables them to be a risk to your hosting bill. My understanding is that the goal here isn't to prevent crawlers, it's to prevent overly aggressive ones.
Well the problem is that computers got good at basically everything.
Early 2000s captchas really were like that.
The original reCAPTCHA was doing distributed book OCR. It was sold as an altruistic project to help transcribe old books.
And now they're using us to train car driving AI :(
I don't mind car driving AI, why the sad face?
Maybe autonomous train training would be even cooler but it's not like improving tobacco products that only have downsides
I mind working for free.
Every time I see one of these I think it's a malicious redirect to some pervert-dwelling imageboard.
On that note, is kernel.org really using this for free and not the paid version without the anime? Linux Foundation really that desperate for cash after they gas up all the BMW's?
It's crazy (especially considering anime is more popular now than ever; netflix alone is making billions a year on anime) that people see a completely innocent little anime picture and immediately think "pervent-dwelling imageboard".
> people see a completely innocent little anime picture and immediately think "pervent-dwelling imageboard"
Think you can thank the furries for that.
Every furry I've happened to come across was very pervy in some way, and so that what immediately comes to mind when I see furry-like pictures like the one shown in the article.
YMMV
Out of interest, how many furries have you met? I've been to several fur meets, and have met approximately three furries who I would not want to know anymore for one reason or another
Admittedly just a handful. But I met them in entirely non-furry settings, for example as a user of a regular open source program I was a contributor to (which wasn't Rust based[1]).
None of them were very pervy at first, only after I got to know them.
[1]: https://www.reddit.com/r/rust/comments/vyelva/why_are_there_...
To be fair, that's the sort of place where I spend most of my free time.
"Anime pfp" stereotype is alive and well.
It's not crazy at all that anyone who has been online for more than a day has that association.
they've seized the moment to move the anime cat girls off the Arch Linux desktop wallpapers and onto lore.kernel.org.
Even if the images aren’t the kind of sexualized (or downright pornographic) content this implies… having cutesy anime girls pop up when a user loads your site is, at best, wildly unprofessional. (Dare I say “cringe”?) For something as serious and legit as kernel.org to have this, I do think it’s frankly shocking and unacceptable.
https://storage.courtlistener.com/recap/gov.uscourts.miwd.11...
https://storage.courtlistener.com/recap/gov.uscourts.miwd.11...
“The future is now, old man”
Assuming your quote isn't a joke, I think those links prove the opposite.
Not only is it unprofessional, courts have found it impermissible.
This is the most hilarious thing I have ever read from HN, thank you.
never forget the Ponies CV of an ML guy https://www.huffingtonpost.co.uk/2013/09/03/my-little-pony-r...
Never mind the content, that is one of the most printer-unfriendly CVs I've ever seen
HP loves it "oops you're out of ink"
Noted, I will now add anime girls to my website, so I'm not at risk of being misconstrued as "professional"
If anime girls prevent LLM scraper sympathizers from interacting with the kernel, that's a good thing and should be encouraged more!
You'd think it's the opposite, look at Joseph Redmon's resume:
https://web.itu.edu.tr/yavuzid19/cv.pdf
Isn't the mascot/logo for the Linux kernel a cartoon penguin?
Right, but, that's different. Penguins are serious and professional.
I mean, he's wearing a tuxedo!
I have a plushy tux at home (about 30cm high). So now I'm in the same league as the people with anime pillows?
Well, the people with anime plushies would be a better comparison. There's plenty more of those than pillows.
It depends. What do you do with the plushy?
I bet he's keeping it on some shelf because he think it's cute like only a true sicko would do
What’s the difference?
You'll live.
For me it's the flipside: It makes me think "Ahh, my people!"
Huh, why would they need the unbranded version? The branded version works just fine. It's usually easier to deploy ordinary open source software than it is for software that needs to be licensed, because you don't need special download pages or license keys.
If it makes sense for an organization to donate to a project they rely on, then they should just donate. No need to debrand if it's not strictly required, all that would do is give the upstream project less exposure. For design reasons maybe? But LKML isn't "designed" at all, it has always exposed the raw ugly interface of mailing list software.
Also, this brand does have trust. Sure, I'm annoyed by these PoW captcha pages, but I'm a lot more likely to enable Javascript if it's the Anubis character, than if it is debranded. If it is debranded, it could be any of the privacy-invasive captcha vendors, but if it's Anubis, I know exactly what code is going to run.
If i saw an anime pic show up, thatd be a flag. I only know of Anubis’ existence and use of anime from hn.
It is only trusted by a small subset of people who are in the know. It is not about “anime bad” but that a large chunk of the population isnt into it for whatever reason.
I love anime but it can also be cringe. I find this cringe as it seems many others do too.
I wonder if the best solution is still just to create link mazes with garbage text like this: https://blog.cloudflare.com/ai-labyrinth/
It won't stop the crawlers immediately, but it might lead to an overhyped and underwhelming LLM release from a big name company, and force them to reassess their crawling strategy going forward?
That won't work, because garbage data is filtered after the full dataset is collected anyway. Every LLM trainer these days knows that curation is key.
If the "garbage data" is AI generated, it'll be hard or impossible to filter.
Crawlers already know how to stop crawling recursive or otherwise excessive/suspicious content. They've dealt with this problem long before LLM-related crawling.
Why is kernel.org doing this for essentially static content? Cache control headers and ETAGS should solve this. Also, the Linux kernel has solved the C10K problem.
Because its static content that is almost never cached because its infrequently accessed. Thus, almost every hit goes to the origin.
The contents in question are statically generated, 1-3 KB HTML files. Hosting a single image would be the equivalent of cold serving 100s of requests.
Putting up a scraper shield seems like it's more of a political statement than a solution to a real technical problem. It's also antithetical to open collaboration and an open internet of which Linux is a product.
Bots don't respect that.
Use a CDN.
A great option for most people, and indeed Anubis' README recommends using Cloudflare if possible. However, not everyone can use a paid CDN. Some people can't pay because their payment methods aren't accepted. Some people need to serve content or to countries which a major CDN can't for legal and compliance reasons. Some organizations need their own independent infrastructure to serve their organizational misson.
So that someone else pays for your bandwidth while seeing who is interested in this content? Idk about that solution
Maybe the Linux Foundation should cover kernel.org's hosting costs?
I have a S24 (flagship of 2024) and Anubis often takes 10-20 seconds to complete, that time is going to add up if more and more sites adopt it, leaning to a worse browsing experience and wasted battery life.
Meanwhile AI farms will just run their own nuclear reactors eventually and be unaffected.
I really don't understand why someone thought this was a good idea, even if well intentioned.
Something must be wrong on your flagship smartphone because I have an entry level one that doesn't take that long.
It seems there is a large number of operations crawling the web to build models that aren't using directly infrastructure hosted on AI farms BUT botnet running on commodity hardware and residencial networks to circumvent their ip range from being blacklisted. Anubis point is to block those.
Which browser and which difficulty setting is that?
Because I've got the same model line but about 3 or 4 years older and it usually just flashes by in the browser Lightning from F-droid which is an OS webview wrapper. On occasion a second or maybe two, I assume that's either bad luck in finding a solution or a site with a higher difficulty setting. Not sure if I've seen it in Fennec (firefox mobile) yet but, if so, it's the same there
I've been surprised that this low threshold stops bots but I'm reading in this thread that it's rather that bot operators mostly just haven't bothered implementing the necessary features yet. It's going to get worse... We've not even won the battle let alone the war. Idk if this is going to be sustainable, we'll see where the web ends up...
I have Pixel 7 (released in 2022) and it usually takes less than a second...
Either your phone is on some extreme power saving mode, your ad blocker is breaking Javascript, or something is wrong with your phone.
I've certainly seen Anubis take a few seconds (three or four maybe) but that was on a very old phone that barely loaded any website more complex than HN.
Something is wrong with your flagship if it takes that long.
Samsung's UI has this feature where it turns on power saving mode when it detects light use.
I guess his flagship IS compromised and part of an AI crawling botnet ;-)
You're looking at it wrong.
I remember that LiteCoin briefly had this idea, to be easy on consumer hardware but hard on GPUs. The ASICs didn't take long to obliterate the idea though.
Maybe there's going to be some form of pay per browse system? even if it's some negligible cost on the order of 1$ per month (and packaged with other costs), I think economies of scale would allow servers to perform a lifetime of S24 captchas in a couple of seconds.
Seems like ai bots are indeed bypassing the challenge by computing it: https://social.anoxinon.de/@Codeberg/115033790447125787
That's not bypassing it, that's them finally engaging with the PoW challenge as intended, making crawling slower and more expensive, instead failing to crawl at all, which is more of a plus.
This however forces servers to increase the challenge difficulty, which increases the waiting time for the first-time access.
Obviously the developer of Anubis thinks it is bypassing: https://github.com/TecharoHQ/anubis/issues/978
Fair, then I obviously think Xe may have a kinda misguided understanding of their own product. I still stand by the concept I stated above.
latest update from Xe:
> After further investigation and communication. This is not a bug. The threat actor group in question installed headless chrome and simply computed the proof of work. I'm just going to submit a default rule that blocks huawei.
this kinda proves the entire project doesn't work if they have to resort to manual IP blocking lol
It doesn't work for headless chrome, sure. The thing is that often, for threats like this to work they need lots scale, and they need it cheaply because the actors are just throwing a wide net and hoping to catch it. Headless chrome doesn't scale cheaply so by forcing script kiddies to use it you're pricing them out of their own game. For now.
Doesn't have to be black or white. You can have a much easier challenge for regular visitors if you block the only (and giant) party that has implemented a solver so far. We can work on both fronts at once...
The point is that it isn't "implementing a solver", it's just using a browser and waiting a few seconds.
The point is that it will always be cheaper for bot farms to pass the challenge than for regular users.
Why does that matter? The challenge needs to stay expensive enough to slow down bots, but legitimate users won't be solving anywhere near the same amount of challenges and the alternative is the site getting crawled to death, so they can wait once in a while.
It might be a lot closer if they were using argon2 instead of sha. Sha is a kind of bad choice for this sort of thinh.
Too bad the challenge's result is only a waste of electricity. Maybe they should do like some of those alt-coins and search for prime numbers or something similar instead.
Most of those alt-coins are kind of fake/scams. Its really hard to make it work with actually useful problems.
Of course that doesn't directly help the site operator. Maybe it could actually do a bit of bitcoin mining for the site owner. Then that could pay for the cost of accessing the site.
this only holds through if the data to be accessed is less valuable than the computational cost. in this case, that is false and spending a few dollars to scrape data is more than worth.
reducing the problem to a cost issue is bound to be short sighted.
This is not about preventing crawling entirely, it's about finding a way to prevent crawlers from repeatedly everything way too frequently just because crawling is just very cheap. Of course it will always be worth it to crawl the Linux Kernel mailing list, but maybe with a high enough cost per crawl the crawlers will learn to be fine with only crawling it once per hour for example
What exactly is so bad about AI crawlers compared to Google or Bing? Is there more volume or is it just "I don't like AI"?
If you want my help training up your billion dollar model then you should pay me. My content is for humans. If you're not a human you are an unwelcome burden.
Search engines, at least, are designed to index the content, for the purpose of helping humans find it.
Language models are designed to filch content out of my website so it can reproduce it later without telling the humans where it came from or linking them to my site to find the source.
This is exactly the reason that "I just don't like 'AI'." You should ask the bot owners why they "just don't like appropriate copyright attribution."
> copyright attribution
You can't copyright an idea, only a specific expression of an idea. An LLM works at the level of "ideas" (in essence - for example if you subtract the vector for "woman" from "man" and add the difference to "king" you get a point very close to "queen") and reproduces them in new contexts and makes its own connections to other ideas. It would be absurd for you to demand attribution and payment every time someone who read your Python blog said "Python is dynamically type-checked and garbage-collected". Thankfully that's not how the law works. Abusive traffic is a problem, but the world is a better place if humans can learn from these ideas with the help of ChatGPT et al. and to say they shouldn't be allowed to just because your ego demands credit for every idea someone learns from you is purely selfish.
LLMs quite literally work at the level of their source material, that's how training works, that's how RAG works, etc.
There is no proof that LLMs work at the level of "ideas", if you could prove that, you'e solve a whole lot of incredibly expensive problems that are current bottlenecks for training and inference.
It is a bit ironic that you'd call someone wanting to control and be paid for the thing they themselves created "selfish", while at the same time writing apologia on why it's okay for a trillion dollar private company to steal someone else's work for their own profit.
It isn't some moral imperative that OpenAI gets access to all of humanity's creations so they can turn a profit.
As a reference on the volume aspect: I have a tiny server where I host some of my git repos. After the fans of my server spun increasingly faster/louder every week, I decided to log the requests [1]. In a single week, ClaudeBot made 2.25M (!) requests (7.55GiB), whereas GoogleBot made only 24 requests (8.37MiB). After installing Anubis the traffic went down to before the AI hype started.
[1] https://types.pl/@marvin/114394404090478296
Same, ClaudeBot makes a stupid amount of requests on my git storage. I just blocked them all on Cloudflare.
As others have said, it's definitely volume, but also the lack of respecting robots.txt. Most AI crawlers that I've seen bombarding our sites just relentlessly scrape anything and everything, without even checking to see if anything has changed since the last time they crawled the site.
Yep, AI scrapers have been breaking our open-source project gerrit instance hosted at Linux Network Foundation.
Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me. This should be a solved problem. But it looks like this field is full of wrongly behaving companies with complete disregards toward common goods.
>Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me.
a mix of ignorance, greed, and a bit of the tragedy of the commons. If you don't respect anyone around you, you're not going to care about any rules or ettiquite that don't directly punish you. Society has definitely broken down over the decades.
Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163
Why not just actually rate-limit everyone, instead of slowing them down with proof-of-work?
My understanding is that AI scrapers rotate IPs to bypass rate-limiting. Anubis requires clients to solve a proof-of-work challenge upon their first visit to the site to obtain a token that is tied to their IP and is valid for some number of requests -- thus forcing impolite scrapers to solve a new PoW challenge each time they rotate IPs, while being unobtrusive for regular users and scrapers that don't try to bypass rate limits.
It's like a secondary rate-limit on the ability of scrapers to rotate IPs, thus allowing your primary IP-based rate-limiting to remain effective.
Earlier today I found we'd served over a million requests to over 500,000 different IPs.
All had the same user agent (current Safari), they seem to be from hacked computers as the ISPs are all over the world.
The structure of the requests almost certainly means we've been specifically targeted.
But it's also a valid query, reasonably for normal users to make.
From this article, it looks like Proof of Work isn't going to be the solution I'd hoped it would be.
The math in the article assumes scrapers only need one Anubis token per site, whereas a scraper using 500,000 IPs would require 500,000 tokens.
Scaling up the math in the article, which states it would take 6 CPU-minutes to generate enough tokens to scrape 11,508 Anubis-using websites, we're now looking at 4.3 CPU-hours to obtain enough tokens to scrape your website (and 50,000 CPU-hours to scrape the Internet). This still isn't all that much -- looking at cloud VM prices, that's around 10c to crawl your website and $1000 to crawl the Internet, which doesn't seem like a lot but it's much better than "too low to even measure".
However, the article observes Anubis's default difficulty can be solved in 30ms on a single-core server CPU. That seems unreasonably low to me; I would expect something like a second to be a more appropriate difficulty. Perhaps the server is benefiting from hardware accelerated sha256, whereas Anubis has to be fast enough on clients without it? If it's possible to bring the JavaScript PoW implementation closer to parity with a server CPU (maybe using a hash function designed to be expensive and hard to accelerate, rather than one designed to be cheap and easy to accelerate), that would bring the cost of obtaining 500k tokens up to 138 CPU-hours -- about $2-3 to crawl one site, or around $30,000 to crawl all Anubis deployments.
I'm somewhat skeptical of the idea of Anubis -- that cost still might be way too low, especially given the billions of VC dollars thrown at any company with "AI" in their sales pitch -- but I think the article is overly pessimistic. If your goal is not to stop scrapers, but rather to incentivize scrapers to be respectful by making it cheaper to abide by rate limits than it is to circumvent them, maybe Anubis (or something like it) really is enough.
(Although if it's true that AI companies really are using botnets of hacked computers, then Anubis is totally useless against bots smart enough to solve the challenges since the bots aren't paying for the CPU time.)
If the scraper scrapes from a small number of IPs they're easy to block or rate-limit. Rate-limits against this behaviour are fairly easy to implement, as are limits against non-human user agents, hence the botnet with browser user agents.
The Duke University Library analysis posted elsewhere in the discussion is promising.
I'm certain the botnets are using hacked/malwared computers, as the huge majority of requests come from ISPs and small hosting providers. It's probably more common for this to be malware, e.g. a program that streams pirate TV, or a 'free' VPN app, which joins the user's device to a botnet.
Why haven't they been sued and jailed for DDoS, which is a felony?
Criminal convictions in the US require a standard of proof that is "beyond a reasonable doubt" and I suspect cases like this would not pass the required mens rea test, as, in their minds at least (and probably a judge's), there was no ill intent to cause a denial of service... and trying to argue otherwise based on any technical reasoning (e.g. "most servers cannot handle this load and they somehow knew it") is IMO unlikely to sway the court... especially considering web scraping has already been ruled legal, and that a ToS clause against that cannot be legally enforced.
There's an angle where criminal intent doesn't matter when it comes to negligence and damages. They have to had known that their scrapers would cause denial of service, unauthorized access, increased costs for operators, etc.
That's not a certain outcome. If you're willing to do this case, I can provide access logs and any evidence you want. You can keep any money you win plus I'll pay a bonus on top! Wanna do it?
Keep in mind I'm in Germany, the server is in another EU country, and the worst scrapers overseas (in China, USA, and Singapore). Thanks to these LLMs there is no barrier to have the relevant laws be translated in all directions I trust that won't be a problem! :P
> criminal intent doesn't matter when it comes to negligence and damages
Are you a criminal defense attorney or prosecutor?
> They have to had known
IMO good luck convincing a judge of that... especially "beyond a reasonable doubt" as would be required for criminal negligence. They could argue lots of other scrapers operate just fine without causing problems, and that they tested theirs on other sites without issue.
coming from a different legal system so please forgive my ignorance: Is it necessary in the US to prove ill intent in order to sue for repairs? Just wondering, because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.
>Is it necessary in the US to prove ill intent in order to sue for repairs?
As a general rule of thumb: you can sue anyone for anything in the US. There are even a few cases where someone tried to sue God: https://en.wikipedia.org/wiki/Lawsuits_against_supernatural_...
When we say "do we need" or "can we do" we're talking about the idea of how plausible it is to win case. A lawyer won't take a case with bad odds of winning, even if you want to pay extra because a part of their reputation lies on taking battles they feel they can win.
>because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.
IANAL, so the boring answer is "it depends". reparations aren't guaranteed, but there's 50 different state laws to consider, on top of federal law.
Generally, they are not entitled to pay for damages themselves, but they may possibly be charged with battery. Intent will be a strong factor in winning the case.
Manslaughter vs. murder. Same act, different intent, different stigma, different punishment
I thought only capital crimes (murder, for example) held the standard of beyond a reasonable doubt. Lesser crimes require the standard of either a "Preponderance of Evidence" or "Clear and Convincing Evidence" as burden of proof.
Still, even by those lesser standards, it's hard to build a case.
It's civil cases that have the lower standard of proof. Civil cases arise when one party sues another, typically seeking money, and they are claims in equity, where the defendant is alleged to have harmed the plaintiff in some way.
Criminal cases require proof beyond a reasonable doubt. Most things that can result in jail time are criminal cases. Criminal cases are almost always brought by the government, and criminal acts are considered harm to society rather than to (strictly) an individual. In the US, criminal cases are classified as "misdemeanors" or "felonies," but that language is not universal in other jurisdictions.
Thank you.
No, all criminal convictions require proof beyond a reasonable doubt: https://constitution.congress.gov/browse/essay/amdt14-S1-5-5...
>Absent a guilty plea, the Due Process Clause requires proof beyond a reasonable doubt before a person may be convicted of a crime.
Proof or a guilty plea, which is often extracted from not guilty parties due to the lopsided environment of the courts
Thank you.
Many are using botnets, so it's not practical to find out who they are.
Then how do we know they are OpenAI?
High volume and inorganic traffic patterns. Wikimedia wrote about it here: https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...
they seem to be written by either idiots and/or people that don't give a shit about being good internet citizens
either way the result is the same: they induce massive load
well written crawlers will:
- wait 2 seconds for a page to load before aborting the connection
- wait for the previous request to finish before requesting the next page, since that would only induce more load, get even slower, and eventually take everything down
I've designed my site to hold up to traffic spikes anyway and the bots I'm getting aren't as crazy as the ones I hear about from other, bigger website operators (like the OpenStreetMap wiki, still pretty niche), so I don't block much of them. Can't vet every visitor so they'll get the content anyway, whether I like it or not. But if I see a bot having HTTP 499 "client went away before page finished loading" entries in the access log, I'm not wasting my compute on those assholes. That's a block. I haven't had to do that before, in a decade of hosting my own various tools and websites
I disagree with the post author in their premise that things like Anubis are easy to bypass if you craft your bot well enough and throw the compute at it.
Thing is, the actual lived experience of webmasters tells that the bots that scrape the internets for LLMs are nothing like crafted software. They are more like your neighborhood shit-for-brain meth junkies competing with one another who makes more robberies in a day, no matter the profit.
Those bots are extremely stupid. They are worse than script kiddies’ exploit searching software. They keep banging the pages without regard to how often, if ever, they change. If they were 1/10th like many scraping companies’ software, they wouldn’t be a problem in the first place.
Since these bots are so dumb, anything that is going to slow them down or stop them in their tracks is a good thing. Short of drone strikes on data centers or accidents involving owners of those companies that provide networks of botware and residential proxies for LLM companies, it seems fairly effective, doesn’t it?
It is the way it is because there are easy pickings to be made even with this low effort, but the more sites adopt such measures, the less stupid your average bot will be.
Those are just the ones that you've managed to ID as bots.
Ask me how I know.
As I've been saying for a while now - if you want to filter for only humans, ask questions only a human can easily answer; counting the number of letters in a word seems to be a good way to filter out LLMs, for example. Yes, that can be relatively easily gotten around, just like Anubis, but with the benefit that it doesn't filter out humans and has absolutely minimal system requirements (a browser that can submit HTML forms), possibly even less than the site itself.
There are forums which ask domain-specific questions as a CAPTCHA upon attempting to register an account, and as someone who has employed such a method, it is very effective. (Example: what nominal diameter is the intake valve stem on a 1954 Buick Nailhead?)
For smaller forums, any customization to the new account process will work. When I ran a forum that was getting a frustratingly high amount of spammer signups, I modified the login flow to ask the user to add 1 to the 6-digit number in the stock CAPTCHA. Spam signups dropped like a rock.
> counting the number of letters in a word seems to be a good way to filter out LLMs
As long as this challenge remains obscure enough to be not worth implementing special handlers in the crawler, this sounds a neat idea.
But I think if everyone starts doing this particular challenge (char count), the crawlers will start instructing a cheap LLM to do appropriate tool calls and get around it. So the challenge needs to be obscure.
I wonder if anyone tried building a crawler-firewall or even nginx script which will let the site admin plug their own challenge generator in lua or something, which would then create a minimum HTML form. Maybe even vibe code it :)
Tried and true method! An old video game forum named moparscape used to ask what mopar was and I always had to google it
Good thing modern bots can't do a web search!
They will be as likely if not more so to fall victim to the large amount of misinformation... and AI-generated crap you'll find from doing so.
There is a decent segment of the population that will gave a hard time with that.
So it's no different from real CAPTCHAs, then.
If you want a tip my friend, just block all of Huawei Cloud by ASN.
... looks like they did: https://github.com/TecharoHQ/anubis/pull/1004, timestamped a few hours after your comment.
lmfao so that kinda defeats the entire point of this project if they have to resort to a manual IP blocklist anyways
I would actually say that it's been successful in determining at least one, so far, large scale abuser, which can the be blocked via more traditional methods.
I have my own project that finds malicious traffic IP addresses, and through searching through the results, it's allowed me to identify IP address ranges to be blocked completely.
Yielding useful information may not have been what it was designed to do, but it's still a useful outcome. Funny thing about Anubis' viral popularity is that it was designed to just protect the author's personal site from a vast army of resource-sucking marauders, and grew because it was open sourced and a LOT of other people found it useful and effective.
I think that was already common knowledge as hansjorg above suggests
Whenever I see an otherwise civil and mature project utilize something outwardly childish like this I audibly groan and close the page.
I'm sure the software behind it is fine but the imagery and style of it (and the confidence to feature it) makes me doubt the mental credibility/social maturity of anybody willing to make it the first thing you see when accessing a webpage.
Edit: From a quick check of the "CEO" of the company, I was unsurprised to have my concerns confirmed. I may be behind the times but I think there are far too many people in who act obnoxiously (as part of what can only be described as a new subculture) in open source software today and I wish there were better terms to describe it.
I wouldn't be surprised if just delaying the server response by some 3 seconds will have the same effect on those scrapers as Anubis claims.
There is literally no point wasting 3 seconds of a computer's time and it's expensive wasting 3 seconds of a person's time.
That is literally an anti-human filter.
From tjhorner on this same thread
"Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project."
So its meant/preferred to block low effort crawlers which can still cause damage if you don't deal with them. a 3 second deterrent seems good in that regard. Maybe the 3 second deterrent can come as in rate limiting an ip? but they might use swath's of ip :/
Anubis exists specifically to handle the problem of bots dodging IP rate limiting. The challenge is tied to your IP, so if you're cycling IPs with every request, you pay dramatically more PoW than someone using a single IP. It's intended to be used in depth with IP rate limiting.
Anubis easily wastes 3 seconds of a human's time already.
You've just described Anubis, yeah
I know, I read the article and that's the thesis.
Yea I'm not convinced unless somehow the vast majority of scrapers aren't already using headless browsers (which I assume they are). I feel like all this does is warm the planet.
On my daily browser with V8 JIT disabled, Cloudflare Turnstile has the worst performance hit, and often requires an additional click to clear.
Anubis usually clears in with no clicks and no noticeable slowdown, even with JIT off. Among the common CAPTCHA solutions it's the least annoying for me.
> I host this blog on a single core 128MB VPS
Where does one even find a VPS with such small memory today?
Or software to run on it. I'm intrigued about this claim as well.
The software is easy. Apt install debian apache2 php certbot and you're pretty much set to deploy content to /var/www. I'm sure any BSD variant is also fine, or lots of other software distributions that don't require a graphical environment
On an old laptop running Windows XP (yes, with GUI, breaking my own rule there) I've also run a lot of services, iirc on 256MB RAM. XP needed about 70 I think, or 52 if I killed stuff like Explorer and unnecessary services, and the remainder was sufficient to run a uTorrent server, XAMPP (Apache, MySQL, Perl and PHP) stack, Filezilla FTP server, OpenArena game server, LogMeIn for management, some network traffic monitoring tool, and probably more things I'm forgetting. This ran probably until like 2014 and I'm pretty sure the site has been on the HN homepage with a blog post about IPv6. The only thing that I wanted to run but couldn't was a Minecraft server that a friend had requested. You can do a heck of a lot with a hundred megabytes of free RAM but not run most Javaware :)
Article might be a bit shallow, or maybe my understanding of how Anubis works is incorrect?
1. Anubis makes you calculate a challenge.
2. You get a "token" that you can use for a week to access the website.
3. (I don't see this being considered in the article) "token" that is used too much is rate limited. Calculating a new token for each request is expensive.
That, but apparently also restrictions on what tech you can use to access the website:
- https://news.ycombinator.com/item?id=44971990 person being blocked with `message looking something like "you failed"`
- https://news.ycombinator.com/item?id=44970290 mentions of other requirements that are allegedly on purpose to block older clients (as browser emulators presumably often would appear to be, because why would they bother implementing newer mechanisms when the web has backwards compatibility)
That's the basic principle. It's a tool to fight to crawlers that spam requests without cookies to prevent rate limiting.
The Chinese crawlers seem to have adjusted their crawling techniques to give their browsers enough compute to pass standard Anubis checks.
So... Is Anubis actually blocking bots because they didn't bother to circumvent it?
Basically. Anubis is meant to block mindless, careless, rude bots with seemingly no technically proficient human behind the process; these bots tend to be very aggressive and make tons of requests bringing sites down.
The assumption is that if you’re the operator of these bots and care enough to implement the proof of work challenge for Anubis you could also realize your bot is dumb and make it more polite and considerate.
Of course nothing precludes someone implementing the proof of work on the bot but otherwise leaving it the same (rude and abusive). In this case Anubis still works as a somewhat fancy rate limiter which is still good.
Essentially the Pow aspect is pointless then? They could require almost any arbitrary thing.
What else do you envision being used instead of proof of work?
Rot13 a challenge string. It could be any arbitrary function.
That wouldn’t have the fallback rate-limiting functionality. It’s too cheap.
It’s too cheap as a rate limiter as it is if you read TFA.
That's a configurable setting.
There’s no possible setting that would make it expensive enough to deter AI scrapers while preserving an acceptable user experience. The more zeros you add the more real users suffer, despite not creating much of a challenge to datacenter-hosted scrapers.
Real users suffer much more if the site is entirely down due to being DDoSed by aggressive AI scrapers.
Yeah, and if this tool doesn’t stop them then the site is down anyway.
s/circumvent/implement/
Yeah the PoW is minor for botters but annoying people. I think the only positive is if enough people see anime girls on there screens there might actually be political pressure to make laws against rampent bot crawling
> PoW is minor for botters
But still enough to prevent a billion request DDoS
These sites have been search engine scrapped forever. It’s not about blocking bots entirely just about this new wave of fuck you I don’t care if your host goes down quasi malicious scrappers
"But still enough to prevent a billion request DDoS" - don't you just do the PoW once to get a cookie and then you can browse freely?
Yes, but a single bot is not a concern. It's the first "D" in DDoS that makes it hard to handle
(and these bots tend to be very, very dumb - which often happens to make them more effective at DDoSing the server, as they're taking the worst and the most expensive ways to scrape content that's openly available more efficiently elsewhere)
Reading TFA, those billions requests would cost web crawlers what about $100 in compute?
I don't care that they use anime catgirls.
What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net.
I hate Amazon's failure pets, I hate google's failure mini-games -- it strikes me as an organizational effort to get really good at failing rather than spending that same effort to avoid failures all together.
It's like everyone collectively thought the standard old Apache 404 not found page was too feature-rich and that customers couldn't handle a 3 digit error, so instead we now get a "Whoops! There appears to be an error! :) :eggplant: :heart: :heart: <pet image.png>" and no one knows what the hell is going on even though the user just misplaced a number in the URL.
This is something I've always felt about design in general. You should never make it so that a symbol for an inconvenience appears happy or smug, it's a great way to turn people off your product or webpage.
Reddit implemented something a while back that says "You've been blocked by network security!" with a big smiling Reddit snoo front and centre on the page and every time I bump into it I can't help but think this.
The original versions were a way to make fun even a boring event such as a 404. If the page stops conveying the type of error to the user then it's just bad UX but also vomiting all the internal jargon to a non-tech user is bad UX.
So, I don't see an error code + something fun to be that bad.
People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today, so I don't see how having fun error pages to be such an issue?
This assumes it's fun.
Usually when I hit an error page, and especially if I hit repeated errors, I'm not in the mood for fun, and I'm definitely not in the mood for "fun" provided by the people who probably screwed up to begin with. It comes off as "oops, we can't do anything useful, but maybe if we try to act cute you'll forget that".
Also, it was more fun the first time or two. There's a not a lot of orginal fun on the error pages you get nowadays.
> People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today
It's been a while, but I don't remember much gratuitous cutesiness on the 90s Web. Not unless you were actively looking for it.
> This assumes it's fun.
Not to those who don't exist in such cultures. It's creepy, childish, strange to them. It's not something they see in everyday life, nor would I really want to. There is a reason why cartoons are aimed for younger audiences.
Besides if your webserver is throwing errors, you've configured it incorrectly. Those pages should be branded as the site design with a neat and polite description to what the error is.
> What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net
This is probably intentional. They offer an paid unbranded version. If they had a corporate friendly brand on the free offering, then there would be fewer people paying for the unbranded one.
Guru Meditations and Sad Macs are not your thing?
FWIW second and third iteration of AmigaOS didn't have "Guru Meditation"; instead it bluntly labeled the numbers as error and task.
That also got old when you got it again and again while you were trying to actually do something. But there wasn't the space to fit quite as much twee on the screen...
I hear this
I can't find any documentation that says Anubis does this, (although it seems odd to me that it wouldn't, and I'd love a reference) but it could do the following:
1. Store the nonce (or some other identifier) of each jwt it passes out in the data store
2. Track the number or rate of requests from each token in the data store
3. If a token exceeds the rate limit threshold, revoke the token (or do some other action, like tarpit requests with that token, or throttle the requests)
Then if a bot solves the challenge it can only continue making requests with the token if it is well behaved and doesn't make requests too quickly.
It could also do things like limit how many tokens can be given out to a single ip address at a time to prevent a single server from generating a bunch of tokens.
Good on you that you found a solution to myself but personally I will just not use websites that pull this and not contribute to projects where using such a website is required. If you respect me so little that you will make demands about how I use my computer and block me as a bot if I don't comply then I am going to assume that you're not worth my time.
Interesting take to say the Linux Kernel is not worth your time.
As far as I know Linux kernel contributions still use email.
This sounds a bit overdramatic for a less than a second waiting time per week for each device. Unless you employ an army of crawlers of course.
With the asymmetry of doing the PoW in javascript versus compiled c code, I wonder if this type of rate limiting is ever going to be directly implemented into regular web browsers. (I assume there's already plugins for curl/wget)
Other than Safari, mainstream browsers seem to have given up on considering browsing without javascript enabled a valid usecase. So it would purely be a performance improvement thing.
Apple supports people that want to not use their software as the gods at Apple intended it? What parallel universe Version of Apple is this!
Seriously though, does anything of Apple's work without JS, like Icloud or Find my phone? Or does Safari somehow support it in a way that other browsers don't?
Last I checked, safari still had a toggle to disable javascript long after both chrome and firefox removed theirs. That's what I was referring to.
I think the solution to captcha-rot is micro-payments. It does consume resources to serve a web-page so whose gonna pay for that?
If you want to do advertisement then don't require a payment, and be happy that crawlers will spread your ad to the users of AI-bots.
If you are a non-profit-site then it's great to get a micro-payment to help you maintain and run the site.
> an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources
Sure, if you ignore that humans click on one page and the problematic scrapers (not the normal search engine volume, but the level we see nowadays where misconfigured crawlers go insane on your site) are requesting many thousands to millions of times more pages per minute. So they'll need many many times the compute to continue hammering your site whereas a normal user can muster to load that one page from the search results that they were interested in
Something feels bizarrely incongruent about the people using Anubis. These people used to be the most vehemently pro-piracy, pro internet freedom and information accessibility, etc.
Yet now when it's AI accessing their own content, suddenly they become the DMCA and want to put up walls everywhere.
I'm not part of the AI doomer cult like many here, but it would seem to me that if you publish your content publicly, typically the point is that it would be publicly available and accessible to the world...or am I crazy?
As everything moves to AI-first, this just means nobody will ever find your content and it will not be part of the collective human knowledge. At which point, what's the point of publishing it.
It is rather funny. "We must prevent AI accessing the Arch Linux help files or it will start the singularity and kill us all!"
Oh, its time to bring Internet back to humans. Maybe its time to treat first layer of Internet just as transport. Then, layer large VPN networks and put services there. People will just VPN to vISP to reach content. Different networks, different interests :) But this time dont fuck up abuse handling. Someone is doing something fishy? Depeer him from network (or his un-cooperating upstream!).
The argument isn't that it's difficult for them to circumvent - it's not - but that it adds enough friction to force them to rethink how they're scraping at scale and/or self-throttle.
I personally don't care about the act of scraping itself, but the volume of scraping traffic has forced administrators' hands here. I suspect we'd be seeing far fewer deployments if the scrapers behaved themselves to begin with.
The OP author shows that the cost to scrape an Anubis site is essentially zero since it is a fairly simple PoW algorithm that the scraper can easily solve. It adds basically no compute time or cost for a crawler run out of a data center. How does that force rethinking?
The cookie will be invalidated if shared between IPs, and it's my understanding that most Anubis deployments are paired with per-IP rate limits, which should reduce the amount of overall volume by limiting how many independent requests can be made at any given time.
That being said, I agree with you that there are ways around this for a dedicated adversary, and that it's unlikely to be a long-term solution as-is. My hope is that the act of having to circumvent Anubis at scale will prompt some introspection (do you really need to be rescraping every website constantly?), but that's hopeful thinking.
>do you really need to be rescraping every website constantly Yes, because if you believe you out-resource your competition, by doing this you deny them training material.
The problem with crawlers if that they're functionally indistinguishable from your average malware botnet in behavior. If you saw a bunch of traffic from residential IPs using the same token that's a big tell.
Time to switch to stagit. Unfortunately it does not generate static pages for a git repo except "master". I am sure someone will modify to support branches.
Anubis works because AI crawlers do very little requests from an ip address to bypass rate-limiting. Last year they could still be blocked by ip range, but now the requests are from so many different networks that doesn't work anymore.
Doing the proof-of-work for every request is apparently too much work for them.
Crawlers using a single ip, or multiple ips from a single range are easily identifiable and rate-limited.
I actually really liked seeing the mascot. Brought a sense of whimsy to the Internet that I've missed for a long time.
I write about something similar a while back https://maori.geek.nz/proof-of-human-2ee5b9a3fa28
About the difficulty of proving you are human especially when every test built has so much incentive to be broken. I don't think it will be solved, or could ever be solved.
> This isn’t perfect of course, we can debate the accessibility tradeoffs and weaknesses, but conceptually the idea makes some sense.
It was arguably never a great idea to begin with, and stopped making sense entirely with the advent of generative AI.
> I host this blog on a single core 128MB VPS
No wonder the site is being hugged to death. 128MB is not a lot. Maybe it's worth to upgrade if you post to hacker news. Just a thought.
It doesnt take much to host a static website. Its all the dynamic stuff/frameworks/db/etc that bogs everything down.
Still, 128MB is not enough to even run Debian let alone Apache/NGINX. I’m on my phone, but it doesn’t seem like the author is using Cloudflare or another CDN. I’d like to know what they are doing.
128MB is more than enough to run Debian and serve a static site. I had no issue with doing it a decade ago and it still works fine.
How much memory do you think it actually takes to accept a TLS connection and copy files from disk to a socket?
Modern Linux is much less frugal these days:
https://wiki.debian.org/DebianEdu/Documentation/Bullseye/Req...
* Thin clients with only 256 MiB RAM and 400 MHz are possible, though more RAM and faster processors are recommended.
* For workstations, diskless workstations and standalone systems, 1500 MHz and 1024 MiB RAM are the absolute minimum requirements. For running modern webbrowsers and LibreOffice at least 2048 MiB RAM is recommended.
That's for some educational distro, which presumably is running some fancy desktop environment with fancy GUI programs. I don't think that is reflective of what a web server needs.
A web server is really only going to be running 3 things: init, sshd, and the web server software. Even if we give init and sshd half of 128 MB, there's still 64 MB left for the web server.
Moving bytes around doesn't take RAM but CPU. Notice how switches don't advertise how many gigabytes of RAM they have, but can push a few gigabits of content around between all 24 ports at once without even going expensive
Also, the HN homepage is pretty tame so long as you don't run WordPress. You don't get more than a few requests per second, so multiply that with the page size (images etc.) and you probably get a few megabits as bandwidth, no problem even for a Raspberry Pi 1 if the sdcard can read fast enough or the files are mapped to RAM by the kernel
And Codeberg, even behind Anubis, is not immune from scrapers either
https://social.anoxinon.de/@Codeberg/115033782514845941
Reading the original release post for Anubis [0], it seems like it operates mainly on the assumption that AI scrapers have limited support for JS, particularly modern features. At its core it's security through obscurity; I suspect that as usage of Anubis grows, more scrapers will deliberately implement the features needed to bypass it.
That doesn't necessarily mean it's useless, but it also isn't really meant to block scrapers in the way TFA expects it to.
[0] https://xeiaso.net/blog/2025/anubis/
Your link explicitly says:
> It's a reverse proxy that requires browsers and bots to solve a proof-of-work challenge before they can access your site, just like Hashcash.
It's meant to rate-limit accesses by requiring client-side compute light enough for legitimate human users and responsible crawlers in order to access but taxing enough to cost indiscriminate crawlers that request host resources excessively.
It indeed mentions that lighter crawlers do not implement the right functionality in order to execute the JS, but that's not the main reason why it is thought to be sensible. It's a challenge saying that you need to want the content bad enough to spend the amount of compute an individual typically has on hand in order to get me to do the work to serve you.
Here's a more relevant quote from the link:
> Anubis is a man-in-the-middle HTTP proxy that requires clients to either solve or have solved a proof-of-work challenge before they can access the site. This is a very simple way to block the most common AI scrapers because they are not able to execute JavaScript to solve the challenge. The scrapers that can execute JavaScript usually don't support the modern JavaScript features that Anubis requires. In case a scraper is dedicated enough to solve the challenge, Anubis lets them through because at that point they are functionally a browser.
As the article notes, the work required is negligible, and as the linked post notes, that's by design. Wasting scraper compute is part of the picture to be sure, but not really its primary utility.
Why require proof of work with difficulty at all then? Just have no UI other than (javascript) required and run a trivial computation in WASM as a way of testing for modern browser features. That way users don't complain that it is taking 30s on their low-end phone and it doesn't make it any easier for scrapers to scrape (because the PoW was trivial anyways).
The compute also only seems to happen once, not for every page load, so I'm not sure how this is a huge barrier.
Once per ip. Presumably there's ip-based rate limiting implemented on top of this, so it's a barrier for scrapers that aggressively rotate ip's to circumvent rate limits.
It happens once if the user agent keeps a cookie that can be used for rate limiting. If a crawler hits the limit they need to either wait or throw the cookie away and solve another challenge.
Can that cookie then be used across multiple IPs?
I like hashcash.
https://github.com/factor/factor/blob/master/extra/hashcash/...
https://bitcoinwiki.org/wiki/hashcash
Anubis is based on hashcash concepts - just adapted to a web request flow. Basically the same thing - moderately expensive for the sender/requester to compute, insanely cheap for the server/recipient to verify.
We need bitcoin-based lightning nano-payments for such things. Like visiting the website will cost $0.0001 cent, the lightning invoice is embedded in the header and paid for after single-click confirmation or if threshold is under a pre-configured value. Only way to deal with AI crawlers and future AI scams.
With the current approach we just waste the energy, if you use bitcoin already mined (=energy previously wasted) it becomes sustainable.
We deployed hashcash for a while back in 2004 to implement Picasa's email relay - at the time it was a pretty good solution because all our clients were kind of similar in capability. Now I think the fastest/slowest device is a broader range (just like Tavis says), so it is harder to tune the difficulty for that.
I don't think I've ever actually seen Anubis once. Always interesting to see what's going on in parts of the internet you aren't frequenting.
I read hackernews on my phone when I'm bored and I've seen it a lot lately. I don't think I've ever seen it on my desktop.
Hug of death https://archive.ph/BSh1l
Hmm... What if instead of using plain SHA-256 it was a dynamically tweaked hash function that forced the client to run it in JS?
No, the economics will never work out for a Proof of Work-based counter-abuse challenge. CPU is just too cheap in comparison to the cost of human latency. An hour of a server CPU costs $0.01. How much is an hour of your time worth?
That's all the asymmetry you need to make it unviable. Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users. So there's no point in theorizing about an attacker solving the challenges cheaper than a real user's computer, and thus no point in trying to design a different proof of work that's more resistant to whatever trick the attackers are using to solve it for cheap. Because there's no trick.
But for a scraper to be effective it has to load orders of magnitude more pages than a human browses, so a fixed delay causes a human to take 1.1x as long, but it will slow down scraper by 100x. Requiring 100x more hardware to do the same job is absolutely a significant economic impediment.
The entire problem is that proof of work does not increase the cost of scraping by 100x. It does not even increase it by 100%. If you run the numbers, a reasonable estimate is that it increases the cost by maybe 0.1%. It is pure snakeoil.
>An hour of a server CPU costs $0.01. How much is an hour of your time worth?
That's irrelevant. A human is not going to be solving the challenge by hand, nor is the computer of a legitimate user going to be solving the challenge continuously for one hour. The real question is, does the challenge slow down clients enough that the server does not expend outsized resources serving requests of only a few users?
>Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users.
No, I disagree. If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.
The problem with proof-of-work is many legitimate users are on battery-powered, 5-year-old smartphones. While the scraping servers are huge, 96-core, quadruple-power-supply beasts.
The human needs to wait for their computer to solve the challenge.
You are trading something dirt-cheap (CPU time) for something incredibly expensive (human latency).
Case in point:
> If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.
No. A human sees a 10x slowdown. A human on a low end phone sees a 50x slowdown.
And the scraper paid one 1/1000000th of a dollar. (The scraper does not care about latency.)
That is not an effective deterrent. And there is no difficulty factor for the challenge that will work. Either you are adding too much latency to real users, or passing the challenge is too cheap to deter scrapers.
>No. A human sees a 10x slowdown.
For the actual request, yes. For the complete experience of using the website not so much, since a human will take at least several seconds to process the information returned.
>And the scraper paid one 1/1000000th of a dollar. (The scraper does not care about latency.)
The point need not be to punish the client, but to throttle it. The scraper may not care about taking longer, but the website's operator may very well care about not being hammered by requests.
But now I have to wait several seconds before I can even start to process the webpage! It's like the internet suddenly became slow again overnight.
Yeah, well, bad actors harm everyone. Such is the nature of things.
A proof of work challenge does not throttle the scrapers at steady state. All it does is add latency and cost to the first request.
Hypothetically, the cookie could be used to track the client and increase the difficulty if its usage becomes abusive.
Yes, and then we can avoid the entire issue. It's patronizing for people to assume users wouldn't notice a 10x or 50x slowdown. You can tell those who think that way are not web developers, as we know that every millisecond has a real, nonlinear fiscal cost.
Of course, then the issue becomes "what is the latency and cost incurred by a scraper to maintain and load balance across a large list of IPs". If it turns out that this is easily addressed by scrapers then we need another solution. Perhaps, the user's browser computes tokens in the background and then serves them to sites alongside a certificate or hash (to prevent people from just buying and selling these tokens).
We solve the latency issue by moving it off-line, and just accept the tradeoff that a user is going to have to spend compute periodically in order to identify themselves in an increasingly automated world.
crawlers can run JS, and also invest into running the Proof-Of-JS better than you can
Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project.
wait but then why bother with this PoW system at all? if they're just trying to block anyone without JS that's way easier and doesn't require slowing things down for end users on old devices.
reminds of how wikipedia literally has all the data available even in a nice format just for scrapers (I think) and even THEN, there are some scrapers which still scraped wikipedia and actually made wikipedia lose some money so much that I am pretty sure that some official statement had to be made or they disclosed about it without official statement.
Even then, man I feel like you yourself can save on so many resources (both yours) and (wikipedia) if scrapers had the sense to not scrape wikipedia and instead follow wikipedia's rules
If we're presupposing an adversary with infinite money then there's no solution. One may as well just take the site offline. The point is to spend effort in such a way that the adversary has to spend much more effort, hopefully so much it's impractical.
> This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources or trying to conserve them.
A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.
Secondly, Anubis specifically targets bots that try to blend in with human traffic. Bots that don't try to blend in with humans are basically ignored and out-of-scope. Most malicious bots don't want to be targeted, so they want to blend in... so they kind of have to deal with this. If they want to avoid the Anubis challenge, they have to essentially identify themselves. If not, they have to solve it.
Finally... If bots really want to durably be able to pass Anubis challenges, they pretty much have no choice but to run the arbitrary code. Anything else would be a pretty straight-forward cat and mouse game. And, that means that being able to accelerate the challenge response is a non-starter: if they really want to pass it, and not appear like a bot, the path of least resistance is to simply run a browser. That's a big hurdle and definitely does increase the complexity of scraping the Internet. It increases more the more sites that use this sort of challenge system. While the scrapers have more resources, tools like Anubis scale the resources required a lot more for scraping operations than it does a specific random visitor.
To me, the most important point is that it only fights bot traffic that intentionally tries to blend in. That's why it's OK that the proof-of-work challenge is relatively weak: the point is that it's non-trivial and can't be ignored, not that it's particularly expensive to compute.
If bots want to avoid the challenge, they can always identify themselves. Of course, then they can also readily be blocked, which is exactly what they want to avoid.
In the long term, I think the success of this class of tools will stem from two things:
1. Anti-botting improvements, particularly in the ability to punish badly behaved bots, and possibly share reputation information across sites.
2. Diversity of implementations. More implementations of this concept will make it harder for bots to just hardcode fastpath challenge response implementations and force them to actually run the code in order to pass the challenge.
I haven't kept up with the developments too closely, but as silly as it seems I really do think this is a good idea. Whether it holds up as the metagame evolves is anyone's guess, but there's actually a lot of directions it could be taken to make it more effective without ruining it for everyone.
> A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.
... has phpbb not heard of the old "only create the session on the second visit, if the cookie was successfully created" trick?
phpBB supports browsers that don't support or accept cookies: if you don't have a cookie, the URL for all links and forms will have the session ID in it. Which would be great, but it seems like these bots are not picking those up either for whatever reason.
We have been seeing our clients' sites being absolutely *hammered* by AI bots trying to blend in. Some of the bots use invalid user agents - they _look_ valid on the surface, but under the slightest scrutiny, it becomes obvious they're not real browsers.
Personally I have no issues with AI bots, that properly identify themselves, from scraping content as if the site operator doesn't want it to happen they can easily block the offending bot(s).
We built our own proof-of-work challenge that we enable on client sites/accounts as they come under 'attack' and it has been incredible how effective it is. That said I do think it is only a matter of time before the tactics change and these "malicious" AI bots are adapted to look more human / like real browsers.
I mean honestly it wouldn't be _that_ hard to enable them to run javascript or to emulate a real/accurate User-Agent. That said they could even run headless versions of the browser engines...
It's definitely going to be cat-and-mouse.
The most brutal honest truth is that if they throttled themselves as not to totally crash whatever site they're trying to scrape we'd probably have never noticed or gone through the trouble of writing our own proof-of-work challenge.
Unfortunately those writing/maintaining these AI bots that hammer sites to death probably either have no concept of the damage it can do or they don't care.
> We have been seeing our clients' sites being absolutely hammered by AI bots trying to blend in. Some of the bots use invalid user agents - they _look_ valid on the surface, but under the slightest scrutiny, it becomes obvious they're not real browsers.
Yep. I noticed this too.
> That said they could even run headless versions of the browser engines...
Yes, exactly. To my knowledge that's what's going on with the latest wave that is passing Anubis.
That said, it looks like the solution to that particular wave is going to be to just block Huawei cloud IP ranges for now. I guess a lot of these requests are coming from that direction.
Personally though I think there are still a lot of directions Anubis can go in that might tilt this cat and mouse game a bit more. I have some optimism.
I haven't seen much if anything getting past our pretty simple proof-of-work challenge but I imagine it's only a matter of time.
Thankfully, so far, it's still been pretty easy to block them by their user agents as well.
>The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans
Not for me, I have nothing but a hard time solving CAPTCHAs, ahout 50% of the time I give up after 2 tries.
it's still certainly trivial for you compared to mentally computing a SHA256 op.
Isn’t animus a dog? So it should be anime dog/wolf girl rather than cat girl?
Yes, Anubis is a dog-headed or jackal-headed god. I actually can't find anywhere on the Anubis website where they talk about their mascot; they just refer to her neutrally as the "default branding".
Since dog girls and cat girls in anime can look rather similar (both being mostly human + ears/tail), and the project doesn't address the point outright, we can probably forgive Tavis for assuming catgirl.
I hope there's some kind of memory-hungry checker to replace the CPU cost.
a 2GB memory consumption wont stop them, but it will limit the parallelism of crawlers.
> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans. > Anubis – confusingly – inverts this idea.
Not really, AI easily automates traditional captchas now. At least this one does not need extensions to bypass.
The actual answer to how this blocks AI crawlers is that they just don't bother to solve the challenge. Once they do bother solving the challenge, the challenge will presumably be changed to a different one.
This seems like a good place to ask. How do I stop bots from signing up to my email list on my website without hosting a backend?
Depending on your target audience you could require people signing up to send you and email first.
Would it not be more effective just to require payment for accessing your website? Then you don't need to care about bot or not.
We're 1-2 years away from putting the entire internet behind Cloudflare, and Anubis is what upsets you? I really don't get these people. Seeing an anime catgirl for 1-2 seconds won't kill you. It might save the internet though.
The principle behind Anubis is very simple: it forces every visitor to brute force a math problem. This cost is negligible if you're running it on your computer or phone. However, if you are running thousands of crawlers in parallel, the cost adds up. Anubis basically makes it expensive to crawl the internet.
It's not perfect, but much much better than putting everything behind Cloudflare.
The solution is to make premium subscription service for those who do not want to solve CAPTCHAs.
Money is the best proof of humanity.
Isn't that line of reasoning implies companies with multi-billion dollars in their war chest are much more "human" than a literal human with student loans?
My biggest bitch is that it requires JS and cookies...
Although the long term problem is the business model of servers paying for all network bandwidth.
Actual human users have consumed a minority of total net bandwidth for decades:
https://www.atom.com/blog/internet-statistics/
Part 4 shows bots out using humans in 1996 8-/
What are "bots"? This needs to include goggleadservices, PIA sharing for profit, real-time ad auctions, and other "non-user" traffic.
The difference between that and the LLM training data scraping, is that the previous non-human traffic was assumed, by site servers, to increase their human traffic, through search engine ranking, and thus their revenue. However the current training data scraping is likely to have the opposite effect: capturing traffic with LLM summaries, instead of redirecting it to original source sites.
This is the first major disruption to the internet's model of finance since ad revenue look over after the dot bomb.
So far, it's in the same category as the environmental disaster in progress, ownership is refusing to acknowledge the problem, and insisting on business as usual.
Rational predictions are that it's not going to end well...
"Although the long term problem is the business model of servers paying for all network bandwidth."
Servers do not "pay for all the network bandwidth" as if they are somehow being targeted for fees and carrying water for the clients that are somehow getting it for "free". Everyone pays for the bandwidth they use, clients, servers, and all the networks in between, one way or another. Nobody out there gets free bandwidth at scale. The AI scrapers are paying lots of money to scrape the internet at the scales they do.
The Ai scrapers are most likely vc funded and all they care about is getting as much data as possible and not worry about the costs.
They are hiring machines at scale too so definitely bandwidth etc. are cheaper for them too. Maybe use a provider that doesn't have too much bandwidth issues (hetzner?)
But still, the point being that you might be hosting website on your small server and that scraper with its machines beast can come and effectively ddos your server looking for data to scrape. Deterring them is what matters so that the economical scale finally slide back to our favours again.
Maybe my statement wasn't clear. The point is that the server operators pay for all of the bandwidth of access to their servers.
When this access is beneficial to them, that's OK, when it's detrimental to them, they're paying for their own decline.
The statement isn't really concerned with what if anything the scraper operators are paying, and I don't think that really matters in reaching the conclusion.
> The difference between that and the LLM training data scraping
Is the traffic that people are complaining about really training traffic?
My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run.
That doesn't seem like enough traffic to be a really big problem.
On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not.
Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time.
Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does.
So what's really going on here? Anybody actually know?
The traffic I've seen is the big AI players just voraciously scraping for ~everything. What they do with it, if anything, who knows.
There's some user-directed traffic, but it's a small fraction, in my experience.
It's not random internet people saying it's training. It's Cloudflare, among others.
Search for “A graph of daily requests over time, comparing different categories of AI Crawlers” on this blog: https://blog.cloudflare.com/ai-labyrinth/
In the feed today:
AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders
https://www.theregister.com/2025/08/21/ai_crawler_traffic/
The traffic I'm seeing on a wiki I host looks like plain old scraping. When it hits it's a steady load of lots of traffic going all over, from lots of IPs. And they really like diffs between old page revisions for some reason.
That sounds like a really dumb scraper indeed. I don't think you'd want to feed very many diffs into a training run or most inference runs.
But if there's a (discoverable) page comparing every revision of a page to every other revision, and a page has N revisions, there are going to be (N^2-N)/2 delta pages, so could it just be the majority of the distinct pages your Wiki has are deltas?
I would think that by now the "AI companies" would have something smarter steering their scrapers. Like, I dunno, some kind of AI. But maybe they don't for some reason? Or maybe the big ones do, but smaller "hungrier" ones, with less staff but still probably with a lot of cash, are willing to burn bandwidth so they don't have to implement that?
The questions just multiply.
It's near-stock mediawiki, so it has a ton of old versions and diffs off the history tab but I'd expect a crawler to be able to handle it.
For the same reason why cats sit on your keyboard. Because they can
Soon any attempt to actually do it would indicate you're a bot.
Site doesn't load, must be hit by AI crawlers.
Surely the difficulty factor scales with the system load?
HN hug of death
I’m getting a black page. Not sure if it’s an ironic meta commentary, or just my ad blocker.
Can we talk about the "sexy anime girl" thing? Seems it's popular in geek/nerd/hacker circles and I for one don't get it. Browsing reddit anonymously you're flooded with near-pornographic fan-made renders of these things, I really don't get the appeal. Can someone enlighten me?
It's a good question. Anime (like many media, but especially anime) is known to have gratuitous fan service where girls/women of all ages are in revealing clothing for seemingly no reason except to just entice viewers.
The reasoning is that because they aren't real people, it's okay to draw and view images of anime, regardless of their age. And because geek/nerd circles tend not to socialize with real women, we get this over-proliferation of anime girls.
This also was my best guess. A "victimless crime" kind of logic that really really creeps me out.
I'd say it's partially a result of 4chan.
Probably depends on the person, but this stuff is mostly the cute instinct, same as videos of kittens. "Aww" and "I must protect it."
We live in a decadent society.
2D girls don't nag and I've never had to clear their clogged hair out of my shower drain.
Anubis doesn't use enough resources to deter AI bots. If you really want to go this way, use React, preferably with more than one UI framework.
I always wondered about these anti bot precautions... as a firefox user, with ad blocking and 3rd party cookies disabled, i get the goddamn captcha or other random check (like this) on a bunch of pages now, every time i visit them...
Is it worth it? Millions of users wasting cpu and power for what? Saving a few cents on hosting? Just rate limit requests per second per IP and be done.
Sooner or later bots will be better at captchas than humans, what then? What's so bad with bots reading your blog? When bots evolve, what then? UK style, scan your ID card before you can visit?
The internet became a pain to use... back in the time, you opened the website and saw the content. Now you open it, get an antibot check, click, forward to the actual site, a cookie prompt, multiple clicks, then a headline + ads, scroll down a milimeter... do you want to subscribe to a newsletter? Why, i didn't even read the first sentence of the article yet... scroll down.. chat with AI bot popup... a bit further down login here to see full article...
Most of the modern web is unusable. I know I'm ranting, but this is just one of the pieces of a puzzle that makes basic browsing a pain these days.
So it's a paywall with -- good intentions -- and even more accessibility concerns. Thus accelerating enshittification.
Who's managing the network effects? How do site owners control false positives? Do they have support teams granting access? How do we know this is doing any good?
It's convoluted security theater mucking up an already bloated , flimsy and sluggish internet. It's frustrating enough to guess schoolbuses every time I want to get work done, now I have to see porfnified kitty waifus
(openwrt is another community plagued with this crap)
here is the community post with Anubis pro / con experiences https://forum.openwrt.org/t/trying-out-anubis-on-the-wiki/23...
blame canada
aren't you happy? at least you see catgirl
Just use Anubis Bypass: https://addons.mozilla.org/en-US/android/addon/anubis-bypass...
Haven't seen dumb anime characters since.
> The idea of “weighing souls” reminded me of another anti-spam solution from the 90s… believe it or not, there was once a company that used poetry to block spam!
> Habeas would license short haikus to companies to embed in email headers. They would then aggressively sue anyone who reproduced their poetry without a license. The idea was you can safely deliver any email with their header, because it was too legally risky to use it in spam.
Kind of a tangent but learning about this was so fun. I guess it's ultimately a hack for there not being another legally enforceable way to punish people for claiming "this email is not spam"?
IANAL so what I'm saying is almost certainly nonsense. But it seems weird that the MIT license has to explicitly say that the licensed software comes with no warranty that it works, but that emails don't have to come with a warranty that they are not spam! Maybe it's hard to define what makes an email spam, but surely it is also hard to define what it means for software to work. Although I suppose spam never e.g. breaks your centrifuge.
It's posts like this that make me really miss the webshit weekly
i suppose one nice property is that it is trivially scalable. if the problem gets really bad and the scrapers have llms embedded in them to solve captchas, the difficulty could be cranked up and the lifetime could be cranked down. it would make the user experience pretty crappy (party like it's 1999) but it could keep sites up for unauthenticated users without engaging in some captcha complexity race.
it does have arty political vibes though, the distributed and decentralized open source internet with guardian catgirls vs. late stage capitalism's quixotic quest to eat itself to death trying to build an intellectual and economic robot black hole.
the action is great, anubis is a very clever idea i love it.
I'm not a huge fan of the anime thing, but i can live with it.
Kernel.org* just has to actually configure Anubis rather than deploying the default broken config. Enable the meta-refresh proof of work rather than relying on the corporate browsers only bleeding edge javascript application proof of work.
* or whatever site the author is talking about, his site is currently inaccessible due to the amount of people trying to load it.
This cartoon mascot has absolutely nothing to do with anime
If you disagree, please say why
Oh I saw this recently on ffmpeg's site, pretty fun
I really don't understand the hostility towards the mascot. I can't think of a bigger red flag.
Funny to say this when the article literally says "nothing wrong with mascots!"
Out of curiosity, what did you read as hostility?
Oh I totally reacted to the title. The last few times Anubis has been the topic there's always comments about "cringy" mascot and putting that front and center in the title just made me believe that anime catgirls was meant as an insult.
Honestly I am okay with anime catgirls since I just find it funny but still it would be cool to see linux related stuff. Imagine mr tux penguin gif of him racing in like supertuxcart for the linux website.
sourcehut also uses anubis but they have removed the anime catgirl thing with their own logo, I think disroot also does that I am not sure though
Sourcehut uses go-away, not Anubis.
https://sourcehut.org/blog/2025-04-15-you-cannot-have-our-us...
> As you may have noticed, SourceHut has deployed Anubis to parts of our services to protect ourselves from aggressive LLM crawlers.
Its nice that sourcehut themselves have talked about it on their own blog but I had discovered this through the anubis website themselves showcases or soemthing like that iirc.
Yes, your link from four months ago says they deployed Anubis. Now actually go to sourcehut yourself and you'll see it uses go-away, not Anubis. Or read the footnote at the bottom of your link (in fact, linked from the very sentence you quoted) that says they were looking at go-away at the time.
https://sourcehut.org/blog/2025-05-29-whats-cooking-q2/
> A few weeks after this blog post, I moved us from Anubis to go-away, which is more configurable and allows us to reduce the user impact of Anubis (e.g. by offering challenges that don’t require JavaScript, or support text-mode browsers better). We have rolled this out on several services now, and unfortunately I think they’re going to remain necessary for a while yet – presumably until the bubble pops, I guess.
Oh sorry, Didn't know about the fact that you started using go away after anubis, my bad.
But if I remember correctly, when you were using anubis, you had changed the logo of the anime catgirl to something related to sourcehut/ its logo right?
I don't understand, why do people resort to this tool instead of simply blocking by UA string or IP address. Are there so many people running these AI crawlers?
I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.
Because that solution simply does not work for all. People tried and the crawlers started using proxies with residential IPs.
less savory crawlers use residential proxies and are indistinguishable from malware traffic
You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.
Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.
[1]: https://pod.geraspora.de/posts/17342163
OpenAI/Anthropic/Perplexity aren't the bad actors here. If they are, they are relatively simply to block - why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?
I get the sense many of the bad actors are simply poor copycats that are poorly building LLMs and are scraping the entire web without a care in the world
> why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?
That's in fact what I was asking: I've only seen traffic from these kind of companies and I've easily blocked them without an annoying PoW scheme.
I have yet to see any of these bad actors and I'm interested in knowing who they actually are.
Huawei. Be happy that you haven't been hit by them yet.
> AI companies use residential proxies
Source:
Source: Cloudflare
https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
Perplexity's defense is that they're not doing it for training/KB building crawls but for answering dynamic queries calls and this is apparently better.
Well yes it is better. It's a page load triggered by a user for their own processing.
If web security worked a little differently, the requests would likely come from the user's browser.
I do not see the words "residential" or "proxy" anywhere in that article... or any other text that might imply they are using those things. And personally... I don't trust crimeflare at all. I think they and their MITM-as-a-service has done even more/lasting damage to the global Internet and user privacy in general than all AI/LLMs combined.
However, if this information is accurate... perhaps site owners should allow AI/bot user agents but respond with different content (or maybe a 404?) instead, to try to prevent it from making multiple requests with different UAs.
I had 500,000 residential IPs make 1-4 requests each in the past couple of days.
These had the same user agent (latest Safari), but previously the agent has been varied.
Blocking this shit is much more complicated than any blocking necessary before 2024.
The data is available for free download in bulk (it's a university) and this is advertised in several places, including the 429 response, the HTML source and the API documentation, but the AI people ignore this.
Lots of companies run these kind of crawlers now as part of their products.
They buy proxies and rotate through proxy lists constantly. It's all residential IPs, so blocking IPs actually hurts end users. Often it's the real IPs of VPN service customers, etc.
There are lots of companies around that you can buy this type of proxy service from.
Why does Anubis not leverage PoW from its users to do something useful (at best, distributed computing for science, at worst, a crypto-currency at least allowing the webmasters to get back some cash)
People are already complaining. Could you imagine how much fodder this'd give people who didn't like the work or the distribution of any funds that a cryptocurrency would create (which would be pennies, I think, and more work to distribute than would be worth doing).
If people are truly concerned about the crawlers hammering their 128mb raspberry pi website then a better solution would be to provide an alternative way for scrapers to access the data (e.g., voluntarily contribute a copy of their public site to something like common crawl).
If Anubis blocked crawler requests but helpfully redirected to a giant tar ball of every site using their service (with deltas or something to reduce bandwidth) I bet nobody would bother actually spending the time to automate cracking it since it’s basically negative value. You could even make it a torrent so most of the be costs are paid by random large labs/universities.
I think the real reason most are so obsessed with blocking crawlers is they want “their cut”… an imagined huge check from OpenAI for their fan fiction/technical reports/whatever.
No, this doesn’t work. Many of the affected sites have these but they’re ignored. We’re talking about git forges, arguably the most standardised tool in the industry, where instead of just fetching the repository every single history revision of every single file gets recursively hammered to death. The people spending the VC cash to make the internet unusable right now don’t know how to program. They especially don’t give a shit about being respectful. They just hammer all the sites, all the time, forever.
The kind of crawlers/scrapers who DDoS a site like this aren't going to bother checking common crawl or tarballs. You vastly overestimate the intelligence and prosociality of what bursty crawler requests tend to look like. (Anyone who is smart or prosocial will set up their crawler to not overwhelm a site with requests in the first place - yet any site with any kind of popularity gets flooded with these requests sooner or later)
If they don’t have the intelligence to go after the more efficient data collection method then they likely won’t have the intelligence or willpower to work around the second part I mentioned (keeping something like Anubis). The only problem is when you put Anubis in the way of determined, intelligent crawlers without giving them a choice that doesn’t involve breaking Anubis.
> I think the real reason most are so obsessed with blocking crawlers is they want “their cut”…
I find that an unfair view of the situation. Sure, there are examples such as StackOverflow (which is ridiculous enough as they didn't make the content) but the typical use case I've seen on the small scale is "I want to self-host my git repos because M$ has ruined GitHub, but some VC-funded assholes are drowning the server in requests".
They could just clone the git repo, and then pull every n hours, but it requires specialized code so they won't. Why would they? There's no money in maintaining that. And that's true for any positive measure you may imagine until these companies are fined for destroying the commons.
There's a lot of people that really don't like AI, and simply don't want their data used for it.
While that’s a reasonable opinion to have, it’s a fight they can’t really win. It’s like putting up a poster in a public square then running up to random people and shouting “no, this poster isn’t for you because I don’t like you, no looking!” Except the person they’re blocking is an unstoppable mega corporation that’s not even morally in the wrong imo (except for when they overburden people’s sites, that’s bad ofc)
The looking is fine, the photographing and selling the photo less so… and fyi in denmark monuments have copyright so if you photograph and sell the photos you owe fees :)
I'm generally very pro-robot (every web UA is a robot really IMO) but these scrapers are exceptionally poorly written and abusive.
Plenty of organizations managed to crawl the web for decades without knocking things over. There's no reason to behave this way.
It's not clear to me why they've continued to run them like this. It seems so childish and ignorant.
The bad scrapers would get blocked by the wall I mentioned. The ones intelligent enough to break the wall would simply take the easier way out and download the alternative data source.
literally the top link when I search for his exact text "why are anime catgirls blocking my access to the Linux kernel?" https://lock.cmpxchg8b.com/anubis.html Maybe travis needs more google-fu. maybe that includes using duckduckgo?
The top link when you search the title of the article is the article itself?
I am shocked, shocked I say.