I didn't understand what Cloudflare is saying, so perhaps somebody else could clarify - Google already crawls sites to index them for its search. So why would Google need a separate AI crawler? Cloudflare (according to the article) claims they already "block" Gemini - does that mean Google does operate a separate AI crawler? Why though, when they already have a huge index? What does an AI crawler do different from a search engine (indexing) crawler?
Google has made it very difficult to completely block their AI crawling by also using the standard googlebot search crawlers to feed data into their AI overviews and other AI features within Google search. Google says there is a workaround but it also blocks your site from fully indexing in Google search. This also also all covered in the article though.
> What does an AI crawler do different from a search engine (indexing) crawler?
Many people don't want the extra bot traffic hitting their site that comes from AI, especially when AI chat & AI overviews in Google provide such a small amount of traffic in return and that traffic pretty much always has horrendous conversion rates (personally seen across multiple industries).
It doesn't seem like the extra traffic is the issue. People don't want Google's AI from reading and summarizing their data and thus preventing clickthroughs. Why would I click on your site if google did all the work of giving me the answer ahead of time?
Both are an issue. People don't want AI overviews cannibalizing their website traffic. People also don't want AI bots spamming their website with outrageous numbers of requests everyday.
In the specific case of Google would there be any additional traffic that isn't just the normal googlebot? I can't imagine they would bother crawling twice for every site on the internet.
Google-Extended is what is associated with AI crawling, but GoogleBot also crawls to produce AI overviews in addition to indexing your website in Google search.
While the number of crawlers and their overlapping responsibilities makes it difficult to know which ones you can safely block, I should also say that pure AI company bots behave 1000x worse than Google crawlers when it comes to not flooding your site with scraping requests.
this is a problem which needs regulatory action, not one which should be solved by a quasi monopoly forcing it onto anyone but another quasi monopoly which can use their monopoly power to avoid it
require
- respecting robots.txt and similar
- require purpose binding/separation (of the crawler agent, but also the retrieved data) similar to what GDPR does
- require public agent purpose documentation and stable agent identities
- disallow obfuscation of who is crawling what
- do enforce it
and sure making something illegal doesn't prevent anyone from being technically able to do it
but now at lest large companies like Google have to decide weather they want to commit a crime, and the more they obfuscate that they are doing it the more there is prove it was done with a lot of bad faith, i.e. the higher judges can push punitive damages
combine it with internet gateways like CF trying to provide technical enforcement and you might have a good solution
but one quasi monopoly trying to force another to "comply" with their money making scheme (even if it's in the interest of the end user) smells a lot like you can have a winnable case against CF wrt. unfair market practices, monopoly power abuse etc...
I find it wild that you focus on CF being a monopoly here when they are providing tools that help publishers not have all of their content stolen and repurposed. AI companies have been notorious over the last few years for not respecting any directives and spamming sites with requests to scrape all of their data.
There is also nothing stopping other CDN/DNS providers spinning up a similar marketplace to what CF is looking to do now.
I thought we were broadly opposed to regulatory action for a number of reasons, including anti-socialism ideology, dislike of "red tape", and belief that free markets can solve problems.
Google does not use a separate crawler, but they do use the robots.txt semantics to control what's allowed in the AI "index" separately from the search index. From the docs:
> Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity.
The assumption is that "Google" is a good player, not a malicious one. Same idea as robots.txt. It's only there to configure the "good guys" behavior not to prevent malicious actors.
So I'm assuming Cloudflare is basically asking Google to split its crawler's UA and distinguish between search and AI overview and respect something akin ot robots.txt
Watching these two behemoths wrestle over the future of a space we all share, and wondering if they will need to loop in regulators on one side or another, convinces me that we shouldn't have gifted all of our digital infrastructure to just 2 companies. Inlcuding our economy, healthcare, government, and civil infrastructure. We've put all our eggs into only a couple of very greedy, impossible to audit baskets. We've really done this all to ourselves. We've raced ourselves all the way to the bottom.
There is nothing stopping other CDN/DNS providers from implementing similar services and tools to what Cloudflare offers. Part of the reason CF has become so popular is because so many of their competitors don't offer nearly the same convenience for routine tasks & protection.
After 50 years of effectively zero activity, we had some glimmers of anti-trust enforcement under the Biden admin. But then eggs were expensive in the summer of 2024, so we decided it was actually no problem for these half-dozen companies to control our speech and economy, and here we are.
In practice it means that we are pretty close to how ancient Greeks (in the city state of Athens) defined democracry: ~3% of the population decided - by voting - how the remaining 97% would live their lives.
What has been democratic about how the internet has evolved over the last 2 decades? Because as far as I can see, the internet has undergone a massive centralization into the hands of a few players with practically no regulation. Especially Google, which can make decisions such as adding AI Overviews to search results leading to millions of websites seeing a ~25% drop in organic traffic in the last few months.
Tech regulation, or lack thereof, tends to be "biggest pile of money wins", but in this case there's already large anti-Google and anti-AI constituencies which CF may be able to mobilize. Especially in the EU.
I don't understand how AI scrapers make up such a large percentage of traffic to websites, as people claim it does.
In principle, if you post a webpage, presumably, it's going to be viewed at least a few dozen times. If it's an actually good article, it might be viewed a few hundred or even thousands of times. If each of the 20 or so large AI labs visit it as well, does it just become N+20?
> I don't understand how AI scrapers make up such a large percentage of traffic to websites, as people claim it does.
I think a lot of people confuse scraping for training with on-demand scraping for "agentic use" / "deep research", etc. Today I was testing the new GLM-experimental model, on their demo site. It had "web search", so I enabled that and asked it for something I have recently researched myself for work. It gave me a good overall list of agentic frameworks, after some google searching and "crawling" ~6 sites it found.
As a second message I asked for a list of repo links, how many stars each repo has, and general repo activity. It went on and "crawled" each of the 10 repos on github, couldn't read the stars, but then searched and found a site that reports that, and it "crawled" that site 10 times for each framework.
All in all, my 2 message chat session performed ~ 5-6 searches and 20-30 page "crawls". Imagine what they do when traffic increases. Now multiply that for every "deep research" provider (perplexity, goog, oai, anthropic, etc etc). Now think how many "vibe-coded" projects like this exist. And how many are poorly coded and re-crawl each link every time...
Yeah it seems the implementation of these web-aware GPT queries lacks a(n adequate) caching layer.
Could also be framed as an API issue, as there is no technical limitations why search provider couldn't provide relevant snapshots of the body of the search results. Then again, might be legal issues behind not providing that information.
Caching on client-side is an obvious improvement, but probably not trivial to implement at provider-level (what do you cache, are you allowed to?, how do you deal with auth tokens (if supported), when searching a small difference might invalidate cache, and so on).
Another content-creator avenue might be to move to a 2-tier content serving, where you serve pure html as a public interface, and only allow "advanced" features that take many cpu cycles for authenticated / paying users. It suddenly doesn't make sense to use a huge, heavy and resource intensive framework for things that might be crawled a lot by bots / users doing queries w/ LLMs.
Another idea was recently discussed here, and covers "micropayments" for access to content. Probably not trivial to implement either, even though it sounds easy in theory. We've had an entire web3.0 hype cycle on this, and yet no clear easy solutions for micropayments... Oh well. Web4.0 it is :)
If you run a website, you'll realize it's very difficult to get human traffic. Worse, trying to understand what those eyeballs are doing is a swamp; there are legitimate privacy concerns for example. Maybe all you care about is if your articles about sewing machines are getting more traction than your articles about computing Pi, but you can't get that without navigating all the legal complications of your analytics platform of choice, who wants to make sure you suffer for not letting them collect private information on your visitors to sell to third parties and to dump ads onto your visitors. Were it not for the bots, you would be fine just by running grep on your access logs. But no, bot traffic leaves noise everywhere; and for small websites that noise is more than enough to bury the signal and to be most of the traffic bill.
On multiple client sites that have > 1 million unique real visitors per month, we are seeing some days where ~25-30% of requests is from AI crawlers. Thankfully we block almost all of this traffic. But it is a huge pain because it add an addition load to your server and messes up your analytics data for what is a terrible return - traffic from AI sources has a horrendous conversion rate, even worse than social media traffic conversion rates.
I don't understand it either. I track requests and AI crawlers are there but not as abusive as people claim. Most annoying requests are from hackers who are trying to find my ".git" directory. But I highly doubt these guys will respect any rules anyway.
Vastly varies on what type of website you have, how many pages you have, and how often they are updated. We routinely see 1000's of requests per minute coming from AI bots and the scraping lasts for hours. Enough to make up 20-30% of overall requests to the server.
Pages are dynamic, they change often: if a page is worth scraping once, it is worth scraping again and again and again to keep up to date with any changes.
I have a symbol server hosting a few thousand PDBs for a foss package.
Amazonbot is every day trying to scan every single PDB directory listed. For no real reason. This is something causing 10k+ requests each day when legitimate traffic sits at maybe 50 requests a day.
Can we block them for customers? As they're extremely inaccurate. For example the Google "summary" for "stop killing games started by" was returning "Scott Ross". It was started by Ross Scott. The first search results provide the correct example.
> Worst case we’ll pass a law somewhere that requires them to break out their crawlers and then announce all routes to their crawlers from there. And that wouldn’t be hard. But I’m hopeful it won’t need to come to that.
It's interesting to see Matthew say the quiet part out loud here, if by "pass a law," Matthew means get federal legislation passed.
Probably because that's one of the only things you can say to Google these days that will escalate it from an low level support agent on the other side of the world.
The thing that he doesn't mention is that as soon as they do something legislatively and announce routes there, etc.....well Google just won't crawl those sites. It turns into a game of whether you would like 0 traffic from Google, or allow them to use your content both for search results and AI summaries.
Google is the bringer of traffic and if you want it, then you play by their rules. I don't like that the web is in that position, but here we are.
I think it gets tricky with AI overviews. Some sites are saying they have lost the majority of their traffic (some losing more than 90%). Many of those make money from ads (often by google).
Some sites don't get enough traffic from google to sustain their business where they previously did. Wholesale blocking google crawlers doesn't seem like a risky move for them.
It makes me wonder if that is a trend? Will more sites go to 'google zero'?
To some extent, yes. I speculate that is why they are sometimes not putting the AI overview first.
They do have ads in search though. I guess it depends on how much more money they make from people spending more time not leaving google vs showing you display ads on other sites.
The whole point of publication is so that humans can read it. Robots not so much, especially if they're not paying customers. This is the distinction between how the web works technically and how it works socioeconomically.
This is the next iteration of things like the news snippet case. Publishers are not happy that Google crawls their content (at their expense) and then republishes it on their own site, while serving ads around it and getting user data, without cutting in the publisher who originally made it. And, for what little it's worth, owns the copyright.
> Again it sounds like the people who are upset by this really want to publish images rather than web pages.
More like people don't want to lose money because a 3rd party stole all of their content, and then repurposed it to show people before they visit their website.
It's a cost thing. It costs more to render a website than it does to consume it. When you have some bot traffic mixed in with human traffic, that is fine.
When you have egregious bot traffic, say 10k requests per minute sustained load, it becomes a real problem for webmasters.
Having perplexity or other AI bots go haywire and sending 10s of thousands of requests per minute to your website (despite you having a robots.txt blocking them) is a giant pain in the ass. Not only does your server costs go up, but your analytics and attribution reports start to look messed up because of all the bot traffic.
Well besides them being abusive, the other issue is that AI overviews and answer boxes cannibalize traffic to websites, leading to less conversions for the original content producers. This is pretty well established across industries at this point:
...that's literally the entire point of this article. People don't want their websites being de-listed from the monopoly that controls organic traffic. At the same time they would like some control over stopping companies (in this case, the same company that controls the organic search monopoly) from scraping and repurposing their content so their the traffic to their website doesn't decrease.
Why is it such an issue that publishers and website owners want to maintain the traffic to their website so that they can continue operating as usual? Or should we all just accept every Google decision, even when those decisions result in more engagement on google.com, but 20-35% decreases in traffic to the original websites?
Also I'm going to need a citation that the vast majority of people want and get value out of AI overviews. Because that is certainly not the case from my experience.
Google has clearly decided to keep users on their platform longer, hoping that this will lead to more ad clicks. There is a clear reason why AI overviews very seldomly link to outside websites, and why website links are much more hidden on Google Maps/Business Profiles. More time spent on the Google platform means more likely that someone will eventually click an ad.
Also - I noticed a pretty huge outcry when AI overviews were introduced to search. Can you show me all the people who enjoy the experience of using them more than not?
Because you remain at the top of an extremely shallow perspective. The fact that robots are a part is not relevant, the relevant issue is how some of those robots behave, and what the consequences of that behavior are.
Have you ever been responsible for the performance and security of a publicly accessible web server? I'll accept robots indexing my content if they play nicely. Unfortunately most do not, even from major vendors.
> We never distinguished automations from people though, that makes no sense on the internet.
LOL I see you've never sold anything on the internet, ran a website that is supposed to generate leads, or had to gauge the effectiveness of an ad campaign. There is a huge part of the internet that relies on real humans doing things on websites. And ignoring that is insane.
HTML was not made for robots to read (a semantic web or an internet of data), it just so happens that crawlers try to index things in meaningful ways. It's an un-ordered blob of unstructured data.
I come to the exact opposite conclusion. A few large companies scraping and repurposing original content from publishers kills original content on the internet because it takes out the ability to earn a livelihood from creating original content or running your own ecommerce store that is not tied to a mega-company's platform.
They are going to have to use the law. Are you kidding? You block ai for Google while every other agent is able to bypass that block and crawl your site? How will Google stay competitive for AI? Of course Google would rather have you do it by law. You block it for them you better do it for every single one of their competitors too.
Also guaranteed even if you do block by law there will be actors who will ignore the law. For example people outside of the US. Those people will as a result likely build better AI because they have access to more training data.
First example of this is of course China. One thing with China is there’s no holy sanctity behind data. Whatever is made is copy-able and goes to the public domain regardless of anything. It both causes China to both exceed the US and to be slightly less innovative at the same time.
If these laws come to pass you bet your ass China will be exceeding the US in AI like they already have with stem cell research.
Who pays the megacorps to use AI? The peasants. Business only succeeds if the peasantry want it. So if businesses are fucking over the peasantry what’s really going on is that the businesses are just middlemen for peasantry fucking themselves. In the end the megacorps are just a feedback loop to the peasantry. That’s the irony. You like to blame corporations but it’s just a mirror to yourself.
Still doesn’t hurt to lower taxes for peasants and heighten taxes for corps. That’s how you prevent wealth inequality. And also middlemen are by nature just a bit corrupt as they siphon resources just by being in the middle.
I didn't understand what Cloudflare is saying, so perhaps somebody else could clarify - Google already crawls sites to index them for its search. So why would Google need a separate AI crawler? Cloudflare (according to the article) claims they already "block" Gemini - does that mean Google does operate a separate AI crawler? Why though, when they already have a huge index? What does an AI crawler do different from a search engine (indexing) crawler?
Google has made it very difficult to completely block their AI crawling by also using the standard googlebot search crawlers to feed data into their AI overviews and other AI features within Google search. Google says there is a workaround but it also blocks your site from fully indexing in Google search. This also also all covered in the article though.
> What does an AI crawler do different from a search engine (indexing) crawler?
Many people don't want the extra bot traffic hitting their site that comes from AI, especially when AI chat & AI overviews in Google provide such a small amount of traffic in return and that traffic pretty much always has horrendous conversion rates (personally seen across multiple industries).
It doesn't seem like the extra traffic is the issue. People don't want Google's AI from reading and summarizing their data and thus preventing clickthroughs. Why would I click on your site if google did all the work of giving me the answer ahead of time?
> It doesn't seem like the extra traffic is the issue.
it really can be, Anubis AI crawler detection was create mainly because of "way to many AI bot requests" to quote
> This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies.
Both are an issue. People don't want AI overviews cannibalizing their website traffic. People also don't want AI bots spamming their website with outrageous numbers of requests everyday.
In the specific case of Google would there be any additional traffic that isn't just the normal googlebot? I can't imagine they would bother crawling twice for every site on the internet.
There are about a dozen Google crawlers that can hit your website for different reasons:
https://developers.google.com/search/docs/crawling-indexing/...
Google-Extended is what is associated with AI crawling, but GoogleBot also crawls to produce AI overviews in addition to indexing your website in Google search.
While the number of crawlers and their overlapping responsibilities makes it difficult to know which ones you can safely block, I should also say that pure AI company bots behave 1000x worse than Google crawlers when it comes to not flooding your site with scraping requests.
which again comes back to
this is a problem which needs regulatory action, not one which should be solved by a quasi monopoly forcing it onto anyone but another quasi monopoly which can use their monopoly power to avoid it
require
- respecting robots.txt and similar
- require purpose binding/separation (of the crawler agent, but also the retrieved data) similar to what GDPR does
- require public agent purpose documentation and stable agent identities
- disallow obfuscation of who is crawling what
- do enforce it
and sure making something illegal doesn't prevent anyone from being technically able to do it
but now at lest large companies like Google have to decide weather they want to commit a crime, and the more they obfuscate that they are doing it the more there is prove it was done with a lot of bad faith, i.e. the higher judges can push punitive damages
combine it with internet gateways like CF trying to provide technical enforcement and you might have a good solution
but one quasi monopoly trying to force another to "comply" with their money making scheme (even if it's in the interest of the end user) smells a lot like you can have a winnable case against CF wrt. unfair market practices, monopoly power abuse etc...
I find it wild that you focus on CF being a monopoly here when they are providing tools that help publishers not have all of their content stolen and repurposed. AI companies have been notorious over the last few years for not respecting any directives and spamming sites with requests to scrape all of their data.
There is also nothing stopping other CDN/DNS providers spinning up a similar marketplace to what CF is looking to do now.
> this is a problem which needs regulatory action
I thought we were broadly opposed to regulatory action for a number of reasons, including anti-socialism ideology, dislike of "red tape", and belief that free markets can solve problems.
Google does not use a separate crawler, but they do use the robots.txt semantics to control what's allowed in the AI "index" separately from the search index. From the docs:
> Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity.
https://developers.google.com/search/docs/crawling-indexing/...
The assumption is that "Google" is a good player, not a malicious one. Same idea as robots.txt. It's only there to configure the "good guys" behavior not to prevent malicious actors.
So I'm assuming Cloudflare is basically asking Google to split its crawler's UA and distinguish between search and AI overview and respect something akin ot robots.txt
cloudflare wants to sell a "AI bots have to pay to access your site" service, so they need to argue it out with google one way or another
the reality of the tech is irrelevant
Google does crawl with a different crawler. Makes no sense other than clearly two different teams with different goals.
it could be the "grounding with google search" feature when using Gemini models
Watching these two behemoths wrestle over the future of a space we all share, and wondering if they will need to loop in regulators on one side or another, convinces me that we shouldn't have gifted all of our digital infrastructure to just 2 companies. Inlcuding our economy, healthcare, government, and civil infrastructure. We've put all our eggs into only a couple of very greedy, impossible to audit baskets. We've really done this all to ourselves. We've raced ourselves all the way to the bottom.
There is nothing stopping other CDN/DNS providers from implementing similar services and tools to what Cloudflare offers. Part of the reason CF has become so popular is because so many of their competitors don't offer nearly the same convenience for routine tasks & protection.
> we shouldn't have gifted all of our digital infrastructure to just 2 companies
We didn't. Just as we didn't gift all our chocolate-making infrastructure to Hershey's and Cadbury's.
Hey did you forget the market is consumer driven?
It used to be just Ma Bell
After 50 years of effectively zero activity, we had some glimmers of anti-trust enforcement under the Biden admin. But then eggs were expensive in the summer of 2024, so we decided it was actually no problem for these half-dozen companies to control our speech and economy, and here we are.
Is this how the democratic process works now? Cloudflare threatens Google to pass a law?
Yes. It has been long known as "lawfare"
In practice it means that we are pretty close to how ancient Greeks (in the city state of Athens) defined democracry: ~3% of the population decided - by voting - how the remaining 97% would live their lives.
What has been democratic about how the internet has evolved over the last 2 decades? Because as far as I can see, the internet has undergone a massive centralization into the hands of a few players with practically no regulation. Especially Google, which can make decisions such as adding AI Overviews to search results leading to millions of websites seeing a ~25% drop in organic traffic in the last few months.
Tech regulation, or lack thereof, tends to be "biggest pile of money wins", but in this case there's already large anti-Google and anti-AI constituencies which CF may be able to mobilize. Especially in the EU.
Do you have any more doubts who run the US and makes laws there?
Hard to believe Cloudflare has more sway than Google.
Cloudflare is positioned to become a very powerful company. Well worth investing in.
Guess they've been pretty successful converting scummy Enterprise plan upsells to lobbyist retainers.
"Always has been"
I don't understand how AI scrapers make up such a large percentage of traffic to websites, as people claim it does.
In principle, if you post a webpage, presumably, it's going to be viewed at least a few dozen times. If it's an actually good article, it might be viewed a few hundred or even thousands of times. If each of the 20 or so large AI labs visit it as well, does it just become N+20?
Or am I getting this wrong somehow?
> I don't understand how AI scrapers make up such a large percentage of traffic to websites, as people claim it does.
I think a lot of people confuse scraping for training with on-demand scraping for "agentic use" / "deep research", etc. Today I was testing the new GLM-experimental model, on their demo site. It had "web search", so I enabled that and asked it for something I have recently researched myself for work. It gave me a good overall list of agentic frameworks, after some google searching and "crawling" ~6 sites it found.
As a second message I asked for a list of repo links, how many stars each repo has, and general repo activity. It went on and "crawled" each of the 10 repos on github, couldn't read the stars, but then searched and found a site that reports that, and it "crawled" that site 10 times for each framework.
All in all, my 2 message chat session performed ~ 5-6 searches and 20-30 page "crawls". Imagine what they do when traffic increases. Now multiply that for every "deep research" provider (perplexity, goog, oai, anthropic, etc etc). Now think how many "vibe-coded" projects like this exist. And how many are poorly coded and re-crawl each link every time...
Yeah it seems the implementation of these web-aware GPT queries lacks a(n adequate) caching layer.
Could also be framed as an API issue, as there is no technical limitations why search provider couldn't provide relevant snapshots of the body of the search results. Then again, might be legal issues behind not providing that information.
Caching on client-side is an obvious improvement, but probably not trivial to implement at provider-level (what do you cache, are you allowed to?, how do you deal with auth tokens (if supported), when searching a small difference might invalidate cache, and so on).
Another content-creator avenue might be to move to a 2-tier content serving, where you serve pure html as a public interface, and only allow "advanced" features that take many cpu cycles for authenticated / paying users. It suddenly doesn't make sense to use a huge, heavy and resource intensive framework for things that might be crawled a lot by bots / users doing queries w/ LLMs.
Another idea was recently discussed here, and covers "micropayments" for access to content. Probably not trivial to implement either, even though it sounds easy in theory. We've had an entire web3.0 hype cycle on this, and yet no clear easy solutions for micropayments... Oh well. Web4.0 it is :)
A caching layer sounds wonderful. Improves reliabiltity while reducing load on the original servers.
I worry that such caching layers might run afoul of copyright, though :(
Though an internal caching layer would work, surely?
If you run a website, you'll realize it's very difficult to get human traffic. Worse, trying to understand what those eyeballs are doing is a swamp; there are legitimate privacy concerns for example. Maybe all you care about is if your articles about sewing machines are getting more traction than your articles about computing Pi, but you can't get that without navigating all the legal complications of your analytics platform of choice, who wants to make sure you suffer for not letting them collect private information on your visitors to sell to third parties and to dump ads onto your visitors. Were it not for the bots, you would be fine just by running grep on your access logs. But no, bot traffic leaves noise everywhere; and for small websites that noise is more than enough to bury the signal and to be most of the traffic bill.
On multiple client sites that have > 1 million unique real visitors per month, we are seeing some days where ~25-30% of requests is from AI crawlers. Thankfully we block almost all of this traffic. But it is a huge pain because it add an addition load to your server and messes up your analytics data for what is a terrible return - traffic from AI sources has a horrendous conversion rate, even worse than social media traffic conversion rates.
I don't understand it either. I track requests and AI crawlers are there but not as abusive as people claim. Most annoying requests are from hackers who are trying to find my ".git" directory. But I highly doubt these guys will respect any rules anyway.
Vastly varies on what type of website you have, how many pages you have, and how often they are updated. We routinely see 1000's of requests per minute coming from AI bots and the scraping lasts for hours. Enough to make up 20-30% of overall requests to the server.
Pages are dynamic, they change often: if a page is worth scraping once, it is worth scraping again and again and again to keep up to date with any changes.
Your speculation assumes a low page count.
Maybe they're vibe-coding the scrapers.
I have a symbol server hosting a few thousand PDBs for a foss package.
Amazonbot is every day trying to scan every single PDB directory listed. For no real reason. This is something causing 10k+ requests each day when legitimate traffic sits at maybe 50 requests a day.
Post a URL to a page on your website on something like Mastodon and tail your logs.
Related: Please Don’t Share Our Links on Mastodon
https://news.ycombinator.com/item?id=40222067
Can we block them for customers? As they're extremely inaccurate. For example the Google "summary" for "stop killing games started by" was returning "Scott Ross". It was started by Ross Scott. The first search results provide the correct example.
You can use the element picker in uBlock Origins.
> Worst case we’ll pass a law somewhere that requires them to break out their crawlers and then announce all routes to their crawlers from there. And that wouldn’t be hard. But I’m hopeful it won’t need to come to that.
It's interesting to see Matthew say the quiet part out loud here, if by "pass a law," Matthew means get federal legislation passed.
Probably because that's one of the only things you can say to Google these days that will escalate it from an low level support agent on the other side of the world.
So this boils down to Cloudflare, supposedly having more regulatory capture than Google.
The thing that he doesn't mention is that as soon as they do something legislatively and announce routes there, etc.....well Google just won't crawl those sites. It turns into a game of whether you would like 0 traffic from Google, or allow them to use your content both for search results and AI summaries.
Google is the bringer of traffic and if you want it, then you play by their rules. I don't like that the web is in that position, but here we are.
I think it gets tricky with AI overviews. Some sites are saying they have lost the majority of their traffic (some losing more than 90%). Many of those make money from ads (often by google).
Some sites don't get enough traffic from google to sustain their business where they previously did. Wholesale blocking google crawlers doesn't seem like a risky move for them.
It makes me wonder if that is a trend? Will more sites go to 'google zero'?
Doesn't this also cannibalize Google's ad revenue?
To some extent, yes. I speculate that is why they are sometimes not putting the AI overview first.
They do have ads in search though. I guess it depends on how much more money they make from people spending more time not leaving google vs showing you display ads on other sites.
I've still never understood the complaint here. Robots are part of the web, the whole point of HTML is that robots can read it.
The whole point of publication is so that humans can read it. Robots not so much, especially if they're not paying customers. This is the distinction between how the web works technically and how it works socioeconomically.
This is the next iteration of things like the news snippet case. Publishers are not happy that Google crawls their content (at their expense) and then republishes it on their own site, while serving ads around it and getting user data, without cutting in the publisher who originally made it. And, for what little it's worth, owns the copyright.
Robots don't exist for their own sake, at the end of the day they are user agents for some group of humans.
Again it sounds like the people who are upset by this really want to publish images rather than web pages.
> Again it sounds like the people who are upset by this really want to publish images rather than web pages.
More like people don't want to lose money because a 3rd party stole all of their content, and then repurposed it to show people before they visit their website.
Web 3.0 is all about machine-readability: https://en.m.wikipedia.org/wiki/Semantic_Web
(Not to be confused with Web3.)
It's a cost thing. It costs more to render a website than it does to consume it. When you have some bot traffic mixed in with human traffic, that is fine.
When you have egregious bot traffic, say 10k requests per minute sustained load, it becomes a real problem for webmasters.
Having perplexity or other AI bots go haywire and sending 10s of thousands of requests per minute to your website (despite you having a robots.txt blocking them) is a giant pain in the ass. Not only does your server costs go up, but your analytics and attribution reports start to look messed up because of all the bot traffic.
Yeah obviously if they're being abusive that's a problem but that's not what the article seems to be talking about.
Well besides them being abusive, the other issue is that AI overviews and answer boxes cannibalize traffic to websites, leading to less conversions for the original content producers. This is pretty well established across industries at this point:
https://ahrefs.com/blog/ai-overviews-reduce-clicks/
That's how people want to browse the web. If you block it you won't even get links from those. That's like blocking the search crawler.
...that's literally the entire point of this article. People don't want their websites being de-listed from the monopoly that controls organic traffic. At the same time they would like some control over stopping companies (in this case, the same company that controls the organic search monopoly) from scraping and repurposing their content so their the traffic to their website doesn't decrease.
Why is it such an issue that publishers and website owners want to maintain the traffic to their website so that they can continue operating as usual? Or should we all just accept every Google decision, even when those decisions result in more engagement on google.com, but 20-35% decreases in traffic to the original websites?
Also I'm going to need a citation that the vast majority of people want and get value out of AI overviews. Because that is certainly not the case from my experience.
Google absolutely is not the only company doing this and if they didn't do it I'd feed the results into my local models to get the same thing.
This isn't a "google decision" people are changing the way they use the web.
Google has clearly decided to keep users on their platform longer, hoping that this will lead to more ad clicks. There is a clear reason why AI overviews very seldomly link to outside websites, and why website links are much more hidden on Google Maps/Business Profiles. More time spent on the Google platform means more likely that someone will eventually click an ad.
Also - I noticed a pretty huge outcry when AI overviews were introduced to search. Can you show me all the people who enjoy the experience of using them more than not?
Because you remain at the top of an extremely shallow perspective. The fact that robots are a part is not relevant, the relevant issue is how some of those robots behave, and what the consequences of that behavior are.
Have you ever been responsible for the performance and security of a publicly accessible web server? I'll accept robots indexing my content if they play nicely. Unfortunately most do not, even from major vendors.
Not a web server but yeah we dealt with it by black listing patterns (IPs, requests etc) from misbehaving domains.
We never distinguished automations from people though, that makes no sense on the internet.
> We never distinguished automations from people though, that makes no sense on the internet.
LOL I see you've never sold anything on the internet, ran a website that is supposed to generate leads, or had to gauge the effectiveness of an ad campaign. There is a huge part of the internet that relies on real humans doing things on websites. And ignoring that is insane.
HTML was not made for robots to read (a semantic web or an internet of data), it just so happens that crawlers try to index things in meaningful ways. It's an un-ordered blob of unstructured data.
That's why we need the Captchaweb. It's like the web, but everything is in captcha text.
Machine-readable does not mean centralized.
A lot of people don't want AI slop and don't want the companies pushing it to crawl their websites.
We're seeing the free (libre) internet destroyed in front of us.
I come to the exact opposite conclusion. A few large companies scraping and repurposing original content from publishers kills original content on the internet because it takes out the ability to earn a livelihood from creating original content or running your own ecommerce store that is not tied to a mega-company's platform.
They are going to have to use the law. Are you kidding? You block ai for Google while every other agent is able to bypass that block and crawl your site? How will Google stay competitive for AI? Of course Google would rather have you do it by law. You block it for them you better do it for every single one of their competitors too.
Also guaranteed even if you do block by law there will be actors who will ignore the law. For example people outside of the US. Those people will as a result likely build better AI because they have access to more training data.
First example of this is of course China. One thing with China is there’s no holy sanctity behind data. Whatever is made is copy-able and goes to the public domain regardless of anything. It both causes China to both exceed the US and to be slightly less innovative at the same time.
If these laws come to pass you bet your ass China will be exceeding the US in AI like they already have with stem cell research.
Ah ok, yet the megacorps continue to reap profits off stealing labor and work from the peasants.
I propose we just remove the charades and require us peasants pay these megacorps 20% of our annual income yearly at tax time.
Who pays the megacorps to use AI? The peasants. Business only succeeds if the peasantry want it. So if businesses are fucking over the peasantry what’s really going on is that the businesses are just middlemen for peasantry fucking themselves. In the end the megacorps are just a feedback loop to the peasantry. That’s the irony. You like to blame corporations but it’s just a mirror to yourself.
Still doesn’t hurt to lower taxes for peasants and heighten taxes for corps. That’s how you prevent wealth inequality. And also middlemen are by nature just a bit corrupt as they siphon resources just by being in the middle.
Related discussion previously:
Cloudflare to introduce pay-per-crawl for AI bots
https://news.ycombinator.com/item?id=44432385
..
[dead]