That's because Chrome tracks so much telemetry about you that Google is satisfied with how well it has you surveilled. If you install a ton of privacy extensions like Privacy Badger, uBlock, VPN extensions with information leakage protections, etc., watch that "accuracy" plummet again as it makes you click 20 traffic signals to pass one check.
It's a known "issue" of reCaptcha, and many other systems like it. If it thinks you're a bot, it will "fail" the first few correct solves before it lets you through.
The worst offenders will just loop you forever, no matter how many solves you get right.
stock Chrome logged into a Google account = definitely not a bot. here, click a few fire hydrants and come on in :^)
I sincerely wish all the folx at Google directly responsible for this particular user acquisition strategy to get every cancer available in California.
I would think that when you're viewing recaptcha on a site, if you have 3rd party cookies disabled the embedded recaptcha script won't have anyway of connecting you with your Google account, even if you're logged in. At least that's how disabling 3rd party cookies is supposed to work.
It doesn't catch OpenAI even though the mouse/click behavior is clearly pretty botlike. One hypothesis is that Google reCAPTCHA is overindexing on browser patterns rather than behavioral movement
The buses and fire hydrants are easy. It is the bicycles. If it goes a pixel over the next box, do I select the next box? Is the pole part of the traffic light? And the guy as you say. There is a special place in hell for the inventor of reCaptcha (and for all of Cloudflare staff as fas as I am concerned!)
I always assume that people are lazy and try and click the least amount of squares as possible to get broadly the correct answer. Therefore, if it says motorbikes just click on the body of the bike and leave out rider and tiles with hardly any bike in them.
If it says traffic lights just click on the ones you can see lit and not the posts and ignore them if they are too far in the distance. Seems to work for me.
Didn't look a lot into this but I think the fact that humans are willing to do this in the "cents per thousand" or something range means that it's really hard to get much interest in automating it
Not sure it is your case but I think I sometimes had to solve many of them when I am in my daily task rush. My hypothesis is that I solve them too fast for "average human resolving duration" recaptcha seems to expect (I think solving it too fast triggers bot fingerprint). More recently when I fall on a recaptcha to solve, I consciently do not rush it and feel have no more to solve more than one anymore. I don't think I have super powers, but as tech guy I do a lot a computing things mechanically.
While running this I looked at hundreds and hundreds of captchas. And I still get rejected on like 20% of them when I do them. I truly don't understand their algorithm lol
Forget whether humans can't distinguish your AI from another human. The real Turing test is whether your AI passes all the various flavors of captcha checks.
Same! As we talk about in the article, the failures were less from raw model intelligence/ability than from challenges with timing and dynamic interfaces
In my admittedly limited-domain tests, Gemini did _far_ better at image recognition tasks than any of the other models. (This was about 9 months ago, though, so who knows what the current state of things). Google has one of the best internal labeled image datasets, if not the best, and I suspect this is all related.
The solvers are a problem but they give themselves away when they incorrectly fake devices or run out of context. I run a bot detection SaaS and we've had some success blocking them. Their advertised solve times are also wildly inaccurate. They take ages to return a successful token, if at all. The number of companies providing bot mitigation is also growing rapidly, making it difficult for the solvers to stay on top of reverse engineering etc.
That's a good question. I haven't checked the stats to see how often it happens but I will make a note to return with some info. We're dealing with the entire internet, not just YC companies, and many scrapers / solvers will pass up a user agent that doesn't quite match the JS capabilities you would expect of the browser version. Some solving companies allow you to pass up user agent , which causes inconsistencies as they're not changing their stack to match the user agent you supply. Under the hood they're running whatever version of headless Chrome they're currently pinned to.
Wow. Cross-tile performance was 0-2%. That's the challenge where you select all of the tiles containing an item where the single item is in a subset of tiles. As opposed to all the tiles that contain the item type (static - 60% max) and the reload version (21% max). Seems to really highlight how far these things are from reasoning or human level intelligence. Although to be fair, the cross-tile is the one I perform worst on too (but more like 90+% rather than 2%).
The cross-tile challenges were quite robust - every model struggled with them, and we tried with several iterations of the prompt. I'm sure you could improve with specialized systems, but the models out-of-the-box definitely struggle with segmentation
interesting results. why does reload/cross-tile have worse results? would be nice to see some examples of failed results (how close did it to solving?)
We have an example of a failed cross-tile result in the article - the models seem like they're much better at detecting whether something is in an image vs. identifying the boundaries of those items. This probably has to do with how they're trained - if you train on descriptions/image pairs, I'm not sure how well that does at learning boundaries.
Reload are challenging because of how the agent-action loop works. But the models were pretty good at identifying when a tile contained an item.
I'm also curious what the success rates are for humans. Personally I find those two the most bothersome as well. Cross-tile because it's not always clear which parts of the object count and reload because it's so damn slow.
Ok and then? Those models were not trained for this purpose.
It's like the last hype over using generative AI for trading.
You might use it for sentiment analysis, summarization and data pre-processing. But classic forecast models will outperform them if you feed them the right metrics.
It is relevant because they are trained for the purpose of browser use and completing tasks on websites. Being able to bypass captchas is important for using many websites.
It would be nice to see comparisons to some special-purpose CAPTCHA solvers though.
Hcaptcha cofounder here. Enterprise users have a lot of fancy configuration behind the scenes. I wonder if they coordinated with recaptcha or just assume there sitekey in the same as others
Indeed, captcha vs captcha bot solvers has been an ongoing war for a long time. Considering all the cybercrime and ubiquitous online fraud today, it's pretty impressive that captchas have held the line as long as they have.
So, when do we reach a level where AI is better than humans and we remove captcha from pages alltogether? If you don't want bots to read content, don't put it online, you're just inconveniencing real people now.
They can also sign up and post spam/scams. There are a lot of those spam bots on YouTube, and there probably would be a lot more without any bot protection. Another issue is aggressive scrapers effectively DOSing a website. Some defense against bots is necessary.
I’m sure they do better than me. Sometimes I get stuck on an endless loop of buses and fire hydrants.
Also, when they ask you to identify traffic lights, do you select the post? And when it’s motor/bycicles, do you select the guy riding it?
Testing those same captcha on Google Chrome improved my accuracy by at least an order of magnitude.
Either that or it was never about the buses and fire hydrants.
That's because Chrome tracks so much telemetry about you that Google is satisfied with how well it has you surveilled. If you install a ton of privacy extensions like Privacy Badger, uBlock, VPN extensions with information leakage protections, etc., watch that "accuracy" plummet again as it makes you click 20 traffic signals to pass one check.
It's a known "issue" of reCaptcha, and many other systems like it. If it thinks you're a bot, it will "fail" the first few correct solves before it lets you through.
The worst offenders will just loop you forever, no matter how many solves you get right.
stock Chrome logged into a Google account = definitely not a bot. here, click a few fire hydrants and come on in :^)
I sincerely wish all the folx at Google directly responsible for this particular user acquisition strategy to get every cancer available in California.
I would think that when you're viewing recaptcha on a site, if you have 3rd party cookies disabled the embedded recaptcha script won't have anyway of connecting you with your Google account, even if you're logged in. At least that's how disabling 3rd party cookies is supposed to work.
Yeah, we've looked at it in the context of reCAPTCHA v3 and 'invisible behavioral analysis': https://www.youtube.com/watch?v=UeTpCdUc4Ls
It doesn't catch OpenAI even though the mouse/click behavior is clearly pretty botlike. One hypothesis is that Google reCAPTCHA is overindexing on browser patterns rather than behavioral movement
The buses and fire hydrants are easy. It is the bicycles. If it goes a pixel over the next box, do I select the next box? Is the pole part of the traffic light? And the guy as you say. There is a special place in hell for the inventor of reCaptcha (and for all of Cloudflare staff as fas as I am concerned!)
It doesn't matter. Select it of you think other people would select it too.
That's the thing, you could go either way. I am not sure I can answer the question "what would a resonable person click?".
The 'Process Turing Test' extends the CAPTCHA from 'What would a reasonable person click' to 'How would a reasonable person click'.
For example, hesitation/confusion patterns in CAPTCHAs are different between humans and bots and those can actually be used to validate humans
That's not due to accuracy, you're getting tarpitted for not looking human enough.
Pro tip, select a section you know is wrong, then de select it before submitting. Seems to help prove you are not a bot.
Shhh, you're not supposed to tell people. Now they'll patch it and I'll have to select stairs and goddamn motorcycles 4 times in a row.
Another pro tip: the audio version of the captcha is usually much easier / faster to solve if you're in a quiet environment
I always assume that people are lazy and try and click the least amount of squares as possible to get broadly the correct answer. Therefore, if it says motorbikes just click on the body of the bike and leave out rider and tiles with hardly any bike in them.
If it says traffic lights just click on the ones you can see lit and not the posts and ignore them if they are too far in the distance. Seems to work for me.
Didn't look a lot into this but I think the fact that humans are willing to do this in the "cents per thousand" or something range means that it's really hard to get much interest in automating it
Not sure it is your case but I think I sometimes had to solve many of them when I am in my daily task rush. My hypothesis is that I solve them too fast for "average human resolving duration" recaptcha seems to expect (I think solving it too fast triggers bot fingerprint). More recently when I fall on a recaptcha to solve, I consciently do not rush it and feel have no more to solve more than one anymore. I don't think I have super powers, but as tech guy I do a lot a computing things mechanically.
that, and VPN.
Yes.
While running this I looked at hundreds and hundreds of captchas. And I still get rejected on like 20% of them when I do them. I truly don't understand their algorithm lol
There's a browser extension to solve them. Buster.
Forget whether humans can't distinguish your AI from another human. The real Turing test is whether your AI passes all the various flavors of captcha checks.
One of the writers here. We believe the real Turing Test is whether your AI performs a CAPTCHA like a human would/does.
To be honest I'm surprised how well it holds. I expected close-to-total collapse. It'll be a matter of time I guess, but still.
I wonder if any of the agents hit the audio button and listened to the instructions? In my experience, that can be pretty helpful.
Same! As we talk about in the article, the failures were less from raw model intelligence/ability than from challenges with timing and dynamic interfaces
i mean did you see the cross-tile numbers
Would performance improve if the tiles were stitched together and fed to a vision model, and then tiles are selected based on a bounding box?
That's a cool idea. I bet it would work better.
Seems like Google Gemini is tied for the best and is the cheapest way to solve Google's reCAPTCHA.
Will be interesting to see how Gemini 3 does later this year.
After watching hundreds of these runs, Gemini was by far the least frustrating model to observe.
In my admittedly limited-domain tests, Gemini did _far_ better at image recognition tasks than any of the other models. (This was about 9 months ago, though, so who knows what the current state of things). Google has one of the best internal labeled image datasets, if not the best, and I suspect this is all related.
Makes sense, what do you think it was trained on?
If not today, models will get better at solving captchas in the near future. IMHO, the real concern, however, is cheap captcha solving services.
The solvers are a problem but they give themselves away when they incorrectly fake devices or run out of context. I run a bot detection SaaS and we've had some success blocking them. Their advertised solve times are also wildly inaccurate. They take ages to return a successful token, if at all. The number of companies providing bot mitigation is also growing rapidly, making it difficult for the solvers to stay on top of reverse engineering etc.
> when they incorrectly fake devices
And how often does this happen? Do you have any proof? Most YC companies building browser agents have built-in captcha solvers.
That's a good question. I haven't checked the stats to see how often it happens but I will make a note to return with some info. We're dealing with the entire internet, not just YC companies, and many scrapers / solvers will pass up a user agent that doesn't quite match the JS capabilities you would expect of the browser version. Some solving companies allow you to pass up user agent , which causes inconsistencies as they're not changing their stack to match the user agent you supply. Under the hood they're running whatever version of headless Chrome they're currently pinned to.
Wow. Cross-tile performance was 0-2%. That's the challenge where you select all of the tiles containing an item where the single item is in a subset of tiles. As opposed to all the tiles that contain the item type (static - 60% max) and the reload version (21% max). Seems to really highlight how far these things are from reasoning or human level intelligence. Although to be fair, the cross-tile is the one I perform worst on too (but more like 90+% rather than 2%).
I think the prompt is probably at fault here. You can use LLMs for object segmentation and they do fairly well, less than 1% seems too low.
The cross-tile challenges were quite robust - every model struggled with them, and we tried with several iterations of the prompt. I'm sure you could improve with specialized systems, but the models out-of-the-box definitely struggle with segmentation
interesting results. why does reload/cross-tile have worse results? would be nice to see some examples of failed results (how close did it to solving?)
We have an example of a failed cross-tile result in the article - the models seem like they're much better at detecting whether something is in an image vs. identifying the boundaries of those items. This probably has to do with how they're trained - if you train on descriptions/image pairs, I'm not sure how well that does at learning boundaries.
Reload are challenging because of how the agent-action loop works. But the models were pretty good at identifying when a tile contained an item.
I'm also curious what the success rates are for humans. Personally I find those two the most bothersome as well. Cross-tile because it's not always clear which parts of the object count and reload because it's so damn slow.
static, cross-tile and reload. recaptcha call window pings LPRs.
Ok and then? Those models were not trained for this purpose.
It's like the last hype over using generative AI for trading.
You might use it for sentiment analysis, summarization and data pre-processing. But classic forecast models will outperform them if you feed them the right metrics.
It is relevant because they are trained for the purpose of browser use and completing tasks on websites. Being able to bypass captchas is important for using many websites.
It would be nice to see comparisons to some special-purpose CAPTCHA solvers though.
These are all multi-modal models, right? And the vision capabilities are particularly touted in Gemini.
https://ai.google.dev/gemini-api/docs/image-understanding
Hcaptcha cofounder here. Enterprise users have a lot of fancy configuration behind the scenes. I wonder if they coordinated with recaptcha or just assume there sitekey in the same as others
I know people were solving CAPTCHAS with neural nets (with PHP no less!) back in 2009.
Indeed, captcha vs captcha bot solvers has been an ongoing war for a long time. Considering all the cybercrime and ubiquitous online fraud today, it's pretty impressive that captchas have held the line as long as they have.
You could definitely do better than we do here - this was just a test of how well these general-purpose systems are out-of-the-box
So, when do we reach a level where AI is better than humans and we remove captcha from pages alltogether? If you don't want bots to read content, don't put it online, you're just inconveniencing real people now.
They can also sign up and post spam/scams. There are a lot of those spam bots on YouTube, and there probably would be a lot more without any bot protection. Another issue is aggressive scrapers effectively DOSing a website. Some defense against bots is necessary.
3 models only, can we really call that a benchmark?
yes
in other words reasoning call fill the context window with crap
I hypothesize that these AI agents are all likely higher than human performance now.