As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.
I wanted to take at stab at this problem and started researching some daily busineses and how they use software.
My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.
My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.
I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
Boy was I wrong.
I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
This has existed for a long time, it's called "RPA" or Robotic Process Automation. The biggest incumbent in this space is UiPath, but there are a host of startups and large companies alike that are tackling it.
Most of the things that RPA is used for can be easily scripted, e.g. download a form from one website, open up Adobe. There are a lot of startups that are trying to build agentic versions of RPA, I'm glad to see Anthropic is investing in it now too.
> Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
FWIW, looking at it from end-user perspective, it ain't much different than the Windows apps. APIs are not interoperability - they tend to be tightly-controlled channels, access gated by the vendor and provided through contracts.
In a way, it's easier to make an API to a legacy native desktop app than it is to a typical SaaS[0] - the native app gets updated infrequently, and isn't running in an obstinate sandbox. The older the app, the better - it's more likely to rely on OS APIs and practices, designed with collaboration and accessibility in mind. E.g. in Windows land, in many cases you don't need OCR and mouse emulation - you just need to enumerate the window handles, walk the tree structure looking for text or IDs you care about, and send targeted messages to those components.
Unfortunately, desktop apps are headed the same direction web apps are (increasingly often, they are web apps in disguise), so I agree that AI-level RPA is a huge deal.
--
[0] - This is changing a bit in that frameworks seem to be getting complex enough that SaaS vendors often have no clue as to what kind of access they're leaving open to people who know how to press F12 in their browsers and how to call cURL. I'm not talking bespoke APIs backend team wrote, but standard ones built into middleware, that fell beyond dev team's "abstraction horizon". GraphQL is a notable example.
Basically, if it means companies can introduce automation without changing anything about the tooling/workflow/programs they already use, it's going to be MASSIVE. Just and install and a prompt and you've already automated a lengthy manual process - awesome.
Talking about ancient Windows software... Windows used to have an API for automation in the 2000s (I don't know if it still does). I wrote this MS Access script that ran and moved the cursor at exactly the pixel coordinates where buttons and fields were positioned in a GUI that we wanted to extract data from, in one of my first jobs. My boss used to do this manually. After a week he had millions of records ready to query in Access. You can imagine how excited he was. Was a fun little project and pretty hilarious to see the cursor moving fast AF around the screen like it was possessed. PS: you could screw up the script run pretty easily by bumping into the mouse of that pc.
Really good software has automation built in, in the form of macros/scripting.
One of the reasons my bash terminal is more effective than point and click is the easy of automation of routine tasks: from history and aliases to scripting.
Software interop is trickier as it doesn't so much depend on being able to send messages from one piece of software to another, it's you need an agreed format of those messages ( whether they be expressed in the payload of an API, or a UI specific stream of point and clicks ).
I tried once to integrate with software used by pharmacists in Australia (which costs a fortune and looks straight out of the 90's). Even though they have a SQL database with all information about everything, the DB is encrypted and they provide no way to access it. I tried contacting the company to learn how I could integrate with it but they offered no help. It's not in their interest to help upcoming competitors and they claim they cannot provide the encryption key as that's for protecting the customer's privacy, forgetting that the pharmacists already have access to all data through their software. But without a way to automatically extract information from the DB, there's no way to migrate to new software as no pharmacist would want to use new software if it cannot "import" the data they need.
It's clear that the reason there's no competition in the market is that the companies (I think it's literally one company) selling this software want it that way, and legislation is yet to catch up with the way they ensure their monopoloy.
I'm a bit skeptical about this working well enough to handle exceptions as soon as something out of the ordinary occurs. But it seems this could work great for automated testing.
Has anyone tried asking "use computer" to do "Please write a selenium/capybara/whatever test for filling out this form and sending it?"
That would take away some serious drudge work. And it's not a big problem if it fails, contrary to when it makes a mistake in filling out a form in an actual business process.
LLM's are enabling a reimagination of UI.
Wheres the biggest opportunity in UI that hasn't kept up to date? legacy and regulated software in the long tail. Thats the disruption opportunity for LLM's.
Not to mention software like peoplesoft, SAP and servicenow. Absolute shit interfaces that employees have to deal with day in and day out.
Now, an in-house tool built on top of this Anthropic API can save hours of drudgery. I can already see sales teams smiling at the new 'submit your expense report' button.
This is pretty similar to the argument for why humanoid robots will be a big deal. The physical world is also built mostly for humans, so having an autonomous agent that can interact with it is huge.
You don’t know for a fact that those two specific packages don’t have supported APIs. Just because the user doesn’t know of any API doesn’t mean none exists. The average accountant or doctor is never going to even ask the vendor “is there an API” because they wouldn’t know what to do with one if there was.
On their "Developing a computer use model" post they have mention
> On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70-75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category.
Here, "next-best AI model in the same category" referes to which model.
> I apologize, but it seems we're having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?
It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.
It seems to insistent when it needs to, though
> I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:
> Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:
> Let me try clicking Firefox one more time to ensure it starts:
> Let me try one more time to take a screenshot to see if Firefox has opened:
> Let me try starting Firefox using the bash tool as an alternative method:
> Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it's open?
I didn't instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn't go loose.
And here he goes through my website, through my hacker news account and then find this very comment
> Looking at his first/most recent comment, it's about a discussion of Claude and computer use. Here's what he wrote:
"I like its lazy approach"
This appears to be a humorous response in a thread about "Computer use, a new Claude 3.5 Sonnet, and Claude..." where he's commenting on an AI's behavior in a situation. The comment is very recent (shown as "8 minutes ago" in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.
SYSTEM_PROMPT = f"""<SYSTEM_CAPABILITY>
* You are utilising an Ubuntu virtual machine using {platform.machine()} architecture with internet access.
* You can feel free to install Ubuntu applications with your bash tool. Use curl instead of wget.
* To open firefox, please just click on the firefox icon. Note, firefox-esr is what is installed on your system.
* Using bash tool you can start GUI applications, but you need to set export DISPLAY=:1 and use a subshell. For example "(DISPLAY=:1 xterm &)". GUI apps run with bash tool will appear within your desktop environment, but they may take some time to appear. Take a screenshot to confirm it did.
* When using your bash tool with commands that are expected to output very large quantities of text, redirect into a tmp file and use str_replace_editor or `grep -n -B <lines before> -A <lines after> <query> <filename>` to confirm output.
* When viewing a page it can be helpful to zoom out so that you can see everything on the page. Either that, or make sure you scroll down to see everything before deciding something isn't available.
* When using your computer function calls, they take a while to run and send back to you. Where possible/feasible, try to chain multiple of these calls all into one function calls request.
* The current date is {datetime.today().strftime('%A, %B %-d, %Y')}.
</SYSTEM_CAPABILITY>
<IMPORTANT>
* When using Firefox, if a startup wizard appears, IGNORE IT. Do not even click "skip this step". Instead, click on the address bar where it says "Search or enter address", and enter the appropriate search term or URL there.
* If the item you are looking at is a pdf, if after taking a single screenshot of the pdf it seems that you want to read the entire document instead of trying to continue to read the pdf from your screenshots + navigation, determine the URL, use curl to download the pdf, install and use pdftotext to convert it to a text file, and then read that text file directly with your StrReplaceEditTool.
</IMPORTANT>"""
I still feel like the difference between Sonnet and Opus is a bit unclear. Somewhere on Anthropic's website it says that Opus is the most advanced, but on other parts it says Sonnet is the most advanced and also the fastest. The UI doesn't make the distinction clear either. Then on Perplexity, Perplexity says that Opus is the most advanced, compared to Sonnet.
And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?
When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.
Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere.
I think they originally announced that Opus would get a 3.5 update, but with every product update they are doing I'm doubting it more and more. It seems like their strategy is to beat the competition on a smaller model that they can train/tune more nimbly and pair it with outside-the-model product features, and it honestly seems to be working.
Opus is a larger and more expensive model. Presumably 3.5 Opus will be the best but it hasn't been released. 3.5 Sonnet is better than 3.0 Opus kind of like how a newer i5 midrange processor is faster and cheaper than an old high-end i7.
Anthropic use the names Haiku/Sonnet/Opus for the small/medium/large versions of each generation of their models, so within-generation that is also their performance (& cost) order. Evidentially Sonnet 3.5 outperforms Opus 3.0 on at least some tasks, but that is not a same-generation comparison.
I'm wondering at this point if they are going to release Opus 3.5 at all, or maybe skip it and go straight to 4.0. It's possible that Haiku 3.5 is a distillation of Opus 3.5.
By reputation -- I can't vouch for this personally, and I don't know if it'll still be true with this update -- Opus is still often better for things like creative writing and conversations about emotional or political topics.
The models "3.5 Sonnet" and "3 Opus" are in my experience nearly at the same level. Once in my last 250 prompts did I run into a problem that 3 Opus was able to solve, but 3.5 Sonnet could not. (I forget the details but it was a combination of logic and trivia knowledge. It is highly likely 3.5 Sonnet would have done a better job with better prompting and richer context, but this was a problem where also I lacked the context and understanding to prompt well.)
Given that 3.5 Sonnet is cheaper and faster than 3 Opus, I default to 3.5 Sonnet so I don't know what the number for the reverse is. How many problems do 3.5 Sonnet get which 3 Opus does not? ¯\_(ツ)_/¯
My best guess would be that it's something in the same kind of range.
The "computer use" ability is extremely impressive!
This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.
Take a look at their demo of using this for coding.
This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.
It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?
From what I'm seeing on GH, this could have technically already been built right? Is it not just taking screenshots of the computer screen and deciding what to do from their / looping until it gets to the solution ?
I've been saying this is coming for a long time, but my really smart SWE friend who is nevertheless not in the AI/ML space dismissed it as a stupid roundabout way of doing things. That software should just talk via APIs. No matter how much I argued regarding legacy software/websites and how much functionality is really only available through GUI, it seems some people are really put off by this type of approach. To me, who is more embedded in the AI, computer vision, robotics world, the fuzziness of day-to-day life is more apparent.
Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.
I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.
>it seems some people are really put off by this type of approach
As someone who has had to interact with legacy enterprise systems via RPA (screen scraping and keystroke recording) it is absolutely awful, incredibly brittle, and unmaintainable once you get past a certain level of complexity. Even when it works, performance at scale is terrible.
It's Postel's Law, on steroids. Be liberal in what you accept (with LLMs, that means 'literally anything'), but strict in what you return (which in an LLM is still 'literally anything' but you can constrain that).
Beyond that, I can't help but think of the old thin vs. thick client debate, and I would argue that "software should just talk via APIs" is why, in the web space, everybody is blowing time and energy on building client/server architectures and SPAs instead of basic-ass full-stacks.
It's basically the digital equivalent of humanoid robots - people object because having computers interact with a browser, like building a robot in the form of a human, is incredibly inefficient in theory or if you're designing a system from scratch.
The problem is that we're not starting from scratch - we have a web engineered for browser use and a world engineered for humanoid use. That means an agent that can use a browser, while less efficient than an agent using APIs at any particular task, is vastly more useful because it can complete a much greater breadth of tasks. Same thing with humanoid robots - not as efficient at cleaning the floor as my purpose-built Roomba, but vastly more useful because the breadth of tasks it can accomplish means it can be doing productive things most of the time, as opposed to my Roomba, which is not in use 99% of the time.
I do think that once AI agents become common, the web will increasingly be designed for their use and will move away from the browser, but that probably take a comparable amount of time as it did for the mobile web to emerge after the iPhone came out. (Actually that's probably not true - it'll take less time because AI will be doing the work instead of humans.)
I recall 90's Macs had a 3rd party app that offered to observe your mouse/keyboard then automatically recommend routine tasks for you. As a young person I found that fascinating. It's interesting to see history renew itself.
If you want an API, have Claude procedurally test actions and then write a pyautogui/pywinauto/autohotkey etc script to perform it instead. Have it auto-test to verify and classify the general applicability of each action. Repeat for all of life...
Not specific to this update, but I wanted to chime in with just how useful Claude has been, and relatively better than ChatGPT and GitHub copilot for daily use. I've been pro for maybe 6 months. I'm not a power user leveraging their API or anything. Just the chat interface, though with ever more use of Projects, lately. I use it every day, whether for mundane answers or curiosities, to "write me this code", to general consultation on a topic. It has replaced search in a superior way and I feel hugely productive with it.
I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I've not felt any need to switch back or have both. Well done, Anthropic!
It's possible that they've determined that Opus no longer makes sense if they're able to focus on continuously optimising Sonnet. That said, Anthropic have been relatively good at setting and managing expectations, so today would have been a good time to make that clear.
One of the funnier things during training with the new API (which can control your computer) was this:
"Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.
Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park."
* Fixed bug where Claude got bored during compile times and started editing Wikipedia articles to claim that birds aren't real
* Blocked news.ycombinator.com in the Docker image's hosts file to avoid spurious flamewar posts (Note: The site is still recovering from the last insident)
* Addressed issue of Claude procrastinating on debugging by creating elaborate ASCII art in Vim
* Patched tendency to rickroll users when asked to demonstrate web scraping"
Seeing models act as though they have agency gives me goosebumps (e.g. seeking out photos of Yellowstone for fun). LLMs don't yet have a concept of true intent or agency, but it's wild to think of them acquiring it.
I have been playing with Mindcraft which lets models interact with Minecraft through the bot API and one of them started saying things like "I want to place some cobblestone there" and then later more general "I want to do X" and then start playing with the available commands, it was pretty cool to watch it explore.
I think the best use case for AI `Computer Use` would be a simple positioning of the mouse and asking for conformation before a click. For most use cases this is all people will want/need. If you don't know how to do something, it is basically teaching you how, in this case, rather than taking full control and doing things so fast you don't have time to stop of going rogue.
You'll know AGI is here when it takes time out to go talk to ChatGPT, or another instance of itself, or maybe goes down a rabbit hole of watching YouTube music videos.
In 2015, when I was asked by friends if I'm worried about Self driving Cars and AI, I answered:
"I'll start worrying about AI when my Tesla starts listening to the radio because it's bored."
... that didn't take too long
This is, craaaaaazzzzzy. I'm just a layman, but to me, this is the most compelling evidence that things are starting to tilt toward AGI that I've ever seen.
True end-user programming and product manager programming are coming, probably pretty soon. Not the same thing, but Midjourney went from v.1 to v.6 in less than 2 years.
If something similar happens, most jobs that could be done remotely will be automatable in a few years.
Every time I see this argument made, there seems to be a level of complexity and/or operational cost above which people throw up their hands and say "well of course we can't do that".
I feel like we will see that again here as well. It really is similar to the self-driving problem.
> True end-user programming and product manager programming are coming
This means that either product managers will have to start (effectively) writing in-depth specs again, or they will have to learn to accept the LLM's ideas in a way that most have not accepted their human programmers' ideas.
Definitely will be interesting to see how that plays out.
Idk, LLMs have basically stopped improving for over a year now. And in their current state no matter how many abstractions you add to them - or chain them - they are not even close capable to replace even simple jobs.
It makes me wonder if people that make these claims have an actual job. Because if they did then I doubt anyone could make that claim with a straight face.
> If something similar is the case, most jobs that can be done remotely will be automatable in a couple of years.
I'm really curious on the cost of that sort of thing. Seems astronomical atm, but as much as i get shocked at the today-cost, staffing is also a pretty insane cost.
"Create a simple website" has to be one of the most common blog / example out there in about every programming language.
It can automate stuff? That's cool: I already did automate screenshots and then AI looking if it looks like phishing or not (and it's quite good at it).
I mean: the "Claude using Claude" may seem cool, but I dispute the "for coding" part. That's trivial stuff. A trivial error (which it doesn't fix btw: it just deletes everything).
'Claude, write me code to bring SpaceX rockets back on earth"
or
"Claude, write me code to pilot a machine to treat a tumor with precision"
Molmo released recently and is able to provide point coordinates for objects in images. I’ve been testing it out recently and am currently building an automation tool that allows users to more easily control a computer. Looks like Anthropic built a better one.
Edit: it seems like these new features will eliminate a lot of automated testing tools we have today.
From the computer use video demo, that's a lot of API calls. Even though Claude 3.5 Sonnet is relatively cheap for its performance, I suspect computer use won't be. It's a very good idea that Anthropic upfront that it isn't perfect. And it's guaranteed that there will be a viral story where Claude will accidentally delete something important with it.
I'm more interested in Claude 3.5 Haiku, particularly if it is indeed better than the current Claude 3.5 Sonnet at some tasks as claimed.
Seemed like a reasonable amount of API calls. For a first public iteration this seems quite nice and a logical progression in tooling. UiPath has a $7bn market cap and thats only a single player in the industry of automation. If they can figure out the quirks this can be a game changer.
Completely irrelevant, and it might just be me, but I really like Anthropic's understated branding.
OpenAI's branding isn't exactly screaming in your face either, but for something that's generated as much public fear/scaremongering/outrage as LLMs have over the last couple of years, Anthropic's presentation has a much "cosier" veneer to my eyes.
This isn't the Skynet Terminator wipe-us-all-out AI, it's the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it's going to be OK.
I have to agree. I've been chatting with Claude for the first time in a couple days and while it's very on-par with ChatGPT 4o in terms of capability, it has this difficult-to-quantify feeling of being warmer and friendlier to interact with. I think the human name, serif font, system prompt, and tendency to create visuals contributes to this feeling.
> This isn't the Skynet Terminator wipe-us-all-out AI, it's the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it's going to be OK.
Ray: I tried to think of the most harmless thing. Something I loved from my childhood. Something that could never ever possibly destroy us. Mr. Stay Puft!
This is actually very relevant: most people think this is just an arms race to see who can get the better percentages on benchmarks but to me all this technology is useless if we dont give programmers and end users the right interfaces to utilize the technology.
Anthropic seems to have a better core design and human-computer interaction ethos that shows up all throughout their product and marketing.
I found the “Computer Use” product name funny. Many other companies would’ve used the opportunity to come up with something like “Human Facing Interface Navigation and Task Automation Capabilities” or “HFINTAC”.
I didn’t know what Computer Use meant. I read the article and though to myself oh, it’s using a computer. Makes sense.
Take a read through the user agreements for all the major LLM providers and marvel at the simplicity and customer friendliness of the Anthropic one vs the others.
Not irrelevant at all! Compare their branding to that of Boston Dynamics. Their branding of robots reminds me more of a Black Mirror episode... If Claude was a dog like robot, it sure would look like a golden retriever or something.
Positive AI branding should create a positive public perception which in turn should create a positive attitude towards AI regulation.
Claude is amazing. The project documents functionality makes it a clear leader ahead of ChatGPT and I have found it to be the clear leader in coding assistance over the past few months. Web automation is really exciting.
I look forward to the brave new future where I can code a webapp without ever touching the code, just testing, giving feedback, and explaining discovered bugs to it and it can push code and tweak infrastructure to accomplish complex software engineering tasks all on its own.
Its going to be really wild when Claude (or other AI) can make a list of possible bugs and UX changes and just ask the user for approval to greenlight the change.
And today I realized that despite it being an extremely common activity, we don’t really have a word for “using the computer” which is distinct from “computing”. It’s funny because AI models are always “using a computer” but now they can “use your computer.”
The word is interfacing generally (or programming for some) but it's just not commonly used for general users. I’d say this is probably because the activity of focus for general users is in use of the applications, not the computer itself despite being instanced with a computer. Thus a computer is commonly less the user’s object of activity, and more commonly the setting for activity.
Similarly using our homes are an extremely common ‘activity’, yet the object-activities that get special words commonly used are the ones with specific user application.
Reminds me of the rise in job application bots. People are applying to thousands of jobs using automated tools. It’s probably one of the inevitable use cases of this technology.
It makes me think. Perhaps the act of applying to jobs will go extinct. Maybe the endgame is that as soon as you join a website like Monster or LinkedIn, you immediately “apply” to every open position, and are simply ranked against every other candidate.
> People are applying to thousands of jobs using automated tools
Employers were already screening thousands of applications using automated tools for years. Candidates are catching up to the automation cat-and-mouse game.
The `Hiring Process` in America is definitely BADLY broken. Maybe worldwide afaik. It's a far too difficult, time-consuming, and painful process for everyone involved.
I have a feeling AI can fix this, although I'd never allow an AI bot to interview me. I just mean other ways of using AI to help the process.
Also people are hired for all kinds of reasons having little to do with their qualifications lots of the time, and often due to demographics (race, color, age, etc), and this is another way maybe AI can help by hiding those aspects of a candidate somehow.
If "computer use" feature is able to find it's way in Azure, AAD/Entra, SharePoint settings, etc. - it has a chance of becoming a better user interface for Microsoft products. :)
Can you imagine how simple the world would be if you'd just need to tell Claude: "user X needs to have access to feature Y, please give them the correct permissions", with no need to spend days in AAD documentation and the settings screens maze. I fear AAD is AI-proof, though :)
I have been a paying ChatGPT customer for a long time (since the very beginning). Last week I've compared ChatGPT to Claude and the results (to my eye) were better, the output better structured and the canvas works better. I'm on the edge of jumping ship.
For python, at least, Sonnet’s code is much more elegant, well composed, and thoughtfully written. It also seems to be biased towards more recent code, whereas the gpt models can’t even properly write an api call to itself.
o1 is pretty decent as a rotor rooter, ie the type of task that requires both lots of instruction as well as lots of context. I honestly think it works half as well as it does now because it’s able to properly mull through the true intent of the user that usually takes the multiple shots that nobody has the patience to do.
Claude is the daily driver. GPT-O1 for complicated tasks. For example, questions where linear reasoning is not enough like advanced rust ownership questions.
Anthropic's rate limit are very low sadly, even for paid customers. You can use the API of course but it's not as convenient and may be more expensive.
I'd jump ship if it weren't for the real time voice chat. It's extremely powerful for beginner conversation language learning. Hoping that a company will make use of the real time api for a dedicated language learning app soon.
Yeah I think I might also jump ship. It’s just that chatGPT now kinda knows who I am and what I like and I’m afraid of losing that. It’s probably not a big deal though.
I think this is good evidence that people's jobs are not being replaced by AI, because no AI would give the product a confusing name like "new Claude 3.5 Sonnet".
Google, OpenAI, and Anthropic are responsibly scaling their models by confusing their customers into using the wrong ones.
When AGI finally is launched, adoption will be responsibly slowed because it is called something like "new new Gemini Giga 12.9.2xo IT" and users will have to select it from dozens of similar names.
The best answer I've seen so far is that "Claude 3.5 Sonnet" is a brand name rather than a specific version. Not saying I agree, just a way to visualize how the team is coming up with marketing.
Not that I'm scared of this update but I'd probably be alright with pausing llm development today, atleast in regard to producing code.
I don't want an llm to write all my code, regardless of if it works, I like to write code. What these models are capable of at the moment is perfect for my needs and I'd be 100% okay if they didn't improve at all going forward.
Edit: also I don't see how an llm controlled system can ever replace a deterministic system for critical applications.
I have trouble with this too. I'm working on a small side project and while I love ironing out implementation details myself, it's tough to ignore the fact that Claude/GPT4o can create entire working files for me on demand.
It's still enjoyable working at a higher architecture level and discussing the implementation before actually generating any code though.
In normal critical system u use 3 CPUs. With LLM u can 1000 shot majority voting. Seems like approaches like entropix might reduce hallucinations also.
Great progress from Anthropic! They really shouldn't change models from under the hood, however. A name should refer to a specific set of model weights, more or less.
On the other hand, as long as its actually advancing the Pareto frontier of capability, re-using the same name means everyone gets an upgrade with no switching costs.
Though, all said, Claude still seems to be somewhat of an insider secret. "ChatGPT" has something like 20x the Google traffic of "Claude" or "Anthropic".
> Great progress from Anthropic! They really shouldn't change models from under the hood, however. A name should refer to a specific set of model weights, more or less.
In the API (https://docs.anthropic.com/en/docs/about-claude/models) they have proper naming you can rely on. I think the shorthand of "Sonnet 3.5" is just the "consumer friendly" name user-facing things will use. The new model in API parlance would be "claude-3-5-sonnet-20241022" whereas the previous one's full name is "claude-3-5-sonnet-20240620"
There was a recent article[0] trending on HN a about their revenue numbers, split by B2C vs B2B.
Based on it, it seems like Anthropic is 60% of OpenAI API-revenue wise, but just 4% B2C-revenue wise. Though I expect this is partly because the Claude web UI makes 3.5 available for free, and there's not that much reason to upgrade if you're not using it frequently.
Google has the tech (some of it's gathering dust, but they have it). They can use the gameplay tech developed for stadia when a user experiences lag and combine it with their LLM.
Definitely! Those with movement disabilities could have a much easier time if they could just dictate actions to the computer and have them completed with some reliability.
I am curious why "upgraded Claude 3.5 Sonnet" instead of simply Claude 3.6 Sonnet? Minor version increment is a standard way of versioning update. Am i missing something or it is just Anthropic marketing?
Probably because there was no 3.1-3.4, and that the .5 is mostly just to represent that it's an upgrade on Claude 3 but not quite enough to be a Claude 4
Claude is absurdly better at coding tasks than OpenAI. Like it's not even close. Particularly when it comes to hallucinations. Prompt for prompt, I see Claude being rock solid and returning fully executable code, with all the correct imports, while OpenAI struggles to even complete the task and will make up nonexistent libraries/APIs out of whole cloth.
I've been using a lot of o1-mini and having a good experience with it.
Yesterday I decided to try sonnet 3.5. I asked for a simple but efficient script to perform fuzzy match in strings with Python. Strangely, it didn't even mention existing fast libraries, like FuzzyWuzzy and Rapidfuzz. It went on to create everything from scratch using standard libraries. I don't know, I thought this was something basic for it to stumble on.
Like find me a list of things to do with a family, given today's weather and in the next 2 hours, quiet sit down with lots of comfy seating, good vegetarian food...
Not only is this kind of use getting around API restrictions, it is also a superior way to do search: Specify arbitrary preferences upfront instead of a search box and trawling different modalities of content to get better result. The possibilities for wellness use cases are endless, especially for end users that care about privacy and less screen use.
- "computer use" is basically using Claude's vision + tool use capability in a loop. There's a reference impl but there's no "claude desktop" app that just comes with this OOTB
- they're basically advertising that they bumped up Claude 3.5's screen vision capability. we discussed the importance of this general computer agent approach with David on our pod https://x.com/swyx/status/1771255525818397122
- @minimaxir points out questions on cost. Note that the vision use is very sparing - the loop is I/O constrained - it waits for the tool to run and then takes a screenshot, then loops. for a simple 10 loop task at max resolution, Haiku costs <1 cent, Sonnet 8 cents, Opus 41 cents.
- beating o1-preview on SWEbench Verified without extended reasoning and at 4x cheaper output per token (a lot cheaper in total tokens since no reasoning tokens) is ABSOLUTE mogging
- New 3.5 Haiku is 68% cheaper than Claude Instant haha
Cody by Sourcegraph has unlimited code completions for Claude & a very generous monthly message limit. They don't have this new version I think but they roll these out very fast.
Perhaps it's just because English is not my native language, but the prompt 3 isn't quite clear at the beginning when it says "group of four. Words (...)". It is not explained what the group of four must be, if I add to the prompt "group of four words" Claude 3.5 manages to answer it, while without it, Claude tells it is not that clear and can't answer
What a neat bench mark! I'm blown away that o1 absolutely crushes everyone else in this. I guess the chain of thought really hashes out those associations.
shows Sonnet 3.5 using the Google web UI in an automated fashion. Do Google's terms really permit this? Will Google permit this when it is happening at scale?
I wonder how they could combat it if they choose to disallow AI access through human interfaces. Maybe more captchas, anti-AI design language, or even more tracking of the user's movements?
For me, one of the more useful steps on macOS will be when local AI can manipulate anything that has an Apple Script library. The hooks are there and decently documented. For meta purposes, having AI work with a third-party app like Keyboard Maestro or Raycast will even further expand the pre-built possibilities without requiring the local AI to reinvent steps or tools at the time of each prompt.
Nice improvements in scores across the board, e.g.
> On coding, it [the new Sonnet 3.5] improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding.
I've been using Sonnet 3.5 for most of my AI-assisted coding and I'm already very happy (using it with the Zed editor, I love the "raw" UX of its AI assistant), so any improvements, especially seemingly large ones like this are very welcome!
I'm still extremely curious about how Sonnet 3.5 itself, and its new iteration are built and differ from the original Sonnet. I wonder if it's in any way based on their previous work[0] which they used to make golden-gate Claude.
It will interesting to see how this evolves. UI automation use case is different from accessibility do to latency requirement. latency matters a lot for accessibility not so much for ui automation testing apparatus.
I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.
This bolsters my opinion that OpenAI is falling rapidly behind. Presumably due to Sam's political machinations rather than hard-driving technical vision, at least that's what it seems like, outside looking in.
Computer use seems it might be good for e2e tests.
I'm giving the new Sonnet a chance, although for my use as a writing companion so far, Opus has been king among all the models I've tried.
However, I've been using Opus as a writing companion for several months, especially when you have writer's block and ask it for alternative phrases, it was super creative. But in recent weeks I was noticing a degradation in quality. My impression is that the model was degrading. Could this be technically possible? Might it be some kind of programmed obsolescence to hype new models?
Fascinating. Though I expect people to be concerned about privacy implications of sending screenshots of the desktop, similar to the backlash Microsoft has received about their AI products. Giving the remote service actual control of the mouse and keyboard is a whole another level!
But I am very excited about this in the context of accessibility. Screen readers and screen control software is hard to develop and hard to learn to use. This sort of “computer use” with AI could open up so many possibilities for users with disabilities.
I suspect businesses will create VDI's or VM's for this express purpose. One because it scales better, and 2 because you can control what it has access to easier and isolate those functions.
Of course there's great inefficiency in having the Claude software control a computer with a human GUI mediating everything, but it's necessary for many uses right now given how much we do where only human interfaces are easily accessible. If something like it takes off, I expect interfaces for AI software would be published, standardized, etc. Your customers may not buy software that lacks it.
But what I really want to see is a CLI. Watching their software crank out Bash, vim, Emacs!, etc. - that would be fascinating!
I hope specialized interfaces for AI never happen. I want AI to use human interfaces, because I want to be empowered to use the same interfaces as AI in the future. A future where only AI can do things because it uses an incomprehensible special interface and the human interface is broken or non-existent is a dystopia.
I also want humanoid robots instead of specialized non-humanoid robots for the same reason.
I agree, I bet models could excel at CLI tasks since the feedback would be immediate and in a language they can readily consume. It's probably much easier for them to to handle "command requires 2 arguments and only 1 was provided" than to do image-to-text on an error modal and apply context to figure out what went wrong.
Wow, there's a whole industry devoted to what they're calling "Computer Use" (Robotic Process Automation, or RPA). I wonder how those folks are viewing this.
Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks.
Nice, but I wonder why didn't they use UI automation/accessibility libraries, that have access to the semantic structure of apps/web pages, as well as accessing documents directly instead of having Excel display them for you.
I wonder if the model has difficulties for the same reason some people do - UI affordability has gone down with the flattening, hover-to-see scrollbar, hamburger-menu-ization of UIs.
I'd like to see a model trained on a Windows 95/NT style UI - would it have an easier time with each UI element having clearly defined edges, clearly defined click and dragability, unified design language, etc.?
I think this is to make human /user experience better. If you use accessibility features, then user need to know how to use those features. Similar to another comment in here, the UX they shoot for is “click the red button with cancel on it”, and ship that ASAP.
- AI Labs will eat some of the wrappers on top of their APIs - even complex ones like this. There are whole startups that are trying to build computer use.
- AI is fitting _some_ scaling law - the best models are getting better and the "previously-state-of-the-art" models are fractions of what they cost a couple years ago. Though it remains to be seen if it's like Moore's Law or if incremental improvements get harder and harder to make.
It seems a little silly to pretend there’s a scaling “law” without plotting any points or doing a projection. Without the mathiness, we could instead say that new models keep getting better and we don’t know how long that trend will continue.
How does the computer use work -- Is this a desktop app they are providing that can do actions on your computer? Didn't see any such mention in the post
Tried my standard go-to for testing, asked it to generate a voronoi diagram using p5js. For the sake of job security I'm relieved to see it still can't do a relatively simple task with ample representation in the Google search results. Granted, p5js is kind of niche, but not terribly so. It's arguably the most popular library for creating coding.
In case you're wondering, I tried o1-preview, and while it did work, I was also initially perplexed why the result looked pixelated. Turns out, that's because many of the p5js examples online use a relatively simple approach where they just see which cell-center each pixel is closest to, more or less. I mean, it works, but it's a pretty crude approach.
Now, granted, you're probably not doing creative coding at your job, so this may not matter that much, but to me it was an example of pretty poor generalization capabilities. Curiously, Claude has no problem whatsoever generating a voronoi diagram as an SVG, but writing a script to generate said diagrams using a particular library eluded it. It knows how to do one thing but generalizes poorly when attempting to do something similar.
Really hard to get a real sense of capabilities when you're faced with experiences like this, all the while somehow it's able to solve 46% of real-world python pull-requests from a certain dataset. In case you're wondering, one paper (https://cs.paperswithcode.com/paper/swe-bench-enhanced-codin...) found that 94% of the pull-requests on SWE-bench were created before the knowledge cutoff dates of the latest LLMs, so there's almost certainly a degree of data-leakage.
It's surprising how much knowledge is not easily googleable and can only unearched by deep diving into OSS or asking an expert. I recently was debugging a rather naive gstreamer issue where I was seeing a delay in the processing. ChatGPT, Claude and Google were all unhelpful. I spend the next couple days reading the source code, found my answer, and thought it was a bug.
Asked the mailing list, and my problem was solved in 10 seconds by someone who could identify the exact parameter that was missing (and IMO, required some architecture knowledge on how gstreamer worked - and why the unrelatedly named parameter would fix it). The most difficult problems fall into this camp - I don't usually find myself reaching for LLMs when the problem is trivial unless it involves a mountain of boilerplate.
I recently asked it to give me code to do gapless playback of audio files using Apple's AVAudioEngine APIs. It got it wrong and additional prompts to explain why it was wrong didn't help.
To me what it seems like these tools do really well is paraphrase stuff that's in their training data.
I really don't get their model. They have very advanced models, but the service overall seems to be a jumble of priorities. Some examples:
Anthropic doesn't offer an unlimited chatbot service, only plans that give you "more" usage, whatever that means. If you have an API key, you are "unlimited," so they have the capability. Why doesn't the chatbot allow one to use their API key in the Claude app to get unlimited usage? (Yes, I know there are third-party BYOK tools. That's not the question.)
Claude appears to be smart enough to make an Excel spreadsheet with simple formulae. However, it is apparently prevented from making any kind of file. Why? What principle underlies that guardrail that does not also apply to Computer Use?
Really want to make Claude my daily driver, but right now it often feels too much like a research project.
I've seen quite a few YC startups working on AI-powered RPA, and now it looks like a foundational model player is directly competing in their space. It will be interesting to see whether Anthropic will double down on this or leave it to third-party developers to build commercial applications around it.
Is there anything out there yet that will let me issue the command:
> Refactor the api folder with any recommended readability improvements or improvements that would help DRY up code without adding additional complexity.
I tried to get it to translate a document and it stopped after a few paragraphs and asked if I wanted it to keep going. This is not appropriate for my use case and it kept doing this even though I explicitly told it not to. The old version did not do this.
> "... and similar speed to the previous generation of Haiku."
To me this is the most annoying grammatical error. I can't wait for AI to take over all prose writing so this egregious construction finally vanishes from public fora. There may be some downsides -- okay, many -- but at least I won't have to read endless repetitions of "similar speed to ..." when the correct form is obviously "speed similar to".
In fact, in time this correct grammar may betray the presence of AI, since lowly biologicals (meaning us) appear not to either understand or fix this annoying error without computer help.
That’s too much control for my taste. I don’t want anthropic to see my screen. I rather prefer a VS Code with integrated Claude. A version that can see all my dev files in a given folder. I don’t need it to run Chrome for me.
This is what the Rabbit "large action model" pretended to be. Wouldn't be surprised to see them switch to this and claim they were never lying about their capabilities because it works now.
How long until it is profitable the tell a cheap AI to "win this game by collecting resources and advancing in-game" and then sell the account on eBay?
I wonder what optimizations could be made? Could a gold farmer have the directions from one AI control many accounts? Could the AI program simpler bots for each bit of the game?
I can imagine not being smart enough to play against computers, because I am flagged as a bot. I can imagine a message telling me I am banned because "nobody but a stupid bot would score so low."
If anyone would like to try the new Sonnet in VSCode. I just updated https://double.bot to the new Sonnet. (disclaimer: I am the cofounder/creator)
---
Some thoughts:
* Will be interesting to see what we can build in terms of automatic development loops with the new computer use capabilities.
* I wonder if they are not releasing Opus because it's not done or because they don't have enough inference compute to go around, and Sonnet is close enough to state of the art?
It's a problem we used to work on and perhaps many other people have always wanted to accomplish since 10 years ago. So it's yet to be seen how well it works outside a demo.
What was surprising was the slow/human speed of operations. It types into the text boxes at a human speed rather than just dumping the text there. Is it so the human can better monitor what's happening or is it so it does not trigger Captchas ?
This is incredibly cool but it seems like the potential damage from a "hallucination" in this mode is considerable, especially when they provide examples of it going very far off-track (looking up Yellowstone pictures). Would basically need constant monitoring for me not to be paranoid it did something stupid.
Also seems like a privacy issue with them sending screenshots of your device back to their servers.
anybody know how the hell they're combating / gonna combat captcha's, cloudflare blocking, etc. I remember playing in this space on a toy project and being utterly frustrated by anti-scraping. Maybe one good thing that will come out of this AI boom is that companies will become nicer to scrapers? Or maybe, they'll just cut sweetheart deals?
I am surprised it uses macOS as the demo, as I thought it would be harder to control vs Ubuntu. But maybe at the same time, macOS is the most predictable/reliable desktop environment? I noticed that they use virtual environment for the demo, curious how do they build that along with docker, is that leveraging the latest virtualization framework from Apple?
I skimmed through the computer use code. It's possible to build this with other AI providers too. For instance you can asks ChatGPT API to call functions for click and scroll and type with specific parameters and execute them using OS's APIs (A11y APIs usually)
Did I miss something? Did they have to make changes to the model for this?
Imagine the possibilities for cyber-crime. Surely you could program it to log in to a financial institution and transfer money. And if you had a list of user names and passwords from some large info breach? You could automate a LOT of transfers in a short amount of time...
This demo is impressive although my initial reaction is a sort of grief that I wasn't born in the timeline where Alan Kay's vision of object-oriented computing was fully realized -- then we wouldn't have to manually reconcile wildly heterogeneous data formats and interfaces in the first place!
OpenAI must be scared at this point. Anthropic is clobbering them at the high end of the market and Meta is providing free AIs at the low end. OpenAI is pretty soon going to be in the valueless middle fighting with tons of other companies for relevance
I wonder if OpenAI will fast follow; usually they're the ones to throw down the gauntlet. That being said, you can play around with OpenAI with a similar architecture of vision + agent + exec + loop using Donobu, though it is constrained to web browsers.
Offtopic but youtube doesn't allow me to view the embedded video, with a "Sign in to confirm you’re not a bot" message. I need to open a dedicated youtube tab to watch it
The barrier to scraping youtube has increased a lot recently, I can barely use yt-dlp anymore
Hopefully the coding improvements are meaningful because I find that as a coding assistant o1-preview beats it (at least the Claude 3.5 that was available yesterday) but I like Claude's demeanor more (I know this sounds crazy but it matters a bit to me)
Any comments on alignment with Anthropic's missions. Last time I checked Anthropic is about building SOTA as that is the only way to do safety research. Making money ans useful stuff commercially is a means to an end.
Looks like visual understanding of diagrams is improved significantly! For example, it was on par with Chat GPT 4o and Gemini 1.5 in parsing an ERD for a conceptual model, but now far excels over the others.
I suspect they are gonna need some local offload capabilities for Computer Use, the repeated screen reading can definitely be done locally on modern machines, otherwise the cost maybe impractical.
One suggestion, use the following prompt at a LLM:
The combination of the words "computer use" is highly confusing. It's also "Yoda speak". For example it's hard for humans to parse the sentences *"Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku"*, *"Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku "* (it literally relies on the comma to make any sense) and *"Computer use for automated interaction"* (in the youtube vid's title: this one is just broken english). Please suggest terms that are not confusing for a new ability allowing an AI to control a computer as if it was a human.
This "computer use" feature is obviously perfect for automating GUI tests. Will it work on screenshots of mobile devices like smartphones/tables, also?
since they didnt rev the version, does this mean if we were using 3.5 today its just automatically using the new version? That doesnt seem great from a change management perspective
though I am looking forward to using the new one in cursor.ai
wow, i almost got worried but the cute music and the funny little monster on the desk convinced me that this all just fun and dandy and all will be good. the future is coming and we'll all be much more happy :)
While I was initially impressed with it's context window, I got so sick of fighting with Claude about what it was allowed to answer I quit my subscription after 3 months.
Their whole policing AI models stance is commendable but ultimately renders their tools useless.
It actually started arguing with me about whether it was allowed to help implement a github repository's code as it might be copywritten... it was MIT licensed open source from Google :/
computer use is really going to highlight how fragmented the desktop ecosystem is, but also this definitely paints more context on how microsoft wants to use their screenshot ai
Now I am really curious how to programmatically create a sandboxed compute environment to do a self-hosted "Computer use" and see how well other models, including self-hosted Ollama models, can do this.
The new Sonnet tops aider's code editing leaderboard at 84.2%. Using aider's "architect" mode it sets the SOTA at 85.7% (with DeepSeek as the "editor" model).
84% Claude 3.5 Sonnet 10/22
80% o1-preview
77% Claude 3.5 Sonnet 06/20
72% DeepSeek V2.5
72% GPT-4o 08/06
71% o1-mini
68% Claude 3 Opus
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%!
It's quite sad that application interoperability requires parsing bitmaps instead of exchanging structured information. Feels like a devastating failure in how we do computing.
'Hey Claude 3.5 New, pretend I'm a CEO of a big company and need to lay off 20% people, make me a spreadsheet and send it to HR. Oh make sure to not fire the HR department'
Both new Sonnet and gpt-4o still fail at a simple:
"How many w's are in strawberry?"
gpt-4o: There are 2 "w's" in "strawberry."
Claude 3.5 Sonnet (new): Let me count the w's in "strawberry": 0 w's.
(same question with 'r' succeeds)
What is artificial about current gen of "artificial intelligence" is the way training (predict next token) and benchmarking (overfitting) is done. Perhaps a fresh approach is needed to achieve a true next step.
This is actually a huge deal.
As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation.
I wanted to take at stab at this problem and started researching some daily busineses and how they use software.
My brother-in-law (who is a doctor) showed me the bespoke software they use in his practice. Running on Windows. Using MFC forms.
My accountant showed me Cantax - a very powerful software package they use to prepare tax returns in Canada. Also on Windows.
I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into. Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
Boy was I wrong.
I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
This has existed for a long time, it's called "RPA" or Robotic Process Automation. The biggest incumbent in this space is UiPath, but there are a host of startups and large companies alike that are tackling it.
Most of the things that RPA is used for can be easily scripted, e.g. download a form from one website, open up Adobe. There are a lot of startups that are trying to build agentic versions of RPA, I'm glad to see Anthropic is investing in it now too.
> Being in the SaaS space makes you believe that everyone ought to have client-server backend APIs etc.
FWIW, looking at it from end-user perspective, it ain't much different than the Windows apps. APIs are not interoperability - they tend to be tightly-controlled channels, access gated by the vendor and provided through contracts.
In a way, it's easier to make an API to a legacy native desktop app than it is to a typical SaaS[0] - the native app gets updated infrequently, and isn't running in an obstinate sandbox. The older the app, the better - it's more likely to rely on OS APIs and practices, designed with collaboration and accessibility in mind. E.g. in Windows land, in many cases you don't need OCR and mouse emulation - you just need to enumerate the window handles, walk the tree structure looking for text or IDs you care about, and send targeted messages to those components.
Unfortunately, desktop apps are headed the same direction web apps are (increasingly often, they are web apps in disguise), so I agree that AI-level RPA is a huge deal.
--
[0] - This is changing a bit in that frameworks seem to be getting complex enough that SaaS vendors often have no clue as to what kind of access they're leaving open to people who know how to press F12 in their browsers and how to call cURL. I'm not talking bespoke APIs backend team wrote, but standard ones built into middleware, that fell beyond dev team's "abstraction horizon". GraphQL is a notable example.
Basically, if it means companies can introduce automation without changing anything about the tooling/workflow/programs they already use, it's going to be MASSIVE. Just and install and a prompt and you've already automated a lengthy manual process - awesome.
Talking about ancient Windows software... Windows used to have an API for automation in the 2000s (I don't know if it still does). I wrote this MS Access script that ran and moved the cursor at exactly the pixel coordinates where buttons and fields were positioned in a GUI that we wanted to extract data from, in one of my first jobs. My boss used to do this manually. After a week he had millions of records ready to query in Access. You can imagine how excited he was. Was a fun little project and pretty hilarious to see the cursor moving fast AF around the screen like it was possessed. PS: you could screw up the script run pretty easily by bumping into the mouse of that pc.
Really good software has automation built in, in the form of macros/scripting.
One of the reasons my bash terminal is more effective than point and click is the easy of automation of routine tasks: from history and aliases to scripting.
Software interop is trickier as it doesn't so much depend on being able to send messages from one piece of software to another, it's you need an agreed format of those messages ( whether they be expressed in the payload of an API, or a UI specific stream of point and clicks ).
I tried once to integrate with software used by pharmacists in Australia (which costs a fortune and looks straight out of the 90's). Even though they have a SQL database with all information about everything, the DB is encrypted and they provide no way to access it. I tried contacting the company to learn how I could integrate with it but they offered no help. It's not in their interest to help upcoming competitors and they claim they cannot provide the encryption key as that's for protecting the customer's privacy, forgetting that the pharmacists already have access to all data through their software. But without a way to automatically extract information from the DB, there's no way to migrate to new software as no pharmacist would want to use new software if it cannot "import" the data they need.
It's clear that the reason there's no competition in the market is that the companies (I think it's literally one company) selling this software want it that way, and legislation is yet to catch up with the way they ensure their monopoloy.
I'm a bit skeptical about this working well enough to handle exceptions as soon as something out of the ordinary occurs. But it seems this could work great for automated testing.
Has anyone tried asking "use computer" to do "Please write a selenium/capybara/whatever test for filling out this form and sending it?"
That would take away some serious drudge work. And it's not a big problem if it fails, contrary to when it makes a mistake in filling out a form in an actual business process.
LLM's are enabling a reimagination of UI. Wheres the biggest opportunity in UI that hasn't kept up to date? legacy and regulated software in the long tail. Thats the disruption opportunity for LLM's.
Not to mention software like peoplesoft, SAP and servicenow. Absolute shit interfaces that employees have to deal with day in and day out.
Now, an in-house tool built on top of this Anthropic API can save hours of drudgery. I can already see sales teams smiling at the new 'submit your expense report' button.
I think it was pretty clear from the beginning that the whole AI thing is going to be winner-takes-all.
If you're in some niche doing AI development work, you are going to be outcompeted by more generalized AI at some point.
Absolutely! This reminds me of the humanoid robots vs specialized machines debate.
This is pretty similar to the argument for why humanoid robots will be a big deal. The physical world is also built mostly for humans, so having an autonomous agent that can interact with it is huge.
You don’t know for a fact that those two specific packages don’t have supported APIs. Just because the user doesn’t know of any API doesn’t mean none exists. The average accountant or doctor is never going to even ask the vendor “is there an API” because they wouldn’t know what to do with one if there was.
Anthropic blog post outlining the research process: https://www.anthropic.com/news/developing-computer-use
Computer use API documentation: https://docs.anthropic.com/en/docs/build-with-claude/compute...
Computer Use Demo: https://github.com/anthropics/anthropic-quickstarts/tree/mai...
On their "Developing a computer use model" post they have mention > On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70-75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category.
Here, "next-best AI model in the same category" referes to which model.
This needs to be brought up. Was looking for the demo and ended up on the contact form
Thanks for these. Wonder how many people will use this at work to pretend that they are doing work while they listen to a podcast.
I like its lazy approach
> I apologize, but it seems we're having some issues installing Chromium. As an alternative, since I can see Firefox is already installed on the system, would you like to use Firefox instead? Or would you prefer me to try a different approach to installing Chromium?
It seems unable to figure out how to install chormium. Sees the Firefox icon out there and just want to use it. You know, like a real lazy human being.
It seems to insistent when it needs to, though
> I clicked on Firefox and it should start opening. Let me take another screenshot to see if the window has appeared:
> Firefox seems to be taking a moment to start up. Let me wait a few seconds and take another screenshot:
> Let me try clicking Firefox one more time to ensure it starts:
> Let me try one more time to take a screenshot to see if Firefox has opened:
> Let me try starting Firefox using the bash tool as an alternative method:
> Firefox has successfully opened! As per the instructions, I should ignore the setup wizard and instead click directly on the address bar. Would you like me to do anything specific with Firefox now that it's open?
I didn't instruct him to ignore the setup wizard. So my guess is that Anthropic has configured it to ignore stuff happening on the screen so it doesn't go loose.
And here he goes through my website, through my hacker news account and then find this very comment
> Looking at his first/most recent comment, it's about a discussion of Claude and computer use. Here's what he wrote:
"I like its lazy approach"
This appears to be a humorous response in a thread about "Computer use, a new Claude 3.5 Sonnet, and Claude..." where he's commenting on an AI's behavior in a situation. The comment is very recent (shown as "8 minutes ago" in the screenshot) and is referring to a situation where an AI seems to have taken a simpler or more straightforward approach to solving a problem.
SYSTEM_PROMPT = f"""<SYSTEM_CAPABILITY> * You are utilising an Ubuntu virtual machine using {platform.machine()} architecture with internet access. * You can feel free to install Ubuntu applications with your bash tool. Use curl instead of wget. * To open firefox, please just click on the firefox icon. Note, firefox-esr is what is installed on your system. * Using bash tool you can start GUI applications, but you need to set export DISPLAY=:1 and use a subshell. For example "(DISPLAY=:1 xterm &)". GUI apps run with bash tool will appear within your desktop environment, but they may take some time to appear. Take a screenshot to confirm it did. * When using your bash tool with commands that are expected to output very large quantities of text, redirect into a tmp file and use str_replace_editor or `grep -n -B <lines before> -A <lines after> <query> <filename>` to confirm output. * When viewing a page it can be helpful to zoom out so that you can see everything on the page. Either that, or make sure you scroll down to see everything before deciding something isn't available. * When using your computer function calls, they take a while to run and send back to you. Where possible/feasible, try to chain multiple of these calls all into one function calls request. * The current date is {datetime.today().strftime('%A, %B %-d, %Y')}. </SYSTEM_CAPABILITY>
<IMPORTANT> * When using Firefox, if a startup wizard appears, IGNORE IT. Do not even click "skip this step". Instead, click on the address bar where it says "Search or enter address", and enter the appropriate search term or URL there. * If the item you are looking at is a pdf, if after taking a single screenshot of the pdf it seems that you want to read the entire document instead of trying to continue to read the pdf from your screenshots + navigation, determine the URL, use curl to download the pdf, install and use pdftotext to convert it to a text file, and then read that text file directly with your StrReplaceEditTool. </IMPORTANT>"""
Now that was most excellent.
So meta! I love this story
I still feel like the difference between Sonnet and Opus is a bit unclear. Somewhere on Anthropic's website it says that Opus is the most advanced, but on other parts it says Sonnet is the most advanced and also the fastest. The UI doesn't make the distinction clear either. Then on Perplexity, Perplexity says that Opus is the most advanced, compared to Sonnet.
And finally, in the table in the blogpost, Opus isn't even included? It seems to me like Opus is the best model they have, but they don't want people to default using it, maybe the ROI is lower on Opus or something?
When I manually tested it, I feel like Opus gives slightly better replies compared to Sonnet, but I'm not 100% it's just placebo.
Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere.
I think they originally announced that Opus would get a 3.5 update, but with every product update they are doing I'm doubting it more and more. It seems like their strategy is to beat the competition on a smaller model that they can train/tune more nimbly and pair it with outside-the-model product features, and it honestly seems to be working.
Opus is a larger and more expensive model. Presumably 3.5 Opus will be the best but it hasn't been released. 3.5 Sonnet is better than 3.0 Opus kind of like how a newer i5 midrange processor is faster and cheaper than an old high-end i7.
Anthropic use the names Haiku/Sonnet/Opus for the small/medium/large versions of each generation of their models, so within-generation that is also their performance (& cost) order. Evidentially Sonnet 3.5 outperforms Opus 3.0 on at least some tasks, but that is not a same-generation comparison.
I'm wondering at this point if they are going to release Opus 3.5 at all, or maybe skip it and go straight to 4.0. It's possible that Haiku 3.5 is a distillation of Opus 3.5.
By reputation -- I can't vouch for this personally, and I don't know if it'll still be true with this update -- Opus is still often better for things like creative writing and conversations about emotional or political topics.
Opus has been stuck on 3.0, so Sonnet 3.5 is better for most things as well as cheaper.
Big models / huge models take weeks / month longer than the smaller ones.
Thats why they release them with that skew
Sonnet is better for most things. But I do prefer Opus's writing style to Sonnet.
Opus the biggest and slowest and most expensive one
Not most advanced
The models "3.5 Sonnet" and "3 Opus" are in my experience nearly at the same level. Once in my last 250 prompts did I run into a problem that 3 Opus was able to solve, but 3.5 Sonnet could not. (I forget the details but it was a combination of logic and trivia knowledge. It is highly likely 3.5 Sonnet would have done a better job with better prompting and richer context, but this was a problem where also I lacked the context and understanding to prompt well.)
Given that 3.5 Sonnet is cheaper and faster than 3 Opus, I default to 3.5 Sonnet so I don't know what the number for the reverse is. How many problems do 3.5 Sonnet get which 3 Opus does not? ¯\_(ツ)_/¯
My best guess would be that it's something in the same kind of range.
yes it baffles they cant semver the shit out of them properly (anthtopic, meta, openai, lol)
The "computer use" ability is extremely impressive!
This is a lot more than an agent able to use your computer as a tool (and understanding how to do that) - it's basically an autonomous reasoning agent that you can give a goal to, and it will then use reasoning, as well as it's access to your computer, to achieve that goal.
Take a look at their demo of using this for coding.
https://www.youtube.com/watch?v=vH2f7cjXjKI
This seems to be an OpenAI GPT-o1 killer - it may be using an agent to do reasoning (still not clear exactly what is under the hood) as opposed to GPT-o1 supposedly being a model (but still basically a loop around an LLM), but the reasoning it is able to achieve in pursuit of a real world goal is very impressive. It'd be mind boggling if we hadn't had the last few years to get used to this escalation of capabilities.
It's also interesting to consider this from POV of Anthropic's focus on AI safety. On their web site they have a bunch of advice on how to stay safe by sandboxing, limiting what it has access to, etc, but at the end of the day this is a very capable AI able to use your computer and browser to do whatever it deems necessary to achieve a requested goal. How far are we from paperclip optimization, or at least autonomous AI hacking ?
From what I'm seeing on GH, this could have technically already been built right? Is it not just taking screenshots of the computer screen and deciding what to do from their / looping until it gets to the solution ?
I've been saying this is coming for a long time, but my really smart SWE friend who is nevertheless not in the AI/ML space dismissed it as a stupid roundabout way of doing things. That software should just talk via APIs. No matter how much I argued regarding legacy software/websites and how much functionality is really only available through GUI, it seems some people are really put off by this type of approach. To me, who is more embedded in the AI, computer vision, robotics world, the fuzziness of day-to-day life is more apparent.
Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.
I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.
>it seems some people are really put off by this type of approach
As someone who has had to interact with legacy enterprise systems via RPA (screen scraping and keystroke recording) it is absolutely awful, incredibly brittle, and unmaintainable once you get past a certain level of complexity. Even when it works, performance at scale is terrible.
It's Postel's Law, on steroids. Be liberal in what you accept (with LLMs, that means 'literally anything'), but strict in what you return (which in an LLM is still 'literally anything' but you can constrain that).
Beyond that, I can't help but think of the old thin vs. thick client debate, and I would argue that "software should just talk via APIs" is why, in the web space, everybody is blowing time and energy on building client/server architectures and SPAs instead of basic-ass full-stacks.
It's basically the digital equivalent of humanoid robots - people object because having computers interact with a browser, like building a robot in the form of a human, is incredibly inefficient in theory or if you're designing a system from scratch.
The problem is that we're not starting from scratch - we have a web engineered for browser use and a world engineered for humanoid use. That means an agent that can use a browser, while less efficient than an agent using APIs at any particular task, is vastly more useful because it can complete a much greater breadth of tasks. Same thing with humanoid robots - not as efficient at cleaning the floor as my purpose-built Roomba, but vastly more useful because the breadth of tasks it can accomplish means it can be doing productive things most of the time, as opposed to my Roomba, which is not in use 99% of the time.
I do think that once AI agents become common, the web will increasingly be designed for their use and will move away from the browser, but that probably take a comparable amount of time as it did for the mobile web to emerge after the iPhone came out. (Actually that's probably not true - it'll take less time because AI will be doing the work instead of humans.)
I recall 90's Macs had a 3rd party app that offered to observe your mouse/keyboard then automatically recommend routine tasks for you. As a young person I found that fascinating. It's interesting to see history renew itself.
If you want an API, have Claude procedurally test actions and then write a pyautogui/pywinauto/autohotkey etc script to perform it instead. Have it auto-test to verify and classify the general applicability of each action. Repeat for all of life...
> and how much functionality is really only available through GUI
Isn't the GUI driven by code? Can anything at all in the GUI work that can't be done programmatically?
Not specific to this update, but I wanted to chime in with just how useful Claude has been, and relatively better than ChatGPT and GitHub copilot for daily use. I've been pro for maybe 6 months. I'm not a power user leveraging their API or anything. Just the chat interface, though with ever more use of Projects, lately. I use it every day, whether for mundane answers or curiosities, to "write me this code", to general consultation on a topic. It has replaced search in a superior way and I feel hugely productive with it.
I do still occasionally pop over to ChatGPT to test their their waters (or if Claude is just not getting it), but I've not felt any need to switch back or have both. Well done, Anthropic!
Claude 3.5 Opus is no longer mentioned at all on https://docs.anthropic.com/en/docs/about-claude/models
Internet Archive confirms that on the 8th of October that page listed 3.5 Opus as coming "Later this year" https://web.archive.org/web/20241008222204/https://docs.anth...
The fact that it's no longer listed suggests that its release has at least been delayed for an unpredictable amount of time, or maybe even cancelled.
Before anyone reads too much into this, here's what an Anthropic staff said on Discord:
> i don't write the docs, no clue
> afaik opus plan same as its ever been
It's possible that they've determined that Opus no longer makes sense if they're able to focus on continuously optimising Sonnet. That said, Anthropic have been relatively good at setting and managing expectations, so today would have been a good time to make that clear.
Or maybe they released it
One of the funnier things during training with the new API (which can control your computer) was this:
"Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.
Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park."
[0] https://x.com/AnthropicAI/status/1848742761278611504
Next release patch notes:
* Fixed bug where Claude got bored during compile times and started editing Wikipedia articles to claim that birds aren't real
* Blocked news.ycombinator.com in the Docker image's hosts file to avoid spurious flamewar posts (Note: The site is still recovering from the last insident)
* Addressed issue of Claude procrastinating on debugging by creating elaborate ASCII art in Vim
* Patched tendency to rickroll users when asked to demonstrate web scraping"
Seeing models act as though they have agency gives me goosebumps (e.g. seeking out photos of Yellowstone for fun). LLMs don't yet have a concept of true intent or agency, but it's wild to think of them acquiring it.
I have been playing with Mindcraft which lets models interact with Minecraft through the bot API and one of them started saying things like "I want to place some cobblestone there" and then later more general "I want to do X" and then start playing with the available commands, it was pretty cool to watch it explore.
https://github.com/kolbytn/mindcraft
At least now we know SkyClaude’s plan to end human civilization.
It’s planning on triggering a Yellowstone caldera super eruption.
I think the best use case for AI `Computer Use` would be a simple positioning of the mouse and asking for conformation before a click. For most use cases this is all people will want/need. If you don't know how to do something, it is basically teaching you how, in this case, rather than taking full control and doing things so fast you don't have time to stop of going rogue.
You'll know AGI is here when it takes time out to go talk to ChatGPT, or another instance of itself, or maybe goes down a rabbit hole of watching YouTube music videos.
> Claude accidentally stopped a long-running screen recording,
It's kind of interesting that they're not running a 2PC setup with HDMI splitter, but (presumably)just laptops and screen recording apps...
It enjoyed nature photos of the Yellowstone national park? I rather expected it would prefer images of big data centers.
In 2015, when I was asked by friends if I'm worried about Self driving Cars and AI, I answered: "I'll start worrying about AI when my Tesla starts listening to the radio because it's bored." ... that didn't take too long
This is, craaaaaazzzzzy. I'm just a layman, but to me, this is the most compelling evidence that things are starting to tilt toward AGI that I've ever seen.
This needs more discussion:
Claude using Claude on a computer for coding https://youtu.be/vH2f7cjXjKI?si=Tw7rBPGsavzb-LNo (3 mins)
True end-user programming and product manager programming are coming, probably pretty soon. Not the same thing, but Midjourney went from v.1 to v.6 in less than 2 years.
If something similar happens, most jobs that could be done remotely will be automatable in a few years.
Every time I see this argument made, there seems to be a level of complexity and/or operational cost above which people throw up their hands and say "well of course we can't do that".
I feel like we will see that again here as well. It really is similar to the self-driving problem.
> True end-user programming and product manager programming are coming
This means that either product managers will have to start (effectively) writing in-depth specs again, or they will have to learn to accept the LLM's ideas in a way that most have not accepted their human programmers' ideas.
Definitely will be interesting to see how that plays out.
Idk, LLMs have basically stopped improving for over a year now. And in their current state no matter how many abstractions you add to them - or chain them - they are not even close capable to replace even simple jobs.
> True end-user programming and product manager programming are coming, probably pretty soon.
I'm placing my bets rather on this new object-oriented programming thing. It will make programming jobs obsolete any day now...
> If something similar happens, most jobs that could be done remotely will be automatable in a few years.
I'd be willing to be a large amount of money this doesn't happen, assuming "most" means >50% and "a few" is <5.
It makes me wonder if people that make these claims have an actual job. Because if they did then I doubt anyone could make that claim with a straight face.
> If something similar is the case, most jobs that can be done remotely will be automatable in a couple of years.
I'm really curious on the cost of that sort of thing. Seems astronomical atm, but as much as i get shocked at the today-cost, staffing is also a pretty insane cost.
> This needs more discussion:
"Create a simple website" has to be one of the most common blog / example out there in about every programming language.
It can automate stuff? That's cool: I already did automate screenshots and then AI looking if it looks like phishing or not (and it's quite good at it).
I mean: the "Claude using Claude" may seem cool, but I dispute the "for coding" part. That's trivial stuff. A trivial error (which it doesn't fix btw: it just deletes everything).
'Claude, write me code to bring SpaceX rockets back on earth"
or
"Claude, write me code to pilot a machine to treat a tumor with precision"
This was not it.
i am sure it will do great handling error cases and pixel perfect ui
openinterpreter has been doing this for a while, with a bunch of LLMs, glad to see first party support for this use case
and how is Midjourney doing? did it change the world?
I wrote up some of my own notes on Computer Use here: https://simonwillison.net/2024/Oct/22/computer-use/
Molmo released recently and is able to provide point coordinates for objects in images. I’ve been testing it out recently and am currently building an automation tool that allows users to more easily control a computer. Looks like Anthropic built a better one.
Edit: it seems like these new features will eliminate a lot of automated testing tools we have today.
Code for molmo coordinate tests https://github.com/logankeenan/molmo-server
From the computer use video demo, that's a lot of API calls. Even though Claude 3.5 Sonnet is relatively cheap for its performance, I suspect computer use won't be. It's a very good idea that Anthropic upfront that it isn't perfect. And it's guaranteed that there will be a viral story where Claude will accidentally delete something important with it.
I'm more interested in Claude 3.5 Haiku, particularly if it is indeed better than the current Claude 3.5 Sonnet at some tasks as claimed.
Seemed like a reasonable amount of API calls. For a first public iteration this seems quite nice and a logical progression in tooling. UiPath has a $7bn market cap and thats only a single player in the industry of automation. If they can figure out the quirks this can be a game changer.
It's just bizarre to force a computer to go through a GUI to use another computer. Of course it's going to be expensive.
I suspect these models have been getting smaller on the back-end, and the GPU's have been getting bigger. It's probably not a huge deal.
Completely irrelevant, and it might just be me, but I really like Anthropic's understated branding.
OpenAI's branding isn't exactly screaming in your face either, but for something that's generated as much public fear/scaremongering/outrage as LLMs have over the last couple of years, Anthropic's presentation has a much "cosier" veneer to my eyes.
This isn't the Skynet Terminator wipe-us-all-out AI, it's the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it's going to be OK.
I have to agree. I've been chatting with Claude for the first time in a couple days and while it's very on-par with ChatGPT 4o in terms of capability, it has this difficult-to-quantify feeling of being warmer and friendlier to interact with. I think the human name, serif font, system prompt, and tendency to create visuals contributes to this feeling.
As a Kurt Vonnegut fan, their asterisk logo on claude.ai always amuses me. It must be intentional:
https://en.m.wikipedia.org/wiki/File:Claude_Ai.svg
https://www.redmolotov.com/vonnegut-ahole-tshirt
Anthropic has recently begun a new, big ad campaign (ads in Times Square) that more-or-less takes potshots at OpenAI. https://www.reddit.com/r/singularity/comments/1g9e0za/anthro...
> This isn't the Skynet Terminator wipe-us-all-out AI, it's the adorable grandpa with a bag of werthers wipe-us-all-out AI, and that means it's going to be OK.
Ray: I tried to think of the most harmless thing. Something I loved from my childhood. Something that could never ever possibly destroy us. Mr. Stay Puft!
Venkman: Nice thinkin', Ray.
This is actually very relevant: most people think this is just an arms race to see who can get the better percentages on benchmarks but to me all this technology is useless if we dont give programmers and end users the right interfaces to utilize the technology.
Anthropic seems to have a better core design and human-computer interaction ethos that shows up all throughout their product and marketing.
I wrote on the topic as well: https://blog.frankdenbow.com/statement-of-purpose/
I found the “Computer Use” product name funny. Many other companies would’ve used the opportunity to come up with something like “Human Facing Interface Navigation and Task Automation Capabilities” or “HFINTAC”.
I didn’t know what Computer Use meant. I read the article and though to myself oh, it’s using a computer. Makes sense.
I find myself wanting to say please and thank you to Claude when I didn't have the reflex to do that with chatgpt. Very successful branding.
Take a read through the user agreements for all the major LLM providers and marvel at the simplicity and customer friendliness of the Anthropic one vs the others.
Not irrelevant at all! Compare their branding to that of Boston Dynamics. Their branding of robots reminds me more of a Black Mirror episode... If Claude was a dog like robot, it sure would look like a golden retriever or something. Positive AI branding should create a positive public perception which in turn should create a positive attitude towards AI regulation.
Claude is amazing. The project documents functionality makes it a clear leader ahead of ChatGPT and I have found it to be the clear leader in coding assistance over the past few months. Web automation is really exciting.
I look forward to the brave new future where I can code a webapp without ever touching the code, just testing, giving feedback, and explaining discovered bugs to it and it can push code and tweak infrastructure to accomplish complex software engineering tasks all on its own.
Its going to be really wild when Claude (or other AI) can make a list of possible bugs and UX changes and just ask the user for approval to greenlight the change.
And today I realized that despite it being an extremely common activity, we don’t really have a word for “using the computer” which is distinct from “computing”. It’s funny because AI models are always “using a computer” but now they can “use your computer.”
The word is interfacing generally (or programming for some) but it's just not commonly used for general users. I’d say this is probably because the activity of focus for general users is in use of the applications, not the computer itself despite being instanced with a computer. Thus a computer is commonly less the user’s object of activity, and more commonly the setting for activity.
Similarly using our homes are an extremely common ‘activity’, yet the object-activities that get special words commonly used are the ones with specific user application.
Computering
With all the buzz around "agent" I'm surprised they didn't brand it "Computer Agent"
In English at least. In other languages there are.
what is the impact to desktop programs ui if it's just being manipulated by AI?
Operating a computer?
Reminds me of the rise in job application bots. People are applying to thousands of jobs using automated tools. It’s probably one of the inevitable use cases of this technology.
It makes me think. Perhaps the act of applying to jobs will go extinct. Maybe the endgame is that as soon as you join a website like Monster or LinkedIn, you immediately “apply” to every open position, and are simply ranked against every other candidate.
> People are applying to thousands of jobs using automated tools
Employers were already screening thousands of applications using automated tools for years. Candidates are catching up to the automation cat-and-mouse game.
The `Hiring Process` in America is definitely BADLY broken. Maybe worldwide afaik. It's a far too difficult, time-consuming, and painful process for everyone involved.
I have a feeling AI can fix this, although I'd never allow an AI bot to interview me. I just mean other ways of using AI to help the process.
Also people are hired for all kinds of reasons having little to do with their qualifications lots of the time, and often due to demographics (race, color, age, etc), and this is another way maybe AI can help by hiding those aspects of a candidate somehow.
I've found that doing some research and finding the phone number of the hiring person and calling them directly is very powerful.
Pretty cool! I use Claude 3.5 to control a robot (ARKit/iOS based) and it does surprisingly well in the real world: https://youtu.be/-iW3Vzzr3oU?si=yzu2SawugXMGKlW9
That looks pretty cool, congrats! How feasible is it to be a product by itself? Did you try with a local edge model?
Great work by Anthropic!
After paying for ChatGPT and OpenAI API credits for a year, I switched to Claude when they launched Artifacts and never looked back.
Claude Sonnet 3.5 is already so good, specially at coding. I'm looking forward to testing the new version if it is, indeed, even better.
Sonnet 3.5 was a major leap forward for me personally, similar to the GPT-3.5 to GPT-4 bump back in the day.
How are you using it with coding?
If "computer use" feature is able to find it's way in Azure, AAD/Entra, SharePoint settings, etc. - it has a chance of becoming a better user interface for Microsoft products. :)
Can you imagine how simple the world would be if you'd just need to tell Claude: "user X needs to have access to feature Y, please give them the correct permissions", with no need to spend days in AAD documentation and the settings screens maze. I fear AAD is AI-proof, though :)
Sure, Ted. I’ve let user HAL access feature “door locks”. I’ve corrected all permissions accordingly.
I have been a paying ChatGPT customer for a long time (since the very beginning). Last week I've compared ChatGPT to Claude and the results (to my eye) were better, the output better structured and the canvas works better. I'm on the edge of jumping ship.
For python, at least, Sonnet’s code is much more elegant, well composed, and thoughtfully written. It also seems to be biased towards more recent code, whereas the gpt models can’t even properly write an api call to itself.
o1 is pretty decent as a rotor rooter, ie the type of task that requires both lots of instruction as well as lots of context. I honestly think it works half as well as it does now because it’s able to properly mull through the true intent of the user that usually takes the multiple shots that nobody has the patience to do.
Claude is the daily driver. GPT-O1 for complicated tasks. For example, questions where linear reasoning is not enough like advanced rust ownership questions.
I jumped ship in April of this year and haven’t looked back.
Use the best tool available for your needs. Don’t get trapped by a feeling of sunk cost.
Anthropic's rate limit are very low sadly, even for paid customers. You can use the API of course but it's not as convenient and may be more expensive.
I'd jump ship if it weren't for the real time voice chat. It's extremely powerful for beginner conversation language learning. Hoping that a company will make use of the real time api for a dedicated language learning app soon.
interesting. i couldn’t imagine giving up o1-preview right now even with just 30/week.
and i do get a some bit of value from advanced voice mode, although it would be a lot more if it were unlimited
i started liking ai as a tool for coding once i switched to claude.
> I'm on the edge of jumping ship.
Yeah I think I might also jump ship. It’s just that chatGPT now kinda knows who I am and what I like and I’m afraid of losing that. It’s probably not a big deal though.
I think this is good evidence that people's jobs are not being replaced by AI, because no AI would give the product a confusing name like "new Claude 3.5 Sonnet".
I wonder why they didn't choose a "point update" scheme, like bumping it up to v3.6, for example. I agree, the naming is super confusing.
Google, OpenAI, and Anthropic are responsibly scaling their models by confusing their customers into using the wrong ones.
When AGI finally is launched, adoption will be responsibly slowed because it is called something like "new new Gemini Giga 12.9.2xo IT" and users will have to select it from dozens of similar names.
Why on god's green earth is it not just called Claude 3.6 Sonnet. Or Claude 4 Sonnet.
I don't actually care what the answer is. There's no answer that will make it make sense to me.
The best answer I've seen so far is that "Claude 3.5 Sonnet" is a brand name rather than a specific version. Not saying I agree, just a way to visualize how the team is coming up with marketing.
A bit like the new Gemini Pro 1.5-002 release.
Not that I'm scared of this update but I'd probably be alright with pausing llm development today, atleast in regard to producing code.
I don't want an llm to write all my code, regardless of if it works, I like to write code. What these models are capable of at the moment is perfect for my needs and I'd be 100% okay if they didn't improve at all going forward.
Edit: also I don't see how an llm controlled system can ever replace a deterministic system for critical applications.
I have trouble with this too. I'm working on a small side project and while I love ironing out implementation details myself, it's tough to ignore the fact that Claude/GPT4o can create entire working files for me on demand.
It's still enjoyable working at a higher architecture level and discussing the implementation before actually generating any code though.
In normal critical system u use 3 CPUs. With LLM u can 1000 shot majority voting. Seems like approaches like entropix might reduce hallucinations also.
Great progress from Anthropic! They really shouldn't change models from under the hood, however. A name should refer to a specific set of model weights, more or less.
On the other hand, as long as its actually advancing the Pareto frontier of capability, re-using the same name means everyone gets an upgrade with no switching costs.
Though, all said, Claude still seems to be somewhat of an insider secret. "ChatGPT" has something like 20x the Google traffic of "Claude" or "Anthropic".
https://trends.google.com/trends/explore?date=now%201-d&geo=...
> Great progress from Anthropic! They really shouldn't change models from under the hood, however. A name should refer to a specific set of model weights, more or less.
In the API (https://docs.anthropic.com/en/docs/about-claude/models) they have proper naming you can rely on. I think the shorthand of "Sonnet 3.5" is just the "consumer friendly" name user-facing things will use. The new model in API parlance would be "claude-3-5-sonnet-20241022" whereas the previous one's full name is "claude-3-5-sonnet-20240620"
There was a recent article[0] trending on HN a about their revenue numbers, split by B2C vs B2B.
Based on it, it seems like Anthropic is 60% of OpenAI API-revenue wise, but just 4% B2C-revenue wise. Though I expect this is partly because the Claude web UI makes 3.5 available for free, and there's not that much reason to upgrade if you're not using it frequently.
[0]: https://www.tanayj.com/p/openai-and-anthropic-revenue-breakd...
Traveling to the US recently, I was surprised to see Claude ads around the city/in the airport. It seems like they're investing on marketing there.
In my country I've never seen anyone mention them at all.
Maybe LLM's helping blind people like me play video games that aren't accessible to us normally, is getting closer!
Google has the tech (some of it's gathering dust, but they have it). They can use the gameplay tech developed for stadia when a user experiences lag and combine it with their LLM.
Definitely! Those with movement disabilities could have a much easier time if they could just dictate actions to the computer and have them completed with some reliability.
I am curious why "upgraded Claude 3.5 Sonnet" instead of simply Claude 3.6 Sonnet? Minor version increment is a standard way of versioning update. Am i missing something or it is just Anthropic marketing?
Probably because there was no 3.1-3.4, and that the .5 is mostly just to represent that it's an upgrade on Claude 3 but not quite enough to be a Claude 4
Claude is absurdly better at coding tasks than OpenAI. Like it's not even close. Particularly when it comes to hallucinations. Prompt for prompt, I see Claude being rock solid and returning fully executable code, with all the correct imports, while OpenAI struggles to even complete the task and will make up nonexistent libraries/APIs out of whole cloth.
Does anyone know _why_ it’s so much better at coding? Better architecture, better training data, better RLHF?
I've been using a lot of o1-mini and having a good experience with it.
Yesterday I decided to try sonnet 3.5. I asked for a simple but efficient script to perform fuzzy match in strings with Python. Strangely, it didn't even mention existing fast libraries, like FuzzyWuzzy and Rapidfuzz. It went on to create everything from scratch using standard libraries. I don't know, I thought this was something basic for it to stumble on.
Yeah, sonnet is noticeably better. To the point that openai is almost unusable, too many small errors
This can power one of my favorite use-cases.
Like find me a list of things to do with a family, given today's weather and in the next 2 hours, quiet sit down with lots of comfy seating, good vegetarian food...
Not only is this kind of use getting around API restrictions, it is also a superior way to do search: Specify arbitrary preferences upfront instead of a search box and trawling different modalities of content to get better result. The possibilities for wellness use cases are endless, especially for end users that care about privacy and less screen use.
my quick notes on Computer Use:
- "computer use" is basically using Claude's vision + tool use capability in a loop. There's a reference impl but there's no "claude desktop" app that just comes with this OOTB
- they're basically advertising that they bumped up Claude 3.5's screen vision capability. we discussed the importance of this general computer agent approach with David on our pod https://x.com/swyx/status/1771255525818397122
- @minimaxir points out questions on cost. Note that the vision use is very sparing - the loop is I/O constrained - it waits for the tool to run and then takes a screenshot, then loops. for a simple 10 loop task at max resolution, Haiku costs <1 cent, Sonnet 8 cents, Opus 41 cents.
- beating o1-preview on SWEbench Verified without extended reasoning and at 4x cheaper output per token (a lot cheaper in total tokens since no reasoning tokens) is ABSOLUTE mogging
- New 3.5 Haiku is 68% cheaper than Claude Instant haha
references i had to dig a bit to find
- https://www.anthropic.com/pricing#anthropic-api
- https://docs.anthropic.com/en/docs/build-with-claude/vision#...
- loop code https://github.com/anthropics/anthropic-quickstarts/blob/mai...
- some other screenshots https://x.com/swyx/status/1848751964588585319
- https://x.com/alexalbert__/status/1848743106063306826
- model card https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Cla...
Haven't used vision models before, can someone comment if they are good at "pointing things". E.g given a picture, give co-ordinate for text "foo".
This is the key to accurate control, it needs to be very precise.
Maybe Claude's model is trained at this. Also what about open source vision models? Any ones good at "pointing things" on a typical computer screen?
See https://github.com/OpenAdaptAI/OpenAdapt for an open source implementation that includes a desktop app OOTB.
Is there an easy way to use Claude as a Co-Pilot in VS Code? If it is better at coding, it would be great to have it integrated.
You can use it in Cursor - called "Cursor Tab"
IMO Cursor Tab performs much better than Co-Pilot, easily works through things that would cause Co-Pilot to get stuck, you should give it a try
For Copilot-like use, Continue is the plugin you're looking for, though I would suggest using a cheaper/faster model to get inline completions.
For Cursor-like use (giving prompts and letting it create and modify files across the project), Cline – previously Claude Dev – is pretty good.
Cody by Sourcegraph has unlimited code completions for Claude & a very generous monthly message limit. They don't have this new version I think but they roll these out very fast.
You can easily use a plugin like https://www.continue.dev/ and configure it to use Claude 3.5 Sonnet.
Codeium (cheapest), double.bot and continue.dev (with api key) have Claude in chat.
https://github.com/cline/cline (with api key) has Claude as agent.
Tabnine includes Claude as an option. I've been using it to compare Claude Sonnet to Chatgpt-4o and Sonnet is clearly much better.
You can use Cursor (VS fork) with private Anthropic key
Cursor uses Claude as its base model.
There may be extensions for VScode to do it but it will never be allowed in Copilot unless MS and OpenAI have a falling out.
Continue.dev's VS Code extension is fantastic for this
It improves to 25.9 over the previous version of Claude 3.5 Sonnet (24.4) on NYT Connections: https://github.com/lechmazur/nyt-connections/.
Perhaps it's just because English is not my native language, but the prompt 3 isn't quite clear at the beginning when it says "group of four. Words (...)". It is not explained what the group of four must be, if I add to the prompt "group of four words" Claude 3.5 manages to answer it, while without it, Claude tells it is not that clear and can't answer
What a neat bench mark! I'm blown away that o1 absolutely crushes everyone else in this. I guess the chain of thought really hashes out those associations.
This "Computer use" demo:
https://www.youtube.com/watch?v=jqx18KgIzAE
shows Sonnet 3.5 using the Google web UI in an automated fashion. Do Google's terms really permit this? Will Google permit this when it is happening at scale?
I wonder how they could combat it if they choose to disallow AI access through human interfaces. Maybe more captchas, anti-AI design language, or even more tracking of the user's movements?
For me, one of the more useful steps on macOS will be when local AI can manipulate anything that has an Apple Script library. The hooks are there and decently documented. For meta purposes, having AI work with a third-party app like Keyboard Maestro or Raycast will even further expand the pre-built possibilities without requiring the local AI to reinvent steps or tools at the time of each prompt.
This looks quite fantastic!
Nice improvements in scores across the board, e.g.
> On coding, it [the new Sonnet 3.5] improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding.
I've been using Sonnet 3.5 for most of my AI-assisted coding and I'm already very happy (using it with the Zed editor, I love the "raw" UX of its AI assistant), so any improvements, especially seemingly large ones like this are very welcome!
I'm still extremely curious about how Sonnet 3.5 itself, and its new iteration are built and differ from the original Sonnet. I wonder if it's in any way based on their previous work[0] which they used to make golden-gate Claude.
[0]: https://transformer-circuits.pub/2024/scaling-monosemanticit...
I'm waiting for Aider benchmark
It will interesting to see how this evolves. UI automation use case is different from accessibility do to latency requirement. latency matters a lot for accessibility not so much for ui automation testing apparatus.
I've often wondered what the combination of grammar-based speech recognition and combination with LLM could do for accessibility. Low domain Natural Language Speech recognition augmented by grammar based speech recognition for high domain commands for efficiency/accuracy reducing voice strain/increasing recognition accuracy.
https://github.com/dictation-toolbox/dragonfly
This bolsters my opinion that OpenAI is falling rapidly behind. Presumably due to Sam's political machinations rather than hard-driving technical vision, at least that's what it seems like, outside looking in.
Computer use seems it might be good for e2e tests.
Livebench updated
https://livebench.ai
> we have provided three tools > bash shell
November 2024: AI is allowed to execute commands in a bash shell. What could possibly go wrong?
Can this solve CAPTCHAs for me? It's starting to get to the point where limited biological brains can't do them.
I'm giving the new Sonnet a chance, although for my use as a writing companion so far, Opus has been king among all the models I've tried.
However, I've been using Opus as a writing companion for several months, especially when you have writer's block and ask it for alternative phrases, it was super creative. But in recent weeks I was noticing a degradation in quality. My impression is that the model was degrading. Could this be technically possible? Might it be some kind of programmed obsolescence to hype new models?
You're expectations could just be increasing as you start taking it for granted and are using other models.
Fascinating. Though I expect people to be concerned about privacy implications of sending screenshots of the desktop, similar to the backlash Microsoft has received about their AI products. Giving the remote service actual control of the mouse and keyboard is a whole another level!
But I am very excited about this in the context of accessibility. Screen readers and screen control software is hard to develop and hard to learn to use. This sort of “computer use” with AI could open up so many possibilities for users with disabilities.
The key difference is that Microsoft Recall wasn't opt-in.
There's such a gulf between choosing to send screenshots to Anthropic and Microsoft recording screenshots without user intent or consent.
I suspect businesses will create VDI's or VM's for this express purpose. One because it scales better, and 2 because you can control what it has access to easier and isolate those functions.
> I expect people to be concerned about privacy implications of sending screenshots of the desktop
That's why in https://github.com/OpenAdaptAI/OpenAdapt we've built in several state-of-the-art PII/PHI scrubbers.
Of course there's great inefficiency in having the Claude software control a computer with a human GUI mediating everything, but it's necessary for many uses right now given how much we do where only human interfaces are easily accessible. If something like it takes off, I expect interfaces for AI software would be published, standardized, etc. Your customers may not buy software that lacks it.
But what I really want to see is a CLI. Watching their software crank out Bash, vim, Emacs!, etc. - that would be fascinating!
I hope specialized interfaces for AI never happen. I want AI to use human interfaces, because I want to be empowered to use the same interfaces as AI in the future. A future where only AI can do things because it uses an incomprehensible special interface and the human interface is broken or non-existent is a dystopia.
I also want humanoid robots instead of specialized non-humanoid robots for the same reason.
I agree, I bet models could excel at CLI tasks since the feedback would be immediate and in a language they can readily consume. It's probably much easier for them to to handle "command requires 2 arguments and only 1 was provided" than to do image-to-text on an error modal and apply context to figure out what went wrong.
Wow, there's a whole industry devoted to what they're calling "Computer Use" (Robotic Process Automation, or RPA). I wonder how those folks are viewing this.
Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks.
Nice, but I wonder why didn't they use UI automation/accessibility libraries, that have access to the semantic structure of apps/web pages, as well as accessing documents directly instead of having Excel display them for you.
We use operating system accessibility APIs when available in https://github.com/OpenAdaptAI/OpenAdapt.
I wonder if the model has difficulties for the same reason some people do - UI affordability has gone down with the flattening, hover-to-see scrollbar, hamburger-menu-ization of UIs.
I'd like to see a model trained on a Windows 95/NT style UI - would it have an easier time with each UI element having clearly defined edges, clearly defined click and dragability, unified design language, etc.?
I think this is to make human /user experience better. If you use accessibility features, then user need to know how to use those features. Similar to another comment in here, the UX they shoot for is “click the red button with cancel on it”, and ship that ASAP.
> developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text.
So, this is how AI takes over the world.
Seems like both:
- AI Labs will eat some of the wrappers on top of their APIs - even complex ones like this. There are whole startups that are trying to build computer use.
- AI is fitting _some_ scaling law - the best models are getting better and the "previously-state-of-the-art" models are fractions of what they cost a couple years ago. Though it remains to be seen if it's like Moore's Law or if incremental improvements get harder and harder to make.
It seems a little silly to pretend there’s a scaling “law” without plotting any points or doing a projection. Without the mathiness, we could instead say that new models keep getting better and we don’t know how long that trend will continue.
How does the computer use work -- Is this a desktop app they are providing that can do actions on your computer? Didn't see any such mention in the post
Quickstart is here: https://github.com/anthropics/anthropic-quickstarts/tree/mai...
It’s a sandbox compute environment, using Gvisor or Firecracker or similar, which exposes a browser environment to the LLM.
modal.com’s modal.Sandbox can be the compute layer for this. It uses Gvisor under the hood.
See https://github.com/OpenAdaptAI/OpenAdapt for an open source alternative that includes a desktop app.
It is a docker container providing a remote desktop you can see; they strongly recomend you also run it inside a VM.
Tried my standard go-to for testing, asked it to generate a voronoi diagram using p5js. For the sake of job security I'm relieved to see it still can't do a relatively simple task with ample representation in the Google search results. Granted, p5js is kind of niche, but not terribly so. It's arguably the most popular library for creating coding.
In case you're wondering, I tried o1-preview, and while it did work, I was also initially perplexed why the result looked pixelated. Turns out, that's because many of the p5js examples online use a relatively simple approach where they just see which cell-center each pixel is closest to, more or less. I mean, it works, but it's a pretty crude approach.
Now, granted, you're probably not doing creative coding at your job, so this may not matter that much, but to me it was an example of pretty poor generalization capabilities. Curiously, Claude has no problem whatsoever generating a voronoi diagram as an SVG, but writing a script to generate said diagrams using a particular library eluded it. It knows how to do one thing but generalizes poorly when attempting to do something similar.
Really hard to get a real sense of capabilities when you're faced with experiences like this, all the while somehow it's able to solve 46% of real-world python pull-requests from a certain dataset. In case you're wondering, one paper (https://cs.paperswithcode.com/paper/swe-bench-enhanced-codin...) found that 94% of the pull-requests on SWE-bench were created before the knowledge cutoff dates of the latest LLMs, so there's almost certainly a degree of data-leakage.
It's surprising how much knowledge is not easily googleable and can only unearched by deep diving into OSS or asking an expert. I recently was debugging a rather naive gstreamer issue where I was seeing a delay in the processing. ChatGPT, Claude and Google were all unhelpful. I spend the next couple days reading the source code, found my answer, and thought it was a bug.
Asked the mailing list, and my problem was solved in 10 seconds by someone who could identify the exact parameter that was missing (and IMO, required some architecture knowledge on how gstreamer worked - and why the unrelatedly named parameter would fix it). The most difficult problems fall into this camp - I don't usually find myself reaching for LLMs when the problem is trivial unless it involves a mountain of boilerplate.
I recently asked it to give me code to do gapless playback of audio files using Apple's AVAudioEngine APIs. It got it wrong and additional prompts to explain why it was wrong didn't help.
To me what it seems like these tools do really well is paraphrase stuff that's in their training data.
"generate a voronoi diagram using p5js. use a vector algorithm to generate the diagram instead of a raster" seems to work for me?
They should just adopt Apple "version numbers:" Claude Sonnet (Late 2024).
What are the licensing implications of this? If I’m Google, I’d be pissed that my software is being used without a human there looking at the ads.
Are these ppl are aware that they can bump minor versions?
The mkt team vetoed Claude 3.6 ???
Ok I know that we're in the post-nerd phase of computers, but version numbers are there for a reason. 3.6, please? 3.5.1??
I really don't get their model. They have very advanced models, but the service overall seems to be a jumble of priorities. Some examples:
Anthropic doesn't offer an unlimited chatbot service, only plans that give you "more" usage, whatever that means. If you have an API key, you are "unlimited," so they have the capability. Why doesn't the chatbot allow one to use their API key in the Claude app to get unlimited usage? (Yes, I know there are third-party BYOK tools. That's not the question.)
Claude appears to be smart enough to make an Excel spreadsheet with simple formulae. However, it is apparently prevented from making any kind of file. Why? What principle underlies that guardrail that does not also apply to Computer Use?
Really want to make Claude my daily driver, but right now it often feels too much like a research project.
I've seen quite a few YC startups working on AI-powered RPA, and now it looks like a foundational model player is directly competing in their space. It will be interesting to see whether Anthropic will double down on this or leave it to third-party developers to build commercial applications around it.
Is there anything out there yet that will let me issue the command:
> Refactor the api folder with any recommended readability improvements or improvements that would help DRY up code without adding additional complexity.
Then I can just `git status` to see the changes?
They need to work on their versioning.
"3.5 Sonnet (New)", WTAF? - just call it 3.6 Sonnet or something.
Is it "New" sonnet? is it "upgraded"? Is there a difference? How do I know which one I use?
I can understand claude-3-5-sonnet-20241022, but that's not what users see.
I tried to get it to translate a document and it stopped after a few paragraphs and asked if I wanted it to keep going. This is not appropriate for my use case and it kept doing this even though I explicitly told it not to. The old version did not do this.
> "... and similar speed to the previous generation of Haiku."
To me this is the most annoying grammatical error. I can't wait for AI to take over all prose writing so this egregious construction finally vanishes from public fora. There may be some downsides -- okay, many -- but at least I won't have to read endless repetitions of "similar speed to ..." when the correct form is obviously "speed similar to".
In fact, in time this correct grammar may betray the presence of AI, since lowly biologicals (meaning us) appear not to either understand or fix this annoying error without computer help.
That’s too much control for my taste. I don’t want anthropic to see my screen. I rather prefer a VS Code with integrated Claude. A version that can see all my dev files in a given folder. I don’t need it to run Chrome for me.
This is what the Rabbit "large action model" pretended to be. Wouldn't be surprised to see them switch to this and claim they were never lying about their capabilities because it works now.
Pretty cool for sure.
How long until it is profitable the tell a cheap AI to "win this game by collecting resources and advancing in-game" and then sell the account on eBay?
I wonder what optimizations could be made? Could a gold farmer have the directions from one AI control many accounts? Could the AI program simpler bots for each bit of the game?
I can imagine not being smart enough to play against computers, because I am flagged as a bot. I can imagine a message telling me I am banned because "nobody but a stupid bot would score so low."
Finally a general tool to solve captchas for my web scrapers.
If anyone would like to try the new Sonnet in VSCode. I just updated https://double.bot to the new Sonnet. (disclaimer: I am the cofounder/creator)
---
Some thoughts:
* Will be interesting to see what we can build in terms of automatic development loops with the new computer use capabilities.
* I wonder if they are not releasing Opus because it's not done or because they don't have enough inference compute to go around, and Sonnet is close enough to state of the art?
The "computer use" demos are interesting.
It's a problem we used to work on and perhaps many other people have always wanted to accomplish since 10 years ago. So it's yet to be seen how well it works outside a demo.
What was surprising was the slow/human speed of operations. It types into the text boxes at a human speed rather than just dumping the text there. Is it so the human can better monitor what's happening or is it so it does not trigger Captchas ?
This is incredibly cool but it seems like the potential damage from a "hallucination" in this mode is considerable, especially when they provide examples of it going very far off-track (looking up Yellowstone pictures). Would basically need constant monitoring for me not to be paranoid it did something stupid.
Also seems like a privacy issue with them sending screenshots of your device back to their servers.
anybody know how the hell they're combating / gonna combat captcha's, cloudflare blocking, etc. I remember playing in this space on a toy project and being utterly frustrated by anti-scraping. Maybe one good thing that will come out of this AI boom is that companies will become nicer to scrapers? Or maybe, they'll just cut sweetheart deals?
I am surprised it uses macOS as the demo, as I thought it would be harder to control vs Ubuntu. But maybe at the same time, macOS is the most predictable/reliable desktop environment? I noticed that they use virtual environment for the demo, curious how do they build that along with docker, is that leveraging the latest virtualization framework from Apple?
Interesting stuff, i look forward to future developments.
A comment about the video: Sam Runger talks wayyy too fast, in particular at the beginning.
I skimmed through the computer use code. It's possible to build this with other AI providers too. For instance you can asks ChatGPT API to call functions for click and scroll and type with specific parameters and execute them using OS's APIs (A11y APIs usually)
Did I miss something? Did they have to make changes to the model for this?
Imagine the possibilities for cyber-crime. Surely you could program it to log in to a financial institution and transfer money. And if you had a list of user names and passwords from some large info breach? You could automate a LOT of transfers in a short amount of time...
This demo is impressive although my initial reaction is a sort of grief that I wasn't born in the timeline where Alan Kay's vision of object-oriented computing was fully realized -- then we wouldn't have to manually reconcile wildly heterogeneous data formats and interfaces in the first place!
OpenAI must be scared at this point. Anthropic is clobbering them at the high end of the market and Meta is providing free AIs at the low end. OpenAI is pretty soon going to be in the valueless middle fighting with tons of other companies for relevance
Next stop after 'Computer Use' -- Multimodal input from a robot's sensors and generating various signals to control its actions.
Looking forward to see this in the coming few years. And hoping such a robot could be of help to many people including those old.
How long until "computer use" is tricked into entering PII or PHI into an attackers website?
I wonder if OpenAI will fast follow; usually they're the ones to throw down the gauntlet. That being said, you can play around with OpenAI with a similar architecture of vision + agent + exec + loop using Donobu, though it is constrained to web browsers.
Offtopic but youtube doesn't allow me to view the embedded video, with a "Sign in to confirm you’re not a bot" message. I need to open a dedicated youtube tab to watch it
The barrier to scraping youtube has increased a lot recently, I can barely use yt-dlp anymore
Hopefully the coding improvements are meaningful because I find that as a coding assistant o1-preview beats it (at least the Claude 3.5 that was available yesterday) but I like Claude's demeanor more (I know this sounds crazy but it matters a bit to me)
Any comments on alignment with Anthropic's missions. Last time I checked Anthropic is about building SOTA as that is the only way to do safety research. Making money ans useful stuff commercially is a means to an end.
I wonder when it'll actually be available in the Bedrock AU region, because as of right now we're still stuck using mid-range models from a year ago.
Amazon has really neglected ap-southeast-2 when it comes to LLMs.
Looks like visual understanding of diagrams is improved significantly! For example, it was on par with Chat GPT 4o and Gemini 1.5 in parsing an ERD for a conceptual model, but now far excels over the others.
They need to get the price of 3.5 Haiku down. It's about 2x 4o-mini.
I suspect they are gonna need some local offload capabilities for Computer Use, the repeated screen reading can definitely be done locally on modern machines, otherwise the cost maybe impractical.
This looks really similar to rabbit's Large Action Model (LAM). Cool!
https://www.rabbit.tech/rabbit-os
One suggestion, use the following prompt at a LLM:
This "computer use" feature is obviously perfect for automating GUI tests. Will it work on screenshots of mobile devices like smartphones/tables, also?
Join PeachLive and input my invitation code 6B94HL to get 20 free coins! Enjoy live video chat at {invitationUrl}
Interestingly new claude only knows content up to:
> I'm limited to what I know as of April 2024, which includes the initial Claude 3 family launch but not subsequent updates.
Did they just invent a new world of warcraft or runescape bot?
Cursor AI already have the option to switch to using claude-3-5-sonnet-20241022 in the chat box.
I was about to try to add a custom API. I’m impressed by the speed of that team.
> Claude 3.5 Haiku matches the performance of Claude 3 Opus
Oh wow!
Can anyone share a .http or curl or anything similar based session with computer tool use? Docker containers make me cry.
Computer use won't allow you to log in to social media accounts, even if it is your account and credentials. Bummer.
Does anyone know how I could check whether my Claude Sonnet version that I am using in the UI has been updated already?
Looks like it just takes a screenshot and can't scroll so it might miss things.
Claude 3.5 Haiku will be released later this month.
We are approaching FSD for the computer, with all of the lofty promises, and all of the horrible accidents.
I love how they don't seem to be calling it "AgenticAI" or something like that.
What I'd like to know is whether prompt caching is available to Claude on AWS Bedrock now.
aider benchmarks for claude 3.5 new are impressive. From 77.4% to 83.5% beating o1-preview.
since they didnt rev the version, does this mean if we were using 3.5 today its just automatically using the new version? That doesnt seem great from a change management perspective
though I am looking forward to using the new one in cursor.ai
wow, i almost got worried but the cute music and the funny little monster on the desk convinced me that this all just fun and dandy and all will be good. the future is coming and we'll all be much more happy :)
Does this make cursor obsolete?
You can just use any IDE you want and it will work with it.
While I was initially impressed with it's context window, I got so sick of fighting with Claude about what it was allowed to answer I quit my subscription after 3 months.
Their whole policing AI models stance is commendable but ultimately renders their tools useless.
It actually started arguing with me about whether it was allowed to help implement a github repository's code as it might be copywritten... it was MIT licensed open source from Google :/
This is bad news for SWEs!
I checked the docs but did not find it out. Cloude has API as the GPT Assistant? with also the ability to give a set of documents to work with?
It seems that you can only send single message, thus not relying on the ability to "learn" from predefined documents.
computer use is really going to highlight how fragmented the desktop ecosystem is, but also this definitely paints more context on how microsoft wants to use their screenshot ai
Can Claude create and run a CI/CD pipeline now from a prompt?
How does one get access to it without using the API??
eventually, we'll be able to eliminate the intermediate "computer", and just let the ai render everything we need to interact with
Captchas are toast.
im unclear, is haiku supposed to be similar to 4o-mini in usecase/cost/performance? If not, do they have an analog?
Still can't use their services. They still require a phone number for some reason. What about those of us who don't have one?
Just to confirm: did they just release a model with the exact same name as the previous one?
does anyone know what are some use cases for "computer use"?
Just a question: For this thingy to work, I must give the provider access to my computer? Good luck. :)
Just another reason to use ONLY local LLM's.
This is amazing
Is it just me who feels that Anthropic has been innovating faster than ChatGPT in the past year?
6B94HL
and i was just planning to go to sleep…
Now I am really curious how to programmatically create a sandboxed compute environment to do a self-hosted "Computer use" and see how well other models, including self-hosted Ollama models, can do this.
new VBA version just landed
The new Sonnet tops aider's code editing leaderboard at 84.2%. Using aider's "architect" mode it sets the SOTA at 85.7% (with DeepSeek as the "editor" model).
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%! https://aider.chat/docs/leaderboards/cries in UiPath
Why not rev the numbers? "3.5" vs. "3.5 New" feels weird -- is there a particular reason why Anthropic doesn't want to call this 3.6 (or even 3.5.1)?
It's quite sad that application interoperability requires parsing bitmaps instead of exchanging structured information. Feels like a devastating failure in how we do computing.
My First Experience with Claude Computer Use - It's Mind-Blowing!
Just tested Claude's new Computer Use feature and had to share this simple but powerful test:
My Basic Prompt: "Please: 1. Search Amazon for 3 wireless earbuds: Find price Rating Brand name
2. Make a simple Excel file 'earbuds.xlsx': Put the information in a basic table Add colors to the headers Sort by price
3. Show me the results"
What blew my mind: - Claude actually looked at my screen - Moved the mouse by itself - Clicked buttons like a human - Created reports automatically
It's like having a virtual assistant that can really use your computer! No coding needed - just simple English instructions.
For those interested: https://mergisi.medium.com/8f56f683e307
[dead]
Scary stuff.
'Hey Claude 3.5 New, pretend I'm a CEO of a big company and need to lay off 20% people, make me a spreadsheet and send it to HR. Oh make sure to not fire the HR department'
c.f. IBM 1979.
Both new Sonnet and gpt-4o still fail at a simple:
"How many w's are in strawberry?"
gpt-4o: There are 2 "w's" in "strawberry."
Claude 3.5 Sonnet (new): Let me count the w's in "strawberry": 0 w's.
(same question with 'r' succeeds)
What is artificial about current gen of "artificial intelligence" is the way training (predict next token) and benchmarking (overfitting) is done. Perhaps a fresh approach is needed to achieve a true next step.