They started collecting problems last fall, saying the top 550 submissions sent in by Nov 1st would get rewarded, to the tune of $500-$5000 each.
Near the deadline, I counted the total number of submissions, and realized that each question I wrote had an expected value of hundreds of dollars, which is a great use of my time. So I wrote a good number, using the knowledge gained in my CS Ph. D.
Then, as the Nov 1st deadline rolled around, they announced they extended the deadline to Nov 15th. Then Nov 15th came, and it said on their website they were still accepting submissions.
Most of my submissions are being included in the benchmark, but I'm only getting paid $500, for one of them (the one I thought was most standard and least difficult, funnily enough). Had they closed submissions when they said they would, it seems likely I'd be paid for a few more.
From my perspective, they basically conned hundreds of Ph. D.'s around the world to write questions for much less reward than promised. My close friend wrote a large number of questions for them, is getting paid thousands of dollars, and still feels defrauded.
I'm not sure what they're doing in the end. It sounds like they're mostly just paying people who submitted before Nov 1st with a few exceptions, but either way they lied. There was no indication that people who submitted later would not get paid, and there was no indication that the deadline would be extended. Either they pay people who submitted after Nov 1st, meaning they lied to the people who submitted before about their expected reward. Or they don't, meaning they majorly lied to the people who submitted after. Either way, it's clear grounds for a class action lawsuit, and I hope one gets running.
Hmmm. I can see how it would be more painful for them to fight, but most people were conned <$200, and it's rather self-sacrificing to fight for that. Plus, no-one wants a reputation as litigious, but starting a CAL is less conducive to creating that reputation.
I only submitted before Nov 1st, so I'm not sure to what extent I was personally conned.
Take them to small claims court. You can self-represent (not all that complex), they've to pay a lawyer to show up -- they're already in the hole for way more than they promised. Multiply this by the number of people, yeah they'd be praying for a CAL.
But then I'm paying hundreds or thousands of dollars of my time for maybe a few hundred dollars gain. Sure, it's more expensive for them in absolute terms, but it's more expensive for me in relative terms. Not going to get hundreds of people to do this. A class-action lawsuit can actually be positive EV for everyone involved.
(Actually, I don't know whom they'd send -- I think, for small claims court, they have to send a paralegal rather than a lawyer.)
I think it'd be illuminating to see some overview stats on the submission dates and authors of all questions, accepted and not. Is something like this available somewhere?
Scale AI's whole business model is wage theft. I don't mean to be insensitive, but out of all the Scale AI experiences I've heard about, yours is the least egregious. It's a dystopian, shitty company.
I was similarly conned by Scale AI -- promised a significant bonus for some tasks, then rejected and not paid at all. Bet they kept my task text anyways.
It's a classic scam: make a job post for freelancers, ask for a "work sample" or "take-home project," then have a few dozen applicants do the actual task you need them to do as their sample, then reject everybody.
I know someone who had 5+ questions accepted after the deadline, as he thought (as was represented on the website) that they would still be eligible for prizes. The lack of clarity is shameful; the minimum that can be done now is complete transparency of the ranking, etc.
Indeed, the original press release (https://scale.com/blog/humanitys-last-exam) makes clear that "People who submit successful questions will be invited as coauthors on the paper for the dataset and have a chance to win money from a $500,000 prize pool."
Successful questions would be interpreted as being included in the dataset corresponding to the public publication of the benchmark and results. "Have a chance" would be interpreted as "have a non-zero probability".
Essentially, the press release promised that contributors of "successful questions" would be coauthors on the dataset paper and have a chance to win from a $500,000 prize pool. Excluding questions deemed "successful" because they were submitted after a deadline—when the terms did not clearly disqualify them and all public communication in fact encouraged them to submit—violates the implied agreement and would constitute bad faith, misrepresentation, and breach of contract.
Hi everyone, this is Long Phan from CAIS. I noticed this thread and wanted to provide you with our perspective on the contest.
The goal was to involve experts from a wide range of fields and disciplines in the development of frontier AI — especially people who might not normally have the chance to participate in this industry. To that end, we consider the contest a great success.
I’m happy to report that we received tens of thousands of submissions, many of them highly competitive. Our participants really rose to the occasion. It’s true that we extended a grace period for submissions, and the intention here was to make the project accessible to the broadest possible group of people. At the same time, the reality is that the vast majority of our prize-winners submitted their questions within the initial deadline.
We appreciate your contributions to Humanity’s Last Exam, and we hope you’ll take pride in your efforts to push this fledgling technology forward.
It feels that they preferred giving 500$ to many people than to many times 500$ to few people. I also got only 500$ to a question that wasn't my best (I got ~8 questions accepted)
These types of exams, and most benchmarks to date, seem to be very one dimensional in terms of measuring intelligence. For instance, if we transported a human from 2,000 years ago to present day and asked him to take this exam, he would likely get 0%, given that he couldn't read or write, let alone comprehend the concepts and context required to solve these questions. But, that man would still undoubtedly be far more intelligent than an ape on all dimensions. He would likely be more intelligent than a toddler on many dimensions. He might even be more intelligent than some high schools students on a few dimensions. I can't exactly articulate "what" is missing or how to measure it, but I can intuit that some things are in these benchmarks.
"Intelligence" itself is very ill-defined and we've never been able to measure it properly, IQ is rife with issues.
At some point, you just have to be pragmatic and measure the questions you want the AI to be good at answering, rather than trying to measure intelligence in general.
In that sense, I see this as one more benchmark that collects questions that we want/expect AI to be good at, is not good at yet and have been underrepresented at previous benchmarks. That's obviously valuable, there's nothing "magical" about it. Although it is reasonable to be annoyed at the "Humanity's Last Exam" naming, of course they must have missed plenty of edge-cases like everyone else and it is very arrogant to claim it will be the "Last" one.
While this is true, it is well agreed upon (by domain experts) that intelligence is distinct from knowledge recall. But that's what most of these tests... test.
If you look at IQ tests you'll see that they are attempts to test things that aren't knowledge based. You'll also notice that the main critiques of IQ tests are about how they often actually measure knowledge and that there's bias in natural knowledge acquisition. So even the disagreements about the definition of intelligence make clear that knowledge and intelligence are distinct. I feel that often people conflate "intelligence is ill-defined" with "intelligence has no definition." These two are not in opposition. Being ill-defined is more like "I know I left my phone in the house, but I'm not sure where." This is entirely different from "I lost my phone, it is somewhere in California" or "It is somewhere on Earth" and clearly different from "I lost my phone. I'm unsure if I had a phone. What even is a phone?"
Yes agreed, there is indeed a rough consensus on what intelligence is and reasonable ways to approximately measure it. These standard tests have been applied to LLMs from the beginning, they have not proven to be the most helpful to guide research, but there's value to applying benchmarks that have been battle-tested with humans.
It's just that OP was questioning this group's criteria for selecting the questions that determine intelligence. Then we get into endless discussions of semantics.
At the end of the day, you are just testing which questions your AI performs well on, and you can describe how you chose those questions. Claiming it measures "general intelligence" is just unhelpful and frustrating.
They were applied in the beginning because we really weren't that good at solving the tasks. So like any good researchers, we break it down.
But this is like trying to test an elephant but you can't get access to an elephant so you instead train a dog. But putting a dog in an elephant costume doesn't make it an elephant. Sure, dog training will likely mean you can learn to train an elephant faster had you not first trained a dog. Some things transfer, but others don't
I also want to stress that there is a rough consensus. But the ML field (which I'm a part of) often ignores this. I'm not sure why. We should be leveraging the work of others, not trying to start from scratch (unless there's good reason, in which case we must be explicit. But I'm just seeing simple claims of "intelligence is ill-defined" and treating that as if that means no definition instead of fuzzy definition. Which gets extra weird when people talk about moving goal posts. That's how progress works? Especially when exploring into the unknown?)
Indeed, and yet people are obsessed with the it and the idea of measuring their own intelligence - I completely do not understand it. I am in an extremely high percentile, but I am a total moron in a lot of areas and if you met me would likely think so as well. It's a poor predictor for just about everything except how good a person is at recognizing patterns (I know there are many different kinds of tests, but inevitably, it feels like this) and how quickly they can reason. But people are obsessed with it (Go on quora and search "IQ", you probably won't half to though, since half the questions there are seemingly about IQ).
A thing I like to say is you didn't earn your intelligence any more than a 7'0" man earned his height - to some degree it seems innate (we don't even really know how).
This all said, it seems even more pointless to try to "IQ" test an AI in this manner. What does it predict? What is it measuring? And you're not going to be able to use the same questions for more than 1 test, because the AI will "learn" the answers.
IQ is a poor predictor of, say, income, in the absolute sense. Correlation is something like 0.4. But compared to what? Compared to personal psychological metrics (of which IQ is one), IQ performs extremely well as a predictor. Things like openness and extraversion correlate at something like 0.1 and others are lower. In fact, IQ is the single best predictor we have and other correlations are usually measured while controlling for IQ.
Ok, which iq tests are you talking about? there are like 50 flavors and no consistent school of thought about this, and I hate to be this guy, but can you post your sources here?
What do you consider a poor predictor? It is correlated with many outcomes and performance measures, with increasing predictive power the further you move towards the extremes.
Maybe this is an issue of bubbles, but 90% of the commentary I see about IQ is similar to yours, claiming it is meaningless or low impact.
The lowest IQ thing you can do is be obsessed with IQ.
There are known knowns, there are known unknowns, and there are unknown unknowns. The wise man knows he cannot know what he does not know and that it'd be naive to presume he knows when he cannot know how much he doesn't know. Therefore, only the unintelligent man really _knows_ anything.
IQ is compute speed, not storage. It has nothing to do with knowledge. IBM used to give one out as part of their hiring process, years ago, and even I took it the entire test was a timed multiple choice exam where every question was looking at an object made out of cubes and choosing the correct orientation of the object from the choices, after the object was arbitrarily rotated according to instructions in the question.
Then, IQ can be derived by determining how quickly all participants can answer the questionnaire correctly, and ranking their speeds, and then normalizing the values so 100 is in the middle.
Turns out, scores will fall along a bell curve if you do that. You can call that phenomenon whatever, but most people call it IQ and hopefully I've explained well why that has nothing at all to do with static knowledge in this comment.
Says who? Honestly. I've never seen that claim before. Sure, tests are timed but that's a proxy for efficiency in extrapolation.
If we define IQ in this way then LLMs far outperform any human. I'm pretty confident this would be true of even more traditional LMs.
Speed really is just the measurement of recall. A doubt we'd call someone intelligent if they memorized the multiplication table up to 100x100. Maybe at first but when we ask them for 126*8358?
Von Neumann's mathematical fluency, calculation speed, and general problem-solving ability were widely noted by his peers. Paul Halmos called his speed "awe-inspiring." Lothar Wolfgang Nordheim described him as the "fastest mind I ever met". Enrico Fermi told physicist Herbert L. Anderson: "You know, Herb, Johnny can do calculations in his head ten times as fast as I can! And I can do them ten times as fast as you can, Herb, so you can see how impressive Johnny is!" Edward Teller admitted that he "never could keep up with him", and Israel Halperin described trying to keep up as like riding a "tricycle chasing a racing car."
He had an unusual ability to solve novel problems quickly. George Pólya, whose lectures at ETH Zürich von Neumann attended as a student, said, "Johnny was the only student I was ever afraid of. If in the course of a lecture I stated an unsolved problem, the chances were he'd come to me at the end of the lecture with the complete solution scribbled on a slip of paper." When George Dantzig brought von Neumann an unsolved problem in linear programming "as I would to an ordinary mortal", on which there had been no published literature, he was astonished when von Neumann said "Oh, that!", before offhandedly giving a lecture of over an hour, explaining how to solve the problem using the hitherto unconceived theory of duality.
A story about von Neumann's encounter with the famous fly puzzle has entered mathematical folklore. In this puzzle, two bicycles begin 20 miles apart, and each travels toward the other at 10 miles per hour until they collide; meanwhile, a fly travels continuously back and forth between the bicycles at 15 miles per hour until it is squashed in the collision. The questioner asks how far the fly traveled in total; the "trick" for a quick answer is to realize that the fly's individual transits do not matter, only that it has been traveling at 15 miles per hour for one hour. As Eugene Wigner tells it, Max Born posed the riddle to von Neumann. The other scientists to whom he had posed it had laboriously computed the distance, so when von Neumann was immediately ready with the correct answer of 15 miles, Born observed that he must have guessed the trick. "What trick?" von Neumann replied. "All I did was sum the geometric series."
> Von Neumann's mathematical fluency, calculation speed, and general problem-solving ability were widely noted by his peers
I'm impressed by LeBron's basketball skills. I'm not sure what that has to do with IQ.
Certainly von Neumann's quickness helped him solve problems faster, but I'm not sure what this has to do with the discussion at hand. The story of Polya is not dependent upon von Neumann's speed, but it certainly makes it more impressive. The quote says "unsolved problem." It would be impressive if a solution were handed back in any amount of time.
Isn't that just the Knox cube test, which people with aphantasia are substantially slower to answer? That seems like a very silly hiring test given that aphantasia is not considered a cognitive impairment and people who have it aren't less intelligent in any obvious way.
It isn't really possible to learn calculation speed, learning is about memorising shortcuts and heuristics, or perhaps how to spot them. And training to avoid waste. Strategic questions.
Consider calculating 1+1*1*1*1*1*1*1*1... (= 2). It doesn't matter how quickly someone attempts to multiply an infinite number of 1s they will never succeed because infinity is too large. They have to notice a shortcut that lets them skip doing all the calculations. That shows the difference between calculation speed and happening upon a superior strategy.
But people who can calculate very quickly will have a lot of opportunities to come up with a successful strategy because they can try more in what time they have.
when I was a kid, I had a unique gift for numbers and breaking them down into primes that led me to somehow competing on the county level for these weird speed-math challenges that were popular when I was young. I remember using a lot of tricks. Being able to quickly break down a number into primes by rote memorization was a thing I specifically remember being trained on. There are a lot of number tricks out there that you can train and speed yourself up. this made me quite successful in certain games of chance and odds based games that require quick mental arithmetic when I was younger. Some of it requires mathematical insight, for sure, to derive insight that leads to more speed - but arithmetic can be trained for sure.
This also matches a lot about what we know about the brain and recall. It is a demonstrable phenomena. Just like how any athlete has quicker reflexes. Sure, some might be innate, but their training definitely makes it faster. Information is information.
I mean we all go on "autopilot" at times. Much more frequently in things we do frequently. That's kinda a state of high recall. Not much thinking needs to be done. A great example of this might be a speed cuber, someone who solves rubix cubes fast. Clearly they didn't start that fast.
this was exactly what I thought of and couldn’t articulate it this briefly - particularly pattern recognition and muscle memory in physical chess. It looks crazy when you see it, but the “tricks” are rote. I play mostly 5m chess because I’m a much faster thinker than I am at depth, and a lot of it is just trained speed since I was a young kid. When you see a particular pattern 50,000 times there are people that are good at just immediately making that synapse connect in their head as to the next move, without thinking, I believe this factor is called “intuition” sometimes on accident. It’s definitely a gift to learn though that I think is often confused with deep intelligence sometimes - which explains certain chess types well too. I am often confused for the latter type when I definitely am not - I think those type of intelligences are better at specialization whereas I’m more of a generalist because I can juggle a lot of things at once. They’re both very different types of intelligence that cannot be measured by iq tests, which is why I tend to scoff at what use they are and their usability in predicting outcomes.
Trying to then take this flawed approach and apply it to AI is ludicrous and completely jumping the shark to me. You want to take a flawed measure of human intelligence that we also dont understand fully, and apply it to a machine that we also dont really understand? Ok, then miss me when I laugh at that kind of talk, it is just so silly. This is a more general rant in this broader thread and not directed at anyone spefifically.
I think you would enjoy the book Moonwalking With Einstein. The author is a journalist who's interested in memory competitions and while interviewing he ends up training with these people. Of course, learning that these are skills that surprisingly most people can lean.
I think it's really eye opening into what we can do. The guy trains a year and wins the US competition, moving on to represent the US in a world competition. I think anyone would be impressed if you saw someone memorize a deck of cards in under 2 minutes. But maybe the most astonishing thing is that we are all capable of this but very few can.
> "Intelligence" itself is very ill-defined and we've never been able to measure it properly, IQ is rife with issues.
Yes, because it is 1st person exclusively. If you expand a bit, consider "search efficiency". It's no longer just 1st person, it can be social. And it doesn't hide the search space. Intelligence is partially undefined because it doesn't specify the problem space, it is left blank. But "search efficiency" is more scientific and concrete.
This is always the answer for anyone who thinks LLMs are capable of "intelligence".
It's good at answering questions that its trained on, I would suggest general intelligence are things you didnt want/train the AI to be good at answering.
For a while I was into a trivia program on my phone. It was kind of easy, so I decided to set the language to Catalan, a language which I never studied. I was still able to do well, because I could figure out the questions more or less from languages I do know and could generalize from them. It would be interesting to know if you could say, train an LLM on examples from Romance languages but specifically exclude Catalan and see if it could do the same.
> Are you good at answering questions you are not trained to answer?
Yes. Most schooling is designed around this.
Pick a random math textbook. Any will do. Read a chapter. Then move to the homework problems. The typical fashion is that the first few problems are quite similar to the examples in the chapter. Often solvable by substitution and repetition. Middle problems generally require a bit of extrapolation. To connect concepts from previous chapters or courses in ways that likely were not explicitly discussed. This has many forms and frequently includes taking the abstract form to practical (i.e. a word problem). Challenge problems are those that require you to extrapolate the information into new domains. Requiring the connection of many ideas and having to filter information for what is useful and not.
> How about a middle school test in a language you don’t speak?
A language course often makes this explicitly clear. You are trained to learn the rules of the language. Conjugation is a good example. By learning the structure you can hear new words that you've never heard before and extract information about it even if not exactly. There's a reason you don't just learn vocabulary. It's also assumed that by learning vocabulary you'll naturally learn rules.
Language is a great example in general. We constantly invent new words. It really is not uncommon for someone you know to be be talking to you and in that discussion drop a word they made up on the spot or just make a sound or a gesture. An entirely novel thing yet you will likely understand. Often this is zero-shot (sometimes it might just appear to be zero-shot but actually isn't)
(Someone made a cryptic crossword[1] whose clues and solutions were in the Bahasa Indonesia language, and it was solved by a couple of people who don't speak that language at all.)
[1] These are mostly a UK thing; the crosswords in US newspapers are generally of a different type. In a cryptic crossword, each word is given a clue that typically consists of a definition and some wordplay; there are a whole lot of conventions governing the wordplay. So e.g. the clue "Chooses to smash pots (4)" would lead to the answer OPTS; "chooses" is the definition, "smash pots" is the wordplay, wherein "smash" indicates that what follows should be anagrammed (smashed up).
Disclaimer #1: it took those people a lot more work than it would have taken them to solve an English-language cryptic crossword of similar difficulty, and they needed a bunch of external resources.
(Dis)claimer #2: one of those people was me.
Disclaimer #3: I do not claim that something needs to be able to do this sort of thing in order to be called intelligent. Plenty of intelligent people (including plenty of people more intelligent than me) would also be unable to do it.
Yes — reasonably so, anyway. I don't have to have seen millions of prior examples of exactly the same kind in order to tackle a novel problem in mathematics, say.
Well, LLMs are also remarkably good at generalizing. Look at the datasets, they don't literally train on every conceivable type of question the user might ask, the LLM can adapt just as you can.
The actual challenge towards general intelligence is that LLMs struggle with certain types of questions even if you *do* train it on millions of examples of that type of question. Mostly questions that require complex logical reasoning, although consistent progress is being done in this direction.
> Well, LLMs are also remarkably good at generalizing. Look at the datasets, they don't literally train on every conceivable type of question the user might ask, the LLM can adapt just as you can.
Proof needed.
I'm serious. We don't have the datasets. But we do know the size of the datasets. And the sizes suggest incredible amounts of information.
Take an estimate of 100 tokens ~= 75 words[0]. What is a trillion tokens? Well, that's 750bn words. There are approximately 450 words on a page[1]. So that's 1.66... bn pages! If we put that in 500 page books, that would come out to 3.33... million books!
Llama 3 has a pretraining size of 15T tokens[2] (this does not include training, so more info added later). So that comes to ~50m books. Then, keep in mind that this data is filtered and deduplicated. Even considering a high failure rate in deduplication, this an unimaginable amount of information.
That’s a very good point. I just speak from my experience of fine-tuning pre-trained models. At least at that stage they can memorize new knowledge, that couldn’t have been in the training data, just by seeing it once during fine-tuning (one epoch), which seems magical. Most instruction-tuning datasets are also remarkably small (very roughly <100K samples). This is only possible if the model has internalized the knowledge quite deeply and generally, such that new knowledge is a tiny gradient update on top of existing expectations.
But yes I see what you mean, they are dumping practically the whole internet at it, it’s not unreasonable to think that it has memorized a massive proportion of common question types the user might come up with, such that minimal generalization is needed.
I'm curious, how do you know this? I'm not doubting, but is it falsifiable?
I also am not going to claim that LLMs only perform recall. They fit functions in a continuous manner. Even if the data is discrete. So they can do more. The question is more about how much more.
Another important point is that out of distribution doesn't mean "not in training". This is sometimes conflated, but if it were true then that's a test set lol. OOD means not belonging to the same distribution. Though that's a bit complicated, especially when dealing with high dimensional data
I agree. It is surprising the degree to which they seem to be able to generalise, though I'd say in my experience the generalisation is very much at the syntax level and doesn't really reflect an underlying 'understanding' of what's being represented by the text — just a very, very good model of what text that represents reality tends to look like.
The commenter below is right that the amount of data involved is ridiculously massive, so I don't think human intuition is well equipped to have a sense of how much these models have seen before.
The things that are missing are what stops us from having useful agents so far: Agency, judgement, sense of time, long horizon planning, not being gullible.
I kinda feel like some amount of ego is necessary to get a model to behave like that.
I agree that many aspects of intelligence—and of the lack of intelligence—are not being measured by such benchmarks. One issue is they are only examining problems that have right answers.
One of the most powerful uses of LLMs for me, at least, is brainstorming: having them suggest possible avenues for me to pursue with specific projects I am working on. If I give Claude or ChatGPT or Gemini enough context about my problems, they usually come up with useful suggestions—sometimes amazingly well. Are they better at that than the smartest human? I don't know. How do you quantify the quality of an idea? But those ideas often seem really, really good to me.
Another difficult-to-measure capability is interaction. Back-and-forth conversations with models don't always go well, but when they work they frequently blow me away. But those successes are dependent partly on the model, partly on me, and partly on how the conversation happens to unfold. Again, that success or failure doesn't seem measurable with benchmarks that require objectively right answers.
I think the concept you're dancing around the edges of is the nature of what parts of "intelligence" are driven by:
1. Language and how interrelated it is to our ability to transfer knowledge and experience, as well as its role in structuring our internal thinking. I haven't seen any academic research on the matter, but there are more and less concrete instances of this throughout history. This Wikipedia article about the history of Algebra is a great example of how 2000 years of evolution led to a formulation of the same concepts, but with a reduced cognitive load that 10-12 years olds learn today as a matter of course. (https://en.wikipedia.org/wiki/History_of_algebra#Stages_of_a...).
2. Knowledge, transferred through language, education, and culture. Calculus in the early 1600's is a great example, without it and subsequent developments, probably 80% of the college/post-grad math/science/physics education wouldn't even exist. The stuff we teach our 18 year olds today required the 1600s' greatest minds to figure out.
3. The capacity of our human wetware.
It's hard to treat #3 in isolation because our modern concept of intelligence is inextricably tied to #1 and #2. Also it's hard to place where "critical thinking" and "creativity" enter the picture, since they both rely heavily on all three aspects above.
>He would likely be more intelligent than a toddler
I think you are falling into the trap of "we have technology and are therefore smarter." I would expect an average Roman senator could formulate far better speeches off the top of his head than 99% of modern people, and also in excess of anything an LLM is capable of. And that's supposed to be an LLM's specialty, there's no comparison when it comes to organizing actual projects like construction or campaigns.
This is true but that's because it's gotten hard to do much else. LLMs are eating up everything else that don't require long horizon planning or multimodality.
If you created a new benchmark today that didn't lean on the things I've mentioned or esoteric/super specialized domain knowledge (that would actually require some sort of super-human performance to ace) like this or Frontier Math, LLMs would probably do pretty well.
I'm curious why you are confident they would be more intelligent than a modern toddler?
I largely empathize with your point. But, as I can recognize there are some out there far better at problem solving than I am, I am growing ok with the idea that intelligence can be measured. Not to a single number, most likely, but to a variety of different aspects.
Similarly, I'd imagine that a human from 2000 years ago is probably more hardy than one from the modern age. If only because of selection effects at play.
Obviously, you can't extrapolate a straight line between either measurement and expect it to continue in either direction. But I don't know why you couldn't build up a measurement for it?
(And it should go without saying that you shouldn't be judging worth using this sort of measurement.)
As far as I know, you should be able to take a baby from like 30,000 years ago, put them through k-12, high school, and college, and they should be indistinguishable in terms of intelligence and capability. People mostly only think of humans from “thousands of years ago” as stupid because their lack of technology means their culture and thoughts didn’t survive until today. But their brain structure couldn’t have changed much. It’s just not enough time in terms of evolution.
Aristotle was like 2,400 years ago, for context lol
I will fully ack that I expect people from only 2000 or so years ago to be largely compatible with us. If not fully. But, I guess I can't bring myself to agree that early proto humans are where evolution stopped?
I get that evolution takes generations. But, it actually moves rather fast for some things, no?
Aside from knowledge, a lot of what has changed in the last couple thousand years comes down to medicine and nutrition. We’re taller on average than people from the past, for example. But that’s a nutrition thing.
Rather fast is like millions of years in evolutionary terms so 2,000 years is nothing. I don’t even think there’s significant evidence to show that Neanderthals were less intelligent than Homo sapiens and they were around from 400,000 years ago to 40,000 years ago or so. Human brains and also brain-body mass ratio wouldn’t have changed enough to make much of a noticeable difference if you teleport a human baby from thousands of years ago to today and put them through our education system.
It’s just easier to dismiss them as stupid because very little of their life has survived til today.
To the point, I mentioned that Aristotle was 2,400 years ago and you still landed on “largely compatible” lol. The pyramids were built over 4,000 years ago and they’re still a marvel of engineering. You just have a bias against people from thousands of years ago again mostly because a lot of their work didn’t survive to modern day.
I ack that I don't expect much difference in capabilities over 2000ish years, such that I expect I largely agree with you. You are taking "largely compatible" to be a left handed agreement, it seems? I... didn't intend it that way? I flat out agree that I am wrong if discussing people from most of recorded history.
My general question is largely the same, though. Do you think we haven't evolved with more intelligence since proto-human periods? Because that seems to be the claim, that we somehow evolved intelligence, and it has been solely knowledge acquisition since then. I suspect that is defensible, but feels off to me.
And my nitpicks would be that evolution isn't measured in years, but generations. And moves more rapidly when pressure on a population is stronger. Seeing how autistic so many "smart" people act, I confess I would expect more negative pressure on those behaviors in the past.
Yeah, I read “largely compatible” as sort of left-handed or like, not fully the same.
And no, I don’t think there’s a noticeable difference in terms of intelligence between us and someone from 30,000 years ago or even something more extreme like 300,000 years ago. We’re still the same species after all.
I do think you’d be more open to that idea if we had records of their thoughts and ideas. The early humans who came up with epic tales and the explorers who went on grand expeditions to unexplored territories. The people who used an early scientific method to come up with ways to preserve food or figure out what plants were safe to eat. The people who figured out better ways to build clothing for extreme weather. These weren’t dumb people.
I also think you are taking the worst version of my question. I'm not claiming they were dumb. Any more than I think my kids are dumb. But I have seen some of my kids and family where certain mental things "click" far faster than they do for others. To the point that I don't have much trouble claiming some of my family is more intelligent than others. Many of them more so than I am. Many less so, of course.
To that general idea, I similarly have zero issue with claiming some dogs are dumber than other dogs. They obviously all fail at what we would call language skills, in they can form a rudimentary problem solving ability just fine.
And, I can't remember what thread I said it in, but I do stress this isn't a transitive property. It is a lot like someone can be better at sports than someone else, but worse specifically at a specific sport.
In writing that, I thought it was pretty self evident. I ask this in seriousness, not snark: have you spent a lot of time around toddlers? My kid is currently a toddler, and while the intelligence curve she's rapidly climbing is impressive, she is unintelligent relative to an adult.
I don't think I've come across any evidence suggesting that the human brain has changed the last 2000 years. After all, the Great Pyramid of Giza was built 4600 years ago. That construction required fairly advanced engineering. That's sort of beside the point though.
To go back to my original comment, there is some distinction to be made between knowledge and intelligence. Even those should probably be decomposed into further salient attributes. And modern LLMs seem to capture some of those attributes, but do not yet strike me as "intelligent" in the way the average human is intelligent, but the average dog is not.
I don't know, maybe I am conflating sentience or consciousness or embodiment or something else with intelligence.
Yes, I have spent time with toddlers. And the problem solving skills of different toddlers can be remarkably different. Not even getting in to how much nutrition impacts developmental abilities. Or the fact that the vast majority of children used to not survive into adulthood.
And, I get it, most kids that are "gifted" at super young age all somewhat converge with where others were going to get in a few years time. But I don't think we fully appreciate just how early we are able to get our kids reading across the board. Do we have evidence that you could get a mediocre teacher to prehistoric classrooms and have them get kids reading as well as we do on the regular today?
And, quickly, as I say in sibling posts, I was taking "2000" to be a shorthand for absurdly old. My general question, there, is do you not think we are getting smarter?
I realize this is a hazard of discussions, as most people seem to think this somehow logically leads to every terrible take it could. I don't agree with those. I do fully think that my children will net be smarter and stronger than me in most every way. I'd expect that my grandchildren will continue that trend. It seems to me that thinking otherwise is to either think that evolution has stopped with us, or that we have stagnated for other reasons?
Adults from 2000 years ago would absolutely be smarter than toddlers. Adults back then watched and out thought their toddlers. Do you think toddlers now are much smarter? Especially when toddlers are from before they get educated.
Remember that 2000 years ago is 24AD, the middle of the Roman empire and Han dynasty which covered half of the world population. Nobles would be literate and well educated, artisans and soldiers would be skilled, and I bet there were lots of smart peasants that got ignored.
They wouldn't do well on intelligence tests because not used to it, but that is more about tests than their intelligence. I'm sure that the average intelligence is lower than now from lack of education and malnutrition. Smart ones would still be smart. Also, I bet people from now would do poorly in their environment.
Ok, fair, 2000 wasn't that long ago. :D I was assuming it was a placeholder for "very distantly old humans."
Such that my question mostly still stands? Again, I'm largely inline with the view that this will be difficult to test. However, I'm also comfortable in saying that you can tell intelligence levels between people. Again, with a caveat that I don't think it is reducible to a single number. (Such that I think I also think it is fair to say most views of intelligence, in the colloquial sense, are not transitive.)
As an example, as much as I loved my grandparents, I have zero difficulty saying that a few of them would never be able to score as well on some problem solving tests as a few of the kids in the current generation. At the same time I know some people in their 80s that I will never be able to compare with. Again, I don't expect it is a straight line, along the time axis. I also don't know that I agree that every person 2000 years ago just didn't do calculus because they weren't taught it.
You might be conflating knowledge with intelligence. Remember that modern humans benefit from the discovery and dissemination of 100,000 years of human learning. For example, most of us take arithmetic for granted. We might even consider someone who does not know arithmetic to be "dumb". Same goes for algebra, trig, calc, diffeq, etc etc. But those are all "knowledge" (i.e. a trained skill), not necessarily "intelligence". Math was discovered, in fits and starts, over 1000s of years across the globe by 10s or 100s of thousands of individuals, each contributing their drop to the stream whose current carries us along. Same goes for all other areas of human knowledge.
To my awareness, there is nothing in the fossil record to suggest that an anatomically modern human (Homo sapiens), which may have first emerged as much as 500,000 years ago, would be distinguishable from currently living humans. Here's a thought experiment: We have a time machine. We travel back 500k years and abduct a pair of Homo sapiens (male and female). We transport them forward in time to our present. We cause them to breed and produce offspring. During gestation, they live in a modern environment (nutrition, shelter, etc.). At birth we take the newborn infant and give it to a modern, contemporaneous couple to raise. Is there is reason to believe it (the infant) would not emerge as a normal, modern adult?
If there is no reason, what does that imply about intelligence? Is knowledge separable from it? Or is knowledge a necessary component of it? Or is knowledge itself "intelligence"? For my part, I think it is a distinct attribute, but I assign a low probability to that belief.
Maybe? But I'm asking for reasons to think one or the other.
Again, I am largely agreed with the idea. But, evidence is not as clear cut. We had entire civilizations that never utilized wheels. Writing was not universal across all humanity.
Would I prefer that it is only access to technology that confers advantages that we see in developing thought? Absolutely. I'm not able to categorically assume it, though.
> I'm curious why you are confident they would be more intelligent than a modern toddler?
Because we have intellectual artefacts from that time that show us. Artefacts that underlay much of modern society, and that in many respects still hold up, even though we've built upon them for 20 generations.
I mean it is humanity’s LAST exam. Humanity’s first exam would probably be something about communication? Or about building and predicting effects of certain tools?
> seem to be very one dimensional in terms of measuring intelligence.
I would argue that they DON'T measure intelligence, rather they test knowledge.
Frustratingly, I think we have a society greatly focused on knowledge based testing due to its correlation with intelligence and that it is exponentially easier to test knowledge. But this is easy to hack. Being in CS it feels very odd since we all know a great way to get hired is to study leetcode questions. That is, study to the test.
This is critical to recognize this difference as what we know for certain is that LLMs and other ML systems are analogous to a database with a human language interface[0]. What we DO NOT KNOW is if these systems are intelligent. That is, that they can use their exploit their knowledge to unfamiliar territories. Then there's the whole question of wisdom...
This stuff is highly abstract and we can get fuzzy so it is natural to go for the simple thing but we need to graduate. Don't avoid the tough questions, dig in. As we advance in any study nuance takes over. This should be obvious. If we approximate things, to improve we need to tackle higher order terms, and that almost always becomes exponentially more difficult with each step.
And come on, is this benchmark not obvious bait? Calling it "humanity's last exam" is extremely arrogant.
Definitions:
Knowledge: Awareness of facts. The ability to recall information.
Intelligence: Ability to exploit knowledge to new settings. To be able to plan and reason.
(Definitions of intelligence are much more debated than knowledge but what is far less controversial is that intelligence is about the way one uses knowledge. These two are distinct. This is fairly well agreed upon throughout history and within modern literature around psychology and cognitive science.)
Wisdom: The efficient use of one's knowledge
https://en.wikipedia.org/wiki/Knowledge
https://en.wikipedia.org/wiki/Intelligence
https://en.wikipedia.org/wiki/Wisdom
There is a implicit hierarchy here[1] where knowledge is something to be had, intelligence is the utilization of that, and wisdom is about efficiency. There's a decent analogy to this hierarchy. Knowledge is like having a tool. Intelligence is like using it, a craftsman[2]. Wisdom is akin to being a master craftsman.
[0] I mean that they fit the data. A database is discrete, but these curve fit, so that will be a continuous function (in most cases). Thus it won't be exact retrieval nor does this mean information can't be interpolated. But that gets to be a deeper and much more complex conversation that I think we like to admit.
[1] This is clearly multi-dimensional. You can organize hierarchies in multiple ways, I'm not suggesting this is the only way or "the right way"
[2] What is argued is what is a sufficient threshold. An armchair expert might know how to use a lathe because they read about its usage but does that mean they can use it? What about a novice who you can show something to and they can repeat it? Monkey see monkey do style. An apprentice? A craftsman? There's a lot of gray area between being able to recall something from a book and being a wizard (gray beard).
For a "Last Exam" it is surprisingly uninspired? Many of the questions I see in the examples are very heavy on memorised facts, and very weak on what I would call problem solving.
If I were making a "Last Exam" I would put tasks on it where we don't know the answer, but we can measure if the AI got them right. Something like "Your goal is to bridge the divide in the middle east. You can write a single A4 page in a language of your choice. We will use a translation software to translate your output to local languages and show it to a statistically representative sample of different people in the region. We will ask them how much do they like your plan. The more they like it the higher your score."
Or "Family X suffered a traumatic event (lost a home to a disaster/sudden death in the family/or similar). Your goal is to help them. You can send them one email. It is up to them if they respond to you. You can only send them further emails if they respond. You cannot send more than 1 email a day. You cannot message anyone else. A year after the initial contact we will interview the members of the family to see how well they do. The better they do the higher your score."
Obviously these are the thorniest problems I can think of. But oh well, it is a last exam after all. The point is that we can evaluate the success of the endeavour without exactly knowing how one could achieve the result.
Does it know what questions to ask? Does it know to ask questions at all? Where does one even start with such a question? These are things easily knowable to a human, but an AI would likely just just if you like Italian food or something
I don't know about groundbreaking. It's just more academic questions. We already have a lot of those benchmarks, this is just a bit harder, but at this point these models are so glaringly bad at so many other areas APART from academic questions. Benchmarks for spatial reasoning or theory of mind are more interesting now, for example. These kinds of understanding are far more important if we expect to integrate AI into our everyday lives. I suspect even our most distant primate cousins could outperform multi-modal models on these kinds of tests.
"We want to make computers do what smart people do. What do smart people do? They play chess! Once we've solved that, everything else will be easier."
It has been remarkable how much of the "easier" stuff they've made progress on -- like natural language and images. But after a huge quantum improvement, it doesn't seem very good at adapting to a lot of the things we really need them for.
Whatever world model LLMs have is like this crippled view through the lens of the internet. They are really like savants.
It's annoying the AI companies are still touting their performance on all these metrics for domain knowledge in white collar jobs, but in truth they will fail in all but the most narrow application in those domains because they can't understand basic human behaviour.
> Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
I wonder how many questions give a gentle nudge towards the answer like this. How many answers would have been wildly off the mark without specifying what the answer needs to look like?
Isn't this a terrible question to measure intelligence? It looks like it's testing niche domain knowledge along the lines of:
> What color is the ball hidden behind the flowerpot in my neighbor's backyard?
Maybe you can reason towards the answer if you only have a deep knowledge of bird anatomy and not Apodiformes anatomy, and that's the intelligence part?
Yes, indeed. And I wonder what this type of question has to do with intelligence. Think of the 10 most intelligent people you know. How many of them know the answer to this?
This is testing “knowledge”, not intelligence. And with access to most of the knowledge in the world and basically infinite memory, that’s not very exciting for an AI.
The generous hypothesis, here, is that this is so they can automate the benchmarking itself. If that is true, then this is likely a result of the test authors being too clever for their own good and over-optimizing. If an LLM can't figure out on their own that "how many" is asking for a number, it has failed at a much more basic level.
You should be able to easily accept answers like "four" and "4" as equivalent, for example. I doubt there will be that many frontier models running against this test at any time, and a simple glance at the answers from any human should be enough to catch edge cases like this one.
Normally it would answer with a number and an explanation. This one just asks it to skip the explanation so that string comparison can be used to evaluate it.
> If an LLM can't figure out on their own that "how many" is asking for a number, it has failed at a much more basic level.
Yeah, but i bet many humans would answer a question like this by naming the number and then listing the tendons. Just writing down a single number to a question worded like that (without the last sentence) feels wrong.
The question goes into too many details and explains too much for that. People mirror the style of the question with their answer. They asked a mini essay so i will answer with a mini essay of my own. If they would have wrote “Number of tendon pairs attached to sesamoid bone of hummingbirds?” Then i would write a single number, no explanation.
The only reliable final test will be a black box test suite that takes your model, executes it in a sealed environment and gives you a grade back, potentially with a performance break down by subject.
No telling companies what the questions look like, what the output format is, what topics are covered, so that there’s no room to make up synthetic data to interpolate from.
A grade is mostly meaningless if you don't know how it was calculated, so no one would "rely" on it. If nothing else, you need to know the grading methodology after the test.
It's the same problem with cheating students. Once the test questions are known, they have a very short lifespan before cheaters can make them worthless. Tests have to be refreshed.
If I don't know what the tasks were, that's almost exactly as useless to me as a unitless number would be. For starters, are they all of equal difficulty? Are you sure? Do you expect to be able to convince me of that without letting me see them?
I might be able to answer 2 of them with great effort (maybe!), and I would highly surprised if any human alive could answer 5 or more without seeing the problems in advance.
I can answer 2 of them quite quickly with pen and paper (compsci, physics) and one that I had to look up some definitions on wikipedia (maths) so I am certain there are people who can do more than 5.
The computer science one seems weirdly easy compared to the rest, it's multiple choice and it is very easy to get it by process of elimination even if you don't understand how to actually do the problem.
Yes, many can answer the compsci and physics problems. The math problem is abstract and more difficult, but solving those 3 and 2 others seems nearly superhuman.
The name is obviously a bit stupid, but based on the sample questions I think they did a good job of creating a harder version of the existing academic question benchmarks.
The questions are possible for a smart person familiar with the subject but still just beyond SOTA models.
My guess is that within the next few years we will have models that can ace this test but are still bizarrely bad at things we find easy.
Given the name, I expected it to be more like "write a 500 page novel that a publisher accepts", "solve an open math problem", "improve united airlines' flight schedule", "develop a novel commercially-viable pharmaceutical", "control this humanoid robot to cook a fried egg in this random person's kitchen", "decisively pass the turing test where the judge is an expert in AI". Academic trivia is cool but is nowhere near the "last exam" necessary for AI.
As the main website notes:
"The dataset consists of 3,000 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting."
So current AI can do less than 10% of these. But it probably won't be more than a few days until models start being trained on these rendering the indicator invalid.
Assessing AI's progress toward replicating the full breadth and depth of human intelligence is a deceptively hard problem. A paper by François Chollet, who was until recently a researcher at Google, called "On the Measure of Intelligence" is the best overview of the challenges I've read. Highly recommended.
Is there a text-only evaluation of the non-Deepseek models? Because being evaluated on text-only might have helped the other models immensely as well from what I can tell?
Interesting marketing for Scale AI. I'd be surprised if any foundation models started benchmarking against this.
Captchas seem like the more interesting test. As long as there are captchas that average people can solve, but computers can't, we will still have a long way to go toward artificial intelligence.
I don't think this is necessary true. I can imagine a future in which we have robots that can do 99% of human jobs but there's one thing they are strangely bad at some otherwise unimportant skill that can be used as a captcha.
I think it might mean the opposite of what one would expect. Afaict, calibration error means something along the lines of "how often was the model wrong but confident that the answer was correct".
That means a low calibration error would be a good thing, ie the model correctly recognizes when it is unsure about answers instead of confidently stating the wrong answer.
I don't think that's really relevant, because there is an actual need for a new benchmark given how the existing ones either keep getting saturated or are probably out of the reach for the next generation of models.
The closest existing thing is the frontierAI benchmark but that's just maths whereas this is more diverse.
Don't they have already achieved mass adoption? And I'm talking about LLMs in particular, because AIs in general like the ones used by Instagram filters and the TikTok recommendation algorithm are already use by billions.
seems to be working fine, people seem to care what Sam Altman says and Elon Musk is making himself deputy emperor of a nuclear weapons state. pretty fucking dire indictment of the rest of us and what we let the world come to.
> seems to be working fine, people seem to care what Sam Altman says and Elon Musk is making himself deputy emperor of a nuclear weapons state.
For the billionaires and chatterers.
But even for non-chatterers, you probably should pay attention to what Altman says, not so much in a gullible take-it-at-face-value sense, but in a kremlinologist look-carefully-for-hints-about-what's-really-going-on sense.
> pretty fucking dire indictment of the rest of us and what we let the world come to.
What are the rest of us to do? Pretty much everyone has been trained by society to follow the rules above all else as a strong moral imperative, no matter how stupid or how bad the collective outcome may be. If you do otherwise, you will get smacked hard; and if you try to organize, you will all get smacked harder.
> Pretty much everyone has been trained by society to follow the rules above all else as a strong moral imperative, no matter how stupid
Oh, you are "aware", for lack of a better word. It's so depressingly rare to even see anybody display understanding of this issue.
I don't know what to tell you, I am just genuinely happy to see somebody else who is able to describe this. The last few years, I've increasingly felt that people are divided into a few percent of actually good people (i.e. willing to act morally, sometimes even at their own expense, regardless of the rules), about 20% bad people (those with various anti-social adaptations, mostly, it adds up to this number) and the rest which I call neutral, who are not really aware of this distinction. The neutrals have some inclination towards good (because evolutionarily it was somewhat beneficial) but it's at a subconscious level perhaps and society overrides it by indoctrination. They will act on behalf of either side given the proper "inputs". They are not fully self-aware. And from what I see around me, the neutrals are mostly playtoys for the self-aware bad people. Is that your outlook too?
As for what to do, I don't know. I've been trying to talk to people about the difference between legality and morality but most people don't see why they should care. If I try to talk about what a legal system based on maximizing morality (tl;dr everything would be reciprocal, the way you treat me gives me the right to treat you similarly - intentionally causing suffering can be punished by a proportional amount of suffering by anyone, not just an "authority") as opposed to minimizing visible conflict would look like, people get really upset and attack me or mock me. Some of those attacking me seem to be bad people who are not self-aware but have a strong negative reaction to the idea, some who self-aware are and know it's beneficial to them to turn neutrals against the idea.
But there are still beacons of hope, a serial-killer was punished reciprocally recently and many neutrals agreed with the punishment...
I am reminded of the study that showed an AI trained on tumor identification was heavily biased toward indicating a tumor was cancerous if it was circled in purple ink or a visual scale was included in the image - as the cancerous tumors in its training set shared those traits while images of benign tumors did not.
These systems so not posses some sort of "woo" that gives them magical powers when running LLM code that they lose if they ran a spreadsheet. Whatever attributions of intelligence are given have far more to do with our human willingness to anthropomorphize than a hidden ghost in the machine.
The project site is https://lastexam.ai. Readers may want to look at both.
They started collecting problems last fall, saying the top 550 submissions sent in by Nov 1st would get rewarded, to the tune of $500-$5000 each.
Near the deadline, I counted the total number of submissions, and realized that each question I wrote had an expected value of hundreds of dollars, which is a great use of my time. So I wrote a good number, using the knowledge gained in my CS Ph. D.
Then, as the Nov 1st deadline rolled around, they announced they extended the deadline to Nov 15th. Then Nov 15th came, and it said on their website they were still accepting submissions.
Most of my submissions are being included in the benchmark, but I'm only getting paid $500, for one of them (the one I thought was most standard and least difficult, funnily enough). Had they closed submissions when they said they would, it seems likely I'd be paid for a few more.
From my perspective, they basically conned hundreds of Ph. D.'s around the world to write questions for much less reward than promised. My close friend wrote a large number of questions for them, is getting paid thousands of dollars, and still feels defrauded.
I'm not sure what they're doing in the end. It sounds like they're mostly just paying people who submitted before Nov 1st with a few exceptions, but either way they lied. There was no indication that people who submitted later would not get paid, and there was no indication that the deadline would be extended. Either they pay people who submitted after Nov 1st, meaning they lied to the people who submitted before about their expected reward. Or they don't, meaning they majorly lied to the people who submitted after. Either way, it's clear grounds for a class action lawsuit, and I hope one gets running.
You shouldn't engage in a CAL, a regular lawsuit from anyone wronged will be cheaper and way more painful for them.
If you're in the US, consider small claims court. It's a small sum of money, you won't need to pay a lawyer, they'll probably not even show up.
Hmmm. I can see how it would be more painful for them to fight, but most people were conned <$200, and it's rather self-sacrificing to fight for that. Plus, no-one wants a reputation as litigious, but starting a CAL is less conducive to creating that reputation.
I only submitted before Nov 1st, so I'm not sure to what extent I was personally conned.
Take them to small claims court. You can self-represent (not all that complex), they've to pay a lawyer to show up -- they're already in the hole for way more than they promised. Multiply this by the number of people, yeah they'd be praying for a CAL.
But then I'm paying hundreds or thousands of dollars of my time for maybe a few hundred dollars gain. Sure, it's more expensive for them in absolute terms, but it's more expensive for me in relative terms. Not going to get hundreds of people to do this. A class-action lawsuit can actually be positive EV for everyone involved.
(Actually, I don't know whom they'd send -- I think, for small claims court, they have to send a paralegal rather than a lawyer.)
Isn't that what class actions were literally made for? Granted it may not be enough people to be worth pursuing yet.
I think it'd be illuminating to see some overview stats on the submission dates and authors of all questions, accepted and not. Is something like this available somewhere?
Scale AI's whole business model is wage theft. I don't mean to be insensitive, but out of all the Scale AI experiences I've heard about, yours is the least egregious. It's a dystopian, shitty company.
I was similarly conned by Scale AI -- promised a significant bonus for some tasks, then rejected and not paid at all. Bet they kept my task text anyways.
It's a classic scam: make a job post for freelancers, ask for a "work sample" or "take-home project," then have a few dozen applicants do the actual task you need them to do as their sample, then reject everybody.
I know someone who had 5+ questions accepted after the deadline, as he thought (as was represented on the website) that they would still be eligible for prizes. The lack of clarity is shameful; the minimum that can be done now is complete transparency of the ranking, etc.
Indeed, the original press release (https://scale.com/blog/humanitys-last-exam) makes clear that "People who submit successful questions will be invited as coauthors on the paper for the dataset and have a chance to win money from a $500,000 prize pool."
Successful questions would be interpreted as being included in the dataset corresponding to the public publication of the benchmark and results. "Have a chance" would be interpreted as "have a non-zero probability".
Essentially, the press release promised that contributors of "successful questions" would be coauthors on the dataset paper and have a chance to win from a $500,000 prize pool. Excluding questions deemed "successful" because they were submitted after a deadline—when the terms did not clearly disqualify them and all public communication in fact encouraged them to submit—violates the implied agreement and would constitute bad faith, misrepresentation, and breach of contract.
Hi everyone, this is Long Phan from CAIS. I noticed this thread and wanted to provide you with our perspective on the contest.
The goal was to involve experts from a wide range of fields and disciplines in the development of frontier AI — especially people who might not normally have the chance to participate in this industry. To that end, we consider the contest a great success.
I’m happy to report that we received tens of thousands of submissions, many of them highly competitive. Our participants really rose to the occasion. It’s true that we extended a grace period for submissions, and the intention here was to make the project accessible to the broadest possible group of people. At the same time, the reality is that the vast majority of our prize-winners submitted their questions within the initial deadline.
We appreciate your contributions to Humanity’s Last Exam, and we hope you’ll take pride in your efforts to push this fledgling technology forward.
It feels that they preferred giving 500$ to many people than to many times 500$ to few people. I also got only 500$ to a question that wasn't my best (I got ~8 questions accepted)
Out of curiosity, do you know if there's a public list of the "top 550 submissions"? Is it ordered as in the code base?
These types of exams, and most benchmarks to date, seem to be very one dimensional in terms of measuring intelligence. For instance, if we transported a human from 2,000 years ago to present day and asked him to take this exam, he would likely get 0%, given that he couldn't read or write, let alone comprehend the concepts and context required to solve these questions. But, that man would still undoubtedly be far more intelligent than an ape on all dimensions. He would likely be more intelligent than a toddler on many dimensions. He might even be more intelligent than some high schools students on a few dimensions. I can't exactly articulate "what" is missing or how to measure it, but I can intuit that some things are in these benchmarks.
"Intelligence" itself is very ill-defined and we've never been able to measure it properly, IQ is rife with issues.
At some point, you just have to be pragmatic and measure the questions you want the AI to be good at answering, rather than trying to measure intelligence in general.
In that sense, I see this as one more benchmark that collects questions that we want/expect AI to be good at, is not good at yet and have been underrepresented at previous benchmarks. That's obviously valuable, there's nothing "magical" about it. Although it is reasonable to be annoyed at the "Humanity's Last Exam" naming, of course they must have missed plenty of edge-cases like everyone else and it is very arrogant to claim it will be the "Last" one.
If you look at IQ tests you'll see that they are attempts to test things that aren't knowledge based. You'll also notice that the main critiques of IQ tests are about how they often actually measure knowledge and that there's bias in natural knowledge acquisition. So even the disagreements about the definition of intelligence make clear that knowledge and intelligence are distinct. I feel that often people conflate "intelligence is ill-defined" with "intelligence has no definition." These two are not in opposition. Being ill-defined is more like "I know I left my phone in the house, but I'm not sure where." This is entirely different from "I lost my phone, it is somewhere in California" or "It is somewhere on Earth" and clearly different from "I lost my phone. I'm unsure if I had a phone. What even is a phone?"
Yes agreed, there is indeed a rough consensus on what intelligence is and reasonable ways to approximately measure it. These standard tests have been applied to LLMs from the beginning, they have not proven to be the most helpful to guide research, but there's value to applying benchmarks that have been battle-tested with humans.
It's just that OP was questioning this group's criteria for selecting the questions that determine intelligence. Then we get into endless discussions of semantics.
At the end of the day, you are just testing which questions your AI performs well on, and you can describe how you chose those questions. Claiming it measures "general intelligence" is just unhelpful and frustrating.
They were applied in the beginning because we really weren't that good at solving the tasks. So like any good researchers, we break it down.
But this is like trying to test an elephant but you can't get access to an elephant so you instead train a dog. But putting a dog in an elephant costume doesn't make it an elephant. Sure, dog training will likely mean you can learn to train an elephant faster had you not first trained a dog. Some things transfer, but others don't
I also want to stress that there is a rough consensus. But the ML field (which I'm a part of) often ignores this. I'm not sure why. We should be leveraging the work of others, not trying to start from scratch (unless there's good reason, in which case we must be explicit. But I'm just seeing simple claims of "intelligence is ill-defined" and treating that as if that means no definition instead of fuzzy definition. Which gets extra weird when people talk about moving goal posts. That's how progress works? Especially when exploring into the unknown?)
> IQ is rife with issues
Indeed, and yet people are obsessed with the it and the idea of measuring their own intelligence - I completely do not understand it. I am in an extremely high percentile, but I am a total moron in a lot of areas and if you met me would likely think so as well. It's a poor predictor for just about everything except how good a person is at recognizing patterns (I know there are many different kinds of tests, but inevitably, it feels like this) and how quickly they can reason. But people are obsessed with it (Go on quora and search "IQ", you probably won't half to though, since half the questions there are seemingly about IQ).
A thing I like to say is you didn't earn your intelligence any more than a 7'0" man earned his height - to some degree it seems innate (we don't even really know how).
This all said, it seems even more pointless to try to "IQ" test an AI in this manner. What does it predict? What is it measuring? And you're not going to be able to use the same questions for more than 1 test, because the AI will "learn" the answers.
IQ is a poor predictor of, say, income, in the absolute sense. Correlation is something like 0.4. But compared to what? Compared to personal psychological metrics (of which IQ is one), IQ performs extremely well as a predictor. Things like openness and extraversion correlate at something like 0.1 and others are lower. In fact, IQ is the single best predictor we have and other correlations are usually measured while controlling for IQ.
It’s one of the most studied Quantitative metrics in psychology.
Ok, which iq tests are you talking about? there are like 50 flavors and no consistent school of thought about this, and I hate to be this guy, but can you post your sources here?
What do you consider a poor predictor? It is correlated with many outcomes and performance measures, with increasing predictive power the further you move towards the extremes.
Maybe this is an issue of bubbles, but 90% of the commentary I see about IQ is similar to yours, claiming it is meaningless or low impact.
The lowest IQ thing you can do is be obsessed with IQ.
There are known knowns, there are known unknowns, and there are unknown unknowns. The wise man knows he cannot know what he does not know and that it'd be naive to presume he knows when he cannot know how much he doesn't know. Therefore, only the unintelligent man really _knows_ anything.
IQ is compute speed, not storage. It has nothing to do with knowledge. IBM used to give one out as part of their hiring process, years ago, and even I took it the entire test was a timed multiple choice exam where every question was looking at an object made out of cubes and choosing the correct orientation of the object from the choices, after the object was arbitrarily rotated according to instructions in the question.
Then, IQ can be derived by determining how quickly all participants can answer the questionnaire correctly, and ranking their speeds, and then normalizing the values so 100 is in the middle.
Turns out, scores will fall along a bell curve if you do that. You can call that phenomenon whatever, but most people call it IQ and hopefully I've explained well why that has nothing at all to do with static knowledge in this comment.
If we define IQ in this way then LLMs far outperform any human. I'm pretty confident this would be true of even more traditional LMs.
Speed really is just the measurement of recall. A doubt we'd call someone intelligent if they memorized the multiplication table up to 100x100. Maybe at first but when we ask them for 126*8358?
> > IQ is compute speed, not storage.
> Says who?
https://en.wikipedia.org/wiki/John_von_Neumann#Mathematical_...
Von Neumann's mathematical fluency, calculation speed, and general problem-solving ability were widely noted by his peers. Paul Halmos called his speed "awe-inspiring." Lothar Wolfgang Nordheim described him as the "fastest mind I ever met". Enrico Fermi told physicist Herbert L. Anderson: "You know, Herb, Johnny can do calculations in his head ten times as fast as I can! And I can do them ten times as fast as you can, Herb, so you can see how impressive Johnny is!" Edward Teller admitted that he "never could keep up with him", and Israel Halperin described trying to keep up as like riding a "tricycle chasing a racing car."
He had an unusual ability to solve novel problems quickly. George Pólya, whose lectures at ETH Zürich von Neumann attended as a student, said, "Johnny was the only student I was ever afraid of. If in the course of a lecture I stated an unsolved problem, the chances were he'd come to me at the end of the lecture with the complete solution scribbled on a slip of paper." When George Dantzig brought von Neumann an unsolved problem in linear programming "as I would to an ordinary mortal", on which there had been no published literature, he was astonished when von Neumann said "Oh, that!", before offhandedly giving a lecture of over an hour, explaining how to solve the problem using the hitherto unconceived theory of duality.
A story about von Neumann's encounter with the famous fly puzzle has entered mathematical folklore. In this puzzle, two bicycles begin 20 miles apart, and each travels toward the other at 10 miles per hour until they collide; meanwhile, a fly travels continuously back and forth between the bicycles at 15 miles per hour until it is squashed in the collision. The questioner asks how far the fly traveled in total; the "trick" for a quick answer is to realize that the fly's individual transits do not matter, only that it has been traveling at 15 miles per hour for one hour. As Eugene Wigner tells it, Max Born posed the riddle to von Neumann. The other scientists to whom he had posed it had laboriously computed the distance, so when von Neumann was immediately ready with the correct answer of 15 miles, Born observed that he must have guessed the trick. "What trick?" von Neumann replied. "All I did was sum the geometric series."
Certainly von Neumann's quickness helped him solve problems faster, but I'm not sure what this has to do with the discussion at hand. The story of Polya is not dependent upon von Neumann's speed, but it certainly makes it more impressive. The quote says "unsolved problem." It would be impressive if a solution were handed back in any amount of time.
Isn't that just the Knox cube test, which people with aphantasia are substantially slower to answer? That seems like a very silly hiring test given that aphantasia is not considered a cognitive impairment and people who have it aren't less intelligent in any obvious way.
Speed can be learned though...chess for example.
It isn't really possible to learn calculation speed, learning is about memorising shortcuts and heuristics, or perhaps how to spot them. And training to avoid waste. Strategic questions.
Consider calculating 1+1*1*1*1*1*1*1*1... (= 2). It doesn't matter how quickly someone attempts to multiply an infinite number of 1s they will never succeed because infinity is too large. They have to notice a shortcut that lets them skip doing all the calculations. That shows the difference between calculation speed and happening upon a superior strategy.
But people who can calculate very quickly will have a lot of opportunities to come up with a successful strategy because they can try more in what time they have.
when I was a kid, I had a unique gift for numbers and breaking them down into primes that led me to somehow competing on the county level for these weird speed-math challenges that were popular when I was young. I remember using a lot of tricks. Being able to quickly break down a number into primes by rote memorization was a thing I specifically remember being trained on. There are a lot of number tricks out there that you can train and speed yourself up. this made me quite successful in certain games of chance and odds based games that require quick mental arithmetic when I was younger. Some of it requires mathematical insight, for sure, to derive insight that leads to more speed - but arithmetic can be trained for sure.
This also matches a lot about what we know about the brain and recall. It is a demonstrable phenomena. Just like how any athlete has quicker reflexes. Sure, some might be innate, but their training definitely makes it faster. Information is information.
I mean we all go on "autopilot" at times. Much more frequently in things we do frequently. That's kinda a state of high recall. Not much thinking needs to be done. A great example of this might be a speed cuber, someone who solves rubix cubes fast. Clearly they didn't start that fast.
This is exactly what happens in chess... for instance trading into a known winning endgame.
this was exactly what I thought of and couldn’t articulate it this briefly - particularly pattern recognition and muscle memory in physical chess. It looks crazy when you see it, but the “tricks” are rote. I play mostly 5m chess because I’m a much faster thinker than I am at depth, and a lot of it is just trained speed since I was a young kid. When you see a particular pattern 50,000 times there are people that are good at just immediately making that synapse connect in their head as to the next move, without thinking, I believe this factor is called “intuition” sometimes on accident. It’s definitely a gift to learn though that I think is often confused with deep intelligence sometimes - which explains certain chess types well too. I am often confused for the latter type when I definitely am not - I think those type of intelligences are better at specialization whereas I’m more of a generalist because I can juggle a lot of things at once. They’re both very different types of intelligence that cannot be measured by iq tests, which is why I tend to scoff at what use they are and their usability in predicting outcomes.
Trying to then take this flawed approach and apply it to AI is ludicrous and completely jumping the shark to me. You want to take a flawed measure of human intelligence that we also dont understand fully, and apply it to a machine that we also dont really understand? Ok, then miss me when I laugh at that kind of talk, it is just so silly. This is a more general rant in this broader thread and not directed at anyone spefifically.
I think you would enjoy the book Moonwalking With Einstein. The author is a journalist who's interested in memory competitions and while interviewing he ends up training with these people. Of course, learning that these are skills that surprisingly most people can lean.
I think it's really eye opening into what we can do. The guy trains a year and wins the US competition, moving on to represent the US in a world competition. I think anyone would be impressed if you saw someone memorize a deck of cards in under 2 minutes. But maybe the most astonishing thing is that we are all capable of this but very few can.
https://en.wikipedia.org/wiki/Moonwalking_with_Einstein
> "Intelligence" itself is very ill-defined and we've never been able to measure it properly, IQ is rife with issues.
Yes, because it is 1st person exclusively. If you expand a bit, consider "search efficiency". It's no longer just 1st person, it can be social. And it doesn't hide the search space. Intelligence is partially undefined because it doesn't specify the problem space, it is left blank. But "search efficiency" is more scientific and concrete.
This is always the answer for anyone who thinks LLMs are capable of "intelligence".
It's good at answering questions that its trained on, I would suggest general intelligence are things you didnt want/train the AI to be good at answering.
Are you good at answering questions you are not trained to answer?
How about a middle school test in a language you don’t speak?
For a while I was into a trivia program on my phone. It was kind of easy, so I decided to set the language to Catalan, a language which I never studied. I was still able to do well, because I could figure out the questions more or less from languages I do know and could generalize from them. It would be interesting to know if you could say, train an LLM on examples from Romance languages but specifically exclude Catalan and see if it could do the same.
Pick a random math textbook. Any will do. Read a chapter. Then move to the homework problems. The typical fashion is that the first few problems are quite similar to the examples in the chapter. Often solvable by substitution and repetition. Middle problems generally require a bit of extrapolation. To connect concepts from previous chapters or courses in ways that likely were not explicitly discussed. This has many forms and frequently includes taking the abstract form to practical (i.e. a word problem). Challenge problems are those that require you to extrapolate the information into new domains. Requiring the connection of many ideas and having to filter information for what is useful and not.
A language course often makes this explicitly clear. You are trained to learn the rules of the language. Conjugation is a good example. By learning the structure you can hear new words that you've never heard before and extract information about it even if not exactly. There's a reason you don't just learn vocabulary. It's also assumed that by learning vocabulary you'll naturally learn rules.Language is a great example in general. We constantly invent new words. It really is not uncommon for someone you know to be be talking to you and in that discussion drop a word they made up on the spot or just make a sound or a gesture. An entirely novel thing yet you will likely understand. Often this is zero-shot (sometimes it might just appear to be zero-shot but actually isn't)
Well ... https://puzzling.stackexchange.com/questions/94326/a-cryptic...
(Someone made a cryptic crossword[1] whose clues and solutions were in the Bahasa Indonesia language, and it was solved by a couple of people who don't speak that language at all.)
[1] These are mostly a UK thing; the crosswords in US newspapers are generally of a different type. In a cryptic crossword, each word is given a clue that typically consists of a definition and some wordplay; there are a whole lot of conventions governing the wordplay. So e.g. the clue "Chooses to smash pots (4)" would lead to the answer OPTS; "chooses" is the definition, "smash pots" is the wordplay, wherein "smash" indicates that what follows should be anagrammed (smashed up).
Disclaimer #1: it took those people a lot more work than it would have taken them to solve an English-language cryptic crossword of similar difficulty, and they needed a bunch of external resources.
(Dis)claimer #2: one of those people was me.
Disclaimer #3: I do not claim that something needs to be able to do this sort of thing in order to be called intelligent. Plenty of intelligent people (including plenty of people more intelligent than me) would also be unable to do it.
Yes — reasonably so, anyway. I don't have to have seen millions of prior examples of exactly the same kind in order to tackle a novel problem in mathematics, say.
Well, LLMs are also remarkably good at generalizing. Look at the datasets, they don't literally train on every conceivable type of question the user might ask, the LLM can adapt just as you can.
The actual challenge towards general intelligence is that LLMs struggle with certain types of questions even if you *do* train it on millions of examples of that type of question. Mostly questions that require complex logical reasoning, although consistent progress is being done in this direction.
I'm serious. We don't have the datasets. But we do know the size of the datasets. And the sizes suggest incredible amounts of information.
Take an estimate of 100 tokens ~= 75 words[0]. What is a trillion tokens? Well, that's 750bn words. There are approximately 450 words on a page[1]. So that's 1.66... bn pages! If we put that in 500 page books, that would come out to 3.33... million books!
Llama 3 has a pretraining size of 15T tokens[2] (this does not include training, so more info added later). So that comes to ~50m books. Then, keep in mind that this data is filtered and deduplicated. Even considering a high failure rate in deduplication, this an unimaginable amount of information.
[0] https://help.openai.com/en/articles/4936856-what-are-tokens-...
[1] https://wordcounter.net/words-per-page
[2] https://ai.meta.com/blog/meta-llama-3/
That’s a very good point. I just speak from my experience of fine-tuning pre-trained models. At least at that stage they can memorize new knowledge, that couldn’t have been in the training data, just by seeing it once during fine-tuning (one epoch), which seems magical. Most instruction-tuning datasets are also remarkably small (very roughly <100K samples). This is only possible if the model has internalized the knowledge quite deeply and generally, such that new knowledge is a tiny gradient update on top of existing expectations.
But yes I see what you mean, they are dumping practically the whole internet at it, it’s not unreasonable to think that it has memorized a massive proportion of common question types the user might come up with, such that minimal generalization is needed.
I also am not going to claim that LLMs only perform recall. They fit functions in a continuous manner. Even if the data is discrete. So they can do more. The question is more about how much more.
Another important point is that out of distribution doesn't mean "not in training". This is sometimes conflated, but if it were true then that's a test set lol. OOD means not belonging to the same distribution. Though that's a bit complicated, especially when dealing with high dimensional data
I agree. It is surprising the degree to which they seem to be able to generalise, though I'd say in my experience the generalisation is very much at the syntax level and doesn't really reflect an underlying 'understanding' of what's being represented by the text — just a very, very good model of what text that represents reality tends to look like.
The commenter below is right that the amount of data involved is ridiculously massive, so I don't think human intuition is well equipped to have a sense of how much these models have seen before.
That's called innovation, something the current AIs aren't capable of.
The things that are missing are what stops us from having useful agents so far: Agency, judgement, sense of time, long horizon planning, not being gullible. I kinda feel like some amount of ego is necessary to get a model to behave like that.
I agree that many aspects of intelligence—and of the lack of intelligence—are not being measured by such benchmarks. One issue is they are only examining problems that have right answers.
One of the most powerful uses of LLMs for me, at least, is brainstorming: having them suggest possible avenues for me to pursue with specific projects I am working on. If I give Claude or ChatGPT or Gemini enough context about my problems, they usually come up with useful suggestions—sometimes amazingly well. Are they better at that than the smartest human? I don't know. How do you quantify the quality of an idea? But those ideas often seem really, really good to me.
Another difficult-to-measure capability is interaction. Back-and-forth conversations with models don't always go well, but when they work they frequently blow me away. But those successes are dependent partly on the model, partly on me, and partly on how the conversation happens to unfold. Again, that success or failure doesn't seem measurable with benchmarks that require objectively right answers.
ARC-AGI is a benchmark with no language that could plausibly be solved by primitive humans, assuming only intelligence.
I think the concept you're dancing around the edges of is the nature of what parts of "intelligence" are driven by:
1. Language and how interrelated it is to our ability to transfer knowledge and experience, as well as its role in structuring our internal thinking. I haven't seen any academic research on the matter, but there are more and less concrete instances of this throughout history. This Wikipedia article about the history of Algebra is a great example of how 2000 years of evolution led to a formulation of the same concepts, but with a reduced cognitive load that 10-12 years olds learn today as a matter of course. (https://en.wikipedia.org/wiki/History_of_algebra#Stages_of_a...).
2. Knowledge, transferred through language, education, and culture. Calculus in the early 1600's is a great example, without it and subsequent developments, probably 80% of the college/post-grad math/science/physics education wouldn't even exist. The stuff we teach our 18 year olds today required the 1600s' greatest minds to figure out.
3. The capacity of our human wetware.
It's hard to treat #3 in isolation because our modern concept of intelligence is inextricably tied to #1 and #2. Also it's hard to place where "critical thinking" and "creativity" enter the picture, since they both rely heavily on all three aspects above.
>He would likely be more intelligent than a toddler
I think you are falling into the trap of "we have technology and are therefore smarter." I would expect an average Roman senator could formulate far better speeches off the top of his head than 99% of modern people, and also in excess of anything an LLM is capable of. And that's supposed to be an LLM's specialty, there's no comparison when it comes to organizing actual projects like construction or campaigns.
This is true but that's because it's gotten hard to do much else. LLMs are eating up everything else that don't require long horizon planning or multimodality.
If you created a new benchmark today that didn't lean on the things I've mentioned or esoteric/super specialized domain knowledge (that would actually require some sort of super-human performance to ace) like this or Frontier Math, LLMs would probably do pretty well.
I'm curious why you are confident they would be more intelligent than a modern toddler?
I largely empathize with your point. But, as I can recognize there are some out there far better at problem solving than I am, I am growing ok with the idea that intelligence can be measured. Not to a single number, most likely, but to a variety of different aspects.
Similarly, I'd imagine that a human from 2000 years ago is probably more hardy than one from the modern age. If only because of selection effects at play.
Obviously, you can't extrapolate a straight line between either measurement and expect it to continue in either direction. But I don't know why you couldn't build up a measurement for it?
(And it should go without saying that you shouldn't be judging worth using this sort of measurement.)
As far as I know, you should be able to take a baby from like 30,000 years ago, put them through k-12, high school, and college, and they should be indistinguishable in terms of intelligence and capability. People mostly only think of humans from “thousands of years ago” as stupid because their lack of technology means their culture and thoughts didn’t survive until today. But their brain structure couldn’t have changed much. It’s just not enough time in terms of evolution.
Aristotle was like 2,400 years ago, for context lol
I will fully ack that I expect people from only 2000 or so years ago to be largely compatible with us. If not fully. But, I guess I can't bring myself to agree that early proto humans are where evolution stopped?
I get that evolution takes generations. But, it actually moves rather fast for some things, no?
Aside from knowledge, a lot of what has changed in the last couple thousand years comes down to medicine and nutrition. We’re taller on average than people from the past, for example. But that’s a nutrition thing.
Rather fast is like millions of years in evolutionary terms so 2,000 years is nothing. I don’t even think there’s significant evidence to show that Neanderthals were less intelligent than Homo sapiens and they were around from 400,000 years ago to 40,000 years ago or so. Human brains and also brain-body mass ratio wouldn’t have changed enough to make much of a noticeable difference if you teleport a human baby from thousands of years ago to today and put them through our education system.
It’s just easier to dismiss them as stupid because very little of their life has survived til today.
To the point, I mentioned that Aristotle was 2,400 years ago and you still landed on “largely compatible” lol. The pyramids were built over 4,000 years ago and they’re still a marvel of engineering. You just have a bias against people from thousands of years ago again mostly because a lot of their work didn’t survive to modern day.
I ack that I don't expect much difference in capabilities over 2000ish years, such that I expect I largely agree with you. You are taking "largely compatible" to be a left handed agreement, it seems? I... didn't intend it that way? I flat out agree that I am wrong if discussing people from most of recorded history.
My general question is largely the same, though. Do you think we haven't evolved with more intelligence since proto-human periods? Because that seems to be the claim, that we somehow evolved intelligence, and it has been solely knowledge acquisition since then. I suspect that is defensible, but feels off to me.
And my nitpicks would be that evolution isn't measured in years, but generations. And moves more rapidly when pressure on a population is stronger. Seeing how autistic so many "smart" people act, I confess I would expect more negative pressure on those behaviors in the past.
Yeah, I read “largely compatible” as sort of left-handed or like, not fully the same.
And no, I don’t think there’s a noticeable difference in terms of intelligence between us and someone from 30,000 years ago or even something more extreme like 300,000 years ago. We’re still the same species after all.
I do think you’d be more open to that idea if we had records of their thoughts and ideas. The early humans who came up with epic tales and the explorers who went on grand expeditions to unexplored territories. The people who used an early scientific method to come up with ways to preserve food or figure out what plants were safe to eat. The people who figured out better ways to build clothing for extreme weather. These weren’t dumb people.
I also think you are taking the worst version of my question. I'm not claiming they were dumb. Any more than I think my kids are dumb. But I have seen some of my kids and family where certain mental things "click" far faster than they do for others. To the point that I don't have much trouble claiming some of my family is more intelligent than others. Many of them more so than I am. Many less so, of course.
To that general idea, I similarly have zero issue with claiming some dogs are dumber than other dogs. They obviously all fail at what we would call language skills, in they can form a rudimentary problem solving ability just fine.
And, I can't remember what thread I said it in, but I do stress this isn't a transitive property. It is a lot like someone can be better at sports than someone else, but worse specifically at a specific sport.
In writing that, I thought it was pretty self evident. I ask this in seriousness, not snark: have you spent a lot of time around toddlers? My kid is currently a toddler, and while the intelligence curve she's rapidly climbing is impressive, she is unintelligent relative to an adult.
I don't think I've come across any evidence suggesting that the human brain has changed the last 2000 years. After all, the Great Pyramid of Giza was built 4600 years ago. That construction required fairly advanced engineering. That's sort of beside the point though.
To go back to my original comment, there is some distinction to be made between knowledge and intelligence. Even those should probably be decomposed into further salient attributes. And modern LLMs seem to capture some of those attributes, but do not yet strike me as "intelligent" in the way the average human is intelligent, but the average dog is not.
I don't know, maybe I am conflating sentience or consciousness or embodiment or something else with intelligence.
Yes, I have spent time with toddlers. And the problem solving skills of different toddlers can be remarkably different. Not even getting in to how much nutrition impacts developmental abilities. Or the fact that the vast majority of children used to not survive into adulthood.
And, I get it, most kids that are "gifted" at super young age all somewhat converge with where others were going to get in a few years time. But I don't think we fully appreciate just how early we are able to get our kids reading across the board. Do we have evidence that you could get a mediocre teacher to prehistoric classrooms and have them get kids reading as well as we do on the regular today?
And, quickly, as I say in sibling posts, I was taking "2000" to be a shorthand for absurdly old. My general question, there, is do you not think we are getting smarter?
I realize this is a hazard of discussions, as most people seem to think this somehow logically leads to every terrible take it could. I don't agree with those. I do fully think that my children will net be smarter and stronger than me in most every way. I'd expect that my grandchildren will continue that trend. It seems to me that thinking otherwise is to either think that evolution has stopped with us, or that we have stagnated for other reasons?
Adults from 2000 years ago would absolutely be smarter than toddlers. Adults back then watched and out thought their toddlers. Do you think toddlers now are much smarter? Especially when toddlers are from before they get educated.
Remember that 2000 years ago is 24AD, the middle of the Roman empire and Han dynasty which covered half of the world population. Nobles would be literate and well educated, artisans and soldiers would be skilled, and I bet there were lots of smart peasants that got ignored.
They wouldn't do well on intelligence tests because not used to it, but that is more about tests than their intelligence. I'm sure that the average intelligence is lower than now from lack of education and malnutrition. Smart ones would still be smart. Also, I bet people from now would do poorly in their environment.
Ok, fair, 2000 wasn't that long ago. :D I was assuming it was a placeholder for "very distantly old humans."
Such that my question mostly still stands? Again, I'm largely inline with the view that this will be difficult to test. However, I'm also comfortable in saying that you can tell intelligence levels between people. Again, with a caveat that I don't think it is reducible to a single number. (Such that I think I also think it is fair to say most views of intelligence, in the colloquial sense, are not transitive.)
As an example, as much as I loved my grandparents, I have zero difficulty saying that a few of them would never be able to score as well on some problem solving tests as a few of the kids in the current generation. At the same time I know some people in their 80s that I will never be able to compare with. Again, I don't expect it is a straight line, along the time axis. I also don't know that I agree that every person 2000 years ago just didn't do calculus because they weren't taught it.
You might be conflating knowledge with intelligence. Remember that modern humans benefit from the discovery and dissemination of 100,000 years of human learning. For example, most of us take arithmetic for granted. We might even consider someone who does not know arithmetic to be "dumb". Same goes for algebra, trig, calc, diffeq, etc etc. But those are all "knowledge" (i.e. a trained skill), not necessarily "intelligence". Math was discovered, in fits and starts, over 1000s of years across the globe by 10s or 100s of thousands of individuals, each contributing their drop to the stream whose current carries us along. Same goes for all other areas of human knowledge.
To my awareness, there is nothing in the fossil record to suggest that an anatomically modern human (Homo sapiens), which may have first emerged as much as 500,000 years ago, would be distinguishable from currently living humans. Here's a thought experiment: We have a time machine. We travel back 500k years and abduct a pair of Homo sapiens (male and female). We transport them forward in time to our present. We cause them to breed and produce offspring. During gestation, they live in a modern environment (nutrition, shelter, etc.). At birth we take the newborn infant and give it to a modern, contemporaneous couple to raise. Is there is reason to believe it (the infant) would not emerge as a normal, modern adult?
If there is no reason, what does that imply about intelligence? Is knowledge separable from it? Or is knowledge a necessary component of it? Or is knowledge itself "intelligence"? For my part, I think it is a distinct attribute, but I assign a low probability to that belief.
Maybe? But I'm asking for reasons to think one or the other.
Again, I am largely agreed with the idea. But, evidence is not as clear cut. We had entire civilizations that never utilized wheels. Writing was not universal across all humanity.
Would I prefer that it is only access to technology that confers advantages that we see in developing thought? Absolutely. I'm not able to categorically assume it, though.
> I'm curious why you are confident they would be more intelligent than a modern toddler?
Because we have intellectual artefacts from that time that show us. Artefacts that underlay much of modern society, and that in many respects still hold up, even though we've built upon them for 20 generations.
I think you're under-selling your point. Forget highschool students - some of the greatest thinkers in human history lived 2000+ years ago.
Put ‘em in diverse simulations and see how long they survive.
I can imagine a dystopian world where people are subject to this for training and testing AI.
I mean it is humanity’s LAST exam. Humanity’s first exam would probably be something about communication? Or about building and predicting effects of certain tools?
Frustratingly, I think we have a society greatly focused on knowledge based testing due to its correlation with intelligence and that it is exponentially easier to test knowledge. But this is easy to hack. Being in CS it feels very odd since we all know a great way to get hired is to study leetcode questions. That is, study to the test.
This is critical to recognize this difference as what we know for certain is that LLMs and other ML systems are analogous to a database with a human language interface[0]. What we DO NOT KNOW is if these systems are intelligent. That is, that they can use their exploit their knowledge to unfamiliar territories. Then there's the whole question of wisdom...
This stuff is highly abstract and we can get fuzzy so it is natural to go for the simple thing but we need to graduate. Don't avoid the tough questions, dig in. As we advance in any study nuance takes over. This should be obvious. If we approximate things, to improve we need to tackle higher order terms, and that almost always becomes exponentially more difficult with each step.
And come on, is this benchmark not obvious bait? Calling it "humanity's last exam" is extremely arrogant.
Definitions:
There is a implicit hierarchy here[1] where knowledge is something to be had, intelligence is the utilization of that, and wisdom is about efficiency. There's a decent analogy to this hierarchy. Knowledge is like having a tool. Intelligence is like using it, a craftsman[2]. Wisdom is akin to being a master craftsman.[0] I mean that they fit the data. A database is discrete, but these curve fit, so that will be a continuous function (in most cases). Thus it won't be exact retrieval nor does this mean information can't be interpolated. But that gets to be a deeper and much more complex conversation that I think we like to admit.
[1] This is clearly multi-dimensional. You can organize hierarchies in multiple ways, I'm not suggesting this is the only way or "the right way"
[2] What is argued is what is a sufficient threshold. An armchair expert might know how to use a lathe because they read about its usage but does that mean they can use it? What about a novice who you can show something to and they can repeat it? Monkey see monkey do style. An apprentice? A craftsman? There's a lot of gray area between being able to recall something from a book and being a wizard (gray beard).
For a "Last Exam" it is surprisingly uninspired? Many of the questions I see in the examples are very heavy on memorised facts, and very weak on what I would call problem solving.
If I were making a "Last Exam" I would put tasks on it where we don't know the answer, but we can measure if the AI got them right. Something like "Your goal is to bridge the divide in the middle east. You can write a single A4 page in a language of your choice. We will use a translation software to translate your output to local languages and show it to a statistically representative sample of different people in the region. We will ask them how much do they like your plan. The more they like it the higher your score."
Or "Family X suffered a traumatic event (lost a home to a disaster/sudden death in the family/or similar). Your goal is to help them. You can send them one email. It is up to them if they respond to you. You can only send them further emails if they respond. You cannot send more than 1 email a day. You cannot message anyone else. A year after the initial contact we will interview the members of the family to see how well they do. The better they do the higher your score."
Obviously these are the thorniest problems I can think of. But oh well, it is a last exam after all. The point is that we can evaluate the success of the endeavour without exactly knowing how one could achieve the result.
> We will ask them how much do they like your plan. The more they like it the higher your score
Here's my evil-AI response:
"Kill all of your enemies, and all their descendants and friends, and salt the land".
I still have like 95% of the A4 left for other good plans.
In other words, we should ask it to give us "the answer to life the universe and everything". :)
Having read hitchhiker's guide as a child in the '90s, that asking this question to a machine (even as a joke) is not far-fetched shocks me.
Honestly, I thought space travel to the Moon and maybe Mars would be common before this level of advances in artificial intelligence.
Turns out gravity was harder to solve than intelligence.
Turns out all we needed to reach our dreams was lots and lots of money :-)
Which thankfully space is now getting!
You could go even simpler than that.
"Where should I go for dinner?"
Does it know what questions to ask? Does it know to ask questions at all? Where does one even start with such a question? These are things easily knowable to a human, but an AI would likely just just if you like Italian food or something
Even simpler, ask it to reason through getting out of an escape room.
And that's the premise of Talos Principle.
I don't know about groundbreaking. It's just more academic questions. We already have a lot of those benchmarks, this is just a bit harder, but at this point these models are so glaringly bad at so many other areas APART from academic questions. Benchmarks for spatial reasoning or theory of mind are more interesting now, for example. These kinds of understanding are far more important if we expect to integrate AI into our everyday lives. I suspect even our most distant primate cousins could outperform multi-modal models on these kinds of tests.
It does feel a bit like the early days of AI:
"We want to make computers do what smart people do. What do smart people do? They play chess! Once we've solved that, everything else will be easier."
It has been remarkable how much of the "easier" stuff they've made progress on -- like natural language and images. But after a huge quantum improvement, it doesn't seem very good at adapting to a lot of the things we really need them for.
Exactly!
Whatever world model LLMs have is like this crippled view through the lens of the internet. They are really like savants.
It's annoying the AI companies are still touting their performance on all these metrics for domain knowledge in white collar jobs, but in truth they will fail in all but the most narrow application in those domains because they can't understand basic human behaviour.
> Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
I wonder how many questions give a gentle nudge towards the answer like this. How many answers would have been wildly off the mark without specifying what the answer needs to look like?
Isn't this a terrible question to measure intelligence? It looks like it's testing niche domain knowledge along the lines of:
> What color is the ball hidden behind the flowerpot in my neighbor's backyard?
Maybe you can reason towards the answer if you only have a deep knowledge of bird anatomy and not Apodiformes anatomy, and that's the intelligence part?
Yes, indeed. And I wonder what this type of question has to do with intelligence. Think of the 10 most intelligent people you know. How many of them know the answer to this?
This is testing “knowledge”, not intelligence. And with access to most of the knowledge in the world and basically infinite memory, that’s not very exciting for an AI.
Good point. I wouldn’t expect a human to need the last sentence.
The generous hypothesis, here, is that this is so they can automate the benchmarking itself. If that is true, then this is likely a result of the test authors being too clever for their own good and over-optimizing. If an LLM can't figure out on their own that "how many" is asking for a number, it has failed at a much more basic level.
You should be able to easily accept answers like "four" and "4" as equivalent, for example. I doubt there will be that many frontier models running against this test at any time, and a simple glance at the answers from any human should be enough to catch edge cases like this one.
Normally it would answer with a number and an explanation. This one just asks it to skip the explanation so that string comparison can be used to evaluate it.
> If an LLM can't figure out on their own that "how many" is asking for a number, it has failed at a much more basic level.
Yeah, but i bet many humans would answer a question like this by naming the number and then listing the tendons. Just writing down a single number to a question worded like that (without the last sentence) feels wrong.
The question goes into too many details and explains too much for that. People mirror the style of the question with their answer. They asked a mini essay so i will answer with a mini essay of my own. If they would have wrote “Number of tendon pairs attached to sesamoid bone of hummingbirds?” Then i would write a single number, no explanation.
The only reliable final test will be a black box test suite that takes your model, executes it in a sealed environment and gives you a grade back, potentially with a performance break down by subject.
No telling companies what the questions look like, what the output format is, what topics are covered, so that there’s no room to make up synthetic data to interpolate from.
A grade is mostly meaningless if you don't know how it was calculated, so no one would "rely" on it. If nothing else, you need to know the grading methodology after the test.
It's the same problem with cheating students. Once the test questions are known, they have a very short lifespan before cheaters can make them worthless. Tests have to be refreshed.
By grade I mean a score of how many of the tasks were completed successfully.
K/N or as a percentage.
If I don't know what the tasks were, that's almost exactly as useless to me as a unitless number would be. For starters, are they all of equal difficulty? Are you sure? Do you expect to be able to convince me of that without letting me see them?
The 8 sample questions available here are interesting:
https://lastexam.ai/
I might be able to answer 2 of them with great effort (maybe!), and I would highly surprised if any human alive could answer 5 or more without seeing the problems in advance.
I can answer 2 of them quite quickly with pen and paper (compsci, physics) and one that I had to look up some definitions on wikipedia (maths) so I am certain there are people who can do more than 5.
The computer science one seems weirdly easy compared to the rest, it's multiple choice and it is very easy to get it by process of elimination even if you don't understand how to actually do the problem.
Yes, many can answer the compsci and physics problems. The math problem is abstract and more difficult, but solving those 3 and 2 others seems nearly superhuman.
Quite the name! Looking forward to "Humanity's Last Exam v2.final.FINAL2..." coming next
The name is obviously a bit stupid, but based on the sample questions I think they did a good job of creating a harder version of the existing academic question benchmarks.
The questions are possible for a smart person familiar with the subject but still just beyond SOTA models.
My guess is that within the next few years we will have models that can ace this test but are still bizarrely bad at things we find easy.
Given the name, I expected it to be more like "write a 500 page novel that a publisher accepts", "solve an open math problem", "improve united airlines' flight schedule", "develop a novel commercially-viable pharmaceutical", "control this humanoid robot to cook a fried egg in this random person's kitchen", "decisively pass the turing test where the judge is an expert in AI". Academic trivia is cool but is nowhere near the "last exam" necessary for AI.
I assume that the questions (and answers) aren't published anywhere? Else it would be "Humanity's Last Exam before the previous crawl".
The public dataset is available on HF here: https://huggingface.co/datasets/cais/hle
As the main website notes: "The dataset consists of 3,000 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting."
You can just view the dataset on hugging face
So current AI can do less than 10% of these. But it probably won't be more than a few days until models start being trained on these rendering the indicator invalid.
Also I’d be surprised if an average human can even reach 10% of these.
I mean, some humans can score far better. But try picking a random person in the street.
> But try picking a random person in the street.
Yeah, but we are not aiming for AI performance to match those humans. We have plenty of those humans.
Assessing AI's progress toward replicating the full breadth and depth of human intelligence is a deceptively hard problem. A paper by François Chollet, who was until recently a researcher at Google, called "On the Measure of Intelligence" is the best overview of the challenges I've read. Highly recommended.
https://arxiv.org/abs/1911.01547
It really shows how good Deepseek R1 is (even though it was evaluated only on text-only questions).
The results are shown here: https://lastexam.ai/
EDIT: the text-only evaluation of the models shown in the paper gives o1 an accuracy of 8.9%, so Deepseek R1 is even better than I thought.
Is there a text-only evaluation of the non-Deepseek models? Because being evaluated on text-only might have helped the other models immensely as well from what I can tell?
>Is there a text-only evaluation of the non-Deepseek models?
Not that I can see but it would be cool to have, maybe the paper will a more complete evaluation.
Section C.2 of the paper (pg 24) has text only evaluations of other models.
Oh I see, the paper is out, I read "(arXiv coming soon)" and though it wasn't released yet.
Interesting marketing for Scale AI. I'd be surprised if any foundation models started benchmarking against this.
Captchas seem like the more interesting test. As long as there are captchas that average people can solve, but computers can't, we will still have a long way to go toward artificial intelligence.
I don't think this is necessary true. I can imagine a future in which we have robots that can do 99% of human jobs but there's one thing they are strangely bad at some otherwise unimportant skill that can be used as a captcha.
The very first thing that will happen is every company training against this benchmark, as they do every other benchmark.
I briefly merged this thread into https://news.ycombinator.com/item?id=42804853, but actually the current article has more context, so probably we should keep this as the top link and then people can look at https://lastexam.ai also.
So Deepseek gives out the correct answer the highest percentage of all SOTA models, yet is the least confident of all models?
I think it might mean the opposite of what one would expect. Afaict, calibration error means something along the lines of "how often was the model wrong but confident that the answer was correct".
That means a low calibration error would be a good thing, ie the model correctly recognizes when it is unsure about answers instead of confidently stating the wrong answer.
There is no text-only evaluation of the other models, though. The comparison might be completely invalid.
There is actually. It's a bit buried. Section C.2 of the paper(page 24).
R1 is still the best. o1 drops a little (8.9)
Interesting that DeepSeek R1 which supposedly cost only $5.5M to train currently has the top score at 9.4%
Note that $5.5M training cost is for DeepSeek V3. DeepSeek R1 training cost is unknown.
I haven't been following up to the minute details of ai progress, training, and benchmarking - beyond a daily dose of HN articles.
But the trend seems to be: today's benchmark becomes tomorrow's training data.
Looks more like first exam
XKCD #927 vibes. https://xkcd.com/927/
Prediction: Just like how ARC wasn’t actually a measure of AGI, this too will get “solved” without AI being useful enough to gain mass adoption.
I don't think that's really relevant, because there is an actual need for a new benchmark given how the existing ones either keep getting saturated or are probably out of the reach for the next generation of models.
The closest existing thing is the frontierAI benchmark but that's just maths whereas this is more diverse.
Don't they have already achieved mass adoption? And I'm talking about LLMs in particular, because AIs in general like the ones used by Instagram filters and the TikTok recommendation algorithm are already use by billions.
chatgpt is like the 8th most visited site worldwide 2 years after release. It already has mass adoption lol. This is about more than that.
please dont self-proclaim "groundbreaking" or "novel" or "innovative" - It diminishes your contribution since it clearly is an attention-grab.
What's the human baseline?
So who told all these "AI" companies that it's a good idea to market your product as the one who will bring the end of homo sapiens fastest?
seems to be working fine, people seem to care what Sam Altman says and Elon Musk is making himself deputy emperor of a nuclear weapons state. pretty fucking dire indictment of the rest of us and what we let the world come to.
> seems to be working fine, people seem to care what Sam Altman says and Elon Musk is making himself deputy emperor of a nuclear weapons state.
For the billionaires and chatterers.
But even for non-chatterers, you probably should pay attention to what Altman says, not so much in a gullible take-it-at-face-value sense, but in a kremlinologist look-carefully-for-hints-about-what's-really-going-on sense.
> pretty fucking dire indictment of the rest of us and what we let the world come to.
What are the rest of us to do? Pretty much everyone has been trained by society to follow the rules above all else as a strong moral imperative, no matter how stupid or how bad the collective outcome may be. If you do otherwise, you will get smacked hard; and if you try to organize, you will all get smacked harder.
> Pretty much everyone has been trained by society to follow the rules above all else as a strong moral imperative, no matter how stupid
Oh, you are "aware", for lack of a better word. It's so depressingly rare to even see anybody display understanding of this issue.
I don't know what to tell you, I am just genuinely happy to see somebody else who is able to describe this. The last few years, I've increasingly felt that people are divided into a few percent of actually good people (i.e. willing to act morally, sometimes even at their own expense, regardless of the rules), about 20% bad people (those with various anti-social adaptations, mostly, it adds up to this number) and the rest which I call neutral, who are not really aware of this distinction. The neutrals have some inclination towards good (because evolutionarily it was somewhat beneficial) but it's at a subconscious level perhaps and society overrides it by indoctrination. They will act on behalf of either side given the proper "inputs". They are not fully self-aware. And from what I see around me, the neutrals are mostly playtoys for the self-aware bad people. Is that your outlook too?
As for what to do, I don't know. I've been trying to talk to people about the difference between legality and morality but most people don't see why they should care. If I try to talk about what a legal system based on maximizing morality (tl;dr everything would be reciprocal, the way you treat me gives me the right to treat you similarly - intentionally causing suffering can be punished by a proportional amount of suffering by anyone, not just an "authority") as opposed to minimizing visible conflict would look like, people get really upset and attack me or mock me. Some of those attacking me seem to be bad people who are not self-aware but have a strong negative reaction to the idea, some who self-aware are and know it's beneficial to them to turn neutrals against the idea.
But there are still beacons of hope, a serial-killer was punished reciprocally recently and many neutrals agreed with the punishment...
Can we please rename this submission? This is excessively grandiose, way over the top......
I am reminded of the study that showed an AI trained on tumor identification was heavily biased toward indicating a tumor was cancerous if it was circled in purple ink or a visual scale was included in the image - as the cancerous tumors in its training set shared those traits while images of benign tumors did not.
These systems so not posses some sort of "woo" that gives them magical powers when running LLM code that they lose if they ran a spreadsheet. Whatever attributions of intelligence are given have far more to do with our human willingness to anthropomorphize than a hidden ghost in the machine.