A human metaphor for evaluating AI capability

102 points by bertman 10 hours ago

I like this approach in general for understanding AI tools.

There are lots of things computers can do that humans can't, like spawn N threads to complete a calculation. You can fill a room with N human calculators and combine their results.

If your goal is to just understand the raw performance of the AI as a tool, then this distinction doesn't really matter. But if you want to compare the performance of the AI on a task against the performance of an individual human you have to control the relevant variables.

chronic0262 4 hours ago

> Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition.

what a badass

amelius 4 hours ago

Yes, I think it is disingenuous of OpenAI to make ill-supported claims about things that can affect us in important ways, having an impact on our worldview, and our place in the world as an intelligent species. They should be corrected here, and TT is doing a good job.

d4rkn0d3z 25 minutes ago

As a graduate student I was actually given tests that more closely resembled the second scenario the auther described. Difficult problems in GR, a whole weekend to work on them, no limits as to who or what references I consulted.

This sounds great until you realize there are only a handful of people on earth that could offer any help, also the proofs you will write are not available in print anywhere.

I asked one of those questions of Grok 4 and its response was to issue "an error". AFAIK, in many results quoted for AI performance, filling the answer box yields full marks but I would have recieved a big fat zero had I done the same.

godelski 2 minutes ago

As a physics undergraduate I had similar style tests for my upper division classes (the classical mechanics professor and loved these). We'd have like 3 days to do the test, open book, open internet[0] and the professor extended his office hours, but no help from peers. It really stretched your thinking. Removed the time pressure but really gave the sense of what it was like to be a real physicist.
Even though in the last decade a lot more of that complex material appears online, there's still a lot that can't. Unfortunately, I haven't seen any AI system come close to answering any of these types of questions. Some look right at a glance but often contain major errors pretty early on.
[0] you were expected to report if you stumbled on the solution somewhere. No one ever found one though

roxolotl 4 hours ago

This does a great job illustrating the challenges with arguing over these results. Those in the agi camp will argue that the alterations are mostly what makes the ai so powerful.

Multiple days worth of processing, cross communication, picking only the best result? That’s just the power of parallel processing and how they reason so well. Altering to a more standard prompt? Communicating with a more strict natural language helps reduce confusion. Calculator access and the vast knowledge of humanity built in? That’s the whole point.

I tend to side with Tao on this one but the point is less who’s right and more why there’s so much arguing past each other. The basic fundamentals of how to judge these tools aren’t agreed upon.

griffzhowl 2 hours ago

> Calculator access and the vast knowledge of humanity built in? That’s the whole point.
I think Tao's point was that a more appropriate comparison between AI and humans would be to compare it with humans that have calculator/internet access.
I agree with your overall point though: it's not straighforward to specify exactly what would be an appropriate comparison
zer00eyz 35 minutes ago

> Those in the agi camp will argue that the alterations are mostly what makes the ai so powerful.
And here is a group of people who is painfully unaware of history.
Expert systems were amazing. They did what they were supposed to do, and well. And you could probably build better ones today on top of the current tech stack.
Why hasn't any one done that? Because constantly having to pay experts to come in and assess, update, test, and measure your system was a burden for the result returned.
Sound familiar?
LLM's are completely incapable of synthesis. They are incapable of the complex chaining, the type that one has to do when working with systems that aren't well documented. Dont believe me: Ask an LLM to help you with build root on a newly minted embedded system.
Go feed an LLM one of the puzzles from here: https://daydreampuzzles.com/logic-grid-puzzles/ -- If you want to make it more fun, change the names to those of killers and dictators and the acts to those of ones its been "told" to dissuade.
Could we re-tool an LLM to solve these sorts of matrix style problems. Sure. Is that going to generalize to the same sorts of logic and reason matrixes that a complex state machine requires? Not without a major breakthrough of a nature that is very different to the current work.
johnecheck 3 hours ago

Would be nice if we actually knew what was done so we could discuss how to judge it.
That recent announcement might just be fluff or might be some real news, depending. We just don't know.
I can't even read into their silence - this is exactly how much OpenAI would share in the totally grifting scenario and in the massive breakthrough scenario.
- algorithms432 2 hours ago
  
  Well, they deliberately ignored the requests of IMO organizers to not publish AI results for some time (a week?) to not steal the spotlight from the actual participants, so clearly this announcement's purpose is creating hype. Makes me lean more towards the "totally grifting" scenario.
  - bgwalter an hour ago
    
    Amazing. Stealing the spotlight from High School students is really quite something.
    I'm glad that Tao has caught on. As an academic it is easy to assume integrity from others but there is no such thing in software big business.
    
    bluefirebrand an hour ago
    
    > As an academic it is easy to assume integrity from others
    I'm not an academic, but from the outside looking in on academia I don't think academics should be so quick to assume integrity either
    There seems to be a lot of perverse incentives in academia to cheat, cut corners, publish at all costs, etc

svat 4 hours ago

Great set of observations, and indeed it's worth remembering that the specific details of assistance and setup make a difference of several orders of magnitude. And ha, he edited the last post in the thread to add this comment:

> Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition. (3/3)

(This wasn't there when I first read the thread yesterday 18 hours ago; it was edited in 15 hours ago i.e. 3 hours later.)

It's one of the things to admire about Terence Tao: he's always insightful even when he comments about stuff outside mathematics, while always having the mathematician's discipline of not drawing confident conclusions when data is missing.

I was reminded of this because of a recent thread where some HN commenter expected him to make predictions about the future (https://news.ycombinator.com/item?id=44356367). Also reminded of Sherlock Holmes (from A Scandal in Bohemia):

> “This is indeed a mystery,” I remarked. “What do you imagine that it means?”

> “I have no data yet. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

Edit: BTW, seeing some other commentary (here and elsewhere) about these posts is very disappointing — even when Tao explicitly says he's not commenting about any specific claim (like that of OpenAI), many people seem to be eager to interpret his comments as being about that claim: people's tendency for tribalism / taking “sides” is so great that they want to read this as Tao caring about the same things they care about, rather than him using the just-concluded IMO as an illustration for the point he's actually making (that results are sensitive to details). In fact his previous post (https://mathstodon.xyz/@tao/114877789298562646) was about “There was not an official controlled competition set up for AI models for this year’s IMO […] Hopefully by next year we will have a controlled environment to get some scientific comparisons and evaluations” — he's specifically saying we cannot compare across different AI models so it's hard to say anything specific, yet people think he's saying something specific!

largbae 3 hours ago

I feel like everyone who treats AGI as "the goal" is wasting energy that could be applied towards real problems right now.

AI in general has given humans great leverage in processing information, more than we have ever had before. Do we need AGI to start applying this wonderful leverage toward our problems as a species?

johnecheck 4 hours ago

My thoughts were similar. OpenAI, very cool result! Very exciting claim! Yet meaningless in the form of a Twitter thread with no real details.

xmorse 4 hours ago

[flagged]

nurettin 4 hours ago

Because there is no mathx.com
chronic0262 4 hours ago

[flagged]