Competence and Performance in Benchmarking LLMs

By Bbenzon @bbenzon

I’ve been hearing complaints about the inadequacy of LLM benchmarks for two years now. So let’s think a bit. Consider this passage from Rodney Brooks, [FoR&AI] The Seven Deadly Sins of Predicting the Future of AI (Sept. 7, 2017):

One of the social skills that we all develop is an ability to estimate the capabilities of individual people with whom we interact. It is true that sometimes “out of tribe” issues tend to overwhelm and confuse our estimates, and such is the root of the perfidy of racism, sexism, classism, etc. In general, however, we use cues from how a person performs some particular task to estimate how well they might perform some different task. We are able to generalize from observing performance at one task to a guess at competence over a much bigger set of tasks. We understand intuitively how to generalize from the performance level of the person to their competence in related areas.

When in a foreign city we ask a stranger on the street for directions and they reply in the language we spoke to them with confidence and with directions that seem to make sense, we think it worth pushing our luck and asking them about what is the local system for paying when you want to take a bus somewhere in that city. If our teenage child is able to configure their new game machine to talk to the household wifi we suspect that if sufficiently motivated they will be able to help us get our new tablet computer on to the same network.

If we notice that someone is able to drive a manual transmission car, we will be pretty confident that they will be able to drive one with an automatic transmission too. Though if the person is North American we might not expect it to work for the converse case.

He then gives a few more examples.

So, consider benchmarks. Many of them are standardized tests devised to discriminate among humans. Any human capable of taking one of these tests – such as the Advanced Placement test for physics, the law bar exam, a standard set of programming problems – is capable of performing other tasks in the domain, tasks which may not be amenable to testing procedures, but the test themselves are sufficient to sort individuals.

Consider the bar examination. Practicing lawyers have to be able to meet with clients, take depositions, negotiate with other lawyers, and appear in court. None of these things are tested in a bar exam, but lawyers have to do them. To use Brooks’ terms, an individual’s performance on the bar exam does not test their overall competence as a lawyer. It would be a mistake to assume that an LLM’s performance on the bar exam is an indication of its overall lawyerly competence.

Furthermore, all tests present the test-taker with a well-defined situation to which they must respond. But life isn’t like that. It’s messy and murky. Perhaps the most difficult a person has to do is to wade into the mess and murk and impose a structure on it – perhaps by simply asking a question – so that one can then set about dealing with that situation in terms of the imposed structure. Tests give you a structured situation. That’s not what the world does.

Consider this passage from Sam Rodiques, “What does it take to build an AI Scientist”:

Scientific reasoning consists of essentially three steps: coming up with hypotheses, conducting experiments, and using the results to update one’s hypotheses. Science is the ultimate open-ended problem, in that we always have an infinite space of possible hypotheses to choose from, and an infinite space of possible observations. For hypothesis generation: How do we navigate this space effectively? How do we generate diverse, relevant, and explanatory hypotheses? It is one thing to have ChatGPT generate incremental ideas. It is another thing to come up with truly novel, paradigm-shifting concepts.

Right.

How do we put an LLM, or any other AI, out in the world where it can roam around, poke into things, and come up with its own problems to solve? If you want AGI in any deep and robust sense, that’s what you have to do. That calls for real agency. I don’t see that OpenAI or any other organization is anywhere close to figuring out how to do this.