Culture Magazine

What’s It Mean to Understand How LLMs Work?

By Bbenzon @bbenzon

I don’t think we know. What bothers me is that people in machine learning seem to think of word means as Platonic ideals. No, that’s not what they’d say, but some such belief seems implicit in what they’re doing. Let me explain.

I’ve been looking through two Anthropic papers on interpretability: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, and Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. They’re quite interesting. In some respects they involve technical issues that are a bit beyond me. But, setting that aside, they also involve masses to detail that you just have to slog through in order to get a sense of what’s going on.

As you may know, the work centers on things that they call features, a common term in this business. I gather that:

  • features are not to be identified with individual neurons or even well-defined groups of neurons, which is fine with me,
  • nor are features to be closely identified with particular tokens. A wide range of tokens can be associated with any given feature.

There is a proposal that these features are some kind of computational intermediate.

We’ve got neurons, features, and tokens. I believe that the number of token types is on the order of 50K or so. The number of neurons is considerably larger varies depending on the size of the model, but will be 3 or 4 orders of magnitude larger. The weights on those neurons characterize all possible texts that can be constructed with those tokens. Features are some kind of intermediate between neurons and texts.

The question that keeps posing itself to me is this: What are we looking for here? What would an account of model mechanics, if you will, look like?

A month or so ago Lex Fridman posted a discussion with Ted Gibson, an MIT psycholinguist, which I’ve excerpted here at New Savanna. Here’s an excerpt:

LEX FRIDMAN: (01:30:35) Well, let’s take a stroll there. You wrote that the best current theories of human language are arguably large language models, so this has to do with form.

EDWARD GIBSON: (01:30:43) It’s a kind of a big theory, but the reason it’s arguably the best is that it does the best at predicting what’s English, for instance. It’s incredibly good, better than any other theory, but there’s not enough detail.

LEX FRIDMAN: (01:31:01) Well, it’s opaque. You don’t know what’s going on.

EDWARD GIBSON: (01:31:03) You don’t know what’s going on.

LEX FRIDMAN: (01:31:05) Black box.

EDWARD GIBSON: (01:31:06) It’s in a black box. But I think it is a theory.

LEX FRIDMAN: (01:31:08) What’s your definition of a theory? Because it’s a gigantic black box with a very large number of parameters controlling it. To me, theory usually requires a simplicity, right?

EDWARD GIBSON: (01:31:20) Well, I don’t know, maybe I’m just being loose there. I think it’s not a great theory, but it’s a theory. It’s a good theory in one sense in that it covers all the data. Anything you want to say in English, it does. And so that’s how it’s arguably the best, is that no other theory is as good as a large language model in predicting exactly what’s good and what’s bad in English. Now, you’re saying is it a good theory? Well, probably not because I want a smaller theory than that. It’s too big, I agree.

It's that smaller theory that interests me. Do we even know what such a theory would look like?

Classically, linguists have been looking for grammars, a finite set of rules that characterizes all the sentences in a language. When I was working with David Hays back in the 1970s, we were looking for a model of natural language semantics. We chose to express that model as a directed graph. Others were doing that as well. Perhaps the central question we faced was this: what collection of node types and what collection of arc types did we need to express all of natural language semantics? Even more crudely, what collection of basic building blocks did we need in order to construct all possible texts?

These machine language people seem to be operating under the assumption that they can figure it out by an empirical bottom-up procedure. That strikes me as being a bit like trying to understand the principles governing the construction of temples by examining the materials from which they’re constructed, the properties of rocks and mortar, etc. You can’t get there from here. Now, I’ve some ideas about how natural language semantics works, which puts me a step ahead of them. But I’m not sure how far that gets us.

What if the operating principles of these models can’t be stated in any existing conceptual framework? The implicit assumption behind all this work is that, if we keep at it with the proper tools, sooner or later the model is going to turn out to an example of something we already understand. To be sure, it may be an extreme, obscure, and extraordinarily complicated example, but in the end, it’s something we already understand.

Imagine that some UFO crashes in a field somewhere and we are able to recover it, more or less intact. Let us imagine, for the sake of argument, that the pilots have disappeared, so all we’ve got is the machine. Would we be able to figure out how it works? Imagine that somehow a modern digital computer were transported back in time and ended up in the laboratory of, say, Nikola Tesla. Would he have been able to figure out what it is and how it works?

Let’s run another variation on the problem. Imagine that some superintelligent, but benevolent aliens were to land, examine our LLMs, and present us with documents explaining how they work. We would be able to read and understand those documents. Remember, these are benevolent aliens, so they’re doing their best to help us. I can imagine three possibilities:

  1. Yes, perhaps with a bit of study, we can understand the documents.
  2. We can’t understand them right away, but the aliens establish a learning program that teaches us what we know to understand those documents.
  3. The documents are forever beyond us.

I don’t believe three. Why not? Because I don’t believe our brains limit us to current modes of thought. In the past we’ve invented new ways of thinking; no reason why would could continue doing so, or learn new methods under the tutelage of benevolent aliens.

That leaves us with 1 and 2. Which is it? At the moment I’m leaning toward 2. But of course those superintelligent aliens don’t exist. We’re going to have to figure it out for ourselves.


Back to Featured Articles on Logo Paperblog