Just read Gary Marcus, Does AI Really Need a Paradigm Shift. The BIG-Bench paper, Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models, speaks to his concerns. In particular, this passage struck me:
Limitations that we believe will require new approaches, rather than increased scale alone, include an inability to process information across very long contexts (probed in tasks with the keyword context length), a lack of episodic memory into the training set (not yet directly probed), an inability to engage in recurrent computation before outputting a token (making it impossible, for instance, to perform arithmetic on numbers of arbitrary length), and an inability to ground knowledge across sensory modalities (partially probed in tasks with the keyword visual reasoning).
It's the "inability to engage in recurrent computation before outputting a token" that has my attention, as I've been thinking about that one for awhile, though that is a more succinct formulation of the problem than any that I’ve come up with. I note that our capacity for arithmetic computation is not part of our native endowment. It doesn't exist in pre-literate cultures and our particular system originated in India and China and made its way to Europe via the Arabs. We owe the words "algebra" and "algorithm" to that process.
Think of that capacity as a very specialized form of language, which it is. That is to say, it piggy-backs on language. That capacity for recurrent computation is part of the language system. Language involves both a stream of signifiers and a stream of signifieds. I think you'll find that the capacity for recurrent computation is required to manage those two streams. And that's where you'll find operations over variables and an explicit type/token distinction [which Marcus mentions in his post].
Of course, linguistic fluency is one of the most striking characteristics of these LLMs. So one might think that architectural weakness – for that is what it is – has little or no effect on language, whatever its effect on arithmetic. But I suspect that's wrong. We know that the linguistic fluency has a relatively limited span. I'm guessing effectively and consistently extending that span is going to require the capacity for recurrent computation. It's necessary to keep focused on the unfolding development of a single topic. That problem isn't going to be fixed by allowing for wider attention during the training process, though that might produce marginal improvements.
The problem is architectural and requires an architectural fix, both for the training engine and the inference engine. Simply scaling up – more parameters, more data – won't fix the problem.
