How Not To Destroy the World With AI - Stuart Russell

So we need to actually to get rid of the standard model. So we need a different model, right? This is the standard model. Machines are intelligent to the extent their actions can be expected to achieve their objectives.

Instead, we need the machines to be beneficial to us, right? We don't want this sort of pure intelligence that once it has the objective is off doing its thing, right? We want the systems to be beneficial, meaning that their actions can be expected to achieve our objectives.

And how do we do that? [...] That you do not build in a fixed known objective upfront. Instead, the machine knows that it doesn't know what the objective is, but it still needs a way of grounding its choices over the long run.

And the evidence about human preferences will say flows from human behavior. [...] So we call this an assistance game. So it's a, involves at least one person, at least one machine, and the machine is designed to be of assistance to the human. [...] The key point is there's a priori uncertainty about what those utility functions are. So it's gotta optimize something, but it doesn't know what it is.

And during, you know, if you solve the game, you in principle, you can just solve these games offline and then look at the solution and how it behaves. And as the solution unfolds effectively, information about the human utilities is flowing at runtime based on the human actions. And the humans can do deliberate actions to try to convey information, and that's part of the solution of the game. They can give commands, they can prohibit you from doing things, they can reward you for doing the right thing. [...]

So in some sense, you know, the entire record, the written record of humanity is, is a record of humans doing things and other people being upset about it, right? All of that information is useful for understanding what human preference structures really are algorithmically.

Yeah, we, you know, we can solve these and in fact, the, the one machine, one human game can be reduced to a partially observable MDP.

And for small versions of that we can solve it exactly. And actually look at the equilibrium of the game and, and how the agents behave. But an important point here and, the word alignment often is used in, in discussing these kinds of things.

And as Ken mentioned, it's related to inverse reinforcement learning, the learning of human preference structures by observing behavior. But alignment gives you this idea that we're gonna align the machine and the human and then off they go, right? That's never going to happen in practice.

The machines are always going to have a considerable uncertainty about human preference structures, right? Partly because there are just whole areas of the universe where there's no experience and no evidence from human behavior about how we would behave or how we would choose in those circumstances. And of course, you know, we don't know our own preferences in those areas. [...]

So when you look at these solutions, how does the robot behave? If it's playing this game, it actually defers to human requests and commands. It behaves cautiously because it doesn't wanna mess with parts of the world where it's not sure about your preferences. In the extreme case, it's willing to be switched off.

So I'm gonna have, in the interest of time, I'm gonna have to skip over the proof of that, which is prove with a little, a little game. But basically we can show very straightforwardly that as long as the robot is uncertain about how the human is going to choose, then it has a positive incentive to allow itself to be switched off, right? It gains information by leaving that choice available for the human. And it only closes off that choice when it has, well, or at least when it believes it has perfect knowledge of human preferences.