A place is set at a lavishly decorated table. The drape of the tablecloth is just so. Light twinkles off fine crystal goblets. Two nicely-dressed people take their seats. One of them pours wine into his companion’s goblet and passes it to her. She takes the cup, takes a sip, and smiles.
Cut. The elaborate lighting and multiple cameras power down. The actors get up and change out of their costumes. The set is disassembled. This wasn’t the opening scene of a 2 hour rom-com. This was the movie.
That’s because this film wasn’t intended for human consumption. It was more like a puppet show for AI.
AI has learned a lot – enough to make six-fingered-but-otherwise-realistic photos and give you plausible answers in the medium of text (which should always be fact checked!). But it has no idea what any of this stuff means, as you probably know from reading the reams of AI coverage that has proliferated in the past year. It hasn’t got a clue about the semantic meaning of images of cats or ladies or cat ladies or a treatise on Boulangerism in 19th century France or “what is a Christie Aschwanden“. That’s no obstacle to spitting out confident content, though, and because it has been trained on such a colossal amount of data, it intuits which word to put in front of another word mostly correctly, the way you might put one foot in front of the other to shuffle around a pitch black room. Occasionally it steps on a rake.
The hype around Google’s recent launch of its new AI model, Gemini, has to do with the fact that it is multi-modal. This means it can parse your input whether you’re giving it by speech, text or images. That is because it was also trained multimodally, on richer types of sensory input like youtube videos rather than on still pictures.
People have been showing AIs movies for years, but none of those movies were made with AI models in mind as the specific audience demographic. Those are our videos.
By contrast, these puppet shows were made explicitly for them and include millions of annotated high-resolution frames of people doing everyday things like playing board games, exercising, and unwrapping presents. Take a look at another one. Here we have a man washing dishes. It’s a mess, the countertops are a mess, most of the objects are partially or wholly obscured and unless you understand what a kitchen is or why a person would be washing dishes in one, you couldn’t make inferences about how any of these objects relate to each other. How would any given AI model – having no experience of the real world – figure out what the hell is going on here?
The puppet shows have many possible use cases, and one of them is training generative AI. In which case it makes sense that they fixed a head-mounted camera on the actor. This would give the AI model the first person POV on doing the dishes.
This ties in with many conversations about whether “real” AI needs to either be fully embodied in a physical robot, or virtually inhabit a simulation so realistic that it can perfectly emulate the experience of gravity, light, wind, cold, and heat and all the other components of reality. Because this is how we learn to understand the world: as toddlers we don’t just watch other people break things when they drop, we break them ourselves. Our embodiment is the only thing that makes us feel the consequential nature of our actions in the universe. The cause and effect chain is the start of narrative. We humans learn best when there is narrative.
That sounds great in theory, but the dishwashing puppetshow revealed that actually, embodiment has its costs. The AI couldn’t make heads or tails of the objects in the puppetshow. Its limited perspective was wobbly and shaky and confusing. Looking at everything from your own limited perspective makes things feel much more complicated than they are from an omniscient or global framework.
It seemed to do a lot better at categorising the scene when presented with a thrid-person perspective; better able to distinguish salient objects from the chaos. Hands. Dishes. Sponge. Countertop (even obscured by a few dishes and a dish towel). The objective POV also gave a much better idea about how the salient items were interacting with each other. From the bird’s eye view, the AI could see an entire human being doing the dishes, not just a set of disembodied hands rummaging.
(As an aside, the limiting qualities of the first-person perspective are evident in multiple contexts. Ever give your friends advice that no one can give you?)
Forcing AI to watch movies to learn about the world throws big “A Clockwork Orange” vibes. (Come to think of it, it’s also what it has felt like to try to keep up with AI news coverage this year.)
I feel like it also says something about how we approach the project of getting the AI to be “aligned” with us. “Alignment” is another fancy word that actually needs to be decomposed properly to provide robust understanding of its full implications. Alignment means it shares our goals and values. Not just in broad strokes (“Alexa, please make the world a better place”), but with an intricate understanding of the obstacles and complications and unintended consequences that you need to be aware of in meeting any goal, the awareness that there are shared goals (omniscient POV) and individual goals (first person headmounted camera POV) and sometimes these conflict. In a sense, our ideal AI would need to understand us and have empathy for us, if it is going to help us do the things we want to do, but do them better than we can, which is what we are ultimately going for with the creation of AI, right? We are trying to make a thing that is both god-like and slave-like. It will faithfully give us what we want, while understanding better than we do what we want, because it will understand the world better than we do.
But is our cause-and-effect way of seeing the universe the correct way? Plenty of evidence from maths and physics suggests it is not. For example, maybe even time itself is a technocratic illusion, and if that’s true, you can throw cause and effect right in the bin as a human sensory misapprehension of a more complex reality.
So if we force an AI to see the world our way, is it going to be capable of seeing the world more clearly than we do, or will it be befuddled by the same stuff that benights our perception? And if we allow AI to “see” the world in a way that is beyond our ability to audit, will our perspective become totally irrelevant to it?
Unrelated, I wonder if the AI puppet-theatre auteurs will ever shoot a live-action rendering of the “This is Fine” dog drinking his coffee as the flames lick the ceiling.
Image credit: In A Clockwork Orange, Alex is forced to watch movies which will program him into pro-social behaviour. Source: wikimedia commons; image is in the public domain.