Magazine

OpenAI’s DALL-E Creates Plausible Images of Literally Anything You Ask It to – ProWellTech

Posted on the 05 January 2021 by Thiruvenkatam Chinnagounder @tipsclear

OpenAI's latest weird but fascinating creation is DALL-E, which by way of a hasty summary could be called "GPT-3 for images". Create illustrations, photos, renderings, or whatever method you prefer, of anything you can intelligibly describe, from "a cat wearing a bow tie" to "a daikon radish in a tutu walking a dog". But don't write obituaries for stock photos and illustrations yet.

As usual, OpenAI's the description of his invention is quite legible and not excessively technical. But it bears a bit of contextualization.

What the researchers created with GPT-3 was an AI that, if requested, would attempt to generate a plausible version of what it describes. So, if you say "a story about a child finding a witch in the woods", he'll try to write one - and if you press the button again, he'll write it again, differently. And again and again and again.

Some of these attempts will be better than others; in fact, some will be barely consistent while others may be nearly indistinguishable from something written by a human. But it doesn't return junk or serious grammar errors, making it suitable for a variety of tasks, as startups and researchers are exploring right now.

DALL-E (a combination of Dali and WALL-E) takes this concept further. Turning text into images has been done for years by artificial intelligence agents, with varying but steadily increasing success. In this case the agent uses the language understanding and context provided by GPT-3 and its underlying structure to create a plausible image that matches a prompt.

As OpenAI says:

GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate high-fidelity images. We extend these findings to show that the manipulation of visual concepts through language is now within reach.

What they mean is that such an image generator can be manipulated naturally, simply by telling it what to do. Sure, you could dig into its guts and find the sign that represents color and decode its pathways so you can activate and change them, the way you could stimulate the neurons of a real brain. But you wouldn't do this when you ask your staff illustrator to create something blue instead of green. You just say "a blue car" instead of "a green car" and they get it.

So it is with DALL-E, which understands these tips and rarely fails seriously, although it must be said that even when looking at the best of a hundred or a thousand tries, many of the images it generates are more than just ... off. Referred to below.

In the OpenAI post, the researchers provide copious interactive examples of how the system can be told to make small variations of the same idea, and the result is plausible and often quite good. The truth is that these systems can be very fragile, as DALL-E admits somehow, and saying "a pentagon-shaped green leather bag" can produce what is expected, but "a blue suede bag a pentagon shape "could produce nightmare fuel. Because? It's hard to tell, given the black box nature of these systems.

OpenAI’s DALL-E creates plausible images of literally anything you ask it to – ProWellTech OpenAI’s DALL-E creates plausible images of literally anything you ask it to – ProWellTech

But DALL-E is remarkably resilient to such changes and reliably produces just about anything you ask for. A guacamole bull, a zebra ball; a large blue block sitting on a small red block; a front view of a happy capybara; an isometric view of a sad capybara; And so on and so on. You can play with all the examples at the post.

He also exhibited some unintended but useful behavior, using intuitive logic to understand requests such as asking him to make multiple sketches of the same (non-existent) cat, with the original on top and the sketch on the bottom. No special coding here: "We didn't anticipate this ability would emerge and we didn't make any changes to the neural network or training procedure to encourage it." This is fine.

Interestingly, another new system from OpenAI, CLIP, was used alongside DALL-E to understand and classify the images in question, although it is a bit more technical and more difficult to understand. You can read about CLIP here.

The implications of this ability are many and varied, so much so that I won't try to elaborate on them here. Even the punt of OpenAI:

In the future, we plan to analyze how models like DALLE relate to social issues such as the economic impact on certain work processes and professions, the potential for bias in the model's results, and the long-term ethical challenges involved in this. technology.

Right now, like GPT-3, this technology is surprising and yet difficult to make clear predictions about it.

In particular, very little of what it produces seems truly "definitive" - ​​that is, I couldn't tell it to create a main image for anything I've written recently and expect it to post something I could use without modification. Even a brief inspection reveals all sorts of AI quirks (Janelle Shane's specialty), and while these rough edges are sure to smooth out over time, it's far from safe, the way GPT-3 text doesn't can be sent without modification on the spot of human writing.

It helps to generate many and choose the first few, as the following collection shows:

This doesn't detract from the OpenAI results here. This is extraordinarily interesting and powerful work, and like the company's other projects, it will undoubtedly develop into something even more fabulous and interesting in a short time.


Back to Featured Articles on Logo Paperblog