The world of artificial intelligence is still figuring out how to deal with the amazing display of ingenuity DALL-E 2 can draw / paint / imagine just about anything…but OpenAI isn’t the only one working on something like this. Google Research rushed to announce a similar model it was working on – which it claims is even better.
Image (Get it?) is a text-to-image based generator built on the language models of great converters… OK, let’s slow down and decipher this real fast.
Text-to-image forms take text inputs like “dog on a bike” and produce a corresponding image, something that’s been done for years but has recently seen huge leaps in quality and accessibility.
Part of that is using propagation techniques, which basically start with a pure image of noise and slowly refine it little by little until the model thinks it can’t make it look more like a dog on a bike than it actually does. This was an improvement over the top-down generators that can get ridiculously wrong on first guessing, and other generators that can easily be misled.
The other part is to improve language comprehension through large language models Using the transformer approach, technical aspects that I won’t (and can’t) go into here, but they and some other recent advances have led to compelling language paradigms like GPT-3 and others.
Imagen starts by creating a small image (64 x 64 pixels) and then passes it “ultra-resolution” up to 1024 x 1024. This isn’t like a regular upgrade, though AI’s ultra-resolution creates new details that blend in with the smaller image, using the original as a basis.
Let’s say for example that you have a dog on a bike and the dog’s eye is 3 pixels wide in the first image. There is not much room for expression! But in the second image, it’s 12 pixels wide. Where do the details required for this come from? Well, the AI knows what a dog’s eye looks like, so it generates more details as you draw it. Then this happens again when the eye is done again, but 48px wide. But the AI has never had to pull 48 pixels of any pixel out of a dog’s eye…let’s say the magic bag. Like many artists, he began with the equivalent of a rough sketch, filled it out with a study, and then actually went to the city on the finished canvas.
This is not unprecedented, and in fact artists working with AI models are already using this technology to create much larger pieces than the AI can handle in one go. If you break a board into several pieces, and very precisely for each of them individually, you end up with something much larger and more detailed; You can do this repeatedly. Interesting example From an artist I know:
The developments that Google researchers claim with Imagen are numerous. They say that existing text templates can be used for the text markup part, and that their quality is more important than simply increasing visual fidelity. This makes sense intuitively, because the detailed picture of the bullshit is definitely worse than a slightly less detailed picture of exactly what you asked for.
For example, in paper To describe Imagen, they compared their results and DALL-E 2 to making “a panda makes latte art.” In all the pictures of the latter, it is a latte art of a panda; In most Imagen images, pandas make art. (Neither was able to get a horse to ride an astronaut, which shows the opposite in all attempts. It’s a work in progress.)
In Google tests, Imagen has outperformed human evaluation tests, for accuracy and fidelity. This is obviously quite subjective, but even to match the perceived quality of the DALL-E 2, which until today has been a huge leap above all else, is impressive. I’ll just add that while they’re very good, none of these images (from any generator) will bear more than a quick check before people notice they’ve been created or have serious doubts.
Despite this, OpenAI is one or two steps ahead of Google in two ways. DALL-E 2 is more than just a research paper, it’s a private beta that people use, just as they used its predecessor and GPT-2 and 3. Ironically, the company with the word “open” in its name has focused on producing its text image search, While the fabulously profitable internet giant has yet to try.
This is more evident than the choice made by the DALL-E 2 researchers, to pre-format the training dataset and remove any content that might violate their own guidelines. The model can’t make an NSFW thing if it tries. However, the Google team used some large data sets that are known to include inappropriate material. In an insightful section on Imagen describing “Limitations and Societal Impact,” the researchers wrote:
The downstream applications of text-to-image models are varied and may affect society in complex ways. The potential risks of misuse raise concerns about responsible open source code and demos. At this time we have decided not to release an icon or public demo.
The data requirements for text-to-image models have prompted researchers to rely heavily on large, often unsaturated, data sets written from the web. Although this approach has enabled rapid advances in algorithms in recent years, data sets of this type often reflect social stereotypes, oppressive viewpoints, and degrading or otherwise harmful associations with marginalized identity groups. While a subset of our training data was filtered to remove noise and unwanted content, such as pornographic images and toxic language, we also used the LAION-400M dataset known to contain a wide range of inappropriate content including pornographic images, racial slurs, and stereotypes. harmful social. Imagen relies on text codecs trained on unsaturated web-wide data, thus inheriting the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and representations, which guides our decision not to release Imagen for public use without further safeguards.
While some might shrug it off, saying that Google fears that its AI may not be politically correct enough, this is an intolerant and short-sighted view. An AI model is as good as the data it has been trained on, and not every team can spend the time and effort it would take to remove the awful things that these scrapers pick up as they assemble several million images, or billions-of-word data sets.
Presumably, such biases emerge during the research process, which reveals how systems work and provides an unfettered testing ground for identifying these and other limitations. Otherwise, how do we know that AI can’t draw popular black hairstyles – hairstyles that any kid can draw? Or that when you are asked to write stories about work environments, AI always makes the boss a man? In these cases, the AI model works perfectly and as designed – it has successfully learned the biases that pervade the media being trained. No different from people!
But while eliminating systemic bias is a lifelong project for many humans, AI is easier, and its creators can remove content that caused them to behave badly in the first place. Perhaps one day an AI will be needed to write in the style of a racist, sexist expert from the 1950s, but for now, the benefits of including that data are small and the risks significant.
In any case, Imagen, like the others, is clearly still in the beta phase, not ready to work on anything other than a strictly human supervised way. As Google makes its capabilities more accessible, I’m sure we’ll learn more about how it works and why.