There is a new hot trend in artificial intelligence: text-to-image generators. Feed these programs any text you want and they’ll create remarkably accurate images that match that description. It can match a range of styles, from oil paintings to CGI images and even photographs, and while it looks tacky – in many ways, the only limit is your imagination.
To date, the industry leader has been DALL-E, a program built by the commercial AI lab OpenAI (and just updated). Back in April). Yesterday, though, Google She announced her own stance on this genreImagen, and removed DALL-E in its production quality.
The best way to understand the amazing ability of these models is to simply look at some of the images they can create. There are some created by Imagen above, and more on that below (You can see more examples on Google custom landing page).
In both cases, the text below the image was a prompt being entered into the program, and the image above, the output. Just to confirm: that’s all it takes. You write what you want to see and the program generates it. Pretty cool, isn’t it?
But while these images are undeniably impressive in their consistency and accuracy, they should also be captured with a pinch of salt. When search teams like Google Brain release a new AI model, they tend to pick the best results. So, while all of these images look quite polished, they may not represent the average output of the image system.
Images generated by TXT models often look patchy, smudged, or blurry — issues we’ve seen with images created by OpenAI’s DALL-E software. (For more information on problem points for text-to-image systems, see: Check out this interesting thread on Twitter that dives into the problems of DALL-E. It highlights, among other things, the system’s tendency to misunderstand vectors, and to conflict with both text and faces.)
Despite this, Google claims that Imagen produces consistently better images than DALL-E 2, based on a new standard it created for this project called DrawBench.
DrawBench isn’t a particularly complex metric: it’s basically a file List of about 200 text prompts The Google team fed them into Imagen and other text-to-image generators, with each program output and then judged by human evaluators. As shown in the graphs below, Google has found that humans generally prefer Imagen’s output over competitors’ output.
It would be difficult to judge this for ourselves, because Google does not make the Imagen form publicly available. There is a good reason for that, too. While text-to-image forms certainly have great creative potential, they also have a bunch of annoying applications. Imagine a system that generates pretty much any image you like to use to spread fake news, hoaxes, or harassment, for example. As Google notes, these systems also encode social biases, often the result of which is racist, sexist, or otherwise toxic in an innovative way.
Much of this is due to how these systems are programmed. Essentially, they are trained on massive amounts of data (in this case: lots of pairs of images and annotations) that they study for patterns and learn to iterate. But these models require a great deal of data, and most researchers — even those working for well-funded tech giants like Google — have decided that it is too difficult to exhaustively filter such inputs. So, they extract massive amounts of data from the web, and as a result, their models absorb (and learn to iterate) all the nasty, hateful content you’d expect to find online.
Google researchers also summarize this issue in a file paper: “[T]Extensive data requirements for text-to-image forms […] It prompted researchers to rely more heavily on the large, often unsaturated, data set provided by the web […] Dataset audits revealed that these data sets tend to reflect social stereotypes, oppressive viewpoints, and degrading, or harmful, associations with marginalized identity groups.”
In other words, the old saying of computer scientists still applies in the vibrant world of AI: trash in, trash out.
Google doesn’t go into much detail about the disturbing content generated by Imagen, but notes that the model “encodes many social biases and stereotypes, including a general bias toward creating images of people with lighter skin tones and a penchant for images depicting various occupations.” in keeping with Western gender stereotypes.”
This is something researchers Also found while evaluating DALL-E. Ask DALL-E to create pictures of a “flight stewardess,” for example, and almost all of the people will be women. Ask for “CEO” pictures, and surprise, surprise, you get a bunch of white guys.
That’s why OpenAI has also decided not to release DALL-E publicly, but the company is giving access to a select group of beta testers. It also filters certain text inputs in an attempt to stop the form from being used to generate racist, violent or pornographic images. These actions go some way to restricting potentially harmful applications of this technology, but the history of artificial intelligence tells us that text-to-image models are almost certain to become public at some point in the future, with all the annoying effects that wider access brings.
Google’s own conclusion is that Imagen is “not suitable for public use at this time,” and the company says it plans to develop a new way to measure “social and cultural bias in future work” and test for future iterations. For now, though, we have to be satisfied with the company’s upbeat choice of images — raccoon kings and cactus wear sunglasses. This is just the tip of the iceberg, though. The iceberg consists of the unintended consequences of technological research, if Imagen wants to start generating who – which.