Josh, I’ve heard a lot about “AI-generated art” and seen a lot of really crazy looking memes. What’s going on, are the machines picking up brushes now?
No brushes, no. What you see are neural networks (algorithms that supposedly mimic how our neurons signal each other) that have been trained to generate images from text. It’s actually a lot of math.
Neural networks? Generate images from text? So, like, you plug Kermit the Frog in Blade Runner into a computer and it spits out pictures of… that?
You don’t think outside the box enough! Of course, you can create any Kermit images you want. But the reason you hear about AI art is because of its ability to create images of ideas that have never been expressed before. If you google “a kangaroo made of cheese” you won’t really find anything. But here are nine generated by a fashion model-†
You said earlier it’s all a bunch of math, but – to put it as simply as possible – how does it actually work?
I’m no expert, but essentially they had a computer “look” at millions or billions of pictures of cats and bridges and so on. These are usually scraped from the internet along with their captions.
The algorithms identify patterns in the images and captions and can eventually begin to predict which captions and images belong together. Once a model can predict what an image “should” look like from a caption, the next step is to invert it – creating brand new images from new “captions”.
When these programs create new images is it to find matches – like, all my images tagged with ‘kangaroos’ are usually big blocks of shapes like this oneand ‘cheese’ is usually a number of pixels that look like this one – and just make up variations on it?
It’s a little more than that. If you look at this blog post from 2018 you can see how much trouble older models had. When the caption “a herd of giraffes on a ship” was given, a bunch of giraffe-colored blobs emerged standing in the water. So the fact that we’re getting recognizable kangaroos and different types of cheese shows that there’s been a big leap in the “understanding” of the algorithms.
Damn. So what’s changed so that the stuff that makes it doesn’t look like horrible nightmares anymore?
There have been a number of advances in techniques, as well as the datasets they train on. In 2020, a company called OpenAi released GPT-3 – an algorithm capable of generating text uncannily close to what a human could write. One of the most hyped text-to-image generating algorithms, DALLE, is based on GPT-3; more recently, Google released imageusing their own text models.
These algorithms are fed massive amounts of data and forced to do thousands of “practice” to get better at predictions.
‘Assignments’? Are there any real people involved?like the algorithms tell if what they are making is right or wrong?
Actually, this is another big development. When you use one of these models, you will probably only see a handful of the images that were actually generated. Just as these models were initially trained to predict the best captions for images, they show you only the images that best match the text you’ve given them. They mark themselves.
But there are still weaknesses in this generation process, right?
I cannot emphasize enough that this is not intelligence. The algorithms don’t “understand” what the words or the pictures mean like you or I do. It’s kind of a best guess based on what it’s “seen” before. So there are quite a few limitations, both in what it can do and what it does that it probably shouldn’t (like possibly graphics).
Okay, so if the machines are shooting on demand now, how many artists will it cost?
For now, these algorithms are largely limited or expensive to use. I’m still on the waiting list to try DALLE. But computing power is also getting cheaper, there are many huge image data sets, and even ordinary people are make their own models† Like the one we used to create the kangaroo images. There is also an online version called Dall-E 2 mini that people use, explore and share online to create everything from Boris Johnson eating a fish to kangaroos made from cheese.
I doubt anyone knows what will happen to artists. But there are still so many edge cases where these models break that I wouldn’t rely on them exclusively.
Are there other problems with creating images based purely on matching patterns and then marking themselves on their answers? Questions about, for example, bias or unfortunate associations?
One thing you notice in the company announcements of these models is that they tend to use innocuous examples. Many generated images of animals. This speaks to one of the huge problems with using the internet to train a pattern matching algorithm – so much of it is absolutely terrible.
A few years ago, a dataset of 80 million images was used to train algorithms removed by MIT researchers because of “derogatory terms like categories and offensive images”. What we noticed in our experiments is that “business” words seem to be associated with generated images of men.
So right now it’s about good enough for memes, and it’s still making weird nightmare images (especially of faces), but not as much as it used to. But who knows about the future. Thanks Jos.