DeepMind recently trained Flamingo, an 80B parameter vision language model (VLM) AI. Flamingo combines separately pre-trained vision and language models and outperforms all other single-shot learning models on 16 vision language benchmarks. Flamingo can also chat with users and answer questions about input images and videos.
The model was announced in a blog post by lead researchers Jean Baptiste Alayrac† Jeff Donahue† Pauline Lucand Antoine Miecho† Flamingo is based on two previous models developed by DeepMind: Chinchilla, a 70B parameter language generation model; and observer, a multimodal classification model. Flamingo combines these two models into a single neural network, which is then trained on sequences of interleaved image and text data. The result is an AI that can learn new vision language tasks with little or no additional training data. According to Alayrac et al:
Models such as Flamingo hold promise for society in practical ways and we continue to improve their flexibility and capabilities so that they can be safely deployed for everyone’s benefit. Flamingo’s capabilities are paving the way for rich interactions with learned visual language models that enable better interpretability and exciting new uses, such as a visual assistant that helps people in everyday life—and we’re thrilled with the results so far.
Multimodal VLMs, such as: CLAMPhave proved successful in zero-shot learning† however, because such models only give a score that indicates the agreement between an image and a textual description, their range of tasks is limited. Other VLMs, such as: DALL-Ecan generate photorealistic images based on a description, but does not generate language, so cannot perform tasks such as visual question answering (VQA) or captioning images.
Because large generative language models such as GPT-3 shown to perform well when learning a few shots on a wide variety of natural language processing (NLP) tasks, the DeepMind team chose to build on their Chinchilla language model, which outperforms GPT-3 at many of these tasks. This required several changes to Chinchilla. First, there was the need to process multimodal data, without negatively impacting the language proficiency of the model. To fix this, the team interweave new cross-attention layers with the existing self-attention layers, which were kept frozen during training.
To enable support for both single-frame images and video, the researchers built in a Perceiver model that generates a “small fixed number of visual tokens” for both images and videos. This improved the scalability of the model with the input size. Finally, the team needed a large combined image-text training dataset. To do this, the team scraped text and images from approximately 43 million web pages to create the MultiModal MassiveWeb (M3W) dataset, which contains 185 million images and 182 BG of text. Flamingo was trained on a combination of M3W and several other pre-existing image-text datasets.
To evaluate Flamingo, DeepMind tested it against 16 multimodal benchmarks across a range of tasks, including visual dialogue, VQA, captioning, and image classification. In some learning scenarios, Flamingo outperformed past best results “by a large margin”. On six of the benchmarks, Flamingo outperformed state-of-the-art refined models without being refined itself; instead, Flamingo was used in a few shots scenario and only given 32 samples, “about 1000 times less” than the fine-tuned models.
In a Reddit discussion about Flamingonoted one user:
Any work that can reduce the required training data and generalize understanding will be incredibly relevant. There are so many different developments that these companies are trying to combine to create generalized AI, it’s amazing to see. I imagine we’ll also see a lot more research on catastrophic forgetting this year.
Multimodal AI is an active research topic. Earlier this year, InfoQ data2vec, a multimodal AI from Meta that can perform various computer vision and speech recognition tasks. Last year InfoQ covered DeepMind’s Perceiver and more recently the new generalist AI model Gatowhich can perform “more than 600 different tasks”, including image captioning and robot control.