The Allen Institute for AI (AI2), the division within the nonprofit Allen Institute focused on machine learning research, today published its work on an AI system called Unified-IO, which it says is one of the first to a “large and diverse” set of AI tasks. Unified-IO can process and create images, text, and other structured data, an achievement the research team says is a step toward building capable, unified AI systems for general use.
“We are interested in building task-agnostic [AI systems]that practitioners can train with [machine learning] models for new tasks with little to no knowledge of the underlying machinery,” Jaisen Lu, a research scientist at AI2 who worked on Unified-IO, told TechCrunch via email. “Such unified architectures alleviate the need for task-specific parameters and system customizations, can be collectively trained to perform a wide variety of tasks, and can share knowledge about tasks to improve performance.”
AI2’s early efforts in building unified AI systems led to GPV-1 and GPV-2, two general-purpose “vision-language” systems that supported a handful of workloads, including captioning images and answering questions. . Unified-IO had to go back to the drawing board and design a new model from scratch, according to Lu.
Unified-IO shares features in common with OpenAIs GPT-3 in the sense that it is a ‘transformer’. Dating back to 2017, the Transformer has become the architecture of choice for complex reasoning tasks, demonstrating an aptitude for summarizing documents, generating music, classifying objects in images, and analyzing protein sequences.
For example, like all AI systems, Unified-IO learned billions of words, images, and more in the form of tokens. These tokens served to represent data in a way that Unified-IO could understand.
“The Natural Language Processing Community (NLP) has been very successful in building unified [AI systems] that support many different tasks, because many NLP tasks can be represented homogeneously – words as input and words as output. But the nature and diversity of computer vision tasks has meant that in the past multitasking models have been limited to a small number of tasks, and mostly tasks that produce language output (answering a question, captioning an image, etc.),” Chris Clark, who works with Lu collaborated on Unified-IO at AI2, told TechCrunch in an email. “Unified-IO shows that by converting a series of diverse structured outputs such as images, binary masks, bounding boxes, sets of key points, grayscale maps and more into homogeneous sets of tokens, we can model many classic computer vision tasks that are very similar. how we model tasks in NLP.”
Unlike some systems, Unified-IO cannot analyze or create videos and audio — a limitation of the model “from a modality perspective,” explains Clark. But under the tasks Unified-IO can complete are generating images, detecting objects in images, estimating depth, paraphrasing documents, and highlighting specific regions in photos.
“This has huge implications for computer vision as it begins to treat modalities as diverse as images, masks, language and bounding boxes as simple strings of tokens – similar to language,” Clark added. †In addition, unification on this scale can now open the doors to new avenues in computer vision, such as massive unified pre-training, knowledge transfer between tasks, pair-learning and more.”
Matthew Guzdial, an assistant professor of computer science at the University of Alberta who was not involved in AI2 research, was hesitant to call Unified-IO a breakthrough. He noted that the system is similar to DeepMind’s recently detailed gatoa single model that can perform over 600 tasks, from playing games to controlling robots.
“The difference [between Unified-IO and Gato] is of course that it is a different set of tasks, but also that these tasks are largely much more useful. By that I mean there are clear, up-to-date use cases for the things this Unified-IO network can do, when most of the time Gato could only play games. This makes it more likely that Unified-IO or a similar model will actually impact people’s lives in terms of potential products and services,” Guzdial said. “My only concern is that while the demo is flashy, there’s no idea how well it does on these tasks compared to models trained separately for these individual tasks. Given that Gato underperformed models trained on the individual tasks, I expect the same to be the case here.”
Nevertheless, the AI2 researchers view Unified-IO as a strong foundation for future work. They plan to improve the efficiency of the system while adding support for more modalities, such as audio and video, and scaling it up to improve performance.
“Recent works like image and DALL-E 2 have shown that with sufficient training data, models… can be trained to produce very impressive results. Yet these models only support one task,” says Clark. “Unified-IO allows us to train large-scale multitasking models. We hypothesize that massively scaling the data and model size will yield much better results.”