Advances in the AI realm are constantly coming out, but they tend to be limited to a single domain: for instance, a cool new method for producing synthetic speech isn’t also a way to recognize expressions on human faces. Meta (AKA Facebook) researchers are working on something a little more versatile: an AI that can learn capably on its own whether it does so in spoken, written, or visual materials.
The traditional way of training an AI model to correctly interpret something is to give it lots and lots (like millions) of labeled examples. A picture of a cat with the cat part labeled, a conversation with the speakers and words transcribed, etc. But that approach is no longer in vogue as researchers found that it was no longer feasible to manually create databases of the sizes needed to train next-gen AIs. Who wants to label 50 million cat pictures? Okay, a few people probably — but who wants to label 50 million pictures of common fruits and vegetables?
Currently some of the most promising AI systems are what are called self-supervised: models that can work from large quantities of unlabeled data, like books or video of people interacting, and build their own structured understanding of what the rules are of the system. For instance, by reading a thousand books it will learn the relative positions of words and ideas about grammatical structure without anyone telling it what objects or articles or commas are — it got it by drawing inferences from lots of examples.
This feels intuitively more like how people learn, which is part of why researchers like it. But the models still tend to be single-modal, and all the work you do to set up a semi-supervised learning system for speech recognition won’t apply at all to image analysis — they’re simply too different. That’s where Facebook/Meta’s latest research, the catchily named data2vec, comes in.
The idea for data2vec was to build an AI framework that would learn in a more abstract way, meaning that starting from scratch, you could give it books to read or images to scan or speech to sound out, and after a bit of training it would learn any of those things. It’s a bit like starting with a single seed, but depending on what plant food you give it, it grows into an daffodil, pansy, or tulip.
Testing data2vec after letting it train on various data corpi showed that it was competitive with and even outperformed similarly sized dedicated models for that modality. (That is to say, if the models are all limited to being 100 megabytes, data2vec did better — specialized models would probably still outperform it as they grow.)
“The core idea of this approach is to learn more generally: AI should be able to learn to do many different tasks, including those that are entirely unfamiliar,” wrote the team in a blog post. “We also hope data2vec will bring us closer to a world where computers need very little labeled data in order to accomplish tasks.”
“People experience the world through a combination of sight, sound and words, and systems like this could one day understand the world the way we do,” commented CEO Mark Zuckerberg on the research.
This is still early stage research, so don’t expect the fabled “general AI” to emerge all of a sudden — but having an AI that has a generalized learning structure that works with a variety of domains and data types seems like a better, more elegant solution than the fragmented set of micro-intelligences we get by with today.
The code for data2vec is open source; it and some pretrained models are available here.
 
					