Scaling audiovisual learning without labels

Scaling audiovisual learning without labels

Researchers from MIT, the MIT-IBM Watson AI Lab, IBM Research and elsewhere have developed a new technique for analyzing unlabelled audio and visual data that could improve the performance of machine learning models used in applications such as recognition voice and object detection. The work, for the first time, combines two self-supervised learning architectures, contrastive learning and masked data modeling, in an attempt to scale machine learning tasks such as classifying events into single and multimodal data without the need for annotations, thus replicating as humans understand and perceive our world.

A larger part of human knowledge is learned in a self-controlled way, because we don’t always get supervisory cues and we want to allow the machine learning model to have the same ability, says Yuan Gong, an MIT postdoc in the Computer Science and Artificial Intelligence Laboratory ( CSAIL).

So another way to put it is that self-supervised learning often forms the basis of an initial model, because it can learn on large amounts of unlabelled data. And then you can use classical, supervised, or reinforcement learning to fine-tune the model to something in particular if you like, says Jim Glass, an MIT senior research scientist and a member of the MIT-IBM Watson AI Lab.

The technique, called contrastive audiovisual masked autoencoder (CAV-MAE), is a type of neural network that can learn to extract and map spatially meaningful latent representations in high-dimensional space from acoustic and visual data by training on large YouTube datasets of clips 10 second audio and video. The researchers say the technique is more effective than previous approaches because it explicitly models the relationships between audio and visual data in a way that other methods don’t.

Joining Gong and Glass in the study are graduate students Andrew Rouditchenko and Alexander H. Liu of MIT, David Harwath PhD 18 of the University of Texas at Austin, and MIT-IBM Watson AI Lab members Leonid Karlinsky and Hilde Kuehne. Kuehne is also affiliated with the Goethe University of Frankfurt. The method was recently presented at the International Conference on Learning Representations.

A joint and coordinated approach

The CAV-MAE works by learning by prediction and learning by comparison, says Gong. Masked data modeling, or the prediction method, takes a video along with its coordinated audio waveform, converts the audio into a spectrogram, and masks 75% of both. The unmasked data is tokenized, then fed into separate audio and visual encoders before entering a joint encoder/decoder, where the model is asked to fetch the missing data. The difference (reconstruction loss) between the resulting reconstructed prediction and the original audiovisual mix is ​​then used to train the model for better performance. An example would be to cover part of a piano video and part of a piano music spectrogram, and then ask the model to try to determine the masked inputs. Unfortunately, this method may not capture the association between the video and audio pair, whereas contrastive learning takes advantage of it, but it can discard some modality-related information, such as the background in a video.

Contrastive learning aims to map representations that are similar close to each other. For example, the model will attempt to place several video and audio data of different parrots close to each other and further away from video and audio pairs of playing guitars. Similar to masked autocoding, the audio-visual pairs are passed into separate mode encoders; however, the audio and visual components are kept separate within the joint encoder before the model performs the bundling and contrast loss. In this way, contrastive learning seeks to identify the parts of each audio or video that are most relevant to the other. For example, if a video shows someone talking and the corresponding audio clip contains speech, the autoencoder will learn to associate the speaker’s mouth movements with the words spoken. It will then adjust the parameters of the models so that these inputs are represented close to each other. Ultimately, the CAV-MAE method combines both techniques with multiple forward data streams with first-pass masking, mode-specific encoders, and layer normalization so that the strengths of the representation are similar.

We [then] I wanted to compare the proposed CAV-MAE with a model trained with a masked autoencoder only and a model trained with contrastive learning only, because we want to show that by combining the masked autoencoder and contrastive learning, we can get a performance improvement, he says Gong, and the results support our hypothesis that there is a clear improvement.

Researchers tested CAV-MAE and their method without contrast loss or a masked auto encoder against other state-of-the-art methods for audiovisual retrieval tasks and audiovisual event classification using standard AudioSet (20K and 2M) and VGGSound datasets labeled, short realistic clips, which could include more sounds. Audio-visual retrieval means that the model sees the audio or visual component of a query pair and searches for the missing one; Event classification includes identifying actions or sounds within the data, such as a person singing or a car driving.

Overall, they found that contrastive learning and masked data modeling are complementary methods. CAV-MAE was able to outperform previous techniques (with fully self-supervised pre-training) by approximately 2% for event classification performance compared to models with comparable computation and, most impressively, kept up or outperformed models with industry-level computational resources. The teams model ranked similarly to models trained with contrastive loss only. And surprisingly, the team says, incorporating multimodal data into CAV-MAE pre-training dramatically improves single-modal representation tuning via supervised learning (with some data labeled) and performance in event-only classification tasks. audio. . This demonstrates that, like humans, multimodal information provides an additional soft-label impetus even for audio-only or visual-only activities; for example, it helps the model understand whether an electric or acoustic guitar is looking for a richer supervisory signal.

I think people like the elegance of this model to combine information in different audio and video streams. It has contrast and reconstruction loss, and compared to models that have been evaluated with similar data, it clearly performs very well at a number of these tasks, Glass says.

Building on this, one special thing is that our model can perform both classification and retrieval, which is not common, Gong adds. Prior to this work, these methods are used separately, but after this work, I see that most audiovisual learning frameworks use shrinkage loss and masked autoencoder together, either implicitly or explicitly.

Bringing self-supervised audiovisual learning to our world

The researchers see their contribution of the contrastive audio-visual masked autoencoder (CAV-MAE) as an important milestone and breakthrough for applications, which are increasingly moving from single-mode to multi-mode and which require or exploit audio-visual fusion . They speculate that it could one day be used for recognition of actions in areas such as sports, education, entertainment, motor vehicles and public safety. It may even, one day, extend to other modalities. At the moment, the fact that this only applies to audiovisual data may be a limitation, but we’re targeting multimodal learning, which is a trend in machine learning, says Gong. As human beings, we have multimodality, we have smell, we touch many more things than audiovisual. So when we try to build artificial intelligence, we try to mimic humans in some way, not necessarily biologically, and this method could [potentially be] generalized to other unexplored modalities.

As machine learning models continue to play an increasingly important role in our lives, techniques like this will become ever more valuable.

This research was supported by the MIT-IBM Watson AI Lab.

#Scaling #audiovisual #learning #labels

Leave a Reply

Your email address will not be published. Required fields are marked *