Breaking the Data Boundaries With General Self-Supervised Learning Approach

[vc_row pix_particles_check=””][vc_column]

[vc_column_text css=”.vc_custom_1653911604527{padding-top: 40px !important;padding-right: 40px !important;padding-bottom: 0px !important;padding-left: 40px !important;}”]Understand how Data2vec achieves general self-supervision on speech, vision, and text data[/vc_column_text]

[vc_column_text css=”.vc_custom_1653911747614{padding-top: 40px !important;padding-bottom: 40px !important;}”]

Moden data is complex, diverse, and unsupervised. Data can have different modalities, such as text, image, and audio.

In the last two decades, Artificial intelligence (AI) has demonstrated powerful predictive capabilities to handle any kind of data. However, each type of data requires different training and processing techniques.

Existing AI systems fail to provide a generic model capable of handling such diversified input simultaneously. The typical approach is to develop separate algorithms for every input source.

To fill the gap, researchers at Meta devised a general self-supervised learning solution called data2vec that works on speech, vision, and text data at once.

In this post, we’ll explore self-supervision and discuss the data2vec architecture. We’ll also compare the performance of data2vec with the existing state-of-the-art speech, language, and image models to understand how self-supervision can potentially develop truly intelligent AI systems in the future.[/vc_column_text][/vc_column][/vc_row][vc_section full_width=”stretch_row” pix_over_visibility=”” css=”.vc_custom_1650444445523{padding-top: 80px !important;padding-bottom: 80px !important;background-color: #f8f9fa !important;}” el_id=”pix_section_program”][vc_row full_width=”stretch_row” pix_particles_check=””][vc_column content_align=”text-center” offset=”vc_col-lg-offset-0 vc_col-lg-12 vc_col-md-offset-1 vc_col-md-10″]

What Is Self-Supervised Learning?

[vc_column_text css=”.vc_custom_1653911738302{padding-top: 40px !important;padding-bottom: 40px !important;}”]

Before YOLO, models like Region-based Convolutional Neural Networks (R-CNN) and Deformity Parts Models (DPP) dominated the object detection space. DPP takes up to 14 seconds to detect all objects in an image, processing around 0.07 frames per second (FPS). R-CNN takes around six seconds more than DPP processing around 0.05 FPS. Here’s how YOLO changes the game: it can process as many as 65 FPS on a V100 GPU, detecting all objects in an image with considerable accuracy in just 22 milliseconds.

Now, imagine a self-driving car using DPP or R-CNN. If the vehicle was traveling on the freeway at 60mph, and the algorithm took 20 seconds to detect objects on the road like people and other cars, it would have traveled 0.3 miles or over 500 meters. This would make collisions practically impossible to avoid. The typical distance from one car to another on the freeway is 3 meters (10 ft). Real-time object detection that can be used in self-driving cars must detect objects before the vehicle covers the 3 meters traveling at a relatively high speed. In the 60mph example, YOLO would identify the objects before the car moves a meter, making it a truly real-time object detection algorithm.

[/vc_column_text][/vc_column][/vc_row][/vc_section][vc_row pix_particles_check=””][vc_column]

Why Do We Need a Generic Data Handling Strategy?

[vc_column_text css=”.vc_custom_1653911792040{padding-top: 40px !important;}”]In AI, each data type is processed differently. Self-supervised language models are trained by hiding or masking a portion of input data and predicting this hidden information in the sentence. During training, the models build and train a vocabulary of discrete words, which aids in predicting the hidden words more accurately.

The training looks more complex for computer vision (CV) and speech models. Vision models predict the intensities of missing pixels in an image or video, while speech models learn sound waveforms to predict the missing audio or video sounds. However, no pre-existing vocabulary of speech units or visual tokens exists as they are continuous in nature.

Because each data source has varying informational units, i.e., characters or words for text, pixels for images, and sound waveforms for speech, a unified AI model cannot manage the diverse nature of training data.[/vc_column_text]

[/vc_column][/vc_row][vc_section full_width=”stretch_row” pix_over_visibility=”” css=”.vc_custom_1650444445523{padding-top: 80px !important;padding-bottom: 80px !important;background-color: #f8f9fa !important;}”][vc_row full_width=”stretch_row” pix_particles_check=””][vc_column content_align=”text-center” offset=”vc_col-lg-offset-0 vc_col-lg-12 vc_col-md-offset-1 vc_col-md-10″]

Self-Supervised Data2vec Architecture Explained

[vc_column_text css=”.vc_custom_1653911905319{padding-top: 40px !important;}”]

Data2vec provides a unified training mechanism for text, speech, and vision data. Data2vec simplifies the learning process by training a transformer network, masking the input data, and allowing models to predict their representations of the input data.

Data2vec is trained using two networks: teacher and student. First, the teacher network computes numerical representations from input text passages, images, or speech audio. Next, the input is masked and delivered to the student network, where the numerical representations of the hidden input are predicted by updating weights. The two models have similar structures except that the teacher network has slightly outdated weights to assist self-supervised learning in the student network.

Let’s learn more about this training process in detail.

[/vc_column_text]

1. Transformer Architecture and Encoding Schemes

[vc_column_text css=”.vc_custom_1653911931245{padding-top: 40px !important;}”]

The transformer, developed initially for language problems, is now widely adopted for many self-supervised learning tasks across different data domains. The data2vec algorithm uses standard transformer architecture and encodes each input data according to its data type.

Images are encoded using the ViT-strategy as a sequence of pixel patches. Each patch spanning 16×16 pixels is linearly transformed and fed to the standard transformer.

Audio-based data is encoded using a multi-layer 1-D convolutional neural network that maps 16 kHz waveform to 50 Hz representations, similar to the encoding technique used in the wave2vec 2.0 self-supervised speech recognition model.

For the text data, word units are obtained by pre-processing, and the input is tokenized with byte-pair encoding.

[/vc_column_text]

2. Different Masking Strategies

[vc_column_text css=”.vc_custom_1653912060348{padding-top: 40px !important;}”]

Data2vec masks or hides some parts of the encoded input and feeds it to the transformer network.

The images are masked via block-wise masking strategy applied in BEiT, which hides multiple adjacent image patches. For audio data, data2vec uses the masking technique of the self-supervised wave2vec 2.0 speech model, while for language data, BERT token masking is adapted.

With a unified data handling strategy, the data2vec model can learn the underlying structure from any kind of unlabeled input data and predict information.

A general self-supervised Data2vec architecture to learn different input sources.

[/vc_column_text][/vc_column][/vc_row][/vc_section][vc_row pix_particles_check=””][vc_column]

Comparison of Data2vec Performance With Benchmarked Techniques

Performance on Image Data

[vc_column_text css=”.vc_custom_1653912159580{padding-top: 40px !important;padding-bottom: 0px !important;}”]To evaluate data2vec for visual data, researchers at Meta pre-trained the model on the benchmark image data of ImageNet-1K. Data2vec was fine-tuned with the labeled data from the same benchmark for image classification task. Results show data2vec outperforms previous state-of-the-art image models like MoCov3, DINO, BeiT, etc.

Data2vec vs. previous CV models. Image by Meta AI

[/vc_column_text]

Performance on Speech Audio

[vc_column_text css=”.vc_custom_1653912227234{padding-top: 40px !important;padding-bottom: 0px !important;}”]To assess speech processing capabilities, data2vec was pre-trained and fine-tuned on Librispeech audio data, which is composed of clean speech between 10 hours to 960 hours of duration (a standard benchmark in speech community). Data2vec was compared with previous state-of-the-art self-supervised speech recognition models like wav2vec 2.0 and HuBERT, and the results show improved performance of data2vec.

Low word error rate recorded for data2vec against Librispeech benchmark models with 10h labeled data. Image by Meta AI

[/vc_column_text]

Performance on Text Data

[vc_column_text css=”.vc_custom_1653912308576{padding-top: 40px !important;padding-bottom: 0px !important;}”]To compare the text-based performance of data2vec, a similar processing setup like BERT was replicated by pre-training on the Books Corpus and evaluating data2vec on General Language Understanding Evaluation (GLUE) benchmark. Comparison with RoBERTa baseline language model proves that data2vec slightly outperforms on text data as well.

Data2vec scores are higher than the RoBERTa language model. Image by Meta AI

Data2vec has the potential to train multimodal data (text, audio, image) by using the self-supervision mechanism, enabling AI researchers to develop all-in-one models.[/vc_column_text]

Limitations of Self-Supervised Data2vec

[vc_column_text css=”.vc_custom_1653912358482{padding-top: 40px !important;}”]

Data2vec is a significant step towards building more generalized AI models. But it has a few limitations.

Data2vec requires data-specific input encoding schemes. It also requires different masking schemes for each of the audio, image, and text data.

To build truly intelligent AI systems that learn by observing the real-world, future models should be able to process any kind of data using a unified encoding and masking approach.

[/vc_column_text]

Boost the Performance of Your AI Applications With High-Quality Data

[vc_column_text css=”.vc_custom_1653912404407{padding-top: 40px !important;}”]

Data is growing exponentially, so we need efficient AI solutions to manage it. With a general self-supervised learning approach, data2vec handles unlabeled and diverse image, text, and audio data effectively. However, self-supervised techniques require more research before applying them to real-world applications. Until then, AI systems must feed on high-quality labeled datasets.

DATUMO is a leading crowdsourcing platform that enables quick and accurate data collection and annotation for audio, video, image, and text data. Our highly-trained crowdsource workers can diligently tag, edit, classify, segment, and transcribe data as per your needs. Contact us today and start curating high-quality datasets to fuel your AI applications.

[/vc_column_text]

[/vc_column][/vc_row][/vc_section][vc_row pix_particles_check=””][vc_column width=”1/2″]

See what we can do for you.

Build smarter AI with us.

Learn More

[/vc_column][vc_column width=”1/2″]

We would like to support the AI industry by sharing.

Download Open Datasets

[/vc_column][/vc_row][vc_row pix_particles_check=””][vc_column]

[/vc_column][/vc_row]

AI data

Breaking the Data Boundaries With General Self-Supervised Learning Approach

What Is Self-Supervised Learning?

Why Do We Need a Generic Data Handling Strategy?

Self-Supervised Data2vec Architecture Explained

1. Transformer Architecture and Encoding Schemes

2. Different Masking Strategies

Comparison of Data2vec Performance With Benchmarked Techniques

Performance on Image Data

Performance on Speech Audio

Performance on Text Data

Limitations of Self-Supervised Data2vec

Boost the Performance of Your AI Applications With High-Quality Data

See what we can do for you.

We would like to support the AI industry by sharing.

Customer

Product

Newsletter