How does Wav2vec 2 for speech recognition (speech2text) work?

Ask Questions Forum: ask Machine Learning Questions to our readersCategory: Deep LearningHow does Wav2vec 2 for speech recognition (speech2text) work?
Chris Staff asked 7 months ago
1 Answers
Best Answer
Chris Staff answered 7 months ago

Wav2vec 2 is the successor of the Wav2vec model and was developed by Facebook AI. It can be used for performing speech recognition tasks. Among others, it can be used for speech 2 text tasks.
 
Goal of creating Wav2vec 2:
* Showing that it is possible to use a pretrained Transformer to fine-tune to speech recognition tasks with really small amounts of labeled data. This possibly enables speech recognition opportunities for obscure languages where limited labeled data is available.
 
Wav2vec 2 architecture:
(image source: Baevski et al., 2020)

* The Wav2vec 2 architecture is composed of a flow from a raw waveform into an output classification (if finetuned) or context representations (if only pretrained).
* The raw waveform is first windowed and fed to an 1D ConvNet. This ConvNet, which has 7 blocks of kernel widths (10, 3, 3, 3, 3, 2, 2) and stride (5, 2, 2, 2, 2, 2, 2), produces 512 feature maps per layer. The output of the ConvNet is called a “latent speech representation”, i.e. more abstract (latent) representations of the raw waveform. Here, GeLU activations and LayerNorm is also applied and raw waveform input is normalized to zero mean and unit variance first.
* The outputs of the ConvNet are fed through a Quantization segment, which performs product quantization: selecting closest discrete vector representations for a continuous vector to reduce set of possible vectors. This is done by generating G codebooks with V entries, choosing one from each codebook and concatenating, then applying a linear transformation to get the discrete vector. A Gumbel Softmax function is used for this. The why: previous work has shown that using this approach works great (and we shall see in describing the Loss how these are used).
* The outputs of the ConvNet are also used as inputs for a Transformer model, which is also called a “context network”, generating “context representations”. Instead of fixed positional encoding (which is what the original Transformer does), this one uses a Conv layer as a relative one. The output for the positional encoding is added to the input and then LayerNormalized. A proportion of the inputs to the Transformer is masked in order to perform a MLM task during pretraining. Note that inputs to the quantization segment are not masked.
* For finetuning, a Feedforward segment is stacked on top of the Transformer model, as is common with Transformer finetuning (e.g. using the C class value in BERT finetuning).
 
Wav2vec 2 datasets:
 
Unlabeled data used for pretraining:
* LibriSpeech corpus with 960 hours of audio (LS-960), OR
* LibriVox dataset (LV-60k); 53.200 hours after preprocessing.
 
Labeled data for finetuning (one per try):
* 960 hours of transcribed LibriSpeech
* 100 hours of transcribed LibriSpeech
* 10 hours of transcribed LibriSpeech
* 1 hour of transcribed LibriSpeech
* 10 minutes of transcribed LibriSpeech
 
Wav2vec 2 pretraining:
* Pretraining uses a loss function (L = Lm + alpha * Ld) combined of two sub losses, “contrastive loss” (Lm) with “diversity loss” (Ld). Here, alpha is a tunable hyperparameter for the diversity loss.
* Contrastive loss measures how well the model is capable of generating a quantized vector for any masked Transformer input. In other words, how well it is capable of generating a generic output value close to the input value.
* Diversity loss measures how well the model is using the variety of codebooks when performing product quantization. The more codebooks are used (i.e. the more balanced the weights in the linear operation generating the quantized vector from the concatenated codebook vectors), the better.
 
Model setup for pretraining:
* The proportion of inputs selected for being masked is 0.065 (i.e. 6.5%). These are starting points, as all 10 tokens following these are also masked. Spans of masked tokens can overlap. On average, approximately 49% of all inputs is masked.
* Two configurations – BASE and LARGE.
* BASE has 12 Transformer blocks, an internal Transformer dim of 768, Transformer feedforward dim of 3072, and 8 attention heads. 250k audio samples are used for a total of 15.6 seconds per sample. Batched together to not exceed 1.4m samples per GPU, BASE was trained on 64 V100 GPUs for 1.6 days.
* LARGE has 24 Transformer blocks, an internal Transformer dim of 1024, Transformer feedforward dim of 4096, and 16 attention heads. 320k audio samples are used for a total of 20.0 seconds per sample. Limited to 1.2 m samples per GPU – and trained on 128 V100 GPUs for 2.3 days (Librispeech dataset) or 5.2 days (LibriVox dataset). Here, Dropout was applied within Transformer, feature encoder output, and quantization module input, with p set to 0.1.
* ADAM optimization was used with a LR warmup on the first 8% of iterations. A LR peak of 5 x 10^-4 on BASE; 3 x 10^-4 on LARGE, followed by linear decay.
* BASE was trained for 250k epochs (600k on LV-60k dataset); LARGE for 400k epochs. L2 penalty applied with small datasets. Lowest contrastive validation loss was chosen to checkpoint the models.
 
Wav2vec 2 finetuning:
* A CTC Loss is used during finetuning, with SpecAugment to avoid overfitting.
 
Model setup for finetuning:
* Dependent on the dataset used for finetuning, but some general aspects too.
* ADAM optimizer, tri-state LR schedule (10% warmup LR, 40% constant LR, then linear decay).
* BASE used 3.2m samples/GPU (8 GPUs); LARGE 1.28m samples/GPU (24 GPUs).
* First 10k iterations only updating weights of the classifier segment on top of the Transformer, then also the Transformer itself.
 
Wav2vec 2 Results:
 
Results for low-resource training (= fine-tuning with very few samples)
* Low resource settings (10 minutes of labeled data only!) after unlabeled pretraining gives good results.
* Jointly using the discrete quantized units and Transformer-based contextual representations in the loss improves results, reducing error by approximately 1/3.
* Generally, the 10h/100h datasets used in finetuning yield 24-29% to 42-45% improvements. The 1h dataset yielded 7-12% improvements.
 
Results for high-resource training (= fine-tuning with 960h of data)
* Here, too, the model improved performance, for both BASE and LARGE.
 
Results for Phoneme training (= fine-tuning with a Phoneme classification task).
* 23-29% improvements, resulting in SOTA performance.
 
Conclusions:
* Wav2vec 2 shows that it is possible to use Transformer based models to pretrain on large, unlabeled corpora, and get good results with a variety of dataset sizes during fine-tuning.
* It works with big fine-tuning datasets, but also with smaller ones.
* This is promising for low-resource situation, e.g. in the case with very obscure languages, or personal language situations / tasks.
 
Sources:
 
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
 
Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.

Your Answer

16 + 13 =