a tall building lit up at night

Microsoft Research Lab – Asia

R-Drop: A simple and effective regular method to correct the defects of Dropout

Share this page

Deep neural networks (DNN) have recently achieved remarkable success in various fields. When training these large-scale DNN models, regularization techniques such as L2 Normalization, Batch Normalization, Dropout, etc. are indispensable modules to prevent model overfitting and improve model generalization. Among them, the Dropout [1] technology, which simply requires dropping a part of the neurons during the training process, has become the most widely used regularization technology. Though this technology is effective and performs well, the randomness introduced by Dropout causes nonnegligible inconsistency between the model utilization in training and inference. Concretely speaking, the randomly sampled sub-model (with Dropout) used during training is inconsistent with the full model (without Dropout) used during inference.

Researchers from Microsoft Research Asia and Soochow University investigated this inconsistency and proposed a simple and improved regularization method: Regularized Dropout, or R-Drop for short. Unlike traditional regularization methods that act on neurons (Dropout [1]) or model parameters (DropConnect [2]), R-Drop acts on the output layer of the model. The algorithm is simple: In every mini-batch, each data sample goes through two randomly sampled sub models created by Dropout, and R-Drop uses KL-divergence to constrain the different outputs from the two sub models. Therefore, R-Drop constrains the output consistency of the two random sub-models created by Dropout. Compared with the traditional training method, R-Drop simply adds a KL-divergence loss function term without introducing any other changes. Although this method is simple, experiments show that in five commonly used tasks including NLP and CV (covering a total of 18 data sets), R-Drop has achieved significant and consistent model performance improvement. In machine translation, text summarization, and other tasks, R-Drop has achieved state-of-the-art results.

R-Drop paper

Paper:https://arxiv.org/abs/2106.14448
Code:https://github.com/dropreg/R-Drop

R-Drop

Since deep neural networks overfit easily, the Dropout [1] method randomly drops neurons in each layer to avoid the problem of overfitting during the training process. It is precisely because some neurons are randomly discarded each time that the sub-models generated after each dropout are different, so the Dropout operation, to a certain extent, constrains the trained model with multiple sub-models. Though simple and effective, there is huge inconsistency between training and inference that hinders model performance surrounding Dropout. That is, the training stage adopts a sub-model with randomly dropped units, while the inference phase adopts the full model without Dropout. Also, the sub-models caused by randomly sampled dropout units are also different. Considering the randomness that Dropout brings to the network, the researchers proposed R-Drop to further restrict the (sub-model) network output predictions.

Figure 1: R-Drop framework, the difference between the two probabilities P_1 and P_2 caused by Dropout during training

Figure 1: R-Drop framework, the difference between the two probabilities P_1 and P_2 caused by Dropout during training

Specifically, when given training data D={x_i,y_i }_(i=1)^n, each training sample x_i will go through the forward propagation of the network twice to obtain two output predictions: P_1 (y_i│x_i ),P_2 (y_i│x_i ). Since Dropout randomly drops neurons each time, P_1 and P_2 are two different predicted probabilities obtained through two different sub-networks (from the same model) (as shown in Figure 1). R-Drop uses the difference between these two prediction probabilities and uses symmetric Kullback-Leibler (KL) divergence to constrain P_1 and P_2:

formular

Coupled with the traditional minimum negative log-likelihood loss function:

formular

The final training loss function is:

formular

Among them, α is used to control the coefficient of L_KL^i, so the training of the entire model is very simple. During real implementation, the data x_i does not need to be modeled twice; rather, only one copy of x_i needs to be made in the same batch. In addition, from a theoretical point of view, the researchers expounded the control of the R-Drop constraint on the degree of freedom of the model in the paper, so as to better improve the generalization of the model.

Experiments

To verify the role of R-Drop, the researchers conducted experiments on five different NLP and CV tasks: machine translation, text summarization, language modeling, language understanding, and image classification, covering a total of 18 data sets.

1. On the machine translation task, based on the famous Transformer [3] model, R-Drop training achieved the best BLEU score (30.91/43.95) on the WMT14 English->German and English->French task, surpassing the results of other types of complex, pre-trained combination models, or larger-scale models.

table

Figure 2: The results of R-Drop on WMT14 English->German and English->French machine translation

2. On the image classification task, using the pre-trained Vision Transformer (ViT) [4] as the skeleton network, R-Drop fine-tuned the CIFAR-100 dataset and the ImageNet dataset, after which the ViT-B/16 and ViT-L/16 models have achieved significant performance improvements.

table

Figure 3: The results of R-Drop on image classification based on Vision Transformer after fine-tuning CIFAR-100 and ImageNet datasets

3. On the natural language understanding (NLU) task, after fine-tuning R-Drop on the pre-trained BERT-base [5] and the RoBERTa-large [6] skeleton network, an improvement of over 1.2 and 0.8 average points was made on the GLUE benchmark.

table

Figure 4: The R-Drop’s fine-tuning results on the validation set of GLUE language understanding

4. On the text summarization task, R-Drop training is based on the pre-trained model BART [7] and has delivered top results after fine-tuning the CNN/Daily Mail data.

text, table

Figure 5: The result of R-Drop fine-tuning on CNN/Daily Mail text summaries based on the BART model

5. On the language modeling task, based on the original Transformer and Adaptive Transformer [8], the training of R-Drop has achieved a PPL improvement of 1.79 and 0.80 on the Wikitext-103 dataset.

table

Figure 6: The R-Drop’s language modeling results on Wikitext-103 data

From the above results, it can be seen that although R-Drop is very simple, its effect is outstanding, achieving state-of-the-art results in many tasks, and it can be used in different fields such as texts and images. The researchers also conducted various analytical experiments, including ones on training complexity, k-step R-Drop, m-time R-Drop, etc., and further conducted a comprehensive analysis of R-Drop.

Conclusion and outlook

The proposal of R-Drop is based on the randomness of Dropout, which causes inconsistency in the model during training and inference. R-Drop is simple but very effective, and has been verified on many different well-known benchmarks. In this work, only supervised tasks are explored. In the future, unsupervised and semi-supervised learning, as well as more tasks with varying data types, are also worthy of further exploration. We welcome everyone to use the R-Drop training technology and apply it to various practical scenarios, and we also hope that the ideas surrounding R-Drop can bring inspiration to future work.

 

References:

[1] Srivastava, Nitish, et al. “Dropout: a simple way to prevent neural networks from overfitting.” The journal of machine learning research 15.1 (2014): 1929-1958.
[2] Wan, Li, et al. “Regularization of neural networks using dropconnect.” International conference on machine learning. PMLR, 2013.
[3] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.
[4] Dosovitskiy, Alexey, et al. “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.” International Conference on Learning Representations. 2020.
[5] Kenton, Jacob Devlin Ming-Wei Chang, and Lee Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of NAACL-HLT. 2019.
[6] Liu, Yinhan, et al. “Roberta: A robustly optimized bert pretraining approach.” arXiv preprint arXiv:1907.11692 (2019).
[7] Lewis, Mike, et al. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
[8] Baevski, Alexei, and Michael Auli. “Adaptive Input Representations for Neural Language Modeling.” International Conference on Learning Representations. 2018.