a tall building lit up at night

Microsoft Research Lab – Asia

ACL 2022 highlights: From unified-modal encoder-decoder to neural machine translation

Share this page

As a top international academic conference in the field of natural language processing, ACL attracts paper submissions and conference participation from a large number of scholars every year. This year’s ACL conference was held from May 22nd to May 27th. It is worth noting that this was the first time for the conference to be held since ACL Congress adopted the ACL Rolling Review mechanism.

This article selected six of the papers submitted by Microsoft Research Asia. Paper topics cover a range of subjects, including: encoder-decoder frameworks, natural language generation, knowledge neurons, extractive text abstracts, pre-trained language models, zero-shot neural machine translation, and more.

SpeechT5:Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

text

Paper link: https://www.microsoft.com/en-us/research/publication/speecht5-unified-modal-encoder-decoder-pre-training-for-spoken-language-processing/

Encoder-decoder frameworks are widely used in natural language processing and speech processing, such as in end-to-end neural machine translation models and speech recognition models. Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, the researchers proposed a unified-modal SpeechT5 framework that explores encoder-decoder pre-training for self-supervised speech/text representation learning.

The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. Leveraging large-scale unlabeled speech and text data, SpeechT5 was pre-trained to learn a unified-modal representation in the hopes of improving modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, the researchers proposed a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations have shown the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

Figure 1:(a) The model architecture of SpeechT5, which contains an encoder-decoder module and six modal-specific pre/post-nets. (b) By sharing discrete tokens across modalities, the joint pre-training approach builds bridges between speech and text.
Figure 1:(a) The model architecture of SpeechT5, which contains an encoder-decoder module and six modal-specific pre/post-nets. (b) By sharing discrete tokens across modalities, the joint pre-training approach builds bridges between speech and text.

Controllable Natural Language Generation with Contrastive Prefixes

text, letter

Paper link: https://www.microsoft.com/en-us/research/publication/controllable-natural-language-generation-with-contrastive-prefixes/

Previous works on guiding the generation of large pre-trained language models (LM) have focused on directly fine-tuning the language model or utilizing an attribute discriminator. In this work, the researchers proposed a novel lightweight framework for controllable GPT2 generation, which utilizes a set of small attribute-specific vectors called prefixes (Li and Liang, 2021) to steer natural language generation.

Compared with using an attribute model or a generative discriminator, using learned prefixes to achieve controllability has the following benefits: First, it introduces fewer additional parameters (0.2%-2% of GPT2 parameters in the experiments); and second, using prefixes keeps the inference speed comparable to that of the original GPT2 model. Moreover, this paper proposed a novel supervised method and a novel unsupervised one in the framework, which takes the relationship among prefixes into consideration and trains multiple prefixes simultaneously with novel training objectives. Experimental results on the single-aspect control tasks (sentiment control, detoxification, and topic control) have shown that the new proposed methods can guide generation towards the target attribute while maintaining high linguistic quality, even when only several dozen labeled examples are available. In addition to single-aspect control, multi-aspect control can be achieved by combining the proposed supervised method with the unsupervised method in this framework. Experimental results on the sentiment and topic control have shown that the prefixes trained with the new method can successfully control these two aspects simultaneously.

diagram
Figure 2: A comparison of prefix-tuning (top) and the novel lightweight framework (bottom) on sentiment control. The solid arrows show the training process, while the dashed arrows show the inference (generation) process. In the new proposed framework, the training can be supervised, semi-supervised, or unsupervised.

Knowledge Neurons in Pretrained Transformers

pic

Paper link: https://www.microsoft.com/en-us/research/publication/knowledge-neurons-in-pretrained-transformers/

Large-scale pretrained language models are surprisingly good at recalling factual knowledge presented in the training corpus. This paper presents preliminary studies on how factual knowledge is stored in pretrained Transformers by introducing the concept of knowledge neurons.

As illustrated in Figure 3, the researchers proposed a knowledge attribution method to identify the neurons that express a relational fact; these neurons are named knowledge neurons. Specifically, feed-forward network (i.e., two-layer perceptron) modules in Transformer were viewed as key-value memories. For the example in Figure 3, the hidden state is fed into the first linear layer and activates knowledge neurons; then, the second linear layer integrates the corresponding memory vectors. The key-value-memory nature inspired researchers to propose the knowledge attribution method, which identifies knowledge neurons in feed-forward networks by computing the contribution of each neuron to the knowledge prediction.

Figure 3: Through knowledge attribution, the researchers identify knowledge neurons that express a relational fact.
Figure 3: Through knowledge attribution, the researchers identify knowledge neurons that express a relational fact.

Extensive analyses showed that the activation of the identified knowledge neurons is positively correlated to the knowledge expression, which demonstrates the effectiveness of the proposed knowledge attribution method. First, suppressing and amplifying knowledge neurons notably affected the expression of the corresponding knowledge. Second, the researchers found that knowledge neurons of a fact tend to be activated more by corresponding knowledge-expressing prompts. Third, given the knowledge neurons of a fact, the top activating prompts retrieved from open-domain texts usually express the corresponding fact, while the bottom activating prompts do not express the correct relation.

In case studies, the researchers tried to leverage knowledge neurons to explicitly edit factual knowledge in pretrained Transformers without any fine-tuning. This paper presented two preliminary studies: updating facts and erasing relations. After identifying the knowledge neurons, the researchers performed a knowledge surgery for pretrained Transformers by directly modifying the corresponding parameters in feed-forward networks.

Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

text

Paper Link: https://www.microsoft.com/en-us/research/publication/neural-label-search-for-zero-shot-multi-lingual-extractive-summarization/

Extractive summarization models have achieved great performance on English datasets, which is mainly due to large pre-trained language models and the availability of a large number of annotated datasets. However, it is difficult to obtain large-scale annotated summarization data for low-resource languages. In response to this, the researchers focused this paper on Zero-shot multi-lingual extractive summarization. Specifically, an extractive summarization is trained on the English summarization dataset and can be applied to non-English datasets for summarization directly. The researchers found the monolingual label bias problem in Zero-Shot Multi-Lingual Extractive Summarization and proposed the multilingual label annotation algorithm and the neural label search model (NLSSum).

Multilingual labels are created using machine translation and bilingual dictionaries. As shown in Figure 4, label sets a, b, c, and d were obtained through translation and work replacement between different languages. Intuitively, the labels created by this method could introduce more cross-lingual information.

diagram
Figure 4: The process of creating multi-lingual extractive summarization labels. Label set a is obtained from English data; label sets b, c and d are obtained from foreign language data, created by translating from the English data with machine translation (MT) or word replacement (WR) using a bilingual dictionary.

The NLSSum model assigned different weights for different multilingual label sets, and sentence level labels were created. These labels were used to train the model on the English summarization dataset (see Figure 5). The sentence level label scores combined the scores from a sentence level weight predictor Tα and a label set level weight predictor Tβ. Compared to monolingual labels, the new multilingual labels contain more cross-lingual information. Experiments show that NLSSum outperforms all models across different datasets in a zero-shot setting and even outperforms one supervised model (i.e., Pointer-Generator).

diagram
Figure 5: Multi-lingual Neural Label Search Summarization Model (NLSSum).

In this paper, the researchers further studied the distribution of important information between different languages through visual analysis. It can be found that the distribution of important information in the English language is relatively high, and the distribution of important information in other languages is relatively scattered. This is an important reason why language tags can improve model performance.

NoisyTune: A Little Noise Can Help You Finetune Pretrained Language Models Better

text

Paper link: https://www.microsoft.com/en-us/research/publication/noisytune-a-little-noise-can-help-you-finetune-pretrained-language-models-better/

Effectively finetuning pretrained language models is critical for their success in downstream tasks. Most existing methods directly finetune PLMs with task data (Fig.6(a)). However, language models also carry the risk of overfitting the pretraining tasks and data, which enlarges gaps in downstream tasks and data. It can be difficult for existing methods to overcome these gaps to effectively adapt to downstream tasks, especially when labeled task data is very limited. In this work, the researchers proposed a simple yet effective solution to this problem by adding a little noise, named NoisyTune(Fig.6(b)), to perturb the PLM before finetuning,.

diagram
Figure 6: Schematic comparisons between standard PLM finetuning and the NoisyTune.

Inspired by the dueling bandits mechanism, the researchers believed that adding a little noise to PLMs could help “explore” more feature spaces and thereby mitigate the overfitting of pretraining tasks and data. However, it is no simple task to add proper noise to PLM. Instead of adding noise with the same distribution, the researchers proposed adding matrix-wise uniform noise according to the variance of parameter matrices. A hyperparameter lambda was used to control the intensity of noise, so that the diverse characteristics of parameters could be considered and constant variables in the model would not be perturbed.

function

Experiments were conducted on the GLUE English language understanding benchmark and the XTREME multilingual language understanding benchmark. The performance improvement brought by NoisyTune is usually larger on relatively small datasets.

The researchers also studied which kind of noise is more suitable for NoisyTune. The results show that adding matrix-wise noise is better than adding global noise with the same distribution to all PLM parameters, and uniform noise is a better choice than Gaussian noise.

table
Figure 7: Different noise types and perturbing methods.

Towards Making the Most of Cross-Lingual Transfer for Zero-Shot Neural Machine Translation

text

Paper link: https://www.microsoft.com/en-us/research/publication/towards-making-the-most-of-multilingual-pretraining-for-zero-shot-neural-machine-translation/ (opens in new tab)

This paper demonstrates that multilingual pretraining and multilingual fine-tuning are both critical for facilitating cross-lingual transfer in zero-shot translation. Therefore the researchers present SixT+, a strong many-to-English NMT model that supports 100 source languages but is trained with a parallel dataset in only six source languages.

SixT+ initializes the decoder embedding and the full encoder with XLM-R large and then trains the encoder and decoder layers with a simple two-stage training strategy. SixT+ has achieved impressive performance on many-to-English translation. It significantly outperformed CRISS and m2m-100, two strong multilingual NMT systems, with an average gain of 7.2 and 5.0 BLEU respectively.

Additionally, SixT+ offers a set of model parameters that can be further fine-tuned to other unsupervised tasks. This paper demonstrated that adding SixT+ initialization outperforms state-of-the-art explicitly designed unsupervised NMT models on Si->En and Ne->En by over 1.2 average BLEU. When applied to zero-shot cross-lingual abstractive summarization, it produces an average performance gain of 12.3 ROUGE-L over mBART-ft. The researchers conducted detailed analyses to understand the key ingredients of SixT+, including the multilinguality of the auxiliary parallel data, the positional disentangled encoder, and the cross-lingual transferability of its encoder.

diagram
Figure 8. The new proposed two-stage training framework for building cross-lingual NLG model with XLM-R. The blue icy blocks are initialized with XLM-R and frozen, while the red fiery blocks are initialized randomly or from the first stage.