Microsoft Search, Assistant and Intelligence

When does text prediction benefit from additional context? An exploration of contextual signals for chat and email messages

分享这个页面

By Stojan Trajanovski, Chad Atalla, Kunho Kim, Vipul Agarwal, Milad Shokouhi, and Chris Quirk

Email and chat communication tools are increasingly important for completing daily professional and personal tasks. Given the recent pandemic and shift to remote work, this usage has surged. The number of daily active users in Microsoft Teams, the largest business communication and chat platform, has increased from 20 million pre-pandemic time in 2019 to more than 115 million and 145 million in October 2020 and April 2021, respectively. On the other hand, email continues to be the crucial driver for formal communication showing ever increasing usage. Providing real-time suggestions for word or phrase auto-completions is known as text prediction. The efficiency of these communications is enhanced by suggesting highly accurate text predictions with low latency. Text prediction services have been deployed across popular communication tools and platforms such as Microsoft Outlook Text Predictions or GMail Smart Compose [1].

Modern text prediction algorithms are based on large language models and generally rely on the prefix of a message (characters typed until cursor position) to create predictions. We study to what extent additional contextual signals improve text predictions in chat and email messages in two of the largest commercial communication platforms Microsoft Teams and Outlook.
We examine several signals accompanying the main message: composition time, subject, and previous messages (see the Table below).

Contextual signal Details
Composition time It is a contextual signal which can provide added value for text prediction, enabling suggestions with relevant date-time words, like “weekend”, “tonight”.
Subject Message subjects often contain the purpose or summarized information of a message. In the email scenario, we use subject as context. In the chat scenario, we use the chat window name as a proxy for subject.
Previous email Previous messages can provide valuable background information which influences the text of the current message being composed.  In the email case, we create pairs of messages and replies.
Previous chat messages Prior message contextualization for chat scenario is much more complex. Chat conversations typically consist of many small messages sent in quick succession.

We combine and encode these signals with the message body into a single “contextualized” string for the language model, using special tokens to separate from the other signals (Figure 1).

Context extraction and encoding

Figure 1. Context extraction and encoding.

We segment chat histories by message blocks and time windows. A series of uninterrupted messages sent by one sender is considered as a single message block. Messages sent within the past N minutes are within a time window, which enforces recency as a proxy for relevance. We define three previous message context aggregation modes in the chat scenario (visualized in Figure 2), mimicking prior email context:

  • Ignore-Blocks: chat messages from the current sender, in the past N minutes (e.g., 2, 5, 10 minutes), ignoring any message block boundaries.
  • Respect-Blocks: chat messages from the current sender, in the past N minutes, confined to the most recent message block.
  • Both-Senders: chat messages from both senders, in the past N minutes. When the sender turn changes, strings are separated by a space or a special token.
chat window

Figure 2. Aggregating a 5 min prior chat window in various context modes.

For example, 2 minutes Both-Senders mode and 5 minutes Ignore-Blocks aggregate similar amount: 2.5 chat messages on average and 56-59% of chat messages have at least one message as context. Given the email and chat message length statistics from Figure 2, we expect chat messages to be about 10 x smaller than emails. Namely, in a statistical analysis of the chat message lengths (see Figure 3, blue box) we find that mean tokens number is 9.15, while median tokens number is 6. On the other hand, email (see Figure 3, green box) mean number of tokens is 94, while the median is 53 tokens. So, we limit chat histories to 20 messages, which is roughly equivalent to an email-reply pair in length.

“1 email formal content = 10 x chat (informal) messages”

chart, box and whisker chart

Figure 3. Box-plot statistics for messages aggregation in Teams and Outlook.

For ethical considerations of how we process data through multiple privacy precautions; according to General Data Protection Regulation (GDPR) and with “fair block-listing” of denigrative, offensive, controversial, sensitive, and stereotype-prone words and phrases, check out our NAACL paper [2].

Results. Previous message contextualization leads to significant gains for chat messages from Microsoft Teams, when using an appropriate message aggregation strategy. By using a 5-minute time window and messages from both senders, we see a 9.4% relative increase in the match rate1, and an 18.6% relative gain on estimated characters accepted. This 5-minute window of prior messages from both senders outperforms the corresponding 2- and 10-minute window configurations. Chat messages are often short and can lack context about a train of thought; thus, the appropriate number of previous messages can bring necessary semantics to the model to provide a correct prediction. Benefits are comparatively insignificant for subject and compose time as contextual signals in chat messages. In the email scenario based on Microsoft Outlook, we find that time as a contextual signal yields the largest boost with a 2% relative increase on the match rate, while subject only helps in conjunction with time, and prior messages yields no improvement. We conclude that the different characteristics of chat and email messages impede domain transfer. The best contextual text prediction models are custom trained for each scenario, using the most impactful subset of contextual signals. Future work involves exploring different encodings for contextual signals, such as utilizing hierarchical RNNs to better capture context, or using more advanced architectures such as transformers, generative models or GPT-3.

References

[1] M. X. Chen, B. N. Lee, G. Bansal, Y. Cao, S. Zhang, J. Lu, J. Tsay, Y. Wang, A. M. Dai, Z. Chen et al. (2019) Gmail Smart Compose: Real-time assisted writing, In Proc. of the 25th ACM SIGKDD Intl. Conf. on Knowledge Discovery & Data Mining, pp. 2287–2295.

[2] S. Trajanovski, C. Atalla, K. Kim, V. Agarwal, V. Shokouhi, and C. Quirk (2021) When does text prediction benefit from additional context? An exploration of contextual signals for chat and email messages, In Proc. of NAACL-HLT (Annual Conf. of the North American Chapter of the Association for Computational Linguistics – Industry track papers).


1The ratio of the number of matched suggestions and the total number of generated suggestions.