Multilingual E5 Text Embeddings: A Technical Report

Liang Wang; Nan Yang; Xiaolong Huang; Linjun Yang; Rangan Majumder; Furu Wei

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang ,
Nan Yang ,
Xiaolong Huang ,
Linjun Yang ,
Rangan Majumder ,
Furu Wei

MSR-TR-2024-45 | February 2024

Published by Microsoft

Publication | Publication

Télécharger BibTex

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 (opens in new tab) .

Téléchargements de publications

UniLM – Unified Language Model Pre-training

octobre 1, 2019

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities.

Télécharger Les détails