Domain-specific language model pretraining for biomedical natural language processing
Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general-domain corpora, such as in newswire and web text. Biomedical text is very different from general-domain text, yet biomedical NLP has been relatively underexplored. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models.
In this webinar, Microsoft researchers Hoifung Poon, Senior Director of Biomedical NLP, and Jianfeng Gao, Distinguished Scientist, will challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models.
You will begin with understanding how biomedical text differs from general-domain text and how biomedical NLP poses substantial challenges that are not present in mainstream NLP. You will also learn about the two paradigms for domain-specific language model pretraining and see how pretraining from scratch significantly outperforms mixed-domain pretraining in a wide range of biomedical NLP tasks. Finally, find out about our comprehensive benchmark and leaderboard created specifically for biomedical NLP, called BLURB, and see how our biomedical language model, PubMedBERT, sets a new state of the art.
Together, you’ll explore:
- How biomedical NLP differs from mainstream NLP
- A shift in approach to pretraining language models for specialized domains
- BLURB: a comprehensive benchmark and leaderboard for biomedical NLP
- PubMedBERT: the state-of-the-art biomedical language model pretrained from scratch on biomedical text
Resource list:
- BioMed NLP Group (opens in new tab) (Group page)
- Hanover (opens in new tab) (Project page)
- Deep Learning (opens in new tab) (Group page)
- BLURB (opens in new tab) (GitHub)
- PubMedBERT (opens in new tab) (GitHub)
- Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing (opens in new tab) (Paper)
- Hoifung Poon (opens in new tab) (Profile page)
- Jianfeng Gao (opens in new tab) (Profile page)
*This on-demand webinar features a previously recorded Q&A session and open captioning.
Explore more Microsoft Research webinars: https://aka.ms/msrwebinars (opens in new tab)
- Date:
- Speakers:
- Hoifung Poon, Jianfeng Gao
- Affiliation:
- Microsoft Research
-
-
Hoifung Poon
General Manager, Health Futures
-
Jianfeng Gao
Distinguished Scientist & Vice President
-
-
Watch Next
-
-
-
-
Advances in Natural Language Generation for Indian Languages
Speakers:- Dr. Raj Dabre
-
-
Microsoft Research India - who we are.
Speakers:- Kalika Bali,
- Sriram Rajamani,
- Venkat Padmanabhan
-