Telescope peering into the night sky

Microsoft Academic

Changes to Microsoft Academic Services (MAS) During COVID-19

Partagez cette page

Update: An article on CORD-19 dataset is now available on arXiv (opens in new tab).

Since we answered the call from the White House (opens in new tab) and teamed up with our partners to release CORD-19 (opens in new tab) and MAS resources (opens in new tab) a few weeks ago, the scientific outputs on COVID-19 continue to grow at an amazing pace. In contrast to what we reported in this previous blog (opens in new tab), the COVID-19 (opens in new tab)/SARS-Cov-2 (opens in new tab) specific publications now look like the following based on the March-27 snapshot of the Microsoft Academic Graph (MAG):

COVID-19 papers in MAG

 

In the meantime, we continue to be in an “ultimate open science” era where all publishers have dropped their paywalls, expedited the peer review and publication, and some even have waived the article processing charges (APCs) on publications related to the pandemic. Most impressively, major publishers have granted text and data mining (TDM) rights on coronavirus related publications and agreed their CORD-19 distributions (albeit temporarily). As of April 6, 2020, there are more than 47K articles in the CORD-19 dataset mirrored at Kaggle (opens in new tab) and MIT (opens in new tab), among others, a rapid growth in comparison to the onset where only 13K out of 29K CORD-19 articles had full text contents.

These developments have necessitated a few new changes aside from what have been described in our previous blog (opens in new tab). First, starting this week, we have doubled our data update frequency from every other week to once each week. This is becoming necessary given the research communities are publishing more than 3500 articles a week on COVID-19 alone, as shown in the figure above. The faster pace of data update can be seen in MAG, MAKES (including the public REST API), and the Microsoft Academic website.

Secondly, the figure above is currently not reproducible using the publication dates reported by publishers. Instead, to understand when the contents are available for the research community to consume, we have found it necessary to use the online dates rather than the publication dates publishers prefer. For instance, there are papers reported as published in January 2020 but contain references to “COVID-19”, a term that was not decided by World Health Organization (WHO) until February 11, 2020. On the other hand, some journals have scheduled well into their September issue many COVID-19 articles that have already received citations by articles published in March 2020. All these forward references that should be rare but are exacerbated in recent months are a legacy in the publication industry that can use an update in the online era. Accordingly, we will add an “online publication date” to every article aside from the existing publication date reported by the publisher as soon as the new property passes our quality control evaluations.

Thirdly, as much as we are proud of the concept recognition capability (opens in new tab), we have to recognize the technology is not 100% perfect yet. This sample code (opens in new tab) on our GitHub page illustrates a way to conduct semantic search and keyword matching into MAG, and for the past several snapshots, the concept-only retrieval consistently covers only about 85% of the results. Additionally, MAG has yet to recognize all chemical compounds and pharmaceutical products, such as many drugs that were designed as treatments for other diseases but are being considered for COVID-19 clinical trials. To compensate for the 15% shortfall in semantic search and the missing concepts, we have quickly included rudimentary keyword search capabilities at Microsoft Academic website. Effectively immediately, the website users can search phrases in quotes (e.g., “novel coronavirus” ”china” (opens in new tab), very useful in finding COVID-19 papers before official terminologies were widely adopted) and expect such queries will retrieve articles with literal matches in the title or the abstract. Harmonizing the semantic and keyword search experience is not trivial, and we will have a separate blog on this subject in the coming week.

Finally, as a requirement for the CORD-19 dataset, we have taken as a credible source the WHO’s paper collection (opens in new tab) that, in addition to research articles written in English, includes news, commentaries and, most importantly, non-English publications that would have otherwise been excluded from MAG (see our recent article (opens in new tab) on this subject). A sizeable number of non-English articles included in WHO’s collection are from Chinese journals that provide high quality English translation on the title and the abstract. We have started working with our colleagues in China to develop scalable means to include these journals and their publications into MAG.

As for news and other non-scholarly articles, they are slipping through the principal component analysis because of their strong connections to the two fields of study COVID-19 or SARS-CoV-2. Based on this observation, we are excluding the relations to fields of study from being considered in the principal component analysis starting this week. A preliminary analysis indicates this algorithmic change can filter out more than a half million articles previously included in MAG, mostly from university websites that are not published in other peer review venues. We do not expect the removal of this type of articles to cause dramatic impacts on analytics based on MAG, but in the coming weeks, we will continue to monitor the effect of this new tweak and run additional experiments.

Suffice it to say this pandemic has profoundly impacted our lives in many ways. We hope this blog finds you safe and healthy, and happy researching!