Microsoft Academic increases power of semantic search by adding more fields of study
In the video below, our colleague Darrin tells a personal story of unleashing the power of semantic search:
Let’s look at a few more examples of powerful search experiences that are unique to Microsoft Academic.
Imagine you met a scholar at a conference, but can’t quite remember his name, except that it was short and started with K. But, you do remember that the person’s area of work was technology acceptance, and that he worked at a university in Colorado. So, you type whatever you can remember in Microsoft Academic:
As you type, you notice that University of Colorado Boulder appears in query suggestions. This means it is one of the top ranked entities containing the letters “col” in the «technology acceptance model» area. You click that query suggestion and see the following search results:
The search engine results page shows you all the papers about “technology acceptance model” authored by individuals affiliated with the University of Colorado Boulder. You recognize the name Kai Larsen among the filters on the left side and realize this is the name of the person you were looking for.
Or, assume you are a graduate student and, in one of your HCI (opens in new tab) courses, you heard the phrase “seven stages of action.” As you type the words into Microsoft Academic, you notice that the phrase is recognized as a research topic, because the beaker icon appears next to it in the query suggestion drop-down. You accept the suggestion by clicking it and find papers that refer specifically to Don Norman’s seven stages of action.
To learn more about the topic, you click the topic’s card on the right-hand rail. On the topic detail page, you see a list of related topics that help you understand other concepts relevant to Norman’s seven stages of action.
So, how does Microsoft Academic make it possible to discover knowledge in such a powerful way? Three aspects contribute to the power of our semantic search:
1. Author entity disambiguation, addressed in a previous post (opens in new tab);
2. The recent increase in number of fields of study in our graph, and
3. The accuracy of tagging fields of study onto papers — both explained below.
Field of study increase
During the past few weeks, we have increased the number of topics, or fields of study (FoS), in our graph from about 50K to almost 200K. We leveraged Wikipedia content and used graph link analysis to expand the coverage of FoS. We started with a few thousand high quality seed FoS and iterated a few rounds between graph link analysis (opens in new tab) and entity filtering to help us identify more FoS. Then, we scanned through our 170 million academic publications’ meta information, such as the title, keywords, abstracts, to confirm the existence of the new FoS.
The table below shows a comparison of before and after numbers for each one of our 19 top-level topics.
Top level |
Before |
After
|
Difference |
Biology |
4173 |
70019 |
1578% |
Medicine |
1675 |
27022 |
1513% |
Geology |
2120 |
15117 |
613% |
Chemistry |
3522 |
24333 |
591% |
Psychology |
2430 |
13291 |
447% |
Philosophy |
1665 |
9066 |
445% |
Sociology |
2047 |
9623 |
370% |
Engineering |
2689 |
12100 |
350% |
Economics |
2347 |
10439 |
345% |
Computer Science |
5180 |
21328 |
312% |
Art |
422 |
1703 |
304% |
Physics |
6618 |
24075 |
264% |
History |
656 |
2245 |
242% |
Political Science |
250 |
667 |
167% |
Materials Science |
945 |
2404 |
154% |
Mathematics |
8022 |
19540 |
144% |
Geography |
502 |
929 |
85% |
Business |
536 |
917 |
71% |
Environmental Science |
178 |
262 |
47% |
Tagging fields of study on papers
Once the fields of study were extracted from papers, the next step was to stamp them appropriately onto the 170+ million papers in the Microsoft Academic Graph, the largest knowledge graph of scholarly publications in existence.
The machine applied the almost 200K fields of study onto papers in our graph and tagged them with the appropriate topics, according to its understanding. This was done with minimal human intervention. We began the tagging process by first taking into consideration metadata associated with each publication. However, publication metadata is neither complete nor accurate. Common pitfalls include but are not limited to:
- Most papers about a topic like «artificial intelligence» do not actually mention these words explicitly in the paper (incomplete);
- A large number of raw keywords from various data sources are noisy and irrelevant to the paper (inaccuracy, e.g. some websites assigned same sets of keywords to all papers published on it);
- The same words refer to different concepts in different disciplines (ambiguity, e.g., «entropy»).
We applied several state-of-the-art natural language processing techniques to tackle these challenges. For example, we extended convolutional neural networks (opens in new tab) for short text classification and made it highly scalable for our 140M English papers, such that high-level disciplines such as: computer science, mathematics, artificial intelligence, etc., would be properly tagged. We also pre-trained word embedding (opens in new tab) vectors with text from more than 80M abstracts., used together with bag-of-words (opens in new tab) for text similarity calculation, this helped to eliminate noisy tagging effectively.
The results exhibit high accuracy, according to our observations. Take, for example, the paper below, which, according to its abstract, “re-examines the concept of ‘meme’ in the context of digital culture.” Our machines have appropriately tagged it with fields of study that include “user-generated content,” and even “Internet meme.”
As a result of tagging papers with so many fields of study, you can now explore and locate scholarship a lot easier, in ways that are unique to Microsoft Academic.
For example, “microblogging” is now a field of study, which has been stamped onto 6977 publications as of the time of this writing.
You can explore all publications tagged with “microblogging” by clicking the “See all publications” link on the topic’s detail page (shown above) and then sorting them and applying filters. Because Microsoft Academic is semantic search, when you explore papers tagged “microblogging,” you will find papers about this topic that might not even include the word microblogging but refer, for example, to Twitter.
Researchers in biology and medicine will notice that the fields of study in that domain area can be as specific as the names of various genes, as illustrated in the screenshot below:
We hope that our recent efforts to increase the number of fields of study and tag them onto papers help you explore and discover knowledge in more powerful ways than ever before.
How do you unleash the power of semantic search? As always, we would like to hear from you either through the feedback link at the bottom right of the website, or on Twitter. You can also find our project home page with this blog on the Microsoft Research site at aka.ms/msracad (opens in new tab).
Happy researching!