In a recent collaboration, Microsoft and China’s Tsinghua University released an academic graph, named Open Academic Graph (opens in new tab) (OAG). This billion-scale academic graph integrates the current Microsoft Academic Graph (opens in new tab) (MAG) and Tsinghua’s AMiner (opens in new tab) academic graph. Specifically, it contains the metadata information of 155 million academic paper metadata from AMiner and 166 million papers from MAG. By consolidating metadata information of each, it generates nearly 65 million matching relationships between the two academic graphs [1].
Spotlight: On-demand video
The construction of the billion-scale OAG is challenging, because of the heterogeneous distribution of academic data that exists in the different academic graphs, challenges in terms of homonyms and synonyms, and the need for accuracy in data matching. Some examples:
- Heterogeneous data. Because the data is distributed in different data sources, it faces a heterogeneous data problem. For example, paper authors may publish in different formats, such as Quoc le and Le, Quoc; or a journal or conference uses either a full name or an abbreviation.
- Disambiguation problem with the same name. The same name can represent multiple entities. For example, in China, one common name might be used by more than 200,000 people. Similarly, one topic, such as “data,” may correspond to multiple articles.
In addition, to achieve the billion-level of data integration, efficient computing is key. In the case of AMiner, there are 155 million published papers, while MAG has public data of 160 million papers. Calculating the algorithmic complexity of two map-matching relationships is generally O (N2), which requires a lot of computation. We designed a compromise approach that uses a hashing algorithm to improve efficiency. This approach completed matches for approximately 300 million papers automatically, while still ensuring high matching accuracy.
OAG is an important project of Open Academic Society (opens in new tab) (OAS), which is a consortium of 20 global institutions—including Microsoft, Tsinghua, the Allen Institute for Artificial Intelligence, the University of Arizona, the University of Washington, the University of California, Los Angeles, and the Australian National University—to promote the open sharing of academic data and strengthen academic exchanges and cooperation. The OAG aims to integrate the global atlas of academic knowledge, publicly share academic atlas data, and provide relevant academic search and mining services. Specifically, OAS activities include:
- The integration of rich academic knowledge data. At present, the core data of OAG is from MAG and AMiner. The next step will be to integrate additional academic data, including the semantic data of different types of entities such as authors and papers. Data integration and data mining algorithms will link more entities to more accurate and richer data, including metadata, concept networks, research field, full text and author biographical information.
- Data sharing. By sharing different academic knowledge maps and their links, we hope to benefit academic research in the fields of knowledge atlases, scholar cooperative relationships, and academic topic mining.
- Service sharing. We want to design more intelligent academic atlas connectivity systems and provide relevant services (such as APIs) to encourage more people to use the services and join open academic communities.
In another collaboration, this time between Microsoft Academic, Tsinghua and the Documentation and Information Center of the Chinese Academy of Sciences, more than 1000 students in 400 teams participated in the “Open Academic Precision Portrait Competition.” Students from Peking University, University of Science and Technology of China, and Harbin Institute of Technology took home the top prizes.
One last collaboration of note is a contract between Professor Jie Tang, (of Tsinghua University and the founder of AMiner) and Microsoft, where AMiner [2] will be deployed on Azure. This, together with OAG, will further enlarge the impact of Azure on academic research.
The future of the collaboration between Microsoft and Tsinghua will include integrating different types of entities (such as authors and conferences) in a large-scale heterogeneous academic atlas, publishing more academic atlas connectivity data, and designing more intelligent academic atlas connectivity systems.
Reference:
[1] Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, Jie Tang WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA ACM February 5, 2018 (https://dl.acm.org/citation.cfm?doid=3159652.3159706)
[2] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). pp.990-998.