Over the past few years deep representation learning (opens in new tab) has revolutionized the developments of various domains, including natural language processing (NLP) (opens in new tab), computer vision (opens in new tab), and speech (opens in new tab). For example, in the NLP domain representation learning aims to learn contextual embeddings for tokens/words (opens in new tab) such that «words that occur in the same contexts tend to have similar meanings». The distributional hypothesis that was first proposed by Harris in 1954 (opens in new tab). The representation learning idea has also been extended to networks (opens in new tab), in which vertices that have the same structural contexts have the tendency to be similar.
Existing representation learning techniques use only one embedding vector for each token/node that may actually have different meanings under different contexts. This fundamental issue leads to the need of using more complicated models, such as ELMo (opens in new tab) and Transformers (opens in new tab), to try to recapture the contextual information for each customized context because one single vector is not enough to capture the contextual differences in both natural language and network structures. This issue could get worse when the network structures are organized in a heterogeneous way, which is the nature of the Microsoft Academic Graph (MAG) (opens in new tab), in which the structural contexts are naturally diverse in observation of different types of entities and their relationships.
For additional context it’s important to review how representation learning has shaped network mining (opens in new tab) and to demonstrate why one embedding vector is not enough to model different structural contexts in MAG. The traditional paradigm of mining and learning with networks (opens in new tab) usually begins with the discovery of networks’ structural properties. With these structural properties extracted as features, machine learning (opens in new tab) algorithms can be applied for various of applications. Often, however, the characterization of these features involves domain knowledge and expensive computation. The emergence of representation learning on networks (opens in new tab) offers new perspectives to address this issue by translating discrete and structural symbols into continuous representations such as low-dimensional vectors, that computers can «understand» and process algebraically.
The Microsoft Academic Graph (MAG) is a prime example of a network that can benefit from these recent advances in network representation learning. To illustrate, pretend there are two scholars who are extensively working on machine learning. One of them publishes all their papers in the ICML (opens in new tab) conference and the other one exclusively has papers published at the NeurIPS (opens in new tab) conference. Intuitively, these two scholars are considered to be very similar in light of the strong similarity between the ICML (opens in new tab) and NeurIPS (opens in new tab) conferences. However, in the discrete space they have not published in the same venue, meaning zero similarity between them which is quite counter-intuitive. This issue can be addressed by computing the similarity between their representations in the latent continuous space.
Learning representations for MAG is more complex due to it being a heterogeneous network consisting of different types of entities (publications, authors, venues, affiliations, fields of study) with various types of relationships between them (publication relation between papers and authors, the citation relation between papers, etc.) You can see the heterogeneous network of MAG (opens in new tab) illustrated in the left of the figure below, and its five types of meta relations are introduced in the right:
The premise of network representation learning is to map the network structures into latent continuous space such that the structural relations between entities can be embedded. In heterogeneous networks there exist various structural relations corresponding to different semantic similarities. For example, the two scholars mentioned earlier are similar to each other in the sense of their publication venues. Their similarity can also be measured through other senses: scientific collaborations, research topics, and in combinations of all other senses due to MAG being strongly connected.
The core question we must answer here is how to define and encode different senses of similarities in MAG. To address this we produce multi-sense network similarities for MAG, each of which corresponds to one semantic sense in the academic domain. The general idea is to project the heterogeneous structure of MAG into homogeneous structures according to different semantic senses and to learn entity representations for each of them.
We are happy to announce that users can now access these multi-sense MAG network embeddings and similarity computation functions with the Network Similarity Package (NSP) (opens in new tab), an optional utility available as part of the larger MAG package (opens in new tab). Note that the NSP is not included in basic MAG distribution and must be specifically requested when signing up to receive MAG.
The senses of entity embeddings that are currently available in NSP include:
Entity type | Sense 1 | Sense 2 | Sense 3 |
affiliation | copaper | covenue | metapath |
venue | coauthor | cofos | metapath |
field of study | copaper | covenue | metapath |
author | copaper |
The description for each sense:
Entity type | Sense | Description |
affiliation | copaper | Two affiliations are similar if they are closed connected with each other in the weighted affiliation collaboration graph. |
affiliation | covenue | Two affiliations are similar if they publish in similar venues (journals and conferences). |
affiliation | metapath | Two affiliations are similar if they co-occur with common affiliations, venues, and fields of study. |
venue | coauthor | Two venues are similar if they publish papers with common authors. |
venue | cofos | Two venues are similar if they publish papers with similar fields of study. |
venue | metapath | Two venues are similar if they co-occur with common affiliations, venues, and fields of study. |
field of study | copaper | Two fields of study are similar if they appear in the same paper. |
field of study | covenue | Two fields of study are similar if they have papers from similar venues. |
field of study | metapath | Two fields of study are similar if they co-occur with common affiliations, venues, and fields of study. |
author | copaper | Two authors are similar if they are closed connected with each other in the weighted author collaboration graph. |
Using the journal “Nature (opens in new tab)” as an example, the top five most similar venues under the three senses are quite different:
The NSP supports both U-SQL (opens in new tab) and PySpark (opens in new tab). Using PySpark as an example, creating an NSP instance and loading network embeddings is as easy as:
ns = NetworkSimilarity(container=MagContainer, account=AzureStorageAccount, key=AzureStorageAccessKey, resource=ResourcePath)
Note that the “resource” parameter specifies the desired sense.
To retrieve the raw similarity score between two entities (i.e., EntityId1, EntityId2) after initializing an NSP instance for a specific sense, use the “getSimilarity()” method:
score = ns.getSimilarity(EntityId1, EntityId2)
If you would like to get the top k most similar item to one particular entity (i.e. EntityId1) under this sense, you call the “getTopEntities()” function.
topEntities = ns.getTopEntities(EntityId1)
The “metapath” sense combines all different types of senses together and has been powering the related journal function on Microsoft Academic (opens in new tab) for some time now. For example, on the entity detail page (EDP) for the journal «Nature» (opens in new tab) :
For more detailed instructions on using NSP please see the following two samples:
Introducing Network Representation Learning Techniques
Given MAG as input, we first define the semantic senses of network similarities based on how we project the heterogeneous structures into corresponding homogeneous network structures. For example, for the “copaper” sense for affiliations, we construct the affiliation collaboration network from MAG, in which two affiliations are connected with each other if both of them appear in the same paper. Given the projected homogeneous affiliation collaboration network, we leverage the NetSMF algorithm (Qiu et al., WWW 2019) (opens in new tab) to learn the co-paper sense embeddings for affiliation entities. We decided to use NetSMF as our network presentation learning algorithm of choice because of its ability to learn representations for billion-scale homogeneous networks on a modern single-node computers. Network embeddings for “copaper”, “covenue”, “coauthor”and “cofos” were all trained using this framework.
For the “metapath” sense, instead of projecting MAG into homogeneous networks we instead fully utilized its heterogeneous structures using meta path based random walks to convert non-Euclidean structures into entity sequences. Then we applied the metapath2vec algorithm (opens in new tab) to encode the semantic relations underlying the structures into latent continuous embeddings. This sense aims to capture the co-occurrence of all types of entities based on the heterogeneous structure of the MAG network.
For example, the similarity between the “Nature” and “Science” journals encodes their co-occurrence frequency together with other venues, fields of study, affiliations, and authors.
The current version of multi-sense network similarity is based solely on the heterogeneous network structures of MAG and reflects the structural semantics underlying the network organization. It’s important to note however that it does not currently cover language semantics, which is another important and unique property of MAG. Language semantics are critical, as scientific innovations are actually carried by the text of each publication. To understand how language semantics plays a role in MAG, please refer to the Language Similarity Package (opens in new tab).
Network representation learning is a new frontier and one that our team is committed to exploring. We plan to continue making improvements to both MAG and more broadly to representation learning research. Our ongoing effort is in combining both the structural and language semantics into the entity representations by unleashing the power of graph neural networks (opens in new tab). To accomplish this we are developing a self-attention (opens in new tab) based heterogeneous graph neural network framework, also known as Heterogeneous Graph Transformer (HGT) (opens in new tab), to learn unified entity representations with both structural and text information encoded in their embeddings. The technical detail of the HGT model will be presented in an upcoming WWW (opens in new tab) 2020 publication under the title of “Heterogeneous Graph Transformer”. Stay tuned!
If you would like to receive the Network Similarity Package with your MAG distribution, please refer to the Network Similarity overview page (opens in new tab).
Happy researching!