At the ACM Conference on Web Search and Data Mining 2018 (opens in new tab), my team will introduce research that, for the first time, provides a theoretical explanation of popular methods used to automatically map the structure and characteristics of networks, known as network embedding. We then use this theoretical explanation to present a new network embedding method that performs as well as or better than existing methods.
Networks are fundamental ways of representing knowledge and relating to the world. As humans, we think through association; we naturally draw links between concepts or entities. People are linked through family relationships or through collaborations. Diseases are linked to treatments. Works of art are linked to their creators. Wikipedia represents the interconnectedness of human knowledge.
Microsoft research podcast
As humans, we have great capacity to understand knowledge, notice similarities and make connections and inferences. However, our capacity is limited by the amount of information we can process. Computers, on the other hand, can process tremendous amounts of information, but their ability to understand knowledge and make inferences is limited. That’s because humans think in symbols and computers think in math. Recent advances in computer science such as word and network embedding address this problem by using methods that translate discrete symbols into continuous representations such as high-dimensional vectors that computers can process algebraically.
Network embedding methods enable us to map the structure and characteristics of a network algorithmically, without human intervention. Mapping can be used to predict missing links, to identify missing entities and to make inferences that inform, for example, recommender systems on e-commerce websites.
Existing methods of network embedding work empirically, but so far, scientists have considered them a black box: We lacked understanding of how or why these methods work. Our recent research (opens in new tab)—a collaboration between Microsoft Research and Tsinghua University—contributes to the theoretical understanding of four popular network-embedding methods, including DeepWalk (opens in new tab), node2vec (opens in new tab), as well as LINE (opens in new tab) and PTE (opens in new tab), which were developed at Microsoft Research Asia.
Our research bridges the gap between the empirical performance and theoretical mechanism of network embedding. We prove that four popular network-embedding models are —in theory— implicitly performing matrix factorizations (opens in new tab) over a network. More excitingly, the theoretical analysis also provides the closed forms of those factorized matrices, most of which are related to the network’s Laplacian matrix (opens in new tab). We show that these models share a common practice: they are unified into a general matrix-factorization framework. As such, this work lays the theoretical foundation for network embedding, yielding a better understanding of latent representation learning over networks.
Theoretical understanding is important because it enables us to go beyond empirical attempts. Once we understand why and how something works, we are better able to predict how it will work in the future and can use it more efficiently. Based on the new theoretical understanding we derived from four existing methods, we proposed a new algorithm, NetMF, that uses matrix factorization explicitly to create network embeddings. The new algorithm performed at comparative or better levels than existing methods in our tests, providing further support for our theoretical explanation.
We use network embedding in the Microsoft Academic Graph (opens in new tab) – the largest knowledge graph of research publications in existence, accessible via the Academic Knowledge API (opens in new tab) and the Microsoft Academic (opens in new tab) website. As machines map the structure and meaning of entities in the graph and their relationships, they can begin making inferences. For example, they can infer that two papers are quite similar in content, or two authors do similar work. In a way, this vector representation is like your genome sequence. If it is similar to someone else’s, you must be related. With network embedding, we can find academic twins that perhaps were not aware of each other. We can help researchers discover scholarship they might not have come across otherwise, and bridge gaps between traditional disciplinary divisions.
The research team included Jiezhong Qiu (a summer intern at Microsoft Research), Yuxiao Dong, Hao Ma, Kuansan Wang from Microsoft Research, and Jian Li and Jie Tang from Tsinghua University.