a tall building lit up at night

Microsoft Research Lab – Asia

Graphormer wins the Open Catalyst Challenge and upgrades to AI for Molecular Simulation Toolkit

Partagez cette page

Graphormer is a new generation deep learning model for graph data modeling (with typical graph data including molecular chemical formulas, social networks, etc.) that was proposed by Microsoft Research Asia. Compared with the previous generation of traditional graph neural networks, Graphormer is more powerful in its expressiveness, has higher efficiency in capturing graph structure information, and has greater potential for scalability. In the graph prediction track of the recently held KDD Cup 2021, Graphormer won the championship by outperforming models developed by a number of research institutions around the world (opens in new tab).

In recent years, the prediction and simulation of molecular properties based on artificial intelligence algorithms have become extremely important in the fields of materials science and drug discovery. However, in the open source community of machine learning, there remains a lack of algorithms and models that support cutting-edge molecular simulation deep learning, as well as easy-to-use toolkits. In order to close this gap in the field, researchers at MSR Asia have been updating the previous open sourced Graphormer into a general toolkit for molecular simulation based on artificial intelligence in an effort help researchers better utilize the most advanced machine learning algorithms to perform molecular simulation, molecular property prediction, molecular generation, and other tasks. This major upgrade includes cutting-edge algorithms, easy-to-use pre-trained models, more flexible user interfaces, high efficiency architectures, and comprehensive documentation. Regardless of whether you are a scientific researcher or an engineer, Graphormer would be able to greatly help you in AI molecular simulation.

logo

GitHub: https://github.com/microsoft/Graphormer (opens in new tab)

Graphormer v2.0 helps researchers to win the Open Catalyst Challenge

The recent Open Catalyst Challenge was jointly organized by Meta AI Research Institute, Carnegie Mellon University, and NeurIPS, a top machine learning conference, with the aim of using artificial intelligence algorithms to model and discover new catalyst materials and help solve critical scientific issues such as those related to new energy storage and climate change.

The discovery and optimization of catalysts are key to solving many social and energy-related challenges, including solar fuel synthesis, long-term energy storage, and renewable fertilizer production. New catalyst structures can be screened and evaluated using quantum chemistry-based molecular and chemical reaction simulations (such as the density functional theory). However, the excessively high computational cost and time cost of such a method not only limit the throughput and scale of simulations, but also greatly limit the development of the entire field. For this reason, the use of machine learning algorithms to provide efficient approximations for molecular and reaction simulations is gradually becoming a new trend in catalyst discovery.

Although the catalysis community has made considerable efforts to apply machine learning models to the discovery process of computational catalysts, it remains an open challenge to build models that can generalize the composition of surface elements and the properties of adsorbents. In order to solve this challenge and promote the development of the field of catalysts, the Open Catalyst Challenge asked participating teams to develop machine learning algorithms to simulate more than 660,000 density functional theory calculations of catalyst-adsorbent reaction systems (more than 140 million structures-energy estimation), where each system must simulate the structure and energy of the adsorbate from the initial state to the relaxed state (state with the lowest energy).

An illustration of the catalyst-adsorbate reaction relaxation process system

An illustration of the catalyst-adsorbate reaction relaxation process system

The Open Catalyst Challenge this year attracted the attention and participation of many scientific research institutions due to the significance it holds for scientific research, the challenging nature of the topic, and the large scale of the dataset. At the NeurIPS 2021 conference, the organizers of the competition announced the results of the Direct Track (direct prediction of relaxation energy), where Microsoft Research Asia achieved an absolute error of 0.547 eV, winning the challenge by a relatively large margin. The algorithm performance analysis showed that for complex systems with multiple adsorbents, Graphormer can predict the lowest energy system with an accuracy of 89%, which can save at least 50% of the density functional theory calculation overhead.

table

Graphormer wins first place on both public and private leaderboards of the Direct Track.

Graphormer has been recently upgraded to support 3D molecular modeling and to deliver improved performance. Previously, in order to better capture structural information in the 2D graph, Graphormer used the shortest distance as a spatial encoding to describe the mutual spatial relationship between nodes and used degree information as a central code to describe the structural importance of each node. However, there is no chemical bond information in 3D molecules, so the entire system can be regarded as a fully connected graph composed of all atoms. Therefore, the researchers alternatively used a Gaussian kernel function to encode the Euclidean distance between nodes as a spatial encoding, and then summed up all spatial encodings of each node to obtain the central code that describes the importance of the atoms in the 3D molecule graph.

In addition to having Graphormer directly predict the energy of the system in a relaxed state, the researchers also designed an auxiliary task: predicting the coordinate displacement of each atom from the initial state to the relaxed state. In molecular dynamics tasks, it is often necessary to predict the force of atoms or coordinate displacement, so the output of the model needs to maintain equivariance for the rotation and translation of the system. To this end, the researchers designed a special 3D attention layer for the Graphormer model, so that the effect of the target node on the source node in the graph can be projected on the x, y, and z axes, where the model output could be equivariant.

equivariant

The winner architecture of Graphormer in the Open Catalyst Challenge

The winner architecture of Graphormer in the Open Catalyst Challenge

The latest open source Graphormer toolkit includes all models, training and inference codes, and data processing scripts used in the Open Catalyst Challenge. It is hoped that researchers in related fields can easily apply Graphormer to molecular dynamics or related tasks and that this can support the development of artificial intelligence algorithms in materials science, drug discovery, and other fields.

Open source pushes the frontiers of research and applications in cross-discipline

Currently, integration of research in AI and the natural sciences is being accelerated, and MSR Asia has made important breakthroughs in fields such as biology, material science, and environmental science. It is hoped that Graphormer, as MSR Asia’s first open-source toolkit serving the cross-discipline of AI and the natural sciences, can better promote cutting-edge research and applications of AI in molecular science, such as in the discovery of new energy storage materials and in drug discovery. In addition to advanced algorithms and models, Graphormer also provides powerful pre-trained models trained on different datasets.

As we all know, the precise physicochemical properties and pharmacological properties of molecules are quite difficult to obtain in the laboratory or in clinical trials. Therefore, we often lack high-quality data in these areas, which leaves certain cutting-edge deep learning models without the chance to exert their capabilities. With a powerful pre-training model, researchers would often only need to fine-tune the model with very little data on their specific tasks to obtain a promising deep learning model. For example, after the upgrade, the toolkit now provides a Graphormer model that is pre-trained on the PCQM4M dataset. This dataset contains the quantum chemical properties of more than 3.8 million molecules, allowing the pre-trained Graphormer model to learn a wealth of chemistry knowledge and to also possess good transferability. For example, when the Graphormer model pre-trained on this dataset is transferred to biometric tasks (such as the OGBG-PCBA dataset), it can largely outperform the last generation of graph neural networks.

Furthermore, the Graphormer toolkit currently supports a variety of mainstream graph packages and databases such as PYG, DGL, OGB, etc., so that researchers can quickly develop and verify algorithms on benchmark datasets or specific private data. Compared with the previous version, the upgraded Graphormer is more efficient and can provide high-performance parallel training at large scales and flexible custom model algorithms. In addition to user-friendly interfaces, powerful cutting-edge algorithms, and pre-trained models, the new Graphormer toolkit also has improved documentation support, with rich sample tutorials to help users quickly understand it and get started.

In the future, on top of molecular properties prediction, molecular dynamics, etc., the Graphormer toolkit will also support a variety of common applications in scientific research and in industry, such as drug molecule-protein interaction, chemical reaction prediction (retrosynthesis), molecular generation, and large molecule (polymer, protein) simulation, etc. A large number of public datasets, industry benchmarks, and unified evaluation standards will be able to help researchers save unnecessary troubles and concentrate on algorithms or applications.

Graphormer has received positive feedback from community members and users. Updates to Graphormer are taking place very actively now, and more functions will be released in the future. Users who have interest in molecular modeling or related issues are very welcome to use the Graphormer toolkit. It is hoped that more exchanges and sharing would help to promote better communication in the field of molecular modeling.