MatterGen: a generative model for inorganic materials design

Claudio Zeni; Robert Pinsler; Daniel Zügner; Andrew Fowler; Matthew Horton; Xiang Fu; Sasha Shysheya; Jonathan Crabbé; Lixin Sun; Jake Smith; Ryota Tomioka; Tian Xie

MatterGen: a generative model for inorganic materials design

Claudio Zeni ,
Robert Pinsler ,
Daniel Zügner ,
Andrew Fowler ,
Matthew Horton ,
Xiang Fu ,
Sasha Shysheya ,
Jonathan Crabbé ,
Lixin Sun ,
Jake Smith ,
Ryota Tomioka ,
Tian Xie

December 2023

The design of functional materials with desired properties is essential in driving technological advances in areas like energy storage, catalysis, and carbon capture. Generative models provide a new paradigm for materials design by directly generating entirely novel materials given desired property constraints. Despite recent progress, current generative models have low success rate in proposing stable crystals, or can only satisfy a very limited set of property constraints. Here, we present MatterGen, a model that generates stable, diverse inorganic materials across the periodic table and can further be fine-tuned to steer the generation towards a broad range of property constraints. To enable this, we introduce a new diffusion-based generative process that produces crystalline structures by gradually refining atom types, coordinates, and the periodic lattice. We further introduce adapter modules to enable fine-tuning towards any given property constraints with a labeled dataset. Compared to prior generative models, structures produced by MatterGen are more than twice as likely to be novel and stable, and more than 15 times closer to the local energy minimum. After fine-tuning, MatterGen successfully generates stable, novel materials with desired chemistry, symmetry, as well as mechanical, electronic and magnetic properties. Finally, we demonstrate multi-property materials design capabilities by proposing structures that have both high magnetic density and a chemical composition with low supply-chain risk. We believe that the quality of generated materials and the breadth of MatterGen’s capabilities represent a major advancement towards creating a universal generative model for materials design.

출판물 다운로드

MatterGen

12월 13, 2024

MatterGen is a generative model for inorganic materials design across the periodic table that can be fine-tuned to steer the generation towards a wide range of property constraints.

데이터 다운로드

Keynote: The Revolution in Scientific Discovery

Presented by Chris Bishop at Microsoft Research Forum, Episode 2

Chris Bishop shared the vision for how AI for science will leverage AI to model and predict natural phenomena, including the exciting real-world progress being made by the team.

All Research Forum sessions

Transcript

Keynote: The revolution in scientific discovery

CHRIS BISHOP: Good morning. A very warm welcome to the Microsoft Research Forum. My name is Chris, and I’m going to talk today about an extraordinary revolution that’s unfolding at the intersection of AI and deep learning with the natural sciences.

In my view, the most important use case of AI will be to scientific discovery. And the reason I believe this is that it’s our understanding of the natural world obtained through scientific discovery, together with its application in the form of technology, that has really transformed the human species. This transformation has very broad applicability, spanning vast ranges of length and time. Now, we’ve seen remarkable advances, of course, in AI in the last couple of years. And you may ask, can we just apply large language models to scientific discovery and be done? Well, the answer is no. But first, let me say that large language models do have two remarkable properties that are very useful. The first one is, of course, they can generate and could understand human language, so they provide a wonderful human interface to very sophisticated technologies. But the other property of large language models—and I think this came as a big surprise to many of us—is that they can function as effective reasoning engines. And, of course, that’s going to be very useful in scientific discovery. But large language models alone don’t address the full challenge of scientific discovery. And the reason is that there are some key differences in the natural sciences. And let me highlight some of these.

So the first one is that in scientific discovery, we need to do precise quantitative numerical calculations. We may need to calculate the properties of molecules or materials. And large language models are very poor at doing complex numerical calculations. They don’t produce accurate results. And, of course, they’re hugely inefficient from a computational point of view in doing such calculations. A second critical difference is that in the natural sciences, the ultimate truth—the gold standard—is experiment. It doesn’t matter how beautiful your theory is or how clever your code is. If it doesn’t agree with experiment, you have to go back and think again. So in scientific discovery, experiment needs to be embedded in the loop of the scientific discovery process.

Another difference is that with large language models, we can exploit internet-scale data that, you know, to a first approximation is readily available, freely available. In scientific discovery, however, the training data is often scarce. We may generate it computationally at great expense, or we gather it through sophisticated, complex laboratory experiments. But it tends to be scarce. It tends to be expensive. It tends to be limited. But there’s a final difference that, to some extent, offsets that scarcity of data, and it’s the fact that we have the known laws of physics. We’ve had more than three and a half centuries of scientific discovery that’s given us tremendous insight into the machinery of the universe. So let me say a little bit more about that, what I’ll call prior knowledge.

So very often, this prior knowledge is expressed in the form of differential equations. So think about Newton’s laws of motion or the law of gravity, going back to the 17th century; Maxwell’s equations of electrodynamics, in the 19th century; and then, of course, very importantly, at the beginning of the 20th century, the discovery of the equations of quantum physics. And here I show a simplified version of Schrödinger’s equation. And if you sprinkle in a few relativistic effects, then this really describes matter at the molecular level with exquisite precision. And it, of course, [would] be crazy not to use those centuries of scientific advance. But there’s a problem, which is that these equations, although they’re very simple to write down, are computationally very expensive to solve. In fact, an exact solution of Schrödinger’s equation is exponential in the number of electrons, so it’s prohibitive for any practical application. And even accurate approximations to Schrödinger’s equation are still computationally very expensive. Nevertheless, we can make efficient use of that because instead of viewing your solver for a Schrödinger’s equation as a way of directly calculating the properties of materials or molecules—that’s expensive—instead, we can use that simulation to generate synthetic training data and then use that training data to train deep learning models, which we’ll call emulators. And once they’re trained, those emulators can be several orders of magnitude faster than the original simulator. And I’ll show an example of that in a moment. But it’s not just these differential equations that constitute powerful prior knowledge.

Let’s have a look at this molecule in isolation. Just a simple molecule. And it has various properties. Let’s say it has some energy. If we now imagine rotating the molecule that in the computer, the coordinates—all the atoms are stored as numbers. As we rotate the molecule, all of those numbers change, but the energy doesn’t change. And we call that an invariance property, and it’s a powerful, exact piece of prior knowledge. We want to make sure that’s baked into our, into our machine learning models. And if that molecule happens to have a dipole moment like a little bar magnet that when the molecule rotates, that little magnet rotates with the molecule, that’s called equivariance. And there’s a lot more besides. These are examples of symmetries, but symmetries play a very powerful role in the natural sciences. So the symmetry of spacetime gives rise to conservation of momentum, conservation of energy; gauge symmetries in the electromagnetic field gives rise to the conservation of charge. These hold exactly with exquisite precision, and again, we want to exploit all of that prior knowledge.

So how can we actually make use of that prior knowledge in practice? Well, it really comes down to a very fundamental theorem that’s right at the heart of machine learning. It has a strange title. It’s called the no-free-lunch theorem. But what it says is that you cannot learn purely from data. You can only learn from data in the presence of assumptions, or prior knowledge. And in the machine learning context, we call that inductive bias. And there’s a tradeoff between the data and the inductive bias. So if you’re in a situation where data is scarce, you can compensate for that by using powerful inductive bias. And so it leads to a different kind of tradeoff. If you think about large language models, I’ve already said that we have data available at a very large scale, and so those large language models use very lightweight inductive bias. They’re often based on transformers. The inductive biases that we have are deep hierarchical representation; perhaps there’s some data-dependent self-attention. But it’s very lightweight inductive bias. And many scientific models are in the other regime. We don’t have very much data, but we have these powerful inductive biases arising from three and a half centuries of scientific discovery.

So let me give you an example of how we can use those inductive biases in practice. And this is some work done by our close collaborators and partners in the Microsoft Azure Quantum team. And the goal here is to find new electrolytes for lithium-ion batteries and, in particular, to try to replace some of that increasingly scarce lithium with cheap, widely available sodium. And so this really is a screening process. We start at the top with over 32 million computer-generated candidate materials, and then we go through a series of evermore expensive screening steps, including some human-guided screening towards the end, eventually to arrive at a single best candidate. Now, those steps involve things like density functional theory, which are approximate solutions to Schrödinger’s equation, but they’re computationally very expensive.

So we do what I talked about earlier, which is we use those solutions—we use solutions from density functional theory—to train an emulator, and now the emulator can do the screening much faster. In fact, it’s more than three orders of magnitude faster at screening these materials. And anytime something gets three orders of magnitude faster, that really is a disruption. And so what this enabled us to do is to take a process, a screening process, that would have taken many years of compute by conventional methods and reduce it to just 80 hours of computation. And here you see the best candidate material from that screening process. This was synthesized by our partners at the Pacific Northwest National Laboratory. And here you can see some test batteries being fabricated. And then here are the batteries in a kind of test cell. And then just to prove that it really works, here’s a little alarm clock being powered by one of these new lithium-ion batteries that uses 70 percent less lithium than a standard lithium-ion battery. So that’s extremely exciting. But there’s much more that we can do. It’s really just the beginning. So as well as using AI to accelerate that screening process by three orders of magnitude, we can also use AI to transform the way we generate those candidate materials at the top of that funnel.

So this is some recent work called MatterGen. And the idea here is not simply to generate materials at random and then screen them but instead generate materials in a much more focused way, materials that have specific values of magnetic density, bandgap, and other desired properties. And we use a technique called diffusion models. You’re probably familiar at least with the output of diffusion models; they’re widely used to generate images and now video, as well. And here they are being used to generate—can we just play that video? Is that possible? This is a little video … here we go. So this, the first part of the video here, is just showing a typical generation of a random material. And now we see MatterGen generating materials that have specific desired properties. What this means is that we can take that combinatorically vast space of possible new materials and by focusing our attention on subspace of that overall space of materials and then using accelerated AI, this gives a further several orders of magnitude acceleration in our ability to explore the space of materials to find new candidates for things like battery electrolytes. But it’s not just materials design. This disruption has much broader applicability.

It’s a very sad fact that in 2022, 1.3 million people died of tuberculosis. Now, you may find that surprising because there are antibiotics; there are drugs to treat tuberculosis. But the bacterium that causes TB is developing very strong drug resistance, and so the search is on for new and better treatments. So again, we can use modern deep learning techniques, and I’ll talk through a framework here called TamGen, for target-aware molecular generation, and this allows us to go search very specifically for new molecules that bind to a particular protein. So here’s how it works. We first of all train a language model, but it’s not trained on human language; it’s trained on the language of molecules. And, in particular, this uses a standard representation called SMILES, which is just a way of taking a molecule and expressing it as a one-dimensional sequence of tokens, so a bit like a sequence of words in language. And now we train a transformer with self-attention to be able to effectively predict the next token, and when it’s trained, it now understands the language of SMILES strings—it understands the language of molecules—and it can generate new molecules.

But we don’t just want to generate new molecules at random, of course. We want to generate molecules that are targeted to a particular protein. And so we use another transformer-based model to encode the properties of that protein. And, in particular, we’re looking for a region of the protein called a pocket, which is where the drug molecule binds and, in the process, it alters the function of the protein, and that breaks the chain of the disease. And so we use some of those geometrical properties that I talked about earlier to encode the geometrical structure of the protein, taking account of those invariance and equivariance properties. And we learn a model that can map that into that representation of the SMILES string. We want to do one more thing, as well. What we want to do is to be able to refine molecules. We want to take molecules that we know bind but improve them, increase their binding efficiency. And so we need a way of encoding an existing molecule but also generating variability. And we use another standard deep learning technique called a variational autoencoder, which takes a representation of the starting molecule, and again encode that into that representation space.

And then finally we use a thing called cross-attention that combines the output of those two encoders into that SMILES language model. So once the system has been trained, we can now present it with a target protein, in this case, for TB. We can present it with a known molecule that binds to that target, and then it can generate candidates that we hope will have an improved efficacy compared to the starting molecule. Now, we collaborate with a partner called GHDDI—the Global Health Drug Discovery Institute. They’ve synthesized these candidate molecules, and they found this one in particular is more than two orders of magnitude improvement over a standard drug molecule. So it’s got a long way to go before we have a clinical drug. But nevertheless, this is an extraordinary achievement. This is the state of the art in terms of candidate drug molecules which bind to this particular protein. So I think very, very exciting. And, of course, we’re continuing to work with GHDDI to refine and optimize this and hope eventually to take this towards pre-clinical trials.

So I’ve mentioned several concepts here: transformers, attention, variational autoencoders, diffusion models, and so on. And if you want to learn more about these techniques, I’m delighted to say that a new book has just been published a few weeks ago called Deep Learning: Foundations and Concepts, produced by Springer—a beautiful, very high-quality hardback copy. But it’s also available from BishopBook.com (opens in new tab) as a free online version. So I encourage you to take a look at that.

So finally, I hope I’ve given you a glimpse of how AI and deep learning are transforming the world of scientific discovery. I’ve highlighted two examples, one of them in materials design and one of them in drug discovery. This is just scratching the surface. The potential of this disruption has huge breadth of applicability. And so to hear more about this exciting field, in a few minutes, Bonnie [Kruft] will be moderating a panel discussion on transforming the natural sciences with AI.

Thank you very much.

Ask Microsoft research copilot experience

Microsoft research copilot experience How will AI revolutionize scientific discovery?

Unlocking Real world solutions with AI – Chris Bishop

Chris Bishop reveals how AI is revolutionizing material science with an innovative battery electrolyte material. With the help of MatterGen, an AI system akin to a search engine, researchers can explore novel material options with precision and efficiency. The broad potential of these AI systems spans industries from drug discovery to environmental science.

MatterGen: A Generative Model for Materials Design

Presented by Tian Xie at Microsoft Research Forum, Episode 3

Tian Xie introduces MatterGen, a generative model that creates new inorganic materials based on a broad range of property conditions required by the application, aiming to shift the traditional paradigm of materials design with generative AI.

All Research Forum sessions

Transcript

MatterGen: A Generative Model for Materials Design

TIAN XIE: Hello, everyone. My name is Tian, and I’m from Microsoft Research AI for Science. I’m excited to be here to share with you MatterGen, our latest model that brings generative AI to materials design.

Materials design is the cornerstone of modern technology. Many of the challenges our society is facing today are bottlenecked by finding a good material. For example, if we can find a novel material that conducts lithium very well, it will be a key component for our next-generation battery technology. The same applies to many other domains, like finding a novel material for solar cells, carbon capture, and quantum computers. Traditionally, materials design is conducted by search-based methods. We search through a list of candidates and gradually filter them using a list of design criteria for the application. Like for batteries, we need the materials to contain lithium, to be stable, to have a high lithium-ion conductivity, and each filtering step can be conducted using simulation-based methods or AI emulators. At the end, we get five to 10 candidates that we’re sending to the lab for experimental synthesis.

In MatterGen, we hope to rethink this process with generative AI. We’re aiming to directly generate materials given the design requirements for the target application, bypassing the process of searching through candidates. You can think of it as using text-to-image generative models like DALL-E to generate the images given a prompt rather than needing to search through the entire internet for images via a search engine. The core of MatterGen is a diffusion model specifically designed for materials. A material can be represented by its unit cell, the smallest repeating unit of the infinite periodic structure. It has three components: atom types, atom positions, and periodic lattice. We designed the forward process to corrupt all three components towards a random structure and then have a model to reverse this process to generate a novel material. Conceptually, it is similar to using a diffusion model for images, but we build a lot of inductive bias like equivariance and periodicity into the model because we’re operating on a sparse data region as in most scientific domains.

Given this diffusion architecture, we train the base model of MatterGen using the structure of all known stable materials. Once trained, we can generate novel, stable materials by sampling from the base model unconditionally. To generate the material given desired conditions, we further fine-tune this base model by adding conditions to each layer of the network using a ControlNet-style parameter-efficient fine-tuning approach. The condition can be anything like a specific chemistry, symmetry, or any target property. Once fine-tuned, the model can directly generate the materials given desired conditions. Since we use fine-tuning, we only need a small labeled dataset to generate the materials given the corresponding condition, which is actually very useful for the users because it’s usually computationally expensive to generate a property-labeled dataset for materials.

Here’s an example of how MatterGen generates novel materials in the strontium-vanadium- oxygen chemical system. It generates candidates with lower energy than two other competing methods: random structure search and substitution. The resulting structure looks very reasonable and is proven to be stable using computational methods. MatterGen also generates materials given desired magnetic, electronic, and mechanical properties. The most impressive result here is that we can shift the distribution of generated material towards extreme values compared with training property. This is very significant because most of the materials design problem involves finding materials with extreme properties, like finding superhard materials, magnets with high magnetism, which is difficult to do with traditional search-based methods and is the key advantage of generative models.

Our major next step is to bring this generative AI–designed materials into the real life, making real-world impact in a variety of domains like battery design, solar cell design, and carbon capture. One limitation is that we only have validated this AI-generated materials using computation. We’re working with experimental partners to synthesize them in the wet lab. It is a nontrivial process, but we keep improving our model, getting feedbacks from the experimentalist, and we are looking forward to a future where generative AI–designed materials can make real-world impact in a broad range of domains. Here’s a link to our paper in case you want to learn more about the details. We look forward to any comments and feedbacks that you might have. Thank you very much.

Microsoft research copilot experience What is MatterGen, and how did Tian Xie describe its role in materials design?