Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Materials Informatics Machine Learning in Materials Science Trevor David Rhone "I Keep Six Honest Serving Men ..." I keep six honest serving-men (They taught me all I knew); Their names are What and Why and When And How and Where and Who… by Rudyard Kipling What is Materials Informatics? oThe use of statistical analysis in materials science • Analyze materials’ property data to make predictions and to uncover relationships among materials’ properties oStatistical approaches are common in other fields • Biology – bioinformatics • Astronomy Goal o Materials discovery • Make predictions based on some inputs • Speed up a set of ‘expensive’ theory calculations • Eliminate need for a multitude of serial experiments o Uncover physical principles What is Materials Informatics? Problem: o You are an alchemist living in ancient Greece o You have access to a small random set of elements o You would like to predict which properties an element X will have o How do you proceed in a systematic way? What is Materials Informatics? Challenges: o What are appropriate features to represent our elements? o How to deal with missing data? o How to deal with uncertainty / errors in the data? o How do you construct a model? o Which models are appropriate? What is Materials Informatics? Solution: o Develop appropriate descriptors for the data • Atomic number • Number of valence electrons • etc. o Build models that you can apply to your data to make predictions o Train your model o Test the effectiveness of your model to make faithful predictions What is Materials Informatics? Solution: o Select and train model • Group by color? • Sort by weight? • Sort by atomic number? o Handle missing values o Account for errors or uncertainty in the data o Use features in a model to • develop an understanding of data • make property predictions of unknown elements 7 ? What is Materials Informatics? Solution: o Select and train model • Group by color? • Sort by weight? • Sort by atomic number? o Handle missing values o Account for errors or uncertainty in the data o Use features in a model to • develop an understanding of data • make property predictions of unknown elements What is Materials Informatics? Real World Example: Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning o Use a machine learning model (ML) to predict atomization energies of a set of organic molecules • Features based on nuclear charges and atomic positions only • Regression models are trained on and compared to atomization energies computed with densityfunctional theory. • Can make good predictions of molecular atomization potential energies The problem of solving the molecular Schrodinger equation is mapped onto a nonlinear statistical regression problem of reduced complexity M. Rupp, et al., PRL 108, 058301 (2012) Correlation of DFT results (Eref) with ML based estimates (Eest) of atomization energies. Correlations for bond counting and semiempirical quantum chemistry (PM6) are also What is Materials Informatics? Real World Example: Prediction of Low-Thermal-Conductivity Compounds with First-Principles Anharmonic Lattice Dynamics Calculations and Bayesian Optimization o Look for compounds with low lattice thermal conductivity (LTC) • good thermoelectric materials o Virtual screening of a library containing 54,779 compounds o Perform a Bayesian optimization search • using initial LTC data obtained from firstprinciples calculations (a set of 101 compounds) o Discovered 221 materials with very low LTC o Two of them even have an electronic band gap < 1 eV, making them good candidates for thermoelectric applications A. Seko, et al., Phys. Rev. Lett. 115, 205901 What is Materials Informatics? Real World Example: Machine learning with systematic density-functional theory calculations: Application to melting temperatures of single- and binary-component solids o Combine systematic theory (DFT) calculations and regression techniques to predict the melting temperature for single and binary component compounds o Ordinary least-squares regression, partial least-squares regression, support vector regression, and Gaussian process regression o SVR makes the best predictions o Including physical properties computed by the DFT calculations improves predictions o The kriging design (Bayesian search) finds the compound with the highest melting temperature much faster than random designs Predictions of Melting Temperature using SVR A. Seko et al., Physical Review B 89, 054303 (2014) Why use Machine Learning to study materials? Our goal is to make some materials discovery: oWhy not just start testing a bunch of compounds for a desired property? oOr do a bunch of DFT calculations? oSerial experiments are slow and expensive oFirst principles calculations are slow and computationally expensive oParameter search space is huge • Previous example – library of 54,779 organic compounds • 49,037,297 organic and inorganic substances registered with the Chemical Abstracts Service When to use Machine Learning for Materials studies? • Data intensive problems • Making many DFT calculations would take a long time • Many serial experiments are not desirable (slow and expensive) • Problems where there are many parameters whose relationships are not easily understood using conventional means • Data exists and are accessible • Datascience tools are accessible • Computing power exists When to use Machine Learning for Materials studies?: Practical considerations • Data exists and are accessible • Datascience tools are accessible • Computing power exists When to use Machine Learning for Materials studies?: Practical considerations o Materials Project Database • Access database programmatically using an application program interface (API) via python package (pymatgen) GET EXAMPLE !!!! How to use Machine Learning in Materials Science? oIf a collection of data exists that is amenable to statistical analysis oIdentify features/variables that are important for predicting some desired quantity • Requires some domain knowledge • Principle Component Analysis (PCA) to discover ‘best’ features (i.e. those most relevant for making a prediction) oChoose appropriate algorithm or statistical analysis to tease information from the data oTrain model and do model testing oMake predictions for a desired property on test data How to use Machine Learning in Materials Science? K. Rajan, Materials Informatics, Materials Today, October 2005, How to use Machine Learning in Materials Science? The machine (or statistical) learning methodology: 1. material motifs are reduced to numerical fingerprint vectors 2. A suitable measure of chemical (dis)similarity, or ‘chemical distance’, is used within a learning scheme (e.g. linear regression or kernel ridge regression) to map the distances to properties Ghanshyam Pilania, Scientific Reports, 3 : 2810, DOI: 10.1038/srep02810 How to use Machine Learning in Materials Science? Iris Dataset: What is a good descriptor or feature? Iris setosa Iris virginica Iris versicolor The Iris dataset is a famous dataset in machine learning The machine (or statistical) learning methodology: o Feature selection is important o Which ‘features’ of the flowers can help us to classify them? How to use Machine Learning in Materials Science? Iris Dataset: What is a good descriptor or feature? setosa virginica versicolor Feature Selection: 1. Sepal length 2. Sepal width 3. Petal length 4. Petal width How to use Machine Learning in Materials Science? Iris Dataset: Exploiting ‘features’ to help build models setosa virginica versicolor Machine Learning: o Use features for data visualization o Perform non-linear regression on training data o Make predictions on test data The Where? And Who? of Machine Learning in Materials Science oAcademia • Research o Persons doing materials Informatics come from a very diverse set of backgrounds and have a diverse set of skills • Statistics • Computer science • Materials science Physics Engineering Chemistry Materials Project (Materials Genome Initiative) NIMS Small community of researchers (a few at Harvard) oIndustry • Materials design for the energy industry Photovoltaics Thermoelectrics Organic LEDs • Automotive industry Lighter engines A case study in Materials Informatics: Machine learning with systematic density-functional theory calculations: Application to melting temperatures of single- and binary-component solids Dataset: o Features of Compounds, X (input data) o Melting temperature, Y (target variables) FeSe ? o Have a new compound with unknown melting temperature o Would like to make a prediction: y* = f(x*) o Model selection (f(x) = ? ): • Ordinary least-squares regression (y = mx + c) • partial least-squares regression • support vector regression (SVR) • Gaussian process regression (GPR) o SVR makes the best predictions • Compared actual data in a test set with corresponding predictions A case study in Materials Informatics: Machine learning with systematic density-functional theory calculations: Application to melting temperatures of single- and binary-component solids Let’s focus on one model: o Predictions of Melting Temperature using Gaussian process regression (GPR) Gaussian process regression (GPR) • GPR does a good job of making predictions • The kriging design finds the compound with the highest melting temperature much faster than random designs A. Seko et al., Physical Review B 89, 054303 (2014) A case study in Materials Informatics: Gaussian process regression (GPR) Melting Temperature Optimization o Gaussian process regression (GPR) • GPR does a good job of making predictions • The kriging design finds the compound with the highest melting temperature much faster than random designs Target property Let’s focus on one model: Compound A. Seko et al., Physical Review B 89, 054303 (2014) A case study in Materials Informatics: Gaussian process regression (GPR) Melting Temperature Optimization o Gaussian process regression (GPR) • GPR does a good job of making predictions • The kriging design finds the compound with the highest melting temperature much faster than random designs Target property Let’s focus on one model: Compound A. Seko et al., Physical Review B 89, 054303 (2014) A case study in Materials Informatics: Gaussian process regression (GPR) Melting Temperature Optimization Let’s focus on one model: o Gaussian process regression (GPR) • GPR does a good job of making predictions • The kriging design finds the compound with the highest melting temperature much faster than random designs A. Seko et al., Physical Review B 89, 054303 (2014) Gaussian Process Regression • GPR is a Bayesian regression technique • Solves nonlinear estimation problems • A Gaussian process is a generalization of the multivariate Gaussian probability distribution Bivariate Gaussian Distribution: Multivariate Gaussian distribution: Gaussian Process Regression • GPR is a Bayesian regression technique • Solves nonlinear estimation problems • A Gaussian process is a generalization of the multivariate Gaussian probability distribution • The prediction f(x∗) at a point x∗ and its variance v(f∗) are described by using the Gaussian kernel function as follows: A. Seko et al., Physical Review B 89, 054303 (2014) Gaussian Process Regression • GPR is a Bayesian regression technique • Solves nonlinear estimation problems • A Gaussian process is a generalization of the multivariate Gaussian probability distribution • The prediction f(x∗) at a point x∗ and its variance v(f∗) are described by using the Gaussian kernel function as follows: where k∗ = [k(x1,x∗), . . . ,k(xN ,x∗)]⊤ examples, and I is the unit matrix. is the vector of kernel values between x∗ and the training , distribution has variance, 𝛔2 A. Seko et al., Physical Review B 89, 054303 (2014) Gaussian Process Regression • GPR is a Bayesian regression technique • Solves nonlinear estimation problems • A Gaussian process is a generalization of the multivariate Gaussian probability distribution • The prediction f(x∗) at a point x∗ and its variance v(f∗) are described by using the Gaussian kernel function as follows: where k∗ = [k(x1,x∗), . . . ,k(xN ,x∗)]⊤ examples, and I is the unit matrix. is the vector of kernel values between x∗ and the training , distribution has variance, 𝛔2 A. Seko et al., Physical Review B 89, 054303 (2014) GPR, Function-space View: We use a Gaussian process (GP) to describe a distribution over functions o The covariance between the outputs is written as a function of the inputs o The covariance is almost unity between variables whose corresponding inputs are very close, and decreases as their distance in the input space increases. Bayesian probability: C. Rasmussen & C. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, o It can be shown that the squared exponential covariance function corresponds to a Bayesian linear regression model with an infinite number of Gaussian shaped basis functions. o This property implies a GPR, Function-space View: We use a Gaussian process (GP) to describe a distribution over functions We are interested in the conditional probability p(y⇤|y): “given the data, how likely is a certain prediction for y⇤?” Gaussian Process Regression Consider linear model: Likelihood: Posterior: • To predict for a test case we average over all possible parameter values, weighted by their posterior probability • Thus the predictive distribution for f* at x⇤ is given by averaging the output of all possible linear models w.r.t. the Gaussian Gaussian Process Regression Change the representation: where, An algorithm defined in terms of inner products in input space can be lifted into feature space by replacing occurrences of those inner Bayesian linear model which suffers from limited expressiveness. Fix: project the inputs into some high dimensional space using a set of basis functions and, * * Gaussian Process Regression Algorithm for Bayesian optimization: 1. 2. An initial training set is prepared by randomly choosing compounds A compound is selected based on GPR. The compound is chosen as the one with the largest probability of getting beyond the current best value fcur. Since the probability is a monotonically increasing function of the z score, the compound with the highest z score is chosen from the pool of unobserved materials 3. 4. The melting temperature of the selected compound is observed The selected compound is added into the training data set. Then the simulation goes back to step (2) Steps (2)–(4) are repeated until all data of melting temperatures are included in the training set 5. A. Seko et al., Physical Review B 89, 054303 (2014) Gaussian Process Regression Choice of features Feature Selection: o Requires some domain knowledge o The more features the better o Some features may not affect prediction Gaussian Process Regression Choice of features Feature Selection: o Can generate new features by ‘combining’ existing features o Perform calculations to create new features. Gaussian Process Regression Benchmarking – model testing o Evaluate our Bayesian search algorithm by comparison with a standard (e.g. a random search) Melting Temperature Optimization o Select subpopulation of n compounds, e.g. n = 3 o Within a subpopulation perform a search and report the value you want to optimize (melting temperature) • Bayesian search • Random search o Take the average of a search through a subpopulation o Increase n and repeat Bayesian search finds the optimal value in fewer evaluations A. Seko et al., Physical Review B 89, 054303 (2014) Summary for Case Study: Gaussian Process Regression o Statistical nonlinear model implemented as Gaussian Process Regression • Bayesian optimization algorithm o Bayesian regression model makes faithful predictions o Bayesian search can find the optimal compound faster than a random search Predictions of Melting Temperature using Gaussian process regression (GPR) A. Seko et al., Physical Review B 89, 054303 (2014) Summary for Case Study: Gaussian Process Regression o Statistical nonlinear model implemented as Gaussian Process Regression • Bayesian optimization algorithm o Bayesian regression model makes faithful predictions o Bayesian search can find the optimal compound faster than a random search Why is this useful for materials discovery? Summary: Materials Informatics o Materials informatics is a way of identifying and exploiting trends in materials’ properties data o Data availability and access to computing power along with machine learning algorithms makes this approach viable Outlook: o Machine learning to guide experiments in real-time o Use statistical inference to uncover new physics and guide the discovery of new physical principles Bayesian statistics Bayes Rule: