Download lecture14 - Projects at Harvard

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Materials Informatics
Machine Learning in Materials Science
Trevor David Rhone
"I Keep Six Honest
Serving Men ..."
I keep six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who…
by Rudyard Kipling
What is Materials Informatics?
oThe use of statistical analysis in materials science
• Analyze materials’ property data to make predictions and to uncover
relationships among materials’ properties
oStatistical approaches are common in other fields
• Biology – bioinformatics
• Astronomy
Goal
o Materials discovery
• Make predictions based on some inputs
• Speed up a set of ‘expensive’ theory calculations
• Eliminate need for a multitude of serial experiments
o Uncover physical principles
What is Materials Informatics?
Problem:
o You are an alchemist living
in ancient Greece
o You have access to a small
random set of elements
o You would like to predict
which properties an
element X will have
o How do you proceed in a
systematic way?
What is Materials Informatics?
Challenges:
o What are appropriate
features to represent our
elements?
o How to deal with missing
data?
o How to deal with
uncertainty / errors in the
data?
o How do you construct a
model?
o Which models are
appropriate?
What is Materials Informatics?
Solution:
o Develop appropriate
descriptors for the data
• Atomic number
• Number of valence
electrons
• etc.
o Build models that you can
apply to your data to make
predictions
o Train your model
o Test the effectiveness of
your model to make faithful
predictions
What is Materials Informatics?
Solution:
o Select and train model
• Group by color?
• Sort by weight?
• Sort by atomic number?
o Handle missing values
o Account for errors or
uncertainty in the data
o Use features in a model to
• develop an understanding
of data
• make property predictions
of unknown elements
7
?
What is Materials Informatics?
Solution:
o Select and train model
• Group by color?
• Sort by weight?
• Sort by atomic number?
o Handle missing values
o Account for errors or
uncertainty in the data
o Use features in a model to
• develop an understanding
of data
• make property predictions
of unknown elements
What is Materials Informatics?
Real World Example:
Fast and Accurate Modeling of Molecular Atomization Energies
with Machine Learning
o Use a machine learning model (ML) to predict
atomization energies of a set of organic
molecules
• Features based on nuclear charges and atomic
positions only
• Regression models are trained on and compared
to atomization energies computed with densityfunctional theory.
• Can make good predictions of molecular
atomization potential energies
The problem of solving the molecular Schrodinger
equation is mapped onto a nonlinear statistical
regression problem of reduced complexity
M. Rupp, et al., PRL 108, 058301 (2012)
Correlation of DFT results (Eref) with ML based
estimates (Eest) of atomization energies.
Correlations for bond counting and semiempirical quantum chemistry (PM6) are also
What is Materials Informatics?
Real World Example:
Prediction of Low-Thermal-Conductivity Compounds with First-Principles Anharmonic Lattice
Dynamics Calculations and Bayesian Optimization
o Look for compounds with low lattice thermal
conductivity (LTC)
• good thermoelectric materials
o Virtual screening of a library containing
54,779 compounds
o Perform a Bayesian optimization search
• using initial LTC data obtained from firstprinciples calculations (a set of 101
compounds)
o Discovered 221 materials with very low LTC
o Two of them even have an electronic band
gap < 1 eV, making them good candidates
for thermoelectric applications
A. Seko, et al., Phys. Rev. Lett. 115, 205901
What is Materials Informatics?
Real World Example:
Machine learning with systematic density-functional theory calculations: Application to
melting temperatures of single- and binary-component solids
o Combine systematic theory (DFT)
calculations and regression techniques to
predict the melting temperature for single
and binary component compounds
o Ordinary least-squares regression, partial
least-squares regression, support vector
regression, and Gaussian process regression
o SVR makes the best predictions
o Including physical properties computed by
the DFT calculations improves predictions
o The kriging design (Bayesian search) finds
the compound with the highest melting
temperature much faster than random
designs
Predictions of Melting Temperature using SVR
A. Seko et al., Physical Review B 89, 054303 (2014)
Why use Machine Learning to study materials?
Our goal is to make some materials discovery:
oWhy not just start testing a bunch of compounds for a desired
property?
oOr do a bunch of DFT calculations?
oSerial experiments are slow and expensive
oFirst principles calculations are slow and computationally expensive
oParameter search space is huge
• Previous example – library of 54,779 organic compounds
• 49,037,297 organic and inorganic substances registered with the Chemical
Abstracts Service
When to use Machine Learning for Materials studies?
• Data intensive problems
• Making many DFT calculations would take a long time
• Many serial experiments are not desirable (slow and expensive)
• Problems where there are many parameters whose relationships are
not easily understood using conventional means
• Data exists and are accessible
• Datascience tools are accessible
• Computing power exists
When to use Machine Learning for Materials studies?:
Practical considerations
• Data exists and are accessible
• Datascience tools are accessible
• Computing power exists
When to use Machine Learning for Materials studies?:
Practical considerations
o Materials Project Database
• Access database
programmatically using
an application program
interface (API) via python
package (pymatgen)
GET EXAMPLE
!!!!
How to use Machine Learning in Materials Science?
oIf a collection of data exists that is amenable to statistical analysis
oIdentify features/variables that are important for predicting some
desired quantity
• Requires some domain knowledge
• Principle Component Analysis (PCA) to discover ‘best’ features (i.e. those
most relevant for making a prediction)
oChoose appropriate algorithm or statistical analysis to tease
information from the data
oTrain model and do model testing
oMake predictions for a desired property on test data
How to use Machine Learning in Materials Science?
K. Rajan, Materials Informatics, Materials Today, October 2005,
How to use Machine Learning in Materials Science?
The machine (or statistical)
learning methodology:
1. material motifs are reduced to
numerical fingerprint vectors
2. A suitable measure of chemical
(dis)similarity, or ‘chemical distance’,
is used within a learning scheme (e.g.
linear regression or kernel ridge
regression) to map the distances to
properties
Ghanshyam Pilania, Scientific Reports, 3 : 2810, DOI: 10.1038/srep02810
How to use Machine Learning in Materials Science?
Iris Dataset: What is a good descriptor or feature?
Iris setosa
Iris virginica
Iris versicolor
The Iris dataset is a famous
dataset in machine learning
The machine (or
statistical) learning
methodology:
o Feature selection is important
o Which ‘features’ of the flowers
can help us to classify them?
How to use Machine Learning in Materials Science?
Iris Dataset: What is a good descriptor or feature?
setosa
virginica
versicolor
Feature
Selection:
1. Sepal length
2. Sepal width
3. Petal length
4. Petal width
How to use Machine Learning in Materials Science?
Iris Dataset: Exploiting ‘features’ to help build models
setosa
virginica
versicolor
Machine Learning:
o Use features for data
visualization
o Perform non-linear regression
on training data
o Make predictions on test data
The Where? And Who? of Machine Learning in
Materials Science
oAcademia
• Research
o Persons doing materials Informatics
come from a very diverse set of
backgrounds and have a diverse set of
skills
• Statistics
• Computer science
• Materials science
Physics
Engineering
Chemistry
Materials Project (Materials Genome
Initiative)
NIMS
Small community of researchers (a few
at Harvard)
oIndustry
• Materials design for the energy
industry
Photovoltaics
Thermoelectrics
Organic LEDs
• Automotive industry
Lighter engines
A case study in Materials Informatics:
Machine learning with systematic density-functional theory calculations:
Application to melting temperatures of single- and binary-component solids
Dataset:
o Features of Compounds, X (input data)
o Melting temperature, Y (target variables)
FeSe
?
o Have a new compound with unknown melting
temperature
o Would like to make a prediction: y* = f(x*)
o Model selection (f(x) = ? ):
• Ordinary least-squares regression (y = mx + c)
• partial least-squares regression
• support vector regression (SVR)
• Gaussian process regression (GPR)
o SVR makes the best predictions
• Compared actual data in a test set with
corresponding predictions
A case study in Materials Informatics:
Machine learning with systematic density-functional theory calculations:
Application to melting temperatures of single- and binary-component solids
Let’s focus on one model:
o
Predictions of Melting Temperature using
Gaussian process regression (GPR)
Gaussian process regression (GPR)
• GPR does a good job of making
predictions
• The kriging design finds the
compound with the highest
melting temperature much faster
than random designs
A. Seko et al., Physical Review B 89, 054303 (2014)
A case study in Materials Informatics:
Gaussian process regression (GPR)
Melting Temperature Optimization
o
Gaussian process regression (GPR)
• GPR does a good job of making
predictions
• The kriging design finds the
compound with the highest
melting temperature much faster
than random designs
Target property
Let’s focus on one model:
Compound
A. Seko et al., Physical Review B 89, 054303 (2014)
A case study in Materials Informatics:
Gaussian process regression (GPR)
Melting Temperature Optimization
o
Gaussian process regression (GPR)
• GPR does a good job of making
predictions
• The kriging design finds the
compound with the highest
melting temperature much faster
than random designs
Target property
Let’s focus on one model:
Compound
A. Seko et al., Physical Review B 89, 054303 (2014)
A case study in Materials Informatics:
Gaussian process regression (GPR)
Melting Temperature Optimization
Let’s focus on one model:
o
Gaussian process regression (GPR)
• GPR does a good job of making
predictions
• The kriging design finds the
compound with the highest
melting temperature much faster
than random designs
A. Seko et al., Physical Review B 89, 054303 (2014)
Gaussian Process Regression
• GPR is a Bayesian regression technique
• Solves nonlinear estimation problems
• A Gaussian process is a generalization of the multivariate Gaussian probability
distribution
Bivariate
Gaussian
Distribution:
Multivariate Gaussian
distribution:
Gaussian Process Regression
• GPR is a Bayesian regression technique
• Solves nonlinear estimation problems
• A Gaussian process is a generalization of the multivariate Gaussian probability
distribution
• The prediction f(x∗) at a point x∗ and its variance v(f∗) are described by using the
Gaussian kernel function as follows:
A. Seko et al., Physical Review B 89, 054303 (2014)
Gaussian Process Regression
• GPR is a Bayesian regression technique
• Solves nonlinear estimation problems
• A Gaussian process is a generalization of the multivariate Gaussian probability
distribution
• The prediction f(x∗) at a point x∗ and its variance v(f∗) are described by using the
Gaussian kernel function as follows:
where k∗ = [k(x1,x∗), . . . ,k(xN ,x∗)]⊤
examples, and I is the unit matrix.
is the vector of kernel values between x∗ and the training
, distribution has variance, 𝛔2
A. Seko et al., Physical Review B 89, 054303 (2014)
Gaussian Process Regression
• GPR is a Bayesian regression technique
• Solves nonlinear estimation problems
• A Gaussian process is a generalization of the multivariate Gaussian probability
distribution
• The prediction f(x∗) at a point x∗ and its variance v(f∗) are described by using the
Gaussian kernel function as follows:
where k∗ = [k(x1,x∗), . . . ,k(xN ,x∗)]⊤
examples, and I is the unit matrix.
is the vector of kernel values between x∗ and the training
, distribution has variance, 𝛔2
A. Seko et al., Physical Review B 89, 054303 (2014)
GPR, Function-space View:
We use a Gaussian process (GP) to describe a distribution over
functions
o The covariance between the
outputs is written as a function
of the inputs
o The covariance is almost unity
between variables whose
corresponding inputs are very
close, and decreases as their
distance in the input space
increases.
Bayesian probability:
C. Rasmussen & C. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,
o It can be shown that the
squared exponential covariance
function corresponds to a
Bayesian linear regression
model with an infinite number of
Gaussian shaped basis
functions.
o This property implies a
GPR, Function-space View:
We use a Gaussian process (GP) to describe a distribution over
functions
We are interested in the conditional probability p(y⇤|y):
“given the data, how likely is a certain prediction for y⇤?”
Gaussian Process Regression
Consider linear model:
Likelihood:
Posterior:
• To predict for a test case we average over
all possible parameter values, weighted by
their posterior probability
• Thus the predictive distribution for f* at x⇤
is given by averaging the output of all
possible linear models w.r.t. the Gaussian
Gaussian Process Regression
Change the representation:
where,
An algorithm defined in terms
of inner products in input
space can be lifted into feature
space by replacing
occurrences of those inner
Bayesian linear model which suffers from limited
expressiveness. Fix: project the inputs into some
high dimensional space using a set of basis
functions
and,
*
*
Gaussian Process Regression
Algorithm for Bayesian optimization:
1.
2.
An initial training set is prepared by randomly choosing compounds
A compound is selected based on GPR. The compound is chosen as the one with the
largest probability of getting beyond the current best value fcur. Since the probability
is a monotonically increasing function of the z score, the compound with the
highest z score is chosen from the pool of unobserved materials
3.
4.
The melting temperature of the selected compound is observed
The selected compound is added into the training data set. Then the simulation goes
back to step (2)
Steps (2)–(4) are repeated until all data of melting temperatures are included in the
training set
5.
A. Seko et al., Physical Review B 89, 054303 (2014)
Gaussian Process Regression
Choice of features
Feature Selection:
o Requires some domain knowledge
o The more features the better
o Some features may not affect
prediction
Gaussian Process Regression
Choice of features
Feature Selection:
o Can generate new features by
‘combining’ existing features
o Perform calculations to create new
features.
Gaussian Process Regression
Benchmarking – model testing
o Evaluate our Bayesian search algorithm by
comparison with a standard (e.g. a random search)
Melting Temperature Optimization
o Select subpopulation of n compounds, e.g. n = 3
o Within a subpopulation perform a search and report
the value you want to optimize (melting temperature)
• Bayesian search
• Random search
o Take the average of a search through a subpopulation
o Increase n and repeat
Bayesian search finds the optimal
value in fewer evaluations
A. Seko et al., Physical Review B 89, 054303 (2014)
Summary for Case Study:
Gaussian Process Regression
o Statistical nonlinear model implemented as Gaussian Process Regression
• Bayesian optimization algorithm
o Bayesian regression model makes faithful predictions
o Bayesian search can find the optimal compound faster than a random search
Predictions of Melting Temperature using Gaussian process regression (GPR)
A. Seko et al., Physical Review B 89, 054303 (2014)
Summary for Case Study:
Gaussian Process Regression
o Statistical nonlinear model implemented as Gaussian Process Regression
• Bayesian optimization algorithm
o Bayesian regression model makes faithful predictions
o Bayesian search can find the optimal compound faster than a random search
Why is this useful for materials discovery?
Summary:
Materials Informatics
o Materials informatics is a way of identifying and
exploiting trends in materials’ properties data
o Data availability and access to computing power
along with machine learning algorithms makes this
approach viable
Outlook:
o Machine learning to guide experiments in real-time
o Use statistical inference to uncover new physics and guide the
discovery of new physical principles
Bayesian statistics
Bayes Rule: