Download data mining with different types of x-ray data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Copyright ©JCPDS-International Centre for Diffraction Data 2006 ISSN 1097-0002
DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA
C. K. Lowe-Ma, A. E. Chen, D. Scholl
Physical & Environmental Sciences, Research and Advanced Engineering
Ford Motor Company, Dearborn, Michigan, USA
C. J. Gilmore, R. J. Thatcher
Chemistry Department, University of Glasgow, Glasgow, Scotland, UK
W. Sverdlik
Department of Computer Science, Eastern Michigan University, Ypsilanti, Michigan, USA
Abstract
High-Throughput Materials Discovery uses automation and parallelism to synthesize and
evaluate large numbers of specimens while reducing time and costs associated with finding and
optimizing novel materials. As optimal performance may not be uniformly distributed
throughout parameter space, efficient tools for analyzing data and evaluating large areas of
compositional or parameter space are needed. Data mining tools enable moving from the
statistics of limited experimental designs to more descriptive and predictive relationships.
Clustering a set of 47 samples for which both X-ray powder diffraction data and X-ray
fluorescence-based elemental composition data were available showed that elemental
composition correlated strongly with phase composition in this particular set of samples. Also,
the clustering of the X-ray data was found to be exactly coincident with a different sample
characteristic "type". Decision tree classification of a larger data set of 86 samples showed that
"type" could be defined with very few errors from relatively few splits of the XRF-based
compositions. Although composition exhibited strong clustering, measures of performance in
these same samples exhibited only very weak clustering. However, performance of the materials
could be predicted from linear regression using different slices of the data. Neural nets were
attempted for improved predictability of performance beyond linear regression. As expected
from the liner regression results, single output linear-based multi-layer perceptrons yielded
acceptable predictive capability, but were found to yield notably degraded predictive results if
"type" was excluded from the models. The strong dependence of performance on "type" for
these samples was an unexpected outcome of the data analysis.
Introduction
High-Throughput Materials Discovery makes use of automated instrumentation and parallelism
to synthesize and test large numbers of specimens (Figure 1). The foundation of this approach is
that more can be learned from experiments on a widely diverse set of specimens than from
complex, detailed measurements on simple systems or on measurements of a limited number of
samples. Automated instrumentation and large numbers of specimens implies that large amounts
of data will be generated, implying a strong need for efficient methods in evaluating data from
large areas of compositional or parameter space.
Although standard experimental design (DOE) statistical tools can provide a basis for selecting
parameters and interpreting results, DOE tools are inherently constrained to the parameter space
examined. We would like to take knowledge gleaned from a wide diversity of specimens,
315
This document was presented at the Denver X-ray
Conference (DXC) on Applications of X-ray Analysis.
Sponsored by the International Centre for Diffraction Data (ICDD).
This document is provided by ICDD in cooperation with
the authors and presenters of the DXC for the express
purpose of educating the scientific community.
All copyrights for the document are retained by ICDD.
Usage is restricted for the purposes of education and
scientific research.
DXC Website
– www.dxcicdd.com
ICDD Website
- www.icdd.com
Copyright ©JCPDS-International Centre for Diffraction Data 2006 ISSN 1097-0002
316
describe our knowledge about these specimens, and develop predictions about regions in
parameter space where further studies would be warranted. Describing data and developing
predictions falls in the realm of data mining. Instead of the inward deductive data focus of DOE
and statistical analysis tools, data mining emphasizes learning from examples and extrapolating
to more general descriptive or predictive models through the use of a variety of artificial
intelligence, pattern recognition, and machine learning algorithms. Effective data mining is all
about how to formulate questions that are meaningful or sensible and how to prepare data to
correctly answer those questions. Unfortunately, no general recipes exist for designing good
questions nor for preparing data, especially scientific data, although some useful general
references are available. 1,2 Types of standard data mining algorithms that might be used to
answer questions are listed in Table 1. In this paper, clustering, regression, decision
classification trees, and neural nets were used to examine relationships in a dataset that contained
both quantitative X-ray fluorescence compositions and X-ray powder diffraction data.
Design
DesignExperiment
Experiment
(DoE
(DoETools)
Tools)
Data
DataReduction
Reductionand
and
Data
DataMining
Mining
Robotic
Robotic
Synthesis
Synthesis
Database
Parallel
ParallelScreening
Screening
Figure 1. Ford Motor Company implementation of High-Throughput Materials Discovery
Results and Discussion
As previously mentioned, one of the biggest challenges in data mining is data preparation.
Although many vendors offer very capable software for handling X-ray powder diffraction data,
we developed a fully automated empirical algorithm for background subtraction using Python.
The algorithm (Equation 1) uses a 6-parameter fit with complex non-linear weighting but
requires only a single input parameter from the user specifying an estimate of where background
is relative to the last few points at the high-angle end of the scan. The algorithm fits both the
low-angle scatter arising from powder surface roughness and the flat background expected at
higher angles from off-axis-cut zero-background quartz substrates (Figure 2). Minimization is
achieved using a Nelder-Mead simplex.
x2
1
1
1
y = a1 + a2 + a3 2 + a4 3 + a5 e a6
x
x
x
(Eqn. 1)
The advantage of using this algorithm for background subtraction is that all diffraction scans are
treated the same and a very large number of data files can be handled very efficiently by listing
the filenames in a batch run file. Following background subtraction, the X-ray powder
Copyright ©JCPDS-International Centre for Diffraction Data 2006 ISSN 1097-0002
317
diffraction data can then be further processed. For the analyses described below, the X-ray
powder diffraction data were subsequently processed using PolySNAP. 3,4
Table 1. Types of Data Mining Algorithms
Regression
(numerical data)
Linear and multiple
regression
Regression and model
trees
Adaptive neural nets,
multilayer nets
Genetic algorithms
Descriptive Data
Visualization
Statistical exploratory
data analysis
Market basket analyses, a
priori algorithms
Decision trees and lists
Hierarchical clustering
Textual analyses
Instance-based classifiers
K-means clustering
Image analysis and
segmentation
Classification Models
Version space hypotheses
Perceptron neural nets
Genetic algorithms
Other
Expectation
Maximization clustering
Bayesian inference
Figure 2. Two X-ray powder diffraction scans showing the effectiveness of the new algorithm in fitting a
background. The red line is the fitted background, y in Equation 1.
The X-ray powder diffraction data were obtained with either a PAD-V or an X2 Scintag powder
diffractometer equipped with a copper-target X-ray tube. Data were collected with continuous
scans and electronic integration over 0.03° 2θ. The X-ray fluorescence data were obtained with a
Philips PW2400 with a chromium tube using UniQuant5 and sensitivities optimized using
additional in-house calibration standards and with background channels customized to better
handle the chemistries of these samples. The resulting output of oxide weight percentages was
converted to moles of each element.
The data were prepared such that relationships between phase composition, elemental
composition, and performance could be examined. Merging data from different characterization
techniques yielded two data sets: (1) a set of 47 samples with X-ray powder diffraction (XRD)
data, elemental compositions from X-ray fluorescence (XRF), surface area, and four measures of
performance; (2) a related data set containing 86 samples with XRF data, surface area, a
parameter for history (sample aging), and four measures of performance but without XRD data.
Copyright ©JCPDS-International Centre for Diffraction Data 2006 ISSN 1097-0002
318
Data sets (1) and (2) were initially examined for natural groupings in the data with clustering.
5
STATISTICA was used for hierarchical clustering of the XRF, surface area, and performance
data. Similarity clustering of the XRF, surface area, and performance data in various
combinations with the XRD data was accomplished with the three-way multidimensional scaling
of PolySNAP.6,7 For more predictive models, regression and decision tree classification were
accomplished with the open-source software WEKA6. Neural nets were developed using
STATISTICA Neural Nets.
(a)
(b)
Figure 3. (a) The clustering of the XRPD data in data set (1) by multi-dimensional scaling in PolySNAP. (b)
The clustering of the XRF data in data set (1) also by multi-dimensional scaling. Although difficult to see in
these images, the cluster membership is exactly the same for both types of data.
(a)
(b)
Figure 4. From PolySNAP using data set (1), similarity clustering of a subset of the elemental data from XRF
(a) without surface area and (b) with surface area included.
Copyright ©JCPDS-International Centre for Diffraction Data 2006 ISSN 1097-0002
As illustrated in Figure 3, cluster membership is found to be the same for both types of X-ray
data, XRD and XRF. Therefore, the phase composition has a strong relationship to elemental
composition. Different variations in specimen composition are related to the presence of
different phases. Examination of the cluster membership shows that the members accurately
reflect a descriptor “sample type” that was derived from other information unrelated to any
chemical or characterization measurements, e.g., “sample type” reflects the source from which
the chemicals originated.
Figure 5. Similarity clustering from PolySNAP that results from adding XRD data to surface area and a
subset of XRF elemental data (data set 1).
Our knowledge of the samples tells us that not all of the elemental composition should be related
to “sample type”. Manually selecting a subset of the XRF data enables probing relationships
beyond the influence of “sample type”. However, the subset of XRF data exhibits relatively
weak clustering (Figure 4a). Including surface area with the XRF data changes the clustering
membership (Figure 4b) but does not strengthen the relationships. Hierarchical clustering using
complete Euclidean linkage distances for the same subset of XRF data but from the larger data
set (2) of 86 samples still yields poor clustering with very small linkage distances. However,
inclusion of surface area in the hierarchical clustering of the larger data set does yield more
numerically significant linkage distances and more distinct clusters. Not surprisingly, because
the XRD data contain information so strongly related to “sample type” (Figure 3a), the addition
of XRD clustering to the XRF subset-surface area clustering imposes a more definite structure in
the overall clustering (Figure 5). Nevertheless, surface area and the XRF subset of data
influences the cluster membership compared with Figure 3a. Examination of the clustering
relationships for the four measures of performance indicates that the performance data alone
show no strong tendency to cluster.
The larger data set (2) of 86 samples but without XRD data was used to test for efficacy in
predicting performance. For the prediction model building, selecting amongst the possible
319
Copyright ©JCPDS-International Centre for Diffraction Data 2006 ISSN 1097-0002
twenty-one primary variables, thirteen derived variables, and four response variables was
accomplished by either the independent feature selection heuristic of STATISTICA Data Miner or
by using in each technique the embedded algorithms that selectively add or subtract parameters.
Rather surprisingly, linear regression models for all four measures of performance could be
found with correlation coefficients ranging from 0.84 to 0.96.7 Different combinations of XRF
elements, surface area, and history parameter yield statistically comparable models although all
models included “sample type”. Decision tree classification shows that “sample type” can be
defined with very few errors from relatively few splits of the XRF-based compositions, which is
consistent with the clustering observed using PolySNAP (Figure 3b). To examine the influence
of “sample type” on the regression models, “sample type” and elements defining “sample type”
were excluded, but the history parameter and various combinations of surface area with
remaining XRF elements were included. Nevertheless, the correlation coefficients for the linear
regression models dropped significantly to 0.77 to 0.82. This leads us to conclude that the
measures of performance that were tested do depend to some extent on aging history of the
specimens, surface area, and other aspects of composition besides “sample type”, but that for
these particular materials, “sample type” is a significant factor related to the performance of the
materials. Predictive models developed using neural nets show the same trend; predictions are
notably degraded without the inclusion of “sample type”. The predictive capabilities of neural
net models are further degraded if multiple predictions are attempted. This may suggest that the
parameters remaining after removing “sample type” may be only weakly related to performance
and may be insufficiently independent to successfully predict material performance.
Conclusions
XRD phase composition and XRF elemental composition were found to yield the same
clustering and, hence, both types of X-ray data have a strong relationship to each other in the
specimens examined. Cluster membership of the X-ray data was found to be indicative of an
unrelated descriptor “sample type”. Models developed for these data sets needed the inclusion of
“sample type” to be effective in predicting performance. Although the dependence on “sample
type” is, perhaps, not surprising in retrospect, models independent of sample type would be more
useful. Hence, the next step for extending our data mining is to find other descriptors that
improve prediction of performance without requiring the inclusion of sample type in the model.
Dimensionality reduction of spectral-type X-ray data may yield other descriptors useful for
modeling performance. Improved predictive models would guide us to other regions in
parameter space in which to search for new or optimized materials.
320
Copyright ©JCPDS-International Centre for Diffraction Data 2006 ISSN 1097-0002
References
1
2
3
4
5
6
7
Data Mining, Ian Witten and Eibe Frank (2000); Machine Learning, Tom Mitchell (1997); Data
Mining: Concepts and Techniques, J. Han and M. Kamber (2001); Data Mining: Concepts, Models,
Methods, and Algorithms, Mehmed Kantardzic (2003).
Data Preparation for Data Mining, Dorian Pyle (1999).
PolySNAP, Brucker AXS; also G. Barr, W. Dong, C.J. Gilmore (2004). “PolySNAP: a computer
program for analysing high-throughput powder diffraction data”, J. Appl. Cryst. 37, 658.
C.J. Gilmore, G. Barr, J. Paisley (2004). “High-throughput powder diffraction. I. A new approach to
qualitative and quantitative powder diffraction pattern analysis using full pattern profiles”, J. Appl.
Cryst. 37, 231; G. Barr, W. Dong, D.J. Gilmore (2004). “High-throughput powder diffraction. II.
Applications of clustering methods and multivariate data analysis”, J. Appl. Cryst. 37, 243.
StatSoft, Inc. (2005). STATISTICA 7.1 or STATISTICA Data Miner, version 7. www.statsoft.com.
Ian H. Witten and Eibe Frank (2005). "Data Mining: Practical machine learning tools and techniques",
2nd Edition, Morgan Kaufmann, San Francisco; see also Weka 3: Data Mining Software in Java,
http://www.cs.waikato.ac.nz/ml/weka/ .
Using WEKA’s “Greedy” algorithm for linear regression models with the outlier (sample P31)
removed.
321