Download finding descriptors useful for data mining in the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
Copyright ©JCPDS - International Centre for Diffraction Data 2004, Advances in X-ray Analysis, Volume 47.
FINDING DESCRIPTORS USEFUL FOR DATA MINING
IN THE CHARACTERIZATION DATA OF CATALYSTS
C. K. Lowe-Ma, A. R. Drews, A. E. Chen
Research & Advanced Engineering, Ford Motor Company, Dearborn, Michigan, USA
ABSTRACT
The ultimate goal of materials characterization is often to optimize materials by relating
observed features to a response function or performance specification. For X-ray data to be
successfully included in statistical or data mining methodologies that examine contributions to a
response function, sufficient pieces of information of the right kind must be extracted from the
X-ray data. Traditional X-ray analysis methods using individual comparisons cannot keep up
with the flux of specimens and data needed for data mining approaches to materials optimization.
The work described herein focuses on obtaining descriptors from X-ray fluorescence, X-ray
powder diffraction, and other characterization data from automotive exhaust-gas catalysts using
automated or semi-automated processes, and relating these descriptors to other performance
measures. Our results are also relevant to informatics requirements for high-throughput
screening and combinatorial studies.
INTRODUCTION
The goal of this work is to combine X-ray powder diffraction features with other characterization
data to build up mathematical relationships in automated or semi-automated processes that not
only describe existing data, but can also predict results and materials performance. These new
data analysis approaches can (a) help to develop better catalysis strategies and new materials, (b)
help to understand failure mechanisms, and (c) help in examining large numbers of fleet and
customer-aged catalysts for usage-dependent aging. Understanding and improving automotive
exhaust-gas catalysis enables us to improve air quality by continuing to mitigate undesirable
exhaust gases.
Although automotive exhaust-gas catalysts are only one component of a complex exhaust
emissions system, the catalysts themselves are also complex heterogeneous chemical systems
designed to perform multiple functions. An example of an automotive exhaust-gas catalyst is
shown in Figure 1.
FINDING AND USING DESCRIPTORS
Statistical methods and data mining algorithms have evolved to handle discrete bits of
information that are abstractions of data that may not necessarily have simple physical
interpretations. Ideal descriptors are those that can provide real distinctions amongst data
without redundancy. The value of descriptors derives from using them to enable comparisons
between disparate unrelated-types of characterization data. One of the difficulties and challenges
of obtaining useful descriptors is that the results of subsequent statistical analyses or data mining
algorithms may be quite dependent on the form and choice of descriptors! [1]
Complex data, such as X-ray powder diffraction patterns of catalyst materials, are often difficult
and very time consuming to interpret, making detailed individual interpretations impractical if
338
Copyright ©JCPDS - International Centre for Diffraction Data 2004, Advances in X-ray Analysis, Volume 47.
large numbers of samples are involved. Accurately predicting catalyst performance over a wide
variety of scenarios requires models built upon large numbers of specimens and built upon data
from different sources, hence the need to find data-driven descriptors that can provide the most
useful information from the fewest variables in an automated fashion.
Substrate
Active catalytic material
Figure 1. Shown at the left is a catalytic converter brick for a vehicle. The magnified image in the
middle shows the channels in the brick through which engine exhaust gases pass. The electron
microscopy image at the right is an image a single corner of a channel and shows the active catalytic
material that has been washcoated onto the substrate.
Most physical characterization techniques (and their associated software) have evolved to
examine one (or small n) sample(s) at a time. For example, X-ray powder diffraction scans are
collected sequentially, one at a time, on an individual specimen. Each resulting diffraction scan
is processed either by hand or in a batch mode for baseline correction, possibly some additional
geometric corrections, and peak picking. Each processed diffraction scan is then analyzed to
identify phases present, possibly analyzed for crystallite size or quantitative information, and
relationships to other characterization data are deduced manually. The limitations of this
conventional approach are obvious: it is problematic for materials containing many phases with
severe overlap; it is problematic for complex mixtures of crystalline and poorly crystalline
materials; and it is certainly problematic for handling large numbers of diffraction patterns
containing many phases of variable crystallinity mixed with highly crystalline (but uninteresting)
substrate phases. Figure 2 shows representative powder diffraction data obtained from the active
catalytic material scraped from three different catalysts and illustrates how unrealistic a
conventional approach might be.
Several approaches to computationally examining diffraction patterns were considered.
Described herein are results obtained by: (a) using whole-pattern (SNAP)-derived correlations
and peaks as descriptors, (b) using expectation maximization to bin peaks into clusters and using
the clusters as descriptors, and (c) using principle component analysis of large regions of raw
powder pattern data to obtain key factors with high variance (high information content).
COMPUTATIONALLY-DERIVED DESCRIPTORS INSTEAD OF PHASE DESCRIPTORS
339
Copyright ©JCPDS - International Centre for Diffraction Data 2004, Advances in X-ray Analysis, Volume 47.
Using the non-parametric whole-pattern analysis described by Gilmore, et al, [2] (also called
SNAP), correlations amongst a set of catalyst diffraction patterns were obtained from pair-wise
pattern comparisons between catalyst diffraction patterns and a standard set of reference patterns
for different ceria-zirconia compositions and crystallinity. These correlations, when examined
for clustering, yield the hierarchical cluster analysis tree (without pruning and using standard tree
clustering algorithms [3]) shown in Figure 3. The resulting tree exhibits three (or possibly four)
major clusters. These correlation values (or the mean value of each cluster) could, therefore, be
used in subsequent analyses as a variable (or a descriptor) representative of the type and
composition of ceria-zirconia present in the catalyst.
Figure 2. Representative diffraction data obtained from the catalytic material in automotive catalysts.
Figure 3. Hierarchical cluster analysis tree illustrating the clustering of correlation coefficients obtained
from SNAP (see text). The horizontal axis shows individual data labels.
340
Copyright ©JCPDS - International Centre for Diffraction Data 2004, Advances in X-ray Analysis, Volume 47.
A concern with using SNAP-derived correlations is that every pair-wise comparison yields only
a single value that may not adequately represent the complexity and subtle differences between
diffraction patterns containing data of the type shown in Figure 2. For this reason, an approach
using peaks was also considered.
Peak positions and intensities (d's & I's) have, historically, been used as short-hand descriptors
for powder patterns. However, the mechanics of the approach we used was rather different and
more amenable to computational methods. A complete list of all observed peak positions from a
number of diffraction patterns was used as input to a clustering algorithm using expectation
maximization [4]. Expectation Maximization examines distributions of data and develops naïve
Bayesian probabilities about which data values (peak positions) belong together. A sampling of
the cluster or binning results is shown in Table I below. Then, using normalized intensities
(from zero to one) to represent the peak heights every diffraction pattern exhibits for each bin,
relationships between the peak-position bins can be further examined. For example, Table II
shows, for small a subset of bins, intensity-based correlations between some bins.
Over an entire scan, redundant phase composition information is present in diffraction data that
contain no distortions due to preferred orientation. In the present example, although the major
ceria-zirconia [111] peak envelope region was not included in the expectation maximization
procedure, enough ceria-zirconia phase information still remains in other parts of the diffraction
pattern to derive a general regression relationship between binned (peak) intensities and the
amount of cerium observed by X-ray fluorescence (Figure 4).
Table I. Sample of expectation maximization results (after including diffraction knowledge about the
likely spread in peak position values). N is the number of patterns examined containing a peak at that
average position.
cluster, or
bin #
2theta
Std.Error
-99.00%
99.00%
N
1
18.115
0.008
18.094
18.136
21
2
19.007
0.008
18.986
19.027
22
3
19.354
0.021
19.299
19.410
3
4
19.670
0.011
19.643
19.698
12
6
20.427
0.010
20.402
20.452
15
Table II. Example of intensity-based correlations between peak-position bins. The blue highlighted
values in the 18.114 and 19.006 columns are correlations from peaks due to the same phase — cordierite
from the substrate. The green highlighted correlation is new information not previously known; peaks at
19.67° and at 21.31° appear to be due to the same (but unknown) phase.
18.114
19.006
19.354
19.670
18.114
1.000
0.886
0.408
0.182
19.006
0.886
1.000
0.416
0.292
19.354
0.408
0.416
1.000
0.416
19.670
0.182
0.292
0.416
1.000
20.427
0.661
0.669
0.301
0.183
21.307
0.229
0.336
0.401
0.815
21.739
0.815
0.856
0.407
0.292
341
Copyright ©JCPDS - International Centre for Diffraction Data 2004, Advances in X-ray Analysis, Volume 47.
Expectation Maximization may represent a different, and possibly improved approach to phase
analysis through intensity correlations, but this approach still results in far too many variables
(too many peak positions). Another approach is to use Principle Component Analysis (PCA) to
obtain composite factors that contain the largest amount of information. [5] PCA has been used
in other diffraction studies [6] and has been widely used in the spectroscopy community [5b, 7].
PCA could be used to reduce the number of peak-position bins obtained by expectation
maximization but PCA on peak-position bins could be problematic because across any given set
of diffraction scans many of the peak-position bins may have zero peak intensity. However, if
PCA is used on raw normalized data, every 2θ step becomes a variable and the intensity is the
value of the variable. Shown in Figure 5 are plots of the first two principle components obtained
from raw diffraction data over a "low-angle" region and over a "mid-angle" region. Because
PCA derives directions in parameter space with the highest variance (information content), the
PCA approach is able to delineate differences between the diffraction data for three types of
catalysts; and these differences are more substantive than being due to just the variable amount
of cordierite substrate that inadvertently occurs when scraping catalyst washcoat.
Figure 4. Comparison of the regression-predicted Ce composition with the observed Ce from XRF.
INCORPORATING DESCRIPTORS FROM OTHER CHARACTERIZATION TECHNIQUES
Analytical techniques such as quantitative X-ray fluorescence generally yield descriptors
(numerical values for composition) that are examined easily for relationships. Other
spectroscopic characterization methods can yield descriptors using procedures similar to those
described here for X-ray diffraction data. Continuous curve data (e.g., reactor or emissions data)
represent another type of data for which obtaining useful descriptors can be difficult. Images,
and microscopy images in particular, also pose challenges to using computational methods to
examine relationships. For electron microprobe images of catalysts, we obtain descriptors by
separating substrate regions from washcoat and then derive elemental spatial correlations from
the X-ray emission maps. [8]
PUTTING IT ALL TOGETHER — RELATIONSHIPS BETWEEN DESCRIPTORS AND PERFORMANCE
Aggregate X-ray diffraction descriptors, such as the first few PCA factors, can be compared to
performance groupings derived from, e.g., tailpipe emissions to determine the usefulness of the
descriptors. Figure 6a shows the effectiveness of the selected X-ray diffraction PCA factors in
discriminating amongst emissions-based groupings. The larger the numbers on both axes, the
342
This document was presented at the Denver X-ray
Conference (DXC) on Applications of X-ray Analysis.
Sponsored by the International Centre for Diffraction Data (ICDD).
This document is provided by ICDD in cooperation with
the authors and presenters of the DXC for the express
purpose of educating the scientific community.
All copyrights for the document are retained by ICDD.
Usage is restricted for the purposes of education and
scientific research.
DXC Website
– www.dxcicdd.com
ICDD Website
- www.icdd.com
Copyright ©JCPDS - International Centre for Diffraction Data 2004, Advances in X-ray Analysis, Volume 47.
better the discrimination. Some discrimination occurs using just the X-ray diffraction descriptors
(Figure 6a), but the discrimination between performance groups is greatly enhanced if the most
important PCA factors from electron microprobe image correlations and from XRF-based
compositions are included with the X-ray diffraction factors (Figure 6b).
Figure 5. Plots illustrating the ability of the first PCA factor (horizontal axis) and second PCA factor
(vertical axis) derived from raw X-ray diffraction data to separate three different types of catalysts into
three (known) types of catalyst. The plot on the left shows the first two PCA factors derived from "lowangle" data; the plot on the right shows the first two PCA factors for "mid-angle" data.
Figure 6a.
Figure 6b.
Figure 6. Plots of each catalyst discriminant score for the first two discriminant functions (canonical
roots). Discriminant analysis is used to determine which variables can successfully differentiate between
groups, in this case, groups based on emission performance. Discriminant functions are obtained from
weighted linear combinations of variables with the weights derived to maximize differentiation between
groups. As shown above, PCA factors derived from raw XRD data can (marginally) discriminate
between emission groups (6a), but the differentiation between emissions groups is notably more effective
if the first few PCA factors obtained from XRF and EPMA data are also included (6b).
343
Copyright ©JCPDS - International Centre for Diffraction Data 2004, Advances in X-ray Analysis, Volume 47.
CONCLUSIONS
SNAP is a powerful approach to comparing diffraction patterns and enables obtaining useful
correlations between whole patterns. Expectation Maximization is found to be useful for
deriving clusters of peaks associated with a single average peak position (a peak bin);
correlations between intensities in the bins can then be used to determine which peaks belong
together and are due to a single phase. Principle Component Analysis of normalized step-scan
data is found to yield useful descriptors that subsequently can be related to measures of materials
performance. These approaches to deriving descriptors from powder diffraction data will
facilitate finding new and improved inorganic materials, especially heterogeneous catalytic
materials, using high-throughput discovery methods and data mining to target specific property
criteria.
REFERENCES
[1] Kantardzic, M., Data Mining -- Concepts, Models, Methods, and Algorithms, Wiley-Interscience,
IEEE Press: New Jersey (2003), pp. 19-38.
[2] Gilmore, C. J., Barr, G., Paisley, J., “High Throughput Powder Diffraction I: Full-profile Qualitative
and Quantitative Powder Diffraction Pattern Analysis”, J. Appl. Crystall. (submitted)
[3] StatSoft, Inc. (2003). STATISTICA (data analysis software system), version 6. Tulsa, Oklahoma:
www.statsoft.com.
[4] Mitchell, T. M., Machine Learning, McGraw-Hill: Boston (1997), pp. 191-195.
[5] (a) Reference [1], pp. 48-51; (b) Jurs, P.C., "Chemometric and Multivariate Analysis in Analytical
Chemistry" in Reviews in Computational Chemistry, Lipkowitz and Boyd, edit., VCH Publishers:
New York (1990), pp. 169-212; (c) Jambu, M., Exploratory and Multivariate Data Analysis,
Academic Press: Boston (1991), pp. 129-167.
[6] (a) Kato, M., Fujii, S., Ui, T., Asada, E., Powder Diffract. 5(1), 33-35 (1990); (b) Klar, P.J., Chen,
L., Rentschler, T., J. Mater. Chem., 6(11), 1815-1821 (1996); (c) Artursson, T., et al., Applied
Spectr., 54(8), 1222-1230 (2000); (d) Hida, M., Sato, H., Sugawara, H., Mitsui, T., Forensic Science
International 115, 129-134 (2001).
[7] (a) Aries, R., Lidiard, D., Spragg, R., Spectroscopy 5(3), 41-44 (1990); (b) Workman, J.J., et al.,
"Review of Chemometrics Applied to Spectroscopy: 1985-95, Part I" in Applied Spectr.Reviews,
31(1&2), 73-124 (1996).
[8] Chen, A.E. and Lowe-Ma, C.K., Microscopy and Microanalysis 7, Suppl. 2, 1116-7 (2001).
344