Download The Discovery by Data Mining of Rogue Equipment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
The Discovery by Data Mining of Rogue Equipment in the Manufacture of
Semiconductor Devices
Steven G. Barbee
A Thesis
Submitted in Partial Fulfillment
of the Requirements for the Degree of
Master of Science in Data Mining
Department of Mathematical Sciences
Central Connecticut State University
New Britain, Connecticut
April 2007
Thesis Advisor
Dr. Daniel Larose
Director of Data Mining Program
The Discovery by Data Mining of Rogue Equipment in the Manufacture of
Semiconductor Devices
Steven G. Barbee
An Abstract of a Thesis
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Science in Data Mining
Department of Mathematical Sciences
Central Connecticut State University
New Britain, Connecticut
April 2007
Thesis Advisor
Dr. Daniel Larose
Department of Mathematical Sciences
Key Words: Radial Basis Function Network, Semiconductor Manufacturing,
Feature Selection, Tree Stump, Tree Curve, Tree Bias
2
Abstract
Finding equipment causes of faulty devices in semiconductor manufacturing is inhibited
by several difficulties which are briefly described. The main problem area focused on
here is that of biased data mining methods. By judiciously selecting two data mining
methods from IBM’s data mining workbench, the Intelligent Miner for Data (IM4D),
discovery of the known root cause of a decrease in device parametric data from a
manufacturing line is more likely to be obtained. The methods employed are the radial
basis function network with chi-square ranking for feature selection followed by
sequential one-level regression trees (tree stumps) to provide rules. A graphical
representation of the rules, the tree curve, is introduced which makes the determination of
the root cause visually easy. The value of this approach was proven when it revealed the
key candidate for a problem, in IBM’s primary manufacturing line, which was later
confirmed by traditional engineering methods to be the root cause.
3
Dedication
This thesis is dedicated to my wife, Debbie, whom I have loved since our teenage years.
Her encouragement and support over the past four years allowed me to pursue this degree
to completion.
Acknowledgements
I am grateful to the Lord for leading me to a vocational epiphany during my 2001
Christmas vacation which eventually led me to enter this field within IBM. He opened
doors, before closing doors, leading me to this chapter in my life.
I would also like to thank Professor Larose for creating the online degree program in data
mining which quickly brought me up to speed in my second career. His friendly
encouragement to me at the end of the 2nd of the first 3 of his courses was especially
helpful. I doubt that I would be a data miner today without his encouragement or his
foresight and perseverance in providing the world with this resource.
My knowledge of data mining was broadened and deepened by the challenging courses
of Professors Markov and Dziuda. Professor Markov introduced me to WEKA and
machine learning using Prolog and Mitchell’s text. Professor Dziuda introduced me to
the fascinating field of gene expression preprocessing and mining for genomics
information discovery from microarray data.
I would like to thank the above-mentioned professors for serving on my thesis committee.
4
CONTENTS
Abstract
I. Introduction
A) Goal
B) Related Work
II. Semiconductor Device Fabrication
A) Introduction
B) Definition of Terms
C) Business motivation
1) Implications of Wafer Size
2) Time to Market
D) Data Mining as an Adjunct to Conventional Engineering
III. Data Mining Challenge
A) Scope Limited to Knowledge Discovery (vs. Predictive Analytics)
B) Hierarchical Nature of Fabrication
C) Variable Reduction and SME Bias
D) Complexity of Wafer Trajectories
E) Probabilities from the Hypergeometric Distribution
F) The Problem of Autocorrelation
G) The Problem of Bias in Selection Methods
1) Chi-square
5) Gini Index
2) Entropy Ranking
6) Minimum Description
3) Information Gain
Length
4) Gain Ratio
IV. Data Mining Approach and Application
A) Data Exploration
B) Data Preparation
C) Feature Selection to Reduce Variables vs. Records
D) Methods Chosen for Categorical Variables and Numeric Targets
1) Radial Basis Functions
a) Applying RBFNs
b) Selecting Features with RBFNs
2) Classification and Regression Trees
a) Operation of the basic tree method
b) Sequential top node method
E) Rules for Multiple Process Tools
1) Method for Summarizing and Prioritizing Rules
F) Results for Two Datasets
V. Further Study
A) Mining Methods
B) Feature Creation
C) Data Access Limitations Due to Commercial Prudence
Appendix A: Acronyms
Appendix B: Cited Books
Biography of the Author
5
Abstract
Finding equipment causes of faulty devices in semiconductor manufacturing is inhibited
by several difficulties which are briefly described. The main problem area focused on
here is that of biased data mining methods. By judiciously selecting two data mining
methods from IBM’s data mining workbench, the Intelligent Miner for Data (IM4D),
discovery of the known root cause of a decrease in device parametric data from a
manufacturing line is more likely to be obtained. The methods employed are the radial
basis function network with chi-square ranking for feature selection followed by
sequential one-level regression trees (tree stumps) to provide rules. A graphical
representation of the rules, the tree curve, is introduced which makes the determination of
the root cause visually easy. The value of this approach was proven when it revealed the
key candidate for a problem, in IBM’s primary manufacturing line, which was later
confirmed by traditional engineering methods to be the root cause.
I. Introduction
Before we delve into data mining, we’ll first explore the complicated world of
semiconductor manufacturing so that we can better understand the business motivation
behind this work. Then we’ll look at technical difficulties, particular to manufacturing,
which may hinder our mining or the interpretation of our mining results. Finally, we’ll
present different approaches to mining and concentrate on our preferred method and a
means of interpreting its results.
In the manufacture of semiconductor devices, electrical properties can be affected by
deviations in the fabrication process due to drifting or otherwise poorly performing
processes or equipment (also referred to as tools). Data mining methods in knowledge
discovery can detect which process equipment (tool) may have caused such deviations.
However, a naive application of data mining methods may lead to erroneous results due
to the presence of bias in the method. The problem may be further worsened by an
imbalanced dataset in that there are few wafers (work pieces) with the best electrical
characteristics versus a majority with poorer characteristics or vice versa. Logistics, the
particular path of the work piece through the manufacturing line, contained in datasets,
may frequently present us with as many variables as there are records or instances. Given
all of the methods contained in a single data mining workbench, the challenge is then to
find the best method, or combination of methods, to discover any equipment causing poor
electrical characteristics in the final product (referred to as work pieces, integrated
circuits, microchips or semiconductor devices).
I.A.) Goal
The goal of this work is to find the best discovery method among those available in a
given commercial data mining suite to use on such datasets as are common in the
semiconductor industry. Since combinations of two or more of the process tools used
6
during semiconductor device fabrication may be causative, the methods must allow for
the discovery of such tool combinations. The hypothesis within this work is that a
method or combination of methods will overcome a misleading selection bias in this
knowledge discovery problem.
I.B.) Related Work
One of the earliest papers detailing the application of data mining to semiconductor
manufacturing applied a tree to process control. 1 There is now a growing level of
sophistication in the body of literature in the semiconductor data mining field. Perhaps
the most impressive recent paper on the application of data mining to semiconductor
manufacturing is that from Intel. 2 A variety of recently developed data mining methods,
both commercial and in-house, were shown to accurately predict the completion date of
wafers and the device speed of chips based on electrical parameters tested during earlier
steps in their fabrication. Motorola has also published their development of a software
application for data mining named CORDEX 3 which uses Kohonen self-organizing maps
for clustering the data followed by a tree for rule analysis within the clusters.
Based on the complexity of data mining methods and the history of the acceptance and
use of statistical methods, such as the design of experiments, among the engineering
community it is this author’s opinion that the newest, most advanced data mining
methods are wielded today by only a select few in a company’s semiconductor
manufacturing operation despite the huge ROI realizable when they are properly applied.
The use of more generic, but still powerful, methods through user-friendly GUIs in
commercial software will best spread the use of data mining among the general
engineering community and should be strongly encouraged.
The community of data mining companies 4 producing specialized software or services
customized to semiconductor manufacturing is growing. Trees and neural networks are
commonly encountered in such software, the latter for both clustering and prediction.
For instance, AMD’s Richard Kittler (formerly of Yield Dynamics) and Weidong Wang
(Yield Dynamics) tout the use of Bayesian nets and decision trees; 5 they also survey the
present and predict the future use of data mining in the semiconductor industry. 6
1
“Applying Machine Learning to Semiconductor Manufacturing,” K. Irani, J. Cheng, U. Fayyad, Z. Qian,
IEEE Expert, Febr. 1993, pp. 41—47.
2
“Advancements and Applications of Statistical Learning/Data Mining in Semiconductor Manufacturing,”
Randall Goodwin, Russell Miller, Eugene Tuv, Alexander Borisov, Mani Janakiram, Sigal Louchheim,
Intel Technology Journal, Volume 8, Issue 4, 2004; pp. 325—336.
3
“Data Mining Solves Tough Semiconductor Manufacturing Problems,” Mike Gardner and Jack Bieker,
Proc. ACM SIGKDD 2000, Boston, MA, pp.376—383.
4
See, for example, Yield Dynamics’ YieldMine; Neumath:
http://www.neumath.com/solutions/solutions_index.htm ; Quadrillion Q-Yield: www.quadrillion.com ;
and PDF Solutions: http://www.pdf.com/services_solutions.phtml
5
“Data Mining for Yield Improvements,” Richard Kittler and Weidong Wang, Proc. Int’l. Conf. on
Modeling and Analysis of Semiconductor Manufacturing, Temp, Arizona, 2000, pp.270-277.
6
“The Emerging Role for Data Mining,” Richard Kittler and Weidong Wang, Solid State Technology, Vol.
42, Issue 11, November, 1999.
7
An example of literature from academia that is closely related to the present work is that
of the National Chiao Tung University with TSMC (Taiwan Semiconductor
Manufacturing Company) 7 where several undisclosed methods were used to back test
data mining methods to discover the root cause using data from a semiconductor
manufacturing facility. If the results of the data mining method ranked the root cause
among its top 5 candidates then the method was deemed to be successful. A similar
ranking approach will be used to determine the success of the mining methods used in
this work.
Back to CONTENTS
II. Semiconductor Device (Chip) Fabrication 8
II.A) Introduction
The manufacture of semiconductor-based devices, integrated circuits or microchips,
involves a lengthy and repetitive set of processes often numbering in the hundreds of
steps. Each of many (perhaps hundreds of) microchips are formed on a cross-sectional
(orthogonal to the axis) slice of a cylindrical crystal of silicon. These slices are known in
the trade as “wafers” and are currently as large as 300mm in diameter and over a
millimeter in thickness. Batches, or “lots,” of wafers move together in a container from
tool to tool, automatically, in the newest factories (also known as foundries, fabricators or
“fabs”). Within a tool they may be individually processed in several single-wafer
chambers. It is expected that there will be differences in processing from tool to tool and
chamber to chamber that may lead to different electrical results in the microchips on the
wafers. Furthermore, the processes can vary across the wafer so that there are planar
regions of variation so that individual microchips within each wafer may receive slightly
different processing.
The process steps exercised to form devices on the microchips within a wafer involve
various physical and chemical methods to define and form circuit elements such as by
chemically doping the semiconductor material to form electrically active regions in
transistors, diodes, capacitors comprising memory or logic components within a device in
the microchip.
II.B) Definition of Terms
In knowledge discovery, we search attribute values of independent variables that affect
the dependent variable by poring over thousands of instances. We will use any of the
following synonyms depending on their context:
Independent variable = variable = process (e.g. EQP_A11)
7
“A Data Mining Project for Solving Low-yield Situations of Semiconductor Manufacturing,” Wei-Chou
Chen, Shian-Shyong Tseng, Kuo-Rong Hsiao, Chia-Chun Liu, 2004 IEEE/SEMI Advanced Semiconductor
Manufacturing Conference, pp.129—134.
8
See, for example, Silicon Processing for the VLSI Era, Vol. 1: Process Technology, 2nd edn., S. Wolf
and R. Tauber, Lattice Press, 1999. Volume 2 on Process Integration is also of interest here.
8
Independent variable’s attribute = attribute = tool or equipment (e.g. AE01)
Dependent variable = target = electrical parameter % of goal or ratio (e.g. C or LR)
Instance = record = wafer = work piece
In this section we also introduce common acronyms for process categories (see Appendix
A). Semiconductor processes require very complex phenomena such as: (a) ion
implantation (ION) where dopant ions accelerated to KeV (or higher) energies become
embedded with tightly controlled dosages and depths in well-defined areas of
semiconductor materials; (b) reactive ion etching (RIE) where chemically reactive ions
from plasmas impinge on masked openings to isotropically or anisotropically remove thin
films; (c) chemical vapor deposition where thermally, or plasma (PE), activated chemical
species chemisorb and react on surfaces forming thin (as thin as an atomic layer)
conformal films, typically insulators (INS), but of any type of material controlled to
molecular dimensions in thickness; (d) physical vapor deposition, where the species are
sputtered from targets or evaporated from sources to form films by physisorption, that
may serve as seed layers or liners (LNR) between metal films; (e) lithography (LTH)
which involves exposing photoreactive films to ultraviolet light to define the smallest
possible, circuit speed-limiting, widths of areas on the chip; (f) anneals (FRN) where
wafers are heated with tightly controlled thermal ramps and temperatures that move or
activate dopants or densify films or cause reactions at film interfaces; (g) wet etches to
remove thin films (h) cleans where either wet (WET) reagents chemically, or with
cryogenic aerosols (AERO), ballistically remove monolayer or particulate contaminants;
(i) chemical-mechanical planarization (CMP) where a combination of abrasion and
chemical reactions remove films; (j) oxidation where, for Si, a stable silicon dioxide film
forms to protect the highly reactive Si surface; (k) stripping of photoresist (RstStrip) by
liquid or plasma exposure; (l) metal films are electroplated (MTL) on seed layers to serve
as interconnect lines or vias between or within devices on a microchip.
The acronyms above have been replaced by an alphabetic encoding that will be seen later
in our anonymized data mining results. For example, EQP_A11 might represent the
variable for a p-well ION implantation process with attributes (tools) represented by
alphanumerics (e.g. AE01).
A typical sequence during device formation is to deposit a film, lithographically define a
region by projecting the image of a mask, etch the defined regions in the film, deposit a
different film or perform an ion implantation. This sequence is repeated many times to
define the different regions of the semiconducting device. The different active regions of
the device are connected by forming many layers of thin wires connected to each other by
vertical conductors or “vias.”
Below is a sequence of cross-sections 9 that illustrate the large number of steps required,
in this case, to form a submicron spacing or opening.
9
S.G.Barbee & G.R.Goth, “Virtual Image Structure for Defining Sub-Micron Dimensions,” IBM Technical
Disclosure Bulletin, Vol. 25, No. 3B, August, 1982, p.1448.
9
In the first figure, a series of thin films have been deposited as blanket films across the
entire wafer surface. Then a layer of photoresist is exposed and developed through a
mask which defines a border. In the next figure an anisotropic dry etch process uses the
photoresist as a template to remove 3 films down to silicon nitride, leaving a vertical wall
in the polysilicon. Next the exposed polysilicon is oxidized at a well-controlled rate to
form a “plug” of silicon dioxide. The silicon nitride and polysilicon surrounding the plug
are removed and photoresist is spun onto the wafer enveloping the plug. An isotropic dry
etch removes the top of the photoresist exposing the plug which is then etched away
leaving a submicron channel.
Such sequences of steps are used to form the active device regions and interconnect them
with layers of insulated metal lines which are connected to form word and bit lines in
DRAM, for instance. Similar methods are used with different materials and larger
dimensional scales to package the devices and integrate the chips into computer boards.
10
At specific points during the fabrication of complete circuit devices on each wafer,
surrogate electrical structures formed in the microchip or in the kerf between microchips
are electrically tested to indicate the quality of the product. This in-line testing (ILT)
may involve thousands of measurements of voltage, resistance, current, capacitance,
transconductance and other properties directly or derived from direct measurements for
different structures which roughly emulate components of the circuits in the device. It is
especially desirable to predict the electrical performance of the completed chip from
these tests at intermediate stages in the chip’s fabrication. The benefit of data mining
such electrical tests against wafer logistics is to find rogue tools as quickly as possible
before more wafers are misprocessed or out-of-spec by the time they reach the final
testing of each chip. The chip fabricator or foundry can then scrap the defective wafers
early and replace them by starting more so that committed shipments of working chips to
customers can be achieved.
Back to CONTENTS
II.C) Business Motivation
II.C.1) Implications of increasing wafer size and decreasing device geometries.
An economy of scale applies to the number of chips per wafer, driving the wafer diameter
to increase dramatically over the years from around 25 mm in the late 60s to 300mm in
the late 90s. There are few such examples of manufacturing work pieces increasing in
size by an order of magnitude over a few decades. Each change to a larger diameter
requires building semiconductor processing equipment to accommodate the larger sizes,
but more importantly, requires redeveloping the process to maintain the tight control
across the much larger surface of the wafer. Unfortunately, certain processes have
inherent radial variations for which prior or posterior processes must compensate, if
possible. The cumulative effect of such radial nonuniformities can result in a steep
degradation of circuit performance or functionality in a radial region of the wafer.
Each generation of devices has a shorter gate length across which the charge carriers
traverse, so that the switching speed of the transistors is faster (higher GHz) or so that
more transistors can be fit into a given chipsize thereby allowing more processors per
chip. As chip areas increase so does the chance of large particulates (killer defects)
landing on their surface even in pristine clean room conditions. Smaller vertical
dimensions typically accompany shorter gate lengths so more precisely controlled thin
film forming and removing processes (and tools) are required with each generation.
There is always a race to achieve yield on new generations of chips so the pursuit of high
yield manufacturing is a continual challenge for the semiconductor foundry.
II.C.2) Time to Market
The profit incentive is compelling to find rogue equipment preventing rapid yield
improvement during this ramp-up stage. The result of one study is that a 6 month delay
in ramp-up will reduce profits by two thirds; however, a reduction by 6 months of the
11
time to ramp-up doubles the profit. 10 This motivates intense mining during the period
leading up to mature device production. A general misconception among R&D
management is that there is insufficient data (wafers) during this period for data mining.
Data mining methods have been shown to work even when there are far more variables
than records. An example of this is the genomic mining field. The mining method must
be judiciously selected by the miner for this scenario.
In time there are more and more wafers to mine for yield improvement even though the
profit of increasing the yield decreases. This is a trade-off that faces data miners tasked
with increasing the yield during volume ramp-up stage of the product life cycle.11
Given the above business considerations, and the imperative for understanding the
business need (e.g. step one in the CRISP-DM framework 12 ) for data mining we could
create a business goal map: 13
10
“Yield Learning and the Sources of Profitability in Semiconductor Manufacturing and Process
Development,” Charles Weber, IEEE Trans. On Semiconductor Manufacturing, Vol. 17, No. 4, November
2004, pp.590—596.
11
One reference is by Gabriel Steinhardt, “Extending Product Life Cycle Stages,” pp. 1—3.
http://www.blackblot.com/files/articles/Blackblot_Extending_Product_Life_Cycle_Stages.pdf
12
See for example, Daniel T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining,
Wiley, 2005, pp.5-7.
13
Dorian Pyle, Business Modeling and Data Mining, Mogan Kaufmann, 2003, p.207.
12
With increasing granularity towards the lowest layer, this diagram shows the cascade of
business effects due to manufacturing problems leading to the topmost “bottom line.”
Such charts can be used to elicit executive buy-in for the use of data mining.
Back to CONTENTS
II.D) Data Mining as an Adjunct to Conventional Engineering
The yield of the chips on a wafer can be affected by circuit design (& layout), random
and systematic causes. Due to the small lateral dimensions of the circuits (the smallest in
manufacturing are currently 90nm and 65nm and are steadily decreasing by roughly 30%
with each generation) the landing of any air-borne or surface-electrostatically-driven
particulate contaminant can effectively kill the device or circuit connection proximate to
the particle’s random location. Such effects typically follow the well-known Poisson
distribution and can be minimized by controlling the ambient conditions in the wafer’s
environment (wafer containers, clean rooms and load-locks on tools) and surface
cleaning. Such measures are hindered by wafer environments during processing in tools,
such as due to extraneous coated materials flaking from reactive chamber walls in
process tools used to deposit or remove films from the wafer. Persistent monitoring and
more frequent cleaning measures reduce such sources.
When circuits are laid-out during the chip design stage, minimal widths and spaces are
imposed for the features comprising the device and its interconnections based on models
of, for example, cross-talk or other signal interference between digital and analog devices
in the chip. These dimensional ground rules may be inadvertently violated during the
layout stage. Such violations can interact with process tolerances to drastically reduce
device yield. With aggressive linewidths, even shapes and their proximities on the
patterned photolithographic masks must be modeled to obtain the desired effect.
Systematic problems are those arising from sporadic change points or gradual drifts in the
control or performance of the semiconductor fabrication processes described above. Just
13
the act of improperly performing routine maintenance on a process tool can transform it
from the best--yielding to the worst--yielding tool at a given process step.
Statistical Process Control (SPC) using Shewhart charts and other means are typical to
any manufacturing industry where the performance of the manufacturing tool is tracked
through time. When the tool wanders out of specification in its performance corrective
action is taken. Advanced Process Control (APC) describes the feeding-forward of
departures from the mean for each wafer from a previous process tool to a subsequent
one that can correct or compensate for the departure. This is possible for only certain
processes and degrees of departure from the mean but is a valuable yield-enhancing
practice.
The design of experiments (DoE) 14 is typically used to find the “sweet spot” of a process
in a tool and is commonly used during the development of the manufacturing process.
There may be a need for several dependent variables to be optimized and the influential
variables do not always have a control knob. Despite any such lurking variables, clever
and carefully designed experiments can keep the manufacturing processes away from a
cliff. The use of DoE is imperative for developing efficient manufacturing processes,
used to fabricate increasingly complex devices, that continue to push the limits of
manufacturing science.
Diagnosticians of device electrical parametric results and yield often avail themselves of
standard ANOVA or Generalized Linear Methods from statistics to determine sources of
aberrant results. These methods are well-proven but do not always detect the cause of a
problem and may tax the analytical capability of engineers tasked with such
responsibilities. This view was put forth by Gardner and Bieker: 15 “Quickly solving
product yield and quality problems in a complex manufacturing process is becoming
increasingly more difficult. The ‘low hanging fruit’ has been plucked using process
control, statistical analysis, and design of experiments which have established a solid
base for a well tuned manufacturing process.”
Based on this author’s observation of job postings by the semiconductor industry and the
success of software providers in this area, data mining methods are beginning to be in
demand to fill the gaps in the arsenal of those improving device yield in semiconductor
fabricators worldwide.
Back to CONTENTS
III. Data Mining Challenges in Semiconductor Manufacturing
In this section we will survey many pitfalls awaiting the data miner or analyst when
attempting to discover knowledge in semiconductor manufacturing data. There are a
variety of problems, some of which are perhaps unique to manufacturing. Although not
14
A good reference is Design and Analysis of Experiments, 6th edn., Douglas C. Montgomery, Wiley
Press, 2005.
15
“Data Mining Solves Tough Semiconductor Manufacturing Problems,” Mike Gardner and Jack Bieker,
Proc. of ACM SIGKDD 2000, Boston, MA; 2000, pp.376—383.
14
strictly pertinent to finding a preferred mining method, these problems may still interfere
with reaching our goal of discovering knowledge in data in this industrial area.
As Professor Mastrangelo has pointed out, “current data mining and analysis techniques
do not readily enable modeling of semiconductor manufacturing environments 16 .” While
many mining methods perform well on deep, narrow datasets and there are proven
methods for mining shallow, wide datasets such as those found in genomics; we are
focused on fairly equal depth and width datasets between these two extremes.
With the growth of sensor networks and other input devices, the number of variables
pertinent to data analysis is increasing dramatically. With competitive time pressures on
the discovery of yield-improving knowledge, the luxury of deeper datasets (i.e., more
wafers) is a vanishing possibility for the mining analyst. This work will attempt to
demonstrate the performance of two well-known methods in this venue.
III.A) Scope Limited to Knowledge Discovery (vs. Predictive Analytics)
Data mining encompasses two goals: (1) knowledge discovery from data as well as (2)
predictive analytics. The latter typically involves forming a model, or scoring function,
from one or a combination of mining methods which are compared by a confusion matrix
or gains and ROC charts on how accurately they perform on new data after being trained
and then tested on a hold-out dataset. A model is desired which generalizes the behavior
of the data and which is therefore not overfit to the training data.
The former involves the discovery of data or variable relationships, especially finding the
most pertinent variables affecting a target value. This present work is entirely within the
purview of knowledge discovery from data (KDD). In our application of KDD, mining
methods are successful to the degree that they can discover the cause of reduced
performance of integrated circuits (chips). Only process causes are of interest to us; any
other causes of poor chip performance are not investigated. The process causes we will
investigate are due to rogue process equipment (tools). Since processes and equipment
can vary in performance by the: drifting of equipment states; changes in consumables; or
frequency and correctly-performed preventive and corrective maintenance. Discovering
these altered states of the process is the goal of our knowledge discovery. If equipment
sensor data were available, one could conceivably create predictive models for process
variation and subsequently, variation in device performance.
Back to CONTENTS
III.B) Hierarchical Nature of Fabrication
There is also a hierarchical nature of the problem in that the wafers are grouped in
batches or lots of typically 25—50 wafers and each wafer may have 100 to several
hundred integrated circuits (chips). It is quite common that a certain region (e.g. center,
16
“Multivariate Process Modeling: The ‘Preprocessing’ Challenge,” Christina Mastrangelo and David
Forrest, pp.1478—1483.
15
top, annulus, outer edge) of the wafer, comprising a percentage of all the chips, is
consistently performing or yielding differently than the rest of the wafer’s chips.
Although data mining can be performed at the lot or chip level, in this work we will
restrict our focus to the wafer level (using the average parameter value of all chips per
wafer) of mining.
Hierarchies are found in 17 : i) workpiece groupings; (ii) process & tool state levels; and
(iii) electrically testing increasingly complex structures—from kerf to devices on a chip.
A reasonable approach to use when ferreting out problems in chip performance due to
processing differences is to:
(i) Mine a lot-level, wafer-level or chip-regional (quadrant, radii, patterns) performance
variable (continuous or discrete classes) against the process steps at the tool or chamber
levels to find candidates. Recall that a lot may contain 25-50 wafers, and a wafer may
contain 50-100s of chips.
(ii) Mine performance data against sensor data (e.g. temperature, pressure, power,
chemical flow, step duration) for the top candidate process tools. This adds another layer
of granularity below that of tool in the the process / tool / sensor hierarchy. Other
consumable and maintenance activities could be added at the sensor level (tool
maintenance, chamber conditioning, etc.).
(iii) Mine electrical test data of kerf (between chip) structures (such as serpentine intralevel resistor chains) at intermediate metal levels during device interconnect formation as
well as functional tests when the chip is completed and when it is packaged. Obviously,
the sooner a problem is detected (such as at an early kerf device test) the fewer
subsequently started wafers will be affected after fixing the problem compared to having
to wait until the chips are at a functional test before mining and problem detection can
begin.
Back to CONTENTS
III.C) Variable reduction and SME Bias
As with all other mining applications, guidance from subject matter experts (SMEs) may
serve well to exclude variables that are deemed long shots (although this may exclude
surprising results) that would be unexplainable for the given outcome. One example
would be all of the process steps that occur AFTER the performance parameter is
measured (the target variable). Another example would be metrology or test steps prior
to that used to obtain the target variable. This practice can greatly reduce the noise level
in the data. If this approach fails to reveal a likely candidate, then the entire set of
variables can be included. The danger here is that the engineering team may have
conceived a physical model (hypothesis) which they hope that data mining will confirm
and therefore they have limited the types or sequences of processes that would bring such
a result from the input to the data mining model. Even if it does confirm their suspicion,
17
Another very important hierarchy is that in the time dimension. How many weeks or months of logistic
data will you mine in an automatic production mode?
16
by excluding the other process steps they may have precluded the discovery of a
secondary or even the primary effect. It is very common that data mining will present
candidates which are corrected one-at-a-time by the engineering team. Subsequent
mining of all of the processes to determine whether previously discarded candidates are
persistent, and therefore, real is a reasonable follow-up action.
III.D) Complexity of Wafer Trajectories
In large manufacturing lines, due to economies of scale, as many wafers are processed as
is possible within the confines of the factory floor space. This often requires many
duplicate toolsets to perform the same type of process. As we’ve already seen, there are
many process steps required to fabricate a device. It would be nice to know the best or
worst combination of tools among the steps. Unfortunately, there is a myriad of possible
combinations of tools used in the hundreds of fabrication steps. The detection of errant
equipment may be further stymied by the need for discovery on relatively few
manufacturing work pieces (silicon wafers). As mentioned above, this is especially true
during the development of a new product or during the initial stages of product ramp-up
when there may be as few as 200 wafers for which pertinent data is to be mined against
over 400 process steps with as many as 10 different tools at each step and with a tool
containing one or several processing chambers.
Considering just tools (not chambers) it is simple to determine the number of ordered
samples for a number, r, of consecutive process steps for which there is a selection of a
tool from a toolset of size n, with replacement: 18 = nr. For example, simplistically
assuming that there are only 2 process tools at each of 100 sequential process steps, there
are 2 possible paths at the 1st step times 2 possible steps at the 2nd step ... times 2
100
possible paths at the 100th step = 2
possible trajectories of a wafer through the 100
steps.
More realistically, we could evaluate the scenario where there are over 400 steps with
perhaps 50 using 2 tools, 200 using 3 tools, 100 using 4 tools and 50 using 10 tools.
Then there are 250 * 3200 * 4100 * 1050 possible unique process-tool trajectories for a
wafer. These are almost astronomical numbers. The number is decreased when just
unique tools are considered since many types of process step use the same toolset. Even
so, these are large numbers.
It is quite unrealistic to think that the volume of wafers through a manufacturing line
would come anywhere near populating all of these possibilities. For example, if 10,000
wafers were started each day, it would take well over 1046 days (=2.7x1042 centuries) to
begin to populate the many process-tool combinations. This makes approaches where a
large number of possible combinations is assumed in the population, such as with
traditional statistical hypothesis testing methods, quite unrealistic for finding the best or
worst overall process-tool trajectories through the entire fabrication of complete devices.
18
See, for example, Probability and Statistical Inference, 6th edn., Hogg and Tanis, Prentice Hall, 2001,
p.82.
17
Our focus should rather be on the which tools are bad at particular process steps.
Back to CONTENTS
III.E) Probabilities from the Hypergeometric Distribution
For binary classification where the wafers are partitioned into good vs. bad, defective vs.
nondefective (or high vs. low yield) categories, we can find the probability of a tool
processing a good or bad wafer. This is an excellent sanity check on the results of a
mining model. A useful method for this scenario is to use the hypergeometric
distribution 19 to find random probabilities, since we are drawing without replacement
from a population whose probability of success changes with each draw. If we consider
that there are 2 members of the defective class: defective and nondefective. The
probability of a tool processing m wafers having k defective wafers where the defective
population is n out of a total of N wafers is:
⎞⎞
⎞⎛
⎛ ⎛ m⎞⎛ N − m⎞ ⎞ ⎛ ⎛
( N − m)!
m!
⎜ ⎜ ⎟⎜
⎟ ⎟ ⎜ ⎜ ( m − k ) !k! ⎟ ⎜ ( N − m − n+ k ) ! ( n − k ) ! ⎟ ⎟
⎠⎝
⎠⎟
⎜ ⎝ k ⎠⎝ n − k ⎠ ⎟ = ⎜ ⎝
⎜
⎟
⎜
⎟
N!
⎛N⎞
⎜
⎟
⎜⎜
⎟
⎜ ⎟
⎟ ⎜
( N − n ) !n!
⎟
⎝n ⎠
⎝
⎠ ⎝
⎠
where:
m = the number of wafers processed by a tool
k = the number of hypothetical defective wafers processed in a tool
n = the number of defective wafers in the population
N = the total number of wafers
We can substitute the values from the tools at a suspect process step and use either
Excel’s COMBIN function or, for larger values, the Hypergeometric Calculator 20 to find
the probabilities for each tool at the process step.
Consider, for example, the distribution of defective (red) and nondefective (blue) wafers
distributed among 10 tools as shown in this bar chart 21 :
Which of these tools would be the most suspicious and, hopefully, selected by our mining
analysis?
19
The Wikipedia entry on this topic was used for this section
http://stattrek.com/Tables/Hypergeometric.aspx
21
Obtained using Spotfire DecisonSite software
20
18
We can use the hypergeometric distribution to find the probabilities and compare them in
a chart:
where CDF = cumulative distribution function (the probability of having that value or
less) and CCDF = complementary CDF = 1 – CDF.
From the hypergeometric probabilities above, we would think that the defective wafer
counts for tools 2 ,3 ,6 and 7 would be out of the ordinary: 2 and 6 for being defect-free
and 3 and 7 for having above-normal distributions of defective wafers. This calculation
is not entirely independent because when one tool receives (from prior steps) more
defective wafers, another or others will receive fewer.
As a foretaste of the mining we’ll do in section IV, we can point out here that the
regression tree in IM4D selected tools 3, 5, 7, 10, 11 and 12 as the defective ones given
the process fields from a prior feature selection by RBFN’s entropy ranking. Clearly
there must be a better way (mining method).
The motivation for the data miner approaching many multi-attributed variables with few
records is then to select a method which uses an attribute discriminator that is as unbiased
as possible to the number of attributes in a variable. Fortunately, IM4D’s RBFN method
affords us an improvement towards this selection capability due to its inherent clustering,
nonlinear modeling and, in IM4D, it’s ranking of variables by entropy or chi-square as
we shall see in section IV.
We now continue our brief survey of trouble areas in mining semiconductor
manufacturing data.
Back to CONTENTS
III.F) The Problem of Autocorrelation
Autocorrelation is a curse in mining manufacturing logistical data. The following
scenario explains what it is and how it may occur.
19
One scenario is to try to find the tool causing a single bad lot of a product with a small
number of wafer starts (i.e. there aren’t many such lots that have finished processing in
the manufacturing line). There may be many process steps where an entire lot is
processed in a single tool at that step. Each of these single tools is a possible candidate
for the anomalous processing of that bad lot.
Consider another case where each batch consists of 9 wafers. The order of these 9 wafers
is kept in the same sequence within each batch. Suppose that they are processed at steps
with 3 tools or a single tool with 3 chambers. Consider also that the loading of the wafers
into the tools or chambers is always performed the same way (1st 3 wafers to chamber I;
2nd 3 wafers to chamber II; etc.) at each process step’s tool. Assuming no wafer breakage
or culling in previous steps, and no tool unavailability, the clusters of wafers in each
batch would have identical process tool histories:
20
In the above example, wafers 1,2 and 3 would always have been processed by Chamber I
at the deposition step and Chamber A at the etch step. Likewise, wafers 4, 5 and 6 in
each batch would always have been processed by Chamber II and Chamber B. Now
suppose that Chamber II has a faulty condition causing all wafers processed in it to have
poor electrical performance. These would be wafers 4, 5 and 6 in each batch. The data
mining method would not be able to differentiate between Chamber B and Chamber II
because the wafers processed in them are common.
Suppose that there are only 2 process steps: process A with 2 tools and process B with
10. Suppose that there are only 10 wafers and that process A’s 2nd tool is 20% defective
but has not yet been discovered. Next assuming that we evenly divide the wafers among
the tools at both steps, we would have process A with 5 perfect wafers in the 1st tool and
4 out of 5 perfect wafers in the 2nd tool. This difference would not be terribly noticeable.
However, for process B, one of the 10 tools would have the bad wafer from the 2nd tool of
process A. The obvious mining signal would be that, say, the 6th tool is 100% defective
based on the single bad wafer that happened by chance to be processed by that tool.
Process B’s other tools are 0% defective. This is contrasted with Process A’s 1st tool
which is 0% defective and its 2nd tool which is 20% defective. The 6th tool of process B
is presented as a key candidate to the engineering team for further investigation based on
the mining results.
Obviously, the above scenarios take an extreme perspective of very few batches or few
steps. However, the possibility of mistaking a signal from a process with few tools with
21
that from a process with many tools is only one reason why it is necessary to investigate
bias in the mining methods.
Back to CONTENTS
III.G) The Problem of Bias in Selection Methods
A tendency with some mining methods is an inherent bias towards selecting variables
with many attributes. This problem has been described in the literature and several
methods have been used to try to mitigate it. The fact that there are many approaches and
no universal unbiased method should give us pause for concern. After Chi-square, we
will briefly list some of the more common selection methods (splitting criteria) from
among 16 listed by Maimon & Oded 22 :
III.G.1) Chi-square
Simply stated, the Chi-square statistic, χ2, compares the expected to the observed values.
The chi-square ranking is based on ranking different variables by the sum of squares their
attributes’ departure of observed (O) values from their expected (E) value divided by the
expected value: 23
χ =∑
2
(O − E )
2
E
For the following impurity-based splitting criteria, the following nomenclature and
definition holds: given a training set, S, these are the probabilities of the target feature, y:
24
⎛σ S
σ y =c dom ( y ) S ⎞
y = c1
⎜
⎟
,...,
Py ( S ) =
⎜ S
⎟
S
⎜
⎟
⎝
⎠
where σ y = c1 S indicates the subset of instances in S for which the feature y = c1 , or the first
instance and dom(y) is the entire domain of features, y.
III.G.2) Entropy Ranking 25
The entropy ranking is based on the amount of disorder for a given distribution of values
of the attributes for a variable: 26
Entropy =
∑
c j ∈dom ( y )
−
σ y =c S
j
S
log 2
S
σ y =c S
j
22
Oded Maimon & Lior Rokach, “Decision Trees,” Chapter 9, The Data Mining and Knowledge Discovery
Handbook, Springer, 2005, pp.168-178.
23
See for example Data Mining: Introductory and Advanced Topics, Margaret H. Dunham, Prentice Hall,
2003, pp.54-55
24
Maimon & Rokach, p.170.
25
See Larose, pp.116-121 for worked-out examples of entropy and information gain.
26
Dunham, pp.98-100.
22
III.G.3) Information Gain
Briefly, when the entropy change from one splitting on one variable is smaller than that
of splitting by another variable, the information gain is: 27
InformationGain = Entropy −
σ a =υ S
∑
i
υi , j ∈dom ( ai )
i, j
S
* Entropy
III.G.4) Gain Ratio 28
This is merely the information gain divided by the split information to “normalize” it:
InformationGain
GainRatio =
Entropy
Where the information gain is typically calculated first and then, preferably for
appreciable (nonzero) entropy, the Gain ratio is calculated.
III.G.5) Gini Index 29
⎛ σ y =c S
j
Gini = 1 − ∑ ⎜
S
c j ∈dom ( ai ) ⎜
⎝
Back to CONTENTS
⎞
⎟
⎟
⎠
2
III.G.6) Minimum Description Length 30
The size of the tree is the size of its encoding by bits, and the fewer bits the better. The
cost of a split at a leaf t is:
Cost (t ) =
∑
ci ∈dom ( y )
σ y =c St *ln
i
St
σ y = c St
i
+
π*
dom( y ) − 1 St
+ ln
ln
2
2
⎛
Γ⎜
⎝
dom( y )
2
dom( y ) ⎞
⎟
2
⎠
Where St denotes the instances that have reached node t.
Mitchell 31 indicts any method that uses information gain with these comments:
“There is a natural bias in the information gain measure that favors [variables] with many
[attribute] values over those with few values.” He describes the problem this way: “[the
many-attributed variable] has so many possible values that it is bound to separate the
training examples into very small subsets. Because of this, it will have a very high
information gain relative to the training examples, despite being a very poor predictor of
the target function over unseen instances.” He goes on to describe the gain ratio (=
27
Maimon & Rokach, p.174.
Maimon & Rokach, p.171.
29
ibid., p.172.
30
ibid., p.178.
31
Thomas Mitchell, Machine Learning, Mcgraw-Hill, 1997, pp.73-74.
28
23
information gain / split information) as a remedy which unfortunately creates another
problem when the split information tends towards zero for a variable.
He then references several alternative approaches to resolving this problem.
Han and Kamber 32 agree with Mitchell and elucidate further in their description of
attribute selection methods for trees: “Information gain … is biased toward multi-valued
[variables]. Although the gain ratio adjusts for this bias, it tends to prefer unbalanced
splits in which one partition is much smaller than the others. The Gini index is biased
toward multi-valued [variables] and has difficulty when the number of classes is large. It
also tends to favor tests that result in equal-sized partitions and purity in both partitions.”
They go on to emphasize Kononenko’s finding 33 : “[Variable] selection measures based
on the Minimum Description Length (MDL) principle have the least bias toward multivalued [variables].”
A popular tree method that claims to mitigate this bias problem is QUEST: 34
“QUEST (quick unbiased efficient statistical tree) addressed the problems in CART that
it tends to select variables with many values, which creates a bias in the model.” 35
Any mining method with this bias is quickly exposed when applied to semiconductor
process logistics data where there are a sizable number of steps with many more tools &
chambers than others (e.g. 15 vs. 2). This is a large problem which we address in this
work by using feature selection prior to ranking top node competitors in a tree.
Back to CONTENTS
IV. Data Mining Approach and Application
IV.A) Data Exploration
Two datasets were created from known causes of electrical test problems. The logistical
data for the time period during which the problems manifested was joined to the electrical
test data. One is labeled SLMP and the other RX. There are thousands of wafers
(instances or records) from electrical tests performed over several months (SLMP) or one
month (RX).
Dataset
SLMP
RX
Variables
406
208
Instances
8,193
3,183
Ratio of Instances to Variables
20.18
15.30
In both datasets, the target or dependent variable is continuous (numeric) and the
independent variables are the fields representing the process steps used to fabricate the
32
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd edn., Morgan
Kaufmann, 2006, pp.304-305.
33
ibid. p.379;, “On biases in estimating multi-valued attributes,” I. Kononenko, Proc. 14th Joint Int. Conf.
Artificaial Intelligence (IJCAI’95), vol.2, pp.1034-1040, Montreal, Canada, Aug. 1995.
34
www.stat.wisc.edu/~loh/quest.html
35
Dunham, p. 123.
24
devices. Each process field may have a few or around a dozen attributes known as
“tools” representing the semiconductor process equipment used at that step.
Here is a bar chart of the SLMP dataset showing the distribution of attributes:
The RX dataset has a similar distribution of high-attributed fields:
The mode appears to be around 5 attributes (tools) per field (process). The maximum is
16 and the minimum is 1. There is a large number of fields with 9 or more attributes for
RX. This is cause for concern when using mining methods susceptible to a bias for
highly-attributed fields.
As mentioned before, these tools are as identical as possible but unintentional variation
may cause a tool or tools to drift in their processes exercised on the wafers resulting in
unacceptable variation in the device’s electrical behavior. Detecting these rogue tools is
our quest in data mining the logistical data.
The target variable distribution for the SLMP dataset is given in a percentage range of an
electrical parameter, C, on the abscissa and as a frequency percentage on the ordinate:
25
where we see that it is skewed but since we want to retain clarity for the management
team receiving these results and since there is no assumption of normality in our
modeling algorithms, the variable will not be transformed to reduce the skewness.
Likewise, shown below is the distribution for the RX dataset’s target variable LR, the
ratio of two electrical parameters (multiplied by 100 to overcome a minor glitch 36 in the
tree algorithm):
Back to CONTENTS
IV.B) Data Preparation
36
A bug in this version (8.1) of IM4D has a bug that precludes the use of target values < 1.0 in the
regression tree.
26
As with most data mining investigations, a large effort is required in the preparation of
the datasets 37 both of which will comprise a month or more of logistics data from the
company’s data warehouse joined to electrical parameter data. The datasets are then
transformed to formats usable by IBM’s Intelligent Miner for Data and various mining
methods are applied to the data.
SAS Base software was used to join the logistics dataset to the electrical test dataset; both
were obtained from a data warehouse. The joined file was then exported as a CSV file
and processed by IM4D’s idmcsv function 38 to create data (*.dat) and information
(*.info) files which are then read into IM4D using the loadmnb function to create a
mining base object.
Chip functional or parametric test data was averaged for each wafer. There was no need
to filter out subsequent logistical steps with few or no wafers. The range of values for the
electrical test had to be within its specifications or the datum was considered as an outlier
and culled from the data.
Missing data, if excessive (> 25%) could also eliminate a process field from the data. If a
step were repeated, only the 1st incidence was used. This excluded experimental
processes or processes that were being transitioned to newer, replacement processes.
Measurement and testing steps were not included in the logistics fields since they do not
affect the structure or functioning of the device.
In summary, two datasets were used for this exercise. One (RX) had a single process and
tool causing a problem. The other (SLMP) had a group of 5 process steps which all used
the same tool within their processes.
Back to CONTENTS
IV.C) Feature Selection to Reduce Variables vs. Records
The reduction of dimension or feature selection entails eliminating inconsequential
variables to form a less noisy, more manageable number suitable for most mining
methods. This is critical for mining methods that are very sensitive to noise or are
computationally intensive (assuming a large amount of data must be crunched). Rules of
thumb for the number of variables or records are often suggested by those with
experience in using particular data mining methods: “there must be at least six times as
many records as variables” or “there must be at least one half as many records as
variables for a valid result” and so on. However, these are not easily followed when there
are extremes in the number of variables and records. These extremes are sometimes
denoted as “wide and shallow” versus “narrow and deep.” The former is a challenge to
analytical methods and is encountered routinely in bioinformatics with records on the
order of 10 and variables on the order of 10,000. Narrow and deep datasets are routinely
37
Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufman, 1999, p. 11.
See, for example, Appendix C (pp.347-353) of Using the Intelligent Miner for Data, version 6 release 1,
pub order SH12-6394-00, IBM, 1999.
38
27
encountered in business applications of data mining. For example, CRM (customer
relationship management) or retail sales mining can involve ~100 variables and millions
or billions of records. Sampling may be used to reduce the computing time required for
such large datasets although not without a penalty. 39
Semiconductor manufacturing is often limited in the number of available records (wafers)
especially if mining is needed for ramping yield on a new product. Depending on how
they are combined, the number of variables can quickly overwhelm the accuracy of many
methods when only 200 wafers are available: 2,000 electrical variables, 500 process
steps, 2 to 15 process tools used at a process step and 1 to 4 process chambers used in a
process tool. This scenario places the data mining of semiconductor manufacturing near
the 1:1 dashed line but closer to genomic mining than other commonly encountered
mining in the business world as indicated in the chart below:
Range of Variables vs. Records in Data Mining
Common Logarithm of Number of Records
8
Retail
Narrow
& Deep
7
6
5
4
o
o Semiconduct
or
3
2
Genomi
1
Wide &
Shallow
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Common Logarithm of Number of Variables
Since feature selection is commonly used (and desired) in genomic mining 40 , it should be
seriously considered in any approach used for mining semiconductor data.
39
See for example, Data Mining: Concepts, Models, Methods, and Algorithms, Mehmed Kantardzic,
Wiley Interscience, 2003, pp 58—61.
40
D. Amaratunga and J. Cabrera, Exploration and Analysis of DNA Microarray and Protein Array Data,
Wiley, 2004, pp.189—193.
28
A different realm of problems, beyond the scope of this study, is encountered when realtime sensor data (e.g. temperature, pressure, flow, power) from process tools is mined
against electrical or defect parameters. 41
Various methods are employed in reducing the number of variables in data mining. For
instance, WEKA 42 has methods ranging from genetic algorithms 43 to boosted trees for
feature selection.
The mining in this work is restricted to IBM’s Intelligent Miner for Data 44 . The output
from the radial basis function is used to find a variable hierarchy. A manageable number
of highly-ranked variables are then fed into the tree sequential top node mining method.
Back to CONTENTS
IV.D) Methods Chosen for Categorical Variables and Numeric Targets
IBM’s data mining workbench, Intelligent Miner for Data, IM4D, contains 6 different
mining methods:
Neural Network (classification or prediction)
Radial Basis Function Network (classification or prediction)
Association Rules (a priori)
Clustering: Kohonen self-organizing map
Demographic
Classification Tree
Perhaps the 3 most popular methods among these are the tree, neural network and
association rules. However, in my experience, individually these have had limited
success in mining semiconductor logistics data. My preference has been with the radial
basis function network, or RBF method but it suffers from a difficulty in ranking and
reporting its results.
Shown below is the utility of combining two of these data mining methods: the radial
basis function network and the classification tree to discover a known root cause of a
wide distribution in electrical data.
First, we shall offer a “math-lite” overview the topic of radial basis function networks to
see where they fit in the arsenal of data mining methods and to get a sense of how they
work.
IV.D.1) The Radial Basis Function Network
41
PDF Solutions, Proc. AEC/APC Conference, Denver, CO, 2006.
Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn,
Elsevier, 2005, pp. 420-425.
43
For an example (using WEKA) see D. Larose, Data Mining Methods and Models, Wiley, 2006, pp.252-261.
44
http://www-306.ibm.com/software/data/iminer/fordata/, this particular member of the Intelligent Miner
family of products is a standalone workbench and is no longer sold by IBM and will no longer be supported
after September 30, 2007.
42
29
IBM’s IM4D uses radial basis function network as a predictor method. 45 The Radial
Basis Function Networks are described in great detail in several texts. 46 47 48
The Radial Basis Function Network, RBFN, is introduced by several authors 49 as a
special case of a neural network with similarities 50 or equivalency 51 to a type of support
vector machine. Elsewhere 52 , a study is cited where an RBFN underperformed against a
SVM with a Gaussian kernel (deemed by others to essentially be an RBFN); but with
“clever engineering” the RBFN could achieve performance close to that of an SVM. 53
The similarity of the RBFN structure to that of other neural networks is evident in the
diagram below where all of the p input variables are individually “fed” into the network’s
hidden nonlinear layer and then combined linearly to the output layer:
45
Peter Cabena, Pablo Hadjinian, Rolf Stadler, Jaap Verhees & Alessandro Zanasi, Discovering Data
Mining From Concept To Implementation, , Prentice Hall, 1998, p.77.
46
See for example, Chapter 7: “Radial-Basis Function Networks,” Simon Haykin, Neural Networks: A
Comprehensive Foundation, IEEE Press, 1994, pp. 236—284 from which many of the concepts are drawn
in this section.
47
Another useful reference work is: Christopher M. Bishop, “Radial Basis Functions, Chapter 5, Neural
Networks for Pattern Recognition, Oxford, 1995, pp.164—193.
48
A text which presents results and methods from recent research is by Lipo Wang and Xiuju Fu, “An
Improved RBF Neural Network Classifier,” Chapter 4, and “Rule Extraction from RBF Neural Networks,”
Chapter 7, in Data Mining with Computational Intelligence, Springer, 2005.
49
One example is Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines
and other Kernel-based Learning Methods, Cambridge, 2000, p.44.
50
See comments by Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques, 2nd edn.,
Morgan Kaufman, 2006, p.344.
51
Eibe & Frank, p. 234.
52
Bernhard Scholkopf and Alexander J. Smola, Learning with Kernels: SVMs, Regularization,
Optimization, and Beyond, MIT Press, 2002, .p.203.
53
ibid., p.204.
30
It is unusual for an RBFN to have more than one hidden layer in contrast to other NNs
which may have several or more hidden, and output, layers -- all nonlinear.
Functionally, we note that one difference is the use of a Gaussian radial basis functions in
the hidden layer in contrast to linear nodes as the output classifications or discrete real
values. A useful way to think of the hidden layer is: “each hidden unit essentially
represents a particular point in input space, and its output, or activation, for a given
instance depends on the distance between its point and the instance … the closer these
two points, the stronger the activation” 54
Haykin introduces RBFNs as addressing a “curve-fitting (approximation) problem in a
high-dimensional space” 55 using “nonlinear layered feedforward networks.” 56
Elsewhere, he describes them as “networks using exponentially decaying localized
nonlinearities (e.g. Gaussian functions) [to] construct local approximations to nonlinear
input-output mapping.” 57 This is in contrast to neural networks which provide a global
approximation with the possibility of exhibiting local minima. With this ability, an
RBFN is capable of solving the Exclusive OR problem.
Cover’s theorem states that a “complex pattern-classification problem cast in highdimensional space nonlinearly is more likely to be linearly separable than in a low
dimensional space.” 58 Therefore, many nodes may be needed in the hidden layer to best
approximate the data; in fact, one RBF is centered at each data point: 59
N
F ( x ) = ∑ wiϕ (
i =1
x − xi )
Where the Euclidean norm between the parallel lines indicates the distance of each of N
data points, xi, from the vector x. And where: 60
⎛ r2 ⎞
ϕ (r ) = exp ⎜ − 2 ⎟ , for r > 0 and σ > 0.1
⎝ 2σ ⎠
In this equation, σ is an effective width of the radial basis function. However, practical
considerations (computation) would reduce the number of RBFs to a much lower amount
than one for each data point. By so doing, the hypersurface approximating function no
longer goes through each of the data points (i.e. no longer has RBFs centered at each data
point). Each RBF unit or center must therefore be strategically located and sized. The
learning mechanism, and optimization, for locating the nonlinear, hidden layer RBF
centers is separate from, and slower, than computing the linear output layer weights. The
RBF centers may be obtained randomly, or by self-organized (e.g. k-nearest neighbor), or
54
Ian H. Witten & Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn.,
Elsevier Press, 2005, p. 234.
55
Simon Haykin, Neural Networks: A Comprehensive Foundation, Chapter 7, Radial-Basis Function
Networks, IEEE Press, 1994, p.236.
56
ibid., p.262.
57
ibid., p.263.
58
ibid., p.237.
59
ibid., p.243.
60
ibid., p.244.
31
by a supervised method. The output layer uses supervised learning (e.g. least mean
square). 61 Hastie, et. al., point out the desirability of normalizing each basis function so
that there are no gaps in coverage throughout hyperspace. 62
Disadvantages:
Various authors have cautioned the user of various drawbacks to the use of RBFNs:
• An RBFN gives “every attribute the same weight because all are treated equally
in the distance computation” and “cannot deal effectively with irrelevant
attributes – unlike multilayer perceptrons. SVMs share the same problem . . .” 63
• “in order to represent a mapping to some desired degree of smoothness, the
number of radial-basis functions required to span the input space adequately may
have to be very large.” 64
But RBFNs have advantages, too. “RBF networks are capable of fast learning and
reduced sensitivity to the order of presentation of training data.” 65
In summary, there is usually just one hidden layer in the network comprised of nonlinear
(typically Gaussian) nodes. Mapping the data nonlinearly into a high-dimensional space
makes it easier to find a linear separation of classes than in a low-dimensional space
(Cover’s Theorem 66 ). Key differences from neural networks are given by Specht 67 and
include: “RBFs always cluster whereas PNNs are defined with one node per training
point and have to have clustering added.” Extracting centers using k-means clustering is
a method associated with RBFNs. 68 This clustering feature is evident in IM4D where the
number of regions (selectable by the miner) are formed for the n points in p-dimensional
space. In the IM4D version, the p variables within each region can be ranked by chisquare or entropy measures.
Back to CONTENTS
IV.D.1.a) Applying RBFNs
When performing RBFNs in IM4D, there are default values for many “expert” fields.
One of these is the number of centers or hidden units in the hidden layer. The default
setting usually results in over 10 centers being formed. A setting of 6 results in 4 to 6
centers.
61
ibid., section 7.11 “Learning Strategies,” pp. 264—268.
Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data
Mining, Inference and Prediction, Springer, 2001, pp. 186—188.
63
Ian H. Witten & Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn.,
Elsevier Press, 2005, p. 234.
64
Haykin, p.263.
65
Haykin, p.263.
66
Haykin., p.237.
67
Donald F. Specht, Chapter 3: “Probabilistic Neural Networks and General Regression Neural
Networks,” Fuzzy Logic and Neural Network Handbook, ed. C.H.Chen, IEEE Press, 1996, p.3.39.
68
Scholkopf and Smola, p.203.
62
32
Below is an example of a 4 unit RBFN for the SLMP problem. The left vertical gray bar
indicates the values of the target variable that correspond to the 4 centers of RBFs in the
network. Each horizontal row has the most important variables for that hidden unit
ranked by either chi-square or entropy. The right column shows the root mean square
error for each hidden unit or center.
Within each row, the variables are represented by pie charts with an outer annulus which
represents the total population distribution of the various tools for each process field.
The inner circle shows the distribution of tools for that center alone. Uncolored slices in
the pie represent missing values.
The leftmost histogram for each row has the gray bars representing the entire population
and red-lined bars that represent the current row.
2
3
2
3
3
3
3
2
2
2
2
2
2
2
3
2
The Region ID (4, 5, 6, 7) is shown in the upper right area of each row. Added below
each pie chart is the number (in red) of tools for a particular process.
Now compare the same regions ranked by entropy instead of chi-square:
33
1
9
1
9
9
3
1
1
5
1
5
5
5
1
1
5
The black areas in some pies are labeled “other” as a graphical-process-saving feature of
the software. Again, the red number indicates the true number of slices in the inner pie
for each process. It is readily seen in comparing the two charts that there are many more
tools in the entropy ranking than in the chi-square ranking. This difference is of
paramount importance to the work in this thesis.
The distribution of wafers found in each center is shown in a table:
When the number of regions (units in the hidden layer) is large (>10) there is a good
chance that the relative percentage of wafers in a small target-valued row is very small
(<2%). This problem is overcome by setting the number of centers to 6 as we have done
in the above example. The smallest row still has 16.44% of the wafers.
When looking for important variables, the best situation is to find the same field among
the 4 most important in both the top (highest-valued) and bottom (lowest-valued) rows.
Furthermore, we would want to see an opposite or complementary distribution of tools
34
between the top and bottom rows so that there is a clear story of perhaps only one tool
associated with most of the low value and another tool associated with most of the high
value.
Our observations of drawbacks in the practical application of RBFN in IM4D are that it:
1) produces mining results geared towards arbitrary visual discernment of patterns (rows
of pie charts; one row per center or region for each radial basis function in the network)
2) can sometimes produce too many rows (dozens) for easy discernment of results
3) is not clear how to perform an overall ranking of the variables from the output (they
are ranked by chi-square or entropy but for each of many rows)
4) works best with many times more instances than variables: a deep, narrow dataset
An illustration of (2) is from our back-tested dataset with the number of regions set to
“default” where we obtain 32 centers (or regions).
The summary table for these 32 centers is shown here:
Where we see that the 4 smallest-valued centers at the bottom each have < 2% of the
population represented. This may lead to misleading results since usually the smallest-
35
valued row is compared to the largest-valued row. The RMSE is also noticeably larger
for these bottom rows.
As mentioned above, the intuitively appealing nature of radial basis function networks is
that they have features similar to both clustering and neural networks. Each ‘cluster’ is
centered at an RBF region with a, typically, Gaussian ‘distance measure’ capturing points
within its radius. The RBF method can use categorical or numeric input (or both) and
output. The regions are found automatically by the RBF algorithm and can be adjusted in
the minimum number of points defining a region as well as the quantity of regions to
include in the model. If these are not specified, the software uses default values.
To fine tune the RBF method for best results in this work, this sequence was used:
1) Set the number of regions to that number (typically 4 or 6) which will provide on the
order of 10% of the data in the extreme (e.g. highest or lowest target values) regions of
interest. Although the Redbook 69 suggests finding a ratio of 2 between the regions’
highest and lowest target values (for numeric prediction) I find that the percentage of
points in these regions is probably more important and that the ratio is not important.
2) Save the partition details for the highest and lowest regions as text files and load them
into a spreadsheet program.
3) In the spreadsheet, rank each of the 2 regions by chi-square (provided) and create a
(Pareto) chart of the variables’ target values. Include the topmost variables as features
for further modeling. Alternatively, select the top 50 (depending on the total number of
variables and the number the next model can accept).
Our initial RBF summary table illustrated the utility of (1) when we ran the RBFN with
regions=6 (resulting in 4 centers) for the same dataset as used above.
We immediately notice that there are only 4 rows (regions) and the values of the target
are less extreme than in the 22 region case: 0.8118—1.008 vs. 0.728—1.05.
For illustrative purposes, let’s next compare between the ranking of our results for the
above 32 and 4 regional RBFNs by chi-square vs. entropy. The entropy ranking contains
many processes with a large number of attributes (tools). The chi-square ranking is less
susceptible to this bias and snags our true causative tool. The chi-square formula used in
IM4D is: 70
⎛ ci pi ⎞
− ⎟
d ⎜
c
P⎠
1
k
⎝
χ2 =
∑
pi
100(d − 1) i =1
P
Where:
k = a fixed partition
2
69
C. Baragoin, C. M. Andersen, S. Bayerl, G. Bent, J. Lee, C. Schommer, Mining Your Own Business in
Retail: Using DB2 Intelligent Miner for Data, IBM Redbooks, August, 2001, pp.111—116, 126—130.
70
This information is (until 2008) available from IBM’s help site for IM4D.
36
d = the number of categories for a discrete field, or the number of buckets for a
continuous field
c(k) = the size of the partition k
c(i) = the number of records in the partition k with value i
p(i) = the total number of records with value i
P = the total number of records
Note that c(i)/c(k) and p(i)/P are the relative frequencies in the partition or background
Here are the results for the 4-regional RBFN in the SLMP dataset. Note that the process
variables (in the PDID column) are ranked here by chi-square and then a Pareto chart is
shown for all of the variables. The red line shows our cutoff point which includes the 24
fields listed below. If there is an easily discernible knee in the Pareto bar charts, then a
cut-off line is placed there; otherwise the choice is arbitrary. In this way, the candidates
are selected (feature extraction). The blue highlighted fields are the known-cause fields,
which is reassuring. Ideally our top 5 fields would be these 5 but having them within the
top 20 is still impressive.
Next we will show the same analysis for the lowest region, still using the chi-square
ranking:
37
We see that only 4 of the known-cause fields (highlighted in blue) are included and
ranked lower than in the high region. The combination of the high and low ranked fields
comprises our feature selection. These fields are then fed into the IM4D decision tree.
First we will discuss such hybrid, or meta-models in the next section.
Back to CONTENTS
IV.D.1.b) Selecting Features with RBFN
Using one learning algorithm to select features for another is a known approach 71 in data
mining. For instance, the use of the output from a clustering method as input to a
classification or prediction method is a common method in data mining known as meta,
or hybrid, models. In the application of data mining to the semiconductor industry, Intel
has reported on the use of regression or neural network methods after clustering. 72
Since the RBF method clusters the data within its regions, it provides an attractive
precursor to a follow-on method; the RBFN selects features for reducing the dimension
of the variables. The top-ranked variables from either chi-square or entropy ranking can
71
See, for example, section 2.1.4, “Using One Learning Algorithm as a Filter for Another” in Chapter 5,
“Dimension Reduction and Feature Selection” by B. Chizi and O. Maimon, in The Data Mining and
Knowledge Discovery Handbook, by O. Maimon and L. Rokach, Springer, 2005, pp.98-100.
72
“Advancements and Applications of Statistical Learning / Data Mining in Semiconductor
Manufacturing,” Randall Goodwin, Russell Miller, Eugene Tuv, Alexander Borisov, Mani Janakiram, Sigal
Louchheim, Intel Technology Journal, 8(4), 2004, p.330, Figure 6.
38
be compared for matches (e.g. visually using a Pareto chart) as a form of voting. Another
possible approach is to find matches among highly chi-square ranked variables between
the RBF regions for the highest and lowest target values.
Next we have the problem of reconciling or accounting for results from the different
regions of the RBFN. Since each region is individually ranked, it is very likely, for
hundreds of variables, that a different variable will be ranked highest in each region.
IM4D’s RBFN results are presented in the order of their distributions of the target
variable typically with the highest-valued region as the top row of results. One can
therefore look at the topmost and bottommost regions to try to find the same variable
among those that are top ranked. The next consideration is whether the same tool is
predominant in both regions for a variable (process). If this is the case, then it may
merely indicate that a particular tool has a wide variation in its values and is both the best
and worst tool according to the target values of its records (wafers). A design of
experiments (using ANOVA) could then clarify from where the variation occurs and be
corrected.
However, we would rather find different tools, for a given process, between the top and
bottom regions for a highly ranked process. In IM4D this can be done visually by
comparing the pie charts to see if there is a large difference in tools (inner pie slices)
compared to the overall population’s distribution in that tool (outer circle).
I have found that an easier approach is to let a tree do this discrimination. We then have
used the RBFN for feature selection – using its top ranked variables, followed by
sequential trees to discriminate among the tools for each variable.
The RBFN may require fine-tuning to determine the best number of regional centers as
well as discovering the best way to use its output directly for interpretation or as input to
a follow-on mining method (decision tree). Rules for single or combined equipment
paths can then be obtained in view of the desired value of the electrical characteristic.
Ultimately, the ranked output of each method is compared to the known cause(s) in order
to find the best method.
Back to CONTENTS
IV.D.2) Classification and Regression Trees
IV.D.2.a) Operation of the basic tree method
Tree methods are very popular 73 among, and a mainstay of, data miners. The data is split
two or more ways at each node or branch in the tree according to which variable best
separates them by one of various criteria (e.g. Gini, Information Gain, Gain ratio). An
inherent drawback of this approach is that the dataset is reduced in size after each split, so
the significance of such variables is pertinent only to the sub-branch which it separates.
This splitting continues down to the final leaves. The variable splitting the top or initial
node is the only one of interest to us here.
73
see, for example, the KDnuggets poll archive at www.kdnuggets.com
39
IV.D.2.b) Sequential top node method 74
In this method of KDD using trees, after the tree is formed, the top node is
recorded and then removed and the tree method is applied again to find the 2nd-most
important variable. Repeating this to completion among the variables creates a hierarchy
of rules ranked by their importance. With hundreds of variables, this is performed only
after a feature selection method reduces the number of variables down to a more
manageable dataset.
Such a method has been used with the 1R tree “stump” method (found in WEKA)
to create single node rules. 75
Back to CONTENTS
IV.E) Rules for Multiple Process Tools
If a subject matter expert (SME, e.g. a manufacturing or device yield engineer)
suspects that there is a process-to-process interaction between tools, such as between a
lithography tool and a subsequent etching process’ tool, then it would be useful to retain
2 levels in each tree for the SME to peruse. This could be performed for each tree but the
“stump” single-node tree could not be used. In IM4D, by setting the number of levels to
2, such 2-node trees are easily produced.
When a “match” to intuition or experience is made by the SME between the top
and second node variables (processes) then both process variables (top and 2nd nodes) of
the tree should be added to the pool of variables before proceeding with the sequential
elimination of top nodes as described in the section above. Matches in this case are
processes that are strongly related in the formation of a device feature that can directly or
indirectly affect the electrical parameter used as the target. As mentioned, a common
interaction of variables is for the definition of a feature which involves
photolithographically defining the shape’s boundary features and a subsequent etching
step that reproduces the photolithographically-defined pattern into an underlying film.
The degree of exactness of the pattern transfer down to the base of the underlying film
depends on how vertical (anisotropic) the walls of the pattern are. The process variables
of the etching process can strongly affect this anisotropy. The feature dimensions at the
base of the film may define the region for doping the underlying silicon comprising the
active device region (e.g. the gate channel) and therefore is of paramount importance in
device performance (e.g. speed of switching for gate length).
An example of a rule for multiple process tools would be:
If tools JS06, JN03, or JS05 are used at the D39 process step
AND
If tool IA01 is used at the F23 step
74
This method was first described to me in a private conversation with an IBM colleague, Roger Cheek.
Holmes, G. and Nevill-Manning, C.G., “Feature selection via the discovery of simple classification
rules,” Proc. Symp. Intelligent Data Analysis, Baden-Baden, Germany, 1995.
http://www.cs.waikato.ac.nz/~ml/publications/1995/Holmes95-IDA95.pdf
75
40
THEN
C = 69.348 for 2,386 wafers (29% of the entire population)
The electrical parameter is reduced in value by 5.5% from the overall mean for 29% of
the wafers. The interaction of D39 with F23 may make sense to the SME.
This is shown below in Node ID 1.1.1 in the tree summary from Intelligent Miner for
Data.
The tree diagram provides a histogram of the target variable “C” and pie charts of the
distribution of tools used at the process steps. For example, the distribution in Node Id
1.1.2 (4th row below) has a distribution (tan bars in front of background population’s gray
bars) that is slightly higher than that of Node 1.1.1 (3rd row). The predominant tool used
in process F23 (rightmost column) is IFD2 for node 1.1.2.
We now return to our analysis of the SLMP data where we selected features based on a 4region RBF with chi-square ranking.
41
The decision tree is not run in the usual way, but is run as a stump where we are
interested in only the top node. That field is then eliminated from the input fields and
another stump is run. In this way, we have a listing of sequential top nodes:
This is the 1st or top node tree stump that we obtain for the SLMP dataset with chi-square
feature selection from a 4-region RBFN:
The next tree obtained is:
We continue in this manner until we have exhausted all of the selected features with as
many tree stumps formed. We are then ready for our visualization approach described in
the next section.
Back to CONTENTS
IV.E.1) Method for Summarizing and Prioritizing Rules: The Tree Curve 76
A distinct disadvantage with knowledge discovery vs. predictive analytics where there
are roughly as many variables as records is the lack of diagnostic methods, such as gain
or lift curves, to compare mining methods. The use of the hypergeometric method
mentioned in Part III.D for categorical variables is appropriate for classification problems
but our target is continuous. We could bin our continuous target for classification, but
often a binning boundary, or cutoff, is difficult to determine in electrical test parameters.
A method introduced in this work is that of the “tree curve.” Rather than simply rank the
variables again using sequential top nodes from the tree, we will plot the nodes on a chart
with the number of records (wafers) vs. target value (electrical parameter). The mining
results can then be judged in importance visually by the end-user.
The general form of the tree curve is shown below. It has the shape of a cusp where the
apex is defined by the mean of the population. Any rules obtained from the tree will
involve a smaller number than the total population due to the splitting nature of nodes.
Rules which result in target values less than the mean will be represented by a cloud of
points roughly on or below the left curve; complementarily, those higher than the mean
76
Submitted by the author as an invention in IBM.
42
follow on or below the right curve. Admittedly, this is subjective since the curve is
drawn by eye. Perhaps splines or methods could be used to less ambiguously draw the
boundary curves on both sides.
If there is a large effect due to a single process or combination of processes, it will be
represented as a point above the curves towards the upper left or right extremes of the
chart depending on whether the target values of interest are below or above the mean.
For example, a point representing low yield would follow the left curve if the target is
percent yield. However, if the target is yield loss (=1-yield), then low yield would follow
the right curve.
Most of the rules obtained by the tree lie below the curve(s). The candidates of most
interest are therefore those lying on or above the curve. If any points lie above the curve,
they would draw the most interest from the end-user since they represent more instances
for given departures from the mean.
If the phenomenon being mined affects only a small percentage of the wafers, then points
towards the lower extremities of the curve(s) would be of interest. If the yield loss is of a
smaller degree and affecting a sizeable percentage of the wafers, then points near the
apex of the cusp would be of more interest. In this way, the end-user can quickly
ascertain which of a plurality of rules from the tree are of most interest for the situation
being modeled.
Figure 1: Tree Curve. From the above chart we note the apex of the curves which
represents the average target value of all of the wafers.
To show how the best candidate is visually identified as an outlier from the boundary
curve, here is a section of the left side of the Tree Curve showing that the tools (IA02,03
at E24 in the chart) from the one process (E24) have a markedly lower value of the
43
electrical parameter for a sizable number of wafers. This result was obtained prior to
discovery by conventional engineering methods! This proves the utility of this
combined approach (RBFN for feature selection followed by Classification Tree for
Rule formation and the Tree Curve for visualization of the key process(es)/tool(s).
Here is the same visualization for the Entropy-ranked variables showing that no
process/tool extended above the curve as in the chi-square case above. Note the large
number of tools identified for these highly ranked processes – revealing the inherent bias
in the entropy-based ranking for highly-attributed variables during the feature selection.
44
We now continue with the SLMP dataset and, unlike the RX dataset tree curve above, we
will illustrate plotting BOTH branches within the top node of each tree stump that we
obtained for each selected feature. One branch will be plotted on the left side of the tree
curve and the other branch will be plotted on the right side. The visualization does not
yield as dramatic a story as for the RX dataset.
With selective labeling, this tree-curve immediately draws our attention to the field the
farthest outside of the curve: D39. However, we also note the strong clustering of the
F21-24 fields. Although sought out and labeled in 20/20 hindsight, this grouping of
fields with its nearness to the curve would would have drawn attention (in their “real”
and not anonymized labeling, they all share the same prefix so the grouping is much more
striking). The RBF ranking would further elicit suspicion and attention.
A “sanity check” on the D39 finding would be to see how many attributes (process tools)
were associated with it. There are 6 tools, but one large slice of the pie is for missing
values. Even thought this process was not implicated at the time, it would have been
interesting for the engineering team to investigate it.
Back to CONTENTS
IV.F) Results for 2 datasets:
With less clutter than the next table, the one here summarizes the results of the hybrid or
meta-model using a RBFN and a Tree on the 2 datasets:
Dataset
SLMP
Known Cause
R20—R24
RBF-Tree Result
KDD approach
3 w/i top 13 and 2 w/i top 10 Fairly successful
45
RX
E24
#1
Successful
The RBF-tree results are included along with those of each method used alone; the results
of a neural network model are also added. Note that parameters are also given for the
RBF models. The rank of the known-cause variables is given for each method:
Using the techniques described above on 2 datasets, we find that RBF with less than or
equal to 6 regions using Chi-square ranking of the fields (green rectangle) resulted in the
highest (top 10) ranking (blue) of the actual causes. It was the only method that identified
all of the actual causes within the top 10 variables. This required ranking in both the
highest region and lowest region. From the tree curve below right, it is obvious by eye
that F20 has different behavior from the grouping (green ellipse) of processes F21--24.
This is also evident in its ranking by the RBF method in that the highest rank for F20 was
obtained by the entropy ranking (3rd) rather than chi-square ranking (9th).
The neural prediction method gave a noteworthy performance on all of the fields except
F23 and F24, where the lowest of 8 results was required to find F22.
46
The decision tree method alone was poor across the board. Its lack of performance was a
strong motivator for embarking on this entire work.
The RBF-TreeCurve method mirrored the RBF, Chi-square methods for the RX dataset.
However, it was not as strong in identifying the causes for the SLMP dataset other than
the fact that F21-24 were clustered tightly together and slightly penetrated the curve. A
key qualification for success in the SLMP dataset was to obtain enough F20-F24 fields in
a high ranking so that suspicion would be directed towards a process tool that was
common to all of these F steps. By that criterion, the tree-curve is a success since the
tight clustering of F21-24 would immediately raise the question to the observer as to the
cause of their joint behavior. This further justifies the thrust of the work in this thesis.
In summary, this work introduces the use of a RBFN to select variables which are used as
the input to a sequential tree (with 1 or 2 levels) finding top nodes that are then plotted in
a tree-curve to allow a visual confirmation of highly-ranked candidates associated with
the mining target.
Back to CONTENTS
V. Further Study
V.A) Mining Methods
As described above, the QUEST tree method is touted as being unbiased towards multiattributed variables which is one of the major problems encountered in mining
semiconductor manufacturing logistics data. The QUEST tree or any tree using the MDL
method for node splitting would be logical next steps beyond this work.
Since SVMs were noted by some authors to be superior to RBFNs it would be interesting
to use a SVM with a RBF kernel and compare the results for feature selection. IM4D
does not offer SVM mining.
V.B) Feature Creation
A promising avenue of study 77 would be to transform the logistics dataset so that similar
processes with identical tools are reduced to a single variable. A common objection to
being asked to investigate a data mining result by the process engineering team is if there
are similar processes to the result and which use the same tool as the result but which
have no signal or indication of being a problem. “If the tool is bad for process A then
why isn’t it bad for process B?” This question is especially valid when process B is a
repetition of process A during the formation of a repetitive structure such as when
forming different levels of metal interconnect lines. Unless it can be shown that a layout
design ground rule is violated at a particular level so that it is more susceptible to a short
or misalignment, it is difficult to obtain “buy-in” from the process engineers or their
management that the mining result is real.
77
Disclosed by the author as an invention within IBM on December 18, 2006.
47
If such similar processes are combined into a single variable, and it is ranked highly in
the mining results, then the objection is less likely to occur. Such a transformation is
likely to make the signal stronger and easier to find by the mining method. The
alternative is to hope that the signal is equally strong among the similar processes so that
all of them are highly ranked. A private observation by the author is that not all mining
methods are equal to this task and a favorite mining method may easily miss such a
combined signal.
The determination of which processes should be combined into single variables could be
easily made by subject matter experts, specifically process integration teams.
Please note that such a transformed dataset should be mined separately from the original
dataset since the possibility of a problem at a particular level, not due to a common tool
among similar levels, is still real.
V.C) Dataset Access Limitations Due to Commercial Prudence
Typical of applied data mining studies is the proprietary nature of the algorithms used in
the data mining method as well as the dataset. However, for those mining manufacturing
data in industry, the generic description of the method and the availability of
manufacturing data should suffice not only in the testing of the findings from this study
but also as a springboard to the testing of new or comparable mining methods.
Since this study involves manufacturing data from a company in a highly competitive
industry, the dataset is proprietary. This prevents the testing of alternative methods on
this dataset unless performed by those from this company. This is a common situation in
the mining literature involving semiconductor manufacturing. [1-12] Unfortunately, this
limits research of these problems to those within, or funded by, this industry.
Back to CONTENTS
Appendix A:
Acronyms for Process Variables
AERO for cryogenic aerosol cleaning
CMP for chemical mechanical planarization
CVD for chemical vapor deposition
FRN for furnace
INS for insulator
ION for ion implantation
LNR for liner
LTH for photolithography
MTL for metal
PE for plasma activated
PLT for plating
RIE for dry, or reactive ion, etching
RTP for rapid thermal processing
48
WET for wet chemical etching or cleaning
Acronyms
AMD = Advanced Micro Devices
ANOVA for analysis of variance
APC for advanced process control
CART for classification and regression tree
CCDF for complementary CDF
CDF for cumulative distribution function
CRM for customer relationship management
CSV for comma separated variables
DoE for design of experiments
GHz for gigaHerz (10E6 cycles/sec)
GUI = graphical user interface
ILT for in-line electrical test
IM4D = intelligent miner for data
KDD for knowledge discovery in data
MDL for minimum description length
PDID for process description identification
PNN for probabilistic neural network
QUEST for quick unbiased efficient statistical tree
R&D for research and development
RBF = radial basis function network
RBFN = RBF network
ROC for receiver operator characteristic
ROI = return on investment
SME = subject matter expert
SPC for statistical process control
TSMC (Taiwan Semiconductor Manufacturing Company
Back to CONTENTS
Appendix B: Cited Books
1) Dhammika Amaratunga and Javier Cabrera, Exploration and Analysis of DNA
Microarray and Protein Array Data, Wiley Series in Probability and Statistics, 2004.
2) C. Baragoin, C. M. Andersen, S. Bayerl, G. Bent, J. Lee, C. Schommer, Mining Your
Own Business in Retail: Using DB2 Intelligent Miner for Data, IBM Redbooks, August,
2001.
3) Christopher M. Bishop, Neural Networks for Pattern Recognition, Oxford, 1995
4) Peter Cabena, Pablo Hadjinian, Rolf Stadler, Jaap Verhees & Alessandro Zanasi,
Discovering Data Mining From Concept To Implementation, , Prentice Hall, 1998.
5) C.H. Chen, ed., Fuzzy Logic and Neural Network Handbook, IEEE Press, 1996
49
6) Nello Cristianini and John Shawe-Taylor, An Introduction to Support Vector Machines
and other Kernel-based Learning Methods, Cambridge, 2000
7) Margaret H. Dunham, Data Mining: Introductory and Advanced Topics, , Prentice
Hall, 2003.
8) Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd
edn., Morgan Kaufmann, 2006.
9) Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical
Learning: Data Mining, Inference and Prediction, Springer, 2001.
10) Simon Haykin, Neural Networks: A Comprehensive Foundation, IEEE Press, 1994
11) Robert V. Hogg and Elliot A. Tanis, Probability and Statistical Inference, 6th edn., ,
Prentice Hall, 2001.
12) IBM DB2 Intelligent Miner for Data: Using the Intelligent Miner for Data, Version
6, Release 1, IBM, 1999.
13) Mehmed Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms,
Wiley Interscience, 2003.
14) Daniel T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining,
Wiley, 2005.
15) Daniel T. Larose, Data Mining Methods and Models, Wiley, 2006
16) Oded Maimon and Lior Rokach, eds., The Data Mining and Knowledge Discovery
Handbook, Springer, 2005.
17) Thomas Mitchell, Machine Learning, McGraw-Hill, 1997.
18) Douglas C. Montgomery, Design and Analysis of Experiments, 6th edn., Wiley Press,
2005.
19) Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufman, 1999.
20) Dorian Pyle, Business Modeling and Data Mining, Morgan Kaufmann, 2003.
21) Bernhard Scholkopf and Alexander J. Smola, Learning with Kernels: SVMs,
Regularization, Optimization, and Beyond, MIT Press, 2002.
22) Lipo Wang and Xiuju Fu, Data Mining with Computational Intelligence, Springer,
2005.
50
23) Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and
Techniques, 2nd edn, Elsevier, 2005
24) S. Wolf and R. Tauber, Silicon Processing for the VLSI Era, Vol. 1: Process
Technology, 2nd edn., Lattice Press, 1999.
Back to CONTENTS
Biography of the Author
Steven G. Barbee is continuing his second career in data mining as a Sales Engineer with
SPSS, Inc. He retired from IBM after having mined semiconductor manufacturing data
for 4 years as a Senior Engineer in IBM's Systems and Technology Group in East
Fishkill, NY. Prior to that he was a Senior Engineering Manager in IBM's
Microelectronics Division reporting to the Research Division where he supervised the R
& D of specific semiconductor unit processes and their control. He is a patentee (33 US
patents) and author in this field. He served as a National Science Foundation Auditor for
the two annual reviews of the Engineering Research Center at North Carolina State
University. He holds a Master of Science degree concentrated in plasma physics from
Columbia University's School of Engineering and Applied Science. He earned a
Bachelor of Arts degree in Physics and Math, summa cum laude, from Doane College.
He is a member of the IEEE Computer Society, the American Computer Society and the
New York Academy of Sciences and a past member of the American Physical Society
and the American Vacuum Society. His community interests include leading an alpha
class as a deacon at the Warren Congregational Church in Connecticut; prior to that he
served as senior warden and licensed lay reader of an Episcopal church. He held two
national staff positions at the ’93 and ‘97 National BSA Jamborees and served his local
community as Scoutmaster for 9 years. Mr. Barbee is married with two sons (Paul and
David); his extended family includes 4 international exchange student “sons” (Rob,
Minh, Ashley and Tercio). He resides with his wife, Debbie, in Amenia, New York.
51