Download Data Mining PowerPoint Slide Presentation

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
These are general notional tutorial
slides on data mining theory and
practice from which content may be
freely drawn.
Monte F. Hancock, Jr.
Chief Scientist
Celestech, Inc.
Data Mining is the detection,
characterization, and exploitation of
actionable patterns in data.
Data Mining (DM)
• Data Mining (DM) is the principled detection, characterization, and
exploitation of actionable patterns in data.
• It is performed by applying modern mathematical techniques to collected
data in accordance with the scientific method.
• DM uses a combination of empirical and theoretical principles to Connect
Structure to Meaning by:
– Selecting and conditioning relevant data
– Identifying, characterizing, and classifying latent patterns
– Presenting useful representations and interpretations to users
• DM attempts to answer these questions:
–
–
–
–
What patterns are in the information?
What are the characteristics of these patterns?
Can “meaning” be ascribed to these patterns and/or their changes?
Can these patterns be presented to users in a way that will facilitate their assessment,
understanding, and exploitation?
– Can a machine learn these patterns and their relevant interpretations?
DM for Decision Support
●
“Decision Support” is all about…
–
–
–
–
–
–
–
–
–
–
●
enabling users to group information in familiar ways
controlling complexity by layering results (e.g., drill-down)
supporting user’s changing priorities
allowing intuition to be triggered (“I’ve seen this before!”)
preserving and automating perishable institutional knowledge
providing objective, repeatable metrics (e.g., confidence factors)
fusing & simplifying results
automating alerts on important results (“It’s happening again!”)
detecting emerging behaviors before they consummate (“Look!”)
delivering value (timely-relevant-accurate results)
…helping users make the best choices.
DM Provides “Intelligent” Analytic Functions
●
Automating pattern detection – to characterize complex,
distributed signatures that are worth human attention… and recognize those
that are not.
●
Associating events – that “go together” but are difficult for humans to
correlate.
●
Characterizing interesting processes – not just facts or simple
events
●
Detecting actionable anomalies – and explaining what makes
them “different AND interesting”.
●
Describing contexts – from multiple perspectives –with numbers, text
and graphics
DM Answers Questions Users are Asking
●
Fusion Level 1: Who/What is Where/When in my
space?
–
●
Fusion Level 2: What does it mean?
–
●
Enterprise relevance? What action should be taken?
Fusion Level 4: What can I do better next time?
–
●
Has this been seen before? What will happen next?
Fusion Level 3: Do I care?
–
●
Organize and present facts in domain context
Adaptation by pattern updates and retraining
How certain am I?
–
Quantitative assessment of evidentiary pedigree
Useful Data Applications
●
Accurate identification and classification– add value to raw data
by tagging and annotation (e.g., fraud detection)
●
Anomaly / normalcy and fusion – characterize, quantify, and assess
●
Emerging patterns and evidence evaluation - capturing
●
Behavior association - detection of actions that are distributed in time &
●
Signature detection and association – detection & characterization
“normalcy” of patterns and trends (e.g., network intrusion detection)
institutional knowledge of how “events” arise and alerting when they emerge
space but “synchronized” by a common objective: “connecting the dots”
of multivariate signals, symbols, and emissions (e.g., voice recognition)
●
Concept tagging - reasoning about abstract relationships to tag and
annotate media of all types (e.g., automated web bots)
●
Software agents assisting analysts – small-footprint “fire-andforget” apps that facilitate search, collaboration, etc.
Some “Good” Data Mining Analytic Applications
•
Help the user focus via unobtrusive automation
–
–
–
•
Automate aspects of classification and detection
–
–
–
–
–
•
Determine which sets of data hold the most information for a task
Support construction of ad hoc “on-the-fly” classifiers
Provide automated constructs for merging decision engines (multi-level fusion)
Detect and characterize “domain drift” (the “rules of the game” are changing)
Provide functionality to make best estimate of “missing data”
Extract/characterize/employ knowledge
–
–
–
–
•
Off-load burdensome labor (perform intelligent searches, smart winnowing)
Post “smart” triggers/tripwires to data stream (e.g., anomaly detection)
Help with mission triage (“Sort my in-basket!”)
Rule induction from data, develop “signatures” from data
Implement reasoning for decision support
High-dimensional visualization
Embed “decision explanation” capability into analytic applications
Capture/automate/institutionalize best practice
–
–
–
–
Make proven analytic processes available to all
Capture rare, perishable human knowledge… and put it everywhere
Generate “signature-ready” prose reports
Capture and characterize the analytic process to anticipate user needs
Things that make “hard” problems VERY hard
– Events of interest occur relatively infrequently in very large datasets (“population
imbalance”)
– Information is distributed in a complex way across many features (the “feature selection
problem”)
– Collection is hard to task, data are difficult to prepare for analysis, and are never “perfect”
(“noise” in the data, data gaps, coverage gaps)
– Target patterns are ambiguous/unknown; “squelch” settings are brittle (e.g., hard to
balance detection vs. “false-alarm” rates)
– Target patterns change/morph over time and across operational modes (“domain drift”,
processing methods becomes “stale”)
Some Key Principles of
“Information Driven” Data Mining
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Right People, Methods, Tools (in that order)
Make no prior assumptions about the problem (“agnostic”)
Begin with general techniques that let the data determine the direction of the
analysis (“Funnel Method”)
Don’t jump to conclusions; perform process audits as needed
Don’t be a “one widget wonder”; integrate multiple paradigms so the strengths
of one compensate for the weaknesses of another
Break the problem into the right pieces (“Divide and Conquer”)
Work the data, not the tools, but automate when possible
Be systematic, consistent, thorough; don’t lose the forest for the trees.
Document the work so that it is reproducible
Collaborate to avoid surprises: team members, experts, customer
Focus on the Goal: maximum value to the user within cost and schedule
Select Appropriate Machine Reasoners
1.) Classifiers
Classifiers ingest a list of attributes, and determine into which of finitely many categories the entity exhibiting
these attributes falls. Automatic object recognition and next-event prediction are examples of this type of
reasoning.
2.) Estimators
Estimators ingest a list of attributes, and assign some numeric value to the entity exhibiting these attributes.
The estimation of a probability or a "risk score" are examples of this type of reasoning.
3.) Semantic Mappers
Semantic mappers ingest text (structured, unstructured, or both), and generate a data structure that gives the
"meaning" of the text. Automatic gisting of documents is an example of this type of reasoning Semantic
mapping generally requires some kind of domain model.
4.) Planners
Planners ingest a scenario description, and formulate an efficient sequence of feasible actions that will move the
domain to the specified goal state.
5.) Associators
Associators sample the entire corpus of domain data, and identify relationships among entities. Automatic
clustering of data to identify coherent subpopulations is a simple example. A more sophisticated example is
the forensic analysis of phone, flight, and financial records to infer the structure of terrorist networks.
Embedded Knowledge…
•
•
•
•
•
•
•
•
•
Principled, domain-savvy synthesis of “circumstantial” evidence
Copes well with ambiguous, incomplete, or incorrect input
Enables justification of results in terms domain experts use
Facilitates good pedagogical helps
“Solves the problem like the man does”, and so is comprehensible to
most domain experts.
Degrades linearly in combinatorial domains
Can grow in power with “experience”
Preserves perishable expertise
Allows efficient incremental upgrade/adjustment/repurposing
Features
• A feature is the value assumed by some attribute
of an entity in the domain
(e.g., size, quality, age, color, etc.)
• Features can be numbers, symbols, or complex
data objects
• Features are usually reduced to some simple form
before modeling is performed.
>>>features are usually single numeric values or contiguous strings.<<<
Feature Space
• Once the features have been designated, a feature space can be
defined for a domain by placing the features into an ordered array
in a systematic way.
• Each instance of an entity having the given features is then
represented by a single point in n-dimensional Euclidean space:
its feature vector.
• This Euclidean space, or feature space for the domain, has
dimension equal to the number of features.
• Feature spaces can be one-dimensional, infinite-dimensional, or
anywhere in between.
How do classifiers work?
Machines
• Data mining paradigms are characterized by
– A “concept of operation (CONOP: component structure, I/O, training
alg., operation)
– An architecture (component type, #, arrangement, semantics)
– A set of parameters (weights/coefficients/vigilance parameters)
>>>it is assumed here that parameters are real numbers.<<<
A machine is an instantiation of a data mining paradigm.
• Examples of parameter sets for various paradigms
–
–
–
–
Neural Networks: interconnect weights
Belief Networks: conditional probability tables
Kernel-Based-classifiers (SVM, RBF): regression coefficients
Metric classifiers (K-means): cluster centroids
A Spiral Methodology for the
Data Mining Process
The DM Discovery Phase:
Descriptive Modeling
•
•
•
•
•
OLAP
Visualization
Unsupervised learning
Link Analysis/Collaborative Filtering
Rule Induction
The DM Exploitation Phase:
Predictive Modeling
•
•
•
•
•
•
•
Paradigm selection
Test design
Formulation of meta-schemes
Model construction
Model evaluation
Model deployment
Model maintenance
A “de facto” standard DM Methodology
CRISP-DM (“cross-industry standard process for data mining”)
–
–
–
–
–
–
1.) Business Understanding
2.) Data Understanding
3.) Data Preparation
4.) Modeling
5.) Evaluation
6.) Deployment
Data Mining Paradigms:
What does your solution look like?
• Conventional Decision Models
-statistical inference, logistic regression, score cards
• Heuristic Models
-human expert, knowledge-based expert systems,
fuzzy logic, decision trees, belief nets
• Regression Models
-neural networks (all sorts), radial basis functions,
adaptive logic networks, decision trees, SVM
Real-World DM Business Challenges
• Complex and conflicting goals
– Defining “success”
– Getting “buy in”
• Enterprise data is distributed
• Limited automation
• Unrealistic expectations
Real-World DM Technical Challenges
•
•
•
•
•
•
big data consume space and time
efficiency vs. comprehensibility
combinatorial explosion
diluted information
difficult to develop “intuition”
algorithm roulette
Data Mining Problems:
What does your domain look like?
•
•
•
•
•
•
How well is the problem understood?
How "big" is the problem?
What kind of data do we have?
What question are we answering?
How deeply buried in the data is the answer?
How must the answer be presented to the user?
1. Business Understanding
How well is the problem understood?
How well is the problem
understood?
•Domain intuition: low/medium/high
–Experts available?
–Good documentation?
–DM team’s prior experience?
–Prior art?
•What is the enterprise definition of “success”?
•What is the target environment?
•How skillful are the users?
•Where are the pitchforks?
2. Data Understanding
3. Preparing the Data
How "big" is the problem?
What kind of data do we have?
DM Aspects of Data Preparation
•
•
•
•
•
•
•
Data Selection
Data Cleansing
Data Representation
Feature Extraction and Transformation
Feature Enhancement
Data Division
Configuration Management
How "big" is the problem?
•Number of exemplars (“rows”)
•Number of features (“columns”)
•Number of classes (“ground truth”)
•Cost/schedule/talent (dollars, days, dudes)
•Tools (own/make/buy, familiarity, scope)
What kind of data do we have?
•Feature type: nominal/numeric/complex
•Feature mix: homo/heterogeneous by type
•Feature tempo:
–Fresh/stale
–Periodic/sporadic
–Synchronous/asynchronous
•Feature data quality:
–Low/high SNR
–Few/many gaps
–Easy/hard to access
–Objective/subjective
•Feature information quality
–Salience, correlation, localization, conditioning
–Comprehensive? Representative?
How much data do I need?
• Many heuristics
– Monte’s 6MN rule, other similar
– Support vectors
• Segmentation requirements
• Comprehensive
• Representative
– Consider population imbalance
Feature Saliency Tests
•
•
•
•
•
•
Correlation/Independence
Visualization to determine saliency
Autoclustering to test for homogeneity
KL-Principal Component Analysis
Statistical Normalization (e.g., ZSCORE)
Outliers, Gaps
Making Feature Sets for Data Mining
• Converting Nominal Data to Numeric:
Numeric Coding
• Converting Numeric data to Nominal:
Symbolic Coding
• Creating Ground-Truth
Information can be Irretrievably Distributed
(e.g., the parity-N problem)
0010100110… 1
The best feature set is not necessarily the set
of best features.
An example of a Feature Metric
“Salience” : geometric mean of class precisions
• an objective measure of the ability of a feature
to distinguish classes
• takes class proportion into account
• specific to a particular classifier and problem
• does not measure independence
Nominal to Numeric Coding...
…one step at a time!
Original Data:
Nam e
Bill
Bubbles
Rover
Ringo
Chuck
Tweety
Clas s
primates
fishes
domestic
bugs
bacteria
birds
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
(habitat)
(diet)
(integument)
(morphology)
(life cycle)
land
sea
land
land
parasitic
land
omnivore
omnivore
carnivore
herbivore
other
omnivore
skin w/o feathers
biped no wings
live birth
scales
no wings, non biped eggs w/o meta
skin w/o feathers no wings, non biped
live birth
exoskeleton
wings, non-biped
egss w. meta
other
no wings, non biped
other
skin with feathers
wings, biped
eggs w/o meta
Step 1:
Nam e
Bill
Bubbles
Rover
Ringo
Chuck
Tweety
Clas s
mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
(habitat)
(diet)
(integument)
(morphology)
(life cycle)
land
sea
land
land
parasitic
land
omnivore
omnivore
carnivore
herbivore
other
omnivore
skin w/o feathers
scales
skin w/o feathers
exoskeleton
other
skin with feathers
biped no wings
no wings, non biped
no wings, non biped
wings, non-biped
no wings, non biped
wings, biped
live birth
eggs w/o meta
live birth
egss w. meta
other
eggs w/o meta
Step 2:
Nam e
Clas s
Feature 1
(habitat)
Feature 2
(diet)
Feature 3
(integum ent)
Feature 4
(m orphology)
Feature 5
(life cycle)
1
2
3
4
5
6
1
2
1
2
2
2
2
1
2
2
3
2
3
3
2
1
4
3
1
3
1
4
5
2
3
4
4
1
4
2
1
3
1
2
4
3
Numeric to Nominal Quantization
“Clusters” Usually Mean Something
How many objects are shown here? One, seen from various perspectives!
This illustrates the danger of using ONE METHOD/TOOL/VISUALIZATION!
Autoclustering
• Automatically find spatial patterns in
complex data
– find patterns in data
– measure the complexity of the data
Differential Analysis
• Discover the Difference “Drivers” Between Groups
– Which combination of features accounts for the
observed differences between groups?
– Focus research
Sensitivity Analysis
• Measure the Influence of Individual
Features on Outcomes
– Rank order features by salience and
independence
– Estimate problem difficulty
Rule Induction
• Automatically find semantic patterns in
complex data
– discover rules directly from data
– organize “raw” data into actionable knowledge
A Rule Induction Example
(using data splits)
Rule Induction Example (Data Splits)
4. Modeling
What question are we answering?
How deeply buried in the data is the answer?
How must the answer be presented to the user?
What question are we answering?
•Ground truth type
–Nominal
–Numeric
–Complex (e.g., interval estimate, plan, concept)
•Ground truth data quality
–Low/high SNR
–Few/many gaps
–Easy/hard to access
–Objective/subjective
•Ground truth predictability
–Correlation with features
–Population balance
–Class collisions
How deeply buried in the
data is the answer?
•Solvable by a 1 layer Multi-Layer Perceptron (easy)
–Linearly separable; any two classes can be separated by a
hyperplane
•Solvable by a 2 layer Multi-Layer Perceptron (moderate)
–Convex hulls of classes overlap, but classes do not
•Solvable by a 3 layer Multi-Layer Perceptron (hard)
–Classes overlap but do not “collide”
•“intractable”
–Data contain class collisions
How must the answer be presented to the user?
•Forensics
–GUI, confidence factors, intervals, justification
•Integration
–Web-based, Web-enable, dll/sl, fully integrated
•Accuracy
–% correct, confusion matrix, lift chart
•Performance
–Throughput, ease of use, accuracy, reliability
Text Book Neural Network
Knowledge Acquisition
What the Expert says:
KE: ...and, primates. What evidence makes you CERTAIN an animal
is a primate?
KE: Yeah, well, like...If it’s a land animal that’ll eat anything...but it
bears live young and walks upright,...
KE: Any obvious physical characteristics?
EX: Uh...yes...and no feathers, of course, or wings, or any of that...
Well, then...then, it’s gotta be a primate...yeah.
KE: So, ANY animal which is a land-dwelling, omnivorous, skincovered, unwinged featherless biped which bears live young is
NECESSARILY a primate?
EX: Yep.
KE: Could such an animal, be, say, a fish?
EX: No...it couldn’t be anything but a primate.
What the KE hears:
IF
(f1,f2,f3,f4,f5) = (land, omni, feathers, wingless
biped, born alive)
THEN
PRIMATE and (not fish, not domestic, not bug, not
germ, not bird)
Evaluation
How must the answer be presented to the user?
Model Evaluation
• Accuracy
–
–
–
–
–
–
Classification accuracy, geometric accuracy
precision/recall
RMS
Lift curve
Confusion matrices
ROI
• Speed, space, utility, other
Classification Errors
Prediction =
Ground Truth
1
2
3
PRECISION
Type I Error
1
302
128
35
64.9%
35.1%
2
55
526
68
81.0%
19.0%
3
RECALL
21
79.9%
194
62.0%
469
82.0%
68.6%
31.4%
Type II
Error
20.1%
38.0%
18.0%
• Type I - Accepting an item as a member of a class when it
is actually false: a “false positive”.
• Type II - Rejecting an item as a member of a class when it
actually is (true) a “false negative”.
Model Maintenance
•
•
•
•
Retraining, stationarity
Generalization (e.g. heteroscedasticity)
Changing the feature set (add/subtract)
Conventional maintenance issues
What do we give the
user besides an application?
•
•
•
•
Documentation
Support
Model retraining
New model generation
Using a Paradigm Taxonomy to Select a
DM Algorithm
Place paradigms into a taxonomy by
specifying their attributes. This
taxonomy can be used for algorithm
selection.
First, an example taxonomy….
KBES (knowledge-based Expert System)
required intuition: high
vector count supported: high
feature count supported: medium
class count supported: medium
cost to develop: high
schedule to develop: high
talent to develop: medium, high
tools to develop: can be expensive to buy/make
feature types supported: nominal/numeric/complex
feature mix supported: homogeneous, heterogeneous
feature data quality needed: need not fill "gaps"
ground truth types supported: nominal, complex
relative representational power: low
relative performance: fast, intuitive, robust
relative weaknesses: ad hoc; relatively simple class boundaries
relative strengths: intuitive; easy to provide conclusion justification
MLP (Multi-Layer Perceptron)
required intuition: low
vector count supported: high
feature count supported: medium
class count supported: medium
cost to develop: low
schedule to develop: medium
talent to develop: medium
tools to develop: easy to obtain inexpensively
feature types supported: numeric
feature mix supported: homogeneous
feature data quality needed: must fill "gaps"
ground truth types supported: nominal, numeric
relative representational power: high
relative performance: moderately fast
relative weaknesses: inscrutable; uncontrolled regression
relative strengths: easy to build
RBF (Radial Basis Function)
required intuition: low
vector count supported: high
feature count supported: medium
class count supported: high
cost to develop: low
schedule to develop: medium
talent to develop: medium
tools to develop: easy to obtain inexpensively
feature types supported: numeric
feature mix supported: homogeneous
feature data quality needed: need not fill "gaps"
ground truth types supported: nominal, numeric
relative representational power: high
relative performance: moderately fast
relative weaknesses: inscrutable; models tend to be large
relative strengths: uncontrolled regression can be mitigated
SVM (Support Vector Machines)
required intuition: low
vector count supported: high
feature count supported: high
class count supported: two
cost to develop: medium
schedule to develop: medium
talent to develop: medium
tools to develop: easy to obtain inexpensively
feature types supported: numeric
feature mix supported: homogeneous
feature data quality needed: must fill "gaps"
ground truth types supported: nominal, numeric
relative representational power: high
relative performance: moderately fast
relative weaknesses: inscrutable; can be hard to train
relative strengths: minimal need to enhance features
Decision Trees (e.g., CART, BBN’s)
required intuition: low
vector count supported: high
feature count supported: medium
class count supported: high
cost to develop: low
schedule to develop: medium
talent to develop: medium
tools to develop: easy to obtain inexpensively
feature types supported: nominal, numeric
feature mix supported: homogeneous, heterogeneous
feature data quality needed: need not fill "gaps"
ground truth types supported: nominal, numeric
relative representational power: high
relative performance: moderately fast
relative weaknesses: many "low support" nodes or rules
relative strengths: can provide insight into the domain
The taxonomy can be used to match
available paradigms with the
characteristics of the data mining
problem to be addressed…
IF
the ground truth is discrete;
there aren't too many classes;
the class boundaries are simple;
the number of features is medium;
the data are heterogeneous;
no comprehensive, representative data set with GT;
the population is unbalanced by class;
the domain is well-understood by available experts;
conclusion justification is needed;
THEN
KBES
ELSE IF
the ground truth is numeric;
there is a medium number of classes;
the class boundaries are complex;
the number of features is medium;
the data are numeric;
comprehensive, representative data set tagged with GT;
the population is relatively balanced by class;
the domain is not well-understood by available experts;
conclusion justification is not needed;
THEN
MLP
ELSE IF
the ground truth is numeric or nominal;
there is a large number of classes;
the class boundaries are very complex;
the number of features is medium;
the data are numeric;
representative data set tagged with GT;
the population is unbalanced by class;
the domain is not well-understood by available experts;
conclusion justification is not needed;
THEN
RBF
ELSE IF
the ground truth is numeric or nominal;
the number of classes is two;
the class boundaries very complex;
the number of features is very large;
the data are numeric;
comprehensive, representative data set tagged with GT;
the population is unbalanced by class;
the domain is not well-understood by available experts;
conclusion justification is not needed;
THEN
SVM
ELSE IF
the ground truth is numeric or nominal;
there is a medium number of classes;
the class boundaries are very complex;
the number of features is medium;
the data are numeric, nominal, or complex;
representative data set tagged with GT;
the population is unbalanced by class;
the domain is not well-understood by available experts;
conclusion justification is needed;
THEN
Decision Tree (CART, BBN, etc.)
END IF
Common Reasons
Data Mining Projects Fail
Mistakes can occur in each major
element of data mining practice!
1. Specification of Enterprise Objectives
– Defining “success”
2. Creation of the DM Environment
– Understanding and Preparing the Data
3. Data Mining Management
4a,b. Descriptive Modeling and Predictive Modeling
– Detecting and Characterizing Patterns
– Building Models
5. Model Evaluation
6. Model Deployment
7. Model Maintenance
1. Specification of Enterprise Objectives
Define “success”:
• Knowledge acquisition interviews (who, what,
how)
• Objective measures of performance (enterprise
specific)
• Assessment of enterprise process and data
environment
• Specification of data mining objectives
Specification Mistakes
• DM projects require careful management of user
expectations. Choosing the wrong person as
customer interface can guarantee user
disappointment.
(GIGOO: Garbage in, GOLD out!)
• Since the default assessment of “R&D type”
efforts is “failure”, not defining “success”
unambiguously will guarantee “failure”.
2. Creation of the DM Environment
•
•
•
•
Data Warehouse/Data Mart /Database
Meta data and schemas
Data dependencies
Access paths and mechanisms
Environmental Mistakes
• Big data require bigger storage. DM efforts
typically work against multiple copies of the data;
try 2 or 3 x.
• Unwillingness to invest in tools forces data miners
to consume resources building inferior versions of
what could have been purchased more cheaply.
• Get labs and network connections set up quickly.
Understanding the Data
• Enterprise data survey
– Data as a process artifact
– Temporal Considerations
• Data Characterization
– Metadata
– Collection paths
• Data Metrics and Quality
– currency, completeness, correctness, correlation
A List of Common Data Problems
•
•
•
•
•
•
•
•
•
•
Conformation (e.g., a dozen ways to say lat/lon)
Accessibility (distributed, sensitive)
Ground Truth (missing, incorrect)
Outliers (detect/process)
Gaps (imputation scheme)
Time (coverage, periodicity, trends, Nyquist)
Consistency (intra/inter record)
Class collisions (how to adjudicate)
Class population imbalance (balancing)
Coding/quantization
Data Understanding Mistakes
• Assuming that no understanding of the domain is
needed for a successful DM effort
• Temporal infeasibility: assuming every type of
data you find in the warehouse will actually be
there when your fielded system needs it.
• Ignoring the data conformation problem
Data Preparation Mistakes
•
•
•
•
Improper handling of missing data, outliers
Improper conditioning of data
“Trojan Horsing” ground truth into the feature set
Having no plan for getting operational access to
data
3. Data Mining Management
• Data mining skill mix (who are the DM
practitioners?)
• Data mining project planning (RAD vs. waterfall)
• Data mining project management
• Sample DM project cost/schedule
• Don’t forget Configuration Management!
DM Management Mistakes
• Appointing a “domain expert” as the technical lead on a
DM project virtually guarantees that no new ground will
covered.
• Inadequate schedule and/or budget poison the
psychological atmosphere necessary for discovery.
• Failure to parallelize work
• Allowing planless tinkering
• Letting technical people “snow” you
• Failure to conduct “process audits”
Configuration Management
• Nomenclature and naming conventions
• Documenting the workflow for reproducibility
• Modeling Process Automation
Configuration Management Mistakes
• Not having a configuration management plan (files,
directories, nomenclature, audit trail) virtually guarantees
that any success you have will be unreproduceable.
• Allowing each data miner to establish their own
documentation and auditing procedures guarantees that no
one will understand what anyone else has done.
• Failure to automate configuration management (e.g.,
putting annotated experiment scripts in a log) guarantees
that your configuration management plan will not work.
4a. Descriptive Modeling
•
•
•
•
•
•
•
OLAP (on-line analytical processing)
Visualization
Unsupervised learning
Link/Market Basket Analysis
Collaborative Filtering
Rule Induction Techniques
Logistic Regression
4b. Predictive Modeling
•
•
•
•
•
•
•
Paradigms
Test Design
Meta-Schemes
Model Construction
Model Evaluation
Model Deployment
Model Maintenance
Paradigms
•
•
•
•
•
Know what they are
Know when to use which
Know how to instantiate them
Know how to validate them
Know how to maintain them
Model Construction
•
•
•
•
•
Architecture (monolithic, hybrid)
Formulation of Objective Function
Training (e.g., NN)
Construction (e.g., KBES)
Meta Schemes
– Bagging
– Boosting
– Post-process model calibration
Modeling Mistakes
• The “Silver Bullet Syndrome”: relying entirely on a single
tool/method
• Expecting your tools to think for you
• Overreliance on visualization
• Using tools that you don’t understand
• Not knowing when to quit (maybe this is just dirt)
• Quitting too soon (I haven’t dug deep enough)
• Picking the wrong modeling paradigm
• Ignoring population imbalance
• Overtraining
• Ignoring feature correlation
5. Model Evaluation
• Blind Testing
• N-fold Cross-Validation
• Generalization and Overtraining
Model Evaluation Mistakes
• Not validating the model
• Validating the model on the training data
• Not escrowing a “holdback set”
6. Model Deployment
• ASP (applications service provider)
• API (application program interface)
• Other
– plug-ins
– linked objects
– file interface, etc.
Model Deployment Mistakes
• Not considering the fielded architecture
• No user training
• Not having any operational performance
requirements (except “accuracy”)
7. Model Maintenance
• Retraining
• Poor generalization
– Heteroscedasticity
– Non-stationarity
– Overtraining
• Changing the problem architecture
– Adding/subtracting features
– Modifying ground truth
• Other
Model Maintenance Mistakes
• Not having a mechanism, method, and criteria for
tracking performance of the fielded model
• Not providing a model “retraining” capability
• No documentation, no support
Published by:
Digital Press, 2001
ISBN: 1-555558-231-1