Download Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining vs. Statistics
Pavel Brusilovsky
Objectives
• Intro to Data Mining
• Data Mining vs. Statistics
• Data Mining vs. Text Mining
• Applications of Data Mining
2
Data Mining
• Data Mining
– is a cutting edge technology to analyze diverse,
multidisciplinary and multidimensional complex data
• Data mining could identify relationships in your multidimensional and
heterogeneous data that cannot be identified in any other way
• Successful application of state-of-the-art data mining technology to
marketing and sales is indicative of analytic maturity and the
success of a company
• Working definition of Data Mining:
– Data Mining is a process of discovering previously unknown and
potentially useful hidden pattern in your data
3
What is the Taxonomy of Data Mining?
•
Data mining taxonomy, based on application
– Data Mining
– Text Mining
– Web Mining
– Image Mining…
•
Data mining taxonomy, based on the usage of domain knowledge:
– Verification-driven data mining
• Is associated with traditional quantitative approaches that permit a
decision maker to express and verify organizational and personal
domain knowledge
– Discovery driven data Mining
• It tied with knowledge discovery technology capable of automatically
discovering previously unknown patterns hidden in the data
– Combination of both classes leads to synergy that can produce
meaningful and reliable results that may not be obtained within the
framework of each class of data mining independently
•
Data mining taxonomy, based on estimation paradigm:
– supervised learning
– unsupervised learning
4
What is the deference between “Search”
and “Discover”
Source:
http://www.knowledgetechnologies.org/proceedings/presentations/treloar/nathantreloar.ppt
5
Example: Amazon.com purchase suggestion
Amazon.com increased
sales by 15%, using
data/text mining
generated purchase
suggestions
6
Data Mining and Related Fields
Statistics: “The model is king” (Hand)
Data Mining: “The data is king”
7
Is Data Mining extension of Statistics?
• Data Mining and Statistics: mutual fertilization with
convergence
• Statistical Data Mining (Graduate course, George Mason
University)
• Statistical Data Mining and Knowledge Discovery (Hardcover)
by Hamparsum Bozdogan (Editor)
– An overview of Bayesian and frequentist issues that arise in
multivariate statistical modeling involving data mining
• Data Mining with Stepwise Regression (Dean Foster, Wharton
School)
– use interactions to capture non-linearities
– use Bonferroni adjustment to pick variables to include
– use the sandwich estimator to get robust standard errors
8
What are Data Mining Myths?
• Myth 1: Data mining automatically discovers hidden pattern in your
data
• Myth 2: Data mining is design for business analysts who are not
professional in quantitative fields
• Myth 3: Data mining findings can be easily translated into decisionmaker actions
• Myth 4: Data mining encompasses decision analysis/decision
support technology
9
What are logical steps of Data Mining?
SEMMA methodology (SAS Enterprise Miner)
• The core process of conducting data mining study includes the following
steps (SEMMA):
– Sample
– Explore
– Modify
– Model
– Assess
• SEMMA is a logical organization of the functional tool set of SAS
Enterprise Miner for carrying out the core tasks of data mining
• SEMMA is focused on the model development aspects of data mining
10
CRoss-Industry Standard Process for Data
Mining (CRISP-DM)
SPSS Clementine
Six phases of CRISP-DM:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Model deployment
www.crips-dm.org
11
Statistics vs. Data Mining: Concepts
Feature
Statistics
Data Mining
Type of Problem
Well structured
Unstructured / Semi-structured
Inference Role
Explicit inference plays
great role in any analysis
No explicit inference
Objective of the Analysis
and Data Collection
First – objective
formulation, and then data collection
Data rarely collected for objective of
the analysis/modeling
Size of data set
Data set is small and
hopefully homogeneous
Data set is large and data set is
heterogeneous
Paradigm/Approach
Theory-based (deductive)
Synergy of theory-based and
heuristic-based approaches
(inductive)
Signal-to-Noise Ratio
STNR > 3
0 < STNR <= 3
Type of Analysis
Confirmative
Explorative
Number of variables
Small
Large
12
Statistics vs. Data Mining: Regression Modeling
Feature
Statistics
Data Mining
Number of inputs
Small
Large
Type of inputs
Interval scaled and categorical with
small number of categories
(percentage of categorical variables is
small)
Any mixture of interval scaled,
categorical, and text variables
Multicollinearity
Wide range of degree of
multicollinearity with intolerance to
multicollinearity
Severe multicollinearity is
always there, tolerance to
multicollinearity
Distributional
assumptions,
homoscedasticity,
outliers, missing
values
Intolerance to distrubitional
assumption violation,
homoscedasticity,
Outliers/leverage points, missing
values
Tolerance to distributional
assumption violation,
outliers/leverage points, and
missing values
Type of model
Linear / Non-linear / Parametric / NonParametric in low dimensional Xspace (intolerance to
uncharacterizable non-linearities)
Non-linear and non-parametric
in high dimensional X-space
with tolerance to
uncharacterizable non13
linearities
What is an unstructured problem?
Well-structured Business
Problem
Definition
Unstructured Business Problem
Can be described with a high
degree of completeness
Cannot be described with a high
degree of completeness
Can be solved with a high
degree of certainty
Cannot be resolved with a high
degree of certainty
Experts usually agree on the
best method and best
solution
Experts often disagree about the
best method and best solution
Can be easily and uniquely
translated into quantitative
counterpart
Cannot be easily and uniquely
translated into quantitative
counterpart
Goal
Find the best solution
Find reasonable solution
Complexity
Ranges from very simple to
complex
Ranges from complex to very
complex
14
What are differences between Data/Text
Mining and Statistics?
• Statistical analysis is designed to deal with structured data in order to
solve structured problem:
– Results are software and researcher independent
– Inference reflects statistical hypothesis testing
• Data mining is designed to deal with structured data in order to solve
unstructured business problems
– Results are software and researcher dependent (absence of
implementation standards)
– Inference reflects computational properties of data mining
algorithm at hand
• Text mining is designed to deal with unstructured data in order to solve
unstructured problems
– Results are software and researcher dependent
– Inference reflects computational properties and visualization
capability of text mining algorithm at hand
15
When data mining technology is
appropriate?
• Data mining technology is appropriate if:
– The business problem is unstructured
– Accurate prediction is more important than the explanation
– The data include the mixture of interval, nominal, ordinal, count,
and text variables, and the role and the number of non-numeric
variables are essential
– Among those variables there are a lot of irrelevant and redundant
attributes
– The relationship among variables could be non-linear with
uncharacterizable nonlinearities
– The data are highly heterogeneous with a large percentage of
outliers, leverage points, and missing values
– The sample size is relatively large
• Important marketing and sales studies/projects have the majority of
these features
16
Accurate prediction is more important than
the explanation
17
What is Breiman Uncertainty Principle?
• Breiman uncertainty principle:
Accuracy * Interpretability = Breiman’s constant
• Breiman uncertainty principle means that
The higher method’s accuracy, the lower its interpretability, and
vice versa
18
What are great Data Mining Ideas?
• Injecting randomness into function estimation procedure
• Bagging (Breiman, 1996):
– Apply the same unstable algorithm to different samples (with
replacement) of the original data
– Different samples yield different models
– The average of the predictions of these models might be better
than the predictions from any single model
• Boosting (Friedman, Hastie, and Tibshirani (1999):
– Each model is based on the same original data
– The first individual model is fit to the original data
– For the second model, subtract the predicted value from the
original target value, and use the difference as the target value
to train the second model
– For the third model, subtract weighted average of the predictions
from the original target value, and use the difference as the
target value to train the third model, and so on.
19
What are the best Data Mining
Conferences?
• Annual SAS Data Mining Technology Conference
– The world’s largest data mining conference that balancing
theory and practice
• Annual International Conference on Knowledge Discovery and Data
Mining (KDD)
– Sponsored by the American Association for Artificial Intelligence
(AAII)
• Annual International Salford Systems Data Mining Conference
– Focusing on solving real world challenges
– Business Applications of CART, MARS, TreeNet, and Random
Forrest
– Keynote speakers: Jerome Friedman (Stanford University) and Leo
Breiman (University of California, Berkeley)
20
What are the best data mining tools?
• Salford Systems’ Tools (CART, Random Forest, MARS, TreeNet)
• SAS Enterprise Miner/Text Miner
• SPSS Clementine
• Megaputer Intelligence PolyAnalyst
21
Reference (Data Mining)
• Randall Matignon (2007), Neural Network Modeling Using SAS
Enterprise Miner , SAS® Institute Inc.
• David J. Hand, Data Mining: Statistics and More? The American
Statistician, May 1998, Vol. 52 No. 2
http://www.amstat.org/publications/tas/hand.pdf
• Friedman, J.H. 1997. Data Mining and Statistics. What’s connection?
Proceedings of the 29th Symposium on the Interface: Computing
Science and Statistics, May 1997, Houston, Texas
• Doug Wielenga (2007), Identifying and Overcoming Common Data
Mining Mistakes, SAS Global Forum Paper 073-2007
• Nathan Treloar (2002), Text Mining: Tools, Techniques, and
Applications
http://www.knowledgetechnologies.org/proceedings/presentations/trel
oar/nathantreloar.ppt.
22