Download Types of Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Database Management System
Data Mining
Knowledge Discovery and Data Mining
 Knowledge Discovery
 The nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data. (Fayyad, et al 1996)
 Data Mining
 A step in the knowledge discovery process
 Application of algorithms to extract meaningful patterns
 Data Dredging
 Blind application of data mining techniques
Knowledge Discovery in Databases
Cleaning
Integration
Selection
Data
Transformation Mining
Data
Warehouse
Evaluation
Visualization
Prepared
data
Patterns
Data
Knowledge
Base
Knowledge
What is Data Mining?
 Filtering large amounts of data
 Searching for hidden patterns and/or trends
 Predicting future results
 Creating a competitive advantage and improving decision making
Data mining is a form of artificial intelligence, but is very different from other BI
tools.
 Discovery versus Verification
What Sparked Data Mining?
 “Motivated by business need, large amounts of available data,
and humans’ limited cognitive processing abilities
 Enabled by data warehousing, parallel processing, and data
mining algorithms”
Source: Dr. Hugh Watson
Popular Data Mining Methods
 Neural networks – learning from data patterns and predicting new data
 Genetic Algorithms – optimizing techniques
 Decision trees – rules for classifying data
 Regression Analysis - statistical
 K-nearest neighbor – classifying and clustering technique based on weighting of
selected variables
 Data Visualization – visually showing patterns
Types of Data Mining
 Association – identifies relationships
 Sequential pattern – identifies sequencing
 Classifying – identifies potential outcomes for predetermined
categories
 Clustering – identifies categories
 Prediction – estimates future values or forecasts
Types of Data Mining






Classification
Prediction
Clustering
Association Analysis
Summarization
…
Types of Data Mining
 Classification
 From data with known labels, create a classifier that determines
which label to apply to a new observation
 E.g. Label loan applications as low, medium, or high risk
Types of Data Mining
 Prediction
 Given a collection of data with known numeric outputs, create a
function that outputs a predicted value from a new set of inputs.
 E.g. Given historical consumption of milk in the U.S., predict what
the consumption will be over the next five years.
Types of Data Mining
 Clustering
 Identify “natural” groupings in data
 Unsupervised learning, no predefined groups
 E.g. A city planner grouping houses by value, location, and house
type.
Types of Data Mining
 Association Analysis
 Identify relationships in data from co-occuring terms or items.
 E.g. Analyze grocery store purchases to identify items most
commonly purchased together. This is often used to create coupons
and sales: buy chips and get $0.50 off salsa.
Types of Data Mining
 Summarization
 Given a data set, summarize the important characteristics of
the data.
 E.g. calculate mean and standard deviation, determine
statistical distribution, identify most commonly appearing
attribute values, etc.
Types of Data Mining
 Sequence Analysis
 Given data collected over time, identify trends in the data
that may be used to predict future events occuring
 E.g. Analyzing stock data to identify stocks that will
perform well vs. those that will perform poorly.
Data Mining Process
 “Requires personnel with domain, data warehousing, and data
mining expertise
 Requires data selection, data extraction, data cleansing, and
data transformation
 Most data mining tools work with highly granular flat files
 Is an iterative and interactive process”
Source: Dr. Hugh Watson
Data Mining Process
No
Fit a Model
Calculate
Performance
Meet Criteria?
Yes
Interpret Model
Data Mining Algorithms
 Determine the preference criterion
 In the face of two models, which one is “better”
 Examples: goodness of fit, prediction accuracy,
size/complexity, etc.
 Search algorithm
 Good models are found by searching the space of all
possible models
 How is this space organized and searched?
Data Mining Models
 Mathematical Functions
 Mathematical combination of attribute values
 E.g. linear model, non-linear model, support vectors, etc.
 CPU performance prediction
PRP  55.9  0.489MYCT  0.0153MMIN  0.0056MMAX
 0.6410CACH  0.2700CHMIN  1.480CHMAX
Data Mining Models
 Decision Trees
Study
>= 10 hours
<10 hours
Do Homework
Yes
Yes
No
C
Test Well
A
Test Well
B
No
Yes
C
No
F
Data Mining Models
 Neural Networks
0.8
0.23
-0.48
0.67
1.5
1.93
-0.81
-0.4 0.18
0.5
-0.88
Data Mining Models
 Mixture Models
Data Mining Models
 Bayesian Networks
P(B)
.001
Earthquake
Burglary
P(E)
.002
B E P(A)
Alarm
T T 0.95
T F 0.95
F T 0.29
F F 0.001
A
P(J)
T
0.90
F
0.05
John Calls
Mary Calls
A P(M)
T 0.70
F 0.01
Searching the Model Space
 Concept generalization is searching
 Almost all search algorithms are heuristic
 Optimal models are not guaranteed
 Enumerating the space involve bias
 Language bias – what the model can represent
 Search bias – which models are ignored
 Overfitting-avoidance bias – how models are simplified to
handle outliers
Searching the Model Space
Study
>= 10 hours
Do Homework
Yes
Yes
No
C
Test Well
A
<10 hours
Test Well
B
Yes
C
Model 2
No
F
No
Study
>= 10 hours
Test Well
Model 1
<10 hours
Homework
Yes
Yes
Good Project
A
Yes
No
B
C
No
Yes
B
No
F
Test Well
C
No
How Data Mining Is Used?
CRM: Research, churn and promotional management.
Process Mgmt: Reduce operational delays.
Analysis: Develop forecasting models and fraud prevention.
Predictive Capabilities: Develop rules for queries or expert systems and oil
exploration.
 Health Care: Medical research and trends.
 Banking: Identify bank locations.
 Sports: Guide movement of players.




Sources
 Davis, Jennifer and Others. 2002. Data Mining I: KnowledgeSEEKER.
http://www.terry.uga.edu/~hwatson/ Presentation_DMining_Final.ppt
 Struble, Craig A. 2004. Data Mining. http://www.mscs.mu.edu/~cstruble