Download Data Mining - Computer Science - University of Wisconsin

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction to Data
Mining
Michael R. Wick
Professor and Chair
Department of Computer Science
University of Wisconsin – Eau Claire
Eau Claire, WI 54701
[email protected]
715-836-2526
Acknowledgements
Some of the material used in this talk is
drawn from:
– Dr. Jiawei Han at University of Illinois at
Urbana Champaign
– Dr. Bhavani Thuraisingham (MITRE Corp. and
UT Dallas)
– Dr. Chris Clifton, Indiana Center for Database
Systems, Purdue University
Road Map
•
•
•
•
•
•
•
•
Definition and Need
Applications
Process
Types
Example: The Apriori Algorithm
State of Practice
Related Techniques
Data Preprocessing
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data
– Data mining: a misnomer
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
– (Deductive) query processing.
– Expert systems or small learning programs
What is Data Mining?
Real Example from the NBA
• Play-by-play information recorded by teams
– Who is on the court
– Who shoots
– Results
• Coaches want to know what works best
– Plays that work well against a given team
– Good/bad player matchups
• Advanced Scout (from IBM Research) is a
data mining tool to answer these questions
Starks+Houston+Ward playing
Shooting
Percentage
Overall
0
20
40
60
http://www.nba.com/news_feat/beyond/0126.html
Necessity for Data Mining
•
Large amounts of current and historical data being stored
– Only small portion (~5-10%) of collected data is analyzed
– Data that may never be analyzed is collected in the fear that something that may
prove important will be missed
•
•
As databases grow larger, decision-making from the data is not
possible; need knowledge derived from the stored data
Data sources
–
–
–
–
–
•
Health-related services, e.g., benefits, medical analyses
Commercial, e.g., marketing and sales
Financial
Scientific, e.g., NASA, Genome
DOD and Intelligence
Desired analyses
– Support for planning (historical supply and demand trends)
– Yield management (scanning airline seat reservation data to maximize yield per
seat)
– System performance (detect abnormal behavior in a system)
– Mature database analysis (clean up the data sources)
Potential Applications
• Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection
• Finding outliers in credit card purchases
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– DNA and bio-data analysis
Knowledge Discovery in
Databases: Process
Interpretation/
Evaluation
Data Mining
Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Steps of a KDD Process
• Learning the application domain
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning: (may take 60% of effort!)
• Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing methods of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Multiple Perspectives in
Data Mining
• Data to be mined
– Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
• Knowledge to be mined
– Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, Web mining, etc.
Ingredients of an Effective
KDD Process
“In order to discover
anything, you must
be looking for
something.” Murphy’s
1st Law of Serendipity
Visualization and
Human Computer
Interaction
Plan
for
Learning
Generate
and Test
Hypotheses
Goals for Learning
Discover
Knowledge
Knowledge Base
Discovery Algorithms
Determine
Knowledge
Relevancy
Evolve
Knowledge/
Data
Database(s)
Background Knowledge
What Can Data Mining Do?
• Clustering
– Identify previously unknown groups
• Classification
– Give operational definitions to categories
• Association
– Find Association rules
• Many others…
Clustering
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no
predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Some Clustering Approaches
•
Iterative Distance-based Clustering
–
–
–
–
–
•
Incremental Clustering
–
–
–
–
–
–
•
Specify in advance the number of desired clusters (k)
K random points chosen as cluster centers
Instances assigned to closest center
Centroid (or mean) of all points in cluster is calculated
Repeat until clusters are stable
Uses tree to represent clusters
Nodes represent clusters (or subclusters)
Instances added one by one and tree updated
Updating can involve simple placement of instance in cluster or re-clustering
Uses category utility function to determine if instance fits with each cluster
Can result in merging or splitting of existing clusters
Category Utility
–
–
Uses quadratic loss function of conditional probabilities
Does the addition of new instance help us better predict the value of attributes for other
instances?
General Applications of
Clustering
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature spaces
– detect spatial clusters and explain them in spatial data mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering
Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
Classification (vs Prediction)
• Classification:
– predicts categorical class labels (discrete/nominal)
– classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it
in classifying new data
– Learns operational definition
• Prediction:
– models continuous-valued functions, i.e., predicts unknown or
missing values
• Typical Applications
–
–
–
–
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Classification—A Two-Step
Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formula
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
Classification Process (1):
Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification Process (2):
Use the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
Classification Approaches
• Divide and Conquer
– Results in decision tree
– Uses “information gain” function
• Covering
- Select category for which to learn ru
- Add conditions on rule until “good enough”
Association
• Association rule mining:
– Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction
databases, relational databases, and other information
repositories.
– Frequent pattern: pattern (set of items, sequence, etc.) that
occurs frequently in a database [AIS93]
• Motivation: finding regularities in data
– What products were often purchased together? — Beer and
diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
Why Is Association Mining
Important?
• Foundation for many essential data mining
tasks
– Association, correlation, causality
– Sequential patterns, temporal or cyclic association,
partial periodicity, spatial and multimedia association
– Associative classification, cluster analysis, iceberg
cube, fascicles (semantic data compression)
• Broad applications
– Basket data analysis, cross-marketing, catalog
design, sale campaign analysis
– Web log (click stream) analysis, DNA sequence
analysis, etc.
Basic Concepts:
Association Rules
Transaction-id
Items bought
10
A, B, C
20
A, C
30
A, D
40
B, E, F
Customer
buys both
Customer
buys beer
Customer
buys diaper
• Itemset X={x1, …, xk}
• Find all the rules XY with
min confidence and support
– support, s, probability that
a transaction contains XY
– confidence, c, conditional
probability that a
transaction having X also
contains Y.
Let min_support = 50%,
min_conf = 50%:
A  C (50%, 66.7%)
C  A (50%, 100%)
Mining Association Rules:
Example
Min. support 50%
Min. confidence 50%
Transaction-id
Items bought
10
A, B, C
20
A, C
Frequent pattern
Support
30
A, D
{A}
75%
40
B, E, F
{B}
50%
{C}
50%
{A, C}
50%
For rule A  C:
support = support({A}{C}) = 50%
confidence = support({A}{C})/support({A}) =
66.6%
Apriori: A Candidate Generationand-test Approach
• Any subset of a frequent itemset must be frequent
– if {beer, diaper, nuts} is frequent, so is {beer, diaper}
– Every transaction having {beer, diaper, nuts} also contains {beer,
diaper}
• Apriori pruning principle: If there is any itemset
which is infrequent, its superset should not be
generated/tested!
• Method:
– generate length (k+1) candidate itemsets from length k frequent
itemsets, and
– test the candidates against DB
• Performance studies show its efficiency and
scalability
The Apriori Algorithm—A Mathematical
Definition
Let I = {a,b,c,…} be a set of all items in the domain
Let T = { S | S  I } be a set of all transaction records of item sets
Let support(S) =  {A | A  T  S  A} |
Let L1 = { {a} | a  I  support({a})  minSupport }
k (k > 1  Lk-1  ) Let
Lk = { Si  Sj | (Si  Lk-1)  (Sj  Lk-1) 
( |Si – Sj| = 1 )  ( |Sj – Si| = 1) 
( S[ ((S  Si  Sj)  (|S| = k-1))  S  Lk-1] ) 
( support(Si  Sj)  minSupport )
Then, the set of all frequent item sets is given by
 L=
Lk
k
and the set of all association rules is given by
R = { A  C | A  (Lk)  (C = Lk – A)  (A  )  (C
 )
support(Lk) / support(A)  minConfidence }
The Apriori Algorithm—An Example
Example: minSupport = 2
I= {Table Saw, Router, Kreg Jig, Sander, Drill Press}
T= {
{Table
Press},
{
},
{
},
{Table
},
{Table
},
{
},
{Table
},
{Table
Press},
{Table
}
}
Saw, Router,
Drill
Router,
Sander
Router, Kreg Jig
Saw, Router,
Saw,
, Sander
, Kreg Jig
Router, Kreg Jig
Saw,
, Kreg Jig
Saw, Router, Kreg Jig,
Saw, Router, Kreg Jig
, Drill
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• How to count supports of candidates?
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
State of Commercial/Research
Practice
• Increasing use of data mining systems in financial
community, marketing sectors, retailing
• Still have major problems with large, dynamic sets of
data (need better integration with the databases)
– Off-the-shelf data mining packages perform specialized learning
on small subset of data
• Most research emphasizes machine learning; little
emphasis on database side (especially text)
• People achieving results are not likely to share
knowledge
Related Techniques: OLAP
On-Line Analytical Processing
• On-Line Analytical Processing tools provide the
ability to pose statistical and summary queries
interactively
– Traditional On-Line Transaction Processing (OLTP) databases
may take minutes or even hours to answer these queries
• Advantages relative to data mining
– Can obtain a wider variety of results
– Generally faster to obtain results
• Disadvantages relative to data mining
– User must “ask the right question”
– Generally used to determine high-level statistical summaries,
rather than specific relationships among instances
Integration of Data Mining
and Data Warehousing
• Data mining systems, DBMS, Data warehouse
systems coupling
– No coupling, loose-coupling, semi-tight-coupling, tight-coupling
• On-line analytical mining data
– integration of mining and OLAP technologies
• Interactive mining multi-level knowledge
– Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
• Integration of multiple mining functions
– Characterized classification, first clustering and then association
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
• e.g., occupation=“”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Why Is Data Dirty?
• Incomplete data comes from
– n/a data value when collected
– different consideration between the time when the
data was collected and when it is analyzed.
– human/hardware/software problems
• Noisy data comes from the process of data
– collection
– entry
– transmission
• Inconsistent data comes from
– Different data sources
– Functional dependency violation
Why Is Data Preprocessing
Important?
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
– Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation
comprises the majority of the work of building a data
warehouse. —Bill Inmon (father of the data
warehouse)
Major Tasks in Data
Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the
same or similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially
for numerical data
Data Cleaning
• Importance
– “Data cleaning is one of the three biggest problems in
data warehousing”—Ralph Kimball
– “Data cleaning is the number one problem in data
warehousing”—DCI survey
• Data cleaning tasks
–
–
–
–
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus
deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the
time of entry
– not register history or changes of the data
• Missing data may need to be inferred.
How to Handle Missing
Data?
• Ignore the tuple: usually done when class label is
missing (assuming the tasks in classification—not
effective when the percentage of missing values per
attribute varies considerably.
• Fill in the missing value manually: tedious +
infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class:
smarter
– the most probable value: inference-based such as Bayesian
formula or decision tree
Noisy Data
• Noise: random error or variance in a measured
variable
• Incorrect attribute values may due to
–
–
–
–
–
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g.,
deal with possible outliers)
• Regression
– smooth by fitting the data into regression functions
Simple Discretization
Methods: Binning
• Equal-width (distance) partitioning:
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate
presentation
– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
– Divides the range into N intervals, each containing approximately
same number of samples
– Good data scaling
– Managing categorical attributes can be tricky.
Thank you!
Michael R. Wick
Professor and Chair
Department of Computer Science
University of Wisconsin – Eau Claire
Eau Claire, WI 54701
[email protected]
715-836-2526