Download Modul 1 - Intro Data mining

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining
Business Intelligence
outline







Data Mining and KDD
Why Data Mining
Applications of Data Mining
Data Preprocessing
Data Mining techniques
Visualization of the results
Summary
2
Data Mining and KDD
3
Looking for knowledge

The Explosive Growth of Data

The World Wide Web

Business: e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation

Society and everyone: news, digital cameras, YouTube, forums,
blogs, Google & Co

We are drowning in data, but starving for knowledge!

Avoid data tombs

“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets.
4
What is Data Mining?

Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
5
Knowledge Discovery (KDD)
Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Data sources
6
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Decision
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Quantity of data
DBA
7
Data Mining: confluence of multiple
disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithms
Visualization
Other
Disciplines
8
Why Data Mining?
9
Why is Data Mining so complex? A
matter of data dimensions

Tremendous amount of data



High-dimensionality of data



Walmart – Customer buying patterns – a data warehouse 7.5
Terabytes large in 1995
VISA – Detecting credit card interoperability issues – 6800
payment transactions per second
Many dimensions to be combined together
Data cube example: time, location, product  sales
High complexity of data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Spatial, spatiotemporal, multimedia, text and Web data
10
What does Data Mining provide me
with? (1)


Multidimensional concept description: Characterization and
discrimination

Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions

Characterization describes things in the same class,
discrimination describes how to separate different classes
Frequent patterns, association, correlation vs. causality

Wine  Spaghetti [0.3% of all basket cases, 75% of cases
when tomato sauce is bought]

Is this correlation or not?
11
What does Data Mining provide me
with? (2)

Classification and prediction

Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based
on gas mileage


Predict some unknown or missing numerical values
Cluster analysis
 Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass
similarity
12
What does Data Mining provide me
with? (3)

Outlier analysis




Outlier: Data object that does not comply with the general
behavior of the data
Fraud detection is the main application area
Noise or exception?
Trend and evolution analysis




Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera  large SD
memory
Periodicity analysis
Similarity-based analysis
13
Applications of Data Mining
Market Analysis and Management

Data sources:


credit card transactions, loyalty cards, smart cards, discount
coupons, ...
Target marketing

Find clusters of “model” customers who share the same
characteristics:
• Geographics (lives in Rome, lives in Trentino)
• Demographics (married, between 21-35, at least one child, family income
more than 40.000€/year)
• Psychographics (likes new products, consistently uses the Web)
• Behaviors (searches info in Internet, always defends her decisions)

Determine customer purchasing patterns over time
14
Applications of Data Mining
Market Analysis and Management

Cross-market analysis



Customer profiling



Find associations between product sales, and predict based on
such association
Compare the sales in the US and in Italy, find associations in
old products and predict if new ones will have success
What types of customers buy what products
Customers with age between 20-30 and income > 20K€ will buy
product A
Customer requirement analysis


Identify the best products for different groups of customers
Predict what factors will attract new customers
15
Applications of Data Mining
Corporate Analysis

Finance Planning and Asset Evaluation



Resource Planning


summarize and compare the resources and spending
Competition




Cash flow prediction and analysis
Cross-sectional and time-series analysis (financial ratio, trend
analysis)
monitor competitors and market directions
group customers into classes and a class-based pricing
procedure
set pricing strategy in a highly competitive market
Other examples?
16
What’s next?

Data Preprocessing






Data Mining techniques





Why is it needed?
Data cleaning
Data integration and transformation,
Data reduction
Discretization and Concept hiererchy
Frequent patterns, association rules
Classification and prediction
Cluster Analysis
Are you sleeping?
Visualization of the results
Summary
17
Data Preprocessing
18
Why Data Preprocessing?

Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., occupation=“ ”, birthdate=“31/12/2099”

noisy: containing errors or outliers
• e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997” (we are in 2007!!)
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records. In one copy of the data
customer A has to pay 200.000€, in the second copy of the data A does not
have to pay anything.
19
Why is data dirty?

Incomplete data may come from




Noisy data (incorrect values) may come from




“Not applicable” data value when collected
Different considerations between the time when the data was
collected and when it is analyzed.
Human/hardware/software problems
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from


Different data sources
Functional dependency violation (e.g., modify some linked data)
20
Why Is Data Preprocessing
Important?
21
Data Preprocessing
1. Data cleaning – missing values
“Data cleaning is one of the three biggest problems in data
warehousing”— Ralph Kimball

Fill in missing values






Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“”
Ignore the record (is it always feasible?)
Manually filling missing attributes
Automatically insert a constant
Automatically insert the mean value (relative to the record
class)
Most probable value: make some inference!
22
Data Preprocessing
1. Data cleaning – binning

Handle noisy data


1.
2.
Binning
Sort data by price (€): 4, 8, 9, 15, 21, 21, 24, 25, 26
Partition into equal-frequency (equi-depth) bins:



3.
Binning, clustering, regression (not details)
Bin 1: 4, 8, 9
Bin 2: 15, 21, 21
Bin 3: 24, 25, 26
Smoothing by bin means:



Bin 1: 7, 7, 7
Bin 2: 19, 19, 19
Bin 3: 25, 25, 25
23
Data Preprocessing
1. Data cleaning – clustering
noise
24
Data Preprocessing
2. Integration and transformation


Data Integration combines data from multiple sources
into a coherent store
D1
D2
D3
Schema integration



D1,2,3
Entity identification problem:


Integrate metadata from different sources
A.cust-id  B.cust-number
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts

For the same real world entity, attribute values from different
sources are different (e.g., cm vs. inch)
25
Data Preprocessing
2. Integration and transformation

Data integration can lead to redundant attributes



Same object (A.house = B.residence)
Derivates (A.annualIncome =  B.salary+C.rentalIncome)
Redundant attributes can be discoverd via correlation
analysis




A mathematical method detecting the correletion between two
attributes
Correlation coefficient (Pearson’s product moment coefficient):
the higher it is, the stronger the correlation between attributes
Χ2 (chi-square) test
No details on these methods here
26
Data Preprocessing
2. Integration and transformation

Aggregation:



Sum the sales of different branches (in different data sources)
to compute the company sales
Generalization:

concept hierarchy climbing

From integer attribute age to classes of age (children, adult,
old)
Normalization: scaled to fall within a small, specified
range

Change the range from [-∞,+ ∞] to [-1,+1]

{-13, -6, -3, 10, 100}  {-0.13, -0.06, -0.03, 0.1, 1}
27
Data Preprocessing
3. Data reduction

Data reduction



Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Different reduction types (dimensions, numerosity,
discretization)
Dimensionality: Attribute subset selection

Example with a decision tree (left branches True, right False)
A4?
Initial attribute set:
{A1, A2, A3,
A4, A5, A6}
A1?
Class 1
Class 2
A6?
Class 1
Reduced attribute
set: {A1, A4, A6}
Class 2
28
Data Preprocessing
3. Data reduction

Dimensionality: Principal Components Analysis




Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to
represent data
Works for numeric data only
Used when the number of dimensions is large
Numerosity: Clustering

Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
2 clusters
Sparse data leads
to many clusters
– non effective
29
Data Preprocessing
3. Data reduction

Numerosity: Sampling




obtaining a small sample s to represent the whole data set N
Problem: How to select a representative sampling set
Random sampling is not enough – representative samples
should be preserved
Stratified sampling: Approximate the percentage of each class
(or subpopulation of interest) in the overall database
Random sampling
Stratified sampling
No samples
from here
30
Data Preprocessing
4. Discretization - concept hierarchy

Three types of attributes




Discretization




Nominal — values from an unordered set (color, profession)
Ordinal — values from an ordered set (military or academic
rank)
Continuous — numbers (integer or real numbers)
Divide the range of a continuous attribute into intervals
Reduces data size and its complexity
Some data mining algorithms do not support continuous types,
and in those cases discretization is mandatory
Some useful methods:


Binning, clustering (already presented)
Entropy-based discretization (no details here)
31
Data Preprocessing
4. Discretization - concept hierarchy

Concept hierarchy generation


For categorical data
Specification of an ordering between attributes (schema level)
• street < city < state < country

Specification of a hierarchy of values (data level)
• {Urbana, Champaign, Chicago} < Illinois

Automatic generation using the number of distinct values
• For the set of attributes: {street, city, state, country}
• IF: |street| = 600.000, |city|=3.000, |state|=300, |country|=15
• THEN: street < city < state < country
32
Outline

Data Mining techniques

Frequent patterns, association rules
• Support and confidence

Classification and prediction
•
•
•
•



Decision trees
Bayesian classifiers
Support Vector Machines
Lazy learning
Cluster Analysis
Visualization of the results
Summary
33
Data Mining techniques
34
Frequent pattern analysis

What is it?



Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
Frequent pattern analysis: searching for frequent patterns
Motivation: Finding inherent regularities in data
• Which products are bought together? Yesterday’s wine and spaghetti
example
• What are the subsequent purchases after buying a PC?
• Can we automatically classify web documents?

Applications
•
•
•
•
Basket data analysis
Cross-marketing
Catalog design
Sale campaign analysis
35
Basic Concepts: Frequent Patterns
and Association Rules (1)
Transaction-id
Items bought
1
Wine, Bread, Spaghetti
2
Wine, Cocoa, Spaghetti
3
Wine, Spaghetti, Cheese
4
Bread, Cheese, Sugar
5
Bread, Cocoa, Spaghetti, Cheese,
Sugar
Itemsets (= transactions
in this example)
Goal: find all rules of type X  Y between items in an itemset
with minimum:
Support s - probability that an itemset contains X  Y
Confidence c – conditional probability that an itemset containing X
contains also Y
36
Support and confidence
That is.
support, s, probability that a transaction contains {A  B }
s = P(A  B )
confidence, c, conditional probability that a transaction
having A also contains B.
c = P(A|B).
 Rules that satisfy both a minimum support threhold
(min_sup) and a mimimum confidence threhold
(min_conf) are called strong.
37
Basic Concepts: Frequent Patterns
and Association Rules (2)
Transaction-id
Items bought
1
Wine, Bread, Spaghetti
2
Wine, Cocoa, Spaghetti
3
Wine, Spaghetti, Cheese
4
Bread, Cheese, Sugar
5
Bread, Cocoa, Spaghetti, Cheese,
Sugar
Suppose:
support s = 50%
confidence c=50%
Support is used to define frequent patterns (sets of
products in more than s% itemsets)
{Wine} in itemsets 1, 2, 3 (support = 60%)
{Bread} in itemsets 1, 4, 5 (support = 60%)
{Spaghetti} in itemsets 1, 2, 3, 5 (support = 80%)
{Cheese} in itemsets 3, 4, 5 (support = 60%)
{Wine, Spaghetti} in itemsets 1, 2, 3 (support = 60%)
38
Basic Concepts: Frequent Patterns
and Association Rules (3)
Transaction-id
Items bought
1
Wine, Bread, Spaghetti
2
Wine, Cocoa, Spaghetti
3
Wine, Spaghetti, Cheese
4
Bread, Cheese, Sugar
5
Bread, Cocoa, Spaghetti, Cheese,
Sugar
Suppose:
support s = 50%
confidence c=50%
Confidence defines association rules: X  Y rules in frequent
patterns whose confidence is bigger than c
Suggestion: {Wine, Spaghetti} is the only frequent
pattern to be considered. Why?
Association rules:
Wine  Spaghetti (support=60%, confidence=100%)
Spaghetti  Wine (support=60%, confidence=75%)
39
Advanced concepts in Asssociation
Rules discovery

Algorithms must face scalability problems


Apriori: If there is any itemset which is infrequent, its superset
should not be generated/tested!
Advanced problems


Boolean vs. quantitative associations
age(x, “30..39”) and income(x, “42..48K”)  buys(x, “car”)
[s=1%, c=75%]
Single level vs. multiple-level analysis
What brands of wine are associated with what brands of
spaghetti?
Are support and confidence
clear?
40
Another example for association
rules
Transaction-id
Items bought
1
Margherita, Beer, Coke
2
Margherita, Beer
3
Quattro stagioni, Coke
4
Margherita, Coke
Support s = 40%
Confidence c = 70%
41
Another example for association
rules
Transaction-id
Items bought
1
Margherita, Beer, Coke
2
Margherita, Beer
3
Quattro stagioni, Coke
4
Margherita, Coke
Frequent itemsets:
{Margherita} = 75%
{Beer} = 50%
{Coke} = 75%
{Margherita, Beer} = 50%
{Margherita, Coke} = 50%
Support s = 40%
Confidence c = 70%
Association rules:
Beer  Margherita [c=50%,s=100%]
42
Classification vs. Prediction

Classification




Prediction


Characterizes (describes) a set of items belonging to a training
set; these items are already classified according to a label
attribute
The characterization is a model
The model can be applied to classify new data (predict the
class they should belong to)
models continuous-valued functions, i.e., predicts unknown or
missing values
Applications

Credit approval, target marketing, fraud detection
43
Classification: the process
1.
Model construction



2.
The class label attribute defines the class each item should
belong to
The set of items used for model construction is called training
set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage

Estimate accuracy of the model
• On the training set
• On a generalization of the training set

If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
44
Classification: the process
Model construction
Classification
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
45
Classification: the process
IF rank = ‘professor’
Model usage
OR years > 6
THEN tenured = ‘yes’
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
46
Supervised vs. Unsupervised
Learning


Supervised learning (classification)

Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations

New data is classified based on the training set
Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
47
Evaluating generated models

Accuracy



Speed



handling noise and missing values
Scalability


time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness


classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
efficiency in disk-resident databases
Interpretability

understanding and insight provided by the model
48
Example of Classification


Example: Suppose that we have a database of customers on
the AllEletronics mailing list. The database describes
attributes of the customers, such as their name, age, income,
occupation, and credit rating. The customers can be classified
as to whether or not they have purchased a computer at
AllElectronics.
Suppose that new customers are added to the database and
that you would like to notify these customers of an upcoming
computer sale. To send out promotional literature to every
new customers in the database can be quite costly. A more
cost-efficient method would be to target only those new
customers who are likely to purchase a new computer. A
classification model can be constructed and used for this
purpose.
49
Each internal node
represents a test on
an attribute. Each
leaf node represents
a class.
A decision tree for the concept buys_computer, indicating whether or
not a customer at AllElectronics is likely to purchase a computer.
Assoc. Prof. Dr. D. T. Anh
50
Classification techniques
Decision Trees (1)
Investment type choice
Income > 20K€
no
yes
Age > 60
Low risk
no
yes
Married?
no
High risk
Mid risk
yes
Mid risk
51
Classification techniques
Decision Trees (2)

How are the attributes in decision trees selected?

Two well-known indexes are used
• Information gain selects the most informative
attribute in distinguishing the items between the
classes
• It biases towards attributes with a large set of
values
• Gain ratio faces the information gain limitations
52
Classification techniques
Bayesian classifiers (2)

Bayesian classification

A statistical classification technique
• Predicts class membership probabilities

Founded on the Bayes theorem
P( X | H ) P( H )
P( H | X ) 
P( X )
• What if X = “Red and rounded” and H = “Apple”?

Performance
• The simplest implementation (Naïve Bayes) can be compared to decision
trees and neural networks

Incremental
• Each training example can increase/decrease the probability that an
hypothesis in correct
53
Other Classification Methods





k-nearest neighbor classifier
case-based reasoning
Genetic algorithm
Rough set approach
Fuzzy set approaches
54
The k-Nearest Neighbor Algorithm




All instances (samples) correspond to points in the n-dimensional
space.
The nearest neighbor are defined in terms of Euclidean distance.
The Euclidean distance of two points, X = (x1, x2, …,xn) and Y = (y1,
y2, …,yn) is
n
d(X,Y) =   (xi –yi)2
1
When given an unknown sample, the k-Nearest Neighbor classifier
search for the space for the k training samples that are closest to the
unknown sample xq. The unknown sample is assigned the most
common class among its k nearest neighbors. The algorithm has to
vote to determine the most common class among the k nearest
neighbor. When k = 1, the unknown sample is assigned the class of
the training sample that is closest to it in the space.
Once we have obtained xq’s k-nearest neighbors using the distance
function, it is time for the neighbors to vote in order to determine xq’s
class.
55
Genetic Algorithms



GA: based on an analogy to biological evolution
Each rule is represented by a string of bits.
Example: The rule “IF A1 and Not A2 then C2“ can be
encoded as the bit string “100”, where the two left bits
represent attributes A1 and A2, respectively, and the
rightmost bit represents the class. Similarly, the rule “IF NOT
A1 AND NOT A2 THEN C1” can be encoded as “001”.




An initial population is created consisting of randomly generated
rules
Based on the notion of survival of the fittest, a new population is
formed to consists of the fittest rules and their offsprings
The fitness of a rule is represented by its classification accuracy
on a set of training examples
Offsprings are generated by crossover and mutation.
56
5 minutes break!
57
Classification techniques
Support Vector Machines

One of the most advanced classification techniques



Left figure: a small margin between the classes is found
Right figure: the largest margin is found
Support vector machines (SVMs) are able to identify the right
figure margin
58
Classification techniques
SVMs + Kernel Functions

Is data always linearly separable?


NO!!!
Solution: SVMs + Kernel Functions
How to split this?
SVM
SVM + Kernel
Functions
59
Classification techniques
Lazy learning

Lazy learning




Simply stores training data (or only minor processing) and waits
until it is given a test tuple
Less time in training but more time in predicting
Uses a richer hypothesis space (many local linear functions),
and hence the accuracy is higher
Instance-based learning



Subcategory of lazy learning
Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
An example: k-nearest neighbor approach
60
Classification techniques
k-nearest neighbor



All instances correspond to points in the n-Dimensional
space – x is the instance to be classified
The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to x
Which class should the
green circle belong to?
It depends on k!!!
k=3  Red
K=5  Blue
61
Prediction techniques
An overview

Prediction is different from classification



Major method for prediction: regression


Classification refers to predict categorical class label
Prediction models continuous-valued functions
model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis




Linear and multiple regression
Non-linear regression
Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
No details here
62
What is cluster Analysis?


Cluster: a collection of data objects

Similar to one another within the same cluster

Dissimilar to the objects in other clusters
Cluster analysis

Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters

It belongs to unsupervised learning

Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms (day 1 slides)
63
Examples of cluster analysis

Marketing:


Land use:


Identification of areas of similar land use in an earth observation
database
Insurance:


Help marketers discover distinct groups in their customer bases
Identifying groups of motor insurance policy holders with a high
average claim cost
City-planning:

Identifying groups of houses according to their house type,
value, and geographical location
64
Good clustering

A good clustering method will produce high quality
clusters with

high intra-class similarity

low inter-class similarity

Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)

The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.

It is hard to define “similar enough” or “good enough”
65
A small example
How to cluster this data?
This process is not
easy in practice. Why?
66
Visualization of the results

Presentation of the results or knowledge obtained from
data mining in visual forms

Examples

Scatter plots

Association rules

Decision trees

Clusters
67
Summary
Why Data
Mining?
Data Mining
and KDD
Data
preprocessing
Some
scenarios
Classification
Clustering
68