Download X - MS.ITM.

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Tutorial Document of
ITM638 Data Warehousing
and Data Mining
Dr. Chutima Beokhaimook
24th March 2012
1
DATA WAREHOUSES AND
OLAP TECHNOLOGY
2
What is Data Warehouse?
 Data warehouse have been defined in many ways
 “A data warehouse is a subject-oriented, integrated,
time-variant and non-volatile collection of data in support
of management’s decision making process” – W.H.
Inmon
 The four keywords :- subject-oriented, integrated, timevariant and non-volatile
3
So, what is data warehousing ?
 A process of constructing and using data warehouses
Data integration
Data cleaning
Constructing A Data
Warehouse
Data consolidation
 The utilization of a data warehouse necessitates a
collection of decision support technologies
 This allow knowledge workers (e.g. managers, analysts
and executives) to use the data warehouse to obtain an
overview of the data and make decision based on
information in the warehouse
 Term “warehouse DBMS” – refer to the management
and utilization of data warehouse
4
Operational Database vs. Data
Warehouses
Operational DBMS
 OLTP (on-line transaction processing)
 Day-to-day operations of an organization such as
purchasing, inventory, manufacturing, banking, etc.
Data warehouses
 OLAP (on-line analytical processing)
 Serve users or knowledge workers in the role of data
analysis and decision making
 The system can organize and present data in various
formats
5
OLTP vs. OLAP
Feature
OLTP
OLAP
Characteristic
Operational Processing
Informational processing
users
Clerk, IT professional
Knowledge worker
Orientation
transaction
Analysis
Function
Day-to-day operation
Long-term informational
requirements, DSS
DB design
ER based, applicationoriented
Star/snowflake, subjectoriented
Data
Current, guaranteed up-todate
Historical; accuracy
maintained over time
Summarization
Primitive, highly detailed
Summarized, consolidated
# of record access
Tens
Millions
# of users
Thousands
Hundreds
DB size
100 MB to GB
100GB to TB
6
Why Have a Separate Data warehouse?
 High performance for both systems
 An operational database – tuned for OLTP: access methods,
indexing, concurrency control, recovery
 A data warehouse – tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data
 DSS require historical data, whereas operational DB do not
maintain historical data
 DSS require consolidation (such as aggregation and
summarization) of data from heterogeneous sources, resulting in
high-quality, clean, and integrated data, whereas operational DB
contain only detailed raw data, which need to be consolidate
before analysis
7
A Multidimensional Data Model (1)
 Data warehouses and OLAP tools are based on a
multidimensional data model – views data in the form of a data
cube
 A data cube allows data to be modeled and viewed in multiple
dimension
 Dimensions are the perspectives of entities with respect to which
an organization wants to keep records
 Example
 A sales data warehouse keep records of the store’s sales with
respect to the dimensions time, item, branch and location.
 Each dimension may have a table associated with it, called a
dimension table, which further describes the dimension.
 Ex. item(item_name, brand, type)
 Dimension tables can be specified by users or experts, or
automatically adjusted based on data distribution
8
A Multidimensional Data Model (2)
 A multidimensional model is organized around a central
theme, for instance, sales, which is represented by a
fact table
 Facts are numerical measures such as quantities :dolar_sold, unit_sold, amount_budget
9
Example: A 2-D view
Table 2.1 A 2-D view of sales data according to the dimension time and
item, where the sales are from braches located in Vancouver. The
measure shown is dollar_sold (in thousands)
10
Example: A 3-D View
Table 2.2 A 3-D view of sales data according to the dimension time and
item and location. The measure shown is dollar_sold (in thousands)
11
Example: A 3-D data cube
A 3-D data cube represent the data in table 2.2 according to the
dimension time and item and location. The measure shown is
dollar_sold (in thousands)
12
Star Schema

The most common modeling paradigm, in which the
data warehouse contains
1.
2.
A large central table (fact table) containing the bulk of the
data, no redundancy
A set of smaller attendant table (dimension table), one for
each dimension
13
Example: star schema of a data
warehouse for sales
• A central fact table is sales
• that contains keys to each
of the four dimensions,
• along with 2 measures:
dollars_sold and unit_sold.
14
Example: snowflake schema of a data
warehouse for sales
15
Example: fact constellation schema of a
data warehouse for sales and shipping
2 fact models
16
A Concept Hierarchies
 A concept hierarchy defines a sequence of mapping
form a set of low-level concepts to higher-level, more
general concepts (Example below is location)
17
A Concept Hierarchies (2)
 Many concept hierarchies are implicit within the
database schema
location which is described by
attributes number, street, city,
province_or_state, zipcode and
country
time which is described by
attributes day, week, month,
quarter and year
year
country
province_or_state
city
street
Total order hierarchy
quarter
week
month
day
Partial order hierarchy
18
Typical OLAP Operations for
multidimensional data (1)
 Roll-up (drill-up): climbing up a concept hierarchy or
dimension reduction – summarize data
 Drill down(roll-down): stepping down a concept hierarchy
or introducing additional dimensions
 reverse of roll-up
 Navigate from less detailed data to more detailed data
 Slice and dice:
 Slice operation perform a selection on one dimension of the
given cube, resulting in subcube.
 Dice operation defines a subcube by performing a selection on
two or more dimensions
19
Typical OLAP Operations for
multidimensional data (2)
 Pivot (rotate):
 A visualization operation that rotate data axes in view in order to
provide an alternative presentation of the data
 Other OLAP operations: such as
 drill-across – execute queries involving more than one fact table
 drill-through
20
USA
Canada
Q2
Q3
Q4
2000
Q1 1000
Q1 605 825 14 400
Time (Quarter)
Time (Quarter)
Chicago 440
New York 1560
Toronto 395
Vancouver
Q2
Q3
Q4
Computer
Computer
Home
Entertainment
Security
Home
Entertainment
Phone
Security
Phone
item (type)
item (type)
roll-up on location
(from cities to countries)
21
Time (Quarter)
Chicago
New York
Toronto
Vancouver
Time (Quarter)
Chicago 440
New York 1560
Toronto 395
Vancouver
Q1 605 825 14 400
Q2
Jan
150
Feb
100
Mar
150
Apr
May
Q3
Jun
Q4
Computer
Home
Entertainment
Jul
Security
Phone
item (type)
Aug
drill-down on time
(from quarters to
months)
Sep
Oct
Nov
Dec
Computer
Home
Entertainment
Security
Phone
22
item (type)
Time (Quarter)
Chicago 440
New York 1560
Toronto 395
Vancouver
Toronto 395
Vancouver
Q1 605 825 14 400
Q1 605
Q2
Q2
Q3
Computer
Home
Entertainment
Q4
Computer
Home
Entertainment
Security
item (type)
Phone
item (type)
dice for
(location=“Toronto”or“Vancouver”)
and (time = “Q1” or “Q2”) and
(item=“home entertainment” or
“computer”)
23
Chicago
New York
Q1 605 825 14 400
Toronto
Q2
Q3
slice for
(time=“Q1”)
Q4
Computer
Home
Entertainment
Vancouver 605 825
Security
400
Security
Phone
item (type)
Phone
item (type)
14
Computer
Home
Entertainment
Home
Entertainment
item (type)
Time (Quarter)
Chicago 440
New York 1560
Toronto 395
Vancouver
440 1560 395 605
825
Computer
pivot
14
Phone
Security
400
New York
Chicago
Vancouver
Toronto
Location (cities)
24
MINING FREQUENT PATTERN,
ASSOCIATIONS
25
What is Association Mining?
 Association rule mining:
Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories
 Applications:
Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc.
 Rule form: “Body  Head [support, confidence]”
buys(x, “diapers”)  buys(x, “beers”) [0.5%,60%]
major(x, “CS”)  take (x, “DB”)  grade (x, “A”) [1%,75%]
26
A typical example of association rule
mining is market basket analysis.
27
 The information that customers who purchase computer
also tend to buy antivirus software at the same time is
represented in Association Rule below:
computer  antivirus_software
[support = 2%, confidence = 60%]
 Rule support and confidence are two measures of rule
interestingness
 Support= 2% means that 2% of all transactions under analysis
show that computer and antivirus software are purchased
together
 Confidence=60% means that 60% of the customers who
purchased a computer also bought the software
 Typically, association rules are considered interesting if
they satisfy both a minimum support threshold and a
minimum confidence threshold
 Such threshold can be set by users of domain experts
28
Rule Measure: Support and Confidence
TransID
Items Bought
T001
A,B,C
T002
A,C
T003
A,D
T004
B,E,F
•Find all the rule A B  C with
minimum confidence and support
•Let min_sup=50%,
min_conf.=50%
Support: probability that a transaction contain {ABC}
Confidence: Condition probability that a transaction
having {AB} also contain {C}
 Typically association rules are considered interesting if they
satisfy both a minimum support threshold and a mininum
confidence threshold
 Such thresholds can be set by users or domain experts
29
 Rules that satisfy both a minimum support threshold
(min_sup) and a minimum confidence threshold
(min_conf) are called strong
 A set of items is referred to as an itemset
 An itemset that contains k items is a k-itemset
 The occurrence frequency of an itemset is the
number of transactions that contain the itemset
 An itemset satisfies minimum support if the occurrence
frequency of the itemset >= min_sup * total no. of
transaction
 An itemset satisfies minimum support  it is a frequent
itemset
30
Two Steps in Mining Association Rules
 Step1 :Find all frequent itemsets
 A subset of a frequent itemset must also be a frequent itemset
 i.e. if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
 Iteratively find frequent itemset with cardinality from 1 to k (kitemset)
 Step2 : generate strong association rules from the
frequent itemsets
31
Mining Single-Dimensional Boolean
Association Rules From Transaction Databases


Methods for mining the simplest form of association
rules: single-dimensional, single-level, boolean
association rules  Apriori algorithm
The Apriori algorithm : Finding frequent itemset for
boolean association rules


1.
2.
Lk : frequent k- itemset is used to explore Lk+1
Consists of join and prune step
The join step: A set of candidate k-itemset (Ck) is generated by
joining Lk-1 with itself
The prune step: Determine Lk as : any (k-1)-itemset that is not
frequent cannot be a subset of a frequent k-itemset
32
The Apriori Algorithm
 Pseudo-code:
Ck: Candidate itemset of size k
Lk: Frequent itemset of size k
L1= {frequent 1-itemsets}:
for (k=1; Lk!=; k++) do
begin
Ck+1=candidates generated from Lk;
for each transaction t in database D do
increment count of all candidates in Ck+1 that are
contained in t
Lk+1=candidate in Ck+1 with min_support
end
Return kLk;
33
Example: Finding frequent itemsets in D
Transaction database D
|D| = 9
1. Each item is a member of
the set of candidate 1itemsets (C1), count the
number of occurrences of
each item
2. Suppose the minimum
transaction support count =
2, the set of L1 = candidate
1-itemsets that satisfy
minimum support
3. Generate C2 = L1L1
4. Continue the algo.
Until C4= 
34
35
Example of Generating Candidates
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3L3
 C4 ={abcd acde}
 Pruning:
 acde is remove because ade is not in L3
 C4={abcd}
36
Generating Association Rule from
frequent Itemsets
 confidence(AB)= P(B|A)=support_count(AB)
support_count(A)
 support_count (AB) is the no. of transaction containing the
itemsets AB
 support_count (A) is the no. of transaction containing the
itemsets A
 Association rules can be generated as
 For each frequent itemset l, generate all nonempty subset of l
 For every nonempty subset s of l, output the rule
s(l-s) if support_count(l)  min_conf.
support_count(s)
Min_conf. is the minimum confidence threshold
37
Example
 Suppose the data contain the frequent itemset l={I1,I2,I5}
What are the association rules that can be generated from l?
 The nonempty subsets of l are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}
 The resulting association rules are
1
l1l2l5 confidence=2/4=50%
2
l1l5l2 confidence=2/2=100%
3
l2l5l1 confidence=2/2=100%
4
l1l2l5 confidence=2/6=33%
5
l2l1l5 confidence=2/7=29%
6
l5l1l2 confidence=2/2=100%
 If minimun confidence threshold = 70%
 Output are the rule no. 2,3 and 6
38
CLASSIFICATION AND
PREDICTION
39
976-451 Data Warehousing and Data Mining
Lecture 5
Classification and Prediction
Chutima Pisarn
Faculty of Technology and Environment
Prince of Songkla University
40
What Is Classification?
 Case
 A bank loans officer needs analysis of her data in order to learn
which loan applicants are “safe” and which are “risky” for the bank
 A marketing manager at AllElectronics needs data analysis to help
guess whether a customer with a given profile will buy a new
computer
 A medical researcher wants to analyze breast cancer data in order
to predict which one of three specific treatments a patient receive
 The data analysis task is classification, where the model
or classifier is constructed to predict categorical labels,
such as
 “safe” or “risky” for the loan application data
 “yes” or “no” for the marketing data
 “treatment A”, “treatment B” or “treatment C” for the medical data
41
What Is Prediction?
 Suppose that the marketing manager would like to
predict how much a given customer will spend during
a sale at AllElectronics
 This data analysis task is numeric prediction, where
the model constructed predicts a continuous value or
ordered values, as opposed to a categorical label
 This model is a predictor
 Regression analysis is a statistical methodology that
is most often used for numeric prediction
42
How does classification work?
 Data classification is a two-step process
 In the first step, -- learning step or training phase
 a model is built describing a predetermined set of data classes
or concepts
 The model is constructed by analyzing database tuples
described by attributes
 Each tuple is assumed to belong to a predefined class, as
determined by the class label attribute
 Data tuples used to build the model are called training data set
 The individual tuples in a training set are referred to as training
samples
 If the class label is provided, this step is known as supervised
learning, otherwise called unsupervised learning (or clustering)
 The learned model is represented in the form of classification
rules, decision trees or mathematical formulae
43
How does classification work?
 In the second step,
 The model is used for classification
 First, estimate the predictive accuracy of the model
 The holdout method is a technique that uses a test set of classlabeled samples which are randomly selected and are
independent of the training samples
 The accuracy of a model on a given test set is the percentage of
test set correctly classified by model
 If the accuracy of the model were estimate based on the training
data set -> the model tends to overfit the data
 If the accuracy of the model is considered acceptable, the model
can be used to classify future data tuples or objects for which the
class label is unknown
44
How is prediction different from
classification?
 Data prediction is a two step process, similar to that of
data classification
 For prediction, the attributefor which values are being
predicted is continuous-value (ordered) rather than
categorical (discrete-value and unordered)
 Prediction can also be viewed as a mapping or function,
y=f(X)
45
Classification by Decision Tree Induction
 A decision tree is a flow-chart-like tree structure,
 each internal node denotes a test on an attribute,
 each branch represents an outcome of the test
 leaf node represent classes
 Top-most node in a tree is the root node
The decision tree
represents the concept
buys_computer
Age?
<=30
student?
no
no
yes
yes
31…40
yes
>40
Credit_rating?
excellent
no
fair
yes
46
Attribute Selection Measure
 The information gain measure is used to select the test
attribute at each node in the tree
 Information gain measure is referred to as an attribute
selection measure or measure of the goodness of split
 The attribute with the highest information gain is chosen
as the test attribute for the current node
 Let S be a set consisting of s data samples, the class label attribute
has m distinct value defining m distinct classes, Ci (for i=1,...,m)
 Let si be the no of sample of S in class Cim
 The expected information I(s1,s2,…sm)=-
p log2(pi),
i=1 i
 where pi is the probability that the sample belongs to class, pi =si/s
47
Attribute Selection Measure (cont.)
 Find an entropy of attribute A
 Let A have  distinct value {a1,a2,…,a} which can partition S into
{S1,S2,….S}
 For each Sj, sij is the number of samples Sj of class Ci
 The entropy or expected information based on attribute A is given
by

 E(A)=  s1j+…+smj I(s ,…,s )
j=1
1j
mj
s
 Gain(A)=I(s1,s2,…sm)-E(A)
 The algorithm computes the information gain of each
attribute. The attribute with the highest information gain
is chosen as the test attribute for the given set S
48
Example
the class label attribute: 2 classes
RID
age
income
student
Credit_rating
Class:buys_computer
1
<=30
High
No
Fair
No
2
<=30
High
No
Excellent
No
3
31…40
High
No
Fair
Yes
4
>40
Medium
No
Fair
Yes
5
>40
Low
Yes
Fair
Yes
6
>40
Low
Yes
Excellent
No
7
31…40
Low
Yes
Excellent
Yes
8
<=30
Medium
No
Fair
no
9
<=30
Low
Yes
Fair
Yes
10
>40
Medium
Yes
Fair
Yes
11
<=30
Medium
Yes
Excellent
Yes
12
31…40
Medium
No
Excellent
Yes
13
31…40
High
Yes
Fair
Yes
14
>40
medium
no
Excellent
No
I(s1,s2) = I(9,5) =
-9/14 log2 (9/14)5/14log2(5/14)
=0.940
49
I(s1,s2) = I(9,5) = -9/14 log2(9/14)- 5/14 log2(5/14) =0.940
Compute the entropy of each attribute
For attribute “age”
For age=“<=30”
s11=2, s21=3
For age=“31…40”
s12=4, s22=0
For age=“>40”
s13=3, s23=2
Gain(age) = I(s1,s2) – E(age)
= 0.940 –[(5/14)I(2,3)+(4/14)I(4,0)+(5/14)I(3,2)
= 0.246
For attribute “income”
For income=“high”
s11=2, s21=2
For income=“medium”
s12=4, s22=2
For income=“low”
s13=3, s23=1
Gain(income) = I(s1,s2) – E(income)
= 0.940 –[(4/14)I(2,2)+(6/14)I(4,2)+(4/14)I(3,1)
= 0.029
50
For attribute “student”
For student=“yes”
s11=6, s21=1
For student =“no”
s12=3, s22=4
Gain(student) = I(s1,s2) – E(student)
= 0.940 –[(7/14)I(6,1)+(7/14)I(3,4)
= 0.151
For attribute “credit_rating”
For credit_rating =“fair”
s11=6, s21=2
For credit_rating =“excellent”
s12=3, s22=3
Gain(credit_rating) = I(s1,s2) – E(credit_rating)
= 0.940 –[(8/14)I(6,2)+(6/14)I(3,3)
= 0.048
Since age has the highest information gain, age is selected as the test attribute
-A node is created and labeled with age
-Braches are grown for each of the attribute’s values
51
Age?
<=30
>40
30…40
income
student
Credit_rating
Class
medium
No
Fair
Yes
No
Low
Yes
Fair
Yes
Fair
No
low
Yes
Excellent
No
Yes
Fair
Yes
medium
Yes
Fair
Yes
Yes
Excellent
Yes
medium
No
Excellent
No
income
student
Credit_rating
Class
High
No
Fair
No
High
No
Excellent
Medium
No
Low
Medium
S3
S1
income
student
Credit_rating
Class
High
No
Fair
Yes
Low
Yes
Excellent
Yes
Medium
No
Excellent
Yes
High
Yes
Fair
Yes
S2
52
 For the partition age=“<=30”
 Find information gain for each attribute in this partition, then
select the attribute with the highest information gain as a test
node (call generate_decision_tree(S1, {income, student,
credit_rating}))  student have the highest information gain
Age?
<=30
>40
31…40
student?
yes
no
income
Credit_rating
Class
income
Credit_rating
Class
Low
Fair
Yes
High
Fair
No
Medium
Excellent
Yes
High
Excellent
No
Medium
Fair
No
All sample belong to
class yes  create leaf
node and label with “yes”
All sample belong to
class no  create leaf
node and label with “no”
53
Age?
<=30
31…40
student?
no
no
yes
yes
yes
>40
income
student
Credit_rating
Class
medium
No
Fair
Yes
Low
Yes
Fair
Yes
low
Yes
Excellent
No
medium
Yes
Fair
Yes
medium
No
Excellent
No
 For the partition age=“30…40”
 All sample belong to class no  create leaf node and label with “no”
 For the partition age=“>40”
 Consider credit rating and income  credit rating has higher
information gain
54
Age?
<=30
student?
no
no
yes
yes
31…40
yes
>40
Credit_rate?
excellent
no
fair
yes
 Attribute left is income but sample is empty  terminate
generate_decision_tree
Assignment 1 แสดงการสร้าง Decision Tree นี้
อย่างละเอียด แสดงการคานวณด้วย
55
Example : generate rules from decision tree
Age?
<=30
student?
no
no
yes
yes
31…40
yes
>40
Credit_rate?
excellent
no
fair
yes
1.
2.
IF age=“<=30” AND student=“no” THEN buys_computer =“no”
IF age=“<=30” AND student=“yes” THEN buys_computer =“yes”
3.
4.
IF age=“31…40” THEN buys_computer =“yes”
IF age=“>40” AND credit_rate=“excellent” THEN buys_computer
=“no”
IF age=“>40” AND credit_rate=“fair” THEN buys_computer =“yes”
5.
56
Naïve Bayesian Classification

1.
Naïve Bayesian classifier also called simple Bayesian classifier,
works as follows:
Each data sample is represented by an n-dimensional feature,
X=(x1,x2,…,xn) from n attributes, repectively, A1,A2,…,An
A1,A2,…,An
X
Outlook
Temperature
Humidity
Windy
Play
Rainy
Mild
Normal
False
Y
Overcast Cool
Normal
True
Y
Sunny
Hot
High
True
N
Overcast Hot
High
False
Y
Sunny
High
False
Hot
X=(sunny,hot,high, false) unknown class
57
Naïve Bayesian Classification (cont.)
2.
Suppose that there are m clases, C1,C2,…Cm


Given an unknown data sample X
The calssifier will predict that X belongs to the class having the
highest posterior probability, condition on X

The naïve Bayesian will assigns an unknown X to class Ci if
and only if
P(Ci|X) > P(Cj|X) for 1 j  m, jI


That is, it will find the maximum posterior probability among
P(C1|X), P(C2|X), ….,P(Cm|X)
The class Ci for which P(Ci|X) is maximized is called the
maximum posteriori hypothesis
58
m =2
C1: Play=“Y” and C2: Play=“N”
A1,A2,…,An
Training Samples
X
Outlook
Temperature
Humidity
Windy
Play
Rainy
Mild
Normal
False
Y
Overcast Cool
Normal
True
Y
Sunny
Hot
High
True
N
Overcast Hot
High
False
Y
Sunny
High
False
Y
Hot
X=(sunny,hot,high,false) unknown class
If (Play=“Y” | X) > (Play=‘N’| X)
59
Naïve Bayesian Classification (cont.)
3.
By Bayes theorem,
P(Ci|X) = P(X|Ci) P(Ci)
P(X)



As P(X) is constant for all classes, only P(X|Ci) P(Ci) need to
be maximized
If P(Ci) are not known, it is commonly assume that P(C1) =
P(C2) = … = P(Cm), therefore only P(X|Ci) need to be
maximized
Otherwise, we maximize P(X|Ci) P(Ci) ,
# of training sample of Class Ci
where P(Ci) = si
s
Total # of training sample
60
m =2
C1: Play=“Y” and C2: Play=“N”
A1,A2,…,An
Training Samples
X
Outlook
Temperature
Humidity
Windy
Play
Rainy
Mild
Normal
False
Y
Overcast Cool
Normal
True
Y
Sunny
Hot
High
True
N
Overcast Hot
High
False
Y
Sunny
High
False
Hot
X=(sunny,hot,high,false) unknown class
(Play=“Y”|X)
= P(X|Play=“Y”) P(Play=“Y”)
= P(X|Play=“Y”) (3/4)
(Play=“N”|X)
= P(X|Play=“N”) P(Play=“N”)
= P(X|Play=“N”) (1/4)
61
Naïve Bayesian Classification (cont.)
4.
Given a data sets with many attribute  it is expensive
to compute P(X|Ci)


To reduce computation, naïve made an assumption of class
conditional independence (there are no dependence
relationship among the attribute)
n
P(X|Ci) =  P(xk|Ci) = P(x1|Ci)* P(x2|Ci)*…* P(xk|Ci)
k=1
If Ak is categorical, then P(xk|Ci) = sik
si
# of training sample of
Class Ci having the value xk
for Ak
Total # of training sample
belong to class Ci
If Ak is continuous values  perform Gaussian distribution (not
focus in this class)
62
Naïve Bayesian Classification (cont.)
5.
In order to classify an unknown X, P(X|Ci) P(Ci) is
evaluated for each class Ci

Sample X is assign to the class Ci for which P(X|Ci) P(Ci) is
the maximum
63
Example: Predicting a class label using
naïve Bayesian classification
Unknown
sample
RID
age
income
student
Credit_rating
Class:buys_computer
1
<=30
High
No
Fair
No
2
<=30
High
No
Excellent
No
3
31…40
High
No
Fair
Yes
4
>40
Medium
No
Fair
Yes
5
>40
Low
Yes
Fair
Yes
6
>40
Low
Yes
Excellent
No
7
31…40
Low
Yes
Excellent
Yes
8
<=30
Medium
No
Fair
no
9
<=30
Low
Yes
Fair
Yes
10
>40
Medium
Yes
Fair
Yes
11
<=30
Medium
Yes
Excellent
Yes
12
31…40
Medium
No
Excellent
Yes
13
31…40
High
Yes
Fair
Yes
14
>40
medium
no
Excellent
No
15
<=30
medium
yes
fair
64
 C1: buys_computer =“Yes” , C2: buys_computer =“No”
 The unknown sample we wish to classify is
X=(age=“<=30”, income=“medium”, student=“yes”,
credit_rating=“fair”)
 We need to maximize P(X|Ci) P(Ci) , for i=1,2
i=1
P(X|buys_computer=“yes”) P(buys_computer=“yes”)
P(buys_computer=“yes”) = 9/14 = 0.64
P(X|buys_computer=“yes”) = P(age=“<=30“|buys_computer=“yes”) *
P(income=“medium“|buys_computer=“yes”) *
P(student=“yes“|buys_computer=“yes”) *
P(credit_rating=“fair“|buys_computer=“yes”)
= 2/9 * 4/9 * 6/9 * 6/9
= 0.044
P(X|buys_computer=“yes”) P(buys_computer=“yes”) = 0.64*0.044 = 0.028
65
i=2
P(X|buys_computer=“no”) P(buys_computer=“no”)
P(buys_computer=“no”) = 5/14 = 0.36
P(X|buys_computer=“no”) = P(age=“<=30“|buys_computer=“no”) *
P(income=“medium“|buys_computer=“no”) *
P(student=“yes“|buys_computer=“no”) *
P(credit_rating=“fair“|buys_computer=“no”)
= 3/5 * 2/5 * 1/5 * 2/5
= 0.019
P(X|buys_computer=“yes”) P(buys_computer=“yes”) = 0.36*0.019= 0.007
Therefore,
X=(age=“<=30”, income=“medium”, student=“yes”,
credit_rating=“fair”) should be in class buys_computer= “yes”
66
Assignment2:
Outlook
Temperature
Humidity Windy Play
Sunny
Hot
High
False
N
Sunny
Hot
High
True
N
Overcast Hot
High
False
Y
Rainy
Mild
High
False
Y
Rainy
Cool
Normal
False
Y
Rainy
Cool
Normal
True
N
Overcast Cool
Normal
True
Y
Sunny
Mild
high
False
N
Sunny
Cool
Normal
False
Y
Rainy
Mild
Normal
False
Y
Sunny
Mild
Normal
True
Y
Overcast Hot
Normal
False
Y
Overcast Mild
High
True
Y
Rainy
Mild
High
True
N
Sunny
Cool
Normal
False
Rainy
Mild
High
False
Using naïve Bayesain classifier to
predict those unknown data
samples
Unknown data samples
67
Prediction: Linear Regression
 The prediction of continuous values can be modeled by
statistical technique of regression
 The linear regression is the simplest form of regression
Y=+X
 Y is called a response variable
 X is called a predictor variable
  and  are regression coefficient specifying the Y-intercept and
slop of the line
 These coefficients can be solved method of least
squares, which minimizes the error between the actual
data and the estimate of the line
68
Example : Find the linear regression of
salary data
Salary data
X
Year experience
Y
Salary (in $1000s)
Y=+X
3
30
8
57
=  i=1 (xi-x)(yi-y) = 3.5
9
64
13
72
3
36
6
43
11
59
21
90
1
20
16
83
10
23.6+3.5(10) = 58.6
x = 9.1 and y = 55.4
s

s
i=1
(xi-x)2
 = y - x = 23.6
 Predicted line is estimated by
Y = 23.6 + 3.5X
69
Classifier Accuracy Measures
 The accuracy of a classifier on a given test set is the
percentage of test set tuples that are correctly classified
by the classifier
 Recognition rate – for pattern recognition literature
 The error rate or misclassification rate of the classifier M
is simply
1- Acc(M)
where Acc(M) is the accuracy of M
 If we were to use the training set to estimate the error
rate of the model  resubstitution error
 Confusion matrix is a useful tool for analyzing how well
the classifier can recognize
70
Confusion matrix: Example
Class
Buys_computer Buys_computer total
=yes
=no
Recognition
(%)
Buys_computer
=yes
6,954
46
7,000
99.34
Buys_computer
=no
412
2,588
3,000
86.27
Total
7,366
2,634
10,000
95.52
No. of tuple of class buys_computer=yes that
were labeled by the classifier as class
buys_computer=yes
Predicted Class
Actual Class
C1
C1
C2
True positives False negative
C2 False positive True negative
71
Are there alternatives to the
accuracy measure?
 Are there alternatives to the accuracy measure?
 Sensitivity refer to true positive (recognition) rate = the
porportion of positive tuples that are correctly idenitfied
 Specificity is the true negative rate = the proportion of
negative tuples that are correctly identified
No of positive tuples
sensitivity = t_pos / pos
specificity = t_neg / neg
precision = t_pos / (t_pos+f_pos)
Accuracy = sensitivity pos + specificity neg
(pos+neg)
(pos+neg)
72
Predictor Error Measure
 Loss functions measure the error between yi and the
predicted value yi’
 The most common loss functions are
 Absolute error: |yi-yi’|
 Squared error: (yi-yi’)2
 Based on the above, the test error (rate) or
generalization error, is the average loss over the test set
 Thus, we get the following error rates
d
 Mean absolute error:
y
i 1
i
 yi
d
 Mean squared error:   yi  yi 2
d
i 1
73
d
Evaluating the Accuracy of a
Classifier or Predictor
 How can we use those measures to obtain a reliable
estimate of classifier accuracy (or predictor accuracy)
 Accuracy estimates can help in comparison of different
classifiers
 Common techniques to assessing accuracy based on
randomly sampled partitions of the given data
 Holdout, random subsamplig, cross-validation, bootstrap
74
Evaluating the Accuracy of a
Classifier or Predictor
 Holdout method
 The given data are randomly partitioned into 2 independent
sets, a training data and a test set
 Typically, 2/3 are training set, 1/3 are test set
 Training set: used to derive the classifier
 Test set: used to estimate the derived classifier
Training
set
Derived
model
Estimate
accuracy
Data
Test set
75
Evaluating the Accuracy of a
Classifier or Predictor
 Random subsampling
 The variation of the hold out method
 Repeat hold out method k times
 The overall accuracy estimate is the average of the
accuracies obtained from each iteration
76
Evaluating the Accuracy of a
Classifier or Predictor
 k-fold cross validation
 The initial data are randomly partitioned into k equal sized
subsets (“folds”) S1, S2, ...,Sk
 Training and testing are performed k times
 In iteration i, the subset Si is the test sets, and the remaining
subset are collectively used to train the classifier
 Accuracy = overall no. of correct classifiers from the k iterations
total no. of samples in the initial data
s1
s2
Iteration 1
Iteration 2
…
77
Evaluating the Accuracy of a
Classifier or Predictor
 Bootstrap method
 The training tuples are sampled uniformly with replacement
 Each time a tuple is selected, it is equally likely to be selected
again and readded to the training set
 There are several bootstrap method – the commonly used one is
.632 bootstrap which works as follows
 Given a data set of d tuples
 The data set is sampled d times, with replacement, resulting bootstrap
sample of training set of d samples
 It is very likely that some of the original data tuples will occur more than
once in this sample
 The data tuples that did not make it into the training set end up forming the
test set
 Suppose we try this out several times – on average 63.2% of original data
tuple will end up in the bootstrap, and the remaining 36.8% will form the test
set
78
CLUSTER ANALYSIS
79
What is Cluster Analysis
 Clustering: the process of grouping the data into classes
or clusters which the objects within a cluster have high
similarity in comparison to one another, but very
dissimilarity to objects in other clusters
 What are some typical applications of clustering?
 In business, discovering distinct groups in their customer bases
and characterize customer groups based on purchasing patterns
 In biology, derive plant and animal taxonomies, categorize genes
 Etc.
 Clustering is also call data segmentation in some
application because clustering partition large data set
into groups according to their similarity
80
What is Cluster Analysis
 Clustering can also be used for outlier detection, where
outliers (value that are “far away” from any cluster) may
be more interesting than common cases
 Application- the detection of credit card fraud, monitoring of
criminal activities in electronic commerce
 In machine learning, clustering is an example of
unsupervised learning (do not rely on predefined
classes)
81
How to compute the dissimilarity between
object?
 The dissimilarity (or similarity) between objects
described by interval-scaled variables is typically
compute based on the distance between each pair of
objects
 Euclidean distance
d(i,j) = (|xi1-xj1|2 + |xi2-xj2|2 +… + |xip-xjp|2)
q=2
 Manhattan (or city block) distance
d(i,j) = (|xi1-xj1| + |xi2-xj2| +… + |xip-xjp|)
q=1
 Minkowski distance, a generalization of both Euclidean and
Manhattan distance
d(i,j) = (|xi1-xj1|q + |xi2-xj2|q +… + |xip-xjp|q)1/q
82
Centroid-Based Technique: The K-Means
Method

Cluster similarity is measured in regard to the mean
value of the objects in a cluster, which can be viewed
as the cluster’s centroid or center of gravity
83
The k-means algorithm
 Input parameter: the number of cluster k and a
database containing n objects
 Output: A set of k clusters that minimizes the squarederror criterion
1. Randomly selects k of the objects, each of which
initially represents a cluster mean or center
2. For each remaining objects,
 an object is assigned to the cluster to which it is the most
similar, based on the distance between the object and the
cluster mean
 Compute the new mean for each cluster
 The process iterates until the criterion function converges
84
The K-Means Method
 The criterion used is called square-error criterion
E=i=1pC |p-mi|2
k
i
where E is the sum of square-error for all objects in the
database, p is the point representing a given object, and
mi is the mean of cluster Ci
 Assignment 3: suppose that the data mining task to
cluster the following eight points into three clusters
A1(2,10) A2(2,5) A3(8,4) B1(5,8) B2(7,5) B3(6,4) C1(1,2)
C2(4,9)
 The distance function is Euclidean distance
 Suppose A1, B1 and C1 is assigned as the center of each
cluster
85