Download OLAP and Data Mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
OLAP and Data Warehouses
1
Introduction
• Increasingly, organizations are analyzing current
and historical data to identify useful patterns and
support business strategies.
• Emphasis is on complex, interactive, exploratory
analysis of very large datasets created by
integrating data from across all parts of an
enterprise; data is fairly static.
– Contrast such On-Line Analytic Processing (OLAP)
with traditional On-line Transaction Processing
(OLTP): mostly long queries, instead of short update
Xacts.
2
OLTP Compared With OLAP
• On Line Transaction Processing – OLTP
– Maintains a database that is an accurate model of some realworld enterprise. Supports day-to-day operations.
Characteristics:
• Short simple transactions
• Relatively frequent updates
• Transactions access only a small fraction of the database
• On Line Analytic Processing – OLAP
– Uses information in database to guide strategic decisions.
Characteristics:
•
•
•
•
Complex queries
Infrequent updates
Transactions access a large fraction of the database
Data need not be up-to-date
3
Three Complementary Trends
• Data Warehousing: Consolidate data from many
sources in one large repository.
– Loading, periodic synchronization of replicas.
– Semantic integration.
• OLAP:
– Complex SQL queries and views.
– Queries based on spreadsheet-style operations and
“multidimensional” view of data.
– Interactive and “online” queries.
• Data Mining: Exploratory search for interesting trends
and anomalies.
4
Data Warehousing
EXTERNAL DATA
SOURCES
• Integrated data spanning long
EXTRACT
TRANSFORM
time periods, often augmented
LOAD
REFRESH
with summary information.
• Several gigabytes to terabytes
common.
DATA
Metadata
WAREHOUSE
• Interactive response times Repository
expected for complex
SUPPORTS
queries; ad-hoc updates
uncommon.
DATA
MINING
OLAP5
Warehousing Issues
• Semantic Integration: When getting data from
multiple sources, must eliminate mismatches, e.g.,
different currencies, schemas.
• Heterogeneous Sources: Must access data from a
variety of source formats and repositories.
– Replication capabilities can be exploited here.
• Load, Refresh, Purge: Must load data, periodically
refresh it, and purge too-old data.
• Metadata Management: Must keep track of source,
loading time, and other information for all data in
the warehouse.
6
OLAP, Data Mining, and
Analysis
• The “A” in OLAP stands for “Analytical”
• Many OLAP and Data Mining applications
involve sophisticated analysis methods from
the fields of mathematics, statistical
analysis, and artificial intelligence
• Our main interest is in the database aspects
of these fields, not the sophisticated analysis
techniques
7
Fact Tables
• Many OLAP applications are based on a fact table
• For example, a supermarket application might be
based on a table
Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)
• The table can be viewed as multidimensional
– Market_Id, Product_Id, Time_Id are the dimensions that
represent specific supermarkets, products, and time
intervals
– Sales_Amt is a function of the other three
8
A Data Cube
• Fact tables can be viewed as an N-dimensional data cube
(3-dimensional in our example)
– The entries in the cube are the values for Sales_Amts
9
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
sum
10
pid
11 12 13
– E.g., measure Sales, dimensions Product
(key: pid), Location (locid), and Time
(timeid).
8
10
10
30
20
50
25
8
15
1
2
3
timeid
locid
sales
11 1 1 25
• Collection of numeric measures, which
depend on a set of dimensions.
Slice locid=1
is shown:
timeid
pid
Multidimensional
Data Model
11 2 1 8
11 3 1 15
12 1 1 30
12 2 1 20
12 3 1 50
13 1 1 8
13 2 1 10
locid
13 3 1 10
11 1 2 35
11
Dimension Hierarchies
• For each dimension, the set of values can be
organized in a hierarchy:
PRODUCT
TIME
LOCATION
year
quarter
category
pname
week
month
date
country
state
city
12
Dimension Tables
• The dimensions of the fact table are further
described with dimension tables
• Fact table:
Sales (Market_id, Product_Id, Time_Id, Sales_Amt)
• Dimension Tables:
Market (Market_Id, City, State, Region)
Product (Product_Id, Name, Category, Price)
Time (Time_Id, Week, Month, Quarter)
13
Conceptual Modeling of Data
Warehouses
• Modeling data warehouses: dimensions &
measures
– Star schema: A fact table in the middle connected to
a set of dimension tables.
– Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake.
– Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation.
14
Star Schema
• The fact and dimension relations can be
displayed in an E-R diagram, which looks
like a star and is called a star schema
15
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
Measures
16
Example of Snowflake
Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
17
Example of Fact
Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
18
location_key
shipper_type
OLAP Server Architectures
• Relational OLAP (ROLAP)
– Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces
– Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services
– greater scalability
• Multidimensional OLAP (MOLAP)
– Array-based multidimensional storage engine (sparse matrix techniques)
– fast indexing to pre-computed summarized data
• Hybrid OLAP (HOLAP)
– User flexibility, e.g., low level: relational, high-level: array
19
Aggregation
• Many OLAP queries involve aggregation of the
data in the fact table
• For example, to find the total sales (over time) of
each product in each market, we might use
SELECT
S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM
Sales S
GROUP BY S.Market_Id, S.Product_Id
• The aggregation is over the entire time dimension
and thus produces a two-dimensional view of the
data
20
Aggregation over Time
• The output of the previous query
Market_Id
Product_Id
M1
SUM(Sales_Amt)
P1
P2
P3
P4
P5
M2
3003
1503
6003
2402
4503
3
7503
7000
…
…
M3
M4
…
…
…
…
…
21
Typical OLAP Operations
• Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or
detailed data, or introducing new dimensions
• Slice and dice:
– project and select
– Slice performs a selection on one dimension
– Dice defines a subcube by performing a selection on two
or more dimensions
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes.
22
Drilling Down and Rolling Up
• Some dimension tables form an aggregation hierarchy
Market_Id  City  State  Region
• Executing a series of queries that moves down a
hierarchy (e.g., from aggregation over regions to that
over states) is called drilling down
– Requires the use of the fact table or information more specific
than the requested aggregation (e.g., cities)
• Executing a series of queries that moves up the hierarchy
(e.g., from states to regions) is called rolling up
– Note: In a rollup, coarser aggregations can be computed using
prior queries for finer aggregations
23
Drilling Down
•
Drilling down on market: from Region to State
Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)
Market (Market_Id, City, State, Region)
1.
SELECT
S.Product_Id, M.Region, SUM (S.Sales_Amt)
FROM
Sales S, Market M
WHERE
M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.Region
2.
SELECT
S.Product_Id, M.State, SUM (S.Sales_Amt)
FROM
Sales S, Market M
WHERE
M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.State,
24
Rolling Up
•
Rolling up on market, from State to Region
–
If we have already created a table, State_Sales, using
1.
SELECT
S.Product_Id, M.State, SUM (S.Sales_Amt)
FROM
Sales S, Market M
WHERE
M.Market_Id = S.Market_Id
GROUP BY S.Product_Id, M.State
then we can roll up from there to:
2.
SELECT
FROM
WHERE
T.Product_Id, M.Region, SUM (T.Sales_Amt)
State_Sales T, Market M
M.State = T.State
GROUP BY T.Product_Id, M.Region
25
Pivoting
• Aggregation on selected dimensions.
– E.g., Pivoting on Location and Time
yields this cross-tabulation:
WI CA
1995 63
Total
81 144
1996 38 107 145
1997 75
35 110
Total 176 223 339
Pivoting
• When we view the data as a multi-dimensional
cube and group on a subset of the axes, we are said
to be performing a pivot on those axes
– Pivoting on dimensions D1,…,Dk in a data cube
D1,…,Dk,Dk+1,…,Dn means that we use GROUP BY
A1,…,Ak and aggregate over Ak+1,…An, where Ai is an
attribute of the dimension Di
– Example: Pivoting on Product and Time corresponds to
grouping on Product_id and Quarter and aggregating
Sales_Amt over Market_id:
SELECT
S.Product_Id, T.Quarter, SUM (S.Sales_Amt)
FROM
Sales S, Time T
WHERE
T.Time_Id = S.Time_Id
GROUP BY S.Product_Id, T.Quarter
Pivot
27
Slicing-and-Dicing
• When we use WHERE to specify a particular
value for an axis (or several axes), we are
performing a slice
– Slicing the data cube in the Time dimension
(choosing sales only in week 12) then pivoting to
Product_id (aggregating over Market_id)
SELECT
S.Product_Id, SUM (Sales_Amt)
Slice
FROM Sales S, Time T
WHERE T.Time_Id = S.Time_Id AND T.Week = ‘Wk-12’
GROUP BY S. Product_Id
Pivot
28
Slicing-and-Dicing
• Typically slicing and dicing involves several queries to
find the “right slice.”
For instance, change the slice and the axes:
• Slicing on Time and Market dimensions then pivoting to Product_id
and Week (in the time dimension)
SELECT
FROM
WHERE
S.Product_Id, T.Quarter, SUM (Sales_Amt)
Sales S, Time T
T.Time_Id = S.Time_Id
AND T.Quarter = 4
AND S.Market_id = 12345
GROUP BY S.Product_Id, T.Week
Pivot
Slice
29
The CUBE Operator
• To construct the following table, would take 3
queries (next slide)
Market_Id
Product_Id
M1
SUM(Sales_Amt)
P1
P2
P3
P4
Total
M2
3003
1503
6003
2402
4503
3
7503
7000
…
…
M3 Total
…
…
…
…
…
…
…
…
…
…
30
The Three Queries
• For the table entries, without the totals (aggregation on time)
SELECT
S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM
Sales S
GROUP BY S.Market_Id, S.Product_Id
• For the row totals (aggregation on time and supermarkets)
SELECT
S.Product_Id, SUM (S.Sales_Amt)
FROM
Sales S
GROUP BY S.Product_Id
• For the column totals (aggregation on time and products)
SELECT
S.Market_Id, SUM (S.Sales)
FROM
Sales S
GROUP BY S.Market_Id
31
Definition of the CUBE Operator
• Doing these three queries is wasteful
– The first does much of the work of the other two: if
we could save that result and aggregate over
Market_Id and Product_Id, we could compute the
other queries more efficiently
• The CUBE clause is part of SQL:1999
– GROUP BY CUBE (v1, v2, …, vn)
– Equivalent to a collection of GROUP BYs, one for
each of the 2n subsets of v1, v2, …, vn
32
Example of CUBE Operator
• The following query returns all the information
needed to make the previous products/markets
table:
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY CUBE (S.Market_Id, S.Product_Id)
33
ROLLUP
• ROLLUP is similar to CUBE except that instead of
aggregating over all subsets of the arguments, it
creates subsets moving from right to left
• GROUP BY ROLLUP (A1,A2,…,An) is a series of
these aggregations:
–
–
–
–
–
–
GROUP BY A1 ,…, An-1 ,An
GROUP BY A1 ,…, An-1
………
GROUP BY A1, A2
GROUP BY A1
No GROUP BY
• ROLLUP is also in SQL:1999
34
Example of ROLLUP Operator
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales S
GROUP BY ROLLUP (S.Market_Id, S. Product_Id)
– first aggregates with the finest granularity:
GROUP BY
S.Market_Id, S.Product_Id
– then with the next level of granularity:
GROUP BY
S.Market_Id
– then the grand total is computed with no GROUP
BY clause
35
ROLLUP vs. CUBE
• The same query with CUBE:
- first aggregates with the finest granularity:
GROUP BY
S.Market_Id, S.Product_Id
- then with the next level of granularity:
GROUP BY
S.Market_Id
and
GROUP BY
S.Product_Id
- then the grand total with no GROUP BY
36
Materialized Views
The CUBE operator is often used to
precompute aggregations on all dimensions of
a fact table and then save them as a
materialized views to speed up future queries
37
Incremental Updates
• The large volume of data in a data warehouse
makes loading and updating a significant task
• For efficiency, updating is usually incremental
– Different parts are updated at different times
• Incremental updates might result in the database
being in an inconsistent state
– Usually not important because queries involve only
statistical summaries of data, which are not greatly
affected by such inconsistencies
38
Introduction to Data Mining
39
Outline
• What’s Data Mining?
• Data Mining Techniques
–
–
–
–
Association Rule Mining
Classification
Clustering
Other Data Mining Techniques
• New Challenges in Data Mining
40
Data Mining
• Data Mining, also referred to as KDD
(Knowledge Discovery in Database), is a
process of nontrivial extraction of implicit,
previously unknown and potentially useful
information from data in large databases.
• Those useful information could be
knowledge rules, constraints or regularities.
41
Data Mining: A KDD Process
Evaluation and Presentation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
42
Tasks Solved by Data Mining
•
•
•
•
•
Prediction/Classification
Clustering
Association Analysis
Outlier Detection
Evolution Analysis
43
Data Mining Applications
• Customer Relation Management (CRM)
– Customer Retention
• Market analysis
– Target market
– Market segmentation
– Cross selling
• Fraud detection
– Health care fraud detection
– Credit card fraud detection
– Telecom fraud detection
44
Data Mining – a
Multidisciplinary field
•
•
•
•
•
•
Database technology
Data warehousing
Artificial Intelligence
Machine learning
Statistics
Neural networks
• Pattern recognition
•
•
•
•
•
Knowledge-based systems
Knowledge acquisition
Information retrieval
Data visualization
High-performance computing
45
Association Rule Mining
• An association is a correlation between certain
values in a database (in the same or different
columns)
• Originally proposed on market basket data
(items purchased on a per-transaction basis)
– A rule such as “70% of transactions that purchase bread
also purchase butter”.
– In a convenience store in the early evening, a large
percentage of customers who bought diapers also
bought beer
46
Confidence and Support
• To determine whether an association exists, the system
computes the confidence and support for that
association
• Confidence in A => B
– The percentage of transactions (recorded in the database)
that contain B among those that contain A
• Diapers => Beer:
The percentage of customers who bought beer among those who bought
diapers
• Support
– The percentage of transactions that contain both items
among all transactions
• 100* (customers who bought both Diapers and Beer)/(all customers)
47
Formal Definition
• Let I = {i1, i2,…, im} be a set of literals, called items. Let D
be a set of transactions, where each transaction T is a set of
items (itemsets). An association rule is an implication of
the form X=>Y, where XI, Y I , and XY=.
• X is called antecedent, Y is called consequence.
• The rule X=>Y has support s if s% of transactions in D
contains XY.
• The rule X=>Y has confidence c if c% of transactions in D
that contains X also contains Y.
• Support indicates the frequencies of the occurring patterns
while confidence measures the rule’s strength.
48
Example
TID
Items
1
Bread, Milk
2
Beer, Diaper, Bread, Eggs
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Bread, Diaper, Milk
Rule:
{Diaper, Milk} => Beer
(Diaper, Milk, Beer)
Support =
= 0.4
|D|
(Diaper, Milk, Beer)
Confidence =
(Diaper, Milk)
= 0.66
49
Two steps of Association Rule
Mining
• Given a user specified minimum support threshold
(minsup) and minimum confidence threshold
(minconf), the problem can be solved in two steps:
– Find the frequent itemsets which have support above
minsup.
– Based on each frequent itemsets, derive all the rules,
which have confidence above minconf.
• The overall performance is determined by step 1.
50
Apriori Algorithm
• Step-wise algorithm
• Any subset of a frequent itemset is also
frequent.
• Candidate set generation
51
Apriori Algorithm
k-itemset
An itemset having k items
Lk
Set of frequent k-itemset (those with minimum support)
Ck
Set of candidate k-itemset (potentially frequent itemsets)
1) L1 = {frequent 1-itemsets};
2) For (k=2; L k-1; k++) do begin
3) Ck = apriori-gen(L k-1); // New candidates
4)
forall trans tD do begin
5)
Ct = subset(Ck,t); //Candidates contained in t
6)
forall candidates c Ct do
7)
c.count++;
8)
end
9)
Lk = {c Ck | c.count >= minsup}
10) End
11) Answer = kLk;
52
Example
Database D
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
Candidate 1-itemset
Itemset Support_Count
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
Candidate 2-itemset
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Frequent 1-itemset
Itemset Support_Count
{A}
2
{B}
3
{C}
3
{E}
3
Candidate 2-itemset
Itemset Support_Count
{A, B}
1
{A, C}
2
{A, E}
1
{B, C}
2
{B, E}
3
{C, E}
2
Frequent 2-itemset
Itemset Support_Count
{A, C}
2
{B, C}
2
{B, E}
3
{C, E}
2
Candidate 3-itemset Candidate 3-itemset
Itemset
Itemset Support_Count
{B, C, E}
{B, C, E}
2
Frequent 3-itemset
Itemset Support_Count
{B, C, E}
2
B and C  E,
B and E  C,
C and E  B,
B  C and E,
C  B and E,
E  B and C,
with support = 50%
with support = 50%
with support = 50%
with support = 50%
with support = 50%
with support = 50%
and confidence = 100%
and confidence = 66.7%
and confidence = 100%
and confidence = 66.7%
and confidence = 66.7%
and confidence = 66.7%
TID
Items
100
A
C
D
200
B
C
E
300
A
B
C
400
B
E
E
minsup = 50%
minconf = 60%
53
Topics on Association Rule
Mining
• Quantitative Association Rule Mining
– Partition continuous data into intervals
• Mining multilevel Association Rules
– Concept Hierarchy
• 80% of customers that purchase milk may also purchase bread
• 70% of people buy wheat bread if they buy 2% milk
•
•
•
•
•
Constraint-base Rule Mining
Mining high confidence rules
Rule interestingness
Removing redundant and misleading rules
From association rule mining to correlation analysis
54
Classification
• Each tuple belongs to a class. The class of a tuple is
indicated by the value of goal attribute (or classlabel attribute). Other attributes are predicting
attributes.
• The task is to discover relationship between the
predicting attributes and the goal attribute
• The discovered knowledge can be used to predict
the class of a new, unknown-class tuple.
• Classification and Regression
 Classification – on categorical data
 Regression – on continuous data
55
Classification Accuracy
• Training data set and testing data set
• For each tuple in test data, the class
predicted by the selected rule is compared
with the actual class.
• Accuracy is the ratio of the number of
tuples in test data correctly classified by the
discovered rules.
56
Example 1
Gender
Country
Age
Buy(goal)
Male
Male
Female
Female
Female
Male
Male
Female
Female
Male
France
England
France
England
France
Germany
Germany
Germany
France
France
25
21
23
34
30
21
20
18
34
55
Yes
Yes
Yes
Yes
No
No
No
No
No
No
If (Country = “Germany”) then (Buy = “no”)
If (Country = “England”) then (Buy = “yes”)
Input data
IF-THEN
If (Country = “France” and Age <= 25) then (Buy = “yes”)
Classification
If (Country = “France”) and Age >= 25) then (Buy = “no”)
Rules
57
Example 2
No.
Risk
Credit
History
Debt
Collateral
Income
1
high
bad
high
none
$0 to $15k
2
high
unknown
high
none
$15 to $35k
3
moderate
unknown
low
none
$15 to $35k
4
high
unknown
low
none
$0 to $15k
5
low
unknown
low
none
over $35k
6
low
unknown
low
adequate
over $35k
7
high
bad
low
none
$0 to $15k
8
moderate
bad
low
adequate
over $35k
9
low
good
low
none
over $35k
10
low
good
high
adequate
over $35k
11
high
good
high
none
$0 to $15k
12
moderate
good
high
none
$15 to $35k
13
low
good
high
none
over $35k
14
high
bad
high
none
$15 to $35k
58
Example 2 (cont.)
Income?
$0 to $15K
high risk
$15 to 35K
Credit history?
unknown
Debt?
high
high risk
good
moderate risk
bad
high risk
Over $35K
Credit history?
unknown
bad
low risk moderate risk
good
low risk
low
moderate risk
Decision Trees
59
Decision Tree Induction
• Decision tree
–
–
–
–
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
• Two phases of decision tree generation
– Tree construction
– Tree pruning
60
Algorithm for Decision Tree
Induction
• Basic algorithm
– Tree is constructed in a top-down recursive manner
– At start, all the training examples are at the root
– Test attributes are selected based on a heuristic or some statistical
measure (e.g., information gain in ID3/C4.5 or Gini Index in IBM
Intelligent Miner)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
• DTI Algorithms: ID3, C4.5, CART, SPRINT, SLIQ
61
Attribute selection
Income?
$15 to 35K
$0 to $15K
Examples
{1,4,7,11}
Examples
{2,3,12,14}
Over $35K
Examples
{5,6,8,9,10,13}
Why select “Income” attribute at root level?
6
6
3
3
5
5
I   log 2 ( )  log 2 ( )  log 2 ( )  1.531(bits)
14
14 14
14 14
14
4
4
4
4
2
2 2
2
E (income)  ( log 2 ( ))  ( log 2 ( )  log 2 ( ))
14 4
4
14 4
4 4
4
6
5
5 1
1
 ( log 2 ( )  log 2 ( ))  0.564
14 6
6 6
6
Information_gain(income) = 1.531 – 0.564 = 0.967 bits
Information_gain(credit history) = 0.266 bits
Information_gain(debt) = 0.581 bits
Information_gain(collateral) = 0.756 bits
62
Other Classification Methods
• Bayesian Classification
– Naïve Bayesian Classification
– Bayesian Belief Networks
• Neural Networks
– Classification by Backpropagation
•
•
•
•
•
Nearest Neighbor Classification
Case-Based Reasoning
Genetic Algorithms
Rough Set Approach
Fuzzy Set Approach
63
Clustering
• Data clustering is to group a set of data
(without a predefined class attribute), based
on the conceptual clustering principle:
maximize the intraclass similarity and
minimize the interclass similarity.
• Classification is supervised learning, while
clustering is unsupervised learning.
64
Similarity Metrics
Suppose i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects
• Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2
i p jp
• Euclidean distance
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1
j1
i2
j2
ip
jp
• Minkowski distance
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2
j2
ip
jp
65
The K-Means Clustering Method
• Given k, the k-means algorithm is
implemented in 4 steps:
– Partition objects into k nonempty subsets.
– Compute seed points (mean) of each cluster.
– Assign each object to the cluster with the
nearest seed point.
– Go back to Step 2, stop when no more new
assignment.
66
Example of K-Means Clustering Method
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
67
k-means and k-medoids
clustering
• Partitioning Methods
– k-means: Each cluster is represented by the
center (mean) of the cluster
– k-medoids: Each cluster is represented by one
of the objects in the cluster called medoid
which is the most centrally located object.
68
Other Clustering Methods
• Hierarchical methods
– Top-down (starts with all the objects in the same cluster)
– Bottom-up (starts with each object forming a separate
cluster)
• Density-Based Methods
– a neighborhood has to contain at least a minimum number
of points (density threshold)
• Grid-Based Methods
– It quantizes the object space into a finite number of cells
that form a grid structure
• Model-Based Clustering Methods
– Neural Networks, Self-Organizing Maps (SOM)
69
Other Data Mining Techniques
• Sequential Pattern Mining and Time-Series
Similarity Analysis
– Stock Analysis
• Outlier Detection
– Fraud Detection
– Medical Analysis
• Text Mining
• Web Mining
70
New Challenges in Data Mining
• Incremental data mining – data mining on data
streams, such as supermarket transactions and
phone call records
• Scalable data mining methods, parallel and
distributed data mining
• New application areas, such as Bioinformatics
• Privacy protection and information security in data
mining
71
New Challenges in Data Mining
(cont.)
•
•
•
•
•
Incorporation of domain knowledge
Interestingness measures
Data mining query language
Handling noisy or incomplete data
Data mining on complex types of data – spatial
data, web data, multimedia data, temporal data,
XML data
72