Download Chapter 3: Cluster Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter 3: Cluster Analysis
`
`
`
`
`
`
`
3.1 Basic Concepts of Clustering
3.2 Partitioning Methods
3.3 Hierarchical Methods
3.4 Density-Based Methods
3.5 Model-Based Methods
3.6 Clustering High-Dimensional Data
3.7 Outlier Analysis
3.7.1 Definition
3.7.2 Statistical-Based Methods
3.7.3 Distance-Based Methods
3.7.4 Density-Based Local Methods
3.7.5 Deviation-Based Methods
3.7.1 Definition
`
Outliers: data objects that do not comply with the general
behavior or model of the data
`
Outlier detection or analysis is referred to as Outlier Mining
`
Outlier mining has different applications
Fraud detection
Detecting unusual usage of telecommunication services
Identifying the spending behavior of costumers with extremely low
or extremely high incomes
Finding unusual responses to various medical treatments
Etc.
Outlier Mining
`
Given a set if n data objects and k expected number of outliers
`
Find the top k objects that are considerably
`
Dissimilar
Exceptional
Inconsistent with respect to the remaining data
The outlier mining problem can be seen as two sub-problems
1) Define what data can be considered as inconsistent in a given
data set
2) Find an efficient method to mine the outliers so defined
`
Data visualization methods are weak in detecting data with
many categorical attributes or data of high dimensionality
`
Investigate computer-based techniques to detect outliers
3.7.2 Statistical Distribution-Based Methods
`
Assume a distribution model for the given data set(e.g., Normal)
Identify outliers w. r. t the model using a discordancy test
How does it work?
`
Examine two hypothesis
`
`
`
working hypothesis
alternative hypothesis
A working hypothesis H is a statement that the entire data set of n
objects comes from an initial distribution model F that is:
H: oi∈F,
`
where i=1,2,…,n
The hypothesis H is retained if there is no statistically significant
evidence supporting its rejection
Discordancy Test
`
Verifies whether an object oi is significantly large(or small) in
relation to the distribution F
`
Principle
Choose a some statistic T for discordancy testing
Consider the value vi of an object oi
If significance probability SP(vi) is sufficiently small



`
oi is discordant
The working hypothesis is rejected
An alternative hypothesis ¬H which says that oi comes from a
another distribution model G is adopted
The result depends on the model F is chosen because oi may be
an outlier under one model and perfectly valid value under
another
Discordancy Test: Example
`
Let o1,…,on represent the data objects
`
Compute the sample mean µ and the standard deviation σ
`
If the an object oi is suspected to be an outlier
Compute the test statistic T
T=
| µ − oi |
σ
If T exceeds some critical value, then oi is an outlier
Discordancy Test: Example
`
Consider the following ordered data: 3.84, 4.26, 4.53, 4.60, 5.28,
5.29, 5.74, 5.86
`
Consider an additional sample P: 10 (it is suspected that this point
might be an outlier)
`
Compute µ and σ without the suspected outlier µ = 5.48, σ = 1.82
| 5.48 −10 |
T=
= 2.48
1.82
`
With n=9 and level of significance α=0.05, the critical value is
2.110
`
T>2.110, then there is an evidence that P is an outlier
Alternative Distributions
Inherent Alternative Distribution
`
The working hypothesis that all objects come from distribution F is
rejected
`
Alternative hypothesis assume that all objects come from another
distribution G
¬H: oi∈G,
`
`
`
where i=1,2,…,n
F and G: different distributions
F and G : the same distribution but with different parameters
Distribution G must have the potential to produce outliers (a
different mean, or dispersion, or a longer tail)
Alternative Distributions
Mixture Alternative Distribution
`
The discordant values are not outliers in F population but
contaminants from some other population G
`
The alternative hypothesis is
¬H: oi∈ (1-λ) F+ λG,
where i=1,2,…,n
Slippage Alternative Distribution
`
All objects (except a small number) are from initial model F, with
its given parameters
`
The remaining objects are from a modified version of F in which
the parameters have been shifted
Characteristics of Statistical-Based Methods
`
Tests are for single attributes
`
Need to find outliers in multidimensional space
`
Statistical approaches require knowledge about parameters of
the data set
`
Statistical methods do not guarantee that all outliers will be found
No specific test was developed
The distribution cannot be adequately modeled with any
standard distribution
3.7.2 Distance-Based Methods
`
Generalize the test-based techniques
`
Distance-based outliers are those objects that do not have
“enough” neighbors
`
Formally
`
Define DB(pct, dmin)-outlier: a distance based outlier with
parameters pct and dmin
An object o is DB(pct, dmin)-outlier if at least a fraction pct of the
objects lie at a distance greater than dmin from o
Avoids excessive computation related to fitting the observed
data into some standard distribution and selecting discordancy
tests
Distance-Based Algorithms
Index-based algorithms
`
Use multidimensional indexing structures such as R-trees or kd trees to search for neighbors of each object o
Distance-Based Algorithms
`
Find neighbors of object o within a
radius dmin
`
M is the maximum number of objects
within the dmin-neighborhood of an
outlier
`
Once M+1 objects of object o are
found, then o is not an outlier
`
Complexity of O(n2k)
`
N: number of objects
K: dimensionality
Complexity is in search time. Building
the index can be computationally
very expensive
Distance-Based Algorithms
`
Cell-based algorithms
The data space is partitioned into cells with a side length equal to
dmin
2 k
`
dmin: radius around objects
K: dimensionality
`
Each cell has two layers surrounding it
`
First layer is 1-cell thick
Second layer is 2 k − 1 thick, rounded up to the closest integer
Distance-Based Algorithms
Cell-based algorithms
`
Count outliers on a cell-by-cell rather than object-by-object
basis
`
For a given cell, the algorithm accumulates three counts
`
The number of objects on the cell C
The number of objects in the cell and the first layer C+1
The number of objects In the cell and the second layer C+2
How to determine outliers with these counts?
Distance-Based Algorithms
Cell-based algorithms
`
Assume M to be a threshold used to detect outliers
`
An object o is considered as an outlier if C+1 <M, else all the
objects in the cell are considered as non outliers
`
If C+2 <M, all the objects in the cell are considered outliers
`
If C+2 >M, it is possible that some objects in the cell are outliers
do object-by-object processing to detect outliers
only objects that have less than M objects in their dminneighborhood are outliers
the dmin-neighborhood consist of the object’s cell, all of its first
layer and some of its second layer
Characteristics of Distance-Based Methods
`
Avoid O(n2) computational complexity
`
Its complexity is O(ck+n)
c is a constant depending on the number of cells
k the dimensionality
n number of objects
`
Developed for memory-resident data sets
`
Requires the user to set both dmin and pct
`
Finding suitable settings for these parameters can involve much
trial and error
3.7.3 Density-Based Methods
`
Statistical and distance-based methods depend on the overall
“global” distribution of data
`
Data are usually not uniformly distributed
`
Data can have different density distributions
C1
C2
o2
o1
Density-Based Methods
`
Define Local Outliers
`
Does not consider being an outlier as a binary property
`
An object is a local outlier if it is outlying relative to its local
neighborhood (w. r. t the density of the neighborhood )
Asses the degree to which an object is an outlier
The degree of the “outlierness” is computed as the Local Outlier
Factor(LOF) of an object
The degree depends on how isolated the object is with respect to
the surrounding neighborhood
Detect global and local outliers
Density-Based Methods
`
To define the local outlier factor of an object, the following
concepts should be introduced
K-distance
K-distance neighborhood
Reachability distance
Local reachability distance
K-distance & K-distance neighborhood
`
`
The k-distance of an object p is the maximal distance that p gets
from its k-nearest neighbors
Denoted k-distance(p)
How k is determined?
LOF method sets k to the parameter MinPts used in the densitybased clustering (e.g., Minpts=4) [MinPts-distance]
p
K-distance neighborhood of an object p contains the MinPtsnearest neighbors of p
Denoted Nk-distance(P) or Nk(P), also NMinPts
p
Reachability distance
`
The reachability-distance of an object q with respect to object o
(where o is within the MinPts-nearest neighbors of P) is denoted
reach_distMinPts(p,o)
p
`
Reach_distMinPts (p,o)=max{MinPts_distance(o), d(p,o)}
`
If p is far away from o, the reachability distance between the two
is simply their actual distance
`
If they are close, then the actual distance is replaced by the
MinPts_distance of o
Local Outlier Factor (LOF)
`
The local reachability density of p is the inverse of the average
reachability density based on the MinPts-nearest neighbors of p
| N MinPts(P)|
lrdMinPts(p) =
∑o∈NMinPts(p)reach_distMinPts(p,o)
`
The local outlier factor (LOF) of p captures the degree to which
we call p an outlier
IrdMinPts (o)
∑o∈NMinPts(P) Ird (P)
MinPts
LOFMinPts(p) =
| N MinPts ( P) |
3.7.4 Deviation-Based Methods
`
Identify outliers by examining the main characteristics of objects
on a group
`
Objects that deviate from this description are outliers
`
The term deviation is used to refer to outliers
`
Two main methods
Sequential Exception Technique
OLAP Data Cube Technique
Summary of Chapter 3
`
A cluster is a collection of data objects that are similar within the
same cluster and dissimilar to the objects on other clusters
`
Clustering can be used as
`
a main task to gain insights about the data
a preprocessing step for other data mining algorithms
Several applications
Market segmentation
Pattern recognition
Biological studies
Spatial data analysis
Web document classification, etc.
Summary of Chapter 3
`
The quality of clustering can be assessed based on dissimilarity of
objects
`
Many techniques have been developed
Partitioning Methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Clustering high dimensional data
Constrained-based methods
Applications and Tools in Data Mining
Summary
1. Financial Data Analysis
`
Banks and Institutions offer a wise variety of banking services
Checking and savings accounts for business or individual
customers
Credit business, mortgage, and automobile loans
Investment services (mutual funds)
Insurance services and stock investment services
`
Financial data is relatively complete, reliable, and of high
quality
`
What to do with this data?
1. Financial Data Analysis
`
Design of data warehouses for multidimensional data analysis
and data mining
Construct data warehouses (data come from different sources)
Multidimensional Analysis: e.g., view the revenue changes by
month. By region, by sector, etc. along with some statistical
information such as the mean, the average, the maximum and
the minimum values, etc.
Characterization and class comparison
Outlier analysis
1. Financial Data Analysis
`
Loan Payment Prediction and costumer credit policy analysis
Attribute selection and attribute relevance ranking may help
indentifying important factors and eliminate irrelevant ones
Example of factors related to the risk of loan payment






Term of the loan
Debt ratio
Payment to income ratio
Customer level income
Education level
Residence region
The bank can adjust its decisions
according to the subset of factors selected (use classification)
2. Retail Industry
`
Collect huge amount of data on sales, customer
shopping history, goods transportation,
consumption and service, etc.
`
Many stores have web sites where you can buy
online. Some of them exist only online (e.g.,
Amazon)
`
Data mining helps to
Identify costumer buying behaviors
Discover customers shopping patterns and trends
Improve the quality of costumer service
Achieve better costumer satisfaction
Design more effective good transportation
Reduce the cost of business
2. Retail Industry
`
Design data warehouses
`
Multidimensional analysis
`
Analysis of the effectiveness of sales campaigns
`
Costumer retention
`
Advertisements, coupons, discounts, bonuses, etc
Comparing transactions that contain sales items
during and after the campaign
Analyze the change in costumers behaviors
Product Recommendation
Mining association rules
Display associative information to promote sales
3. Telecommunication Industry
`
Many different ways of communicating
`
Fax, cellular phone, Internet messenger, images, email, computer and Web data transmission, etc.
Great demand of data mining to help
Understanding the business involved
Indentifying telecommunication patterns
Catching fraudulent activities
Making better use of resources
Improve the quality of service
3. Telecommunication Industry
`
Multidimensional analysis (several attributes)
`
Several features: Calling time, Duration, Location of
caller, Location of callee, Type of call, etc.
Compare data traffic, system workload, resource
usage, user group behavior, and profit
Fraudulent Pattern Analysis
Identify potential fraudulent users
Detect attempts to gain fraudulent entry to costumer
accounts
Discover unusual patterns (outlier analysis)
4. Many Other Applications
`
Biological Data Analysis
`
Web Mining
`
E.g., identification and analysis of human genomes
and other species
E.g., explore linkage between web pages to compute
authority scores (Page Rank Algorithm)
Intrusion detection
Detect any action that threaten file integrity,
confidentiality, or availability of a network resource
How to Choose a Data Mining System (Tool)?
`
Do data mining system share the same well defined operations
and a standard query language?
`
No
`
Many commercial data mining system have a little in common
`
Different functionalities
Different methodology
Different data sets
You need to carefully choose the data mining system that is
appropriate for your task
How to Choose a Data Mining System (Tool)?
`
Data Types
`
Available systems handle formatted record-based, relational-like
data with numerical, and nominal attributes
That data could be on the form of ASCII text, relational databases,
or data warehouse data
It is important to check which kind of data the system you are
choosing can handle
Operating System
A data mining system may run only on one operating system
The most popular operating systems that host data mining tools
are UNIX/LINUX and Microsoft Windows
Large industry data mining systems adopt client-server
architecture
How to Choose a Data Mining System (Tool)?
`
Data Sources
`
Data formats
Some systems work only with ASCII test files, whereas many other
work with databases
It is important that the data mining system supports ODBC
connections (Open Database Connectivity)
Data Mining functions and Methodologies
Some systems provide only one data mining function(e.g.,
classification). Other system support many functions
For a given data mining function (e.g., classification), some
systems support only one method. Other systems may support
many methods (k-nearest neighbor, naive Bayesian, etc.)
Data mining system should provide default settings for non experts
How to Choose a Data Mining System (Tool)?
`
Coupling data mining with databases(data warehouse) systems
No Coupling



Loose coupling



Efficient implementation of few essential data mining primitives
(sorting, indexing, histogram analysis) is provided by the DB/DW
Tight coupling


`
A DM system use some facilities of a DB/DW system
Fetch data from data repositories managed by a DB/DW
Store results in a file or in the DB/DW
Semi-tight coupling

A DM system will not use any function of a DB/DW system
Fetch data from particular resource (file)
Process data and then store results in a file
A DM system is smoothly integrated into the DB/DW
Data mining queries are optimized
Tight coupling is highly desirable because it facilitates
implementations and provide high system performance
How to Choose a Data Mining System (Tool)?
`
Scalability
`
Visualization
`
Query execution time should increase linearly with the number of
dimensions
“A picture is worth a thousand words”
The quality and the flexibility of visualization tools may strongly
influence usability, interpretability and attractiveness of the system
Data Mining Query Language and Graphical user Interface
High quality user interface
It is not common to have a query language in a DM system
Examples of Commercial Data Mining Tools
Database system and graphics vendors
`
Intelligent Miner (IBM)
`
Microsoft SQL Server 2005
`
MineSet (Purple Insight)
`
Oracle Data Mining (ODM)
Examples of Commercial Data Mining Tools
Vendors of statistical analysis or data mining software
`
Clementine (SPSS)
`
Enterprise Miner (SAS Institute)
`
Insightful Miner (Insightful Inc.)
Examples of Commercial Data Mining Tools
Machine learning community
`
CART (Salford Systems)
`
See5 and C5.0 (RuleQuest)
`
Weka developed at the university Waikato (open source)
End of The Data Mining Course
Questions? Suggestions?