Download SQL: Queries, Constraints, Triggers

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining with DB
14
1
Introduction to Data Mining

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Major issues in data mining
2
Necessity Is the Mother of Invention

Data explosion problem
 Automated data collection tools and mature database technology
lead to tremendous amounts of data accumulated and/or to be
analyzed in databases, data warehouses, and other information
repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining
 Data warehousing and on-line analytical processing
 Miing interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
3
Evolution of Database Technology

1960s:
 Data collection, database creation, IMS and network DBMS

1970s:
 Relational data model, relational DBMS implementation

1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases

2000s
 Stream data management and mining
 Data mining with a variety of applications
 Web technology and global information systems
4
What Is Data Mining?

Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Data mining: a misnomer?

Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”?
 (Deductive) query processing.
 Expert systems or small ML/statistical programs
5
Why Data Mining?—Potential Applications

Data analysis and decision support
 Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
 Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
 Fraud detection and detection of unusual patterns (outliers)

Other Applications
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 DNA and bio-data analysis
6
Market Analysis and Management

Where does the data come from?



Target marketing

Find clusters of “model” customers who share the same characteristics: interest, income level,
spending habits, etc.

Determine customer purchasing patterns over time
Cross-market analysis



Associations/co-relations between product sales, & prediction based on such association
Customer profiling


Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus
(public) lifestyle studies
What types of customers buy what products (clustering or classification)
Customer requirement analysis

identifying the best products for different customers

predict what factors will attract new customers
Provision of summary information

multidimensional summary reports

statistical summary information (data central tendency and variation)
7
Corporate Analysis & Risk Management

Finance planning and asset evaluation
 cash flow analysis and prediction
 contingent claim analysis to evaluate assets
 cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)

Resource planning
 summarize and compare the resources and spending

Competition
 monitor competitors and market directions
 group customers into classes and a class-based pricing
procedure
 set pricing strategy in a highly competitive market
8
Fraud Detection & Mining Unusual Patterns

Approaches: Clustering & model construction for frauds, outlier analysis

Applications: Health care, retail, credit card service, telecomm.
 Auto insurance: ring of collisions
 Money laundering: suspicious monetary transactions
 Medical insurance
• Professional patients, ring of doctors, and ring of references
• Unnecessary or correlated screening tests
 Telecommunications: phone-call fraud
• Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
 Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest employees
 Anti-terrorism
9
Other Applications

Sports
 IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat

Astronomy
 JPL and the Palomar Observatory discovered 22 quasars with the
help of data mining

Internet Web Surf-Aid
 IBM Surf-Aid applies data mining algorithms to Web access logs
for market-related pages to discover customer preference and
behavior pages, analyzing effectiveness of Web marketing,
improving Web site organization, etc.
10
Data Mining: A KDD Process
 Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
11
Steps of a KDD Process

Learning the application domain
 relevant prior knowledge and goals of application



Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
 Find useful features, dimensionality/variable reduction, invariant
representation.

Choosing functions of data mining




summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge
12
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
13
Architecture: Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
Filtering
Data
Warehouse
14
Data Mining: On What Kinds of Data?




Relational database
Data warehouse
Transactional database
Advanced database and information repository
 Object-relational database
 Spatial and temporal data
 Time-series data
 Stream data
 Multimedia database
 Heterogeneous and legacy database
 Text databases & WWW
15
Data Mining Functionalities

Concept description: Characterization and discrimination
 Generalize, summarize, and contrast data characteristics, e.g., dry
vs. wet regions

Association (correlation and causality)
 Diaper  Beer [0.5%, 75%]

Classification and Prediction
 Construct models (functions) that describe and distinguish classes
or concepts for future prediction
• E.g., classify countries based on climate, or classify cars based on gas
mileage
 Presentation: decision-tree, classification rule, neural network
 Predict some unknown or missing numerical values
16
Data Mining Functionalities (2)




Cluster analysis
 Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass
similarity
Outlier analysis
 Outlier: a data object that does not comply with the general
behavior of the data
 Noise or exception? No! useful in fraud detection, rare events
analysis
Trend and evolution analysis
 Trend and deviation: regression analysis
 Sequential pattern mining, periodicity analysis
 Similarity-based analysis
Other pattern-directed or statistical analyses
17
Are All the “Discovered” Patterns Interesting?

Data mining may generate thousands of patterns: Not all of them are
interesting
 Suggested approach: Human-centered, query-based, focused mining

Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid on
new or test data with some degree of certainty, potentially useful, novel,
or validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
 Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
18
Can We Find All and Only Interesting Patterns?

Find all the interesting patterns: Completeness
 Can a data mining system find all the interesting patterns?
 Heuristic vs. exhaustive search
 Association vs. classification vs. clustering

Search for only interesting patterns: An optimization problem
 Can a data mining system find only the interesting patterns?
 Approaches
• First general all the patterns and then filter out the uninteresting ones.
• Generate only the interesting patterns—mining query optimization
19
Data Mining: Confluence of Multiple Disciplines
Database
Systems
Machine
Learning
Algorithm
Statistics
Data Mining
Visualization
Other
Disciplines
20
Data Mining: Classification Schemes

General functionality
 Descriptive data mining
 Predictive data mining

Different views, different classifications
 Kinds of data to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted
21
Multi-Dimensional View of Data Mining

Data to be mined
 Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW

Knowledge to be mined
 Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels

Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.

Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, Web mining, etc.
22
OLAP Mining: Integration of Data Mining and Data Warehousing

Data mining systems, DBMS, Data warehouse systems
coupling
 No coupling, loose-coupling, semi-tight-coupling, tight-coupling

On-line analytical mining data
 integration of mining and OLAP technologies

Interactive mining multi-level knowledge
 Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

Integration of multiple mining functions
 Characterized classification, first clustering and then association
23
An
OLAM
Architecture
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Data cleaning
Databases
Data
Data integration Warehouse
Data
Repository24
Major Issues in Data Mining

Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge fusion

User interaction
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction

Applications and social impacts
 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy
25
Summary

Data mining: discovering interesting patterns from large amounts of
data

A natural evolution of database technology, in great demand, with
wide applications

A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation

Mining can be performed in a variety of information repositories

Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.

Data mining systems and architectures

Major issues in data mining
26
Chapter 2: Data Warehousing and
OLAP Technology for Data
Mining

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

Further development of data cube technology

From data warehousing to data mining
27
What is Data Warehouse?

Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from the
organization’s operational database
 Support information processing by providing a solid platform of
consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon

Data warehousing:
 The process of constructing and using data warehouses
28
Data Warehouse—Subject-Oriented

Organized around major subjects, such as customer,
product, sales.

Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.

Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process.
29
Data Warehouse—Integrated

Constructed by integrating multiple, heterogeneous
data sources
 relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is converted.
30
Data Warehouse—Time Variant

The time horizon for the data warehouse is significantly
longer than that of operational systems.
 Operational database: current value data.
 Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)

Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not contain “time
element”.
31
Data Warehouse—Non-Volatile

A physically separate store of data transformed from the
operational environment.

Operational update of data does not occur in the data
warehouse environment.
 Does not require transaction processing, recovery, and
concurrency control mechanisms
 Requires only two operations in data accessing:
• initial loading of data and access of data.
32
Data Warehouse vs. Heterogeneous
DBMS

Traditional heterogeneous DB integration:
 Build wrappers/mediators on top of heterogeneous databases
 Query driven approach
• When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into a
global answer set
• Complex information filtering, compete for resources

Data warehouse: update-driven, high performance
 Information from heterogeneous sources is integrated in advance and
stored in warehouses for direct query and analysis
33
Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.

OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making

Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries
34
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
35
Why Separate Data Warehouse?

High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing, concurrency control,
recovery
 Warehouse—tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation.

Different functions and different data:
 missing data: Decision support requires historical data which operational
DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
36
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

Further development of data cube technology

From data warehousing to data mining
37
From Tables and Spreadsheets to Data
Cubes

A data warehouse is based on a multidimensional data model which
views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or time(day,
week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to each of
the related dimension tables

In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
38
Cube: A Lattice of Cuboids
all
time
time,item
0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
39
Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
40
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
Measures
41
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
state_or_province
country
42
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_state
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type 43
A Data Mining Query Language:
DMQL

Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:

<measure_list>
Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables)
 First time as “cube definition”
 define dimension <dimension_name> as
<dimension_name_first_time> in cube <cube_name_first_time>
44
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
45
Defining a Snowflake Schema in
DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
46
Defining a Fact Constellation in
DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)
define dimension from_location as location in cube sales
47
Measures: Three Categories

distributive: if the result derived by applying the function
to n aggregate values is the same as that derived by
applying the function on all the data without partitioning.
• E.g., count(), sum(), min(), max().

algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function.
• E.g., avg(), min_N(), standard_deviation().

holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
• E.g., median(), mode(), rank().
48
A Concept Hierarchy: Dimension
(location)
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Mexico
Toronto
M. Wind
49
View of Warehouses and Hierarchies
Specification of hierarchies

Schema hierarchy
day < {month < quarter;
week} < year

Set_grouping hierarchy
{1..10} < inexpensive
50
Multidimensional Data

Sales volume as a function of product,
Dimensions: Product, Location, Time
month, and region Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month Week
Day
Month
51
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
sum
52
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
53
Browsing a Data Cube
Visualization
 OLAP capabilities
 Interactive
manipulation

54
Typical OLAP Operations

Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or detailed data, or
introducing new dimensions

Slice and dice:
 project and select

Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes.

Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its back-end relational
tables (using SQL)
55
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time
Product
ANNUALY QTRLY
DAILY
PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
Location
Each circle is
called a footprint
DIVISION
Promotion
Organization
56
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

Further development of data cube technology

From data warehousing to data mining
57
Design of a Data Warehouse: A Business
Analysis Framework

Four views regarding the design of a data warehouse
 Top-down view
• allows selection of the relevant information necessary for the data
warehouse
 Data source view
• exposes the information being captured, stored, and managed by
operational systems
 Data warehouse view
• consists of fact tables and dimension tables
 Business query view
• sees the perspectives of data in the warehouse from the view of enduser
58
Data Warehouse Design Process

Top-down, bottom-up approaches or a combination of both
 Top-down: Starts with overall design and planning (mature)
 Bottom-up: Starts with experiments and prototypes (rapid)

From software engineering point of view
 Waterfall: structured and systematic analysis at each step before
proceeding to the next
 Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around

Typical data warehouse design process




Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
59
Multi-Tiered Architecture
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
60
Density Concepts

Core object (CO)–object with at least ‘M’ objects
within a radius ‘E-neighborhood’

Directly density reachable (DDR)–x is CO, y is in x’s
‘E-neighborhood’

Density reachable–there exists a chain of DDR objects
from x to y

Density based cluster–density connected objects
maximum w.r.t. reachability
61
Density-Based Clustering: Background

Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-neighbourhood of
that point

NEps(p):
{q belongs to D | dist(p,q) <= Eps}

Directly density-reachable: A point p is directly densityreachable from a point q wrt. Eps, MinPts if
 1) p belongs to NEps(q)
 2) core point condition:
|NEps (q)| >= MinPts
p
q
MinPts = 5
Eps = 1 cm
62
Density-Based Clustering: Background
(II)

Density-reachable:
p
 A point p is density-reachable from a
point q wrt. Eps, MinPts if there is a chain
of points p1, …, pn, p1 = q, pn = p such that
pi+1 is directly density-reachable from pi

p1
q
Density-connected
 A point p is density-connected to a point
q wrt. Eps, MinPts if there is a point o
such that both, p and q are densityreachable from o wrt. Eps and MinPts.
p
q
o
63
DBSCAN: Density Based Spatial
Clustering of Applications with Noise


Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
64
DBSCAN: The Algorithm
 Arbitrary select a point p
 Retrieve all points density-reachable from p wrt Eps and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-reachable from p and
DBSCAN visits the next point of the database.
 Continue the process until all of the points have been processed.
65
OPTICS: A Cluster-Ordering Method
(1999)

OPTICS: Ordering Points To Identify the Clustering
Structure
 Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
 Produces a special order of the database wrt its density-based
clustering structure
 This cluster-ordering contains info equiv to the density-based
clusterings corresponding to a broad range of parameter settings
 Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
 Can be represented graphically or using visualization techniques
66
OPTICS: Some Extension from
DBSCAN

Index-based:
• k = number of dimensions
• N = 20
• p = 75%
• M = N(1-p) = 5

 Complexity: O(kN2)
Core Distance

Reachability Distance
D
p1
o
p2
Max (core-distance (o), d (o, p))
r(p1, o) = 2.8cm. r(p2,o) = 4cm
o
MinPts = 5
e = 3 cm
67
Reachability
-distance
undefined
e
e‘
e
Cluster-order
of the objects
68
Density-Based Cluster analysis:
OPTICS & Its Applications
69
DENCLUE: Using density
functions

DENsity-based CLUstEring by Hinneburg & Keim
(KDD’98)

Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
 Significant faster than existing algorithm (faster than DBSCAN by a
factor of up to 45)
 But needs a large number of parameters
70
Denclue: Technical Essence

Uses grid cells but only keeps information about
grid cells that do actually contain data points and
manages these cells in a tree-based access structure.

Influence function: describes the impact of a data
point within its neighborhood.

Overall density of the data space can be calculated
as the sum of the influence function of all data
points.

Clusters can be determined mathematically by
identifying density attractors.

Density attractors are local maximal of the overall
71
Gradient: The steepness of a slope

Example
f Gaussian ( x , y )  e
f
D
Gaussian
f
d ( x , y )2

2 2
( x )   i 1 e
N

d ( x , xi ) 2
2 2
( x, xi )  i 1 ( xi  x)  e
D
Gaussian
N

d ( x , xi ) 2
2 2
72
Density Attractor
73
Center-Defined and Arbitrary
74
Grid-Based Clustering Method

Using multi-resolution grid data structure

Several interesting methods
 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
 WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
• A multi-resolution clustering approach using wavelet
method
 CLIQUE: Agrawal, et al. (SIGMOD’98)
75
STING: A Statistical Information
Grid Approach



Wang, Yang and Muntz (VLDB’97)
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to different
levels of resolution
76
STING: A Statistical Information
Grid Approach (2)
 Each cell at a high level is partitioned into a number of smaller cells in
the next lower level
 Statistical info of each cell is calculated and stored beforehand and is
used to answer queries
 Parameters of higher level cells can be easily calculated from parameters
of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
 Use a top-down approach to answer spatial data queries
 Start from a pre-selected layer—typically with a small number of cells
 For each cell in the current level compute the confidence interval
77
STING: A Statistical
Information Grid Approach (3)
 Remove the irrelevant cells from further consideration
 When finish examining the current layer, proceed to the next
lower level
 Repeat this process until the bottom layer is reached
 Advantages:
• Query-independent, easy to parallelize, incremental update
• O(K), where K is the number of grid cells at the lowest level
 Disadvantages:
• All the cluster boundaries are either horizontal or vertical,
and no diagonal boundary is detected
78
WaveCluster (1998)

Sheikholeslami, Chatterjee, and Zhang (VLDB’98)

A multi-resolution clustering approach which applies
wavelet transform to the feature space
 A wavelet transform is a signal processing technique that
decomposes a signal into different frequency sub-band.

Both grid-based and density-based

Input parameters:
 # of grid cells for each dimension
 the wavelet, and the # of applications of wavelet transform.
79
WaveCluster (1998)

How to apply wavelet transform to find clusters
 Summaries the data by imposing a multidimensional grid
structure onto data space
 These multidimensional spatial data objects are represented in a
n-dimensional feature space
 Apply wavelet transform on feature space to find the dense
regions in the feature space
 Apply wavelet transform multiple times which result in clusters
at different scales from fine to coarse
81
Wavelet Transform
Decomposes a signal into different frequency
subbands. (can be applied to n-dimensional
signals)
 Data are transformed to preserve relative
distance between objects at different levels of
resolution.
 Allows natural clusters to become more
distinguishable

82
What Is Wavelet (2)?
83
Quantization
84
Transformation
85
WaveCluster (1998)

Why is wavelet transformation useful for clustering
 Unsupervised clustering
It uses hat-shape filters to emphasize region where points
cluster, but simultaneously to suppress weaker information in
their boundary
 Effective removal of outliers
 Multi-resolution
 Cost efficiency

Major features:




Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
86
CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

Automatically identifying subspaces of a high dimensional
data space that allow better clustering than original space

CLIQUE can be considered as both density-based and gridbased
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the
unit exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace
87
CLIQUE: The Major Steps

Partition the data space and find the number of points that
lie inside each cell of the partition.

Identify the subspaces that contain clusters using the
Apriori principle

Identify clusters:
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of interests.

Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of connected dense
units for each cluster
 Determination of minimal cover for each cluster
88
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
30
Vacation
(week)
0 1 2 3 4 5 6 7
age
60
20
30
40
50
age
60
50
age
89
Strength and Weakness of CLIQUE

Strength
 It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
 It is insensitive to the order of records in input and does not
presume some canonical data distribution
 It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases

Weakness
 The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
90
Model-Based Clustering Methods


Attempt to optimize the fit between the data and some
mathematical model
Statistical and AI approach
 Conceptual clustering
• A form of clustering in machine learning
• Produces a classification scheme for a set of unlabeled objects
• Finds characteristic description for each concept (class)
 COBWEB (Fisher’87)
• A popular a simple method of incremental conceptual learning
• Creates a hierarchical clustering in the form of a classification tree
• Each node refers to a concept and contains a probabilistic description of
that concept
91
COBWEB Clustering Method
A classification tree
92
More on Statistical-Based
Clustering

Limitations of COBWEB
 The assumption that the attributes are independent of each
other is often too strong because correlation may exist
 Not suitable for clustering large database data – skewed tree
and expensive probability distributions

CLASSIT
 an extension of COBWEB for incremental clustering of
continuous data
 suffers similar problems as COBWEB

AutoClass (Cheeseman and Stutz, 1996)
 Uses Bayesian statistical analysis to estimate the number of
clusters
 Popular in industry
93
Other Model-Based
Clustering Methods

Neural network approaches
 Represent each cluster as an exemplar, acting as a
“prototype” of the cluster
 New objects are distributed to the cluster whose exemplar is
the most similar according to some dostance measure

Competitive learning
 Involves a hierarchical architecture of several units
(neurons)
 Neurons compete in a “winner-takes-all” fashion for the
object currently being presented
94
Model-Based Clustering Methods
95
Self-organizing feature maps
(SOMs)
Clustering is also performed by having
several units competing for the current object
 The unit whose weight vector is closest to the
current object wins
 The winner and its neighbors learn by having
their weights adjusted
 SOMs are believed to resemble processing
that can occur in the brain
 Useful for visualizing high-dimensional data
in 2- or 3-D space

96
What Is Outlier Discovery?

What are outliers?
 The set of objects are considerably dissimilar from the
remainder of the data
 Example: Sports: Michael Jordon, Wayne Gretzky, ...

Problem
 Find top n outlier points

Applications:




Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
97
Outlier Discovery:
Statistical
Approaches


Assume a model underlying distribution that
generates data set (e.g. normal distribution)
Use discordancy tests depending on
 data distribution
 distribution parameter (e.g., mean, variance)
 number of expected outliers

Drawbacks
 most tests are for single attribute
 In many cases, data distribution may not be known
98
Outlier Discovery: Distance-Based
Approach

Introduced to counter the main limitations imposed
by statistical methods
 We need multi-dimensional analysis without knowing data
distribution.


Distance-based outlier: A DB(p, D)-outlier is an object
O in a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
 Index-based algorithm
 Nested-loop algorithm
 Cell-based algorithm
99
Outlier Discovery: DeviationBased Approach

Identifies outliers by examining the main
characteristics of objects in a group

Objects that “deviate” from this description are
considered outliers

sequential exception technique
 simulates the way in which humans can distinguish
unusual objects from among a series of supposedly like
objects

OLAP data cube technique
 uses data cubes to identify regions of anomalies in large
multidimensional data
100
Problems and Challenges

Considerable progress has been made in scalable
clustering methods
 Partitioning: k-means, k-medoids, CLARANS
 Hierarchical: BIRCH, CURE
 Density-based: DBSCAN, CLIQUE, OPTICS
 Grid-based: STING, WaveCluster
 Model-based: Autoclass, Denclue, Cobweb

Current clustering techniques do not address all the
requirements adequately

Constraint-based clustering analysis: Constraints exist in
data space (bridges and highways) or in user queries
101
Constraint-Based Clustering Analysis

Clustering analysis: less parameters but more user-desired
constraints, e.g., an ATM allocation problem
102
Clustering With Obstacle Objects
Not Taking obstacles into account
Taking obstacles into account
103