Download Pattern Discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
國立雲林科技大學
National Yunlin University of Science and Technology
Pattern Discovery: A Data Driven
Approach to Decision Support
Advisor:Dr.Hsu
Graduate: Keng-Wei Chang
Author: Andrew K. C. Wong
Yang Wang
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART C:
APPLICATIONS AND REVIEWS, VOL. 33, NO. 1, FEBRUARY 2003
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Outline








Motivation
Objective
Introduction
Brief Description of Pattern Discovery
Data, Events, and Patterns
Pattern Discovery
Inferencing with Patterns for Decision Support
Summary and Discussion
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Motivation

Decision support nowadays is more and more
targeted to large scale complicated systems
and domains.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Objective


Having capability of processing large amounts
of data and efficiently extracting useful
knowledge from the data
Develop a fundamental framework toward
intelligent decision support by analyzing a
large amount of mixed-mode data
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
1. Introduction
If machine intelligence could be used
comfortably by the decision makers

1)
2)
3)
4)
Discover multiple patterns from a database
without relying on prior knowledge
Cope with multiple and flexible decision and
objectives
Provide explicit patterns and rules associated for
interpretation
Render a high-speed interactive mode for
information and knowledge extraction
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
1. Introduction
Three related issues are also of concern to the
decision-makers

1)
2)
3)
Flexibility and versatility of the pattern discovery
process;
Transparency of the supporting evidences;
Processing speed.
Intelligent Database Systems Lab
2. Brief Description of Pattern
Discovery

In the seventies



Quantitative basis of information measures and
statistical patterns
This formed the early basis of our pattern
discovery approach
Pattern recognition methods for discrete valued
and continuous data
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
2. Brief Description of Pattern
Discovery

In the late seventies and early eighties



Dimension was too large to pattern discovery
Database partitioning was proposed
In nineties

Shift pattern recognition paradigm from the
variable level to event level based
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3. Data, Events, and Patterns

X  X 1 ,..., X N 

if contunous
 R,
di  
mi
1
  i ,...,  i , if descrete
N.Y.U.S.T.
I.M.
each values from its domain d i


Generalized Event
 Pattern

Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3.1 Generalized Event

A Borel σ-field is the collection of all
N
rectangles in 
I i  ai , bi  and    ai  bi  

Two advantages


geometric perspective
Probability measure
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3.1 Generalized Event

Definition 1:Consider the sample space M
N
N
A hypercell H of M is called a hypercell if it has
the form
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3.1 Generalized Event
N

Definition 2: An event in M is a hyper set.

Definition 3: The volume of an event is the
hypervolume of the N-dimensional subspace
occupied by the event.
N
Ex: a data set D = {ω} from a sample space M
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3.1 Generalized Event

Definition 4: The observed frequency, oE , of an
event E in the sample space Ω is the number of
data points that fall within the volume of E.
Ex: X  D as the finite set inside the volume of E
then oE = |{X}|

Definition 5: The probability of an event E is
intuitively estimated by the proportion of data
points contained in the event
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3.2 Pattern

Definition 6: Let Ω be the sample space and g(.)
be a test statistic corresponding to a specified
discovery criterion c.
A pattern is an event E that satisfies the condition
 c be the critical value of the statistical test
at a significant level of α
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3.2 Pattern

Definition 7: An event association is a significant
joint occurrence of low-dimensional events.

N-dimensional event (N > 1) can be considered an event
association, composed of N one-dimensional events.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
4. Pattern Discovery

Definition 8: Suppose we have a data set with
sample space Ω. Pattern Discovery is the search
for significant events (hypercells) in a compact
subspace of the sample space Ω demarcated by
the available data set D.

Pattern Discovery as Residual Analysis
Pattern Discovery as Optimization

Intelligent Database Systems Lab
4.1 Pattern Discovery as Residual
Analysis

N.Y.U.S.T.
I.M.
Definition 9: The residual of an event E
against a pre-assumed model is defined as the
difference between the actual occurrence of the
event and its expected occurrence.
eE is the expected occurrence
Intelligent Database Systems Lab
4.1 Pattern Discovery as Residual
Analysis
N.Y.U.S.T.
I.M.

Definition 10: The standardized residual of
event E is defined as the ratio of its residual
and the square root of its expectation

Definition 11: The adjusted residual of event E
is defined as
 E is the variance of  E
Intelligent Database Systems Lab
4.1 Pattern Discovery as Residual
Analysis

N.Y.U.S.T.
I.M.
Two pre-assumed model

uniform distribution ; (concentration-driven discovery)
where V is the volume of S, and M is the total
number of observations.

mutual independence. (dependency-driven discovery)
ni is the number of
data points falls
into I i
Intelligent Database Systems Lab
4.2 Pattern Discovery as Optimization

C represents one of the corners of E, and L
represents the lengths of the edges.

Further define
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
4.2 Pattern Discovery as Optimization

The pattern discovery problem is to

The objective function O(E) is the adjusted residual
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
5. Inferencing with Patterns for
Decision Support




Building Classifiers
Multivariate Probabilistic Density Estimation
Interpretation of Patterns
Discovered Patterns as Queries for Class Data
Retrieval
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.
5.1 Building Classifiers

Based on the mutual information in
information theory (dependency-driven discovery)


information gain
weight of evidence
I(.) is the mutual information
result: + ; - ; 0
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
5.1 Building Classifiers

But need to estimate the conditional
probabilities or know the distribution

decompose if significant event associations
related to yi and x are known
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
5.1 Building Classifiers


Only the significant event associations discovered
from the data set are used in the inference process.

Thus , maximize the

Conditions
Using the pattern as a model, any missing values
of a discrete variables can be classified
Intelligent Database Systems Lab
5.2 Multivariate Probabilistic Density
Estimation
N.Y.U.S.T.
I.M.

The estimation of the probability density function
(pdf) is a central problem in multivariate data
analysis. (concentration-driven discovery)


discrete pdf Estimation
continuous pdf Estimation
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
discrete pdf Estimation

Definition 12: The indicator function for a
event, Ei, is defined as

The probability density estimate

The normalization condition

The discrete probability density function
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
continuous pdf Estimation

The basic idea is to estimate a kernel for each
event.

Gaussian kernel, its continuous and satisfies
where
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
continuous pdf Estimation

To fit a kernel ψ(x) to the event E

compute the mean and covariance matrix

The combined pdf is estimated by
where
Intelligent Database Systems Lab
5.2 Multivariate Probabilistic Density
Estimation
N.Y.U.S.T.
I.M.

An exmple of continuous density estimation
Intelligent Database Systems Lab
5.3 Interpretation of Patterns

Since events was discovered, rule cane be
easily extracted.



association rule, form X => Y
support and confidence is measured
P(X,Y) and P(Y|X)
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
5.4 Discovered Patterns as Queries
for Class Data Retrieval

One pattern

The Query
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
6. Summary and Discussion
N.Y.U.S.T.
I.M.
Develop a framework of data driven decision
support based on pattern discovry

1)
2)
3)
the motivation, historical background and the
rationale of our approach;
a novel unified framework to define and
represent mixed-mode data, the most general
and common data encountered in today’s
database;
the theoretical basis of pattern discovery based
on statistical residual and optimization principles;
Intelligent Database Systems Lab
6. Summary and Discussion
Develop a framework of data driven decision
support based on pattern discovry

4)
5)
6)
7)
a novel and unified framework to represent probability
models for mixed-mode data in the form of pdf;
an inferencing system using the discovered patterns and
weight of evidence for classification and prediction;
a new way of data retrieval by each class queries for
retrieval in a distributive database with unlimited size;
supporting validation evidences of the efficacy of the
proposed framework and its new development in solving
large scaled problem with online interactive capability.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.