Download Pattern Management

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Pattern Management: Models,
Languages, and Architectural
Issues
Tutorial
DASFAA’05 - Beijing, April 16th
Barbara Catania
DISI - University of Genoa
Barbara Catania
DASFAA’05 Tutorial
1
Tutorial Objectives
• Provide a definition of pattern management
• Identify the environments in which pattern management
could be useful
• Understand the analogies and the differences with data
mining, data warehousing and metadata management
• Introduce the main requirements of pattern management
• Present some theoretical proposals and some standards,
and discuss their features with respect to the general
pattern management requirements
• Discuss open issues
Barbara Catania
DASFAA’05 Tutorial
2
Outline
• Introduction to pattern management
• Features
– Architecture
– Models
– Languages
• Theoretical proposals
• Standards
• Open issues
Barbara Catania
DASFAA’05 Tutorial
3
A lot of data but only few
information ...
The world produces between 1 and 2
exabytes of unique information per year,
which is roughly 250 megabytes for
every man, woman, and child on earth
(Lyman & Varian, 2003)
Barbara Catania
DASFAA’05 Tutorial
4
Which Data?
•
•
•
•
Large datasets
Distributed sources
Heterogeneous
Data is not knowledge!
Barbara Catania
DASFAA’05 Tutorial
5
Which information?
• Knowledge artifacts
• Smaller datasets,
manageable by
humans
• Preserve as much as
possible the
hidden/interesting/
available information
of data
⇒
lacrimazione
ridotta
nessuna
Barbara Catania
normale
astigmatismo
no
si
morbide
prescrizione
oculistica
miopia
rigide
ipermetropia
nessuna
DASFAA’05 Tutorial
6
We probably need patterns ...
• A compact and
semantically reach
representation of
raw data
Data
Patterns
Barbara Catania
DASFAA’05 Tutorial
7
An example
T1 Beer, Potato, Chips, Refreshments, Nappies
T2 Whisky, Beer, Nappies
T3 Detergents, Broom, Beer, Potato Chips
T4 Milk, Potato Chips, Tomatoes, Carrots
T5 Cigarettes, Meat, Refreshments
T6 Meat, Cheese, Fish, Refreshments, Beer, Potato Chips
A1 Beer Potato Chips
Barbara Catania
DASFAA’05 Tutorial
8
Pattern examples
Not all patterns are data mining patterns!
Barbara Catania
DASFAA’05 Tutorial
9
Should heterogeneous patterns be
managed together?
• Example 1
– Which items are co-purchased with a certain
promotional item p?
• Frequent itemsets
– What are the circumstances (e.g., location,
time, etc.) under which the frequent copurchases were made?
• Frequent itemsets and decision trees
Barbara Catania
DASFAA’05 Tutorial
10
Should heterogeneous patterns be
managed together?
• Example 2
– Mobile objects monitoring through trajectories
• Equations
– What objects are similar with respect to their
trajectories?
• Clusters over equations
Barbara Catania
DASFAA’05 Tutorial
11
Should patterns be combined?
• Example 3
– Classifying customers in China into the categories of
highRisk and lowRisk for credit rating
• Decision tree T1
– Predicting under which conditions people in China live
in cities vs. the countryside
• Decision tree T2
– How is it possible to combine T1 and T2 in order to be
able to predict under which conditions people have a
certain credit rating and tend to live in a certain
neighborhood?
Barbara Catania
DASFAA’05 Tutorial
12
A many-to-many relationship
Data space
Barbara Catania
Pattern space
DASFAA’05 Tutorial
13
An example
T1 Beer, Potato Chips, Refreshments, Nappies
T2 Whisky, Beer, Nappies
T3 Detergents, Broom, Beer, Potato Chips
T4 Milk, Potato Chips, Tomatoes, Carrots
T5 Cigarettes, Meat, Refreshments
T6 Meat, Cheese, Fish, Refreshments, Beer, Potato Chips
A1 Beer Potato Chips
A2 Beer Refreshment
Barbara Catania
DASFAA’05 Tutorial
14
But …how can the spaces be
characterized?
Data space
Pattern space
1. The pattern space must provide the representation of
heterogeneous patterns
2. Are the two spaces distinct or coincide?
• Patterns as a kind of data
3. Is the pattern space defined in terms of the data space?
• Patterns as views over data
Barbara Catania
DASFAA’05 Tutorial
15
… how is pattern importance
determined?
Data space
Pattern space
p1
p2 > p1
p2
• How is it possible to quantify data
representation realized by a pattern?
• Need for measures
Barbara Catania
DASFAA’05 Tutorial
16
An example
T1 Beer, Potato Chips, Refreshments, Nappies
T2 Whisky, Beer, Nappies
T3 Detergents, Broom, Beer, Potato Chips
T4 Milk, Potato Chips, Tomatoes, Carrots
T5 Cigarettes, Meat, Refreshments
T6 Meat, Cheese, Fish, Refreshments, Beer, Potato Chips
A1 Beer Potato Chips
Support(A1) = 3
Confidence (A1) = 0.75
A2 Beer Refreshment
Support(A2)= 2
Confidence(A2)= 0.5
S(X) = # of transactions containing X
Support (X ⇒ Y) = S(X∪Y)
Confidence(X ⇒ Y) = S(X∪Y)/S(X)
Barbara Catania
DASFAA’05 Tutorial
17
… how is data importance
determined?
Data space
Pattern space
d1
d2
d1 > d2
• How many patterns does a data item (or a
subset of data items) correspond to?
Barbara Catania
DASFAA’05 Tutorial
18
… how is the relationship
represented?
Barbara Catania
1.
Set of data items from
which a pattern has been
extracted (in case of
mining)
2.
Set of data items
represented by a pattern
3.
Set of data items
possibly represented by a
pattern
DASFAA’05 Tutorial
19
…how is relationship evolution
traced?
Data space
Pattern space
• The data space changes
with a high frequency
• What happens to the
pattern space?
• … patterns are not
necessarily views over
raw data …
• Need for synchronization
Barbara Catania
DASFAA’05 Tutorial
20
Spaces interaction: a generic usage
scenario
Data space
Pattern space
78
1 5
?
2 9
4
?
3
6
1.
2.
3.
4.
5.
6.
7.
8.
9.
Extraction
Select one or more patterns PS1
What are the data represented by such patterns? DS1
Select a certain subset of DS1 DS2
What are the patterns representing possibly approximately DS2? PS2
Is a pattern in PS2 suitable for representing a dataset DS3?
How different are two pattern sets? Is the difference significant?
How similar are two patterns?
How patterns can be combined together?
Barbara Catania
DASFAA’05 Tutorial
21
Pattern management
•
•
•
•
•
•
•
•
Representation of data and pattern spaces
Generation or definition of patterns
Storage
Retrieval
Synchronization
Visualization
Analysis, inference
…
Barbara Catania
DASFAA’05 Tutorial
22
Pattern management vs data
mining
Data
• Data mining
– Generation of new and
previously unknown
knowledge from large
datasets
Data mining
techniques
Knowledge artifacts
(patterns)
• Pattern management
– Generation and management
of (heterogeneous) patterns
Data mining is an activity
of pattern management
Barbara Catania
DASFAA’05 Tutorial
Pattern
base
Pattern
management
23
Pattern management vs data
warehousing
• Is a pattern-base a sort of a
meta-warehouse?
– Relationship between data
warehouse and source data exist in
the form of metadata
– Measures are a key concept
Data Warehouse
Data mining
techniques
• However
– Not necessarily patterns can be
represented in terms of dimensions
and measures
– The DW model is not adequate
(and therefore, also the languages)
Barbara Catania
DASFAA’05 Tutorial
Knowledge artifacts
(patterns)
Pattern
base
Pattern
management
24
Pattern management vs metadata
management
• Patterns are metadata
– Semantic metadata
– Knowledge over data
Data mining
techniques
• Metadata are not patterns
– No quantification of
importance
– Not necessarily first-class
citizens
Barbara Catania
Metadata
repository
Data Warehouse
DASFAA’05 Tutorial
Knowledge artifacts
(patterns)
Pattern
base
Pattern
management
Metadata
repository
25
We need a Pattern Base
Management System ...
• Pattern-Base Management System (PBMS):
technology for
–
–
–
–
Modeling patterns as first class citizens
Querying patterns
Efficiently manage patterns
Uniformely manage heterogeneous patterns
Barbara Catania
DASFAA’05 Tutorial
26
Is DBMS technology sufficient?
• Patterns can be represented
according to some advanced
data model
Pattern management
layer
– Object-oriented
– Semi-structured
– …
Data & Pattern
Base
• … but specific applications
are required to cope with
patterns
Barbara Catania
DASFAA’05 Tutorial
DBMS
27
Is DBMS technology sufficient?
Data Base
• Alternatively, we can
use DBMS
technology to design
ad hoc PBMS
DBMS
Pattern Base
PBMS
Barbara Catania
DASFAA’05 Tutorial
28
Disciplines involved
Artificial
intelligence
DBMS technology
Data mining
…
Pattern
management
Constraint
programming
Data warehousing
Spatial databases
Barbara Catania
Metadata
management
DASFAA’05 Tutorial
29
Outline
• Introduction to pattern management
• Features
– Architectures
– Models
– Languages
•
•
•
•
A classification of existing proposals
Theoretical proposals
Standards
Open issues
Barbara Catania
DASFAA’05 Tutorial
30
PBMS features
A PBMS architecture
Data Warehouse
Data mining
techniques
A pattern model
Knowledge artifacts
(patterns)
Pattern
base
Pattern
management
Languages to query and
manipulate patterns
Barbara Catania
DASFAA’05 Tutorial
31
Integrated architecture
Pattern management
layer
Queries
Data & Pattern
Base
DBMS
Barbara Catania
DASFAA’05 Tutorial
32
Integrated architecture
• One management system
• One logical model
Less effort in design
Limitations due to the chosen model
•
Pattern generation is a query operator
- Need of extending traditional query languages
- Mixing manipulation and query operations
• Only one language for pattern and data querying
Less effort in query design
What about optimization?
-
Barbara Catania
DASFAA’05 Tutorial
33
Separated architecture
Data queries
Cross-over
queries
Data Base
DBMS
Barbara Catania
Pattern Base
Pattern queries
PBMS
Pattern extraction
DASFAA’05 Tutorial
34
Separated architecture
• Two management systems
• Two logical models
- More effort in design
• Pattern generation is a pattern manipulation
operation
Clear distinction between query and manipulation
operations
• Two groups of query languages
- More effort in query design
Ad hoc optimization techniques
Barbara Catania
DASFAA’05 Tutorial
35
The model
•
•
•
•
•
Support of typical (data mining) patterns
User-defined pattern types support
Hierarchies over pattern types
Relation between raw data and patterns
Quality Measures
Barbara Catania
DASFAA’05 Tutorial
36
User-defined pattern types
support
• Typical data mining patterns
(association rules, clusters,
decision trees, etc.) are usually
supported
but
often
independently managed
• The pattern space should be
extensible, to guarantee the
representation of user-defined
pattern types
Barbara Catania
DASFAA’05 Tutorial
37
Pattern hierarchies
• Ability to model
hierarchies between
pattern types
expressivity
reusability
modularity
Barbara Catania
DASFAA’05 Tutorial
38
Relationship between raw data
and patterns
•
•
•
Storage of the relation between
patterns and raw data
Makes the pattern richer in
semantics and provides significant
information for pattern retrieval
Three different approaches
1
2
3
Set of data items from which the
pattern have been extracted (in case of
mining)
Set of data items represented by a
pattern
Set of data items possibly represented
by a pattern
Barbara Catania
DASFAA’05 Tutorial
39
Quality Measures
• Patterns are usually associated with
measures
– Association rules: support, confidence, Jmeasure, convinction (Smith, Goodman, 1992)
– Clusters: Average intra-cluster distance
• In general measures are static
– Computed at pattern extraction time
– New computation = new pattern extraction
Barbara Catania
DASFAA’05 Tutorial
40
Languages
• Pattern manipulation language
–
–
–
–
–
Automatic extraction
Direct insertion of patterns
Modifications and deletions
Synchronization over source data
Mining function
• Pattern query language
– Queries against patterns
• Similarity
• Combination
– Queries involving source data
Barbara Catania
DASFAA’05 Tutorial
41
PML: automatic extraction
• Capability of a system to generate patterns
starting from raw data using a mining
function
• It corresponds to the data mining step of a
knowledge data discovery process
• Generates a-posteriori patterns
Barbara Catania
DASFAA’05 Tutorial
42
PML: direct insertion
• Some patterns are not extracted from
raw data
• Inserted directly from scratch in the
system
• a-priori patterns
• Example
• Import or insert a classifier from scratch
• Use it to classify existing data
Barbara Catania
DASFAA’05 Tutorial
43
PML: synchronization
• Source data change with high frequency
• It is important to determine whether existing
patterns, after a certain time, still represent the
data source from which they have been generated
• If it is not, the ability to change information
associated with a pattern when the quality of the
representation or its validity during the time
change could be useful
• Alternative: generation of new patterns
Barbara Catania
DASFAA’05 Tutorial
44
PML: synchronization
P1
Shoes socks
Support = 0.55
Confidence = 0.75
Transactions T1
Transactions T1
Transactions T2
P1
Shoes socks
Support = 0.55
Confidence = 0.75
P2
Shoes socks
Support = 0.7
Confidence = 0.80
P1
Shoes socks
Support = 0.7
Confidence = 0.80
P1 Shoes socks
Support = 0.55
Confidence = 0.75
P1 Shoes socks
Support = 0.7
Confidence = 0.80
Barbara Catania
DASFAA’05 Tutorial
45
PML: mining function
• A-posteriori patterns are generated from
raw data by applying some kind of
mining function
• Association rules: APriori algorithm
• Clusters: k-means algorithm
• The presence of a library of mining
functions and the possibility to define
new functions when required makes
pattern generation much more flexible
Barbara Catania
DASFAA’05 Tutorial
46
PQL: queries over patterns
• Primitives for pattern retrieval
– Selection
– Similarity-based selection
• How is it possible to define pattern
similarities?
– Join: how can patterns be combined
together?
• If shoes socks and socks t-shirts, what
can we say about shoes t-shirts?
Barbara Catania
DASFAA’05 Tutorial
47
PQL: similarity
• Useful whenever we have to measure
differences of models describing evolving data
or data extracted from different sources
– monitoring monthly sales of a supermarket
– analyzing differences of data characteristics across
several sets of data (customers transactions,
reactions to chemical/biological substances)
• If similarity is high, there no need to perform a
thorough (and costly) analysis on actual data
Barbara Catania
DASFAA’05 Tutorial
48
PQL: queries over patterns and
source data
• Cross-over queries
• Which data is best represented by a given
pattern?
• Which patterns represent a given set of
data?
– A sort of classification
Barbara Catania
DASFAA’05 Tutorial
49
Spaces interaction: a generic usage
scenario
1. Extraction
2. Select one or more patterns PS1
3. What are the data represented
by such patterns? DS1
1. Select a certain subset of DS1 DS2
2. What are the patterns representing possibly
approximately DS2? PS2
1. Is a pattern in PS2 suitable for representing
a dataset DS3?
1. How different are two pattern sets?
Is the difference significant?
1. How similar are two patterns?
2. How patterns can be combined together?
Barbara Catania
DASFAA’05 Tutorial
PML
PQL selection
PQL cross-over
QL selection
PQL cross-over
PQL cross-over
PQL measures
PQL similarity
PQL combination
50
Outline
• Introduction to pattern management
• Features
– Architectures
– Models
– Languages
•
•
•
•
A classification of existing proposals
Theoretical proposals
Standards
Open issues
Barbara Catania
DASFAA’05 Tutorial
51
Pattern management taxonomy
Pattern
management
Standards
for patterns
Theoretical
proposals
Integrated
architecture
Frameworks
Barbara Catania
Separated
architecture
Metadata
management
Pattern
similarity
Languages
DASFAA’05 Tutorial
52
Pattern management taxonomy
Pattern
management
Theoretical
proposals
Integrated
architecture
Frameworks
Inductive databases
(Imielinsky & Mannila, 1996)
CINQ project (1998-2002)
(De Raedt, 2002)
(Meo et Al, 2004)
Barbara Catania
Standards
for patterns
Separated
architecture
Metadata
management
Pattern
similarity
Languages
DASFAA’05 Tutorial
53
Pattern management taxonomy
Pattern
management
Theoretical
proposals
Integrated
architecture
Frameworks
Barbara Catania
Standards
for patterns
Separated
architecture
Metadata
management
Pattern
similarity
Languages for inductive databases
No storage
DMQL(Han et Al., 1996), ODMQL (Elkefy et Al., 2001)
Storage, no query
Mine Rule (Meo et Al., 1996-1999), XMine (Braga et Al., 2002)
Storage, query, recomputation
MSQL (Imielinsky & Virmani, 1996-1999)
DASFAA’05 Tutorial
54
Pattern management taxonomy
Pattern
management
Theoretical
proposals
Standards
for patterns
Integrated
architecture
Frameworks
Languages
Barbara Catania
Metadata
management
Pattern
similarity
Separated architecture
3World Model (Johnson, Lakshmanan, Ng, 2000)
PANDA framework (Rizzi et Al, 2001-2004)
DASFAA’05 Tutorial
55
Pattern management technology
Pattern
management
Standards
for patterns
Theoretical
proposals
Integrated
architecture
Frameworks
Barbara Catania
Metadata
management
Separated
architecture
Languages
Pattern similarity
FOCUS (Ganti et Al., 1999)
PANDA approach (Bartolini et Al., 2004)
DASFAA’05 Tutorial
56
Pattern management taxonomy
Pattern
management
Metadata
management
Theoretical
proposals
Integrated
architecture
Frameworks
Barbara Catania
Separated
architecture
Pattern
similarity
Standards for patterns
(data mining standards)
PMML
CWM
ISO SQL/MM
JDM API
Languages
DASFAA’05 Tutorial
57
Pattern management taxonomy
Pattern
management
Standards
for patterns
Theoretical
proposals
Integrated
architecture
Frameworks
Barbara Catania
Separated
architecture
Metadata management
RDF
Dublin Core
…
Pattern
similarity
Languages
DASFAA’05 Tutorial
58
Outline
• Introduction to pattern management
• Features
– Architectures
– Models
– Languages
•
•
•
•
A classification of existing proposals
Theoretical proposals
Standards
Open issues
Barbara Catania
DASFAA’05 Tutorial
59
Theoretical proposals: what is the
aim?
• Definition of pattern management
frameworks providing a full support for
heterogeneous pattern generation and
management
– back-end technologies for pattern management
applications
• Similarities for patterns
Barbara Catania
DASFAA’05 Tutorial
60
Inductive databases
• First defined in 1996 (Imielinsky & Mannila,
1996)
• Mainly investigated in the context of the EU
project CINQ (Consortium on Discovery
Knowledge with Inductive Queries, 1998-2002)
• Aim
– Developing a general theory of inductive databases
(IDBs)
– Analyze query evaluation for well-known pattern
domains (e.g., association rules) and some new ones
(e.g., graphs)
– Provide extensions of existing query languages
– Implement prototypes
– Evaluates prototypes against several applications (Web
mining, Bio-informatics)
Barbara Catania
DASFAA’05 Tutorial
61
IDBs: features
• Integrated architecture
• Model
• PML
– Extraction is a query
operation
– No user-defined pattern types
(support for common pattern • PQL
types)
– Constraint theories as
– No general hierarchies
formal foundation
– Patterns represented according
– Extension of standard data
to the raw data model
query languages
– Measures
– No explicit relationship with
raw data
Barbara Catania
DASFAA’05 Tutorial
62
IDBs
• Knowledge discovery as an extended query
process
– There is no such thing as real discovery, just a matter
of the expressive power of the query languages
(Imielinsky and Mannila, CACM, Nov. 1996)
• General frameworks + inductive extensions for
different querying paradigms
• Specific types of patterns
–
–
–
–
–
Itemsets
Association Rules
Sequences
Clusters
Equations
Barbara Catania
DASFAA’05 Tutorial
63
IDBs: application domains
• Molecular (MOLFEA, 2004)
– a domain specific IDB
• Association rules and itemsets (Minerule
System, 2004)
– main paradigm for IDBs
Barbara Catania
DASFAA’05 Tutorial
64
IDBs: the framework
Barbara Catania
DASFAA’05 Tutorial
65
IDBs: the framework
• An IDB is composed of:
– A set of data sets
– A set of pattern sets
• IDB languages
– A query language that generates data sets
– An inductive query language that generates
pattern sets
• Data & pattern sets can be
extensional/intensional
Barbara Catania
DASFAA’05 Tutorial
66
IDBs: the framework
•
•
•
•
•
create data set D as query
create view data set D as query
create pattern set P as query
create pattern view P as query
Insert/delete/update statements
Barbara Catania
DASFAA’05 Tutorial
67
IDBs: theoretical foundations
• Formal theory for IDBs (De Raedt et Al, 2002)
• For each pattern type
– Language of patterns (e.g., itemsets, association rules,
sequences, graphs, dependencies, decision trees,
clusters)
– Evaluation functions (e.g., frequency, closures,
generality, validity, accuracy)
– Primitive constraints (e.g., minimal/maximal frequency,
minimal accuracy)
• Constraint programming can be used for
extraction and further queries (post-processing)
– Constraint-based mining (SIGKDD, 2002)
Barbara Catania
DASFAA’05 Tutorial
68
IDBs: constraint examples
•
•
•
•
Cmaxfreq(φ,r) ≡ freq(φ,r) ≤ γ, γ ∈ [0,1]
Cminfreq(φ,r) ≡ freq(φ,r) ≥ γ, γ ∈ [0,1]
Cclose(φ,r) ≡ closure(φ,r) = φ
…
Barbara Catania
DASFAA’05 Tutorial
69
IDBs: examples of constraint-based
computations
• Standard association rule mining
– Cminfreq(φ,r) ∧ Cminconf(φ,r)
• Discriminant patterns
– Frequent in one dataset and unfrequent in another
– Cminfreq(φ,r1) ∧ Cmaxfreq(φ,r2)
• Post-processing
– As before, without extractions
• Computing condensed (concise) representations
– Cminfreq(φ,r) ∧ Cclose(φ,r)
– …
Barbara Catania
DASFAA’05 Tutorial
70
IDBs: the languages
Capabilities
+
retrieval,
recomputation
+
storage
MSQL
(Imielinsky & Virmani, 1996-1999)
XMINE
(Braga et Al., 2002)
Mine Rule
(Meo et Al., 1996-1999)
ODMQL
(Elkefy et Al., 2001)
Only
DMQL
extraction (Han et Al., 1996)
1996
Barbara Catania
1999
2001
DASFAA’05 Tutorial
2002
Year
71
DMQL (Han et Al., 1996) and
ODMQL (Elkefy et Al., 2001)
•
•
•
•
DMQL: SQL-like language
ODMQL: OQL-like language
Similar characteristics
Extracted patterns
–
–
–
–
association rules
characteristic rules
discriminant rules
classification rule
• No rule storage
Barbara Catania
DASFAA’05 Tutorial
72
DMQL and ODMQL
• For each type of pattern, a set of measures is provided
• Conditions over them can be specified
• Ability to generalize or specialize the mined results
{hiking_boots} {ski_pants}
{hiking_boots} {pants}
{hiking_boots} {clothes}
…
Barbara Catania
DASFAA’05 Tutorial
73
MineRule (Meo et Al., 1996-1999)
• SQL extension with MineRule operator for rule extraction
• Transactions are assumed to be stored in relations
• Usage of hierarchies over raw data to generalize association rules
Rules are generated
from purchases performed in the same date
Conditions to be satisfied by items
in the body and the head
MINE RULE MarketAssRules AS
SELECT DISTINCT l..n item AS BODY,1..n item AS HEAD,
SUPPORT, CONFIDENCE
WHERE BODY.price >=100 AND HEAD.price < 100
FROM Purchase
Purchases are grouped
GROUP BY Customer
to form transactions
CLUSTER BY date
EXTRACTING RULES WITH SUPPORT:0.01, CONFIDENCE:0.2
Barbara Catania
DASFAA’05 Tutorial
74
XMINE (Braga et Al., 2002)
• Association rules for XML documents
• Merge of XQuery and MineRule
Barbara Catania
DASFAA’05 Tutorial
75
MSQL (Imielinsky & Virmani, 1996-1999)
• SQL-like language for rule extraction and management
• Transactions are stored in relations
• GET_RULES statement: association rules are generated and
stored in extended relations
GetRules(Employees)
into My_Emp_Rules
where
Body has { (Job=*) }AND
Consequent has { (Age=*) } AND
confidence > 0.9 and support > 0.3
Source dataset
Conditions over rules
Barbara Catania
DASFAA’05 Tutorial
76
MSQL
• SELECT_RULES operator: provides queries over association rules
SelectRules(My_Emp_Rules)
where Body has { (Sex = *) ,
(Salary=[30000,80000])}
Measure recomputation
over views
Project Body, Consequent,
Confidence(NJ_Emp), Support(NJ_Emp),
Confidence(NY_Emp), Support(NY_Emp),
SelectRules(My_Emp_Rules)
where Body has { (Sex=*) }
AND Consequent has {(Car=*) }
Barbara Catania
DASFAA’05 Tutorial
77
MSQL
• SATISFY/VIOLATE: two operators for cross-over
queries
• Determine whether a tuple satisfies or violates at least
one or all the association rules in a given set.
Select …
From …
Where { SATISFIES | VIOLATES } { ALL |
ANY } (<GetRules | SelectRules Subquery>)
Barbara Catania
DASFAA’05 Tutorial
78
Impact on DBMS technology
• Constraint programming as theoretical
framework
• Extensions of existing query languages
– Impact on the relational and object-oriented
model
Barbara Catania
DASFAA’05 Tutorial
79
The 3 World framework
(Johnson, Lakshmanan, Ng, 2000)
• Knowledge discovery as a multistep process
• Three main issues:
– a model for heterogeneous pattern
representation, based on 3 different worlds
– Languages for manipulating data in each world
– Operators to move in and out of the worlds
• No prototype
Barbara Catania
DASFAA’05 Tutorial
80
3W framework: features
• Separated architecture
• Model
–
–
–
–
• PML
– A-posteriori patterns
– Synchronization
– No mining function
specification
User-defined pattern types
Hierarchies
Measures
Relationship with raw data • PQL
– Selection, projection, …
– Pattern combination
– No similarity
– Cross-over queries
Barbara Catania
DASFAA’05 Tutorial
81
The 3W model
I-World
Intensional
description of
patterns
D-World
E-World
Extensional
(e.g. by
enumeration)
description of
patterns
Raw data from
which patterns
have been
defined
Barbara Catania
DASFAA’05 Tutorial
82
The I-world
• Patterns can be represented as possibly
overlapping regions of the data space
– Each frequent itemset coincides with a
region in the dataspace of itemsets
• Regions described as conjunctions of
linear constraints
p3
– Beer = 1 AND diaper = 1
p1
• Sets of regions form a dimension
– All frequent items containing a given
promotional item p
• Regions are associated with attributes
– Measures
• Regions can be hierarchically organized
p2
– Partial order over regions
Barbara Catania
DASFAA’05 Tutorial
83
The I-world
Isothetic regions: axix parallel and hyper-rectangular
efficient manipulation and processing through linear constraints
Barbara Catania
DASFAA’05 Tutorial
84
The E-World and the D-world
• E-World
– Extensional representation of a region or
dimension by enumerating its components with
respect to the D-world
• D-world
– Relational database
Barbara Catania
DASFAA’05 Tutorial
85
3W model – relationships with raw data
• Source data
– D-world
• Extensional representation
– E-world
• Intensional representation
– I-world
Barbara Catania
DASFAA’05 Tutorial
86
An example
T1 Beer, Potato, Chips, Refreshments, Nappies
T2 Whisky, Beer, Nappies
T3 Detergents, Broom, Beer, Potato Chips
T4 Milk, Potato Chips, Tomatoes, Carrots
T5 Cigarettes, Meat, Refreshments
T6 Meat, Cheese, Fish, Refreshments, Beer, Potato Chips
Beer =1 AND Potato Chips = 1
I-World
T1 Beer, Potato Chips, Refreshments, Nappies
T3 Detergents, Broom, Beer, Potato Chips
T6 Meat, Cheese, Fish, Refreshments, Beer, Potato Chips
Barbara Catania
D-World
DASFAA’05 Tutorial
E-World
87
3-W computations
Extended relational algebra
Dimension algebra
refresh
E-world
lookup
I-world
populate
Mine
D-world
Relational algebra
Barbara Catania
DASFAA’05 Tutorial
88
PQL: the dimension algebra
• Selection
– Overlap, containment, disjointness, non-containment
• Projection
– Project out some attributes
• Purge
– Return satisfiable constraints
• Cartesian product
– Pairwise combination of constraints
• Union, Difference
– Absence of the compatibility requirement
• Renaming
Barbara Catania
DASFAA’05 Tutorial
89
Impact on DBMS technology
• Logical optimization for dimension algebra
similar to that for relational algebra
• Linear constraint programming
• First-order logic augmented with linear
polynomial inequalities over reals, and relation
variables with fixed arities
– PTIME
– decidability for equivalence checking
• Spatial database technology for pattern
management
Barbara Catania
DASFAA’05 Tutorial
90
The PANDA framework
(Rizzi et Al, ER’01)
• EU Project PAtterns for Next-generation
DAtabase systems (2001-2004)
• Aims
– lay the foundations for pattern modeling
– investigate the main issues involved in
managing and querying a pattern-base
– outline the requirements for building a PBMS
• Preliminary prototype
Barbara Catania
DASFAA’05 Tutorial
91
PANDA: features
• Separated architecture
• Model
–
–
–
–
• PML
– A-posteriori and a-priori
patterns
– Synchronization
– Mining function specification
User-defined pattern types
Hierarchies
Measures
Relationship with raw data • PQL
– Selection, projection, …
– Combination
– Similarity
– Cross-over queries
Barbara Catania
DASFAA’05 Tutorial
92
PatternsOf
Experiment145
Ass.Rules
Type
MyClustersOn
TableEMP
Class Layer
Dec.Trees
Type
Cluster
Type
CyclicCluster
Type
Type Layer
The PANDA
architecture
member of
instance of
Ass.Rule
3
Ass.Rule
Ass.Rule
2
1
DBSCAN
Cluster 2
DBSCAN
Cluster 1
Pattern Layer
PBMS
Intermediate Mapping Layer
Data
Mining
Algorithms
Pattern
Recognition
Algorithms
DB1
DB2
Flat
File
Raw Data Layer
Barbara Catania
DASFAA’05 Tutorial
93
The PANDA model
related-to
class
pattern type
member-of
instance-of
dec. tree type
supermarket rules
cluster type
pattern
my clusters
ass. rule type
class layer
type layer
pattern layer
Barbara Catania
DASFAA’05 Tutorial
94
The PANDA model
related-to
pattern type
name
member-of
name
structure schema
source schema
class
instance-of
pattern
measure schema
PID
formula
validity period
schema
structure
source
measure
instantiated formula
validity period
Barbara Catania
DASFAA’05 Tutorial
95
PANDA: Example
radius
Structure
Data Source
Measure
Formula
center
Average Intra Cluster Distance
Barbara Catania
DASFAA’05 Tutorial
96
PANDA: Example
n: Cluster
s: disk: TUPLE(center:TUPLE(CX1:real,CX2:real), rad:real)
d: SET(X1:real, X2:real)
m: AvgIntraClusterDistance: real
f: (X1 -disk.center.CX1) 2 + (X2 -disk.center.CX2)2 ≤ disk.rad 2
vp: [DAY,DAY)
pid: 337
s: disk: TUPLE(center:TUPLE(CX1:2,CX2:3), rad:4)
d: users(X,Y)
m: AvgIntraClusterDistance: 0.9
f: (X -2) 2 + (Y -3)2 ≤ 42
vp: [01/01/2004,03/31/2004)
Barbara Catania
DASFAA’05 Tutorial
97
PANDA: Example
n: AssociationRule
s: TUPLE(head: SET(STRING), body: SET(STRING))
d: BAG(transaction: SET(STRING))
m: TUPLE(confidence: REAL, support: REAL)
f: ∀ x (x ∈ head ∨ x ∈ body → x ∈ transaction)
vp: [DAY,DAY)
pid: 512
s: (head = {'Boots’}, body = {'Socks', 'Hat’})
d: SELECT SETOF(article) AS transaction
FROM sales GROUP BY transactionId
m: (confidence = 0.75, support = 0.55)
f: ∀ x (x ∈ {'Boots', 'Socks','Hat'} → x ∈ transaction)
vp: [01/01/2004,03/31/2004)
Barbara Catania
DASFAA’05 Tutorial
98
PANDA: Example
Dataset 1:
SELECT SETOF(article)
AS transaction
FROM sales_shop1
GROUP BY transactionId
Apriori
Pattern type:
AssociationRule
Dataset 2:
Class: SaleRules
Apriori
SELECT SETOF(article)
AS transaction
FROM sales_shop2
GROUP BY transactionId
Barbara Catania
Patterns:
Association rules
512, 513, 514
DASFAA’05 Tutorial
Patterns:
Association rules
515, 516, 517
99
Pattern Space e Data Space
data space
pattern type
dataset
PID
name
source schema
formula
backward
image
type
source
formula
structure
structure schema
pattern space
Barbara Catania
pattern
DASFAA’05 Tutorial
validity period
100
PANDA: Relationships with raw data
• Source data
– Raw data
– Intensional description inside
patterns
• Extensional representation
– Explicit intermediate mapping
• Intensional representation
– The formula
– Approximated intermediate
mapping
Barbara Catania
DASFAA’05 Tutorial
101
PANDA: Pattern Hierarchies
• Specialization
pattern type 1
class 1
related-to
inheritance
pattern type 2
• Composition
pattern type 1
related-to
Refinement
pattern type 1
part-of
pattern type 2
refined-by
pattern type 2
• Ability of referring pattern • Ability of referring pattern
types in the structure schema
types in the source schema
Barbara Catania
DASFAA’05 Tutorial
102
PANDA: Pattern Hierarchies
Example
n: ClusterOfRules
ss: representative: AssociationRule
ds: SET(rule: AssociationRule)
ms: TUPLE(deviationOnConfidence: REAL,
deviationOnSupport: REAL)
f: rule.ss.head = representative.ss.head
Barbara Catania
DASFAA’05 Tutorial
composition
refinement
103
PANDA: Pattern Validity
• Temporal validity
– Pattern validity with respect
Time
to user requirements
– A certain pattern is assumed
to be usable in a given
interval I
Semantic
validity
Safety
Temporal
validity
• Semantic validity
– Pattern (measure) validity
with respect to data source
Barbara Catania
DASFAA’05 Tutorial
Data
104
PANDA PML
direct insertion
deletion
recomputation
Patterns
synchronization
extraction
Raw data
Barbara Catania
DASFAA’05 Tutorial
105
PANDA PML
• Both a-priori (direct insertion)
and a-posteriori (extraction)
patterns with mining function
specification
• Synchronization:
– verifies whether an existing
pattern still holds with respect to
its source data and, possibly, it
changes pattern measures
PT1
Raw data
UPDATE
Barbara Catania
PT1
P2
• Recomputation
– like synchronization but new
patterns are generated from the
update
p1
s
d
m m’
X
f
pv
Raw data
DASFAA’05 Tutorial
P1
C1
PT2
C2
106
PT..
C..
PANDA PQL
• Selection
– Predicates over all components, including formula
• Measure projection
– Project out some attributes
• Reconstruction
– Structure manipulation
• Join
• Union, Difference
• Renaming
Barbara Catania
DASFAA’05 Tutorial
107
PANDA PQL: join
p1.f ∨ p2.f
• Intersection join
– (p1 >< p2).s = (p1.s,p2.s)
– (p1 >< p2).d = p1.d U p2.d
– (p1 >< p2).f = p1.f ∧ p2.f
• Union join
– (p1 >< p2).s = (p1.s,p2.s)
– (p1 >< p2).d = p1.d U p2.d
p1.f ∧ p2.f
– (p1 >< p2).f = p1.f ∨ p2.f
Barbara Catania
DASFAA’05 Tutorial
108
PANDA PQL: processing
• Structure-based processing
– Usage of pattern structure
– Object-relational processing
• Approximated processing
– Usage of formula component
– Logical, constraint-based
processing
• Determine all association
rules whose Body
contains attribute Job
• Determine all patterns
that represent in an
approximate way
employees in New York
• Mapping-based processing
– Usage of relationships with
raw data
– Data processing
Barbara Catania
• Determine all patterns
that represent only
employees in New York
DASFAA’05 Tutorial
109
Impact on DBMS technology
• Object-relational technology useful for
structure manipulation
• Constraint programming for formulas
• Complexity of the language depends on the
chosen formula language
– Linear constraints: PTIME
• Useful spatial DBMS technology
Barbara Catania
DASFAA’05 Tutorial
110
Theoretical proposals:
architecture and model
Architecture
3W model
PANDA
Inductive DB
Separated
Separated
Integrated
region-based
Hierarchies
Measures
Data source
User-defined types
Validity
Prototype
Barbara Catania
DASFAA’05 Tutorial
111
Theroretical proposals: manipulation
3W model
PML
A-posteriori
(extraction)
A-priori
(direct insertion)
Deletion & update
PANDA
Inductive DB
Extraction as a
query operation
Synchronization
recomputation recomputation
recomputation synchronization
Mining function
Barbara Catania
DASFAA’05 Tutorial
112
Theoretical proposals: querying
3W model
PQL
Combination
Algebra
Cartesian
product
Similarity
Cross-over queries
Barbara Catania
PANDA
Inductive DB
Algebra/cal Constraint-based
SQL,OQL like
culus
Join
Integrated archi
DASFAA’05 Tutorial
113
Similarities for patterns
• Computed with respect to either
– Data source represented by patterns
• overhead
– Just patterns: Pattern structure + Pattern measures
• sim(p1,p2) ∈ {0,1}
– p1 and p2 have the same type
• Two main general approaches
– FOCUS (Ganti et Al., 1999)
– PANDA approach (Bartolini et Al., 2004)
Barbara Catania
DASFAA’05 Tutorial
114
Similarities for patterns: FOCUS
p1
refinement
p2
p1r
p2r
Faggreg
over measures
sim(p1,p2)
• Refinement
– Detection of the greatest common refinement (GCR) of
p1.s and p2.s
– Recomputation of p1.m and p2.m over GCR
• Decision trees, clusters, frequent itemsets can be
refined
Barbara Catania
DASFAA’05 Tutorial
115
Similarities for patterns: FOCUS
sim(T1,T2) =
|0.0 -0.0| +|0.0 -0.04|+
|0.1-0.14| + |0.0-0.0| +
|0.0-0.0| + |0.005-0.1| =
0.175
Barbara Catania
DASFAA’05 Tutorial
116
Similarities for patterns:PANDA
• No need for refinement
• Applicable also to complex patterns, defined over other
patterns
– Clusters of association rules
• Not necessarily requires data access
Barbara Catania
DASFAA’05 Tutorial
117
Outline
• Introduction to pattern management
• Features
– Architectures
– Models
– Languages
•
•
•
•
A classification of existing proposals
Theoretical proposals
Standards
Open issues
Barbara Catania
DASFAA’05 Tutorial
118
Standards for patterns: what is
the aim?
• Standard representation purposes for
patterns resulting from data mining and
data warehousing processes
– No generic patterns are supported
• support their exchange between different
architectures
• front-end for pattern management
applications
Barbara Catania
DASFAA’05 Tutorial
119
Standards for patterns: issues
1. Modeling the overall process by which data mining
models are produced, used, and deployed
2. A standard representation for data mining and statistical
models
3. A standard representation for cleaning, transforming, and
aggregating attributes to provide the inputs for data mining
models
4. A standard representation for specifying the settings
required to build models and to use the outputs of models
in other systems
5. Interfaces and Application Programming Interfaces (APIs)
to other languages and Systems (Java & SQL)
6. Standards for viewing, analyzing, and mining remote and
distributed data
Barbara Catania
DASFAA’05 Tutorial
120
Standards & theoretical proposals:
an overall picture Web Standards
Process Standards
Standards for
pattern representation
Standard
pattern
representation
Pattern
Base
Pattern engine
application
Theoretical proposals
Standard APIs
Barbara Catania
DASFAA’05 Tutorial
121
Standards for patterns: a classification
Process standards
Standard APIs
Cross Industry Standard
Process for Data Mining
(CRISP-DM)
Predictive Model Markup
Language (PMML)
Common Warehouse Model
for Data Mining (CWM-DM)
SQL/MM, JDM
Web standards
XML for analysis (XMLA)
Standards for pattern
representation
Barbara Catania
DASFAA’05 Tutorial
122
Predictive Model Markup Language
(PMML)
• Standardization effort of DMG (Data
Mining group)
• XML-based language for representing data
mining models
• Aim
– support the exchange of data mining models
between different applications and visualization
tools
Barbara Catania
DASFAA’05 Tutorial
123
PMML usage
Pattern management
layer
Data & Pattern
Base
DBMS
Data Base
DBMS
Barbara Catania
<!-- model in PMML format -->
<PMML version="1.1"
<TreeModel ModelName="golf"
etc.
<Node score="play">
etc.
</Node>
etc.
</TreeModel>
</PMML>
Pattern Base
PBMS
DASFAA’05 Tutorial
124
PMML
• Relatively narrow so that it could serve as
common ground for possible subsequent
standards
– Source data
– Mining function
– Parameters for the mining function
• Specification of the pattern type and pattern
instances
Barbara Catania
DASFAA’05 Tutorial
125
PMML: supported patterns
•
•
•
•
•
•
•
•
•
Association Rules
Decision Trees
Center Based Clustering
Distribution Based Clustering
(General) Regression
Neural Networks
Naive Bayes
Sequences
…
Barbara Catania
DASFAA’05 Tutorial
126
PMML: pattern type specification
• Data dictionary
– Describes attributes source data
• Mining schema
– One for each pattern
– For each attribute of the data dictionary specifies
whether it is used by the pattern as an input or an output
• Transformation dictionary
– Defines derived fields
• Model statistics
• Model parameters
– Parameters required by each pattern type
• Mining model and functions
Barbara Catania
DASFAA’05 Tutorial
127
PMML: an example
<?xml version="1.0" ?>
<PMML version="3.0" >
<Header copyright="www.dmg.org" description="example model for association
rules"/>
<DataDictionary numberOfFields="2" >
<DataField name="transaction" optype="categorical" />
<DataField name="item" optype="categorical" />
</ DataDictionary >
<AssociationModel modelName =“My_Ass_rule”
functionName="associationRules" algorithmName=“Apriori”
numberOfTransactions="4" numberOfItems="3"
minimumSupport="0.6" minimumConfidence="0.5" numberOfItemsets="3"
numberOfRules="2">
<MiningSchema>
<MiningField name="transaction" usageType="group" />
<MiningField name="item" usageType="predicted"/>
</MiningSchema>
Barbara Catania
DASFAA’05 Tutorial
128
PMML: an example
<!-- We have three items in our input data -->
<Item id="1" value="Cracker" /> <Item id="2" value="Coke" />
<Item id="3" value="Water" />
<!-- and two frequent itemsets with a single item -->
<Itemset id="1" support="1.0" numberOfItems="1">
<ItemRef itemRef="1" /> </Itemset>
<Itemset id="2" support="1.0" numberOfItems="1">
<ItemRef itemRef="3" /> </Itemset>
<!-- and one frequent itemset with two items. -->
<Itemset id="3" support="1.0" numberOfItems="2">
<ItemRef itemRef="1" /> <ItemRef itemRef="3" />
</Itemset>
<!-- Two rules satisfy the requirements -->
<AssociationRule support="1.0“ confidence="1.0" antecedent="1"
consequent="2" />
<AssociationRule support="1.0“ confidence="1.0" antecedent="2"
consequent="1" />
</AssociationModel>
</PMML>
Barbara Catania
DASFAA’05 Tutorial
129
Common Warehouse Model
(CWM)
• Standardization effort of Object Management Group
(OMG)
• A common metamodel of the data warehousing and
business intelligence domains
• Consists of a platform-independent metamodel
definition
• Includes an XML-based interchange format for
metadata
• Also includes a mapping to a platform-independent
API specification (CORBA IDL)
• Tools that standardize on CWM can readily share
metadata via CWM-compliant XML files
Barbara Catania
DASFAA’05 Tutorial
130
CWM architecture
MOF
Meta-Object Facility
CWM
Barbara Catania
XMI
XML Metadata Interchange
UML
DASFAA’05 Tutorial
XML
document
131
CWM Data Mining
• CWM Metamodel consists of a number of
sub-metamodels
–
–
–
–
Data Resources
Data Analysis (OLAP, Data Mining, …)
Warehouse Management
…
Barbara Catania
DASFAA’05 Tutorial
132
CWM Data Mining
• Three conceptual areas (UML instances)
– Model description
• MiningModel: a representation of the mining model itself
• MiningSettings: driving the construction of the model
• ApplicationInputSpecification: set of input attributes for the
model
• MiningModelResult: result set produced by the testing or
application of a generated model.
– Settings (for the mining functions)
•
•
•
•
StatisticsSettings
ClusteringSettings
AssociationRulesSettings
SupervisedMiningSettings
– ClassificationSettings
– RegressionSettings
– Attributes
Barbara Catania
DASFAA’05 Tutorial
133
CWM-DM patterns
• Clustering
• Association Rules
• Supervised
– Classification
– Regression
• Statistics
– Attribute Importance
Barbara Catania
DASFAA’05 Tutorial
134
SQL/MM
• SQL Multimedia and Application Packages (ISO
SQL/MM)
– specification for data management of data types relevant in
multimedia and other knowledge intensive applications in SQL-99
• It defines several class libraries of SQL object types
• The structured types defined in such libraries are first-class
SQL types accessed through ordinary SQL:1999
• SQL/MM Parts
–
–
–
–
–
Part 1: Framework
Part 2: Full-Text
Part 3: Spatial
Part 5: Still Image
Part 6: Data Mining
Barbara Catania
DASFAA’05 Tutorial
135
SQL/MM Part 6
• Standardized interface to data mining
algorithms
• Can be layered at the top of any ORDBMS
or even deployed as middleware when
required
• Provides several SQL user-defined types to
support pattern extraction, storage, and
retrieval of common pattern types
Barbara Catania
DASFAA’05 Tutorial
136
SQL/MM architecture
PML through
SQL/MM types
PQL through
SQL and SQL/MM types
Pattern Base
SQL/MM types
ORDBMS
Barbara Catania
DASFAA’05 Tutorial
137
SQL/MM: supported patterns
•
•
•
•
Association rules
Clusters
Regression
Classification
Barbara Catania
DASFAA’05 Tutorial
138
SQL/MM: supported phases
Pattern type &
mining function
data
settings
Training phase
Source data
model
Application
phase
raw data
result
Patterns
Test
phase
Barbara Catania
test result
DASFAA’05 Tutorial
139
SQL/MM: types for mining
• DM_*Model
– Defines the model that you want to use when mining your data
• DM_*Settings
– Stores various parameters of the data mining model, e.g. depth of a
decision tree, maximum number of clusters
• DM_*Result
– Sets of patterns created by running data mining model against real
data
• DM_*TestResult
– Holds the results of testing during the training phase of the data
mining models
• DM_*Task
– Stores the metadata that describe the process and control of the
testing and of the actual runnings
* : Clas, Rule, Clustering, Regression
Barbara Catania
DASFAA’05 Tutorial
140
SQL/MM: additional types
• DM_MiningData
– Abstraction for real data contained in tables or views
– It just stores metadata to access the real data sources
and any other information necessary to make the real
data accessible for a later data mining training or test
run (e.g., transformations)
• DM_MiningMapping
– Allows the specification of data mining field related
information (e.g., categorical)
• DM_ApplicationData
– Abstraction for a set of values with associated names
representing a single row of input data
Barbara Catania
DASFAA’05 Tutorial
141
SQL/MM: type interactions
Barbara Catania
DASFAA’05 Tutorial
142
SQL/MM: type interaction
Barbara Catania
DASFAA’05 Tutorial
143
JDM
• Java Specification Request -73 (JSR-73) also
known as Java Data Mining (JDM)
• Pure Java API to support
–
–
–
–
Creation
Storage
Access
Maintenance
of data and metadata supporting data mining
models, data scoring and data mining results
• Input/output in various format (PMML, CWMDM)
Barbara Catania
DASFAA’05 Tutorial
144
JDM architecture
JDM
API
Data mining engine
(DME)
Pattern base
(Mining Object
Repository)
PBMS
Barbara Catania
DASFAA’05 Tutorial
145
JDM: supported patterns
•
•
•
•
•
Classification
Regression
Attribute importance
Clustering
Association rules
… and several algorithms …
Barbara Catania
DASFAA’05 Tutorial
146
JDM: supported operations
• Building a Model
– Users define input tasks specifying the parameters model name, mining data
and mining settings
– Specification pattern type and mining function details
• Testing a Model
– Gives an estimate of the accuracy a model has in predicting the target
• Applying a Model
– Model is applied to a case. Produces one or more predictions or assignments
– Pattern extraction
• Object Import and Export
– Interchange with other DMEs, Persistent storage outside the DME
– Object inspection or manipulation
– To enable import and export of system metadata JDM specifies 2 standards for
defining metadata in XML: PMML and CWM
– A-priori patterns
• Computing statistics on data
– computes various statistics on a given physical data set
– Measure computation
• Verifying task correctness
Barbara Catania
DASFAA’05 Tutorial
147
JDM: an example
// Create the physical representation of the data
(1) PhysicalDataSetFactory pdsf=(PhysicalDataSetFactory)
dmeConn.getFactory("javax.datamining.data.PhysicalDataSet");
(2) PhysicalDataSet bd = pdsf.create(uri, true);
(3) dmeConn.saveObject( "myBuildData", bd, false );
// Create the settings to build an association rule model
(4) AssociationSettingsFactory
asf=(AssociationSettingsFactory)dmeConn.getFactory
("javax.datamining.association.AssociationSettings");
(5) AssociationSettings associationSettings = asf.create();
(6) associationSettings.setMaxNumberOfRules( 100 );
(7) dmeConn.saveObject("myAssBS",associationSettings,false);
Barbara Catania
DASFAA’05 Tutorial
148
JDM: an example
// Create a task to build an association model with data and settings
(8) BuildTaskFactory btf = (BuildTaskFactory) dmeConn.getFactory(
"javax.datamining.task.BuildTask" );
(9) BuildTask task = btf.create( "myBuildData", "myAssBS",
"myAssociationModel" );
(10) dmeConn.saveObject( "myAssTask", task, false );
// Execute the task and check the status
(11) ExecutionHandle handle = dmeConn.execute( "myAssTask" );
(12) handle.waitForCompletion( Integer.MAX_VALUE );
// wait until done
(13) // check the returned status …
Barbara Catania
DASFAA’05 Tutorial
149
JDM: an example
// Restore an association model to extract association rules
(1) AssociationModel assocModel = (AssociationModel)
dmeConn.retrieveObject( "myAssociationModel");
// Specify rule selection criteria (support >= 30% AND confidence >=
90%)
(2) RulesFilterFactory filterFactory = (RulesFilterFactory)
dmeConn.getFactory(
"javax.datamining.association.RulesFilterFactory" );
(3) RulesFilter rulesFilter = filterFactory.create();
// The range of the support values is from 0.3 (30%) to 1.0 (100\%).
(4) rulesFilter.setRange( RuleProperty.support, 0.3, 1.0 );
// The range of the confidence values is 0.9 (90%) to 1.0 (100%).
(5) rulesFilter.setRange( RuleProperty.confidence, 0.9, 1.0 );
Barbara Catania
DASFAA’05 Tutorial
150
JDM: an example
// Extract rules from the model using the filtering criteria
(6) Collection rulesCollection = assocModel.getRules( rulesFilter );
(7) Iterator ruleIt = rulesCollection.iterator();
(8) while( ruleIt.hasNext() ) {
(9) AssociationRule r = (AssociationRule) ruleIt.next();
(10) /* work with the rule retrieved here...*/
}
Barbara Catania
DASFAA’05 Tutorial
151
Standards: models
JDM
PMML
CWM-DM
SQL/MM
User-defined
pattern types
Hierarchies
Measures
Data source
Validity
Barbara Catania
DASFAA’05 Tutorial
152
Standards: manipulation and
querying
SQL/MM
JDM
Languages
SQL
Java
A-posteriori
A-priori
Deletion
Synchronization
Mining function
Combination
Similarity
Cross-over queries
Barbara Catania
DASFAA’05 Tutorial
153
Standards: the commercial DBMS
choices
• Most commercial DBMSs have been extended
with data mining functionalities
– Oracle Data Mining
– Microsoft SQL Server 2005 Data Miner
– IBM Intelligent Miner
• They usually provide
– SQL extensions for pattern representation and
manipulation
– Oracle Java API
– Import/Export in PMML for patterns
Barbara Catania
DASFAA’05 Tutorial
154
Outline
• Introduction to pattern management
• Features
– Architectures
– Models
– Languages
•
•
•
•
A classification of existing proposals
Theoretical proposals
Standards
Open issues
Barbara Catania
DASFAA’05 Tutorial
155
Where are we now?
• Frameworks more expressive than existing
standard proposals
• Lack in modeling
– No user-defined patterns
– No hierarchies
• Lack in manipulation
–
–
–
–
No manipulation of heterogeneous patterns
Similarity functions
Pattern combination operators
Pattern synchronization with source data
Barbara Catania
DASFAA’05 Tutorial
156
Where are we now?
• Are those characteristics really needed?
– Combined efforts with industries for
establishing the real need of those features
Barbara Catania
DASFAA’05 Tutorial
157
What else?
• Measure ontologies
– Pattern comparison based on measures
– Various strategy for measure computations
• general probabilities, Dempster-Schafer, Bayesian
Networks
– Need of measure ontologies for quantitative
pattern reasoning
Barbara Catania
DASFAA’05 Tutorial
158
What else?
• Physical design
– What is a reasonable physical layer for
patterns?
– What are reasonable clustering techniques for
patterns?
– What about reasonable indexing techniques?
Barbara Catania
DASFAA’05 Tutorial
159
What else?
• Query optimization
– Separated architecture
• Data-based computations versus pattern-based
computations
• Heuristics: pattern-based computations are more
efficient
• How is it possible to use patterns to reduce data
access in data and cross-over queries?
• How can data and pattern query processors be
combined?
– Integrated architecture
• extraction optimization
Barbara Catania
DASFAA’05 Tutorial
160
What else?
• Query optimization in integrated
architectures (IDBs)
– Itemsets and association rule mining
– Extraction optimization based on constraints
usage (Ng et Al., 1998)
• Anti-monotonic, monotonic, succinct constraints
– Incremental refinement (Baralis, Psaila, 1999)
– Condensed representations (various proposals
of the CINQ consortium)
Barbara Catania
DASFAA’05 Tutorial
161
What else?
• Access control
– Patterns are high-sensitive
information
– An authorized access over
data may correspond to an
unauthorized access over
patterns extracted from
those data
– Instance of the inference
problem (Farkas &
Jajodia, 2002)
Barbara Catania
DASFAA’05 Tutorial
162
What else?
• Access control approaches
– Preprocessing techniques: checking through
mining techniques whether it is possible to infer
sensitive data
– Run-time techniques: release patterns only
when they do not represent sensitive
information
– Data modifications: perturbation and sample
size restrictions are applied without disturbing
data mining results
Barbara Catania
DASFAA’05 Tutorial
163
What else?
• Access control
– What happens when a pattern is used against a
dataset which is not the source dataset?
• Cross-over computations may reduce the effect of
existing techniques
Barbara Catania
DASFAA’05 Tutorial
164
Main references (1)
• Agrawal, R., Srikant, R. (1994) Fast Algorithms for Mining Association
Rules in Large Databases. In Proc. of the 20th VLDB, pages 487–499
• Bartolini, I., Ciaccia, P., Ntoutsi, I., Patella, M., Theodoridiss, Y. (2004) A
Unified and Flexible Framework for Comparing Simple and Complex
Patterns. In LNAI 3202: Proc. of the 15th ECML/PKDD, pages 496–499.
• Baralis, E. and Psaila, G (1999). Incremental refinement of mining queries. In
LNCS 1676: Proc. of DaWaK’99, pages 173–182.
• Braga, D., Campi, A., Klemettinen, M., Lanzi, P.L. (2002) Mining
Association Rules from XML Data. In Proc. of DaWaK, pages 21–30.
• Catania, B., Maddalena, A., Mazza, M., Bertino, E., Rizzi, S. (2004). A
Framework for Data Mining Pattern Management. In LNAI 3202: Proc. of
the 15th ECML/PKDD, pages 87–98.
• De Raedt, L. (2002). A Perspective on Inductive Databases. ACM SIGKDD
Explorations Newsletter, 4(2), pages 69–77.
• De Raedt, L., Jaeger, M., Lee, S.D., Mannila, H.(2002) A Theory on
Inductive Query Answering. In Proc. of ICDM, pages 123–130.
• Elfeky, M. G., Saad, A., Fouad, S.A. (2001). ODMQL: Object Data Mining
Query Language. Lecture Notes in Computer Science (1944), pages 128–140.
Barbara Catania
DASFAA’05 Tutorial
165
Main references (2)
• Farkas, C., Jajodia, S. (2002) The Inference Problem: a Survey. SIGKDD
Explor. Newsl., 4(2): 6–11.
• Ganti, V., Gehrke, J., Ramakrishnan, R., Loh, W.-Y. (1999) A Framework for
Measuring Changes in Data Characteristics. In Proc. of PODS’99, pages 126–
137.
• Han, J., Fu, Y., Wang, W., Koperski, K., Zaiane, O. (1996). DMQL: A data
mining query language for relational databases. In Proc.of ACM SIGMOD'96
Workshop on Research Issues in Data Mining and Knowledge Discovery
(DMKD'96).
• Han,J., Kamber,M. (2001). Data Mining: Concepts and Techniques. Academic
Press.
• Imielinski, T., Mannila, H. (1996). A Database Perspective on Knowledge
Discovery. Communications of the ACM, 39(11): 58–64.
• Imielinski, T. , Virmani, A. (1999). MSQL: A Query Language for Database
Mining. Data Mining and Knowledge Discovery, 2(4): 373–408.
• Johnson,S., Lakshmanan, L.V.S., Ng, R.T. (2000). The 3W Model and
Algebra for Unified Data Mining. In Proc. of VLDB, pages 21–32.
• Lyman, P. and Varian, H. R. (2003). How much information. Available at
http://www.sims.berkeley.edu/how-much-info-2003
Barbara Catania
DASFAA’05 Tutorial
166
Main references (3)
• Meo, R., Psaila, G., Ceri, S. (1996) A New SQL-like Operator for Mining
Association Rules. In Prof. of VLDB, pages 122–133.
• Meo, R., Psaila, G., Ceri,S. (1999). An Extension to SQL for Mining
Association Rules. Data Mining and Knowledge Discovery, 2(2): 195–224.
• Meo, R., Lanzi, P.L., Klemettinen, M. (editors) (2004). Database Support
for Data Mining Applications - Discovering Knowledge with Inductive
Queries. LNAI 2682.
• Ng, R., Lakshmanan, L. V., Han, J., Pang, A. (1998) Exploratory Mining
and Pruning Optimizations of Constrained Associations Rules. In Proc. of
SIGMOD’98, pages 13–24.
• Rizzi, S. et Al. (2003). Towards a Logical Model for Patterns. In Proc. of
the 22nd Int. Conf. on Conceptual Modeling (ER 2003), pages 77–90.
• SIGKDD Explorations (2002). Special Issue on Constraint-Based Mining.
• Smyth, P. and R. M. Goodman, R.M. (2002) An Information Theoretic
Approach to Rule Induction from Databases. IEEE Transactions on
Knowledge and Data Engineering, 4(4):301–316.
• Theodoridis, Y., Vazirgiannis, M., Vassiliadis, P., Catania, B., Rizzi,
S.(2003) A Manifesto for Pattern Bases. PANDA Technical Report TR2003-03, 2003.
Barbara Catania
DASFAA’05 Tutorial
167
References: standards
• PMML (2003). Predictive Model Markup Language.
http://www.dmg.org/pmml-v3-0.html
• CWM (2001). Common Warehouse Metamodel.
http://www.omg.org/cwm
• MOF (2003). Meta-Object Facility specification.
http://www.omg.org/technology/documents/formal/mof.htm
• XMI (2003) XML Metadata Interchange specification
http://www.omg.org/technology/documents/formal/mof.htm
• J. Melton and A. Eisenberg (2001) SQL Multimedia and Application
Packages (SQL/MM)”, SIGMOD Record, 30(4): 97–102, December
2001.
• JDM (2003). Java Data Mining API.
http://www.jcp.org/jsr/detail/73.prt
Barbara Catania
DASFAA’05 Tutorial
168
References: projects
• CINQ (2001). The CINQ project. http://www.cinq-project.org
– Minerule System (2004). Minerule Mining System (demo version)
http://kdd.di.unito.it/minerule2/demo.html
– MOLFEA (2004) The Molecular Feature Miner based on the LVS
Algorithm. (demo version).
http://www.predictive-toxicology.org/cgi bin/molfea/molfea.cgi
• PANDA (2001). The PANDA Project. http://dke.cti.gr/panda/
Barbara Catania
DASFAA’05 Tutorial
169