Download Data Mining Primitives - Texas Tech University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Outline
Motivation
 Data mining primitives
 Data mining query languages
 Designing GUI for data mining systems
 Architectures

Data Mining
Primitives
CS 5331 by Rattikorn Hewett
Texas Tech University
1
Motivations: Why primitives?


Data mining primitives
Data mining systems uncover a large set of patterns –
not all are interesting
 Data mining should be an interactive process
 User directs what to be mined
Users need data mining primitives to communicate with
the data mining system by incorporating them in a data
mining query language
Benefits:



2

Data mining tasks can be specified in the form of data
mining queries by five data mining primitives:
 Task-relevant data  input
 The kinds of knowledge to be mined  function & output
 Background knowledge  interpretation
 Interestingness measures  evaluation
 Visualization
of the discovered patterns  presentation
More flexible user interaction
Foundation for design of graphical user interface
Standardization of data mining industry and practice
3
4
1
Task-relevant data
Knowledge to be mined
Specify data to be mined
 Database, data warehouse, relation, cube
 Condition for selection & grouping
 Relevant attributes
Specify data mining “functions”:
 Characterization/discrimination
 Association
 Classification/prediction
 Clustering
5
6
Background Knowledge
Interestingness
Typically, in the form of concept hierarchies
Objective measures:
 Simplicity:

Schema hierarchy



Set-grouping hierarchy


Operation-derived hierarchy

E.g., email address: [email protected]
login-name < department < university < organization



7
Rule A => B has support, #(A and B)/ sample size
noise threshold (description)
Novelty

E.g., 87 ≤ temperature < 90  normal_temperature
Rule A => B has confidence, P(A|B) = #(A and B)/ #(B)
classification reliability or accuracy, certainty factor, rule strength, rule
quality, discriminating weight, etc.
Utility: potential usefulness

Rule-based hierarchy
(association) rule length, (decision) tree size
Certainty: validity of the rule
E.g., {low, high}  all, {30..49}  low, {50..100}  high

simpler rules are easier to understand and likely to be interesting

E.g., street < city < state < country
not previously known, surprising (used to remove redundant rules)
8
2
Visualization of Discovered Patterns


DMQL(data mining query language)
Specify the form to view the patterns

E.g., rules, tables, chart, decision trees, cubes,
reports etc.

Specify operations for data exploration in
multiple levels of abstraction
E.g., drill-down, roll-up etc.

A DMQL can provide the ability to support ad-hoc and
interactive data mining
By providing a standardized language
 Hope to achieve a similar effect like that SQL has on
relational database
 Foundation for system development and evolution
 Facilitate information exchange, technology transfer,
commercialization and wide acceptance
DMQL is designed with the primitives described earlier
9
Languages & Standardization Efforts

Association rule language specifications
MSQL (Imielinski & Virmani’99)
MineRule (Meo Psaila and Ceri’96)
 Query flocks based on Datalog syntax (Tsur et al’98)

collection and data mining query composition
 Presentation of discovered patterns
 Hierarchy specification and manipulation
 Manipulation of data mining primitives
 Interactive multilevel mining
 Other information
Based on OLE, OLE DB, OLE DB for OLAP
Integrating DBMS, data warehouse and data mining
CRISP-DM (CRoss-Industry Standard Process for Data
Mining)


What tasks should be considered in the design
GUIs based on a data mining query language?
 Data
OLEDB for DM (Microsoft’2000)


Designing GUI based on DMQL




10
Providing a platform and process structure for effective data
mining
Emphasizing on deploying data mining technology to solve
business problems
11
12
3
Architectures
Coupling data mining system with DB/DW system




No coupling - Flat file processing, not recommended
Loose coupling - Fetching data from DB/DW
Semi-tight coupling - Enhanced DM performance
 Provide efficient implementation of a few data mining primitives in
a DB/DW system, e.g., sorting, indexing, aggregation, histogram
analysis, multiway join, precomputation of some stat functions
Tight coupling - A uniform information processing environment
 DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query, indexing, query processing
methods, etc.
Concept Description
CS 5331 by Rattikorn Hewett
Texas Tech University
13
14
Review terms
Outline

Review terms
 Characterization
Descriptive vs. predictive data mining
 Descriptive:
describes the data set in concise,
summarative, informative, discriminative forms
 Predictive: constructs models representing the data set,
and uses them to predict behaviors of unknown data

 Summarization
 Hierarchical
generalization
 Attribute relevance analysis

Concept description: involves
 Characterization:
provides a concise and succinct
summarization of the given collection of data
 Comparison (discrimination): provides descriptions
comparing two or more collections of data
Comparison/discrimination
 Descriptive statistical measures

15
16
4
Outline
Concept Description vs. OLAP

Concept description:

can handle complex data types (e.g., text,
image) of the attributes and their aggregations
 a more automated process

Review terms
 Characterization

 Summarization
 Hierarchical
generalization
 Attribute relevance analysis
OLAP:
 restricted
to a small number of dimension and
measure data types
 user-controlled process
Comparison/discrimination
 Descriptive statistical measures

17
18
Characterization methods
Summarization by OLAP
One approach for characterization is to transform data from
low conceptual levels to high ones  “data
generalization”
E.g., daily sales  annual sales
Biology  Science

Two Methods:
 Summarization – as in Data Cube’s OLAP
 Hierarchical generalization – Attribute-oriented induction
Data generalization?
19

Data are stored in data cubes
Identify summarization computations





e.g., count( ), sum( ), average( ), max( )
Perform computations and store results in data cubes
Generalization and specialization can be performed on
a data cube by roll-up and drill-down
An efficient implementation of data generalization
Limitations:

Can handle only simple non-numeric data type of dimensions
Can handle only summarization of numeric data
 Do not guide users which dimensions to explore or which levels
to reach

20
5
Outline
Attribute-Oriented Induction
Review terms
 Characterization




 Summarization
 Hierarchical
generalization
 Attribute relevance analysis
Comparison/discrimination
 Descriptive statistical measures

Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How is it done?
 Collect the task-relevant data (initial relation) using a
relational database query
 Perform generalization by attribute removal or
attribute generalization.
 Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts
 Interactive presentation with users
21
Basic Elements



22
General Steps
Data focusing: task-relevant data, including dimensions, and
the result is the initial relation.
Attribute-removal and Attribute-generalization:

InitialRel: Query processing of task-relevant data,
deriving the initial relation.

Attribute A has a large set of distinct values
 If there is no generalization operator on A, or
 A’s higher level concepts are expressed in terms of other attributes
(giving redundancy)
 Remove A
 If there exists a set of generalization operators on A
 Select an operator to generalize A
PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?

PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
Generalization threshold controls

Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.

Attribute generalization: controls size of attribute values for
generalization or removal (~ 2-8, specified/default)
 Relation generalization: controls the final relation/rule size (~ 10-30).
23
24
6
Example

Example (cont.)
Initial
Relation
DMQL: Describe general characteristics of graduate
students in the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in “graduate”

Name
Gender
Major
Birth_date
Residence
Phone #
GPA
Jim
Woodman
M
CS
Vancouver,BC,
Canada
8-12-76
3511 Main St.,
Richmond
687-4598
3.67
Scott
Lachance
M
CS
Montreal, Que,
Canada
28-7-75
345 1st Ave.,
Richmond
253-9106
3.70
Laura Lee
…
F
…
Physics
…
Seattle, WA, USA
…
25-8-70
…
420-5232
…
3.83
…
Removed
Retained
Generalized
to
Sci,Eng,Bus
Removed
Generalized
to
Excl, VG,..
Gender
Major
Age_range
Residence
GPA
M
Science
Canada
20-25
Richmond
Very-good
16
F
Science
Foreign
25-30
Burnaby
Excellent
22
…
…
…
Prime
Generalized
Relation
Transform to corresponding SQL statement:
Select name, gender, major, birth_place, birth_date, residence,
phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
…
…
Birth-Place
Generalized
to
Country
Birth_ country
125 Austin
Ave., Burnaby
…
Generalized
to
City
Generalized
to
Age range
…
Canada
Foreign
Total
16
10
26
14
22
36
30
32
62
25
Presentation of results


 Summarization
Mapping results into cross tabulation
Visualization techniques:


Review terms
 Characterization

Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
Cross tabulation:

26
Outline
Generalized relation:

 Hierarchical
Pie charts, bar charts, curves, cubes, and other visual forms.
 Attribute
Quantitative characteristic rules:

…
Birth_Region
Gender
M
F
Total
Presentation

Count
Mapping generalized result into characteristic rules with
quantitative information associated with it, e.g., t = typical
generalization
relevance analysis
Comparison/discrimination
 Descriptive statistical measures

grad ( x) Ù male( x) Þ
birth _ region( x) ="Canada"[t :53%]Ú birth _ region( x) =" foreign"[t : 47%].
27
28
7
Analysis of Attribute Relevance
Methods
To filter out statistically irrelevant attributes or rank
attributes for mining
Idea: Compute a measure that quantifies the
relevance of an attribute with respect to a given
class or concept

Irrelevant attributes  inaccurate/unnecessary complex
patterns

An attribute is highly relevant for classifying/predicting a class, if it is
likely that its values can be used to distinguish the class from others
E.g., to describe cheap vs. expensive cars
Is “color” a relevant attribute?
What about using “color” to compare banana and apple?
These measures can be:
 Information gain
 The Gini index
 Uncertainty
 Correlation coefficients
29
Example


Example (cont)
How much attribute “major” is relevant to classification
of graduate/undergraduate students?
Relevance measure: Information gain
Review formulae:
 For
an attribute value set S, each labeled with a class
in C and pi is a probability that class i is in S, then
Ent ( S ) = -å pi log 2 pi
 Expected
iÎC
information needed to classify a sample if it
is partitioned into Si’s for data point that has A’s value
Si
i
I ( A) = å
Ent ( Si )
iÎdom ( A ) S
 Information
30
gain: Gain(A) = Ent(S) – I(A)
Gender
Major
Birth_ country
Age_range
M
F
M
F
M
F
M
F
M
F
M
F
Science
Science
Eng
Science
Science
Eng
Science
Business
Business
Science
Eng
Eng
Canada
Foreign
Foreign
Foreign
Canada
Canada
Foreign
Canada
Canada
Canada
Foreign
Canada
20-25
25-30
….
Very-good
Excellent
GPA
……
…..
…
…..
…..
……
Count
16
22
18
25
21
18
18
20
22
24
22
24
120 Graduates
130 Undergraduates
Dom(Major) = {Science, Eng, Business}
Partition the data into Sc, Eng, Bus representing a set of data points
whose “Major” is Science, Eng and Business, respectively
31
32
8
Ent ( S ) = -å pi log 2 pi
Example (cont)
iÎC
I ( A) =
å
iÎdom ( A )
Gender
Major
Birth_ country
Age_range
M
F
M
F
M
F
M
F
M
F
M
F
Science
Science
Eng
Science
Science
Eng
Science
Business
Business
Science
Eng
Eng
Canada
Foreign
Foreign
Foreign
Canada
Canada
Foreign
Canada
Canada
Canada
Foreign
Canada
20-25
25-30
….
Very-good
Excellent
GPA
……
…..
…
…..
…..
……
Si
S
Ent ( Si )
iÎC
I ( A) =
å
iÎdom ( A )
Count
16
22
18
25
21
18
18
20
22
24
22
24
Ent ( S ) = -å pi log 2 pi
Example (cont)
120 Graduates:
Science = 84 (= 16+22+25+21)
Eng = 36
Business = 0
130 Undergraduates
Science = 42
Eng = 46
Business = 42
Gender
Major
Birth_ country
Age_range
M
F
M
F
M
F
M
F
M
F
M
F
Science
Science
Eng
Science
Science
Eng
Science
Business
Business
Science
Eng
Eng
Canada
Foreign
Foreign
Foreign
Canada
Canada
Foreign
Canada
Canada
Canada
Foreign
Canada
20-25
25-30
….
Very-good
Excellent
GPA
……
…..
…
…..
…..
……
Si
S
Ent ( Si )
Count
16
22
18
25
21
18
18
20
22
24
22
24
120 Graduates:
Science = 84 (= 16+22+25+21)
Eng = 36
Business = 0
130 Undergraduates
Science = 42
Eng = 46
Business = 42
Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115
Similarly, find
Gain(gender), Gain(Birth_country), Gain(Age_range), Gain(GPA)
Ent(S) = 120/250 log2 (120/250)  130/250 log2 (130/250) = 0.9988
Ent(Sc) = 84/126 log2 (84/126)  42/126 log2 (42/126) = ….
Ent(Eng) = 36/82log2 (36/82)  46/82 log2 (46/82) = ….
Ent(Bus) =  0/42 log2 (0/42)  42/42 log2 (42/42) = ….
I(Major) = 126/250Ent(Sc) + 82/250Ent(Eng) + 42/250Ent(Bus) = 0.7873
Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115
• We can rank “importance” or degree of “relevance” by Gain values
• We can use a threshold to prune out attributes that are less “relevant”
Class Information captured from S
Expected class information induced by attribute “Major”
33
34
Outline
Class comparison
Review terms
 Characterization



Goal: mine properties (or rules) to compare a target
class with a contrasting class
The two classes must be comparable
E.g., address and gender are not comparable
store_address and home_address are comparable
CS students and Eng students are comparable
 Summarization

 Hierarchical
generalization
 Attribute relevance analysis

Comparable classes should be generalized to the same
conceptual level
Approaches

Use attribute-oriented induction or data cube to generalize data
for two contrasting classes and then compare the results --- !!!!
 Pattern Recognition approach –Approximate discriminating rules
from a data set, repeatedly fine-tune until errors are small
enough
Comparison/discrimination
 Descriptive statistical measures

35
36
9
Outline
Descriptive statistical measures
Review terms
 Characterization
Data Characteristics that can be computed


Central Tendency


 Summarization

 Hierarchical
generalization
 Attribute relevance analysis
mean
median
Dispersion
When is “mean” not an appropriate measure?
For a very large data set, how do we compute
median ?

five number summary: Min, Quartile1, Median, Quartile3, Max
variance, standard deviation Spread about the mean. What does var = 0 mean?
 Outliers
Detected by rules of thumb: values falling at

Comparison/discrimination
 Descriptive statistical measures

least 1.5 of (Q3-Q1) above Q3 or below Q1
Useful displays

37
Boxplots, quantile-quantile plot (q-q plot), scatter plot, loess curve
38
References









E. Baralis and G. Psaila. Designing templates for mining association rules. Journal of Intelligent
Information Systems, 9:7-32, 1997.
Microsoft Corp., OLEDB for Data Mining, version 1.0, http://www.microsoft.com/data/oledb/dm,
Aug. 2000.
J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane, “DMQL: A Data Mining Query Language
for Relational Databases”, DMKD'96, Montreal, Canada, June 1996.
T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining and
Knowledge Discovery, 3:373-408, 1999.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting
rules from large sets of discovered association rules. CIKM’94, Gaithersburg, Maryland, Nov.
1994.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96,
pages 122-133, Bombay, India, Sept. 1996.
A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems.
IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational
database systems: Alternatives and implications. SIGMOD'98, Seattle, Washington, June 1998.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A
generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998.
39
10