Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DM.Lab in University of Seoul
An Excel-Based Data Mining
Tool
iData Analyzer
Data Mining Laboratory
April 24th , 2008
Summarized by Sungjick Lee
Data Mining Laboratory
Contents
DM.Lab in University of Seoul
The iData Analyzer
ESX:A Multipurpose Tool for Data Mining
iDAV Foramt for Data Mining
A Approach for Unsupervised Clustering
A Approach for Supervised Learning
Data Mining Laboratory
2
The iData Analyzer
DM.Lab in University of Seoul
Scanning for errors
• illegal numeric values
• balnk lines
• missing items
exemplar-based data mining tool
builds a concept hierarchy to
generalize data
allows users to extract a
representative subset of
the data
•A backpropagation neural network
for supervised learning
•A self-organizing feature map for
unsupervised clustering
Data Mining Laboratory
3
DM.Lab in University of Seoul
ESX:A Multipurpose Tool for Data Mining(1/2)
Both supervised learning and unsupervised clustering
No statistical assumptions about the nature for data
An automated method for dealing with missing attrib
ute values
In domains containg both categorical and numberical
data
For supervised classification, Determination of those
instances and attributes best able to classify new
instances of unknown origin
For unsupervised clustering, a globally optimizing
evaluation function that encourages a best instance
clustering
Data Mining Laboratory
4
DM.Lab in University of Seoul
ESX:A Multipurpose Tool for Data Mining(2/2)
summary
information about
the domain
summary statistics
about the attribute
values found within
instance-level
Class
resemblance
scores
Root Level
Root
Class
resemblance
scores
Concept Level
Class
resemblance
scores
C1
Report
Generator
Class
resemblance
scores
C2
...
Cn
summary report
in spreadsheet
format
define the concept
classes
Instance Level
Data Mining Laboratory
I11 I12 . . . I1j
I21 I22 . . . I2k
In1 In2 . . . Inl
5
iDAV Format for Data Mining
DM.Lab in University of Seoul
Table 4.1 • Credit Card Promotion Database: iDAV Format
Income
Range
Magazine
Promotion
Watch
Promotion
C
C
C
I
I
I
40–50K
Yes
No
30–40K
Yes
Yes
I : input
attribute
40–50K
No
No
U : not
used
30–40K
D : not
used for classificationYes
or clustering, Yes
50–60Kavlue summaryYes
but attribute
information is No
20–30K
No
No
displayed
30–40K
Yes
No
O : used
as an ouput attribute
20–30K
No
Yes
30–40K
Yes
No
30–40K
Yes
Yes
40–50K
No
Yes
20–30K
No
Yes
50–60K
Yes
Yes
40–50K
No
Yes
20–30K
No
No
Data Mining Laboratory
Life Insurance
Promotion
C
I
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
Credit Card
Insurance
Sex
C
C
I
I
NoC : categoricalMale
(nomical)
NoR : real-valued
Female
(numerical)
No
Male
Yes
Male
No
Female
No
Female
Yes
Male
No
Male
No
Male
No
Female
No
Female
No
Male
No
Female
No
Male
Yes
Female
Age
R
I
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
6
DM.Lab in University of Seoul
A Approach for Unsupervised Clustering
1.
2.
3.
4.
Enter data into a new Excep Spreadsheet
Perform a data mining session
Read and interpret summary results
Read and interpret results for individual
clusters
5. Visualize and interpret rules defining the
individual clusters
Data Mining Laboratory
7
A approach for unsupervised clustering
DM.Lab in University of Seoul
Enter data into a new Excel Spreadsheet
CreditCardPromotion.xls
Data Mining Laboratory
8
A approach for unsupervised clustering
DM.Lab in University of Seoul
Perform a data mining session(1/2)
A value closer to 100 : encourages
the formation of new clusters
A value closer to 0 : discourages the
formation of new clusters
8 classes are too many!!
Change Instance similarity
value and try again.
Data Mining Laboratory
The similarity criteria
for real-valued attribute
1.0 is usually appropriate
9
A approach for unsupervised clustering
DM.Lab in University of Seoul
Perform a data mining session(2/2)
Attribute Significance
{The largest class mean(class 1 age = 43.33) The smallest class mean(Class 2 age = 37.00) }
/ the domain standar deviation
Data Mining Laboratory
10
A approach for unsupervised clustering
Result–RES
RUL(The generated production rules)
Rules for Class 1
**Total Percent Coverage = 0.00%
Rules for Class 2
Rules for Class 3
Income Range = "20-30,000"
:rule accuracy 100.00%
:rule coverage 80.00%
Income Range = "30-40,000"
:rule accuracy 80.00%
:rule coverage 57.14%
19.00 <= Age <= 29.00
:rule accuracy 100.00%
:rule coverage 60.00%
Magazine Promo = Yes
:rule accuracy 75.00%
:rule coverage 85.71%
19.00 <= Age <= 29.00
and Income Range = "20-30,000"
:rule accuracy 100.00%
:rule coverage 60.00%
Life Ins Promo = Yes
:rule accuracy 77.78%
:rule coverage 100.00%
19.00 <= Age <= 29.00
and Magazine Promo = No
:rule accuracy 100.00%
:rule coverage 60.00%
Data Mining Laboratory
DM.Lab in University of Seoul
35.00 <= Age <= 43.00
:rule accuracy 77.78%
:rule coverage 100.00%
( 중간 생략 )
(중간 생략)
**Total Percent Coverage =
80.00%
**Total Percent Coverage = 100.00%
11
A approach for unsupervised clustering
DM.Lab in University of Seoul
Result–RES SUM(summary statistics) (1/2)
Resemblance Score
Within-class resemblance
scores are higher than the
domain resemblance value?
If not, why?
•Bad choice of attributes
•Bad choice of instances
•The domain does not contain
definable classes
Attribute Significance
Data Mining Laboratory
{The largest class mean(class 1 age = 43.33) The smallest class mean(Class 2 age = 37.00) }
/ the domain standar deviation (9.51)
12
A approach for unsupervised clustering
DM.Lab in University of Seoul
Result–RES CLS(statistics about the individual class)
(1/2)
Typicality
the average similarity of an instance
to all other members of its cluster
Predictability
Predictiveness
•
•
•
•
•
•
•
•
degree that a correct forecast
the percent of instances within a class
within-class measur es
If ‘1’, the value is necessary
Data Mining Laboratory
the state of being predicted
the probability an instance reside in the Class
between-class measures
If ‘1’, the value is sufficient
13
A approach for unsupervised clustering
DM.Lab in University of Seoul
Result–RES CLS(statistics about the individual class)
(1/2)
Highly
greater than or equal to 0.80
Data Mining Laboratory
14
DM.Lab in University of Seoul
A Approach for Supervised Clustering
1. Enter data into a new Excep Spreadsheet
and Choose output attribute
2. Perform a data mining session
3. Read and interpret summary results
4. Read and interpret test set results
5. Read and interpret results for individual
clusters
6. Visualize and interpret class rules
Data Mining Laboratory
15