Download Roiger_DM_ch04 - Gonzaga University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Chapter 4
An Excel-based Data Mining Tool
(iData Analyzer)
Jason C. H. Chen, Ph.D.
Professor of MIS
School of Business Administration
Gonzaga University
Spokane, WA 99223
[email protected]
A/W & Dr. Chen, Data Mining
Objectives
• This chapter will introduce you the iData Analyzer(iDA)
and how to use two of learner models contained in your
iDA software of data mining tools.
• In Section 4.1 overviews the iDA Model for Knowledge
Discovery.
• In Section 4.2, introduces an exemplar-based data
mining tool, ESX, capable of both supervised learning
and unsupervised clustering.
• The way of representing datasets and how to use ESX to
perform unsupervised clustering and building supervised
learning models and others will be also introduced in
this chapter.
2
A/W & Dr. Chen, Data Mining
4.1 The iData Analyzer
• iDA provides support for the business or technical
analyst by offering a visual learning environment,
an integrated tool set, and data mining process
support.
• iDA consists of the following components:
–
–
–
–
–
–
Preprocessor
Heuristic agent (for larger Large Dataset)
ESX
Neural Network
Rule Maker
See p.107 and Appendix A-2 for the
instructions of installation
Report Generator
3
A/W & Dr. Chen, Data Mining
Limitations
• The commercial version of iDA is bounded by the
size of a single MS Excel spreadsheet, i.e., up to
65,536 rows and 256 columns
• The iDA input format uses the first three rows of a
spreadsheet to house information about individual
attributes
– Up to 65,533 data instances in attribute-value format
can be mined
– The student version allows a maximum of 7,000 data
instances (i.e., 7003 rows)
After completing the installation if the security setting is high,
you should change it to medium and click OK.
A/W & Dr. Chen, Data Mining
4
Interface
Figure 4.1 – The iDA system
architecture
Data
PreProcessor
Large
Dataset
Yes
Heuristic
Agent
No
Mining
Technique
Neural
Networks
Explaination
No
ESX
Yes
Generate
Rules
Yes
RuleMaker
Rules
No
Report
Generator
Excel
Sheets
A/W & Dr. Chen, Data Mining
5
6
A/W & Dr. Chen, Data Mining
4.2 ESX: A Multipurpose Tool for
Data Mining
• ESX can help create target data, find irregularities
in data, perform data mining, and offer insight
about the practical value of discovered knowledge.
• Features of ESX learner model are:
– It supports both supervised learning and unsupervised
clustering
– It supports an automated method for dealing with
missing attribute value
– It does not make statistical assumptions about the
nature of data to be processed
– It can point out inconsistencies and unusual values in
data
A/W & Dr. Chen, Data Mining
7
Figure 4.3 An ESX concept hierarch
Root Level
Concept Level
Instance Level
Root
C1
I11 I12 . . . I1j
C2
I21 I22 . . . I2k
...
Cn
In1 In2 . . . Inl
8
A/W & Dr. Chen, Data Mining
4.3 iDAV Format for Data Mining
Second Row: C: categorical; R: real-valued
Third Row (see Table 4.2 below)
Table 4.2 – Values for Attribute Usage
Character
Usage
I
The attribute is used as an input attribute
U
The attribute is not used (categorical attribute with several unique
values are of little predictive value)
D
The attribute is not used for classification or clustering, but attribute
value summary information is displayed in all output reports
O
The attribute is used as an output attribute. For supervised learning with
ESX exactly one categorical attribute is selected as the output attribute.
9
A/W & Dr. Chen, Data Mining
Table 4.1 – Credit Card Promotion Database: iDAV Format
Income Range Magazine Promo Watch Promo Life Ins Promo Credit Card Ins.
C
C
C
C
C
I
I
I
I
I
40-50,000
Yes
No
No
No
30-40,000
Yes
Yes
Yes
No
40-50,000
No
No
No
No
30-40,000
Yes
Yes
Yes
Yes
50-60,000
Yes
No
Yes
No
20-30,000
No
No
No
No
30-40,000
Yes
No
Yes
Yes
20-30,000
No
Yes
No
No
30-40,000
Yes
No
No
No
30-40,000
Yes
Yes
Yes
No
40-50,000
No
Yes
Yes
No
20-30,000
No
Yes
Yes
No
50-60,000
Yes
Yes
Yes
No
40-50,000
No
Yes
No
No
20-30,000
No
No
Yes
Yes
Sex
C
I
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
Age
R
I
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
10
A/W & Dr. Chen, Data Mining
4.4 A Five-step Approach for
Unsupervised Clustering
Step 1: Enter the Data to be Mined
Step 2: Perform a Data Mining Session
Step 3: Read and Interpret Summary Results
Step 4: Read and Interpret Individual Class Results
Step 5: Visualize Individual Class Rules
A/W & Dr. Chen, Data Mining
Step 1: Enter The Data To Be Mined
12
A/W & Dr. Chen, Data Mining
Step 2: Perform A Data Mining Session
13
A/W & Dr. Chen, Data Mining
Figure 4.5 – Unsupervised settings for ESX (#4,p.116)
Value for instance similarity:
A value closer to 100 encourages the formation of new clusters
A value closer to 0 favors new instances to enter existing clusters
The real-valued tolerance setting helps determine the similarity criteria
for real-valued attributes. A setting of 1.0 is usually appropriate.
14
A/W & Dr. Chen, Data Mining
#6 A message box indicating that eight clusters were formed.
This tells us the data has been successfully mine.
15
A/W & Dr. Chen, Data Mining
#6, #7 (p.116)As a general rule, an unsupervised clustering
of more than five or six clusters is likely to be less than
optimal.
16
A/W & Dr. Chen, Data Mining
#8 and #9, Repeat steps 1-4. For step 5, set the similarity value to 55
17
A/W & Dr. Chen, Data Mining
Re-rule feature
Covering set rules:
RuleMaker will generate a
set of best-defining rules for
each class.
Minimum correctness rule (50-100): if 80, the rules generated must have an
error rate less than or equal to 20%
Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover
10% or more of the instances in each class.
Attribute significance (start with 80-90): values close to 100 will allow
RuleMaker to consider only those attribute values most highly predictive of
class membership for rule generation.
18
A/W & Dr. Chen, Data Mining
#10 (p.117) Set minimum rule coverage at 30
30
Minimum correctness rule (50-100): if 80, the rules generated must have an
error rate less than or equal to 20%
Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover
10% or more of the instances in each class.
Attribute significance: values close to 100 will allow RuleMaker to consider
only those attribute values most highly predictive of class membership for
rule generation.
19
A/W & Dr. Chen, Data Mining
A Production Rule for the
Credit Card Promotion Database
IF Sex = Female & 19 <=Age <= 43
THEN Life Insurance Promotion = Yes
Rule Accuracy: 100.00%
Rule Coverage: 66.67%
Question: Can we assume that two-thirds of all females in
the specified age range will take advantage of the
promotion?
• Rule accuracy is a between-class measure.
• Rule coverage is a within-class measure.
A/W & Dr. Chen, Data Mining
Output Reports:
Unsupervised Clustering
• RES SUM: This sheet contains summary statistics
about attribute values and offers several heuristics
to help us determine the quality of a data mining
session.
• RES CLS: this sheet has information about the
clusters formed as a result of an unsupervised
mining session
• RUL TYP: Instances are listed by their cluster
number. The typicality of instance i is the average
similarity of i to the other members of its cluster.
• RES RUL: The production rules generated for each
cluster are contained in this sheet.
21
A/W & Dr. Chen, Data Mining
#10 (p.117) Set minimum rule coverage at 30
22
A/W & Dr. Chen, Data Mining
Figure 4.7 Rules for the credit card promotion database
23
A/W & Dr. Chen, Data Mining
Step 3: Read and Interpret
Summary Results (p.117)
(Sheet1 RES SUM)
• Class Resemblance Scores (RES)
• Domain Resemblance Score
• Domain Predictability
24
A/W & Dr. Chen, Data Mining
Step 3: Read and Interpret Summary Results (p.119)
Instances of Class 1
have a best withinclass fit
Similarity value
(within the class)
In general, the within-class RES scores should be higher than the domain RES. It should be true for most of the 25
classes.
A/W & Dr. Chen, Data Mining
Figure 4.9 - Step 3: Read and Interpret Summary Results (cont.)
26
A/W & Dr. Chen, Data Mining
Figure 4.9 -Statistics for numerical attributes and common categorical attribute values
Step 3: Read and Interpret Summary Results (cont.)
27
A/W & Dr. Chen, Data Mining
Step 4: Read and Interpret
Individual Class Results (p.121)
(Sheet1 RES CLS)
• Typicality
– is defined as the average similarity of an instance to all
other members of its cluster or class
• Class Predictability is a within-class measure.
– the percent of class instances having a particular value for a
categorical attribute
• Class Predictiveness is a between-class measure
– it is defined as probability an instance resides in a specified class
given the instance has the value for the chosen attribute
28
A/W & Dr. Chen, Data Mining
Figure 4.10 – Class 3 Summary Results
within-class
between-class
29
A/W & Dr. Chen, Data Mining
Figure 4.11 – Necessary and sufficient attribute values for Class 3
30
A/W & Dr. Chen, Data Mining
Step 5: Visualize Individual Class Rules
IF life ins Promo = Yes
THEN Class = 3
:rule accuracy 77.78%
:rule coverage 100.00%
31
A/W & Dr. Chen, Data Mining
4.5 A Six-Step Approach for
Supervised Learning
• Step 1: Choose an Output Attribute
– Launch a fresh life insurance promotion
• Step 2: Perform the Mining Session
• Step 3: Read and Interpret Summary
Results
• Step 4: Read and Interpret Test Set Results
• Step 5: Read and Interpret Class Results
• Step 6: Visualize and Interpret Class Rules
32
A/W & Dr. Chen, Data Mining
Step 2: Perform the Mining Session
Filename: CreditCardPromotion-supervised.xls
O: output; D: Display-Only
33
A/W & Dr. Chen, Data Mining
Step 2(#4): Select the number of instances for training and a real-valued tolerance setting (p.127)
34
A/W & Dr. Chen, Data Mining
Step 3 – Read and Interpret Summary Results
Domain statistics for
categorical attributes tells
us that 80% of the training
instances represent
individuals without credit
card insurance.
35
A/W & Dr. Chen, Data Mining
Step 3 – Read and Interpret Summary Results (cont.)
36
A/W & Dr. Chen, Data Mining
Step 4 - Read and Interpret Test Set Results
37
A/W & Dr. Chen, Data Mining
Step 5 - Read and Interpret Results for Individual Classes (p.130)
38
A/W & Dr. Chen, Data Mining
Sheet1 RUL TYP
In Class Yes (Life Ins. Promo)
Instances of Credit Card Ins = Yes is 40% (2/5)
39
A/W & Dr. Chen, Data Mining
Step 6 – Visualize and Interpret Class Rules (p.130)
Re-rule feature
Covering set rules:
RuleMaker will generate a
set of best-defining rules for
each class.
Minimum correctness rule (50-100): if 80, the rules generated must have an
error rate less than or equal to 20%
Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover
10% or more of the instances in each class.
Attribute significance (start with 80-90): values close to 100 will allow
RuleMaker to consider only those attribute values most highly predictive of
class membership for rule generation.
40
A/W & Dr. Chen, Data Mining
4.6 Techniques for Generating
Rules
1.
2.
3.
4.
5.
Define the scope of the rules.
Choose the instances.
Set the minimum rule correctness.
Define the minimum rule coverage.
Choose an attribute significance value.
A/W & Dr. Chen, Data Mining
4.7 Instance Typicality
Typicality Scores
• Identify prototypical and outlier instances.
• Select a best set of training instances.
• Used to compute individual instance
classification confidence scores.
A/W & Dr. Chen, Data Mining
Figure 4.13 Instance Typicality
43
A/W & Dr. Chen, Data Mining
4.8 Special Considerations and
Features
• Avoid Mining Delays
• The Quick Mine Feature
– Supervised with more than 2000 training set
instances, “quick mine” feature will be asked
– Unsupervised with more than 2000 data
instances. ESX is given a random selection of
500 instances.
• Erroneous and Missing Data
44
A/W & Dr. Chen, Data Mining
Homework
• Use EXS (and iDA) to perform a supervised data
mining session using the CardiologyCategorical.xls
data file.
• Save output file as
CardiologyCategorical-supervised.xls
• Lab#4 (p.141)
• Turn in
– 1. Spreadsheet file (CardiologyCategoricalsupervised.xls) that contains the outcome of data mining
session
– 2. Word file that includes (and explains) answers to all
questions (a. thru n.)
45
A/W & Dr. Chen, Data Mining