Download Exam 3 Review Decision Trees Cluster Analysis SAS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Cluster analysis wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Exam 3 Review
SAS
Decision Trees
Cluster Analysis
Association Rules
Data Visualization
SAS
•  When to Use Which Analysis (D, C or A)?
–  When someone gets an A in this class, what other classes do they get
an A in?
–  What predicts whether a company will go bankrupt?
–  If someone upgrades to an iPhone, do they also buy a new case?
–  Which party will win the election?
–  Can we group our website visitors into types based on their online
behaviors?
–  Which customers will purchase our product?
–  Can we identify different product markets based on customer
demographics?
SAS
•  When to Use Which Analysis (D, C or A)?
–  When someone gets an A in this class, what other classes do they get
an A in?
–  What predicts whether a company will go bankrupt?
–  If someone upgrades to an iPhone, do they also buy a new case?
–  Which party will win the election?
–  Can we group our website visitors into types based on their online
behaviors?
–  Which customers will purchase our product?
–  Can we identify different product markets based on customer
demographics?
Decision Trees
•  Which is the Root Node?
•  # Leafs Nodes?
Decision Trees
•  Which is the Root Node?
•  # Leafs Nodes?
&"
!"
#"
$"
%"
•  Probability of Purchase?
i) Female, 130 lbs, 12 ft? ii) 120 lbs, 5 feet, male?
•  Best predictor variable?
<=1D"
;1D"
6-789)"
;&#0"
;&50"
<=&#0"
:-789)"
'()*+,-"
"
0
&
4
"
"
"
"./)/"
"1!2"
"$32"
"$#0"
:-789)"
'()*+,-
"./)/"
'()*+,-
0
&
4
"##2"
"%#2"
"!#0"
0
&
4
"
<=&50"
"
"
"
"
"
"
"
"./)/"
"./)/"
0
&
4
"102"
"%02"
"!#0"
"
"%02"
"102"
"&#0"
A/B-"
'()*+,-
C-,/B-"
>-4?-@"
'()*+,-"
"
0
&
4
"
"
"
"./)/"
"%#2"
"##2"
"5#"
'()*+,-
"./)/"
0
&
4
"$#2"
"1#2"
"5#"
"
"
"
"
"
"
"
•  Probability of Purchase?
i) Female, 130 lbs, 12 ft? ii) 120 lbs, 5 feet, male?
•  Best predictor variable?
<=1D"
;1D"
()*+,-"
;&#0"
'()*+,-"
"
"./)/"
0
"
"1!2"
!
"
"#$%"
4
"
"$#0"
;&50"
<=&#0"
:-789)"
:-789)"
'()*+,-
"./)/"
'()*+,-
0
&
4
"##2"
"%#2"
"!#0"
0
&
4
"
<=&50"
"
"
"
"
"
"
"
"./)/"
"./)/"
0
&
4
"102"
"%02"
"!#0"
"
"%02"
"102"
"&#0"
A/B-"
'()*+,-
C-,/B-"
>-4?-@"
'()*+,-"
"./)/"
'()*+,-
"./)/"
0
&
4
"%#2"
"##2"
"5#"
0
"$#2"
"
"
"
"
"
!
4
"
"
"
"&'%"
"5#"
"
"
"
•  Probability of Purchase?
i) 5 ft 5 inches?
ii) 6 ft 5 inches 190 lbs?
<=1D"
;1D"
6-789)"
;&#0"
;&50"
<=&#0"
:-789)"
'()*+,-"
"
0
&
4
"
"
"
"./)/"
"1!2"
"$32"
"$#0"
:-789)"
'()*+,-
"./)/"
'()*+,-
0
&
4
"##2"
"%#2"
"!#0"
0
&
4
"
<=&50"
"
"
"
"
"
"
"
"./)/"
"./)/"
0
&
4
"102"
"%02"
"!#0"
"
"%02"
"102"
"&#0"
A/B-"
'()*+,-
C-,/B-"
>-4?-@"
'()*+,-"
"
0
&
4
"
"
"
"./)/"
"%#2"
"##2"
"5#"
'()*+,-
"./)/"
0
&
4
"$#2"
"1#2"
"5#"
"
"
"
"
"
"
"
Decision Trees
•  What does it mean that Gender is only on the right side of
the tree? Why is it not on both sides?
•  Based on the tree, which demographic is MOST likely to buy
the product? Least likely to buy the product?
Decision Trees
•  What does it mean that Gender is only on the right side of the
tree? Why is it not on both sides?
–  Gender only has predictive/explanatory power for customers who are
greater than or equal to 6 feet and below 170lbs.
–  That is, in other subsets of the population, it does no better than
chance at predicting behavior.
•  Based on the tree, which demographic is MOST likely to buy the
product? Least likely to buy the product?
–  Biggest Leaf Node Probability (1): Over 6 ft, below 170 lbs, female
(1 = 65% probability)
–  Biggest Leaf Node Null Probability (0): below 6 ft, below 150 lbs
(0 = 62% probability)
Decision Trees
•  What Statistics are Used to Determine Splits for Decision Trees?
–  Gini Coefficient, Chi-Square Statistics (p-value)
•  What does it mean when the Gini = 1?
•  What does it mean when the Chi-square is bigger?
•  What happens to the p-value as the Chi-square gets bigger?
– 
Decision Trees
•  What Statistics are Used to Determine Splits for Decision Trees?
–  Gini Coefficient, Chi-Square Statistics (p-value)
•  What does it mean when the Gini = 1?
–  The predictor is no better than flipping a coin (you want a small Gini)
•  What does it mean when the Chi-square is bigger?
–  The variable is better at predicting the outcome (you want a big Chisquare)
•  What happens to the p-value as the Chi-square gets bigger?
–  The p-value gets smaller as the Chi-square gets bigger (you want a small
p-value)
Clustering
•  What statistics do we care about in cluster analysis? What
do they represent?
•  What happens to these statistics as the number of
clusters is increased?
•  Why do we standardize data? Why do we eliminate outliers?
Clustering
•  What statistic do we care about in cluster analysis? What does it
represent?
–  Sum of Squared Errors ‒ SSE (or Root Mean Square Std Dev.)
–  Within SSE = cohesion, Between SSE = distinctiveness
•  What happens to these statistics as the number of clusters is
increased?
–  SEE goes down (both within and between)
–  More cohesive clusters, less distinct though
•  Why do we standardize data? Why do we eliminate outliers?
–  Standardize else variables with bigger values will have greater weighting
–  Elimination outliers because they can skew results
Clustering
•  What are the pros and cons of having only a few clusters
(compared to having many clusters)?
•  What is bad about the below cluster analysis result?
How would you improve it?
Clustering
•  What are the pros and cons of having only a few clusters
(compared to having many clusters)?
–  Easier to interpret/analyze, but they may be less informative
•  What is bad about the below cluster analysis result?
How would you improve it?
–  Clusters should be fairly round!
–  Add more clusters.
Association Rules
•  How would you describe the following association rule?
–  {Meat, Dairy} ! {Vegetables}
•  How many items are in this item set?
•  What is (are) the antecedents? What are the consequents?
•  What are the statistics we care about when evaluating an
association rule?
Association Rules
•  How would you describe the following association rule?
–  {Meat, Dairy} ! {Vegetables}
–  When someone eats meat and dairy they also eat vegetables.
•  How many items are in this item set?
–  This is a 3 item set.
•  What is (are) the antecedents? What are the consequents?
–  Meat and Dairy are the antecedents, vegetables is the consequent.
•  What are the statistics we care about when evaluating an
association rule?
–  Support count, Support Percent, Confidence and Lift
Association Rules
•  Do the following two rules have to have the same
Confidence? The same Support? The same Lift?
–  {Meat, Dairy} ! {Vegetables}
–  {Vegetables} ! {Meat, Dairy}
•  What does Lift > 1 mean? Would you take action on such a
rule?
–  What about Lift < 1?
–  What about Lift = 1?
Association Rules
•  Do the following two rules have to have the same Confidence
(NO) ? The same Support (Yes)? The same Lift (Yes)?
–  {Meat, Dairy} ! {Vegetables}
–  {Vegetables} ! {Meat, Dairy}
•  What does Lift > 1 mean? Would you take action on such a rule?
–  More co-purchase observed than chance would predict (+ association)
–  What about Lift < 1? Less than chance predicts (- association)
–  What about Lift = 1? Chance explains the observed co-purchase
(no apparent association)
Association Rules
•  What might you do as a manager if you saw a very high Lift
and Confidence for the following rule about product
purchase? Why would you do this?
–  {Pasta} ! {Orange Juice}
Association Rules
•  What might you do as a manager if you saw a very high Lift
and Confidence for the following rule about product
purchase? Why would you do this?
–  {Pasta} ! {Orange Juice}
•  Encourage pasta buyers to see OJ (placement)
•  Get them in and milk ‘em (discount pasta, premium OJ)
•  Target market (advertise new OJ to Pasta customers)
Association Rules
•  What is the most reliable association rule below?
Association Rules
•  What is the most reliable association rule below?
–  Rule 2 ‒ Tied for best Lift (3.60), but has Better confidence!
Data Visualization
•  Look at In-Class Exercise Answers...