Download MIS2502: Review for Final Exam (Exam 3) Jing Gong

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
MIS2502:
Review for Final Exam (Exam 3)
Jing Gong
[email protected]
http://community.mis.temple.edu/gong
Overview
• Date/Time: Wednesday, December 16, 1 – 2:30 pm
• Place: Regular classroom (Alter Hall 232)
Please arrive 5 minutes early!
• Multiple-choice and short-answer questions
• Closed-book, closed-note
• No computer
Multiple Choice Questions
• We will use the blue exam answer sheet.
• Bring #2 pencil and eraser. (Extra pencils will be provided)
Office Hours
• Mon 12/7, 3:00 – 4:00 pm
• Tue 12/15, 2:00 – 4:00 pm
• Or by appointment.
Coverage
1. Data Mining and Data Analytics Techniques
2. Using R and RStudio
3. Understanding Descriptive Statistics (Introduction to R)
4. Decision Tree Analysis
5. Cluster Analysis
6. Association Rules
Study Materials
• Lecture notes
• In-class exercises
• Assignments
– Video reviews will be uploaded
• Course recordings
Data Mining and Data Analytics
Techniques
• Explain the three data analytics techniques we
covered in the course
– Decision Trees, Clustering, and Association Rules
• Explain how data mining differs from OLAP
analysis
Using R and RStudio
• Explain the difference between R and RStudio
• The role of packages in R
• Generate and explain basic syntax for R, for example:
– Variable assignment (e.g. NUM_CLUSTERS <- 5)
– Identify functions versus variables (e.g. kmeans())
– Identify how to access a variable (column) from a dataset (table) (e.g.
dataSet$Salary)
Understanding Descriptive Statistics
• Be able to read and
interpret a histogram
• Be able to read and
interpret sample
(descriptive) statistics
Hypothesis Testing
• uses p-values to weigh the strength of the evidence
• T-test: A small p-value (typically ≤ 0.05) suggests that there is
a statistically significant difference in means.
> t.test(subset$TaxiOut~subset$Origin);
Welch Two Sample t-test
data: subset$TaxiOut by subset$Origin
t = 51.5379, df = 24976.07, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
−16 ≤ 0.05
2.2𝑒𝑒
−
16
=
2.2
×
10
6.119102 6.602939
So we conclude that the difference is
sample estimates:
statistically significant
mean in group ORD mean in group PHX
20.58603
14.22501
More about p-values:
http://www.dummies.com/how-to/content/the-meaning-of-the-p-value-from-a-test.html
http://www.dummies.com/how-to/content/statistical-significance-and-pvalues.html
Decision Tree Analysis
• Outcome variable:
Discrete/Categorical
• Interpreting decision tree
output
• Complexity factor, minimum
split, and the size of the
training partition
• Chi-Squared statistic
𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 − 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
𝜒𝜒 = �
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
2
2
Chi-Squared Statistics
Assignment 8:
Observed
Expected
PromSpend
PromTime
<50
>=50
<6
>=6
Buy
520
730
1250
Buy
370
880
1250
No Buy
480
770
1250
No Buy
630
620
1250
1000
1500
2500
1000
1500
2500
PromSpend
PromTime
<50
>=50
<6
>=6
Buy
?
?
?
Buy
?
?
?
No Buy
?
?
?
No Buy
?
?
?
1000
1500
2500
1000
1500
2500
Chi-Squared Statistics
Assignment 8:
Observed
Expected
PromSpend
PromTime
<50
>=50
<6
>=6
Buy
520
730
1250
Buy
370
880
1250
No Buy
480
770
1250
No Buy
630
620
1250
1000
1500
2500
1000
1500
2500
PromSpend
PromTime
<50
>=50
<6
>=6
Buy
500
750
1250
Buy
500
750
1250
No Buy
500
750
1250
No Buy
500
750
1250
1000
1500
2500
1000
1500
2500
PromSpend:
𝜒𝜒 2
520 − 500
=
500
730 − 750
+
750
= 2.67
2
480 − 500
+
500
2
770 − 750
+
750
2
2
PromTime:
𝜒𝜒 2
370 − 500
=
500
880 − 750
+
750
= 112.67
2
630 − 500
+
500
2
620 − 750
+
750
2
2
PromTime is the stronger differentiator
Cluster Analysis
• Interpret output from a
cluster analysis
• Interpret what the boxplot
and histogram tells you
about each variable
Cluster Analysis
• Interpret within cluster sum of squares (cohesion) and
between cluster sum of squares (separation)
higher cohesion:
less withinss error
higher separation:
more betweensss error
• What happens to those statistics as the number of clusters increases?
(1) higher cohesion (good), but (2) lower separation (bad)
• What is the advantage of fewer clusters?
Cluster Analysis
• Interpret standardized cluster means for each input
variable
For standardized values, “0” is the average value for that variable.
For group 5:
 average RegionDensityPercentile >0  higher than the population average
 average MedianHouseholdIncome, and AverageHouseholdSize <0  lower
than the population average
Association Rules
• Interpret the output from an association rule
analysis
lhs
611 {CCRD,CKING,MMDA,SVG}
485 {CCRD,MMDA,SVG}
489 {CCRD,CKING,MMDA}
265 {CCRD,MMDA}
530 {CCRD,MMDA,SVG}
308 {CCRD,MMDA}
rhs
support
confidence lift
=> {CKCRD} 0.01026154 0.6029412 5.335662
=> {CKCRD} 0.01026154 0.5985401 5.296716
=> {CKCRD} 0.01776999 0.5220588 4.619903
=> {CKCRD} 0.01776999 0.5107914 4.520192
=> {CKING} 0.01701915 0.9927007 1.157210
=> {CKING} 0.03403829 0.9784173 1.140559
• Support, confidence, and lift
– Understand the difference
– Know how to compute
– These Formulas will be provided
Confidence =
Lift =
σ (X → Y)
σ (X )
s( X → Y )
s ( X ) * s (Y )
Support, confidence, and lift
Basket
1
2
3
4
5
6
7
8
Items
Coke, Pop-Tarts, Donuts
Cheerios, Coke, Donuts, Napkins
Waffles, Cheerios, Coke, Napkins
Bread, Milk, Coke, Napkins
Coffee, Bread, Waffles
Coke, Bread, Pop-Tarts
Milk, Waffles, Pop-Tarts
Coke, Pop-Tarts, Donuts, Napkins
Rule
Support
{Coke}  {Donuts}
3/8 = 0.375
3/6 = 0.50
{Coke, Pop-Tarts} {Donuts}
2/8 = 0.25
2/3 = 0.67
• Which rule has the stronger association?
Confidence
Lift
0.375
= 𝟏𝟏. 𝟑𝟑𝟑𝟑
0.75 ∗ 0.375
0.25
= 𝟏𝟏. 𝟕𝟕𝟕𝟕
0.375 ∗ 0.375
• Consider:
(1) a customer with coke in the shopping cart.
(2) a customer with coke and pop-tarts in the shopping cart.
Who do you think is more likely to buy donuts?
Support, confidence, and lift
Krusty-O’s
Potato Chips
No
Yes
No
5000
1000
Yes
4000
500
10500
• What is the lift for the rule {Potato Chips}  {Krusty-O’s}?
• Are people who bought Potato Chips more likely than chance to buy Krusty-O’s
too?
𝒔𝒔(𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪, 𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲)
They appear in the same basket
𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳 =
𝒔𝒔 𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪 ∗ 𝒔𝒔(𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲) less often than what you’d
0.048
expect by chance (i.e., Lift < 1).
=
= 0.782
0.429 ∗ 0.143