Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MIS2502: Review for Final Exam (Exam 3) Jing Gong [email protected] http://community.mis.temple.edu/gong Overview • Date/Time: Wednesday, December 16, 1 – 2:30 pm • Place: Regular classroom (Alter Hall 232) Please arrive 5 minutes early! • Multiple-choice and short-answer questions • Closed-book, closed-note • No computer Multiple Choice Questions • We will use the blue exam answer sheet. • Bring #2 pencil and eraser. (Extra pencils will be provided) Office Hours • Mon 12/7, 3:00 – 4:00 pm • Tue 12/15, 2:00 – 4:00 pm • Or by appointment. Coverage 1. Data Mining and Data Analytics Techniques 2. Using R and RStudio 3. Understanding Descriptive Statistics (Introduction to R) 4. Decision Tree Analysis 5. Cluster Analysis 6. Association Rules Study Materials • Lecture notes • In-class exercises • Assignments – Video reviews will be uploaded • Course recordings Data Mining and Data Analytics Techniques • Explain the three data analytics techniques we covered in the course – Decision Trees, Clustering, and Association Rules • Explain how data mining differs from OLAP analysis Using R and RStudio • Explain the difference between R and RStudio • The role of packages in R • Generate and explain basic syntax for R, for example: – Variable assignment (e.g. NUM_CLUSTERS <- 5) – Identify functions versus variables (e.g. kmeans()) – Identify how to access a variable (column) from a dataset (table) (e.g. dataSet$Salary) Understanding Descriptive Statistics • Be able to read and interpret a histogram • Be able to read and interpret sample (descriptive) statistics Hypothesis Testing • uses p-values to weigh the strength of the evidence • T-test: A small p-value (typically ≤ 0.05) suggests that there is a statistically significant difference in means. > t.test(subset$TaxiOut~subset$Origin); Welch Two Sample t-test data: subset$TaxiOut by subset$Origin t = 51.5379, df = 24976.07, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: −16 ≤ 0.05 2.2𝑒𝑒 − 16 = 2.2 × 10 6.119102 6.602939 So we conclude that the difference is sample estimates: statistically significant mean in group ORD mean in group PHX 20.58603 14.22501 More about p-values: http://www.dummies.com/how-to/content/the-meaning-of-the-p-value-from-a-test.html http://www.dummies.com/how-to/content/statistical-significance-and-pvalues.html Decision Tree Analysis • Outcome variable: Discrete/Categorical • Interpreting decision tree output • Complexity factor, minimum split, and the size of the training partition • Chi-Squared statistic 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 − 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝜒𝜒 = � 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 2 2 Chi-Squared Statistics Assignment 8: Observed Expected PromSpend PromTime <50 >=50 <6 >=6 Buy 520 730 1250 Buy 370 880 1250 No Buy 480 770 1250 No Buy 630 620 1250 1000 1500 2500 1000 1500 2500 PromSpend PromTime <50 >=50 <6 >=6 Buy ? ? ? Buy ? ? ? No Buy ? ? ? No Buy ? ? ? 1000 1500 2500 1000 1500 2500 Chi-Squared Statistics Assignment 8: Observed Expected PromSpend PromTime <50 >=50 <6 >=6 Buy 520 730 1250 Buy 370 880 1250 No Buy 480 770 1250 No Buy 630 620 1250 1000 1500 2500 1000 1500 2500 PromSpend PromTime <50 >=50 <6 >=6 Buy 500 750 1250 Buy 500 750 1250 No Buy 500 750 1250 No Buy 500 750 1250 1000 1500 2500 1000 1500 2500 PromSpend: 𝜒𝜒 2 520 − 500 = 500 730 − 750 + 750 = 2.67 2 480 − 500 + 500 2 770 − 750 + 750 2 2 PromTime: 𝜒𝜒 2 370 − 500 = 500 880 − 750 + 750 = 112.67 2 630 − 500 + 500 2 620 − 750 + 750 2 2 PromTime is the stronger differentiator Cluster Analysis • Interpret output from a cluster analysis • Interpret what the boxplot and histogram tells you about each variable Cluster Analysis • Interpret within cluster sum of squares (cohesion) and between cluster sum of squares (separation) higher cohesion: less withinss error higher separation: more betweensss error • What happens to those statistics as the number of clusters increases? (1) higher cohesion (good), but (2) lower separation (bad) • What is the advantage of fewer clusters? Cluster Analysis • Interpret standardized cluster means for each input variable For standardized values, “0” is the average value for that variable. For group 5: average RegionDensityPercentile >0 higher than the population average average MedianHouseholdIncome, and AverageHouseholdSize <0 lower than the population average Association Rules • Interpret the output from an association rule analysis lhs 611 {CCRD,CKING,MMDA,SVG} 485 {CCRD,MMDA,SVG} 489 {CCRD,CKING,MMDA} 265 {CCRD,MMDA} 530 {CCRD,MMDA,SVG} 308 {CCRD,MMDA} rhs support confidence lift => {CKCRD} 0.01026154 0.6029412 5.335662 => {CKCRD} 0.01026154 0.5985401 5.296716 => {CKCRD} 0.01776999 0.5220588 4.619903 => {CKCRD} 0.01776999 0.5107914 4.520192 => {CKING} 0.01701915 0.9927007 1.157210 => {CKING} 0.03403829 0.9784173 1.140559 • Support, confidence, and lift – Understand the difference – Know how to compute – These Formulas will be provided Confidence = Lift = σ (X → Y) σ (X ) s( X → Y ) s ( X ) * s (Y ) Support, confidence, and lift Basket 1 2 3 4 5 6 7 8 Items Coke, Pop-Tarts, Donuts Cheerios, Coke, Donuts, Napkins Waffles, Cheerios, Coke, Napkins Bread, Milk, Coke, Napkins Coffee, Bread, Waffles Coke, Bread, Pop-Tarts Milk, Waffles, Pop-Tarts Coke, Pop-Tarts, Donuts, Napkins Rule Support {Coke} {Donuts} 3/8 = 0.375 3/6 = 0.50 {Coke, Pop-Tarts} {Donuts} 2/8 = 0.25 2/3 = 0.67 • Which rule has the stronger association? Confidence Lift 0.375 = 𝟏𝟏. 𝟑𝟑𝟑𝟑 0.75 ∗ 0.375 0.25 = 𝟏𝟏. 𝟕𝟕𝟕𝟕 0.375 ∗ 0.375 • Consider: (1) a customer with coke in the shopping cart. (2) a customer with coke and pop-tarts in the shopping cart. Who do you think is more likely to buy donuts? Support, confidence, and lift Krusty-O’s Potato Chips No Yes No 5000 1000 Yes 4000 500 10500 • What is the lift for the rule {Potato Chips} {Krusty-O’s}? • Are people who bought Potato Chips more likely than chance to buy Krusty-O’s too? 𝒔𝒔(𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪, 𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲) They appear in the same basket 𝑳𝑳𝑳𝑳𝑳𝑳𝑳𝑳 = 𝒔𝒔 𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪 ∗ 𝒔𝒔(𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲𝑲) less often than what you’d 0.048 expect by chance (i.e., Lift < 1). = = 0.782 0.429 ∗ 0.143