Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Turning data into knowledge to solve real world problems Christopher R. Bilder, Ph.D. Department of Statistics University of Nebraska-Lincoln www.chrisbilder.com www.chrisbilder.com 1 of 31 15 years ago… The year is 1990 Music – U2 George Bush is president TV – The Simpson’s Millard South – Senior year – Big hair – In the middle of winning state titles in basketball for 3 out of 4 years (1988, 1989, 1991) What am I going to major in at college? – Calculus I – No AP Statistics! www.chrisbilder.com 2 of 31 15 years ago… UNO (1990 – 1994) – Math undergraduate major – What can you do with a degree? – Planned to be an actuary – Hypothesis testing in a statistics course (junior year) • Use for decision making! • Scientifically prove a hypothesis or statement Kansas State University for graduate school (1994 – 2000) – Statistics graduate major in Department of Statistics – Master of Science (MS) and Doctor of Philosophy (PhD) Oklahoma State University faculty (2000 – 2003) – Department of Statistics UNL faculty (2003 – now) – NEW Department of Statistics www.chrisbilder.com 3 of 31 Purpose Tell you a little about the statistical science Turning data into knowledge to solve real world problems – 3 actual examples AP statistics exam Website (www.chrisbilder.com/statistics) for more information www.chrisbilder.com 4 of 31 Grocery store prices Undergraduate teaching example for a course like AP STATs How could you determine which grocery store, Super Wal-Mart or Baker’s, has lower average prices? – Paired or dependent two sample hypothesis test for Wal-Mart - Baker’s – Sample the same items at each store www.chrisbilder.com 5 of 31 Grocery store prices Undergraduate teaching example for a course like AP STATs How could you determine which grocery store, Dillon’s or Food-4-Less in Manhattan, KS, has lower average prices? – Paired or dependent two sample hypothesis test for Dillon’s - Food-4-Less – Sample the same items at each store Only cereals from Fall 1998 www.chrisbilder.com 6 of 31 Grocery store prices Sample: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Item Malt-o-meal - Tootie Fruities, 15oz Malt-o-meal - Golden Puffs, 18oz Quaker Oats - Life Cereal: Original, 21oz Cheerios, 20oz Cheerios, 15oz Wheaties, 18oz Kellogg’s Funpack, 8 9/16oz Kellogg’s Variety Pack 9 5/8oz. Kellogg’s Frosted Mini-Wheats Bite Size 19oz Kellogg’s Frosted Mini-Wheats, 16oz Kellogg’s Frosted Flakes, 15oz Our Family Frosted Flakes, 20oz. Kellogg’s Crispix, 12oz. Our Family - Raisin Bran, 20oz Kellogg’s Smart Start, 13.3oz Grape Nuts, 24oz Frosted Alpha Bits, 15oz Dillon's $1.99 $1.99 $3.69 $4.59 $3.79 $3.89 $2.89 $3.49 $3.49 $2.50 $3.19 $2.50 $3.49 $2.50 $3.49 $3.00 $3.00 www.chrisbilder.com Food-4-Less $1.84 $1.84 $3.49 $4.24 $3.50 $3.60 $2.67 $3.14 $2.50 $2.73 $2.92 $1.90 $3.20 $1.92 $3.24 $2.85 $2.87 Difference $0.15 $0.15 $0.20 $0.35 $0.29 $0.29 $0.22 $0.35 $0.99 -$0.23 $0.27 $0.60 $0.29 $0.58 $0.25 $0.15 $0.13 7 of 31 Grocery store prices Do you think there are mean differences? Dillon's - Food 4 Less $0.8 Dillon's $1.99 $1.99 $3.69 $4.59 $0.6 $3.79 $3.89 $2.89 $3.49 $0.4 $3.49 $2.50 $3.19 $2.50 $0.2 $3.49 $2.50 $3.49 $3.00 $0.0 $3.00 $1.0 9 Food-4-Less Difference $0.15 $1.84 $0.15 $1.84 $0.20 $3.49 $0.35 $4.24 12 14 $0.29 $3.50 $0.29 $3.60 $0.22 $2.67 $0.35 $3.14 $0.99 $2.508 4 -$0.23 $2.73 5 6 13 $0.27 $2.92 11 15 $0.60 $1.90 7 3 $0.29 $3.20 1 2 16 $0.58 $1.92 17 $0.25 $3.24 $0.15 $2.85 $0.13 $2.87 -$0.2 www.chrisbilder.com 10 $0.8 $0.6 $0.4 75% 50% $0.2 25% Dillon's - Food 4 Less 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Item Malt-o-meal - Tootie Fruities, 15oz Malt-o-meal - Golden Puffs, 18oz Quaker Oats - Life Cereal: Original, 21oz Cheerios, 20oz Cheerios, 15oz Wheaties, 18oz Kellogg’s Funpack, 8 9/16oz Kellogg’s Variety Pack 9 5/8oz. Kellogg’s Frosted Mini-Wheats Bite Size 19oz Kellogg’s Frosted Mini-Wheats, 16oz Kellogg’s Frosted Flakes, 15oz Our Family Frosted Flakes, 20oz. Kellogg’s Crispix, 12oz. Our Family - Raisin Bran, 20oz Kellogg’s Smart Start, 13.3oz Grape Nuts, 24oz Frosted Alpha Bits, 15oz $1.0 $0.0 -$0.2 8 of 31 Grocery store prices Paired two sample hypothesis test – Ho:Dillon’s - Food-4-Less=0 Ha:Dillon’s - Food-4-Less0 – t = 4.77, p-value = 0.0002, 95% C.I.: 0.1644 < Dillon’s - Food-4-Less < 0.4274 – Reject equal mean prices If price was the only consideration, what store should one shop at? Assumptions – Normal populations – The sample was taken in 1998; what about now? – Finite populations www.chrisbilder.com 9 of 31 Placekicking The use of the statistical science in sports Find a model to estimate the probability of success for placekicks (field goals, PATs) in the NFL Video – January 7, 1996 – Playoff game – Indianapolis Colts 10 Kansas City Chiefs 7 – Lin Elliott of KC will attempt a 42 yard field goal to tie the game and send it into overtime – Field goal video www.chrisbilder.com 10 of 31 Placekicking What factors affect the probability of success for NFL placekicks? – – – – – Distance Pressure – How do you quantitatively measure? Wind Grass vs. artificial turf Dome vs. outdoor stadium Collect sample of >1,700 placekicks during the 1995 NFL season Find the best logistic regression model of the form e0 1x1 2x2 k xk p= 1 e0 1x1 2x2 k xk where p is the probability of success xi for i=1,…,k are independent variables i measures the effect of xi on p for i=1,…,k e 2.718; ln(e) = 1 www.chrisbilder.com 11 of 31 Placekicking The i’s are parameters which are estimated using “iteratively reweighted least squares” Estimated model e4.49840.3306change 0.0807distance 1.2592PAT 2.8778wind0.0907distancewind p̂= 1 e4.49840.3306change 0.0807distance 1.2592PAT 2.8778wind0.0907distancewind – – – – Change: lead change = 1, non-lead change = 0 Distance: distance in yards PAT: point after touchdown = 1, field goal = 0 Wind: windy (speed > 15 MPH) = 1, non-windy = 0 What is the estimated probability of success for Elliott’s field goal? Change Distance PAT Wind – Conditions: 1 42 0 0 – Estimated probability of success: p̂ 0.6850 – 90% confidence interval for probability of success: 0.6298 < p < 0.7402 www.chrisbilder.com 12 of 31 Estimated probability of success of a field goal (PAT=0) 0.4 0.6 0.685 0.2 Change=0, Wind=0 Change=1, Wind=0 Change=0, Wind=1 Change=1, Wind=1 0.0 Estimated Probability of Success 0.8 1.0 Estimated probability of success for a field goal (PAT=0) 20 30 40 42 Distance in Yards www.chrisbilder.com 50 60 13 of 31 HCV prevalence Hepatitis C (HCV) – Viral infection that causes cirrhosis and cancer of the liver Questions: – How can people be tested in a cost effective and timely manner? • Blood bank setting – What is the probability a person has HCV? • What proportion of people is inflicted with HCV in a population? • Prevalence in a population Individual testing – Each blood sample is tested individually – Problems: • Costly • Time + or - + or - + or - + or - + or - www.chrisbilder.com + or - 14 of 31 HCV prevalence Group testing – Pool the blood samples together to form n groups of size s + or - Group 1 + or - Group 2 + or - Group n – If the GROUP sample is negative, then all s people do not have the disease – If the GROUP sample is positive, then at least ONE of the s people have the disease • May want to determine who in the group has the disease – Strategy works well when prevalence of a disease is small www.chrisbilder.com 15 of 31 HCV prevalence Notation – – – – – p = probability an INDIVIDUAL is HCV positive (prevalence) = probability a GROUP is HCV positive s = group size n = number of groups T be a random variable denoting the number of positive GROUPS • T has a binomial distribution with “n trials” and “ as the probability of success” n f(t) t (1 )n t for t=0,1,2,...,n t www.chrisbilder.com 16 of 31 HCV prevalence How can we estimate p? – We observe information about the groups, not individuals! – Estimate with ˆ T / n = # positive / # of groups – = P(group is positive) = P(at least one individual is positive) = 1 – P(no individuals are positive) using complement rule = 1 – P(all individuals are negative) = 1 – (1 – p)s since p = P(individual is positive) and s individuals per group – p = 1 – (1 – )1/s 1/ s 1/ s – Then p̂ 1 (1 ˆ ) 1 (1 T / n) www.chrisbilder.com 17 of 31 HCV prevalence Estimation of HCV prevalence in Xuzhou City, China – Data from Liu et al. (Transfusion, 1997) – 1,875 blood donors screened for HCV • There were 42 positives – In order to test the usefulness of group testing, blood samples were also pooled • n = 375 groups • s = 5 individuals per group • t = 37 positive groups – Estimates of p, probability individual is positive • Using individual data: 42/1875 = 0.0224 • Using group data: p̂ 0.0206 – Which is easier and more cost effective? • 1875 tests using individual testing • 375 tests using group testing www.chrisbilder.com 18 of 31 HCV prevalence New research – MS/PhD research – What factors could affect p? – Include independent variables to help model p e0 1x1 2x2 k xk p= 1 e0 1x1 2x2 k xk – Problem: Do not have the individual outcomes – After a group is tested positive, how can you find what individuals have the disease? • Use model to help decide who to retest if get a positive group – Multiple diseases • HCV • HIV • Other disease • Simultaneously model www.chrisbilder.com 19 of 31 HCV prevalence Multiple vector transfer designs – – Swallow (Phytopathology, 1985) Want to estimate the probability a insect vector transfers a pathogen (virus, bacteria, etc.) to a plant Brown planthopper Whitebacked planthopper www.chrisbilder.com 20 of 31 HCV prevalence Multiple vector transfer designs (continued) y = 0 if plant is negative, 1 if plant is positive y=0 y=1 y=0 Greenhouse Does not transmit virus Transmits virus Enclosed test plant y=0 y=1 Planthopper y=0 T = number of plants with disease www.chrisbilder.com 21 of 31 Why statistics? Statistics is used in many diverse areas! – Statistics is the “science of science” – Florence Nightingale quote: the most important science in the whole world: for upon it depends the practical application of every other science and of every art: the one science essential to all political and social administration, all education, all organization based on experience, for it only gives results of our experience. Take statistics courses in college! – Of course, I want you to consider coming to UNL! – Statistics is mainly a graduate discipline, so there is no undergraduate major at UNL – Undergraduate minor in statistics can be useful for many majors – Most statisticians have an undergraduate degree (Bachelor of Science) in math www.chrisbilder.com 22 of 31 Why statistics? Where do statisticians work? – – – – – – Pharmaceutical and medical research – Pfizer, Merck, medical centers Marketing – Target, Hallmark Government research labs – INEEL, Los Alamos, Sandia, Argonne Agriculture – Pioneer Hi-Bred Consulting firms – Quintiles In Nebraska – ConAgra, Gallup, First National Bank, MDS Pharma, Experian, UNMC and Creighton medical center, various universities, Pfizer, Acton International, Nebraska state agencies, Union Pacific Everyone that I have known has had a job offer before they graduated! How many statisticians are there? – 20,000 www.chrisbilder.com 23 of 31 Why statistics? Salaries – Non-academic starting (2003 American Statistical Association survey) Background needed Survey response rate Percentile was 23.5%; see salary th th th Degree Sample size 25 50 75 surveys at the American MS 102 45.5K 50K 59K Statistical Association’s PhD 99 60K 65K 75K website Strong in mathematics and using computers – Majority of statisticians have Bachelor’s degrees in mathematics • Good with calculus • Applied math courses • Take at least one statistics course • Comfortable with using software packages – To actually be a “statistician”, usually need to go to graduate school to get a MS or PhD in statistics • Financial support • Graduate Teaching Assistantship www.chrisbilder.com 24 of 31 Why statistics? What courses to take next in college? – AP statistics equivalent to a one semester introductory statistics course without calculus • UNL: STAT 218 (Introduction to Statistics) • UNO: MATH/STAT 3000 (Statistical Methods I); Business Administration 2130 (Principles of Business Statistics) – Theory – 2 semester sequence using calculus I-III • UNL: STAT 462 (Distribution Theory) and STAT 463 (Statistical Inference) • UNO: MATH 4740 and 4750 (Intro. to Probability and Statistics I and II) – Applications • UNL: STAT 450 (Introduction to Regression Analysis) or STAT 412 (Introduction to Experimental Design) • UNO: MATH /STAT 3010 (Statistical Methods II); Business Administration 3140 (Business Statistical Applications) www.chrisbilder.com 25 of 31 Why statistics? Other recommended UNL classes (undergraduate) – MATH 340 Numerical Analysis – MATH 314 Applied Linear Algebra – MATH 325 Elementary Analysis and MATH 425 Mathematical Analysis • Helpful if go on for a PhD – Computer science programming courses Other recommended UNO classes (undergraduate) – – – – MATH 3300 Numerical Methods MATH 4050 Linear Algebra MATH 4760 Topics in Modeling MATH 4230 and 4240 Mathematical Analysis I and II • Helpful if go on for a PhD – Computer science programming courses www.chrisbilder.com 26 of 31 AP Statistics Grading done in Lincoln! – State fair grounds – Grade the free response section of about 66,000 student exams (2004) – 250 AP statistics high school teachers and college professors – June 13 to June 19, 2005 – 8:30AM – 4:45PM EVERYDAY www.chrisbilder.com 27 of 31 AP Statistics I graded in 2002 – About 900 problems graded! – 16 graders in a room split into two groups – Each group has a leader • Answer questions • CHECKS some of your grading! – Paid $1,450 • Stay in dorms • Free meals and snacks Grading is not fun – Evening activities – Discussions on how to teach introductory statistics better The grading rubric – An outline of how to grade a problem that must be followed! – These are put together before graders arrive through examining sample set of tests www.chrisbilder.com 28 of 31 AP Statistics Question #6 in 2002 – 4 parts – (a), (b), (c), (d) – Each part is graded as • E = Essentially correct • P = Partially correct • I = Incomplete – Graders are given a “conversion” table to show how to convert the scores into a numerical score • 4 = Complete response • 3 = Substantial response • 2 = Developing response • 1 = Minimal response • 0 = No credit – 1 point given to an E, 0.5 points given to a P, 0 points given to an I • Round up if (a) or (c) has the correct interpretation – Example given at end of PowerPoint file www.chrisbilder.com 29 of 31 For more information… E-mail me at [email protected] Website: www.chrisbilder.com/statistics – This PowerPoint presentation (including example question) – Links to • Introductory information about being a statistician • Jobs (including internships) • Salary information • List of all Departments of Statistics • Professional societies • Course websites that myself and others teach • Newspaper and magazine articles about statistical applications www.chrisbilder.com 30 of 31 Turning data into knowledge to solve real world problems Christopher R. Bilder, Ph.D. Department of Statistics University of Nebraska-Lincoln www.chrisbilder.com www.chrisbilder.com 31 of 31 33rd st. Statistics at UNL Department of Statistics www.chrisbilder.com 32 of 31 AP Statistics www.chrisbilder.com 33 of 31 AP Statistics www.chrisbilder.com 34 of 31 AP Statistics May actually be an E? www.chrisbilder.com 35 of 31 AP Statistics www.chrisbilder.com 36 of 31 AP Statistics www.chrisbilder.com 37 of 31 Estimated probability of success for a field goal (PAT=0) 0.6 0.4 Lowest Number of Risk Factors Estimated Probability 90% Confidence Interval 0.2 Highest Number of Risk Factors Estimated Probability 90% Confidence Interval 0.0 Estimated Probability of Success 0.8 1.0 Estimated probability of success for a field goal (PAT=0) 20 30 40 Distance in Yards www.chrisbilder.com 50 60 38 of 31