Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Describing Data September 14, 2016 Updates • This week – Lab sections begin • Wed: 2-4pm (Today!) • Wed: 4-6pm (Today!) • Mon: 4-6pm • Next week • Eric Glass, guest speaker from DSSC (part of class) • The following week, another speaker talking about Zotero. Updates to assignments • Updated LiPS assignment Still have to seven write-ups One must be either Fulong Wu (Monday evening Nov 14th) or Malo Hutson (Tuesday evening Sept. 20th) • Assignment 2 posted to CourseWorks Due at the start of your lab in 2 weeks. Hand in a paper copy to your TA and post also to CourseWorks. Today: Statistics • Descriptive • Describe and summarize our data to give insights • Inferential • Use statistics to make generalizations about a broader population Types of Variables • Categorical • Nominal (not ranked) • College major, type of property, color of car • Ordinal (ordered or ranked) • Useful for preferences, though no value assigned • Dichotomous (two categories, not ranked) • Yes/no • Numerical • Discrete (values are counts) • Continuous (values are measures) Variables • Nominal • Exclusive but not ordered or ranked • Ordinal • Ranked • Interval • Equally spaced variables Nominal Examples • Think of nominal scales as “labels” • No quantitative value Nominal Examples • Think of nominal scales as “labels” • No quantitative value Nominal Examples • Think of nominal scales as “labels” • No quantitative value Color Blue Black Red blue Purple Green Purple White BLUE Brown Burgundy Gray Pink Red Yellow nav orange purple red seafoam green turquoise white Count 10 8 6 5 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Nominal Examples • Think of nominal scales as “labels” • No quantitative value • Other Examples: • Gender • Hair color • Neighborhood • When there are only two categories, we call this “dichotomous.” • Examples – Heads/Tails, On/Off, Rural/Urban, In poverty / Not in poverty • Q: What about gender? Is that a dichotomous variable? Ordinal • Ranked in order of values, but the difference between values is not always known • Example: • Educational attainment Ordinal example: educational attainment Interval • Numerical scales where order of and differences between variables is known Hours of Sleep 0 12 • Examples: • Money or income • Height • Weight 10 COUNT 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Likert items • Allow people to respond according to some scale Likert items • Allow people to respond according to some scale • Examples: Question: How frequently do you think you need to come to class to get a high pass? o Always o Often o Occasionally o Rarely o never Likert items • Allow people to respond according to some scale • Examples: Question: I already know everything there is to know about “Planning Techniques” o Agree Strongly o Agree Slightly o Neutral o Disagree Slightly o Disagree Strongly Likert items • Allow people to respond according to some scale • Examples – four point scale Question: I read emails from Nick Klein o o o o Most of the time Some of the time Seldom Never Likert items • Allow people to respond according to some scale • Examples – four point scale Question: I read emails from Nick Klein o o o o Most of the time – ALL OF THE TIME Some of the time Seldom Never Likert Scales • What types of variables are these? • How can we interpret them? Descriptive stats We need some data to describe Lucky us! What year were you born? 50 responses: 1993, 1991, 1960, 1993, 1994, 1992, 1989, 1992, 1993, 1993, 1994, 1991, 1990, 1992, 1987, 1989, 1994, 1992, 1989, 1992, 1994, 1985, 1994, 1991, 1991, 1992, 1993, 1993, 1993, 1992, 1991, 1985, 1992, 1992, 1992, 1985, 1994, 1993, 1995, 1991, 1985, 1993, 1990, 1992, 1994, 1994, 1994, 1994, 1992, 1990 Hard to make sense of this… 50 responses: 1993, 1991, 1960, 1993, 1994, 1992, 1989, 1992, 1993, 1993, 1994, 1991, 1990, 1992, 1987, 1989, 1994, 1992, 1989, 1992, 1994, 1985, 1994, 1991, 1991, 1992, 1993, 1993, 1993, 1992, 1991, 1985, 1992, 1992, 1992, 1985, 1994, 1993, 1995, 1991, 1985, 1993, 1990, 1992, 1994, 1994, 1994, 1994, 1992, 1990 We can use a “frequency table” Year born 1960 Frequency 1 Percent 2.00 1985 1987 1989 4 1 3 8.00 2.00 6.00 1990 1991 1992 1993 3 6 12 9 6.00 12.00 24.00 18.00 1994 10 20.00 1995 1 2.00 Let’s represent it another way, graphically We can use a “dot plot” where each dot represents a response 1960 1970 1980 What year were you born? 1990 2000 0 5 Frequency 10 15 This is similar to a histogram 1960 1970 1980 1990 What year were you born? 2000 But a histogram is more flexible 10 5 0 Frequency 15 20 We can change the number of “bins” 1960 1970 1980 What year were you born? 1990 2000 20 10 0 Percent 30 40 And change the y-axis to a measure of “relative frequency” rather than a count. 1960 1970 1980 What year were you born? 1990 2000 Another approach is a “stem and leaf” 195. | 196. | 197. | 198. | 199. | 200. | The stem consists of the numbers with the last digit omitted. So for our years, this would mean ignore the year but keep the decade. So “1975” would become “197” Another approach is a “stem and leaf” 195. | 196. | 0 Then add the final digits (the leaf or leaves) back in to the corresponding stem 197. | 198. | 55557999 199. | 00011111122222222222233333333344444444445 200. | Summary Statistics Central Tendency and Spread • Two of the most simple and most important measures Central Tendency • There are a number of measures of central tendency • The most common are: • Mean • Median • Mode • Let’s focus on the first two Mean • The mean is the average. • To calculate it, we add up all the values and divide by the number of observations. • If we write it out as an equation, where we have n observations, we could write it out as such: 𝜒= 𝑛 𝑖=1 𝑛𝑖 𝑛 Mean • Let’s use the first five years from our data as an example: 1. 1985 2. 1992 3. 1992 4. 1992 5. 1985 𝜒= 𝑛 𝑖=1 𝑛𝑖 𝑛 1985 + 1992 + 1992 + 1992 + 1985 𝜒= 5 𝜒 = 1989.2 Median • The median is the middle most value • We can identify it by placing our data in order. Let’s use the same five values: 1985 1985 1992 1992 1992 • The mean (1989.2) and median (1992) are often different. The median has a nice attribute in that it is generally not sensitive to outliers. Median • If there are two middle-most variables, we would take the average of the two middle values • Let’s add our outlier (1960) to our data set and figure out the median: 1960 1985 1985 1992 1992 The median is now (1985 + 1992) / 2 = 1988.5 1992 Mean and Median Mean ● Easy to understand. It’s the average ● Affected by extreme high or low values (outliers) ● May not best characterize skewed distributions Median ● Not affected by outliers ● May better characterize skewed distributions What about mode? Mode ● The most frequent value ● Less often used in social science Mode ● The most frequent value ● Less often used in social science Percentiles • Imagine a chart will all the observable values in a population; it contains 100 percent of the possible values. • The pth percentile is the value of a given distribution such that p% of the distribution is less than or equal to that value. • Quartiles: The 25th, 50th, and 75th percentiles • Quintiles: The 20th, 40th, 60th, and 80th are quintiles • Deciles: 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and 90th. • The 50th percentile is the MEDIAN 10 percent under curve (shaded red) 10th percentile=-1.2816 Basic descriptive statistics 25 percent under curve (shaded red) 25th percentile=-0.67 Basic descriptive statistics 50 percent under curve (shaded red) 50th percentile=0.00 75 percent under curve (shaded red) 75th percentile=0.6745 Basic descriptive statistics 90 percent under curve (shaded red) 90th percentile=1.2816 Percentiles from our data 1960 1970 1980 What year were you born? 1990 2000 Percentiles from our data 25th Percentile is 1991 50th Percentile / the median value is 1992 75th Percentile is 1993 1960 1970 1980 What year were you born? 1990 2000 Measures of Spread How do we describe the different distributions? Measures • Range • Interquartile range • Index of dispersion • Standard Deviation Interquartile Range (IQR) • The IQR is a simple measure of spread: It is the difference between 25th and 75th percentile values. • The IQR tells us about the spread from the median Interquartile Range (IQR) 25th Percentile is 1991 50th Percentile / the median value is 1992 75th Percentile is 1993 1960 1970 1980 What year were you born? 1990 2000 Boxplots Standard Deviation • Often, we will use and talk about st. dev. • Represented by sigma : σ • The st. dev tells us about the spread from the mean • (The IQR tells us about the spread form the median) Standard Deviation • We can think of the st. dev. (σ) as an measure of the average distance from the mean. 𝜎= 𝑛 𝑖=1 𝜒−𝜇 𝑛−1 2 Standard Deviation • We can think of the st. dev. (σ) as an measure of the average distance from the mean. 𝜎= 𝑛 𝑖=1 𝜒−𝜇 𝑛−1 2 • And we call this part the variance: 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑛 𝑖=1 𝜒−𝜇 𝑛−1 2 Standard Deviation • The st. dev. is quite cumbersome to calculate 𝜇 • First, we need the average Standard Deviation • The st. dev. is quite cumbersome to calculate 𝜒−𝜇 2 • First, we need the average • Then calculate the squared distance from the average for each value Standard Deviation • The st. dev. is quite cumbersome to calculate 𝑛 𝜒−𝜇 2 𝑖=1 • First, we need the average • Then calculate the squared distance from the average for each value • Sum them all up Standard Deviation • The st. dev. is quite cumbersome to calculate 𝑛 𝑖=1 𝜒−𝜇 𝑛−1 • • • • 2 First, we need the average Then calculate the squared distance from the average for each value Sum them all up Divide by n-1 Standard Deviation • The st. dev. is quite cumbersome to calculate 𝑛 𝑖=1 𝜒−𝜇 𝑛−1 • • • • • 2 First, we need the average Then calculate the squared distance from the average for each value Sum them all up Divide by n-1 Take the square root of all of that. Standard Deviation • But the st. dev. is really useful. • If we have normally distributed data, • We can expect 68% is within 1 st. dev. • And 95% is within 2. Other ways to describe spread Skewness and Symmetry Skewness and Symmetry Skewness and Symmetry • Why might data be skewed? • Why might data be bimodal? 6000 0 2000 4000 Skewed data example: Family Income 0 200,000 400,000 600,000 800,000 6000 0 2000 4000 Q: Guess the mean 0 200,000 400,000 600,000 800,000 6000 Q: Guess the mean 0 2000 4000 $71,840 0 250,000 500,000 750,000 6000 Q: Guess the mean 0 2000 4000 $71,840 0 250,000 500,000 750,000 6000 Q: Guess the mean 4000 $71,840 0 2000 Q: Guess the median 0 250,000 500,000 750,000 6000 Q: Guess the mean 4000 $71,840 Q: Guess the median 0 2000 $55,000 0 250,000 500,000 750,000 Interpreting Tables Elements of a Table Table 1: Outbound Freight Desitnations from the Port of New York and New Jersey Port Destination Global NYCT Trips Share Trips Share Canada 15 0.6% 6 0.2% Connecticuit 6 0.2% 20 0.6% Delaware 3 0.1% 3 0.1% Massachusetts 42 1.7% 19 0.6% Maryland 1 0.0% 1 0.0% Maine 1 0.0% 0 0.0% New Jersey 1,941 79.9% 2,681 82.0% New York 156 6.4% 314 9.6% Ohio 2 0.1% 0 0.0% Pennsylvania 119 4.9% 153 4.7% Texas 1 0.0% 0 0.0% California 0 0.0% 16 0.5% Missing 141 5.8% 56 1.7% Total 2,428 3,269 NY & NJ 2,097 86.4% 2,995 91.6% • Title describes content • Sample size presented • Actual and percentage shares presented Table 4: Toll and Operation Cost Estimates for 20 Mile Trip from New York Area Ports, 2011 U.S. Average To and From Global To and From NYCT Average Cost Total % of Total % of Total % of Cost of Operations per Mile Costs Costs Costs Costs Costs Costs Vehicle Based Fuel and Oil Truck/Trailer Lease or Purchase Repair and Maintenance Truck Insurance Premiums Permits and Licenses Tires Tolls: General Tolls: Bridges Driver-based Driver Wages Driver Benefits $ $ $ $ $ $ $ 0.59 0.19 0.15 0.07 0.04 0.04 0.02 $ 11.90 $ 3.78 $ 3.04 $ 1.34 $ 0.76 $ 0.84 $ 0.34 35% 11% 9% 4% 2% 2% 1% $ 11.80 $ 3.78 $ 3.04 $ 1.34 $ 0.76 $ 0.84 $ 0.34 $ 8.97 27% 9% 7% 3% 2% 2% 1% 21% $ 11.80 $ 3.78 $ 3.04 $ 1.34 $ 0.76 $ 0.84 $ 0.34 $ 48.22 14% 5% 4% 2% 1% 1% 0% 59% $ $ 0.46 0.15 $ $ 27% 9% $ $ 21% 7% $ $ 11% 4% 9.20 3.02 9.20 3.02 9.20 3.02 Total Costs $ 1.71 $ 34.12 100% $ 43.09 100% $ 83.34 100% Note: Estimates are of overall cost of a 20 mile trip. General operating costs from 2012 ATRI Average Carrier Costs per Mile. Calculations by Jonathan Peters. • Assumptions stated • Source of calculations stated Interpreting Tables Homicide Rates per 100,000 residents by year and treatment status in 1977 Group Year 1975 1977 Total Untreated 8.0 6.9 7.5 Treated 10.3 9.7 10.0 Total 9.6 8.8 9.2 • From Manski (2014) • Death penalty moratorium was lifted in U.S. is 1976 • Three ways to interpret data presented Interpreting Tables 1) “Before and after” • Average effect of death penalty is -.6 (calculated as 9.7-10.3) Homicide Rates per 100,000 residents by year and treatment status in 1977 Group Year 1975 1977 Total Untreated 8.0 6.9 7.5 Treated 10.3 9.7 10.0 Total 9.6 8.8 9.2 Interpreting Tables 2) Compare treated and untreated Homicide Rates per 100,000 residents by year and treatment status in 1977 Group Year 1975 1977 Total Untreated 8.0 6.9 7.5 Treated 10.3 9.7 10.0 Total 9.6 8.8 9.2 • Assumes all else equal, e.g. propensity to kill is the same everywhere • Average effect in 1977 is 2.8 (=9.7-6.9) Interpreting Tables 3) Difference in difference Homicide Rates per 100,000 residents by year and treatment status in 1977 Group Year 1975 1977 Total Untreated 8.0 6.9 7.5 Treated 10.3 9.7 10.0 Total 9.6 8.8 9.2 • Changes in effects over time to account for policy changes • Treated states declined from 10.3 to 9.7 = -.6 • Untreated states declined from 8.0 to 6.9 = 1.1 • Effect =.5 = [(9.7-10.3)-(6.9-8.0)] Interpreting Tables Homicide Rates per 100,000 residents by year and treatment status in 1977 Group Year 1975 1977 Total Untreated 8.0 6.9 7.5 Treated 10.3 9.7 10.0 Total 9.6 8.8 9.2 • Before and after shows reduced homicide rates • Comparison of treated and untreated shows increase in rate to 2.8 • Difference in difference shows increase in rate to .5 per 100,000 • Explanations? Presenting Data • Tables • Charts • Graphs Problems with Pie Charts • No sample size • Similarly sized pies suggest all groups are equal and all response rates are about the same • Were yes/no the only options? • What are “enough transportation options”? When Pie Charts Are Appropriate Bar Chart Measures of association