Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
ch 3: numerically summarizing data - center, spread, shape 3.1 measure of central tendency or, give me one number that represents all the data consider the number of math classes taken by math 150 students. how can we represent the results in one number? average: add up all the numbers and divide by the amount of numbers that there are ex) suppose you score on three tests 71,75,84. what is your test average? also called the mean ex) for number of math classes, mean = median: the middle number ex) suppose you score on three tests 71,75,84. what is your median test score? median is 75 interpretation: half the time the score is above 75, half the time the score is below 75 note: you must put data in ascending order to determine the median ex) what is the median for: 75, 84, 71 0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2...median is ex) heights of students (in inches): 59,61,62,64,64,64,65,66,66,66,67,68,68,69,70,70,71,71,73 what is the median height? ...find middle number: there are 19 numbers (19+1)/2=10 ...so its the number in the 10th position ...the median is 66 what do you do if there are two middle numbers? add together, divide by two (i.e. take the average) ..this will happen when there is an even amount of data note that, using the "+1" method, you would get (20+1)/2 = 10.5 ...this means the median is between the 10th and 11th numbers, so take their average ex) find the median for: 5 7 8 9 mode: most common number ex) number of math classes: 1 ex) heights: two modes 64 and 66 ex) test scores: no mode (all the same frequency) Question: which of these should we use, and why? ex) number of credits taken at BMCC among math150 students: 0,0,9,12,21,22,27,32,35,38,44,50,52,56 mean = median = mode = 0 ex) there can be a problem with the mean the average salary in this class is around $15,000 if Bill Gates (and his $1,000,000,000 salary) walk into the room, the average salary is now around $35,000,000. does this make us all millionaires? ...no the median salary is still around $15,000, because at most you go to the next number on the list "the Bill Gates effect" Bill Gates' salary is an outlier: it is a value far away from most of the data the average is not robust with respect to an outlier the median is robust with respect to an outlier robust: not affected by [also known as resistant] Q: where are the median and the mean (approx)? 3.2 Measures of Dispersion how spread out is the data because mean & median do not tell the whole story ex) group of 5 men, heights group 1: 5'8,5'10, 5'11, 6', 5'9 ... in inches: 68,69,70,71,72 group 2: 4'6,7'4,4'2,6'8,6'6 ... in inches: 50,54,78,80,88 find mean: group 1: 68+69+70+71+72 = 350 = 70" (or 5'10) 5 5 group 2: 54+88+50+80+78 = 350 = 70" (or 5'10) 5 5 - range (highest) - (lowest) ex) group #1: 72" - 68"=4" group #2: 88" - 50" = 38" note: affected by an outlier ex) our salary range is 30000-0 = 30000 with Bill Gates, range is 1000000000 - 0 = 1000000000 standard deviation ex) group 1 (inches) 68,69,70,71,72 mean = 70 standard deviation = you do: ex) group #2: 54, 88, 50, 80, 78 ... mean = 70 find the standard deviation ex) var = 4 ... st.dev. = ex) st.dev. = 9 ... var = sample population mean x "x-bar" µ "mu" st.dev. s σ "sigma" 2 variance s σ2 size n N depends on fixed your sample a "statistic" a "parameter" also: "data value" = x the way that you calculate the sample mean and the population mean are exactly the same. the difference is the kind of information it gives you ex) find the standard deviation of the sample 7,10,16 (and the variance) standard deviation note for standard dev: for a population, divide by the number of data for a sample, divide by the number - 1 3.3 calculating that stuff from a table [extra credit material] (measures of central tendency and dispersion) or, what to do if we have only the table of data and not the raw data ex) whats the mean?? note: the table is an approximation, so the result will be an approximation note: divide by 12, not 5, because 12 is the total frequency (e.g. 25 appears 7 times) Formula for a weighted mean: this is similar to a weighted mean ex) get three scores, 80, 95, 70 whats the mean?... but the first score is your hw grade (that counts 20%) the second score is your midterm grade (that counts 30%) the third score is your final exam grade (that counts 50%) mean = Σ x · rel.freq(x) x or µ whats the standard deviation? [extra credit material] s= measures of position - rank (location) ex) New York marathon, 12,635 people run, you finished 586 your rank is 586 (out of 12635) - percentile you are above ? % of the data percentile --> data value ex) 3,7,9,12,15,15,16,18,19,21,24,26,28,29 find the 37th percentile: (n=14) rank = (n+1)(P/100), then find the data value ex) find the 58th percentile you do: ex) find the 82nd percentile data value --> percentile ex) at what percentile is x=24? [recall: "x" means data value] x=24 is above 10 data values (out of 14) percentile: 10/14 = .71 or 71st percentile (above 71% of the data) notation: the 71th percentile is 24 P71 = 24 note: for both problems, the middle step is to find the rank (position) note: the "+1" formula has some glitches for small data sets. this comes from the fact that one data value represents a large chunk of your data set (e.g. if you have 20 numbers, each one represents 5%) ...just follow the formula - quartile break the data into four quartiles. they are marked off by: quarter point, half-way point, three-quarter point - 5-number summary min--Q1--Q2--Q3--max Q1: data value after one quarter of the data. thats the same as P25 (the data value at the 25th percentile mark). it separates first quartile and second quartile Q2 is in the 50th percentile position (then find the data value) Q3 is in the 75th percentile position (then find the data value) ex) 14,15,16,17,18,19,20,21,22 (n=9) using the formula: Q1 appears in which position? Q1 = Q2 appears in which position? Q2 = Q3 appears In which position? Q3 = follow-up: in which quartile is x=19 ? why do we need the "+1" ? well, if we didnt have it then for Q2 we would calculate (9)(.5) = 4.5 but we know thats not right, its too low...the "+1" fixes that problem Boxplot - a visual representation of the 5 number summary - helps you see if the distribution is symmetric or skewed ex) here, its 20.5 - 15.5 = 5 this distribution shape is called "symmetric" here are some other shapes (as seen with boxplots): - z-score "the number of standard deviations from the mean" ex) there is an exam. the mean score is 77, you got an 85. is that good? how good? it depends. suppose the standard deviation is 4. how many standard dev's above the mean is your score? you are 8 points above the mean...that is 2 standard deviations (since st.dev. is 4) Jerry got a 88. how many standard deviations above the mean is his score? what is each number called? Formula: for a z-score: z = x - µ σ (population) for a sample, same formula: different notation z=x-x s ex) find the z-score for 47 if µ=38, σ=5 what does that mean, in words? ... 1.8 standard deviations above the mean ex) find the z-score for 68 if µ=78, σ=4 note that a positive z-score means your data value is above the mean and a negative z-score means your data value is below the mean ex) which exam score is relatively better, a 75 when the class average was 68 and the standard deviation was 4, or a 89 when the class average was 76 and the standard deviation was 12 ? (use the z-score) ex) find the data value which is 2 standard deviations above the mean if µ=32, σ=6 formula for x: x = µ + z·σ same as the formula for z, but you solve for x