Download ch 3: numerically summarizing data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ch 3: numerically summarizing data
- center, spread, shape
3.1 measure of central tendency
or, give me one number that represents all the data
consider the number of math classes taken by math 150 students. how can
we represent the results in one number?
average: add up all the numbers and divide by the amount of numbers that
there are
ex) suppose you score on three tests 71,75,84. what is your test average?
also called the mean
ex) for number of math classes, mean =
median: the middle number
ex) suppose you score on three tests 71,75,84. what is your median test
score?
median is 75
interpretation: half the time the score is above 75, half the time the score is
below 75
note: you must put data in ascending order to determine the median
ex) what is the median for: 75, 84, 71
0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2...median is
ex) heights of students (in inches):
59,61,62,64,64,64,65,66,66,66,67,68,68,69,70,70,71,71,73
what is the median height?
...find middle number: there are 19 numbers (19+1)/2=10 ...so its the
number in the 10th position ...the median is 66
what do you do if there are two middle numbers? add together, divide by
two (i.e. take the average)
..this will happen when there is an even amount of data
note that, using the "+1" method, you would get (20+1)/2 = 10.5
...this means the median is between the 10th and 11th numbers, so take their
average
ex) find the median for: 5 7 8 9
mode: most common number
ex) number of math classes: 1
ex) heights: two modes 64 and 66
ex) test scores: no mode (all the same frequency)
Question: which of these should we use, and why?
ex) number of credits taken at BMCC among math150 students:
0,0,9,12,21,22,27,32,35,38,44,50,52,56
mean =
median =
mode = 0
ex) there can be a problem with the mean
the average salary in this class is around $15,000
if Bill Gates (and his $1,000,000,000 salary) walk into the room,
the average salary is now around $35,000,000. does this make us all millionaires?
...no
the median salary is still around $15,000, because at most you go to the next
number on the list
"the Bill Gates effect"
Bill Gates' salary is an outlier: it is a value far away from most of the data
the average is not robust with respect to an outlier
the median is robust with respect to an outlier
robust: not affected by [also known as resistant]
Q: where are the median and the mean (approx)?
3.2 Measures of Dispersion
how spread out is the data
because mean & median do not tell the whole story
ex) group of 5 men, heights
group 1: 5'8,5'10, 5'11, 6', 5'9 ... in inches: 68,69,70,71,72
group 2: 4'6,7'4,4'2,6'8,6'6 ... in inches: 50,54,78,80,88
find mean:
group 1: 68+69+70+71+72 = 350 = 70" (or 5'10)
5
5
group 2: 54+88+50+80+78 = 350 = 70" (or 5'10)
5
5
- range
(highest) - (lowest)
ex) group #1: 72" - 68"=4"
group #2: 88" - 50" = 38"
note: affected by an outlier
ex) our salary range is 30000-0 = 30000
with Bill Gates, range is 1000000000 - 0 = 1000000000
standard deviation
ex) group 1 (inches) 68,69,70,71,72 mean = 70
standard deviation =
you do:
ex) group #2: 54, 88, 50, 80, 78 ... mean = 70
find the standard deviation
ex) var = 4 ... st.dev. =
ex) st.dev. = 9 ... var =
sample population
mean
x "x-bar" µ "mu"
st.dev.
s
σ "sigma"
2
variance
s
σ2
size
n
N
depends on
fixed
your sample
a "statistic" a "parameter"
also: "data value" = x
the way that you calculate the sample mean and
the population mean are exactly the same.
the difference is the kind of information it gives
you
ex) find the standard deviation of the sample 7,10,16 (and the
variance)
standard deviation
note for standard dev:
for a population, divide by
the number of data
for a sample, divide by
the number - 1
3.3 calculating that stuff from a table [extra credit material]
(measures of central tendency and dispersion)
or, what to do if we have only the table of data and not the raw data
ex)
whats the mean??
note: the table is an
approximation, so the result
will be an approximation
note: divide by 12, not 5, because
12 is the total frequency (e.g. 25
appears 7 times)
Formula for a
weighted
mean:
this is similar to a weighted mean
ex) get three scores, 80, 95, 70
whats the mean?...
but the first score is your hw grade (that counts 20%) the second score is your midterm
grade (that counts 30%) the third score is your final exam grade (that counts 50%)
mean = Σ x · rel.freq(x)
x or µ
whats the standard deviation? [extra credit material]
s=
measures of position
- rank (location)
ex) New York marathon, 12,635 people run, you finished 586
your rank is 586 (out of 12635)
- percentile
you are above ? % of the data
percentile --> data value
ex) 3,7,9,12,15,15,16,18,19,21,24,26,28,29
find the 37th percentile:
(n=14)
rank = (n+1)(P/100), then find the data value
ex) find the 58th percentile
you do:
ex) find the 82nd percentile
data value --> percentile
ex) at what percentile is x=24? [recall: "x" means data value]
x=24 is above 10 data values (out of 14)
percentile: 10/14 = .71 or 71st percentile (above 71% of the data)
notation: the 71th percentile is 24
P71 = 24
note: for both problems, the middle step is to find the rank (position)
note: the "+1" formula has some glitches for small data sets. this comes from
the fact that one data value represents a large chunk of your data set (e.g. if
you have 20 numbers, each one represents 5%)
...just follow the formula
- quartile
break the data into four quartiles. they are marked off by: quarter
point, half-way point, three-quarter point
- 5-number summary
min--Q1--Q2--Q3--max
Q1: data value after one quarter of the data. thats the same as P25
(the data value at the 25th percentile mark). it separates first quartile and
second quartile
Q2 is in the 50th percentile position (then find the data value)
Q3 is in the 75th percentile position (then find the data value)
ex) 14,15,16,17,18,19,20,21,22
(n=9)
using the formula:
Q1 appears in which position?
Q1 =
Q2 appears in which position?
Q2 =
Q3 appears In which position?
Q3 =
follow-up: in which quartile is x=19 ?
why do we need the "+1" ? well, if we didnt have it then for Q2 we would
calculate
(9)(.5) = 4.5
but we know thats not right, its too low...the "+1" fixes that problem
Boxplot
- a visual representation of the 5 number summary
- helps you see if the distribution is symmetric or skewed
ex) here, its 20.5 - 15.5 = 5
this distribution shape is called "symmetric"
here are some other shapes (as seen with boxplots):
- z-score
"the number of standard deviations from the mean"
ex) there is an exam. the mean score is 77, you got an 85. is that good? how
good?
it depends.
suppose the standard deviation is 4. how many standard dev's above
the mean is your score?
you are 8 points above the mean...that is 2 standard deviations
(since st.dev. is 4)
Jerry got a 88. how many standard deviations above the mean is his
score?
what is each number called?
Formula:
for a z-score: z = x - µ
σ
(population)
for a sample, same formula:
different notation
z=x-x
s
ex) find the z-score for 47 if µ=38, σ=5
what does that mean, in words?
... 1.8 standard deviations above the mean
ex) find the z-score for 68 if µ=78, σ=4
note that a positive z-score means your data value is above the mean
and a negative z-score means your data value is below the mean
ex) which exam score is relatively better, a 75 when the class average was
68 and the standard deviation was 4, or a 89 when the class average was 76
and the standard deviation was 12 ? (use the z-score)
ex) find the data value which is 2 standard deviations above the mean if µ=32,
σ=6
formula for x: x = µ + z·σ
same as the formula for z, but you solve for x
Related documents