Download 3.3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Measures of Position
Where does a certain data value fit in relative
to the other data values?
To accompany Hawkes lesson 3.3
Original content by D.R.S.
1
Nth Place
• The highest and the lowest
• 2nd highest, 3rd highest, etc.
• “If I made $60,000, I would be 6th richest.”
2
Another view: “How does my 𝑥
compare to the mean?”
• “Am I in the middle of the pack?”
• “Am I above or below the middle?”
• “Am I extremely high or extremely low?”
• 𝑧 Score is the measuring stick
3
𝒛 Score: 𝑥 is how many standard
deviations away from the mean?
If you know the x value
• Population:
𝑥−𝜇
𝑧=
𝜎
• Sample
𝑥−𝑥
𝑧=
𝑠
To work backward from z to x
• Population
𝑥 =𝑧∙𝜎+𝜇
• Sample
𝑥 =𝑧∙𝑠+𝑥
4
𝑧 score is also called “Standard Score”
• No matter what 𝑥 is measured in or how large
or small the 𝑥 values are….
• The 𝑧 score of the mean will be 0
– Because numerator 𝑥 − 𝑥 turns out to be 0.
• If 𝑥 is above the mean, its 𝑧 is positive.
– Because numerator 𝑥 − 𝑥 turns out to be positive
• If 𝑥 is below the mean, its 𝑧 is negative.
– Because numerator 𝑥 − 𝑥 turns out to be negative
5
𝑧 score values
• Typically round to two decimal places.
– Don’t say “0.2589”, say “0.26”
• If not two decimal places, pad
– Don’t say “2”, say “2.00”
– Don’t say “-1.1”, say “-1.10”
• 𝑧 scores are almost always in the interval
− 4 < 𝑧 < 4. Be very suspicious if you
calculate a 𝑧 score that’s not a small number.
6
Practice: Given x, compute z
Find the 𝑧 scores corresponding to the 𝑥 salary
values, given that the mean, 𝑥 = $51168 and
the standard deviation 𝑠 = $16291.
• 𝑥 = $90,000
• 𝑥 = $70,000
• 𝑥 = $50,000
• 𝑥 = $30,000
• 𝑥 = $10,000
7
Practice: Given z, compute x
Find the 𝑥 scores (salaries) corresponding to
these 𝑧 standard scores, given that the mean,
𝑥 = $51168 and the standard deviation 𝑠 =
$16291.
• 𝑧=0
• 𝑧 = 1 and 𝑧 = −1
• 𝑧 = 2 and 𝑧 = −2
• 𝑧 = 3 and 𝑧 = −3
8
Two parallel axes (scales), 𝑥 and 𝑧
9
Example: Using 𝑧 scores to compare
unlike items
The Literature test
• The mean score was 77
points.
• The standard deviation was
11 points
• Sue earned 91 points
• Find her z score for this test
The Biology test
• The mean score was 47
points
• The standard deviation was
6 points
• Sue earned 55 points
• Find her z score for this test
• On which test did she have
the “better” performance?
10
𝑧 scores caution with negatives
• Example: compare test scores on two different
tests to ascertain “Which score was the more
outstanding of the two?”
• Be careful if the 𝑧 scores turn out to be
negative. Which is the better performance?
𝑧 = −1.99 or 𝑧 = −0.34 ?
• Stop and think back to your basic number line
and the meaning of “<“ and “>”
11
Percentiles
• “What percent of the values are lower than
my value?”
– 90th percentile is pretty high
– 50th percentile is right in the middle
– 10th percentile is pretty low
• If you scored in the 99th percentile on your
SAT, I hope you got a scholarship.
12
Salary data for our percentile examples
• With these
salary values
again
• What’s the
percentile for a salary of $59,000 ?
• You can see it’s going to be higher than 50th
Because it’s in the top half.
13
Example: Given x, find the percentile
• Count 𝑥 = how many values below $59,000
• Count 𝑛 = how many values in the data set
𝑥
𝑛
• Formula for percentile 𝑝 = ∙ 100%
• Here we have 𝑥 = 15 values lower than our
$59,000
• Here we have 𝑛 = 20 values in the data set.
• 𝑝=
15
20
∙ 100% so 𝑝 = 75, “75th percentile”
14
Continued: Given x, find the percentile
• 𝑝=
15
20
∙ 100% so 𝑝 = 75
• Do not say “75%”, but say “the 75th percentile”
• Other sources use different formulas, beware!
– Some other books use 𝑥 + 0.5 in the numerator.
– Excel has two different answers, PERCENTILE.EXC
and PERCENTILE.INC functions.
15
Given Percentile 𝑝, find the 𝑥 value
• Formula: position from bottom 𝑐 =
𝑛∙𝑝
100
– Again, 𝑛 = how many data values in the set
– and 𝑝 = the percentile rank that’s given.
• Is there a decimal remainder in position 𝑐?
– If so, then BUMP UP to the next highest whole #
and take the value in that position.
– Or if 𝑐 is an exact whole number, take the average
from positions 𝑐 and (𝑐 + 1).
• Note: Book uses lowercase 𝑙 instead of 𝑐.
16
Given Percentile 𝑝, find the 𝑥 value
• Example: What is the 31st percentile in the
salary data?
• 31st percentile: plug in 𝑛 = 20, 𝑝 = 31
• Compute 𝑐 =
20∙31
100
= 6.2. It has a remainder.
• Bump it up! 𝑐 =7.
– Not rounding, but rather bumpety-upping
• So we look 7 positions from the bottom
• “The 31st percentile is $44,476”
17
Given Percentile 𝑝, find the 𝑥 value
• Example: What is the 40th percentile in the
salary data? Plug in 𝑛 = 20, 𝑝 = 40
• Compute 𝑐 =
20∙40
100
= 8. Exact integer!
• So count 𝑐 = 8th and 𝑐 + 1 =9th from bottom.
47043+47692
2
• “The
percentile is
$47,367.50, or $47,368.”
40th
=
18
Excel gives different answers
• Excel does some
fancy interpolation
19
Quartiles Q1, Q2, Q3
•
•
•
•
•
Data values are arranged from low to high.
The Quartiles divide the data into four groups.
Q2 is just another name for the Median.
Q1 = Find the Median of Lowest to Q2 values
Q3 = Find the Median of Q2 to Highest values
• It gets tricky, depending on how many values.
20
Quartiles example
•
•
•
•
•
10, 20, 30, 40, 50, 60, 70, 80, 90
The Second Quartile, Q2 = median = 50
Find the medians of the subsets left and right.
Keep the 50 in each of those subsets.
The First Quartile, Q1
= median of { 10, 20, 30, 40, 50 } = 30
• The Third Quartile, Q3
= median of { 50, 60, 70, 80, 90 } = 70
21
Quartiles example
• 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
• Q2 = median
50+60
=
2
= 𝟓𝟓. (two middle #s)
• Leave the 50 and 60 in place; do not reuse 55
• Q1 = median of {10, 20, 30, 40, 50} = 30
• Q3 = median of {60, 70, 80, 90, 100} = 80
22
Quartiles example
• 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110
• Q2 = median
•
•
•
•
•
50+60
=
2
= 𝟓𝟓 (two middle #s).
55 isn’t really there so you can’t remove it!
Leave the 50 and 60 in place
Q1 = median of {0, 10, 20, 30, 40, 50} = 25
Q3 = median of {60, 70, 80, 90, 100, 110} = 85
Two middle numbers happened again!
23
Interquartile Range
• Definition: IQR = Q3 – Q1
• In the previous example, 85 – 25 = 60.
• Interquartile Range measures how spread out
the middle of the data are
– The lowest quartile (x < Q1) is not involved
– And the highest quartile (x > Q3) is not involved.
24
Quartiles with TI-84
• 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110
• Put values into a TI-84 List
• Use STAT, CALC,
1-Var Stats
• Scroll down down
down to get to them.
25
There is disagreement about Quartiles
• The TI-84 sometimes gives different answers
than the method we use in the Hawkes
materials
• Excel might give different answers from
Hawkes and TI-84, both.
• Use the Hawkes method in this course’s work
• Be aware of the others
– You should know how to use TI-84 and Excel
– You should be aware that differences can occur.
26
Quartiles with TI-84 vs. Hawkes
• 10, 20, 30, 40, 50, 60, 70, 80, 90
• We got Q1=30 and Q3=70 before.
• Hawkes keeps the 50,
using 10,20,30,40,50
to compute Q1.
• But the TI-84 throws
out 50 and uses
10,20,30,40.
• Hawkes says the TI-84 is computing “hinges”.
27
Quartiles in Excel
• =QUARTILE.INC(cells, 1 or 2 or 3) seems to
give the same results as the old QUARTILE
function
• There’s new =QUARTILE.EXC(cells, 1 or 2 or 3)
• Excel does fancy interpolation stuff and may
give different Q1 and Q3 answers compared to
the TI-84 and our by-hand methods.
28
The Five Number Summary
• Again: 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110
• Q2 = median
50+60
=
2
= 𝟓𝟓, Q1 = 25 and Q3 = 85
• “The Five Number Summary” is defined as:
the minimum, then Q1, Q2, Q3, then the
maximum
• For this set of numbers, the Five Number
Summary is “0, 25, 55, 85, 110”
29
The Five Number Summary
• Again: 0, 10, 20, 30, 40,
50, 60, 70, 80, 90, 100,
110
• Q2=55, Q1=25, Q3 = 85
• Min is 0, Max is 110
• For this set of numbers,
the Five Number
Summary is “0, 25, 55,
85, 110”
• Box Plot
Min
0
Q1
25
Q2
55
Q3
85
Max
110
• TI-84 can do Box Plot
too, but again its
quartiles disagree with
the way Hawkes defines
quartiles.
30
Why Box Plot?
• Don’t lose sight of the big picture here:
– We have a data set
– It’s a bunch of numbers
– We want to summarize the data
• Summarize means make it into a sound bite
– We must be Concise – don’t say too much
– We must be Informative – don’t say too little
31
We must be Concise
• Bad: “Here is a report that tells you the mean
and the variance and the standard deviation
and the quartiles and the percentiles from 0
to 100… and the marketing survey analyzed by
demographic subgroups …” (there is a place
for that, but not right now)
• Good: “Got fifteen seconds? Here’s what we
found.”
32
Notice the pieces of the boxplot:
• Horizontal scale, maybe a little beyond the
min and the max. A generic number line.
• The five numbers.
• The box holds the quartiles
– With a line in the middle at the median.
• The whiskers extend out to the min and the
max.
33
TI-84 Boxplot
• See instructions on separate handout.
• Caution again that TI-84 computes quartiles
differently from Hawkes and differently from
Excel, so the results aren’t always going to
agree.
34
Additional Topics
• Might not be needed for Hawkes homework
• But you should be aware of them
• Quintiles and Deciles
• Interquartile Range and Outliers
• TI-84 Box Plot
35
Quintiles and Deciles
• You might also encounter
– Quintiles, dividing data set into 5 groups.
– Deciles, dividing data set into 10 groups.
• Reconcile everything back with percentiles:
– Quartiles correspond to percentiles 25, 50, 75
– Deciles correspond to percentiles 10, 20, …, 90
– Quintiles correspond to percentiles 20, 40, 60, 80
36
Interquartile Range and Outliers
• Concept: An OUTLIER is a wacky far-out
abnormally small or large data value
compared to the rest of the data set.
• We’d like something more precise.
• Define: IQR = Interquartile Range = Q3 – Q1.
• Define: If 𝑥 < 𝑥 − 1.5 ∙ 𝐼𝑄𝑅, 𝑥 is an Outlier.
• Define: If 𝑥 > 𝑥 + 1.5 ∙ 𝐼𝑄𝑅, 𝑥 is an Outlier.
• (Other books might make different definitions)
37
Outliers Example
•
•
•
•
Here’s an quick elementary example:
Data values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20
Mean 𝑥 = 6.8 and 𝐼𝑄𝑅 = 9 – 3 = 6
Or in Hawkes method, 𝑄1 = 3.5, 𝑄3 = 9.5,
and we still get interquartile range =
9.5 – 3.5 = 6 (it won’t always work out the
same but in this case the IQR is the same
either way)
38
Outliers Example
•
•
•
•
•
Data values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20
We found IQR = 6 and the mean is 6.8
One definition uses 𝐼𝑄𝑅 ∗ 1.5 to define outliers
Here, 6 ∗ 1.5 = 9
Anything more than 9 units away from 𝒙 is then
considered to be abnormally small or large.
• 6.8 – 9 = −3.2, nothing smaller than −3.2
• 6.8 + 9 = 15.8: the 20 is an outlier.
39
No-Outliers Example
• Data values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10
• Mean 𝑥 = 5.9 and 𝐼𝑄𝑅 = 9 – 3 = 6
(coincidence that 𝑥 = 𝐼𝑄𝑅, insignificant)
• 𝐼𝑄𝑅 ∗ 1.5 = 9
• Anything more than 9 units away from 𝒙 is
abnormal. 5.9 − 9 = −3.1; 5.9 + 9 = 14.9
• This data set has No Outliers.
40
Outliers: Good or Bad?
• “I have an outlier in my data set.
Should I be concerned?”
– Could be bad data. A bad measurement.
Somebody not being honest with the pollster.
– Could be legitimately remarkable data, genuine
true data that’s extraordinarily high or low.
• “What should I do about it?”
– The presence of an outlier is shouting for
attention. Evaluate it and make an executive
decision.
41