Download Power Point Chapter 4

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Mean field particle methods wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Chapter 4
Numerical Methods for
Describing Data
Parameter -
Suppose we want to know the MEAN length
of
all the value
fish in about
Lake Lewisville
...
• Fixed
a population
• Is
Typical
unknown
this a value
that is known?
Can we find it out?
At any given point
in time, how
many values are
there for the
mean length of
fish in the lake?
Statistic -
Suppose we want to know the MEAN
length
of calculated
all the fish infrom
Lake Lewisville.
• Value
a sample
What can we do to estimate this unknown
parameter?
Measures of Central Tendency
• Mode – the observation that occurs the
most often
– Can be more than one mode
– If all values occur only once – there is
no mode
– Not used as often as mean & median
Measures of Central Tendency
Median - the middle value of the data; it
divides the observations in half
To find: list the observations in numerical
order
single middle value is n is odd
sample median  
average of the two middle values if n is even
Where n = sample size
Suppose we catch a sample of 5 fish from the
lake. The lengths of the fish (in inches) are
listed below. Find the median length of fish.
The numbers are in orderThe median length of
& n is odd – so find the
fish is 5 inches.
middle observation.
3
4 5 8 10
Suppose we caught a sample of 6 fish from the
lake. The median length is …
The numbers are in order The
& median length
is 5.5 inches.
n is even – so find the
middle two observations.
Now, average these two values.
3
5.5
4 5 6 8 10
Measures of Central Tendency
parameter
Mean is the arithmetic average.
m is the lower case Greek
letter mu
statistic
– Use m to represent a population mean
S is the capital
– Use x to represent a sample
mean Greek
Formula:
letter sigma – it means to
sum the values that
follow
x

x
n
Suppose we caught a sample of 6 fish from
the lake.
Findthe
themean
mean
length
the
To find
length
of of
fish
- fish.
add the observations and divide by
n.
3  4  5  6  8  10
6
x 6
3
4 5 6 8 10
Now find how each observation deviates
from the mean.
x
3
4
5
6
8
10
Sum
(x - x)
-3
3-6
-2
-1
0
2
4
0
The mean is considered
This ispoint
the deviation
the balance
of
from the mean.
the distribution
because it “balances”
Find the
the positive
andrest of the
deviations
from the mean
negative
deviations.
What is the sum
Will
sum always
of this
the deviations
equal
from
thezero?
mean?
YES
Imagine a ruler with pennies placed at
3”, 4”, 5”, 6”, 8” and 10”.
To balance the
ruler on your
finger, you would
need to place your
finger at the mean
of 6.
The mean is the
balance point of a
distribution
What happens to the median & mean if
the length of 10 inches was 15 inches?
The median is . . .
The mean is . . .
5.5
6.833
3  4  5  6  8  15
6
What happened?
3
4 5 6 8 15
What happens to the median & mean if
the 15 inches was 20?
The median is . . .
The mean is . . .
5.5
7.667
2  4  5  6  8  20
6
What happened?
3
4 5 6 8 20
Statistics that are not affected by
extreme values are said to be resistant.
Is the median resistant?
Is the mean resistant?
YES
NO
Suppose we caught a sample of 20 fish with
the following lengths. Create a histogram
for the lengths of fish. (Use a class width of 1.)
Mean = 6.5
Median = 6.5
Look at the placement of the mean
and median in this
symmetrical
Calculate
the mean and median.
distribution.
3
6
5
4
6
7
10
5
6
9
7
9
7
8
8
7
4
6
5
8
Suppose we caught a sample of 20 fish with
the following lengths. Create a histogram
for the lengths of fish. (Use a class width 1.)
Mean = 6.8
Median = 5.5
Look at the placement of the mean
and median inCalculate
this skewed
the mean and median.
distribution.
3
6
5
4
6
12
10
5
15
3
7
4
3
8
3
13
4
11
5
9
Suppose we caught a sample of 20 fish with
the following lengths. Create a histogram
for the lengths of fish. (Use a class width of 1.)
Mean = 7.75
Median = 8.5
Look at the placement of the mean
and median inCalculate
this skewed
the mean and median.
distribution.
3
6
5
4
6
9
10
10
10
9
7
9
10
10
8
7
9
10
5
8
Recap:
• In a symmetrical distribution, the mean
and median are equal.
• In a skewed distribution, the mean is
pulled in the direction of the skewness.
• In a symmetrical distribution, you should
report the mean!
• In a skewed distribution, the median
should be reported as the measure of
center!
Trimmed mean:
Purpose is to remove outliers from a data
set
To calculate a trimmed mean:
• Multiply the percent to trim by n
• Truncate that many observations from
BOTH ends of the distribution (when
listed in order)
• Calculate the mean with the shortened
data set
Find the mean of the following set of data.
12
14
19
Mean = 23.8
10%(10) = 1
20
22
24
25
26
26
50
Find a 10% trimmed.
So remove one observation from each side!
14  19  20  22  24  25  26  26
xT 
 22
8
What values are used to describe
categorical data?
Suppose that each person in a sample of 15 cell
phone users is asked if he or she is satisfied
with the cell phone service.
Pronounced p-hat
population proportion is
Here are The
the responses:
p.
Y
N
Y
Yby the
N letter
N
Y
Y
What
wouldY denoted
be the
possible
responses?
N
Y
Y
Y
N
N
60% ofofthe
Find the9sample proportion
thesample
peoplewas
ˆ
p


0
.
6
who answered “yes”: satisfied with their cell
15
numberphone
of successes
service.
pˆ 
n
Why is the study of variability
important?
this can
of soda
• There is variability Does
in virtually
everything
contain exactly 12
ounces?
• Allows us to distinguish between usual &
unusual values
• Reporting only a measure of center
doesn’t provide a complete picture of the
distribution.
A
B
C
20
30
40
50
60
70
20
30
40
50
60
70
20
30
40
50
60
70
What is the mean and median of these
three graphs?
Measures of Variability
The simplest numeric measure of variability
is range.
What is the range of
these data sets?
Range =
largest observation – smallest observation
A
B
C
20
30
40
50
60
70
20
30
40
50
60
70
20
30
40
50
60
70
The first two data
sets have a range of
50 (70-20) but the
third data set has a
much smaller range
of 10.
Measures of Variability
How would a dotplot look if the average
deviation was 0?
What does it mean to have an average
deviation of 0?
1
2
3
4
5
Measures of Variability
Another measure of the variability in a
data set uses the deviations from the
mean (x – x).
What
What
is the
is a
mean
deviation
of this
from
distribution?
the mean?
A
45
20
30
40
50
60
70
Measures of
Variability
What can we do to the
deviations
so that
we could
Can
we
find
an
average
Remember
the sample
of
6
fish
that
we
Another
measure
of the
variability
in
finddeviation?
an average? a
caught
...
data
setfrom
uses the
the lake
deviations
from the
They(x
were
the following lengths:
mean
–
x).
Population variance
is denoted by3”,
s2 4”, 5”, 6”, 8”, 10”
The
the deviations
and estimated
divided
n. average
Degree
The
mean by
length
was 6 of
inches.
Recallof
freedom
squared
called thethe
variance.
that we is
calculated
deviations
from
(explained
the mean. What was the sum of later)
these
deviations?
2
2
s 
 x  x
n 1

Remember the sample of 6 fish that we
caught from the lake . . .
Find the variance of the length
fish.the
Firstof
square
x
3
4
5
6
8
10
Sum
deviations
(x - x) (x - x)2
Finding
the
average
of
What
could
we
do
so
that
-3
9
the
deviations
would
we
would
be
able
to find
-2
4
always equal
0!
an average
deviation?
-1
1
What is the sum
0
0
of the deviations
2
4
Divide this by 5.
squared?
4
16
0
34
s2 = 6.5
Measures of Variability
The square root of variance is called standard
deviation.
A typical deviation from the mean is the
standard deviation.
s2 = 6.8 inches2 so s = 2.608 inches
The fish in our sample deviate from the mean of 6
by an average of 2.608 inches.
The most commonly used measures of
Calculation
of standard
center and variability
are the mean
and standard deviation, respectively.
deviation of a sample
s 
 x  x
n 1
Population standard deviation
is denoted by s (where n is
used in the denominator).
2

Degrees of Freedom (df)
• The number of independent observations that
are free to vary
However, once these five values occur, then
the sixth value is no longer free to vary. It
Suppose
we consider
the sample
6 fish
MUST
be a specific
value inof
order
for the
wheredeviations
the mean from
is 6 inches.
the mean (of 6) to have a
sum of zero.
Thus, out of a sample of n, n - 1
observations are free to vary.
Five of these values are free to be
any possible length of fish!
Measures of Variability
Interquartile range (IQR) is the range of
the middle half of the data.
Lower quartile (Q1) is the median of the
lower half of the data
Upper quartile (Q3) is the median of the
upper half of the data
IQR = Q3 – Q1
The Chronicle of Higher Education (2009-2010
issue) published the accompanying data on the
percentage of the population with a bachelor’s or
higher degree in 2007 for each of the 50 states
and the District of Columbia.
21
27
35
25
22
26
27
30
38
32
25
26
24
25
26
29
19
29
31
24
33
30
22
19
22
34
35
24
24
28
30
35
29
27
26
17
26
20
27
30
25
47
20
23
23
23
Find the interquartile range for this set of data.
26
27
34
25
34
21
17
27
23
35
25
25
27
22
31
26
47
27
19
30
23
38
26
32
27
25
32
26
19
24
25
26
26
28
29
33
19
20
29
24
31
26
24
29
33
34
30
20
22
24
19
26
22
29
34
35
21
24
26
24
28
29
30
34
35
22
29
25
27
26
26
30
17
35
26
22
20
25
27
30
25
35
47
22
20
25
23
27
23
30
23
35
26
23
27
25
34
27
25
30
34
38
First
put the
data in
order
& find the
Find
the
lower
quartile
(Q
)
by
finding
1 ) by finding the
Find the upper quartile
(Q
3
the
median.
median
medianof
ofthe
thelower
upperhalf.
half.
IQR = 30 – 24 = 6
Which measure(s) of variability
(spread) is/are resistant?
Only the IQR!
Wolf Stat Company Activity
How does the mean and standard
deviation change with linear
transformations?
Linear transformation rule
• When adding a constant to a random
variable, the mean changes but not
the standard deviation.
• When multiplying a constant to a
random variable, the mean and the
standard deviation changes.
An appliance repair shop charges a $30 service
call to go to a home for a repair. It also charges
$25 per hour for labor. From past history, the
average length of repairs is 1 hour 15 minutes
(1.25 hours) with standard deviation of 20
minutes (1/3 hour). Including the charge for
the service call, what is the mean and standard
deviation for the charges for labor?
m  30  25(1.25)  $61.25
1
s  25   $8.33
3
Stat Land Game
Activity
?
Move
1s
How do you combine the mean and
standard deviation of two
independent random variables?
Rules for Combining two variables
• To find the mean for the sum (or
difference), add (or subtract) the two
means
• To find the standard deviation of the sum
(or differences), ALWAYS add the
variances, then take the square root.
m a  b  m a  mb
ma b  ma  mb
2
a
s a b  s  s
2
b
If variables are independent
Bicycles arrive at a bike shop in boxes. Before they
can be sold, they must be unpacked, assembled, and
tuned (lubricated, adjusted, etc.). Based on past
experience, the times for each setup phase are
independent with the following means & standard
deviations (in minutes). What are the mean and
standard deviation for the total bicycle setup times?
Phase
Mean
SD
Unpacking
Assembly
Tuning
3.5
21.8
12.3
0.7
2.4
2.7
mT  3.5  21.8  12.3  37.6 minutes
sT  0.7 2  2.42  2.7 2  3.680 minutes
Another graph- Boxplots
What are some advantages of boxplots?
• Ease of construction
• Convenient handling of outliers
• Construction is not subjective (like
histograms)
• Used with medium or large size data
sets (n > 10)
• Useful for comparative displays
Boxplots
The five-number summary is the minimum
median,
third
When tovalue,
Use first quartile,
Univariate
numerical
data
quartile, and maximum value
How to construct a Skeleton Boxplot
– Calculate the five number summary
– Draw a horizontal (or vertical) scale
– Construct a rectangular box Use
fromfor
the moderate
lower
quartile (Q1) to the upper quartile
(Q3) data
to large
– Draw lines from the lower quartile to the
Don’t
use
smallest observation and fromsets.
the upper
quartile
to the largest observation with data sets of
To describe
n < 10.
– comment on the center, spread, and shape of the
distribution and if there is any unusual features
Remember the data on the percentage of the
population with a bachelor’s or higher degree in
2007 for each of the 50 states and the District of
Columbia.
17
23
25
27
31
47
19
23
26
27
32
19
24
26
28
33
20
24
26
29
34
20
24
26
29
34
21
24
26
29
34
22
25
26
30
35
22
25
27
30
35
22
25
27
30
35
23
25
27
30
38
First
scale
Draw
aalines
box
from
Q1
Drawdraw
lineafor
for
the
the
to Q3
median
whiskers
10
20
30
Percentages
40
50
Modified boxplots
To display outliers:
• Identify mild & extreme outliers
An observation is an outliers if it is more
than 1.5(iqr) away from the nearest
Modified
boxplots
are
generally
preferred
quartile.
because
provide
iqr  and

Q1  1.5they
Q3  1more
.5iqr information
about the data distribution.
An outlier is extreme if it is more than 3(iqr)
away from the nearest quartile.
Q1  3iqr  and Q3  3iqr 
• whiskers extend to largest (or smallest) data
observation that is not an outlier
Remember
the data on the percentage of the
To describe:
population with a bachelor’s or higher degree in
The
distribution
percent
the
population
2007
for
each of theof50
statesof
and
the
District of
with a bachelor’s degree or higher for the U.S.
Columbia.
statesisand
of the
Columbia
There
oneDistrict
outlier at
upper is positively
withdistribution,
an
47%.
median
17 skewed
19 at the
19
20 outlier
20 at21
22The22
22
23
end
but
none
at 26%
23 percentage
24 isend.
24
24 itwith
24 a range
25 of
2530%.
25
25
at23
the lower
Is
extreme?
25
27
31
47
26
27
32
26
28
33
26
29
34
24-1.5(6) = 15
30+1.5(6) = 39
26
26
26
27
27
27
29
29
30
30
30
30
for
the38
34 Place
34 aDraw
35 lines
35 for
35the
solid
dot
whiskers
outlier
First,
draw
the
box
Next calculate scale,
the fences
and the
for the
for line
outliers.
median
30+3(6) = 48
10
20
30
Percentages
40
50
Symmetrical boxplots
Approximately symmetrical boxplot
Notice that the range
Notice that all 3
of the lower half and
the range of the upper boxplots are identical,
but
their
corresponding
half of this
histograms are very
distribution are
approximately equal so different. Can you
we can say that it is determine the number
of modes from a
approximately
boxplot?
However,
the range of
symmetrical.
Skewed boxplot
the two halves of this
distribution are
definitely different
sizes, so it would be
skewed in the direction
of the longest side.
The 2009-2010 salaries of NBA players
published on the web site hoopshype.com were
used to construct the comparative boxplot of
salary data for five teams.
Discuss the
similarities
and
differences.
Normal Curve
the following
into your calculator:
• Put
Bell-shaped,
symmetrical,
unimodal
(Window: x: [0,20] & y: [0,0.3])
curve
• Y1:
Transition
points between cupping
normalpdf(X,10,2)
Y2:
normalpdf(X,10,1.5)
upward
and downward occur at m ± s
Y3: normalpdf(X,10,3)
• As the standard deviation increases,
the
curve
flattens
and
spreads
What happens?
Let’s use our calculator to
• As the standard
deviation
decreases,
graph some
normal
curves
the curve gets taller and thinner
What’s my area?
Input the following command into a graphing calculator
in order to graph a normal curve with a mean of 20 and
standard deviation of 3.
Y1 = normalpdf(X,20,3)
(Window x: [10,30] y: [0,0.2])
Use the command 2nd trace, 7 to find the area under
the curve for the: (Round to 3 decimal places.)
Lower limit: 17
Lower limit: 14
Lower limit: 11
Upper limit: 23
Upper limit: 26
Upper limit: 29
Area: ________
Area: ________
Area: ________
What’s my area?
Graph a normal curve with a mean of 50 and standard
deviation of 5.
Y1 = normalpdf(X,50,5)
(x: [30,70] y: [0,0.1])
Find the area under the curve for the following:
Lower limit: 45
Lower limit: 40
Lower limit: 35
Upper limit: 55
Upper limit: 60
Upper limit: 65
Area: ________
Area: ________
Area: ________
What pattern do you notice?
Interpreting Center & Variability
Empirical Rule99.7%
• Approximately 68% of the
observations are
68%
95%
within 1 standard deviation of the mean
Can ONLY be used with distributions that
are95%
mound
shaped!
• Approximately
of the
observations are
within 2 standard deviation of the mean
• Approximately 99.7% of the observations
are within 3 standard deviation of the mean
The height of male students at PWSH is
approximately normally distributed with a
mean of 71 inches and standard deviation of
2.5 inches.
a)What percent of the male
students are shorter than
66 inches?
About 2.5%
b) Taller than 73.5 inches?
About 16%
c) Between 66 & 73.5 inches?
About 81.5%
Measures of Relative Standing
Z-score
A z-score tells us how many standard
deviations the value is from the mean.
value - mean
z - score 
standard deviation
One example of standardized score.
What do these z-scores mean?
-2.3
2.3 standard deviations below the mean
1.8
1.8 standard deviations above the mean
-4.3
4.3 standard deviations below the mean
Sally is taking two different math achievement
tests with different means and standard
deviations. The mean score on test A was 56
with a standard deviation of 3.5, while the mean
score on test B was 65 with a standard deviation
of 2.8. Sally scored a 62 on test A and a 69 on
test B. On which test did Sally score the best?
Z-score on test A
Z-score on test B
62  56
z
 1.714
3 .5
69  65
z
 1.429
2.8
She did better on test A.
Measures of Relative Standing
Percentiles
A percentile is a value in the data set where
r percent of the observations fall AT or
BELOW that value
In addition to weight and length, head
circumference is another measure of health in
newborn babies. The National Center for Health
Statistics reports the following summary values
for head circumference (in cm) at birth for boys.
Head circumference (cm) 32.2 33.2 34.5 35.8 37.0 38.2 38.6
Percentile
5
10
25
50
75
90
What percent of newborn boys had head
circumferences greater than 37.0 cm? 25%
10% of newborn babies have head
circumferences bigger than what value?
38.2 cm
95