Download Chapter 3: Describing Relationships (first spread)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Receiver operating characteristic wikipedia , lookup

History of statistics wikipedia , lookup

Central limit theorem wikipedia , lookup

Transcript
Ex. 2.1-1: Wins in Major League Baseball
The stemplot below shows the number of wins for each of the 30 Major League Baseball teams in 2009.
5 9
Key: 5|9 represents a
6 2455
team with 59 wins.
7 00455589
8 0345667778
9 123557
10 3
Find the percentiles for the following teams:
(a) The Colorado Rockies, who won 92 games.
(b) The New York Yankees, who won 103 games.
(c) The Kansas City Royals and Cleveland Indians, who both won 65 games.
Ex. 2.1-2, 3: State Median Household Incomes
The table below shows the distribution of median household incomes for the 50 states and the District of
Columbia.
Median
Cumulative
Relative
Cumulative
Income Frequency
Relative
Frequency
Frequency
($1000s)
Frequency
35 to < 40
1
1/51 = 0.020
1
1/51 = 0.020
40 to < 45
10
10/51 = 0.196
11
11/51 = 0.216
45 to < 50
14
14/51 = 0.275
25
25/51 = 0.490
50 to < 55
12
12/51 = 0.236
37
37/51 = 0.725
55 to < 60
5
5/51 = 0.098
42
42/51 = 0.824
60 to < 65
6
6/51 = 0.118
48
48/51 = 0.941
65 to < 70
3
3/51 = 0.059
51
51/51 = 1.000
The cumulative relative frequency graph below shows the same income data. The point at (50,0.49)
means 49% of the states had median household incomes less than $50,000. The point at (55, 0.725)
means that 72.5% of the states had median household incomes less than $55,000. Thus, 72.5% - 49% =
23.5% of the states had median household incomes between $50,000 and $55,000 since the cumulative
relative frequency increased by 0.235. Due to rounding error, this value is slightly different than the
relative frequency for the 50 to <55 category.
(a) At what percentile is California, with a median
income of $57,445?
(b) Estimate and interpret the first quartile of this
distribution.
Ex. 2.1-5: Wins in Major League Baseball
In 2009, the mean number of wins was 81 with a standard deviation of 11.4 wins.
Find and interpret the z-scores for the following teams.
(a) The New York Yankees, with 103 wins.
(b) The New York Mets, with 70 wins.
Ex. 2.1-6: Home Run Kings
The single-season home run record for major league baseball has been set just three times since Babe
Ruth hit 60 home runs in 1927. Roger Maris hit 61 in 1961, Mark McGwire hit 70 in 1998 and Barry
Bonds hit 73 in 2001. In an absolute sense, Barry Bonds had the best performance of these four players,
since he hit the most home runs in a single season. However, in a relative sense this may not be true.
Baseball historians suggest that hitting a home run has been easier in some eras than others. This is due
to many factors, including quality of batters, quality of pitchers, hardness of the baseball, dimensions of
ballparks, and possible use of performance-enhancing drugs. To make a fair comparison, we should see
how these performances rate relative to others hitters during the same year.
Compute the standardized scores for each performance.
Year
Player
HR Mean
SD
1927
Babe Ruth
60
7.2
9.7
1961
Roger Maris
61
18.8
13.4
1998 Mark McGwire
70
20.7
12.7
2001
73
21.4
13.2
Barry Bonds
z-score
Which player had the most outstanding performance relative to his peers?
Ex. 2.1-7, 8: Test Scores (Transforming Data)
The graph and table below are summary statistics for a sample of 30 test scores.
The maximum possible
score on1the test was 50 points.
Dot Plot
Collection
10
15
20
25
30 35
Score
40
45
50
sx Min Q1 M Q3 Max IQR Range
x
n
Score 30 35.8 8.17 12 32 37 41 48
9
36
Suppose that the teacher was nice and added 5 points to each test score.
How would this change the shape, center, and spread of the distribution?
Shown below areCollection
the graphs1 and summary statistics for the original scores and the Dot
+5Plotscores:
Score
Score_Plus5
10
15
20
25
30
35
40
45
50
sx Min Q1 M Q3 Max IQR Range
x
n
Score
30 35.8 8.17 12 32 37 41 48
9
36
Score + 5 30 40.8 8.17 17 37 42 46 53
9
36
From both the graph and summary statistics, we can see that the measures of center and measures of
position all increased by 5. However the shape of the distribution did not change nor did the spread of
the distribution.
Suppose that the teacher wanted to convert the original test scores to percents. Since the test was out of
50 points, he should multiply each score by 2 to make them out of 100. Shown below are graphs and
summary statistics for the original scores and the doubled scores.
Collection 1
Dot Plot
Score
Scorex2
10
20
30
40
50
60
70
80
90
100
sx
x
n
Min Q1 M Q3 Max IQR Range
Score
30 35.8 8.17 12 32 37 41 48
9
36
Score x 2 60 71.6 16.34 24 64 74 82 96
18
72
From the graphs and summary statistics we can see that the measures of center, location, and spread all
have doubled, just like the individual observations. But even though the distribution is more spread out,
the shape hasn’t changed. It is still skewed to the right with the same clusters and gaps.
Ex. 2.2-1, 2: Batting Averages
The histogram below shows the distribution of batting average (proportion of hits) for the 432 Major
League Baseball players with at least 100 plate appearances in the 2009 season. The smooth curve
shows the overall shape of the distribution.
In the first graph below, the bars in red represent the proportion of players who had batting averages of
at least 0.270. There are 177 such players out of a total of 432, for a proportion of 0.410. In the second
graph below, the area under the curve to the right of 0.270 is shaded. This area is 0.391, only 0.019
away from the actual proportion of 0.410.
The mean of the 432 batting averages in MLB in 2009 was 0.261 with a standard deviation of 0.034.
Suppose that the distribution is exactly Normal with  = 0.261 and  = 0.034.
(a) Sketch a Normal density curve for this distribution
of batting averages.
Label the points that are 1, 2, and 3 standard deviations
from the mean.
(b) What percent of the batting averages are
above 0.329? Show your work.
(c) What percent of the batting averages are
between 0.227 and .295? Show your work.
Compare your results with results from using the 68-95-99.7 rule.
Ex. 2.2-3: Finding Area to the Right
Suppose we wanted to find the proportion of observations in a Normal distribution that were more than
1.53 standard deviations above the mean. That is, we want to know what proportion of observations in
the standard Normal distribution are greater than z = 1.53.
To find this proportion, locate the value 1.5 in the left-hand column of Table A, then locate the
remaining digit 3 as .03 in the top row. The corresponding entry is 0.9370. This is the area to the left of
z = 1.53. To find the area above z = 1.53, subtract 0.9370 from 1 to get 0.0630.
The area to the right of
z = 1.53 is 1 – 0.9370
= 0.0630
The table entry 0.9370
is for the area to the
left of z = 1.53.
Ex. 2.2-4: Finding Areas Under the Standard Normal Curve
Find the proportion of observations from the standard Normal distribution that are between -0.58 and
1.79.
To find the proportion of observations from the standard Normal curve that are between -0.58 and 1.79,
we must find the proportion of values that are less than z = 1.79 and then subtract the proportion of
values that are less than z = -0.58. The difference in these proportions is the proportion of observations
that are between z = -0.58 and z = 1.79.
Area to the left
of z = 1.79 is
0.9633
Area to the left
of z = -0.58 is
0.2810
–
Area between z =
-0.58 and z = 1.79
is 0.6823
=
Ex. 2.2-5: Working Backward
In a standard Normal distribution, 20% of the observations are above what value? Using Table A, we
should look up an area of 0.8000 since the table always lists area to the left of a boundary. The closest
area to 0.8000 is 0.7995 which corresponds to a z-score of z = 0.84. Thus, approximately 20% of the
observations in a standard Normal distribution are above z = 0.84.
Area to the right of
z is 0.20. What’s z?
Ex. 2.2-6, 7: Serving Speed
In the 2008 Wimbledon tennis tournament, Rafael Nadal averaged 115 miles per hour (mph) on his first
serves1. Assume that the distribution of his first serve speeds is Normal with a mean of 115 mph and a
standard deviation of 6 mph.
About what proportion of his first serves would you expect to exceed 120 mph?
State: Let x = the speed of Nadal’s first serve. The variable x has a Normal distribution with  = 115
and  = 6. We want the proportion of first serves with x  120.
Plan: The figure below shows the distribution with the area of interest shaded.
x = 120
z = 0.83
120  115
 0.83 . Table A: Looking up a z-score of 0.83 shows us that the area
6
less than z = 0.83 is 0.7967. This means that the area to the right of z = 0.83 is 1 – 0.7967 = 0.2033.
Conclude: About 20% of Nadal’s first serves will travel more than 120 mph.
Do: Standardize: z 
What percent of Rafael Nadal’s first serves are between 100 and 110 mph?
State: Let x = the speed of Nadal’s first serve. The variable x has a Normal distribution with  = 115
and  = 6. We want the proportion of first serves with 100 < x < 110.
Plan: The figure below shows the distribution with the area of interest shaded.
x = 100
z = -2.50
x = 110
z = -0.83
100  115
110  115
 2.50 . When x = 110, z 
 0.83 . Table
6
6
A: Looking up a z-score of -2.50 shows us that the area less than z = -2.50 is 0.0062. Looking up a zscore of -0.83 shows us that the area less than z = -0.83 is 0.2033. Thus, the area between z = -2.50 and
z = -0.83 is 0.2033 – 0.0062 = 0.1971.
Conclude: About 20% of Nadal’s first serves will travel between 100 and 110 mph.
Do: Standardize: When x = 100, z 
1
http://sports.espn.go.com/sports/tennis/wimbledon08/columns/story?columnist=garber_greg&id=3472238
Ex. 2.2-8: Heights of three-year-old females.
According to http://www.cdc.gov/growthcharts/, the heights of 3 year old females are approximately
Normally distributed with a mean of 94.5 cm and a standard deviation of 4 cm.
What is the third quartile of this distribution?
State: Let x = height of a randomly selected three year old female. This variable has the N(94.5, 4)
distribution. The third quartile is the value with 75% of the distribution to its left.
Plan: The picture below illustrates what we are trying to find.
Do: Using Table A, the table entry closest to 0.75 is 0.7486. This corresponds to a z-score of 0.67. To
x  94.5
un-standardize, we solve the following equation for x: 0.67 
and get x = 97.18 cm.
4
Conclude: The third quartile of 3 year old female heights is 97.18 cm.
Ex. 2.2-9: No Space in the Fridge? (Assessing Normality)
The measurements listed below describe the useable capacity (in cubic feet) of a sample of 36 side-byside refrigerators. <source: Consumer Reports, May 2010>
Are the data close to Normal?
12.9 13.7 14.1 14.2 14.5 14.5 14.6 14.7 15.1 15.2 15.3 15.3
15.3 15.3 15.5 15.6 15.6 15.8 16.0 16.0 16.2 16.2 16.3 16.4
16.5 16.6 16.6 16.6 16.8 17.0 17.0 17.2 17.4 17.4 17.9 18.4
A histogram of these data is shown below. It seems roughly symmetric and bell shaped.
The mean and standard deviation of these data are x = 15.825 and sx = 1.217.
 x  1sx = (14.608, 17.042) 24 of 36 = 66.7%


x  2sx = (13.391, 18.259)
x  3sx = (12.174, 19.467)
34 of 36 = 94.4%
36 of 36 = 100%
These percents are quite close to what we would expect based on the 68-95-99.7 rule. Combined with
the graph, this gives good evidence that this distribution is close to Normal.
Ex. 2.2-10: Assessing Normality
Alternate Example: Here is a Normal probability plot (also called a Normal quantile plot) of the
refrigerator data from the example above. It is quite linear, supporting our earlier decision that the
distribution is close to Normal.
Ex. 2.2-11: State Land Areas (Assessing Normality)
The histogram and Normal probability plot below display the land areas for the 50 states.
Is this distribution approximately Normal?
Both the histogram and Normal probability plot indicate that this distribution is strongly skewed to the
right. In particular, there is one state whose area is much larger than we would expect if the distribution
was approximately Normal.
Ex. 2.2-12: NBA Free Throw Percentage (Assessing Normality)
This is an example of a distribution that is skewed to the left. Notice that the lowest free throw
percentages are too the left of what we would expect and the highest free throw percentages are not as
far to the right as we would expect.
Ex. 2.2-13: How linear should a Normal probability plot actually be?
The screen shots below show the results for taking random samples from a Normal distribution and
generating the data’s Normal probability plot. As you can see, none of the plots were perfectly linear,
even though the sample came from a Normal population.