Download AP Stats / Topic TWO “Summarizing Distributions” Contents

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
AP Stats / Topic TWO “Summarizing Distributions”
Contents
1. Measuring the center / Measures of central tendency
2. Measuring spread / Measures of variation
3. Empirical rule
4. Measuring position
Objectives / SWBAT









Find mean, median, and mode of a set of data, including weighted mean and the mean
of a frequency distribution
Describe the shape of a distribution as symmetric, skewed, or uniform
Compare the mean and median for each shape of a distribution.
Find the range, the variance, and standard deviation of a set of data.
Find an approximation of the sample standard deviation for grouped data.
Find the first, second, and third quartiles of a set of data.
Find the interquartile range
Display data by using a box plot / whisker plot
Find, interpret and comparing Z-scores (standard score)
Measuring the center / Measures of central tendency:
Mean and Median / Notes
 Population Mean (µ is a Greek letter) µ = ∑X / N
Sample Mean (ẋ read as “x bar”) ẋ = ∑x / n
 Median/ value that lies in the middle of the data
when the data is ORDERED
 Mode / data entry that occurs with the greatest
frequency. If no entry is repeated, dataset has not
mode. (bimodal, multimodal). Ordering the data
helps to find the mode.
 Outlier/ a data entry that is far removed from the
other entries in the data set. Outliers cause gaps.
 Weighted mean/ mean of data set whose entries
have varying weights. ẋ = ∑ (x . w / ∑ w ), where w is
the weight of each entry x
 Mean of a frequency distribution for a sample is
approximated by ẋ = ∑ (x . f / n ), where x and f are
the midpoint and frequency of a class, respectively.
Note n= ∑ f
 Shape of distributions: Symmetric (when a vertical
line can be drawn through the middle of a graph of
the distribution, and the resulting halves are
APPROXIMATELY mirror images). Uniform (or
rectangular) when all entries or classes have equal or
APPROXIMATELY equal frequencies. A uniform
distribution is also symmetric. Skewed when the
distribution has a tail extended to the left or to the
right. Bell-shaped (mount shaped)
 In general the median is at (n+1) / 2 position. If we
have 28 entries in order, we will find the median at
the (28+1)/ 2 =14.5 th position, that is between the
14th and 15th terms
 Median and mean are both measures of center, but
sometime we must make the selection of which to
use to describe the distribution. To use Mean or
Median depend of the SHAPE of the Distribution:
 Symmetric and bell-shape: mean and
median will be close
 Distribution has outliers or is strongly
skewed, the median is probably the better
choice to describe the center because
MEDIAN is a resistant statistic; it’s not
dramatically affected by extreme values.
The mean is not resistant; it’s dramatically
affected by extreme values.
Class Examples
1.
The prices (in dollars) for a sample of roundtrip flights from Chicago, Illinois to Cancun, Mexico are listed
below
872
a)
2.
432
397
427
388
782
397
Find the mean, median, and mode of the data set.
The ages of students in a college class are listed below
20
20
20
20
20
20
21
21
21
21
22
22
22
23
23
23
23
24
24
65
a)
b)
c)
d)
e)
f)
Find the mean, median, and the mode.
Make a histogram. Indicate measures of central tendency.
Which measure of central tendency best describes a typical entry of the data set?
Are there any outliers?
Remove the data entry of 65 from the preceding data set. Find the mean, median, and the mode.
How does the absence of this outlier change each of the measures? Compare these measures with
those found in part a).
The median is not affected by outliers; it is a particular useful measurement to describe a distribution when the
distribution has outliers (extreme values).
The mean is affected by outliers.
3.
4.
5.
Suppose that the number of unnecessary procedures recommended by five doctors in a 1-month period
are given by the set {2, 2, 8, 20, 33}.
a. Find the mean and the median.
b. If it discovered that the fifth doctor also recommended an additional 25 unnecessary procedures,
how will median and mean be affected?
You are taking a class in which your grade is determined from five sources: 50% from your test mean, 15%
from your midterm, 20% from your final exam, 10% from your computer lab work, and 5% from your
homework. Your scores are 86 (test mean), 96 (midterm), 82 (final exam), 96 (computer lab), and 100
(homework). What is the weighted mean of your scores? If the minimum average for an A is 90, did you
get an A?
Approximate the mean of the following frequency distribution. The data represents number of minutes
that a sample of Internet subscribers spent online during their last session.
Class Midpoint
Frequency
12.5
6
24.5
36.5
10
13
48.5
60.5
8
5
72.5
6
84.5
2
6.
Suppose the salaries of six employees are listed below
$ 3 000
$7 000
$15 000
$22 000
$23 000
a) What is the mean salary?
b) What will the new mean salary be if everyone receives a $3 000 increase?
c) What will the new mean salary be if everyone receives a 10% raise?
d) Make conclusions
$38 000
Adding the same constant to each value increases the mean and median by
the same constant. Multiplying each value by the same constant multiplies
the mean and median by a like amount.
Measuring spread / Measures of variation
 Range / Difference between the largest and smallest values. Range gives some impression of the
dispersion (spreading). Range is not sensitive to the ones in the middle. We could use range to
evaluate samples with very few terms.
 Interquartile range/ IQR = Q3 – Q1 . IQR is useful to remove the influence of extreme values or
outliers on range. IQR remove the upper and lower quartiles of the values. Represent the range
of the middle 50% of the values (or entries).
 The numerical rule for distinguish outliers is to calculate 1.5 x IQR and then call a value an
outlier if it is more than 1.5 x IRQ below the first quartile or 1.5 x IQR above the third quartile.
 Deviation of an entry x in a population data set is the difference between the entry and the
mean µ of the data set. Distance between every observation and the mean. The sum of the
deviations is zero. To overcome this problem we can square each deviation. (x-ẋ) is called
residuals.
 Variance by definition is the average squared deviation from the mean. It is a measure of spread
because the more distant a value is from the mean, the larger will be the square of the
difference between it and the mean.
 Population Variance
σ2 = ∑ (x-µ)2 / N
 Population standard deviation Sq root of σ2 = Sq root ∑ (x-µ)2 / N
 Sample variance s2 = ∑ (x-ẋ)2 / n-1
 Sample standard deviation s = sq root ∑ (x-ẋ)2 / n-1 (n-1 is representing the number of
independent values, not n. If you know n-1 and the mean (x bar), then the nth term is
determined.
 Standard deviation is a measure of the typical (usual, representative) amount an entry deviates
from the mean. The more the entries are spread out, the greater standard deviation. S does give
a measure of the spread of the x-values around the sample mean.
 (x-ẋ) is called residual and s is a “typical value “ of the residuals.
 Variance is measured in squared units. Standard deviation is measured in the same units as are
the data.
 Standard deviation for grouped data s = sq root ∑ (x-ẋ)2 f / n-1
 Sx gives us the sample standard deviation

The definition of standard deviation
Class Examples
7. Suppose that the starting salaries (in $1 000) for college graduates who took AP Stats in high
school have the following characteristic: the smallest value is 18.8, 10% of the values are below
25.6, 25% are below 41.1, the median is 59.3, 60% are below 84.3, 75% are below 101.9, 90 %
are below 118.0, and the top value is 201.7.
a. What is the range
b. What is the IQR?
c. When the numerical rule is used for outliers, should either the smallest or largest value
be called an outlier?
8. The numbers of calories in 12-ounce servings of five popular beers are {95, 152, 188,205, and
131}. Use a calculator to find the mean (ẋ), the sample standard deviation (sx), and variance of
the data.
9. Sample office rental rate (in dollars per square foot per year) for Seattle’s central business
district are listed. Use a calculator to find the mean rental rate and the sample standard
deviation.
40.00
43.00
46.00
40.50
35.75
39.75
32.75
36.75
35.75
38.75
38.75
36.75
38.75
39.00
29.00
35.00
42.75
32.75
40.75
35.25
Class
0-99
100-199
200-299
300-399
400-499
500 -600
10. The following frequency distribution shows the results of a survey in which 1000 adults were asked how
much they spend in preparation for personal travel each year
X (midpoint)
f
xf
x-ẋ
(x-ẋ)2
(x-ẋ)2 f
49.5
380
149.5
230
249.5
210
349.5
50
449.5
60
549.5
70
∑ 1000
a) Find the sample mean and the sample standard deviation of the set of data.
Empirical rule
 Empirical rule (also called 68-95-99.7 rule) applies to symmetric bell-shaped data. In this case
about 68% of the values lie within 1 standard deviation of the mean, about 95 % of the values lie
within 2 standard deviations of the mean, and more than 99% of the values lie within 3 standard
deviations of the mean.
11. Suppose that taxicabs in New York City have driven an average of 75, 000 miles per year with a
standard deviation of 12, 000 miles. What information does the empirical rule give us? Assume
that the distribution is bell-shaped. Use a graphical representation to illustrate your answer.
Answer: 68% of taxicabs in New York City have driven between 63 000 and 87 000 miles per year
95% of taxicabs in New York City have driven between 51 000 and 99 000 miles per year
99.7% taxicabs in New York City have driven between 39 000 and 111 000 miles per year
Use a graphical representation to illustrate your answer.
Measuring position: simple ranking, percentile ranking, and z-score
 To describe data, we also need to be able to talk about the position of any values.
 For describing position, there are three procedures:
o Simple ranking / involves arranging in some order and noting where in that order a
particular value falls.
o Percentile ranking /indicates what percentage of all values fall below the value under
consideration. For Q1 and Q3 the percentile ranks are 25% and 75% respectively.
o The z-score (standard score) / states by how many standard deviations a particular value
varies (diverges) from the mean. Z= (x-µ) / σ
Class Examples
12. Suppose the average price of gasoline in a large city is $3.80 per gallon with a standard deviation
of $0.05. Calculate the z-score of the following values:
a. $3.90
b. $3.65
c. For a z-core of +2.2, what is a raw score?
13. The water capacity (in gallons) of some of major solid-fuel boilers in USA are
6.3
21
7.4
8.6
12.1
16.1
23
23
28
33
26
21
65
65
56
66
70
34
35
50
a. What is the position of the Passat HO-45 which has a capacity of 34 gallons?
Answer: Enter information in L1 . Organize data set in descending order.
As you can see, there are 7 boilers with higher capacity on the list. The Passat HO-45 has a
SIMPLE RANKING of 8Th. The Passat has a PERCENTILE RANKING of 8/20= 0.4 = 40%
The above list has a mean of 33.325, and a standard deviation of 21.244. Then the Passat
has z –score of (34-33.325) / 21.244 = 0.031.
14. A small used car dealer wanted to get an idea of how many cars her dealership sells per day.
Listed below is the number of cars per day sold over a two-week period
14
9
23
7
11
23
17
11
3
24
21
2
20
20
a. Find the mean number of cars sold per day
b. Find the range of cars sold per day
c. Find the standard deviation of the number of cars sold per day
d. Find the median of cars sold per day
e. Find the first and third quartiles of the number of cars sold per day. Find IQR
f. Find the 90th percentile of the number of cars sold per day.
Calculator Tips
To summarize distribution/Finding five-number summary of data)
 STAT → Edit →List entries
 STAT → CALC → 1: 1-Var Stats→ 1-var Stats L1 (list of entries)