Download Q - Lycoming College

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Return to the definitions in Class Handout #1:
five-number summary
Minimum, First Quartile, Second Quartile, Third Quartile, Maximum
Min
Q1
Q2
Q3
Max
center value in a distribution
Now return to Exercise #1(i) in Class Handout #1:
1.-continued
(i) From the stem-and-leaf display in part (g) for the variable “yearly income,” find
the five-number summary, the median, the range, and the interquartile range.
2
3
4
5
6
7
56789
003445999
014559
35
01458
158
Min = 25
Q1 = 33
Q2 = 40.5
Q3 = 60
Max = 78
range = Max – Min = 78 – 25 = 53
IQR = Q3 – Q1 = 60 – 33 = 27
= median
outlier an observation whose value is deemed to be either unusually high or
unusually low relative to the other observations in the data set; one guideline
is to consider any value which is a distance of more than 1.5(IQR) below Q1
or above Q3 a potential outlier
box plot a graphical display for quantitative sample data where two edges of a
rectangular box mark the values of Q1 and Q3 , a line dividing the box into
two sections marks the value of Q2 , and the ends of two lines extending from
the sides of the box represent the minimum and maximum.
Often, the lines are stopped at the largest and smallest values which are not
potential outliers, and all potential outliers are designated with dots.
(j) From the five-number summary found in part (i) for the variable “yearly
income,” decide whether or not there are any potential outliers, and construct a
box plot. (1.5)IQR = (1.5)27 = 40.5
Since 33 – 25 = 7 < 40.5, the
minimum is not a potential outlier.
Min = 25
Q1 = 33
Since 78 – 60 = 18 < 40.5, the
maximum is not a potential outlier.
Q2 = 40.5
= median
Q3 = 60
Max = 78
25
30
35
40
45
50
55
60
Yearly Income ($1000s)
65
70
75
80
five-number summary
Minimum, First Quartile, Second Quartile, Third Quartile, Maximum
Min
Q1
Q2
Q3
Max
center value in a distribution refers to a middle or average value for the distribution
Some measures of center for a distribution are as follows:
mean ( denoted y ) the arithmetic average of n quantitative
observations y1 , y2 , , yn , calculated
by dividing the sum of the observations
by the number of observations; that is,
y1 + y2 +  + yn
y
y = ——————— = —–
n
n
median the second quartile Q2 in the five-number summary
mode the most frequently occurring observation(s) if any exist
(With a quantitative variable, the mean and median are often
of more interest then the mode; the mode is more useful with
a qualitative variable, where computing a mean or median
makes no sense as a general rule.)
dispersion in a distribution refers the amount of variation (spread) in the distribution
Some measures of dispersion for a distribution are as follows:
range the difference between the largest and smallest observations,
that is, Max – Min
interquartile range (IQR) the difference between the third and first
quartiles, that is, Q3 – Q1
variance (denoted s2) the sum of the squared deviations from the mean
(i.e., the sum of y1 – y , y2 – y , … , yn – y)
divided by one less than the number of
observations; that is,
 (y – y)2
s2 = ————
n–1
standard deviation (denoted s)
the square root of the variance
These statistics can all be verified by using the
Excel spreadsheet named Summary_Statistics,
1.-continued
(k) The yearly incomes ($1000s) for the eight democrats in the sample are
28
75
26
78
40
60
49
39
Do each of the following for these eight incomes:
List the observations in ascending order.
26
28
39
40
49
60
75
78
Find the five-number summary.
26
Find the median.
33.5
44.5
67.5
78
44.5
Find the mean.
26 + 28 + 39 + 40 + 49 + 60 + 75 + 78
495
y = ———————————————— = —— = 49.375
8
8
Find the variance and standard deviation.
(26–49.375)2 + (28–49.375)2 + … + (78–49.375)2
2787.875
2
————————————————————
=
———— = 398.268
s =
8–1
7
s =  398.268 = 19.957
Find the mode.
There is no mode, since no observations are repeated.
If the largest salary of 78 ($78,000) were changed to 175 ($175,000),
what would be the values for the mean and median? What does this
suggest about how the values of the mean and median are influenced
by symmetry and skewness in a distribution?
This can be done using the Excel spreadsheet named Summary_Statistics,
The mean would be 61.5, but the median would stay equal to 44.5.
When a distribution is nearly symmetric, the mean and median will
be close in value. If the mean is smaller than the median, the
distribution is negatively skewed; if the mean is larger than the
median, the distribution is positively skewed.
shape of a distribution (symmetry and skewness) involves the type and amount of
symmetry or non-symmetry present
in the distribution
A distribution is called positively skewed when larger values tend to be more
dispersed (perhaps resulting in a few unusually high values) and called
negatively skewed when smaller values tend to be more dispersed (perhaps
resulting in a few unusually small values). One measure of skewness in a
distribution is as follows:
shape of a distribution (symmetry and skewness) involves the type and amount of
symmetry or non-symmetry present
in the distribution
A distribution is called positively skewed when larger values tend to be more
dispersed (perhaps resulting in a few unusually high values) and called
negatively skewed when smaller values tend to be more dispersed (perhaps
resulting in a few unusually small values). One measure of skewness in a
distribution is as follows:
(If the skewness ratio for a distribution
y – median is less than –0.3, then the distribution
skewness ratio ————– can be considered very negatively
s
skewed, and if the skewness ratio for a
distribution is greater than +0.3, then
the distribution can be considered very
positively skewed.)
Find the mode.
There is no mode, since no observations are repeated.
If the largest salary of 78 ($78,000) were changed to 175 ($175,000),
what would be the values for the mean and median? What does this
suggest about how the values of the mean and median are influenced
by symmetry and skewness in a distribution?
The mean would be 61.5, but the median would stay equal to 44.5.
When a distribution is nearly symmetric, the mean and median will
be close in value. If the mean is smaller than the median, the
distribution is negatively skewed; if the mean is larger than the
median, the distribution is positively skewed.
What does the skewness ratio suggest about the shape of the
distribution?
49.375 – 44.5
—————— = + 0.24 suggests a somewhat positively skewed distribution
19.957
(l) From the information displayed in part (d) for the variable “political party
affiliation,” find the mode.
The mode is the category “Republican” which is repeated most often.
2. For each of four levels of educational attainment, the distribution of ages for
Americans at least 25 years old in 1984 is organized into a frequency polygon
(from data collected by the U.S. Bureau of the Census (and taken from the World
Almanac and Book of Facts, 1986).
Completed High School
Did Not Complete High School
40
35
30
25
20
15
10
5
0
Relative Frequency
Relative Frequency
Did Not Complete High School
15
25
35
45
55
65
75
85
Completed High School
40
35
30
25
20
15
10
5
0
95
15
25
35
Age (years)
Relative Frequency
Relative Frequency
25
35
45
55
65
Age (years)
65
75
85
95
85
95
4 or More Years of College
1 to 3 Years of College
15
55
Age (years)
1 to 3 Years of College
40
35
30
25
20
15
10
5
0
45
75
85
95
4 or More Years of College
40
35
30
25
20
15
10
5
0
15
25
35
45
55
65
Age (years)
75
(a) Does the distribution of ages appear to be centered at different values for the
different levels of education? If yes, for which level of education does the
center appear to be smallest, and for which level of education does the center
appear to be largest?
The ages for those not completing high school appear to be centered at a
considerably higher value than for those in each of the other three level of
education categories.
(b) Does the distribution of ages appear to have a different amount of dispersion
for the different levels of education? If yes, for which level of education does
the dispersion appear to be smallest, and for which level of education does the
dispersion appear to be largest?
The variation of ages appears to be roughly the same for the four level of
education categories.
(c) Does the distribution of ages appear to have a different shape for the different
levels of education? If yes, how does the shape appear to differ?
None of the distributions appear to be symmetric. The distribution of ages
looks negatively skewed for those not completing high school and
positively skewed for each of the other three level of education categories.
bar chart a graphical display for qualitative data where categories are listed on a
horizontal axis and the height of a bar for each category represents a raw or
relative frequency as indicated by the labels on a vertical axis.
pie chart a graphical display for qualitative data where relative frequency for each
category is represented as a slice of a circle (pie) and the categories listed
include all possibilities
histogram a graphical display for quantitative sample data where the possible
numerical values are scaled on a horizontal axis and the height of a bar for
each of several intervals of values represents a raw or relative frequency as
indicated by the labels on a vertical axis.
frequency polygon a graphical display for quantitative sample data where the possible
numerical values are scaled on a horizontal axis and dots placed at
the middle of the top of where each bar for a histogram would be
are connected to produce a rough “curve” - the proportion of
observations that fall within a given interval of values is
represented by the corresponding area under the “curve”
probability distribution curve a “smooth curve” designed to describe quantitative
population data so that the proportion, or probability, of
observations falling within a given interval of values is
represented by the corresponding area under the “curve”
3. Suppose each of the box plots displayed represents the distribution of sample data
selected from some population. For each box plot, make a sketch of what a
corresponding histogram of the data could look like, and make a sketch of what the
probability distribution curve for the corresponding population could look like.
Uniform Distribution
Bell-Shaped Distribution
Positively Skewed Distribution
Negatively Skewed Distribution
Now look at Class Handout #2.
Class Handout #2
Definitions
parameter a numerical quantity which describes some characteristic of a population
statistic a numerical quantity which describes some characteristic of a sample
The symbol x is used to represent the mean for a sample.
The symbol  is used to represent the mean for a population.
The symbol s is used to represent the standard deviation for a sample.
The symbol  is used to represent the standard deviation for a population.
We see then that  and  are parameters, and that x and s are statistics.
Tchebysheff’s Theorem
1. For each situation, identify the experimental unit, the population, the sample, the
parameter, and the statistic.
(a) A state government official computes the mean yearly amount of spending per
pupil for 75 selected public school districts in the state in order to draw a
conclusion about the mean yearly amount of spending per pupil for all public
school districts in the state.
The experimental unit is each public school district in the state.
The population is all public school districts in the state.
The sample is the 75 selected public school districts.
The parameter is the mean yearly amount of spending per pupil for all
public school districts in the state.
The statistic is the mean yearly amount of spending per pupil for the
75 selected public school districts in the state.
(b) A pollster surveys 427 selected voters in a state by phone and calculates the
proportion intending to vote for the incumbent governor in order to draw a
conclusion about the proportion of all voters in the state intending to vote for
the incumbent governor.
The experimental unit is each voter in the state.
The population is all voters in the state.
The sample is the 427 selected voters.
The parameter is the proportion of all voters in the state intending to vote
for the incumbent governor.
The statistic is the proportion of the 427 surveyed voters intending to vote
for the incumbent governor.
Class Handout #2
Definitions
parameter a numerical quantity which describes some characteristic of a population
statistic a numerical quantity which describes some characteristic of a sample
The symbol x is used to represent the mean for a sample.
The symbol  is used to represent the mean for a population.
The symbol s is used to represent the standard deviation for a sample.
The symbol  is used to represent the standard deviation for a population.
We see then that  and  are parameters, and that x and s are statistics.
Tchebysheff’s Theorem a statement of the following facts (requiring calculus to prove):
For any data set (population or sample), at least 75% (three fourths) of
the measurements must lie within two standard deviations of the mean,
that is, 75% (3/4) of a data set must lie between the values
mean – 2(standard deviation) and mean + 2(standard deviation) .
For any data set (population or sample), at least 89% of the
measurements must lie within three standard deviations of the mean,
that is, 89% of a data set must lie between the values
mean – 3(standard deviation) and mean + 3(standard deviation) .
2. For each data set (all of which are displayed in a stem-and-leaf display),
(i) verify that the given mean, the given standard deviation, and the given
five-number summary are correct;
(ii) find the proportion of measurements which lie within two standard
deviations of the mean, and compare this proportion with what
Tchebysheff’s Theorem states.
Min = 109
(a)
10 9
11 5
Q1 = 188
12
mean = 200
13
Q2 = 200
14
standard deviation = 41.94
15
Q3 = 212
16
17 8
Max = 291
18 1 5
19 1 4 6 8 9
20 1 2 4 6 9
These statistics can all be verified by using the
21 5 9
22 2
Excel spreadsheet named Summary_Statistics,
23
The interval within two standard deviations of the mean is
24
25
from 200 – 2(41.94) to 200 +2(41.94), that is from 116.12 to 283.88
26
The proportion of measurements which lie within two standard
27
28 5 deviations of the mean is 16/20 = 80% .
29 1
Tchebysheff’s Theorem states that this proportion will be at least 75%.
(b)
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Min = 129
Q1 = 188
9
5
mean = 200
standard deviation = 33.20
8
1
1
1
5
2
5
4689
2469
9
Q2 = 200
Q3 = 212
Max = 271
These statistics can all be verified by using the
Excel spreadsheet named Summary_Statistics,
The interval within two standard deviations of the mean is
from 200 – 2(33.20) to 200 +2(33.20), that is from 133.60 to 266.40
5
1 The proportion of measurements which lie within two standard
deviations of the mean is 18/20 = 90% .
Tchebysheff’s Theorem states that this proportion will be at least 75%.
Note that in both data sets, the two largest measurements and the two smallest
measurements are all potential outliers.
For most bell-shaped (or mound-shaped) distributions with no outliers, 95% of the
measurements will lie within two standard deviations of the mean.
3. A random sample of 30 days of hotel reservations is selected to obtain information
about the distribution of the daily number of “no shows”. Each value recorded
below is the number of room reservation bookings where the party did not show up
and did not cancel the reservation.
18
16
16
16
14
18
16
18
14
19
15
19
9
20
10
10
12
14
18
12
14
14
17
12
18
13
15
13
15
19
(a) Identify the experimental unit, the variable of interest, the population, and the
sample.
The experimental unit is each day.
The variable of interest is the number of “no shows” among hotel
bookings for a day.
The population is all days when hotel reservations are made.
The sample is the 30 selected days.
(b) Are the mean and standard deviation for this data parameters or statistics?
The mean and standard deviation calculated from this random sample are
statistics. (Parameters are numerical quantities which describe
characteristics of a population.)
(c) Use the Excel files Summary_Statistics and M214_Data to find the mean and
standard deviation for this data, and comment on the type of distribution the
data appears to have.
mean = x = 15.133 The data appears to have a somewhat
bell-shaped (mound-shaped)
standard deviation = s = 2.945
distribution, and there are no outliers.
(d) We shall use the mean and standard deviation from part (c) as estimates of the
mean and standard deviation for the mean and standard deviation for the
population. Explain why it is reasonable to assume that about 95% of all days
will have a number of “no shows” within two standard deviations of the mean,
and then find this interval.
Since the data appears to have a somewhat bell-shaped (mound-shaped) distribution,
and there are no outliers, we expect about 95% of all measurements to be within two
standard deviations of the mean.
The interval within two standard deviations of the mean is from
15.133 – 2(2.945) to 15.133 +2(2.945), that is from 9.243 to 21.023
(e) From the interval found in part (d), how many rooms per day can the hotel
overbook per day and still feel confident that all reservations will be honored?
Since it appears that the number of “no shows” each day will almost always be at
least 9 or 10, a hotel might feel confident with overbooking 9 or 10 rooms per day.
Before submitting Homework #1, check some of the answers (if you haven’t
done so already) from the link on the course schedule:
http://srv2.lycoming.edu/~sprgene/M214/Schedule214.htm