Download Math 227 Ch 3 notes S16 KEY

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 3 Numerically Summarizing Data
Class ex: mean 6 7 8 fair share
Mean 2 2 3 6 7 7 8 balancing point
Chapter 3.1 Measures of Central Tendency
Objective A: Mean, Median and Mode
Three measures of central of tendency: the mean, the median, and the mode.
A1. Mean (same as average)
The mean of a variable is the sum of all data values divided by the number of observations.
Population mean:  
x
i
N
where
observations in the population). ∑
Sample mean: x 
x
i
n
where
xi
xi
is each data value and N is the population size (the number of
means sum
is each data value and n in the sample size (the number of
observations in the sample).
Example 1: Population: Weights (in pounds) of adult turkeys
12
16
23
17
32
27 14
16
(a) Population mean: (Round the mean to one more decimal place than that in the raw data)
12  16  23  17  32  27  14  16 157

 19.625  19.6 pounds
8
8
The mean (average) weight of an adult turkey is 19.6 pounds.
The typical weight of an adult turkey is 19.6 pounds.

(b) Sample mean (from a simple random sample size of 4:
From a lottery method, 23 16 14 17 were selected.
23  16  14  17 70

 17.5 pounds
4
4
The mean (average) weight of an adult turkey is 17.5 pounds (based
on this sample.
The typical weight of an adult turkey is 17.5 pounds (based on this
sample.
x
(c) Does the sample mean equal to the population mean? NO
(d) Does the population mean or sample mean stay the same? Explain.
The population mean stays the same since it always includes everyone. The sample mean varies for each
sample taken.
1
A2. Median
The median, M, is the value that lies in the middle of the data when arranged in ascending order.
Example 1: Find the median of the data given below (ages of randomly selected movie theater customers.
4 12 32 24 9 18 28 10 36
1) Put data in order:
4
9
10
12
18
24
28
32
36
2) Select middle number: 18 years old
The typical movie theater customer age was 18 years old.
*NOTE: two ways to measure typical center: mean and median
Example 2: Find the median of the data given below, (money spent at the movies for eight customers).
$35.34 $42.09 $38.72 $43.28 $39.45 $49.36 $30.15 $40.88
1) Put data in order:
30.15
35.34 38.72 39.45 40.88 42.09 43.28 49.36
2) Select middle number: 39.45 and 40.88 so take average:
39.45  40.88 80.33

 40.165  $40.17
2
2
The typical amount spent at the movies was $40.17.
*NOTE: two ways to measure typical center: mean and median
A3. Mode
Mode is the most frequent observation in the data set. The data value that occurs the most.
Example 1: Find the mode of the data given below (ages of randomly selected adults at a senior care facility).
76 60 81 72 60 80 68 73 80 67
60 and 80 (both occurred twice)
Example 2: Find the mode of the data given below (randomly selected grades in a math class).
A C D C B C A B B F B W F D B W D A D C D
B and D occurred the most (five times each)
AAA
BBBBB
CCCC
DDDDD
FF
WW
NOTE: mean and median used for quantitative data only
mode used for both quantitative and qualitative (categorical) data
2
Example 3: The following data represent the G.P.A. of 12 students.
2.56 3.21 3.88 2.44 1.96 2.85 2.32 3.38 1.86 3.04 2.75 2.23
Find the mean, median, and mode G.P.A.
Try it!
Answers:
Mean:
x
Median:
i
n

32.48
 2.7066....  2.707
12
2.56  2.75
 2.655
2
Mode: none
Objective B: Relation Between the Mean, Median and Distribution Shape
- The mean is sensitive to extreme data. For continuous data, if the distribution shape is
a bell-shaped curve, the mean is a better measure of central tendency because it
includes all data values in a data set.
- The median is resistant to extreme data. For continuous data, if the distribution shape is
skewed to the right or left, the median is a better measure of central tendency.
- The mode is used to represent the measure of central tendency for qualitative data.
Definition: A numerical summary of data is said to be resistant if extreme values (very large or small) relative
to the data do not affect its value substantially.
Mean or Median versus Skewness
3
Chapter 3.2 Measures of Dispersion (same as spread)
6, 6, 6 and 1, 6, 11 bank example
Measurement of dispersion is a numerical measure that can quantify the spread of data.
In this section, the three numerical measures of dispersion that we will discuss are the range, variance, and
standard deviation. In the later section, we will discuss another measure of dispersion called interquartile
range (IQR).
Objective A: Range, Variance and Standard Deviation
A1. Range (overall range)
Range = R = largest data value - smallest data value
The range is not resistant because it is affected by extreme values in the data set.
A2. Variance and Standard Deviation (used for symmetric distributions)
Variance is based on the deviation about the mean. Since the sum of deviation about the mean is zero, we
cannot use the average deviation about the mean as a measure of spread.
We use the average squared deviation instead.
The population variance,  2 , of a variable is the sum of the squared deviations about the population
mean,  , divided by the number of observations in the population, N .
2  
 
2
( xi   )2
Definition Formula
N
x
2
i
 x 

2
i
N
Computational Formula
N
The sample variance, s 2 , of a variable is the sum of the squared deviations about the sample mean, x ,
divided by the number of observations in the sample minus 1, n 1 .
s
2
 (x  x )

s2 
2
i
Definition Formula
n 1
x
i
2
 x 

2
i
n 1
n
Computational Formula
In order to use the sample variance to obtain an unbiased estimate of the population variance, we divide
the sum of the squared deviations about the sample mean by n 1.We call n 1 the degree of freedom
because the first n 1 observations have freedom to be whatever value they wish, but the nth value has
no freedom in order to force
 ( x  x ) to be zero.
i
4
The population standard deviation,
 , is the square root of the population variance or
σ  Population Variance
The sample standard deviation, s , is the square root of the sample variance or s  Sample Variance
To avoid round-off error, never use the rounded value of the variance to compute the standard deviation.
Keep a few more decimal places for an intermediate step calculation.
Example 1: Use the definition formula to find the population variance and population standard deviation.
Population (randomly selected ages of movie theater customers for a rated G movie):
4, 10, 12, 13, 21 (Assume that this was the entire population. Only 5 people saw this movie in the entire
theater.)
Population Mean:  
4  10  12  13  21
 12
5
xi
xi  
4
4-12=-8
10
10-12=-2
12
12-12=0
13
13-12=1
21
21-12=9
x
i


2
 82  64
 22  4
02  0
12  1
92  81
 x
Population variance:  =
2
 x

i
N


2
i
  = 150
2
=
150
 30
5
Population standard deviation:   30  5.477...  5.5
Summary: The typical age of a movie theater customer for a rated G movie is 12 years
old with a standard deviation of 5.5 years. Therefore, a typical age is 12 ± 5.5 or
between 6.5 and 17.5 years old.
5
Example 2: Use the definition formula to find the sample variance and sample standard deviation.
Sample of four exam grades: 83, 65, 91, 84
83  65  91  84
Sample Mean: x 
 80.75
4
x
xi
xi  x
83
83-80.75 = 2.25
65
65 – 80.75 = -15.75
91
91 – 80.75 = 10.25
84
84 – 80.75 = 3.25
i
x
Sample variance: s =
 x
i
2
2.252  5.0625
 15.752  248.0625
10.252  105.0625
3.252  10.5625
 x
2

x
n 1


2
i
 x = 368.75
2
=
368.75 368.75

 122.9666666...
4 1
3
Sample standard deviation: s  122.9666666...  11.08677......  11.1
Summary: The typical grade for this exam (based on this sample) was 80.75
with a standard deviation of 11.1 points. Therefore, a typical score is between
80.75 ± 11.1 or 69.65 and 91.85.
Example 3: Use StatCrunch to find the sample variance and standard deviation.
Sample: 83, 65, 91, 84 (same data set as Example 2)
Step 1:
Click StatCrunch navigation button under the Course Home page 
Click StatCrunch website  Click Open StatCrunch
Input the raw data in var 1 column  Click Stat Click Summary Stats Columns
6
Step 2:
Click var1 under Select column(s): Under Statistics:, choose Variance and Std. dev. (click them while
holding Ctrl key on the keyboard)  Click Compute!
Variance and standard deviation are computed.
s 2  122.9
s  11.1
For more detailed instructions, please download “Q3.2.20 “ by clicking the StatCrunch Handout navigation
button of the course homepage.
Note : For a small data set, students are expected to calculate the standard deviation by hand.
Objective C : Empirical Rule
7
The figure below illustrates the Empirical Rule
Note: 3 SD ‘almost all’ data is captured (instead of 99.7%)
Class example: ages
8
Example 1:
SAT Math scores have a bell-shaped distribution with a mean of 515 and a standard
deviation of 114. (Source: College Board, 2007)
(a) What percentage of SAT scores is between 401 and 629
515+114 = 629
diagram:
515-114 = 401
Since 1 SD: 68% of all SAT scores are
between 401 and 629
(b) What percentage of SAT scores is between 287 and 743?
515+2(114) = 743
diagram:
515-2(114) = 287
Since 2 SD: 95% of all SAT scores are
between 287 and 743
(c) What percentage of SAT scores is less than 401 or greater than 629?
100 – 68 = 32%
diagram:
(d) What percentage of SAT scores is between 515 and 743?
95
 47.5%
2
diagram:
(e) About 99.7% of SAT scores will be between what scores?
515+3(114) = 857
diagram:
515-3(114) = 173
Almost all (99.7%) of all SAT scores are
between 173 and 857
9
Chapter 3.4
Measures of Position and Outliers
Measures of position determine the relative position of a certain data value within the entire set of data.
Objective A : z -scores
The z-score represents the distance that a data value is from the mean in terms of the number
of standard deviations.
Population
z -score:
z
x
Sample
z -score: z 
xx
s

Note: Beyond 2 SD is unusual for the population
x is each data value,  and x are population mean and sample mean.  and s are population SD and sample
SD
Diagram:
Example 1: The average 20- to 29-year-old man is 69.6 inches tall, with a standard deviation of
3.0 inches, while the average 20- to 29-year-old woman is 64.1 inches tall, with a standard deviation of 3.8
inches. Who is relatively taller, a 67-inch man or 62-inch woman?
Man: z 
x

Woman: z 

67  69.6  2.6

 0.867
3
3
x


62  64.1  2.1

 0.553
3.8
3.8
diagram:
diagram:
Both the man and woman are below average height. However the man is relatively further from the average.
Therefore, the 62 inch woman is relatively taller.
Objective B: Percentiles and Quartiles
B1. Percentiles
The k th percentile,
Pk , of a set of data is a value such that k percent of the observations are less
than or equal to the value.
Example 1: Explain the meaning of the 5th percentile of the weight of males 36 months of age is
12.0 kg.
Diagram:
5% of three year old (36 months) males weigh less than 12.0 kg. (Thus, 95% of three year old males weigh
greater than 12.0 kg.
10
The most common percentiles are quartiles.
The first quartile, Q1 , is equivalent to P25 .
The second quartile,
The third quartile,
Q2 , is equivalent to P50 .
Q3 , is equivalent to P75 .
Example: 200 people in survey. How many are in each quartile?
Example 2: Determine the quartiles of the following data (weights of 11 randomly selected elementary
school students in pounds.
46 45 58 71 42 66 72 42 61 49 80
Find the Median by ordering the data (the 50th percentile):
42
42
45
46
49
58
61
66
71
72
80
Thus Q2  58
The 25th percentile will be the middle of the lower half (not including 58): 42
Thus Q1  45
42
45
46
49
The 75th percentile will be the middle of the upper half (not including 58): 61
Thus Q3  71
66
71
72
80
11
B2. Interquartile (Used to measure typical spread for non-symmetric graphs)
For example 2: A typical elementary student weighed between 45 and 71 pounds which was a 26 pound
spread. This captures the middle 50% of the population. (The typical student weighed 58 pounds).
The interquartile range, IQR, is the measure of dispersion (spread) that is based on quartiles. The range and
standard deviation are affected by extreme values. The IQR is resistant to extreme values.
Example 1: One variable that is measured by online homework systems is the amount of time a student
spends on homework for each section of the text. The following is a summary of the number
of minutes a student spends for each section of the text for the fall 2007 semester in a
College Algebra class at Joliet Junior College.
Q1  42
Q2  51.5
Q3  72.5
(a) Provide an interpretation of Q1 . = 42 minutes
25% of students spend less than 42 min. on homework per section. (75% of students spend more than 42.
Min)
* Provide an interpretation of Q2 : 50% of the students spend less than 51.5 min on hw (50% of the students
spend more than 51.5 min on hw)
* Provide an interpretation of Q3 : 75% of the students spend less than 72.5 min on hw (25% of the students
spend more than 51.5 min on hw)
Diagrams:
(b) Determine and interpret the interquartile range.
IQR: Q3  Q1  72.5  42  30.5 minutes
The typical spread of the middle 50% of the class was 30.5 minutes. OR The middle 50% of the class had a
typical range of 30.5 minutes.
(c) Do you believe that the distribution of time spent doing homework is skewed or
symmetric? Why?
The spread from Q1 to Q2 is (51.5-42) 9.5 min.
The spread from Q2 to Q3 is (72.5-51.5) 21 min.
Since the spread is greater from Q2 to Q3, then it is right skewed.
Diagram:
12
Note: Use median with IQR for any
distribution that is not symmetric
13
Objective C: Outliers
Extreme observations are called outliers; they may occur by error in the measurement or during data
entry or from errors in sampling.
Chapter 3.5 The Five-Number Summary and Boxplots
Objective A : The Five-Number Summary
Example 1: The number of chocolate chips in a randomly selected 21 name-brand cookies
were recorded. The results are shown below.
28 23 28 31 27 29 24 19 26 23 21 25 22 23
21 23 33 28 33 21 30 50
Find the five-number Summary.
Minimum: 19
Maximum: 50
Find Q1, M(Q2), & Q3 using method stated before by first putting data in order: Q1 = 22.5, Q2 = 25, Q3 = 28.5
The five-number summary is:
19
22.5 25
28.5 50
Check for outliers (unusual low or high values for the given data):
IQR = Q3-Q1 = 29 - 23 = 6
Outliers on the lower end: Q1-1.5*IQR = 22.5 – 1.5(6) = 22.5 – 9 = 13.5Any data below 13.5 is an outlier.
Outliers on the higher end: Q3+1.5*IQR = 28.5 + 1.5(6) = 28.5 + 9 = 37.5Any data above 37.5 is an outlier.
There is one outlier at 50.
Diagram:
Objective B: Boxplots
14
The five-number summary can be used to construct a graph called the boxplot.
Objective C: Using a Boxplot to describe the shape of a distribution
Example: using boxplot c:
How many dogs lived between 7 and 11 years if 80 dogs were part of the study? Is there enough information?
Yes: 20 dogs since each quartile contains one fourth of the data.
15
Example 1: Use the side-by-side boxplots shown to answer the questions that follow. The boxplots
show the amount spent on a haircut separated by genders. Let x represent the data for males and y the data
for females.
(a) To the nearest integer, what is the median for the males?
The median spent on a haircut for males is $16.
(b) To the nearest integer, what is the first quartile for the females?
The 25th percentile for amount spent on a haircut for females is $22.
(c) Which variable has more dispersion (spread)? Why?
Females because the IQR is greater. Therefore the middle 50% has more
variability.
(d) Does the variable x have any outliers? If so, what is the value of the outlier?
Yes, at $27. (This means that it was unusual for a male to spent $27 on a haircut.)
(e) Describe the shape of the variable y . Support your position.
Left skewed. The spread between Q1 & Q2 is greater than the spread between Q2 & Q3.
Summary of notations:
𝜇 :mean of entire population
𝑥̅ : mean of a random sample (of the entire population)
𝜎 : standard deviation of the population
s: standard deviation of the sample
x: an individual response (ex: 𝑥1 is first measurement or response, 𝑥2 would be the second measurement or
response, 𝑥3 would be the third, etc.)
16