Download Numerically Summarizing Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 3: Numerically Summarizing Data
Section 3.1: Measure of Central Tendency
Objectives: Students will be able to:
Determine the arithmetic mean of a variable from raw data
Determine the median of a variable from raw data
Determine the mode of a variable from raw data
Use the mean and the median to help identify the shape of a distribution
Vocabulary:
Parameter – a descriptive measure of a population
Statistic – a descriptive measure of a sample
Arithmetic Mean – sum of all values of a variable in a data set divided by the number of observations
Population Arithmetic Mean – (μ) summation ( ∑ ) of all values of a variable from the population divided by the total
number in the population (N)
Sample Arithmetic Mean – (x‾) summation of all values of a variable from a sample divided by the total number of
observation from the sample (n)
Median – (M) the value of the variable that lies in the middle of the data when arranged in ascending order (if there is a
even number of observations, then the median is the average of observations either side of the middle (½) value
Mode – the most frequently observed value of the variable
Resistant – extreme values do not affect the statistic
Key Concepts: Three characteristics used to describe distributions (from histograms or similar charts)
1. Shape
2. Center
3. Spread
Distributions Parameters
Skewed Left:
Mean substantially smaller
than median
Mean < Median < Mode
Symmetric:
Mean roughly equal to median
Mean ≈ Median ≈ Mode
Skewed Right:
Mean substantially greater
than median
Mean > Median > Mode
Measure of
Central Tendency
Computation
Interpretation
μ = (∑xi ) / N
Mean
Median
Mode
Center of gravity
x‾ = (∑xi) / n
Arrange data in ascending
order and divide the data set
into half
Tally data to determine most
frequent observation
Divides into
bottom 50% and top 50%
Most frequent observation
When to use
Data are quantitative and
frequency distribution is
roughly symmetric
Data are quantitative and
frequency distribution is
skewed
Data are qualitative or the
most frequent observation is
the desired measure of
central tendency
Chapter 3: Numerically Summarizing Data
Example 1: Which of the following are resistant measures of central tendency:
Mean,
Median or
Mode?
Example 2: Given the following set of data:
70, 56, 48, 48, 53, 52, 66, 48, 36, 49, 28, 35, 58, 62, 45, 60, 38, 73, 45, 51,
56, 51, 46, 39, 56, 32, 44, 60, 51, 44, 63, 50, 46, 69, 53, 70, 33, 54, 55, 52
What is the mean?
What is the median?
What is the mode?
What is the shape of the distribution?
Example 3: Given the following types of data and sample sizes, list the measure of central tendency you would use and
explain why?
Sample of 50
Sample of 200
Hair color
Height
Weight
Parent’s Income
Number of Siblings
Age
Does sample size affect your decision?
Homework: pg : 130-7; 9, 21, 23, 27, 33, 34, 44
Chapter 3: Numerically Summarizing Data
Section 3.2: Measures of Dispersion (Spread)
Objectives: Students will be able to:
Compute the range of a variable from raw data
Compute the variance of a variable from raw data
Computer the standard deviation of a variable from raw data
Use the Empirical Rule to describe data that are bell shaped
Use Chebyshev’s inequality to describe any set of data
Vocabulary:
Range – difference between the smallest and largest data values
Variance – based on the deviation about the mean (how spread out the data is)
Population Variance – ( σ2 ) computed using (∑(xi – μ)2)/N
Sample Variance – ( s2 ) computed using (∑(xi – x‾)2)/((n – 1)
Biased – a statistic that consistently under-estimates or over-estimates a population parameter
Degrees of Freedom – number of observations minus the number of parameters estimated in the computation
Population Standard Deviation – square root of the population variance
Sample Standard Deviation – square root of the sample variance
Key Concepts:
Sample variance is found by dividing by (n – 1) to keep it an unbiased (since we estimate the population mean, μ, by
using the sample mean, x‾) estimator of population variance
The larger the standard deviation, the more dispersion the distribution has
Empirical Rule and Chebyshev’s Inequality
Empirical Rule
Chebyshev’s Inequality
μ ± 3σ
at least 88.9%
μ ± 2σ
at least 75%
μ±σ
Nothing
99.7%
At least (1 – 1/k2)*100%, k>1
within k standard deviations of the mean
95%
68%
34%
0.15%
34%
13.5%
13.5%
2.35%
μ - 3σ
μ - 2σ
2.35%
μ-σ
μ
μ+σ
μ + 2σ
μ + 3σ
0.15%
μ - 3σ
μ - 2σ
μ-σ
μ
μ+σ
μ + 2σ
μ + 3σ
Chapter 3: Numerically Summarizing Data
Example 1: Which of the following measures of spread are resistant?
1.
Range
2.
Variance
3.
Standard Deviation
Example 2: Given the following set of data:
70, 56, 48, 48, 53, 52, 66, 48, 36, 49, 28, 35, 58, 62, 45, 60, 38, 73, 45, 51,
56, 51, 46, 39, 56, 32, 44, 60, 51, 44, 63, 50, 46, 69, 53, 70, 33, 54, 55, 52
1. What is the range?
2. What is the variance?
3. What is the standard deviation?
Example 3: Compare the Empirical Rule and Chebyshev’s Inequality
Empirical Rule
Chebyshev
μ±σ
μ ± 2σ
μ ± 3σ
Homework: pg 148-155: 11, 14, 22, 23, 35, 39, 40, 43, 45, 51
Chapter 3: Numerically Summarizing Data
Section 3.3: Measures of Central Tendency and Dispersion from Grouped Data
Objectives: Students will be able to:
Approximate the mean of a variable from grouped data
Compute the weighted mean
Approximate the variance and standard deviation of a variable from grouped data
Vocabulary:
Weighted Mean – mean of a variable value times its weighted value
Key Concepts:
Use raw data whenever possible
If grouped (summarized data) is the only data available, estimates for mean and standard deviation can still be
obtained
Descriptive Stats from Summary Stats
Class
Midpt
xi
Freq
fi
xi fi
xi - x
(xi – x)²fi
0 – 1.99
1
2
2
-5.1
52.02
2 – 3.99
3
5
15
-3.1
48.05
4 – 5.99
5
6
30
-1.1
7.26
6 – 7.99
7
8
56
0.9
6.48
8 – 9.99
9
9
81
2.9
75.69
Total
Σfi = 30
Σxifi = 184
Σxifi = 184
x = --------------- = 6.1
Σfi = 30
Σ(xi – x)²fi = 189.5
Σ(xi – x)² fi = 189.5
s² = -------------------------- = 6.53
Σfi – 1 = 29
30/2 - 13
n/2 – CF
Median = M = L + ------------ (ci ) = 6 + ------------- (2) = 6 + (2/8) 2 = 6.25
8
f
L is the lower class limit of the class containing the median
n is the total number of data values in the frequency distribution
CF is the cum freq of the class preceding the median class
f is the frequency of the median class
ci is the class width of the median class
Homework: pg 161 - 165: 3, 4, 5, 21, 25
Chapter 3: Numerically Summarizing Data
Section 3.4: Measures of Position
Objectives: Students will be able to:
Determine and interpret z-scores
Determine and interpret percentiles
Determine and interpret quartiles
Check a set of data for outliers
Vocabulary:
Z-Score – the distance that a data value is from the mean in terms of the number of standard deviations
K Percentile – (Pk) divides the lower kth percentile of a set of data from the rest
Quartiles – (Qi) divides the whole data into four (25%) sets of data
Outliers – extreme observations
IQR (Interquartile range) – difference between third and first quartiles (IQR = Q3 – Q1)
Lower fence – Q1 – 1.5(IQR)
Upper fence – Q3 – 1.5(IQR)
Key Concepts:
Data sets should be checked for outliers as the mean and standard deviation are not resistant statistics and any
conclusions drawn from a set of data that contains outliers can be flawed
Fences serve as cutoff points for determining outliers (data values less than lower or greater than upper fence are
considered outliers)
Z-Scores
Population z-Score
Sample z-Score
x–μ
z = -----------σ
z=
x–x
-----------s
Allows comparisons of different distributions. Mean of z is 0
and the standard deviation of z is 1.
Quartiles
Smallest
Data Value
25% of
the data
Median
Q1
Q2
25% of
the data
Example 1: Which player had a better year in 1967?
Carl Yastrzemski
AL Batting Champ 0.326
Roberto Clemente NL Batting Champ 0.357
AL average 0.236
AL stdev 0.01072
NL average 0.249
NL stdev 0.01257
Q3
25% of
the data
Largest
Data Value
25% of
the data
Chapter 3: Numerically Summarizing Data
Example 2: Given the following set of data:
70, 56, 48, 48, 53, 52, 66, 48, 36, 49, 28, 35, 58, 62, 45, 60, 38, 73, 45, 51,
56, 51, 46, 39, 56, 32, 44, 60, 51, 44, 63, 50, 46, 69, 53, 70, 33, 54, 55, 52
What is the median?
What is the Q1?
What is the Q3?
What is the IQR?
What is the upper fence?
What is the lower fence?
Are there any outliers?
Homework: pg 172 - 174: 9-12, 14, 19
Chapter 3: Numerically Summarizing Data
Section 3.5: The Five-Number Summary and Boxplots
Objectives: Students will be able to:
Compute the five-number summary
Draw and interpret boxplots
Vocabulary:
Five-number Summary – the minimum data value, Q1, median, Q3 and the maximum data value
Key Concepts:
Distribution Shape Based on Boxplots:
a. If the median is near the center of the box and each horizontal line is of approximately equal length, then the
distribution is roughly symmetric
b. If the median is to the left of the center of the box or the right line is substantially longer than the left line,
then the distribution is skewed right
c. If the median is to the right of the center of the box or the left line is substantially longer than the right line,
then the distribution is skewed left
Remember identifying a distribution from boxplots or histograms is subjective!
Why Use a Boxplot?
A boxplot provides an alternative to a histogram, a dotplot, and a stem-and-leaf plot. Among the advantages of a
boxplot over a histogram are ease of construction and convenient handling of outliers. In addition, the
construction of a boxplot does not involve subjective judgements, as does a histogram. That is, two individuals
will construct the same boxplot for a given set of data - which is not necessarily true of a histogram, because the
number of classes and the class endpoints must be chosen. On the other hand, the boxplot lacks the details the
histogram provides.
Dotplots and stemplots retain the identity of the individual observations; a boxplot does not. Many sets of data
are more suitable for display as boxplots than as a stemplot. A boxplot as well as a stemplot are useful for
making side-by-side comparisons.
Five-number summary
Min
Q1
smallest
value
Boxplot
M
Q3
Max
First, Second and Third Quartiles
(Second Quartile is the Median, M)
largest
value
Lower
Fence
Upper
Fence
[
]
Smallest Data Value > Lower Fence
(Min unless min is an outlier)
*
Largest Data Value < Upper Fence
(Max unless max is an outlier)
Outlier
Chapter 3: Numerically Summarizing Data
Ex. #1 Consumer Reports did a study of ice cream bars (sigh, only vanilla flavored) in their August 1989 issue. Twentyseven bars having a taste-test rating of at least “fair” were listed, and calories per bar was included. Calories vary quite a bit
partly because bars are not of uniform size. Just how many calories should an ice cream bar contain?
342
377
319
353
295
234
294
286
377
182
310
439
111
201
182
197
209
147
190
151
131
151
Construct a boxplot for the data above.
Ex. #2
The weights of 20 randomly selected juniors at MSHS are recorded below:
121
126
130
132
134
137
141
144
125
128
131
133
135
139
141
147
148
153
205
213
a) Construct a boxplot of the data
b) Determine if there are any mild or extreme outliers.
Ex. #3
The following are the scores of 12 members of a woman’s golf team in tournament play:
89
90
87
95
86
81
111
108
83
88
91
a)
79
Construct a boxplot of the data.
b) Are there any mild or extreme outliers?
c) Find the mean and standard deviation.
d) Based on the mean and median describe the distribution?
Ex. #4 Comparative Boxplots: The scores of 18 first year college women on the Survey of Study Habits and Attitudes (this
psychological test measures motivation, study habits and attitudes toward school) are given below:
154
109
137
115
152
140
154
178
101
103
126
126
137
165
165
129
200
148
The college also administered the test to 20 first-year college men. There scores are also given:
108
140
114
91
180
115
126
92
169
146
109
132
75
88
113
151
70
115
187
104
Compare the two distributions by constructing boxplots. Are there any outliers in either group? Are there any noticeable
differences or similarities between the two groups?
Homework: pg 181-183: 5-7, 15
Chapter 3: Numerically Summarizing Data
Chapter 3: Review
Objectives: Students will be able to:
Summarize the chapter
Define the vocabulary used
Complete all objectives
Successfully answer any of the review exercises
Use the technology to display graphs and plots of data
Vocabulary: None new
Distribution Spread
Distributions Parameters
Standard Deviation
and Variance High
Lots of Data on
extremes
Data very spread out
Skewed Left:
Mean substantially smaller
than median
Mean < Median < Mode
Symmetric:
Mean roughly equal to median
Standard Deviation and
Variance Lower
Some data on extremes
Data more clustered
than first, but still not
tightly packed
Mean ≈ Median ≈ Mode
Skewed Right:
Mean substantially greater
than median
Standard Deviation and
Variance Lowest
Mean > Median > Mode
Fewest data on extremes
Data clustered more
near then mode (tightly
packed)
Homework: pg 186 - 191: 7, 9, 11, 12, 13, 19