Download Chapter 3: Statistics for describing, exploring, and comparing data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Time series wikipedia , lookup

World Values Survey wikipedia , lookup

Transcript
Chapter 3: Statistics for describing, exploring, and comparing data
Chapter Problem: A common belief is that women talk more than men.
Is that belief founded in fact, or is it a myth?
Data set 8 in Appendix B includes different sample groups from the results
provided by researchers show that the sample mean for males is 15,668.5 for a
1
sample size 186, and for female is 16,215.0 for a sample size 210 per day.
Chapter 3-1 Overview and 3-2.1 Measures of Center
3-1 Overview:



Discuss the characteristics of a data set: CVDOT
Statistics:


Descriptive statistics: summarize or describe the characteristics of a data set;
Chapters 2 and 3 discuss the fundamental principles of descriptive statistics
inferential statistics: use sample data to make inferences (or generalizations)
about a population; focus of later chapters
3-2 Measures of Center:




Objective: discuss the characteristics “center”, mean and median of a data
set, effect of outliers on the mean and median
Definition: a measure of center is a value at the center of middle of a
data set
Definition: The arithmetic mean of a set of values is the sum  of the data
values divided by total number n of value. We call it “mean” (means arithmetic mean) and
mean is denoted by x-bar x
x
2
x
Sum of all sample values
n
number of sample values
3-2.2 Mean, Median & Notations
Mean is relative reliable, since means of samples drawn from the same
population don‟t vary as much as other measure of center since it takes every
data value into account. But mean is sensitive to every value, especially when
there is outliers (is a disadvantage).
 Greek letter sigma denote the sum of a set of values
x is the variable for individual data value
n is the number of values in a sample
N represents the number of values in a population





x 
 x is the mean of a set of sample values
n
x
is the mean of all values in a population

N
~
x
The median (x-tilde) of a data set is the measure of center that is the middle
value when the original data values are in ascending order
If the number of values is odd, the median x is exactly the middle of the list
If the number of values is even, the median is mean (average) of the two
middle numbers; Note: Median is not affected by outliers



3
3-2.3 Example of mean and median

Monitoring Lead in Air



To find median, first you need to arrange the data in ascending order




Data taken are 5.4, 1.10, 0.42, 0.73, 0.48, 1.10
Mean is 1.538 g/m3
0.42, 0.48, 0.73, 1.10, 1.10, 5.4 (the number is 6); The median is 0.915
g/m3 = (0.73+1.10)/2
0.42, 0.48, 0.66, 0.73, 1.10, 1.10, 5.4 (the number is 7); The median is
0.73 g/m3
These examples show that median is not sensitive to extreme values
and median is often used for data sets with a few extremes.
Example 2

Find the mean and median for the word counts from 5 men: 27,531; 15,684;
5,638; 27,997; and 25,433. (20,456.6; 25,433)
 Find the mean and median if including the additional 8,077 words. (18,393.3,
20,558.5)
4
3-2.3 Mode and Midrange








The mode of a data set is the value that occurs most frequently
Data set is called bimodal when there are two values occur with the same
greatest frequency, each one is a mode
Data set is called multimodal when there are more than two values occur
with the same greatest frequency, each one is a mode
Data set is called no mode when no value is repeated
Midrange is the measure of center, is calculated by average of the
minimum and maximum value of the data set: Midrange = (max + min)/2
Example from lead in the air: Midrange = (5.4+0.42)/2 = 2.910 g/m3
Example from word counts: 27,531; 15,684; 5,638; 27,997; 25,433;
Midrange = ?
Midrange is rarely used, since it only uses max and min and too sensitive to
those extremes. It is easy to compute and is one of the values to define the
“center” of the data set. “Midrange” is different from the “mean”
Round-off rule – Carry one more decimal place than is present in the
original set of values
5
3-2.4 Examples

Example 1: Comparison of ages of best actress and best actors
Comparison of Ages of Best Actresses and Best Actors
Best Actresses



Best Actors
Mean
35.7
43.9
Median
33.5
42
Mode
35
41 and 42
Midrange
50.5
52.5
What does this data tell? Measures of center suggests that best actresses are
younger than best actors.
In Ch 9, we will discuss the methods for determining whether such
differences are satisfactory significant.
Example 2: Find the mean, median, mode, and midrange of the randomly
selected cans of Coke: 12.3, 12.1, 12.2, 12.3, 12.2 (12.22, 12.20, 12.2, 12.3,
12.2)
6
3-2.5 More Examples

The following examples identify a major reason why the mean and median
are not meaning statistics that accurately and effectively serve as measures
of center.

Find the mean and median of the following

Zip codes: 12601, 90210, 02116, 76177, 19102
 Ranks of stress level from different jobs: 2 3 1 7 9
 Surveyed respondents are coded as 1 (for democrat), 2 (for republican), 3 (for
liberal), 4 (conservative), or 5 (for others)

Mean salary of secondary school teachers: from 50 states, $37,200, $
49,400, $40,000, ….$37,800. The mean is $42,210. but is this mean
salary of all secondary school teachers in U.S.? Why or why not?

The above example did not take into the considerations of the number of
secondary school teachers in each state. The mean for all secondary school
teachers in the U.S. is $45,200, not $42,210.
7
3-2.6 Mean from a Frequency Distribution

Mean from a Frequency Distribution is defined as
Where x is the class midpoint,
formula 3-2
 f x
x


Example:
x
 f x
f
f

f is the frequency for that class
2718
 35.8
76
Age of actress
Frequency f
21 - 30
28
25.5
714
31 - 40
30
35.5
1065
41 - 50
12
45.5
546
51 - 60
2
55.5
111
61 - 70
2
65.5
131
71 - 80
2
75.5
151
Totals
76
Class Midpoint x
f x
2718
You can use TI, with midpoints in L1, f in L2, then calculate
8
3-2.7 Weighted Mean

Weighted mean – used when the values with
different degrees of importance, is defined as
Formula 3-3: x   w  x
w

Example: mean of 3 test scores (85, 90, 75)
 Test
1: 20%
 Test 2: 30%
 Test 3: 50%
9
x
 ( w  x)
w
(20  85)  (30  90)  (50  75)
20  30  50
8150

 81.5
100

3-2.8 Best measure of Center?
Use mean most, then median
mean
Median
Measures
of center
Mode
Midrange
Value that
Sort the data
Occurs
(max + min)/2
Most
frequently
Even
Odd
The mode is good
The midrange
number
number
for data at the
Is rarely used
(1) Sensitive to
of value
of value
nominal level of
Extreme value
measurement
Add the 2 middle
(2) Sample means Median is the
numbers,
to vary less than value in the
then divide by 2
exact middle
other measures
of center
The median is a good choice
10
If there are some extreme values
Find the sum
Of all values,
then divide by
the number of
values
3-2.9 Skewness of data




11
Definition: a distribution of data is skewed if it is not
symmetric and extends more to one side than the other
Skewed to the left (negatively skewed; has a longer left tail) if
mean and the median to the left of mode
Skewed to the right (positively skewed; has a longer right tail)
if mean and the median to the right of the mode
A distribution is symmetric (zero skewness) if the left half of
its histogram is roughly a mirror image of its right half,
mean=median=mode
3-2.10 Summary and Homework #8 Section 3-2




12
We have learned types of measurements of center of a data set;
mean from a frequency distribution, weighted mean, best
measure of center, and skewness.
The mean and median cannot always be used to identify the
shape of the distribution.
Question: What is the highest point of the graph whether it is
symmetric or skewness?
HW #8, Pages 94-96, #5-17 odd, 33-34 (answer for 34: mean
= 84.8, grade = B)
3-3.1 Measures of Variation

Objective 3-3:

Learn the characteristic of variation; such as standard deviation and variance.
 Learn how to use a data set for finding the value of the range and standard deviation;
 Interpreting values of standard deviations and reasons of standard deviation

Definition: The range of a set of data is the difference between the
maximum value and the minimum value
Range = (maximum value) – (minimum value)
 Not useful, since it depends on max and min (i.e. extreme sensitive to the extreme
values)


Definition: The standard deviation is a set of sample values is the measure
of variation of values about the mean.
s
s
 (x  x)
Formula 3-4
standard deviation
2
n 1
n  ( x )  ( x)
n( n  1)
2
2
simple
Formula 3-5
shortcut
formula for sample
standard deviation (formula used by calculators
and computer programs)
13
3-3.2 Properties of Standard Deviation (S.D.)




1.
2.
3.
4.
S.D. is a measure of variation of all values from the mean
S.D. is always  0. It is zero only when all of the data values are the same;
large S.D. values indicate greater amount of variation
S.D. can increase dramatically with the inclusion of one or more outliers
The units of the S.D. s are the same as the units of original values, e.g.
minutes, feet, pounds, etc..
Compute the mean
Subtract the mean from each value
xx
2
Square the difference ( x  x )
2
Add the all the squares
 (x  x)
 (x  x)
Divide the total by n-1 (i.e. one less than the number)
n 1
Find the square root of the result of step 5
 (x  x)
x
2
5.
6.

s
2
n 1
Find the standard deviation of the waiting times from the multiple times.
Those times (in minutes) are 1, 3, 14.
14
3-3.3 Standard Deviation of a Population


Standard deviation of a population  is the formula of
sample deviation, except divided by N (N is the
population size)
The population standard deviation is defined as
 (x  )

2
N

Since we generally deal with sample data, thus we
usually use the formula 3-4.
 (x  x)
s
2
n 1
15
3-3.4 Variance of a Sample and Population






Definition – The variance of a set of values is a measure of variation equal
to the square of the standard deviation
 Sample variance: s2 square of the standard deviation
 Population variance: 2 square of the population standard deviation
s2 is called unbiased estimator of the population variation 2
Example: Use the waiting times of 1 min, 3 min, and 14 min to find the
variance of waiting time
Q: Is smaller variance better?
Note: The units of variance are different from the units of original data set;
the standard deviation has the same unit as the data set
Notations:







s = sample standard deviation
s2 = sample variance
 = population standard deviation
2 = population variance
SD – standard deviation
VAR – variance
Round-Off Rule – carry one more decimal place than the original set of
data for the final answer (don‟t round-off in the middle of a calculation)
16
3-3.5 Why learn Standard Deviation and interpretation

Standard deviation measures the variation among values


Range Rule of Thumb to estimate standard deviation


Small standard deviation means values are close together, while large standard
deviation means values are spread farther apart
is used to roughly estimate standard deviation which is based on the principle that for
most data sets the vast majority (such as 95%) of sample values lie within 2 standard
deviation s;
where s  Range/4 (range = max – min)
If the standard deviation s is known, we can use it to estimate min and
max of sample values
Minimum “usual” value = (mean) – 2 * (standard deviation)
Maximum “usual” value = (mean) + 2 * (standard deviation)
 Example 1: IQ test, mean is 100, S.D. is 15; Min is 70, max is 130



Interpretation: Based on these results, we expect that typical IQ scores fall between 70
and 130. How do you interpret IQ 65 or IQ 135?


Example 2: Pulse rate of women: mean is 76, S. D. is 12.5; min is 51
beats/min, and max is 101 beats/minute
Interpretation


Typical women pulses are from 51 to 101 beats/min
If someone has pulse rate 110 would be unusual, since 110 is outside the limits
17
3-3.6 Empirical (or 68-95-99.7) Rule for data with
normal distribution

Empirical rule – for data set having a distribution that is approximately bell-shaped
has the following properties:

About 68% of all values fall within 1 standard deviation of the mean, i.e. between (mean s) and (mean + s)
 About 95% of all values fall within 2 standard deviation of the mean, i.e. between (mean2s) and (mean+2s)
 About 99.7% of all values fall within 3 standard deviation of the mean, i.e. between (mean3s) and (mean+3s)

Example of IQ scores, mean is 100, standard deviation is 15. What percentage of IQ scores are
between 70 and 130?
18
3-3.7 Chebyshev‟s Theorem

The proportion (or fraction) of any data set lying with K
standard deviations of the mean is always at least 1-1/K2
where K >1



Example – IQ score:



When K=2, we can interpret that at least ¾ (75%) of all values lie
within 2 standard deviation of the mean
When K=3, we can interpret that at least 8/9 (or 89%) of all values lie
within 3 standard deviation of the mean
At least 75% of people have IQ between 70 and 130 (2 SD from mean)
At least 89% of people have IQ between 55 and 145 (3 SD from mean)
Comparison: Example – IQ score using empirical rule



About 68% of people have IQ between 85 and 115 (1 SD from mean)
About 95% of people have IQ between 70 and 130 (2 SD from mean)
About 99.7% of people have IQ between 55 and 145 (3 SD from mean)
19
3-3.8 Coefficient of Variation in Different Populations

Coefficient of variation (CV) for a set of nonnegative sample
population data (expressed as %) is used to describe the standard
deviation relative to the mean with the following:
Sample
s
cv  100%
x

Population
cv 

100%

Example: Heights and Weights of Men(data set 1 in Appendix B)





For heights: mean x  68.34in , s.d. = 3.02in
For weights: mean x  172.55lb , s.d. = 26.33lb
We want to compare variation among heights to variation among weights.
Heights: CV = 4.42%; weights: CV = 15.26%
Interpretation: ? The heights has considerably less variation than weights,
does it make sense?
20
3-3.9 Summary and HW #9 (3-3)

Range rule of thumb




Empirical rule (only applicable to normal (bell-shaped)
distribution)





s  range/4
Min (usual) = mean – 2*s
Max (usual) = mean + 2*s
68% within 1 S.D. means data values are within (mean- s) and (mean +
s)
95% within 2 S.D. means data values are within (mean-2s) and (mean
+2s)
99.7%within 3 S.D. means data values are within (mean-3s) and (mean
+3s)
Chebyshev‟s Theorem helps to approximate the values of data
set (applicable to any data set, but has limited usefulness)
HW #9 (3-3) Pp. 110 -113, # 5-11odd, 17, 31- 35odd
21
3-4.1 Measure of Relative Standing and Boxplots


Objective: To learn the “measure” that can be used to compare values from
the same or different data set, z score, and able to convert data values to zscores, quartiles, percentiles, and boxplots
Definition: a z score (or standardized value) is the number of standard
deviations that a given value x is above or below the mean
xx
For sample data
z
s
 For population data
 Round z to two decimal places





z
x

A man is 76.in tall with 237.1 lb weight. Find the Z-score for the height
and weight. (mean height = 68.34in, s.d.= 3.02in, mean weight = 172.55 lb
and s.d. = 26.33lb); z-score for this man is 2.60 in height, 2.45 in weight.
Interpretation: The man is 2.6 above the mean height, 2.45 above the mean
weight. The height is more extreme than the weight.
Example: Lyndon Johnson 75” (mean 71.5”, S.D. 2.1”), Shaquille O‟Neal
85”(mean 80”, S.D. 3.3”)
Interpretation: ?
22
3-4.2 z-score and unusual values

Use range rule of thumb, a value is “unusual” if it is more than 2 S.D. from the
mean:



min (usual) = mean – 2*s, and
max (usual)= mean + 2*s
Use z-score, a value is „unusual” if it is less than -2, or greater than +2


Ordinary values: -2  z score  +2
Unusual values: z score < -2 or z score >2

z scores: measures of position relative to the mean, a z- score of +2 means 2
standard deviations above the mean, z score of -3 means 3 standard deviation
below the mean

Example: Over the past 30 years, heights of basketball players at Newport
University have a mean of 74.5in, and a s.d. of 2.5in. The latest recruit has a
height of 79.0in


Find z score
Is the height of 79.0in unusual among the heights of players over the past 30
years? Why or why not?
23
3-4.3 Percentiles

Definition:

Is one type of quantiles (fractiles) which partition data into groups with roughly the same
numbers of values in each group
 Percentiles are measures of location. There are 99 percentiles and are denoted by P1, P2,
P3, …P99, which is divide a set of data into 100 groups about 1% of the values in each
group


Example: 50th percentile, denoted by P50, has about 50% of the data values below it,
and about 50% of the data values above it; 50th percentile is the same as the median.
Formula is (round the result to the nearest whole number)
Percentile of value x 

Another way is
number of values less x
 100 (or k 
total number of values
n = total number of values in the data set
k
L
n
100

L
*100)
n
k = percentile being used
L = location that gives the position of a value
Find the percentile for the value of $29 millions. Table 3-4 in the Text (click here)
4.5
5
6.5
7
20
20
29
30
35
40
40
41
50
52
60
65
68
68
70
70
70
72
74
75
80
100
113
116
120
125
132
150
160
200
225
24
3-4.4 Converting from kth percentile to the corresponding data value
Start
Compute
L = (k/100)n
n = # of values
k = percentile
Sort the data
(arrange the data
from low to high)
The value of Pk
is the Lth value
counting
from the lowest
Change L by
rounding it up
to the next
whole number
No
Is L a whole
number?
Yes
Example: Find the 17th percentile of the previous
Data set.
The value of Pk is mean of the
values Lth location and (L+1)th
location
25
3-4.5 Example: Setting Speed Limits




The table is the recorded speeds miles/hour randomly selected
on 405 highway
68
68
72
73
65
74
73
72
68
65
65
73
66
71
68
74
66
71
65
73
59
75
70
56
66
75
68
75
62
72
60
73
61
75
58
74
60
73
58
75
Find the 85th percentile of the listed speeds
Given that speed limits are usually rounded to a multiple of 5,
what speed limit is suggested by these data? Explain your
choice
Does the existing speed limit on Highway 405 conform to the
85th percentile rule (i.e. the speed limit is set so that 85% of
drivers are at or below the speed limit)
26
3-4.6 Quartiles

Definition


Quartiles are measures of location, denoted by Q1, Q2, and Q3 , which divide a
data set into four groups with about 25% of the values in each group (percentile
divide the data into 100 groups.)
Three quartiles Q1, Q2, Q3 Divide the sorted data value into 4 equal parts

Q1 (first quartile) separate the bottom 25% from the rest
 Q2 (second quartile) separate the bottom 50% from the rest (Q2 is also the
median) (also 50 percentile)
 Q3 (third quartile) separate the bottom 75% from the rest

Interquartile range (IQR) = Q3 - Q1

Semi-interquartile = (Q3 - Q1)/2
 Mid-quartile = (Q3 + Q1)/2
 10-90 percentile range = P90 - P10

Example: find the values of Min, Q1, Q2, Q3, Max, IQR of movie budget;
(click here for the table)
27
3-4.8 5-Number Summary and Boxplot

Definition – for a set of data, the 5-number summary consists of :
(1) Minimum, (2) Q1, (3) Q2 (the median), (4) Q3 , (5) Maximum

A boxplot (or box-and-whisker diagram) is a graph of data set that consists of a
line extending from the min to max and a box with lines drawn at the Q1, the
median, and the Q3 (summary and example next page)

A graph which is useful for revealing


The center of the data

The spread of distribution of the data

The presence of outliers
Outlier is a value that is located away from almost all of the other values,
an extreme value falls outside the general pattern;


A data x value is an outlier if x –Q3 > 1.5 IQR or Q1-x > 1.5  IQR
An outlier can have a dramatic effect on the

Mean
 Standard deviation
 The scale of histogram, so the true nature of the distribution is totally observed
28
3-4.9 Procedures for Construct a Boxplot, HW #10










Find the 5-number summary, min, Q1, median, Q3, and the max
Construct a scale with values that include the min and max data value
Construct a box (rectangle) extending from Q1 to Q3, and draw a line in the box at
the median value
Draw lines extending outward from the box to min and max data value
Example on the board, use the movie budget (click here) 5-number summary
Boxplots don‟t show detail info as histograms or stem-and leaf plots – not the best
choice when dealing with a single data set; but it‟s great for comparing different
data sets (use the same scale)
Do women really talk more than men? Use the 5-number summary
Min
Q1
Q2
Q3
Max
Men
695
10009
14290
20565
47016
Women
1674
11010
15917
20571
40055
Read Table 3-3 Comparison of word counts of men and women for “mean”,
“median”, “midrange”, “range”, “S.D”
Example: Here are measured reaction times (in seconds) in a test of driving skills;
2.4, 2.5, 2.8, 2.0, 2.4, 2.9, 3.2, 3.5, 2.7, 2.7, 2.8, 2.6; find the five-number-summary.
HW #10: P. 127-128, #1, 5-7, 9, 13, 15, 19, 23, 27
29
Review (1)

Ch 1 – You should be able to do the following

learned distinguish between a population and a sample; and parameter and statistic
 Understand the importance of good experimental design, including the control of
variable effects, replication, and randomization
 Recognize the importance of good sampling methods in general, a simple random sample
in particular
 Understand if sample data are not collected in an appropriate way, the data may be
completely useless

Ch 2: You should be able to do:

Summarize data by constructing a frequency distribution or relative frequency
distribution
 Visually display the nature of the distribution by constructing a histogram or relative
frequency histogram
 Investigate important characteristics of a data set by creating visual display, such as a
frequency polygon, dotplot, stemplot, pareto chart, pie chart, scatterplot or time-series
graph
 Understand and interpret those result
30
Review (3) - Continued

You should be able to
 Calculate measures
of center by finding the mean and
median
 Calculate measures of variation by finding the standard
deviation, variance, and range
 Understand and interpret the standard deviation by using
the tools such as range rule of thumb
 Compare individual values by using z score, quartiles, or
percentiles, identify outliers
 Investigate and explore the spread of data, the center of the
data, and the range of values by constructing a boxplot
 Understand and interpret those result such as standard
deviation us a measure of how much data vary, and use
standard deviation to distinguish between values that are
usual and unusual
31
Examples
Always consider certain key factors:
• Context of the data
• Source of the data
• Sampling method
• Measures of center
• Measures of variation
• Distribution
• Outliers
• Changing patterns over time
• Conclusion
• Practical implications
32