Download Class Notes - Wells` Math Classes

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
What is statistics?
Statistics is the science of:

Collecting information

Organizing and summarizing the
information collected

Analyzing the information collected in
order to draw conclusions
Two types of Statistics
Descriptive Statistics
Organizing and summarizing the
information collected.
Inferential Statistics
Draws conclusion from the information
collected.
Chapter 1
Exploring Data
Lesson 1-1, Displaying
Distributions with Graphs
Bar Graphs and Pie Charts
Data

Individuals


are objects described by a set of data.
Individuals may be people, animals or
things.
Variable

is any characteristic of an individual. A
variable can take different values for
different individuals
Types of Variables

Categorical variable


allows for classification of individuals
based on some attribute or
characteristics.
Quantitative variable

provides numerical measures of
individuals.
Example, Page 7, #1.2
Data from a medical study contain values of many
variables for each of the people who where subjects
of the study. Which of the following variables are
categorical and which are quantitative?
Example, Page 7, #1.2
a)
b)
c)
d.
e)
f)
Gender (female or male)
categorical
Age (years)
Quantitative
Race (Asian, black, white or other)
categorical
Smoker (yes or no)
categorical
Systolic blood pressure (millimeters of mercury)
Quantitative
Level of calcium in blood (micrograms per milliliter)
Quantitative
Distribution

Distribution

Tells us what values the variable takes
and how often it takes each value
Displaying Distributions

Categorical Variables



Bar Graphs
Pie Charts
Quantitative Variables



Dotplots
Stemplots
Histograms
Example – Page 11, #1.6
In 1997 there were 92,353 deaths from accidents in the
United States. Among these were 42.340 deaths from
Motor vehicle accidents, 11,858 from falls, 10,163 from
poisoning, 4051 from drowning, and 3601 from fires.
A) Find the percent of accidental deaths from each of
these causes, rounded to the nearest percent. What
percent of accidental deaths were due to other causes?
Example – Page 11, #1.6
Accidents
Number
Motor Vehicle 42,340
Falls
11,858
Poisoning
Drowning
10,163
4051
Fires
3601
Other Causes 20,340
Total
92,353
Percentage
42,340
 45.8  46%
92,353
13%
11%
4%
4%
22%
100%
Example – Page 11, #1.6
STAT
Example – Page 11, #1.6
Example – Page 11, #1.6
B) Make a well-labeled bar graph of the distribution of
causes of accidental deaths. Be sure to include an
“other causes” bar.
Percentage of Accidental Deaths
Example – Page 11, #1.6
US Accidental Death – 1997
50
40
30
20
10
MV
Falls
Poison Drown Fires
OC
Causes of Accidental Deaths
Example – Page 11, #1.6
C) Would it also be correct to use a pie chart to display
these data? If so, construct the pie chart. If not
explain why not.
Yes, since categories represent parts of a whole.
Example – Page 11, #1.6
Accidents
Number
Percentage Pie Chart
0.46  360 
46%
165.6  166
47
13%
11%
40
4%
14
4%
14
MV
42,340
Falls
11,858
Poisoning
10,163
Drowning
4051
Fires
3601
OC
20,340
22%
Total
92,353
100%
79
360°
Example – Page 11, #1.6
Example – Page 11, #1.6
US Accidental Deaths - 1997
22%
46%
4%
4%
11%
13%
Motor Vehicle
Falls
Poisoning
Drowning
Fires
Other Causes
Lesson 1-1, Displaying
Distributions with Graphs
Dot Plots and Stem Leaf Plots
Overall Pattern of Distribution
(Quantitative Variables)

Center


Spread


Smallest to largest values
Shape


Divides the data in half
Skewness of the data
Outlier

Data that falls outside of the pattern
Example – Page 16, #1.8
Are you driving a gas guzzler? Table 1.3 displays the highway
gas mileage for 32 model year 2000 midsize cars.
A). Make a dot plot of these data.
Example – Page 16, #1.8
Example – Page 16, #1.8
21
23
25
27
29
Highway Gas Mileage
31
33
Example – Page 16, #1.8
B) Describe the shape, center, and spread of the distribution
of gas mileages. Are there any potential outliers?
The shape of the distribution is skewed to the left, with a
major peak at 28 and a minor peak at 24. The spread is
relatively narrow (21 to 32 mpg). The two observations at
21 and the observation at 32 appear to outliers. The center
is 28 mpg.
Example – Page 35, #1.28
In 1978 the English scientist Henry Cavendish measured
the density of the earth by careful work with a torsion balance.
The variable recorded was the density of the earth as a
multiple of the density water. Here are Cavendish’s 29
measurements:
5.50 5.61 4.88 5.07 5.26 5.55 5.36 5.29 5.58 5.65
5.57 5.53 5.62 5.29 5.44 5.34 5.79 5.10 5.27 5.39
5.42 5.47 5.63 5.34 5.46 5.30 5.75 5.68 5.85
Example – Page 35, #1.28
5.50 5.61 4.88 5.07 5.26 5.55 5.36 5.29 5.58 5.65
5.57 5.53 5.62 5.29 5.44 5.34 5.79 5.10 5.27 5.39
5.42 5.47 5.63 5.34 5.46 5.30 5.75 5.68 5.85
Present these measurements graphically in a stemplot.
Discuss the shape, center, and spread of the distribution.
Are there any outliers? What is estimate of the density of
the earth based on these measurements?
Example – Page 35, #1.28
Density of the Earth
48
49
50
51
52
7
0
6 7 9 9
53
54
55
56
57
58
0
2
0
1
5
5
48|8 = 4.88%
8
4
4
3
2
9
4
6
5
3
6 9
7
7 8
5 8
The shape of the distribution is
roughly symmetric with one
possible outlier at 4.88 that is
somewhat low. The spread
between 4.88 to 5.85. The
center of the distribution if
between 5.4 and 5.5. Based on
the plot, we would estimate the
Earth’s density to be about
halfway between 5.4 and 5.5.
Lesson 1-1 Displaying
Distributions with Graphs
Histograms and Relative
Frequency Graphs
Histogram and categories
GPAs of Spring 1998 Stat 250 Students
Age of Spring 1998 Stat 250 Students
7
60
6
Frequency (Count)
50
40
30
20
5
4
3
2
1
10
0
0
2
18
23
28
Age (in years)
3
4
GPA
n=92 students
n=92 students
too few categories
too many categories
Example – Histogram
Suppose you are considering investing in a Roth IRA.
You collect the data table, which represent the three-year
rate of return (in percent) for 40 small capitalization growth
mutual funds.
27.4 12.7 22.6 32.1 18.2 23.7 18.4 14.7
16.7 28.5 29.6 47.7 32.0 14.7 21.3 37.0
10.8 22.2 11.6 10.9 25.5 12.8 27.0 19.2
24.1 18.4 45.9 18.4 23.7 31.1 19.6 18.5
35.9 17.4 16.6 23.3 38.1 21.9 18.5 29.1
Example – Histogram
STAT
Example – Histogram
A) Construct a histogram to display these data. Record
your class intervals and counts
Step 1 – Find the class intervals
Locate the smallest number (10.8) and the largest
number (47.7)
Lower class limit will be 10.0 with a class width of 5
Example – Histogram
3-yr Rate of Return
10.00 14.9
15.0  19.9
20.0  24.9
25.0  29.9
30.0  34.9
35.0  39.9
40.0  44.9
45.0  49.9
Total
Frequency
7
11
8
6
3
3
0
2
40
Example – Histogram
Step – 2 Graph it using the TI
Stat Plot
2nd Y=
Window
Example – Histogram
Graph
Trace
Example - Histogram
3 – Year Rate of Return of Mutual Funds
Frequency
12
8
4
10 15 20 25 30 35 40
Rate of Return
45 50
Example – Histogram
B) Describe the distribution of 3 – Year Rate of Return.
The shape of the distribution is
skewed to the right with the
center at class 15.0% – 19.9%.
There is one outlier in class the
45.0% – 49.9%. The spread is
between 10% to 50%.
Shape of a Distribution




Uniform (symmetric)
Bell-shaped (Symmetric)
Skewed Right
Skewed Left
Uniform Distribution
Symmetric – Bell Shaped
Skewed Right
Skewed left
Example – Relative Cumulative Frequency
Suppose you are considering investing in a Roth IRA.
You collect the data table, which represent the three-year
rate of return (in percent) for 40 small capitalization growth
mutual funds.
27.4 12.7 22.6 32.1 18.2 23.7 18.4 14.7
16.7 28.5 29.6 47.7 32.0 14.7 21.3 37.0
10.8 22.2 11.6 10.9 25.5 12.8 27.0 19.2
24.1 18.4 45.9 18.4 23.7 31.1 19.6 18.5
35.9 17.4 16.6 23.3 38.1 21.9 18.5 29.1
Example – Relative Cumulative Frequency
Class
10.0 – 14.9
Freq
7
Relative
Frequency
7
 0.175
40
15.0 – 19.9
11
20.0 – 24.9
8
25.0 – 29.9
6
30.0 – 34.9
3
0.275
0.20
0.15
0.075
35.0 – 39.9
3
0.075
40.0 – 44.9
0
45.0 – 49.9
2
Total
40
0
0.05
1
Cumulative
Frequency
7
Relative cumulative
Frequency
0.175
7  11  18
0.175  0.275  0.45
18  8  26
32
35
0.45  0.2  0.65
0.8
0.875
0.95
38
38
40
0.95
1
Example – Relative Cumulative Frequency
Class
Freq
Rel Freq
Cum Freq
Rel Cum Freq
20.0 – 24.9
8
0.2
26
0.65
45.0 – 49.9
2
0.05
40
1



26 of the 40 mutual funds had a 3 year rate of
return of 24.9% or less
65% of the mutual funds had 3 year rate of return of
24.9% or less
A mutual fund with a 3 year rate of return of 45% or
higher is out performing 95% of its peers.
Example – Relative Cumulative Frequency
L3 – Upper Class Limits
L4 – Relative Cumulative Frequency
Example – Relative Cumulative Frequency
Example – Relative Cumulative Frequency
Cumulative
Relative Frequency
3 Year Rate of Return for Small Capitalization
Mutal Funds
1.2
1
0.8
0.6
0.4
0.2
0
10
14.9
19.9
24.9
29.9
34.9
Rate of Return
39.9
44.9
49.9
Lesson 1-2 Describing
Distributions with Numbers
Measuring the center
Mean
To find the sample mean add up all of the observations and
divided by the number of observations.
x
x1  x2  ...  xn

X
 X
n
n
Is affected by unusual values called outliers.
Median
The median is the midpoint of a distribution, such
that half the observation are smaller and the other
half are larger.

Another name for the 50th percentile

Is not affected by unusual values called outliers
Center and Distribution

Mean < Median


Mean = Median


Skewed Left
Symmetric
Mean > Median

Skewed Right
Measuring the Spread





Range
Quartiles
Boxplots
Standard Deviation
Variance
Range
The range is the difference between the largest
and smallest observation.
R  xmax  xmin
Quartiles
Quartiles divides the observation into fourths, or four equal
parts.
Smallest
Data Value
25% of
the data
Q2
Q1
25% of
the data
Q3
25% of
the data
Largest
Data Value
25% of
the data
Interquartile Range (IQR)
The interquartile range (IQR) is the distance between
the first and third quartiles
IQR  Q3  Q1
Outliers
Upper Cutoff
Q3  1.5(IQR )
Lower Cutoff
Q1  1.5(IQR )
Five Number Summary





Smallest observation (minimum)
Quartile 1
Quartile 2 (median)
Quartile 3
Largest observation (maximum)
Example – Page 41, #1.32
The Survey of Study Habits and Attitudes (SSHA) is a
Psychological test that evaluates college students’
Motivation, study habits and attitudes toward school.
A private college gives the SSHA to a sample of 18 of
Its incoming first-year women students. There scores are
154 109 137 115 152 140 154 178 101
103 126 126 137 165 165 129 200 148
Example – Page 41, #1.32
A) Make a stemplot of these data. The overall shape of the
distribution is irregular, as often happens when only a few
observations are available. Are there any potential
outliers? About where is the center of the distribution (the
score with half the scores above it and half below)?
What is the spread of the scores (ignoring any outliers)?
STATEDIT1:edit
Example – Page 41, #1.32
10
11
1
5
3
9
12
13
6
7
6
7
9
14
15
16
0
2
5
8
4
5
17
18
19
20
8
0
4
200 is a potential outlier. The center
Is approximately 140. The spread
(excluding 200) is 178 – 101 = 77.
Example – Page 41, #1.32
154 109 137 115 152 140 154 178 101
103 126 126 137 165 165 129 200 148
Example – Page 41, #1.32
B) Find the mean.
x  141.058
C) Find the median of these scores. Which larger: the
median or the mean? Explain why.
Median  138.5
The mean is larger than the median because the outlier
at 200, which pulls the mean towards the long right
tail of the distribution.
Example – Page 47, #1.36
Here are the scores on the Survey of Study Habits and
Attitudes (SSHA) for 18 first-year college women:
154 109 137 115 152 140 154 178 101 103 126 126 137 165 165 129 200 148
and for 20 first-year college men:
108 140 114 91 180 115 126 92 169 146 109 132 75 88 113 151 70 115 187 104
A) Make side-by side boxplots to compare the distribution.
Example – Page 47, #1.36
SSHA SCORES
Women
Men
Box Plot
0
40
80
120
160
200
Example – Page 47, #1.36
B) Compute the numerical summaries for these two
distributions.
x
Women 141.06
Men
121.25
Min
Q1
Median
Q3
Max
101
126
138.5
154
200
70
98
114.5
143
187
Example – Page 47, #1.36
C) Write a paragraph comparing SSHA scores for men and
women.
All the displays and descriptions reveal that women
generally score higher than men. The men’s scores
(IQR = 45) are more spread out than the women’s
(even if we don’t ignore the outlier). The shapes of the
distributions are reasonable similar, with each
displaying right skewness.
Describing Distributions with
Numbers
Standard Deviation and
Variance
Standard Deviation
The standard deviation (s) measures the average distance
of observations from their mean.
Example, Page 52, #1.40
The level of various substances in the blood influence
our health. Here are measurements of the level of
phosphate in the blood of a patient, in milligrams
of phosphate per deciliter of blood, made on 6
consecutive visits to a clinic.
5.6 5.2 4.6 4.9 5.7 6.4
Example, Page 52, #1.40
5.6 5.2 4.6 4.9 5.7 6.4
A. Find the mean.
x
5.6  5.2  4.6  4.9  5.7  6.4 32.4

 5.4
6
6
Example, Page 52, #1.40
Observation
xi
Deviations
xi  x
5.6
5.6  5.4  0.2
5.2
4.9
5.2  5.4  0.2
4.6  5.4  0.8
4.9  5.4  0.5
5.7
5.7  5.4  0.3
6.4
6.4  5.4  1
4.6
0
Square Deviations
 xi  x 
2
Example, Page 52, #1.40
x  5.4
x  4.6
0.8
4.5
5.0
x  6.4
1
5.5
6.0
6.5
Example, Page 52, #1.40
Observation
xi
Deviations
xi  x
5.6
5.6  5.4  0.2
5.2
4.9
5.2  5.4  0.2
4.6  5.4  0.8
4.9  5.4  0.5
5.7
5.7  5.4  0.3
6.4
6.4  5.4  1
SUM  0
4.6
Square Deviations
 xi  x 
2
(0.2)2  0.04
0.04
0.64
0.25
0.09
1
SUM  2.06
Example – Page 52, #1.40
B) Find the standard deviation (s) from its definition.
1
1
1
2
s 
 xi  x   6  1 2.06   5  2.06   0.412

n 1
2
s s 
2
0.412  0.64187  0.6419
Example – Page 52, #1.40
C) Use your TI-83 to find x and s. Do the result agree with
part B.
STAT
Example – Page 52, #1.40
Standard Deviation






Standard deviation (s) is the square root of the
variance (s² )
Units are the original units
Measures spread about the mean and should only
be used when the mean is chosen as the center
If s = 0 then there is no spread. Observations are
the same value
As s gets larger the observations are more spread
out.
Highly affected by outliers. Best for symmetric data
Variance



Variance (s²) measures the average
squared deviation of observations from the
mean
Units are squared
Highly affected by outliers.
How to Choose?

Skewed Distribution or Outliers


Five number summary
Symmetric Distribution or No Outliers


Mean
Standard Deviation
Homework


HW, page 52, #1.41, 1.43
Read pages 53 – 61
Linear Transformation
A linear transformation changes the original variable
x into the new variable xnew given by an equation of the
form
xnew  a  bx
Adding the constant a shifts all values of x upward or
downward by the same amount.
Multiplying by the positive constant b changes the size
of the unit of measurement.
Example – Page 56, #1.44
Maria measures the lengths of 5 cockroaches that
she finds at school. Here are her results in inches
1.4 2.2 1.1 1.6 1.2
A. Find the mean and standard deviation.
Example – Page 56, #1.44
1.4 2.2 1.1 1.6 1.2
Example – Page 56, #1.44
B) Maria’s science teacher is furious to discover that
she has measured the cockroaches lengths in inches
rather than centimeters. (There are 2.54 cm in 1 inch).
Find the mean and standard deviation of the 5
cockroaches in centimeters.
x  1.5
 1.5(2.54)
s  0.436
 0.436(2.54)
 3.81 cm
 1.017 cm
Example – Page 56, #1.44
C) Considering the 5 cockroaches that Maria found
as a small sample from the population of all
cockroaches at her school, what would you
estimate as the average length of the population
of cockroaches? How sure of your estimate
are you?
The average cockroach length can be estimate
as the mean length of the 5 sampled cockroaches
of 1.5 inches. This is a questionable estimate,
because the sample is so small.
Example – Page 63, #1.56
A change of units that multiplies each unit by b, such
as change xnew  0  2.54x from inches x to centimeters
xnew, multiplies our usual measures of spread by b. This
is true of the IQR and standard deviation. What happens
to the variance when we change units this way?
Variance is changed by a factor of 2.54² = 6.4516
Homework


HW, Page 56, #1.45
HW, Page 63, #1.55
1-2 Describing Distributions
with Numbers.
Comparing Distributions
Example – Page 59, #1.48
The table below gives the distribution grades earned
by students taking the Calculus AB and Statistics
exam in 2000.
5
4
3
2
1
Calculus 16.8% 23.2% 23.5% 19.6% 16.8%
Statistics 9.8% 21.5% 22.4% 20.5% 25.8%
A. Make a graphical display to compare the AP exam
grades for Calculus AB and Statistics.
Example – Page 59, #1.48
2000 AP Exam
% of students Earning Grade
30.0
25.0
20.0
Calculus AB
Statistics
15.0
10.0
5.0
0.0
1
2
3
Grade on Exam
4
5
Example – Page 59, #1.48
B) Write a few sentences comparing the two
distributions of exam grade. Do you know which
now know which exam is easier? Why or why not?
The distributions are very similar for grades 2, 3, and 4.
The major difference occurs for grades 1 and 5. With a
larger proportion of Statistics students receiving a grade
of 1 and a smaller proportion of Statistics student receiving
a grade of 5.
This suggest that the Statistics exam is harder in the
sense that students are more likely to get a poor grade
on the Statistics Exam than on the Calculus AB exam.
Example – Page 63, 1.54
The mean x and standard deviation s measure the center
and spread but are not a complete description of a
distribution. Data sets with different shapes can have
the same mean and standard deviation. To demonstrate
this fact, use your calculator to find x and s for the
following to small data sets. Then make a stem plot
of each and comment on the shape of each distribution
Data A
9.14
8.14
8.74
8.77
9.26
8.10
6.13
3.10
9.13
7.26
4.74
Data B 6.58
5.76
7.71
8.84
8.47
7.04
5.25
5.56
7.91
6.89
12.50
Example – Page 63, 1.54
Set A
Set B
Example – Page 63, 1.54
Set A
3 1
4 7
5
6 1
7 2
8 1 1 7 7
9 1 1 2
3|1 = 3.1
Set B
5
2 5 7
6
7
5 8
0 7 9
8
4 8
9
10
11
12 5
Example – Page 63, 1.54
The means and standard are basically the same. Set A is
skewed to the left, while Set B has a higher outlier.
Homework


HW, Page 59, #1.47, #1.49
HW, Page 62, #1.51, 1,57