Download Chapter 2 Slides

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 2
Descriptive Statistics
2 - 1 Frequency Distributions
2 – 2 Displaying Data
2 – 3 Measures of the Center
2 - 4 Measures of Dispersion
2 – 5 Measures of Position
2 - 6 Bivariate Data
1
2-1 Frequency Distributions
 Frequency Distribution
lists classes (or categories) of values,
along with frequencies (or counts) of the
number of values that fall into each class
2
Qwerty Keyboard Word Ratings
2
2
5
1
2
6
3
3
4
2
4
0
5
7
7
5
6
6
8
10
7
2
2
10
5
8
2
5
4
2
6
2
6
1
7
2
7
2
3
8
1
5
2
5
2
14
2
2
6
3
1
7
3
Frequency Table of Qwerty Word Ratings
Rating
Frequency
0-2
20
3-5
14
6-8
15
9 - 11
2
12 - 14
1
4
Lower Class Limits
are the smallest numbers that can actually belong to
different classes
Rating
Lower Class
Limits
Frequency
0-2
20
3-5
14
6-8
15
9 - 11
2
12 - 14
1
5
Upper Class Limits
are the largest numbers that can actually belong to
different classes
Rating
Upper Class
Limits
Frequency
0-2
20
3-5
14
6-8
15
9 - 11
2
12 - 14
1
6
Class Midpoints
midpoints of the classes
Rating
Class
Midpoints
Frequency
0- 1 2
20
3- 4 5
14
6- 7 8
15
9 - 10 11
2
12 - 13 14
1
7
Class Width
is the difference between two consecutive lower
class limits or two consecutive class boundaries
Rating
Class Width
Frequency
3
0-2
20
3
3-5
14
3
6-8
15
3 9 - 11
2
3 12 - 14
1
8
Guidelines For Frequency
Distributions
1. Be sure that the classes are mutually exclusive.
2. Include all classes, even if the frequency is zero.
3. Try to use the same width for all classes.
4. Select convenient numbers for class limits.
5. Use between 5 and 20 classes.
6. The sum of the class frequencies must equal the
number of original data values.
9
Constructing A Frequency Distribution
1.
Decide on the number of classes .
2.
Determine the class width by dividing the range by the number
of classes (range = highest score - lowest score) and round up.
class width

range
round up of
number of classes
3.
Select for the first lower limit either the lowest score or a
convenient value slightly less than the lowest score.
4.
Use calculator procedures to construct histogram.
5.
List the class limits and frequency.
10
Relative Frequency Distribution
relative frequency =
class frequency
sum of all frequencies
11
Relative Frequency Distribution
Rating Frequency
Relative
Rating Frequency
0-2
20
0-2
38.5%
20/52 = 38.5%
3-5
14
3-5
26.9%
14/52 = 26.9%
6-8
15
6-8
28.8%
9 - 11
2
9 - 11
3.8%
12 - 14
1
12 - 14
1.9%
etc.
Total frequency = 52
12
Cumulative Frequency Distribution
Rating Frequency
Rating
Cumulative
Frequency
0-2
20
Less than 3
20
3-5
14
Less than 6
34
6-8
15
Less than 9
49
9 - 11
2
Less than 12
51
12 - 14
1
Less than 15
52
Cumulative
Frequencies
13
Frequency Distributions
Rating Frequency
Rating
Relative
Frequency
Rating
Cumulative
Frequency
0-2
20
0-2
38.5%
Less than 3
20
3-5
14
3-5
26.9%
Less than 6
34
6-8
15
6-8
28.8%
Less than 9
49
9 - 11
2
9 - 11
3.8%
Less than 12
51
12 - 14
1
12 - 14
1.9%
Less than 15
52
14
2-2 Visualizing Data
Histogram
a bar graph in which the horizontal scale
represents classes and the vertical scale
represents frequencies
15
Histogram of Qwerty Word Ratings
Rating Frequency
0-2
20
3-5
14
6-8
15
9 - 11
2
12 - 14
1
16
TI-83 Calculator
Contructing a Frequency Table and Histogram
STEP 1
1. Choose statplot (2nd Y=)
2. Press Enter
3. Plot on should be ON
4. Cursor to Type and choose the last plot type in first
row
5. Press Enter
17
TI-83 Calculator
Contructing a Frequency Table and Histogram
STEP 2 – Enter Data
1. Press Stat
2. Press “1” Edit
3. Enter data in L1
18
TI-83 Calculator
Contructing a Frequency Table and Histogram
STEP 3 – Set window and graph
1. Press Window
2. Set Xmin = 1st lower class limit
3. Set Xscl = class width
4. Set Xmax = largest upper class limit (maybe larger)
5. Set Ymin = negative 5
6. Set Ymax = a little higher than the largest frequency
7. Press Trace
19
Relative Frequency Histogram
of Qwerty Word Ratings
Relative
Rating Frequency
0-2
38.5%
3-5
26.9%
6-8
28.8%
9 - 11
3.8%
12 - 14
1.9%
20
Histogram
and
Relative Frequency Histogram
21
Frequency Polygon
Midpoints
(points plotted at middle top of each bar in the histogram)
22
Cummulative Frequency Histogram of
Qwerty Word Ratings
Rating Frequency
0-2
20
0-5
34
0-8
49
0 - 11
51
0 - 14
52
23
Ogive
Upper Class Limit
(points are plotted at the upper right corner of each bar in
the histogram)
24
Stem-and Leaf Plot
Stem
Raw Data (Test Grades)
67 72
89
85
88 90
75
89
99 100
6
7
8
9
10
Leaves
7
25
5899
09
0
Used to observe the distribution of data
Back to back stem-leaf hw #7 (also a test problem)
25
Dot Plot
See HW problem #5b (will be a test problem)
26
Pareto Chart
45,000
40,000
35,000
Frequency
30,000
Accidental Deaths by Type
25,000
20,000
15,000
used for
qualitative
data
10,000
5,000
Firearms
Fire
Drowning
Poison
Falls
Ingestion of food
or object
See HW problem #4a
Motor Vehicle
0
27
Pie Chart
PIE charts and Pareto
charts can illustrate the
same data
Firearms
(1400. 1.9%)
Ingestion of food or object
(2900. 3.9%
Fire
(4200. 5.6%)
Motor vehicle
(43,500. 57.8%)
Drowning
(4600. 6.1%)
Poison
(6400. 8.5%)
See HW problem #4d
Falls
(12,200. 16.2%)
Accidental Deaths by Type
28
Deaths in British Military Hospitals During the Crimean War
other causes
preventable diseases
wounds
29
Other Graphs
 Boxplots
 Pictographs
 Time-Series Graphs (forecasting)
30
2-3 Measures of Center
a value at the
center or middle
of a data set
31
Definitions
Mean
(Arithmetic Mean)
AVERAGE
the number obtained by adding the values
and dividing the total by the number of
values
32
Notation

denotes the addition of a set of values
x
is the variable usually used to represent the individual
data values
n
represents the number of data values in a sample
N
represents the number of data values in a population
33
Notation
x is pronounced ‘x-bar’ and denotes the mean of a set
of sample values
x
x =
n
µ
is pronounced ‘mu’ and denotes the mean of all values in a
population
µ =
x
N
Calculators can calculate the mean of data
34
TI-83 Calculator
Calculate Mean
1. Press Stat
2. Press “1” Edit
3. Enter Data in L1
4. Press Stat
5. Cursor over to CALC
6. Choose the 1-Var stats option
7. Enter 1-Var stats L1 (Press 2nd then 1)
35
TI-83 Calculator
Clearing Data in Column
1. Press Stat
2. Press “4” ClrList
3. Enter ClrList L1,L2,etc
36
Definitions
 Median
the middle value when the original
data values are arranged in order of
increasing (or decreasing) magnitude
 not affected by an extreme value
37
6.72
3.46
3.60
6.44
3.46
3.60
6.44
6.72 (sorted)
(even number of values)
no exact middle -- shared by two numbers
3.60 + 6.44
MEDIAN is 5.02
2
6.72
3.46
3.60
6.44
26.70
3.46
3.60
6.44
6.72
26.70 (sorted)
(odd number of values)
exact middle
MEDIAN is 6.44
38
Definitions
 Mode
the score that occurs most frequently
Bimodal
Multimodal
No Mode
39
Examples
a. 5 5 5 3 1 5 1 4 3 5
Mode is 5
b. 1 2 2 2 3 4 5 6 6 6 7 9
Bimodal -
c. 1 2 3 6 7 8 9 10
No Mode
2 and 6
40
Definitions
 Midrange
the value midway between the highest
and lowest values in the original data set
Midrange =
highest score + lowest score
2
41
Round-off Rule for
Measures of Center
Carry one more decimal place than is
present in the original set of values
Let’s try #2 on HW
Go to Excel or Calculator
42
Mean from a Frequency Table
use class midpoint of classes for variable x
 (x • f)
x =
f
x = class midpoint
f = frequency
Let’s try #7 on HW
Go to Excel or
Calculator
f=n
43
TI-83 Calculator
Calculate Mean from a Frequency Distribution
1. Press Stat
2. Press “1” Edit
3. Enter midpoint in L1
4. Enter frequency in L2
5. Press Stat
6. Cursor over to CALC
7. Choose the 1-Var stats option
8. Enter 1-Var stats L1,L2
44
Weighted Mean
 (w • x)
x =
w
I use this to calculate you grade in the class
the weights being 20% for HW, 60% for exams
and 20% for the final and x being your total
points for each area.
45
Best Measure of Center
Advantages - Disadvantage
Measure
How often
used?
Takes
Every
Value into
Account?
Affected
by
Extreme
Values?
Most familiar
Yes
Yes
Median
Commonly
No
No
Mode
Sometimes
No
No
Rarely
No
Yes
Mean
Midrange
46
Definitions
 Symmetric
Data is symmetric if the left half of its
histogram is roughly a mirror of its
right half.
 Skewed
Data is skewed if it is not symmetric
and if it extends more to one side than
the other.
47
Skewness
Mode
=
Mean
=
Median
SYMMETRIC
Sample data: 2 3 3 4 4 4 5 5 6
Median = 4
Mode = 4
Mean = 4
Frequencies are 1 2 3 2 1
48
Skewness
Mode
=
Mean
=
Median
SYMMETRIC
Mean
Mode
Median
SKEWED LEFT
(negatively)
Mean
Mode
Median
SKEWED RIGHT
(positively)
49
Waiting Times of Bank Customers
at Different Banks
in minutes
Green Valley Bank
6.5
6.6
6.7
6.8
7.1
7.3
7.4
7.7
7.7
7.7
Big Spenders Bank
4.2
5.4
5.8
6.2
6.7
7.7
7.7
8.5
9.3
10.0
Green Valley Bank
Big Spenders Bank
Mean
7.15
7.15
Median
7.20
7.20
Mode
7.7
7.7
Midrange
7.10
7.10
Same data used in section 2.3 & 2.4 #3
50
Dotplots of Waiting Times
Green Valley Bank
Big Spenders Bank
So how do we differentiate this data? We need measures
that describes how the data is dispersed.
51
2-4 Measures of Variation
Range
highest
value
lowest
value
52
Measures of Variation
Standard Deviation
a measure of variation of the scores
about the mean
(average deviation from the mean)
53
Sample Standard Deviation
Formula
S=
 (x - x)
n-1
2
calculators can compute the
sample standard deviation of data
54
Population Standard Deviation
 =
 (x - µ)
2
N
calculators can compute the
population standard deviation
of data
55
Symbols
for Standard Deviation
Sample
Most textbook
Some graphics
calculators
Some
non-graphics
calculators
s
Sx
xn-1
Population

x
x n
Articles in professional journals and reports often use SD
for standard deviation and VAR for variance.
56
Measures of Variation
Variance
standard deviation squared
}
Notation
s

2
2
use square key
on calculator
57
Variance
Which is the parameter and which is the statistic?
2
s =

2
=
 (x - x )
2
n-1
 (x - µ)
N
2
Sample
Variance
Population
Variance
58
TI-83 Calculator
Calculate Standard Deviation
1. Same procedure as Mean
Calculate Standard Deviation from a frequency
distribution
1. Same procedure as Mean from a frequency
distribution
59
Round-off Rule
for measures of variation
Carry one more decimal place than
is present in the original set of
values.
Round only the final answer, never in the
middle of a calculation.
Now Try Example ; #2 (go to Excel or Calculator)
60
Calculating Standard
Deviation from the Mean
Example:
Let
X= 40 & S = 3
1.Calculate 1, 2 and 3 standard
deviations from the mean
2.Find the maximum and minimum
unusual values (test question)
61
Standard Deviation from a
Frequency Distribution
n [(f • x 2)] -[(f • x)]2
S=
n (n - 1)
Use the class midpoints as the x values
Calculators can compute the standard deviation for frequency
table
Let’s try #7 (go to excel or calculator)
62
The Empirical Rule
(applies to bell-shaped distributions)
99.7% of data are within 3 standard deviations of the mean
95% within
2 standard deviations
68% within
1 standard deviation
34%
34%
2.4%
2.4%
0.1%
0.1%
13.5%
x - 3s
x - 2s
13.5%
x-s
x
x+s
x + 2s
x + 3s
63
Chebyshev’s Theorem
 applies to distributions of any shape. Therefore it is not as telling
(robust) as the empirical rule
 the proportion (or fraction) of any set of data lying within K standard
deviations of the mean is always at least 1 - 1/K2 , where K is any
positive number greater than 1.
 at least 3/4 (75%) of all values lie within 2 standard deviations of the
mean.
 at least 8/9 (89%) of all values lie within 3 standard deviations of the
mean.
64
Chebyshev’s Theorem &
Empirical Rule Summary
Applies To
Within
1 SD
At least
0% of
data
Chebyshev’s
Theorem
Any
Distribution
Empirical
Rule
Bell Shaped Approx
Distributions 68% of
data
Within
2 SD’s
At Least
75 % of
data
Within
3 SD’s
At least
89% of
data
Approx
95% of
data
Approx
99.7%
of data
Important table for your notes!!
65
Empirical & Chebyshev
Examples
Will be one of each on the test
 Example: A batch of bolts has a mean of 4
inches and a standard deviation of .007 inch.
What can you conclude about the percentage of
bolts between various intervals?
1.
Apply Empirical Rule (assume bell shaped data)
2. Apply Chebyshev’s Theorem
66
Unusual Scores
For typical data sets, it is unusual for a
score to differ from the mean by more than
2 or 3 standard deviations.
Note: I ask this on several questions on the
test
67
Measures of Variation
(dispersion) and Measures of
the Center
Center
Mean
Variation / Dispersion
Range
Mode
Standard Deviation
MidRange
Variance
Median
68
2–5
Measures of Relative Standing
OR
Measures of Position
69
When is the relative position of
data important?
Example:
Student A gets 80 out of 84 on a test.
Student B gets 50 out of 62 on another test.
Which one did relatively better?
What if the average of the test Student A
took was 82 and the average of the test
Student B took was 30. Now which one did
better?
70
Measures of Position
 z Score
(or standard score)
the number of standard deviations
that a given value x is above or below
the mean
71
Measures of Position
z score
Sample
x
x
z= s
Population
x
µ
z=

Round to 2 decimal places
72
Mean and
Standard Deviation of
z score
If all data values in a data set
have been converted to zscores then
Mean
0
Standard Dev
1
Note: This is a test question
73
Interpreting Z Scores
Unusual
Values
-3
Ordinary
Values
-2
-1
0
Unusual
Values
1
2
3
Z
74
Let’s try some examples:
See Excel Z – Score Example
#2, #4, 9
Note: #9 is very similar to the one on the test
75
Measures of Position
Quartiles and
Percentiles
76
Quartiles
Q1, Q2, Q3
divides ranked scores into four equal parts
25%
(minimum)
25%
25% 25%
Q1 Q2 Q3
(maximum)
(median)
77
Quartiles
Q1 = P25
Percentiles
99 percentiles
Q2 = P50
Q3 = P75
78
Finding the Percentile of a
Given Score
Percentile of score x =
number of scores less than x
• 100
total number of scores
Use normal rounding rules
Use first value if duplicates
Let’s try #10 together – will be one like this on the test
Go to Excel for the data
79
Finding the Score
Given a Percentile
L=
k
100
•n
n
k
L
Pk
total number of values in the data set
percentile being used
locator that gives the position of a value
kth percentile
Let’s try a few #14 – 20 even
80
Start
Finding the Value
of the
kth Percentile
Sort the data.
(Arrange the data in
order of lowest to
highest.)
Compute
L= k
n
100
(
)
where
n = number of values
k = percentile in question
Is
L a whole
number
?
No
Yes
The value of the kth percentile
is midway between the Lth value
and the next value in the
sorted set of data. Find Pk by
adding the L th value and the
next value and dividing the
total by 2.
Change L by rounding
it up to the next
larger whole number.
The value of Pk is the
Lth value, counting
from the lowest
81
Interquartile Range (or IQR): Q3 - Q1
Semi-interquartile Range:
Q3 - Q1
2
Midquartile:
Q1 + Q3
2
10 - 90 Percentile Range: P90 - P10
Won’t ask you anything about these
82
2-6 Bivariate Data
Information about two variables (eg.
Weight and Height)
• Often presented visually as a scatter
plot on an xy plane
• Weight is the x axis and Height is the y
axis
83
Scatter Diagram
20
TAR
•
10
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
0
0.0
0.5
1.0
1.5
NICOTINE
Go to Excel Example (Weights of Cars and
Gas Mileage)
84
Scatter Diagram of Paired Data
85
Plot a scatter diagram
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
86
Plot a scatter diagram
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
87
Definition
Correlation
exists between two variables when one of
them is related to the other in some way
88
Positive Linear Correlation
y
y
y
(a) Positive
x
x
x
(b) Strong
positive
(c) Perfect
positive
Scatter Plots
89
Negative Linear Correlation
y
y
y
(d) Negative
x
x
x
(e) Strong
negative
(f) Perfect
negative
Scatter Plots
90
No Linear Correlation
y
y
x
(g) No Correlation
x
(h) Nonlinear Correlation
Scatter Plots
91
Definition
Linear Correlation Coefficient r
measures strength of the linear relationship
between paired x and y values in a sample
r=
nxy - (x)(y)
n(x2) - (x)2
n(y2) - (y)2
Rarely used as it’s much easier to
use a computer or spreadsheet
92
Notes on correlation

r represents linear correlation coefficient for a sample

 (ro) represents linear correlation coefficient for a


population
-1  r  1
r measures strength of a linear relationship.



r = -1 perfect negative correlation
r = 1 perfect positive correlation
r = 0 no correlation
93
Plot a Scatter Diagram
And
Calculate Correlation?
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
Go to Excel or Calculator
94
TI-83 Calculator
Calculate Correlation
1. Turn on DiagnositcOn Mode
2. Press 2nd then 0
3. Arrow down to DiagnosticOn
4. Press Enter
Note: the value for “r” will not appear on the screen if
calculator is in DiagnosticOff Mode
95
TI-83 Calculator
Calculate Correlation and Slope/Intercept
1. Press Stat
2. Press Edit
3. Enter x values in L1 and y values in L2
4. Press Stat
5. Cursor over to CALC
6. Choose the 4: LinReg(a+bx) option
7. Enter LinReg(a+bx) L1,L2
8. Press Enter
96
Plot a Scatter Diagram
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
7
6
y
5
4
3
2
1
0
0
0.5
1
1.5
2
2.5
3
3.5
x
97
Calculate Correlation?
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
7
6
y
5
4
3
2
1
0
0
0.5
1
1.5
2
2.5
3
3.5
x
r = .84236
98
Definition
Line of Best Fit
Algebraically describes the relationship between the
two variables
99
Line of Best Fit Equation
x is the predictor variable
^
y is response variable)
y^ = b0 +b1x
b0 = y - intercept
y = mx +b
b1 = slope
100
Line of Best Fit Plotted on Scatter Plot
The line of best fit goes through (x, y)
101
Formula for b0 and b1
b0 =
b1 =
(y) (x2) - (x) (xy)
n(x2) - (x)2
n(xy) - (x) (y)
n(x2) - (x)2
(y-intercept)
(slope)
Best to use calculators or computers.
102
Find the Line of Best Fit
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
Go to Excel or Calculator for the following example.
103
Find the Line of Best Fit
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
Using a calculator:
b0 = 0.549
b1= 1.48
y = 0.549 + 1.48x
104
Plot Line of Best Fit
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
7
6
y
5
4
3
2
1
0
0
0.5
1
1.5
2
2.5
3
3.5
x
105
What is the best predicted size of a household
that discard 0.50 lb of plastic?
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
7
6
y
5
4
3
2
1
0
0
0.5
1
1.5
2
2.5
3
3.5
x
106
What is the best predicted size of a household
that discard 0.50 lb of plastic?
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
Using a calculator:
b0 = 0.549
b1= 1.48
y = 0.549 + 1.48 (0.50)
y = 1.3
107
What is the best predicted size of a household
that discard 0.50 lb of plastic?
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
Using a calculator:
b0 = 0.549
b1= 1.48
y = 0.549 + 1.48 (0.50)
y = 1.3
A household that discards 0.50 lb of plastic has
approximately one person.
108