Download Chapter 2

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Descriptive Statistics
Frequency Tables
Visual Displays
Measures of Center
1
Overview
 Descriptive Statistics
summarize or describe the important
characteristics of a known set of
population data
 Inferential Statistics
use sample data to make inferences (or
generalizations) about a population
2
Important Characteristics of Data
1. Center: A representative or average value that
indicates where the middle of the data set is located
2. Variation: A measure of the amount that the values
vary among themselves
3. Distribution: The nature or shape of the distribution
of data (such as bell-shaped, uniform, or skewed)
4. Outliers: Sample values that lie very far away from
the vast majority of other sample values
5. Time: Changing characteristics of the data over
time
3
Summarizing Data With
Frequency Tables
 Frequency Table
lists classes (or categories) of values,
along with frequencies (or counts) of the
number of values that fall into each class
4
Table 2-1
Qwerty Keyboard Word Ratings
2
2
5
1
2
6
3
3
4
2
4
0
5
7
7
5
6
6
8
10
7
2
2
10
5
8
2
5
4
2
6
2
6
1
7
2
7
2
3
8
1
5
2
5
2
14
2
2
6
3
1
7
5
Table 2-3
Frequency Table of Qwerty Word Ratings
Rating
Frequency
0-2
20
3-5
14
6-8
15
9 - 11
2
12 - 14
1
6
Frequency Table
Definitions
7
Lower Class Limits
are the smallest numbers that can actually belong to
different classes
Rating
Lower Class
Limits
Frequency
0-2
20
3-5
14
6-8
15
9 - 11
2
12 - 14
1
8
Upper Class Limits
are the largest numbers that can actually belong to
different classes
Rating
Upper Class
Limits
Frequency
0-2
20
3-5
14
6-8
15
9 - 11
2
12 - 14
1
9
Class Boundaries
number separating classes
Rating
Frequency
- 0.5
2.5
Class
Boundaries
5.5
0-2
20
3-5
14
6-8
15
9 - 11
2
12 - 14
1
8.5
11.5
14.5
10
Class Midpoints
midpoints of the classes
Rating
Class
Midpoints
Frequency
0- 1 2
20
3- 4 5
14
6- 7 8
15
9 - 10 11
2
12 - 13 14
1
11
Class Width
is the difference between two consecutive lower
class limits or two consecutive class boundaries
Rating
Class Width
Frequency
3
0-2
20
3
3-5
14
3
6-8
15
3 9 - 11
2
3 12 - 14
1
12
Guidelines For Frequency Tables
1. Be sure that the classes are mutually exclusive.
2. Include all classes, even if the frequency is zero.
3. Try to use the same width for all classes.
4. Select convenient numbers for class limits.
5. Use between 5 and 20 classes.
6. The sum of the class frequencies must equal the
number of original data values.
13
Constructing A Frequency Table
1.
Decide on the number of classes .
2.
Determine the class width by dividing the range by the number
of classes (range = highest score - lowest score) and round up.
class width

range
round up of
number of classes
3.
Select for the first lower limit either the lowest score or a
convenient value slightly less than the lowest score.
4.
Add the class width to the starting point to get the second lower
class limit, add the width to the second lower limit to get the
third, and so on.
5.
List the lower class limits in a vertical column and enter the
upper class limits.
6.
Represent each score by a tally mark in the appropriate class.
Total tally marks to find the total frequency for each class.
14
Inputting a Frequency Table into
the TI83 Calculator
•
•
•
•
Push Stat
Select Edit
Input Class Marks in List 1
Input Frequencies in List 2
15
Relative Frequency Table
relative frequency =
class frequency
sum of all frequencies
16
Relative Frequency Table
Rating Frequency
Relative
Rating Frequency
0-2
20
0-2
38.5%
20/52 = 38.5%
3-5
14
3-5
26.9%
14/52 = 26.9%
6-8
15
6-8
28.8%
9 - 11
2
9 - 11
3.8%
12 - 14
1
12 - 14
1.9%
etc.
Total frequency = 52
17
Cumulative Frequency Table
Rating Frequency
Rating
Cumulative
Frequency
0-2
20
Less than 3
20
3-5
14
Less than 6
34
6-8
15
Less than 9
49
9 - 11
2
Less than 12
51
12 - 14
1
Less than 15
52
Cumulative
Frequencies
18
Frequency Tables
Rating Frequency
Rating
Relative
Frequency
Rating
Cumulative
Frequency
0-2
20
0-2
38.5%
Less than 3
20
3-5
14
3-5
26.9%
Less than 6
34
6-8
15
6-8
28.8%
Less than 9
49
9 - 11
2
9 - 11
3.8%
Less than 12
51
12 - 14
1
12 - 14
1.9%
Less than 15
52
19
Pictures of Data
depict the nature or shape of the data distribution
Histogram
a bar graph in which the horizontal scale
represents classes and the vertical scale
represents frequencies
20
Histogram of Qwerty Word Ratings
Rating Frequency
0-2
20
3-5
14
6-8
15
9 - 11
2
12 - 14
1
21
Relative Frequency Histogram
of Qwerty Word Ratings
Relative
Rating Frequency
0-2
38.5%
3-5
26.9%
6-8
28.8%
9 - 11
3.8%
12 - 14
1.9%
22
Histogram
and
Relative Frequency Histogram
23
Viewing a Histogram
Using the TI83 Calculator
•
•
•
•
•
•
•
•
•
First input class marks and frequencies in L1 & L2
2nd y= (Statplot)
Select 1
Select “on”
Select histogram type (top row, 3rd graph)
X list = 2nd 1 (L1)
Freq = 2nd 2 (L2)
Zoom
9 (ZoomStat)
24
Frequency Polygon
25
Viewing a Frequency Polygon
Using the TI83 Calculator
•
•
•
•
•
•
•
•
•
First input class marks and frequencies in L1 & L2
2nd y= (Statplot)
Select 1
Select “on”
Select polygon type (top row, 2nd graph)
X list = 2nd 1 (L1)
Freq = 2nd 2 (L2)
Zoom
9 (ZoomStat)
26
Ogive
27
Dot Plot
28
Stem-and Leaf Plot
Stem
Raw Data (Test Grades)
67 72
89
85
88 90
75
89
99 100
6
7
8
9
10
Leaves
7
25
5899
09
0
29
Pareto Chart
45,000
40,000
35,000
Accidental Deaths by Type
25,000
20,000
15,000
10,000
5,000
Firearms
Ingestion of food
or object
Fire
Drowning
Poison
Falls
0
Motor Vehicle
Frequency
30,000
30
Pie Chart
Firearms
(1400. 1.9%)
Ingestion of food or object
(2900. 3.9%
Fire
(4200. 5.6%)
Motor vehicle
(43,500. 57.8%)
Drowning
(4600. 6.1%)
Poison
(6400. 8.5%)
Falls
(12,200. 16.2%)
Accidental Deaths by Type
31
Scatter Diagram
20
TAR
•
10
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
0
0.0
0.5
1.0
1.5
NICOTINE
32
Other Graphs
 Boxplots
 Pictographs
 Pattern of data over time
33
Measures of Center
a value at the
center or middle
of a data set
34
Definitions
Mean
(Arithmetic Mean)
AVERAGE
the number obtained by adding the values
and dividing the total by the number of
values
35
Notation

denotes the addition of a set of values
x
is the variable usually used to represent the individual
data values
n
represents the number of data values in a sample
N
represents the number of data values in a population
36
Notation
x is pronounced ‘x-bar’ and denotes the mean of a set
of sample values
x
x =
n
µ
is pronounced ‘mu’ and denotes the mean of all values
population
µ =
in a
x
N
Calculators can calculate the mean of data
37
Finding a Mean
Using the TI83 Calculator
•
•
•
•
•
•
•
•
•
Push Stat
Select Edit
Enter data into L1
2nd mode (quit)
Push Stat
>Calc
Select 1 (1-VarStats)
2nd 1 (L1)
The first output item is the mean
38
Definitions
 Median
the middle value when the original
data values are arranged in order of
increasing (or decreasing) magnitude
 often denoted by x~
(pronounced ‘x-tilde’)
 is not affected by an extreme value
39
6.72
3.46
3.60
6.44
26.70
3.46
3.60
6.44
6.72
26.70
(in order -
exact middle
odd number of values)
MEDIAN is 6.44
40
6.72
3.46
3.60
6.44
3.46
3.60
6.44
6.72
(even number of values)
no exact middle -- shared by two numbers
3.60 + 6.44
2
MEDIAN is 5.02
41
Finding a Median
Using the TI83 Calculator
•
•
•
•
•
•
•
•
•
Push Stat
Select Edit
Enter data into L1
2nd mode (quit)
Push Stat
>Calc
Select 1 (1-VarStats)
2nd 1 (L1)
Arrow down to “Med” to see the value for median
42
Definitions
 Mode
the score that occurs most frequently
Bimodal
Multimodal
No Mode
denoted by M
the only measure of central tendency that can be used
with nominal data
43
Examples
a. 5 5 5 3 1 5 1 4 3 5
Mode is 5
b. 1 2 2 2 3 4 5 6 6 6 7 9
Bimodal
c. 1 2 3 6 7 8 9 10
No Mode
2 and 6
44
Definitions
 Midrange
the value midway between the highest and
lowest values in the original data set
Midrange =
highest score + lowest score
2
45
Finding a Midrange
Using the TI83 Calculator
•
•
•
•
•
•
•
•
•
Push Stat
Select Edit
Enter data into L1
2nd mode (quit)
Push Stat
>Calc
Select 1 (1-VarStats)
2nd 1 (L1)
Arrow down to see the “min” and “max” values
for the midrange calculation
46
Round-off Rule for
Measures of Center
Carry one more decimal place than is
present in the original set of values
47
Mean from a Frequency Table
use class midpoint of classes for variable x
 (f • x)
x =
f
x = class midpoint
f = frequency
f=n
48
Finding a Grouped Mean
Using the TI83 Calculator
•
•
•
•
•
•
•
•
•
Push Stat
Select Edit
Enter class marks into L1 and frequencies into L2
2nd mode (quit)
Push Stat
>Calc
Select 1 (1-VarStats)
2nd 1, 2nd 2 (L1,L2)
The first output item is the mean
49
Descriptive Statistics
Measures of Variability
Measures of Position
50
Definitions
 Symmetric
Data is symmetric if the left half of its
histogram is roughly a mirror of its
right half.
 Skewed
Data is skewed if it is not symmetric
and if it extends more to one side than
the other.
51
Skewness
Mode
=
Mean
=
Median
SYMMETRIC
Mean
Mode
Median
SKEWED LEFT
(negatively)
Mean
Mode
Median
SKEWED RIGHT
(positively)
52
Waiting Times of Bank Customers
at Different Banks
in minutes
Jefferson Valley Bank
6.5
6.6
6.7
6.8
7.1
7.3
7.4
7.7
7.7
7.7
Bank of Providence
4.2
5.4
5.8
6.2
6.7
7.7
7.7
8.5
9.3
10.0
Jefferson Valley Bank
Bank of Providence
Mean
7.15
7.15
Median
7.20
7.20
Mode
7.7
7.7
Midrange
7.10
7.10
53
Dotplots of Waiting Times
54
Measures of Variation
Range
highest
value
lowest
value
55
Measures of Variation
Standard Deviation
a measure of variation of the scores
about the mean
(average deviation from the mean)
56
Sample Standard Deviation
Formula
S=
 (x - x)
n-1
2
calculators can compute the
sample standard deviation of data
57
Sample Standard Deviation
Shortcut Formula
n (x ) - (x)
n (n - 1)
2
s=
2
calculators can compute the
sample standard deviation of data
58
Population Standard Deviation
 =
 (x - µ)
2
N
calculators can compute the
population standard deviation
of data
59
Symbols
for Standard Deviation
Sample
Textbook
Some graphics
calculators
Some
non-graphics
calculators
s
Sx
xn-1
Population

x
x n
Textbook
Some graphics
calculators
Some
non-graphics
calculators
Articles in professional journals and reports often use SD
for standard deviation and VAR for variance.
60
Finding a Standard Deviation
Using the TI83 Calculator
•
•
•
•
•
•
•
•
•
Push Stat
Select Edit
Enter data into L1
2nd mode (quit)
Push Stat
>Calc
Select 1 (1-VarStats)
2nd 1 (L1)
The 4th output item is the sample s.d. and the 5th
output item is the population s.d.
61
Measures of Variation
Variance
standard deviation squared
}
Notation
s

2
2
use square key
on calculator
62
Variance
2
s =

2
=
 (x - x )
2
n-1
 (x - µ)
N
2
Sample
Variance
Population
Variance
63
Round-off Rule
for measures of variation
Carry one more decimal place than
is present in the original set of
values.
Round only the final answer, never in the
middle of a calculation.
64
Standard Deviation from a
Frequency Table
n [(f • x 2)] -[(f • x)]2
S=
n (n - 1)
Use the class midpoints as the x values
Calculators can compute the standard deviation for frequency
table
65
Finding a Grouped Standard Deviation
Using the TI83 Calculator
•
•
•
•
•
•
•
•
•
Push Stat
Select Edit
Enter class marks into L1 and frequencies into L2
2nd mode (quit)
Push Stat
>Calc
Select 1 (1-VarStats)
2nd 1, 2nd 2 (L1, L2)
The 4th output item is the sample s.d and the 5th
output item is the population s.d
66
Estimation of Standard Deviation
Range Rule of Thumb
x - 2s
x + 2s
x
(minimum
usual value)
(maximum
usual value)
Range  4s
or
s
Range
4
=
highest value - lowest value
4
67
Usual Sample Values
minimum ‘usual’ value  (mean) - 2 (standard deviation)
minimum  x - 2(s)
maximum ‘usual’ value  (mean) + 2 (standard deviation)
maximum  x + 2(s)
68
The Empirical Rule
(applies to bell-shaped distributions)
x
69
The Empirical Rule
(applies to bell-shaped distributions)
68% within
1 standard deviation
34%
x-s
34%
x
x+s
70
The Empirical Rule
(applies to bell-shaped distributions)
95% within
2 standard deviations
68% within
1 standard deviation
34%
34%
13.5%
x - 2s
13.5%
x-s
x
x+s
x + 2s
71
The Empirical Rule
(applies to bell-shaped distributions)
99.7% of data are within 3 standard deviations of the mean
95% within
2 standard deviations
68% within
1 standard deviation
34%
34%
2.4%
2.4%
0.1%
0.1%
13.5%
x - 3s
x - 2s
13.5%
x-s
x
x+s
x + 2s
x + 3s
72
Chebyshev’s Theorem
 applies to distributions of any shape.
 the proportion (or fraction) of any set of data lying
within K standard deviations of the mean is always at
least 1 - 1/K2 , where K is any positive number greater
than 1.
 at least 3/4 (75%) of all values lie within 2 standard
deviations of the mean.
 at least 8/9 (89%) of all values lie within 3 standard
deviations of the mean.
73
Measures of Variation Summary
For typical data sets, it is unusual for a
score to differ from the mean by more than
2 or 3 standard deviations.
74
Measures of Position
 z Score
(or standard score)
the number of standard deviations
that a given value x is above or below
the mean
75
Measures of Position
z score
Sample
x
x
z= s
Population
x
µ
z=

Round to 2 decimal places
76
Interpreting Z Scores
Unusual
Values
-3
Ordinary
Values
-2
-1
0
Unusual
Values
1
2
3
Z
77
Measures of Position
Quartiles, Deciles,
Percentiles
78
Quartiles
Q1, Q2, Q3
divides ranked scores into four equal parts
25%
(minimum)
25%
25% 25%
Q1 Q2 Q3
(maximum)
(median)
79
Deciles
D1, D2, D3, D4, D5, D6, D7, D8, D9
divides ranked data into ten equal parts
10% 10% 10%
D1
D2
D3
10% 10% 10%
D4
D5
10% 10% 10% 10%
D6
D7
D8
D9
80
Percentiles
99 Percentiles
81
Quartiles, Deciles, Percentiles
Fractiles
(Quantiles)
partitions data into approximately equal parts
82
Quartiles
Q1 = P25
Q2 = P50
Q3 = P75
Deciles
D1 = P10
D2 = P20
D3 = P30
•
•
•
D9 = P90
83
Viewing Quartiles
Using the TI83 Calculator
•
•
•
•
•
•
•
•
•
Push Stat
Select Edit
Enter data into L1
2nd mode (quit)
Push Stat
>Calc
Select 1 (1-VarStats)
2nd 1 (L1)
Arrow down to view the values for Q2 and Q3 (the
median “med” is Q2)
84
Exploratory Data Analysis
the process of using statistical tools
(such as graphs, measures of center,
and measures of variation) to investigate
the data sets in order to understand
their important characteristics
85
Outliers
 a value located very far away from
almost all of the other values
 an extreme value
 can have a dramatic effect on the
mean, standard deviation, and on the
scale of the histogram so that the true
nature of the distribution is totally
obscured
86
Boxplots
(Box-and-Whisker Diagram)
Reveals the:
 center of the data
 spread of the data
 distribution of the data
 presence of outliers
Excellent for comparing two
or more data sets
87
Boxplots
5 - number summary
 Minimum
 first quartile Q1
 Median (Q2)
 third quartile Q3
 Maximum
88
Boxplots
2
4
6
14
0
0
2
4
6
8
10
12
14
Boxplot of Qwerty Word Ratings
89
Viewing Boxplots
Using the TI83 Calculator
•
•
•
•
•
•
•
•
•
First input class marks and frequencies in L1 & L2
2nd y= (Statplot)
Select 1
Select “on”
Select boxplot type (bottom row, 2nd graph)
X list = 2nd 1 (L1)
Freq = 2nd 2 (L2)
Zoom
9 (ZoomStat)
90
Boxplots
Bell-Shaped
Uniform
Skewed
91
Exploring
 Measures of center: mean, median, and mode
 Measures of variation: standard deviation and
range
 Measures of spread & relative location:
minimum values, maximum value, and quartiles
 Unusual values: outliers
 Distribution: histograms, stem-leaf plots, and
boxplots
92