Download Summary

Document related concepts
no text concepts found
Transcript
Summarizing Data
Graphical Methods
Histogram
Grouped Freq Table
8
7
6
5
4
3
2
1
0
70 to 80 80 to 90
90 to
100
100 to
110
110 to
120
120 to
130
Stem-Leaf Diagram
8
9
10
11
12
024669
04455699
224559
189
70 to 80
80 to 90
90 to 100
100 to 110
110 to 120
120 to 130
Verbal IQ Math IQ
1
1
6
2
7
11
6
4
3
4
0
1
Box-whisker Plot
Example
• The Baby Boom
Age Distribution for Canada
1921 - 2006
1921
20.0%
18.0%
1931
1921
16.0%
20.0%
14.0%
18.0%
16.0%
12.0%
14.0%
10.0%
12.0%
8.0%
10.0%
8.0%
6.0%
6.0%
4.0%
4.0%
2.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
85 and
over
65 to 74
75 to 84
85 and
over
1931
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1941
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1951
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1956
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1961
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1966
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1971
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1976
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1981
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1986
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1991
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
1996
20.0%
18.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
85 and
over
2001
20.0%
18.0%
1931
1921
1971
16.0%
20.0%
14.0%
18.0%
16.0%
12.0%
14.0%
10.0%
12.0%
8.0%
10.0%
8.0%
6.0%
6.0%
4.0%
4.0%
2.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
85 and
over
65 to 74
75 to 84
85 and
over
2006
20.0%
18.0%
1931
1976
16.0%
20.0%
14.0%
18.0%
16.0%
12.0%
14.0%
10.0%
12.0%
8.0%
10.0%
8.0%
6.0%
6.0%
4.0%
4.0%
2.0%
2.0%
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
65 to 74
75 to 84
0.0%
Under 5
5 to 9
10 to 14
15 to 24
25 to 34
35 to 44
45 to 54
55 to 64
85 and
over
65 to 74
75 to 84
85 and
over
Median Age in Canada by Gender and Year
Year
1921
1931
1941
1951
1956
1961
1966
1971
1976
1981
1986
1991
1996
2001
2006
Male
Female
24.7
25.5
27.5
27.8
27.2
26.1
25.0
25.7
27.2
29.0
30.9
32.7
34.5
36.8
38.6
23.2
24.0
26.6
27.6
27.3
26.6
25.9
26.7
28.4
30.3
32.4
34.2
36.1
38.4
40.4
Median Age in Canada by Gender
40.0
30.0
20.0
10.0
1920
Male
Female
1940
1960
1980
2000
Total Population in Canada by Year
Total Population
1921 9 8,787,949
1931 # 10,376,786
1941 # 11,506,655
1951 # 14,009,429
1956 # 16,080,791
1961 # 18,238,247
1966 # 20,014,880
1971 # 21,568,310
1976 # 22,992,600
1981 # 24,343,180
1986 # 25,309,330
1991 # 27,296,855
1996 # 28,846,760
2001 # 30,007,095
2006 # 31,612,895
Total Population (Canada
35
30
25
20
15
10
5
0
1920
1930
1940
1950
1960
1970
1980
1990
2000
2010
Summary
Numerical Measures
Measure of Central Location
1. Mean
2. Median
Measure of Non-Central Location
1. Percentiles
2. Quartiles
1. Lower quartile (Q1) (25th percentile)
(lower mid-hinge)
2. median (Q2) (50th percentile) (hinge)
3. Upper quartile (Q3) (75th percentile)
(upper mid-hinge)
Measure of Variability
(Dispersion, Spread)
1.
2.
3.
4.
Range
Inter-Quartile Range
Variance, standard deviation
Pseudo-standard deviation
1. Range
R = Range = max - min
2. Inter-Quartile Range (IQR)
Inter-Quartile Range = IQR = Q3 - Q1
Example
The data Verbal IQ on n = 23 students
arranged in increasing order is:
80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102 104 105 105 109 111 118 119
min = 80
Q1 = 89
Q2 = 96
Q3 = 105
max = 119
Range and IQR
Range = max – min = 119 – 80 = 39
Inter-Quartile Range
= IQR = Q3 - Q1 = 105 – 89 = 16
3. Sample Variance
Let x1, x2, x3, … xn denote a set of n numbers.
Recall the mean of the n numbers is defined
as:
n
x
 xi
i 1
n
x1  x2  x3    xn 1  xn

n
The numbers
d1  x1  x
d2  x2  x
d3  x3  x

d n  xn  x
are called deviations from the the mean
The sum
n
d
i 1
n
2
i
   xi  x 
2
i 1
is called the sum of squares of deviations from
the the mean.
Writing it out in full:
d  d  d  d
2
1
or
2
2
2
3
x1  x   x2  x 
2
2
2
n
   xn  x 
2
The Sample Variance
Is defined as the quantity:
n
d
i 1
n
2
i
n 1

 x  x 
i 1
2
i
n 1
and is denoted by the symbol
s
2
The Sample Standard Deviation s
Definition: The Sample Standard Deviation is
defined by:
n
s
d
i 1
n
2
i
n 1

 x  x 
i 1
2
i
n 1
Hence the Sample Standard Deviation, s, is the
square root of the sample variance.
Interpretations of s
• In Normal distributions
– Approximately 2/3 of the observations will lie
within one standard deviation of the mean
– Approximately 95% of the observations lie
within two standard deviations of the mean
– In a histogram of the Normal distribution, the
standard deviation is approximately the
distance from the mode to the inflection point
Mode
0.14
0.12
Inflection point
0.1
0.08
0.06
0.04
s
0.02
0
0
5
10
15
20
25
2/3
s
s
2s
A Computing formula for sample
variance:
Sum of squares of deviations from the the
mean :
n
 x  x 
i 1
2
i
The difficulty with this formula is that x will
have many decimals.
The result will be that each term in the above
sum will also have many decimals.
The sum of squares of deviations from the the
mean can also be computed using the
following identity:


x



i
n
2
i 1


  xi 
n
i 1
n
n
 x  x 
i 1
2
i
2
Then:
n
 x  x 
i 1


x



i
n
2
i 1


  xi 
n
i 1
n
2
i
2


x


i
n
2
i 1


xi 

n
i 1

n 1
n
n
and s 
2
 x  x 
i 1
2
i
n 1
2
and


x



i
n
2
i 1


xi 

n
i 1

n 1
n
n
s
 x  x 
i 1
2
i
n 1
2
A quick (rough) calculation of s
Range
s
4
The reason for this is that approximately all
(95%) of the observations are between x  2s
and x  2s.
Thus max  x  2s and min  x  2s.
and Range  max  min  x  2s   x  2s .
 4s
Range
Hence s 
4
The Pseudo Standard Deviation (PSD)
Definition: The Pseudo Standard Deviation
(PSD) is defined by:
IQR InterQuart ile Range
PSD 

1.35
1.35
Properties
• For Normal distributions the magnitude of the
pseudo standard deviation (PSD) and the standard
deviation (s) will be approximately the same value
• For leptokurtic distributions the standard deviation
(s) will be larger than the pseudo standard
deviation (PSD)
• For platykurtic distributions the standard deviation
(s) will be smaller than the pseudo standard
deviation (PSD)
Measures of Shape
Measures of Shape
• Skewness
0.14
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0.12
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
• Kurtosis
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
-3
-2
-1
0
1
2
3
0
0
5
10
15
20
25
-3
-2
-1
0
1
2
3
• Skewness – based on the sum of cubes
n
 x  x 
i 1
3
i
• Kurtosis – based on the sum of 4th powers
n
 x  x 
i 1
4
i
The Measure of Skewness
n
1
3
 xi  x 

n i 1
g1 
3
s
The Measure of Kurtosis
n
1
4
 xi  x 

n i 1
g2 
3
4
s
Interpretations of Measures of Shape
• Skewness
0.14
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0.12
g1 > 0
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
g1 = 0
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
25
0
5
10
15
20
25
g1 < 0
0
5
10
15
20
25
• Kurtosis
0.14
g2 < 0
0.12
g2 = 0
0.1
0.08
0.06
g2 > 0
0.04
0.02
0
0
-3
-2
-1
0
1
2
3
0
0
5
10
15
20
25
-3
-2
-1
0
1
2
3
Advance Box Plots
• An outlier is a “wild” observation in the
data
• Outliers occur because
– of errors (typographical and computational)
– Extreme cases in the population
• We will now consider the drawing of boxplots where outliers are identified
To Draw a Box Plot we need to:
• Compute the Hinge (Median, Q2) and the
Mid-hinges (first & third quartiles – Q1
and Q3 )
• To identify outliers we will compute the
inner and outer fences
The fences are like the fences at a prison. We
expect the entire population to be within both
sets of fences.
If a member of the population is between the
inner and outer fences it is a mild outlier.
If a member of the population is outside of the
outer fences it is an extreme outlier.
Inner fences
Lower inner fence
f1 = Q1 - (1.5)IQR
Upper inner fence
f2 = Q3 + (1.5)IQR
Outer fences
Lower outer fence
F1 = Q1 - (3)IQR
Upper outer fence
F2 = Q3 + (3)IQR
• Observations that are between the lower and
upper inner fences are considered to be
non-outliers.
• Observations that are outside the inner
fences but not outside the outer fences are
considered to be mild outliers.
• Observations that are outside outer fences
are considered to be extreme outliers.
• mild outliers are plotted individually in a
box-plot using the symbol
• extreme outliers are plotted individually in
a box-plot using the symbol
• non-outliers are represented with the box
and whiskers with
– Max = largest observation within the fences
– Min = smallest observation within the fences
Box-Whisker plot
representing the data
that are not outliers
Extreme outlier
Mild outliers
Inner fences
Outer fence
Example
Data collected on n = 109 countries in 1995.
Data collected on k = 25 variables.
The variables
1. Population Size (in 1000s)
2. Density = Number of people/Sq kilometer
3. Urban = percentage of population living in
cities
4. Religion
5. lifeexpf = Average female life expectancy
6. lifeexpm = Average male life expectancy
7. literacy = % of population who read
8. pop_inc = % increase in popn size (1995)
9. babymort = Infant motality (deaths per
1000)
10. gdp_cap = Gross domestic product/capita
11. Region = Region or economic group
12. calories = Daily calorie intake.
13. aids = Number of aids cases
14. birth_rt = Birth rate per 1000 people
15. death_rt = death rate per 1000 people
16. aids_rt = Number of aids cases/100000
people
17. log_gdp = log10(gdp_cap)
18. log_aidsr = log10(aids_rt)
19. b_to_d =birth to death ratio
20. fertility = average number of children in
family
21. log_pop = log10(population)
22. cropgrow = ??
23. lit_male = % of males who can read
24. lit_fema = % of females who can read
25. Climate = predominant climate
The data file as it appears in SPSS
Consider the data on infant mortality
Stem-Leaf diagram stem = 10s, leaf = unit digit
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
4455555666666666777778888899
0122223467799
0001123555577788
45567999
135679
011222347
03678
4556679
5
4
1569
0022378
46
7
8
Summary Statistics
median = Q2 = 27
Quartiles
Lower quartile = Q1 = the median of lower half
Upper quartile = Q3 = the median of upper half
12  12
66  67
Q1 
 12, Q3 
 66.5
2
2
Interquartile range (IQR)
IQR = Q1 - Q3 = 66.5 – 12 = 54.5
The Outer Fences
lower = Q1 - 3(IQR) = 12 – 3(54.5) = - 151.5
upper = Q3 = 3(IQR) = 66.5 + 3(54.5) = 230.0
No observations are outside of the outer fences
The Inner Fences
lower = Q1 – 1.5(IQR) = 12 – 1.5(54.5) = - 69.75
upper = Q3 = 1.5(IQR) = 66.5 + 1.5(54.5) = 148.25
Only one observation (168 – Afghanistan) is
outside of the inner fences – (mild outlier)
Box-Whisker Plot of Infant Mortality
0
0
50
100
150
Infant Mortality
200
Example 2
In this example we are looking at the weight
gains (grams) for rats under six diets differing
in level of protein (High or Low) and source
of protein (Beef, Cereal, or Pork).
– Ten test animals for each diet
Table
Gains in weight (grams) for rats under six diets
differing in level of protein (High or Low)
and source of protein (Beef, Cereal, or Pork)
High Protein
Level
Low protein
Source
Beef
Cereal
Pork
Beef
Cereal
Pork
Diet
1
73
102
118
104
81
107
100
87
117
111
103.0
100.0
24.0
17.78
229.11
15.14
2
98
74
56
111
95
88
82
77
86
92
87.0
85.9
18.0
13.33
225.66
15.02
3
94
79
96
98
102
102
108
91
120
105
100.0
99.5
11.0
8.15
119.17
10.92
4
90
76
90
64
86
51
72
90
95
78
82.0
79.2
18.0
13.33
192.84
13.89
5
107
95
97
80
98
74
74
67
89
58
84.5
83.9
23.0
17.04
246.77
15.71
6
49
82
73
86
81
97
106
70
61
82
81.5
78.7
16.0
11.05
273.79
16.55
Median
Mean
IQR
PSD
Variance
Std. Dev.
Box Plots: Weight Gains for Six Diets
130
High Protein
120
Low Protein
110
Weight Gain
100
90
80
70
60
50
Beef
Cereal
Pork
Beef
2
3
4
Cereal
Pork
40
1
Diet
5
6
Non-Outlier Max
Non-Outlier Min
Median; 75%
25%
Conclusions
• Weight gain is higher for the high protein
meat diets
• Increasing the level of protein - increases
weight gain but only if source of protein is a
meat source
Multivariate Data
Related documents