Download Anatomy of the histogram

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Mean field particle methods wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Those who don’t know
statistics are condemned to
reinvent it…
David Freedman
All you ever wanted to know
about the histogram and
more ...
Distribution of No of Graphics
on web pages (N=1873)
400
Mean = 17.93
Median = 16.00
300
Std. Dev = 17.92
N = 1873
200
100
Graphic Count
0
0.0
10.0
5.0
20.0
15.0
30.0
25.0
40.0
35.0
50.0
45.0
60.0
55.0
70.0
65.0
80.0
75.0
90.0
85.0
95.0
Horizontal Scale
Distribution of Redundant Link
% on web pages (N =1861)
1000
Mean = 22.1
Median = 14
800
600
Std. Dev = 37.33
N = 1861.00
400
200
0
Plotting a histogram:
endpoint convention,
plot frequencies,
make equal intervals etc.
Frequency Table
FrequencyInterval MidPoint
110
430
860
280
180
40
20
10
0-2
2-4
4-6
6-8
8-10
10-12
12-14
14-16
1
3
5
7
9
11
13
15
convention: include the left endpoint in the class interval
Frequency/Probability
Frequency Probability Interval
MidPoint
110
0.06
0-2
1
430
0.22
2-4
3
860
0.45
4-6
5
280
0.15
6-8
7
180
0.09
8-10
9
40
0.02
10-12
11
20
0.01
12-14
13
10
0.01
14-16
15
No of fonts used on a web-page
1000/ .5
Frequency
/probability
800/ .4
600/ .3
400/ .2
200/ .1
0/ 0
1
3
5
7
9
11
13
15
Frequency
110
430
860
280
180
40
20
10
Probability
.06
.22
.45
.15
.09
.02
.01
.01
Cleaning up a histogram:
getting rid of outliers
Distribution of word count
(N=1903)
1600
1400
Mean = 393.2
Median = 223
1200
Std. Dev = 725.24
1000
Minimum = 0
800
Maximum = 20,357
600
400
200
0
Distribution of word count
(N=1897) top six removed
800
Mean = 368.0
Median = 223
600
Std. Dev = 474.04
Minimum = 0
400
Maximum = 4132
200
0
Distribution of word count
(N=1873)
500
Mean = 333.4
Median = 220
400
Std. Dev = 360.30
300
Minimum = 0
Maximum = 4132
200
100
0
WORDCNT2
What can histograms tell
you
Distribution of link count on
good & bad web-pages
3 0 0
2 0 0
1 0 0
0
0 .0
4 0 .0
8 0 .0
Good Sites
1 2 0 .0
1 6 0 .0
2 0 0 .0
2 4 0 .0
2 8 0 .0
Bad Sites
Making inferences from histograms:
Incidence of riots and temperature
30
40
50
60
70
80
temperature
90
100
110
Mean and Median
Mean is arithmetic average, median is 50% point
Mean is point where graph balances
Mean shifts around,
Median does not shift much, is more stable
Computing Median:
for odd numbered N
find middle number
For even numbered N
interpolate between middle 2,
e.g. if it is 7 and 9, then 8 is the median
The instability of means and standard
deviations
11.00
9.00
10.00
4.00
5.00
7.00
8.00
11.00
20.00
Mean
Median
SD
9.44
9.00
4.67
Add two numbers: watch the mean, median, & SD
Mean
Median
SD
11.00
9.00
12.00
4.00
5.00
7.00
8.00
11.00
20.00
11.00
9.00
12.00
4.00
5.00
7.00
8.00
11.00
40.00
9.00
7.00
9.67
9.00
4.74
11.60
9.00
10.31
Add one outlier...
Mean
Median
SD
11.00
9.00
12.00
4.00
5.00
7.00
8.00
11.00
20.00
11.00
9.00
12.00
4.00
5.00
7.00
8.00
11.00
40.00
9.00
7.00
11.00
9.00
12.00
4.00
55.00
7.00
8.00
11.00
40.00
9.67
9.00
4.74
11.60
9.00
10.31
17.44
11.00
17.61
Standard Deviation: a measure of
spread
Same mean, different spread
SD
SD
The Standard Deviation
Mean
SD
Minimum
Maximum
12.00
12.00
10.00
9.00
7.00
6.00
5.00
18.00
15.00
11.00
7.00
7.00
2.00
1.00
8.71
2.81
5.00
12.00
8.71
6.34
1.00
18.00
The SD says how far away numbers
on a list are from their average.
Most entries on the list will be
somewhere around one SD away
from the average. Very few will be
more than two or three SD’s away.
Understanding the standard
deviation
Lets start with a list: 1, 2, 2, 3
50%
25%
0%
Histogram is symmetric about 2,
2 is mean,
and 50% to left of 2, 50% to right
50%
25%
0%
50%
25%
0%
List: 1, 2, 2, 3
Average = 2
SD = .8
List: 1, 2, 2, 5
Average =2.5
SD = 1.73
50%
25%
0%
List: 1, 2, 2, 7
Average =3
SD = 2.71
Computing the standard deviation
List: 20, 10, 15, 15
Average = 15
Find deviations from average=
5, -5, 0, 0
Square the deviations:
(5)2 (-5)2 (0)2 (0)2 = 50
divide it by N-1 = 50/3 = 16.67
Square root it= 16.67 = 4.08
Properties of the standard
deviation
• The standard deviation is in the
same units as the mean
• The standard deviation is inversely
related to sample size (therefore
as a measure of spread it is
biased)
• In normally distributed data 68%
of the sample lies within 1 SD
Properties of the Normal
Probability Curve
• The graph is symmetric about the mean
(the part to the right is a mirror image of
the part to the left)
• The total area under the curve equals
100%
• Curve is always above horizontal axis
• Appears to stop after a certain point (the
curve gets really low)
1 SD= 68%
2 SD = 95%
3 SD= 99.7%
• The graph is symmetric about the mean =
• The total area under the curve equals
100%
• Mean to 1 SD = +- 68%
• Mean to 2 SD = +- 95%
• Mean to 3 SD = +- 99.7%
• You can disregard rest of curve
Distribution of judges ratings for
the Webby Awards
500
400
Mean = 6.3
Median = 6.3
300
Std. Dev = 1.98
200
N = 1867.00
100
Skewness = -.43
Kurtosis = -.201
0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
It is a remarkable fact that many histograms in
real life tend to follow the Normal Curve.
For such histograms, the mean and SD are
good summary statistics.
The average pins down the center, while the SD
gives the spread.
For histogram which do not follow the normal
Curve, the mean and SD are not good
summary statistics.
What when the histogram is not normal ...
Distribution of word count on web pages
500
400
300
Std. Dev = 384.83
Mean = 348.3
200
100
0
+- 3 SD = (384 * 3) = 1152
Mean - 1152 = about 30% sample had negative number of
links
When SD is influenced by outliers
Use inter quartile range
75th percentile - 25th percentile
Note.
A percentile is a score below which a certain % of sample is
Measures of Normality
• Visual examination
• Skewness: measure of symmetry
Positively Skewed
Negatively Skewed
Symmetric
Kurtosis: Does it cluster in the
middle?
Kurtosis is based on a distributions tail.
Distributions with a large tail: leptokurtic
Distributions with a small tail: platykurtic
Distributions with a normal tail: mesokurtic
Large tail
Small tail
Normal Tail
Positively Skewed and
Leptokurtic: Word Count
1600
1400
1200
1000
800
Mean = 393.2
Median = 223
Std. Dev = 725.24
Skewness = 13.62
Kurtosis = 321.84
600
N = 1903.00
400
200
0
Distribution of word count
(N=1897) top six removed
800
Kurtosis = 16.40
Skewness = 3.49
600
400
Mean = 368.0
Median = 223
Std. Dev = 474.04
N = 1897.00
200
0
Degree of Freedom
• The number of independent pieces of
information remaining after estimating one or
more parameters
• Example: List= 1, 2, 3, 4
Average= 2.5
• For average to remain the same three of the
numbers can be anything you want, fourth is
fixed
• New List = 1, 5, 2.5, __ Average = 2.5