Download 251descr

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
251descr 8/16/01 (Open this document in 'Outline' view!)
ECONOMICS 251 COURSE OUTLINE
A. Introduction
1. Definitions
Define Statistics, Descriptive and Analytic Statistics, Induction and Deduction.
2. Uses of Statistics
B. Sources and Types of Data
1. Data
Define data sets, observation, unit of observation. Qualitative and quantitative data. Nominal, ordinal,
interval and ratio data. Discrete vs continuous data.
a. Qualitative Data
(i) Nominal Data: There is no natural number scale - numbers are only used to define categories, so that
no operations like addition or multiplication are valid.
(ii) Ordinal Data: Numbers are used only to order things (e.g. first, second, first). Differences between
ranks do not always have the same meaning. Most mathematical operations are still not valid.
b. Quantitative Data
(i) Interval Data: Differences between ranks have consistent meaning, but, like Celsius temperature,
there is no obvious origin, so that , although addition and subtraction can be used, multiplication and
division have no real meaning.
(ii) Ratio Data: there is a meaningful origin, so that multiplication and division are valid.
2. Sources
Define primary and secondary sources, internal and external data.
3. Cross Section and Time Series Data
a. Cross Section Data
b. Time Series Data.
i. Indices
ii. Real Values
iii. Rates of change
iv Logarithms
C. Presentation of Data
1. Classification
Define collectively exhaustive and mutually exclusive classes. These are not the same thing. Collectively
exhaustive means that every item you are considering has a place in a class. Mutually exclusive means that
if an item belongs in any given class, it does not belong in another class as well.
2. Tables
Define parts of tables. See 251pttbl .
3. Charts and Graphs
Define parts of graphs
2
D. Frequency Distributions and Populations.
1. Definitions
Meaning of Population, Frame, Census, Sample, Grouped Data, Frequency, Example of Frequency
Distribution, Relative Frequency. Width of a class interval.
largest  smallest
w
(Always round this result up!)
number of classes
2. Graphs of the Frequency Distribution.
a. The Histogram
b. The Frequency Polygon
c. The Cumulative Frequency Distribution (Ogive).
d. Relative Frequencies.
e. Smoothed Histograms
E. Sampling and Descriptive Statistics.
1. Sampling to Learn About a Population.
Infinite and finite populations, target and sampled populations, the Stability of Mass Data.
2. The Meaning of Random Sampling.
A simple random sample of n items taken from a population of N items must be selected in such a way
that all combinations of n items are equally likely.
3. Descriptive Statistics.
a. Measures of Central Tendency. (Where's the middle of the data?)
b. Measures of Dispersion. (How spread out are the data?)
c. Measures of Asymmetry etc. (What else can I say about the shape?)
F. Measures of Central Tendency.
1. The Arithmetic Mean of Ungrouped Data.
a. The Population Mean.

x
N
b. The Sample Mean.
x
x
n
3
2. The Arithmetic Mean of Grouped Data.
For grouped data generally substitute
f
for

. For x substitute the midpoint of the group. This is
defined for our purposes as the arithmetic mean of the lower limit of the group in question and the lower
limit of the next group. In other words if we have the group 10 to 10.99, followed by 11 to 11.99 the
midpoint of the first group is 10.50, not 10.495.
3. The Weighted Arithmetic Mean.

 wx
w
, x
 wx
w
4. The Median of Ungrouped Data.
Defined simply as the middle point when the data is in order. If there are two middle points, take their
arithmetic mean. In continuous data half the points will be above or below the median.
5. The Median of Grouped Data.
 pn  F 
x1 p  L p  
 w . See formula for fractiles below and remember that the
 f p 
median is the .5 fractile.
position 
1
2
n  1 .
6. The Mode
Simply the most common point, not very useful in discrete ungrouped data. For grouped data it is defined as
the midpoint of the largest group.
4
7. Other Means.
a. The Geometric Mean.
1
x g  x1  x 2  x 3  x n  n  n
x
or
 
ln x g 
1
n
 ln( x)
b. The Harmonic Mean.
1
1

xh n
x
1
c. The Root-Mean-Square.
1
n
x rms 
x
2
or
2
x rms

1
n
x
2
d. What Formulas for Means Have in Common.
f x  
1
n
 f x 
8. Measures of Position.
Percentiles, deciles, quintiles, quartiles and fractiles.
The two formulas below are two-step formulas. The first step is multiplying n  1 (or N  1 ) by p . p
represents the fractile of the data wanted. For example, if we want the 91st percentile, p is .91. Note that
the number you have found is called x1 p  x1.91  x.09 (i.e. 9% from the top!). If we want the third
quartile, Q3  x.25 , p is 3 4 or 0.75. If we want the first quartile, Q1  x.75 , p is 1 4 or 0.25. Of course, for
the median p  .5 . N or n represents the number of items in the population or sample, not the number of
groups.
a. Finding a Fractile of Grouped Data.
To use this formula, we must first compute the cumulative distribution of the group and determine in which
group the desired fractile is located with the calculation position  pn  1 . Once we have found the
group that this is in, let f p be the frequency of the chosen group, and let
F be the cumulative frequency
 pn  F 
up to but not including the chosen group. The formula here is x1 p  L p  
 w . In this formula, w
 f p 
is the class interval (the interval between the lower limit of the chosen group and the lower limit of the next
group) and L p is the lower limit of the chosen group. Suppose that in the example below we must find the
first quartile. Since the first quartile is the .25 fractile, p is .25. To locate the group use
position  pn  1 = 0.25(16)=4
.
Profit Rate f F
Using the cumulative distribution F 
9-10.99% 3 3
column, we find the fourth item in the sample.
11-12.99% 3 6
Since 4 is above 3 and below 6 in the F column,
13-14.99% 5 11
we pick the group 11-12.99%. n is
15-16.99% 3 14
15, and for the group we have picked, w =
17-18.99% 1 15
13 - 11 = 2, L p  11 , F = 3, and f p  3 .
Total
15
we find that x1.25  x.75
If we put these numbers into the formula,
 .25 15  3 
 11  
 2  11.5 .
3


5
 pn  F 
Note: Sometimes 
 is negative. In this case choose the group before the one you would ordinarily
 f p 
have chosen. Example: If you want the 19th percentile of the data above position  pn  1 =.19(16) =
 pn  F  .1915   3
3.04, which would normally take us into 11-12.99. But 
 0.075 , so use the group

3
 f p 
9-10.99 instead. But see c below.
b. Finding a Fractile of Ungrouped Data.
This time when we compute position  pn  1 , we divide it into an integer part, a , and a
fractional part, .b . For example, if n = 10, and we wish to find the first quartile,
p = 0.25, so that pn  1 = 0.25 (11) = 2.75. Then a  2 , and .b  .75 . Now find xa and xa 1 ,
in this case x2 and x3 , and use the formula x1 p  xa  .bxa1  xa  . For example, if our sample
consists of 10 numbers, 1,5,7,9,9,11,13,14,17 ,19, xa  x2  5 and x a 1  x3  7 , so that
x1 p  x.75  5  0.757  5  6.5
c. Experimental formula (Don't read this!)
Because of problems with the grouped data formula above, I intend to experiment with a new pair of
formulas. position  1  pn  1  a.b (the position formula can be used with both grouped and ungrouped
 pn  1  0.5  F 
data ) and x1 p  L p  
w .
fp


Example: Using the data in 8a n  15 
First quartile: position  1  pn  1  1  .25(14)  4.5 . This is in group 11-12.99.
 .2514   0.5  3 
x1.25  x.75  11  
2  11 .67 .
3


Median: position  1  pn  1  1  .5(14)  8 (Same as with the old formula) This is in group 13-14.99.
 .514   0.5  6 
x1.5  x.5  13  
 2  13 .6 .
5


Third quartile: position  1  pn  1  1  .75(14)  11.5 This is in group 15-16.99.
 .7514   0.5  11 
x1.75  x.25  15  
2  15
3


Seventy-fourth percentile: position  1  pn  1  1  .74(14)  11.36 This is in group 13-14.99. Why? For
13-14.99, F  11 . This means that numbers up to x11 are in 13-14.99 or lower groups and that x12 and
numbers above it are in 15-16.99 and higher groups. Thus we set the boundary at 11.5.
 .74 14   0.5  6 
x1.75  x.25  13  
 2  14 .94
5


Nineteenth percentile: position  1  pn  1  1  .19(14)  3.66 . This is in group 11-12.99.
 .19 14   0.5  3 
x1.19  x.81  11  
 2  11 .11
3


6
G. Measures of Dispersion and Asymmetry.
1. Range
Range  highest number  lowest number or highest midpoint  lowest midpoint .
Interquartile Range: IQR  Q3  Q1 .
2. The Variance and Standard Deviation of Ungrouped Data.
a. The Population Variance - Definitional and Computational Formulas.
Definitional 
2
 x   

2
Computational  2 
N
x
N
2
 2
Standard Deviation = variance
b. The Sample Variance.
Definitional s
2
 x  x 

2
n 1
Computational s 2 
x
2
 nx 2
n 1
The computational formula is one of the most important formulas you will learn. Note that
the same as
 x  . For example, if x is 2,3,5 ,  x
2
2
x
2
is not
 2 2  3 2  5 2  4  9  25  38 , not
2  3  52  10 2  100 .
Example: Use x  2,3,5
Computational Method
x2
x
2
4
3
9
5
25
10
38

From this we find
x  10,

Definitional Method
x
x  x 
2
-1.33333
3
-0.33333
5
1.66667
10
0.00001
x 2  38, x 
 x  10  3.33333
n
3
and
 x  x 
 x  x 2
1.77778
0.11111
2.77778
4.66667
2
 4.66667 Note that
 x  x  should be zero, but is not because of rounding. Now, if we use the computational method, we
 x  nx  38  33.33333   4.6667  2.3333 (Some texts prefer
can use s 
2
2
2
2
n 1
s2 

 x 
1
x2 
n
n 1
2
3 1
2
1
2
38  10 
4.66666667
3


 2.33333 which give us a little more accuracy for a
3 1
2
little more work.) If we use the definitional method s
2
 x  x 

n 1
2

4.66667
 2.33333 , but note that
2
we had to do three subtractions instead of 1.
c. The Coefficient of Variation.
C
std .deviation
mean
7
d. Chebyshef’s Inequality and the Empirical Rule


Chebyshef Inequality: P x    k 
1
k2
Empirical rule: (For Symmetrical Unimodal distributions only)
68% within one standard distribution of the mean, 95% within two and almost all within three.
3. The Variance and Standard Deviation of Grouped Data.
For grouped data generally substitute
f
for

.
4. Skewness and Kurtosis.
Population skewness, the 3rd k-statistic, coefficients of skewness; population kurtosis, the 4th k-statistic, the
coefficient of excess; leptokurtic, platykurtic and mesokurtic distributions.
The usual measurement of skewness is often called the third moment about the mean .
(The population variance is the second). The formula for population skewness is:
 x   
3
3 
N
.
The corresponding sample statistic is the third k-statistic, k 3 
n  1n  2 
corresponding computational formulas are
n
1 3
3 
x  3
x 2  2 N 3 and k 3 
N
n  1n  2
 3x



formulas, put an f to the right of the

 x
3
n
x
2
x  x 3 .
The

 2nx 3 . To make grouped data
sign. Positive values of these formulas imply skewness to the
right, negative values to the left. Note that multiplying all the values of x by two would multiply the values
of these coefficients by eight, but would not change the shape of the distribution. If we want to compare
shapes, we need measurements that will not change if we multiply all values by a constant. Such a measure

k
would be called the coefficient of relative skewness, with the formulas  1  33 and g1  33 . Note that

s
for the Normal distribution  3  0 . Another measure of skewness is Pearson's measure of skewness,
SK 
3mean  mode 
; the median is sometimes used instead of the mode in this formula.
std .deviation
8
Example:
Profit Rate
f
9-10.99
11-12.99
13-14.99
15-16.99
17-18.99
Total
3
3
5
3
1
15
fx
x (midpoint)
10
12
14
16
18
fx2
300
432
980
768
324
2804
30
36
70
48
18
202
fx3
3000
5184
13720
12288
5832
40024
 f  n  15 ,  fx  202 ,  fx  2804 ,  fx  40024 , so that
 fx  202  13.467 and s   fx  nx  2804 1513.467   82.733  5.981 , which means
x
2
So
3
2
2
2
2
n
n 1
15
s  5.981  2.446 . C 
15  1
14
s
.
 2.446  0182
.
x 13.467
To measure skewness, use one of the following three results.
n
15
k3 
fx 3  3x
fx 2  2nx 3 
40024  313.467 2804   215 13.467 2
n  1n  2
14 13
k
158.249 
0.680
 .046 or

= 0.680, or
Relative Skewness g 1  33 
(14 )(13)
s
2.446 3





3mean  mode  313 .467  14 

 0.163 . Note that, in this case,
std .deviation
2.446
Pearson's Measure and Relative Skewness contradict each other as to the direction of skewness.
Pearson's Measure of Skewness SK 
The measures of kurtosis are, for populations,
4 
 x   
N
4

1
N
 x
4
 4

n2
n  1
k4 
n  1n  2n  3 

x
3
 x  x 
n
 6 2
4

x
2
 3 4
3n  13 s 4 
.

n2


and, for samples,
k 4 can be considered an estimate of
 4  3 4 . To get a measurement of shape use the coefficient of excess  2 
4
 3 or g 2 
k4
. Since
s4
the Normal distribution has  4  3 4 , the coefficient of excess is zero for the Normal distribution.
Kurtosis has traditionally been considered a measure of the peakedness of a distribution relative to the
Normal distribution, though there are some exceptions to this interpretation. If the coefficient of excess is
positive, we may call a distribution leptokurtic or sharp-peaked. If the coefficient of excess is negative, the
distribution can be called platykurtic or flat-peaked. If the coefficient of excess is close to zero, we call the
distribution mesokurtic, middle-peaked. A symmetric, mesokurtic distribution is essentially Normal.

4
9
Example (using definitional formulas):
Profit Rate
x
f
midpoint
9-10.99
3
10
11-12.99
3
12
13-14.99
5
14
15-16.99
3
16
17-18.99
1
18
Total
15
So
x  x 
fx
30
36
70
48
18
202
-3.467
-1.467
0.533
2.533
4.533
f x  x 
-10.400
-4.400
2.667
7.600
4.533
0.000
 f  n  15 ,  fx  202 ,  f  x  x   0 ,  f  x  x 

f  x  x   8.249 and
s2 
3
 f  x  x
n 1

2
f  x  x
36.053
6.453
1.422
19.253
20.551
83.732
f  x  x
3
-124.985
-9.465
0.759
48.775
93.164
8.249
433.323
13.885
1.079
123.457
422.317
944.466
 83.732 ,
f  x  x   944.466 , so that x 
4
f  x  x 2
 fx  202  13.467 and
n
15
2

s
82.733
.
.
 5.981 , which means s  5.981  2.446 . C   2.446  0182
x 13.467
14
To measure skewness, use one of the following three results. k 3
Relative Skewness g1 
0.680, or
Pearson's Measure of Skewness SK 
3 mean  mode

n
3
(n  1)(n  2)
k3
s
 f  x  x
3

0.680
 2.446 3

15 8.249
1413
=
.046 or
313.467  14
. Note that, in this case,
 0163
.
std. deviation
2.446
Pearson's Measure and Relative Skewness contradict each other as to the direction of skewness.

f x  x 4 3n  13 s 4 
n2


n  1
k4 


n  1n  2n  3 
n
n2


k
310337
.
 0.868 . The negative sign implies that the distribution is
=-31.0337. So g 2  44 
s
5.981 2
platykurtic.

5. Review
a. Grouped Data.
b. Ungrouped Data.
10
4
Appendix: Explanation of Sample Formulas (Not for student consumption)
1. The Sample Variance.
1
x  x 2  1  x 2  nx 2  . If s 2 has an expected
n  1
n  1 
2
2
value of  2 it must be true that E x  x    E  x  nx 2   n  1 2 . We can assume, without loss
of generality that   E x   0. Under these conditions, the Variance is defined as  2  E x   2  E x 2  .
The Sample Variance is defined as s 2 
Thus E
 x   n
2
2
. An expression like x 2 
1
 x has terms like 1
n
2
x x . Because of the
2 1 2
n
independence assumption on the sample, all these terms have expected values of zero except for terms with
2
2
1
1
1
 1
x  2 E
x  2 n 2   2 . Thus
two identical subscripts and E x 2  E  2
n
n
n
 n
 
 
E
 x
2
  x  nEx   n
 nx 2  E
2
2
2
 
1
 n  2  n  1 2 .
n
2
2. The Third k Statistic.

n
n
x  x 3 
n  1n  2
n  1n  2
If k 3 has an expected value of  3 , it must be true that
If the third k statistic k 3 
 x  x    E x
3
3
 3x
x
2

 2nx 3 
 x
3
 3x
x
2
 2nx 3

n  1n  2 
3.
n
We can assume, without loss of generality that   E x   0. Under these conditions, the skewness is
E
 
defined as  3  E x   3  E x 3 . Thus E
like
 x   n
3
3.
An expression like x 3 
1
n
3
 x
3
has terms
1
x1 x 2 x3 . Because of the independence assumption on the sample, all these terms have expected
n3
values of zero except for terms with three identical subscripts and
3
3
1
1
1
 1
x   3 . Thus
E x 3  E 3
x  3 E
x  3 n 3  2  3 . By the same reasoning E x
n
n
n
 n
 
E  x
3
 
 
 3x  x  2nx   E  x  3E x  x  2nEx 
 n 3  3 3  2n
2
1
n2
3

3
2
n
 3   n  3   3 

2
 
3
n  1n  2  .
n 2  3n  2
3 
3
n
n
3. And now, for considerable extra credit, what can you say about the
expected value of k 4 ?
11