Download transparency of financial time series.(Topic 1)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Time series wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Part I: Introduction to Data Analysis
The true foundation of theology is to ascertain the character of God. It is by the
aid of Statistics that law in the social sphere can be ascertained and codified, and
certain aspects of the character of God thereby revealed. The study of statistics is
thus a religious service. — Florence Nightingale (1820-1910).
Statistical Thinking is understanding variation and how to deal with it. In this
course we explore methods for moving as far as possible to the right on this
continuum:
Ignorance
-->
Uncertainty
-->
Risk
-->
Certainty
Types of Data
Categorical vs. Numerical
Discrete vs. Continuous
Nominal Data are the weakest type of measurement for statistical methods. They
can be numbers, but really are just names or labels (not quantities). Same as
Categorical.
Ordinal Data, by their size, rank or order observations on some basis. These
intervals between these numbers, and their ratios, are meaningless.
Interval Data also rank observations according to some dimension, but the
interval or distance between observations has a constant meaning. Readings on
the Fahrenheit temperature scale are examples of interval data; the zero point is
somewhat arbitrary, but a difference of, say, ten degrees means the same thing
everywhere on the scale. We can do addition and subtraction with interval data,
but not multiplication or division.
Rational Data are the most useful type for statistical analysis. Ratio data are
numbers which by their size rank observations in order of importance and
between which intervals as well as ratios are meaningful. All types of arithmetic
operations can be performed with rational data.
Example 1
1998 New York Yankees Roster
No.
2
11
14
18
19
20
20
21
22
24
25
26
26
27
28
29
31
33
36
39
40
42
43
46
51
54
55
58
Last
Jeter
Knoblauch
Irabu
Brosius
Sojo
Posada
Davis
O'Neill
Bush
Martinez
Girardi
Hernandez
Spencer
Lloyd
Curtis
Stanton
Raines
Wells
Cone
Strawberry
Holmes
Rivera
Nelson
Pettitte
Williams
Borowski
Mendoza
Jerzembeck
First
Derek
Chuck
Hideki
Scott
Luis
Jorge
X-Chili
Paul
Homer
Tino
Joe
Orlando
Shane
Graeme
Chad
Mike
Tim
David
David
Darryl
X-Darren
Mariano
X-Jeff
Andy
Bernie
Joe
Ramiro
Mike
Position
Infield
Infield
Pitcher
Infield
Infield
Catcher
Outfield
Outfield
Infield
Infield
Catcher
Pitcher
Outfield
Pitcher
Outfield
Pitcher
Outfield
Pitcher
Pitcher
Outfield
Pitcher
Pitcher
Pitcher
Pitcher
Outfield
Pitcher
Pitcher
Pitcher
Bats
R
R
R
R
R
S
S
L
R
L
R
R
R
L
R
L
S
L
L
L
R
R
R
L
S
R
R
R
2
Throws
R
R
R
R
R
R
R
L
R
R
R
R
R
L
R
L
R
L
R
L
R
R
R
L
R
R
R
R
Ht.
6-3
5-9
6-4
6-1
5-11
6-2
6-3
6-4
5-10
6-2
5-11
6-2
5-11
6-7
5-10
6-1
5-8
6-4
6-1
6-6
6-0
6-2
6-8
6-5
6-2
6-2
6-2
6-1
Wt.
185
170
240
202
175
205
220
215
175
210
195
190
210
234
185
215
186
225
190
215
202
168
235
235
205
225
154
185
Born
6/26/74
7/7/68
5/5/69
8/15/66
1/3/66
8/17/71
1/17/60
2/25/63
11/12/72
12/7/67
10/14/64
10/11/69
2/20/72
4/9/67
11/6/68
6/2/67
9/16/59
5/20/63
1/2/63
3/12/62
4/25/66
11/29/69
11/17/66
6/15/72
9/13/68
5/4/71
6/15/72
5/18/72
Example 2
THE WORLD COMPETITIVENESS SCOREBOARD
Source: IMD - International Institute for Management Development, Lausanne, Switzerland
(http://www.imd.ch/)
Country
2000 1999 1998 1997 1996 1995
USA
1
1
1
1
1
1
Singapore
2
2
2
2
2
2
Finland
3
3
5
4
15
18
Netherlands
4
5
4
6
7
8
Switzerland
5
6
7
7
9
5
Luxembourg
6
4
9
12
8
Ireland
7
11
11
15
22
22
Germany
8
9
14
14
10
6
Sweden
9
14
17
16
14
12
Iceland
10
17
19
21
25
25
Canada
11
10
10
10
12
13
Denmark
12
8
8
8
5
7
Australia
13
12
15
18
21
16
Hong Kong
14
7
3
3
3
3
UK
15
15
12
11
19
15
Norway
16
13
6
5
6
10
Japan
17
16
18
9
4
4
Austria
18
19
22
20
16
11
France
19
21
21
19
20
19
Belgium
20
22
23
22
17
21
New Zealand
21
20
13
13
11
9
Taiwan
22
18
16
23
18
14
Israel
23
24
25
26
24
24
Spain
24
23
27
25
29
28
Malaysia
25
27
20
17
23
23
Chile
26
25
26
24
13
20
Hungary
27
26
28
36
39
41
Korea
28
38
35
30
27
26
Portugal
29
28
29
32
36
32
Italy
30
30
30
34
28
29
China
31
29
24
27
26
31
Greece
32
31
36
37
40
40
Thailand
33
34
39
29
30
27
Brazil
34
35
37
33
37
38
Slovenia
35
40
Mexico
36
36
34
40
42
42
Czech Rep
37
41
38
35
34
39
South Africa
38
42
42
44
44
43
Philippines
39
32
32
31
31
36
Poland
40
44
45
43
43
45
Argentina
41
33
31
28
32
30
Turkey
42
37
33
38
35
35
India
43
39
41
41
38
37
Colombia
44
43
44
42
33
33
Indonesia
45
46
40
39
41
34
Venezuela
46
45
43
45
45
44
Russia
47
47
46
46
46
46
3
Example 3
a. Ballard Power Systems, Inc. stock has risen in price by $107 per share in five
years.
b. Ballard Power Systems, Inc. stock has risen in price from $8 to $115 per share
in five years.
Operational Definitions
An important concept, perhaps difficult to measure (e.g. the overall health of the
U.S. equity market), is often operationalized with an easy-to-measure proxy (e.g.
the Dow Jones Industrial Average).
Sampling
One of the fundamental principles of statistics is that we can learn a great deal
about a complete population of data by looking at a smaller subset, or sample, from
the population.
Types of Samples
Nonprobability



Judgement
Quota
Chunk
Probability




Simple Random
Systematic
Stratified
Cluster
4
Getting Started in Microsoft Excel
Frequency Distribution
Focus
National Liberal Arts
National University
Regional Liberal Arts
Regional University
Count
2
8
18
32
Percentage Distribution
Focus
National Liberal Arts
National University
Regional Liberal Arts
Regional University
Count Percent
2
3.33%
8
13.33%
18
30.00%
32
53.33%
Graphs and Charts
History
Johann Heinrich Lambert (1728-1777) was a Swiss-German scientist and
mathematician. He is generally recognized as the inventor of the time series
graph, in which the values of some variable of interest are plotted against the
vertical axis and time is plotted on the horizontal axis.
William Playfair (1759-1823) was a Scottish political economist. He advocated
the use of charts instead of tables of data, because "a man who has carefully
investigated a printed table, finds, when done, that he has only a very faint and
partial idea of what he has read". Playfair also invented the bar graph.
Florence Nightingale (1820-1910) was a British Army nurse in the Crimean War
(1854). She used graphical tools to convince army officers to improve conditions
in military hospitals. In 1860 she offered to fund a chair in applied statistics at
Oxford, and was turned down.
Edward Tufte (1946- ) is a professor of political science, statistics, and computer
science at Yale. He has written several excellent books about statistics and
graphic design.
Personal Computers and Integrated Software such as the Microsoft Excel,
PowerPoint, and Word programs used by most students in this class, have
greatly simplified the creation of graphs and their use in documents and
multimedia presentations. An unfortunate side effect has been to limit people's
creativity in creating graphs.
5
Types of Charts
Frequency Bar Chart
Focuses of 60 Texas Universities
35
30
School Focus
25
20
15
10
5
0
National Liberal Arts
National University
Regional Liberal Arts
Regional University
Frequency
Pie Chart
Focuses of 60 Texas Universities
National Liberal Arts
3%
National University
13%
Regional University
54%
Regional Liberal Arts
30%
6
Pareto Diagram
DeBurr
Cut
Engrave
Grind
Weld
Cost
Cumulative Cost Cumulative %
$ 8,181.25
$
8,181.25
52.5%
$ 5,950.00
$ 14,131.25
90.7%
$ 848.75
$ 14,980.00
96.2%
$ 446.25
$ 15,426.25
99.0%
$ 148.75
$ 15,575.00
100.0%
Pareto Diagram
5 Types of Manufacturing Defects
100%
90%
$14,000
80%
$12,000
70%
Cost ($)
60%
$8,000
50%
$6,000
40%
30%
$4,000
20%
$2,000
10%
$-
0%
DeBurr
Cut
Engrave
Defect Type
7
Grind
Weld
Cumulative %
$10,000
Histogram
Texas Tuitions - 60 Universities
30
25
Frequency
20
15
10
5
0
1
3
5
7
9
11
13
Tuition ($1000)
Scatter Plot
Education vs. Income
$140,000
$120,000
Income ($)
$100,000
$80,000
$60,000
$40,000
$20,000
$0
5
10
15
Education (Years)
8
20
25
Here is a time-series graph from March, 1999, showing the growth of the Dow
Jones Industrial Average during the 1990s. Note how the minimum value on the
vertical axis has been set to accentuate the Dow's growth — a mild example of
lying with charts.
9
Here is another example of lying with charts. The proportion of the number of
titles in the Barnes and Noble database to the number in Amazon's is evidently
8,000,000 to 4,700,000, or about 170%. But this one-dimensional relationship is
distorted in the two-dimensional graph.
The area of Barnes and Noble's black bar is 2700 square centimeters, while the
area of Amazon's gray bar is 800 square centimeters. This gives the visual
impression that the proportion of titles is more like 340%. The distortion is
augmented by the choice of color: Barnes and Noble looks bold, clear and strong,
while Amazon looks washed-out, pale, and weak.
10
Here's an example of a graphical technique that you can't do with Excel. In this
NEW YORK TIMES map of Kosovo, colors and shapes are used creatively to
communicate complicated quantitative information simply and clearly (e.g. the
volume and direction of refugee movements over time).
11
Juran's Suggestions for Good Charts
General
Label all axes with the variable name and units.
Don't use a legend for univariate charts (charts with only one variable).
Put the dependent variable on the vertical (Y) axis and the independent variable
on the horizontal (X) axis. (We will discuss dependent and independent variables
in greater detail later in the course.)
Let horizontal and vertical axes start at zero unless you have a good reason not
to.
Keep your scales, colors, patterns, and symbols consistent.
Eschew fancy effects that do not contribute to the reader's understanding (e. g. 3D effects, distracting colors or patterns, etc.).
Watch your ink-to-information ratio (see Tufte).
Keep it simple. Don't present data that aren't central to the point you are making.
Don't rely on the reader to infer the point of your chart; state your point
explicitly in the text.
Pareto Charts
Let the left vertical axis show the values for the various categories, and be scaled
so the maximum value corresponds to the total of all categories. Let the right
vertical axis show the cumulative percent, and be scaled so that the maximum
value is 100%.
Histograms
Don't let Excel decide what values to use for the class boundaries (a.k.a. bin or
bucket boundaries). Specify them yourself.
The proper number of classes is subjective; try to use between six and ten.
Don't use the upper class boundary as the category label on the X-axis. Use the
class midpoint to avoid confusion.
The default Excel column chart has gaps between the columns; these make a
histogram harder to read. Double-click on one of the columns, select "Options",
and reduce the gap width to zero.
12
Descriptive Statistics
Measures of Central Tendency
1) Average or Arithmetic Mean.
Example: The annual salaries (in $1000s) of the seven employees of a small
government department are as follows:
48, 90, 46, 42, 40, 46, 49.
The mean is:
 = (48 + 90 + 46 + 42 + 40 + 46 + 49)/7
= (361/7)
= 51.571
The mean salary is therefore $51,571. We use the Greek letter mu () to symbolize
the mean.
Notation: We will sometimes use a mathematical shorthand notation called
Summation Notation. It is easy to use and should not scare anyone; ask for help
if you need it.
If we have 7 data points, we can abstractly write these numbers X1, X2,..., X7
(where X1 = 48, X2 = 90, ... X7 = 49). Then we write the average of N = 7 numbers
as:
Average of (X1, X2, ..., XN) =  =
X 1  X 2  X 3  ...  X N
N
N
We can also write the average:  
Where
 Xi
i 1
N
N
 X i  X 1  X 2  X 3  ...  X N  48  90  ...  49
i 1
48 + 90 + ... + 49 = 361, so the average or mean is 361/7 = 51.571 or $51,571.
13
2) Median
The median of a data set is the “middle” value; the value such that 50% of the
population lies above and below it.
To find the median salary, first arrange the salaries in ascending order:
40, 42, 46, 46, 48, 49, 90.
The median salary is the middle value. In this case, it is $46,000, which (at least
here) seems more representative of a typical salary than the mean value
($51,571).
This worked nicely because we had an odd number of observations. Suppose we
want to find the median of the following:
48, 90, 46, 42, 40, 46, 49, 51.
For an even number of observations, the median is the average of the two middle
values. In this case, the average of 46 and 48, that is $47,000.
3) Mode
The mode of a data set is the “most popular” value or the value with highest
frequency.
Example: The manager of a men's store observes that the 10 pairs of trousers sold
yesterday have the following waist sizes (in inches): 31, 34, 36, 33, 28, 34, 30, 34,
32, 40. The mode of these waist sizes is 34 inches, and this fact is undoubtedly of
more interest to the manager than are the facts that the mean waist size is 33.2
inches and the median is 33.5 inches.
14
Measures of Dispersion
1) Range = maximum value - minimum value.
In the above example, the range is 90 - 40 = 50.
2) Quartiles, Interquartile Range
Top 20 U.S. Banks (by Total Assets)
Bank
1 Bank of America
2 Chase Manhattan Bank
3 Citibank
4 First Union National Bank
5 Morgan Guaranty Trust Company
6 Wells Fargo Bank
7 Bank One
8 Fleet National Bank
9 HSBC Bank
10 BankBoston
11 U.S. Bank
12 Keybank
13 Bank of New York
14 PNC Bank
15 Wachovia Bank
16 State Street Bank And Trust Co.
17 Bankers Trust Company
18 Southtrust Bank
19 AmSouth Bank
20 Regions Bank
As Of 3/31/2000; source: http://www.ffiec.gov
City
Charlotte
New York
New York
Charlotte
New York
San Francisco
Chicago
Providence
Buffalo
Boston
Minneapolis
Cleveland
New York
Pittsburgh
Winston-Salem
Boston
New York
Birmingham
Birmingham
Birmingham
Assets ($Billion)
571.7
332.2
327.9
229.3
167.7
96.3
93.9
87.7
79.6
78.3
75.4
75.0
71.8
68.2
63.6
56.2
51.2
43.2
43.2
42.2
Quartiles are used to divide a data set into four pieces; they can be thought of as
statistical dividing lines between these pieces. You will discover that there are
differences in the way statisticians calculate these dividing lines; here we will
illustrate the method used in the Excel QUARTILE function.
For a list of n numbers, first sort the numbers in increasing order and figure out
how many data there are. In this case, n = 20. In the Excel method, the first
quartile is the number that is three quarters of the way from the fifth observation
(from the bottom) to the sixth. The fifth is 56,226,197 (State Street Bank), the sixth
is 63,557,835 (Wachovia), and the first quartile is:
3
56.2   63.6  56.2 
4
 56.2  5.5
 $61.7 billion
15
The second quartile is the number that is half way from the tenth observation
(from the bottom) to the eleventh. The tenth is 78.3 (BankBoston), the eleventh is
75.4 (U.S. Bank), so the second quartile is:
1
75.4   78.3  75.4 
2
 75.4  1.5
 $76.9 billion
The third quartile is the number that is one quarter of the way from the fifteenth
observation (from the bottom) to the sixteenth. The fifteenth is 96.3 (Wells Fargo),
the sixteenth is 167.7 (Morgan), so the third quartile is:
1
96.3   167.7  96.3 
4
 96.3  17.8
 $114.2 billion
The interquartile range is the difference between the third and first quartile:
114.2 - 61.7 = $52.5 billion.
Percentiles are like quartiles, except they are dividing lines between hundredths
of the data instead of fourths. The 25th percentile is the same as the 1st quartile,
the 50th percentile is the same as the 2nd quartile, and the 75th percentile is the
same as the 3rd quartile.
Quartiles can be used to create a type of chart called a Box Plot, or Box and
Whisker Plot, as in this example from the Texas College data:
Notice that the box plot allows us to compare central tendency and dispersion
across several variables in one chart. Here we can see how tuition varies across
four different types of schools. Unfortunately, Excel can't help you with box plots
very well (these were created in SPSS, a popular statistics software package).
16
3) Variance: The average of the squared deviations of values from the arithmetic
mean.
Example: To calculate the variance of the above 7 governmental salaries, first
calculate the mean; it is 51.571. Then for each number, calculate its deviation
from the mean, so we get
48 - 51.571 = -3.57,
90 - 51.571 = 38.43, and so forth...,
49 - 51.571 = -2.57.
Add the squares of these together, and we get (-3.57)2 + (38.43)2 + ... + (-2.57)2 =
1,783. Then dividing by 7 we get 254.82. The variance of the above salaries is
254.82($2). Using summation notation this is:
2


1
X 1   2  X 2   2  X 3   2  ...  X N   2
N

1 N
 X i   2
N i 1

(Beware of the units of the variance, it is in the original units squared.)
4) Standard deviation =  2 =  . This can be thought of as the “average”
deviation from the mean. It is simply the square root of the variance:

 2
 254.82
= $15.96
17
Example: A school system employs teachers at salaries between $28,000 and
$50,000. The teachers' union and the school board are negotiating the form of
next year's salary increases.
1. If every teacher is given a flat $1000 raise, what will this do to the mean
salary?
2. To the median salary?
3. To the range?
4. To the quartiles of the salary distribution?
5. What would a flat $1000 raise do to the standard deviation of teachers'
salaries?
6. If, instead, each teacher receives a 5% raise, what will this do to the mean
salary?
7. To the median salary?
8. Will the 5% raise increase the standard deviation of the salaries?
18
Population versus Sample

A population is usually a group we want to know something about: e.g., all
potential customers, all eligible voters, all the products coming off an
assembly line, all items in inventory, etc....

A population parameter is a number relevant to the population that is of
interest to us: e.g., the proportion (in the population) that would buy a
product, the proportion of eligible voters who will vote for a candidate, the
average number of M&M's in a pack....

A sample is a subset of the population that we actually do know about (by
taking measurements of some kind): e.g., a group who fill out a survey, a
group of voters that are polled, a number of randomly chosen items off the
line....

A sample statistic is often the only practical estimate of a population
parameter. In practice we will use sample statistics as proxies for population
parameters, but it is important to remember the difference.
Sample Mean and Variance: To determine the average amount of money spent
in the Central Mall, a Central City official randomly samples 12 people as they
exit the mall. He asks them the amount of money spent and records the data. The
official is trying to estimate mean and variance of the population from a sample of
12 data points. Here are the data for the 12 people:
Person
1
2
3
4
$ spent
$132
$334
$33
$10
Person
5
6
7
8
$ spent
$123
$5
$6
$14
19
Person
9
10
11
12
$ spent
$449
$133
$44
$1
Sample Means, Variances and Standard Deviations: A sample (x1, x2, ... , xn) has
sample mean, sample variance, and sample standard deviation as follows:
n
Sample Mean
X  X 2  ...  X n
X 1

n
 X i  X 
n
Sample Variance
s2 
 Xi
i 1
n
2
i 1
n1
Note: The denominator of the sample variance formula is n - 1, not n. This is
because of the aforementioned distinction between population parameters and
sample statistics. The n - 1 formula for s2 tends to gives a better estimate of 2.
 X i  X 
n
Sample Standard Deviation
s  s2 
2
i 1
n1
Example:
The sample mean is
X
X 1  X 2  ...  X n 132  334  33  ...  1 $1 ,284


 $107
n
12
12
The sample variance is
s
2
132  107 

229,394
11
 20 ,854 $ 2
2
 334  107  2  ...  1  107  2
11


 
The sample standard deviation is
s  s 2  20 ,854  $144.40
So we estimate that on average $107 are spent per shopper with a standard
deviation of $144.40. From now on we will be working almost exclusively with
sample data. Population data are usually not easily obtained.
20
The Coefficient of Variation
Quality Application: Suppose we have two machines producing pipes, one of
small diameter (4 inches) the other of larger diameter (30 inches). Due to
imperfections in the production processes, the small pipes that come of the line
do not all have exactly 4 inches in diameter. Some differ by as much as 0.1 inches
in either direction. We calculate that the mean diameter is 4.0 inches and the
standard deviation of the pipe diameters is 0.05 inches. For the larger pipes, the
mean diameter is 30.0 inches, and the standard deviation is also 0.05 inches. By
comparing standard deviations we would say the quality of the output is
identical; the same variability exists. However, in relative terms they differ. This
is where the coefficient of variation (CV) is useful. The CV measures the variation
relative to the value of the mean. For a sample with mean X and standard
deviation s:
CV 
s
X
Usually the CV is multiplied by 100 and stated as a percentage. For the smaller
pipes, the CV is 0.05/4.0 = 0.0125 = 1.25%. For the larger pipes, the CV is
0.05/30.0 = 0.001667 = 0.17%. Thus the variability of the larger pipes is not as
great as that of the smaller pipes, relative to their diameter. Their quality, one
could say, is therefore better.
Example: Individual firms in the toy industry find that their annual growth rates
in sales tend to fluctuate substantially from year to year, because of changing
fads. In comparison, the growth rate in total industry sales remains relatively
stable.
Percentage Growth in Sales
Company
Mattel
Tonka
Industry
1980
+13.7%
-21.7%
+8.4%
1981
+23.9%
+4.1%
+14.0%
1982
+18.3%
-22.9%
+28.9%
1983
-52.8%
+8.3%
-19.9%
1984
+39.1%
+58.3%
+50.0%
Here are the relevant statistics (calculated using sample formulae (why?)):
Company
Mattel
Tonka
Industry
Sample Mean
8.44%
5.22%
16.28%
Sample Std. Dev.
35.5%
33.0%
25.9%
We see from the standard deviations that the individual toy companies have
slightly larger standard deviations. What do we mean by slightly? What are we
21
measuring against? For Mattel, the CV is 35.5/8.44 = 4.21 or 421%. For Tonka, the
CV is 33.0/5.22 = 6.32 or 632%. For the industry as a whole, the CV is
25.9/16.28 = 1.59 or 159%.
The industry’s growth rate actually is a lot less variable than that of the
individual firms, when measured against the average growth rate.
Selected Bibliography
Bernstein, Peter L. (1996). Against the Gods: The Remarkable Story of Risk. New
York: John Wiley and Sons.
Paulos, John Allen (1995). A Mathematician Reads the Newspaper. New York: Basic
Books
Paulos, John Allen (1998). Once Upon a Number. New York: Basic Books.
Tufte, Edward (1983). The Visual Display of Quantitative Information. Cheshire, CT:
Graphics Press.
22