Download Lesson 1 - WordPress @ VIU Sites

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 1
An Introduction to
Econometrics and Statistical
Inference
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Learning Objectives
• Understand the steps involved in conducting
an empirical research
• Understand the meaning of the term
econometrics
• Understand relationship between populations,
samples, and statistical inference
• Understand the important role that sampling
distributions play in statistical inference
1-2
What is an Empirical Research Project?
An empirical research project is a project that
applies empirical analysis to observed data to
provide insight into questions of theoretical
interest.
1-3
The 5 Steps in Conducting an
Empirical Research Project?
(1) Determining the question of interest
(2) Developing the appropriate theory to address
the question
(3) Collecting data that is appropriate for
empirically investigating the answer
(4) Implementing appropriate empirical techniques,
correctly interpreting results, and drawing
appropriate conclusions based on the estimated
results
(5) Effectively writing up a summary of the first four
steps
1-4
What is Econometrics?
Econometrics is the application of statistical
techniques to economic data.
1-5
Populations, Samples, and
Statistical Inference
A population is the entire group of entities that we are
interested in learning about.
A sample is a subset or part of the population and it is what is
used to perform statistical inference.
Statistical inference is the process of drawing conclusions from
data that are subject to random variation.
1-6
Populations, Samples, and
Statistical Inference Continued
1-7
Some Important Definitions
A parameter is a function that exists within the
population.
A statistic is a function that is computed from
the sample data.
A point estimate is a single valued statistic that
is the best guess of a population parameter.
1-8
Sampling Distributions
A sampling distribution is the distribution of a sample
statistic such as the sample mean.
A sampling distribution is constructed by
(1) collecting all possible samples of size 𝑛 that could be
drawn from the unobserved population of size 𝑁
(2) calculating the value of a given statistic (say, the
sample mean) for each of those samples
(3) placing those values in order on the number-line to
create a distribution known as a sampling distribution
1-9
A Visual Example
1-10
Chapter 2
Collection and
Management of Data
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Learning Objectives
• Consider potential sources of data
• Work through an example of the first three
steps in conducting an empirical research
project
• Develop data management skills
• Understand some useful Excel commands
1-12
Goals of the Chapter
1-13
Types of Data
• Cross-sectional data is data collected for many
different individuals, countries, firms, etc. in a
given time-period.
• Time-series data is data collected for a given
individual, country, firm, etc. over many different
time periods.
• Panel data are data collected for a number of
individuals, countries, firms, etc. over many
different time periods.
1-14
Primary Data Sources
• private-use data
– government surveys or internal firm-level data
– obtained through formal request and/or having the
appropriate connections.
• publicly-available data
– obtained through the internet or through formal
Freedom of Information Act (FOIA) request
• personal survey data
– obtained by personally conducting a survey asking
people for information and recording their responses
1-15
An Example of the First Three Steps
Suppose you are trying to convince your significant
other to go camping but he or she is afraid of bears.
How can you use your empirical research skills to
convince him or her that bear attacks are not a
realistic concern?
Step 1: Identify a question of interest
What factors affect the number of fatal bear attacks
in the US?
1-16
An Example of the First Three Steps
Step 2: Develop appropriate theory
The number of fatal bear attacks in the US
should depend on:
• The number of bears
• The number of campers
• Square feet of national parkland
1-17
An Example of the First Three Steps
Step 3: Collect appropriate data
Start with an internet search for the data you seek
1-18
An Example of the First Three Steps
Download data to Excel and then repeat the process for
the independent variables you seek.
1-19
Data Management Skills
Two important points:
(1) When working with data, it is common to
make mistakes which alter the initial data
(2) When working on a larger project, it is
common to take time off before
returning to the project
1-20
Data Management Skills
Our goals with data management are to be
able to:
(1) Recreate our initial data as easily as
possible
(2) Recall what we had previously done as
easily as possible
1-21
Data Management Skills
When working with data, we recommend:
(1) Creating a “Master” file with the initial
data and performing calculations in a
different “working” file
(2) Exhaustively documenting all initial data
sources
(3) Making file and variable names as
intuitive as possible
(4) Documenting all commands used when
performing estimation
1-22
Chapter 3
Summary Statistics
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Learning Objectives
•
•
•
•
•
•
Construct relative frequency histograms
Calculate measures of central tendency
Calculate measures of dispersion
Use measures of central tendency and dispersion
Detect whether outliers are present
Construct scatter diagrams for the relationship
between two variables
• Calculate the covariance and the correlation
coefficient between two variables
1-24
1-25
Construct a Relative Frequency
Histogram
• A bar chart that shows how often
observations lie within a specified classes
• Allows a visual inspection of the data
• Based on a Relative Frequency Table
• The example dataset for constructing a
histogram use states.xls, a survey of
econometrics students that asked how many
states they have been visited.
1-26
Number of States Visited
0.45
0.4
0.35
Relative Frequency
0.3
0.25
0.2
0.15
0.1
0.05
0
0-5.99
6-11.99
12-17.99
18-24
Numver of States Visited
1-27
To create a frequency distribution
we must …
1. Select the number of classes
2. Choose the class interval or width of the
classes
3. Select the class boundaries or the values
that form the interval for each class
4. Count the number of values in the dataset
that fall in each class
1-28
Step 1: Select the number of
classes
The rule for determining the approximate number
of classes is:
Approximate number of classes =
[(2)(Number of observations)].3333
The actual number of classes is the integer value
that just exceeds the number value.
If the formula gives us 4.66 we use 5
1-29
Step 1: Example
We have 43 data points so the rule is:
Approximate number of classes = [(2)(20)].3333
= 3.503
Round this up to the next integer value which is 4.
The number of classes is 4.
**Always round up!!
1-30
Step 2: Choose the width of the
interval
The rule for determining interval width is:
Approximate interval width =
Largest data value – Smallest data value
Number of classes
The actual interval width is the integer value that
just exceeds the number value.
If the formula gives us 6.17 we use 7
**Always round up!!
1-31
Step 2: Example
Approximate interval width = (24-1)/4 = 5.75
Round up to 6.
Therefore the class width is 6.
1-32
Step 3: Select the class boundaries
• Class boundaries must be chosen such that each
data item belongs to one and only one class.
• Start just below the lowest value in the dataset to
get the lower boundary. The lower boundary for
the second class is then found by adding the class
width. The upper boundary for the first class is
found by subtracting .01 from the lower
boundary of the second class.
• Keep adding the class width and subtracting .01
to get the boundaries.
1-33
Step 3: Example
Lowest data point is 1. We will start our classes
at 0.
Class 1 = 0
Class 2 = 6 (=0+6)
Class 3 = 12 (=6+6)
Class 4 = 18 (=12+6)
1-34
Step 3: Example Continued
Class boundaries are then:
Class 1:
Class 2:
Class 3:
Class 4:
0- 5.99
6-11.99
12-17.99
18-24
1-35
Step 4: Count the number of values in
the dataset that fall into each class
• Doing this by hand is tedious and, therefore,
we want to rely on Excel to do this for us.
• Enter the class boundaries into Excel next to
the data set.
• Enter the Upper Boundaries of each of the
classes
• Use the Frequency command
1-36
How to use the Frequency
command in Excel
1. Select the cells next to the class intervals where
the frequencies should go (say E2:E6).
2. Type but do not enter the formula
=Frequency(A2:A44,D2:D6)
A2:A44 contains the data D2:D6 contain the
ending class boundaries
3. Press CTRL+SHIFT+ENTER and the array formula
will be entered into each of the cells E2:E6.
1-37
Our Excel Results
Class Boundaries
0-5.99
6-11.99
12-17.99
18-24
Upper Limit
5.99
11.99
17.99
24.00
Frequency
18
18
4
3
1-38
Creating relative frequency and
percent frequency distributions
Recall that the relative frequency is the
proportion of the observations belonging to a
class. With n observations
Relative frequency of a class =
Frequency of the class
n
The percent frequency is the relative frequency
multiplied by 100.
1-39
Relative Frequency Table
1-40
Using Excel’s Chart Wizard to
Construct a Histogram
1. Use the frequency distribution we just
constructed and highlight the frequencies
2. Click the Chart Wizard and choose column in
the chart type
3. Click on the Category (X) axis labels box and
enter the class boundaries
4. To get the bars to touch right click on any
rectangle in the column chart and choose
Format Data Series. Select the Options tab
and enter 0 in the Gap Width box.
1-41
Number of States Visited
0.45
0.4
0.35
Relative Frequency
0.3
0.25
0.2
0.15
0.1
0.05
0
0-5.99
6-11.99
12-17.99
18-24
Numver of States Visited
1-42
Soda Consumption Data
Your mission is to pair up with a classmate and
draw what you think the histogram for soda
consumption looks like.
1-43
Calculate Measures of Central
Tendency
• Central tendency is the middle value of a
dataset.
• The measure of central tendency is typically
thought of as the number that best describes
the data.
• Measures of central tendency are:
(1) Mean
(2) Median
1-44
Measure of Central Tendency - Mean
The mean is the arithmetic average of the data. To calculate the
mean sum all the observations and divide by the number of
observations.
Represented by the symbol, x
1 n
1
Mean – x   xi  ( x1  x2  ...  xn )
n i 1
n
For the following small data set:
95 85 99 92 80
Mean =(95+85+99+92+80)/5 = 451/5 = 90.2
In Excel =average(highlight data)
1-45
Measure of Central Tendency - Median
Median – the middle observation when the data are
arranged from smallest to largest sometimes called the 50%
percentile. Half the observations lie below the median and
half the observations lie above the median.
The median is the middle observation for an odd number of
ordered observations and the average of the middle two
ordered observations for an even number of observations.
The median is an order statistic so in order to calculate it
the data must be ordered from smallest to largest.
1-46
Measure of Central Tendency - Median
Median – Central observation for an odd number of observations
and an average of the two middle data points for an even number
of observations
For the following small data set :
95 85 99 92 80
(ordered data 80 85 92 95 99)
Median = 92 (the 3rd data point)
If we had 75 80 85 92 95 99
median =(.5*85)+(.5*92) = (85+92)/2 = 42.5+46 = 88.5
In Excel =median(highlight data)
1-47
Calculate Measures of Dispersion
Dispersion is a measure of how the data vary.
Measures of dispersion are:
(1) Variance
(2) Standard Deviation
(3) Percentiles
(4) Five Number Summary
1-48
Measure of Dispersion – Variance
and Standard Deviation
Standard Deviation – the average deviation away from the
mean. It is the square root of the variance.
The variance is calculated by subtracting the mean from each
observation, squaring that value, adding up all n values, and
then dividing that by the number of observations less one.
n
Sample variance formula is s 2 
Standard deviation is s 
 ( xi  x ) 2
i 1
n 1
s2
In Excel = var(highlight data)
= stdev(highlight data)
1-49
Measure of Dispersion – Variance
and Standard Deviation
n
2
(
x

x
)
 i
Sample variance: s 2  i1
n 1
For the following small data set :
95 85 99 92 80
s2= [(95-90.2)2+ (85-90.2)2+ (99-90.2)2+
(92-90.2)2+ (80-90.2)2]/4=234.8/4=58.7
Sample standard deviation s  s 2
s= 58.7 =7.6616
1-50
Measure of Dispersion – Percentile
A percentile is a number such that p% of the ordered
observations lie below the percentile and (1-p)% of the
observations lie above the percentile.
The median is the 50th percentile and an example of a
percentile where 50% of the ordered data lies below that
level and 50% of the ordered data lies above that level.
A percentile is an order statistic.
There are many different ways to calculate percentiles. On
the next slide one of the easiest ways to calculate
percentiles.
1-51
Steps to Calculate a Percentile, p
(1) Sort the data from low to high
(2) Count the number of observations, n
(3) Select the p(n+1) observation
(4) If the value p(n+1) is not a whole number then select the
closest whole number
(5) If p(n+1) is less than 1 then select the smallest number
(6) If p(n+1) is greater than 1 then select the largest number.
In Excel =percentile(highlight data, p)
Note that the steps to calculate a percentile by hand and
calculating percentiles in Excel will likely not result in the same
value.
1-52
Measure of Dispersion - Percentile
Calculate the 10th and the 70th percentile for the following
small data set :
95 85 99 92 80
(ordered data 80 85 92 95 99)
10th percentile select the .1(n+1) = .1(6) = .6 number in the
data set.
The closest whole number is 1 so the 10th percentile is the
first observation or 80.
70th percentile select the .7(n+1) = .1(6) = 4.2 number in the
data set.
The closest whole number is 4 so the 70th percentile is the
fourth observation or 95.
1-53
Measure of Dispersion – Five
Number Summary
The Five Number Summary is
(1) Minimum
(2) Q1 or 25th Percentile
(3) Q2 or Median (50th Percentile)
(4) Q3 or 75th Percentile
(5) Maximum
1-54
How to Calculate the Five Number
Summary in Excel
Minimum =Min (data)
Q1 or 25th Percentile
=percentile(data,.25) or
=quartile(data,1)
Q3 or 75th Percentile
=percentile(data,.75) or
=quartile(data,3)
Maximum =Max (data)
1-55
Shapes of Histograms
•
•
•
•
Symmetric
Skewed to the right or Positively skewed
Skewed to the left or Negatively Skewed
Bimodal
1-56
Symmetric Histogram
Histogram for Diameter of 400 Elevator Rails
90
80
70
60
50
40
30
20
10
0
<=0.455
.455- .465
.465- .475
.475- .485
.485- .495
.495- .505
.505- .515
.515- .525
.525- .535
.535- .545
>0.545
C a t e gor y
1-57
Positively Skewed Distribution
Histogram for Time Betw een Bank Customer Arrivals
160
140
120
100
80
60
40
20
0
<=2.5
2.5- 5
5- 7.5
7.5- 10
10- 12.5
12.5- 15
15- 17.5
17.5- 20
20- 22.5
22.5- 25
25- 27.5
>27.5
C a t e gor y
1-58
Negatively Skewed Distribution
Histogram for Scores on a Midterm
20
18
16
14
12
10
8
6
4
2
0
<=45
45- 50
50- 55
55- 60
60- 65
65- 70
70- 75
75- 80
80- 85
85- 90
90- 95
>95
C a t e gor y
1-59
Bimodal Distribution
1-60
Positively Skewed Distribution
Median = 2.77
Histogram for Tim e Betw een Bank Custom er Arrivals
160
140
120
Mean = 4.16
100
80
60
40
20
0
<=2. 5
2. 5- 5
5- 7. 5
7. 5- 10
10- 12. 5
12. 5- 15
15- 17. 5
17. 5- 20
20- 22. 5
22. 5- 25
25- 27. 5
>27. 5
C a t e gor y
1-61
Why is the shape of the histogram
important?
• The shape of the empirical distribution
dictates which summary statistics should be
used
Symmetric – Use mean and standard deviation
Skewed – Use median and five number
summary
1-62
How to determine if your data is
skewed or symmetric
Pearson’s coefficient of skewness:
sk = 3*(mean-median)/(standard dev.)
Rule of Thumb:
If sk<-.5 or sk>.5 then the distribution is skewed.
Otherwise the distribution is symmetric.
Negatively skewed
Positively Skewed
Symmetric
-.5
.5
1-63
Symmetric Histogram
Mean = .5013
Histogram for Diameter of 400 Elevator Rails
Standard Deviation =.019
90
80
70
60
50
40
30
20
10
0
<=0. 455
. 455- . 465
. 465- . 475
. 475- . 485
. 485- . 495
. 495- . 505
. 505- . 515
. 515- . 525
. 525- . 535
. 535- . 545
>0. 545
C a t e gor y
1-64
Positively Skewed Distribution
Median = 2.779
Histogram for Tim e Betw een Bank Custom er Arrivals
160
140
120
100
80
60
40
20
0
<=2. 5
2. 5- 5
5- 7. 5
7. 5- 10
10- 12. 5
12. 5- 15
15- 17. 5
17. 5- 20
20- 22. 5
22. 5- 25
25- 27. 5
>27. 5
C a t e gor y
Five Number Summary
Minimum 0.008
Q1
1.1578
Median
2.779
Q3
5.643
Maximum 29.001
1-65
How to Detect Outliers with
Symmetric data
Use the Empirical Rule
68% of data should be within one standard deviation of the
mean
xs
95% of the data should be within two standard deviations of
the mean x  2s
100% of the data should be within three standard deviations
of the mean
x  3s
Therefore, an observation is an outlier if it lies beyond three
standard deviations from the mean or beyond the interval
( x - 3s, x + 3s)
1-66
How to detect an outlier with
skewed data
• Calculate the interquartile range or IQR = Q3 –
Q1.
• If a value is greater than Q3 plus 1.5*IQR or
less than Q1 minus 1.5*IQR the it’s a
moderate outlier
• If a value is greater than Q3 plus 3*IQR or less
than Q1 minus 3*IQR then it’s an extreme
outlier
1-67
Construct Scatter Diagrams for the
Relationship between two Random
Variables
• A scatter diagram (or scatter plot) is used to
show the relationship between two variables
• It contains one variable on the x-axis and the
other variable on the y-axis
• A scatter diagram shows how the two
variables are related to each other, both the
strength and direction of the relationship
1-68
Scatter Diagram Examples
Positive Linear relationship
y
Curvilinear relationships
y
x
Negative Linear relationship
y
x
y
x
x
1-69
Scatter Diagram Examples
Strong relationships
Weak relationships
y
y
x
y
x
y
x
x
1-70
Scatter Diagrams Examples
No relationship
y
x
y
x
1-71
Salary vs. Years of Education
1-72
How to Create a Scatter Diagram
in Excel
• Highlight the data making sure that the
variable you want on the y-axis is on the right
• Select “Insert” and then “Scatter” and click on
the first option
• Make sure to change the chart title, add axis
titles.
• Possibly delete the legend and change the
start values for the axis.
1-73
Salary vs. Experience
160,000
140,000
Salary (dollars)
120,000
100,000
80,000
60,000
40,000
20,000
0
10
12
14
16
18
20
22
Experience (years)
1-74
What does the Scatter Diagram on
the previous slide tell us?
• The relationship between education and
salary is positive (in general as education
increases salary increases)
• The relationship is fairly strong because the
data point are closely gathered to each other
• This scatter diagram indicates that while the
variable education is helpful for predicting
salaries, it will not yield perfect predictions.
1-75
Covariance and the Correlation
Coefficient for the Linear
Relationship between two variables
• Covariance and Correlation Coefficient supplies a
numeric value to the strength and direction of the
linear relationship between two variables
– Only concerned with strength of the
relationship
– No causal effect is implied
1-76
Covariance
• Covariance is a measure of the linear relationship
between two random variables
• A positive covariance indicates a positive linear
relationship between x and y (if x is below its mean
then y tends to be below its mean and if x is above
its mean then y tends to be above its mean)
• A negative covariance indicates a negative linear
relationship between x and y (if x is below its mean
then y tends to be above its mean and if x is above
its mean then y tends to be below its mean)
1-77
Covariance
• A covariance near 0 indicates no linear relationship
between x and y
• A problem with covariance is that it depends on the
units of measurement for x and y if we change from
measuring in feet to inches the covariance will go up
even though the overall relationship hasn’t changed.
1-78
Covariance – a Measure of Linear
Association Between Two Variables
• Remember the formula for variance is
n
 (x i  x)
s 2  i1
n 1
2
n
 (x i  x)( x i  x)
 i1
n 1
or how x varies with itself.
The formula for Covariance is
n
 (x i  x)( yi  y)
Cov( x, y)  s xy  i1
n 1
and it measures how varies with y in a linear fashion.
1-79
Applying the Covariance Formula
Cox(x,y) = Sum/(n-1) = 743000/9 = 82,555.5556
1-80
Calculating Covariance in Excel
• In some versions of Excel, the covariance is not
calculated correctly.
• The Excel command is
=Covar(highlight x values, highlight y values)
• You should perform this command in Excel for the
data set above and see if it matches the value
82,555.5556.
• If you obtain 74,300 using the covar command
(which is likely), you must multiply the value you
obtain in Excel by n/(n-1) to obtain the correct
value for covariance.
1-81
Correlation Coefficient
• The sample correlation coefficient, rxy, is an
estimate of population correlation coefficient
and is used to measure the strength and
direction of the linear between two random
variables.
• The correlation is a unit free measure (unlike the
covariance) and falls between -1 and 1.
1-82
What Does the Correlation Coefficient Mean?
• If all the points in a data set fall on a positively
sloped line, rxy =1.
• If all the points in a data set fall on a negatively
sloped line, rxy =-1.
• If there is no linear relationship between x and y
then rxy =0.
• The closer to -1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear
relationship
• The closer to 0, the weaker the linear relationship
1-83
Examples of Approximate rxy Values
y
y
y
x
r = -1
r = -.6
y
x
x
r=0
y
r = +.3
x
r = +1
x
1-84
Calculating the Correlation Coefficient
Sample correlation coefficient:
Cov( x, y )
sxy
rxy 

st.dev.( x) st.dev.( y ) sxsy
From above, the standard deviation of x is 2.708
and the standard deviation of y is 38,189.037.
82,555.5556
rxy 
 0.7983
(2.708)(38,189.0037)
A correlation of 0.7983 means that education and
salary are positively related and the relationship is
strong (because this values lies near 1)
In Excel =correl(highlight x values, highlight y values)
1-85
What Does Correlation Mean?
• Correlation provides a measure of linear association
between two variables. A correlation coefficient is near 0
only means that there is a weak linear association between
the two variables, not that there isn’t any relationship
between the two variables.
• A high correlation between two variables does not mean
that changes in one variable will cause changes in the
other variable.
• We might find that the quality rating and the typical mean
price of restaurants are positively correlated. However,
simply increasing the mean price at a restaurant will not
cause the quality rating to increase.
1-86