Download Introduction to Stata

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Stata
Luigi Benfratello
Laboratorio Econometria e Statistica 2014-15
1
Stata: Data Analysis and Statistical Software (i)
• Why Stata?
• Which Stata (Different versions exists, data
limits depend on the Stata version)
– Stata MP (Multiprocessor): fastest version of Stata (for dual-core
and multicore/multiprocessor computers)
– Stata SE: Stata for large datasets
– Stata Intercooled: Stata for moderate-sized datasets
– Small Stata: A version of Stata that handles small datasets (for
educational purchases only)
For more detail see: http://www.stata.com/products/whichstata.html
2
Stata: Data Analysis and Statistical Software (ii)
How to ask for an help?
Stata web site (http://www.stata.com)
help on line
Official Manuals
Books on Stata:
• Acock, A Gentle Introduction to Stata, 3rd Edition, Stata
Press, 2010
• Baum, Modern Econometrics using Stata, Stata Press, 2006
• Cameron & Trivedi, Microeconometrics Using Stata, Stata
Press, 2009
3
Stata: Data Analysis and Statistical Software
• Operation
– Interactive Mode
• Command
• Menu
– Batch Mode
• do files
• Windows Interface
– 4 Windows
• Command Window, Results Window, Command Review,
Variables in memory
Viewer, Data Editor, Do file editor, Graph
4
Directories in Stata
Stata allows to use some Dos commands such as:
• cd
changes directory
• mkdir creates a new directory within the current one
• dir lists contents of directory or folder
• pwd displays the current directory
5
How to read the data?
• Stata Dataset
– use filename
• Text Data
– .csv or .txt file with names of the variables in the first
row
– insheet using filename
• Copy-and-paste or manual typing
• Stat/Transfer software
6
Data in Stata
•
Usually information is stored in variables (but it might also be stored in matrices
and macros)
•
Each row of the variable refers to an observation
•
Indicator variable
•
Different types of data
– Numeric
• byte: integer between -127 and 100 e.g. dummy variable
• int : integer between -32,767 and 32,740 e.g. year variable
• long : integer between -2,147,483,647 and 2,147,483,620 e.g.
population data
• float : real number with about 8 digits of accuracy e.g. output data
• double : real number with about 16 digits of accuracy
– String
Missing values
– Numeric: single dot (.)
– String: double quotes (“”)
7
Examining the Data (I)
General comments: commands and variables in Stata
are case-sensitive; commands can be abbreviated
•To reduce the size of a dataset, use the compress
command
•browse and edit commands open the data editor
•The list command makes the data appear in the
result window (if necessary, use the in and if options)
•Other useful commands:
• describe
• codebook
• summarize (sometimes with the option detail)
• tabulate
8
Examining the Data (II)
• To draw graphs, use the graph command. The syntax
for graph is quite complex and specific to the type of
graph. It is easier to work with the menu when making a
graph
• Examples:
• graph twoway histogram age, ///
title(Age of the Italian Population) ///
subtitle(Adults) note(Source: Istat) ///
xtitle(age in years)
• graph twoway scatter wage age, ///
title(Scatterplot age vs wage) ///
note(Source: World Bank) xtitle(age) ///
ytitle(wage)
9
Saving the dataset and keeping
track (I)
• To save a dataset use the save command (with the
replace option if you want to overwrite an existing file)
• To save results in a file, use the log using command
(with the replace option if you want to overwrite an
existing file)
• Write a do file with all your commands to use Stata in
batch mode
• Insert comments in your do file text with *, //, and /* */
characters. If necessary, break a command line with ///
10
Saving the dataset and keeping
track (II)
• Put labels on the dataset with the label data
command
• Put labels on the variables with the label variable
command
• Put labels on values of a variable with the label
values and label define commands
11
Example of a do file (I)
* Example of do file
clear
cd “…”
capture log close // this command closes existing output, if any
log using class1.log, replace
set more off
set memory 100m
use “…”
// this command opens the dataset
/* let’s label the data, label one variable and assign some
values label to the variable */
label data “Data from World Bank”
label variable education “level of education, by class”
label define educ 1 “none” 2 “primary” 3 “secondary” ///
4 “university”
label values education educ
log close // this command closes existing output
12
Manipulating data (I)
• Use the rename command to change the name of a
variable
• Use the recode and replace commands for changing
the values of some variables
• Use the keep and drop commands to select variables
and/or observation to keep or drop from your dataset
• Use the sort and gsort command to sort in ascending
and descending order
13
Manipulating data (II)
• Relational operators are:
•
•
•
•
•
•
==
!=
>
>=
<
<=
equal to
not equal to
greater than
greater than or equal to
less than
less than or equal to
• Logical operators are:
•
•
•
•
&
|
~
!
and
or
not
not
14
Manipulating data (III)
• You can re-run a command for different subsets of the
data using the sort command and the by prefix or
simply with the bysort command
• You can change the order of your variables with order,
aorder, and move commands
• You can combine datasets with the merge, append, and
joinby commands
15
Example of a do file (II)
* rename a variable
rename lf41 age
* recode the missing values
recode age (-9999=.)
* replace the content of a string variable
replace region=“Piedmont” if region_code==“Pdm”
* keep some variables
keep region region_code
drop age /* these two commands are equivalent if only three
variables in the dataset /*
* keep some observations
keep if age > 20 // beware as missing values are included
* sort by age (in ascending order)
sort age
* sort in ascending order by age and in descending by educ
sort age -educ
* use of bysort
bysort year: sum gdp
* use of aorder, order, move
aorder
order gdp year country
move year country
16
Generating new variables
• The generate and egen commands create new
variables from existing ones. egen works on summary
statistics
• Dummy variables are more easily created with the gen
command by writing the condition after the “=” sign
17
Example of a do file (III)
•
generate pnew=ln(pnc)
// log price of new cars
•
gen constant=1
/* constant value of 1 */
•
gen popsq=pop^2
/* squared population */
•
egen
by(year)
totalpop=sum(pop),
/*
world
population
per year */
•
egen
avgpop=mean(pop),
by(year)
/*
average
country
pop per year */
•
egen maxpop=max(pop) /* largest population value */
•
egen
countpop=count(pop)
/*
counts
number
of
non-
missing obs */
•
egen
groupid=group(country_code)
/*
generates
numeric id variable for countries */
18
Estimation and testing
• The summarize command estimates the population mean
and variance with the sample counterparts
• The ttest command allows to perform test of hypothesys
about the mean (one sample, two samples, paired samples)
• The correlate and pwcorr commands allow to compute
linear correlation among variables
• The pctile newvar = exp, pctile_options creates a
variable containing percentiles
• The xtile newvar = exp, xtile_options creates a
variable containing quantile categories
19
Example of a do file (IV)
•
summarize pnew
// summary stats
•
bysort foreign: summarize pnew
•
ttest pnew=5
•
ttest pnew=pworld, unpaired [unequal]
/* summary stats by foreign */
/* test on the overall mean */
/* test on equality for
means of two different populations */
•
ttest pnew=pnew_after
/*
test
on
equality
for
means
of
the same population (paired test) */
•
pcorr pnew pworld
// computes correlation coefficient
20
Estimation and testing (II)
• The regress command performs linear regression (both
bivariate and multiple regression)
The basic syntax is
[by varlist:] regress depvar [varlist] [weight]
[if exp] [in range] [, level(#) noconstant robust]
Stata automatically insert a constant among the regressors,
unless the noconstant option is specified
21
Estimation and testing (III)
• The test command allows to perform test of hypotheses
about the regression coefficients
The basic syntax is:
test [exp = exp] for testing general hypotheses
or
test [coefficientlist]
for testing coefficients = 0
Alternative commands are lincom, whose syntax is:
lincom exp [, level (#)]
and testparm, which allows to test a range of coefficients.
22
Estimation and testing (IV)
• The predict command allows to generate new variables
stemming from estimation, notably residuals and fitted
(predicted values)
The basic syntax is
predict [type] newvarname [if exp] [in range] [,
statistic]
where statistics can be:
• xb, fitted values (the default), calculates linear prediction
• residuals, residuals
23
Example of a do file (V)
•
regress gdp
// regression on a constant only
•
regress gdp money unemp
/* regression on a constant and
money and unemp*/
•
regress gdp money unemp, robust
/* as before, with robust
standard errors*/
•
test money
•
test money=2*unemp
/* test the coefficient of money = 0 */
/* test the coefficient of money is the
double of the one of unemp */
•
lincom money=2*unemp
/* as before, slightly different
output*/
•
predict y_hat
/* generates predicted values */
•
predict residuals, resid
/* generates residuals */
24
Accessing and showing results
• The display command allows you to display
computation, strings, and estimates obtained in previous
estimation commands
• Generally, Stata stores results in e() or in r(). You can
see them with the commands ereturn list or return
list.
25
Example of a do file (VI)
•
regress gdp money unemp
/* regression on a constant and
money and unemp*/
•
ereturn list
/* lists all what is stored after regress */
•
generate b_money =_b[money]
/* generates a variable equal
to the estimated coefficient of money */
•
display log(e(N))
/* computes the log of the number of
observations used in the previous regression */
•
summarize gdp
// computes descriptive stats
•
return list
•
gen mean_gdp=r(mean)
/* lists all what is stored
after summarize*/
/* generates a new variable equal to
the mean, equivalent to egen mean_gdp=mean(gdp) */
26
Accessing and showing results
• estimates store name is a command that allows
you to store the results of the last estimated model under
the name name into the RAM memory of the PC, for later
use
• The command
estimates table [namelist] [, stats(scalarlist)
keep(coeflist) drop(coeflist) b[(%fmt)] se[(%fmt)]
stfmt(%fmt) title(string)]
allows you to display in a table some previously saved
results
27
Example of a do file (VII)
•
•
•
•
•
regress gdp money unemp
/* regression on a
constant and money and unemp*/
est store reg1
/* results are stored under
the reg1 name */
regress gdp money
/* regression on a constant
and money only */
est store reg2
/* results are stored under
the reg2 name */
estimates table reg*, b(%7.4f) se(%7.4f) stats(N
r2_a)
/* results are shown in a
table with estimated coefficients,
standard
errors,
number
of
observations and adjusted R2 */
28
Models with binary dependent variables
The basic command for estimating a binary dependent variable
with a logit model is:
●
●
logit depvar [varlist] [if exp],[options]
whereas the command to estimate it with a probit model is:
● probit depvar [varlist] [if exp], [options]
●
The command dprobit directly displays marginal effects
instead of coefficients for the probit case
To obtain goodness-of-fit measures, the following command
(in version 9 or later) can be used:
● estat classif
●
29
Models with binary dependent variables
For both logit and probit, it is possible to use the predict command
predict [type] newvarname [if ] [in] [, statistic]
where some of the possible statistic are
p
probability of a positive outcome (the default)
xb
fitted values of the index function
The command to obtain marginal effects is
margins [marginlist] [if] [in] [weight] [,
response_options options]
where response options are:
dydx(varlist) marginal effect of variables in varlist
eyex(varlist) elasticities of variables in varlist
dyex(varlist) semielasticity -- d(y)/d(lnx)
eydx(varlist) semielasticity -- d(lny)/d(x)
and the main option is
at(atlist)
after estimating the model.
30
Example of a do file (VIII)
• logit unemployed female educ
/* to estimate a logit
model in which the probability of unemployment is
a function of gender and education */
• probit unemployed female educ
• estat classif
/*
/* same as above, but probit */
to
obtain
some
goodness
of
fit
obtain
the
measures */
• margins, dy/dx(educ) at(educ=10)
/*
to
marginal effect of the variable educ evaluated
at the value of 10 (the default is the mean) */
31
Panel data
Before using panel data, you have to initialize your data
xtset panelvar [timevar]
xtset, clear
Specific commands to panel data are:
xtdescribe allows one to see the structure of the panel
xtsum [varlist] decomposes total variation in a
“between” and a “within” component
xttab [varlist] decomposes occurrences in a
“between” and a “within” component.
32
Panel data
To estimate a linear regression model with panel data the
standard command is:
xtreg depvar [varlist] [if exp], [re be fe
robust]
where the option re (the default) refers to the random
effects estimator, the be option to the between estimator,
and the fe option to the fixed effects (or within groups)
estimator.
The option robust also implies clustering at the individual
level.
33