Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Stata Luigi Benfratello Laboratorio Econometria e Statistica 2014-15 1 Stata: Data Analysis and Statistical Software (i) • Why Stata? • Which Stata (Different versions exists, data limits depend on the Stata version) – Stata MP (Multiprocessor): fastest version of Stata (for dual-core and multicore/multiprocessor computers) – Stata SE: Stata for large datasets – Stata Intercooled: Stata for moderate-sized datasets – Small Stata: A version of Stata that handles small datasets (for educational purchases only) For more detail see: http://www.stata.com/products/whichstata.html 2 Stata: Data Analysis and Statistical Software (ii) How to ask for an help? Stata web site (http://www.stata.com) help on line Official Manuals Books on Stata: • Acock, A Gentle Introduction to Stata, 3rd Edition, Stata Press, 2010 • Baum, Modern Econometrics using Stata, Stata Press, 2006 • Cameron & Trivedi, Microeconometrics Using Stata, Stata Press, 2009 3 Stata: Data Analysis and Statistical Software • Operation – Interactive Mode • Command • Menu – Batch Mode • do files • Windows Interface – 4 Windows • Command Window, Results Window, Command Review, Variables in memory Viewer, Data Editor, Do file editor, Graph 4 Directories in Stata Stata allows to use some Dos commands such as: • cd changes directory • mkdir creates a new directory within the current one • dir lists contents of directory or folder • pwd displays the current directory 5 How to read the data? • Stata Dataset – use filename • Text Data – .csv or .txt file with names of the variables in the first row – insheet using filename • Copy-and-paste or manual typing • Stat/Transfer software 6 Data in Stata • Usually information is stored in variables (but it might also be stored in matrices and macros) • Each row of the variable refers to an observation • Indicator variable • Different types of data – Numeric • byte: integer between -127 and 100 e.g. dummy variable • int : integer between -32,767 and 32,740 e.g. year variable • long : integer between -2,147,483,647 and 2,147,483,620 e.g. population data • float : real number with about 8 digits of accuracy e.g. output data • double : real number with about 16 digits of accuracy – String Missing values – Numeric: single dot (.) – String: double quotes (“”) 7 Examining the Data (I) General comments: commands and variables in Stata are case-sensitive; commands can be abbreviated •To reduce the size of a dataset, use the compress command •browse and edit commands open the data editor •The list command makes the data appear in the result window (if necessary, use the in and if options) •Other useful commands: • describe • codebook • summarize (sometimes with the option detail) • tabulate 8 Examining the Data (II) • To draw graphs, use the graph command. The syntax for graph is quite complex and specific to the type of graph. It is easier to work with the menu when making a graph • Examples: • graph twoway histogram age, /// title(Age of the Italian Population) /// subtitle(Adults) note(Source: Istat) /// xtitle(age in years) • graph twoway scatter wage age, /// title(Scatterplot age vs wage) /// note(Source: World Bank) xtitle(age) /// ytitle(wage) 9 Saving the dataset and keeping track (I) • To save a dataset use the save command (with the replace option if you want to overwrite an existing file) • To save results in a file, use the log using command (with the replace option if you want to overwrite an existing file) • Write a do file with all your commands to use Stata in batch mode • Insert comments in your do file text with *, //, and /* */ characters. If necessary, break a command line with /// 10 Saving the dataset and keeping track (II) • Put labels on the dataset with the label data command • Put labels on the variables with the label variable command • Put labels on values of a variable with the label values and label define commands 11 Example of a do file (I) * Example of do file clear cd “…” capture log close // this command closes existing output, if any log using class1.log, replace set more off set memory 100m use “…” // this command opens the dataset /* let’s label the data, label one variable and assign some values label to the variable */ label data “Data from World Bank” label variable education “level of education, by class” label define educ 1 “none” 2 “primary” 3 “secondary” /// 4 “university” label values education educ log close // this command closes existing output 12 Manipulating data (I) • Use the rename command to change the name of a variable • Use the recode and replace commands for changing the values of some variables • Use the keep and drop commands to select variables and/or observation to keep or drop from your dataset • Use the sort and gsort command to sort in ascending and descending order 13 Manipulating data (II) • Relational operators are: • • • • • • == != > >= < <= equal to not equal to greater than greater than or equal to less than less than or equal to • Logical operators are: • • • • & | ~ ! and or not not 14 Manipulating data (III) • You can re-run a command for different subsets of the data using the sort command and the by prefix or simply with the bysort command • You can change the order of your variables with order, aorder, and move commands • You can combine datasets with the merge, append, and joinby commands 15 Example of a do file (II) * rename a variable rename lf41 age * recode the missing values recode age (-9999=.) * replace the content of a string variable replace region=“Piedmont” if region_code==“Pdm” * keep some variables keep region region_code drop age /* these two commands are equivalent if only three variables in the dataset /* * keep some observations keep if age > 20 // beware as missing values are included * sort by age (in ascending order) sort age * sort in ascending order by age and in descending by educ sort age -educ * use of bysort bysort year: sum gdp * use of aorder, order, move aorder order gdp year country move year country 16 Generating new variables • The generate and egen commands create new variables from existing ones. egen works on summary statistics • Dummy variables are more easily created with the gen command by writing the condition after the “=” sign 17 Example of a do file (III) • generate pnew=ln(pnc) // log price of new cars • gen constant=1 /* constant value of 1 */ • gen popsq=pop^2 /* squared population */ • egen by(year) totalpop=sum(pop), /* world population per year */ • egen avgpop=mean(pop), by(year) /* average country pop per year */ • egen maxpop=max(pop) /* largest population value */ • egen countpop=count(pop) /* counts number of non- missing obs */ • egen groupid=group(country_code) /* generates numeric id variable for countries */ 18 Estimation and testing • The summarize command estimates the population mean and variance with the sample counterparts • The ttest command allows to perform test of hypothesys about the mean (one sample, two samples, paired samples) • The correlate and pwcorr commands allow to compute linear correlation among variables • The pctile newvar = exp, pctile_options creates a variable containing percentiles • The xtile newvar = exp, xtile_options creates a variable containing quantile categories 19 Example of a do file (IV) • summarize pnew // summary stats • bysort foreign: summarize pnew • ttest pnew=5 • ttest pnew=pworld, unpaired [unequal] /* summary stats by foreign */ /* test on the overall mean */ /* test on equality for means of two different populations */ • ttest pnew=pnew_after /* test on equality for means of the same population (paired test) */ • pcorr pnew pworld // computes correlation coefficient 20 Estimation and testing (II) • The regress command performs linear regression (both bivariate and multiple regression) The basic syntax is [by varlist:] regress depvar [varlist] [weight] [if exp] [in range] [, level(#) noconstant robust] Stata automatically insert a constant among the regressors, unless the noconstant option is specified 21 Estimation and testing (III) • The test command allows to perform test of hypotheses about the regression coefficients The basic syntax is: test [exp = exp] for testing general hypotheses or test [coefficientlist] for testing coefficients = 0 Alternative commands are lincom, whose syntax is: lincom exp [, level (#)] and testparm, which allows to test a range of coefficients. 22 Estimation and testing (IV) • The predict command allows to generate new variables stemming from estimation, notably residuals and fitted (predicted values) The basic syntax is predict [type] newvarname [if exp] [in range] [, statistic] where statistics can be: • xb, fitted values (the default), calculates linear prediction • residuals, residuals 23 Example of a do file (V) • regress gdp // regression on a constant only • regress gdp money unemp /* regression on a constant and money and unemp*/ • regress gdp money unemp, robust /* as before, with robust standard errors*/ • test money • test money=2*unemp /* test the coefficient of money = 0 */ /* test the coefficient of money is the double of the one of unemp */ • lincom money=2*unemp /* as before, slightly different output*/ • predict y_hat /* generates predicted values */ • predict residuals, resid /* generates residuals */ 24 Accessing and showing results • The display command allows you to display computation, strings, and estimates obtained in previous estimation commands • Generally, Stata stores results in e() or in r(). You can see them with the commands ereturn list or return list. 25 Example of a do file (VI) • regress gdp money unemp /* regression on a constant and money and unemp*/ • ereturn list /* lists all what is stored after regress */ • generate b_money =_b[money] /* generates a variable equal to the estimated coefficient of money */ • display log(e(N)) /* computes the log of the number of observations used in the previous regression */ • summarize gdp // computes descriptive stats • return list • gen mean_gdp=r(mean) /* lists all what is stored after summarize*/ /* generates a new variable equal to the mean, equivalent to egen mean_gdp=mean(gdp) */ 26 Accessing and showing results • estimates store name is a command that allows you to store the results of the last estimated model under the name name into the RAM memory of the PC, for later use • The command estimates table [namelist] [, stats(scalarlist) keep(coeflist) drop(coeflist) b[(%fmt)] se[(%fmt)] stfmt(%fmt) title(string)] allows you to display in a table some previously saved results 27 Example of a do file (VII) • • • • • regress gdp money unemp /* regression on a constant and money and unemp*/ est store reg1 /* results are stored under the reg1 name */ regress gdp money /* regression on a constant and money only */ est store reg2 /* results are stored under the reg2 name */ estimates table reg*, b(%7.4f) se(%7.4f) stats(N r2_a) /* results are shown in a table with estimated coefficients, standard errors, number of observations and adjusted R2 */ 28 Models with binary dependent variables The basic command for estimating a binary dependent variable with a logit model is: ● ● logit depvar [varlist] [if exp],[options] whereas the command to estimate it with a probit model is: ● probit depvar [varlist] [if exp], [options] ● The command dprobit directly displays marginal effects instead of coefficients for the probit case To obtain goodness-of-fit measures, the following command (in version 9 or later) can be used: ● estat classif ● 29 Models with binary dependent variables For both logit and probit, it is possible to use the predict command predict [type] newvarname [if ] [in] [, statistic] where some of the possible statistic are p probability of a positive outcome (the default) xb fitted values of the index function The command to obtain marginal effects is margins [marginlist] [if] [in] [weight] [, response_options options] where response options are: dydx(varlist) marginal effect of variables in varlist eyex(varlist) elasticities of variables in varlist dyex(varlist) semielasticity -- d(y)/d(lnx) eydx(varlist) semielasticity -- d(lny)/d(x) and the main option is at(atlist) after estimating the model. 30 Example of a do file (VIII) • logit unemployed female educ /* to estimate a logit model in which the probability of unemployment is a function of gender and education */ • probit unemployed female educ • estat classif /* /* same as above, but probit */ to obtain some goodness of fit obtain the measures */ • margins, dy/dx(educ) at(educ=10) /* to marginal effect of the variable educ evaluated at the value of 10 (the default is the mean) */ 31 Panel data Before using panel data, you have to initialize your data xtset panelvar [timevar] xtset, clear Specific commands to panel data are: xtdescribe allows one to see the structure of the panel xtsum [varlist] decomposes total variation in a “between” and a “within” component xttab [varlist] decomposes occurrences in a “between” and a “within” component. 32 Panel data To estimate a linear regression model with panel data the standard command is: xtreg depvar [varlist] [if exp], [re be fe robust] where the option re (the default) refers to the random effects estimator, the be option to the between estimator, and the fe option to the fixed effects (or within groups) estimator. The option robust also implies clustering at the individual level. 33