* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1 - UNDP Climate Change Adaptation
Expectation–maximization algorithm wikipedia , lookup
Forecasting wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Data assimilation wikipedia , lookup
Choice modelling wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Least squares wikipedia , lookup
Linear regression wikipedia , lookup
WORKSHOP ON ECONOMIC ANALYSIS OF CLIMATE CHANGE PRACTICAL LESSONS ON STATA 11 1 • • • • INTERACTIVE USE OF STATA Interactive use means that STATA commands are initiated within STATA. A graphical user interface (GUI) for stat is available. It enables almost all the STATA commands to be accessed using drop down menus. STATA allows users to directly type commands to execute a particular task. The standard procedure however in STATA is to aggregate the various commands needed into one file called a do-file that can be run with or without interactive use. BASICS IN STATA • Like most softwares, STATA has some example data sets that allows ‘amateur’ users to use as starting point in learning STATA. – An example of such data sets is the auto.dta data • To access the example data: – Click File/Example Datasets/… Example datasets installed with Stata • Select the data set auto.dta – Interactive Users can however type the command • sysuse auto DATA MANAGEMENT • To describe the variables in the data set type: – describe or des – Or to describe some specific variables type add the name of the variable to the command. • Eg: des mpg • NB: stata commands does not allow upper case • If you wish to the summary statistics of the variable type: • summarize,detail • sum, detail • su, detail • su, d – You can drop the subcommand detail if you wish to obtain the basic summary statistics. – You can summarize specific variables • sum varlist, detail • Eg: sum mpg, detail – sum mpg – su mpg DATA MANAGEMENT • If you are only interested in a subset of your data, you can inspect it using filters. E.g. If you are only interested in price of a particular type of car you can type: – sum if price>=3000 & price<=4400 – sum if mpg>=16& mpg<=23 • And then you can contrast – sum if price>=3000 |price<=4400 – sum if mpg>=16 |mpg<=23 • Interpretation of Logical Operators in STATA. >= greater or equal to <= less or equal to == equal to & and | or != or ~= not equal to > greater than < Less than . missing DATA MANAGEMENT • The usual arithmetic operators (+,-,*,/) are applicable in STATA. • STATA allows users to tabulate variables to know the distribution of a variable – tabulate mpg – tab mpg DATA MANAGEMENT • Some data/variables have been coded with value labels already assigned to the values. If the user wants to know the actual values used type: – tab varlist, nolabel – Eg: tab foreign, no label GENERATING NEW VARIABLES • You can create a new variable by combining new variables or by performing some arithmetic operations. [gen, egen, recode] • To create a ratio of two variables: – gen mpgratio=mpg/weight – sum mpgratio DATA MANAGEMENT The same procedure can be applied to obtain traditional transformations such as: Square gen mpg2=mpg^2 Cubic gen mpg3=mpg^3 Square roots gen mpgsqrt=sqrt(mpg) Exponential gen expmpg=exp(mpg) Natual logs gen lnmpg=ln(mpg) gen logmpg=log(mpg) Base 10 genl10mpg=log10(mpg) DATA MANAGEMENT • Eg: gen lprice=log(price+1) – Why +1? This helps eliminate the problem of estimating the log of zero or missing numbers. • Sometimes the user may want to generate a new variable within a particular range. – gen lprice=log(price) if mpg==. – gen llprice=log(price) if mpg>15 • The generate command can also be used to create new (binary) variables. – Eg: from the auto.dta data set we are using, may be interested in finding out how many cars were repaired more than two times in 1978. Thus we create a new variable repair =1 if the vehicle was repaired more than twice or 0 if otherwise. DATA MANAGEMENT • Use the command: gen repair =1 if rep78>2 replace repair=0 if rep78<=2 or replace repair=0 if repair==. • You can also create categorical variables from a set of continuous variables. tab mpg gen mpgcat=1 if mpg<15 replace mpgcat=2 if mpg>=16& mpg<26 replace mpgcat=3 if mpg>26 & mpg<=35 replace mpgcat=4 if mpg>35 tab mpgcat DATA MANAGEMENT • tabulate….., generate This command is useful for creating a set of dummy variables (variables with a value of 0 or 1) depending on the value of an existing categorical variable. The syntax is: tab old var, gen (new var) Eg: tab rep78, gen(repair) tab foreign, gen(origin) • The old variable is categorical. The new variables will take the form: newvar1, newvar2, newvar3……. DATA MANAGEMENT EGEN This is an extended version of “generate” to create a new variable by aggregating the existing data. The syntax is: egen newvar = fcn(argument) [if exp] [in range] , by(var) where newvar is the new variable to be created fcn is one of numerous functions such as: count( ) ; max( ); min( ) ; mean( ); median( ); rank( ) ; sd( ); sum( ); argument is normally just a variable var in the by() subcommand must be a categorical variable. Eg: Egen avg=mean(mpg) : creates variable of average mpg over entire sample Egen avg2=median (weight), by (foreign) : creates variable of median weight of cars for each origin. egen totalrepairs=sum(rep78), by(foreign) : generates total repairs of vehicles from each origin. egen prodwgt= sum(weight*price), by (make) DATA MANAGEMENT recode • This command changes the values of a categorical variable according to the rules specified. The syntax is: – recode varname oldvalue=newvalue oldvalue=newvalue … [if exp] [in range] – recode foreign 0=1 1=2 – Recode rep78 .=9 *=7 DATA MANAGEMENT recode is also an extension to replace that recodes categorical variables and generates a new variable if the generate () option is used. recode rep78(1/2=1) (3=2) (4/5=3), gen (repcat) This creates a new variable that takes on value of 1,2 or 3. The repcat variables is set to missing if rep78 doesn’t lie in any of the ranges given in the recode command. Xtile • This command creates a new variable that indicates which category a record falls into, when the sample is sorted by an existing variable and divided into n groups of equal size. • The syntax is: – xtile newvar=variable[if exp][in range],nq(#) Newvar is the new categorical variable created. Variable is the existing variable used to create the quantile. # is the number of different categories. Eg: pctile mpg1quint= mpg, nq(5) pctile weight1dec=weight, nq(5) LIST The most detailed of the commonly used descriptive commands is list. List displays the values of variables by observation. If varlist is not specified the output will contain the value for every variable. list varlist ,or l varlist Eg: list mpg Xi: Indicator Variables A complete set of mutually exclusive categorical indicator dummy variables can be created in several ways. A simpler method is the xi command: xi i.rep78, noomit The noomit option is added because the default setting is to omit the lowest category. INSPECT inspect variable [if exp] [in range] Gives a small histogram, the number of values that are: unique; positive, zero, negative; integer and non-integer; missing. LABEL VARIABLE This command is used to attach labels to variables in order to make the output easier to understand. For example, we know that maritalstat indicates the marital status of the head of household. But other people using the tables may not know this. So we may want to label the variables as follows: label variable region “Region of country” Label variable maritalstat “marital status” LABEL VALUES This command attaches named set of value labels to a categorical variable. The syntax is: label values varname lblname where varname is the categorical variable which will get the labels lblname is a set of labels that have already been defined by label define Here are some examples of labeling values in Stata. label variable yield "Yield (tons/hectare)" gives label to variable yield label define yesno 0 no 1 yes defines set of labels called yesno label values electricity yesno attaches labels to the variable “electricity” label define yesno 3 "perhaps", add adds new value label to existing set label define yesno 3 "maybe", modify modifies existing value label label define reglbl 1 West 2 Center 3 East defines regional labels label values region reglbl attaches regional labels to region label define reglbl 2 Central, modify modifies regional labels TABULATE … SUMMARIZE • This command creates one- and two-way tables that summarize continuous variables. The command tabulate by itself gives frequencies and percentages in each cell (cross-tabulations). With the “summarize” option, we can put means and other statistics of a continuous variable. • The syntax is: tabulate varname1 varname2 [if exp] [in range], summarize(varname3) options • where – – – – varname1 is a categorical row variable varname2 is a categorical column variable (optional) varname3 is the continuous variable summarized in each cell options can be used to tell Stata which statistics you want • tab make, sum(mpg) gives the mean, std deviation, and frequency of mpg for each car model. • tab make, sum(price) mean gives the mean price for each car • tab foreign weight, sum(price) Tabstat This command gives summary statistics for a set of continuous variable for each value of a categorical variable. The syntax is: tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname) where varlist is a list of continuous variables statname is a type of statistic varname is a categorical variable. Example: table This command can creates many types of tables. It is probably the most flexible and useful of all the table commands in Stata. The syntax is: table rowvar colvar [if exp] [in range], c(clist) [row col] where rowvar is the categorical row variable colvar is the categorical column variable clist is a list of statistic and variables row is an option to include a summary row col is an option to include a summary column Examples: table foreign, c(mean rep78 sd rep78 median rep78) – table of yield statistics by region . table foreign rep78, c(mean mpg) –table of average mpg by foreign rep78 • table foreign, c(mean rep78 mean mpg) –table of average rep78 & mpg by foreign MODIFYING DATA FILES • This section describes a number of commands that are used to modify and combine data files in Stata. rename , drop , keep, rename This command renames variables. Syntax: rename oldname newname • Eg: rename mpg mile_per_gallon drop This command deletes records or variables. drop if price>=4000 drop if foreign==1 keep This command deletes everything but specified observations or variables. Keep if price<=3000 keep mpg rep78 headroom trunk if foreign PRESENTING DATA WITH GRAPHS • In Stata, graphs are primarily made with the graph command, followed by numerous subcommands for controlling the type and format of graph. In addition to graph, there are many other commands that draw graphs. graph twoway bar pie matrix connect( ) msymbol( ) histogram scatter http://www.stata.com/support/faqs/graphics/piechart.html PRESENTING DATA WITH GRAPHS graph This command generates numerous types of graphs and diagrams. The syntax is: graph graphtype [varlist] [if exp] [in range] [, options] where graphtype is the type of graph varlist is the list of variables to graph if is used to limit observations that are included based on the exp condition in is used to limit observations that are included based on the case number options are commands to control the look of the graph • graph bar income, over(sexhead) over( locality) Histograms histogram income, by(sexhead) normal bin(20) histogram income, by(locality) normal bin(20) histogram mpg, by( foreign) normal bin(20) Nb: bin () refers to the number of columns it should include in the histogram Scatter Plots scatter mpg price scatter mpg price,by(foreign) • PIE CHARTS In Stata, pie and bar charts are drawn using the sum of the variables specified. Therefore, any zero values will not appear in the chart, as they sum to zero and make no difference to the sum of any other values. If you have a categorical variable that contains labeled integers (for example, 0 or 1, or 1 upwards), and you want a pie or bar chart, you presumably want to show counts or frequencies of those integer values. To create pie charts, first run the variable through tabulate to produce a set of indicator variables: Eg: tab foreign, gen (f) graph pie f1 f2 Try: tabulate rep78, generate(r) . graph r1 r2 r3 r4 r5, pie graph r1 r2 r3 r4 r5, bar • Do-file Editor A Do-file is a file that stores a Stata program (a set of commands) so that you can edit it and run it later. The Do-file Editor is like a simplified word processor for writing Stata programs. Why use the Do-file Editor rather than the Command window or the menu system? – – – – – It makes it easier to check and fix errors, it allows you to run the commands later, it lets you show others how you got your result, and it allows you to collaborate with others on the analysis. In general, any time you are running more than 5-10 commands to get a result, it is easier and safer to use a Dofile to store the commands. • LOG FILES • You can click on File/Log to begin or close a log file (Suspend and Resume are to temporarily turn off and on the log). • You can use “log” commands in the Command window • You can use “log” commands in a Do-file. OPENING FILES STATA FILES (.dta) To open a stata file: use filename, clear Eg: use "G:\fenergydata.dta", clear use varlist using filename, clear [for a subset of the data file]. Alternatively you can use the drop down menu bar to import the data – File/open/………………….. (select the data) IMPORTING EXCEL DATA To import data from excel, one has to convert the data into an CSV [tab delimited] format. For non stata files, the command for importing data is “insheet using” – insheet using filename, clear – Eg: insheet using "C:\Users\myjumens\Desktop\fenergydata.csv" • Alternatively you can use the drop down menu bar to import the data. – File/import/ASCII data created by spreadsheet/ …… (select the data) CODING QUESTIONAIRES INTO STATA • Coding data into STATA can be done in the DATA VIEW – Generate new variables. Eg: gen q1=. gen q2=. – Click Data Editor on the menu bar – Click on Variable manager Type the variable name Type the variable label Click Apply to add your commands into the system Click on the manage to display a new dialog box • Creating Value Labels Click on create label Type the value label here Type in the value. Eg: 1 Type in the corresponding label to the values assigned Click on Add • Note that you can create all the value labels for all the questions before exiting the manage value label dialog box • Assign the imputed value labels to their corresponding questions, or variables in the Variables Manager. • Exit the Variables Manager dialog box and go back to the data editor. • You can now type in the coded response. MICROECONOMETRIC REGRESSION ANALYSIS • • • • • • Ordinary Least Squares Probit Models Logit Models Ordered Probit/Logit Models Multinomial Logit Models Tobit Models Ordinary Least Squares Like most statistical packages, STATA allows users to run some basic regressions such as the OLS. The syntax is: regress dependent var independent var Eg: regress gpa tuce psi reg gpa tuce psi LOGIT AND PROBIT MODELS • Probit and logit models are among the most widely used members of the family of generalized linear models in the case of binary dependent variables. • These group of models allows researchers to analyse data on issues even though the dependent variables are binary (0, 1). – Eg: yes/ no; married or not married; foreign or domestic PROBIT MODEL Let us examine whether a new method of teaching economics, PSI, significantly influence performance in later economics courses using the probit model. The dependent variable used is GRADE, which indicates whether a student’s grade in intermediate macroeconomics course was higher than that in the principle course. The probit model is specified as: • Estimation of Probit Model probit grade gpa psi tuce • The basic probit commands report coefficient estimates and the underlying standard errors. These coefficients are the index coefficients and what we can only say is the direction of the effect and partial effects on the Probit index/score. They do not correspond to the average partial effects. • Let’s try to interpret the results: – Tuce: one unit increase in tuce increases the probit index by 0.05 standard deviations. – But are we concerned with an Probit index? No • In analysing binary choice models the parameter of interest are not the index coefficients, rather the marginal/ partial effects. Marginal Effects • It gives the derivative of the probability that the dependent variable equals one with respect to a particular conditioning variable. In stata these marginal effects can be computed using two methods – dprobit – mfx compute Interpretation For one unit increase in the dependent variable from the baseline, the probability of an event is expected to increase/decrease For instance one unit increase in GPA from the baseline (3.11), the probability of grade improvement increases by 53.3 %. NB: The interpretation for dummy variables differs: The coefficients are discrete changes not marginal effects The interpretation of PSI is that a student exposed to PSI has a probability of grade improvement of 0.46 greater than another student who is not exposed to the same method. LOGIT MODEL The logit model yields similar results as the probit model. • The coefficients of the logit function is quite difficult to interpret since it follows a logistic distribution function. • As a results we compute the odds ratio and the marginal effects • MARGINAL EFFECTS • In stata these marginal effects can be computed using the mfx command. • Recall that for one unit increase in the dependent variable from the baseline, the probability of an event is expected to increase/decrease by the magnitude of the marginal change holding other variables constant • In our case one unit increase in GPA from the baseline mark of 3.11 increases the probability of grade improvement by 53.3% • One unit increase in the previous knowledge of the material from the baseline (21.93) increases the probability of grade improvement by 1.8 %. • What about PSI? ODDS RATIO • Odds are a way of presenting probabilities, but unless you know much about betting you will probably need an explanation of how odds are calculated. The odds of an event happening is the probability that the event will happen divided by the probability that the event will not happen. • Stata command: (or) ologit grade gpa psi tuce, or Being exposed to new teaching methods (PSI) increases the odds of performing well by 0.79 . For every 1 unit increase in GPA, the odds of improving performance by a factor of 16.87 ROBUSTNESS Cross sectional data are usually plaqued by the problem of heteroscedasticity. • This statistical deficiency has implications on the results of binary choice models. • Thus to report standard errors that are robust we use the subcommand r or robust. – Eg: probit grade psi tuce gpa, r probit grade psi tuce gpa, robust ORDERED PROBIT/LOGIT Some multinomial choice models are inherently ordered. Examples include: • Bond ratings • Opinion surveys • Assignment of military personnel to job classifications by skill level • Voting outcomes on certain programs • The level of insurance coverage taken by a consumer: none part, or full • Employment status: unemployed, part-time, or full time • In each of these outcomes, the outcome is discrete but the multinomial logit, conditional logit, nested logit models would fail to account for the ordinal nature of the dependent Variables. • The ordered probit/logit models however, accounts for these ordinal properties. ORDERED PROBIT Suppose we wish to analyze the 1977 repair records of 66 foreign and domestic cars. The 1977 repair records take on poor, fair, average and good and excellent. The main research problem is to explore the factors that explain the repair records in 1977. The categories are; 1. Poor 2. Fair 3. Average 4. Good 5. Excellent MARGINAL EFFECTS • We need the marginal effects to interpret the results of ordered probit effectively • The marginal effects show how the probabilities of each outcome change with respect to changes in regressors. • To calculate the marginal effects we run the mfx command separately for each outcome. – – – – mfx, predict(outcome(1)) mfx, predict(outcome(2)) mfx, predict(outcome(3)) mfx, predict(outcome(4)) ORDERED LOGIT MARGINAL EFFECTS MULTINOMIAL LOGIT The multinomial logit (MNL) model, also known as multinomial logistic regression, is a regression model which generalizes logistic regression by allowing more than two discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.). IMPLEMENTATION IN STATA • Stata uses the mprobit command to estimate the MNP. To use mprobit we must have a single observation for each decision maker in the sample. • Eg: We use data in on the type of health insurance available to 616 psychologically depressed subjects in the US. Patients may have either an indemnity (free-for-service) plan or a prepaid plan such as a Health Management Organisation-HMO) or the patient may be uninsured . • Demographic variables include age, gender, race and site. • Indemnity insurance is the most popular alternative so stata will choose it as the base outcome by default; – The main research problem is to explore the factors that explain the choice of the health insurance mprobit insure age male nonwhite site2 site3 Computation of the Marginal effects • We need the marginal effects to interpret the results of MNP effectively. • The marginal effects show how the probabilities of each outcome change with respect to changes in regressors • To calculate the marginal effects we run the mfx command separately for each outcome. Interpretation: • TOBIT MODEL • There are instances where by the variable we are investigating are censored at a point. • For instance our research objective is to explore the factors that explain the repair records in 1977. • Mpg in our data ranges from 12 to 41 • Assume that our data is censored so that we could not observe a mileage rating below 17 mpg. CENSORE THE MPG • If the true mpg is 17 or less, all we know is that the mpg is less than or equal to 17. • Let’s first generate a new variable called mpg1 – gen mpg1=mpg • Replace any value that is equal to 17 and below with 17 – replace mpg1=17 if mpg<=17 – (14 real changes made) Lets see what we actually observe after censoring IMPLEMENTATION IN STATA • Notice that our dependent variable mpg is not dichotomous but continuous. • Let’s run two regressions • Create wgt by dividing weight by 1000 to make our discussions interesting • gen wgt=weight/1000 TYPES OF TOBIT • Left censored Tobit model • Right censored Tobit model We can estimate a tobit model by instructing the software to censore the data both from below (left censore), above (right censored) or both. Left censored Tobit model – Using the already censored data mpg1 • Using the uncensored data, we could instruct the software to censore it in the estimation by using the subcommand: , ll(…) – tobit mpg wgt, ll(17) Right censored Tobit model Two-limit Tobit models • Tobit regression coefficients are interpreted in the same manner as ols regression coefficients. • For a one unit increase in WEIGHT, there is a 6.2 point decrease in the predicted value of mpg. In other words a unit increase in the weight of the car is associated with a 6.2 units decrease in millage. Computation of the Marginal effects • We need the marginal effects to interpret the results of tobit model effectively. • The marginal effects show how the probabilities of the outcome change with respect to changes in regressors • To calculate the marginal effects we run the mfx command • NB: The marginal effects are just the same as from the regression model Starting with do files version 11 set mat size 400 clear set mem 1000 capture log close set more off