Download dummy A Stata Command to Create Dummy Variables and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
dummy
1
dummy
A Stata Command to Create Dummy Variables and
Interactions from Categorical Variables
The file dummy.ado implements a command that creates dummy variables from
categorical variables and interactions from categorical and continuous variables. Place
the file in Stata’s working directory (folder), or in the folder that contains Stata’s ado
files. The command can be used to:
1. Create a set of dummy variables from a categorical variable.
2. Create two or three way interaction dummy variables from categorical variables.
3. Create two, three, or four way interactions from a single continuous variable and
one, two, or three categorical variables.
The dummy command has the following features, which must be understood in order to
use it successfully.
1. The command uses the reference category method, and does not create dummy
variables for reference categories or their interactions. Thus, the full list of
created variables can be used as independent variables in a regression.
2. By default, the reference category is the one with the lowest coded value. The
reference category can be changed with the omit characteristic, described below.
3. Stata must know whether a variable being operated on by the command is a
categorical variable or a continuous variable. By default, the command assumes
that variables are continuous. To inform the command that a variable is
categorical, its factor characteristic must be specified as category, as
described below.
4. Names for dummy variables and interactions are formed from abbreviations of
the variables and, for categorical variables, their coded values. By default, the
abbreviation for a variable is its first letter. You can change the abbreviation
used with the abrev characteristic, described below. You must use this
characteristic if two variables operated on by the command begin with the same
letter, or if the name that would be created by the command is already taken by
an existing variable.
5. Optionally, you can specify a minimum frequency for a category that is required
for a dummy variable to be created for that category. This can be useful if you
want to lump categories with very few observations together with the reference
category.
6. The command creates labels for the variables it creates from the value labels of
the categorical variables, if present. Thus, providing value labels for categorical
variables is a good way to document the variables that the command creates.
dummy
2
Command Syntax for Variable Characteristics
Before giving Stata the dummy command, you must first specify the characteristics of the
variables on which it operates. To learn more about characteristics of variables, type
help char at the command line.
For categorical variables, it is required that you specify the “category” factor
characteristic for each categorical variable. This is done by:
char varname[factor]category
where varname is the categorical variable. For example, if drug is a categorical variable,
type:
char drug[factor]category
If you want to change the default reference category, use the omit characteristic.
char varname[omit]#
where varname is the categorical variable, and # is the value of the category that you
want to make the reference category. For example, if drug is a categorical variable that
takes the values 1, 2, 3, 4, and you want to make the category with the value 3 the
reference category, type:
char drug[omit]3
If you want to change the default abbreviation for a variable, you must supply one or
more abbreviation characteristics for the variable. The syntax of the abbreviation
characteristic is:
char varname[abrev#]name
where varname is the variable you want to make the abbreviation for, name is the
abbreviation, and # is the number of characters in name. name must conform to the
requirements of a variable name in Stata, with one additional consideration, that it should
be short. This is because a category number, plus other abbreviations and category
numbers it if it is involved in an interaction, will be appended. This full name must
conform to the maximum length of eight characters. You can supply additional
abbreviations. The dummy command will attempt to use the longest abbreviation it can.
For example,
char drug[abrev1]g
dummy drug
will result in dummy variables with names g1, g2, and g4 being created, given that we
specified that category number three would be the reference category.
dummy
3
Command Syntax for the dummy Command
The syntax for the dummy command is:
dummy varlist, minfreq(#) 1 2 3
varlist is
a variable list of up to four existing categorical and continuous variable
names, with a maximum of one continuous variable and a maximum of three categorical
variables. If you specify only one variable, it must be a categorical variable, and a set of
dummies will be created for it. If you specify two variable names, a set of variables for
the two way interaction will be created. If you specify three variable names, a set of
variables for the three way interaction will be created, and so on.
In the minimum frequency option, minfreq, “#” is a number that specifies the minimum
frequency a category must have in order for a dummy variable to be created for it. The
option can be abbreviated by “m(#)”. If the option is omitted, it is identical to specifying
minfreq(1).
The numbers “1 2 3” may be specified if interactions are requested. For example, when
three way interactions are created, the command must first create main effects dummies,
and then two way interactions, before the three way interaction can be created.
Normally, the command would then delete the main effects and two way interaction
variables. However, specifying the “1” option keeps the main effects dummies that were
created, and specifying the “2” option keeps the two way interaction variables that were
created. The “3” option would only be specified if varlist contained four variables.
Example
As an example, consider the “fully” interactive model estimated with the sysage.dta data
set in Lab 7. The dummy command could be used to more easily construct the variables
needed in the regression model in the following manner. First, although it is not
necessary, provide the drug and disease categorical variables with value labels. (Type
help label for information on value labels.)
label
label
label
label
define
define
values
values
drug 1 "drug 1" 2 "drug 2" 3 "drug 3" 4 "drug 4"
disease 1 "disease 1" 2 "disease 2" 3 "disease 3"
drug drug
disease disease
Next, specify the characteristics of the variables. Since the variables drug and disease
begin with the same letter, one of them must be abbreviated.
char drug[factor]category
char disease[factor]category
char drug[abrev1]g
Lastly, create the dummy variables and interactions in one fell swoop.
dummy drug disease age, 1 2
dummy
4
Troubleshooting
Here are some common error messages and their “fixes”.
dummy drug
Error : This combination not allowed
Fix: Stata thinks drug is a continuous, rather than a categorical variable. You forgot to
specify the category factor characteristic for drug, i.e., “char drug[factor]category”.
dummy drug
Error : d2 already exists.
Fix: When Stata attempted to create the dummy variable d2 for the second drug category,
it found a pre-existing variable with the name “d2”. Either drop the variable d2 and issue
the dummy command again, or use the abbreviation characteristic to give drug an
abbreviation other than “d”, e.g., “char drug[abrev1]g” or “char drug[abrev2]dr”.
END