Download DATA MANAGEMENT AND PRESENTATION, FIRST PROJECT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript
Exercises for Lecture 1
EXERCISE 1 – Scleroderma data
A randomized double-blind, controlled study was conducted to determine the
safety and efficacy of a particular compound in the treatment of scleroderma, a
multi-system skin disease characterized by thickening of the skin and possible
involvement of the blood vessels and internal organs. Skin mobility is the sum of
mobility scores from 20 skin locations and assesses the ability of the skin to be
stretched, compressed, and lifted. Skin thickening is the sum of thickening scores
graded 0-3 (3 is the worst) from 15 areas. Patient assessment is the sum of the
scores from the hand, forearms, and arms using a score of 0-3 (3 is the worst).
There are 76 observations and the variables in the data set are as described
below. In the file, I have put periods in the appropriate places for missing values.
variable
clinic number
patient ID
placebo/drug
skin thickening at
1st visit
skin thickening at
2nd visit
skin mobility at
1st visit
skin mobility at
2nd visit
patient assessement
at 1st visit
patient assessement
at 2nd visit
type
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
additional description
1=drug
2=placebo
the higher the number the worse
the thickening
the higher the number the worse
the thickening
the higher the number the better
the mobility
the higher the number the better
the mobility
the higher the number the worse
the patient
the higher the number the worse
the patient
The data are on the CSASS website as a .csv file, sclero.csv, with variable
names in the first row. You may want to open the file with a text editor, like
notepad, to see what it looks like. If you just click on the file directly, it will
automatically open as an Excel workbook, as that is the program which created
the .csv file.
1. Import the data.
2. Using the variable` names, print out the variables clinic number, patient ID and the
two skin thickening measurements.
3. We would like summary statistics for both skin thickening measurements computed for
each clinic. From the output you create, you should be able to easily find the means and
SDs for both skin thickening measurements for each clinic. If you have extra information
in the output, that is okay for now, although you should be able to use the program in the
first lecture to restrict the output to the sample size, mean and SD for only the two skin
thickening measurements for each of the clinics.
4.
The following SAS commands are useful for selecting a subset of the data.
Subsetting IF statements in the DATA step are used to exclude some
observations from the data set. Only observations meeting the IF criteria are
included in the SAS data set formed. The WHERE statement can be used as well
except when reading raw data files with an INPUT statement. However, unlike
the IF, the WHERE statement can be used with any PROC and eliminates the
need to create a new data set meeting the selection criterion. Some examples
follow (either the where or if can be used in these examples).
if carsize = 'small';
(data set only includes carsizes with
value small)
where 25 < mpg < 35;
if country = 'japan' or country = 'us';
where group in (1, 2, 3);
To create a data set called 'new' from data=old, which keeps all the original
variables but restricts the new data set to small cars, you could use either of the
SAS commands below.
data new;
data new;
set old;
set old;
if carsize='small';
where carsize='small';
Here is an example with the sclero data. Either of the following codes can be
used to create a data set clinic36 from sclero which includes only the data for
which the variable clin_num (clinic number) is 36.
data clinic36; set sclero; if clinnum=36;
or
data clinic36; set sclero; where clinnum=36;
Using the previous ideas, create a new data set called placebo consisting only of
the subjects in the placebo group and then print only the variables clinic number
and patient id for the data set placebo.
5. The set command is also useful for adding a new variable to a data set that already
exists in SAS. The set statement keeps all the original variables in the data set
plus the new one that you want to add. For example the code
Data new; set old; milesq=miles*miles;
creates a data set called new which contains all the variables in old plus the
square of mileage.
Create a data set called sclero2 from sclero which contains only the patients in
the drug group, and in addition to the original variables also includes the variable
improve which represents the improvement in mobility score, namely improve =
mobility2 – mobility1. For sclero2 print out the variables clinic number, patient id,
first mobility score, second mobility score and the new variable improve which
you have created with an assignment statement. How are the missing values in
the mobility scores handled when creating the variable improve?
EXERCISE 2 – Airlines data
The data for this part are contained in the file AIRLINES.xls, also stored on the
CSASS website. First, import the AIRLINES.xls data into SAS. The result will be
a SAS data set with the variable names carried over from the spreadsheet. The
data are on two flight destinations, Los Angeles (LAX) and Dallas (DFW). The
variables are flight, date, dest (destination), boarded (number of passengers
boarding the flight as their initial flight), trans (number of passengers transferring
onto the flight from a previous flight) and nonrev (nonpaying passengers such as
a pilot traveling to another airport). Assume the cost for flights to Dallas are $650
and those to Los Angeles are $850. Using the airlines data, create a new data
set (name it newairlines) with only the observations for which the total number of
passengers is 150 or more. In addition, for this new data set create the three
new variables (1) total number of passengers on the flight, (2) name of
destination city (Los Angeles or Dallas) and (3) the revenue (amount earned for
the paying passengers). For example, the first flight in the data set has a total of
135 + 22 + 2 = 159 passengers so would be included in the data set. The value
of revenue is (135 + 22) * $850 = $133,450, since there are 135 + 22 paying
passengers and the cost of the flight is $850 because it is going to Los Angeles.
You should use a length statement for the destination city as Los Angeles is
more than 8 characters. When using character variables in your code, remember
the value of a character variable needs to be included in quotes and the values
are case sensitive (LAX is not the same as lax).
Print only the variables flight, date, destination city and revenue creating
informative labels for each variable to be used as the column headers. For the
revenue variable use a format that includes a dollar sign and commas when
printing. To create this format, use the following statement as part of your print
procedure (put it at the end before RUN).
Format revenue dollar15.; (the length of the format needs to include space for
the numbers, commas and dollar sign)