Download Two Methods to Merge Data onto Every Observation in Another Dataset

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Tw
wo Methods to Merge
e Data onto
o Every Ob
bservation in Anothe
er Dataset
Lisa Mendez
z, Ph.D., Kn
nowesis, Incc., San Anttonio , TX
Kim Bru
unnert, Ph.D., Elsevier, Houston,, TX
ABSTRAC
CT
There are times
t
when you just can’t see
em to find a PR
ROC that will do
o exactly what you want. We
e came across a
scenario where
w
we neede
ed to calculate the mean of a student data fiile and then fla
ag student obse
ervations that w
were
more than three standard
d deviations from the mean. We
W came acrosss two methods to do what w
we needed to do
o.
od uses a comb
bination of Data
a steps and Prrocs, and utilize
es the If _N_ th
hen set method
d. Another metthod
One metho
utilizes Pro
oc SQL. This paper
p
will outlin
ne both method
ds step-by-step
p and illustratess two different w
ways to do the same
thing. Pers
sonal preferenc
ce dictates whiich method to use.
u
INTRODU
UCTION
We have a student data set
s that identifie
es students by ID number acrross two years. Each studen
nt has two time
values (Tim
me1 & Time2). We want to ca
alculate the me
ean of all of the
e students’ time
e values and fla
ag those stude
ents
whose time
e values fall ab
bove or below three standard deviations from
m the mean.
RAW DAT
TA
The raw da
ata set has five
e variables and 25 observation
ns. The variab
bles are: Student, Semester, Y
Year, Time1, a
and
Time2. Ea
ach student has
s a time value, which represents time in secconds.
Figure 1. Raw Data.
1
USING TH
HE IF _N_=1 SET METHO
OD
Essentially
y, the If _N_=1 Set method se
ets the values of
o a one-record
d dataset to all other observattions of another
dataset.
Figure 2. Illustration of how one observation is set to eve
ery observatiion in anothe
er dataset.
To create the
t one-record dataset that we
w will use, we need to start b y calculating th
he means of the student dataset.
We want to
o calculate the mean for both the Time1 and
d Time2 variablles. We will usse the output out = statementt to
create an output
o
dataset.
Sample SAS
S
Code and
d Output Data
aset
Next we wiill drop the non
n-essential varia
ables. We only
y the need the _STAT_ variables, so we dro
op the _TYPE_
_ and
_FREQ_ variables.
Sample SAS
S
Code and
d Output Data
aset
2
Once we get
g the mean an
nd standard de
eviations for both Time1 & Tim
me2 we will nee
ed to transpose
e the data to ge
et the
data into a format that we
e can use to ca
alculate three sttandard deviatiions above and
d below the me
eans.
Sample SAS
S
Code and
d Output Data
aset
At this poin
nt we can calcu
ulate three (3) standard
s
deviations above an
nd below the m
means.
Sample SAS
S
Code and
d Output Data
aset
Now we ha
ave the four variables we need: sd3minus an
nd sd3plus at T
Time1 and Tim
me2. Now, to put those valuess in
the original dataset we firrst split the new
w data into two datasets: one with 3SD ABO
OVE the mean ffor Time1 and
Time2, and
d one with 3SD
D BELOW the mean
m
for Time1
1 and Time2.
To create the
t 3SD datase
et ABOVE the mean,
m
we will create
c
a datase
et with only the
e Time variable
es and the 3SD
D
ABOVE va
alues, transpose
e the data, and
d then rename variables and d
drop non-essential variables.
3
Sample SAS
S
Code and
d Output Data
aset
We will do the same steps to create the 3SD BELOW dataset.
Sample SAS
S
Code and
d Output Data
aset
4
Now we arre ready to merrge the data tog
gether by using
g the If _N_=1 Set syntax. Re
eading the syn
ntax below we a
are
stating: if th
he number of observations
o
fo
or the first datas
set (Student_sd
dplus_t2) equa
als one (1) then
n set that
observation
n with every ob
bservation in th
he Student_data dataset locatted in the SASPaper library, a
and output the file to
a dataset named
n
Studentt_SD3_plus_minus1.
Sample SAS
S
Code and
d Output Data
aset
5
Notice thatt every observa
ation in the orig
ginal Student_D
Data dataset ha
as the same 3S
SD Above valu
ues for Time1 and
Time2 merrged with it.
w merge the 3S
SD Below values.
We will now
6
e 3SD values Above
A
and 3SD values Below for the Time1 and Time2 varriables. Now w
we can
The final fille has both the
flag the observations thatt have either a Time1 or Time
e2 value above
e or below three
e standard devviations from the
e can create a dataset
d
with all of the observa
ations and flagss, but we can a
also create a dataset with only
mean. We
flagged observations and
d another datas
set with un-flagged observatio
ons.
Sample SAS
S
Code and
d Output Data
aset
7
All Studen
nt_Data observ
vations with Flag
F
variable
Only Student_Data obse
ervations with
h Flag variable
e equal to 1 (F
Flag variable h
has been drop
pped)
Only Student_Data obse
ervations with
h Flag variable
e NOT EQUAL
L to 1 (Flag va riable has bee
en dropped)
8
USING PR
ROC SQL
Using PRO
OC SQL to achieve the same results can elim
minate multiple
e steps; howevver, many peop
ple are reluctant to
use PROC
C SQL if they arre unfamiliar wiith it.
As with the
e If _N_=1 Set method, we mu
ust first begin by
b calculating tthe means and standard deviation for the Time1
and Time2 variables. In the
t SAS Code example below
w, note that the
e format statem
ment is not nece
essary, but it h
helps
to make the values more readable.
Sample SAS
S
Code and
d Output Data
aset
Next we wiill use a data sttep to compute
e three (3) standard deviation s above and be
elow the mean
ns, flag the
observation
ns, and write out the three diffferent datasets
s. We will also
o eliminate non-essential varia
ables for each output
dataset spe
ecified.
9
Sample SAS
S
Code and
d Output Data
aset
All Studen
nt_Data observ
vations with Flag
F
variable
Only Student_Data obse
ervations with
h Flag variable
e equal to 1 (F
Flag variable h as been drop
pped)
10
1
Only Student_Data obse
ervations with
h Flag variable
e NOT EQUAL
L to 1 (Flag va riable has bee
en dropped)
Each meth
hod yields the exact
e
same datta sets and results. It is up to
o the user to de
etermine which method he or she
wants to us
se. Neither one method is be
etter or worse than the other. Some people may be more ccomfortable ussing
the data steps and PROC
Cs, while others
s are more com
mfortable and fa
amiliar with PR
ROC SQL.
CONCLUSION
It is always
s difficult to figu
ure out ways to
o do things that are not comm
mon. We were a
able to figure o
out how to utilizze the
IF _N_=1 Set
S method by using the data step and PRO
OCs that we we
ere familiar with
h. We decided to tackle the ssame
issue using
g PROC SQL and
a illustrate bo
oth ways to acc
commodate ma
any SAS userss. Each method
d will yield the same
results. As
s long as the re
esults are accurate, either me
ethod will work.
REFEREN
NCES
CONTACT INFORMAT
TION
Your comm
ments and ques
stions are value
ed and encoura
aged. Contactt the authors att:
Kim
K Brunnert
Elsevier
11
1011 Richmond
d Ave, Ste 450
0
Houston, TX 77
7042
Phone: 713-346-6984
E-mail: k.brunne
[email protected]
dez
Lisa Mend
Knowesis,, Inc.
San Anton
nio, TX
Phone: 20
02-709-8932 ext 231
E-mail: lm
mendez@know
wesis-inc.com
SAS and all
a other SAS In
nstitute Inc. product or service
e names are re gistered tradem
marks or tradem
marks of SAS
Institute Inc
c. in the USA and
a other counttries. ® indicate
es USA registrration. Other b
brand and product names are
trademarks
s of their respective companie
es.
11
1