Download SAS Software for Log-linear Models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Toxicodynamics wikipedia , lookup

Transcript
SAS SOFTWARE FOR LOG-LINEAR MOOELS
Peter B.
Imrey~
University of Illinois
response, on which sampled units are jointly
observed. These response variables may be
nominal, ordinal, or scaled, or the set of
variables may contain some of each type. When
multiple dimensions of the variables defining
the r response categories correspond to
observation of the same response under several
conditions or at different times, then the
term "repeated measurement" is used to describe
the array of responses. When a repeated
measurement response array exists in each of
several populations differentiated by l~vels
of one Or more experimental or observatlonal
factors, the s*r table is called a "splitplot" categorical data s~t. These ~erms a~e
entirely consistent statlstically wlth thelr
counterparts in the terminology of classical
analysis-of-variance of continuous measurement
data. Categorical data analysis software
.
varies in the degree of flexibility in handllng
the various types of structure that may exist
in the s populations and/or the r responses
of the underlying table. Programs which may
be able to incorporate scaling in the population structure may provide less opportunity
for employing scaling of response categories
efficiently in an analysis. Programs which
easily handle factorially-structured response
categories involving several conce~tually
different variables may not be equlpped to
address the specific scientific and, hence.
analytic issues that arise when the crossclassified response dimensions represent
different levels of underlying split-plot
1.
Introduction
This paper will review the capabilities of
current SAS*software for log-linear model
analyses of counted data and extensively discuss, using a variety of examples, the enhancements to these capabilities incorporated in the
SAS Version 5.0 procedure CATMOD, which replaces
FUNCAT from Version 4 releases. The relationships of various procedures available from
SAS Institute and, to some extent, those from
other vendors, will be remarked upon. Capabil ities of PROC CATMOD and aspects of its
syntax that seem important, but are not
highlighted in the forthcoming documentation,
will receive attention here as will limitations of the new software. Although the
collection of software mentioned here provides
capabilities far more general than log-linear
model analysis, no attempt will be made to
comment upon such capabilities more than
peripherally or for contextual purposes.
No attempt is made to provide a comprehensive
treatise on categorical data analysis using
SAS software. It is hoped that this paper
will inform the reader wishing to apply loglinear model analytic techniques to SAS data
sets ,using SAS software, and that portions of
it will serve as useful adjuncts to SAS
Institute documentation.
2.
General Framework for Log-l inear f40del ing
It is assumed that available data may be
structured as a contingency table of counts
and underlying probabilities, as in Table 1.
Rows represent different physical or conceptual populations from which sampling is
reasonably modeled as product-multinomial,
implying response independence for different
subjects sampled from the same population and
independent sampling of subjects from different
populations. Columns represent categories into
exactly one of which each response must be
classified. The population by response
tabulation has- s populations and r responses.
with n.. representing the observed data
lJ
count of response j in population i, and
1T ••
symbolizing the probability of a random
lJ
individual from population i exhibiting
response category j. The s populations may
relate to one another through structure imposed
by the nature and levels of defining variables.
For instance, they may correspond to a factorial
cross-classffication of levels of several
experimental factors. possibly all of substantive interest or some of which may
represent strata defined by other variables to
be controlled for in the analysis. Populations
may correspond to ordered or scaled levels of
experimental factors, such as doses of a test
compound in a bioassay. Similarly. the r
response categories may correspond to crossclassification of several dimensions of
factors~
Table 2 represents an example of a data
array incorporating several types of structure.
Si x populations, corresponding to a 2 x 3
cross-classification of Age range by Town, are
represented. TO\'Jn is a nominal variable.
while Age is ordinal and subject to.reaso~able
scaling in analysis. The response 1S a Slxteen
category (r = 16) indicator structured as a
24 cross-classification corresponding to the
presence of any root caries in each of four
quadrants of the mouth of each subject und~r
study. Thus, this is a split-plot categorlcal
data array. The comparison of quadrants within
the mouth is of interest, especially as to
whether frequency of root caries varies between
mandibularand maxillary tooth surfaces. Of
primary interest is any community effect, .
which might be attributable to the substantlal
difference in natural water fluoride level .
between towns. While this example is one WhlCh
would not usually be addressed by log-linear
model analysis, we will return to it in that
context subsequently.
For purposes of this paper, a mathematical
framework for log-linear modeling is now presented. A log-linear model is defined as a
structural equation
1006
Thus, the design is in essence nested in the
levels specified by j of the log (conditional
odds of j vs. r) response function specified
by ~n(n./n). Each ~n(n./n) is fitted with
J r
J r
its own set of parameters. No modeling of
the conditional odds as they depend on the
value of j is specifiable. The permissible
models form a subset of classical hierarchical
log-linear models, the nature of which has been
clarified in some detail by Bishop (1969).
Use of FUNCAT in this capacity has been
somewhat hampered for users by a notation in
the mathematical documentation which is at
variance with the literature in employing
conventional symbols to represent different
matrices than the research articles underlying
the analyses, and by errors in the descriptive
documentation which confuse the design matrix
for all responses with that for a single
conditional odds, thus suggesting that the
program has wider capabilities than it actually
possesses.
Beyond the analyses just described, however,
FliNCAT is capable of providing WLS fits and
test statistics for other log-linear models
for which the same model matrix applies to
a subset of the collection of ~n(nj/nr)' and
to proportional odds analogues of all such
analyses. Such latter models would involve,
as responses. some subset of the functions
~n( ".n i / ".n ij ) for a selected set of
where
n is a vector of probabilities (such as all
those in Table 1);
~. is a vector of unknown model parameters;
X is a known umodel matrix u expressing the
model subspace through a selected coordinate
system (parametrization);
~ is the elem~nt-wise matrix exponentiatlon operator el J;
and 0- 1 is a diagonal normalizing matrix
"1l
incorporating restrictions on
sums of probabilities. specified by
D = B{~(~)}, with E containing zeroes
and ones.
Three special cases of this model have been
found especially useful:
Classical set-up: 1f is a "strung-out" vector
of probabilities internal to a multi-way
contingency table with s populations and
r response categories, such as Table 1.
E is a block diagonal matrix 1~ 9 Is'
This formulation incorporates all of
what are conventionally termed log-linear
models in the current literature. including
the factorial models of Bishop, Fienberg
and Holland (1975) and the ordinal models
discussed by Goodman (1983) and Agresti
(1984).
Proportional (cumulative) odds: E is a vector
of 2(r - 1) stairstep n cumulative
probabilities and their complements, from
each of s populations. E = 1;' ~ 1s (r-l)"
Split-plot analysis: )! is a vector of k
marginal distributions of a repeated
response, repetitions of which constitute
dimensions of a multiway contingency table,
within each of s populations.
,...,R=1'0I.
..... r
. . . sk
3. Current SAS Software for Log-linear Modeling
Prior to the release of Version 5.0, SAS
software for log-linear modeling has consisted
of the SAS/BASE product PROC FUN CAT [Sall (1982)].
the supplemental library author-supported
PROC LOGIST [Harrell (1983)], and the PROC MATRIX
MACRO CATMAX (Stokes and Koch) described in the
1982 volume of the Proceedings. This listing
excludes user-generated programs which may
have been locally produced using the IPF
facility of PROC MATRIX. Each of the three
major facilities for log-linear modeling has
had substantial limitations from the perspective
of the devotee of such analyses. PROC FUNCAT,
a general functional modeling program for
grouped data, incorporates as one branch of
its capabi 1iti es a 1imited capacity for loglinear model fitting. In particular, PROC FUNCAT
provides weighted (generalized) least-squares
(WLS/GLS) or maximum likelihood (ML) analyses
of IImultiple logitn models for which the same
across-population model matrix applies,
separately, to each £n(n/n ) for all j.
r
t;:>:J
.\
.R.<J
cutpoints, modeled in parallel in terms of
the population structure. Whereas for the
classical log-linear model subset described
Il
earlier, FUNCAT makes available ML fitted
parameters and cell probabilities along with
a likelihood ratio test of fit [obtained via
a Newton-Raphson (iterative WLS) computing
algorithm with product-multinomial likelihood
assumption], for the models described in this
paragraph only WLS solutions are available.
PROC LOGIST provi des, for grouped or
ungrouped data: i) multiple logistic regression analyses for binary data; and i i) proportional odds analyses for ordered polytomies.
In contrast to the models specifiable using
PROC FUNCAT, the log cumulative odds from
models for polytomies generated by PROC LOGIST
share the same parameter sets. The model
matrix consists of blocks, specifying the
model for the log cumulative odds relative
to a single cut-point, stacked vertically to
model several response functions, rather than
diagonally as in FUNCAT. For each type of
model, PROC LOGIST generates ML estimates of
parameters and an omnibus likelihood ratio
test of significance for the model as a
whole. Stepwise model selection is available.
Significance of model parameters is evaluated
using Wald tests comparing each to its
standard error. estimated on the assumption
of model val idity. For automated selection
of variables for entry into a model. efficient
score statistics are used, evaluated at the
current model fitted values. As FUNCAT, LOGIST
1007
Version 5.0 of SAS/BASE incorporates
PROC CATMOD, a replacement/enhancement of
PRDC FUNCAT to a general categorical data
utility for grouped or ungrouped data, with
full log-linear modeling capacity. Along with
the power of FUNCAT, CATMDD incorporates the
capabilities of CATMAX and many, but not all,
of the functions of LOGIST. CATMOD lets the
user fit hierarchical and non-hierarchical loglinear models, multiple logistic models, quasiindependence models, quasi-symmetry models. a
variety of models for ordinal data, proportional
odds models, and log-linear models for marginal
probabilities generated by split-plot categorical data arrays. It incorporates a powerful,
parsimonious syntax for easy specification of
all hierarchical and many non-hierarchical
models. CATMOD provides the flexibility of
allowing the user to select from different
parametrizations which may be appropriate for
a log-linear model with one or more equivalent
specifications as a multiple logistic model.
Within CATMOD, WLS or ML procedures or both
may be generated for the classical log-linear
model set-up. as selected by the user. As
the previous SAS programs, CATMDD uses a
Newton-Raphson approach to obtain ML solutions.
For proportional odds and split-plot or
repeated measurement models. only WLS solutions
are available. The log-linear modeling capabilities of CATMDD are fairly cleanly meshed
with the general WLS functional modeling
approach of Grizzle, Starmer and Koch (1969)
and colleagues, forming within one program a
tool for categorical analYSis with superior
flexibility and user-friendliness. Nevertheless, CATMOO and its documentation do have
limitations. and the user attempting to learn
this program may encounter these. In the hope
of smoothing the way for some, a series of
illustrative examples is provided below.
uses a Newton-Raphson algorithm to obtain its
ML fits.
For dependent dichotomies to be related to
multiple explanatory variables, LOGIST and
FUNCAT are both powerful tools, with LOGIST
possessing greater flexibility in terms of its
stepwise fitting capabilities and its orientation to ungrouped data. For dependent
polytomies, LOGIST fits proportional odds
models only, having no capability for classical
log-linear modeling. Thus, LOGIST in no way
attempts to be a general tool for the fitting
of log-linear models to complex categorical
data arrays.
However, such a general tool does exist
within present SAS software, that tool being
the PROC MATRIX MACRO CATMAX. CATt1AX f1ts
any classical log-linear model to any suitable
grouped categorical array. Its scope does not
include proportional odds or repeated measurement analyses, but it does produce both WLS
and ML solutions for classical models (using
Newton-Raphson, as the other programs. to obtain the ML fit). CATMAX provides the omnibus
likelihood-ratio test of model significance.
and Wald tests of individual parameters and
arbitrary sets of linear functions of them.
A major and, for many uses and users, critical
debility of CATMAX is its lack of a userfriendly front end. The data analyst must
construct and input the appropriate model
matrix, and any matrices specifying linear
functions to be tested, using PROC MATRIX
manipulations outside of CATMAX, and then
arrange to move these matrices into the
CATMAX code stream. Alternately, matrices
can be generated and entered by hand. This
inconvenience has made CATMAX, despite its
generality, a tool for experts rather than
a general purpose facility.
Thus, while it is clear that the three SAS
programs available all are useful for the
fitting of log-linear models, the most convenient and accessible program for fitting of
hierarchical log-linear models has, for many
if not most SAS users, been PROC BMDP. While
BMDP4F was not listed with SAS software above,
its range of available classical models, extensive model selection diagnostics and automated
sequencing aids, and ease of model specification has made it superior, as a general purpose
log-linear fitting tool, to the available software internal to SAS[Brown(1981)]. More
generally, when SAS software is compared to
its primary competitors in this area, BMOP4F
and SPSSX LOGLINEAR the following summary
conclusions are apparent: i) FUN CAT is much
more general for categorical data modeling on
the whole, but rather less general and less
automated for log-linear modeling specifically,
than either major competitor; ii) LOGIST is
a logistic regression and proportional cumulative odds program that does not attempt to be,
and is not fairly evaluabl e as, a general loglinea~ modeling program; and iii) CATMAX is a
more general program for grouped data than
alternatives. but does not support the user with
a front-end for simple model specification.
4. Examples
4.1. A 3 x 2 table from a clinical trial of
respiratory therapies
Initially, a statistically trivial example
is used to illustrate aspects of the relation
of CATMOD to FUNCAT, and the difference in how
CATMOO views the same model in its log-linear
vs. its logistic formulation. Table 3 indicates
presence or absence of atelectasis, a common
compl ication of surgery, in patients 24 hours
post-abdominal surgery who each received one of
three respiratory therapies designed to promote
rapid recovery of lung function: cough and deep
breathing exercises (COB), continuous positive
airway pressure (CPAP), or incentive spirometry
(IS). Are atel ectasis at 24 hours and mode of.
therapy significantly associated? EXClusive
of data entry and title statements, which will
generally be omitted from presentations of code
that follow, the CATMOD syntax for a full logit
analYSis is:
PROC CATMOD;
MODEL ATELECT=TRTMNT/ONEI~AY FREQ PROB XPX
COV COVB CDRRB ML PRED=FREQ;
CONTRAST 'COB VS. CPAP' TRTMNT 1 -1;
CONTRAST 'COB VS. IS' TRTMNT 2 1;
CONTRAST 'CPAP VS. IS' TRTMNT 1 2;
1008
With the exception of the procedure name, these
commands are identical to those of PROC FUNCAT.
CATMOD interprets the data as a 3 population,
2 response contingency table and generates,
for each population by default, the logit of
absence of atelectasis. along with the full rank
model matrix corresponding to the deviation
parameterization of each therapyts variation
from the mean of three logits. The response
logits and the model matrix are shown in
Table 3. The CONTRAST statements specify
pairwise comparisons of treatments by Wald
chi-square statistics. in terms of the deviation
parameters in the model matrix which represent.
respectively, CDB and CPAP vs. all treatments.
As the model is saturated, CATMOD produces
identical WLS and ML estimates which perfectly
fit the data. A Wald statistic QW= 1.41,
D.F. = 2, P = .49, is reported for the TRTMNT
effect.
CATMOD may alternatively fit this model in
its equivalent log-linear form. Substitute
code for this approach follows.
MODEL TRTMNT*ATELECT= RESPONSE /ONEWAY
FREQ PROB XPX COV CQVB CORRB-ML PRED=FREQ;
REPEATED/ RESPONSE =ATELECT iTRTMNT;
CONTRAST 'IS MAIN eFFECT U-TERM'
RESPONSE 0 -1 -1;
CONTRAST 'IS BY ATELECT U-TERM'
RESPONSE 0 0 0 -1 -1;
CONTRAST 'COB VS. CPAP EFFECT'
RESPONSE 000 1 -1;
CONTRAST 'COB VS. IS EFFECT'
RESPONSE 00021;
CONTRAST 'CPAP VS. IS EFFECT'
RESPONSE 0 0 0 1 2;
This syntax makes evident that CATMOO is
operated in log-linear model mode through the
combined use of the MODEL statement and a new
REPEATED statement, coupled with a new keyword,
_RESPONSE_. In this example, TRTMNT*ATELECT
on the left-hand side of the model equation
specifies that r = 6 response categories are
to be constructed from all combinations of
TRTMNT and ATELECT exhibited by the data.
Since there are no other variables. CATMOD
reads the data as a one-population (s = 1),
6 response table. By default in the absence of
a RESPONSE statement, five log ratios of counts
are generated from this table, comparing each
of the first five counts to the sixth. The
use of the keyword RESPONSE on the right-hand
side of the model equation indicates that some
of the variables used to define the r = 6
response categories on the left-hand side will
also be used to produce columns of the model
matrix fit to these five response log ratios.
Which variables are to be used to construct
what columns is specified by the
_RESPONSE_=
phrase following the slash
in the subsequent REPEATED statement. Here,
_RESPONSE_=ATELECTiTRTMNT specifies a model
matrix with the ATELECT*TRTMNT interaction, as
well as all lower order effects contained
within, viz. ATELECT and TRTMNT main effects.
The impact is to produce the saturated loglinear model for these data, equivalent to the
logit model specified earlier.
However, this formulation models five log
ratios with five independent parameters, rather
than three log its with three independent
parameters in the logit formulation. The
responses modeled, and the (full rank) model
matrix constructed by CATf40D, are shown in
Table 4. The matrix columns specify, respectively, an ATELECT main effect u-term, two
TRTMNT main effect u-terms (cols. 2 and 3),
and two interaction u-terms (cols. 4 and 5).
Remaining u-terms are dependent upon these,
and are specified by the first two CONTRAST
statements. The effect of TRTr~NT on ATELECT
is represented by the interaction deviation
contrast u-terms, and pairwise comparisons
among treatments are generated from these by
the remaining three CONTRAST statements. The
identical Wald Chi-square QW = 1.41 labeled as
TRTMNT main effect in the logit formulation is
here reported as the TRTMNT*ATELECT interaction
test in the printed "ANOVA table. The pairwise
contrast statistics from the log-linear formulation are also identical to those from the logit
analYSis, viz. 0.38, 1.41 and 0.37 for COB vs.
CPAP, CDB vs. IS and CPAP vs. IS, each with
D.F. = 1 and clearly non-significant.
Specification of different hierarchical
analysis-of-variance models is simple using
CATMOD syntax. As mentioned above~ vertical
bars between effects spec; fy the correspond; n9
interaction and all lower order effects it
contains. To specify the interaction term
without lower order effects, * repl aces the!.
Thus, TRTMNTiATELECT is equivalent to
TRTMNT ATELECT TRTMNT*ATELECT. TRTMNT*ATELECT
alone would define a model with a general mean
column and two interaction columns with main
effects constrained to zero.
The syntax
MODEL TRTMNT*ATELECT= RESPONSE /ONEWAY
FREQ PROB XPX COV CQVB CORRB-ML PRED=FREQ;
REPEATED/ RESPONSE =ATELECT TRTMNT:
specifies a main effects (independence) model
of treatment and clinical response. The lackof-fit test for this model is the test of treatment effect. The WLS lack-of-fit test will be
identical to the Wald statistic obtained from
the saturated model but, since ML is requested,
the likelihood ratio lack-of-fit statistic
QL = 1.44, D.F. = 2, P = .49 is also generated.
Predicted values for the five modeled logits,
and for the six cell counts, are listed or may
optionally be output in a SAS data set for
model assessment. These fi tted log rat i as a"re
shown in Table 4, and the fitted counts appear
in Tabl e 3.
Before mov; ng to more compl ex ill ustrations
of CATMOD. some comments on syntax are appropriate. The code presented above repeats an
array of options which generate useful documentation and check output but may, of course, be
deleted when unnecessary. The REPEATED
statement and RESPONSE keyword are new SAS
terms for which some explanation is helpful.
The REPEATED statement appears not only in
Version 5.0 CATMOD, but also in ANOVA and
GLM. Its primary purpose is to designate and
label situations where response dimensions
themselves represent levels of factors to be
II
1009
included in an analysis~ such as in repeated
measures or split-plot situations. In these
cases, model matrix columns are generated
corresponding to comparisons amongst dependent
responses arising from within the same populations. Since log-linear modeling involves the
construction of analogous model matrix columns
for multiple dependent log ratios, CATMOD uses
the REPEATED statement to specify the form of
a 10g-l in~ar model even when~ as in most cases,
the data and modeZ do not involve repeated
measurement or split-plot data. A technical
statistical analogy has here been converted
into an unfortunate terminologic red herring.
The REPEATED statement is most easily understood as a code which fulfills two quite
separate functions: i) tbe formulation of a
repeated measurement or split-plot model; and
ii) specification of a log-lin€ar model
structure, whether or not there is any repeated
measurement aspect to the situation.
The RESPONSE keyword is much easier to
understand intuitTvely. It replaces, on the
right-hand side of the MODEL statement, any
variable or combination of variables used both
in designating response categories and in forming the model matrix. Whenever such variables
exist, the manner in which they contribute to
the model is specified on the right-hand side
of a RESPONSE = equation within a REPEATED
statement, while RESPONSE in the MODEL statement represents all parameters they determine.
RESPONSE may appear only on the right-hand
side of the MODEL statement, may appear with
other variables, may be included in interactions
with them, or nested within their levels.
Other variables may not be nested in levels of
- RESPONSE- .
4.2. A 24 table classifying motor vehicle
accldents with serious driver injury
Table 5 is a cross-classification of North
Carolina motor vehicle accidents involving
serious driver injury in 1973 or 1974 by Year
and dichotomies of Speed, Time (of day), and
Place (urban or rural). Exploration of this
table is of interest because of changes in the
driving context that took place in the 1973-4
period, including gasoline shortages and enactment of a national 55 mph speed limit, that
might have selectively affected frequencies
of certain types of accidents. A basic CATMOD
request for a maximum-likelihood log-linear
analysis of these data, on initial exploration
for a suitable model, is:
MODEL SPEED*TIME*PLACE*YEAR= RESPONSE !ONEWAY
NOPARM NOGLS ML;
This statement requests only an ML analysiS of
the 24 table, with no printout of tests of
individual fitted parameter values, but with
marginal one-dimensional distributions reported
as a data check. The resulting output includes
these distributions, a report on the iterations
of parameter estimates to convergence (so that
fitted parameters themselves are reported as
the last stage iteration), and an "ANOVA"
chi-square table for the effects incorporated
in the model. These are specified on a
subsequent REPEATED statement, as desired. Thus,
REPEATED! RESPONSE =SPEED [TINE [PLACE [YEAR;
yields the saturated model, results of which
suggest various approaches to model reduction.
Several tried were:
REPEATED! RESPONSE =SPEED[TIME[PLACE
SPEED [TT~lE [YEAR;REPEATED! RESPONSE =SPEED[TIMElpLACE
SPEED [YEAR TIMETYEAR PLACE IYEAR;
REPEATED! RESPONSE =SPEED[TIMEIPLACE
SPEED[YEAR TIMETYEAR;
Note that, in accord with other programs for
fitting hierarchical ANOVA log-linear models,
each mOdel is designated by listing only the
sufficient statistics which specify the model.
using an appropriate notation. Thus, in CATMOD,
SPEED [TIME[PLACE represents the SPEED by TIME
by PLACE three-way observed marginal distribution, which might be represented as STP.
S*T*P, SPEED BY TIME BY PLACE or otherwise in
another program. CATMOD allows the flexibil ity,
however, of fitting non-hierarchical models by
replacement of vertical bars by asterisks and
incorporation, thusly, of only selected sets of
main effects and interactions.
For these accident data, the last model
fit was regarded as acceptable (Q~ = 4.69,
D.F. = 5, P = .45 for lack-of-fit), with all
parameters significant at p < .015. Fitted
counts from this model are shown in Table 5.
and the analysis of individual parameters
produced by CATMOD displayed in Table 6.
The SPEED*YEAR parameter reflects the overall
24% reduction in high-speed accidents with
serious driver injury from 1973 to 1974. as
compared to only a 10% reduction in lower
speed accidents with serious driver injury
across that period. The TIME*YEAR effect
reflects the overall 21% reduction in daytime
accidents of this type as compared to only a
3% reduction in nighttime,.accidents. These
findings are compatible with hypotheses involving reduced high-speed driving overall, and
increased car-pool.ing and use of public transport leading to reduced daytime exposure in
routine commuting. Other explanations are
possible, of course, and nothing conclusive can
be said in the absence of denominator data for
this selected group of accidents involving
serious driver injury.
4.3. A 23 repeated measures drug comparison
Table 7 gives data from a drug comparison
trial in which 46 subjects received each of
Drugs A. Band C under similar circumstances,
with their joint responses noted. These data
ha ve been analysed by many aut hors inc1ud i ng
Koch, Imreyet al. (1976) and Koch, Landis
et a1. (1977). ---rnitially. a no three-way interaction model
is easily fit by CATMOD using the statements:
r~ODEL DRUGA*DRUGB*DRUGC= RESPONSE ;
REPEATED! RESPONSE =DRUGA[DRUGB DRUGA[DRUGC DRUGB[DRUGC;
Since ML is not specified, the fit defaults to
WLS, with the goodness-of-fit statistics
QW= 0.08, D.F. = 1, P = .78 reported by other
authors. The IIANOVA II table for this model
shows identical Wald chi-squares for
DRUGA*DRUGC and DRUGB*DRUGC of 0.45, D.F. = 1,
P = .50, but a DRUGA*DRUGB chi-square of 7.94,
1010
p = .005.
PROC CATMOD; POPULATION ARTIFAC:
MODEL DIST= /NODESIGN NOPARM NOGLS NDPROFILE
ML PRED=FREQ;
RESPONSE OUT=ARCHOUT;
DATA OUTCHEC; SET ARCHOUT
(KtEP= TYPE NUMBER RESID PRED);
IF TYPE ='FREQ'; STDRESD=(-RESID TSQRT
( PRED )T; KEEP NUMBER STDRESD;PROC SORT; BY DESCENDING STORESD;
PROC PRINT;
The POPULATION statement forces separate
treatment of each type of artifact as a population, whether or not ARTIFAC appears as a
variable in the MODEL statement. An independeoce
model is iteratively fit to the table, and the
likelihood ratio chi-square goodness-ot-fit
test obtained (QL = 180.49, D.F. = 85, p <. .000l).
An output data set is produced containing the
predicted values and residuals from each cell.
Standardized residuals are created as described
above. The data are sorted by descending
standardized residual and printed. The
analysis identifies two standardized residuals
with absolute values 5.09 and 4.06, as compared
to all others substantially below three. The
large standardized residuals correspond to
excesses of grinding stones in the immediate
vicinity of water, and Humboldt projectile
pOints 1-3 miles from water. Each excess has
a plausible substantive explanation. To search
for additional outliers, we fit a quasiindependence model to the remainder of the
table by excluding the cells generating the
identified outliers. Equivalently, these cells
are being treated as structural zeroes for the
continuing analysis.
The two outlying cell counts are deleted
from entry in the DATA step. For the earlier
MODEL statement, we substitute:
r~ODEL ARTI FAC*DIST= RESPONSE /NODESIGN
NOPARM NOGLS NOPROFILE ML PRED=FREQ;
REPEATED/ RESPONSE =ARTIFAC DIST;
Care is necessary here-because of the manner in
which CATMOD handles zeroes and missing cells
or, equivalently, random zero counts vs.
structural zeroes. We wish to treat two cells
as structural zeroes while retaining several
random zeroes in the data table. In a multipopulation context such as was defined in the
initial outlier screen, CATMOD treats all missing
cells or input zero counts as random unless a
response category is missing or unfilled in
all populations simultaneously, in which case the
category is disregarded entirely. In a single
population set-up, a zero or missing cell in
that single population is, by definition, zero
or missing in all populations, and its category
is thus analogously discarded by CATMOD. Thus,
in single population problems all zero or missing
cells are treated as structural zeroes and are
not modeled, unless special measures are taken
to identify those which are truly random, and
should be modeled, to the software. This is
awkward; however, one cannot effectively define
structural zeroes in the multi population set-up
since missing cells are automatically treated
as random zeroes. Thus, the quasi-independence
model is fit in full log-linear rather than
logistic form, so that a distinction between
These results support a model in
which response to Drug C is independent of
responses to Drug A and Drug B. and in which
responses to these latter are associated.
Examination of Table 7 reveals that the data
are completely symmetric with respect to
Drugs A and B. Thus, an appropriate model
might incorporate terms. due to a Drug C main
effect, with terms dependent on the number of
responses (0, 1 or 2) to Drugs A and B,
regardl ess of whi ch of these 1atter drugs
generates a response. Such a non-hierarchical
log-linear model, with effects not specifiable
as obvious main effects. interactions or
nested effects of the dependent variable
factors, may be designated by directly entering the appropriate model matrix. The following
MODEL statement will do the job:
MODEL DRUGA*DRUGB*DRUGC=(2 2 0, 0 2 0,
2 2 -2, 0 2 -2, 2 2 -2, 0 2 -2, 2 0 0)
(1='C',2='A OR B',3='A AND B')/ONEWAY
FREQ PROB XPX COY COVB CORRB ML PRED=FREQ;
This statement enters directly a full-rank
model matrix corresponding to the seven log
ratios of each observed count to the last
(U U U), where the first parameter is an
increment due to _positive response to C, the
second is an increment due to positive response
to at 1east one of A and B, and the thi rd is
an additional increment due to concordant
responses to both A and B. The parenthesized
information after the matrix literal instructs
CATMOD to compute Wald tests of significance
for each parameter, and label these tests
respectively C, A OR B, A AND B. Note that
the model matrix entered is derived from a
corresponding matrix for the full set of eight
log probabilities by subtraction of the last
row from each of the first seven; the initial
matrix to which this operation is applied is
1 -1 1 -1 1 -1 1 -11
~ = 1 1 1 1 1 1 -1 -1 ,and the 1ike 1i hood
[ 1 1 -1 -1 -1 -1 1 1
,ratio lack-ot-fit test tor is model is
QL = 1.75, D.F. = 4, p = .78. The ML-fitted
counts are shown in Table 7.
Although the overall response rates in this
study are 61% for both Drugs A and B, and only
35% for Drug C, this model suggests (see fitted
counts) that once either Drug A or Drug B has
fail ed, the probabil ity of success with Drug C
is at least as great as that for the remaining
drug.
4.4. Outlying cells and guasi-independence in
an 18 x 6 table
Table 8 is a slightly abridged version of
data presented by Casjens (1974), and analysed
for outliers by Mosteller and Parunak (1985,
in press). The latter authors explore various
methods ot searching for extreme departures
from independence in such a table, in the hope
that such departures will be informative. Here,
CATMOD is used to implement a rather conventional
outlier search using residuals standardized by
division by the root of the predicted values
(the Poisson-based standard deviation estimate).
For an initial run, one'may use:
1011
structural and random zeroes may be drawn. This
is accomplished by transforming all random
zeroes to small positive numbers in the DATA
step. CATr-10D documentation recommends replacement of all random zeroes by 1 E-20 as a routine
procedure whenever any log-linear model is to
be fit. For this quasi-independence model such
treatment is essential.
Once the data are so presented to CATMOD,
the quasi-independence model yields a likelihood
ratio lack-of-fit statistic of Q = 145.16,
L
D.F. = 83~ P < .0001, so that quasi-independence
is a poor fit. as was independence to the
original table. Nevertheless, the standardized
residuals from quasi-independence remain
between + 2.7. and are fairly symmetrically
and unimodally distributed. suggesting that
further departure from independence ;s not due
to one or a few additional outlying cells.
It must be noted that the above analyses,
that is, the independence and quasi-independence
fits, required respectively six and five iterations of the Newton-Raphson computing algorithm.
which is very inefficient and time-consuming
for problems of this size relative to the simple
computation of expected values under independence
(row total x column total/grand total), or to
the use of iterative proportional fitting (IPF)
for the quasi -independence model. For sufficiently large problems. limitations of computer
resources may suggest the use of IPF through
PROC BMOP and BMDP4F, or through the IPF PROC
MATRIX command.
4.5. Stratified clinical trial data with an
ordlnal response
Table 9 displays results of a clinical
trial of an experimental agent vs. placebo in
treatment of pain from a chronic joint disease
at one of two possible anatomic sites. The
trial was conducted at two clinical centers.
and response classified into one of three
ordinal categories. Two models of interest for
smoothing and analysing data of this type are
an equal adjacent odds-ratio model [Andrich
(1979), a classical log-linear model], and the
proportional odds model obtained by applying a
similar uniform association structure as the
former to cumulative logits rather than to log
ratios of probabil ities of individual categories.
The equal adjacent odds-ratio model specifies
that the conditional odds-ratio of Good to Medium
response for A'ttive relative to Placebo drug is
equal to that for Medium to Poor response. for
each anatomic site and clinical center combination. This single odds-ratio then measures the
Drug effect. The model which incorporates this
assumption with main effects for anatomic site
and clinical center may be fit using CATMOD by:
POPULATION SITE CENTER DRUG;
MODEL PAIN = (1 0 0 0 0, 0 1 0 0 0,
1 0 2 0 0, 0 1 1 0 0, 1 0 0 2 0, 0 1 0 1 0,
1 0 2 2 0, 0 1 1 1 0, 1 0 0 0 2, 0 1 0 0 1,
10202,01101,10022,01011,
1 0222, 01 1 1 1 )(3='DRUG EFFECT',
'4=' CENTER EFFECT', 5=' SITE EFFECT' )/ONEWAY
FREQ PROB XPX COVB CORRBML PRED=FREQ NOQS;
The POPULATION statement tells CATMOD that data
are to be arranged in populations corresponding
1012
to all distinct combinations of the variables
Site, Center and Drug. Because the MODEL statement directly inputs the model matrix, instead
of directing its construction using names of
these factors, the POPULATION statement is
required to insure that the data a-re not treated
as a single population. Treated as eight populations, sixteen log ratios of probabilities are
modeled by default, and the model matrix is
16 x 5. The same model might have been fit without the POPULATION statement, by entering SITE x
CENTER x DRUG x PAIN on the left-hand side of the
MODEL statement equation. However, CATMOD would
then have expected a larger (and more complex)
23 x 12 model matrix, corresponding to the 23 log
ratios of probabilities it would create,by
default. under the single P9Pulation. 24 response
category assumption. For these data, with two
random zeroes, failure to specify populations
would have led CATMOD to treat the problem as
a 22 response category single population, and
the desired model would not have been possible
to fit without first replacing the sampling
zeroes by negligible numbers. as discussed in
the previous example.
For the equal adjacent odds-ratio model,
the likelihood ratio lack-ot-fit chi-square
is 14.98 with D.F. =11, p = .18, indicating
an adequate fit. The Site, Center and Drug
parameters, standard errors, and Wald chisquare statistics are reported in Table 10.
The Drug para~eter is essentially at the 5%
level of significance.
An analogous proportional odds model may
be fit by adding to the code:
RESPONSE 1 -1 0 % 0 1 -1 LOG 1 0 % 1 1/1
1 0/0 0 1;
which generates log ratios of cumulative odds.
Uniform association across cumulative logits is
imposed by changing all 2's to l's in the previous
model matrix. For this model, maximum likelihood
analysis is not available from CATMOD (though
PROC LOG 1ST will provide it); ML must be dropped
as an option in the MODEL statement. Further.
the WLS fitting procedure uses the observed
response functions which, for this data set, are
undef; ned due to the pl acement of the two sampling
zeroes. To remedy this, these zeroes are reploced
by 0.5 in accordance with conventional practice
using WLS categorical data analysis [Grizzle,
Starmer and Koch (1969)J, and the analysis is
carried through for ill ustration. The model fits
adequately (QW for lack-of-fit = 9.60, D.F. = 11,
P = .57), and its Site, Center and Drug parameters
and related statistics are included in Table 10.
In terms of pa rameter values, these resul ts agree
closely with those of the ML fit obtained from
PROC LOGIST [Koch, Imrey, Singer, Atkinson and
Stokes (1985)J. With regard to formal inferenCe,
all methods confirm the existence ot Site and
Center effects (p < .05), while only the ML fit
of the proportional 3dds model yields a p-value
for Drug below 5% (Xi = 3.95, p = .047). However,
of the three sets of results, the ~iLS fit has the
least favorable asymptotics, and is likely too
conservative.
4.6.
Induced tumor regression in rat
carcinogenesis
Frequently it is desirable to fit log-linear
models to counts arising from sampling schemes
more complex than the product-multinomial or
Poisson models assumed by the WLS or ML
algorithms of CATMOD and related software.
Sometimes this may be accomplished through
CATMOD if the conventional sampling model
applies to some set of sampling u~its from
which the observed counts are derlved~ and if
the mode of derivation of these counts can be
explicitly formulated within the class of
response transformations supported by CATMGD.
For instance, Table 11 shows artificial data
on total palpated tumors and total tumors
regressing in a common rodent model of breast
carcinogenesis. A regressed tumor is one which
was palpable repeatedly but was subsequently
not found on palpation or necropsy. The data
reported are for two strains of rats fed an
anti-tumor agent~ and their controls. It is
of interest to compare the observed proportions
of tumors which regress inthese groups, viz.
48% and 52% for treated rats of Strain A and
B respectively, vs. 22% and 27% for their
corresponding control groups. Log-linear model
analysis might be used to do this, but would be
inappropriate if applied to the 2 x 2.x 2
table of Strain x Treatment x Regresslon,
because of the clustered nature of tumor samplirg, which violates the product-mu~tinomial
assumption since outcomes for multlple tumors
in the same animal are undoubtedly associated
biologically and statistically.
However, a product-multinomial model does
apply to Table 11, which uses rats rather than
tumors as the unit of analysis. Thus, 10glinear models may be fit by deriving the
regression proportions as response functions
from Tabl ell. Appropri ate code for the
saturated model is:
MODEL mHOT*TUMREG=TRTMNT ISTRAIN/ONEWAY
FREQ PROB XPX COV COVB CORRB;
RESPONSE 1 -1 LOG 0 1 0 1 2 0 1 2 3/1 0 2
No ML analysis is available from CATMOD for this
model, as CATMOD is capable of ML fits only when
the product-multinomial likelihood applles to
counts in the initially entered populatlon by
response table. Also, CATMOD is not capable
of accepting and applying WLS modeling to an
observed vector and covariance matrix generated
externally to CATMOD, such as from a TYPE=CORR
SAS data set.
4.7. A split-plot marginal log-linear model for
root caries prevalence data
As a final example, we return to the data
of Table 2 relating presence of any root caries
in each of four quadrants of the mouth to Town
of residence and Ag~, where subjects were drawn
from two towns of widely different water fluoride levels. Constructing variable names using
MAX and MAND to represent maxillary (upper) and
mandibular (lower) levels of teeth, and Land R
prefixes to designate left and right sides of
the mouth, the following CATMOD code fits a
saturated log-linear model to the marginal
prevalence proportions:
RES PONSE LOG IT;
MODEL LMAND*RMAND*LMAX*Rt4AX=TOWN IAGECAT
I RESPONSE;
REPEATED LEVEL 2 SIDE 2/ RESPONSE_=
LEVELISIDE;
The command RESPONSE LOGIT; defines the
responses for analysis as the marginal logits
of each variable listed on the left-hand slde
of the MODEL statement equation. That equation
defines the model matrix as consisting of all
main effects and interactions incl uded in
the interaction TOWN*AGECAT crossed wi th effects
which are functions of the dimensions of the
four-way dependent variable response table.
The syntax REPEATED LEVEL 2 SIDE 2/ indicates
that the dimensions of the LMAND*RMAND*LMAX*RMAX
table correspond to the cross-classification of
two repeated measures (split-plot) factors,
LEVEL and SIDE, each with two values, and that
the values of SIDE, the second listed variable,
changes most rapidly in the 1isting of ~es~onse
dimensions in the MODEL statement. ThlS lS
sufficient to specify the prefixes Land R as
deSignating the two categories of SIDE, and
MAND and MAX the two categories of LEVEL.
RESPONSE =LEVELISIDE; puts LEVEL, SIDE and
their interaction into the model, and these are
crossed with the whole-plot factors TOWN and
AGECAT as a result of the I_RESPONSE_ portion
of the ~IODEL statement.
The "ANGVAll table resulting from this
analysis shows both whole-plot factors significant at p < .0001, and no other terms with
p < .05. Clearly the model may be reduced. To
specify the main effects model, one may slmply
remove all verti ca 1 bars from the -MODEL and
REPEATED statements. This model also shows no
significant effects of the split-plot factors
LEVEL and SIDE, while TO,JN and AGECAT remain
Significant at p < .0001. If models without
parameters representing differences due to
LEVEL and SIDE are to be fit, the REPEATED
statement may be_ ·dropped. To simultaneously fit
separate main effects models of TOWN and AGECAT
to each of the four quadrants, use
1 0 3 2 1 0;
~'
Other mOdels may be obtained by changing
TRTMNTISEX e.g. to TRTMNT SEX for the main
effects model. The RESPONSE statement computes,
for each population formed by combinations of
TRTMNT and SEX, the log of the ratio of regr~sed
to non-regressed tumors~ which is equal to the
10git of the proportion of tumors regressing in
that population. This is the same response
function that would be formed if the data were
input on a per tumor basis and the default
response was used. However, analysis in that
situation would use the wrong likelihood or,
for WLS, moments based on the wrong probability
model. Formulation of these responses through
the RESPONSE statement based on data input on a
per subject basis allows the software to creat~
the correct estimated moments for aWLS analysls
based on an appropriate probability model. The
Wald statistics for the saturated and main
effects model s are shown in Tabl e 12; the WLS
analysis shows no interaction or Strain effect,
and a significant (p=.0003) effect of the
experimental drug in enhancing tumor regression.
1013
complex multivariate association structures which
its test procedures may adjudge to be non-random.
CATMOD, on the other hand, allows full 109linear or other structural modeling and general
estimation of effects within any structural
model which it can test. iv) FREQ produces
some exact testing, but relies mainly on
asymptotiC procedures. CATMOD uses asymptotiC
methods only.
Thus, FREQ and CATMOD have essentially
disjOint capabilities. However, frequently
each will be valuable in generating analyses
which will provide complementary insights into
the same data set, addressing similar questions
under different assumptions or at different
levels of generality.
MODEL LMANDIND*RMANDIND*LMAXIND*RMAXIND=
TOWN AGECAT;
To treat all four quadrants identically by
fitting the same main effects model to them with
one set of parameters, so that the fitted values
are the same for each quadrant, use
MODEL LMANDIND*RMANDIND*LMAXIND*RMAXIND=
TOWN AGECATjAVERAGED PRED;
Here, the option AVERAGED directs tATMOD
to construct a model matrix with identical rows
corresponding to the elements of each set of
multiple response functions within the same
population, so that the resultant parameters
apply to all response functions simultaneously.
Since the fitted parameters are (weighted)
averages of those that would have been fitted
to each response function separately. the usage
is justified.
6.
SASware Ballot 1985, CATMOD Section
To conclude, a number of recommendations
are provided by which CATMOD might be made
even more flexible and attractive to its users.
The "ANOVA II results for this
model are given in Table 13.
It is evident
from examination of the parameters that this
model might be further reduced to include only
a pseudo-l ; near tenn for the effect across age
categories. To do this, the variable AGEINC,
with values -1, 0 and 1, must be created from
A.
AGECAT in the OATA step. Replacing AGECAT by
AGEINC in the MODEL statement, and prefacing
that statement by DIRECT AGEINC, would then
fit the desired model. The added statement
places AGEINC directly into the model matrix,
rather than using its values to define a threelevel categorical factor as was AGECAT. Finally,
note that AVERAGED also allows modeling of
differences between response functions using
appropriate model specifications.
5.
B.
Campa ri son of PROC CATMOD with PROC FREQ
In Version 5.0 of SASjBASE, PROC CATMOD and
PROC FREQ will form complementary tools for the
analysis of categorical data. A brief (and
somewhat oversimplified) comparison is appropriate here. Note that PROC FREQ in Version 5.0
has been available as PROC TFREQ in earl ier
versions. PROC FREQ produces a variety of
contingency table measures of association, with
Cochran-Mantel-Haenszel generalized average
partial association tests. Major differences
in approach between PROC FREQ and PROC CATMOD
are enumerated below. i) FREQ executes
randomization model-based analyses allowing
formal statistical inference only to the subjects
generating the data under study. CATMOD
incorporates random sampling assumptions which,
if valid, allow generalization to broader target
populations. On the other hand, CATMOD analyses
are invalid if these assumptions are grossly
violated. Depending on the nature of the model
and sample sizes available, CATMOD may rely
upon stringent but untestable assumptions.
FREQ makes very weak assumptions, at sacrifice
of scope of inference. ii) FREQ generates a
variety of standard descri'ptive statistics which
are independent of any structural model. CATMOD
generates only descriptive statistics calculated
through its RESPONSE statement by the user, or
based on structural models for such response
functions. iii) FREQ does average partial
association testing only, with summary partial
association measures available under particular
circumstances. FREQ is poor at describing
C.
D.
E.
F.
G.
1014
The REPEATED statement should be spl it
into the two rather different statements
of which it currently forms a hybrid.
REPEATED should be retained for genuine
repeated measures or split-plot analyses
and a new statement, for instance, LOGLIN,
added to incorporate its current function
of log-linear model specification.
Provision should be allowed for direct
input of a vector of response functions and
associated covariance matrix, possibly
produced as an output data set from another
SAS PROC, for direct WLS modeling. This
capability would allow for the modeling of
data sets from complex sample surveys and
other situations in which the standard
probablility models do not apply. The
programming needed to add such a capability
would seem to be minimal, but may not be.
Allow the concatenation of response statements, so that more general response
functions may be more easi ly constructed,
for instance by taking ratios of mean
scores by matrix operations on a vector
of means, where the means are specifiable
by key word.
Incorporate predicted probabil ities under
general functional WLS modeling, such as
marginal log-linear modeling or proportional
odds modeling. Where invertible functions
are modeled [See Dunn (1985) in this
volume], back-transformation may be used.
Otherwise, minimum Neyman chi-square estimates are available under the model [Koch,
Imrey, Singer, Atkinson and Stokes (1985)).
Provide a focussed model reduction capability within a single model statement,
such as by speCification of a selected set
of null effects, parameters, or contrasts
among parameters
Introduce 1imited automated model building
capabilities, offering certain commonly
used sequences of model reduction.
Allow front-end selection of individual
degrees of freedom in multiple degree of
freedom effects which are not nested. This
would make it possible, for instance, to
H.
I.
J.
K.
L.
M.
construct a model incorporating a linear
trend only among three equally-spaced
populations, without having to specify the
linear contrast through a direct statement.
In a related vein, allow the user to choose
effect parametrizations of certain common
types, e.g. orthogonal polynomial contrasts,
comparisons with control contrasts, etc.,
vlithin CATMOD automatically, as is done
in SPSSX LOGLINEAR.
Since the Newton-Raphson algorithm is
highly inefficient for large models and
large tables, allow an iterative proportional fitting option for hierarchical
models for large tables.
Improve or clarify the handling of random
vs. structural zeroes. It seems fool ish
and inelegant for the user to have to
distinguish random zeroes to a log-linear
model program by converting them to small
po s it i ve numbers. Perhaps a speci fi cent ry
symbol for a structural zero might be
designated.
Improve the label ing of parameters in
general, especially those incorporated
with the RESPONSE effect.
Provide an option to allow printing of the
model matrix when the combination of NOGLS
and ML options is used. These options will
frequently be used together by those who
do not wish the WLS output, but are fitting
a moderate sized model to an array of
grouped categorical data. They would find
the model matrix useful, are not dOing
logistic regression, are not otherwise at
risk of generating a monstrous output,
and should be able to see the design.
Allow the user to print the non-full rank
design matrix for the 109-1 inear model,
rather than the reduced design matrix, as
the former is easier for most users to
check and interpret.
References
Agresti, A. [1984J. Analysis of Ordinal
Categorical Data. New York: Wiley.
Andrich, D. [1979J. Biometrics 35, 403-415.
Bishop, Y.M.M. [1969J. Biometrics 25, 383-400.
Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W.
[1975J. Discrete Multivariate Analysis.
Cambridge, MA: MIT Press.
Brown, M.B. [1981]. In BMDP Statistical Software,
Eds. W. J. Dixon, et a!., 143-208.
Los Angeles: University of Cal ifornia Press.
Casjens, L. [1974J. The Prehistoric Human
Ecology of Southern Ruby Valley, Nev.ada.
Doctoral Dissertation. Harvard University,
Department of Anthropology.
Dunn, J.E. [1985J. In SUGI-SAS Users Group
lOth Conference Proceedings, 989-998.
Fienberg, S.E. [1980J. The AnalYSis of CrossClassified Categorical Data. 2nd Ed.
Cambridge, MA: ~IIT Press.
Goodman, L.A. [1983J. Biometrics 39, 149-160.
Grizzle, J.E., Starmer, C.F. and Koch. G.G.
[1969J. Biometrics 25, 489-504.
Harrell, F.E., Jr. [19831. In SUGI Supplemental
Library User1s Guide, Ed. S. P. Joyner,
181 202. Cary, NC: SAS Institute, Inc.
Koch. G.G., Imrey, P.B., Freeman, D.H., Jr. and
Tolley, H.D. [1976J. In Proc. 9th Int.
Biometric Conf. I, 317-336. Raleigh, Nc:
The Blometric Society.
Koch, G.G., Imrey, P.B., Singer, J.S.,
Atkinson, s. and Stokes, M.E. [1985J.
Lecture Notes on Categorical Data
Analysis. Montreal: University of
Montreal.
Koch, G.G., Landis, J.R., Freeman, J.L.,
Freeman, D.H., Jr. and Lehnen, R.G. [1977J.
Biometrics 33, 133-158.
r~osteller, F. and Parunak, A. [1985J. In
Exploring Data Tables, Trends, and Shapes,
Eds. D. C. Hoaglin, F. Mosteller and J.
Tukey, Ch. 5. New York: IJiley, in press.
SaIl, J.P. [1982J. In SAS User's Guide:
Statistics, Ed. A. A. Ray, 257-286. Cary,
NC: SAS Institute, Inc.
SPSS, Inc. [1983J. SPSSX User's Guide, 541570. New York: McGraw Hill.
Stock, M.C., Downs, J.B., Gauer, P.K.,
Alster, J.M. and Imrey, P.B. [1985J. Chest
87,151-157.
Stokes, M.E. and Koch, G.G. [1983J. In SUGISAS Users Group 8th Conference Proceect1ngs,
795-800.
Acknowl edgements
The author is grateful to Sandra Emerson
of SAS Institute,and to Beth Richardson,
Vicki Dingler and Joan Alster of the University
of Illinois Computing Services Office, for
making available a test version of SAS 5.0 for
exploration of PROC CATMOD during the preparation of this paper. William Stanish, developer
of CATMOD at the Institute, served extenSively
as a consultant with regard to CATMOD's inner
workings and those of PROC FUNCAT. Katherine
Council and Andy Littleton of the Institute, and
Gary Koch of the University of North Carol ina,
provided unusual editorial cooperation to allow
production of the manuscript in time to appear
in these Proceedin1s. Ann Thomas of the University of North Caro ina at Chapel Hill typed the
manuscript rapidly and efficiently. Partial
support for the activities at the Department
of Biostatistics, University of North Carolina
was provided through Joint Statistical
Agreement JSA 84-5 with the U.S. Bureau of
the Census. Fred Mosteller introduced the
author to the data in problems of Section 4.4.
J. s. Stamm and D. W. Banting kindly permitted
use of the data from the Strafford-Woodstock
Root Caries Studies in Section 4.7. \~illiam
Stanish kindly reviewed the manuscript but bears
no responsibility for any remaining errors.
*SAS is the registered trademark of SAS
Institute, Inc., Cary, NC, USA.
SPSSX is the registered trademark of
SPSS, Inc., Chicago, IL, USA.
1015
TABLE 1.
CANONICAL DATA ARRAY
Responses
3
2
nll
P
n12
p
u
2
n
21
s
n
sl
1
nlr
nlr
n13
n22
n21
a
n13
n12
nll
o
r
n23
n22
n2r
n23
n2r
t
i
o
n
s
TABLE 2.
n
s2
nsl
ns3
ns2
nsr
ns3
nsr
ROOT CARIES IN QUADRANTS OF THE MOUTH, BY AGE AND TOWN:
STRATFORD-WOODSTOCK CARIES STUDY
Quadrant
Left Max ill a ry
Right Maxillary
Left Mandibular
Right Mandibular
~
N
N
N
N
N
N
N
Y
N
N
Y
N
N
N
Y
Y
N
Y
N
N
N
Y
N
Y
N
Y
Y
N
N
Y
Y
Y
Y
N
N
N
Y
N
N
Y
Y
N
Y
N
Y
N
Y
Y
Y
Y
N
N
Y
Y
N
Y
Y
Y
Y
N
Y
Y
Y
Y
0
1
4
4
0
0
1
5
1
5
0
4
0
2
0
2
3
4
2
3
0
6
3
3
5
4
Town
30-49
30-49
Stratford
Woodstock
139
95
3
10
1
7
0
6
8
3
3
3
2
0
2
1
3
6
1
2
0
1
50-59
50-59
Stratford
Woodstock
61
43
5
2
5
7
2
3
7
5
2
2
0
0
0
1
3
7
2
0
1
3
60+
60+
Stratford
Woodstock
31
28
5
5
1
2
3
5
5
0
3
0
2
2
3
3
1
0
2
1
5
6
0
4
TABLE 3. ATELECTASIS 24 HOURS AFTER ABDOMINAL SURGERY, BY RESPIRATORY THERAPY: OBSERVED COUNTS
AND LOGITS BY TREATMENT, WITH FITTED COUNTS UNDER INDEPENDENCE AND LOGIT DESIGN MATRIx*
Atel ect
LOglt
Counts
Model Matrix
06servea
Fitted
Observed
Treatment
Absent
Present
Present
Logits
Mean
Treatment
Absent
CDB
13
6
11. 2
7.8
.773
CPAP
13
9
12.9
9.1
.368
0
IS
11
12.9
9.1
Chest 87,151-157; 1985.
.000
-1
*From Stock, M.C. et
11
~.,
0
-1
TABLE 4. LOG RATIOS OF ALL COUNTS TO LAST COUNT FROM ATELECTASIS DATA: OBSERVED VALUES, SATURATED
FULL-RANK MODEL MATRIX FOR U-TERMS, AND FITTED VALUES UNDER INDEPENDENCE
Saturated Log-Linear
Treatment
CDB
CDB
CPAP
CPAP
IS
Atel ect
Absent
Present
Absent
Present
Absent
Log Ratio To IS-Present
Fltted Under
Observed
Independence
.167
-.606
.167
-.201
.000
.206
-.147
.353
.000
.353
_ _ _ _ _~M"'o""de"_'_l Matrix for U-terms
.
·--Ateloct" Atel ect'
CPAP
COB
CPAD
COB
Atelect
2
o
2
o
2
1016
2
2
1
1
o
1
1
2
2
o
0
-2
-1
-1
-2
-1
-1
0
-2
-2
TABLE 5. NORTH CAROLINA MOTOR VEHICLE ACCIDENTS YIELDING SERIOUS DRIVER INJURY, BY SPEED,
TIME OF DAY, PLACE AND YEAR: OBSERVED COUNTS AND LOG-LINEAR MODEL FITTED COUNTS*
Year
S~eed
~
~
~
~
>
>
>
>
1973
(MPH)
55
55
55
55
55
55
55
55
Time
Pl ace
Night
Night
Day
Day
Night
Night
Day
Day
Urban
Rural**
Urban
Rural
Urban
Rural
Urban
Rural
1m
Observed
F1 tted
Observed
Fitted
121
374
232
697
27
278
5
200
120
369
236
699
22
289
5
193
125
383
197
575
13
252
4
119
126
388
193
573
18
241
4
126
*Compi1ed by Highway Safety Research Center, University of North Carol ina at Chapel Hill
**Includes both all rural locations and urban interstate highways.
TABLE 6.
ANALYSIS OF INDIVIDUAL PARAMETERS (U-TERMS) FROM FINAL
LOG-LINEAR MODEL FOR ACCIDENT DATA
Effect
Estimate
Speed
0.933
-0.121
0.388
1. 045
-0.493
0.118
·0.128
0.096
-0.058
0.062
Time
Speed*Time
P1 ace
Speed*Pl ace
Time*Pl ace
Speed*Time*Place
Year
Speed*Year
Time*Year
TABLE 7.
F
F
F
F
U
U
U
U
p-value
Chi-sguare
. 048
.048
.048
.048
.048
.048
.048
.019
.020
.017
371.33
6.30
64.28
467.45
103.81
5.99
6.97
24.24
8.52
12.92
.0001
.0121
<.0001
<.0001
<.0001
.0144
. 0083
~. 0001
.0035
.0003
~
JOINT RESPONSES OF 46 SUBJECTS TO ADMINISTRATION OF DRUGS A, BAND C:
OBSERVED COUNTS AND FITTED COUNTS UNDER LOG-LINEAR MODEL WITH A AND B
SYMMETRIC, C INDEPENDENT OF A AND B (F = FAVORABLE, U = UNFAVORABLE)
Res~onse
A
S .E.
Pattern
Drug
B
F
F
U
U
F
F
U
U
C
F
U
F
U
F
U
F
Observed
Count
Fitted
Count
6
16
2
4
2
4
7.65
14.35
2.09
3.91
2.09
3.91
4.17
7.83
6
U
6
1017
TABLE 8.
NUMBER OF ARTIFACTS BY DISTANCE TO PERMANENT WATER
01stance to Permanent ~ater
2
3
4
5
Immediate Withi n a 1/4 to 1/2 1/2 to 1 1 to 3
mi 1es'
vicinity 1/4 mile
Artifact
mil e
mile
102
29
Specialized unifaces
20
54
38
136
56
Unifaces 2 or more edges retouched
33
86
58
Unifaces 1 edge retouched
27
122
51
53
68
4
Limited bifacial retouched
2
10
8
5
Large, heavy tools
82
35
30
11
34
10
53
17
17
Whole bi face
25
Round biface, base snapped, side notched
39
185
88
100
58
179
78
60
Pointed biface, base snapped,side notched
34
70
j -
1
6
Over
3 mil es
1
2
3
4
5
6
7
8
9 Rectangular biface, base snapped,
26
24
8
15
side notched
10
11
12
13
14
15
16
17
18
Biface midsection
Humboldt Pinto Northern (projectile points)
Elko Gypsum (projectile points)
Eastgate and Rose Spring
(projectile points)
11
Cottonwood Desert, side notched
12
Drill s
2
Pots
3
Grinding stones
13
Poi nt fragments
20
3
7
0
0
2
3
13
11
78
88
44
75
24
32
16
30
26
41
28
35
14
26
39
27
6
3
3
8
32
28
10
8
5
36
5
5
4
4
3
19
11
18
2
6
9
20
21
7
6
8
7
28
2
0
0
0
0
1
aThese counts exclude those in the Immediate vicinity column.
TABLE 9. TABULATION OF ANALGESIA RESPONSES OF PATIENTS WITH PAIN FROM CHRONIC JOINT
DISEASE TO ONE OF TWO DRUG PREPARATIONS, BY SITE OF PAIN AND TREATMENT CENTER
Pain
Clinical
Site
Center
Drug
2
2
II
II
Gooa
2
2
Total
filed,um
Poor
11
Total
Active
Pl acebo
5
8
20
14
Active
Pl acebo
0
0
12
10
12
11
24
21
Active
12
5
14
13
3
6
29
24
4
3
37
9
9
101
3
6
55
16
18
193
Pl acebo
II
II
Res~onse
Active
Pl acebo
28
33
3
TABLE 10. ESTIMATED PARAMETERS, STANDARD ERRORS AND WALD CHI-SQUARE TESTS FOR EQUAL ADJACENT
ODDS AND PROPORTIONAL ODDS MODELS FOR THE CHRONIC JOINT PAIN CLINICAL TRIAL DATA
Parameter
Drug
Center
Site
;-IL Flt of
Equal-Adjacent Odds
Estimate
S.E.
Xl2
-.446
-.899
.675
.229
.243
.234
3.79
13.63
8.34
p
.0517
.0002
.0039
1018
WLS Fit of
Proportional Odds
2
Estimate
S. E.
Xl
-.504
-.918
.725
.285
.296
.287
3.13
9.59
6.37
p
.0768
.0020
.0116
TABLE 11.
ARTIFICIAL DATA SHOWING REGRESSED BREAST TUMORS VS. TOTAL PALPATED BREAST TUMORS
IN TWO STRAINS OF RATS TREATED WITH DIMETHYLBENZANTHRACENE, FOR CONTROLS
AND RATS FED A NEW ANTI-TUMOR AGEN~AMONG RATS WITH PALPATABLE TUMORS
Groue
Strain
0
1
2
3
1
2
3
1
2
3
1
2
3
4
3
4
6
13
10
8
4
3
9
8
10
A
A
A
A
A
A
Experimental
Experimental
Experimental
Control
Control
Control
Experimenta 1
B
B
B
Experimental
Experimenta 1
Control
Contro 1
Control
TABLE 12.
Total Palpated
Tumors (Tumtot)
B
B
B
Regressed Tumors (TiJmregl
2
1
5
6
4
2
3
1
4
4
2
0
3
1
3
6
3
2
2
2
5
4
4
3
2
3
ESTIMATED PARAMETERS AND WALD TESTS OF SIGNIFICANCE FOR ARTIFICIAL RAT TUMOR REGRESSION DATA
Effect
Estimate
Saturated Model
2
Xl
Strain
-.1012
0.42
Treatment
-.5570
12.82
Strain by
Treatment
-.0222
0.02
Main Effects Model
P
Estimate
2
Xl
P
.52
-.0958
0.40
.53
.0003
-.5576
12.86
.0003
.89
Lack-of-fit
.89
.02
TABLE 13.
RESULTS OF AVERAGED MAIN EFFECTS MODEL FOR ROOT CARIES PREVALENCE DATA
Analysis of Variance Table
Prob
Chi-Square
Source
OF
Intercept
1
Town
1
Agecat
Residual
Effect
.0001
.0001
.0001
.1202
358.57
18.45
37.35
27.56
2
20
Parameter
Estimate
S.E.
1
2
3
4
-1. 446
- .328
- .502
- .062
.076
.076
.103
.107
Intercept
Town
Agecat
Town
Agecat
Stratford
Stratford
Stratford
Woodstock
Woodstock
Woodstock
30-49
50-59
60 +
30-49
50-59
60 +
Chi-Square
358.57
18.45
23.99
0.34
Predicted Marginal Quadrant
Caries Prevalence
.0931
.1375
.2298
.1248
.1813
.2920
1019
Prob
.0001
.0001
.0001
.5607