Download First page A USER PROGRAM FOR MULTIPLE LINEAR

Document related concepts

Mathematical model wikipedia , lookup

Elementary mathematics wikipedia , lookup

Transcript
First page
A USER PROGRAM FOR MULTIPLE LINEAR REGRESSION ANALYSIS
======================================================
Marten van Gelderen
Version 5H(246)
16-Feb-80
Copyright (C) 1975, 1979, 1980 by
Foundation Mathematical Centre, Amsterdam
Institute for Nuclear Physics Research, Amsterdam
General permission to make fair use in teaching or research of all
or part of this material is granted to individual readers and to
nonprofit organizations, provided that the copyright notice of
either the Foundation Mathematical Centre or the Institute for
Nuclear Physics Research is given and that reference is made to
this publication and to the fact that reprinting privileges were
granted by permission of one of the above mentioned organizations.
CONTENTS
Page
Abstract
1
Introduction
2
Chapter 1 - Multiple linear regression analysis
4
1.
The regression model
4
2.
Least squares
5
2.1 Weighted least squares
8
2.2 Residual analysis
9
3.
Tests of hypotheses
10
3.1 A particular regression coefficient
10
3.2 Analysis of variance
10
Chapter 2 - The input to the program
14
1.
The model specification
14
1.1 Transformations
15
2.
The input specification
17
3.
The option specification
19
4.
The data specification
21
5.
The user program
22
6.
Examples
23
Chapter 3 - The output from the program
25
1.
Standard and optional printed output
25
2.
Standard and optional data output
27
3.
Error messages
28
4.
Examples
1.
Technical remarks
2.
Definition of the syntax of a user program
3.
Technical description of the program
30
Appendices
31
34
37
References
41
Page
1
ABSTRACT
Performing multiple linear regression analysis on a
electronic
computer
can
be very laborious when preparation of the input
is
complicated.
The program introduced in this document is
especially
designed
to
have
flexible
and
comprehensible
model
and
input
specifications. It accepts a "Model" formula which resembles the
notation
of regression models in common statistical literature quite closely.
An
accompanying "Input" formula provides the program with information
about
the arrangement of the observations in the input "Data", which consists
of
a series of numbers in freefield format. A "Run" command activates
the
program, while an "Exit" command causes the program to stop.
Extensive
runtime "Help" information is available. The following piece of
program
may serve as an example of some of these ideas:
"Model" y = alfa0 + alfa1 * x + alfa2 * x ^ 2;
"Input" 5 * ([x], n, n * [y]);
"Options" Transformed data matrix, Process submodels (1);
"Data"
1
4
1.1
0.7
1.8
0.4
3
5
3.0
1.4
4.9
4.4
4.5
5
3
7.3
8.2
6.2
10
4
12.0
13.1
12.6
13.2
15
4
18.7
19.7
17.4
17.1
"Run"
"Exit"
Page
2
INTRODUCTION
Many computer programs or subroutines exist, which in one way
or
another calculate estimates for the parameters of a multiple
linear
regression model, but most of them can hardly be used by the
layman.
Subroutines have to be embedded in a higher level language program
to
perform the necessary input and output; as a layman however, one is
not
always acquainted with these languages. The programs usually require
the
input data to be presented in a specific standard form, which in
practice
means that input data that are already available in a machine readable
form
must be transformed into that standard form or that the input data
should
be punched exactly in that standard form. More difficult still and
often
more confusing, is the way in which the program is told which
regression
model the user wants to consider.
For instance, for the program
for
multiple polynomial regression analysis of the Mathematical Centre a
code
matrix of zeros and ones had to be given in order to indicate the form
of
the regression polynomial. Transformations of one or more variables
are
hardly ever automatically possible and require a separate program to be
run
in advance to prepare the transformed data matrix.
In this document a program is described that will make it possible
for
almost everyone to obtain his desired results with no more than
a
superficial knowledge of the underlying programming system.
It has
been
recognized recently that standard programs should not bother the user
too
much with awkward input specifications. A statistician is more
interested
in the results of a program and in how he can obtain them
without
interference of software specialists of the computer centre, than in a
few
seconds gain in actual computing time. Therefore one of the objectives
was
to develop a program which would enable the user to specify the form of
the
regression model and the arrangement of the numbers in his input data, in
a
straightforward manner, even if he has hardly any knowledge of
programming
languages at all. The program thus accepts a model formula which
resembles
the notation of regression models in common statistical literature
quite
closely;
the accompanying input formula indicates which numbers or
series
of numbers in the input data belong to which variable in the model
formula.
Page
3
This input scheme gives the user the opportunity to work with
existing
data and to process possible transformations of the variables in the
model
formula without the need for external data adjustment. However, the
model
formula must be given explicitly and completely and the structure of
the
input data must be known exactly before an input formula can
be
constructed. In the output the results are identified with the names
given
by the user in the model formula. Any technical (machine dependent)
action
required to run the program is described in appendix 1.
ACKNOWLEDGEMENTS
The basic ideas for the input arrangement presented in this
document
originated from A.P.B.M. VEHMEYER. The article 'ALGOL 60 translation
for
everybody' by F.E.J. KRUSEMAN ARETZ [6] provided the foundation for
the
translator and the execution section of the program, while the
techniques
used in the storage section were designed by L.J. OOSTRIJK. The matrix
and
least squares routines used in the program were copied from 'ALGOL
60
procedures in numerical algebra, part 1', by T.J. DEKKER [2]. The
classic
book 'Applied Regression Analysis' by N.R. DRAPER & H. SMITH [1] served
as
a frame of reference throughout the design and implementation of
the
program.
Numerous recently incorporated statistical improvements
were
suggested by R.D. GILL.
Page
4
CHAPTER 1
MULTIPLE LINEAR REGRESSION ANALYSIS
1.1
THE REGRESSION MODEL
In a regression problem the researcher postulates a certain
relationship between a random variable y (the realizations of which are subject
to
some form of disturbance) on the one side and a number of
variables
x1,...,xp (which are without or at least almost without disturbances)
on
the other side. This relationship is expressed by a mathematical
formula,
which is called the (linear) regression model, for instance:
y = a0 + a1 * x1 + ... + ap * xp + e
(1)
in which a0,...,ap represent unknown regression coefficients
(parameters)
which are to be estimated and e represents the disturbance. If a
constant
term is present in the model formula (in (1) the a0), the model is said
to
be an 'intercept model', if no constant term is present, the model
is
called a 'no-intercept model'.
The variables x1,...,xp and the variable y can also represent
(other)
transformed variables. The researcher might have reasons to believe
(from
background information concerning the experiment) that transformations
are
necessary, for instance:
1) to obtain normally distributed disturbances,
2) to obtain a greater homogeneity of the variances of the disturbances,
3) to linearize non-linear regression models (if possible).
The transformed regression model can be written as:
G(y) = a0 + a1 * F1(x1,...,xm) + ... + ap * Fp(x1,...,xm) + e
(2)
in which G, F1,...,Fp
a0,...,ap
y
x1,...,xm
represent the transformations,
represent the parameters to be estimated,
represents the dependent variable,
represent the independent variables,
e represents the disturbance.
Page
5
The choice of a transformation by means of 'trial and error' is
rather
time consuming and costly. The importance of the location parameter
makes
for the difficulty. It is not unusual that Log (x) yields no
improvement,
but that Log (c+x) gives better results for a particular choice of
c.
Because this holds for almost any transformation of some importance,
we
must actually solve in each case a nonlinear adjustment problem.
Often
though, a simple form of the transformation is suggested by the
researcher
who is better acquainted with the peculiarities of the experiment.
1.2
LEAST SQUARES
Regression analysis consists in fact of the adjustment of a
hyperplane
of the required dimension to the data. The fitting is done with the
method
of least squares, which means that the sum of the squares of
the
differences between the observed values for y and the estimated values
for
the expectation of y, are minimized. This sum of squares is also
called
the residual sum of squares. In matrix notation the regression model
can
be written as (cf. DRAPER & SMITH [3] pp. 58-62):
Y = Xa + e
(3)
in which Y
X
a
and e
is
is
is
is
a
a
a
a
(n*1)
(n*p)
(p*1)
(n*1)
random
matrix
vector
random
vector of observations,
of known (fixed) values,
of (unknown) parameters,
vector of disturbances.
It is supposed that E(e) = 0 and var(e) = Isigma^2, in which I is the
unit
matrix, thus:
E(Y) = Xa
(4)
The sum of squares of the differences between the observed values of
Y
and the estimated values for the expectation of Y thus equals:
(Y-Xa)'(Y-Xa) = Y'Y - 2a'X'Y + a'X'Xa
(5)
(for a'X'Y is a scalar and therefore equal to Y'Xa).
Page
6
Choosing as least squares estimator b that value of a which
minimizes
(5), involves differentiating with respect to the elements
and
equating the result to zero:
-2X'Y + 2X'Xb = 0,
thus:
of a
X'Y = X'Xb
(6)
This system is called the normal equations. If the rank of X equals p,
X'X
is nonsingular and the inverse of X'X exists. In that case the solution
of
the normal equations can be written as:
b = inv(X'X)X'Y
(7)
Observe that p <= n must hold, in order that the rank of X can be p at
all.
Therefore at least as many observations must be made, as there
are
parameters in the model. Also observe that E(b) = inv(X'X)X'E(Y) = a,
thus
b is an unbiased estimator of a.
The least squares estimator has the following properties:
1. It is an estimator which minimizes the sum of squares of
deviations,
irrespective of any distribution properties of the disturbances.
The
assumption that the disturbances are normally distributed is, of
course,
necessary for tests which depend on this assumption, such as t- or
Ftests, or for obtaining confidence intervals based on t- or
Fdistributions.
2. According to the Gauss-Markov theorem, the elements of b are
unbiased
estimators, which have minimum variance (of any linear function of
the
Y's which provides unbiased estimators), again irrespective of
the
distribution properties of the disturbances.
3. If the disturbances are mutually independent and normally
distributed
(with E(e) = 0 and var(e) = Isigma^2), then b is also the
maximum
likelihood estimator.
The variance-covariance matrix of b is:
var(b) = inv(X'X)sigma^2
(8)
The variances
diagonal
elements.
are
the
diagonal
and
the
covariances
the
off-
Page
7
An unbiased estimator for sigma^2 is given by:
s^2 = (Y'Y - b'X'Y) / (n-p)
(9)
The square root of this estimator is frequently called 'standard error
of
estimate'.
In the printed output of the program it is indicated
more
properly as 'standard deviation of the error term'.
Let vij be the element in the i-th row and j-th column of
inv(X'X),
then sdi = s * Sqrt(vii) estimates the standard deviation of bi, and cij
=
vij / Sqrt(vii * vjj) gives the correlation coefficient between bi and
bj
for i = 1,...,p and j = 1,...,p. Thus:
vii = (sdi / s)^2
(10)
and
vij = cij * Sqrt(vii * vjj) = cij * (sdi * sdj) / s
(11)
A frequently used statistical measure for evaluating regression
models
is the multiple correlation coefficient R which is defined in the
intercept
model as the square root of the proportion of the corrected total sum
of
squares accounted for by the model. If the correction for means is
denoted
by nu^2, with u = Sum(i,1,n,yi)/n, then R can be defined by:
R^2 = (b'X'Y-nu^2)/(Y'Y-nu^2) = 1 - (Y'Y-b'X'Y)/(Y'Y-nu^2)
(12)
However, we must divide Y'Y-b'X'Y by n-p, not by n, to obtain an
unbiased
estimator of sigma^2, moreover it is customary to divide Y'Y-nu^2 by n1,
not by n. If we adopt both modifications we obtain the adjusted
multiple
correlation coefficient, which can thus be defined by:
adj(R)^2 = 1 - (n-1)/(n-p) * (Y'Y-b'X'Y)/(Y'Y-nu^2)
(13)
In the no-intercept model the correction for means is ignored,
giving
as definition of R^2: b'X'Y/Y'Y = 1 - (Y'Y-b'X'Y)/Y'Y, while the
adj(R)^2
is defined correspondingly as: 1 - n/(n-p) * (Y'Y-b'X'Y)/Y'Y. R^2
itself
is often called the 'proportion of variation explained' (cf. THEIL
[11]
pp. 178-179).
Page
8
1.2.1
WEIGHTED LEAST SQUARES
It sometimes happens that some of the observations for the
dependent
variable are 'less reliable' than others. This usually means that
the
variances of the observations are not all equal; in other words the
matrix
V = var(e) is not of the form Isigma^2, but is diagonal with
unequal
diagonal elements. The basic idea to solve this problem is, to transform
Y
to other variables, which do appear to satisfy the usual tentative
model
assumptions, and then apply the usual (unweighted) analysis to
the
variables so obtained. The estimates can then be re-expressed in terms
of
the original variables Y (cf. DRAPER & SMITH [3] pp. 77-81).
Let the original regression model be: Y = Xa + e, with E(e) = 0
and
var(e) = Vsigma^2, with V diagonal with unequal diagonal elements, and
let
P = inv(V). Premultiplying the original regression model with Q =
Sqrt(P)
gives as transformed regression model:
QY = QXa + Qe
(14)
with E(Qe) = 0 and var(Qe) = Isigma^2.
The normal equations then become:
(QX)'QY = (QX)'QXa
(15)
giving as solution if the indicated inverse matrix exists:
b = inv((QX)'QX)(QX)'QY = inv(X'PX)X'PY
(16)
with variance-covariance matrix:
var(b) = inv(X'PX)sigma^2
(17)
In practical situations it is often
specific
information on the form of V at first.
sometimes
difficult
to
obtain
For this reason it is
necessary to make the (known to be erroneous) assumption
then
attempt to discover something about the form of V
the
residuals from the regression analysis.
V = I
and
by examining
Page
9
1.2.2
RESIDUAL ANALYSIS
The vector of residuals D is defined as the difference between
the
vector of observations Y and the vector of fitted values Z, obtained
by
using the regression equation Z = Xb. So D = Y - Z or di = yi - zi
for
i = 1,...,n.
If the model is correct, the residual mean square MSE =
s^2
estimates sigma^2, and the estimated standard deviation of the fitted
value
zi at xi = (xi1,...,xip)' is:
sd(zi) = s * Sqrt(xi'inv(X'X)xi)
(18)
which can be used to construct a confidence interval for the expected
value
of yi: E(yi) at xi = (xi1,...,xip)', or to construct a prediction
interval
for the mean of h new observations at this point (cf. DRAPER & SMITH
[3]
pp. 121-122). In the first case the confidence interval is:
zi +- t(n-p-1,1-alpha/2) * s * Sqrt(xi'inv(X'X)xi)
(19)
and in the second case the prediction interval is:
zi +- t(n-p-1,1-alpha/2) * s * Sqrt(1/h + xi'inv(X'X)xi)
(20)
Researchers often divide the residuals di by s, resulting in
the
standardized residuals, which can be examined to see if they make it
appear
that the assumption ei/sigma ~ N(0,1) is violated (cf. DRAPER & SMITH
[3]
pp. 86-97). It might be expected that roughly 95% of the di/s were
between
the limits (-2,2).
However, the variances of the residuals are not constant but
a
function of the X matrix (see (18)), which suggests as standardization:
ti = di / s / Sqrt(1 - xi'inv(X'X)xi)
(21)
giving the studentized residual. The maximum studentized residual can
be
used in a test for detecting outliers, as follows: let t^2 =
max(ti^2),
then min(1, n * (1-Fisher(1, n-p-1, t^2*(n-p-1)/(n-p-t^2)))) is an
'upper
bound for the right tail probability of the largest absolute
studentized
residual' (cf. LUND [9] pp. 473-474).
Page
10
1.3
TESTS OF HYPOTHESES
1.3.1
A PARTICULAR REGRESSION COEFFICIENT
If the disturbances are mutually independent and normally
distributed
(with E(e) = 0 and var(e) = Isigma^2) and with a (preset) level of
significance alpha, a significance test for a particular regression
coefficient
can be performed, or more specifically: the null hypothesis is:
H0:
ai = 0 (given that all other aj are in the model),
which is tested against the alternative hypothesis:
H1:
ai is not equal to zero,
by treating FRi = bi^2/var(bi) as a realization of a Fisher(1,n-p)
variate.
However, this test must be used with caution, because with
the
(preset) level of significance alpha, only one coefficient can be
tested
properly, while the computer output lists statistics for all
coefficients.
It seems very tempting to test the coefficients serially one at a time,
but
one must keep in mind that in doing so the level of significance of
the
whole test rises above the nominal value (cf. DRAPER & SMITH [3] p. 65).
1.3.2
ANALYSIS OF VARIANCE
In the analysis of variance table the different contributions to
the
total uncorrected sum of squares Y'Y (which is the first part in the
table)
are given (cf. DRAPER & SMITH [3] pp. 57 & 62).
The second part of the table assumes the presence of an
(unknown)
constant term in the model;
if this term is absent, the 'mean'line
disappears and in the 'regression'-line p-1 changes into p and b'X'Ynu^2
changes into b'X'Y.
The third part of the table is only present when repeated
observations
for the dependent variable are available, in which case:
Page
11
k is the number of groups of replications,
mi is the number of replications in group i, and
W = (w1,...,wk)', with wi = Sum(j,1,mi,yij) / Sqrt(mi), for i = 1,...,k.
The fourth part of the table is only present if a reduction
is
requested and possible.
SSQ then stands for the residual sum of
squares
from a regression analysis with the first p-q out of the p
independent
variables (1 <= q <= p-1), while SSE stands for the residual sum of
squares
from a regression analysis with the original p independent variables.
Analysis of variance
source of
right
tail
variation
df
sum of squares
mean square
F-ratio
probability
-------------------------------------------------------------------------total
n
Y'Y
-------------------------------------------------------------------------mean
1
nu^2
MSM = nu^2
FRM
P(FM>=FRM)
regression
p-1
b'X'Y - nu^2
MSR
FRR
P(FR>=FRR)
residual
n-p
Y'Y - b'X'Y
MSE = s^2
-------------------------------------------------------------------------lack of fit
k-p
W'W - b'X'Y
MSL
FRL
P(FL>=FRL)
pure error
n-k
Y'Y - W'W
MSP
-------------------------------------------------------------------------reduction
q
SSQ - SSE
MSQ
FRQ
P(FQ>=FRQ)
-------------------------------------------------------------------------The column 'mean square' is obtained by division of the sums
of
squares by their corresponding degrees of freedom. The column 'F-ratio'
is
obtained by division of the mean squares by the residual mean
square,
except for the lack of fit F-ratio, which is obtained by division of
the
lack of fit mean square by the pure error mean square, thus:
MSM = nu^2/1, MSR = (b'X'Y-nu^2)/(p-1), MSE = s^2 = (Y'Y-b'X'Y)/(n-p),
MSL = (W'W-b'X'Y)/(k-p), MSP = (Y'Y-W'W)/(n-k), MSQ = (SSQ-SSE)/q,
FRM = MSM/MSE, FRR = MSR/MSE, FRL = MSL/MSP and FRQ = MSQ/MSE.
Page
12
If the disturbances are mutually independent and normally
distributed
(with E(e) = 0 and var(e) = Isigma^2) and with a (preset) level of
significance alpha, a significance tests can be performed for:
1. The mean of the observations for the dependent variable, or more
specifically: the null hypothesis is:
H0:
E(u) = 0,
which is tested against the alternative hypothesis:
H1:
E(u) is not equal to zero,
by treating FRM as a realization of a Fisher(1,n-p) variate.
2. The regression equation, or more specifically:
is:
H0:
the null hypothesis
a1 = ... = ap = 0, except for the ai that denotes
the constant term (if present),
which is tested against the alternative hypothesis:
H1:
at least one of a1,...,ap is not equal to zero,
by treating FRR as a realization of a Fisher(p-1,n-p) variate.
3. The adequacy (linearity) of the model, or more specifically:
null
hypothesis is:
H0:
the
the linear model is adequate (that is, no model significantly improves the prediction of Y over the linear
model),
which is tested against the alternative hypothesis:
H1:
the linear model is not adequate,
by treating FRL as a realization of a Fisher(k-p,n-k) variate.
Page
13
4. A subset of regression coefficients, or more specifically:
suppose
without loss of generality that the subset consists of the last
q
coefficients, then the null hypothesis is:
H0:
ar = ... = ap = 0, with r = p-q+1,
which is tested against the alternative hypothesis:
H1:
at least one of ar,...,ap is not equal to zero,
by treating FRQ as a realization of a Fisher(q,n-p) variate.
5. A linear combination of the regression coefficients, or
specifically: the null hypothesis is:
H0:
more
c'a = m, in which c is a vector of constants with order
q+1,
which is tested against the alternative hypotheses:
H1:
c'a is not equal to m,
by substituting c'a = m in the original model, shifting the known
terms
to the left hand part, combining the corresponding terms in the
right
hand part, and testing the thus derived so called 'reduced model'
by
treating FRQ as a realization of a Fisher(q,n-p) variate.
In each case the right tail probability P(F >= FR) can be found in the
last
column of the analysis of variance table (cf. DRAPER & SMITH [3] pp. 6364,
68 & 72-75).
Page
14
CHAPTER 2
THE INPUT TO THE PROGRAM
2.0 The purpose of the input system is to give the user a simple
and
adequate formalism to tell the program what he wants. In order to
specify
the regression model and the correspondence between the variables in
the
model and the numbers in the input data, the user must provide a so
called
user program, which consists of one or more jobs which in their turn
are
made up out of one or more specifications. Each specification starts
with
a keyword and terminates at the next keyword. The keywords are:
"Model",
"Input", "Options", "Data", "Run", "Exit" and "Help" (quotes included).
2.1
THE MODEL SPECIFICATION
To let the program know between which variables the
statistician
expects a certain kind of relationship, he must provide a model
specification, which consists of the keyword "Model" followed by a formula
(the
model statement), which resembles the notation of regression models
in
common statistical literature quite closely. For instance:
"Model"
y = alpha0 + alpha1 * x1 + alpha2 * x2;
A model formula consists of an identifier to denote the
dependent
variable (the left hand part), followed by an '=' (equal), followed by
the
sum of a number of terms (the right hand part), while it is terminated
with
a ';' (semicolon).
Each term must be the product of an identifier
to
denote the parameter (which is to be estimated) and an identifier to
denote
the independent variable. An exception is made for the optional
constant
term, which is given as a single identifier denoting that constant
term,
and which may be placed anywhere in the model.
Each identifier must start with a letter and is allowed to contain
any
number of letters, digits and blanks. As most peripheral equipment of
a
computer is unable to process sub- or superscriptions or Greek letters,
we
write alpha0, alpha1 and alpha2. Identifiers have no inherent meaning,
but
Page
15
serve for the identification of variables, parameters and functions.
They
may be chosen freely (except for the twentyone standard function names
and
the ten option names, cf. section 2.1.1 and section 2.3).
It is
advised
not to use the same identifier to denote two (or more)
different
quantities; for regression parameters, however, it will not lead to
fatal
errors, whereas for the dependent and independent variables
distinguishable
identifiers must be used indeed. Correct model formulae are for
instance:
"Model" y variable = constant + parameter * x variable;
and
"Model" depvar = const + beta1 * xvar1 + beta2 * xvar2;
2.1.1
TRANSFORMATIONS
Almost all transformations a user would like to perform on his
input
data fit quite naturally in the model formula: each transformation
is
expressed as a formula itself. If, for instance, the user wants to
include
in the model formula as an independent variable the natural logarithm
of
the sum of two other variables, called xvar1 and xvar2, he writes:
Ln (xvar1 + xvar2)
In model formulae the operators '+' (plus), '-' (minus), '*'
(asterisk),
and '/' (slash) are allowed, all with their conventional meaning
of
addition, subtraction, multiplication and division respectively. Of
course
the normal operator precedence rules are obeyed. Special operators
are:
':' (colon), integer division and '^' (uparrow), exponentiation.
The operation term : factor is defined only for operands both of
type
integer and will yield a result of type integer, with the same sign
as
would be obtained by normal division, while the magnitude is found
by
dividing the two quantities and taking the whole part; mathematically
it
can be defined as: a : b = Sign (a / b) * Entier (Abs (a / b)),
for instance: 5 : 2 = 2 and -7 : 2 = -3.
The operation factor ^ primary denotes exponentiation, where the factor
is
the base and the primary is the exponent,
for instance: 5 ^ 2 = 25 and 2 ^ 3 ^ 2 = 64 but 2 ^ (3 ^ 2) = 512.
Page
16
Also the following twentyone standard functions are allowed:
Abs (E), Sign (E), Sqrt (E), Sin (E), Cos (E), Tan (E), Ln (E),
Log (E), Exp (E), Entier (E), Round (E), Mod (E1, E2), Min (E1, E2),
Max (E1, E2), Arcsin (E), Arccos (E), Arctan (E), Sinh (E), Cosh (E),
Tanh (E) and Indicator (E1, E2, E3)
in which E, E1, E2 and E3 are expressions in terms of variables,
operators
and standard functions. Round (E) is defined as: Entier (E + 0.5) and
Indicator (E1, E2, E3) is defined as: IF E1 <= E2 <= E3 THEN 1 ELSE 0.
The dependent variable may be transformed in a similar way and
a
consequence the model formula in its most general form looks like:
as
"Model" G (y) = b0 + b1 * F1 (x1,...,xm) + ... + bp * Fp (x1,...,xm);
Some examples of transformed model formulae are:
"model" y = a0 + a1 * Sqrt (x1 + x2) + a2 * Sqrt (x3);
and
"MODEL" Arcsin (Sqrt (Y)) = A0 + A1 * X + A2 * X ^ 2;
A user can specify model formulae in which terms with known
regression
coefficients appear, by subtracting those terms from the left hand part
of
the model formula, for instance:
"Model" y - 5.4321 * x3 = a0 + a1 * x + a2 * x ^ 2;
This applies especially to the constant term; if this term is known it
must
be shifted to the left hand part.
If weights are present in the input data (or can be computed out
of
the input data), to indicate that the variances of the observations are
not
all equal (cf. section 1.2.1), the left hand part of the model formula
can
be expanded with a so called weight part (which can be an
expression),
preceeded by a '&' (ampersand), for instance:
"Model" Depvar & Max (Abs (Weight), 10) = Const + Param * Indepvar;
Page
17
2.2
THE INPUT SPECIFICATION
To indicate which numbers or series of numbers in the input
data
belong to which variable in the model formula and which numbers can
be
skipped, the program expects an input specification. It consists of
the
keyword "Input", followed by a formula (the input statement)
which
describes the arrangement of the observations in the input data, while
it
is terminated with a ';' (semicolon). The basic idea is that numbers
from
the input data are identified with the names from the input formula in
such
a way that (in order of entry) numbers belonging to the same name are
put
in a queue appended to that name. For instance:
"Input" 100 * (codenr, 10 * [yvar], [xvar1, xvar2], -1);
means that one hundred series of numbers (each, as a check, terminated
in
this example by -1) are present in the input data. Each series consists
of
fourteen numbers: first one value which is read and assigned to the
name
codenr, then ten values for the name yvar, then one value for the
name
xvar1, followed by one value for the name xvar2 and finally the value -1.
The basic constituent of an input formula is a variable enclosed
in
square brackets, in the example: [yvar]. The corresponding number from
the
input data will be appended to the queue for that name. Several
variables
can be put together in a variable list by separating them by commas
and
enclosing them in square brackets, in the example: [xvar1, xvar2].
This
only serves to save the writing of several opening and closing brackets.
Separate numbers, series or blocks of numbers can be treated
by
putting a repetition factor (control) followed by an asterisk in front of
a
variable list (or in front of an input formula which must then be
enclosed
in parentheses), in the example:
100 * and 10 * .
If a repetition factor is 1, it may be omitted together with
the
asterisk and a parentheses pair, but square bracket pairs must
remain.
When a name is used as a repetition factor, a value must already have
been
assigned to it, which is done by giving that name, without square
brackets
and followed by a comma (or closing parenthesis), earlier in the
input
formula than the use of that name as a repetition factor. The
corres-
Page
18
ponding number from the input data is then assigned as a value to
that
name.
If such names are used repeatedly in the input formula,
the
corresponding numbers from the input data are compared with the first
one
and, in the case of inequality, an error message is supplied. This
may
serve as a check against shifted data reading.
A similar check can
be
obtained by giving an explicit number followed by a comma (or
closing
parenthesis), in the example: the -1. The corresponding number from
the
input data is then compared with that given number and, in the case
of
inequality, an error message is produced.
Also an expression is allowed as a repetition factor, or for
that
matter, as a check value, provided that it is enclosed in angle
brackets,
for instance: <k+n>. As in the case of single names used as a
repetition
factor each (non-standard function and non-option) name used in such
a
(special) expression must have been given, followed by a comma, earlier
in
the input formula than the use of that name in the expression.
The linkage between the model formula and the input formula
is
established by using the same names in the model terms and in the
input
variable lists. Numbers from the input data that belong to such
input
names will be treated as observations for the model variables,
while
numbers that belong to input names between square brackets which do
not
appear in the model formula, are skipped.
Often, repeated observations for the dependent variable are
available.
In order to be able to process these observations automatically, it
is
necessary that a variable list consisting entirely of dependent
variables
is preceeded by a repetition factor (followed by an asterisk)
indicating
the number of repetitions. If a variable list contains independent as
well
as dependent variables, the number of replications is assumed to be 1.
A
series of (say 100) observations for a dependent variable with
no
replications is denoted as:
100 * ([dep var])
The repetition factor in front of the opening square bracket is
omitted
(because it is 1), although the parentheses are not.
Without
the
parentheses it would mean 100 replications of [dep var].
Page
19
EXAMPLE
"Input" k, n, <k+n> * (c, m, m * [y], [x1,x2,x3,x4], c), -99;
means that: the first number is read and its value assigned to k,
the next number is read and its value assigned to n,
then k+n times the following happens:
a number is read and its value assigned to c,
the next number is read and its value assigned to m,
then the m replications for y are read,
next the observations for x1, x2, x3 and x4 are read,
then a number is read and its value compared with c,
finally a number is read and its value compared with -99.
If the comparisons fail, an error message is supplied and execution of
the
job is terminated, otherwise (k+n) observations for x1, x2, x3, x4 and
for
each quadruple m replications for y, have been identified.
2.3
THE OPTION SPECIFICATION
It is possible to have the program perform some tasks optionally
by
providing an option specification in a job. It consists of the
keyword
"Options" followed by a list of option identifiers or corresponding
option
numbers (the option statement), separated by commas and terminated with
a
';' (semicolon). The following ten options are available:
option number
1
2
3
4
5
6
7
8
9
10
option name
Transformed data matrix
Correlation matrix
Residual analysis
No regression analysis
Process submodels
Print input data
No input data rewind
Save original model
Test reduced model
Missing values
Page
20
Options 1, 2, 3 and 6 cause the corresponding piece of information
to
be printed.
However, option 1 lists only those (possibly
transformed)
variables that are present in the model formula in a neat tabular
form,
while option 6 lists all the original input data serially (eleven
numbers
per line) without any special layout, because the input data consists
(by
definition) of an unstructured series of numbers (cf. section 2.4).
Option 4 suppresses the regression analysis;
used
in combination with option 1 and/or 2.
it is meant to
be
Option 5 causes the program to process submodels, which are formed
by
a form of backward elimination: each time the last term from the
right
hand part from the model formula is omitted, by deleting the last
column
from the design matrix, and a regression analysis is performed with
the
reduced design matrix.
Messages are generated about which terms
are
omitted, while further processing of the job ceases when the
resulting
model formula is of the form: y = c. Moreover a test is made (under
the
usual assumptions) whether the omitted terms did contribute
significantly
to the regression sum of squares (cf. section 1.3.2.4).
To option 5 a specifier list may be appended, to prevent the
production of waste output for unwanted submodels. In this list the number
of
terms to be omitted from the model formula (counting backwards, starting
at
the end) must be given enclosed in parentheses. For example the
option:
process submodels (6, 10) instructs the program to process only
two
submodels, one with the last six terms omitted and one with the last
ten
terms omitted (from the original model formula). If the user asks for
more
terms to be omitted than are present in the model formula, an error
message
is supplied and the execution of that job is terminated. Moreover, if
no
explicit specifier list is appended to option 5, the options 2 and 3
yield
no effect (even if specified), which is also to prevent the production
of
waste output for the submodels.
Option 7 gives the user the opportunity to process consecutive
pieces
of input data in consecutive jobs. Normally the processing of the
input
data for each job starts with the first number in the data
specification
(or with the first number in the datastream), and the program gives
a
(warning) message if the input formula does not match the input
data
Page
21
precisely.
This option disengages the message and causes the program
to
continue processing input data where the previous job had finished.
Option 8 causes the residual degrees of freedom and residual sum
of
squares from the current job to be saved, in order to be able in the
next
job, by means of specifying option 9, to test whether the model
under
consideration in that next job, shows a significant increase in
residual
sum of squares in comparison with the model in the previous job. In
effect
this gives the possibility of testing a hypothesis concerning a
linear
combination of the parameters from a model (cf. section 1.3.2.5),
for
instance (cf. SEARLE [11] pp. 121-122):
"Model 1"
"Options"
"Run"
"Model 2"
"Options"
"Run"
y = b1 * x1 + b2 * x2 + b3 * x3;
Save original model;
y - 4 * x1 = b2 * (x1 + x2) + b3 * x3;
Test reduced model;
causes the null hypothesis: b1 = b2 + 4 to be tested (in the second job).
Option 10 may be used to identify some observations or
repetitions
as 'missing'.
In a specifier list, appended to this option, the
missing
values must be given enclosed in parentheses. When a repetition equal to
a
missing value is encountered in the input data, the corresponding set
of
repetitions for the dependent variable(s) is not included in the
design
matrix. When an observation equal to a missing value is encountered in
the
input data, or when none of the repetitions are included in the
design
matrix, the corresponding set of observations for the independent
variables
together with the (possibly empty) set of repetitions for the
dependent
variable(s) (i.e. the 'case') is not included in the design matrix.
2.4
THE DATA SPECIFICATION
The data specification consists of an unstructured series of
numbers
(the data statement), preceeded by the keyword "Data" and terminated at
the
next keyword. The structure is imposed onto it by the input formula.
Page
22
A sequence of symbols is considered a number when it satisfies
the
definition of number in appendix 2, together with the machine
dependent
restrictions imposed by the underlying SIMULA programming system.
Note
that the definition of number does not allow FORTRAN-like numbers as
for
instance 10. or 1# (at least one digit must follow).
It is
recommended
always to use blanks as delimiters between numbers, but the use of
other
non-numerical symbols as delimiters will not lead to fatal errors, only
to
a slight increase in processing time.
If a datafile is specified in response to the datastream request,
or
is appended to the "Run" keyword (cf. appendix 1), the program will try
to
read a record of input data, which then do not have to be preceeded by
the
keyword "Data", from that file. The record ends at the next "Eor"
keyword.
However, a non empty data specification in the user program will
get
priority over reading input data from the specified file, while an
empty
data specification causes the program to start reading the next record
of
input data from the specified file. If the data specification as well
as
the next nonempty record in the data file does not contain any
numerical
information, an error message is supplied.
EXAMPLES
real number
1.234
.98
-0.5673#2
+.02#-1
#+3
2.5
value
1.234
0.98
-56.73
0.002
1000.0
THE USER PROGRAM
In a user program several jobs can be submitted to the program.
Each
job is separated from its preceding one by the keyword "Run", while
the
entire user program is terminated with the keyword "Exit".
In the
first
job, the model, input and data specification must be given in some
order.
The option specification is, of course, optional. In each following job
a
specification which is not changed may be omitted, the program then
retains
the last given specification. If options have been specified in a
previous
Page
23
job and one wants to delete them, this is done by providing a new
option
specification which may be empty if no options are to be executed (that
is
by only providing: "Options";).
In front of each job or in front of the keyword "Exit" a text may
be
given for further identification of the output of a job or of the output
of
the entire user program. The use of quotes in that text should be
avoided
in view of confusion with the keywords. The program starts reading
a
(possibly empty) text at the beginning of the next line after the
keyword
"Run" of the previous job (with the first job the program starts with
the
first line in the inputstream).
2.6
EXAMPLES
The following user program can be submitted without any
modification
to the Multiple Linear Regression Analysis program. It consists of
four
jobs, three of which are preceeded by an identifying header, while
the
whole user program ends with an identifying trailer:
**********************************
* Example 1 originates from: *
* DE JONGE [4], pp. 472 & 479. *
**********************************
"Model"
y = c * Log (x) + a + b * x;
"Input"
5 * ([x], 10 * [y]);
"Options" Transformed data matrix, Correlation matrix,
Residual analysis, Process submodels (1, 2);
"Data"
"Run"
25
50
80
130
180
0.67
0.88
0.96
1.07
1.10
0.70
0.92
0.98
1.09
1.13
0.75
0.93
0.99
1.11
1.17
0.76
0.96
1.03
1.13
1.19
0.78
0.98
1.05
1.14
1.20
0.80
1.00
1.06
1.14
1.21
0.83
1.01
1.08
1.19
1.23
0.84
1.03
1.11
1.22
1.25
0.88
1.06
1.15
1.25
1.28
0.89
1.07
1.17
1.29
1.33
Page
24
**********************************
* Example 2 originates from: *
* SEARLE [11], pp. 121-123
*
**********************************
"Input"
"Data"
5 * [y, x1, x2, x3];
8
2
1
4
10
-1
2
1
9
1
-3
4
6
2
1
2
12
1
4
6
"Model 1"
"Options"
"Run"
"Model 2"
"Options"
"Run"
y = a3 * x3 + a2 * x2 + a1 * x1;
Save original model, Process submodels (1);
y - 4 * x1 = b2 * (x1 + x2) + b3 * x3;
(eqn. 118, p. 121)
Test reduced model, Transformed data matrix;
****************************************
* Example 3 originates from:
*
* AFIFI & AZEN [1], pp. 88 & 93-100. *
****************************************
"Model" y = alfa0 + alfa1 * x;
"Input" 5 * ([x], n, n * [y]);
"Option" Transformed data matrix, Print input data;
"Data"
1
3
5
10
15
4
5
3
4
4
1.1
3.0
7.3
12.0
18.7
0.7
1.4
8.2
13.1
19.7
1.8
4.9
6.2
12.6
17.4
0.4
4.4
4.5
13.2
17.1
"Run"
***
Marten van Gelderen; Mathematisch Centrum
"Exit"
***
Page
25
CHAPTER 3
THE OUTPUT FROM THE PROGRAM
3.1
STANDARD AND OPTIONAL PRINTED OUTPUT
After having read the keyword "Run", the processing of the job
is
initiated.
First the model, input and option texts are printed in
this
order. Next an attempt is made at translating the specifications.
Errors
against syntax or semantics cause error messages to be printed below
each
specification, while further processing of that job ceases. Note that
the
processing of the next job, if present, will be of little or no use
unless
the specification which developed the error(s) is changed.
Next
the
(transformed) data matrix is formed and passed to the regression
routines,
which supply the following printed output in the order indicated:
1) a listing of the original input data (option 6),
2) the (transformed) data matrix (option 1),
3) per (transformed) variable the:
mean, standard deviation, minimum and maximum,
4) the correlation matrix of the (transformed) variables (option 2),
5) the multiple correlation coefficient (with adjustment),
6) the proportion of variation explained (with adjustment),
7) the standard deviation of the error term,
8) the estimates for the regression parameters with
estimated standard deviation, F-ratio and right tail probability,
9) the correlation matrix of the estimates (option 2),
10) the analysis of variance table,
11) the residual analysis (option 3).
Ad 1) cf. section 2.4.
Ad 2) The transformed data matrix gives the input data after
possible
transformations according to the model specifications have
been
applied. If the model formula contains no transformations,
the
original input data are given. The dependent variable is given as
a
separate column. In the case of replications for the
dependent
variable,
of
the
mean
value
of
them
is
given,
and the number
Page
26
replications is given as an extra (last) column.
If a
weightvariable (or -expression) is specified in the model formula,
the
(transformed) data comprising the weights are given as an
extra
(last) column. Each (transformed) independent variable is
indicated
by its corresponding parameter. This originated from the fact
that
it is not obvious how to denote a variable which is
transformed
like: Arcsin (Sqrt (y+25)), with 'Arcsin', with 'Sqrt' or
perhaps
with 'y' itself. The dependent variable is indicated by
'dep.var.'.
Ad 4) and 9) The matrix of the estimated correlation coefficients of
the
variables and of the estimates are both supplied depending
on
whether option 2 is specified or not.
Ad 5), 6) and 7) cf. section 1.2.
Ad 8) The F-ratio and right tail probability give the user the
opportunity
to test the significance of a particular regression
coefficient
(cf. section 1.3.1).
Ad 10) The layout of the table closely resembles that of the table
in
section 1.3.2.
The F-ratios and right tail probabilities give
the
user the opportunity to test the significance of all the
regression
coefficients or of a subset or combination thereof or to test
the
adequacy of the (linear) model (cf. section 1.3.2).
Ad 11) A table of observations, fitted values, standard deviations of
the
fitted values, residuals, standardized residuals and
studentized
residuals is provided (cf. section 1.2.2). As a check on
computations, the sum of the residuals is also given. If an
unknown
constant term is present in the model formula, this sum should
be
zero.
Furthermore the upperbound for the right tail probability
of
the largest absolute studentized residual is given.
Without options specified, the printed output from the
program
consists of 3), 5), 6), 7), 8) and 10). If option 5 is specified,
the
output for the model itself is given as specified by the other options,
but
for the submodels it depends on the use of a submodel specifier
list.
Without that list the output from the options 1, 2 and 3 is
suppressed
(even if those options are specified). With that list only the
superfluous
parts of the output (that is the transformed data matrix and
the
correlation matrix of the variables) are suppressed.
Page
27
3.2
STANDARD AND OPTIONAL DATA OUTPUT
If an outputfile is specified in response to the outputstream
request,
or is appended to the "Run" keyword (cf. appendix 1), the program
writes
the following pieces of information in one record to that file:
1) if option 1 is specified: the transformed data matrix, preceded by
the
number of rows and columns respectively,
2) if option 2 is specified: the corrrelation matrix of the
variables,
preceded by its order,
3) the number of submodels specified in the list appended to option 5,
if
that option is specified at all;
otherwise the number 1. It
is
followed by (for each (sub) model): the number of estimated
parameters,
the estimates for the parameters of the (sub) model, and:
a) if option 2 is specified:
the variance-covariance matrix of
the
estimates, preceded by its order, (BE CAREFUL: this is NOT
the
correlation matrix of the estimates, which is printed; however,
the
correspondence between the two matrices is established by
the
relations (10) and (11) in "Help"/Theory),
b) if option 3 is specified: the number of respondents, followed by
for
each respondent the: observation, fitted value, standard
deviation,
residual, standardized residual and studentized residual,
and finishes by writing an "Eor" keyword.
As in the case of printed output, the output described in 3) is
only
effected for submodels, if an explicit submodel specifier list is
appended
to option 5.
An input specification to describe one record of data written to
the
outputstream when options 1, 2, 3 and 5 (with a submodel specifier
list
appended to it) are specified, could read:
"Input"
n, m, n * (m * [transformed data element]),
p, <p * (p+1) : 2> * [correlation element],
s, s * (t, t * [estimate],
q, <q * (q+1) : 2> * [covariance element],
r, r * (6 * [residual element]) );
Page
28
For the original model the following relations hold: q = t, t = m1,
r = n and p = m (or p = m-1 if replications and/or weights are
specified);
s is the number of processed (sub)models; for each submodel t and q
are
decreased with the number of terms that are omitted from the
original
model.
Real numbers in the printed output are given in fixed point
format
with a six decimal fractional part, the only exceptions are the
estimates
for the regression parameters with their standard deviations, which have
a
ten decimal fractional part and the numbers in the listings of the
input
data and the transformed data matrix, which have a three decimal
fractional
part.
Real numbers in the data output are given in floating point
format
with a sixteen decimal mantissa and a two decimal exponent part.
3.3
ERROR MESSAGES
Error messages against syntax or semantics have the following layout:
Error : <error text> or <error number>
The error text corresponding to the error numbers is:
1
2
3
4
5
6
10
11
12
13
14
15
16
17
No input data given.
All input data has been skipped.
Attempt to process more input data than provided.
Number in the input data is incorrect or too large.
In a number '.' is not followed by a digit.
In a number '#' is not followed by '+', '-' or a digit.
No model formula given.
Left hand part is not followed by '='.
Expression is not followed by ')'.
Option name used in a primary in an expression.
Incorrect primary in a factor in an expression.
Incorrect (control) identifier in an expression.
Parameter list of a standard function is not followed by ')'.
Standard function call with incorrect number of parameters.
Page
29
20
21
22
23
24
25
26
27
No input formula given.
Expression in a control is not followed by '>'.
Option name used in a control in an input statement.
Input statement in a description is not followed by ')'.
Variable list in a description is not followed by ']'.
Incorrect description in an input statement.
Incorrect identifier in a variable list.
Item in a variable list is not an identifier.
30
31
32
33
34
35
36
37
Incorrect option number in an option statement.
Incorrect option name in an option statement.
Specifier list is not followed by ')'.
Number in a specifier list is incorrect or too large.
Specifier list is appended to incorrect option.
Specifier is not a number.
Specification is not properly continued.
Specification is not terminated with ';'.
40
41
42
43
44
45
46
47
No defined (independent) identifier to the right of '='.
Incorrect use of a parameter in a regression term.
Undefined (weight) identifier to the left of '='.
Undefined (dependent) identifier to the left of '='.
Number in a regression term is incorrect or too large.
Term does not have the form: param * factor or factor * param.
Undefined (independent) identifier in a regression term.
No regression parameter in a regression term.
50
51
52
53
54
55
56
Division by zero.
Integer division by zero.
Observation for dependent variable is in absolute value too large.
Observation for independent variable is in absolute value too large.
Exponentiation with zero base and non positive exponent.
Exponentiation with negative base and real exponent.
Weight factor is not positive.
60
61
62
Argument of
Argument of
Argument of
'Sqrt'
'Ln'
'Log'
is negative.
is not positive.
is not positive.
Page
30
63
64
65
66
67
Argument
Argument
Argument
Argument
Argument
of
'Exp' is too large.
of 'Arcsin' is in absolute
of 'Arccos' is in absolute
of 'Sinh' is in absolute
of 'Cosh' is in absolute
70
71
72
73
74
75
76
Number of observations for the first dependent variable is zero.
Numbers of observations for the dependent variables are not equal.
Number of observations for the first independent variable is zero.
Numbers of observations for the independent variables are not equal.
Control reads an incorrect number in the input data.
Numbers of replications for the dependent variables are not equal.
Given, read or computed replication factor is not integral.
value
value
value
value
larger than one.
larger than one.
too large.
too large.
If the error number lies between:
5 and 37, it is followed by the most recently processed identifier,
number
and symbol.
Only the first eight characters of each name
are
displayed.
41 and 47, it is followed by the number of the right hand part
regression
term which causes the error, or a zero if the left hand part
is
at fault.
50 and 67, it is followed by the wrong value and the number of the line
in
the transformed data matrix which causes the error. Instead
of
the wrong value, the number of the right hand part
regression
term which causes the error is displayed when the error
number
lies between 50 and 53.
70 and 76, it is followed by the check value and the wrong value.
Instead
of the wrong value, the value of the controlling variable of
the
next enclosing repetition loop is displayed when the
error
number is 76.
3.4
EXAMPLES
An impression of the printed output of the Multiple Linear
Regression
Analysis program may be obtained by actually submitting the user program
in
section 2.6 (which resides on: Hlp: Mulexa.hlp) to the MULREG program.
Page
31
APPENDIX 1
TECHNICAL REMARKS
The following technical remarks reflect the SIMULA implementation
of
the Multiple Linear Regression Analysis program as of version 5H(246).
Any comments or queries concerning the functioning of the
software
described in this document should be addressed to:
Marten van Gelderen, IKO Computer Systems Group, Postbox 4395,
1009 AJ Amsterdam, The Netherlands. (telephone: 31-(0)20-930951).
The program resides on a device called USR:. It is started by
typing:
.R MULREG, and responds to the standard output device (usually TTY:)
with
an identifying header and requests to specify four files, as follows:
Multiple Linear Regression Analysis
Enter file specifications
Inputstream :
Printstream :
Datastream
:
Outputstream :
The inputstream serves to read the user program from; the
printstream
receives the printed output from the program; the datastream serves
to
read the separate input data records from; the outputstream receives
the
data output records from the program.
If the default carriage return is responded to the data- and
outputstream requests, the program assumes that no separate input data
are
present and that no data output is required.
If the default carriage
printstream requests, the program
devices
(usually TTY:) to the inputthat
the inputstream is connected
dis-
return is responded to the input-
and
connects the standard input and output
and printstream respectively.
To notify
to the standard input device, the program
plays the prompting character '*' (asterisk).
Page
32
If both the input- and printstream are connected to the standard
input
and output devices, the program echoes every text it cannot
interpret
properly, preceeded by the error character '?' (question mark).
However,
in response to a single carriage return the text 'For help type: "Help"'
is
displayed.
To the "Run" keyword a file specification list may be appended,
preceeded by the character '/' (slash), with the following general format:
/print-spec;output-spec=input-spec;data-spec
Each of the specifications will be connected to the corresponding
program
stream respectively.
Also, each of the specifications may be
omitted,
which means that nothing will be changed to the corresponding stream
at
all.
However, if the character ' ' (blank) is substituted for one
(or
more) of the specifications, the defaults - as described previously will
be connected to the corresponding streams respectively.
If the specified files do not exist (for input- or datastreams)
or
cannot be created (for print- or outputstreams), the corresponding
streams
are displayed again, followed by the character '?' (question mark),
to
indicate the erroneous situation and to enable the specification of
other
files (or defaults).
If the program encounters a premature end-of-file condition in
the
inputstream, it will connect the inputstream to the standard input
device
and thus respond with the prompting character. New specifications for
the
"Model", "Input", "Options" or "Data" (or keywords like "Run" or
"Exit")
may then be entered.
In the inputstream, the program does not
upper
discriminate
between
and lower case letters.
Identifiers may contain any number of
blanks,
however, a carriage return is not permitted, restricting the maximum
length
of identifiers to the maximum number of characters in one input line.
In
front of and following the opening quote of keywords only nonprinting
ASCII characters like tabs and/or blanks are permitted, otherwise
the
keyword (and the whole line following it) is not recognized.
Page
33
Two implementation dependent restrictions are imposed on user
programs:
the maximum number of differently spelled identifiers and
numbers
is 789 and the maximum number of nested parentheses is 62. In addition
two
machine
dependent restrictions are imposed:
the maximum number
of
characters in one input line is 132 and the number of significant digits
in
computations is 18.
In response to the keyword "Help", the information in this
appendix
(which resides on HLP:) is copied to the printstream.
Each of
the
following switches may be appended to the "Help" keyword, preceeded by
the
character '/' (slash), in order to obtain more detailed information
(which
also resides on HLP:).
/Theory
/Tests
/Model
/Input
/Options
/Data
/User
/Example
/Print
/Output
/Errors
/Syntax
Regression model & least squares (section 1.1 & 1.2),
Possible tests of hypotheses (section 1.3),
Specification of the model formula (section 2.1),
Specification of the input formula (section 2.2),
Possible options and their effects (section 2.3),
Acceptable numbers and their delimiters (section 2.4),
Setup of a complete user program (section 2.5),
Example of a complete user program (section 2.6),
Standard and optional printed output (section 3.1),
Standard and optional data output (section 3.2),
Meaning of the error numbers (section 3.3),
Definition of the syntax of a user program (appendix 2).
If no help information is available an appropriate message is displayed.
Page
34
APPENDIX 2
DEFINITION OF THE SYNTAX OF A USER PROGRAM
The syntax of a user program is defined in an extended version
a
notation known as the Backus Naur Form, in short: BNF (cf. NAUR
[10]).
The extensions comprise an explicit repetition and optionality
construct
together with the possibility of factorization.
of
The BNF may be regarded as a metalanguage for the description of
a
user program.
In addition to the symbols that are admissible in a
user
program, the metalanguage requires a number of extra symbols,
called
metasymbols.
The ten metasymbols used in extended BNF are: ::=, |, <,
>,
{, }, [, ], ( and ). The , and . are part of the metalanguage English
in
which we are describing BNF. We write:
<expression>::= ['+' | '-'] <term> { ('+' | '-') <term> }
The metasymbols < and > are used as delimiters to enclose the name
of
a class. The metasymbol ::= may be read as 'is defined as' or as
'consists
of'. The metasymbol | is read as 'or'. Repetition is denoted by
curly
brackets, i.e. { a } stands for e | a | aa | ... Optionality is
expressed
by square brackets, i.e. [ a ] stands for e | a. Parentheses merely
serve
for grouping (factorization) i.e. (a | b) c stands for ab | ac.
Terminal
symbols appear enclosed in single apostrophes.
The above phrase defines an expression as a term, optionally
preceeded
by a '+' or a '-' and followed by an arbitrary repetition of terms,
each
preceeded by a '+' or a '-'.
The syntax of a user program can thus be defined as follows:
Page
35
<letter>::= 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z'|
'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z'
<digit>::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<model keyword>::= '"Model"' | '"MO"'
<input keyword>::= '"Input"' | '"IN"'
<option keyword>::= '"Options"' | '"OP"'
<data keyword>::= '"Data"' | '"DA"'
<run keyword>::= '"Run"' | '"RU"'
<exit keyword>::= '"Exit"' | '"EX"'
<function name>::= 'Abs' | 'Sign' | 'Sqrt' | 'Sin' | 'Cos' | 'Tan' |
'Ln' | 'Log' | 'Exp' | 'Entier' | 'Round' | 'Mod' |
'Min' | 'Max' | 'Arcsin' | 'Arccos' | 'Arctan' |
'Sinh' | 'Cosh' | 'Tanh' | 'Indicator'
<option name>::= 'Transformed data matrix' | 'Correlation matrix' |
'Residual analysis' | 'No regression analysis' |
'Process submodels' | 'Print input data' |
'No input data rewind' | 'Save original model' |
'Test reduced model' | 'Missing values'
<option number>::= '1' | '2' | '3' | '4' | '5' |
'6' | '7' | '8' | '9' | '10'
<number>::= ['+' | '-'] <unsigned number>
<unsigned number>::= <decimal number> | <exponent part> |
<decimal number> <exponent part>
<decimal number>::= <unsigned integer> | <fractional part> |
<unsigned integer> <fractional part>
<exponent part>::= '#' <integer>
<fractional part>::= '.' <unsigned integer>
<integer>::= ['+' | '-'] <unsigned integer>
<unsigned integer>::= <digit> { <digit> }
Page
36
<identifier>::= <letter> { <letter> | <digit> }
<data specification>::= <data keyword> [ <input data> ]
<input data>::= <number> { <number> }
<option specification>::= <option keyword> [ <option statement> ] ';'
<option statement>::= <option> { ',' <option> }
<option>::= <simple option> [ '(' <specifier list> ')' ]
<simple option>::= <option name> | <option number>
<specifier list>::= <specifier> { ',' <specifier> }
<specifier>::= <number>
<input specification>::= <input keyword> <input statement> ';'
<input statement>::= <input part> { ',' <input part> }
<input part>::= <control> | <description> | <control> '*' <description>
<control>::= <number> | <identifier> | '<' <expression> '>'
<description>::= '(' <input statement> ')' | '[' <variable list> ']'
<variable list>::= <variable> { ',' <variable> }
<variable>::= <identifier>
<model specification>::= <model keyword> <model statement> ';'
<model statement>::= <left hand part> '=' <right hand part>
<left hand part>::= <expression> [ '&' <weight part> ]
<weight part>::= <expression>
<right hand part>::= ['+'] <term> { '+' <term> }
<expression>::= ['+' | '-'] <term> { ('+' | '-') <term> }
<term>::= <factor> { ('*' | '/' | ':') <factor> }
<factor>::= <primary> { '^' <primary> }
<primary>::= <unsigned number> | <identifier> |
<function designator> | '(' <expression> ')'
<function designator>::= <function name> [ '(' <parameter list> ')' ]
<parameter list>::= <parameter> { ',' <parameter> }
<parameter>::= <expression>
<user program>::= { <job> } <exit keyword>
<job>::= { <specification> } <run keyword>
<specification>::= <model specification> | <input specification> |
<option specification> | <data specification>
Page
37
APPENDIX 3
TECHNICAL DESCRIPTION OF THE PROGRAM
In this appendix a more or less technical description of
program
is given in terms of variables, procedures and control flow.
the
Basically the program logic is as follows:
Init program
Enter & open files
Job: Read job
Run: Init compiler tables
Compile model & input
Init data buffers
Execute (to produce design matrix)
Regression analysis
Print results
GOTO Job
Exit: Close files
In case of an errorsituation in one of the sections, no further
is
undertaken and control is transferred to Job.
action
1. The program is initialized by setting various declared system
and
compiler constants to their appropriate values. Some of them are
highly
machine dependent constants, others are internal codings or table
limits.
The four basic informationstreams in the program: the input-,
print-,
data- and outputstream are connected to the external file
specifications,
entered by the user, in the procedure enterfiles.
2. A job is read specification by specification (after printing
and
skipping leading text by a call on echotext) by means of calls on
readtext
or readdata, depending on which specification from the job is read.
Both
procedures first set up a buffer administration as will be described
in
section 3, then process an input line or input numbers and finally call
on
readline
the
to
obtain
further information.
Their task is finished when
Page
38
variable endtext becomes true, that is when the next input line starts
with
a '"' (quote), which is the beginning of what can be a keyword, or when
the
end-of-file condition is met. If an input data record is to be read from
a
file, the same administration is setup and the same routine for reading
the
actual numbers is used, only information is obtained from the
datastream
rather than the inputstream. The echoing of input text and input data
is
done via calls on the procedures printtext and printdata respectively.
3. The text and data storage section provides two kinds of SIMULA
classes,
one for text storage and another for data storage. The class
textstorage
provides a procedure to store lines of text into a linked list
of
textbuffers (starting at base).
The procedure nextline retrieves
those
lines (after a reset). The class datastorage provides a procedure to
store
real numbers into a databuffer which is linked into a linked list of
such
buffers (starting at base). In case of buffer overflow a new buffer
is
created and linked into the list. A super class inputstorage provides
the
reset procedure and a procedure nextnumber to retrieve numbers out of
the
buffers in a FIFO (first in, first out) manner. Another super
class
rightstorage provides a procedure lastnumber, which is quite similar
to
nextnumber, except that it retrieves numbers out of the buffers in a
LIFO
(last in, first out) manner.
4. The compiler uses two tables: one called program and another
called
hashtable.
The first table accepts the (macro) instructions generated
by
the compiler via calls of load. The second table is used by the
procedure
nextatom to identify the various items in the model and input
formulae.
The recognition method used is twin-prime-hashing described in KNUTH
[5]
p. 522.
Alphanumerical items can belong to one of the following
classes:
identifier, number, functionname or optionname, which are all
superclasses
of atomcell.
The procedure nextatom calls on nextchar and delivers
a
pointer to an atomcell of the appropriate type containing, among
others,
the actual text of the item and its index in the hashtable.
The compiler assumes that the running system has at its disposal
a
(programmed pseudo) register F which is capable of handling both
integer
and real numbers. Furthermore a (programmed pseudo) memory
organization
known as a stack must be available to the running system. A
stackpointer
refers to the first free position in the stack. All binary operations
will
Page
39
take place with the top of the stack as first operand and F as the
second.
The result is delivered in F and as a side effect the stackpointer
is
decreased by 1.
When the contents of F is saved in the stack,
the
stackpointer is increased by 1.
The fundamental idea behind most procedures for translating the
model
and input formulae is, that the first atom of the syntactical unit to
be
processed by that procedure has been read already (its 'value'
being
assigned to lastatom). The procedure considers itself to have finished
its
task after reading the first atom that no longer can belong to that
unit
syntactically.
Meanwhile the translation of that unit has been
produced.
A more elaborate description of the procedure
system
(arithmetic)
expression can be found in KRUSEMAN ARETZ [6] and [7]. Reference
[8]
provides the description of a complete ALGOL 60 compiler. We only
mention
here that every expression is transformed into a macro program
that
corresponds to the reversed polish form, thus:
(a+b) * (c-d) ^ e
becomes:
ab+ cd- e ^ *.
The procedure for translating the input formula must, among
others,
generate instructions to perform the linkage between identifiers from
the
model formula and numbers from the input data.
While translating a model formula, identifiers to the right of
the
equal sign are assigned type 1, those to the left of the equal sign type
2.
If these identifiers appear in a variable list the types are changed into
3
and 5 respectively. While translating an input specification,
identifiers
in a variable list not appearing in the model formula are assigned type
4,
those in the input formula not appearing in a variable list are
assigned
type 6. Meanwhile instructions are generated to put the next number
from
the input data in the appropriate column of the (yet untransformed)
design
matrix, or to skip that number. For variable lists that consists
entirely
of identifiers not appearing in the model formula, special instructions
to
skip the corresponding numbers all in one, are generated.
5. In the procedure check model a check is made if the model
formula,
after the linkage to the input data (by means of the input formula),
still
satisfies some elementary statistical conditions, like:
Page
40
a) each term must be the product of a parameter and a factor,
b) in that factor no identifier may appear that is not present in
a
variable list in the input formula. (An attempt to perform
regression
analysis with variables for which no input data is present may
not
succeed.)
In the procedure check input a check is made if for each variable
in
the model formula an equal amount of numbers is present, moreover a
check
is made if all numbers in the input data have actually been
processed
(option 7 disengages this check).
6. The execution section of the program is activated by a call of
the
procedure execute which, among other things, simulates the basic cycle of
a
computer:
next: get the instruction indicated by the programcounter
increase the programcounter with 1
isolate the instruction and address part
execute the instruction
GOTO next
This cycle ends when the programcounter tries to leave the
(translated
macro) program.
The (macro) instructions itself are coded via the
switch
lists macro and macro2.
7.
After the execution of the input- and
modelinstructions,
the
(transformed) design matrix is delivered to the regression routine(s).
The
actual computation of the regression coefficients is done via a call
of
lsqdec followed by a call of lsqsol and lsqinv. The first two of
these
procedures are described extensively in DEKKER [2] pp. 65-69.
The
vector
and matrix multiplication is done via calls of vecvec, matvec, tamvec
and
tammat, described in DEKKER [2] pp. 8-9. The algorithms for phi and
Fisher
are copied from CACM: algorithms 209 and 322 respectively.
All other computations are straightforward.
Page
41
REFERENCES
[1] AFIFI, A.A. & S.P. AZEN, Statistical Analysis;
Oriented
Approach; Academic Press, (1972).
[2]
MC
A
Computer
DEKKER, T.J., ALGOL 60 procedures in numerical algebra, part
1;
Tract 22, Mathematisch Centrum, Amsterdam.
[3]
DRAPER, N.R. & H. SMITH, Applied Regression Analysis;
John
Wiley
&
Sons, (1966).
[4]
II;
JONGE, H. DE,
Inleiding
tot
de
Medische
Statistiek,
deel
Nederlands Instituut voor Praeventieve Geneeskunde, (1960).
[5]
and
KNUTH, D.E., The Art of Computer
Searching;
Programming,
Vol. 3,
Sorting
Addison Wesley, (1973).
[6] KRUSEMAN ARETZ, F.E.J.,
ALGOL
60
translation
for
everybody;
Elektronische Datenverarbeitung, Vol. 6 (1964), 6, p. 233244.
[7]
MC
KRUSEMAN ARETZ, F.E.J.,
Programmeren
voor
ALGOL 60 vertaler voor de EL X8);
rekenautomaten,
(De
MC Syllabus 13,
Mathematisch
Centrum, (1972).
[8] KRUSEMAN ARETZ, F.E.J., P.J.W. TEN HAGEN & H.L. OUDSHOORN, An
ALGOL
60 compiler in ALGOL 60; MC Tract 48, Mathematisch
Centrum,
(1973).
[9] LUND, R.E., Tables for an approximate test for outliers in
linear
models; Technometrics, Vol. 17 (1975), 4, p. 473-476.
[10]
60;
NAUR, P. (ed.), Revised report on the algorithmic language ALGOL
Regnecentralen, Copenhagen, (1964).
[11]
SEARLE, S.R., Linear Models;
John Wiley & Sons, (1971).
[12]
THEIL, H., Principles of Econometrics;
John Wiley & Sons, (1971).
Page
42