Download Writing a Linear Model

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Writing and Developing Linear Models
1
Introduction
A statistical model attempts to describe reality based upon variables that are observable.
Statistical models are used to analyze all kinds of data. There are three parts to every
model. Part 1 is an equation where the observation on a trait is described as being
influenced by a list of factors (in an additive manner). The equation is written as
yijkl = µ + Ai + Bj + Ck + · · · + eijkl ,
where
yijkl is the observation on a trait of interest,
µ is the overall mean of the population,
Ai is the effect of factor A, level i, on the trait of interest,
Bj is the effect of factor B, level j, on the trait of interest,
Ck is the effect of factor C, level k, on the trait of interest, and
eijkl is a residual effect composed of all factors not observed.
The equation could contain any number of factors that influence the observed trait value.
What are A, B, and C? Suppose y is the score of a dog at an obedience trial. Factor A
could be the breed of dog, factor B could be the judge, and factor C could be the handler
or trainer. Other factors such as the gender of the dog, the number of hours of training, number of previous obedience trials the dog may have participated, the conditions
within the ring during the trial (noise and temperature conditions), and the number of
competitors.
Part 2 of a model is an indication of which factors are fixed or random (see later). If
a factor is random, then it is assumed to be a variable that is sampled from a population
that has a particular mean and variance. The mean and variance should be specified.
Determining whether a factor is fixed or random is not always easy, and takes experience
in data analysis.
Part 3 of the model is a list of all implied or explicit assumptions or limitations about
the first two parts. This part is often missing, but is important to be able to judge the
quality of the analysis. The best way to explain Part 3 is to give an example model.
1
2
Model for Weaning Weights of Beef Calves
Picture yourself as a beef calf and then try to think of the factors that would influence
your growth and eventual weaning weight. For example,
yijklm = Ai + Bj + Xk + HY Sl + cm + eijklm ,
where
yijklm is a weaning weight on a calf,
Ai is the age of the dam (in years), either 2, 3, 4, or 5 and greater,
Bj is a breed of calf effect,
Xk is a gender of calf effect (male or female),
HY Sl is a herd-year-season of birth effect, with three seasons per year (i.e. Nov-Feb,
Mar-Jun, and Jul-Oct),
cm is a calf additive genetic effect, and
eijklm is a residual effect.
The fixed factors are age of dam, breed of calf, and gender of calf. Herd-year-season
effects, calf additive genetic effects, and residual effects are random. Instead of stating
that the variance of calf additive genetic effects, for example, is 3000 kg2 , one could just
say that the variance is 0.35 of the total variance, and herd-year-season effects comprise
0.15 of the total variance. The variance of residual effects is the remaining variation of
0.50 of the total. The means of the random effects are usually assumed to be zero. Calves
could be related to each other because of a common sire, and/or related mothers. Thus,
the analysis should take into account these relationships.
Part 3 of the model lists the assumptions and limitations of the data and model
equation.
1. There are no interactions between age of dam, breed of calf, or gender of calf.
2. The weaning weights have been properly adjusted to a 200-d of age of calf weight.
3. There are no maternal effects on calf weaning weights.
4. Age of dam is known.
5. All calves in the same herd-year-season were raised and managed in the same manner.
2
A researcher would discuss the consequences of each assumption if it were not true. For
example, if interactions among the fixed factors exist, then using this model might give
biased estimates of age of dam, breed, and gender of calf, which might bias the estimates
of calf additive genetic effects. However, So and So (1929) showed that interactions were
negligible. (Note: this article would be considered to be too old to be used as a reference
in 2006).
Maternal effects are known to exist for weaning weights. Thus, the model should be
changed by adding a maternal genetic effect of the dam. Thus, the equation is revised,
maternal genetic effects are another random factor, and the proportions of each to the
total variance need to be revised. There is also a genetic correlation between calf additive
genetic effects and the maternal genetic effects. (This is discussed more in the notes on
Maternal Genetic Effects)
The last assumption may not be true in some herds, because owners sometimes
separate male and female calves earlier than weaning. Also, some herds may be very
large, and so there could be more than one management group within a herd-year-season.
From the recorded data, this fact may not be obvious unless producers correctly fill in
the management group codes.
For this course, students should be able to write an equation of the model (subscripts
not necessary) in words, e.g.
Wean. Wt. = Age of dam + Breed
+Gender + HYS
+Calf + residual.
Then indicate the fixed and random factors, and the proportion of total variance for each
random factor, and then a good attempt at Part 3.
3
Model Building
Developing an appropriate linear statistical model is best accomplished in discussions
with other scientists. Full awareness of models that have been published in the literature
for a particular species and trait is important. Model building, in the beginning, is a
trial and error ordeal. The Analysis of Variance was created to allow factors in models
to be tested for their significance. Factors that are significant should be in the model
(for genetic evaluation). Sometimes factors that are not significant in your data, but
which have consistently been important in previous studies, should also be included in
the model. As more data accumulate, the model may need to be re-tested and refinements
could be made. A genetic evaluation model will likely be used many times per year and
over years. Therefore, scientists should be open towards making improvements to their
models as new information becomes available.
3
4
Practice Models
Write a linear statistical model for one or more of the following cases. A similar case will
be given on the mid-term exam.
Case 1. Body condition scores of cows during the lactation are assigned by the owner
(from 1 to 5 in half increments, 1, 1.5, 2, 2.5,...), where 1 is very thin and lacking
in condition, and 5 is very obese. A farmer has body condition scores on all cows
every 30 days during the year.
Case 2. Beef bulls, at weaning, go to test stations for a 112 day growth test and the best
bulls at the end of test are sold to beef producers in an auction. Growth, feed intake,
and scrotal circumference are measured during the test period every 2 weeks. Write
a model for either growth, feed intake, or scrotal circumference to evaluate the beef
bulls. There are data from many test stations over the last 10 years. Several breeds
and crossbreds are involved in the tests.
Case 3. Weight and length at two years of age in Atlantic cod are important growth
traits. Fish are individually identified with pit tags. Fish are reared in tanks at a
research facility with the capability of controlling water temperature and hours of
daylight. Tanks differ somewhat in size and number of fish.
Case 4. Income from milk sales minus expenses for feed, breeding, and health problems
from one calving to the next are available on many herds of dairy cows. Call the
difference cow profit and write a model to analyze this trait for cows finishing their
first lactation.
Case 5. A reproductive physiology study collected statistics on semen volume, sperm
motility, and number of sperm per ejaculate on stallions from one year to ten years
of age (on the same horses - a long term study) to see how semen characteristics
change with age.
Case 6. Canadian Warmblood horses are raised for dressage and jumping. Mares can be
sent to a central location for a brief training (breaking) period and are scored for
a number of traits, such as gait and movement. Three experts score the horses as
well as two riders, and the results are combined into a weighted average.
Case 7. Horses differ in their reactions to insect bites. A veterinarian observed horses
that had been biten by horse flies and rated the areas around insect bites on the
animals as mild to severe. Horses were from many ranches and over the course of
three summers in Ontario. Some horses were observed in each year.
4
5
Testing Factors in a Model
Below are a few example records (out of 311 total records) in a data frame called “pigs”.
Litter size (LS) of
Sow ID
parity year
AXL82A
2
2002
AXL33A
2
2001
AXL27B
1
2001
BAS99Y
4
2003
BAS63A
2
2002
..
..
..
.
.
.
sows.
month
FEB
JAN
JUN
MAY
APR
..
.
LS
10
9
10
11
12
..
.
The first model to explore for these data is
LS = parity + year + month + sow + residual.
The “sow” factor will definitely be included in the final model because the estimated
breeding values of sows is of interest. The value or significance of the other factors needs
to be tested.
Testing is done by the Analysis of Variance table or ANOVA or AOV. Every ANOVA
table has 3 basic rows, as shown below.
Source
1) Total
2) Model
3) Residual
df
N
p
N-p
Basic ANOVA table.
SS
MS F-value P r(> F )
SST
SSM SSM/p
F
SST-SSM
MSE
The “Total” Sum of Squares is the sum of each litter size observation squared, and
N is the total number of observations (in this case N = 311).
The “Model” Sum of Squares has another pre-defined formula for calculation, but
should always be smaller than the Total SS. The degrees of freedom of the model is p,
where p is the number of parities in the data PLUS the number of years PLUS the number
of months (according to factors in the model) MINUS the number of factors PLUS one.
“MS” stands for Mean Square, and is the Sum of Squares DIVIDED by the degrees of
freedom.
5
The “Residual” Sum of Squares is the Total Sum of Squares MINUS the Model Sum
of Squares. The degrees of freedom is N − p. MSE is (SST-SSM) divided by N − p.
The “F-values” are computed only for the Model Sum of Squares, and are equal to
F(model) =
SSM/p
.
MSE
The last column are probabilities of having an F value greater than the one that was
observed. The smaller this probability is, then the more significantly important is that
sum of squares. Usually any probability less than 0.05 is considered significant; less than
0.01 is highly significant, and so on. These are computed by the software that is used.
Most statistical software packages provide these three lines. Usually the model sum of
squares is always significant, if researchers are good at writing a model. Of greater interest
are tests about the separate factors in the model. Thus, the Model Sum of Squares is
broken down or partitioned into separate sums of squares for each factor. For the example,
the ANOVA for the Litter Size model would have 3 additional lines, as shown below.
6
6
Source
1) Total
df
N
Basic ANOVA table.
SS———
MS———
SST
2) Model
2a) Parity
2b) Year
2c) Month
p
pa
py
pm
SSM
SSM/p Fmodel
SSParity
SSparity/pa Fparity
SSYear
SSYear/py
Fyear
SSMonth SSMonth/pm Fmonth
3) Residual
N-p
SST-SSM
F-value P r(> F )
MSE
ANOVA in R
The lm or “linear model” function in R can be used to generate an ANOVA. First, the
factor() function needs to be used.
______________________________________________________
|| \# Make factor variables for parity, year, month
||
||
fpar = factor(pigs\$parity)
||
||
fyr = factor(pigs\$year)
||
||
fmo = factor(pigs\$month)
||
||
y = pigs\$LS
||
||
||
||
modelA = lm( y ~ fpar + fyr + fmo, data = pigs) ||
______________________________________________________
The lm() function may take some time to execute depending on the amount of data
and the complexity of the model. The function generates a lot of information that could
be useful to the researcher. The str(), structure, command gives a list of the information
that is generated by lm().
______________________________________________________
|| str(modelA)
||
||
$coefficients
||
||
$residuals
||
||
$rank
||
||
$df.residual
||
||
$x levels
||
||
$call
||
||
$terms
||
||
$model
||
7
||
$anova
||
||
$summary
||
______________________________________________________
The last two items are the ones of interest for this course. To view their contents
enter anova(modelA) or summary(modelA).
________________________________________________________
||
anova(modelA)
||
||
df
SS
MS
F
Pr(>F)
||
||
fpar
3
60.68 20.23 11.4 .0000004
||
||
fyr
2
19.01
9.51
5.4 .00511
||
||
fmo
5
41.63
8.33
4.7 .00037
||
||
residual 301 532.95
1.77
||
________________________________________________________
Notice that this table does not contain the Total Sum of Squares, because the Total
Sum of Squares is usually not of any interest. Also, the Model Sum of Squares is omitted
for the same reason.
In the above example, all three factors, parity, year, and month are highly significant.
________________________________________________________
||
summary(modelA)
||
||
Call:
||
||
formula = y ~ fpar + fyr + fmo, data=pigs
||
||
Residuals:
||
||
min
mean
max
||
||
-.82 ....
.02 ....
+.76
||
||
||
||
Coefficients:
||
||
estimate
SE
t-value
Pr(>t)
||
||
(Intercept)
9.27
.24
.003
.000001
||
||
fpar2
1.08
.22
.017
.001328
||
||
fpar3
.08
.21
.442
.540116
||
||
fpar4
.00
.21
.899
.982350
||
||
fyr2002
.52
.18
.261
.357988
||
||
fyr2003
.53
.18
.261
.358022
||
||
fmo2
.28
.26
.335
.501665
||
||
fmo3
-.44
.28
.309
.499756
||
||
fmo4
.63
.26
.274
.367139
||
||
fmo5
.60
.27
.288
.373232
||
8
||
fmo6
.04
.27
.807
.863421
||
||
------------------------------------------------ ||
||
Residual SE
1.331
||
||
Multiple R-squared .1854
Adjusted R2 = .1584 ||
||
F-statistic
6.852 10, 301 df p-value 1.19e-09 ||
________________________________________________________
The summary() function for a model gives the “estimates” of the levels of the factors
in the model. The intercept is similar to the overall mean of the data, in this case 9.27
piglets per litter for sows in first parity, from year 2001 and month of JAN.
“fpar2” is 1.08 and means that sows farrowing in parity 2 gave 1.08 piglets more per
litter than sows farrowing in parity 1. Similarly, parity 3 sows only gave 0.08 piglets more
than parity 1 sows.
Sows farrowing in 2002 gave .52 more piglets than sows that farrowed in 2003. The
months are compared to sows farrowing in 2001.
SE is the standard error of the estimate. The t-value is similar to the F-value in the
ANOVA except it has only 1 degree of freedom for the numerator. Lastly, the P r(> t) is
similar to that in the ANOVA where the smaller values are more significantly different.
The residual SE is the square root of the residual MS from the ANOVA.
The multiple R-squared is a useful statistic for comparing different models. Higher
values are better for this statistic. Values closer to 0.5 would be more desirable, and
mean that the model better explains the data. The adjusted R2 means that the multiple
R-squared is adjusted for the amount of data that was available for the analysis. The
more data that are analyzed, then the adjusted R2 should not be very much lower than
the multiple R-squared.
The F-statistic is for the model as a whole, and as mentioned earlier should always
be significant if the model is reasonable.
7
A Second Model
The nice feature about R is that the model can be changed very quickly and a different
ANOVA can be easily generated. Suppose a new model is proposed as follows:
LS = parity + year − month + sow + residual.
Thus, the effects of year and month act together. The effect of JAN is not the same
for all years. The effect of JAN is different in 2001, from JAN in 2002, and JAN in 2003.
9
Thus, an interaction effect needs to be created and used in the model. To create this
factor, the interaction() function in R is used.
_________________________________________________________
|| ymf = interaction(pigs\$year,pigs\$month,drop=TRUE) ||
||
||
|| modelB = lm(y ~ fpar + ymf, data = pigs)
||
_________________________________________________________
The results were as follows:
_______________________________________________________
||
anova(modelB)
||
||
df
SS
MS
F
Pr(>F)
||
||
fpar
3
60.68 20.23 11.4 .0000004
||
||
ymf
17 160.07
9.42
6.3 1.27e-12
||
||
residual 291 433.53
1.49
||
_______________________________________________________
________________________________________________________
________________________________________________________
||
summary(modelB)
||
||
Call:
||
||
formula = y ~ fpar + ymf, data=pigs
||
||
Residuals:
||
||
min
mean
max
||
||
-.78 ....
-.02 ....
+.74
||
||
||
||
Coefficients:
||
||
estimate
SE
t-value
Pr(>t)
||
||
(Intercept)
8.33
.24
.003
.000001
||
||
fpar2
1.13
.22
.017
.001328
||
||
fpar3
.11
.21
.452
.450116
||
||
fpar4
-.03
.21
.886
.882350
||
||
ymf2
.12
.21
.271
.327964
||
||
ymf3
.23
.21
.274
.316023
||
||
.
.
.
.
||
||
------------------------------------------------ ||
||
Residual SE
1.221
||
||
Multiple R-squared .3374
Adjusted R2 = .2942 ||
||
F-statistic
7.431 20, 291 df p-value 2.2e-016 ||
________________________________________________________
________________________________________________________
10
The multiple R-squared for modelB was greater than that of modelA, and therefore,
modelB would be better to use for a final analysis.
The residual SE for modelB was smaller than that of modelA, and therefore, modelB
would be a better model.
The statistical result should agree with the researchers’ intuition and biological results
as well. Does the model make any sense? Can the model be defended on a biological basis
as well as the statistical basis?
Model development and improvement is a process that takes many attempts and
careful interpretations.
8
Regression Variables
A regression variable (also called a covariate), is one that has a particular relationship
with the observations. One example is the relationship between height at the shoulders
(of a dairy cow) and the weight of the animal. Another is the heart girth (circumference
around the midsection of the cow) and the weight of the cow. If you know the heart girth,
then you can reliably predict the weight, or vice versa. Suppose the model is
Weight = Intercept + b1 Heartgirth + b2 Height + e,
where b1 and b2 are regression coefficients. Let girth be a vector of girth measurements (in cm), height be a vector of height at the shoulders, and y be a vector of weights
(kg) as labelled in the data frame called cows. The way to analyze this model in R is
_________________________________________________________
||
||
|| modelWT = lm(y ~ girth + height, data = cows)
||
_________________________________________________________
Note that girth and height were not made into factors, as was done with parity,
year, and month in the pigs example.
A regression variable (covariate) only takes up 1 (one) degree of freedom.
9
Excluding the Intercept
The intercept is always included in the lm() function call, but it can be excluded, if
desired. The way to do that is to add -1 as follows:
11
_________________________________________________________
||
||
|| modelWT = lm(y ~ fpar + ymf -1, data = cows)
||
_________________________________________________________
10
Dates
Often the date of recording of an observation, or the date of birth are available in the
data, as a number of the form
yyyymmdd
The model of analysis may require just the year or the month or the month converted into
a season. Thus, dates have to be manipulated to obtain what is needed for the model.
Let calve represent the calving date of a cow, and the model needs to have a season
of calving effect in it where there are four seasons per year (every three months).
_________________________________________________________
|| # extract the year from calving date
||
||
year = as.integer(calve/10000)
||
||
||
|| # extract the month from the calving date
||
||
month = as.integer((calve - year*10000)/100)
||
||
||
|| # define the seasons(1=JAN-FEB-MAR, ...
||
||
ch = c(1,1,1,2,2,2,3,3,3,4,4,4)
||
||
season = ch[month]
||
_________________________________________________________
12