Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Writing and Developing Linear Models 1 Introduction A statistical model attempts to describe reality based upon variables that are observable. Statistical models are used to analyze all kinds of data. There are three parts to every model. Part 1 is an equation where the observation on a trait is described as being influenced by a list of factors (in an additive manner). The equation is written as yijkl = µ + Ai + Bj + Ck + · · · + eijkl , where yijkl is the observation on a trait of interest, µ is the overall mean of the population, Ai is the effect of factor A, level i, on the trait of interest, Bj is the effect of factor B, level j, on the trait of interest, Ck is the effect of factor C, level k, on the trait of interest, and eijkl is a residual effect composed of all factors not observed. The equation could contain any number of factors that influence the observed trait value. What are A, B, and C? Suppose y is the score of a dog at an obedience trial. Factor A could be the breed of dog, factor B could be the judge, and factor C could be the handler or trainer. Other factors such as the gender of the dog, the number of hours of training, number of previous obedience trials the dog may have participated, the conditions within the ring during the trial (noise and temperature conditions), and the number of competitors. Part 2 of a model is an indication of which factors are fixed or random (see later). If a factor is random, then it is assumed to be a variable that is sampled from a population that has a particular mean and variance. The mean and variance should be specified. Determining whether a factor is fixed or random is not always easy, and takes experience in data analysis. Part 3 of the model is a list of all implied or explicit assumptions or limitations about the first two parts. This part is often missing, but is important to be able to judge the quality of the analysis. The best way to explain Part 3 is to give an example model. 1 2 Model for Weaning Weights of Beef Calves Picture yourself as a beef calf and then try to think of the factors that would influence your growth and eventual weaning weight. For example, yijklm = Ai + Bj + Xk + HY Sl + cm + eijklm , where yijklm is a weaning weight on a calf, Ai is the age of the dam (in years), either 2, 3, 4, or 5 and greater, Bj is a breed of calf effect, Xk is a gender of calf effect (male or female), HY Sl is a herd-year-season of birth effect, with three seasons per year (i.e. Nov-Feb, Mar-Jun, and Jul-Oct), cm is a calf additive genetic effect, and eijklm is a residual effect. The fixed factors are age of dam, breed of calf, and gender of calf. Herd-year-season effects, calf additive genetic effects, and residual effects are random. Instead of stating that the variance of calf additive genetic effects, for example, is 3000 kg2 , one could just say that the variance is 0.35 of the total variance, and herd-year-season effects comprise 0.15 of the total variance. The variance of residual effects is the remaining variation of 0.50 of the total. The means of the random effects are usually assumed to be zero. Calves could be related to each other because of a common sire, and/or related mothers. Thus, the analysis should take into account these relationships. Part 3 of the model lists the assumptions and limitations of the data and model equation. 1. There are no interactions between age of dam, breed of calf, or gender of calf. 2. The weaning weights have been properly adjusted to a 200-d of age of calf weight. 3. There are no maternal effects on calf weaning weights. 4. Age of dam is known. 5. All calves in the same herd-year-season were raised and managed in the same manner. 2 A researcher would discuss the consequences of each assumption if it were not true. For example, if interactions among the fixed factors exist, then using this model might give biased estimates of age of dam, breed, and gender of calf, which might bias the estimates of calf additive genetic effects. However, So and So (1929) showed that interactions were negligible. (Note: this article would be considered to be too old to be used as a reference in 2006). Maternal effects are known to exist for weaning weights. Thus, the model should be changed by adding a maternal genetic effect of the dam. Thus, the equation is revised, maternal genetic effects are another random factor, and the proportions of each to the total variance need to be revised. There is also a genetic correlation between calf additive genetic effects and the maternal genetic effects. (This is discussed more in the notes on Maternal Genetic Effects) The last assumption may not be true in some herds, because owners sometimes separate male and female calves earlier than weaning. Also, some herds may be very large, and so there could be more than one management group within a herd-year-season. From the recorded data, this fact may not be obvious unless producers correctly fill in the management group codes. For this course, students should be able to write an equation of the model (subscripts not necessary) in words, e.g. Wean. Wt. = Age of dam + Breed +Gender + HYS +Calf + residual. Then indicate the fixed and random factors, and the proportion of total variance for each random factor, and then a good attempt at Part 3. 3 Model Building Developing an appropriate linear statistical model is best accomplished in discussions with other scientists. Full awareness of models that have been published in the literature for a particular species and trait is important. Model building, in the beginning, is a trial and error ordeal. The Analysis of Variance was created to allow factors in models to be tested for their significance. Factors that are significant should be in the model (for genetic evaluation). Sometimes factors that are not significant in your data, but which have consistently been important in previous studies, should also be included in the model. As more data accumulate, the model may need to be re-tested and refinements could be made. A genetic evaluation model will likely be used many times per year and over years. Therefore, scientists should be open towards making improvements to their models as new information becomes available. 3 4 Practice Models Write a linear statistical model for one or more of the following cases. A similar case will be given on the mid-term exam. Case 1. Body condition scores of cows during the lactation are assigned by the owner (from 1 to 5 in half increments, 1, 1.5, 2, 2.5,...), where 1 is very thin and lacking in condition, and 5 is very obese. A farmer has body condition scores on all cows every 30 days during the year. Case 2. Beef bulls, at weaning, go to test stations for a 112 day growth test and the best bulls at the end of test are sold to beef producers in an auction. Growth, feed intake, and scrotal circumference are measured during the test period every 2 weeks. Write a model for either growth, feed intake, or scrotal circumference to evaluate the beef bulls. There are data from many test stations over the last 10 years. Several breeds and crossbreds are involved in the tests. Case 3. Weight and length at two years of age in Atlantic cod are important growth traits. Fish are individually identified with pit tags. Fish are reared in tanks at a research facility with the capability of controlling water temperature and hours of daylight. Tanks differ somewhat in size and number of fish. Case 4. Income from milk sales minus expenses for feed, breeding, and health problems from one calving to the next are available on many herds of dairy cows. Call the difference cow profit and write a model to analyze this trait for cows finishing their first lactation. Case 5. A reproductive physiology study collected statistics on semen volume, sperm motility, and number of sperm per ejaculate on stallions from one year to ten years of age (on the same horses - a long term study) to see how semen characteristics change with age. Case 6. Canadian Warmblood horses are raised for dressage and jumping. Mares can be sent to a central location for a brief training (breaking) period and are scored for a number of traits, such as gait and movement. Three experts score the horses as well as two riders, and the results are combined into a weighted average. Case 7. Horses differ in their reactions to insect bites. A veterinarian observed horses that had been biten by horse flies and rated the areas around insect bites on the animals as mild to severe. Horses were from many ranches and over the course of three summers in Ontario. Some horses were observed in each year. 4 5 Testing Factors in a Model Below are a few example records (out of 311 total records) in a data frame called “pigs”. Litter size (LS) of Sow ID parity year AXL82A 2 2002 AXL33A 2 2001 AXL27B 1 2001 BAS99Y 4 2003 BAS63A 2 2002 .. .. .. . . . sows. month FEB JAN JUN MAY APR .. . LS 10 9 10 11 12 .. . The first model to explore for these data is LS = parity + year + month + sow + residual. The “sow” factor will definitely be included in the final model because the estimated breeding values of sows is of interest. The value or significance of the other factors needs to be tested. Testing is done by the Analysis of Variance table or ANOVA or AOV. Every ANOVA table has 3 basic rows, as shown below. Source 1) Total 2) Model 3) Residual df N p N-p Basic ANOVA table. SS MS F-value P r(> F ) SST SSM SSM/p F SST-SSM MSE The “Total” Sum of Squares is the sum of each litter size observation squared, and N is the total number of observations (in this case N = 311). The “Model” Sum of Squares has another pre-defined formula for calculation, but should always be smaller than the Total SS. The degrees of freedom of the model is p, where p is the number of parities in the data PLUS the number of years PLUS the number of months (according to factors in the model) MINUS the number of factors PLUS one. “MS” stands for Mean Square, and is the Sum of Squares DIVIDED by the degrees of freedom. 5 The “Residual” Sum of Squares is the Total Sum of Squares MINUS the Model Sum of Squares. The degrees of freedom is N − p. MSE is (SST-SSM) divided by N − p. The “F-values” are computed only for the Model Sum of Squares, and are equal to F(model) = SSM/p . MSE The last column are probabilities of having an F value greater than the one that was observed. The smaller this probability is, then the more significantly important is that sum of squares. Usually any probability less than 0.05 is considered significant; less than 0.01 is highly significant, and so on. These are computed by the software that is used. Most statistical software packages provide these three lines. Usually the model sum of squares is always significant, if researchers are good at writing a model. Of greater interest are tests about the separate factors in the model. Thus, the Model Sum of Squares is broken down or partitioned into separate sums of squares for each factor. For the example, the ANOVA for the Litter Size model would have 3 additional lines, as shown below. 6 6 Source 1) Total df N Basic ANOVA table. SS——— MS——— SST 2) Model 2a) Parity 2b) Year 2c) Month p pa py pm SSM SSM/p Fmodel SSParity SSparity/pa Fparity SSYear SSYear/py Fyear SSMonth SSMonth/pm Fmonth 3) Residual N-p SST-SSM F-value P r(> F ) MSE ANOVA in R The lm or “linear model” function in R can be used to generate an ANOVA. First, the factor() function needs to be used. ______________________________________________________ || \# Make factor variables for parity, year, month || || fpar = factor(pigs\$parity) || || fyr = factor(pigs\$year) || || fmo = factor(pigs\$month) || || y = pigs\$LS || || || || modelA = lm( y ~ fpar + fyr + fmo, data = pigs) || ______________________________________________________ The lm() function may take some time to execute depending on the amount of data and the complexity of the model. The function generates a lot of information that could be useful to the researcher. The str(), structure, command gives a list of the information that is generated by lm(). ______________________________________________________ || str(modelA) || || $coefficients || || $residuals || || $rank || || $df.residual || || $x levels || || $call || || $terms || || $model || 7 || $anova || || $summary || ______________________________________________________ The last two items are the ones of interest for this course. To view their contents enter anova(modelA) or summary(modelA). ________________________________________________________ || anova(modelA) || || df SS MS F Pr(>F) || || fpar 3 60.68 20.23 11.4 .0000004 || || fyr 2 19.01 9.51 5.4 .00511 || || fmo 5 41.63 8.33 4.7 .00037 || || residual 301 532.95 1.77 || ________________________________________________________ Notice that this table does not contain the Total Sum of Squares, because the Total Sum of Squares is usually not of any interest. Also, the Model Sum of Squares is omitted for the same reason. In the above example, all three factors, parity, year, and month are highly significant. ________________________________________________________ || summary(modelA) || || Call: || || formula = y ~ fpar + fyr + fmo, data=pigs || || Residuals: || || min mean max || || -.82 .... .02 .... +.76 || || || || Coefficients: || || estimate SE t-value Pr(>t) || || (Intercept) 9.27 .24 .003 .000001 || || fpar2 1.08 .22 .017 .001328 || || fpar3 .08 .21 .442 .540116 || || fpar4 .00 .21 .899 .982350 || || fyr2002 .52 .18 .261 .357988 || || fyr2003 .53 .18 .261 .358022 || || fmo2 .28 .26 .335 .501665 || || fmo3 -.44 .28 .309 .499756 || || fmo4 .63 .26 .274 .367139 || || fmo5 .60 .27 .288 .373232 || 8 || fmo6 .04 .27 .807 .863421 || || ------------------------------------------------ || || Residual SE 1.331 || || Multiple R-squared .1854 Adjusted R2 = .1584 || || F-statistic 6.852 10, 301 df p-value 1.19e-09 || ________________________________________________________ The summary() function for a model gives the “estimates” of the levels of the factors in the model. The intercept is similar to the overall mean of the data, in this case 9.27 piglets per litter for sows in first parity, from year 2001 and month of JAN. “fpar2” is 1.08 and means that sows farrowing in parity 2 gave 1.08 piglets more per litter than sows farrowing in parity 1. Similarly, parity 3 sows only gave 0.08 piglets more than parity 1 sows. Sows farrowing in 2002 gave .52 more piglets than sows that farrowed in 2003. The months are compared to sows farrowing in 2001. SE is the standard error of the estimate. The t-value is similar to the F-value in the ANOVA except it has only 1 degree of freedom for the numerator. Lastly, the P r(> t) is similar to that in the ANOVA where the smaller values are more significantly different. The residual SE is the square root of the residual MS from the ANOVA. The multiple R-squared is a useful statistic for comparing different models. Higher values are better for this statistic. Values closer to 0.5 would be more desirable, and mean that the model better explains the data. The adjusted R2 means that the multiple R-squared is adjusted for the amount of data that was available for the analysis. The more data that are analyzed, then the adjusted R2 should not be very much lower than the multiple R-squared. The F-statistic is for the model as a whole, and as mentioned earlier should always be significant if the model is reasonable. 7 A Second Model The nice feature about R is that the model can be changed very quickly and a different ANOVA can be easily generated. Suppose a new model is proposed as follows: LS = parity + year − month + sow + residual. Thus, the effects of year and month act together. The effect of JAN is not the same for all years. The effect of JAN is different in 2001, from JAN in 2002, and JAN in 2003. 9 Thus, an interaction effect needs to be created and used in the model. To create this factor, the interaction() function in R is used. _________________________________________________________ || ymf = interaction(pigs\$year,pigs\$month,drop=TRUE) || || || || modelB = lm(y ~ fpar + ymf, data = pigs) || _________________________________________________________ The results were as follows: _______________________________________________________ || anova(modelB) || || df SS MS F Pr(>F) || || fpar 3 60.68 20.23 11.4 .0000004 || || ymf 17 160.07 9.42 6.3 1.27e-12 || || residual 291 433.53 1.49 || _______________________________________________________ ________________________________________________________ ________________________________________________________ || summary(modelB) || || Call: || || formula = y ~ fpar + ymf, data=pigs || || Residuals: || || min mean max || || -.78 .... -.02 .... +.74 || || || || Coefficients: || || estimate SE t-value Pr(>t) || || (Intercept) 8.33 .24 .003 .000001 || || fpar2 1.13 .22 .017 .001328 || || fpar3 .11 .21 .452 .450116 || || fpar4 -.03 .21 .886 .882350 || || ymf2 .12 .21 .271 .327964 || || ymf3 .23 .21 .274 .316023 || || . . . . || || ------------------------------------------------ || || Residual SE 1.221 || || Multiple R-squared .3374 Adjusted R2 = .2942 || || F-statistic 7.431 20, 291 df p-value 2.2e-016 || ________________________________________________________ ________________________________________________________ 10 The multiple R-squared for modelB was greater than that of modelA, and therefore, modelB would be better to use for a final analysis. The residual SE for modelB was smaller than that of modelA, and therefore, modelB would be a better model. The statistical result should agree with the researchers’ intuition and biological results as well. Does the model make any sense? Can the model be defended on a biological basis as well as the statistical basis? Model development and improvement is a process that takes many attempts and careful interpretations. 8 Regression Variables A regression variable (also called a covariate), is one that has a particular relationship with the observations. One example is the relationship between height at the shoulders (of a dairy cow) and the weight of the animal. Another is the heart girth (circumference around the midsection of the cow) and the weight of the cow. If you know the heart girth, then you can reliably predict the weight, or vice versa. Suppose the model is Weight = Intercept + b1 Heartgirth + b2 Height + e, where b1 and b2 are regression coefficients. Let girth be a vector of girth measurements (in cm), height be a vector of height at the shoulders, and y be a vector of weights (kg) as labelled in the data frame called cows. The way to analyze this model in R is _________________________________________________________ || || || modelWT = lm(y ~ girth + height, data = cows) || _________________________________________________________ Note that girth and height were not made into factors, as was done with parity, year, and month in the pigs example. A regression variable (covariate) only takes up 1 (one) degree of freedom. 9 Excluding the Intercept The intercept is always included in the lm() function call, but it can be excluded, if desired. The way to do that is to add -1 as follows: 11 _________________________________________________________ || || || modelWT = lm(y ~ fpar + ymf -1, data = cows) || _________________________________________________________ 10 Dates Often the date of recording of an observation, or the date of birth are available in the data, as a number of the form yyyymmdd The model of analysis may require just the year or the month or the month converted into a season. Thus, dates have to be manipulated to obtain what is needed for the model. Let calve represent the calving date of a cow, and the model needs to have a season of calving effect in it where there are four seasons per year (every three months). _________________________________________________________ || # extract the year from calving date || || year = as.integer(calve/10000) || || || || # extract the month from the calving date || || month = as.integer((calve - year*10000)/100) || || || || # define the seasons(1=JAN-FEB-MAR, ... || || ch = c(1,1,1,2,2,2,3,3,3,4,4,4) || || season = ch[month] || _________________________________________________________ 12