Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Selection intensity and order statistics for breeding Dag Lindgren, ????? To this document is associated an EXCEL workbook ( .XLS) and a mathcad file ( .CAD). When working with this draft try to get these in the same directory. Introduction R ih 2 P ih A , where R is the selection 2 2 differential, that is the difference before and after selection , h the heritability, the phenotypic P 2 variance, A the variance of breeding values (the additive genetic variance) and i the selection intensity. Response to selection can be expressed as The selection intensity can be regarded as a standardized value of the difference before and after selection (Falconer and Mackey 1996). Geneticists and breeders routinely use the selection intensity for predictions of genetic gain. To predict selection intensity, the underlaying density function must be known, usually a normally distributed variable is assumed. Methods to deal with such numeric calculations are needed. For more advanced considerations a collection of formulas and methods related to the concept is needed. This paper try to meet the need to compile such formulas and methods. The subject was earlier compiled by Lindgren and Nilsson (1985), and many of their formulations and presentations still remains. For more citations to earlier efforts and for extended tables with numerical values this paper can be consulted. Model and assumptions Let X be a continuous random variable with probability density function f and distribution function F. A standardized probability density function with mean 0 and variance 1 will be assumed when applicable. Unless the contrary is indicated, a normal distribution will be assumed. A sample of size n is obtained by unrestricted random sampling from an infinite population of values with distribution function F. j values are selected and the other rejected. The expected mean of the selected values is the selection intensity. The sample may be finite or infinite. The most frequent application of selection intensity concerns truncation selection (it may also be characterized as directional selection or censorship). This means that the j largest values of the sample are selected, or for the infinite case the corresponding proportion of the values, P, above a truncation point, t. The concepts for the infinite case are illustrated in Figure 1. DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 1 Probability density 0.40 0.35 f ( x) 0.30 0.25 0.20 it 0.15 p1Ft () 0.10 t xf ( x )dx t 1 F (t ) F ( t ) f ( x ) dx 0.05 0.00 -3.0 -2.0 -1.0 0.0 1.0 t=truncation point 2.0 3.0 i =selection intensity 4.0 x Figure 1. Truncation selection in an infinite population. It is possible to measure x. All individuals with values of x above t (the truncation point) are selected, and those below the truncation point rejected. The probability density function of x is f ( x ) and the distribution function is F(x) which can be expressed t F ( t ) f ( x ) dx . The proportion selected is denoted p, which can be expressed p 1 F (t ) . Value formula for selection intensity following truncation selection, infinite case The selection intensity (it) following truncation selection at X=t in a sample of infinite size is it xf ( x )dx t 1 F (t ) it is the mean value of selected values in standardized terms. In the standardized normal distribution case DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 2 f ( x) ( x) e x2 /2 2 t F (t ) (t ) ( x )dx Q( x ) 1 ( x ) (t ) (t ) it 1 (t ) Q(t ) Note that for the normal case it can be simpler expressed than for most distributions as xe x2 /2 dx can be evaluated. Usually the selection intensity with selected proportion as an entry, t i(p), is required rather than the truncation point. For the normal case there is no simple formula for that as there is no simple formula for ( x) , but some iterative method has to be used where is ( x) needed. It can be calculated with arbitrary accuracy. Methods compiled by Zelen and Severo (1964) can be used, some methods are developed more in detail by Lindgren and Bondesson (1987). Many commercial program-packages will include cumulated normal distribution functions which can be used for selection intensity calculations. E.g. MS EXCEL7 functions gave selection intensity (with truncation point as an entry) with at least four correct decimals below 2.5, but at around 3.5 there could be an error in the third decimal. Mathcad7 has an inverse cumulated normal distribution function, and can thus find tp. Using the default precision, selection intensity could be calculated with at least four correct digits for a selected proportion above 10-10, but it seems possible make the accuracy arbitrary high with Mathcad. Standard programmes will continue to be more powerful. If own programming is made the following rational approximation for finding the truncation point corresponding to a certain proportion, tp, (Bratley et al 1983) can be used rather than using methods which generate iterative loops and branches, a bu cu 2 du 3 eu 4 tp u f gu hu 2 iu 3 ju 4 u 2 ln p where a = 0.322232431088 b=1 c = 0.342242088547 d = 0.0204231210245 e = 0.000453642210148 f = 0.099348462606 g = 0.5888581570495 h = 0.5311034622366 i = 0.10353775285 j = 0.0038560700634 DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 3 The relative accuracy is 6 decimal digits. For selection intensities derived by the formula there were at least four correct decimals for selection intensities below 3. For fast and rough calculations simpler rational approximations can be used. A direct approximation of selection intensity was made by Saxton (1988), ip 2.97425 3.35197 p 0.2 1.9319 p 0.4 2.3097 p 0.6 0.51953 0.88768 p 0.2 2.38388 p o.4 p 0.6 cited from Walsh (1999), untested. The appearance of the selection intensity is demonstrated in Fig A2. 3 i( p) (t p ) / p 2 Steep increase, but almost undetectable 1 Selection intensity Steep increase large for extreme p Inflexion point 0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Selected proportion Figure A2. The selection intensity as a function of the selected proportion assuming normal distribution. The curve starts at 0 when all are selected and increases monotonically starting straight upwards, passes an inflexion point at p=0.71 and approaches infinity when the selected proportion becomes small. The selection intensity value derived for a selected proportion in an infinite population will however constitute a poor approximation of selection intensity of the same selected proportion in a finite population (cf Figure A3). DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 4 5 Selection intensity 4 Upper curve: Best fraction 1/p of values selected from a very large population 3 Lower curve: Best value selected from 1/p 2 1 0 1 0.1 0.01 0.001 0.0001 0.00001 0.000001 Selected proportion Figure A3. The selection intensity as a function of the proportion of a truncated normal distribution. Two cases are demonstrated, either the best value from a sample is selected (lower curve) or the best proportion from an infinite population. The scale of the X-axis is logarithmic, the expressions approach the parabole 2 ln p 1 when p becomes small in this scale. If the selected number is small, the error can be considerable if Order statistics and its use for breeding theory studies The expected value of the j:th largest observation from a sample of size n, designed (j,n), can be calculated as n! j 1 n j ( j , n) x ( 1 F ) F fdx ( n j )!( j 1)! The coefficient of this formula may be realised that n ranked objects can be arranged in n! ways. j objects can be taken from in (n-j)!j! ways. Of these permutations (n-j)!(j-1)! are equivalent concerning the rank of the j:th ranking element among the j. The integrand can be interpreted as a probability given that the probability of an individual selection is above x. These values can be seen as the selection intensity for an individual selected on its rank. Thus in theoretical breeding studies it is useful technique to use order statistics for individual values. It has the advantage that the values are typical and underlying parameters are known. It is possible to deal with the corresponding situations with Monte-Carlo simulations of values also, but that is technically more complicated, and sometimes computing time is limiting, although the simulations are more reliable, as they will catch the variance around the typical and not just the typical. The technique to use expected order statistics values has been used e. g. by Lindgren et al (1989), Hodge and White (1993), Ruotsalainen and Lindgren (1998) and Wei (199?). DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 5 An algorithm for the calculation of exact expected order statistics values was given by Royston (1982). Values are tabulated by Harter (1970). When a numeric evaluation of expression AX is made, accuracy can cause considerable problems and large numeric errors even if the program code is correct, alertness against this is strongly recommended. Value formula for selection intensity following truncation selection, finite case If n is finite but large and j not close to 1 or n, the formulas for the infinite case with p=1-j/n will be reasonable approximations, but if high accuracy is desired the demands on size are considerable. The expected mean value i(j,n) of the j largest observations from a sample of size n could be calculated as i ( j , n) (1, n) (2, n) .... ( j , n) j A useful approximation was constructed by Burrows (1972) by expanding the mean of a truncated distribution in a Taylor series. It has the following form n j iB ( j , n) i ( p) 2 j (n 1)i ( p) For the normal probability density case Lindgren and Nilsson (1985) studied the accuracy. The error is rather constant independent of n for n>10 and decreases with j roughly as 1/j2 . The error is less than 0.0005 when j>7 but is of magnitude 0.02 when j=1. Some examples of the size of the error are given in Table AX. It is suggested that Burrows approximation is used for all j if an error of 0.025 (less than 5% of the selection intensity) is acceptable, for j > 2 if an error <0.01 is desired and for j >6 if error <0.001 is desired. Bulmer (1980) suggested an alternative finite-sample approximation with p replaced by p j 0.5 n j / 2n cited from Walsh 1999, check! DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 6 selection intensity 3 2 1 Selected proportion: 0.5 0.2 0.1 0.01 0.001 0 1 10 100 1000 10000 100000 1000000 number selected, j Figure A4. Illustration of the dependence of selection intensity on the selected number. Note that at a given proportion, the selection intensity is lower if the selected number is smaller, thus if the selection is made from a smaller sample. Variances of normal order statistics The n:th ranking value in a sample assuming some density distribution will not be identical for every sample. It not only an expectation, but also a variance around that expectation. Normal order statistics is not just associated with expectations but also variances. The variance of order statistics for small samples was analysed and tabulated by Pearson and Hartley (19??). Although expectation of the sum of the order statistics is the sum of individual expectations, this is not the case for variances of the sum of values with a certain rank. If some top ranking values in a sample are high, other top ranking values also tend to be high. Thus the variance of the sum of high ranking values tend to be more variable than the sum of the variances. The coancestry of the order statistics is presented for some case by Pearson and Harley ( ). For the simple case of the sum of two ranked values from a normal sample the variance, thus the variance of the expected selection intensity, can be calculated as the sum of the variance of the expectation of the corresponding values and double their covariance. Symmetry relation of the selection intensity Knowing i(1-p), i(p) of a symmetric density function may be calculated utilizing i ( p) 1 p i (1 p) p For the finite case the expression gets the form DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 7 1 j / n i ( j , n) i (n j , n) j/n These relations may be realized by considering the fact that the change in mean due to selection multiplied by the proportion selected will be the same for the selected and non-selected parts of the population. Selection intensity for intermediary fractions The selection intensity of some intermediary fraction can be evaluated in terms of the selection intensity and size of the area above the fraction, including and excluding the fraction. p1 is the proportion belonging to the second best fraction. p2 is the proportion belonging to the second best fraction Selection intensity of the best fraction = i1 = i(p1). Selection intensity of second fraction = its average in standardized terms = i2, the problem is to find a formula for it. Average = sum of (proportions * values)/(total proportions) i(p1+p2) can be calculated as an average, but also by known ways, i1 p1 i2 p2 i ( p1 p2 ) p1 p2 ( p1 p2 )i ( p1 p2 ) p1i ( p1 ) i2 p2 There are formal (but hardly real) difficulties when p1 or p2 approaches zero, it would be better with a formulation which avoided these difficulties, but I cannot find that. In integral form the average of an intermediary fraction can be written t p1 xf ( x )dx i2 t p2 p1 p2 To make computations involving selection (and selection intensity in particular), one can slice up the probability density function as composed of a number intervals and when intergrate over each interval DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 8 j nt j 1 j0 t j xf ( x )dx xf ( x )dx t1 , t n , t j 1 t j Asymptotes For a normal distribution t i t t p i ( p) 2 ln p However, the convergence of these expressions is too slow to be useful. The selection intensity for the top ranking of n values taken from a normal distribution has a similar asymptotic expression n i (1, n) 2 ln n Derivatives of selection intensity James (1976) gave the following derivatives (without any particular assumption concerning the distribution) di t i dp p di (i t ) f (t ) dt p In the special case of a normal distribution di (i t )i dt DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 9 Variance of a truncated normal distribution The variance (Vt) of a truncated density function is necessarily less than the variance of the nontruncated function. The variance within the selected part following directional truncation selection at x=t in a normal distribution is di Vt 1 i (i t ) 1 dt The variance of breeding values after truncation selection for phenotypes can be expressed 1 h 2i (i t ) where h2 is heritability (cf Robertson 1961). Selection intensity for linear deployment Breeders often deal with populations where all members are assumed to give equal contributions to selection intensity. Even if this is useful, the constraints implied easily lead to non optimal conclusions. The goal of selection can often be expressed as combining selection intensity and the square sum of genetic contributions in an optimal way (e.g. Lindgren 1991), this formulation is useful for deployment of clones to a seed orchard (Bondesson ). Under certain conditions this is made if contributions of genetic units are linearly related to their assumed value (linear deployment. This is true even if there is an upper limit on the contribution. The corresponding selection intensity of linear deployment if original genetic values are N(0,1) is presented slightly modified from Lindgren (1991) : Q( x0 ) Q( xt ) i ( x0 ) x0 Q( x0 ) ( xt ) xt Q( xt ) where x is the genetic value individuals are represented in linear proportion to their breeding values if x0xI, if x x0 they are not represented at all and if xtx they are represented as the upper bound.. Free software This information is from August 1997. SELEINT2.EXE and SELENOR3.EXE performs some tasks described here (for example calculation of order statistics). Selection intensity algorithms is also available as part of some of the offered EXCEL programs. The programs are available at an FTP-site on the net which has its home page at the address: http://linne.genfys.slu.se/breed/breed.htm. Acknowledgement DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 10 Literature Becker WA Manual of quantitative genetics. Washington State Univ., Pullman, Washington, 186 pp. Bondesson F L & Lindgren D 1993. Optimal utilization of clones and genetic thinning of seed orchards. Silvae Genetica, 42:157-163. Bratley P, Fox BL & Schrage LE 1983. A guide to simulation. Springer-Verlag Heidelberg. Burrows P 1972 Expected selection differentials for directional selection. Biometrics 31:125-133. Hodge GR and White TL 1993. Advanced generation wind-pollinated seed orchard design. New Forests 7:213236. James JW 1976. Maximizing a function of the selection differential. Theoretical and applied genetics 47:203305. Lindgren D & Nilsson J-E 1985. Calculations concerning selection intensity. Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences. Report 5. Lindgren D 1993. Quantitative comparison between truncation selection and a better procedure. Hereditas Lindgren D 1991. Optimal utilization of genetic resources. Forest Tree Improvement 23:49-67. Lindgren, D., Libby, W.S. & Bondesson, F.L. 1989. Deployment to plantations of numbers and proportions of clones with special emphasis on maximizing gain at a constant diversity. Theoretical and Applied Genetics 77, 825-831. Lindgren D & Bondesson L 1987. Calculation of selection intensity and rankits. - Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences. Arbetsrapport 21. Pearson & Hartley 19 Robertson A 1961. Inbreeding in artificial selection programmes. Genet. Res. Camb. 2:189-194. Royston JP 1982. Expected normal order statistics (exact and approximate). Applied Statistics 31: 131-165. (corrections in Applied Statistics 32:223-224.) Zelen Z & Severo NC 1964 Probability functions. In Abramowitz M & Stegun IA (Eds) 1964. Handbook of mathematical functions. National Bureau of Standards applied mathematical series 55: 925- 995. Tables Remark: Nowadays people who need numerical values are usually able to calculate them themselves, at least by the aid of this paper. Numerical values are also published elsewhere. So the main purpose of published values here is to give an impression how they vary, and to give values for checking accuracy of own algorithms. Table A5. Order statistics, (j,n), the expected value of the j:th ranking individual from a random sample of size n form a population with a normal distribution n j=1 j=2 j=3 j=4 j=5 DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 11 5 10 20 50 100 200 1000 1000000 0 Table A6. Selection intensity i(j,n) when selecting the j top values form a random sample of size n form a population with a normal distribution n 5 10 50 100 1000 106 j=1 1.1630 1.5388 2.2491 j=2 .8290 1.2701 2.0520 j=3 .5527 1.0654 1.9109 j=4 j=5 0 .7389 1.7055 .8930 1.7991 Table A7. Error in estimate of selection intensity if estimated as if the population were infinite. The estimates become too high. Error j 1 2 5 10 100 1000 P=0.5 0.23 0.13 0.059 0.030 0.0031 0.0003 Error in % of value P=0.1 0.22 0.12 0.050 0.025 0.0026 0.0002 P=0.001 0.13 0.068 0.029 0.015 0.0015 0.0001 P=0.5 29.29 16.88 7.39 3.74 0.39 0.04 P=0.1 P=0.001 12.32 6.69 2.82 1.45 0.15 0.01 3.73 2.02 0.85 0.44 0.04 0.00 Table A8. Error in estimate of selection intensity if estimated by Burrow’s approximation. The estimates are usually too low (indicated by minus signs). Error j 1 2 5 P=0.5 0.025 0.009 0.002 Error in % of value P=0.1 P=0.001 -0.017 -0.005 -0.001 -0.022 -0.006 -0.001 P=10-5 -0.017 -0.005 -0.000 P=0.5 P=0.1 4.40 1.41 0.27 4.40 1.41 0.27 P=0.001 P=10-5 -0.69 -0.18 -0.03 -0.39 -0.10 -0.02 Table A9. Values for a truncated normal function with remaining proportion as entry Proportion Corresponding Selection DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 Vt PAGE 12 truncation point t ->-inf p p=1 .999 .99 .95 .9 .7 .5 .3 .2 .1 .05 .02 .01 .001 .0001 .00001 .000001 (t) ->0 intensity i(t) 0 Table A10 Values for a truncated normal function with truncation point as entry Truncation point t t->-inf -3 -2 -1 0 1 2 3 4 5 Selected proportion pt=1-(t) ->1 .998650 (t) ->0 .004432 Selection intensity i(t) ->0 .0044 Vt t->inf Table A11 DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 13 Possible further tasks If we set the goal to have an algorithm which gives an error <0.01 I guess Burrows approximation is good except when j=1. But I have not tested that this is true for very low n or for n=infinity. And we need a way to deal with j=1 and possible j=2 Lindgren and Nilsson made a manual how to get selection intensity with three correct digits for almost all cases, and a computer program (SELEINT) was written. However this is too complicated to include as a part of other software, so I would like it to be a manual useful for programming (as well as the actual program). DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932 PAGE 14