Download Selection intensity for intermediary fractions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Selection intensity and order statistics for breeding
Dag Lindgren, ?????
To this document is associated an EXCEL workbook ( .XLS) and a mathcad file ( .CAD). When working
with this draft try to get these in the same directory.
Introduction
R  ih 2 P  ih A , where R is the selection
2
2
differential, that is the difference before and after selection , h the heritability,  the phenotypic
P
2
variance, 
A the variance of breeding values (the additive genetic variance) and i the selection intensity.
Response to selection can be expressed as
The selection intensity can be regarded as a standardized value of the difference before and after selection
(Falconer and Mackey 1996). Geneticists and breeders routinely use the selection intensity for predictions of
genetic gain. To predict selection intensity, the underlaying density function must be known, usually a
normally distributed variable is assumed. Methods to deal with such numeric calculations are needed. For more
advanced considerations a collection of formulas and methods related to the concept is needed. This paper try
to meet the need to compile such formulas and methods.
The subject was earlier compiled by Lindgren and Nilsson (1985), and many of their formulations and
presentations still remains. For more citations to earlier efforts and for extended tables with numerical values
this paper can be consulted.
Model and assumptions
Let X be a continuous random variable with probability density function f and distribution function F. A
standardized probability density function with mean 0 and variance 1 will be assumed when applicable. Unless
the contrary is indicated, a normal distribution will be assumed.
A sample of size n is obtained by unrestricted random sampling from an infinite population of values with
distribution function F. j values are selected and the other rejected. The expected mean of the selected values is
the selection intensity. The sample may be finite or infinite. The most frequent application of selection intensity
concerns truncation selection (it may also be characterized as directional selection or censorship). This means
that the j largest values of the sample are selected, or for the infinite case the corresponding proportion of the
values, P, above a truncation point, t. The concepts for the infinite case are illustrated in Figure 1.
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 1
Probability density
0.40
0.35
f ( x)
0.30
0.25

0.20
it 
0.15
p1Ft
()
0.10
t
 xf ( x )dx
t
1  F (t )
F ( t )   f ( x ) dx
0.05

0.00
-3.0
-2.0
-1.0
0.0
1.0
t=truncation point
2.0
3.0
i =selection intensity
4.0
x
Figure 1. Truncation selection in an infinite population. It is possible to measure x. All individuals with values
of x above t (the truncation point) are selected, and those below the truncation point rejected. The probability
density function of x is
f ( x ) and the distribution function is F(x) which can be expressed
t
F ( t )   f ( x ) dx . The proportion selected is denoted p, which can be

expressed
p  1  F (t ) .
Value formula for selection intensity following truncation selection, infinite case
The selection intensity (it) following truncation selection at X=t in a sample of infinite size is

it 
 xf ( x )dx
t
1  F (t )
it is the mean value of selected values in standardized terms.
In the standardized normal distribution case
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 2
f ( x)   ( x) 
e
 x2 /2
2
t
F (t )  (t )    ( x )dx

Q( x )  1  ( x )
 (t )
 (t )
it 

1  (t ) Q(t )
Note that for the normal case it can be simpler expressed than for most distributions as

 xe
 x2 /2
dx can be evaluated. Usually the selection intensity with selected proportion as an entry,
t
i(p), is required rather than the truncation point. For the normal case there is no simple formula for that as there
is no simple formula for
( x) , but some iterative method has to be used where is ( x) needed. It
can be calculated with arbitrary accuracy. Methods compiled by Zelen and Severo (1964) can be used, some
methods are developed more in detail by Lindgren and Bondesson (1987).
Many commercial program-packages will include cumulated normal distribution functions which can be used
for selection intensity calculations. E.g. MS EXCEL7 functions gave selection intensity (with truncation point
as an entry) with at least four correct decimals below 2.5, but at around 3.5 there could be an error in the third
decimal. Mathcad7 has an inverse cumulated normal distribution function, and can thus find tp. Using the
default precision, selection intensity could be calculated with at least four correct digits for a selected
proportion above 10-10, but it seems possible make the accuracy arbitrary high with Mathcad. Standard
programmes will continue to be more powerful.
If own programming is made the following rational approximation for finding the truncation point
corresponding to a certain proportion, tp, (Bratley et al 1983) can be used rather than using methods which
generate iterative loops and branches,
a  bu  cu 2  du 3  eu 4
tp  u 
f  gu  hu 2  iu 3  ju 4
u   2 ln p
where
a = 0.322232431088
b=1
c = 0.342242088547
d = 0.0204231210245
e = 0.000453642210148
f = 0.099348462606
g = 0.5888581570495
h = 0.5311034622366
i = 0.10353775285
j = 0.0038560700634
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 3
The relative accuracy is 6 decimal digits. For selection intensities derived by the formula there were at least
four correct decimals for selection intensities below 3. For fast and rough calculations simpler rational
approximations can be used.
A direct approximation of selection intensity was made by Saxton (1988),
ip 
2.97425  3.35197 p 0.2  1.9319 p 0.4  2.3097 p 0.6
0.51953  0.88768 p 0.2  2.38388 p o.4  p 0.6
cited from Walsh (1999), untested.
The appearance of the selection intensity is demonstrated in Fig A2.
3
i( p)  (t p ) / p
2
Steep increase,
but almost undetectable
1
Selection intensity
Steep increase
large for extreme p
Inflexion point
0
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Selected proportion
Figure A2. The selection intensity as a function of the selected proportion assuming normal distribution. The
curve starts at 0 when all are selected and increases monotonically starting straight upwards, passes an
inflexion point at p=0.71 and approaches infinity when the selected proportion becomes small.
The selection intensity value derived for a selected proportion in an infinite population will however constitute
a poor approximation of selection intensity of the same selected proportion in a finite population (cf Figure
A3).
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 4
5
Selection intensity
4
Upper curve:
Best fraction 1/p of values selected
from a very large population
3
Lower curve:
Best value selected from 1/p
2
1
0
1
0.1
0.01
0.001
0.0001
0.00001
0.000001
Selected proportion
Figure A3. The selection intensity as a function of the proportion of a truncated normal distribution.
Two cases are demonstrated, either the best value from a sample is selected (lower curve) or the
best proportion from an infinite population. The scale of the X-axis is logarithmic, the expressions
approach the parabole
2 ln p 1 when p becomes small in this scale.
If the selected number is small, the error can be considerable if
Order statistics and its use for breeding theory studies
The expected value of the j:th largest observation from a sample of size n, designed (j,n), can be calculated as

n!
j 1 n  j
 ( j , n) 
x
(
1

F
)
F fdx

( n  j )!( j  1)! 
The coefficient of this formula may be realised that n ranked objects can be arranged in n! ways. j objects can
be taken from in (n-j)!j! ways. Of these permutations (n-j)!(j-1)! are equivalent concerning the rank of the j:th
ranking element among the j. The integrand can be interpreted as a probability given that the probability of an
individual selection is above x.
These values can be seen as the selection intensity for an individual selected on its rank. Thus in theoretical
breeding studies it is useful technique to use order statistics for individual values. It has the advantage that the
values are typical and underlying parameters are known. It is possible to deal with the corresponding situations
with Monte-Carlo simulations of values also, but that is technically more complicated, and sometimes
computing time is limiting, although the simulations are more reliable, as they will catch the variance around
the typical and not just the typical. The technique to use expected order statistics values has been used e. g. by
Lindgren et al (1989), Hodge and White (1993), Ruotsalainen and Lindgren (1998) and Wei (199?).
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 5
An algorithm for the calculation of exact expected order statistics values was given by Royston (1982). Values
are tabulated by Harter (1970). When a numeric evaluation of expression AX is made, accuracy can cause
considerable problems and large numeric errors even if the program code is correct, alertness against this is
strongly recommended.
Value formula for selection intensity following truncation selection, finite case
If n is finite but large and j not close to 1 or n, the formulas for the infinite case with p=1-j/n will be reasonable
approximations, but if high accuracy is desired the demands on size are considerable.
The expected mean value i(j,n) of the j largest observations from a sample of size n could be calculated as
i ( j , n) 
 (1, n)   (2, n) .... ( j , n)
j
A useful approximation was constructed by Burrows (1972) by expanding the mean of a truncated distribution
in a Taylor series. It has the following form
n j
iB ( j , n)  i ( p) 
2 j (n  1)i ( p)
For the normal probability density case Lindgren and Nilsson (1985) studied the accuracy. The error is rather
constant independent of n for n>10 and decreases with j roughly as 1/j2 . The error is less than 0.0005 when
j>7 but is of magnitude 0.02 when j=1. Some examples of the size of the error are given in Table AX. It is
suggested that Burrows approximation is used for all j if an error of 0.025 (less than 5% of the selection
intensity) is acceptable, for j > 2 if an error <0.01 is desired and for j >6 if error <0.001 is desired.
Bulmer (1980) suggested an alternative finite-sample approximation with p replaced by
p
j  0.5
n  j / 2n
cited from Walsh 1999, check!
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 6
selection intensity
3
2
1
Selected proportion:
0.5
0.2
0.1
0.01
0.001
0
1
10
100
1000
10000
100000
1000000
number selected, j
Figure A4. Illustration of the dependence of selection intensity on the selected number. Note that at a given
proportion, the selection intensity is lower if the selected number is smaller, thus if the selection is made from a
smaller sample.
Variances of normal order statistics
The n:th ranking value in a sample assuming some density distribution will not be identical for every sample. It
not only an expectation, but also a variance around that expectation. Normal order statistics is not just
associated with expectations but also variances. The variance of order statistics for small samples was analysed
and tabulated by Pearson and Hartley (19??).
Although expectation of the sum of the order statistics is the sum of individual expectations, this is not the case
for variances of the sum of values with a certain rank. If some top ranking values in a sample are high, other
top ranking values also tend to be high. Thus the variance of the sum of high ranking values tend to be more
variable than the sum of the variances. The coancestry of the order statistics is presented for some case by
Pearson and Harley ( ). For the simple case of the sum of two ranked values from a normal sample the
variance, thus the variance of the expected selection intensity, can be calculated as the sum of the variance of
the expectation of the corresponding values and double their covariance.
Symmetry relation of the selection intensity
Knowing i(1-p), i(p) of a symmetric density function may be calculated utilizing
i ( p) 
1 p
i (1  p)
p
For the finite case the expression gets the form
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 7
1 j / n
i ( j , n) 
i (n  j , n)
j/n
These relations may be realized by considering the fact that the change in mean due to selection multiplied by
the proportion selected will be the same for the selected and non-selected parts of the population.
Selection intensity for intermediary fractions
The selection intensity of some intermediary fraction can be evaluated in terms of the selection
intensity and size of the area above the fraction, including and excluding the fraction.
p1 is the proportion belonging to the second best fraction.
p2 is the proportion belonging to the second best fraction
Selection intensity of the best fraction = i1 = i(p1).
Selection intensity of second fraction = its average in standardized terms = i2, the problem is to find a
formula for it.
Average = sum of (proportions * values)/(total proportions)
i(p1+p2) can be calculated as an average, but also by known ways,
i1 p1  i2 p2
i ( p1  p2 ) 

p1  p2
( p1  p2 )i ( p1  p2 )  p1i ( p1 )
i2 
p2
There are formal (but hardly real) difficulties when p1 or p2 approaches zero, it would be better with a
formulation which avoided these difficulties, but I cannot find that.
In integral form the average of an intermediary fraction can be written
t p1
 xf ( x )dx
i2 
t p2  p1
p2
To make computations involving selection (and selection intensity in particular), one can slice up the
probability density function as composed of a number intervals and when intergrate over each interval
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 8

j  nt j 1

j0 t j
 xf ( x )dx    xf ( x )dx
t1   , t n   , t j 1  t j
Asymptotes
For a normal distribution
t   i t  t
p    i ( p)   2 ln p
However, the convergence of these expressions is too slow to be useful.
The selection intensity for the top ranking of n values taken from a normal distribution has a similar
asymptotic expression
n    i (1, n)  2 ln n
Derivatives of selection intensity
James (1976) gave the following derivatives (without any particular assumption concerning the
distribution)
di t  i

dp
p
di (i  t ) f (t )

dt
p
In the special case of a normal distribution
di
 (i  t )i
dt
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 9
Variance of a truncated normal distribution
The variance (Vt) of a truncated density function is necessarily less than the variance of the nontruncated function. The variance within the selected part following directional truncation selection at
x=t in a normal distribution is
di
Vt  1  i (i  t )  1 
dt
The variance of breeding values after truncation selection for phenotypes can be expressed
1  h 2i (i  t )
where h2 is heritability (cf Robertson 1961).
Selection intensity for linear deployment
Breeders often deal with populations where all members are assumed to give equal contributions to
selection intensity. Even if this is useful, the constraints implied easily lead to non optimal
conclusions. The goal of selection can often be expressed as combining selection intensity and the
square sum of genetic contributions in an optimal way (e.g. Lindgren 1991), this formulation is useful
for deployment of clones to a seed orchard (Bondesson ). Under certain conditions this is made if
contributions of genetic units are linearly related to their assumed value (linear deployment. This is
true even if there is an upper limit on the contribution. The corresponding selection intensity of linear
deployment if original genetic values are N(0,1) is presented slightly modified from Lindgren (1991) :
Q( x0 )  Q( xt )
i 
 ( x0 )  x0 Q( x0 )   ( xt )  xt Q( xt )
where
x is the genetic value
individuals are represented in linear proportion to their breeding values if x0xI, if x x0 they are not
represented at all and if xtx they are represented as the upper bound..
Free software
This information is from August 1997. SELEINT2.EXE and SELENOR3.EXE performs some tasks described
here (for example calculation of order statistics). Selection intensity algorithms is also available as part of some
of the offered EXCEL programs. The programs are available at an FTP-site on the net which has its home page
at the address: http://linne.genfys.slu.se/breed/breed.htm.
Acknowledgement
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 10
Literature
Becker WA Manual of quantitative genetics. Washington State Univ., Pullman, Washington, 186 pp.
Bondesson F L & Lindgren D 1993. Optimal utilization of clones and genetic thinning of seed orchards. Silvae Genetica,
42:157-163.
Bratley P, Fox BL & Schrage LE 1983. A guide to simulation. Springer-Verlag Heidelberg.
Burrows P 1972 Expected selection differentials for directional selection. Biometrics 31:125-133.
Hodge GR and White TL 1993. Advanced generation wind-pollinated seed orchard design. New Forests 7:213236.
James JW 1976. Maximizing a function of the selection differential. Theoretical and applied genetics 47:203305.
Lindgren D & Nilsson J-E 1985. Calculations concerning selection intensity. Department of Forest Genetics
and Plant Physiology, Swedish University of Agricultural Sciences. Report 5.
Lindgren D 1993. Quantitative comparison between truncation selection and a better procedure. Hereditas
Lindgren D 1991. Optimal utilization of genetic resources. Forest Tree Improvement 23:49-67.
Lindgren, D., Libby, W.S. & Bondesson, F.L. 1989. Deployment to plantations of numbers and
proportions of clones with special emphasis on maximizing gain at a constant diversity. Theoretical
and Applied Genetics 77, 825-831.
Lindgren D & Bondesson L 1987. Calculation of selection intensity and rankits. - Department of Forest
Genetics and Plant Physiology, Swedish University of Agricultural Sciences. Arbetsrapport 21.
Pearson & Hartley 19
Robertson A 1961. Inbreeding in artificial selection programmes. Genet. Res. Camb. 2:189-194.
Royston JP 1982. Expected normal order statistics (exact and approximate). Applied Statistics 31: 131-165. (corrections in
Applied Statistics 32:223-224.)
Zelen Z & Severo NC 1964 Probability functions. In Abramowitz M & Stegun IA (Eds) 1964. Handbook of
mathematical functions. National Bureau of Standards applied mathematical series 55: 925- 995.
Tables
Remark: Nowadays people who need numerical values are usually able to calculate them themselves, at least
by the aid of this paper. Numerical values are also published elsewhere. So the main purpose of published
values here is to give an impression how they vary, and to give values for checking accuracy of own
algorithms.
Table A5. Order statistics, (j,n), the expected value of the j:th ranking individual from a random sample of
size n form a population with a normal distribution
n
j=1
j=2
j=3
j=4
j=5
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 11
5
10
20
50
100
200
1000
1000000
0
Table A6. Selection intensity i(j,n) when selecting the j top values form a random sample of size n form a
population with a normal distribution
n
5
10
50
100
1000
106
j=1
1.1630
1.5388
2.2491
j=2
.8290
1.2701
2.0520
j=3
.5527
1.0654
1.9109
j=4
j=5
0
.7389
1.7055
.8930
1.7991
Table A7. Error in estimate of selection intensity if estimated as if the population were infinite. The estimates
become too high.
Error
j
1
2
5
10
100
1000
P=0.5
0.23
0.13
0.059
0.030
0.0031
0.0003
Error in % of value
P=0.1
0.22
0.12
0.050
0.025
0.0026
0.0002
P=0.001
0.13
0.068
0.029
0.015
0.0015
0.0001
P=0.5
29.29
16.88
7.39
3.74
0.39
0.04
P=0.1
P=0.001
12.32
6.69
2.82
1.45
0.15
0.01
3.73
2.02
0.85
0.44
0.04
0.00
Table A8. Error in estimate of selection intensity if estimated by Burrow’s approximation. The estimates are
usually too low (indicated by minus signs).
Error
j
1
2
5
P=0.5
0.025
0.009
0.002
Error in % of value
P=0.1
P=0.001
-0.017
-0.005
-0.001
-0.022
-0.006
-0.001
P=10-5
-0.017
-0.005
-0.000
P=0.5
P=0.1
4.40
1.41
0.27
4.40
1.41
0.27
P=0.001
P=10-5
-0.69
-0.18
-0.03
-0.39
-0.10
-0.02
Table A9. Values for a truncated normal function with remaining proportion as entry
Proportion
Corresponding

Selection
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
Vt
PAGE 12
truncation point
t
->-inf
p
p=1
.999
.99
.95
.9
.7
.5
.3
.2
.1
.05
.02
.01
.001
.0001
.00001
.000001
(t)
->0
intensity
i(t)
0
Table A10 Values for a truncated normal function with truncation point as entry
Truncation
point
t
t->-inf
-3
-2
-1
0
1
2
3
4
5
Selected
proportion
pt=1-(t)
->1
.998650

(t)
->0
.004432
Selection
intensity
i(t)
->0
.0044
Vt
t->inf
Table A11
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 13
Possible further tasks
If we set the goal to have an algorithm which gives an error <0.01 I guess Burrows approximation is good
except when j=1. But I have not tested that this is true for very low n or for n=infinity. And we need a way to
deal with j=1 and possible j=2
Lindgren and Nilsson made a manual how to get selection intensity with three correct digits for almost all
cases, and a computer program (SELEINT) was written. However this is too complicated to include as a part of
other software, so I would like it to be a manual useful for programming (as well as the actual program).
DAG LINDGREN, SLU, CREATED 1998-08-09 ; UPDATED 08-16 ; PRINTED: 2017-05-06; FILE: 840982932
PAGE 14