Download Appendix 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
The first thing to consider is how to deal with the Qi and g(Q)i, since it can be
assumed g(Q)j(i) is related to Qi,. To do this, we can consider one factor for the genetic
effect, G, which should account for both these genetic effect variables as:
G=Qi+g(Q)j(i)
The ANOVA equation can then be written as:
y ij    Gi   ij , for y~ NID(μ,σ2p), G~NID(0, σ2g), ε~NID(0, σ2e),
where yij is the trait value for genotype i in replication j, μ is the mean, Gi the genetic
effect for genotype i and εij the errors. The trait must assume a normalised phenotypic,
genotypic and error variance respectively, based on the data distribution, and this
gives:
σ2p=σ2g+σ2e
To relate this equation to the data we are given, we must include the data for
phenotypes in four environments in blocks of replication number 149 (one cell line
missing). To adapt the ANOVA to the one-way equation we include Environment and
Blocks into the equation to give the theoretical form of the ANOVA for the genetic
effect as:
yij    Gi  E j  (GE) i j  i jk ,
where Gi and GEij are the Genotypic effects measured within blocks.
Source
dof
Expected MSQ
F Value
Environment(E) e-1
σ2g+bσ2ge+beσ2e
=MSQE/MSQe
=MSQE / MSQG
Blocks
(b-1)e
Genotypes (G) g-1
GE
(g-1)(e-1)
Error (e)
(b-1)(g-1)e
MSQ in blocks to be expected
 e2  b ge2  be g2
 e2  b ge2
 e2
=MSQG/MSQe
=MSQGE/MSQe
Table 1: ANOVA Table: Randomized Blocks within environment and within
sets/blocks in environment = b = replications. Focus - on genotype effect. The F
values compare each component as a ratio to error component. Also there is an F
ratio to compare Environment to Genotype.
The tests involve measuring the variance in the data sets of each environment location
for the environment, to calculate σ2e. The variance σ2g also has to be calculated for the
genotypes which can be calculated from σ2g as:
σ2=((1/nQQ+1/nQq)/4)
, where nQQ is the number of genotype QQ and nQq is the number of genotype Qq.
The F-test, t-test and pooled variance of Environment and Genotype are the main
tests.
The pooled estimate σ2ge can be calculated from the individual variances σ2g and σ2e:
σ2eg=[(bE-1)σ2E + (bG-1)σ2G ]
[bE + bG -2]
The pooled estimate σ2EG is used to calculate the pooled variance t-test statistic[1]:
t
E  G  ( E   G )
2
 EG
(
1
1
 )
bE bG
Where,
E is the sample mean of the Environment.
G is the sample mean of the Genotype.
μE is the population mean of the Environment.
μG is the population mean of the Genotype.
σ2EG is the pooled variance estimate from above.
bE is the number of replications for the Environment samples
bG is the number of replications for the Genotype sample
The t-test statistic can then be used to test the means of the two variants environment
and genotype to the population means.
The equation for the analysis of variance for a single marker using backcross progeny
is given as:
Yi ( j ) k    M i  g ( M ) j (i )   i ( j ) k
where:
Yi(j)k=trait value for an individual j with marker genotype i in the replication k.
μ=population mean
Mi = Marker Genotype effect in i
g(M)j(i) = Marker genotypic effect which is unexplained.
Εi(j)k= is the error in the marker
Similary to part (a) we assume Mi and g(M)j(i) to additively represent G, the genetic
effect.
G=Mi+g(M)j(i)
The ANOVA table from Table 1 is drawn up again for the genotypic effect in the
blocks for the Environment and for the Genotype:
yij    Gi  E j  (GE) i j  i jk
The variances for the marker genotype can be calculated identically to that of the QTL
since the definition of genetic effect is the same for QTL and for the marker, so the
concept of additive and dominance effects μ1, μ2 etc. still apply.
The pooled estimate of variance is still:
σ2eg=[(bE-1)σ2E + (bG-1)σ2G ]
[bE + bG -2]
The t-test statistic is also:
t
E  G  ( E   G )
2
 EG
(
1
1
 )
bE bG
The expectation of difference between the marker genotype classes is given by
calculating the expected trait value for each genotype and calculating the difference
between the two genotypes as follows:
 AA  (1  r )1  r 2
 Aa  r1  (1  r ) 2
E (diff )  AA  Aa
Q2.(a). The t-tests were calculated for each marker, taking:
The environment population mean as the average of the environment sample means.
The genotype population mean as the average of the 26 marker genetic effects.
The environment sample mean is the average phenotype in the replication block.
The genotype sample mean is the genetic effect calculated for each marker
The t-test was calculated using the formula in Q1(b) and the raw data for this is
summarised in Appendix 1 Table A as follows:
The data contained some zero’s representing missing data, which we assume to be
non-trivial but should be disregarded in the calculations because any replacement of
zeros as representing either one or two would result in bias of the data, so to represent
the sampling as accurately as possible the zero’s were disregarded. The genetic effect
of genotype Steptoe was calculated as the ratio of the Steptoe genotype to total
genotype, given as no.1’s divided by total number, and total number was calculated as
the sum of 1’s and 2’s for each marker. Hence u1 and u2 were calculated as a
proportion of the total genotype. The mean and variant genetic effect was calculated
from the formula described in Q1(a). The CV was calculated to show how the
variance is compared to the mean genetic effect. The population mean for genetic
effect was calculated as the overall mean of the genetic means taken from each
marker.
The means and variances for the phenotype was calculated from the replication blocks
in four environments, and the population mean was calculated as the average of the
phenotype means. The population variance of environment was calculated as the
average of the phenotype variances. From this data and the equation in Q1(b), the
pooled estimate of variance and degrees of freedom were calculated for the t-test
which follows.
Table 2. The results of the t-tests show t-test statistics and p values for each respective
environment and marker.
Table 3. The results for the expected mean differences for each marker.
The estimated mean differences of trait values are given from the formula in Q1. The
values for μ1 and μ2 are used from the previous calculations. The recombination
fraction r, was not readily computable, and considering that if the distance between
each marker is sufficient, we can assume that the recombination fraction is zero.
For each marker, μAA is denoted as genotype Steptoe and μAa is denoted as genotype
Morex. In terms of difference, it is not important which is which, since we are only
measuring the difference between the two. The expected difference is then the
difference between μ1 and μ2, where Steptoe is μ1 and Morex is μ2.
The total E(Diff) over all the 26 markers is given as the average of the 26 individual
marker results. This is given:
_______
E(diff) = ΣE(diff)/n
Where n is the total number of markers 26. The E(diff) equals to 0.07.
The null hypothesis here is that the test statistic for the pooled Genotype and
Environment means and variance falls within the 95% critical region. Any t-test
statistic values found below 5% should lead to rejection of the null hypothesis.
From the t-test statistics found in table 2 show that they all fall outside the 5% critical
region, indicating that there is a large variation in the pooled Genotype and
Environment data, most likely due to the large differences found going from one
location to the next.
Q.2(b)
Degrees
of
Expected
ANOVA
Freedom
MSQ
F Value
Environment
3
1.54E+03
656.02
9.33
Blocks
447
Genotypes
25
1.65E+02
70.33
G x E Error
75
1.43E+02
61.02
Error
300
2.34789
Table4. The ANOVA results for the trait values given the genotype and
phenotype(environment) data.
In Table 4 above, the ANOVA was calculated from the theoretical from of the
ANOVA found in Q1(a). The variances used for genotype and environment were
calculated as averages of the multiple variances calculated for each marker and
location respectively for both Genotype and Environment. The pooled variance for the
Genotype and Environment interaction term GE were calculated from the pooled
estimate of variance found in Q1(b). The error variance was calculated from the
equation in Q1(a) where if the phenotype and genotype variances are known, the error
variance is the difference between the phenotype(environment) and genotypic
variances.
In calculating the MSQ, the variances for each variable were used, together
with the number of replications (149) and number of environments(4). The F values
compared the MSQs of each component. Most of the F values compare each variable
MSQ to the error MSQ. The results show that most of the variations are found in the
main variables Genotype, Environment and the interaction of Genotype and
Environment, shown by the large F ratio found for each (656.02, 70.33, 61.02
respectively). The F ratio was also calculated between the Environment MSQ and
Genotype MSQ to show where most of the variation lies. The result is 9.33, which
shows that the environment factor accounts for 9 times more variation than the
genotypic effect factor. To further explain this, the t-test results showed something
similar.
The t-test results show that for each location there is reproducibility between
each marker choosing one location. The phenotype variations at each individual
location are low, represented by low co-efficient of variations (CV) in the ranges of 24% (see Appendix 1, Table A). Further, the t-test results are shown in table 2, and it
can be seen that the individual locations have to some degree similar t-test statistics
for each marker, and the CV calculated is a reasonable measure below 20% for all
locations. The biggest variance comes from the changing locations, where the t-test
statistic varies greatly. The high F-value for MSQE/MSWQG in the ANOVA and the
big shifting in the t-test statistics going from one location to another are due to one
factor: the use of environmental population mean as calculated from the individual
location means, which themselves vary around the calculated population mean quite
significantly.
Q2 (c)
The ratio of the Steptoe genotype occurrences to total genotype number was
calculated by counting the total number of Steptoe occurrences in each marker,
totalling for all markers, and dividing by the total number of possible markers (not
including the zero’s which are missing data). The Morex genotype ratio to total
genotype number was calculated likewise. The probability of the Streptoe occurring
was taken as the ratio of Steptoe occurrences in the total data, and the probability of
Morex calculated likewise. Given the raw data, the occurrences of Steptoe and Morex
in each cell line were plotted.
Steptoe and Morex
30
25
Count
20
Steptoe
15
Morex
10
5
0
0
20
40
60
80
100
120
140
160
Cell Line no
Figure 1. A scatter plot showing the total occurrences of both Steptoe and Morex over
all the markers for the genotype experiments.
7.00E-02
6.00E-02
5.00E-02
pdf
4.00E-02
Steptoe
3.00E-02
Morex
2.00E-02
1.00E-02
0.00E+00
0
20
40
60
80
100
120
140
160
-1.00E-02
Cell line
Figure2. Predicted Binomial probability distribution function of the genotype data.
Given the probability of both Steptoe and Morex occurring, a binomial distribution
can be drawn up to allow prediction of genotype occurrences.
For both Steptoe and Morex, the predicted variation of the genotype count is possible
from the regression in Figure2. This can be related to the data in the ANOVA and ttest results earlier in that the genetic effect of both genotypes is measured for each
genotype as in μ1 and μ2. The probability of either Steptoe and Morex occurring
directly affects the calculation of the genetic effect g and its variance σ2g, which is
calculated from μ1 and μ2.
Conclusion
In the analysis of t-test statistics, based on the assumption in the null hypotheses,
(Q2b) any p-value of less than 0.05 should lead to rejection of the null hypothesis.
This means that the effect of the genotype-environment interaction is greater than that
expected to occur by chance, and indicates that this interaction has a significant effect
on phenotype.
Appendix 1:
Number
Marker
>WG622
>ABG313B
>CDO669
>BCD402B
>BCD351D
>TubA1
>Dhn6
>WG1026B
>Adh4
>ABA003
>ABG484
>WG464
>BCD453B
>ABG472
>ABG500B
>ABG366
>ABG397
>BCD351F
>ABG008
>KFP195
>ABR337
>Ica1
>ABG499
>WG110
>ABG004
>ABG395
1's
Degrees
u1
u2
82
58
0.59
0.41
82
61
0.59
0.44
81
56
0.58
0.40
65
68
0.46
0.49
78
70
0.56
0.50
71
69
0.51
0.49
81
66
0.58
0.47
80
68
0.57
0.49
75
67
0.54
0.48
80
64
0.57
0.46
81
65
0.58
0.46
73
67
0.52
0.48
85
62
0.61
0.44
79
62
0.56
0.44
91
55
0.65
0.39
82
54
0.59
0.39
83
63
0.59
0.45
74
70
0.53
0.50
68
62
0.49
0.44
78
71
0.56
0.51
54
75
0.39
0.54
57
86
0.41
0.61
73
68
0.52
0.49
71
72
0.51
0.51
74
76
0.53
0.54
86
62
0.61
0.44
Population Mean for Genetic Effect :
Pop.Mean for Environment Effect:
Means for each environment location
Idaho
2's
g
0.086
0.075
0.089
-0.011
0.029
0.007
0.054
0.043
0.029
0.057
0.057
0.021
0.082
0.061
0.129
0.100
0.071
0.014
0.021
0.025
-0.075
-0.104
0.018
-0.004
-0.007
0.086
σ2g
7.36E-03
7.15E-03
7.55E-03
7.52E-03
6.78E-03
7.14E-03
6.87E-03
6.80E-03
7.06E-03
7.03E-03
6.93E-03
7.16E-03
6.97E-03
7.20E-03
7.29E-03
7.68E-03
6.98E-03
6.95E-03
7.71E-03
6.73E-03
7.96E-03
7.29E-03
7.10E-03
6.99E-03
6.67E-03
6.94E-03
CV
8.6
9.5
8.5
-70.2
23.7
100.0
12.8
15.9
24.7
12.3
12.1
33.4
8.5
11.9
5.7
7.7
9.8
48.6
36.0
26.9
-10.6
-7.0
39.8
-195.8
-93.3
8.1
0.036676
74.08075
Montana Oregon Washington Average
74.3
73.1
73.65
2
Variance (σ E)
1.84
2.79
2.44
CV
2.476818 3.819478 3.313153
75.23
74.08
2.32
2.35
3.081314
Table A. Raw data used to calculate T-test statistic.
References:
[1] Teh Sin Yin* and Abdul Rahman Othman. “When does the pooled variance ttest fail?””. African Journal of Mathematics and Computer Science Research Vol.
2(4), pp. 056-062, May, 2009.
σ2EG
0.95
0.94
0.96
0.98
0.93
0.95
0.93
0.93
0.95
0.94
0.93
0.95
0.93
0.95
0.93
0.97
0.93
0.94
0.99
0.92
0.99
0.94
0.95
0.94
0.92
0.93
of
freedom
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149
149