Download Identification of candidate genes for resource-use

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Site-specific recombinase technology wikipedia , lookup

Genomic imprinting wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Genome (book) wikipedia , lookup

Microevolution wikipedia , lookup

Tag SNP wikipedia , lookup

Public health genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Transcript
General enquiries on this form should be made to:
Defra, Science Directorate, Management Support and Finance Team,
Telephone No. 020 7238 1612
E-mail:
[email protected]
SID 5



Research Project Final Report
Note
In line with the Freedom of Information
Act 2000, Defra aims to place the results
of its completed research projects in the
public domain wherever possible. The
SID 5 (Research Project Final Report) is
designed to capture the information on
the results and outputs of Defra-funded
research in a format that is easily
publishable through the Defra website. A
SID 5 must be completed for all projects.
1.
Defra Project code
2.
Project title
This form is in Word format and the
boxes may be expanded or reduced, as
appropriate.
3.
ACCESS TO INFORMATION
The information collected on this form will
be stored electronically and may be sent
to any part of Defra, or to individual
researchers or organisations outside
Defra for the purposes of reviewing the
project. Defra may also disclose the
information to any outside organisation
acting as an agent authorised by Defra to
process final research reports on its
behalf. Defra intends to publish this form
on its website, unless there are strong
reasons not to, which fully comply with
exemptions under the Environmental
Information Regulations or the Freedom
of Information Act 2000.
Defra may be required to release
information, including personal data and
commercial information, on request under
the Environmental Information
Regulations or the Freedom of
Information Act 2000. However, Defra will
not permit any unwarranted breach of
confidentiality or act in contravention of
its obligations under the Data Protection
Act 1998. Defra or its appointed agents
may use the name, address or other
details on your form to contact you in
connection with occasional customer
research aimed at improving the
processes through which Defra works
with its contractors.
SID 5 (Rev. 3/06)
Project identification
IF0174
Identification of candidate genes for resource-useefficiency by genome-wide association mapping in
Arabidopsis thaliana.
Contractor
organisation(s)
Warwick HRI
University of Warwick
Wellesbourne
Warwick
CV35 9EF
54. Total Defra project costs
(agreed fixed price)
5. Project:
Page 1 of 9
£
49,874
start date ................
01 October 2008
end date .................
31 March 2009
6. It is Defra’s intention to publish this form.
Please confirm your agreement to do so. ................................................................................... YES
NO
(a) When preparing SID 5s contractors should bear in mind that Defra intends that they be made public. They
should be written in a clear and concise manner and represent a full account of the research project
which someone not closely associated with the project can follow.
Defra recognises that in a small minority of cases there may be information, such as intellectual property
or commercially confidential data, used in or generated by the research project, which should not be
disclosed. In these cases, such information should be detailed in a separate annex (not to be published)
so that the SID 5 can be placed in the public domain. Where it is impossible to complete the Final Report
without including references to any sensitive or confidential data, the information should be included and
section (b) completed. NB: only in exceptional circumstances will Defra expect contractors to give a "No"
answer.
In all cases, reasons for withholding information must be fully in line with exemptions under the
Environmental Information Regulations or the Freedom of Information Act 2000.
(b) If you have answered NO, please explain why the Final report should not be released into public domain
We have no objection to the publication of the sid5 form but we do not wish the appendix to be
published because this contains details where protection of intellectual property may be an issue.
Executive Summary
7.
The executive summary must not exceed 2 sides in total of A4 and should be understandable to the
intelligent non-scientist. It should cover the main objectives, methods and findings of the research, together
with any other significant events and options for new work.
Water and nutrient use efficiencies are key targets for future crop improvements because of the increasing
pressures on fresh water resources, the declining availability of key nutrients such as phosphorus, the
energy required for production of nitrogen fertilizers and the need to reduce nutrient leaching to water
courses.
In this project we aimed to develop a methodology, known as genome wide association mapping, for
the identification of plant genes that influence these traits. This is a relatively new and powerful approach
that has been made possible by the ever-decreasing cost of genotyping. Genes that are identified in a
model plant can then later be tested in crop species, and potentially used as markers for selecting crop
varieties with improved water and nutrient use efficiencies.
In simple terms the approach we have used searches for associations between particular gene
alleles and the value of a quantitative trait. The analysis makes use of 241,000 single nucleotide
polymorphisms (SNPs) throughout the genome of the model plant Arabidopsis thaliana. Because linkage
disequilibrium is believe to decay after about 10 kilobasepairs, if a SNP is associated with the trait of
interest, the SNP is likely to be within a few genes of the causative gene. Each SNP has previously been
scored in 363 different accessions of this species, and the data is publicly available. We have developed
algorithms and data pipelines to enable association analysis with this SNP genotype dataset, and
validated the approach by searching for flowering time genes; the best hit for flowering time genes was
within 10 kilobasepairs of a gene known to affect flowering time by 10 days.
We have used trait data generated in project HH3608TX for water use efficiency traits, and publicly
available nutrient data, to perform association mapping using the pipeline. We were able to find many
associated genes, and we are able to predict which associations are likely to be close to causative genes,
and which are more likely to be false positives created by the structured relationships between
accessions. Some of the associated genes can be predicted, based on known gene functions, to influence
water and nutrient use efficiencies. Others have no known function and so are putative new candidate
genes.
Additional work is needed to further improve the pipeline by making refinements to the analysis, and
further work is needed in order to investigate and validate candidate genes in crop species.
SID 5 (Rev. 3/06)
Page 2 of 9
Project Report to Defra
8.
As a guide this report should be no longer than 20 sides of A4. This report is to provide Defra with
details of the outputs of the research project for internal purposes; to meet the terms of the contract; and
to allow Defra to publish details of the outputs to meet Environmental Information Regulation or
Freedom of Information obligations. This short report to Defra does not preclude contractors from also
seeking to publish a full, formal scientific report/paper in an appropriate scientific or other
journal/publication. Indeed, Defra actively encourages such publications as part of the contract terms.
The report to Defra should include:
 the scientific objectives as set out in the contract;
 the extent to which the objectives set out in the contract have been met;
 details of methods used and the results obtained, including statistical analysis (if appropriate);
 a discussion of the results and their reliability;
 the main implications of the findings;
 possible future work; and
 any action resulting from the research (e.g. IP, Knowledge Transfer).
------------------------------------------------------------------------------------------
Aims and objectives
Aim
The aim of this project is to use Genome-Wide Association Mapping (GWAM) to identify genetic
markers very closely associated with genes that control water and nutrient use efficiency. Such genes
would be validated in Arabidopsis and tested in crop species in later projects.
Objectives
1. To organise genotype and trait data into an appropriate format for local analysis
2. To create a data analysis pipeline
3. To perform GWAM analysis using the data analysis pipeline
1. To organise genotype and trait data into an appropriate format for local analysis
Organise the genotype and trait data
The single nucleotide polymorphism (SNP) data was available from http://walnut.usc.edu/2010/SNPs at the
beginning of the project. It was generated by the NSF 2010 grant DEB-0519961 (2005-2008), USA. It was
downloaded in the format of a zip file containing a comma separated text file (.csv) with one column for each of
363 accessions and one row for each of 241 thousand SNPs. A separate file links the accession numbers with
their names. The SNP data were then split into separate files for each chromosome, and these were read into R
and converted into an R data frame with SNPs as columns and accessions as rows, so that the data for individual
markers were readily available. From each chromosome eight separate files were then created, each containing
one eighth of the SNP data, with an overlap of one marker.
Trait data on flowering time for the original 96 accessions were downloaded from
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1283159#supplementary-material-sec. Data for seven
traits related to water use efficiency (see below) had been measured on 94 of the original accessions in project
HH3608TX. Data for 19 mineral traits on 96 accessions were downloaded from
http://www.ionomicshub.org/arabidopsis/piims/showIndex.action, and accession means calculated. For all these
sets of trait data, the data were then organised so that the accession order was consistent with that for the SNP
data.
Examine the effects of population structure
Project HH3608TX had identified three regions of particular interest based on linkage mapping in the Col-gl x
Kas-1 population: two on chromosome 4 and one on chromosome 5. The range of linkage diseqilibrium (LD)
between the SNP markers on these chromosomes was studied with LD decay plots (Flint-Garcia et al, 2003) in
consecutive windows of 200 kilobases along chromosomes 4 and 5. These graphs plot linkage disequilibrium
between a pair of markers (calculated here as r2, the square of the correlation coefficient between the two) against
SID 5 (Rev. 3/06)
Page 3 of 9
the physical distance between the pair for each pair of markers within the 200 kilobase region. LD decay was
found to vary along the chromosomes, but typically most, but not all of the linkage disequilibrium decayed within
about 25 kilobases, but there are regions where LD does not noticeably decay over 200 kilobases. Example plots
are shown in Figure A1 (appendix). The region on chromosome 5 has been located to between 23837141 bp and
the end of the chromosome, but some evidence suggests that the gene is between 26321176 bp and the end of the
chromosome. Figure 1 shows a linkage disequilibrium matrix for this region.
Figure 1. Linkage disequilibrium matrix for region on chromosome 5 believed to contain
a water use efficiency gene. The figure shows linkage disequilibrium between all pairs of
markers in the region. Horizontal and vertical axes are identical. In the upper half of the
figure colour indicates the linkage disequilibrium between the markers, as the r2 value.
In the lower half of the figure, colour indicates the significance of the association. The
arrow indicates the position of a marker which was associated with gravimetric water use
efficiency in this study.
2. To create a data analysis pipeline
A Mac pro computer with two 3.0 GHz Intel Xeon processors (each with four cores), and 4 GB of ram was
purchased and dedicated to the project.
A data analysis pipeline was written comprising several R programs which perform various stages of the analysis,
and Perl code to automate running the various R programs. When the analysis is performed on a data file
containing one or more sets of trait data then each trait is analysed separately. For each trait, two analyses are
performed at each SNP. For the first a Mann-Whitney U test is performed relating the trait to the marker data.
SID 5 (Rev. 3/06)
Page 4 of 9
For the second a Kruskal-Wallis one way ANOVA is performed, relating the trait to the combinations of three
consecutive markers centred on the current one. For both tests, asymptotic p values are used, calculated with the
R package coin (Horthorn et al, 2008). For each chromosome, the markers in the eight sub-files created from that
chromosomes marker data are analysed in parallel, one on each of the
computer’s processor cores. The p values for the various markers are then summarised. Graphs of minus the log
of the p value against position of the markers are plotted for the two tests. Figure 2 shows the output for δ13C with
three consecutive markers.
A threshold p value is calculated, which is intended to highlight the 100-150 most significant markers from the
single marker analysis, although the threshold is constrained to be no greater than 0.0001. For the markers from
the single marker analysis which are more significant than the threshold value, following Aranzana et al (2005)
we consider each marker as dividing the accessions into two groups, and the similarity between the groupings of
accessions produced by different markers is examined. Pairwise differences between markers are calculated as the
number of accessions classified differently between the two markers, and hierarchical cluster analysis using
UPGMA is used to construct a dendrogram, showing how similar the different markers are in terms of the
groupings they impose on the accessions. Figure 3 shows this dendrogram for log dry weight. A problem with
association mapping is that the population structure can result in linkage disequilibrium over long distances
within the genome, between regions which are not closely linked. Where a trait is associated with a gene in such a
region, significant marker association will be found in all of these regions. This dendrogram shows for which of
the most significant markers this may be a problem.
Figure 2. Significance of the association between δ13C and a triplet of adjacent
SNP markers. Height of a bar is minus the log of the p value of the association
with the three markers centred at the position of the bar.
SID 5 (Rev. 3/06)
Page 5 of 9
Figure 3. Relationship between the most significant markers for log dry weight, based
on the differences between the groupings of accessions they impose
Figure 3 shows that many of the most significant markers produce similar groups of accessions, so that a gene at
one of these locations could be causing the significant associations on other chromosomes. Consequently, the
significance of these markers needs to be discounted, since although there should be a gene at one of these
locations there is no indication as to which. The figure also shows that some of the significant markers are very
different from any distant marker, so that the significance at these markers cannot be explained as due to
structural effects.
The output produced by the analysis is the graphs discussed above, a .csv file containing summary data for each
of the markers more significant than the threshold, and an R data file containing p values of the two tests at all the
markers, to facilitate any further analysis. On the computer described above, it takes around half an hour to
analyse and summarise each trait.
One possible weakness of the approach described above is that it may be difficult to detect a gene with three or
more alleles segregating at a locus, since the analysis is largely based on individual SNPs. Flowering time is
known to be influenced by the Frigida gene (FRI), located around 270000 bp on chromosome four, which has
three alleles segregating in this population. Preliminary work was performed to investigate whether we could
improve the detection of these alleles from the flowering time data, using an approach based on one
described by Aranzana et al (2005). Over the first part of chromosome four, windows of eleven consecutive SNPs
were considered and the haplotypes formed from the bases were grouped into a phylogeny using neighbour
joining. Flowering time was then compared with each of the groupings of accessions formed by comparing the
accessions with each haplotype to those with different haplotypes, and then for each internal node of the
phylogeny comparing the accessions with any of the haplotypes joined by that node to the other accessions.
Where this resulted in significant association between flowering time and any of the groupings of accessions
(p<0.0001) combinations of the most significant such grouping and the other groupings were associated with
flowering time. At regions on either side of FRI, around 159808 bp and 376513 bp, this resulted in the
significance of the association with flowering time increasing by a factor of at least 20, and improved the
association with the known FRI alleles. However, this analysis approach appears to be around an order of
magnitude slower.
SID 5 (Rev. 3/06)
Page 6 of 9
3. To perform GWAM analysis using the data analysis pipeline
Trait analysis and validation using flowering time data
Initially the flowering time data was run using the analysis as described above in order to detect well known
flowering time genes. The strongest association detected was with position 454542 on chromosome 4, which is
within 6 kbp of the gene at4g01060 that is known to cause flowering to occur 10 days earlier when it is present as
a null allele (Tominaga et al, 2008). An earlier approach to genome wide association mapping (Aranzana et al,
2005), using the same flowering time data, but different marker data validated their approach by their detection of
FRI, which is also known to affect flowering time. The significance of the association we detected to markers
around FRI was much lower, and it would not have been identified by this analysis. Nonetheless, the close
linkage between the most significant marker produced by the analysis and a gene known to affect flowering time
validates the genes found by this approach.
Using the approach described above we then analysed trait data from the project HH3608TX, consisting of the
following traits measured in a collection of 94 Arabidopsis thaliana ecotypes:
 gravimetric water use efficiency (WUEp)
 carbon isotope composition (13C, a surrogate measure of intrinsic water use efficiency)
 oxygen isotope composition (18O, a surrogate measure of transpiration)
 log(dry weight of rosette)
 relative water content
 leaf chlorophyll content estimated by SPAD meter
 specific leaf weight (mg/cm2, an estimate of leaf thickness)
We also analysed mineral content data from http://www.ionomicshub.org/arabidopsis/piims/showIndex.action for
19 mineral nutrients.
Results from WUE and mineral nutrient traits and interpretation
Markers of interest were selected on the basis of (i) the lowest P values (top quartile of the pre-selected list) for
each trait, (ii) co-association of more than one of the three traits WUEp, 13C and 18O for a particular cluster of
markers, and (iii) the extent to which the high P values could be explained by population structure alone. Several
highly interesting clusters were apparent whose functions relate to plant response to water deficit, stomatal
control or osmotic regulation (Table A1, appendix). These genes are likely to influence WUE.
In Defra project WU0116 we are fine mapping WUE QTL identified in the Col-gl x Kas-1 mapping
population, and in Table A1 we indicated the positions of three QTL of interest. In all cases there are promising
association mapping hits that fall within the regions of these QTL. This information will be useful when
attempting to find the genes underlying these three QTL.
In the case of mineral traits, preliminary analysis of the association hits was performed for three of the
minerals: phosphorus, iron and sodium. Again, several candidate genes could be identified that, based on their
annotation, could be hypothesized to affect these traits (Table A2, appendix).
Future work
Association mapping should be viewed as a powerful method to generate hypotheses that specific alleles at a
given locus cause a trait to increase or decrease; the association by itself cannot prove a cause and effect
relationship between gene and trait. Further work is needed to form the gene hypotheses that are most likely to be
correct, and then to test them by performing experiments to directly test gene functions.
Improved association mapping approaches
The preliminary work looking for multiple alleles in regions described by several adjacent SNPs was promising.
However, the markers found around FRI were around 100 kb away from it in either direction. The linkage
disequilibrium behaviour of markers constructed in this way needs to be studied, and further work done to
determine the best number of adjacent SNPs to include before this approach can be implemented across the
genome.
The trait values for the accessions are influenced by several QTL simultaneously. This means that when trying to
detect a QTL through association at a particular marker, the signal is reduced by the noise from the effect of all
the other QTL, elsewhere in the genome. One approach to solving this problem in the context of using interval
mapping to detect QTL is multiple QTL mapping (MQM). MQM involves a series of passes across the data,
where the QTL found in previous passes are included in the analysis, so that the effects of already found QTL do
SID 5 (Rev. 3/06)
Page 7 of 9
not add to the noise when searching for new QTL. This approach could be adapted to GWAM, which would be
expected to increase the sensitivity of the analysis, and by explicitly looking at the effects of combinations of
QTL would give some information about epistatic effects.
Hypothesis forming
Further analysis of the large amounts of output data is needed in order to look for co-localisation of associations
for different traits, and to prioritise the associations that should be taken forward, i.e. by using statistical analysis
and biological interpretation to select hits for further experimentation. The statistical analysis would automate the
process of finding groups of closely linked markers with significant associations to several related traits. In doing
so it would allow such associations to be found where the significance levels were not extremely high for all of
the traits compared. By biological interpretation, we mean that candidate gene may already have some functional
information from genome annotation or published literature that would suggest a role in a particular trait, or colocalisation of different, but physiologically related, trait data to the same marker associations would suggest that
the hits were more robust.
Population structure results in markers in widely different areas of the genome being in linkage
disequilibrium. If a QTL is closely linked to one of these markers, then all of them will show association with the
trait. Where this has occurred we could seek to pick accessions which have not been scored for the trait but have
been scored for the markers which break the linkage disequilibrium between the different areas of the genome. If
these accessions were then scored for the trait the data should indicate which of the areas located in the earlier
analysis has the QTL. In this approach a collection of lines would be selected from the lines that have already
been genotyped at the 241,000 loci, and phenotype data would be collected and the association analysis repeated.
This should reduce the number of regions that contain associated markers, and make it easier to select candidate
genes for further functional experiments. Selections of such accession will become easier in the future, as the
number of accessions scored for the markers will increase since the Nordborg team aim to score around 1300
accessions.
Testing gene functions
Candidate genes would be investigated by phenotypic assessment of knockout lines (available from NASC for
most genes), and over-expression lines (alleles with a positive effect would be transferred to a genetic background
with a decreasing allele). Such experiments need to take into account the possibility of epistatic interactions
whereby an allele will only function in the presence of specific alleles at other loci. Data from the association
analysis could be used to select the best parental lines for transformation experiments that would maintain
potential positive epistatic interactions.
Application to crop species
Further detailed analysis is required to substantiate the basis of these marker associations. However, they all
present candidate genes whose function can be confirmed in transgenic experiments, or through comparison of
ecotypes with different alleles, but in populations chosen to be balanced at other loci. They can also form the
basis of gene-targeted association mapping (rather than genome-wide) in germplasm collections in other species
such as Brassica oleracea and Brassica napus. This is an attractive approach as it rapidly addresses the question
of whether a particular gene influences the trait of interest in the crop species; to perform such experiments, field
trial trait data is needed for a diversity set of the target crop, and then the diversity set must be genotyped at a
small number of markers located around the target gene.
Evidence of a cause-and-effect relationship for an associated trait/gene requires construction of nearisogenic lines, transgenic experiments, or pilot MAS breeding programs.
Knowledge Transfer Activities
The following presentations were given:
 Invited Seminar, SCRI, Dundee. (2008) Improving water use efficiency in field crops. Dr Andrew
Thompson.
 Brassica2008, Lillehammer, Norway. (Sept. 8th-12th, 2008) Selected presentation for trait genetics
session. "Water-Use-Efficiency Genes in Brassica oleracea" Carol Ryder
 Jan 2009: AJT Invited speaker: Understanding and Exploiting Plant Signalling: A JXB Meeting to
celebrate the contributions to plant science made by Ernst Steudle and Wolfram Hartung, Jan. 2009
 March 2009: AJT Invited speaker: Academia Sinica, Taipei, Institute of Plant and Microbial Biology,
Sensing, response and adaptation to altered water status.
John Hammond also made several presentations related to nutrient use efficiency.
SID 5 (Rev. 3/06)
Page 8 of 9
References
1.
2.
3.
4.
Aranzana et al, Genome-wide association mapping in Arabidopsis identifies previously known flowering
time and pathogen resistance genes. PLOS Genetics, 2005. 1(5): e60.
Flint-Garcia et al., Structure of linkage disequilibrium in plants. Annu Rev Plant Biol, 2003. 54: p. 357374.
Horthorn et al., Implementing a class of permutation tests: The coin package. Journal of statistical
software. 28(8): p. 1-23.
Tominaga, R., et al., Arabidopsis CAPRICE-LIKE MYB 3 (CPL3) controls endoreduplication and
flowering development in addition to trichome and root hair formation. Development, 2008. 135(7): p.
1335-1345.
References to published material
9.
This section should be used to record links (hypertext links where possible) or references to other
published material generated by, or relating to this project.
Publications are expected in a subsequent project to extend this work. In this six month project there were
no publications directly from the work.
Publications related to the projects are listed below:
1. Deswarte, J-C (2008) The genetic control of Water-Use Efficiency in Arabidopsis thaliana (L.) Heynh.
PhD Thesis, University of Warwick.
2. Tung S A, Smeeton R, White C A, Black C R, Taylor I B, Hilton H W, Thompson A J (2008) 'Overexpression of LeNCED1 in tomato (Solanum lycopersicum L.) with the rbcS3C promoter allows
recovery of lines that accumulate very high levels of abscisic acid and exhibit severe phenotypes.',
Plant Cell And Environment, 31 968 - 981
3. Jones, M.O., Manning, K., Andrews, J., Wright, C., Taylor, I.B. and Thompson, A.J. (2008) 'The
promoter from SlREO, a highly-expressed, root-specific Solanum lycopersicum gene, directs
expression to cortex of mature roots', Functional Plant Biology, 35 (12), 1224 - 1233
4. Thompson A. J., Hilton H. (Feb., 2008) “Improving water efficiency in plants”. Warwick iCAST.
http://www2.warwick.ac.uk/newsandevents/icast/archive/s2week12/water/
SID 5 (Rev. 3/06)
Page 9 of 9