Download No Slide Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Saethre–Chotzen syndrome wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Oncogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Long non-coding RNA wikipedia , lookup

NEDD9 wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Point mutation wikipedia , lookup

Metagenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Transposable element wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Copy-number variation wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Public health genomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

X-inactivation wikipedia , lookup

Ridge (biology) wikipedia , lookup

History of genetic engineering wikipedia , lookup

Human genome wikipedia , lookup

Genomic library wikipedia , lookup

Genomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene desert wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Pathogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Gene wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome (book) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Genome editing wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
The Complete Arabidopsis Transcriptome MicroArray (CATMA) Project
P. Hilson, T. Altmann, S. Aubourg, J. Beynon, F. Bitton, M. Caboche, M. Crowe, P. Dehais, H. Eickhoff, E. Kuhn, S. May, W. Nietfeld, J.
Paz-Ares, W. Rensink, P. Reymond, P. Rouzé, U. Schneider, C. Serizet, A. Tabrett, V. Thareau, M. Trick, G. van den Ackerveken, P. Van
Hummelen, P. Weisbeek, M. Zabeau
http://jic-bioinfo.bbsrc.ac.uk/CATMA/
2. Automated design of GSTs
Introduction
Most cDNA clones included in DNA arrays are identified by an EST covering only a
portion of their length. The complete clone sequence is generally unknown and is not
selected to yield hybridisation results specific to a single gene. ESTs only represent
about half the genes identified in model eucaryote genomes.
The Specific Primer & Amplicon Design Software (SPADS) selects specific regions
within genes and designs primer pairs picked to amplify such regions (Figure 1B;
Thareau et al, 2001). The procedure is summarised in the four following steps:
To bypass these shortcomings, we are constructing a collection of high quality Gene
Specific Tags (GSTs) representing most Arabidopsis genes for use in microarray
transcriptome analyses and in other functional genomic approaches.
1. Search for the most specific region within each gene. Each exon is tested with
BLASTn against the whole genome sequence and segments with hits are removed.
Primer pairs are designed in the remaining regions. If none are detected, the
mismatch parameter of BLASTn is decreased and only segments with stringent hits
are substracted, thus enlarging the specific remaining regions for primer design.
1. Gene structural annotation
2. Primer design. The specific regions are used as input for the Primer3 software.
The identification of each gene in the Arabidopsis genome is at the root of any
genome-wide effort to study their expression. Since the structure of only a minority of
Arabidopsis genes has been determined experimentally so far, annotation still relies on
gene prediction to identify the boundaries of transcription units and of the exon(s)
within it (The AGI Consortium, 2000). Using the AGI nuclear genome, we have
generated an updated structural annotation of all 5 Arabidopsis chromosomes.
The annotation process has been automated. It uses the EuGène software (Schiex et al,
2001) with a unique set of parameters and algorithms applied to all chromosome
regions (Figure 1A). Its prediction quality has been tested by matching results against a
set of experimentally defined full length cDNA as described by Rouzé and
collaborators (Pavy et al., 1999). Quality assessment parameters for chromosome 2
annotation are shown in Table 1.
EuGène identifies 29,804 genes in the Arabidopsis nuclear genome, which is higher
than the 25,470 identified by the AGI (Figure 2). The detailed comparative analysis of
the EuGène and AGI annotations is currently underway. Preliminary observations
indicate that EuGène’s higher number results from the combination of several factors:
EuGène can predict two genes where AGI annotates one, it predicts genes where none
is annotated by AGI (3,369) more often than the contrary (1,533), and it seems biased
towards overprediction in pericentromeric regions rich in repeated sequences.
A
Netstart
genomic
fragment
NetGene2
EuGène
Primer3
Blastn
Blastx
B. Position of GSTs
A. Distribution of GST lengths
3000
150-200 bp: 42%
200-300 bp: 36%
300-500 bp: 22%
2500
5’
SPADS
GST
ATG
3’
UTR
stop
5115
(24%)
500
3267
(16%)
12701
(60%)
0
primer
specificity
150
180
210
240
270
300
330
360
390
420
450
480
bp
Figure 4. GST characteristics
Blastn
Blastn
center
CDS
1500
GST
specificity
RepeatMasker
Our GST design is based on expressed sequences (EST or cDNA) or on coding regions
predicted by EuGène (i.e. excluding UTR not represented in EST or cDNA). The GST
lengths range between 150 and 500 bp which is sufficient to yield reproducible
microarray signal for transcriptome analysis (Figure 3). Because of the inherent
duplicated nature of the Arabidopsis genome, not all genes will be represented by
perfect GSTs. Rejecting candidate sequences that show over 70% identity with another
sequence in the Arabidopsis nuclear genome, our process has identified so far a GST
for 21,420 (72.0 %) genes out of 29,775 identified on all 5 chromosomes (Figure 2).
1000
gene sequence
exon coordinates
genes
4. Analysis of amplicon specificity. Each successive amplicon is tested with BLASTn
to determine its specificity. If the identity with putative paralogous sequence is over
70%, the amplicon is removed and the next one is processed. GST are searched
from 3’ to 5' until one is found.
2000
B
SplicePredictor
3. Selection of specific primer pairs. Oligonucleotides designed by Primer3 are
tested for specificity with BLASTn against 2 Mb segment containing the gene and
are excluded if matches indicate potential unwanted PCR amplification.
3. Structure of the GST collection
Figure 1. Gene identification and GST selection
Table 1. Assessment of EuGène prediction results
PlantGene
Araset
actual
genes
238
correct
gene
models
182
(76%)
partial
gene
models
50
(21%)
split
genes
5
(2%)
missing
genes
1
(0.5%)
51
37
(67%)
14
(27%)
0
0
missing missing
exons
central
in 5'
exons
33
12
(2%)
(0.7%)
actual missing
exons
exons
1639
51
(3%)
254
15
(6%)
8
(3%)
Chromosome I
350
300
300
250
250
200
200
150
150
100
100
50
50
0
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Mb
Chromosome III
0
1
AGI
EuGene
GST
2
3
300
300
250
250
250
200
200
200
150
150
150
100
100
100
50
50
50
0
0
2
3
4
5
6
7
8
9
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19
Mb
10 11 12 13 14 15 16 17
Mb
Second round PCR with
universal primer pairs
SS5 5’ ’
U5’
Primary amplicon
Genomic BAC DNA
350
300
1
4
Chromosome V
Chromosome IV
0
1
(0.4%)
First round PCR with
specific primer pairs
350
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Mb
2
(0.8 %)
Chromosome II
350
350
5
(2%)
missing
exons wrong
in 3'
exons
6
1
(0.4%) (0.06%)
Each primer designed to synthesize a GST carries a
24
U 5’ col. 4
x
gene specific 3’ domain corresponding to the sequence
16
selected by SPADS (18-25 nt) and a 5’ extension (17
nt) added to allow for reamplification of the GSTs with U 3’ row 2
a limited set of universal primers. A set of 40
extensions has been designed so that each sample in a
384-well plate can be amplified witt the unique
combination of one row and another column primer, hence avoiding crosscontamination which often plagues the storage and dissemination of large-scale clone
collections. The primary amplicons obtained from BAC DNA templates in large excess
can be conveniently reamplified and distributed. Also, amplicon production using BAC
increases the quality of the GSTs and the fraction of successful PCR amplifications by
reducing the complexity of the templates (Figure 5). All GSTs are oriented with regard
to transcription with column primers at the 5’ end (see above picture). As of 26
September 2001, the Consortium had PCR amplified 16.280 GSTs.
S3’
U3’
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Mb
Figure 2. Gene density according to the Eugène and AGI annotations
A. Hybridization (Cy5 cDNA)
B. Signal according to length
100000
FLOWER BUD
1000
signal
FLOWER BUD
LEAF
10000
100
ROOT
PLANTLET
10
1
100
predicted gene GST
known gene GST
intergenic region GST
highly expressed cDNA
negative control
600
1100
1600
2100
probe length (bp)
Figure 3. Transcription profiling with a test set of GSTs
Figure 5. Two-step GST amplification
Conclusion
The project is based on a novel complete unified annotation of the Arabidopsis nuclear
genome, generated with our upgraded EuGène software, from which GSTs are
selected with SPADS. We are currently studying how best to complement the current
GST collection to minimize the presence of non specific probes allowing hybridisation
with transcripts from non cognate genes. Given the structure of the GST collection, it
can be adapted to a variety of microarray protocols and procedures. It can also serve as
a key resource for other large scale functional genomic endeavours based on specific
nucleic acid hybridisations, such as systematic Arabidopsis RNAi programmes.
References
The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796 - 815.
Pavy N, Rombauts S, Déhais P, Mathé C, Ramana DV, Leroy P and Rouzé P (1999) Evaluation of gene prediction software using a genomic data set:
application to Arabidopsis thaliana sequences. Bioinformatics 15, 887-899.
Schiex T, Moisan A and Rouzé P (2001) EuGène: an eukaryotic gene finder that combines several sources of evidence.
In “JOBIM 2000, LNCS 2066”, O. Gascuel, M.F. Sagot (Eds.), pp 111-125.
Thareau V, Déhais P, Rouzé P and Aubourg S. (2001) Automatic design of gene specific tags for transcriptome studies.
Proc. of JOBIM'2001 (Journées Ouvertes Biologie Informatique Mathématiques). Toulouse, France.