Download ESTs

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Solanaceae 2006
BAC Annotation
2006. 07. 26
Plant Genome Research Center
KRIBB, KOREA
Developmental Environments
• OS
: SGI IRIX 6.5
• CPU : MIPS 500MHz 12 CPUs
• MEM : 12288 MB
• OS
: SUSE Linux 9.0 version 2.6.11.4-21.11-bigsmp
• CPU : Intel(R) Xeon(TM) CPU 2.80GHz
• MEM : 6231 MB
• DBMS
: MySQL-4.0.25
• Language : PHP 5.0.4, Apache 2.0.54, Perl-5.8.7
Data Sets
• BACs (SGN test BACs)
– Annotated: 10
• ESTs : 200,015 (cf: 202,043 -current)
• Full-length mRNAs (GenBank): 596
• Protein DB (UniProt Release 7.7)
– Swiss-Prot/trEMBL:
228,917 / 2,914,826
– Swiss-Prot/trEMBL(plant) 15,203 / 219,361
• Arabidopsis Proteins
– Proteins, Genomes (TAIR): 30,693
– GO associated (TAIR):
28,812
– Pathway/EC associated (KEGG):
1,521
• Tomato Chip DATA - tomato Expression Database (cornell)
Structural Annotation
Target
Protein
Coding
Genes
Tools / Data
Analysis
SGN Guideline
KRIBB
Computational Gene
Prediction
GeneMark.hmm, FGENESH,
GlimmerM, GENSCAN+,
Eugene
FGENESH (N.tabacuum)
GENSCAN
Experimental Gene
Identification
GeneSeqer, SIM4, BLAST
(Tomato cDNAs, ESTs,
unigenes)
BLAT, SIM4, GMAP, GeneSeqer
(dbEST, GenBank mRNAs),
GeneWise2.0 (GenPept Proteins)
Resolution of Conflict
PASA, GeneSeqer (Automatic)
Apollo Genome Viewer
(Manual)
Combined Modeller (Automatic)
Apollo Genome Viewer (Manual)
tRNA
Computational tRNA
Prediction
tRNAscan-SE
tRNAscan-SE
Other
RNAs
Similarity-based RNA
Identification
(microRNAs, snoRNAs)
Cross-match
(GenBank rRNA, Rfam)
Promoter
TFBS/Promoter
analysis
-
Transfac, MEME, Gibs, Pratt
Repeats
Repeat Scanning
-
RepeatMasker/Cross-match
(RepBase/TIGR Plant Repeats)
Functional Annotation
Target
Tools / Data
Analysis
SGN Guideline
Conserved
Functional
Domains
InterProScan
(InterPro Databases)
InterProScan
(InterPro Databases)
Homology to
Proteins
BLASTx
(Arabidopsis, rice, Medicago,
Swiss-Prot, GenBank nr)
BLASTx, WU-BLAST-2.0
(Swiss-Prot, trEMBL, Arabidopsis)
Gene Ontology
assignment
-
BLASTx
(Arabidopsis Proteins associated
with GOA, TAIR GO data)
EC/Pathway
-
BLASTx
(Arabidopsis Proteins associated
with KEGG EC/Pathway data)
TFBS /
Promoter
Function of
Protein
Coding Genes
KRIBB
Protein
Location
Predictions
WU-BLAST2 (blastx)
Arabidopsis proteins associated
with TFBS/Promotor
Transmembrane Domains
(TMHMM),
Subcellular Location(TargetP)
Transmembrane Domains
(TMHMM),
Subcellular Location(TargetP)
Define gene structure by various data evidences
• Full-length evidenced genes (mRNAs / Proteins)
Predict
mRNA
Protein
• Full-length clue evidenced genes (Full-length clue ESTs
from Kazusa full-length cDNA library)
Predict
EST
• Partially evidenced genes (Other partial ESTs)
• No-evidenced genes (Prediction only)
1) Full-length Evidenced Genes
Sample
• Gene locus with full-length mRNA / Protein (GMAP, GeneWise)
• Almost complete gene structure: Gene
boundary
(mRNA:TSS/poly-A,
Predicted
Genes
protein:CDS), Exon/Intron, (some alternative splicing structure)
• Requirement: more than 1 mRNA or Proteins
• Processing:
– Merge the same AS forms
– mRNA evidence: Predict CDS (ESTscan
ESTsetc.)
– Protein evidence: Mend gene boundary(TSS, poly-A)
mRNAs
Predict
mRNA
TIGR TC
Protein
stackPACK
2) Full-length Clue Evidenced Genes
• Gene locus with full-length clue ESTs from Kazusa fulllength cDNA library (GMAP)
• Gene boundary(TSS, poly-A), some Exon/Intron
• Requirement: more than 1 full-length clue ESTs
• Processing:
–
–
–
–
Merge the same AS forms
Link the same-cloned ESTs
Sample
Mend uncomplete portion with predicted model
CDS to be predicted (ESTscan / orfPredictor etc.)
Predicted Genes
Predict
ESTs
EST
Full length Clue ESTs
(kazusa)
3) Partially Evidenced Genes
•
•
•
•
Gene locus with general ESTs (GMAP)
Some Exon/Intron, poly-A
More ESTs, more information expected
Requirement: more than 2 ESTs with more than 2 couples
of overlapped hard-edges
Sample
• Processing:
– Merge the same AS forms
Predicted
Genes ESTs
– Link the
same-cloned
– Mend incomplete portion with predicted model
– CDS to be predicted (ESTscan/orfPredictor etc.)
ESTs
Predict
EST1
EST2
4) No-evidenced Genes
• Predicted model only (hypothetical gene)
• Predicted CDS
Predict
Sample
No Evidence !!
Gene Structure Annotation - Problems
False positive intergenic region:
2 annotated genes actually correspond to a single gene
False negative intergenic region:
One annotated gene structure actually contains 2 genes
False negative gene prediction:
Missing gene (no annotation)
Other:
partially incorrect gene annotation
missing annotation of alternative transcripts -Alternative Splicing
Pseudo-genes
Promoter / Regulatory Elements
Estimated Gene Prediction
CATEGORY
Predicted Genes
TSS
Start Codon
Stop Codon
PAS signals 1)
PolyA ( ≥ 7)
Genes overlapping EST Clusers
Genes hitting mulitple EST Clusters
Genes hitting single EST Clusters
Genes overlapping ESTs
EST mapping Genes (≥ 2)
EST mapping Genes ( =1)
Genes hitting mRNAs
Genes hitting Full-length cDNAs
1) hexamer signal
NUMBER
301
294
296
297
100
296
148
61
87
165
109
56
6
20
A(A/U)AAA - PASes (predict polyadenylation signals) hexamers
Gene Structure Browser
FGENESH
GENSCAN
Protein
Repeats /
Domain
mRNA
dbESTs
TIGR TC
Unigene
•
•
•
Kazusa Full ESTs
Test BLAT/SIM4/GMAP/GeneSeqer
– BLAT – Fast/Unaccurate
– SIM4/GMAP/GeneSeqer – Approx. the Same results
KRIBB: Prefiltering ESTs by BLAT + GMAP
Cutoff: Coverage > 80%, Identity > 90%
Click !!
Click !!
Functional Annotation
Protein DB/ EC / GO
Functional Annotation
Protein DB / GO
TFBS / Promoter
Functional Annotation
TargetP/TMHMM
Enzyme / Pathway
Domain / Motif
Expression Annotation
(Digital Expression )
Principle of identifying differentially expressed genes by Hypergeometric Test
N: ESTs for all genes in all tissues,
n: ESTs for selected genes in all tissues,
K: ESTs for all genes in selected tissue,
k: ESTs for selected gene in selected tissue,
P: Significance of over- or under-expression
in selected tissue
Expression Annotation
(ARRAY CHIP)
Expression Annotation
(Tissue Specific Genes)
Principle of identifying differentially expressed genes by Audic's Test
x: number of cognate ESTs of a given gene in a selected library
N1: selected library
y: number of cognate ESTs of a given gene in other library
N2: other library
Pepper tissue-specific gene analysis
* 25 cycles, annealing temp. 55℃
* (# of ESTs)
CaActin
CacnA (16)
Flower
CacnB (18)
CacnC (13)
CacnD (10)
CacnE (25)
CacnF (31)
Pathogen
CacnG (20)
Fruit
Annotation Results
Property
Value
Unit
BAC (Annotated)
Length (Average)
10
120
BAC
kb
Putative Protein CDSs
Gene Density
Gene Length, Average
Exon Length, Average
Exons per Gene, Average
301
4.2
3.1
338
8.4
gene
kb/gene
kb
bp
exon/gene
165
196
213
144
17
17
127
56
18
gene
gene
gene
gene
gene
gene
gene
gene
gene
0
gene
With ESTs
Protein Annotated
Domain Annotated
GO Annotated
Pathway Annotated
EC Annotated
TFBS/Promoter Annotated
Tissue specific Annotated
Expression Annotated
tRNA
Repeats
144
kb
Thanks !!
Solanaceae 2006 BAC Annotation Test page
http://crop.kribb.re.kr/SOL-Test/
http://sol.kribb.re.kr/
Related documents