Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Solanaceae 2006 BAC Annotation 2006. 07. 26 Plant Genome Research Center KRIBB, KOREA Developmental Environments • OS : SGI IRIX 6.5 • CPU : MIPS 500MHz 12 CPUs • MEM : 12288 MB • OS : SUSE Linux 9.0 version 2.6.11.4-21.11-bigsmp • CPU : Intel(R) Xeon(TM) CPU 2.80GHz • MEM : 6231 MB • DBMS : MySQL-4.0.25 • Language : PHP 5.0.4, Apache 2.0.54, Perl-5.8.7 Data Sets • BACs (SGN test BACs) – Annotated: 10 • ESTs : 200,015 (cf: 202,043 -current) • Full-length mRNAs (GenBank): 596 • Protein DB (UniProt Release 7.7) – Swiss-Prot/trEMBL: 228,917 / 2,914,826 – Swiss-Prot/trEMBL(plant) 15,203 / 219,361 • Arabidopsis Proteins – Proteins, Genomes (TAIR): 30,693 – GO associated (TAIR): 28,812 – Pathway/EC associated (KEGG): 1,521 • Tomato Chip DATA - tomato Expression Database (cornell) Structural Annotation Target Protein Coding Genes Tools / Data Analysis SGN Guideline KRIBB Computational Gene Prediction GeneMark.hmm, FGENESH, GlimmerM, GENSCAN+, Eugene FGENESH (N.tabacuum) GENSCAN Experimental Gene Identification GeneSeqer, SIM4, BLAST (Tomato cDNAs, ESTs, unigenes) BLAT, SIM4, GMAP, GeneSeqer (dbEST, GenBank mRNAs), GeneWise2.0 (GenPept Proteins) Resolution of Conflict PASA, GeneSeqer (Automatic) Apollo Genome Viewer (Manual) Combined Modeller (Automatic) Apollo Genome Viewer (Manual) tRNA Computational tRNA Prediction tRNAscan-SE tRNAscan-SE Other RNAs Similarity-based RNA Identification (microRNAs, snoRNAs) Cross-match (GenBank rRNA, Rfam) Promoter TFBS/Promoter analysis - Transfac, MEME, Gibs, Pratt Repeats Repeat Scanning - RepeatMasker/Cross-match (RepBase/TIGR Plant Repeats) Functional Annotation Target Tools / Data Analysis SGN Guideline Conserved Functional Domains InterProScan (InterPro Databases) InterProScan (InterPro Databases) Homology to Proteins BLASTx (Arabidopsis, rice, Medicago, Swiss-Prot, GenBank nr) BLASTx, WU-BLAST-2.0 (Swiss-Prot, trEMBL, Arabidopsis) Gene Ontology assignment - BLASTx (Arabidopsis Proteins associated with GOA, TAIR GO data) EC/Pathway - BLASTx (Arabidopsis Proteins associated with KEGG EC/Pathway data) TFBS / Promoter Function of Protein Coding Genes KRIBB Protein Location Predictions WU-BLAST2 (blastx) Arabidopsis proteins associated with TFBS/Promotor Transmembrane Domains (TMHMM), Subcellular Location(TargetP) Transmembrane Domains (TMHMM), Subcellular Location(TargetP) Define gene structure by various data evidences • Full-length evidenced genes (mRNAs / Proteins) Predict mRNA Protein • Full-length clue evidenced genes (Full-length clue ESTs from Kazusa full-length cDNA library) Predict EST • Partially evidenced genes (Other partial ESTs) • No-evidenced genes (Prediction only) 1) Full-length Evidenced Genes Sample • Gene locus with full-length mRNA / Protein (GMAP, GeneWise) • Almost complete gene structure: Gene boundary (mRNA:TSS/poly-A, Predicted Genes protein:CDS), Exon/Intron, (some alternative splicing structure) • Requirement: more than 1 mRNA or Proteins • Processing: – Merge the same AS forms – mRNA evidence: Predict CDS (ESTscan ESTsetc.) – Protein evidence: Mend gene boundary(TSS, poly-A) mRNAs Predict mRNA TIGR TC Protein stackPACK 2) Full-length Clue Evidenced Genes • Gene locus with full-length clue ESTs from Kazusa fulllength cDNA library (GMAP) • Gene boundary(TSS, poly-A), some Exon/Intron • Requirement: more than 1 full-length clue ESTs • Processing: – – – – Merge the same AS forms Link the same-cloned ESTs Sample Mend uncomplete portion with predicted model CDS to be predicted (ESTscan / orfPredictor etc.) Predicted Genes Predict ESTs EST Full length Clue ESTs (kazusa) 3) Partially Evidenced Genes • • • • Gene locus with general ESTs (GMAP) Some Exon/Intron, poly-A More ESTs, more information expected Requirement: more than 2 ESTs with more than 2 couples of overlapped hard-edges Sample • Processing: – Merge the same AS forms Predicted Genes ESTs – Link the same-cloned – Mend incomplete portion with predicted model – CDS to be predicted (ESTscan/orfPredictor etc.) ESTs Predict EST1 EST2 4) No-evidenced Genes • Predicted model only (hypothetical gene) • Predicted CDS Predict Sample No Evidence !! Gene Structure Annotation - Problems False positive intergenic region: 2 annotated genes actually correspond to a single gene False negative intergenic region: One annotated gene structure actually contains 2 genes False negative gene prediction: Missing gene (no annotation) Other: partially incorrect gene annotation missing annotation of alternative transcripts -Alternative Splicing Pseudo-genes Promoter / Regulatory Elements Estimated Gene Prediction CATEGORY Predicted Genes TSS Start Codon Stop Codon PAS signals 1) PolyA ( ≥ 7) Genes overlapping EST Clusers Genes hitting mulitple EST Clusters Genes hitting single EST Clusters Genes overlapping ESTs EST mapping Genes (≥ 2) EST mapping Genes ( =1) Genes hitting mRNAs Genes hitting Full-length cDNAs 1) hexamer signal NUMBER 301 294 296 297 100 296 148 61 87 165 109 56 6 20 A(A/U)AAA - PASes (predict polyadenylation signals) hexamers Gene Structure Browser FGENESH GENSCAN Protein Repeats / Domain mRNA dbESTs TIGR TC Unigene • • • Kazusa Full ESTs Test BLAT/SIM4/GMAP/GeneSeqer – BLAT – Fast/Unaccurate – SIM4/GMAP/GeneSeqer – Approx. the Same results KRIBB: Prefiltering ESTs by BLAT + GMAP Cutoff: Coverage > 80%, Identity > 90% Click !! Click !! Functional Annotation Protein DB/ EC / GO Functional Annotation Protein DB / GO TFBS / Promoter Functional Annotation TargetP/TMHMM Enzyme / Pathway Domain / Motif Expression Annotation (Digital Expression ) Principle of identifying differentially expressed genes by Hypergeometric Test N: ESTs for all genes in all tissues, n: ESTs for selected genes in all tissues, K: ESTs for all genes in selected tissue, k: ESTs for selected gene in selected tissue, P: Significance of over- or under-expression in selected tissue Expression Annotation (ARRAY CHIP) Expression Annotation (Tissue Specific Genes) Principle of identifying differentially expressed genes by Audic's Test x: number of cognate ESTs of a given gene in a selected library N1: selected library y: number of cognate ESTs of a given gene in other library N2: other library Pepper tissue-specific gene analysis * 25 cycles, annealing temp. 55℃ * (# of ESTs) CaActin CacnA (16) Flower CacnB (18) CacnC (13) CacnD (10) CacnE (25) CacnF (31) Pathogen CacnG (20) Fruit Annotation Results Property Value Unit BAC (Annotated) Length (Average) 10 120 BAC kb Putative Protein CDSs Gene Density Gene Length, Average Exon Length, Average Exons per Gene, Average 301 4.2 3.1 338 8.4 gene kb/gene kb bp exon/gene 165 196 213 144 17 17 127 56 18 gene gene gene gene gene gene gene gene gene 0 gene With ESTs Protein Annotated Domain Annotated GO Annotated Pathway Annotated EC Annotated TFBS/Promoter Annotated Tissue specific Annotated Expression Annotated tRNA Repeats 144 kb Thanks !! Solanaceae 2006 BAC Annotation Test page http://crop.kribb.re.kr/SOL-Test/ http://sol.kribb.re.kr/