Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Coding Domain Sequence Prediction and Alternative Splicing Detection in Human Malaria Gambiae Jun Li1, Bing-Bing Wang2, Jose M. Ribeiro3, Kenneth D. Vernick1,4 1. Dept of Microbiology, University of Minnesota, St. Paul, MN. 2. Pioneer Hi-Bred International, Johnston, IA. 3. LMVR/NAID, NIH, MD. 4. UGGIV, Institut Pasteur, Paris, France Introduction • • • • Nearly 2/3 of the worlds population are at risk for malaria 1.5 to 2.5 million children die annually A. gambiae is the major malaria vector Genome-wide research needs good CDS structure prediction and alternative splicing information. • Current used A. gambiae CDS structures were predicted based on comparative algorithms that are too conserve. A lot of genes are missing. • Comparative gene prediction algorithms also have problems in prediction of terminal exons, thus, >40% CDS predicted by this algorithm miss start and/or stop codons. • The purpose of this work is to create a A. gambiae specific gene model, fix the incompletion of CDS, and provide the AS information. Combinational Gene Prediction Algorithm • Gold gene set to train GlimmerHMM 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 • Open-Reading-Frame -Selection Algorithm Union CDS Any internal Stop? No EST cluster perfect match to known proteins perfectly mapped to gambiae gold path • Exon-Gene-Union Algorithm {x A} {x P} x C Where x is the basepair, A is ab initio predicted CDS and P is comparative predicted CDS C is combinational CDS Yes CDS set Alternative Splicing No A frame spanning the whole region of Union CDS? No Multiple CDS found by comparative algorithm No Multiple CDS found by ab initio algorithm No The longest transcript Combinational algorithm improves single algorithm prediction Com binational vs Ensem bl Novel Internal exon changed Extension Identical Sensitivity Specificity Complete Rate GlimmerH MM 95% 90% 100% ensembl 92% 99% 60% 96% 99% 95% CombiComparison of CDS structure from national combinational algorithm and ensembl. algorithm Alternative splicing detection in A. gambiae AS distribution in A. gambiae Est-aid AS detection algorithm 100% Align EST to genome, Processing alignments, extract exon/intron 90% information 80% 100% 100% 90% Others 90% 80% Others Others Others ExonS 80% Upload to70% MySQL DB 70% ExonS 70% 60% 60% AltP 60% AltP AltP AltD 50% 50% AltD AltD AltA 40% 40% AltA AltA Quality control, make EST cluster, 50% merge introns and exons from individual 40% alignments 30% 30% 20% 20% IntronR20% Compare intron/intron and intron/exon, find 10% overlapping event, classify AS event. 0% 10% 0% Raw ExonS ExonS AltP AltD AltA 30% IntronR IntronR IntronR 10% 0% Raw Raw Curated Conclusion: 1512 CDS have alternative splicing, most of AS happened in CDS region which will enrich protein structure and function. Manual curation shows that the false positive (due to EST contamination) is low (10%). The AS type distribution indicated that mosquito is more close to plants than mammals. Software package and web presentation The combinational CDS prediction and alternative splicing detection pipeline have been integrated into our open-source package (welcome collaboration). Results is also accessible through web.