* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene-Boosted Assembly of a Novel Bacterial Genome from
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene expression profiling wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genomic library wikipedia , lookup
Human genome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Minimal genome wikipedia , lookup
Pathogenomics wikipedia , lookup
Harry Presman Overview Motivation Assembly Results Advantages/Limitations Motivation Next-gen sequencers produce short read-lengths Useful for polymorphism discovery Difficult to assemble whole genomes Current assembly algorithms produce highly fragmented results Sequencing P. aeruginosa(PAb1) Source of common in-hospital infections Chosen due to available comparators, PAO1 and PA14 8,627,900 shotgun reads (Solexa) Assembly Step 1: AMOScmp Comparative assembler Uses MUMmer ○ Alignment system based on suffix trees Referenced in “Comparative Genome Assembly” PA14 – 2053 contigs PAO1 – 2797 contigs Assembly Step 2 : multiple sequence alignment Align PAO1 and PA14 assemblies Use Minimus to fill gaps with contigs ○ AMOS component for small data sets Re-map reads using AMOScmp to clean assembly Closed 203 gaps Assembly Step 3 : gene-boosted assembly UofMaryland annotation pipeline ○ Based on BLAST and Glimmer Protein-coding genes used to fill gaps ○ Identify genes at contig edges and gaps ○ Extract AA sequences ○ tBlastn identified potential filler reads ○ ABBA assembled reads into gaps Closed 185 gaps Aside Tested gene-boosted analysis alone PAb1 assembled using PA14 proteins 96% of PAb1 proteins assembled using only this method Lacks global genome structure information Assembly Step 4 : Clean up SSAKE “Short Sequence Assembly by K-mer search and 3’ read Extension” Edena “Exact DE Novo Assembler” Velvet Closed 46 gaps Results 76 contigs containing 6,290,005 bp 94% of bases in single scaffold 5602 protein-coding genes identified Error rate per read = 1.04% Error with coverage > 20X is zero Slight bias toward high gene coverage regions Results SNP analysis Aligned PA14 and PAb1 5,537,508/5,568,550 bp agreed 1157/5,568,550 possible sequence errors 187/1104 indels in error Accuracy of assembly: > 99.97% Advantages/Limitations Requires related genomes and protein sequences GenBank contains > 650 microbial genomes Genome size should not matter High speed and low cost ¼ of a single Solexa sequencing run in this case Thank You Questions?