Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DNA sequencing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human genome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Pathogenomics wikipedia , lookup
Human Genome Project wikipedia , lookup
Genomic library wikipedia , lookup
Median graph wikipedia , lookup
Genome evolution wikipedia , lookup
Workshop on FCP Accelerated NGS Srinivas Aluru Iowa State University The Big Data Challenge Then (2005) ABI 3700 96 ~800 bp reads 76.8 X 103 bases ~$1 per kilo base Now Illumina Hiseq 2500 6 billion 100 bp reads 600 X 109 bases ~$1 per 200 million bases Many NGS Technologies Why FCP? • 1 NGS experiment = ~100 GB data • Sequencing Center decade ago small budget individual investigator today • Many FCP technologies are inexpensive and widely available Genomes Galore – Big Data Analytics for High Throughput DNA Sequencing Driving Grand Challenges Identification of complex disease traits Detection of biological threats Microbial studies and human health Plant genotype to phenotype ⁞ ⁞ Research and Dissemination Approach Vision and Goals Empower community migration to HPC Preserve ability to create new solutions Target researchers & software developers The Team Srinivas Aluru (ISU) Jaroslaw Zola (Rutgers) Kunle Olukotun (Stanford) Wu Feng (V. Tech) Domain Experts: Patrick Schnable (ISU) Charles Sing (U. of Michigan) NGS Application: Assembly reconstruct longer original sequences from the high coverage sampling of short fragments produced by NGS Multiple copies Sequence Unordered of the same genome source fragments Randomly fragment the copies NGS Application: Assembly resequencing genome mapping de novo sequencing genome assembly gene expression analysis transcriptome assembly metagenomic sampling metagenomic clustering and/or assembly Graph Abstractions for Assembly • Overlap graphs – node: an NGS read – edge: suffix-prefix alignment between a pair of reads • De Bruijn graphs – node: a kmer from an NGS read – edge: length (k-1) suffix-prefix match between two reads Graph Operations for Assembly • Graph construction from reads • Collapsing chains • Features in local neighborhood to identify errors • Path walking subject to distance constraints on pairs of edges • Operations on multiple assembly graphs, or multiple genomes in a combined graph NGS Error Correction • Hamming/Edit distance graphs – Node: a kmer in an NGS read – Edge: two kmers with short hamming/edit distance • Graph operations needed – Concurrent access to many nodes for neighbor queries