Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Early Users: Metagenomics Sequence Analysis Yuzhen Ye Lab (IU Bloomington School of Informatics) Environmental sequencing – Sampling DNA sequences directly from the environment – Since the sequences consists of DNA fragments from hundreds or even thousands of species, the analysis is far more difficult than traditional sequence analysis that involves only one species. • Assembling metagenomic sequences and deriving genes from the dataset • Dynamic programming to optimally map consecutive contigs from the assembly. a Pervasive Technology Institute (pti.iu.edu)Center NCGAS is a national service center funded by the National Science Foundation’s Advances in Biological Informatics (ABI) to provide scientists access to software and supercomputers for genomics research. http://ncgas.org NCGAS provides •A specific goal is to provide dedicated access to memory rich supercomputers customized for genomics studies, including Mason and other XSEDE systems •Distributions of hardened versions of popular codes •Initially, nucleated around genome assembly software such as: • de Bruijn graph methods: SOAPdeNovo, Velvet, ABySS • consensus methods: Celera, Arachne 2 • Expanding to other areas as users are recruited: now moving into phylogenetics and metagenomics • We’re especially interested in helping smaller institutions •Funded only in Nov. 2011, NCGAS is actively seeking users! Current participating institutions: • IU’s Mason – a HP ProLiant DL580 G7: 10GE interconnect; Quad socket nodes (8 core Xeon L7555, 1.87 GHz base frequency 32 cores per node; 512 GByte of memory per node!); rated at 3.383 TFLOPs (G-HPL benchmark) • Texas Advanced Computing Center (TACC) • San Diego Supercomputer Center (SDSC); e.g. DASH • NCGAS will support software running at IU, TACC and SDSC, as well as other supercomputers available as part of XSEDE, with the goal to create a single allocation system that will transparently access all appropriate clusters • NCGAS will further campus bridging integration Since the number of contigs is enormous for most metagenomic dataset, a large memory computing system is required to perform the dynamic programming algorithm so that the task can be completed in polynomial time. Genome Assembly and Annotation Michael Lynch Lab (IU Bloomington, Department of Biology) • Assembles and annotates Genomes in the Paramecium aurelia species complex in order to eventually study the evolutionary fates of duplicate genes after whole-genome duplication. This project also has been performing RNAseq on each genome, which is currently used to aid in genome annotation and subsequently to detect expression differences between paralogs. • The assembler used is based on an overlap-layout-consensus method instead of a de Bruijn graph method (like some of the newer assemblers). It is more memory intensive – requires performing pairwise alignments between all pairs of reads. • The annotation of the genome assemblies involves programs such as GMAP, GSNAP, PASA, and Augustus. To use these programs, we need to load-in millions of RNAseq and EST reads and map them back to the genome. Genome Informatics for Animals and Plants Genome Informatics Lab (IU Bloomington Department of Biology) • This project is to find genes in animals and plants, using the vast amounts of new gene information coming from next generation sequencing technology. • These improvements are applied to newly deciphered genomes for an environmental sentinel animal, the waterflea (Daphnia), the agricultural pest insect Pea aphid, the evolutionarily interesting jewel wasp (Nasonia), and the chocolate plant (Th. cacao) which will bring genomics to sustainable agriculture of cacao. • Large memory compute systems are needed for biological genome and gene transcript assembly because assembly of genomic DNA or gene RNA sequence reads (in billions of fragments) into full genomic or gene sequences requires a minimum of 128 GB of shared memory, more depending on data set. These programs build graph matrices of sequence alignments in memory. Imputation of Genotypes And Sequence Alignment Tatiana Foroud Lab (IU School of Medicine, Medical and Molecular Genetics) • Study complex disorders by using imputation of genotypes typically for genome wide association studies as well as sequence alignment and post-processing of whole genome and whole exome sequencing. • Requires analysis of markers in a genetic region (such as a chromosome) in several hundred representative individuals genotyped for the full reference panel of SNPs, with extrapolation of the inferred haplotype structures. • More memory allows the imputation algorithms to evaluate haplotypes across much broader genomic regions, reducing or eliminating the need to partition the chromosomes into segments. This increases the accuracy and speed of imputed genotypes, allowing for improved evaluation of detailed within-study results as well as communication and collaboration (including meta-analysis) using the disease study results with other researchers. Daphnia Population Genomics Michael Lynch Lab (IU Bloomington Department of Biology) This project involves the whole genome shotgun sequences of over 20 more diploid genomes with genomes sizes >200 Megabases each. •With each genome sequenced to over 30 x coverage, the full project involves both the mapping of reads to a reference genome and the de novo assembly of each individual genome. •The genome assembly of millions of small reads often requires excessive memory use for which we once turned to Dash at SDSC. With Mason now online at IU, we have been able to run our assemblies and analysis programs here at IU. Thomas G. Doak ([email protected]), Le-Shin Wu, Craig A. Stewart, Robert Henschel, William K. Barnett http://ncgas.org