Download The evolution of life science methodologies: From single gene

Genomic Computing, DEIB, 16‐20 March 2015 The evolution of life science methodologies: From single gene discovery to the ENCODE project and beyond Heiko Muller Computational Research IIT@SEMM [email protected] Discovery of X‐rays, radioactivity, quantum mechanics Konrad Roentgen X‐rays 1895 Henry Bragg Linus Pauling Werner Heisenberg Established quantum mechanics 1925‐1926 1898 Marie & Pierre Curie Radium, Polonium Atoms are divisable!! Lawrence Bragg X‐ray diffraction Erwin Schroedinger Understood nature of chemical bonds, 1932 electronegativity Significance and structure of DNA Oswald Avery 1944, nucleic acid is transforming (causing heredity) Rosalind Franklin, James Watson, Francis Crick 1953: DNA double helix Central dogma of Molecular Biology (Francis Crick) Protein RNA DNA Discovery of proteins Described proteins distinguished by ability to coagulate or flocculate under treatments with heat or acid. e.g. albumin from egg whites, blood serum albumin Antoine François, comte de Fourcroy 18th century 1838: On the composition of some animal substances First use of term protein (the “leader”) Gerardus Johannes Mulder Jöns Jacob Berzelius Sequencing, First structures Frederick Sanger Bovine insulin sequence 1951 Led to sequence hypothesis Proposed by Francis Crick John Kendrew Myoglobin X‐ray structure 1958 Max Perutz Hemoglobin X‐ray structure 1963 myoglobin hemoglobin Western blot Protein profiling: 2D gels, mass spectrometry 2D gel electrophoresis Protein and peptide mass spectrometry Monoclonal antibodies (1975) César Milstein Discovery of DNA, RNA Discovered 1869 by Friedrich Miescher in the kitchen of castle Tuebingen, “nuclein” Phoebus Levene 1909: ribose 1919: nucleic acid = base + sugar + phosphate, “nucleotide” 1929: deoxyribose Manipulating DNA Werner Arber, Hamilton Smith, 1970, restriction enzymes Stanley Cohen, 1972, Molecular cloning Sequencing DNA Frederick Sanger 1977, dideoxy sequencing or chain‐termination sequencing φX174 Identification of specific DNA fragments by hybridization, Southern blot Sir Edwin Mellor Southern, Southern blot http://science.bard.edu/biology/ferguson/course/bio310/student_presentations/Southern_1975.pdf Amplification of DNA Kary Mullis 1983, polymerase chain reaction, amplification of genetic material Quantitative PCR (Taqman PCR), molecular beacons HUGO 15th and 16th of February 2001 June 26, 2000 Spotted microarrays (1995) Oligonucleotide microarrays (1996) DNA and disease karyotype Spectral karyotyping 1960, Philadelphia chromosome, Cause of chronic myelocytic leukemia Bladder cancer cell karyotype By Robert Sanders, Media Relations | July 26, 2011 ChIP‐chip, ChIP‐seq, copy number variation, SNPs (single nucleotide polymorphisms) Discovery of RNA 1900 ‐ 1950 Phoebus Levene 1909: ribose 1919: nucleic acid = base + sugar + phosphate, “nucleotide” 1929: deoxyribose differences in base composition and chemical stability 1950s microsome (ribosome) observation by electron microscopy and centrifugation, radiactive aminoacids are incorporated into microsomes rapidly and microsomes have an RNA component (rRNA) radiactive aminoacids bind to tRNA polysomes ‐>mRNA concept 1960s cracking the genetic code, tRNA sequence Reverse transcription of RNA Isolation of reverse transcriptase, 1970 David Baltimore Howard Temin Northern blot, identification of specific RNA molecules by hybridization RNA processing Late 70 s Louise Chow and Sue Berget Phil Sharp, Rich Roberts mRNA splicing, exons, introns RNA as a regulator: microRNA, RNA interference RNA‐induced silencing complex (RISC) RNA profiling: differential display Identify differentially expressed bands, Clone and sequence them, validate RNA profiling, SAGE, microarrays, RNA‐seq Spotted microarray Probes = cDNA oligonucleotide microarray Probes = oligonucleotides RNA‐seq The genomic data surge The genomic data surge Northern Differential Display SAGE Expression chips SNP chips SNP beads Southern 1980s RFLP PCR 1990s Sequencer AFLP IFOM‐IEO‐CAMPUS Since 2009: 1800 samples 16 TB raw data 70 TB elaborated data Genome draft 2000s Genome browsers 2010s Sequencing technologies First generation: Sanger‐sequencing Second (next) generation: massively parallel sequencing Roche 454 Illumina/Solexa SOLiD (Applied Biosystems, Sequencing by Oligonucleotide Ligation and Detection) Third (next‐next) generation: Roche 454 single molecule sequencing Helicos IonTorrent Pacific Biosciences Oxford Nanopore Technologies Illumina IonTorrentt Work flow of conventional versus second‐generation sequencing (a) With high‐throughput shotgun Sanger sequencing, genomic DNA is fragmented, then cloned to a plasmid vector and used to transform E. coli. For each sequencing reaction, a single bacterial colony is picked and plasmid DNA isolated. Each cycle sequencing reaction takes place within a microliter‐scale volume, generating a ladder of ddNTP‐terminated, dye‐labeled products, which are subjected to high‐resolution electrophoretic separation within one of 96 or 384 capillaries in one run of a sequencing instrument. As fluorescently labeled fragments of discrete sizes pass a detector, the four‐channel emission spectrum is used to generate a sequencing trace. (b) In shotgun sequencing with cyclic‐array methods, common adaptors are ligated to fragmented genomic DNA, which is then subjected to one of several protocols that results in an array of millions of spatially immobilized PCR colonies or 'polonies'15. Each polony consists of many copies of a single shotgun library fragment. As all polonies are tethered to a planar array, a single microliter‐scale reagent volume (e.g., for primer hybridization and then for enzymatic extension reactions) can be applied to manipulate all array features in parallel. Similarly, imaging‐ based detection of fluorescent labels incorporated with each extension can be used to acquire sequencing data on all features in parallel. Successive iterations of enzymatic interrogation and imaging are used to build up a contiguous sequencing read for each array feature. dideoxy ATP, chain termination Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 ‐ 1145 (2008) Clonal amplification of sequencing features in second‐generation sequencing 454, SOLiD Illumina (a) The 454, the Polonator and SOLiD platforms rely on emulsion PCR20 to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor‐flanked shotgun library (shown as gold and turquoise adaptors flanking unique inserts) is PCR amplified (that is, multi‐template PCR, not multiplex PCR, as only a single primer pair is used, corresponding to the gold and turquoise adaptors) in the context of a water‐in‐oil emulsion. One of the PCR primers is tethered to the surface (5'‐attached) of micron‐scale beads that are also included in the reaction. A low template concentration results in most bead‐containing compartments having either zero or one template molecule present. In productive emulsion compartments (where both a bead and template molecule is present), PCR amplicons are captured to the surface of the bead. After breaking the emulsion, beads bearing amplification products can be selectively enriched. Each clonally amplified bead will bear on its surface PCR products corresponding to amplification of a single molecule from the template library. (b) The Solexa technology relies on bridge PCR21, 22 (aka 'cluster PCR') to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor‐flanked shotgun library is PCR amplified, but both primers densely coat the surface of a solid substrate, attached at their 5' ends by a flexible linker. As a consequence, amplification products originating from any given member of the template library remain locally tethered near the point of origin. At the conclusion of the PCR, each clonal cluster contains 1,000 copies of a single member of the template library. Accurate measurement of the concentration of the template library is critical to maximize the cluster density while simultaneously avoiding overcrowding. Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 ‐ 1145 (2008) Example: Illumina/Solexa (2006) 1. Determine first base 2. Image first base 3. Determine second base 4. Image second base 5. Sequence reads over multiple cycles 6. Align data. 180 million clusters/flow cell lane, each 1000 copies of the same template, 20 billion bases per run, 0.1% of the cost of capillary‐based method. (From: http://www.illumina.com /downloads/SS_DNAsequ encing.pdf) Output: FASTQ files “FASTA” stands short for “FAST‐All” because as opposed to the FASTP protein aligner described in 1985 could work with all alphabets (DNA:DNA, translated protein:DNA). FASTA is also a universal file format for sequences. Example: >header1 acgtgatgc >header2 cgtgatgca . . FASTQ = “FASTA with Qualities” @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Four lines per sequence. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Wikipedia ChIP‐seq work flow • • • • • • Cross‐link proteins to DNA Fragment DNA Extract with antibody Reverse cross links Sequence fragments DO CONTROLS! Coverage vector aligned reads A coverage (or: “pile‐up”) vector is an integer vector with one element per base pair in a chromosome, tallying the number of reads (or fragments) mapping onto each base pair. Can be viewed in genome browsers. coverage vector Zhang, Plos Comp Biol 2008 RNA seq Important file formats: BED (textual) A BED file (.bed) is a tab‐delimited text file that defines a feature track. File extension .bed is recommended. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat. Tracks in the UCSC Genome Browser (http://genome.ucsc.edu/) can be downloaded to BED files and loaded into IGB/IGV. Notes: Zero‐based index: Start and end positions are identified using a zero‐based index. The end position is excluded. For example, setting start‐end to 1‐2 describes exactly one base, the second base in the sequence (ACGT). track name=pairedReads description="Clone Paired Reads" Chr22 1000 5000 cloneA Chr22 2000 6000 cloneB Important file formats: BEDGraph (numerical) The bedGraph format is line‐oriented. Bedgraph data are preceeded by a track definition line, which adds a number of options for controlling the default display of this track. The track type is REQUIRED, and must be “bedGraph”. Bedgraph track data values can be integer or real, positive or negative values. Chromosome positions are specified as 0‐relative. The first chromosome position is 0. The last position in a chromosome of length N would be N ‐ 1. Only positions specified have data. Positions not specified do not have data and will not be graphed. All positions specified in the input data must be in numerical order. The bedGraph format has four columns of data: track type=bedGraph name="BedGraph Format" chr19 49302000 49302300 10 chr19 49302300 49302600 20 chr19 49302600 49302900 25 Intervals can be of any length and overlapping. Genome browser Kent, W. James, et al. "The human genome browser at UCSC." Genome research 12.6 (2002): 996‐1006. Visualizing genomic data • User views clickable image (html map tag), not real data • Precludes analytics From HUGO to HUGE data knowledge

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The evolution of life science methodologies: From single gene