Download The evolution of life science methodologies: From single gene

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Eukaryotic transcription wikipedia , lookup

Replisome wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

RNA silencing wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Molecular cloning wikipedia , lookup

Genome evolution wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Epitranscriptome wikipedia , lookup

Gene expression wikipedia , lookup

Non-coding RNA wikipedia , lookup

Non-coding DNA wikipedia , lookup

SNP genotyping wikipedia , lookup

DNA sequencing wikipedia , lookup

Molecular evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Exome sequencing wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Genomic Computing, DEIB, 16‐20 March 2015
The evolution of life science methodologies: From single gene discovery to the ENCODE project and beyond
Heiko Muller
Computational Research IIT@SEMM
[email protected]
Discovery of X‐rays, radioactivity, quantum mechanics
Konrad Roentgen
X‐rays 1895
Henry Bragg
Linus Pauling
Werner Heisenberg
Established quantum mechanics
1925‐1926
1898
Marie & Pierre Curie
Radium, Polonium
Atoms are divisable!! Lawrence Bragg
X‐ray diffraction
Erwin Schroedinger
Understood nature
of chemical bonds,
1932 electronegativity
Significance and structure of DNA
Oswald Avery 1944, nucleic acid is transforming (causing heredity)
Rosalind Franklin, James Watson, Francis Crick 1953: DNA double helix
Central dogma of Molecular Biology (Francis Crick)
Protein
RNA
DNA
Discovery of proteins
Described proteins distinguished by ability to coagulate or flocculate under treatments with heat or acid. e.g. albumin from egg whites, blood serum albumin
Antoine François, comte de Fourcroy
18th century
1838: On the composition of some animal substances
First use of term protein (the “leader”)
Gerardus Johannes Mulder
Jöns Jacob Berzelius
Sequencing, First structures
Frederick Sanger
Bovine insulin sequence
1951
Led to sequence hypothesis
Proposed by Francis Crick
John Kendrew
Myoglobin X‐ray structure
1958
Max Perutz
Hemoglobin X‐ray structure
1963
myoglobin
hemoglobin
Western blot
Protein profiling: 2D gels, mass spectrometry
2D gel electrophoresis
Protein and peptide mass spectrometry
Monoclonal antibodies (1975)
César Milstein
Discovery of DNA, RNA
Discovered 1869 by Friedrich Miescher in the kitchen of castle Tuebingen, “nuclein”
Phoebus Levene 1909: ribose
1919: nucleic acid = base + sugar + phosphate, “nucleotide”
1929: deoxyribose
Manipulating DNA
Werner Arber, Hamilton Smith, 1970, restriction enzymes
Stanley Cohen, 1972, Molecular cloning
Sequencing DNA
Frederick Sanger 1977, dideoxy sequencing or chain‐termination sequencing
φX174
Identification of specific DNA fragments by hybridization, Southern blot
Sir Edwin Mellor Southern, Southern blot
http://science.bard.edu/biology/ferguson/course/bio310/student_presentations/Southern_1975.pdf
Amplification of DNA
Kary Mullis 1983, polymerase chain reaction, amplification of genetic material
Quantitative PCR (Taqman PCR), molecular beacons
HUGO 15th and 16th of February 2001
June 26, 2000
Spotted microarrays (1995)
Oligonucleotide microarrays (1996)
DNA and disease
karyotype
Spectral karyotyping
1960, Philadelphia chromosome, Cause of chronic myelocytic leukemia
Bladder cancer cell karyotype
By Robert Sanders, Media Relations | July 26, 2011 ChIP‐chip, ChIP‐seq, copy number variation, SNPs (single nucleotide polymorphisms)
Discovery of RNA
1900 ‐ 1950
Phoebus Levene 1909: ribose
1919: nucleic acid = base + sugar + phosphate, “nucleotide”
1929: deoxyribose
differences in base composition and chemical stability
1950s
microsome (ribosome) observation by electron microscopy and centrifugation, radiactive aminoacids are incorporated into microsomes rapidly and microsomes have an RNA component (rRNA)
radiactive aminoacids bind to tRNA
polysomes ‐>mRNA concept
1960s cracking the genetic code, tRNA sequence
Reverse transcription of RNA
Isolation of reverse transcriptase, 1970
David Baltimore
Howard Temin
Northern blot, identification of specific RNA molecules by hybridization
RNA processing
Late 70 s Louise Chow and Sue Berget Phil Sharp, Rich Roberts
mRNA splicing, exons, introns
RNA as a regulator: microRNA, RNA interference
RNA‐induced silencing complex (RISC)
RNA profiling: differential display
Identify differentially expressed bands,
Clone and sequence them, validate
RNA profiling, SAGE, microarrays, RNA‐seq
Spotted microarray
Probes = cDNA
oligonucleotide microarray
Probes = oligonucleotides
RNA‐seq
The genomic data surge
The genomic data surge
Northern
Differential Display SAGE
Expression chips
SNP chips
SNP beads
Southern
1980s
RFLP
PCR
1990s
Sequencer
AFLP
IFOM‐IEO‐CAMPUS
Since 2009:
1800 samples
16 TB raw data
70 TB elaborated data
Genome draft
2000s
Genome browsers
2010s
Sequencing technologies
First generation: Sanger‐sequencing
Second (next) generation: massively parallel sequencing
Roche 454
Illumina/Solexa
SOLiD (Applied Biosystems, Sequencing by Oligonucleotide Ligation and Detection)
Third (next‐next) generation: Roche 454
single molecule sequencing
Helicos
IonTorrent
Pacific Biosciences
Oxford Nanopore Technologies
Illumina
IonTorrentt
Work flow of conventional versus second‐generation sequencing
(a) With high‐throughput shotgun Sanger sequencing, genomic DNA is fragmented, then cloned to a plasmid vector and used to transform E. coli. For each sequencing reaction, a single bacterial colony is picked and plasmid DNA isolated. Each cycle sequencing reaction takes place within a microliter‐scale volume, generating a ladder of ddNTP‐terminated, dye‐labeled products, which are subjected to high‐resolution electrophoretic separation within one of 96 or 384 capillaries in one run of a sequencing instrument. As fluorescently labeled fragments of discrete sizes pass a detector, the four‐channel emission spectrum is used to generate a sequencing trace. (b) In shotgun sequencing with cyclic‐array methods, common adaptors are ligated to fragmented genomic DNA, which is then subjected to one of several protocols that results in an array of millions of spatially immobilized PCR colonies or 'polonies'15. Each polony consists of many copies of a single shotgun library fragment. As all polonies are tethered to a planar array, a single microliter‐scale reagent volume (e.g., for primer hybridization and then for enzymatic extension reactions) can be applied to manipulate all array features in parallel. Similarly, imaging‐
based detection of fluorescent labels incorporated with each extension can be used to acquire sequencing data on all features in parallel. Successive iterations of enzymatic interrogation and imaging are used to build up a contiguous sequencing read for each array feature.
dideoxy ATP,
chain termination
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 ‐ 1145 (2008)
Clonal amplification of sequencing features in second‐generation sequencing
454,
SOLiD
Illumina
(a) The 454, the Polonator and SOLiD platforms rely on emulsion PCR20 to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor‐flanked shotgun library (shown as gold and turquoise adaptors flanking unique inserts) is PCR amplified (that is, multi‐template PCR, not multiplex PCR, as only a single primer pair is used, corresponding to the gold and turquoise adaptors) in the context of a water‐in‐oil emulsion. One of the PCR primers is tethered to the surface (5'‐attached) of micron‐scale beads that are also included in the reaction. A low template concentration results in most bead‐containing compartments having either zero or one template molecule present. In productive emulsion compartments (where both a bead and template molecule is present), PCR amplicons are captured to the surface of the bead. After breaking the emulsion, beads bearing amplification products can be selectively enriched. Each clonally amplified bead will bear on its surface PCR products corresponding to amplification of a single molecule from the template library. (b) The Solexa technology relies on bridge PCR21, 22 (aka 'cluster PCR') to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor‐flanked shotgun library is PCR amplified, but both primers densely coat the surface of a solid substrate, attached at their 5' ends by a flexible linker. As a consequence, amplification products originating from any given member of the template library remain locally tethered near the point of origin. At the conclusion of the PCR, each clonal cluster contains 1,000 copies of a single member of the template library. Accurate measurement of the concentration of the template library is critical to maximize the cluster density while simultaneously avoiding overcrowding.
Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 ‐ 1145 (2008)
Example: Illumina/Solexa (2006) 1. Determine first base
2. Image first base
3. Determine second base
4. Image second base
5. Sequence reads over multiple cycles
6. Align data. 180 million clusters/flow cell lane, each 1000 copies of the same template, 20 billion bases per run, 0.1% of the cost of capillary‐based method. (From: http://www.illumina.com
/downloads/SS_DNAsequ
encing.pdf)
Output: FASTQ files
“FASTA” stands short for “FAST‐All” because as opposed to the FASTP protein aligner described in 1985 could work with all alphabets (DNA:DNA, translated protein:DNA).
FASTA is also a universal file format for sequences. Example:
>header1
acgtgatgc
>header2
cgtgatgca
.
.
FASTQ = “FASTA with Qualities”
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Four lines per sequence.
Line 1 begins with a '@' character and is followed by a sequence identifier and an optional
description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
Wikipedia
ChIP‐seq work flow
•
•
•
•
•
•
Cross‐link proteins to DNA
Fragment DNA
Extract with antibody
Reverse cross links
Sequence fragments
DO CONTROLS!
Coverage vector
aligned reads
A coverage (or: “pile‐up”) vector is an integer vector with one element per base pair in a chromosome, tallying the number of reads (or
fragments) mapping onto each base pair. Can be viewed in genome browsers.
coverage vector
Zhang, Plos Comp Biol 2008
RNA seq
Important file formats: BED (textual)
A BED file (.bed) is a tab‐delimited text file that defines a feature track. File extension .bed is recommended. The BED file format is described on the UCSC Genome Bioinformatics web site: http://genome.ucsc.edu/FAQ/FAQformat. Tracks in the UCSC Genome Browser (http://genome.ucsc.edu/) can be downloaded to BED files and loaded into IGB/IGV.
Notes: Zero‐based index: Start and end positions are identified using a zero‐based index. The end position is excluded. For example, setting start‐end to 1‐2 describes exactly one base, the second base in the sequence (ACGT).
track name=pairedReads description="Clone Paired Reads"
Chr22
1000
5000
cloneA
Chr22
2000
6000
cloneB
Important file formats: BEDGraph (numerical)
The bedGraph format is line‐oriented. Bedgraph data are preceeded by a track definition line, which adds a number of options for controlling the default display of this track. The track type is REQUIRED, and must be “bedGraph”.
Bedgraph track data values can be integer or real, positive or negative values. Chromosome positions are specified as 0‐relative. The first chromosome position is 0. The last position in a chromosome of length N would be N ‐ 1. Only positions specified have data. Positions not specified do not have data and will not be graphed. All positions specified in the input data must be in numerical order. The bedGraph format has four columns of data: track type=bedGraph name="BedGraph Format"
chr19 49302000 49302300 10 chr19 49302300 49302600 20 chr19 49302600 49302900 25 Intervals can be of any length and overlapping.
Genome browser
Kent, W. James, et al. "The human genome browser at UCSC." Genome research 12.6 (2002): 996‐1006.
Visualizing genomic data
• User views clickable image (html map tag), not real data
• Precludes analytics
From HUGO to HUGE data knowledge