Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TRANSCRIPTOMICS AND GENE ANNOTATION UNIT 7 GENE EXPRESSION – A MISNOMER ? • In reality, gene expression can only be quantified by looking at protein products in the cell (via proteomic approaches). • However, the term has been co-opted to describe differences in transcript (mRNA) levels. • Transcripts may or may not be translated into protein and thus don’t necessarily reflect gene expression. DIFFERENTIAL GENE EXPRESSION • Responsible for differences between cell types of the same organism (e.g., kidney vs. brain cells) • Means by which development is controlled • Involves gene feedback loops and induction/repression initiated by external (environmental stimuli) and/or internal (transcription factors) forces A TYPICAL EUKARYOTIC PROTEIN CODING GENE A TYPICAL EUKARYOTIC PROTEIN CODING GENE TRANSCRIPTOMICS • The study of the complete set of RNAs (transcriptome) encoded by the genome • 1 - of a specific cell or organism • 2 - at a specific time or • 3 - under a specific set of conditions • Dependent on: • The organism • The cell, cell line, or tissue • The developmental stage • The condition/treatment • Usually, we tend to ignore the rRNAs and tRNAs TRANSCRIPTOME COMPLEXITY TRANSCRIPTOME COMPLEXITY HOW MUCH OF THE HUMAN GENOME IS TRANSCRIBED? • ENCODE project (Nature, 2007) • Examined 1% of the genome (~30Mb) • “The human genome is pervasively transcribed, such that the majority of its bases are associated with at least one primary transcript and many transcripts link distal regions to established protein coding loci.” • “Many novel non-protein-coding transcripts have been identified, with many of these overlapping protein-coding loci and other located in regions of the genome previously thought to be transcriptionally silent.” • “Numerous previously unrecognized transcription start sites have been identified, many of which show chromatin structure and sequence-specific protein binding properties similar to well-understood promoters.” • 74% of bases are represented in a primary transcript with evidence coming from 2 or more experimental technologies • This project was published prior to more advanced techniques being developed and conclusions have been contested. TRANSCRIPTOMICS • The main/common tasks in a transcriptome analysis • 1 – Identify your targets/goals (protein coding, non-coding) • 2 – Transcriptome reconstruction • Identify the genes transcribed • Identify isoforms • 3 – Expression quantification • 4 – Identify differential expression • 5 – Transcription mapping • 6 – Identify gene variants within/among individuals • 7 – Identify allele-specific transcription HISTORY OF GENE EXPRESSION ANALYSIS • Northern blotting • EST sequencing • Microarrays • RNA-Seq RNA-SEQ: WHOLE TRANSCRIPTOME SHOTGUN SEQUENCING • The current state-of-the-art • Process • Isolate mRNA from a tissue • Reverse transcribe mRNA into cDNA • Sequence cDNA with a next-generation sequencer (e.g., Illumina) • Quantify number of different transcripts, copy number of each transcript, and identify SNPs, splicing variations, etc. Sample preparation Next generation sequencing (NGS) Data analysis: Mapping reads Visualization (Gbrowser) De novo assembly Quantification RNA-SEQ • • Pros Cons • High sensitivity, Quantitative, Rapid • • Less expensive than microarray development (but not microarray screening) May provide more information than you want • Not cost-effective if you are only looking at a handful of genes • Not limited to detecting transcripts that correspond to known genomic sequence • Can provide single-nucleotide resolution for alternative splicing and exon boundaries • Large dynamic range for transcript detection RNA-SEQ • Library construction challenges • How to avoid rRNA? • Use oligo-dT enrichment • Bias toward 3’ end • Protocols to remove rRNA followed by random fragmentation • More even coverage but bias against the ends • “Not-so-random” (NSR) priming • Subtract the random hexamers and heptamers that are likely complementary to rRNAs before first round cDNA synthesis RNA-SEQ • Strand-specific library or not? • Transcription can occur in both directions • Gene can be located on either DNA strand and sometimes overlapping. • Complementary RNA molecule to a given mRNA can also be transcribed, antisense transcription, are involved in regulatory mechanism. • Knowing from which DNA strand the RNA molecule originates from is an important piece of information, which helps resolving annotation ambiguities for known and novel genes, provides hints to the function of the studied RNA, and helps with quantification • Knowing the strand of origination can resolve questions about the gene or origin, function of the RNA and expression level RNA-SEQ • Standard library example 5’ First Strand Synthesis 5’ Second Strand Synthesis 5’ A A 5’ T A A T A addition Adapter ligation RNA-SEQ • Major methods for strand-specific libraries • 1. Differential adapter ligation to RNA • 2. Differential adapter priming (RT method) • 3. ‘Strand marking’ of the RNA or secondstrand cDNA (dUTP) The problem: - Reconstruct full-length transcripts (1000’s bp) from short reads (100bp) Read coverage highly variable Capture alternative isoforms Annotation? Expression differences? Novel non-coding? Solution(?): - Read-to-reference alignments, assemble transcripts - Assemble transcripts directly Read mapping vs. de novo assembly Good reference No genome Cufflinks Workflow - Map reads to reference genome: - Disambiguate alignments - Allow for gaps (introns) - Use pairs (if available) - Build sequence consensus: - Identify exons & boundaries - Identify alternative isoforms - Quantify isoform expression - Differential expression: - Between isoforms - Between samples - Annotation-based and novel transcripts Transcriptome assembly with Trinity: How it works Brian Haas Moran Yassour Kerstin Lindblad-Toh Aviv Regev Nir Friedman David Eccles Alexie Papanicolaou Michael Ott … Trinity Workflow - Compress data (inchworm): - Cut reads into k-mers (k consecutive nucleotides) - Overlap and extend - Report all sequences (“contigs”) - Build de Bruijn graph (chrysalis): - Collect all contigs that share k-1-mers - Build graph - Map reads to components - Enumerate all consistent possibilities (butterfly): - Unwrap graph into linear sequences - Use reads and pairs to eliminate false sequences HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • ENCODE project (Nature, 2012) • “…assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions.” • “We define a functional element as a discrete genome segment that • • • 1 - encodes a defined product (e.g. protein or non-coding RNA) • 2 - or displays a reproducible biochemical signature (e.g. protein-binding or a specific chromatin structure).” Even more criticism of this work (and I think deservedly so). • The definition above is much too loose and allows for just about anything to be considered ‘functional’. • Essentially, anything that produces a transcript or is bound by a protein is ‘functional’. Criticized most soundly, in my opinion, by Dan Graur in: “On the immortality of television sets: ‘Function’ in the human genome according to the evolution-free gospel of ENCODE” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • On the immortality of television sets: ‘Function’ in the human genome according to the evolution-free gospel of ENCODE (GBE 2013) • Main points – “This absurd conclusion was reached through…” • “employing the seldom used ‘causal role’ definition of biological function and then applying it inconsistently to different biochemical properties” • “committing a logical fallacy known as ‘affirming the consequent’” • “failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage DNA’” • “using analytical methods that yield biased errors and inflate estimates of functionality” • “favoring statistical sensitivity over specificity” • “emphasizing statistical significant rather than the magnitude of the effect.” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Employing the seldom used ‘causal role’ definition of biological function….” Main points – “This absurd conclusion was reached through…” • What is the meaning of ‘function’? • Selected effect definition is historical and evolutionary • • For a trait, T, to have a proper biological function, F, it is necessary and sufficient that two conditions hold • 1 – T originated as a reproduction of some prior trait that performed F (or something similar) in the past • 2 – T exists because of F • The ‘selected effect’ function of a trait is the effect for which it was selected or is maintained Causal role definition • • For a trait, Q, to have a causal role, function G, it is necessary and sufficient that Q performs G. The heart • Selected effect – to pump blood, Causal role – to add mass to the body HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Employing the seldom used ‘causal role’ definition of biological function….” Main points – “This absurd conclusion was reached through…” • Two identical sequences (TATAAA) in the genome at distinct loci • Instance 1 has been selected for and maintained by natural selection with the effect of binding a transcription factor to initiate gene expression • Instance 2 has arisen by chance, but because of its sequence, can also bind a transcription factor but probably has no impact on function • Instance 1 – selected effect, Instance 2 – causal role • “ENCODE adopted a strong version of the causal role definition of function, according to which a functional element is a discrete genome segment that produces a protein or an RNA or displays a reproducible biochemical signature (e.g., protein binding). Oddly, ENCODE not only uses the wrong concept of functionality, it uses it wrongly and inconsistently.” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Committing a logical fallacy known as ‘affirming the consequent’ • If P, then Q. Q. Therefore, P. • According to ENCODE, DNA segments that ‘function’ in a process (e.g. gene regulation) tend to display a certain property (e.g. transcription factor binding) • Another DNA segment displays said property (e.g. it binds a transcription factor) • Therefore, the DNA segment is functional (e.g. is involved in gene regulation) • All of my ‘nopes’ apply. • One of my favorite passages, “the ENCODE authors singled out transcription as a function, as if the passage of RNA polymerase through a DNA sequence is in some way more meaningful than other functions. But, what about DNA polymerase and DNA replication? Why make a big fuss about 74.7% of the genome that is transcribed, and yet ignore the fact that 100% of the genome takes part in a strikingly “reproducible biochemical signature”—it replicates!” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Committing a logical fallacy known as ‘affirming the consequent’ • If P, then Q. Q. Therefore, P. • According to ENCODE, DNA segments that ‘function’ in a process (e.g. gene regulation) tend to display a certain property (e.g. transcription factor binding) • Another DNA segment displays said property (e.g. it binds a transcription factor) • Therefore, the DNA segment is functional (e.g. is involved in gene regulation) • All of my ‘nopes’ apply. • One of my favorite passages, “the ENCODE authors singled out transcription as a function, as if the passage of RNA polymerase through a DNA sequence is in some way more meaningful than other functions. But, what about DNA polymerase and DNA replication? Why make a big fuss about 74.7% of the genome that is transcribed, and yet ignore the fact that 100% of the genome takes part in a strikingly “reproducible biochemical signature”—it replicates!” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Committing a logical fallacy known as ‘affirming the consequent’ • “Transcription ≠ function” • “Histone modification ≠ function” • “Open chromatin ≠ function” • “Transcription factor binding ≠ function” • “DNA methylation ≠ function” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage DNA’” • Misconceptions about ‘junk DNA’ • 1 – lack of knowledge of original definition • 2 – belief that evolution can always get rid of nonfunctional DNA • 3 – belief that ‘future potential’ constitutes ‘function’ • Original definition of junk DNA – a genomic segment on which selection does not operate. • “This sense of the term “junk DNA” was used by Jacob (1977) in his famous paper “Evolution and Tinkering”: “[N]atural selection does not work as an engineer … It works like a tinkerer—a tinkerer who does not know exactly what he is going to produce but uses whatever he finds around him whether it be pieces of string, fragments of wood, or old cardboards … The tinkerer … manages with odds and ends. What he ultimately produces is generally related to no special project, and it results from a series of contingent events, of all the opportunities he had to enrich his stock with leftovers.” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage DNA’” • Misconceptions about ‘junk DNA’ • 1 – lack of knowledge of original definition • 2 – belief that evolution can always get rid of nonfunctional DNA • 3 – belief that ‘future potential’ constitutes ‘function’ • “Evolution can only produce a genome devoid of “junk” if and only if the effective population size is huge and the deleterious effects of increasing genome size are considerable.” • In bacteria, this generally applies. Generation time is correlated with genome size and effective population sizes are enormous. • In eukaryotes, not so much. HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage DNA’” • Misconceptions about ‘junk DNA’ • 1 – lack of knowledge of original definition • 2 – belief that evolution can always get rid of nonfunctional DNA • 3 – belief that ‘future potential’ constitutes ‘function’ • Teleology – the philosophy that nature has goals • “Junk DNA may, in fact, exhibit a very similar behavior to the regular junk in one's garage, which is kept for years and years, and then thrown out a day before it may become useful.” • “Some years ago I noticed that there are two kinds of rubbish in the world and that most languages have different words to distinguish them. There is the rubbish we keep, which is junk, and the rubbish we throw away, which is garbage. The excess DNA in our genomes is junk, and it is there because it is harmless, as well as being useless, and because the molecular processes generating extra DNA outpace those getting rid of it. Were the extra DNA to become disadvantageous, it would become subject to selection, just as junk that takes up too much space, or is beginning to smell, is instantly converted to garbage … ”. Brenner 1998 NORTHERN BLOTTING • What is it? • Detection of RNA on a substrate via hybridization with a probe • Pros • No amplification involved • Can study expression of multiple genes (e.g., 5-10) on same gel as long as they are of different molecular weights • Allows detection of some alternative splicing • Cons • • • • Must blot gels Requires lots of starting mRNA RNA highly vulnerable to degradation Not high-throughput NORTHERN BLOTTING • Isolate mRNAs from multiple samples that differ with regard to tissue type, developmental stage, disease resistance, exposure to stimulus, etc. • Place each mRNA population in its own well of a denaturing agarose gel (formaldehyde added to gel to keep inter- and intramolecular base pairing from occurring) • Separate mRNAs by electrophoresis • Blot mRNA onto membrane. Fix RNA to membrane. • Hybridize labeled DNA probe(s) to membrane • Quantify differences in transcript levels between samples NORTHERN BLOTTING • What is it? • Detection of RNA on a substrate via hybridization with a probe HOW NORTHERN BLOTTING IS USED • -actin expression in the brain of a mouse • Can also look at changes in expression in multiple tissues as a function of time EST SEQUENCING (SANGER) • Sequence ESTs isolated from different tissues or different experimental trials • Compare similarities and differences in EST expression patterns • Dominance of certain transcripts can make EST sequencing an inefficient means of measuring changes in gene expression • For example, in estrogen-treated chicken oviduct, > 50% of transcripts in cell are product of same gene EST SEQUENCING (SANGER) • Pros • Lots of sequence information that can be used for lots of different purposes • Can study expression variation of whole transcriptome, not just a handful of genes • Allows detection of alternative splicing • Cons • Expensive as dominant cDNAs will be sequenced over and over again • Not likely to be truly quantitative due to RT and cloning biases MICROARRAYS • DNAs are spotted onto a glass microscope slide or similar substrate • Fluorescence techniques rather than radioisotopic techniques are used in visualization • Spots are about the size of a typed period using 10 pt font. • Each spot contains roughly the same amount of DNA • The density of spots on a microarray depends upon the type of robot and the wishes of the scientist • A typical microarray robot can make about 12 slides an hour • Slides can be stored at room temperature • Many PCR products can be spotted on a single array (currently up to 390,000 spots) • For species with relatively few genes (e.g., yeast), it is possible to spot all the genes in an ordered manner onto a single array MICROARRAY TWO–CHANNEL ARRAY EXPERIMENTS • Expression patterns are compared by hybridizing both control and test cDNA populations to the same microarray(s) • Each cDNA population is labeled with a different fluorophore (fluorescent tag) • While relative changes in gene expression can be detected, using two cDNA populations (which can compete with each other) does not permit estimation of absolute expression levels OVERVIEW OF TWO–CHANNEL EXPERIMENT DETAIL OF TWO–CHANNEL EXPERIMENT MICROARRAY • Use the fluorescence data to determine exactly which genes are expressed differently between two tissue types • Quantify differences in expression for individual genes • Actually know which genes correspond to which spots • Find genes that may be activated together (gene expression pathways) MICROARRAYS • Pros • Can study expression variation of whole transcriptome, not just a handful of genes • Definitely high-throughput • Rapid screening possible • Cons • Very few once good slides/chips are made • Not truly as quantitative as qPCR • Not practical with poorly characterized genomes (expense in designing chip requires a commitment from a relatively large scientific community)