Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TRANSCRIPTOMICS UNIT 5 GENE EXPRESSION – A MISNOMER ? • In reality, gene expression can only be quantified by looking at protein products in the cell (via proteomic approaches). • The term has been co-opted to describe differences in transcript (mRNA) levels. • Transcripts may or may not be translated into protein and thus don’t necessarily reflect gene expression. DIFFERENTIAL GENE EXPRESSION • Responsible for differences between cell types of the same organism (e.g., kidney vs. brain cells) • Means by which development is controlled • Involves gene feedback loops and induction/repression initiated by external (environmental stimuli) and/or internal (transcription factors) forces A TYPICAL EUKARYOTIC PROTEIN CODING GENE A TYPICAL EUKARYOTIC PROTEIN CODING GENE TRANSCRIPTOMICS • The study of the complete set of RNAs (transcriptome) encoded by the genome • 1 - of a specific cell or organism • 2 - at a specific time or • 3 - under a specific set of conditions • Dependent on: • The organism • The cell, cell line, or tissue • The developmental stage • The condition/treatment • Usually, we tend to ignore the rRNAs and tRNAs TRANSCRIPTOME COMPLEXITY TRANSCRIPTOME COMPLEXITY TRANSCRIPTOME COMPLEXITY TRANSCRIPTOME COMPLEXITY TRANSCRIPTOME COMPLEXITY TRANSCRIPTOME COMPLEXITY TRANSCRIPTOME COMPLEXITY HISTORY OF GENE EXPRESSION ANALYSIS • Northern blotting • EST sequencing • Microarrays • RNA-Seq NORTHERN BLOTTING • What is it? • Detection of RNA on a substrate via hybridization with a probe • Pros • No amplification involved • Can study expression of multiple genes (e.g., 5-10) on the same gel as long as they are of different molecular weights • Allows detection of some alternative splicing • Cons • • • • Must blot gels (messy and time-consuming) Requires LOTS of starting mRNA RNA highly vulnerable to degradation Not high-throughput NORTHERN BLOTTING • Isolate mRNAs from multiple samples that differ with regard to tissue type, developmental stage, disease resistance, exposure to stimulus, etc. • Place each mRNA population in its own well of a denaturing agarose gel (formaldehyde added to gel to keep inter- and intramolecular base pairing from occurring) • Separate mRNAs by electrophoresis • Blot mRNA onto membrane. Fix RNA to membrane. • Hybridize labeled DNA probe(s) to membrane • Quantify differences in transcript levels between samples NORTHERN BLOTTING • What is it? • Detection of RNA on a substrate via hybridization with a probe HOW NORTHERN BLOTTING IS USED • -actin expression in the brain of a mouse • Can also look at changes in expression in multiple tissues as a function of time EST SEQUENCING (SANGER) mRNA pool cDNA pool Cloned library • Expressed Sequence Tags are a set of single sequence reads from a library of cDNAs of a given sample Clone sequences EST SEQUENCING (SANGER) • Sequence ESTs isolated from different tissues or different experimental trials • Compare similarities and differences in EST expression patterns • Dominance of certain transcripts can make EST sequencing an inefficient means of measuring changes in gene expression • For example, in estrogen-treated chicken oviduct, > 50% of transcripts in cell are product of one gene mRNA pool cDNA pool Cloned library Clone sequences EST SEQUENCING (SANGER) • Pros • Lots of sequence information that can be used for lots of different purposes • Can study expression variation of whole transcriptome, not just a handful of genes • Allows detection of alternative splicing • Cons • Expensive/inefficient because dominant cDNAs will be sequenced over and over again • Not likely to be truly quantitative due to RT and cloning biases MICROARRAYS • DNAs are spotted onto a glass microscope slide or similar substrate • RNAs from a tissue/cell culture are hybridized to the DNAs • Fluorescence techniques rather than radioisotopic techniques are used in visualization • Spots are about the size of a typed period using 10 pt font. • Each spot contains roughly the same amount of DNA • Use the fluorescence data to determine exactly which genes are expressed differently between two tissue types • Quantify differences in expression for individual genes • Actually know which genes correspond to which spots • Find genes that may be activated together (gene expression pathways) MICROARRAY MICROARRAY QUANTITATION MICROARRAY VISUALIZATION MICROARRAYS • Pros • Can study expression variation of whole transcriptome, not just a handful of genes • Definitely high-throughput • Rapid screening possible • Slides can be stored at room temperature • Many PCR products can be spotted on a single array (up to 390,000 spots) • For species with relatively few genes (e.g., yeast), it is possible to spot all the genes in an ordered manner onto a single array • Cons • Very few once good slides/chips are made • Not practical with poorly characterized genomes (expense in designing chip requires a commitment from a relatively large scientific community) • Only as good at the genes you spot on the slide. RNA-SEQ: WHOLE TRANSCRIPTOME SHOTGUN SEQUENCING • The current state-of-the-art • Process • Isolate mRNA from a tissue or tissues (replicates?) • Build a sequencing library • Sequence (e.g., Illumina) • Transcript identification and/or quantification RNA-SEQ: RNA ISOLATION AND QUALITY • RNA degrades quickly • RIN – RNA integrity number • Calculated by identifying a combination of characteristics • Total RNA ratio – compares ratio of rRNA to other RNAs, more intact RNA is better because it indicates little degradation, >=better • Height of 28S rRNA peak – 28S rRNA is typically degraded more quickly than 18S, more intact 28S rRNA indicates little degradation, >=better • Fast area ratio – indicates how much degradation has occurred, <=better • Marker height - <=better, indicates only small amounts of RNA have been degraded RNA-SEQ: RNA LIBRARY PREP • Standard cDNA library example 5’ First Strand Synthesis Random hexamer 5’ Second Strand Synthesis 5’ A A 5’ T A A T Major problem – you get ALL RNAs, including rRNA A addition Adapter ligation RNA-SEQ: RNA LIBRARY PREP • Library construction challenges • How to avoid rRNA? • Use oligo-dT enrichment • Bias toward 3’ end • Protocols to remove rRNA followed by random fragmentation • More even coverage but bias against the ends • “Not-so-random” (NSR) priming • Subtract the random hexamers and heptamers that are likely complementary to rRNAs before first round cDNA synthesis Examples from yeast RNA-SEQ: RNA LIBRARY PREP • Strand-specific library or not? • Transcription can occur in both directions • Gene can be located on either DNA strand and sometimes overlapping. • Complementary RNA molecule to a given mRNA can also be transcribed, antisense transcription, are involved in regulatory mechanism. • Knowing the strand of origination can resolve questions about the gene or origin, function of the RNA and expression level RNA-SEQ: RNA LIBRARY PREP • Major methods for strand-specific libraries • 1. Differential adapter ligation to RNA • 2. ‘Strand marking’ of the RNA or secondstrand cDNA (dUTP) • 3. Differential adapter priming (RT method) RNA-SEQ: SEQUENCING DEPTH • How much to sequence depends heavily on the goals and targeted starting material • Detailed analysis of differential expression will require at least 10s of millions of reads • Simple discovery of what is being transcribed requires as few as 50,000100,000 reads RNA-SEQ: ANALYSIS • Heavily dependent on the goals and resources of the researcher • Transcript Identification • Map to reference genome • Map to available transcriptome • Assemble transcriptome de novo • Transcript Quantification • Differential Expression Analysis • Alternative Splicing Analysis • Small RNAs RNA-SEQ: TRANSCRIPT IDENTIFICATION • Alignment • Map to available reference genome • Must deal with splice junctions Spliced read Unspliced read AAAAAAAAA Mature mRNA Splicing Gene • Reads may map uniquely or be multi-mapped (pseudogenes, paralogs, repetitive sequences, etc) • Use a gapped mapper (Tophat, STAR) RNA-SEQ: TRANSCRIPT IDENTIFICATION • Alignment • Map to available transcriptome • May need to deal with alternative splicing Spliced read Alternatively spliced read AAAAAAAAA AAAAAAAAA Alternatively spliced read • Again, reads may map uniquely or be multi-mapped (pseudogenes, paralogs, repetitive sequences, etc) • Use an ungapped mapper (Bowtie) • Reduced ability to identify new transcripts Transcript 1 Transcript 2 With retained intron RNA-SEQ: TRANSCRIPT IDENTIFICATION • Alignment – no reference • Assemble the transcriptome de novo • Trinity package uses a de Bruijn graph approach RNA-SEQ: TRANSCRIPT IDENTIFICATION • Alignment – no reference • Assemble the transcriptome de novo • Other packages include Oases, SOAPdenovo, Trans, Trans-ABYSS • Map reads to assembled transcriptome • Again, reads may map uniquely or be multi-mapped (pseudogenes, paralogs, repetitive sequences, etc) • Use an ungapped mapper (Bowtie) • Increased ability to identify novel transcripts and isoforms RNA-SEQ: TRANSCRIPT QUANTIFICATION • The most common RNA-Seq task • Basically, you count the number of reads that map to a particular locus • Assumes that library was constructed in such a way that reads are proportional to transcript abundance • Simple counts won’t work because of differential gene and transcript lengths • Longer and more highly expressed transcripts are more likely be represented among RNA-seq reads • Several measures normalize by transcript length and the total number of reads captured and mapped in the experiment RNA-SEQ: TRANSCRIPT QUANTIFICATION • The most common RNA-Seq task • Basically, you count the number of reads that map to a particular locus • Assumes that library was constructed in such a way that reads are proportional to transcript abundance • Simple counts won’t work because of differential gene and transcript lengths • Standard measures • RPKM = reads per kilobase per million = [# of mapped reads]/[length of transcript in kilobase]/[million mapped reads] • FPKM = fragments per kilobase per million = [# of fragments]/[length of transcript in kilobase]/[million mapped reads] FPKM is more appropriate for PE RNA-Seq experiments RNA-SEQ: DIFFERENTIAL EXPRESSION • Comparison of transcription levels among samples • Objective: In samples that have been exposed to different treatments or in distinct tissues, identify what genes are being transcribed at higher or lower rates than others. • Accomplished by mapping the reads to the genome or assembled transcriptome then performing a statistical transformation of the data • But there are problems RNA-SEQ: DIFFERENTIAL EXPRESSION • A simple scenario • Two samples A and B are sequenced • Every gene (n = 1000) that is expressed in A is expressed in B at the same level (same total number of transcripts) • However, in A there are 1000 additional genes that are also expressed but that are not expressed in B • Sample A has twice the number of transcripts, half of which are unique to A • If we sample each to the same depth (say 5,000,000 reads) the shared genes from sample A will have half the number of reads as B • You should adjust (normalize) by a factor of 2 • Other factors can impact this as well - technical variation, random noise in the data, sequencing differences, etc. RNA-SEQ: DIFFERENTIAL EXPRESSION • How do we tell the noise from the ‘real’ differences? • Most common method = TMM (Trimmed mean of m-values) • A method to determine ‘global fold change’ • equates the overall expression levels of genes between samples under the assumption that the majority of them are not differentially expressed RNA-SEQ: ALTERNATIVE SPLICING • Often performed as part of the transcriptome assembly • Can also use genome mapping to identify differential mapping to exons or mapping to introns RNA-SEQ: FUNCTION PROFILING • Gene Ontology (GO) • Suppose you find some differentially expressed genes. How do you find out what they do? • GO terms are associated with specific references that describe the work or analysis upon which the association between the term and gene product is based. • Each annotation includes an evidence code to indicate how the annotation to a particular term is supported. • Experimental evidence, computational evidence, author statements, curatorial statements, inferred from automated annotation statement RNA-SEQ: FUNCTION PROFILING • Gene Ontology (GO) Terms fall into three categories • Cellular component • A component of a cell, but with the proviso that it is part of some larger object; this may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer). • Cytochrome c is a gene with the GO cellular component terms mitochondrial matrix and inner mitochondrial membrane associated with it RNA-SEQ: FUNCTION PROFILING • Gene Ontology (GO) Terms fall into three categories • Molecular function • activities, such as catalytic or binding activities, that occur at the molecular level. Does not specify where or when, or in what context, the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. • Examples of functional terms are catalytic activity, transporter activity or binding, adenylate cyclase activity or Toll receptor binding. • Cytochrome c molecular function GO term is oxidoreductase activity RNA-SEQ: FUNCTION PROFILING • Gene Ontology (GO) Terms fall into three categories • Biological process • A series of events accomplished by one or more ordered assemblies of molecular functions. • Examples of biological process terms are cellular physiological process or signal transduction, pyrimidine metabolism or alphaglucoside transport. • Cytochrome c biological process GO terms are oxidative phosphorylation and induction of cell death associated with it RNA-SEQ: FUNCTION PROFILING • Conesa et al. Genome Biology (2016) 17:13 RNA-SEQ: FUNCTION PROFILING • Conesa et al. Genome Biology (2016) 17:13 HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • ENCODE project (Nature, 2007) • Examined 1% of the genome (~30Mb) • “The human genome is pervasively transcribed, such that the majority of its bases are associated with at least one primary transcript and many transcripts link distal regions to established protein coding loci.” • “Many novel non-protein-coding transcripts have been identified, with many of these overlapping protein-coding loci and other located in regions of the genome previously thought to be transcriptionally silent.” • “Numerous previously unrecognized transcription start sites have been identified, many of which show chromatin structure and sequence-specific protein binding properties similar to well-understood promoters.” • 74% of bases are represented in a primary transcript with evidence coming from 2 or more experimental technologies • This project was published prior to more advanced techniques being developed and conclusions have been contested. HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • ENCODE project (Nature, 2012) • “…assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions.” • “We define a functional element as a discrete genome segment that • • • 1 - encodes a defined product (e.g. protein or non-coding RNA) • 2 - or displays a reproducible biochemical signature (e.g. protein-binding or a specific chromatin structure).” Even more criticism of this work (and I think deservedly so). • The definition above is much too loose and allows for just about anything to be considered ‘functional’. • Essentially, anything that produces a transcript or is bound by a protein is ‘functional’. Criticized most soundly, in my opinion, by Dan Graur in: “On the immortality of television sets: ‘Function’ in the human genome according to the evolution-free gospel of ENCODE” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • On the immortality of television sets: ‘Function’ in the human genome according to the evolution-free gospel of ENCODE (GBE 2013) • Main points – “This absurd conclusion was reached through…” • “employing the seldom used ‘causal role’ definition of biological function and then applying it inconsistently to different biochemical properties” • “committing a logical fallacy known as ‘affirming the consequent’” • “failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage DNA’” • “using analytical methods that yield biased errors and inflate estimates of functionality” • “favoring statistical sensitivity over specificity” • “emphasizing statistical significant rather than the magnitude of the effect.” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Main points – “This absurd conclusion was reached through…” • Employing the seldom used ‘causal role’ definition of biological function….” What is the meaning of ‘function’? • • Selected effect definition is historical and evolutionary • For a trait, T, to have a proper biological function, F, it is necessary and sufficient that two conditions hold • 1 – T originated as a reproduction of some prior trait that performed F (or something similar) in the past • 2 – T exists because of F • The ‘selected effect’ function of a trait is the effect for which it was selected or is maintained Causal role definition • • For a trait, Q, to have a causal role, function G, it is necessary and sufficient that Q performs G. The heart • Selected effect – to pump blood, Causal role – to add mass to the body HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Employing the seldom used ‘causal role’ definition of biological function….” Main points – “This absurd conclusion was reached through…” • Two identical sequences (TATAAA) in the genome at distinct loci • Instance 1 has been selected for and maintained by natural selection with the effect of binding a transcription factor to initiate gene expression • Instance 2 has arisen by chance, but because of its sequence, can also bind a transcription factor but probably has no impact on function • Instance 1 – selected effect, Instance 2 – causal role • “ENCODE adopted a strong version of the causal role definition of function, according to which a functional element is a discrete genome segment that produces a protein or an RNA or displays a reproducible biochemical signature (e.g., protein binding). Oddly, ENCODE not only uses the wrong concept of functionality, it uses it wrongly and inconsistently.” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Committing a logical fallacy known as ‘affirming the consequent’ • If P, then Q. Q. Therefore, P. • According to ENCODE, DNA segments that ‘function’ in a process (e.g. gene regulation) tend to display a certain property (e.g. transcription factor binding) • Another DNA segment displays said property (e.g. it binds a transcription factor) • Therefore, the DNA segment is functional (e.g. is involved in gene regulation) • All ‘nopes’ apply. • One of my favorite passages, “the ENCODE authors singled out transcription as a function, as if the passage of RNA polymerase through a DNA sequence is in some way more meaningful than other functions. But, what about DNA polymerase and DNA replication? Why make a big fuss about 74.7% of the genome that is transcribed, and yet ignore the fact that 100% of the genome takes part in a strikingly “reproducible biochemical signature”—it replicates!” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Committing a logical fallacy known as ‘affirming the consequent’ • “Transcription ≠ function” • “Histone modification ≠ function” • “Open chromatin ≠ function” • “Transcription factor binding ≠ function” • “DNA methylation ≠ function” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage DNA’” • Misconceptions about ‘junk DNA’ • 1 – lack of knowledge of original definition • 2 – belief that evolution can always get rid of nonfunctional DNA • 3 – belief that ‘future potential’ constitutes ‘function’ • Original definition of junk DNA – a genomic segment on which selection does not operate. • “This sense of the term “junk DNA” was used by Jacob (1977) in his famous paper “Evolution and Tinkering”: “[N]atural selection does not work as an engineer … It works like a tinkerer—a tinkerer who does not know exactly what he is going to produce but uses whatever he finds around him whether it be pieces of string, fragments of wood, or old cardboards … The tinkerer … manages with odds and ends. What he ultimately produces is generally related to no special project, and it results from a series of contingent events, of all the opportunities he had to enrich his stock with leftovers.” HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage DNA’” • Misconceptions about ‘junk DNA’ • 1 – lack of knowledge of original definition • 2 – belief that evolution can always get rid of nonfunctional DNA • 3 – belief that ‘future potential’ constitutes ‘function’ • “Evolution can only produce a genome devoid of “junk” if and only if the effective population size is huge and the deleterious effects of increasing genome size are considerable.” • In bacteria, this generally applies. Generation time is correlated with genome size and effective population sizes are enormous. • In eukaryotes, not so much. HOW MUCH OF THE HUMAN GENOME IS FUNCTIONAL? • Failing to appreciate the crucial difference between ‘junk DNA’ and ‘garbage DNA’” • Misconceptions about ‘junk DNA’ • 1 – lack of knowledge of original definition • 2 – belief that evolution can always get rid of nonfunctional DNA • 3 – belief that ‘future potential’ constitutes ‘function’ • Teleology – the philosophy that nature has goals • “Junk DNA may, in fact, exhibit a very similar behavior to the regular junk in one's garage, which is kept for years and years, and then thrown out a day before it may become useful.” • “Some years ago I noticed that there are two kinds of rubbish in the world and that most languages have different words to distinguish them. There is the rubbish we keep, which is junk, and the rubbish we throw away, which is garbage. The excess DNA in our genomes is junk, and it is there because it is harmless, as well as being useless, and because the molecular processes generating extra DNA outpace those getting rid of it. Were the extra DNA to become disadvantageous, it would become subject to selection, just as junk that takes up too much space, or is beginning to smell, is instantly converted to garbage … ”. Brenner 1998 AN EVOLUTIONARY CLASSIFICATION OF GENOMIC FUNCTION D. GRAUR, Y ZHENG, RBR AZEVEDO • The pronouncements of the ENCODE Project Consortium regarding “junk DNA” exposed the need for an evolutionary classification of genomic elements according to their selected-effect function. In the classification scheme presented here, we divide the genome into “functional DNA,” i.e., DNA sequences that have a selected-effect function, and “rubbish DNA,” i.e., sequences that do not. Functional DNA is further subdivided into “literal DNA” and “indifferent DNA.” In literal DNA, the order of nucleotides is under selection; in indifferent DNA, only the presence or absence of the sequence is under selection. Rubbish DNA is further subdivided into “junk DNA” and “garbage DNA.” Junk DNA neither contributes nor detracts from the fitness of the organism and, hence, evolves under selective neutrality. Garbage DNA, on the other hand, decreases the fitness of its carriers. Garbage DNA exists in the genome only because natural selection is neither omnipotent nor instantaneous. Each of these four functional categories can be (1) transcribed and translated, (2) transcribed but not translated, or (3) not transcribed. The affiliation of a DNA segment to a particular functional category may change during evolution: functional DNA may become junk DNA, junk DNA may become garbage DNA, rubbish DNA may become functional DNA, and so on, however, determining the functionality or nonfunctionality of a genomic sequence must be based on its present status rather than on its potential to change (or not to change) in the future. Changes in functional affiliation are divided in to pseudogenes, Lazarus DNA, zombie DNA, and Hyde DNA. Selected effect function? No Yes Does sequence matter? Yes No Selectively neutral? Yes No RNA-SEQ: FUNCTION PROFILING • Gene Ontology (GO) • Evidence terms – Experimental evidence • Experimental (EXP) • Direct Assay (IDA) • Physical interaction (IPI) • Mutant Phenotype (IMP) • Genetic Interaction (IGI) • Expression Pattern (IEP) RNA-SEQ: FUNCTION PROFILING • Gene Ontology (GO) • Evidence terms – Computational evidence • Sequence or Structural Similarity (ISS) • Sequence Orthology (ISO) • Sequence Alignment (ISA) • Sequence Model (ISM) • Genomic Context (IGC) • Biological Aspect of Ancestor (IBA) • Biological Aspect of Descendant (IBD) • Key Residues (IKR) • Rapid Divergence (IRD) • Reviewed Computational Analysis (RCA) RNA-SEQ: FUNCTION PROFILING • Gene Ontology (GO) • Author statement terms • Traceable Author Statement (TAS) • Non-traceable Author Statement (NAS) • Curatorial statement terms • Inferred by Curator (IC) • No Biological Data (ND) • Automatically assigned evidence term • Inferred from Electronic (automated) Annotation (IEA) RNA-SEQ: FUNCTION PROFILING • Gene Ontology (GO) • Author statement terms • Traceable Author Statement (TAS) • Non-traceable Author Statement (NAS) • Curatorial statement terms • Inferred by Curator (IC) • No Biological Data (ND) • Automatically assigned evidence term • Inferred from Electronic (automated) Annotation (IEA)