* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CHIP-seq and RNA-seq
Ridge (biology) wikipedia , lookup
List of types of proteins wikipedia , lookup
Eukaryotic transcription wikipedia , lookup
X-inactivation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genomic imprinting wikipedia , lookup
RNA interference wikipedia , lookup
Epitranscriptome wikipedia , lookup
Gene desert wikipedia , lookup
Molecular evolution wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene nomenclature wikipedia , lookup
Non-coding RNA wikipedia , lookup
RNA silencing wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Genome evolution wikipedia , lookup
Gene regulatory network wikipedia , lookup
Gene expression wikipedia , lookup
Community fingerprinting wikipedia , lookup
Gene expression profiling wikipedia , lookup
Silencer (genetics) wikipedia , lookup
RNA-seq Manpreet S. Katari DNA RNA protein phenotype cDNA Abundance of mRNA is what we try to measure Microarrays vs Northern blots: from Gene to Genome Science • Northern blot: limited by number of lanes in gel • Microarray: A large number of DNA fragments are attached in a systematic way to a solid substrate, can measure mRNA levels for thousands of genes (~ every gene in a genome) in parallel Evolution of Sequence Technology Transcriptomics using RNA-seq Genome-wide expression analysis • Goal: to measure RNA levels of all genes in a genome under various experimental conditions • RNA levels vary with: – – – – Cell type Developmental stage External stimuli Disease state • Time and location of expression provide information on genes’ function and interactions, and can be useful for many purposes, including disease diagnostics and medical applications. For High-Throughput Transcriptomics studies, comparisons are almost always across experiments 45 45 40 35 40 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 Gene A Gene B Gene A whole body liver liver lung Gene B brain kidney Questions that can be addressed with genome-wide expression analysis: • What genes have similar function? • What regulatory pathways exist? • Can we subdivide experiments or genes into meaningful classes? • Can we correctly classify an unknown experiment or gene into a known class? • Can we make better treatment decisions for a cancer patient based on his or her gene expression profile? First two basic tasks to generating meaningful data for transcriptomics analysis • Normalize or scale all samples and replicated to each other • Make a (statistical) statement about what changes are evident in the comparison Microarrays Provides the mRNA level of thousands of genes (sometimes almost all known genes in a genome) in a given sample Sample=tissue (e.g., liver, brain), tissue in a specific environment or state (e.g., brain with cancer), etc. Three types of arrays • Spotted microarrays – Long dsDNA (typically genomic PCR products) • On-chip oligonucleotide synthesis – Photolithography • Affymetrix (~25-mers) – Ink-jet printing • Agilent (~60-mers) Sample labeling Fluorescent cDNA • cDNA made using reverse transcriptase • Fluorescently labeled nucleotides added • Labeled nucleotides incorporated into cDNA cRNA + biotin • cDNA made using reverse transcriptase Linker added with T7 RNA polymerase recognition site T7 polymerase added and biotin labeled RNA bases Biotin label incorporated into cRNA + Microarray hybridization Spotted microarrays – Competitive hybridization: two labeled cDNA samples (experimental and control) hybridized to same slide – Cy3 and Cy5 dye labeling, fluoresce at different wavelengths Affymetrix GeneChips – One labeled RNA population per chip – Biotin labeling, binds to fluorescently labeled avidin (Comparison made between hybridization intensities of same oligonucleotides on different chips). samples mRNA cDNA DNA microarray Affymetrix system What is the Affymetrix Signal? 1. Background subtraction: 1. Microarray is divided into sectors 2. Probe signal is ordered and the lowest 2% is taken as the noise level 3. A weighted mean of the background is subtracted from the signal, such that closer sectors are weighted more heavily Background Adjustment Estimating background effect PM=true signal + background Quantile Normalization 1. Sort each column disregarding gene order 2. Calculate row averages 5.3 Gene A 3 100 500 3 10 3 Gene B 17 10 150 10 100 150 Gene C 10 1000 3 17 1000 500 5.3 5.3 5.3 86.7 86.7 86.7 505.7 505.7 505.7 3. Substitute average values for real ones 86.7 505.7 Gene A 5.3 86.7 505.7 Gene B 505.7 5.3 86.7 Gene C 86.7 505.7 5.3 4. Restore gene order Normalizing the Data • RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene (sum of exons / 1000) T = total number of reads in the library mapped to the genome / 1,000,000 Reproducibility, linearity and sensitivity. RNA-seq provides even more Candidate new and revised exons Comparison of platforms for detecting gene expression AFFY Gene Chip Illumina All protein coding genes are represented X Can detect all the different types of RNA X Cost X X Can determine gene regulation X X Requires pre-existing knowledge of gene sequence X As the price of sequencing goes down, there will be almost no advantage Of Microarray over RNA-seq Mapping Reads from RNA molecules • What is the advantage of mapping reads from RNA to the genome sequenced instead of a database of all predicted RNA molecules? – We are not depending on the quality of annotation. – We are not assuming that we know about all of the RNA molecules in the cell. • How can we find reads mapping to spliced junctions? – Create a separate database of all possible spliced junctions – Split reads in half and map them separately. Bowtie & TopHat Cufflinks first starts with the output of any alignment tool such as TopHat Then it assembles the isoforms by first identifying the reads that can not be assembled together. Then calculate abundance Assembling the reads to identify transcripts. CuffCompare • The program cuffcompare helps you: – Compare your assembled transcripts to a reference annotation – Track Cufflinks transcripts across multiple experiments (e.g. across a time course) • Output contains codes – – – – – = match c contained j new isoform u unknown, intergenic transcript i single exon in intron region Identification of spliced junctions depends largely on the depth of sequences coverage. Cuffdiff • Can be use to find significant changes in transcript expression, splicing, and promoter use. – Inputs are: • Annotation to compare (can be output from cufflinks) • Tophat output from different samples • Options are similar to cufflinks, can also specify a different FDR cutoff. GENE B GENE A Which comparison is more convincing that genes are different? Treatment Control Rep1 20 Rep1 30 Rep2 21 Rep2 31 Rep3 19 Rep3 29 Mean 20 Mean 30 Rep1 10 Rep1 20 Rep2 20 Rep2 30 Rep3 30 Rep3 40 Mean 20 Mean 30 COMPARISON A COMPARISON B t test Difference in the means Standard Error of the difference Can use this test statistic to evaluate the probability that the two means are Var = sum of squares of the difference same using critical values n-1 of T: Degrees of freedom = nt+ncWhere you select the 2 probability of making a type I error e.g., 0.05 Volcano plot: visualizing significance and fold change Volcano plot: visualizing significance and fold change Volcano plot: visualizing significance and fold change Assumptions of the t-test • Samples are drawn from normal distributions – i.e. our estimates of geneA and geneB are random samples from a normal distribution • The variance of the two populations is equal • There is no mean variance relationship RNA-seq data • • • • Count data (discrete) Possible to get zero Cannot get negative number Each sequence read is a random event drawn from a larger population. • Variance increases with the mean RNA-seq data: variance > mean RNA-seq data are consistent with an over-dispersed poisson: variance = a*mean Should we give treat a difference between 9 vs 12 reads the same as 900 vs 1200? t = -3.6742, p-value = 0.02131 t = -3.6742, p-value = 0.02131 t test does not account for scale of the data t = -3.6742, p-value = 0.02131 t = -3.6742, p-value = 0.02131 Test using a negative binomial model [glm.nb()] p-value = 0.258 p-value = 1.03e-05 Test using a negative binomial model [glm.nb()] p-value = 0.258 p-value = 1.03e-05 RNA-seq pipeline Manpreet S. Katari The basic workflow 1. Perform Quality control - fastqc 2. Trim low quality sequence - trimmomatic 3. Map the reads to the Genome Build the database – bowtie2 b. Run the alignment - tophat a.