* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download sequencing all mRNAs
Transposable element wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Human genome wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene desert wikipedia , lookup
Primary transcript wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Ridge (biology) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genomic imprinting wikipedia , lookup
Oncogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Microevolution wikipedia , lookup
Genomic library wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Minimal genome wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression programming wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Tag-based expression/function analysis Data files at webpage (link at todays date), and also: http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/ Where are we now? • R to do statistics • Genome browsers and galaxy to visualize genes and genomics data • Analyzing expression by microarrays +R and Bioconductor • Tag analysis • Proteomics What we want in transcriptomics • Know what transcripts that are transcribed, and how much they are transcribed – Implicitly also what transcripts that exist in the cell, and how they look! • Intuitively, we could get all this information by sequencing all mRNAs in one cell General problems with cDNA sequencing: Reverse transcriptase falls off Hard to sequence long transcripts Many cDNAs are identical, but some occurs only once per cell (or less!). Need to sequence MANY cDNAs Very expensive if you want to sequence all molecules Solutions: 1) Do not sequence: use probes and hybridization: microarrays and tiling arrays ( this is where we are now!) 2) Only sequence parts of transcripts: tag sequencing (this is where we are getting) Thought exercise • What are the pros/cons with hybridization (micro/tiling arrays) vs sequencing? 2 minutes with your sideman Albin’s take Sequencing Hybridization • • • • • • + Cheap(per “gene”) + Mature methods + Standardized -complex normalization needed - cross-hybridization - highly dependant on annotation of probes • -dependant on designed probes for genes • -Cannot deal with repeats • +/-Integrative signal (more on next slide) • • • • • • • • • - expensive (now, but changing) -”unbiased” - no designed probes - non-standard computational methods - more demanding processing (now) - much easier statistics in the end + less noisy + much higher resolution - up to nucleotide level + location information +/- Sampled signal (more on next slides) Hybridization: integrative We have many identical probes. Each time a probe gets a hybridization event, we add a little to the signal. This includes non-optimal hybridization events - just something labeled that hybridizes will give some signal Sequencing: sampling The number of cDNAs in a library is VERY LARGE We pick only some of them to do sequencing, randomly Blind sampling (does not know anything about RNAs) We map sequences back to the genome ( a kind of quality check) Why is this interesting? • Sequencing approaches are generally better than hybridization in quality and you can also do more diverse experiments • New sequencers make it possible to do this almost as cheap as with hybridization – normal research groups can now buy the capacity of an old sequencing centre • It is basically the technology of the future 5 types of sequencing data data for expression – and functional- studies • • • • • Non-subtracted cDNA ESTs SAGE CAGE RNA-seq Why so many techniques? • Historical reasons – technology development over time • Some of these technologies are only for expression – others also give other information (and different information) • Difference in costs - efficiency Non-subtracted cDNA • Theoretically possible to sequence all cDNAs in a cell • Very, very expensive! • Hard to get true expression, since amplification is length-dependant • Not very necessary to have the whole cDNA for expression? Expressed sequence tags ESTs Sequence from 5’ and 3’ ends – until the reverse transcriptase falls off Cheaper than full-length cDNAs Problems: many ESTs are simply trash – the result of overenthusiastic sequencing For longer genes, no coverage of the middle part How can we use ESTs? • View the EST as a ranom sample from a pool of transcripts: – The number of ESTs found from a transcript should be proportional to the concentration of that transcript in the cell=the expression • How do we know what transcripts an EST comes from? Unigene:clustering ESTs to “genes” Back in the 90s, the idea was to use a lot of ESTs to find, and puzzle together, genes The UNIGENE database is one of the outcome of this. Slightly obsolete, but useful at times Basically, it tries to cluster ESTs and cDNAs to functional units: “genes” Bonus: we can use this to look at expression of these genes – because we can count ESTs from different libraries Thought exercise: How? • Say that we have two lung EST libraries(= two collections of tags) from two patients, one who has lung cancer • How can we prove that a given gene, like RARA, is significantly altered in expression in lung cancer? • Think R! What do we need, and what tests should we use? • 2 minutes with your side man “Electronic Northern blot” • In a nutshell: Fill in the following contingency table for a given gene ESTs from tissue A RARA Rest of ESTs ESTs from tissue B Fisher exact test situation! We can do this within unigene for single genes Side-story for non-life-scientists: Northern what? • Northern blot is classical method for detecting RNA molecules • Related to Southern and Western blot (DNA and protein detection methods) However… • An electronic Northern is just a clever name, although it has the same goals - finding RNAs • It is nothing more than a statistical overrepresentation test of mRNAs, by use of ESTs Unigene: • http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene • …or just google for unigene Let’s look at the tissue constraints of human RARA… EST hits from different tissues Public microarray data (nice for comparison - but not important now) Note that the sample sizes are very different! 1tag of 282332 is not the same as 1 tag out of 131488 What is TPM? TPM= Tags per million A normalization to be able to compare libraries of different sizes. Used very often for tag-based expression. “How many tags would my gene have we have if the sample size is 1 million?” …so, 10^6 * (#tags in my gene)/(#total tags) Challenge • Is the RARA gene significantly different in expression in eye vs blood? ESTs from blood ESTs from eye Gene X 12 12 Rest of ESTs 124139 -12 210756 -12 > a<-matrix(c( 12,12,124139-12, 210756-12), nrow=2,byrow=T) > fisher.test(a) Fisher's Exact Test for Count Data data: a p-value = 0.2078 # so,despite twice the TPM value, not significant So ESTs are fantastic? …not really! Sometime useful but There are too few of them, and very diverse libraries …and way too expensive to make routinely in a normal lab Basically, ESTs are rarely used now, but it is data worth considering Modern tag sequencing • SAGE, CAGE and RNASeq Underlying idea: • Only sequence as much as you need: 5', 3' or whole cDNA (in pieces) • Map tags to known cDNAs or the genome (Thought exercise: what is the difference?) SAGE SAGE • After sequencing: – Mask out adapters and primers – Make a database of all possible hits in mRNAs following the restriction site (white board demo) – Map tags to this database, or the genome • Mapping is surprisingly tricky – We cannot use BLAST or BLAT alignments (too short sequences) – Sequencing errors exist, as well as RNA editing – Some species have very few known mRNAs Common approach First identify all unique tags, and how many times we have seen them AAAGATGCTGC 67 CAGTCGATCGAT 192 … Correlate these tags with our gene database. Sum up all the tags for each gene Make expression analysis! How can we analyze count data? • The difference to micro arrays is that we deal with integers • The more counts for a gene, the more expressed it is - theoretically a linear relation. We are theoretically counting actual RNA molecules • Very much like the EST case, we can make statistics based on contingency tables if we have two samples Data flow for tags …is a bit too complex for this course to do in real life - takes time and requires programming (and a big computer) Mapping of tags to genes is complex, and no standard solutions are adopted (yet) Statistical analysis often involves making multiple fisher exact tests - this involves some R programming To get a feeling for the data, we will instead use a website to to these things for us Typical data after mapping: Tag Frequency AAAAAAAAAA 173 AAAAAAAAAG 1 AAAAAAAAAT 1 AAAAAAAATA 2 AAAAAAACAA 1 AAAAAAACTA 2 AAAAAAATAA 1 We want to go from here to actual counts per gene: we will let a web system do this for us • In the data directory, I have collected two such files:SAGE_Colon…, corresponding to normal and cancer colon • These are linked in the web page, also here: http://people.binf.ku.dk/albin/teaching/htbinf/ta g_analysis/ • Then, go to http://cgap.nci.nih.gov/SAGE/ • This page has many SAGE-related analyses. We will try Digital Gene Expression Displayer (DGED) Challenge • Using DGED • Use the “Two of your files” option to use the two colon samples. Select “short tags” • Try to understand what the statistical test does (accept defaults) • What types of genes are “over-expressed” in colon i) cancer tissue vs normal tissue, ii) normal tissue vs cancer tissue Thought exercise • What are the limitations with SAGE? Albin’s take • We can only measure expression – the location of tags in genes have no functional meaning • Dependent on gene annotation - we can map to the genome, but hard to interpret such data (what genes?) • Compared to array data: very few standard analysis methods • Limited sequencing depth 5’ tagging • Three methods that really do the same thing. Difference lies in chemistry and throughput and length of tags – CAGE – 5’SAGE – 5’ Oligo-capping • We will use CAGE as an example (“Cap Analysis of Gene Expression) CAGE Sequencing and mapping to the genome CAGE vs … • SAGE – Conceptually same thing, but you catch the 5’ end of the gene: the transcription start site and thereby the promoter– which is a functional entity – Higher number of tags – 5’ ends give functional data apart from expression Issues • Only capped transcripts – Some real transcripts are not capped – Some capped transcripts are not full-length • Associating 5’ ends with gene products is sometimes problematic – We only know starts of genes, not the length • Tag length is borderline for mapping - 20-21 bp • Not clear how to define cutoffs - how many tags are “real biological promoter” • Under-sampling: we miss a lot of promoters because there are so many of them Strengths We are actually looking at promoters, not genes Find novel promoters - sometimes within known genes We can look at expression at promoter level - for instance define “tissue-specific” promoters We can get a first unbiased look at where promoters are, and how much they are used in a given cell CAGE concepts • The atom unit in CAGE is the tag, mapped to the genome. The tag comes from a given experiment (and has a label) • What positional information is the most relevant for analysis? ? ? The tag 20-21 bp Only 5’ ends are interesting! • …since the 20 bp length is only for mapping purposes . • What if we have many tags overlapping one another? How can we represent this? Some soon-to-be-outdated terminology So… • Unlike SAGE, CAGE can be viewed as a “barplot” on the genome, on nucleotide level • How to cluster nearby CAGE tags to a meaningful “promoter” is an open problem Within a promoter… • …we can do exactly the same Fisher exact tests as before (as in SAGE or ESTs do for whole genes) • What is the advantage/disadvantage of doing this on promoters instead of genes? (2min) The big answer: alternative promoters with different tissue usage CAGE resources • Genomic element viewer ( very similar to UCSC browser) – CAGE tags and cDNA landscapes – Easiest by the links on fantom.gsc.riken.jp/3 Clicking on cage clusters give two options: CAGE analysis viewer CAGE basic viewer CAGE resources • Basic CAGE viewer – Comprehensive browser of CAGE tags and CAGE tag clusters, and library information Challenge • Look at the RARA gene in the MM5 assembly in the genomic elements viewer(browser) (so, NOT UCSC). • How many alternative promoters does it have? • Are any of these biased towards certain tissues? Some points • Not that easy to say which of these promoters that are “significant” • Easy to get overwhelmed by numbers when counting tags Back to work… • We can treat CAGE tag counts, or really TPMs in a promoter as expression • We can do the same analyses as in microarrays - including the typical heatmap • We will do a small exploratory study of some CAGE data • http://people.binf.ku.dk/albin/teaching/htbinf /tag_analysis/ Walk-thru of CAGE exercise • Also at http://people.binf.ku.dk/albin/teaching/htbinf /tag_analysis/ • …together with updated slides • And linked from web page