Download Using genome browsers

Using genome browsers Visualization and data repositories Motivation Aside from R, genome browsers are arguably the most important tool in computational genomics …but is not widely used in the experimental community The browser gives you an immediate edge - you can look at data, form hypothesis and up-and download data! In this course 1: How to use the web interface; understanding the data types 2: How to download and upload data to the browser; interaction with R 3: How to make complex analyses between data types; Galaxy and R Today's teaching: • Lectures with genome browser examples • Short discussions with your neighbour • Exercises Kick starting with a challenge • You are a major sequencing center • You have sequenced the killer whale (Orca) genome - you have the whole genome as a stretch of ACGTs • How do you make sense of this and show it to others? What value does the data have in itself? • 2 minutes with your neighbour Jim Kent, assembly-guru. Some profound words about the genome sequence “Well, it has a lot of G, C, A and Ts” Genomes are worthless • …without any annotation • What type of annotations do we want to put on genomes? • 2 minutes with your neighbour Examples: • 'DNA' annotation: – Known genes – Predicted genes – Repeats, transposons, CpG islands – Conservation across species • 'Dynamic' annotation: – Known transcripts – Expression data – DNA modifications How to present this data? • Plain text files are useless..for most biologists • Use the genome sequence as a frame, on which we map real data or predictions The idea of the browser • Based on the genome, we can – Zoom up and down, and scroll sideways – See the data in different representations – Select WHAT data we want to see (way to much data to look at all at once) • Important side-effect: if we map all interesting data, it means that all data is at one place, which means that we can download what we are interested in to do analysis! The three browsers • UCSC genome browser – http://genome.ucsc.edu – Updated often, simple but powerful interface. Very simple underlying data formats • ensEMBL – http://www.ensembl.org – More complex web interface, with multiple zoom levels. Very complex underlying data formats • The generic genome browser – http://www.gmod.org/GBrowse – Actually more a software development platform, so that you can do your own. Resembles UCSC more than ensEMBL In this course… • We will only use the UCSC browser due to – Simplicity – Lecturer bias – The galaxy tool - a very nifty web-tool to do power user analysis on UCSC data (more later) • If you know this browser, other browsers are easy to understand Basic concepts • Zooming • Data tracks Data tracks -the problem Example: The road from Melby to Ølby Melby Ølby 5 km 10 km Melby Ølby 5 km 10 km Data tracks -the problem Example: The road from Melby to Ølby Melby Ølby 5 km 10 km Melby Ølby 5 km 10 km Melby Ølby 5 km 10 km Data tracks -the solution Melby Ølby 5 km houses trees Monday Sunday 5 km 10 km This is how genome browsers show the data Chromosome position Gene track mRNA track Exons Introns Annotation tracks • A track is often one source of data, from a particular place, that is mapped to the genome • Data can be viewed as “blocks” with a start and an end, expressed as chromosome coordinates • It is important to know what the data is before trying to interpret it • We will first look at the “human mRNA” track Human mRNA track • What the guys at UCSC did: – Take all the known mRNAs in Genbank, and map these to the human genome using a software called BLAT (similar to blast). Everything that hits will be shown in this track. – What is the pros and cons of this approach? What are the limitations? 2 minutes with your neighbour! Example answers: Pros Simple, and no filtering - leaving me to make interpretation Cons Not real annotation - again, leaving me to make interpretation Heavily reliant on the data source quality Limited by the extent of data A short non-interactive tour • We will use the browser extensively from now on • But first, I will guide through a few key concepts - otherwise confusion ensues when trying the real thing What version of the genome do you have? • Genome sequences are based on many short sequenced reads, which then are assembled into a single sequence • This is very tricky, and we get slightly updated genomes at regular intervals • A version of the genome is called an assembly • So, whenever you say that you are using a genome sequence to do something, you have to say what assembly you are working on! More about assemblies • The official naming system is – [species abbreviation][assembly number] For instance hg17 (human nr 17), or mm8 (mus musculus 8) There is an alternative way: the date of the release. So, hg17 is also called “Human May 2004” Even more about assemblies Rules of thumb: The newer an assembly, the “better” Some older assemblies have more data mapped to them (because they have been around longer) Some genomes are new, and unstable: updates come often, and big differences between updates. Some are more mature (like human) Selecting species & assembly Species Where on the genome Assembly: the genome “version”. Looking at the genome, with mRNAs Chromosome overview Different mRNAs (same gene) Direction of arrows shows strand Zooming in (We'll learn how later) Some points: •Transcription, in this case, is right to left transcription on the minus strand - shown by the arrows •Two of the mRNAs start here, the others start even further upstream. Probably alternative promoters •The fat, two-colored blocks are predicted to be protein-coding parts Note that •There are parts of mRNAs that are not translated so called UTRs •There is one mRNA that is clearly non-coding (might have a stop-coding further upstream) Zooming even further down - we see the actual DNA Codons Clicking on any of these mRNAs take you to the corresponding Genbank entry Different data representations Each data track has a selection 'box' Use this to : -turn tracks on or off -change visualization Full examples Squished Dense Time to try it out.. • Important: the genome browser shows many tracks by default, some which are named in a confusing way • Don’t let this throw you. We will walk them through! • Goto http://genome.ucsc.edu/ • Click 'Genome browser' to the left We'll use default position for now, so just click the 'Submit' button (which is on the right) Overwhelmed? Many types of data! We will only use some, others you can explore yourselves Below the image, the data tracks are categorized for easier access: Let’s look only at the Human mRNA track as before Challenge: Turn off all tracks, except “base position” and “human mRNA”! (Expand/collapse the categories, then hide tracks. Use 'refresh' to update the image.) Challenge Using the following buttons, and what we already went through, find out: What is the DNA sequence of the first two codons of mRNA DQ892408? What is the “gene name” of the mRNAs we are looking at? Are the two longest RNAs starting at exactly the same place? What are the neighboring genes? Before we go any further… What are all these data? What can we use them for? Fast info on a given track: • • • • Click on the actual track name (over the box) What does the “refseq genes” track hold? What is the difference to “other refseq” or “Genscan genes” When would you use each track? • It is not realistic to go through all tracks in this course • …and not meaningful, because new tracks are added over time • We will go over the main types of tracks, and the relevant experimental methods for producing the tracks • Understanding what we are looking is very necessary for meaningful interpretation Big groups of things, summarized • Sequence features – CpG islands – Repeats • Transcripts or part of transcripts – mRNA, ESTs • The so-called genes (predicted or experimental) • Tiling array expression data • Chip-Chip • Variation within species (SNPs) • Conservation and alignments between species – net alignments, Phastcons scores, • The ENCODE dataset Between transcription and translation – the modern RNA world • After transcription, RNAs are immature (precursor mRNAs). Processing RNAs give mature mRNAs, which gives access to the cytoplasm, and translation. As usual, we know only a small part of the mechanisms... • 5' CAP structure is added • 3' polyA stretch is added • Splicing (not always!) • RNA editing (rare?) Splicing Problem: We want to know what mRNA look like... but RNA is unstable, can't be sequenced directly Solution: Turn them into cDNA first. Into a plasmid – so, we have a library of plasmids each carrying one cDNA This is a “cDNA library” that later can be sequenced or used for other things General problems with cDNA sequencing: • Reverse transcriptase falls off • Hard to sequence long transcripts • Many cDNAs are identical – Very expensive if you want to sequence all unique molecules Solving the problem Only sequence parts of cDNAs - these are called ESTs(more in a few slides) Semi-recent development: sequencing of full-length cDNAs, using – Cap-trapping – PolyA primers – subtraction Subtraction: how to only get RNAs you have not seen yet • Simple concept: • For a cDNA sample, we add an excess of abundant RNAs. These will hybridize • Then, we remove everything which hybridized • …and sequence the rest Discuss with your neighbour (2 min) Say that we have two cDNA libraries - one is subtracted, one is not What are they good for? Expression (how many transcripts of a certain gene)? Annotation and gene discovery? Visualizing and annotating cDNAs in the genome browser • The genome is actually needed to make sense of cDNAs, especially if it is not protein-coding • A general approach is to map your cDNA to the genome using an alignment algorithms • Here, we will use BLAT and the UCSC browser • Should be straight-forward, but...lets try it out: See the course page for 3 mouse sequences in the blat_seqs file – I will do one in real-time • Assume these are new sequences that you must say whether they are good enough to be part of the genome browser Bottom line • cDNA <->genome is sometimes trivial, but can become very tricky. Bear this in mind when you look at genome mappings – this is the process they are annotated with! • cDNAs are often good quality, but always be sceptical unless there are multiple lines of evidence • Biological knowledge helps here – sanity checks become easier More on the problem of sequencing cDNAs Hard to sequence full-length cDNAs …and expensive to sequence many If we cannot sequence the whole cDNAs… Only sequence parts of cDNAs - these are called expressed sequence tags: ESTs Expressed sequence tags (EST) Cheaper, and easier to scale up Problems: many ESTs are simply trash – the result of over-enthusiastic sequencing For longer genes, no coverage of the middle part Complementary information to cDNAs • Can be used for expression studies (more later) • Many MORE of them than full-length cDNAs - higher coverage • If you only have ONE cDNA for a given isoform, ESTs can help to “validate it” So-called “gene” tracks • We have now seen that often a “gene” have many mRNAs - forming a “transcription unit” • If you have many mRNAs, it is good to have summary tracks of genes or transcription units • The UCSC browser has (at least) two of these: – The RefSeq track – The “Known genes” track Refseq • Refseq is actually database with high-quality cDNAs, from NCBI. So, a Refseq sequence always has at least one identical cDNA in GenBank. • Good, because some individual cDNAs are trash, and we get a more manageble dataset • Bad, because the criteria used are somewhat arnitrary. For example, “long cDNAs are better than short” Known Genes A track made by the UCSC people, which uses multiple databases (Refseq, uniprot, etc) Horrible name - easy to misunderstand it it is NOT all known genes! If clicking on individual genes, you get very nice summaries, sometimes with expression information Searching by gene name • If you put in a gene name, or an accession number in the coordinate box, the browser will search the mRNA, Refseq and Known Genes tracks (and some more) for this name, and give you a list if you get more than one hit • Is usually easy: here is an example: the Dicer1 gene (an important RNAse) CpG islands A CpG dinucleotide is simply a C followed by a G CpGs are uncommon (1%) in vertebrate genomes, due to that the C in the CG is easily methylated and then deaminated into a T However, there are stretches of CpG rich dinucleotides, called CpG islands These are correlated with promoters - around 50% of promoters have a CpG island. Function is unclear! In the UCSC browser, this is simply called the CpG island track Repeats Large portions of the genomes are “repeats”, classified into two main types: 1)Tandem repeats Two or more nucleotides are repeated, directly after each other ATTCGATTCGATTCG (number of repeats are used in crime forensics and parentage tests) 2) Interspersed repeats Results of RNA-mediated transposition (not in this course) Repeats, cont • Generally, repeats are considered “uninformative”, and presents problems when aligning things to the genome • However, there are clear cases of functional repeats • In the UCSC browser, all repeats can be turned on in the repeat track Lets look at these things • 5 minutes with your neighbour: • Look at the RPS9 gene, and turn on Refseqs, known genes, human mRNAs, ESTs, CpG islands and repeats • How well does refseqs, ESTs and Known genes correlate • Are there any CpGs or repeats - where are they located? What type of repeats are there?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Using genome browsers