Download Alternative Splicing Analysis Tools Through the UCSC Genome

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

X-inactivation wikipedia , lookup

Neocentromere wikipedia , lookup

Copy-number variation wikipedia , lookup

Microevolution wikipedia , lookup

Gene desert wikipedia , lookup

Genomic imprinting wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Gene expression profiling wikipedia , lookup

History of genetic engineering wikipedia , lookup

NUMT wikipedia , lookup

NEDD9 wikipedia , lookup

Gene wikipedia , lookup

Public health genomics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Metagenomics wikipedia , lookup

Primary transcript wikipedia , lookup

Non-coding DNA wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Designer baby wikipedia , lookup

Transposable element wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

ENCODE wikipedia , lookup

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomic library wikipedia , lookup

Genomics wikipedia , lookup

Alternative splicing wikipedia , lookup

Human genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Tutorial on UCSC & GALAXY. March 2009. Schraga Schwartz & Eddo Kim
Browsing the genome with an eye on splicing:
An
introduction to the UCSC genome browser, GALAXY and alternative
splicing / 03.03.09
Hands-On session
Exercise 1: basic use of UCSC and of servers for finding/analyzing
splicing signals
Outline: analyze the SCL35B3 gene in terms of alternative splicing; focus on exon 5,
which is a cassette exon; examine the strength of the splicing signals flanking this
exon.
Examining the SCL35B3 gene in the UCSC genome browser:
1.
2.
3.
4.
Open the UCSC genome browser (http://genome.ucsc.edu/)
Choose the “Genomes” tab
Select the human genome, and assembly 2006.
Enter “SLC35B3” into the text box (This box is very versatile, and can receive
text terms, as well as coordinates)
5. Click on the link to the RefSeq gene, accession NM_015948. What are the two
other RefSeq genes? (two additional alternatively spliced forms)
Overview of SLC35B3 – integrating information from different tracks
1. Scroll down to the bottom of the page, and load the following tracks using the “full”
option, changing all other tracks status to "hide":




Under “Genes and Gene Prediction Tracks” select UCSC Genes, Alt Events,
RefSeq genes
Under “mRNA and EST Tracks” select Human mRNAs, Spliced ESTs
Under “Comparative Genomics” select the 17-Way Cons track
Under “Variation and Repeats” select “RepeatMasker”.
2. Classify the 11 exons of the SLC35B3 RefSeq into alternative and constitutive
exons, based on evidence from the mRNA and spliced EST track. What types of
alternative splicing can you observe among the different exons? Can you find
exons that are not supported in the RefSeq annotation? (Hint: you should)
3. Compare your classifications to the annotations in the Alt Events track. Who did a
better job?
Extracting Sequence of Specific Exon: extracting the sequence of the alternatively
spliced exon 5 of SLC35B3
1. Click on the SLC35B3 RefSeq gene (the highlighted one)
2. Note the 100.0% identity indicated. What does it signify? (Hint: it means that
when you align the RefSeq sequence against the DNA, there is a perfect
match).
3. Click on the link to the mRNA/Genomic alignment
4. This screen contains a representation of the Refseq cDNA, with mismatches in
comparison to the genome, and a representation of the genome, with
mismatches in comparison to the RefSeq. On the left side of the screen there
are links to 11 blocks of consecutive alignments; these blocks correspond to
exons. Make sure you understand the color coding (splice junctions, UTR,
etc.).
5. Click on “Block 5” (representing exon 5). Exon 5 now appears on the upper
part of the screen, in the right pane, in blue capital letters.
6. Select and copy the entire sequence of exon 5, along with 14 nucleotides in
the upstream intron and 6 nucleotides in the downstream intron (you’ll also be
copying the numbers, but don’t worry).
7. Getting rid of the numbers: Use sequence massager
(http://www.attotron.com/cybertory/analysis/seqMassager.htm) to get rid of the
numbers, white spaces and line breaks in the sequence of exon 5. You should
now have the full sequence of exon 5 (in upper case), along with 14
nucleotides of the upstream intronic sequence and 6 nucleotides of the
downstream intronic sequence (in lower case). Save this sequence in a
temporary text file, for your convenience.
Analysis of Splicing Signals
1. Determining the strength of 3’ and 5’ splice sites of the SLC35B3 exon: Open
the webserver in http://ast.bioinfo.tau.ac.il/SpliceSiteFrame.htm, and copypaste the 15-nt 3’ss (the first 15 nt of your sequence) and 9 nt 5’ss (the last 9
nt of your sequence) into the server.
2. Submit and examine the results. Note that you receive two sets of results for
the 5’ss, but only one for the 3’ss. What do they represent?
3. Finding regulatory sequences: Open the webserver in http://rulai.cshl.edu/cgibin/tools/ESE3/esefinder.cgi?process=home, which contains data on
experimentally identified splicing regulatory sequences. Paste the exon
sequence into the server. Note the different thresholds, what do they mean?
Using the default thresholds, submit and examine the results.
4. (Optional step, if you’ve got time: Use the server in
http://ast.bioinfo.tau.ac.il/ESR.htm to find other types of potential regulatory
sequences. Compare the results with those found in step 3).
Examining an alternative exon in the UCSC genome browser:
1. Return to the UCSC genome browser main page. Click on Blat. Paste the
exonic sequence (without the intronic parts) into the box. Submit.
2. You obtain several possible alignments of your sequence, when aligned
against the genome. What is the difference between their score, identity and
span? Which measure or combination of measures serves as an indicator for
the “best” alignment?
3. Click on the “browser” link of the best alignment (i.e. the first  ). The genome
browser is now focused on exon 5 of the SLC35B3 gene.
4. Scroll down to the conservation track. How conserved is this exon among 17
mammals?
5. Scroll back up, and click on the “zoom out x10” button. Examine the intronic
region flanking the exon. Is it conserved?
6. Examine the “Repeating Elements by RepeatMasker” track. Does the exon
overlap with any repetitive elements? Do its flanking introns overlap with such
elements?
Exercise 2: basic use of Galaxy
Goal
Finding skipped exons originating from Alu transposable elements on human
chromosome 1.
Outline of steps
 Extract coordinates of alternatively skipped exons.
 Extract coordinates of transposable elements.
 Intersect between the two.
Steps
1. Open the Galaxy web server (http://main.g2.bx.psu.edu/).
2. On the left, click on "Get Data"  "UCSC Main table browser".
3. On the right, select genome: "Human", assembly: "Mar. 2006",
group: "Genes and Gene Prediction Tracks", track:"Alt Events".
Click on "Describe table schema" to view the structure of the Alt Events track.
In the upper table, click on the "values" button of the "name" field to view the
possible types of annotations for alternatively spliced exons. Which one do you
need? (Hint: you are interested in cassetteExon). In the browser window (of
Explorer), hit "back" twice to return to the "Get Data" screen in galaxy.
4. Make sure that the "genome" is checked in "region". Create a "filter" by clicking
the "create" button. In the "chrom" field enter "does match" and "chr1" (to
obtain exons only from chromosome 1). In the "name" field enter "does match"
and "cassetteExon". Hit "submit".
5. Make sure that the "BED" format is selected as "output format" and that "Send
output to Galaxy" is checked (BED format consists of four columns:
chromosome, start, end, and strand).
6. Hit "get output", and on the following screen hit "Send query to Galaxy",
leaving the other parameters as they are.
7. You will be returned to the main page of Galaxy. On the right you can see your
request being processed. Hit "refresh" every now and then to check if it has
finished.
8. Next, we will repeat steps 2-7 to obtain a dataset of all Alu repeating elements
on Chr1. A mapping of the Alu elements is found in the "Variation and
Repeats" group, in the RepeatMasker track. Using the "Describe table
schema" identify how Alus are represented in this track and use it as a filtering
option. (Hint: Alu is a name of a family of repeating elements). In addition,
ensure that you request only repeating elements on Chr1. (Hint: in this track
chromosomes are referred as "GenoName").
9. Quality Control: click on the eye icon in each of the two processes on the right,
to review the results (make sure both requests have ended). Randomly, select
one result from each file, and enter the coordinates into the UCSC genome
browser to ensure the results match your query.
10. Intersection between the two datasets: Click the "Operate on Genomic
Intervals"  "Intersect the intervals of two queries". Intersect the exons data
with the repeats data (in this order!), using the "Overlapping Intervals" option.
Select a minimum overlap threshold ("for at least" option) of 1 (this way even a
1-nt overlap between an Alu element and an alternative exon is sufficient). Hit
"Execute".
11. Quality Control: click on the eye of the intersection process to review the
results (make sure it has ended). Randomly, select one result and enter the
coordinates into the UCSC genome browser to ensure the results are, indeed,
of exons that overlap with Alus.
Exercise 3: more advanced use of Galaxy
Goal
Comparing the conservation level of intronic positions flanking alternatively spliced
exons to the conservation level in intronic positions flanking constitutive exons, on
chromosome 3.
Outline of steps
 Extract the coordinates of all RefSeq exons on chromosome 3.
 Extract the coordinates of alternatively spliced exons on chromosome 3.
 Remove all coordinates from A that intersect with B.
 Extract the coordinates of alternatively skipped exons (cassette exons) of
RefSeq genes on chromosome 3.
 Calculate (and compare) the coordinates of the 100bp in the downstream
flanking introns of constitutive and alternatively skipped exons.
Steps
1. On the left, click on "Get Data"  "UCSC Main table browser".
2. On the right, select genome: "Human", assembly: "Mar. 2006",
group: "Genes and Gene Prediction Tracks", track:"RefSeq genes".
3. Use the filter to extract data for the plus (+) strand of chromosome 3 only (see
step 4 of ex. 2, and also use the "strand" field).
4. Click on "get output" and on the next screen make sure you mark "Exons plus
0 bases…" before sending the query to Galaxy.
5. Follow step3 of exc2, with necessary adjustments to extract a dataset of all
alternatively spliced exons on the plus strand of chromosome 3.
6. In order to create a dataset of constitutive exons you will now remove
coordinates from the dataset of all exons that overlap with coordinates of
alternative exons. To do that, use "Operate on Genomic Intervals" 
"Subtract". Select alternative exons as the first dataset, and all RefSeq exons
as the second one. Make sure "Intervals with no overlap" is selected, and hit
Execute. You have now created a dataset of constitutive exons.
7. You will now create a dataset of cassette exons using the alternative exons
dataset you created in step 5. Use "Filter and Sort"  "Filter". Select the
alternative dataset as input and use "c4 == 'cassetteExon' ". Execute.
8. To extract the coordinates of the 100bp downstream introns of each of the
exons in both datasets we will use "Text Manipulation"  "Compute". Use the
expression "c3+100" to add a new column to each of the two datasets defining
the coordinate of the position 100bp downstream to the exon/intron junction
(Note: you must repeat this step twice, once for each of the two datasets).
Make sure the "Round" option is set to "Yes".
9. You now need to redefine the "start" and "end" attributes so that Galaxy will
refer to the interval of the 100bp intronic region rather than to the exon. In
order to do this, click on the pencil icon on the right next to the dataset name
(between the eye and X icons). Click on "Auto-detect" for Galaxy to identify the
newly created column. Then, click again on the pencil icon and select the start
column as the previous end column (column 3), and the end column as the
newly created one (column 7). Repeat this step for the other dataset. For each
of the two datasets, you now have the coordinates of the 100bp flanking
introns as the start/end coordinates for further manipulations.
10. To compute the conservation level of these intronic regions for both
constitutive and cassette exons, we will now use "Get Genomic Scores" 
"Aggregate datapoints". Repeat this step for each of the datasets. (The
Aggregate database function uses the 17-way conservation track in UCSC,
which contains a conservation score for each nucleotide in the genome.
Galaxy computes mean, maximum and minimum conservation scores for each
of the intervals supplied as input, and outputs this data as three new columns
in a new dataset.
11. You are now one step away from accomplishing the mission. In order to
calculate and compare the mean conservation level of the intronic region
downstream of constitutive and cassette exons, we will now use "Statistics" 
"Summary Statistics". Use this on both new datasets and enter "c8"
(representing column 8, which is the column containing the mean conservation
scores of the intervals) as the requested column.
What are your conclusions? Compare your insights gained over the last 20 minutes to
the ones found by Sorek et al, 2003, Genome Research, "Intronic sequences flanking
alternatively spliced exons are conserved between human and mouse".