Download Woods Hole – Zebrafish Genetics and Development Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Primary transcript wikipedia , lookup

Genome (book) wikipedia , lookup

Transposable element wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomic library wikipedia , lookup

Designer baby wikipedia , lookup

Point mutation wikipedia , lookup

Pathogenomics wikipedia , lookup

Microsatellite wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human genome wikipedia , lookup

Genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Metagenomics wikipedia , lookup

Sequence alignment wikipedia , lookup

RNA-Seq wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome editing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Woods Hole – Zebrafish Genetics and Development
Bioinformatics/Genomics Lab
Ian Woods
Note: This document “wh_informatics_practical.doc” and supporting materials
can be downloaded from my website:
http://faculty.ithaca.edu/iwoods/docs/wh/
Setting the stage: These tasks each pertain to the mutation that we (virtually)
mapped in lab. The curved body axis and U-shaped somites observed in these
mutants are hallmarks of disrupted slow muscle development, and similar
phenotypes are observed in mutants with defects in Hedgehog signaling.
General descriptions of the tasks are provided below. Specific protocols can be
found following this introductory section. Each of you should choose (at least)
one task to accomplish, and collaboration is highly encouraged.
Task 0: High resolution mapping, sequencing, and expression
Overview: From a rough map position, refine the critical interval via (virtual) high
resolution mapping with additional markers. Query the critical interval in the
zebrafish genome for potential candidate genes. Locate expression patterns
online for these candidates. Design primers to sequence candidate genes for the
mutagenic lesion or for additional SNPs to use in mapping.
Task 1: Create a transgenic reporter line by cloning candidate
enhancer/promoter sequences
Overview: Identify the translational start site of a gene of interest. Obtain ~6 kb of
sequence upstream of this site. Design PCR primers that will amplify this region,
and clone it in-frame with GFP in a tol2 expression vector. Identify BACs for use
in creating reporter constructs via homologous recombination. Identify
evolutionarily conserved sequences from other organisms to uncover potential
regulatory regions around your gene of interest.
Task 2: Expression, Rescue, and Phenocopying
Overview: Identify the zebrafish ortholog of your favorite gene. Find its location
in the genome, locate the translational start site (ATG), and identify the exonintron boundaries. Design two 25-mer morpholino sequences that target (1) the
ATG and (2) an exon-intron boundary. Identify an orthologous gene in another
fish species for use in rescue experiments to control for morpholino specificity.
Align this sequence with your morpholinos to determine degree of potential
activity. Obtain a full-length clone of the zebrafish gene (via RTPCR or clone
collections) for use in overexpression experiments or expression analyses via in
situ hybridization. Identify potential CRISPR targets within your gene.
Task 3: Batch BLAST and parsing with Python to identifying zebrafish
transcripts related to a specific signaling pathway
Overview: Mine OMIM (Online Mendelian inheritance in Man) for genes related
to Hedgehog signaling. Obtain amino acid sequences for these genes, and
identify putative zebrafish orthologs for these proteins via BLAST. Use a simple
script to parse the blast results to see where the genes are located in the
zebrafish genome. Finally, find out where a few of these genes are expressed
(via zfin).
Requirements: Terminal, python (both native on MacOSX)
Task 4: Visualization of enriched motifs in putative promoter / enhancer
regions.
Overview: From a file of unidentified sequences derived from a transcriptome
profiling experiment, identify the best matching Ensembl transcript via local
BLAST, batch download potential promoter sequences for each of these
transcripts, search through these promoter sequences for enriched motifs, and
visualize the location of the motifs on the promoters (just a bit advanced)
Requirements: Terminal, Python, Matplotlib (all native on MacOSX)
Protocols:
Task 0: High resolution mapping, sequencing, and expression
1.
The mutation we mapped in lab is flanked by SSLP/Zmarkers Z11119 and
Z15270. Your first job will be to view the region of the genome that is flanked by
these two markers. Within this region, you can identify candidate genes and find
additional markers can be used to refine your map position, and thereby narrow
the critical interval in the genome to look for the gene that is disrupted in the
mutation.
Start at zfin.
http://www.zfin.org
Enter ‘Z15270’ in the box at the top. On the page that follows, hit the link, and
then hit the link to GenBank. On the GenBank page, click FASTA and copy the
sequence onto the clipboard of the computer. To find the location of this marker
in the genome, we’ll go to the zebrafish genome browser hosted by EMBL:
http://ensembl.org
Follow the link for ‘BLAT’, paste the sequence of this gene into the window,
select ‘Danio_rerio’ from the species menu, and click ‘RUN’. On the following
page, click the link for the best matching chromosome region. This takes you to a
view of the genome, centered on this map marker. Note the physical location of
this marker (the numbers in the genome window). Zoom out a bit to get a sense
of the genomic region.
Repeat the above steps for Z11119. How many hits in the genome do you obtain
for Z11119? What does this mean? Choose the ‘best’ alignment. Where in the
genome if your mutation likely located (answer in terms of numbers).
2.
Now let’s look at a candidate gene near one of your map markers. Find
primer sequences for one of these genes (calca). From your browser window for
Z15270 in #1, locate calca and click on it. This takes you to the Ensembl page for
this marker. Click on the ZFIN link, which takes you to the ZFIN page for this
zebrafish sequence. Scroll down to the RefSeq link under ‘Sequence Information’
and follow it. Locate the ‘FASTA’ link and click it, which takes you to a page
where the sequence is located. Copy this sequence to the clipboard on your
computer.
Now to go one of many websites for Primer design:
http://bioinfo.ut.ee/primer3-0.4.0/
Paste in your sequence and select a length of 500-550 (the comfortable limit for
sequencing PCR products). Hit ‘Pick Primers’ and retrieve your primer
sequences. The next step would be to amplify gDNA from wildtype and mutant
embryos via PCR, sequence the PCR products, identify sequence differences,
and use this information to test for linkage between this gene and your mutation.
We’ll go over how to do that in more detail below.
3.
You collect hundreds of mutants for use in a high-resolution mapping
panel, and test them for linkage to numerous markers from your region. You find
that the SSLP Z15270 is the marker that is most tightly linked to your mutation,
but some recombinants remain. Query the zebrafish genome assembly to see a
model of your region of interest (the assembly is pretty good on a large scale, but
can be misleading in a local region). Go to the Ensembl website.
http://www.ensembl.org
Find the genomic location of Z15270 as above. Click on the “Configure this page”
link on the left hand side of the page. Here you’ll find all sorts of ‘tracks’ you can
turn on and off to show different kinds of information. Try turning some additional
features on. Save and close the configuration window by hitting the checkmark
in the upper right, and zoom out in the browser as far as is allowed.
4.
Exploring the genomic region – what do these genes do? Click on some
of the genes found in the region, taking you to the gene record page. Find and
click the ‘orthologues’ link on the left hand side of the page for each gene. What
kind of gene is PDE3B?
5.
Go back to the genomic view. Can you get a link to ZFIN for any of these
genes? Click on rras2, and follow through to ZFIN. Follow the link for Expression
Data. Your mutant has defects in muscle specification – is the expression pattern
of rras2 consistent with a role in muscle?
6.
You decide to sequence rras2 in wildtype and mutant embryos to see if (1)
you can find a SNP to map to rule this gene out via recombination, and (2) you
can find a change in the mutant sequence that might cause a loss-of-function
phenotype. Design primers that will amplify a 600 bp PCR product that contains
the first exon of rras2.
Find the rras2 entry in ZFIN (you are probably already there in step #5). Go to
the ZFIN homepage:
http://zfin.org
Click on Genes/Markers/Clones and enter rras2. On the ZFIN gene page, scroll
down and follow the link to the RefSeq RNA record. Scroll down and note the
coordinates of the coding sequence (CDS) in the entry. Copy the coding
sequence onto the clipboard.
Go to the UCSC genome browser (you can also do this on the Ensembl browser,
but the UCSC interface is a bit friendlier for this task):
http://genome.ucsc.edu
Click on the BLAT tab, and paste in your sequence. Select “Zebrafish” from the
Genome pulldown menu, and click “Submit”. Follow the link for “details” on the
first BLAT hit. Scroll up and down to check your results – what to the different
color-codings mean in your sequence?
Select about 600b of genomic sequence from which to design primers, then
head to the primer3 website:
http://bioinfo.ut.ee/primer3-0.4.0/
Paste in your sequence, choose a size range of 500-600b (about the limit of a
sequence trace from a PCR template), and click “Pick Primers”.
7.
You PCR from genomic DNA of wildtype and mutant embryos, and
sequence the PCR products. The sequencing results are as follows:
>wildtype_rras2_exon1
AGGCGGGAGTGTGAGCGCGCGCCCCCTCGCGCCCGCCGCGCGCACTGCCAGCACTGATTAGCCGTATCTTCCCCTCATCTTGCAGCACAGGCAGTCAGTCAGTGCCTGGTAGCGATTTG
GACGAGGGCGTATGGACTTGAAGCAGCAGTGTATGCATTTCCCACAGACTGTGGTCGTACTTTTCTCCTGTCGGACGGATTACCACTGAGTTGACACATAGCCCAAAAGCCGCTTCGCA
TTTTTTCCGCTGCATTTCTCTAACTGAAGGCCTGTCACAGAGTAAAGTGGCTCGGTGTGCGTGTGTTTAGACAGCGGAGCGAGAGCAGCAGTGTGTCCCCGATGGCTGGCTGGAAGGAC
GGCTCAGTGCAGGAGAAATATCGCCTGGTGGTCGTCGGAGGTGGTGGCGTCGGAAAATCAGCGTTAACCATCCAGTTTATCCAGGTAAGCGGATACATGGCGGAATGTTATGTGGTTTT
CGGCCCTTTAAAAAGATGTGAGGGTGTTGAGGAGAAATGCGTGGATCTTGCTCACAGAAATGGGGACCCCATGAGCGGAAAAGGGGGTTCAGGAATCCAAGCTAGGCCTGCGACACTTT
AAACC
>mutant_rras2_exon1
AGGCGGGAGTGTGAGCGCGCGCCCCCTCGCGCCCGCCGCGCGCACTGCCAGCACTGATTAGCCGTATCTTCCCCTCATCTTGCAGCACAGGCAGTCAGTCAGTGCCTGGTAGCGATTTG
GACGAGGGCGTATGGACTTGAAGCAGCAGTGTATGCATTTCCCACAGACTGTGGTCGTACTTTTCTCCTGTCGGACGGATTACCACTGAGTTGACACATAGCCCAAAAGCCGCTTCGCA
TTTTTTCCGCTGCATTTCTCTAACTGAAGGCCTGTCACAGAGTAAAGTGGCTCGGTGTGCGTGTGTTTAGACAGCGGAGCGAGAGCAGCAGTGTGTCCCCGATGGCTGGCTGGAAGGAC
GGCTCAGTGCAGGAGAAATATCGCCTGGTGGTCGTCGGAGGTGGTGGCGTCGGAAAATCAGCGTTAACCATCCAGTTTATCCAGGTAAGCGGATACATGGCGGAATGTTATGTGGTTTT
CGGCCCTTTAAAAAGATGTGAGGGTGTTGAGGAGAAATGAGTGGATCTTGCTCACAGAAATGGGGACCCCATGAGCGGAAAAGGGGGTTCAGGAATCCAAGCTAGGCATGCGACACTTT
AAACC
You wish to know if these sequences harbor any polymorphisms, and whether
you can use these polymorphisms to facilitate your high resolution mapping.
Align the two sequences via BLAST2:
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Follow the link for ‘nucleotide blast’, and check the box for ‘Align Two or More
Sequences’. Note the points at which the two sequences differ.
Next, you’d like to see if the polymorphisms can be distinguished via restriction
digest. Paste about 40b of wildtype and mutant sequence flanking the SNP into
the dCAPS website, leaving the “mismatches” field blank.
http://helix.wustl.edu/dcaps/dcaps.html
Are there enzymes available that will cut wildtype but not mutant sequence (or
vice versa)? If a SNP does not have a polymorphism, try entering “1” in the
mismatch field – what does this accomplish?
8.
Finally, do the SNPs result in changes in the coding sequence for rras2?
Try BLASTing the mutant sequence (from #7 above) vs. the amino acid
sequence (from the GenBank/NCBI page from #6), using ‘Align Two or More
Sequences’ and BLASTX.
Task 1: Create a transgenic reporter line by cloning candidate
enhancer/promoter sequences
1.
Eventually you identify the mutation as a lesion in the gene scube2. You
wish to analyze the morphogenetic movements of cells expressing this gene
during development in live embryos. To accomplish this, you decide to make a
GFP reporter line that reflects the endogenous expression of this gene. First you
decide to try a quick-and-dirty approach: you plan to clone genomic sequences
upstream of the translational start site (ATG) of this gene and put them into a tol2
GFP expression vector.
Locate this gene in the genome and retrieve the coding sequence: go to the
ZFIN homepage, and enter scube2 in the search box at the top of the page.
http://zfin.org
Follow the “gene” link to the ZFIN record for this gene, and scroll down the page.
Where (which chromosome) does ZFIN say this gene is located?
2.
Next, you want to retrieve the nucleotide sequence of this gene to (1)
compare it with the genomic sequence, and (2) identify the translational start site.
Scroll down the ZFIN page until you find the link for “RNA”. Follow this to the
RefSeq record for this gene. Scroll down to the sequence information at the
bottom of the page. Where does the coding sequence (cds) begin and end
within the complete mRNA transcript? Find the ATG in the nucleotide sequence.
Beginning at the ATG, copy about 100b of nucleotide sequence to the clipboard
and head to the Ensembl Genome Browser for Zebrafish.
http://www.ensembl.org/Danio_rerio/
Enter ‘scube2’ into the search box. On the resulting page, click on “Location.”
Which direction is the gene transcribed (ie. which strand is the coding strand)?
By high-resolution genetic mapping, you localized the SSLP Z15270 to be 0.1 cM
from the mutation in scube2. Z15270 is on chromosome 7 at about 27,488,000.
The genetic map length of the zebrafish genome is 3000 cM total, and the total
physical length of the genome is 1.7 x 109 bp. Is the actual physical (basepair)
distance between Z15270 and scube2 surprising? What factors might account
for any differences in expected distance?
Zoom in and move the window so that the first exon encompasses the entire
view (you can do so by drawing a rectangle around the first exon or by pressing
the < and > buttons). Resize the window to include about 5 kb of upstream
sequence (just add 5000 to the righthand number in the location box). Would
grabbing 5 kb of upstream sequence be a good idea to make a reporter construct
for scube2? Why or why not?
You decide to retrieve all intergenic sequence and test various parts of it for
enhancer activity. First, resize the browser window to just include this intergenic
sequence. Click the link for “export data” on the left hand side of the page. Pull
down ‘soft’ repeat masking in the genomic FASTA options, and hit next. Then
click the ‘text’ link to get the sequence.
Copy the DNA on to the clipboard, then go to the Primer3 website to design
primers, trying to get as much of the input sequence as possible into the PCR
product.
http://bioinfo.ut.ee/primer3-0.4.0/
To clone this bit of DNA, you would add appropriate restriction enzyme (or
Gateway, or SLIC, or PIPE, or Gibson) sequences to the primers, PCR amplify,
and hop into your favorite GFP expression vector.
3.
You successfully make this vector and inject it into 1-cell stage embryos.
The GFP expression in injected fish (aka. ‘transient-transgenics’) is promising –
the pattern of GFP expression in a few fish roughly matches what is observed via
in situ hybridization. In addition, many other tissues express GFP. Encouraged
by this result, you raise the embryos to adulthood and cross them to identify
founders. You identify ten founders, but none of your lines express GFP in a
pattern consistent with the in situ data: expression in some tissues is absent,
and many tissues express GFP where the gene is not normally expressed. How
might you explain these results?
You decide to make a new reporter line by BAC recombination: you will obtain a
large (~200kb) chunk of genomic DNA that contains this gene, and replace the
first exon of your target gene with GFP. Why might this strategy result in GFP
expression that more accurately recapitulates the endogenous expression
pattern?
You can use at least two approaches to identify a BAC that contains your favorite
gene: (1) directly from the Ensembl genome browser, (2) via a BLAST search at
NCBI.
3a.
Go to the Ensembl home page for zebrafish:
http://www.ensembl.org/Danio_rerio/
Enter “scube2” in the search box and click “Go.” Follow the link for “Location”.
Look at the “Location” pane in the browser page – what is written in the blue bar
in the center of the page? If a region of the assembly is represented by a
sequenced BAC, there will be a GenBank accession number (eg. AL845363) in
this blue bar. By contrast, if the region is represented by whole-genome shotgun
traces, you will see something like “Zv9_scaffold12345” in the middle bar.
Turn on the BAC ends track (if not already on) by clicking “Configure this page”
(Simple Features) on the left hand side. Check the boxes for CHXXX and
DKEYXXX (where X = a series of numbers), and hit the check mark on the upper
right corner . Zoom out until you can see connected BAC ends (represented by
horizontal blue bars). Are there any good options for BACS that contain the
scube2 coding sequence and putative regulatory regions?
3b.
Another way to search for a BAC is via a BLAST query at NCBI/Genbank.
Retrieve the GenBank accession number for scube2 again from ZFIN, then go to
the NCBI BLAST homepage:
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Click “nucleotide blast”, enter the accession number in the search box, select “nr”
button from the pulldown menu, and type in “Danio rerio” in the organism box.
Hit BLAST. On the results page, genome sequence will be annotated as
“Zebrafish DNA sequence from clone….” Are there any BAC clones that cover
the entirety of the scube2 sequence? You next decide to align the coding
sequence with one BAC sequence to check for overlap. Note the accession
number of the BAC, and go to the BLAST2 page:
http://blast.ncbi.nlm.nih.gov/
=> select ‘nucleotide blast’ and click the ‘Align two sequences’ box
Enter the accession number for the coding sequence in the top box, and for the
BAC in the bottom box, and hit “Align”. Where does the coding sequence (ie.
query) begin and end in the BAC sequence? Hit the ‘Dot Matrix’ view for a
graphical look.
The next steps would involve creating a targeting vector for homologous
recombination. In this case, you could use ET recombination (or another
method) to replace the first exon with GFP (or whatever you’d like), and also
modify the BAC with tol2 transposon LTR sequences. BACs can be ordered from
two sources, depending on the library:
DKEY and DKEYp from ImaGenes (http://www.imagenes-bio.de/); CH211, CH1073
and CH73 from BacPAC Resources (http://bacpac.chori.org/)
4.
As a final step, you wish to identify candidate regulatory sequences by
comparing genomic sequences from multiple teleost species. This can be
accomplished via the VISTA webserver:
http://genome.lbl.gov/vista/index.shtml
First we will need to collect genomic sequences from other fish. In this example
we will use three fish in which both genomic sequences and chromosome
assignments are available: Tetraodon nigroviridis (Green-spotted pufferfish),
Gasterosteus aculeatus (3-spined stickleback), and Oryzias latipes (medaka).
The whole-genome duplication event in the teleost lineage can make definitive
orthology assigments a bit tricky. Clues to the correct ortholog can be gleaned
from analyzing conserved syntenies, in which gene content on particular
chromosomes has been retained after species divergence. A useful viewer of
conserved syntenies in multiple organisms can be found at the Oxgrid website:
http://oxgrid.angis.org.au/oxg_table.html
By selecting the appropriate species comparisons, you can view chromosomes
and chromosome segments in which gene content has been conserved. Which
regions of the stickleback, medaka, and pufferfish genomes most closely match
zebrafish chromosome 7?
Find the orthologs of Scube2 in these species by performing BLAT searches at
the UCSC genome website, using the peptide sequence of Scube2 as the query.
In the resulting browser page for each BLAT search, expand the window size to
include ~ 10kb of upstream and downstream flanking sequences. Note the
orientation of the gene (+ or – strand), and export the genomic DNA via the DNA
tab. Save these sequences on to your desktop. They may need to be edited to
retain FASTA format – you can do this in Notepad, TextEdit, or via a commandline editor such as emacs.
Note – I’ve collected these sequences for you here, if you don’t want to do all of
the searching:
http://faculty.ithaca.edu/iwoods/docs/wh/vista_scube2/
Next return to the VISTA homepage, choose mVISTA, select 4 sequences to
align, and upload your sequences to the VISTA server. View both the visual
alignments as well as the textual alignments. Since we collected about 10kb of
upstream sequence for each fish, the exons should begin to align at ‘10k’. Can
you see where the exon sequences are? Are there conserved noncoding
sequences present as well? You may want to relax (via shorter alignments, lower
%match) the conservation parameters a bit (‘Adjust Conservation Parameters’),
to see if you can get more sequences to show up as conserved.
Task 2: Expression, Rescue, and Phenocopying
In midline patterning, Hedgehog signals emanate from the notochord and ventral
neural tube. Although scube2 mutants display many defects observed in other
Hedgehog pathway mutants, scube2 expression in the neural tube is confined to
dorsal regions. This expression pattern is reminiscent of Boc, a gene involved in
Hedgehog signaling in mouse. You wish to analyze the zebrafish ortholog of Boc
at the level of expression and function.
1.
As a first step, you search for the zebrafish ortholog of Boc. Start at the
NCBI home page:
http://www.ncbi.nlm.nih.gov/
Select “Genes” from the Search menu, and type in “Boc.” Scroll down until you
see the first mouse record, and follow its link. On the resulting page, scroll down
to the bottom to find the link for the amino acid sequence. Follow this link to the
GenPept record for the protein. Scroll down and copy the amino acid sequence
into your clipboard.
Now go to the BLAST home page:
http://blast.ncbi.nlm.nih.gov
Follow the link for tblastn, paste in the sequence, select “nr” and type in Danio
rerio. While this search is running, hit the back button and select “Expressed
Sequence Tags” from the database menu. These two simultaneous BLASTs will
ensure that all available coding sequences from Zebrafish will be searched. You
can access all of your ongoing BLAST searches via the “Recent Results” tab.
Check the BLAST result pages; if you have hits on the ‘nr’ search with accession
numbers that start with ‘NM_’, you’re all set; these are typically full-length
sequences that have been subjected to some degree of annotation and
verification.
2.
Next you’d like to obtain a clone of zebrafish boc to use in expression
analysis via in situ hybridization. You can follow at least two avenues: (1) obtain
a clone from a commercial source or another laboratory, or (2) make a clone via
RTPCR. Clones that are commercially available are labeled with an IMAGE ID.
Ideally, you would like a full-length sequence that you could use for rescue or
overexpression experiments, but partial sequences are fine for generating in situ
probes. Search the UniGene database using the accession number of your top
‘NM’ hit.
http://www.ncbi.nlm.nih.gov/unigene/
Scroll down to locate sequences that are annotated with IMAGE ID’s. Do any of
the IMAGE clones represent full-length cDNAs (compare with the
‘NM’sequences)? You can order IMAGE-ID’d clones from Open Biosystems:
http://www.openbiosystems.com
3.
Next, design primers that will allow you to amplify a full-length clone via
RT-PCR for mRNA overexpression/rescue experiments. Follow the “NM_XXX”
link from the UniGene page for boc. Scroll down, highlight and copy the
nucleotide sequence, then paste it into primer3.
http://bioinfo.ut.ee/primer3-0.4.0/
Choose a size range that is sufficient to include the entire cds. Do your primer
sequences flank the translational start and stop codons?
4.
You’d also like to design morpholino oligonucleotides (MOs) that target the
translational start site and a splice junction. First, compare the coding sequence
with the genomic sequence to find the ATG and the exon-intron boundaries.
There are several ways to do this, including (a) the GenBank record, (b)
exporting sequence from a genome browser (Ensembl or UCSC), and (c) BLAST
searches on genomic traces or sequenced BACs.
4a. The GenBank browser will often have 5’ UTR sequences that can be used to
design an ATG-binding MO. Where does coding sequence of boc begin? Check
the GenBank record for NM_001005393 and look for ‘cds’.
4b. Go to the Ensembl blast page and paste in the sequence for boc.
http://www.ensembl.org/Multi/Tools/Blast
In the ‘Search Against’ box, select Danio rerio. Examine the alignment overview
on the results page. Is the whole gene aligned to the genome? You can now
zoom in on the first exon of boc, extract the sequence, and design your
morpholinos. To get to the first exon, look for the lowest numbers in the ‘Query’
column on the blast result page. Or, zoom out to the whole gene and draw a
rectangle around the first exon and click ‘Jump to Region’. Similarly, you can
design splice-blocking morpholinos by finding the exon-intron boundaries in the
browser.
4c. You can repeat the procedure above to design a Morpholino that targets an
exon/intron junction. This time, we’ll use the UCSC browser so that you can get a
feel for its features:
http://genome.ucsc.edu
5.
Next, you would like do control for Morpholino specificity via rescue by
injection of an mRNA to which your morpholinos will not bind. One approach to
do this is to inject an mRNA from a different species. You head to the pet store
and acquire a Green-spotted pufferfish (Tetraodon nigroviridis), grind it up in
liquid nitrogen, and extract total RNA. Your plan is to identify the ortholog of boc
in Tetraodon, clone this sequence via high-fidelity PCR, generate mRNA via in
vitro transcription, and inject this mRNA into morpholino-treated zebrafish for
rescue.
First, return to the mouse GenPept page for Boc (from Step #1). Copy the
sequence into the clipboard, and return to the UCSC Blat page.
http://genome.ucsc.edu/cgi-bin/hgBlat?command=start
Paste in the sequence, and select Tetraodon from the species menu. Hit Submit,
and then follow the link for browser view. Zoom out until the full Tetraodon
sequence is shown. How does your BLAT query compare with the Tetraodon
genome?
Click on the Tetraodon Gene within the browser window – this will take you to a
page in which it will be possible to export the predicted gene and peptide
sequences. Copy the amino acid sequence, and paste it into blast2, along with
the mouse Boc peptide sequence.
http://blast.ncbi.nlm.nih.gov/ (select blastp, and check box for ‘compare two sequences’)
Is the full mouse sequence matched by the Tetraodon sequence? How does the
Tetraodon sequence compare with zebrafish boc? Which chromosome contains
boc in Tetraodon? Does this make sense via patterns of conserved synteny
between Zebrafish and Tetraodon? Check the OxGrid website and find the
picture that depicts the orthology relationship between Zebrafish (drer) and
Tetraodon (tnig):
http://oxgrid.angis.org.au/oxg_table.html
The next step is to design primers to amplify Tetraodon boc via RTPCR. The
predicted gene sequence does not include 5’ and 3’ untranslated regions, which
does not leave much wiggle room for designing effective primers. You can
collect putative UTR sequences from the genome. Go back to the UCSC BLAT
page for Tetraodon, and paste in the predicted cDNA sequence. On the results
page, follow the link for “details” and scroll down until you see the alignment with
genomic cDNA. Collect about 80b of genomic sequence up and downstream of
the Tetraodon boc gene, and make a new sequence that includes these putative
UTR sequences. Enter this sequence into Primer3, and pick primers that will
flank your coding sequence. Next, you’ll amplify by high-fidelity PCR, clone into
an expression vector, and verify the clone by sequencing.
How do the morpholino sequences you designed match up with the Tetraodon
boc sequence. Compare with blast two sequences:
Go to ncbi BLAST, click on blast nucleotide, and click on ‘align two sequences’.
Paste your morpholino in one box, and the Tetraodon boc sequence in the other.
Is there a significant match? Will Tetraodon boc mRNA likely escape morpholinoinduced knockdown?
6. Finally, you wish to explore the possibility of targeting boc for knockout via
CRISPRs. One way to look for CRISPR targets is to take advantage of publiclyavailable ‘tracks’ that can be added to genome browsers. To see one of these,
head to
http://genome.ucsc.edu/cgi-bin/hgHubConnect?hubSearchTerms=&hgHub_do_search=on
Scroll down to ‘ZebrafishGenomics’ and click ‘Connect’. Enter ‘boc’ in the search
box on the next page, and select the RefSeq Sequence. On the next page, scroll
down, turn on the CRISPRs track under ZebrafishGenomics, and hit Refresh.
Zoom in to the first exon to visualize potential targets. Lower scores indicate
targets that may potentially have higher specificity.
Task 3: Batch BLAST and parsing with Python to identifying zebrafish
transcripts relating to a specific signaling pathway (just a bit advanced)
Note: the Python software to perform this step is included as part of Mac OSX. If
you run a PC, you’ll need to figure out how to install and run Python.
The mutation we have been studying has a phenotype similar to mutations in
genes encoding members of the Hedgehog pathway. We are going to find every
human protein that has been associated with Hedgehog signaling, identify their
putative zebrafish orthologs via a batch BLAST search, and determine whether
any of these putative zebrafish Hedgehog-associated genes is located near our
mapped mutation (on chromosome 7 at about 28.8 Mb).
A.
Collect sequences for human proteins with hints of hedgehog interactions
1.
Go to the NCBI website: http://www.ncbi.nlm.nih.gov/
2.
Pull down the OMIM database.
3.
Enter SHH in the search box – finds any record that mentions SHH.
4.
Select ‘Protein’ from the ‘Find Related Data’ menu on the right, and click
‘Find Items’. Limit the list to Human proteins by clicking ‘Homo sapiens’ in the
Top Organisms menu.
5.
Select “FASTA” from the Display Settings pull-down menu – this retrieves
the amino acid sequences.
6.
From the “Send to” menu, select “file”, and save by hitting the ‘Create File’
button.
7.
Open a Terminal window and move the sequence file to a new folder
called WH_bioInformatics on your Desktop.
B. Performing a local BLAST search on a batch of sequences
1.
Go to the BLAST homepage at http://blast.ncbi.nlm.nih.gov/Blast.cgi
2.
Click on the “help” tab
3.
Follow the link for “Download BLAST Software and Databases”
4.
Follow the link for the ftp site and connect as ‘Guest’ if necessary, then
click the "blast" link appropriate to your platform (eg. macosx-universal). The file
will be compressed – usually you can uncompress it by just double clicking the
.zip file. Or you can do it via the terminal:
5.
The download will result in a folder saved somewhere on your computer
(depending on your browser preferences). Move the contents of this folder to the
WH_bioInformatics folder.
6.
Now we’ll download all current zebrafish transcripts (known and predicted)
from Ensembl). While in the same folder as your protein sequences, connect via
the terminal or your web browser to the ensembl ftp site.
ftp ftp.ensembl.org, or type ‘ftp.ensembl.org’ into a browser window
(login as “anonymous” with your email address as password)
>:
7.
>:
>:
cd pub/release-81/fasta/danio_rerio/cdna
Fetch all the sequences and disconnect:
mget Dan*
bye
(answer “y” at the prompts)
8.
UnZip the sequences and concatenate them into one file, and move this
file into your WH_bioInformatics folder:
9.
Make a BLASTable database from these zebrafish sequences (type
‘./bin/makeblastdb –help’ for options of makeblastdb):
./ncbi-blast-2.2.31+/bin/makeblastdb -in zebrafishcDNAs.fa -dbtype 'nucl'
10.
Now you’re ready to do a blast search. You can always type ‘[command] –
help’ (eg ‘./bin/tblastn –help’) for blast options:
./ncbi-blast-2.2.31+/bin/tblastn -query shh_human_proteins.fa -db zebrafishcDNAs.fa num_alignments 2 -num_descriptions 2 -evalue 1e-5 -out shhVzfCDNA.tblastn &
[It will take awhile (~1.5 h) to compare our ~2600 human proteins with the entire
database of known and predicted zebrafish transcripts. To save time, get the
blast output from my website as part of task3_materials:
http://faculty.ithaca.edu/iwoods/docs/wh/informatics_problem/
you can check on the progress of the blast search:
>:
less shhVzfCDNA.tblastn
(type ‘q’ to quit out of less)
11.
When the BLAST search is finished, parse it with one of my hacktastic
Python scripts (wh_blast.py) – downloaded from my website as part of
task3_materials. To run the script, type the following:
>:
python wh_blast.pl shhVzfCDNA.tblastn > blast_results.csv
[sample results]
Import the results into excel (comma-delimited) and sort by chromosome and
map position – does anything map to Chr 7 near Z15270 (27,488,000 bp)?
12.
Choose ENSDART00000113313. Look at the Ensembl record:
http://www.ensembl.org/Danio_rerio/Transcript/Transcript?t=ENSDART00000113313
13.
What is the cDNA sequence of this gene? What is the function of this
gene? Click on the ‘Gene’ record, and then follow the ‘Orthologs’ link (left of
page) for some clues.
14.
Find expression data (if it exists): if you’re lucky, there will be a link to
ZFIN straight from the Ensembl Gene Record page.
Task 4: Visualization of enriched motifs in putative promoter / enhancer
regions.
Overview: From a file of unidentified sequences derived from a transcriptome
profiling experiment, identify the best matching Ensembl transcript via local
BLAST, batch download potential promoter sequences for each of these
transcripts, search through these promoter sequences for enriched motifs, and
visualize the location of the motifs on the promoters (just a bit advanced)
Note: the Python and Matplotlib software to perform this step is included as part
of Mac OSX. If you run a PC, you’ll need to figure out how to install and run
Python.
1. Open the Terminal application. Navigate to the Desktop of your computer via
the Terminal.
$ cd ~/Desktop
2. Download a file that contains task 4 materials from
http://faculty.ithaca.edu/iwoods/docs/wh/
Via the Finder/Windows GUI (graphical user interface), move this zip file to your
WH_task4 folder. You can also do this on the Terminal:
$ mv ~/Downloads/task4_materials.zip .
Uncompress this archive by double clicking on it in your GUI, or via the terminal:
$ gunzip task4Materials.zip
This will create a folder of files on your Desktop. Move to this folder via the
Terminal and list its contents.
$ cd task4Materials
$ ls -1
task4sequences.fa = unidentified sequences obtained from a transcription profiling
experiment
task4BlastAndParse.py = a Python script that will BLAST the sequences vs. a database of
transcripts downloaded from Ensembl
kmerCounter.py = a Python script that will identify enriched motifs in a group sequences
visualizeMotifs.py = a Python script that will show a picture representing the location
of motifs within a group of sequences
You can look at each of tese files via a text editor, or from your Terminal using
the ‘less’ command. For example, try:
$ less task4sequences.fa
Our next task is to determine the Ensembl Gene IDs that match these
sequences. To do so, we will BLAST these sequences against a database of
Ensembl Transcripts. First we have to download BLAST from NCBI, and then we
will download the Transcripts from Ensembl.
3. Downloading BLAST:
• Go to the BLAST homepage at http://blast.ncbi.nlm.nih.gov/Blast.cgi
• Click on the “help” tab
• Follow the link for “Download BLAST Software and Databases”
• Follow the link for the ftp site and connect as ‘Guest’ if necessary, then click the
"blast" link appropriate to your platform (eg. macosx-universal). The file will be
compressed – usually you can uncompress it by just double clicking the .zip file.
Or you can do it via the terminal:
$ mv ~/Downloads/ncbi-blast-2.2.31+-universal-macosx.tar.gz .
$ gunzip ncbi-blast-2.2.31+-universal-macosx.tar.gz
Via your GUI (Windows), rename the ncbi-blast-2.2.31+-universal …. BLAST
folder to something more manageable (call it BLAST).
4. Downloading Ensembl Transcripts. Connect via the terminal or your web
browser to the ensembl ftp site.
ftp ftp.ensembl.org, or type ‘ftp.ensembl.org’ into a browser window
(login as “anonymous” with your email address as password). Once you’re
logged in, type the following (after the ‘>’ prompt)
ftp> cd pub/release-81/fasta/danio_rerio/cdna
ftp> get Danio_rerio.GRCz10.cdna.all.fa.gz
ftp> bye
Uncompress the sequences by double clicking on them within your GUI, or via
the terminal:
$ gunzip Danio_rerio.GRCz10.cdna.all.fa.gz
Rename the sequences to something more manageable. Again, you can do this
within your GUI or via the terminal:
$ mv Danio_rerio.GRCz10.cdna.all.fa zfEnsemblCdna.fa
Make BLASTable databases for your Ensembl sequences with the following
command:
$ ./BLAST/bin/makeblastdb -in zfEnsemblCdna.fa -dbtype 'nucl'
5. BLAST the transcriptome sequences vs. the ensembl sequences and get the
top result. Both the BLAST and the parsing can be performed via one of my
Python scripts as follows. Note: this script requires that the files be named
exactly as specified here (i.e. the BLAST commands must be in a BLAST folder,
and the Ensembl Transcripts file must be named zfEnsemblCdna.fa. So when
you list (ls –1) the contents of your folder, you should see the following:
$ ls -1
BLAST
kmerCounter.py
task4blastAndParse.py
task4sequences.fa
task4sequences.txt
visualizeMotifs.py
zfEnsemblCdna.fa
zfEnsemblCdna.fa.nhr
zfEnsemblCdna.fa.nin
zfEnsemblCdna.fa.nsq
If it looks good, run the blast and parsing as follows:
$ python task4BlastAndParse.py
This script creates several new files in your folder. You can visualize them with a
‘list’ command:
$ ls –1
You should see this (new files are BOLD):
BLAST
kmerCounter.py
task4SequencesTranscriptIDs.txt = a list of transcript IDs that match the sequences
task4SequencesgeneINFO.txt = Ensembl GeneIDs and Transcript IDs for all sequences
task4SequencesvCDNA.blastn = the BLAST results
task4blastAndParse.py
task4sequences.fa
task4sequences.txt
visualizeMotifs.py
zfEnsemblCdna.fa
zfEnsemblCdna.fa.nhr
zfEnsemblCdna.fa.nin
zfEnsemblCdna.fa.nsq
6. Now we will obtain potential promoter sequences for these transcripts by
downloading the 5’UTR and 2000b upstream (arbitrary) for these sequences
from BioMart at Ensembl.
First, copy the transcript IDs (from
computer’s clipboard.
task4SequencesTranscriptIDs.txt)
to your
Next, navigate to Ensembl
http://www.ensembl.org
and follow the link for BioMart. On the Following page, select ‘Ensembl Genes’
from the pull-down menu, then select Danio rerio Genes.
To select for a set of Transcripts: click on ‘Filters’, check the plus sign for
‘Genes’, select ‘Ensembl Transcript IDs’, and paste in your list of transcripts.
To select information for download: click on ‘Attributes’, select ‘Sequences’, click
the plus sign for ‘SEQUENCES’, select ‘5-UTR’, check the box for ‘Upstream
flank’ and input 2000 into the box.
To download results: click on the ‘Results’ button. Check box for ‘Unique results
only’. Hit the button for ‘GO’ next to Export all results to (leave ‘file’ selected in
pull-down).
A file called ‘mart_export.txt’ should appear in your Downloads folder. Move this
to your working folder via your GUI or Terminal, and rename it as
upstreamSequences.fa.
$ mv ~/Downloads/mart_export.txt upstreamSequences.fa
7. Identify enriched motifs in these sequences with a Python script. You can
change the parameters within the script if you’d like, but if you search for
anything more than 7 bases long, it will take quite some time.
$ python kmerCounter.py
8. Choose a motif from the list, and modify the visualizeMotifs.py script to look for
this motif. Run this script to visualize motifs
$ python visualizeMotifs.py