* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Woods Hole – Zebrafish Genetics and Development Bioinformatics
Minimal genome wikipedia , lookup
Public health genomics wikipedia , lookup
Copy-number variation wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression programming wikipedia , lookup
Transposable element wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression profiling wikipedia , lookup
Microevolution wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Point mutation wikipedia , lookup
Pathogenomics wikipedia , lookup
Designer baby wikipedia , lookup
Microsatellite wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human genome wikipedia , lookup
Sequence alignment wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic library wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Metagenomics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Woods Hole – Zebrafish Genetics and Development Bioinformatics/Genomics Lab Ian Woods Note: This document “wh_informatics_practical.doc” and supporting materials can be downloaded at the MBL ZF course website. Setting the stage: These tasks each pertain to the mutation that we (virtually) mapped in lab. The curved body axis and U-shaped somites observed in these mutants are hallmarks of disrupted slow muscle development, and similar phenotypes are observed in mutants with defects in Hedgehog signaling. General descriptions of the four tasks are provided below. Specific protocols can be found following this introductory section. Each of you should choose (at least) one task to accomplish, and collaboration is encouraged. Task 1: High resolution mapping, sequencing, and expression Overview: From a rough map position, refine the critical interval via (virtual) high resolution mapping with additional markers. Query the critical interval in the zebrafish genome for potential candidate genes. Locate expression patterns online for these candidates. Design primers to sequence candidate genes for the mutagenic lesion or for additional SNPs to use in mapping. Task 2: Clone candidate enhancer/promoter sequences to create a transgenic reporter line Overview: Identify the translational start site of a gene of interest. Approach #1: obtain ~ 6kb of sequence upstream of this site. Design PCR primers that will amplify this region, and clone it in-frame with GFP in a tol2 expression vector. Approach #2: identify BACs for use in creating reporter constructs via homologous recombination or gap repair. Approach #3: identify evolutionarily conserved sequences from other organisms to uncover potential regulatory regions around your gene of interest. Task 3: Morpholinos, rescue, and expression analysis Overview: Find the zebrafish ortholog of your favorite gene. Find its location in the genome, locate the ATG, and identify the exon-intron boundaries. Design two 25-base morpholino sequences that target (1) the ATG and (2) an exonintron boundary. Identify an orthologous gene in another species for use in rescue experiments to control for morpholino specificity. Align this sequence with your morpholinos to determine degree of potential activity. Obtain a full-length clone of the zebrafish gene (via RTPCR or clone collections) for use in overexpression experiments or expression analyses via in situ hybridization. Task 4: Identifying zebrafish transcripts via Batch sequence retrieval and BLAST Overview: Mine OMIM (Online Mendelian inheritance in Man) for genes related to the Hedgehog signaling pathway. Get amino acid sequences for these genes, and identify (via batch BLAST) the zebrafish orthologs for these proteins. Use a simple Perl script to parse the blast results to see where the genes are located in the zebrafish genome. Finally, find out where a few of these genes are expressed (via zfin). Protocols: Task 1: High resolution mapping, sequencing, and expression 1. The mutation we mapped in lab is flanked by SSLP/Zmarkers Z11119 and Z15270. Your first job will be to view the region of the genome that is flanked by these two markers. Within this region, you can identify candidate genes and find additional markers can be used to refine your map position, and thereby narrow the critical interval in the genome to look for the gene that is disrupted in the mutation. Start at the zfin homepage. http://zfin.org Follow link for “Genes / Markers / Clones”, type in Z11119, and hit ‘SEARCH’. On the resulting page, follow the ‘GENBANK’ link to obtain the sequence. On the page that follows, hit the ‘FASTA’ link, and copy the sequence onto the clipboard of the computer. To find the location of this marker in the genome, we’ll go to the zebrafish genome browser hosted by EMBL: http://ensembl.org Follow the link for ‘BLAT’, paste the sequence of this gene into the window, select ‘Danio_rerio’ from the species menu, and click ‘RUN’. On the following page, click the ‘C’ (for contig) link next to the top hit. This takes you to a view of the genome, centered on this map marker. Note the physical location of this marker (the numbers in the genome window). Zoom out a bit to get a sense of the genomic region. Do the above steps for Z15270. Where in the genome if your mutation likely located (answer in terms of numbers). 2. Now let’s look at a candidate gene near one of your map markers. Find primer sequences for one of these genes (calca). From your browser window for Z15270 in #1, locate calca and click on it. This takes you to the Ensembl page for this marker. Click on the ZFIN link, which takes you to the ZFIN page for this zebrafish sequence. Scroll down to the GenBank link and follow it. Locate the ‘FASTA’ link and click it, which takes you to a page where the sequence is located. Copy this sequence to the clipboard on your computer. Now to go the (one of many) Primer3 website. http://bioinfo.ut.ee/primer3-0.4.0/ Paste in your sequence and select a length of 500-550 (the comfortable limit for sequencing PCR products). Hit ‘Pick Primers’ and retrieve your primer sequences. The next step would be to amplify gDNA from wildtype and mutant embryos via PCR, sequence the PCR products, identify sequence differences, and use this information to test for linkage between this gene and your mutation. We’ll go over how to do that in more detail below. 3. You collect hundreds of mutants for use in a high-resolution mapping panel, and test them for linkage to numerous markers from your region. You find that the SSLP Z15270 is the marker that is most tightly linked to your mutation, but some recombinants remain. Query the zebrafish genome assembly to see a model of your region of interest (the assembly is pretty good on a large scale, but can be misleading in a local region). Go to the Ensembl website. http://www.ensembl.org Follow the link for zebrafish, and then for BLAT search. Paste in the sequence for Z15270 that you collected in step 2, and click ‘Run’. On the resulting page, scroll down until you see a link for ‘C’, which stands for Contig, and click the link. This takes you to a view of the chromosome. Click on the “Configure this page” link on the left hand side of the page. Here you’ll find all sorts of ‘tracks’ you can turn on and off to show different kinds of information. Try turning some additional features on. Save and close the configuration window by hitting the checkmark in the upper right, and zoom out in the browser as far as is allowed. 4. Exploring the genomic region – what do these genes do? Click on some of the genes found in the region, taking you to the gene record page. Find and click the orthologs link on the left hand side of the page for each gene. What kind of gene is PDE3B? 5. Go back to the genomic view. Can you get a link to ZFIN for any of these genes? Click on rras2, and follow through to ZFIN. Follow the link for Expression Data. Your mutant has defects in muscle specification – is the expression pattern of rras2 consistent with a role in muscle? 6. You decide to sequence rras2 in wildtype and mutant embryos to see if (1) you can find a SNP to map to rule this gene out via recombination, and (2) you can find a change in the mutant sequence that might cause a loss-of-function phenotype. Design primers that will amplify a 600 bp PCR product that contains the first exon of rras2. Find the rras2 entry in ZFIN (you are probably already there in step #5). Go to the ZFIN homepage: http://zfin.org Click on Genes/Markers/Clones and enter rras2. On the ZFIN gene page, scroll down and follow the link to the Genbank/RefSeq RNA record. Scroll down and note the coordinates of the coding sequence (CDS) in the entry. Copy the coding sequence onto the clipboard. Go to the UCSC genome browser (you can also do this on the Ensembl browser, but the UCSC interface is a bit friendlier for this task): http://genome.ucsc.edu Click on the BLAT tab, and paste in your sequence. Select “Zebrafish” from the Genome pulldown menu, and click “Submit”. Follow the link for “details” on the first BLAT hit. Scroll up and down to check your results – what to the different color-codings mean in your sequence? Select about 600b of genomic sequence from which to design primers, then head to the primer3 website: http://bioinfo.ut.ee/primer3-0.4.0/ Paste in your sequence, choose a size range of 500-600b (about the limit of a sequence trace from a PCR template), and click “Pick Primers”. 7. You PCR from genomic DNA of wildtype and mutant embryos, and sequence the PCR products. The sequencing results are as follows: >wildtype_rras2_exon1 AGGCGGGAGTGTGAGCGCGCGCCCCCTCGCGCCCGCCGCGCGCACTGCCAGCACTGATTAGCCGTATCTTCCCCTCATCTTGCAGCACAGGCAGTCAGTCAGTGCCTGGTAGCGATTTG GACGAGGGCGTATGGACTTGAAGCAGCAGTGTATGCATTTCCCACAGACTGTGGTCGTACTTTTCTCCTGTCGGACGGATTACCACTGAGTTGACACATAGCCCAAAAGCCGCTTCGCA TTTTTTCCGCTGCATTTCTCTAACTGAAGGCCTGTCACAGAGTAAAGTGGCTCGGTGTGCGTGTGTTTAGACAGCGGAGCGAGAGCAGCAGTGTGTCCCCGATGGCTGGCTGGAAGGAC GGCTCAGTGCAGGAGAAATATCGCCTGGTGGTCGTCGGAGGTGGTGGCGTCGGAAAATCAGCGTTAACCATCCAGTTTATCCAGGTAAGCGGATACATGGCGGAATGTTATGTGGTTTT CGGCCCTTTAAAAAGATGTGAGGGTGTTGAGGAGAAATGCGTGGATCTTGCTCACAGAAATGGGGACCCCATGAGCGGAAAAGGGGGTTCAGGAATCCAAGCTAGGCCTGCGACACTTT AAACC >mutant_rras2_exon1 AGGCGGGAGTGTGAGCGCGCGCCCCCTCGCGCCCGCCGCGCGCACTGCCAGCACTGATTAGCCGTATCTTCCCCTCATCTTGCAGCACAGGCAGTCAGTCAGTGCCTGGTAGCGATTTG GACGAGGGCGTATGGACTTGAAGCAGCAGTGTATGCATTTCCCACAGACTGTGGTCGTACTTTTCTCCTGTCGGACGGATTACCACTGAGTTGACACATAGCCCAAAAGCCGCTTCGCA TTTTTTCCGCTGCATTTCTCTAACTGAAGGCCTGTCACAGAGTAAAGTGGCTCGGTGTGCGTGTGTTTAGACAGCGGAGCGAGAGCAGCAGTGTGTCCCCGATGGCTGGCTGGAAGGAC GGCTCAGTGCAGGAGAAATATCGCCTGGTGGTCGTCGGAGGTGGTGGCGTCGGAAAATCAGCGTTAACCATCCAGTTTATCCAGGTAAGCGGATACATGGCGGAATGTTATGTGGTTTT CGGCCCTTTAAAAAGATGTGAGGGTGTTGAGGAGAAATGAGTGGATCTTGCTCACAGAAATGGGGACCCCATGAGCGGAAAAGGGGGTTCAGGAATCCAAGCTAGGCATGCGACACTTT AAACC You wish to know if these sequences harbor any polymorphisms, and whether you can use these polymorphisms to facilitate your high resolution mapping. Align the two sequences via BLAST2: http://blast.ncbi.nlm.nih.gov/Blast.cgi Follow the link for ‘nucleotide blast’, and check the box for ‘Align Two or More Sequences’. Note the points at which the two sequences differ. Next, you’d like to see if the polymorphisms can be distinguished via restriction digest. Paste about 40b of wildtype and mutant sequence flanking the SNP into the dCAPS website, leaving the “mismatches” field blank. http://helix.wustl.edu/dcaps/dcaps.html Are there enzymes available that will cut wildtype but not mutant sequence (or vice versa)? If a SNP does not have a polymorphism, try entering “1” in the mismatch field – what does this accomplish? 8. Finally, do the SNPs result in changes in the coding sequence for rras2? Task 2: Clone candidate enhancer/promoter sequences to create a transgenic reporter line 1. Eventually you identify the mutation as a lesion in the gene scube2. You wish to analyze the morphogenetic movements of cells expressing this gene during development in live embryos. To accomplish this, you decide to make a GFP reporter line that reflects the endogenous expression of this gene. First you decide to do the quick-and-dirty approach: you plan to clone genomic sequences upstream of the ATG of this gene and put them into a tol2 GFP expression vector. Locate this gene in the genome and retrieve the coding sequence: go to the zfin homepage, click on Genes/Markers/Clones, and enter scube2 in the search box. http://zfin.org Follow the “gene” link to the ZFIN record for this gene, and scroll down the page. Where (which chromosome) does ZFIN say this gene is located? 2. Next, you want to retrieve the nucleotide sequence of this gene to (1) compare it with the genomic sequence, and (2) identify the translational start site. Scroll down the ZFIN page until you find the link for “RNA”. Follow this to the GenBank record for this gene. Scroll down to the sequence information at the bottom of the page. Where does the coding sequence (cds) begin and end within the complete mRNA transcript? Find the ATG in the nucleotide sequence. Beginning at the ATG, copy about 100b of nucleotide sequence to the clipboard and head to the Ensembl Genome Browser for Zebrafish. http://www.ensembl.org/Danio_rerio/ Enter ‘scube2’ into the search box. On the resulting page, click on “Location.” Which direction is the gene transcribed (ie. which strand is the coding strand)? By high-resolution genetic mapping, you localized the SSLP Z15270 to be 0.1 cM from the mutation in scube2. Z15270 is on chromosome 7 at about 28,880,000. The genetic map length of the zebrafish genome is 3000 cM total, and the total physical length of the genome is 1.7 x 109 bp. Is the actual physical (basepair) distance between Z15270 and scube2 surprising? What factors might account for any differences in expected distance? Zoom in and move the window so that the first exon encompasses the entire view (you can do so by drawing a rectangle around the first exon or by pressing the < and > buttons). Resize the window to include about 5 kb of upstream sequence (just add 5000 to the righthand number in the location box). Would grabbing 5 kb of upstream sequence be a good idea to make a reporter construct for scube2? Why or why not? You decide to retrieve all intergenic sequence and test various parts of it for enhancer activity. First, resize the browser window to just include this intergenic sequence. Click the link for “export data” on the left hand side of the page. Pull down ‘soft’ repeat masking in the genomic FASTA options, and hit next. Then click the ‘text’ link to get the sequence. Copy the DNA on to the clipboard, then go to the Primer3 website to design primers, trying to get as much of the input sequence as possible into the PCR product. http://bioinfo.ut.ee/primer3-0.4.0/ To clone this bit of DNA, you would add appropriate restriction enzyme (or Gateway, or SLIC) sequences to the primers, PCR amplify, and hop into your favorite GFP expression vector. 3. You successfully make this vector and inject it into 1-cell stage embryos. The GFP expression in injected fish (aka. ‘transient-transgenics’) is promising – the pattern of GFP expression in a few fish roughly matches what is observed via in situ hybridization. In addition, many other tissues express GFP. Encouraged by this result, you raise the embryos to adulthood and cross them to identify founders. You identify ten founders, but none of your lines express GFP in a pattern consistent with the in situ data: expression in some tissues is absent, and many tissues express GFP where the gene is not normally expressed. How might you explain these results? You decide to make a new reporter line by BAC recombination: you will obtain a large (~200kb) chunk of genomic DNA that contains this gene, and replace the first exon of your target gene with GFP. Why might this strategy result in GFP expression that more accurately recapitulates the endogenous expression pattern? You can use at least approaches to identify a BAC that contains your favorite gene: (1) directly from the Ensembl genome browser, (2) via a BLAST search at NCBI. 3a. Go to the Ensembl home page for zebrafish: http://www.ensembl.org/Danio_rerio/ Enter “scube2” in the search box and click “Go.” Follow the link for “Region in detail”. Look at the “Location” pane in the browser page – what is written in the blue bar in the center of the page? If a region of the assembly is represented by a sequenced BAC, there will be a GenBank accession number (eg. AL845363) in this blue bar. By contrast, if the region is represented by whole-genome shotgun traces, you will see something like “Zv9_scaffold12345” in the middle bar. Turn on the BAC ends track (if not already on) by clicking “Configure this page” (GRC Alignments) on the left hand side. Zoom out until you can see connected BAC ends (represented by horizontal blue bars). Are there any good options for BACS that contain the scube2 coding sequence and putative regulatory regions? 3b. Retrieve the GenBank accession number for scube2 again from ZFIN, then go to the NCBI BLAST homepage: http://blast.ncbi.nlm.nih.gov/Blast.cgi Click “nucleotide blast”, enter the accession number in the search box, select “nr” button from the pulldown menu, and type in “Danio rerio” in the organism box. Hit BLAST. On the results page, genome sequence will be annotated as “Zebrafish DNA sequence from clone….” Are there any BAC clones that cover the entirety of the scube2 sequence? You next decide to align the coding sequence with one BAC sequence to check for overlap. Note the accession number of the BAC, and go to the BLAST2 page: http://blast.ncbi.nlm.nih.gov/ => select ‘nucleotide blast’ and click the ‘Align two sequences’ box Enter the accession number for the coding sequence in the top box, and for the BAC in the bottom box, and hit “Align”. Where does the coding sequence (ie. query) begin and end in the BAC sequence? Hit the ‘Dot Matrix’ view for a graphical look. The next steps would involve creating a targeting vector for homologous recombination. In this case, you could PCR sequences (1 – 1.5 kb) that flank the region you wish to replace with GFP (generally the first exon), and clone these into a vector containing GFP and a selectable marker (eg. kan) that is not present in the destination BAC. 4. As a final step, you wish to identify candidate regulatory sequences by comparing genomic sequences from multiple teleost species. This can be accomplished via the VISTA webserver: http://genome.lbl.gov/vista/index.shtml First we will need to collect genomic sequences from other fish. In this example we will use three fish in which both genomic sequences and chromosome assignments are available: Tetraodon nigroviridis (Green-spotted pufferfish), Gasterosteus aculeatus (3-spined stickleback), and Oryzias latipes (medaka). The whole-genome duplication event in the teleost lineage can make definitive orthology assigments a bit tricky. Clues to the correct ortholog can be gleaned from analyzing conserved syntenies, in which gene content on particular chromosomes has been retained after species divergence. A useful viewer of conserved syntenies in multiple organisms can be found at the Oxgrid website: http://oxgrid.angis.org.au/oxg_table.html By selecting the appropriate species comparisons, you can view chromosomes and chromosome segments in which gene content has been conserved. Which regions of the stickleback, medaka, and pufferfish genomes most closely match zebrafish chromosome 7? Find the orthologs of Scube2 in these species by performing BLAT searches at the UCSC genome website, using the peptide sequence of Scube2 as the query. In the resulting browser page for each BLAT search, expand the window size to include ~ 10kb of upstream and downstream flanking sequences. Note the orientation of the gene (+ or – strand), and export the genomic DNA via the DNA tab. Save these sequences on to your desktop. They may need to be edited to retain FASTA format – you can do this in Notepad, TextEdit, or via a commandline editor such as emacs. Note – I’ve collected these sequences for you here, if you don’t want to do all of the searching: http://faculty.ithaca.edu/iwoods/docs/wh/vista_scube2/ Next return to the VISTA homepage, choose mVISTA, select 4 sequences to align, and upload your sequences to the VISTA server. View both the visual alignments as well as the textual alignments. Since we collected about 10kb of upstream sequence for each fish, the exons should begin to align at ‘10k’. Can you see where the exon sequences are? Are there conserved noncoding sequences present as well? You may want to relax the conservation parameters a bit (‘Adjust Conservation Parameters’), to see if you can get more sequences to show up as conserved. Task 3: Morpholinos, rescue, and expression In midline patterning, Hedgehog signals emanate from the notochord and ventral neural tube. Though loss of function of scube2 recapitulates many defects observed in Hedgehog pathway mutants, scube2 expression in the neural tube is confined to dorsal regions. This expression pattern is reminiscent of Boc, a gene involved in Hedgehog signaling in mouse. You wish to analyze the zebrafish ortholog of Boc at the level of expression and function. 1. As a first step, you search for the zebrafish ortholog of Boc. Start at the NCBI home page: http://www.ncbi.nlm.nih.gov/ Select “Genes” from the Search menu, and type in “Boc.” Scroll down until you see the first mouse record, and follow its link. On the resulting page, scroll down to the bottom to find the link for the amino acid sequence. Follow this link to the GenPept record for the protein. Scroll down and copy the amino acid sequence into your clipboard. Now go to the BLAST home page: http://blast.ncbi.nlm.nih.gov Follow the link for tblastn, paste in the sequence, select “nr” and type in Danio rerio. While this search is running, hit the back button and select “est_others” from the database menu. These two simultaneous searches will ensure that all available coding sequences will be searched. You can access all of your ongoing BLAST searches via the “Recent Results” tab. Check the BLAST result pages; if you have hits on the ‘nr’ search with accession numbers that start with ‘NM_’, you’re all set; these are typically full-length sequences that have been subjected to some degree of annotation and verification. 2. Next you’d like to obtain a clone of zebrafish boc to use in expression analysis via in situ hybridization. You can follow at least two avenues: (1) obtain a clone from a commercial source or another laboratory, or (2) make a clone via RTPCR. Clones that are commercially available are labeled with an IMAGE ID. Ideally, you would like a full-length sequence that you could use for rescue or overexpression experiments, but partial sequences are fine for generating in situ probes. Search the UniGene database using the accession number of your top ‘NM’ hit. http://www.ncbi.nlm.nih.gov/unigene/ Scroll down to locate sequences that are annotated with IMAGE ID’s. Do any of the IMAGE clones represent full-length cDNAs (compare with the ‘NM’sequences)? You can order IMAGE-ID’d clones from Open Biosystems: http://www.openbiosystems.com 3. Next, design primers that will allow you to amplify a full-length clone via RT-PCR for mRNA overexpression/rescue experiments. Follow the “NM_XXX” link from the UniGene page for boc. Scroll down, highlight and copy the nucleotide sequence, then paste it into primer3. http://bioinfo.ut.ee/primer3-0.4.0/ Choose a size range that is sufficient to include the entire cds. Do your primer sequences flank the translational start and stop codons? 4. You’d also like to design morpholino oligonucleotides (MOs) that target the translational start site and a splice junction. First, compare the coding sequence with the genomic sequence to find the ATG and the exon-intron boundaries. There are several ways to do this, including (a) the GenBank record, (b) exporting sequence from a genome browser (ensembl or ucsc), and (c) BLAST searches on genomic traces or sequenced BACs. 4a. The GenBank browser will often have 5’ UTR sequences that can be used to design an ATG-binding MO. Where does coding sequence of boc begin? Check the GenBank record for NM_001005393 and look for ‘cds’. 4b. Go to the ensembl blast page and paste in the sequence for boc. http://www.ensembl.org/Multi/blastview In the ‘Select Species’ box, select Danio rerio. Examine the alignment overview on the results page. Is the whole gene aligned to the genome? You can now zoom in on the first exon of boc, extract sequence, and design your morpholino. To get to the first exon, look for the lowest numbers in the ‘Query’ column on the blast result page. Or, zoom out to the whole gene and draw a rectangle around the first exon and click ‘Jump to Region’. Similarly, you can design spliceblocking morpholinos by finding the exon-intron boundaries in the browser. 4c. Go back to the GenBank record for zebrafsh boc (from Step 2), and copy the 5’ part of the coding sequence plus about 30b of 5’ UTR sequence to the clipboard. Next, BLAST this sequence against the nr database: http://blast.ncbi.nlm.nih.gov/ (select ‘nr’ and Danio rerio). Examine the blast hits and attempt to find a gDNA hit. How are the exon-intron boundaries depicted in the BLAST results? Download a BAC that contains the third exon-intron boundary. Locate the coding sequence in this trace by blast two sequences. http://blast.ncbi.nlm.nih.gov/ (select blastn, and check box for ‘compare two sequences’) If the orientation of the BAC is reversed compared to the coding sequence, you can create a reverse complement online: http://www.bioinformatics.org/SMS/rev_comp.html From this alignment, select a 25b region from the WGS trace surrounding the exon-intron boundary and generate a MO. How can you test to see if your morpholinos are successfully inhibiting function of your target gene? 5. Finally, you would like to control for specificity of your morphant phenotype by rescue via injection of an mRNA to which your morpholinos will not bind. You head to the pet store and acquire a Green-spotted pufferfish (Tetraodon nigroviridis), grind it up in liquid nitrogen, and extract total RNA. Your plan is to identify the ortholog of boc in Tetraodon, clone this sequence via high-fidelity PCR, generate mRNA via in vitro transcription, and inject this mRNA into morpholino-treated zebrafish. First, return to the mouse GenPept page for Boc (from Step #1). Copy the sequence into the clipboard, and return to the UCSC Blat page. http://genome.ucsc.edu/cgi-bin/hgBlat?command=start Paste in the sequence, and select Tetraodon from the species menu. Hit Submit, and then follow the link for browser view. Zoom out until the full Tetraodon sequence is shown. How does your BLAT query compare with the Tetraodon genome? Click on the Tetraodon Gene within the browser window – this will take you to a page in which it will be possible to export the predicted gene and peptide sequences. Copy the amino acid sequence, and paste it into blast2, along with the mouse Boc peptide sequence. http://blast.ncbi.nlm.nih.gov/ (select blastp, and check box for ‘compare two sequences’) Is the full mouse sequence matched by the Tetraodon sequence? How does the Tetraodon sequence compare with zebrafish boc? Which chromosome contains boc in Tetraodon? Does this make sense via conserved synteny? Check the OxGrid website: http://oxgrid.angis.org.au/oxg_table.html The next step is to design primers to amplify Tetraodon boc via RTPCR. The predicted gene sequence does not include 5’ and 3’ untranslated regions, which does not leave much wiggle room for designing effective primers. You can collect putative UTR sequences from the genome. Go back to the UCSC BLAT page for Tetraodon, and paste in the predicted cDNA sequence. On the results page, follow the link for “details” and scroll down until you see the alignment with genomic cDNA. Collect about 80b of genomic sequence up and downstream of the Tetraodon boc gene, and make a new sequence that includes these putative UTR sequences. Enter this sequence into Primer3, and pick primers that will flank your coding sequence. Next, you’ll amplify by high-fidelity PCR, clone into an expression vector, and verify the clone by sequencing. How do the morpholino sequences you designed match up with the Tetraodon boc sequence. Compare with blast two sequences: Go to ncbi BLAST, click on blast nucleotide, and click on ‘align two sequences’. Paste your morpholino in one box, and the Tetraodon boc sequence in the other. Is there a significant match? Will Tetraodon boc mRNA likely escape morpholinoinduced knockdown? Task 4: Batch sequence retrieval and BLAST (just a bit advanced) The mutation we have been studying has a phenotype similar to mutations in genes encoding members of the Hedgehog pathway. We are going to find every human protein that has been associated with Hedgehog signaling, identify their putative zebrafish counterparts via a batch BLAST search, and determine whether any of these putative zebrafish Hedgehog-associated genes is located near our mapped mutation. A. Collect sequences for human proteins with hints of hedgehog interactions 1. Go to the NCBI website: http://www.ncbi.nlm.nih.gov/ 2. Pull down the OMIM database. 3. Enter SHH in the search box – finds any record that mentions SHH. 4. Select “200” from the Display pulldown menu – finds all proteins mentioned in the OMIM descriptions 5. Select ‘Protein’ from the ‘Find Related Data’ menu on the right, and click ‘Find Items’. Limit the list to Human proteins by clicking ‘Homo sapiens’ in the Top Organisms menu. 7. Select “FASTA (text)” from the Display menu – this retrieves the amino acid sequences. 8. Select “200” from the Display menu. [you can skip steps 9-11 by downloading the sequence file from my website:] http://faculty.ithaca.edu/iwoods/docs/wh/informatics_problem/ 9. Build up a file of FASTA sequences and save to your desktop. 10. Repeat for the other pages of results (getting the sequences 200 at a time). 11. Open a terminal window and move the file to the desktop. ZEBRAIM02:~ mbl$ cd ~/Downloads ZEBRAIM02:Downloads mbl$ ls About Downloads.lpdf Dropbox 2.0.26.dmg shh_human_proteins.doc Dropbox 2-1.0.26.dmg jre-7u25-macosx-x64.dmg ZEBRAIM02:Downloads mbl$ mv shh_human_proteins.doc ~/Desktop/ ZEBRAIM02:Downloads mbl$ B. Performing a local BLAST search on a batch of sequences 1. Go to the BLAST homepage at http://blast.ncbi.nlm.nih.gov/Blast.cgi 2. Click on the “help” tab 3. Follow the link for “Download BLAST Software and Databases” 4. Follow the link for the ftp site and connect as ‘Guest’, then click the "blast" link appropriate to your platform (eg. macosx-universal) 5. The download will result in a folder saved somewhere on your computer (depending on your preferences). [OPTIONAL, CAREFUL!]: update your command line path (in your bash .profile on macOSX) to point to the blast executables cd ~ emacs .profile [add a line at the end of the profile that says] PATH=”path-to-blast-commands:$PATH” eg PATH=”/pathToExecutables/:$PATH” 6. Now we’ll download all current zebrafish transcripts (known and predicted) from Ensembl). While in the same folder as your protein sequences, connect via the terminal or your web browser to the ensembl ftp site. ftp ftp.ensembl.org, or type ‘ftp.ensembl.org’ into a browser window (login as “anonymous” with your email address as password) cd pub/release-72/fasta/danio_rerio/cdna 7. Fetch all the sequences and disconnect: mget Dan* bye (answer “y” at the prompts) 8. UnZip the sequences and concatenate them into one file, and move this file into your BLAST folder: gunzip Danio* cat Danio_* >> Zv9_release72_transcripts.fa mv Zv9_release72_transcripts.fa ~/Desktop/blast/ (change if needed to point to your blast folder) 9. Make a BLASTable database from these zebrafish sequences (type ‘./bin/makeblastdb –help’ for options of makeblastdb): cd ~/Desktop/blast/ (change as needed to point to your blast folder) ./bin/makeblastdb -in Zv9_release72_transcripts.fa -dbtype nucl [you may need to type ./bin/makeblastdb ... depending on whether you’ve updated your PATH to point to the blast executables] 10. Now you’re ready to do a blast search. In your terminal app, navigate to the folder created when you downloaded blast. You can always type ‘[command] –help’ (eg ‘tblastn –help’) for blast options: mv ~/Desktop/shh_human_proteins.doc . ./bin/tblastn -query shh_human_proteins.doc -db Zv9_release72_transcripts.fa num_descriptions 2 -num_alignments 2 -evalue 1e-5 -out shh_v_zv9transcripts.tblastn & [It will take awhile (~1h) to compare our ~2000 human proteins with the entire database of known and predicted zebrafish transcripts. To save time, get the blast output “shh_v_zv9transcripts.tblastn” from the course website] you can check on the progress of the blast search: less shh_v_zv9transcripts.tblastn (type ‘q’ to quit out of less) 11. When the BLAST search is finished, parse it with one of my hacktastic perl scripts (wh_blast.pl) – downloaded from my website. perl wh_blast.pl shh_v_zv9transcripts.tblastn > blast_output.csv Import the results into excel (comma-delimited) and sort by chromosome and map position – does anything map to Chr 7 near Z15270 (28,880,000)? 12. Choose ENSDART00000089574. Look at Ensembl record: http://www.ensembl.org/Danio_rerio/Transcript/Transcript?t=ENSDART00000113313 13. What is the cDNA sequence of this gene (click on ‘cDNA’ on the left part of the page)? What is the function of this gene? Click on the ‘Gene’ record, and then follow the ‘Orthologs’ link (left of page) for some clues. 14. Find expression data (if it exists): get sequence from ensembl transcript page. Find the GenBank accession number for this sequence by BLAST (usually on nr, zebrafish; but can do EST blast if no refseq record). Lookup zfin record for this accession number. Or, if you’re lucky, there will be a link to ZFIN straight from the Ensembl Gene Record page. Go to BLAST homepage http://blast.ncbi.nlm.nih.gov/Blast.cgi Select “nucleotide blast” Select the “nr” button, and type Danio rerio in the organism box. Paste in your sequence and hit “BLAST” Take the accession number of the top hit and do a search at ZFIN. http://www.zfin.org Click on “Genes/Markers/Clones” and do the search.