Download Blue Line Walk-through

Blue Line Walkthrough A. Examining DNA Sequence Example Sequences: rbcL sample 1 Tool(s): Sequence Viewer Concept(s): DNA Barcoding, Sanger DNA Sequencing DNA Barcoding: The process of species identification by examination of DNA Sequence. rbcL: A gene coding the large subunit of the enzyme RuBisCo, and one of the important loci for species identification of plants. Sanger DNA Sequencing: A method of DNA sequencing that uses fluorescently labeled didexoynucleotide terminators to generate the sequence of a DNA sample. Quality (Phred) Score: Nucleotide calls read from sequencing output files are assigned a quality score of 10, 20, 30, 40, or 50. A score of 50 means that the base is called with a 99.999% accuracy. A score less than 20 is the cut-off for high quality sequence. I. Create Project 1. Log-in to DNA Subway (dnasubway.iplantcollaborative.org) 2. Click ‘Determine Sequence Relationships.’ (Blue Square) 3. Select project type ‘Barcoding: rbcL.’ 4. Select sample sequence ‘rbcL sample 1.’ 5. Provide your project with a title, then Click ‘Continue.’ Alternatively, if you have sequenced your DNA using your Genewiz account, Select ‘Import trace files from DNALC.’ – Then select sequences to import. II. View Sequence 6. Click ‘Sequence Viewer’ to show a list of your sequences. 7. Click on a sequence name to show the sequences’ trace file. Questions: Q.1: What do you notice about the electropherogram peaks and quality scores at nucleotide positions labeled “N”? _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ Q.2: Where do the ‘N’s’ in the sequence tend to be distributed, and Why? _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ Additional Investigation: Learn more about Sanger Sequence at: http://www.dnalc.org/view/15479-Sanger-method-of-DNAsequencing-3D-animation-with-narration.html 1 B. Assembling and Editing DNA Sequence Example Sequences: rbcL sample 1 from Part A Tool(s): Sequence Trimmer, Pair Builder, Consensus Builder Concept(s): Sanger DNA Sequencing, bidirectional reads Bidirectional sequence: DNA sequence generated by sequencing a DNA strand in the forward and reverse orientation. Consensus sequence: A sequence that sums the consensus of two or more DNA sequences. I. Trim 5’/3’ ends 1. Click ‘Sequence Trimmer.’ 2. Click ‘Sequence Trimmer’ again to examine to changes made in the sequence II. Pair Builder 1. Click ‘Pair Builder.’ 2. Select the check boxes next to the sequences that represent bidirectional reads of the same sequence set. Alternatively Select the ‘Try Auto Pairing’ function and verify the pairs generated. 3. As necessary, Reverse Compliment sequences that were sequenced in the reverse orientation by clicking the ‘F’ next to the sequence name. The ‘F’ will become an ‘R’ to indicate the sequence has been reverse complimented. 4. Save the created pairs. III. Consensus Builder 1. Click ‘Consensus Builder’ 2. Click ‘Consensus Builder’ again to examine the created consensus files. Any differences between two reads will be highlighted in yellow in the consensus builder. 3. Make needed edits, and Save your changes. Questions: Q.3: Sequence identified by DNA subway as low quality is marked by a symbol. What problems might it cause to generate consensus sequence from low-quality DNA sequence? _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ 2 C. Matching sequence to databases Example Sequences: rbcL sample 1 from Part B Tool(s): BLAST, Upload Data, Reference Data Concept(s): BLAST Searches, GenBank, BOLD Database BLAST: Basic Local Alignment Search Tool (BLAST) is an algorithm that search databases of biological sequence information (e.g. DNA, RNA, or Protein sequence) and return matches. The BLASTN program is specific to nucleotide data. GenBank: The largest database of publicly available nucleotide sequences. As of 2011 the database contains well over 100 billion nucleotides of generated sequence data. BOLD: Barcode of Life Online Database (BOLD) is an online repository for sequence data generated by DNA barcoding projects worldwide. I. Check for matches in GenBank 1. Click ‘BLASTN.’ 2. Click the ‘BLAST’ link for the sequence of interest. 3. Examine the BLAST matches for candidate identification. Clicking the species name given in the BLAST hit will also give additional information/photos of the listed species. 4. If desired, select the check box next to any hit, and select ‘Add BLAST hits to project’ to add selected sequences to your project. II. Upload Data (optional) 1. If desired, Click ‘Upload Data’ to import additional data into your project. You will need to repeat steps in the ‘Assemble Sequences’ stop on DNA Subway. III. Reference Data (optional) 1. Click ‘Reference Data.’ 2. Select one or more groups of sequences from selected reference samples of rbcL sequence. 3. Select ‘Add ref data’ to add the data to your project. Questions: Q.4: BLAST will return the closest matches present in GenBank. Will you be able to identify an unknown species using BLAST alone? Why or Why not? _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ Additional Investigation: See the laboratory: “Using Barcoding to identify and Classify Living Things.” (http://www.urbanbarcodeproject.org/files/Barcoding_Protocol.pdf) 3 C. Building Phylogenetic Trees Example Sequences: rbcL sample 1 from Part C Tool(s): Select Data, MUSCLE, PHYLIP NJ, PHYLIP ML Concept(s): Sequence alignment, phylogenetics Multiple Alignment: A (usually) computer generated alignment sequences. Under the assumption that all sequences within the alignment are similar (e.g. of a common genetic origin, from a common locus, in the same strand orientation) gaps are introduced where misalignments (e.g. insertions or deletion/ missing data) appear. I. Select Data for Alignment 1. Click ‘Select Data.’ 2. Select any and all sequences you wish to add to your tree. 3. Click ‘Save Selections.” II. Generate Multiple Sequence Alignment 1. Click ‘MUSCLE.’ \ Phylogenetic tree: A diagram which represents inferred evolutionary relationships between organisms. As applied here, sequences are displayed with branch lengths that are proportional to the differences between the sequences. PHYLIP NJ and PHYLIP ML: Tree building algorithms based on the “Neighbor Joining” and “Maximum likelihood methods respectively. See: http://www.icp.be/~opperd/pr ivate/neighbor.html and http://www.icp.ucl.ac.be/~opp erd/private/max_likeli.html 1. Click ‘MUSCLE’ again to open the sequence alignment window. 2. Click ‘Trim Alignment’ 3. Examine the alignment to help answer question 5. III. Construct Phylogenetic Tree 1. Click either ‘PHYLIP NJ’ or ‘PHYLIP ML’ to run the tree construction algorithm. 2. Click the button for the algorithm you chose above again to launch a viewer for the multiple alignment and tree. Questions: Q.5: What relationship do you see between sequences that have more mutations (align less well with majority of sequences) in the alignment and the lengths of a sequences’ branch on the tree? _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ Q.6: Do you see differences in the phylogenetic tree generated by the Neighbor-joining vs. Maximum likelihood method? _____________________________________________________________________________________________ _____________________________________________________________________________________________ _____________________________________________________________________________________________ 4 Biological Concepts Genomes  A genome is an organism’s entire complement of DNA.  DNA is a directional molecule composed of two anti-parallel strands.  The genetic code is read in a 5’ to 3’ direction, referring to the 5’ and 3’ carbons of deoxyribose.  Eukaryotic genomes contain large amounts of repetitive DNA, including simple repeats and transposons.  Transposons can be located in intergenic regions (between genes) or in introns (within genes).  Genes and transposons are directional, and can be encoded on either DNA strand.  Repeats are non-directional, and, in effect, do occur on both strands.  Transposons can mutate like any other DNA sequence. Genes  Protein-coding information in DNA and RNA begins with a start codon, is followed by codons, and ends with a stop codon.  Codons in mRNA (5’-AUG-3’, etc.) have sequence equivalents in DNA (5’-ATG-3’, etc.).  The DNA strand that is equivalent to mRNA is called the “coding strand.” The complementary strand is called the “template strand,” because it serves as the template for synthesizing mRNA.  Non-spliced genes, which are characteristic of prokaryotes, are also found in eukaryotes.  Even in a spliced gene, the protein-coding information may be organized as Open Reading Frame (ORF).  Most eukaryotic genes are spliced, whereby intervening segments (introns) are removed and the remaining segments (exons) are spliced together.  Splice sites (exon-intron boundaries) have sequence patterns that are recognized by the splicing apparatus (spliceosome).  Gene prediction programs use consensus sequences around splice sites to predict exon-intron boundaries.  Over 90% of eukaryotic introns have “canonical splice sites,” whereby introns begin with GT (mRNA: GU) and end in AG (mRNA: AG).  The protein coding sequence of a eukaryotic mRNA (or gene) is flanked by 5’- and 3’-untranslated regions (UTRs); introns can be located in UTRs.  In most eukaryotic genes, transcripts are alternatively spliced, yielding different mRNAs and proteins.  UTRs hold information for the half-lives of mRNAs and for regulatory purposes.  Gene > mRNA > CDS.  CDS = nucleotides that encode amino acid sequence.  In mRNA: CDS = ORF. BLAST Searches  Basic Local Alignment Search Tool (BLAST) searches databases for matches to a query DNA or protein sequence.  Gene or protein homologs share sequence similarities due to descent from a common ancestor.  Biological evidence is needed to edit and confirm gene models predicted by computer algorithms.  Biological evidence is most often derived from mRNA transcripts (ESTs, cDNAs, RNAseq). Protein sequence data are available, too, but much less common.  Many ESTs and cDNAs are disrupted by “introns” when they are aligned against genomic DNA.  ESTs & cDNAs may be incomplete.  The BLAST algorithm does not resolve intron/exon boundaries.  The BLAST algorithm is not restricted to detecting sequences that fully match a query (“global” matches) but, instead, matches query subsequences as well (“local” matches).  The BLAST algorithm matches sequences to the fullest extent possible and, often, realigns the same sequence twice. 5 Web Resources for Genome Annotation A. Major Plant Genome Hubs: DOE JGI’s http://www.phyotozme.net University of Iowa: http://www.plantgdb.org/ CSHL: http://www.gramene.org/ ENSEMBL: http://plants.ensembl.org/index.html NCBI: http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html NCBI: http://www.ncbi.nlm.nih.gov/mapview/ B. Some Plant Genome Portals: Arabidopsis, TAIR: http://www.arabidopsis.org/ Corn: http://www.maizesequence.org/index.html Grape: http://www.cns.fr/externe/GenomeBrowser/Vitis/ Poplar: http://genome.jgi-psf.org/poplar/poplar.home.html Rice: http://rice.plantbiology.msu.edu/ Tomato: http://solgenomics.net/about/tomato_sequencing.pl C. Browsers: Ensembl: http://www.ensembl.org GBrowse: http://gmod.org/wiki/GBrowse JBRowse: http://jbrowse.org/ UCSC Browser: http://genome.ucsc.edu xGDB: http://brendelgroup.org/bioinformatics2go/bioinformatics2go.php D. Other Resources: Course download site: http://gfx.dnalc.org/files/evidence DynamicGene: http://www.sanger.ac.uk/resources/software/artemis/ GeneBoy: http://www.dnai.org/geneboy/ BioServers: http://www.bioservers.org/bioserver/ mRNA/gDNA: http://www.ncbi.nlm.nih.gov/spidey/ mRNA/gDNA: http://pbil.univ-lyon1.fr/sim4.php Splice site predictor: http://www.fruitfly.org/seq_tools/splice.html Promoter predictor: http://www.fruitfly.org/seq_tools/promoter.html 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Blue Line Walk-through