Download Bioinformatics - Sequences and Computers

Hort 503: Bioinformatics for Research Exercise I The information for the set-up of living organisms is stored in the sequences of nucleotides in DNA. DNA serves two purposes: to provide the information during the life cycle of a cell and to pass it on to offspring. The discovery of genes and the genetic code triggered the hope to be able to read the information stored in our genes, and today we are able to do so: massive progress in sequencing technology has delivered entire genomes to the tips of our fingers. The era of genomics and proteomics has opened up the opportunity to go beyond the analysis of single genes and proteins, towards understanding the interactions between all components of genomes and proteomes. From trying to comprehend life by cutting it into smaller and smaller pieces we are beginning to unveil in the same way it has been functioning since its beginning: as a whole. Computer scientists are important allies for biologists in the struggle to understand the information in DNAs. On one hand the massive amount of sequencing data requires new tools computers and programs- to generate, proof, store, and access these data. On the other hand, the deciphering of genomes necessitates the development of new hardware and software which allow us to detect genes, determine relationships between them and study their expression to help us understand the basis of development and disease. Bioinformatics provides the tools to understand the information in biological molecules - DNA, RNA, and proteins. The two major work routines of bioinformaticists are: (1) Comparing sequences in order to identify similarities, and (2) Analyzing sequence composition in order to identify patterns. The first routine is applied to identify relationships between genes, proteins, or organisms and regulatory regions in genomes. The second routine is applied to the prediction of genes and regulatory regions in genomes. In protein analysis it is used to determine the structure and function of proteins, e.g. in motif identification Having read the above, go to the website listed below and click on the bioinformatics link http://www.dnalc.org/bioinformatics/2003/. Read and complete the exercises using the web links provided. 1 Sequence Analysis: Introduction Learn how information-bearing sequences are different from random sequences and become familiar with bioinformatics tools for the analysis of sequences. Language and DNA use sequences to communicate information. The sequence elements in language are letters and punctuation, in DNA they are the nucleotides. As the letters in books contain information that is realized by readers, the sequence of nucleotides in DNA contains information that is realized by the gene expression machinery of cells. Just as documents may provide information that stimulates readers to act on something (like adopting a new life style), genes contain the information for the building of proteins that perform specific actions within cells. Living organisms are not the only source for DNA sequences. DNA sequencing and computer technologies have made it possible to "isolate" DNA from databases. In addition, many experimental routines such as restriction digests and hybridizations can now be performed electronically or "in silico". As restriction enzymes recognize specific target sequences so do search algorithms. Algorithms then determine the lengths of the DNA stretches between restriction sites and display the result of the "digest" in a graphical form that resembles the banding pattern of a DNA gel. The electronic equivalent of Southern blotting and hybridization are search algorithms that search databases for sequences that are similar to the sequence of a DNA "probe". Then, matches and probe "hybridize" by way of alignments. However, in silico molecular biology provides the advantage of data-linking. With the click of a mouse button, the detection of a hybridizing fragment can be extended into identifying: (1) Where exactly in a genome is a match is located. (2) Whether there are similar sequences elsewhere in the genome. (3) Whether homologs exist in other organisms. (4) Whether it has been found to be linked to disease. (5) Whether it contains a gene, transposon, or other important features. Whether a DNA sequence exists as a chemical or in a database does not affect its nucleotide order. Thus, the information stored in this sequence remains the same, whether it occurs in form of a molecule or as a stretch of letters. It is the realization of this information that requires DNA in its molecular form, e.g. to synthesize proteins which then perform certain actions. In the following segments you will learn how DNA stores information and how the function of DNA sequences can be determined through electronic ("in silico") sequence analysis. Sequence Analysis: Outline The composition of sequences yields clues about their genesis as well as their potential function. This is true for languages, for secret codes, and for DNA sequences. This chapter will follow the outline provided on the new website of the Dolan DNA Learning Center, DNA interactive or DNAi, and use Gene Boy, an intuitive bioinformatics tool that was developed by non-bioinformaticists for biologists. 1. Encoding Meaning; 2. Random vs. Genic vs. Intergenic DNA; 3. Patterns and Consensus; 2 Bioinformatics - Sequence Analysis: Encoding Meaning Language and computer codes provide great examples to examine the encoding of information using sequences - of 26 elements (letters) in language and of 2 elements (0,1) in computer language. Work through the segment 'meaning' in the Genome Mining' section on DNAi, to develop an understanding how language encodes meaning and how the analysis of sequence composition can give valuable clues about the information content of a text or a piece of computer code. 1. Go to the DNAi website. (http://www.dnai.org/index.html) 2. Open 'Genome' 3. Select 'Genome mining' 4. Read through the text giving you an overview. 5. Select 'meaning' 6. Work through the consecutive pages, utilizing the forward icon where ever necessary. Respond to the questions below 1. Click forward. Frame #2: Does the text convey meaning? 2. Click forward. Frame #3: Was there anything that may have kept you initially from answering this question correctly? 3 3. Stay on frame. Frame #3: Do you think that text contains the letters of the English alphabet in equal proportions? 4. Click forward. Frame #4: Click on each of the four sample letters to find the actual counts: E= J= R= Y= 5. Stay on frame. Frame #4: Given that the total number of letters in this text is 1,064 (see frame #2), express your results in percent: E= J= R= Y= 6. Click forward. Frame #5: Do your results match the results on display? 7. Stay on frame. Frame #5: In what percentages would you assume the text contains the other 22 letters of the English alphabet? 8. Click forward. Frame #6: What can you say about the distribution of the ratios for each letter in this meaningful text? Would you call them rather equal or rather dispersed? 9. Click forward. Frame #7: How often would you think letters occur in random text? A sequence of letters that was composed under the rule that none of the letters should be treated preferentially? 10. Click forward. Frame #8: What can you say about the distribution of the ratios for each letter in this random text? Would you call them rather equal or rather "all over the place"? 11. Stay on frame. Compare the result's for the frequency analysis of the two types of text. Can you think of a method to use frequency analysis to determine whether a code may contain meaning without trying to read it? 12. Click forward. Frame #9: Does this result confirm your suggestion? Or is it different? Discuss your answer with the rest of the class. Sequence analysis provides indeed a useful tool to determine the potential of sequences to contain information. However, just analyzing the occurrences of single letters can easily lead to erroneous conclusions. Work through the remaining frames of 'meaning' using the example of binary code to understand how the analysis of single elements can be misleading and how to fix this problem. 4 1. Click forward. Frame #11: Can you see any pattern in this sequence? List them: 2. Stay on frame. Does the sequence follow any rules? List them: 3. Stay on frame. Can you make any predictions about the sequence? Which element would you think comes next? 4. Click forward. Frame #11: Check your answers against the answers provided at the bottom of the left frame. 5. Click forward. Frame #12: Count the numbers of 0's and 1's in this sequence. In what percentages do they occur? 6. Stay on frame. Working through the text examples earlier, what have you learned does a distribution of elements like this indicate? Randomness? Or meaning? So, do you think there is a problem here? 7. Click forward. Frame #14: Read the text. 8. Click forward. Frame #15: How many different pairs of elements can you form of 0's and 1's? In a perfectly random sequence, how often should each pair occur? 9. Click forward. Frame #16: Count the occurrences of these pairs in the actual sequence at hand. After you determined each pair, move only one letter forward. This way each pair overlaps with the next and you should be able to find a total of 19 pairs in this sequence of 20 elements. 10. Stay on frame: Count the occurrences of each of the four different pairs in this sequence and calculate their percentages. 00 01 10 11 11. Stay on frame: Do these data show that the percentages for each pair are very similar? Or are they quite dispersed? 12. Stay on frame: Does this analysis confirm that the sequence is random? Or that it is meaningful and predictable? 13. Click forward. Frame #17: Read the text. 14. Click forward. Frame #18: Read the text. 15. Click forward. Frame #19: In order for frequency analysis to be accurate it needs to include analyzing the occurrences of element combinations. 5 Bioinformatics - Sequence Analysis: Random vs. Genic vs. Intergenic DNA DNA, just like languages or binary code, is a code. It consists of four different elements and the fact that it is a molecule does not inflict on the capacity of nucleotide sequences to encode information. Given the sheer number of nucleotides in a DNA string one could perceive DNA as a random sequence of nucleotides. Determine now this possibility applying nucleotide frequency analysis to random, genic, and intergenic DNA stretches. Use our new bioinformatics tool Gene Boy to calculate the percentages for the four different nucleotides in these three types of sequences. Then, apply your conclusions about meaningful text and binary code to interpret your data about DNA. 1. Go to the DNAi website. 2. Open 'Genome' 3. Select 'Genome mining' 4. Read through the introductory and overview page. 5. Select 'DNA analysis' 6. Following the instructions in DNAi, analyze the nucleotide composition of a random sequence and respond to the questions below. Use the excel spreadsheet provided to record your data o What did you find out about the distribution of A's, C's, G's, and T's in random sequences? o Based on these results and what you learned during text analysis, formulate a hypothesis about the distribution of nucleotides in the random DNA sequence. o Analyze the second random sequence on Gene Boy. o Does the result confirm or contradict your hypothesis? Make sure you check singles, pairs, and triplets 7. Now analyze the sequence composition of a genic sequence. o What did you find out about the distribution of A's, C's, G's, and T's in the genic sequence? o How does it differ from the result for the random sequence? o Form a hypothesis about the difference between genic and random sequences. o Analyze the second genic sequence on Gene Boy. o Does the result confirm or contradict your hypothesis? 6 8. Why would you think genic DNA differs from random DNA? 9. Discuss your answer with the rest of the class. Genomic DNA which does not contain genes is called intergenic DNA. Which sequence characteristics would you expect for intergenic DNA? It's non-coding so would it look like random DNA? Like genic DNA? Or different from that? The answer to this question will provide you with first clues about how one can detect genes in DNA sequences. 1. Compare a random with an intergenic sequence. What differences can you observe? 2. Compare the second intergenic sequence with the second random sequence. 3. Are intergenic sequences random? Explain. 4. Are there any differences between genic and intergenic DNA? 5. Analyze the composition of genic sequence #1 and intergenic sequence #1. Can you identify any differences? 6. Determine the nucleotide distributions for the second set of genic and intergenic sequences. Can you identify any differences? 7. Repeat the previous analysis for genic and intergenic DNA examining the ratios for nucleotide pairs. Can you identify any differences between the two sequence types? 8. Examine the occurrences of CG di-nucleotides in the two sequence types, where do you find high frequencies of CG-pairs? 9. What are "CG-islands" associated with? They are often called "CpG-islands" to denote the fact that they are connected through a phosphate-bridge. In other words, CpG refers to the fact that C and G are neighbors on the same strand and not base pairs which sit opposite each other. 10. From your previous answers can you name anything in the nucleotide and di-nucleotide composition of genic and intergenic DNA that could be used to find genes in genomic DNA sequences? How would you go about doing that? 11. Make a comparative statement about the hemoglobin alpha gene cluster on chromosome 16 and a high CG content. 7 Results of Composition Analysis Composition Analysis Range (%) Type Random 1 Mononucleotides 24.6 - 25.3 Random 2 Genic 1 Genic 2 Intergenic 1 Intergenic 2 Genic 1 Genic 2 Intergenic 1 Intergenic 2 Dinucleotides Trinucleotides Mononucleotide Composition Analysis (%) Mononucleotides A Random 1 Random 2 25.2 C G T 8 Bioinformatics - Sequence Analysis: Patterns and Consensus DNA sequences which serve the same function frequently share characteristics. Genes share many structural features with each other, such as promoters, start and stop codons, and coding sequences. Sometimes they are identical in different genes, such as start and stop codons. Sometimes, they share an overall consensus but the sequences for this feature might deviate somewhat from the consensus in different genes. Familiarize yourself with the most important of these features by working through the 'gene feature' segment in 'Genome Mining', answering the questions below: 1. Frame #1: What is the central dogma? 2. Frame #3: Which triplet is the start codon? On mRNA ______ On DNA ______ 3. Frame #3: Which amino acid does the start codon encode? 4. Frame #5: What are the three stop codons? On mRNA ______, ______, ______ On DNA ______, ______, ______ 5. Frame #5: Which amino acids do the stop codons encode? 6. Frame #7 and #8: What is meant by a reading frame RF? 7. Frame #9: What is meant by an open reading frame ORF? 8. Frame #10: What is a UTR? 9. Frame #10: What is the actual beginning and end of transcription? 10. Frame #10: What is the actual beginning and end of translation? 11. Frame #11: View the 3D animation "Promoting transcription". 12. Frame #11: What is a promoter? 13. Frame #11: What is the actual purpose of a promoter in gene expression? 14. Frame #13: A consensus sequence in the promoter of the six different genes is highlighted in yellow - what is it called? What is the actual consensus sequence? 15. Frames #14 and #15: Explain the process by which a computer program can be used to identify TATA boxes. 16. Frame #15: Do the "TATA"-boxes of all genes contain exactly the same sequence? 17. Frame #16: What is meant by "splicing"? 9 18. Frame #16: Does splicing occur on the DNA or the RNA level? 19. Frame #16: What are "spliced genes"? 20. Frame #16: What are introns? 21. Frame #16: What are exons? 22. Frame #16 and #17: What are splice sites? 23. Frame #17: Introns begin with the sequence ______ and end with the sequence ______. 24. Frame #18: What is a polyA-tail? 25. Frame #18: At which end of the RNA is the polyA-tail added? 26. Frame #18: What is a polyA-signal? 27. Frame #19 and #20: What is the consensus sequence for a polyA-signal? 28. Bonus question: How can these gene features be used in gene identification? 10 Hort 503: Bioinformatics for Research Exercise II Go to http://www.dnalc.org/bioinformatics/2003/ Click on Similarity, read and complete the exercises: Bioinformatics - Sequence Similarity & Alignments According to the concept of evolution (and our own experience) living organisms can be traced back to forebears. Going back further and further in time allows to connect currently distinct species through common ancestors until, ultimately, all life might spring from a few or even one ancestral "cell". Thus, all current life forms may actually be related to each other. Support for this hypothesis can be found in our genes and proteins. Could there be any organisms more different than E. coli, lettuce, yeast, worms, flies, and humans? Yet, humans share genes and proteins with similar functions and sequences with other primates, rodents, worms, and even prokaryotes. Since life in its diversity might relate back to one ancestral life form, all our genomes might relate back to the genome of this very same organism. All differences that exist today among genomes were then introduced during the ensuing billions of years of genomic changes and evolution. These changes then lead to the appearance of new organisms. This evolutionary development of life is likened to a tree with emerging new species and kingdoms representing the branching points of this tree of life. The amount of similarity between two sequences is a measure for their relatedness. The relationship among nucleotide sequences can differ from the relationship among amino acid sequences which, in turn, may differ from the relationship in structure and function. Closely related sequences are more similar than more distantly related sequences. Thus, sequence similarity serves to estimate evolutionary distance following the assumption that sequence similarity that goes beyond the similarity which can be expected just by chance, indicates relatedness. The determination of sequence similarity is not trivial, though. It requires sophisticated computer algorithms which attempt to align sequences with each other in order to determine and score identities and differences between them. Aligning sequences is the basis for many research objectives such as finding genes, determining relationships, and finding sequences in databases. This module will guide you through examples to understand how alignments work and lead to new findings. Bioinformatics - Sequence Similarity: Outline Identifying similarities among sequences is an important technique for many biological questions such as: are two organisms/gene/proteins related to each other? Or: can a drug that works against one type of cancer be applied to other cancers, too? Learn what sequence similarity is, how it is being determined, and how it can be used to answer biological questions. 11 1. 2. 3. 4. 5. Similarity, identity, and homology; Sequence comparison through alignments; Local and global alignments; Pairwise sequence alignments and sequence searches; Pairwise sequence alignments and homology; Bioinformatics - Similarity, identity, and homology Things that are related are often similar and things that are similar are often related. Relationships are defined by common ancestry: siblings are related to and through their parents, nieces and nephews through their grandparents. On the same token different globin molecules are related: they all relate back to an ancestral heme-incorporating globin-like protein. Relationships among family members are determined by similarities: biologically through physical similarities, socially/legally by identical last names. Relationships of proteins and genes are determined by the degree of similarity among their sequences. Everybody has their own, unique DNA sequence, just as everybody has their own, unique set of fingerprints. Any organism's DNA sequence has been provided by its forebears; parents for humans, and progenitor cells for bacteria and other single-cell organisms. Changes in DNA are introduced by two mechanisms: spontaneous mutations (infrequent) and recombination of paternal and maternal DNA during meiosis (frequent). Thus, a person's genome does usually not resemble either her father's or her mother's genome, it is a combined of parts of both of these. Homology is a term used for genes or proteins which are derived from the same ancestor. Homology cannot be expressed as fraction, either two sequences are homologous or not. Scientists infer homology from the similarity among sequences. Similarity is a measure for relatedness. 100% similarity would be identity. In order to find out whether things are similar, we usually compare them side-by-side. Such as the two images below. Or we find some way to describe them. Such as the formula that is being used to describe the loops and whorls in fingerprints. Then, searching for matches does not require to compare images but formulas can be used to query databases. Below are the images of regular hemoglobin and an artistically "mutated" hemoglobin molecule. Compare the two images and try to find the three defects in the right one, that would render a person's blood unable to transport oxygen if this molecule really existed.   Click left image. To preserve the large molecule in a new browser window, hold down 'ctrl'-key and press 'n'.    Then click the right image. Re-size screens that they fit next to each other. Compare. 12 Bioinformatics - Sequence comparison by alignment As you place images and many other things next to each other to compare them, so are sequences compared by aligning them with each other. Alignments between nucleotide or amino acid sequences represent the evolutionary history of these sequences - their relatedness. Many recent advances in understanding the information in genomes and proteomes were derived through sequence comparisons, and alignment and search methods have become indispensable routines in bioinformatics and biological research. An alignment between two sequences is simply a pairwise match between the characters of each sequence. While sequences are either homologs or not, similarity, the measure which is used to infer homology, can be expressed as fraction or percentage. As an example determine the similarity between the two sequences: AATCTATA AAGATA Align the two sequences in various ways with paper and pencil. Or click anywhere on the box and do it online. Then score their similarity by rewarding matches with +1 and scoring mismatches with 0. In how many different ways can you align the two sequences? Which are the highest and lowest scores you can find? Do the alignments allow you to determine a relationship among the two sequences? Results here. However, none of these three alignments is actually very satisfying. There seems to be more similarity at the beginning and the end of the sequence than in the middle part - if only the second sequence wouldn't be so short! Three kinds of changes can occur at any given position within a sequence: a change from one nucleotide to another, an insertion of one or more nucleotides, or a deletion of one or more nucleotides. This fact has prompted scientists to allow the insertions of gaps into sequences that they try to align. A gap would represent a deletion in one of the sequences. Or an insertion in the other one. However, insertions and deletions (so-called 'indels') have been found to occur much less frequent than changes of one nucleotide into another, substitutions. In order to account for this fact gaps are penalized with a negative score or a gap penalty. Align the sequences above again. But this time you are permitted to insert gaps in form of dashes into either sequence. (Don't delete or replace any nucleotides, though. Upon inserting a gap the nucleotides which follow a gap would be pushed forward.) Realign the two sequences, inserting gaps whereever it seems necessary. Then score the alignments, a match with '1', a mismatch with '0', and a gap with '-1'. Find the highest and the lowest scores possible. (Use paper and pencil or click here to do it online.) Do the alignments allow you to determine a relationship among the two sequences? Click here and compare your alignments. Are two holes more than one? The third gapped alignment from the last activity contains two consecutive gaps. Should they be scored the same way as two isolated gaps or differently? Maybe consecutive gaps should be viewed as just one gap? Regardless of the numbers of dashes following each other? In fact, insertions or deletions that sit next to each other are more likely to have occurred at once than due to distinct events. Several gaps in a row are therefore charged differently than several isolated gaps. This happens by breaking up the gap penalty into a gap-opening penalty and a gap-extension penalty. The gap-opening penalty is being charged for each isolated gap as well as for the first gap in a longer stretch of consecutive gaps. Each consecutive gap would then be charged with the gap-extension penalty which is lesser than the gap-opening penalty. Thus, alignment 3 would not score 3 but somewhat higher, depending on the gap-extension penalty. Re-score your alignments by setting the gap-extension penalty to 0.5. 13 Bioinformatics - Global and local alignments So far we compared sequences looking at their entire length. Find out how this technique can lead to misinterpretations and what other technique you need to know in order to understand how to correctly use sequence alignments for the analysis of nucleotide and amino acid sequences. Determine the relationship between these two sequences aligning them with paper and pencil or electronically: ACGT AACACGTGTCT Global alignments try to match sequences by comparing them at each given position. Charging gap penalties does not take into account whether gaps are inserted within a sequence or at the beginning or the end of one or both sequences. However, in aligning the two sequences above, you may have found that the optimal alignment is to not split up ACGT but to leave the sequences in one piece. Here, several alignments are possible, yet, only one that contains the short sequence entirely in the long one. Gaps at the beginning and end of sequences are usually the result of incomplete data and are not based on biological circumstances. They should therefore be treated differently than internal gaps. This kind of alignment is called semi-global alignment. Repeat the previous alignment charging gaps at the beginning and at the end of the sequence with '0'. Find the alignment yielding the highest score. (For the electronic version click here.) Local Alignments combine global and semi-global alignments in that they attempt to determine the best matching subsequences within two sequences. Align these two sequences with paper and pencil or electronically: AACCTATAGCT CCGATATA Again, you could align these sequences in a variety of ways, yet, alignments which leave larger chunks of sequence intact are favored over those that chop the entire sequence into its nucleotides and insert a whole lot of individual gaps into both sequences. Using a local alignment algorithm, scores of 1 for matches and 0 for mismatches, and a gap penalty of -1 would yield this alignment. The gap in the upper sequence could have been placed at a few other positions but the point is that this alignment provides the least number of gaps while maintaining the integrity of as many subsequences as possible. Find out more about scoring alignments in this animation. It's somewhat complex at times. But it provides a good overview over how optimal sequence alignments are determined using alignment matrices. 14 Bioinformatics - Pairwise alignments in searches Searching databases for sequences is an extremely important bioinformatics routine. Learn how alignment programs are used to query databases for sequences. Searching the Internet for information often necessitates the use of so-called search engines. Bioinformatics has its own search engines which are specialized tools to search databases containing nucleotide and amino acid sequence data. These databases are located in a variety of different places around the world, most notably in the US at the National Center for Biotechnology Information (NCBI), in England at the European Molecular Biology Laboratories (EMBL), in Switzerland (SwissProt), and the DNA Database in Japan (DDBJ). The sheer number of sequences in these databases prevents direct sequence-to-sequence alignments and search tools have to be quite sophisticated in order to complete searches in a reasonable amount of time. BLAST is one of the more known sequence search engines. BLAST stands for Basic Local Alignment Search Tool. BLAST finds sequences in databases that are similar and related to subsequences in a query sequence. It returns a brief title line describing the nature of the search hit, links to the database entry for the hit, shows the actual sequence alignment between query and hit sequence, and validates each search hit a ' score' and a so-called 'E-value'. The two scoring parameters 'score' and 'E-value' refer to the quality of the search result. Scores are determined based on the number of matches, mismatches and gaps in the sequence alignment. However, they can be somewhat misleading in that scores strongly depend on the length of the query sequence. Perform a BLAST search with this sequence: TTAACTCCACCATTAGCACC 1. Highlight and copy the sequence 2. Open Gene Boy 3. Click 'Your Sequence', paste the sequence into the central window, change Your Sequence on top into a name of your choosing, select 'Save' 4. Open 'WWW Tools', select 'Sequence Search' 5. In the next window select 'Format', wait for the results to come up 6. Examine the different parts of the results page. Try to resolve questions first by consulting with the help provided under 'BLAST FAQs'. 7. Clicking on a score moves further down the page to a view of the alignment between the match and the query. 8. Clicking on a 'gi|.....' number will link you to the database entry for the respective match. Use the browser's 'Back'-button to move back to the result page. 9. Which organisms are the matching DNA sequences from? 10. What scores and E-values can you find for different matches? Read the definition for 15 score and for E-value under 'BLAST FAQs'. 11. Check Genetic Origins - mtDNA - Recipes. What sequence was the primer derived from? The two scoring parameters 'score' and 'E-value' are provided to judge the quality of the search result. The score is determined based on the number of matches, mismatches and gaps in the sequence alignment. However, score can be somewhat misleading in that it strongly depends on the length of the query sequence. The E-value on the other hand is seen as a more informative parameter to judge the validity of a search result. It provides an estimate for the possibility that a match is similar to the query sequence just by chance. The higher the E-value the more likely it is that the result matches the query just by chance. The lower the E-value the more significant the search result. (As if this wouldn't be confusing enough, the E-value is often expressed as e to the power of a negative number. While this negative number can be quite big, e.g. 56, e-56 is a rather small number and an E-value that you would want to see for meaningful search results.) Bioinformatics - Pairwise alignments to determine homology Sequence homology can be determined through aligning the sequences and scoring them. If the score differs significantly from the degree of similarity one could expect if the sequences were just random, homology can be inferred. Homologous sequences share a common ancestor. Since they diverged from this ancestor, both sequences have undergone changes. The number of these changes and, therefore, their degree of similarity are correlated with the number of generations that have passed since the two sequences diverged. If sequences are very close they are likely to be very similar. If sequences are very similar they might be very closely related. If sequences have diverged very far in the past, they might be quite different. In other words, sequences that are highly different might not be homologous at all. Or they might be homologous, except one might not be able to determine that by examining sequence similarity. In the following example determine the similarity between genic sequences for proteins that have the same function but are derived from different organisms. The sequences are human alpha globin 1, mouse alpha globin 1, and lupine leghemoglobin, a plant derived oxygen-binding protein. 1. Determine the degree of identity among these two sequences by calculating the percentage of identical nucleotides: Hs hba Mm hba >gcctggggtaaggtcggcgcgcacgctggcgagtatggtgcggaggccctggagagg< |||||||| ||| | || | || | || || ||||| || || |||||||| ||| >gcctgggggaagattggtggccatggtgctgaatatggagctgaagccctggaaagg< 2. Do you think these two genes share a common ancestor? 3. Now calculate the identity among these two genic sequences: Hs hba >gcgagtatggtgcggaggccctggagaggtgaggctccctcccctgctcc< || | | ||| | | | | | || | || L lhb >aagaatttaatgcaaatattcctaaaaacacccaccgtttcttcaccttg< 16 4. Do you think these two genes share a common ancestor? 5. How does it change your thinking looking at this alignment of the two proteins for the two genes? 6. The occurrence of hemoglobin is not limited to red blood cells. Legume plants such as clover, pea, beans, and many others are able to synthesize a form of hemoglobin when they undergo a symbiosis with nitrogen-fixing bacteria, Rhizobia. Rhizobia are able to capture atmospheric nitrogen, N2, by reducing it to ammonia, NH3. During the establishment of the symbiosis, the plant host develops new organs, nodule-like structures on its roots, within which it isolates and houses the bacteria. The plant receives nitrogenous compounds from the bacterial partner, and can grow independently from fertilizer. The bacterial partner receives carbon compounds from the plant host. The bacterial enzyme that reduces atmospheric nitrogen is called nitrogenase; it is extremely oxygen-sensitive. On the other hand, nitrogen-fixation requires a large amount of energy, ATP, which depends on the availability of oxygen. In order to accommodate this paradox, legumes synthesize in their nodules a form of hemoglobin leading to a pinkish to dark red hue within the nodules. This form of hemoglobin is called leghemoglobin since it is synthesized in legumes. Leghemoglobin binds the free oxygen around the bacteria in the nodules and protects their nitrogenase from being destroyed. On the other hand, it presents the oxygen to the bacterial symbionts, which use it to satisfy their ATP needs. 7. Comparing nucleotide sequences does not always give you a good idea about the relatedness of two different functional structures such as hemoglobin. Comparing the protein structures gave a much more accurate clue about the similarity between the proteins. Therefore, in order to understand the relatedness of proteins you have to not only look at the genes but also at the amino acid sequences to determine their similarity. What is the percentage of identical amino acids in this alignment between mouse and human hemoglobin? Human alpha globin 1 Mouse hemoglobin >GKVGAHAGEYGAEALER< || | | ||||||||| >GKIGGHGAEYGAEALER< 17 8. Amino acids are different from nucleotides in that similarity and identity are differentiated due to the fact that amino acids can be grouped according to their physicochemical properties such as size, charge, hydrophobicity etc. (see image, web page). By just looking at the image below, it is obvious that leucine and valine are more similar than histidine. Thus, amino acid sequence alignments are analyzed by a) determining the percentage of identical amino acids as % identity. Then b) by determining how many amino acids are identical plus how many represent substitutions against similar ones and expressing the result as % similarity. Groups of similar amino acids are as follows (as provided by ClustalW site at European Bioinformatics Institute EBI): Small + hydrophobic + aromatic: A,V,F,P,M,I,L,W; Acidic: D,E; Basic: R,H,K; Hydroxyl + Amine + Basic: S,T,Y,H,C,N,G,Q. 9. How similar are the two sequences if similar amino acids are labeled with a '+'? Human alpha globin 1 Mouse hemoglobin >GKVGAHAGEYGAEALER< ||+| | ||||||||| >GKIGGHGAEYGAEALER< 10. Now determine the degree of identity among the human and the legume sequence: Hs Hba L Lghb >VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPH< | || | | | | | >ALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSF< 11. How similar are the two sequences if similar amino acids are labeled with a '+'? Hs Hba L Lghb >VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP< +|+ || | +|+ + + ++ + | | | >ALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSF< 18 12. Since human proteins are composed of 20 different amino acids one would expect a random similarity of about 5% between two entirely different amino acid sequences. The amino acid sequences of human and legume hemoglobin’s are significantly more similar than 5% and it can be safely assumed that the two sequences are derived from a common ancestor; these two sequences are homolog. The amount of similarity between two sequences can be used to estimate the point in time when they split from each other: the more different two related sequences are the longer ago they split. 19 Hort 503: Bioinformatics for Research Exercise III Bioinformatics - Finding Genes, Intro Genes are DNA sequence stretches that are transcribed into RNA. Some RNAs, messenger RNAs or mRNA serve as templates that are translated into amino acid sequences. Other types of RNA perform tasks as RNA molecules, such as transfer or tRNA, ribosomal or rRNA, small nuclear RNA or snRNA, interfering RNA or RNAi, small nucleolar RNA or snoRNA. For the sake of this course we will be looking at genes whose transcripts are translated into proteins. DNA makes RNA makes protein, this is the central dogma of molecular biology. This is generally true, including the action of retroviruses. Retroviruses consist of RNA which is reverse transcribed into DNA by means of an enzyme called reverse transcriptase. However, this is not a contradiction of the Central Dogma - it is just a necessary first step in the process to express the viral genes. Because only after they have been transcribed into DNA can these genes be used to synthesize proteins - by way of the Central Dogma. The transcription of RNA follows this schema: DNA Watson-strand 5'-nnnnnCATGCTGACGCAGTCGCTAGTCTGAAnnnnn-3' (equivalent to RNA) DNA Crick strand 3'-nnnnnGTACGACTGCGTCAGCGATCAGACTTnnnnn-5' (template for RNA) RNA 5'-nnnnnCAUGCUGACGCAGUCGCUAGUCUGAAnnnnn-3' (transcript of DNA) Bioinformatics provides two principally different methods to find genes. The first one uses sequence alignments to search databases for the presence of a highly similar sequence which has already been annotated and shows a potential gene. This could result in the finding that the gene has already been identified. Or it could yield a similar gene, the finding of which may significantly speed up research. Finally, it may end up yielding no significant match, in which case one would have to follow the second approach to finding genes in silico: finding genes from scratch. Finding genes from scratch is called ab initio or de novo gene prediction. It is based on the observations and experiences derived from comparing many known genes and identifying common sequence features. Once some common sequence features, motifs, or consensus sequences, have been identified, these are being looked for to identify new genes in yet un-annotated DNA. Sequence features associated with functional aspects of genes are open reading frames (ORFs), promoters, splice sites, polyA-signals, start- and stop codons. Another feature, this one related to the structural aspect of genes, is the occurrence of specific nucleotides and nucleotide combinations in genic vs. non-genic regions. Bioinformatics - ORF Finding: Finding a human gene Purpose: to identify a gene in human DNA by finding ORFs, and to determine the nature of the potential gene. ORF finding 1. Open Gene Boy at http://www.dnai.org/geneboy/index.html 2. Select 'Clear' 3. On the left, under 'Sequences', select 'Genic 1' 20 4. How long is the sequence? o On the right, under 'Operations', select 'Find Genes', then 'ORF' o Below the window select 'Reverse' o Explain what you see on the screen answering these questions: o What do the three horizontal bars represent? o What do the yellow and green boxes represent? o What constitutes an ORF? o Why are you searching for ORFs? o What does 'Reverse' do? o Why is this function needed? 5. Transfer the data from both windows into the table below: Frame From To Length Organism Gene 6. Gene verification and identification 1. Go to the DNAi website at http://www.dnai.org/index.htm 2. Select 'Genome', then 'Genome Mining', then 'Gene Finding' 3. Click the forward icon until you arrive at frame #9 4. Compare your ORFs with the map for the gene. Did you find the ORF shown in the map? 5. What are UTRs? 6. Why did the ORF finder not find the UTRs? 7. How could you find out what function the different genes have? Gene function: In order to find out what the gene is by using Genic 1 for a BLAST search. 21 1. On Gene Boy select 'Clear', then select 'Genic 1' 2. Select 'WWW Tools', then 'Sequence Search' 3. Select 'Format' and wait for your results to come up 4. On the results page determine which matches are the best, the one's with low E-values or high E-values? (Use the BLAST FAQs link to answer this question) 5. Scroll down to results #7, 8, or 9 - what gene is mentioned there? 6. Clicking on the gi- number on the left side of a hit links you to the database ( GenBank) entry for this hit. Determine the function of the gene. Bioinformatics - Finding Spliced Genes The majority of genes in eukaryotes are spliced genes. This means that the coding sequence is interrupted by non-coding sequences. These interspersed non-coding sequences are called introns, the expressed sequences they separate are called exons. In order to restore an open reading frame that can be translated into a protein, the preliminary mRNA (pre-mRNA), consisting of the entire sequence between transcription start site and transcription stop site undergoes a process called splicing. During splicing socalled splicesosomes attach to the pre-mRNA molecule, often while it still is being generated by transcription. Then, the introns are cut removed, and the exons are connected (spliced together). This process generates a molecule that may contain some untranslated regions (UTR's) at the beginning and the end, but internally it would contain a continuous ORF - ready to be translated into an amino acid sequence. Thus, even spliced genes have ORF's - on the level of the mRNA molecule. ORF finders are not suitable to identify spliced genes. In spliced genes the true start and stop codons are separated by introns, and thus, do not generally lend themselves to be part of ORF structures. The first coding exon does contain a start codon, but no stop codon that would follow the start codon in frame. Otherwise, translation would stop right there and the gene would not be a spliced gene. Internal exons may contain methionine codons, they may also contain stop codons. But the methionine would not code for a start but for a methione which is internal to the final protein product. And, while they may contain stop codons in two reading frames, the third one would have to remain free of stop codons. Finally, terminal exons may or may not contain ATG's, but these wouldn't be start codons. So, applying ORF-finding to the identification of spliced genes is almost hopeless. Except ... When ORF-finders are applied to mRNA or cDNA they should be able to pick up the coding sequence for a particular splice form of the gene. Once the ORF has been identified, the cDNA sequence can be aligned to the genomic DNA, revealing the position of introns (not-aligning) and exons (aligning). A variety of tools is available to identify spliced genes. Gene prediction is geared to assign to a raw DNA sequence like this: 22 a structure like this: Upstream Promoter First Exon Intergenic e.g. Region TATA Intron(s) Exon(s) Intron(s) Last Exon Downstream CDS/ORF & Transcriptional Enhancer Start, 5'-UTR, Sites, CDS/ORF Translational Frequent Frequent Translational & Intergenic Start, Stop Stop Stop, 3'-UTR, Enhancer Region CDS/ORF & Codons Codons PolyASites Enhancer insertion Site, Sites Transcriptional Stop De novo gene prediction programs probe sequence features such as the distribution of nucleotide pairs (CpG), triplets, and heptamers to determine regions which are likely to contain genes. Other programs predict individual sequence features such as splice sites, promoters, polyA-signals, etc., providing data that can be used to puzzle the gene together. Alignment-based gene prediction methods search sequence data bases for sequences which are similar to the sequence awaiting annotation. Bioinformatics - Finding Spliced Genes: ORFs Purpose: to identify a gene in human DNA by finding ORFs. ORF Finding 1. Open Gene Boy at http://www.dnai.org/geneboy/index.html 2. Select 'Genic 2' 3. How long is this sequence? 4. Select 'Find Genes' and 'ORF' 5. Below the window select 'Reverse' 6. Explain what you see on the screen 7. Transfer the data from the screens into the table below: Frame From To Length Organism Gene 8. 9. Did you detect any gene in the sequence? Gene verification and identification Find out whether the identified ORF is in fact the gene 23 in Genic 2 by tranlsating the sequence into its aminoacid equivalent and comparing it to the aminoacid sequence of the gene. 1. Open Gene Boy, press 'Clear' 2. Select 'Genic 2' 3. Copy and paste the 'Genic 2' sequence - numbers, empty spaces, nucleotides 4. Come back to the course website and open 'Translator' (in the bar to the right or here) 5. Paste the sequence into the window of the translation tool 6. Check 'Show Aminoacids' 7. Select 'go' 8. Print out the DNA/Amino acid sequence, reconstruct the gene/protein by cutting out the paper stretches and adding them head-to-tail using scissors and glue. 9. In order to include the complementary strand in this examination use the 'Transform Sequence' funtion in GeneBoy to generate the reverse complement of the sequence. Then, translate the reverse complement using the 'Translator' link on the right. 10. Identify the coding portion of the gene by highlighting in the forward strand and/or the complementary strand of the reconstructed gene the nucleotide stretches that code for the amino acid sequences in this view of the protein encoded by Genic 2 11. What strikes you about the structure of the gene? 12. Compare this map for the gene in Genic 2 with the result of 'Find ORFs' (repeat if necessary). Which portion of the gene did ORF-finder detect? 13. Why would ORF-finders not be able to detect the entire gene? 14. Given the way cells process the information in genes, at what point could you have successfully applied ORF-finding to identify the coding protion of the gene? 15. Click the forward icon until you arrive at frame #12 Gene features Try to predict the gene in Genic 2 by identifying individual gene features. 1. Get the sequence: Open Gene Boy, press 'Clear', select 'Genic 2', copy and copy the 'Genic 2' sequence - numbers, empty spaces, nucleotides 2. Determine the start and the end of the gene by using bioinformatics tools that determine transcriptional start sites (beginning) and polyA-tail signals in DNA sequences. o Here is the output of a program which predicts transcriptional start sites. Viewing the output determines were DNA transcription may start in your sequence, and thus the 5'-region of the gene. Determine the location for the transcriptional start sites which are predicted by both programs and which have the highest scores. by comparing the nucleotide positions and the scores provided by the different programs. (If you would like to run the programs for 24 yourself, highlight and copy the DNA sequence. Then, use the CSHL Human Core Promoter Predictor and the UC Berkeley Promoter Predictor to predict transcriptional start sites by pasting the sequence into the input window for each program and running the program. In order to examine the second strand for the sequence generate its reverse complement through the 'Transform Sequence' function in Gene Boy.) o Here is the output of a program which predicts polyA-tail insertion signals. Viewing the output determine the region where the polyA-tail may be inserted, and thus the 3'-region of the gene. (If you would like to run the programs for yourself, highlight and copy the DNA sequence. Then, use the CSHL PolyASignal Predictor to predict polyA-insertion signals by pasting the sequence into the input window of the program and running the program.) o Mark in your worksheet the promoter region, and the prospective translational start and polyA-signal sites. Which direction is the gene most likely transcribed in? 3. Discern exons and introns by identifying splice sites (i.e. the borders between exons and introns): o Here is the output of a program which predicts splice sites. Viewing the output determine the two models that would incorporate the findings above about promoters and polyA-signal regions of the gene. (If you would like to run the programs for yourself, highlight and copy the DNA sequence. Then, use the Splice site prediction program at UC Berkeley. Paste the sequence into the input window of the program and change the donor score cutoff to 0.88 and acceptor score cutoff to 0.94 (donor site at beginning of intron, acceptor site at end of intron). Then run the program.) o What does the output page tell you about the prediction mechanism? o Use the output to determine the predicted splice site positions and record the results in the worksheet. o From the total of six splice site predictions build a couple of alternative maps that show how exons and introns could follow each other in this gene. 4. Determine how the predicted splice sites align with open reading frames: Relate the predicted splice sites to ORFs. Try Gene Boy or NCBI's ORF Finder here to complete this task. 5. Determine which of the predicted splice sites border exons and introns: For each splice site examine where exactly it is preceded (donor sites) or followed (acceptor sites) by a stop codon utilizing the Translation Tool in Sequence Utilities, checking 'Show only start and stop codons'. Make sure you examine all three reading frames +1, +2, and +3, for each splice site. (Hint: start out with identifying the open read-through/coding sequence - CDS - in the internal exon. Then determine the CDS in the last exon and the translational stop site. Then figure out the first exon and the translational start site ATG.) 6. Characterize the gene by determining its length, exons, introns, splice sites, 25 promoter, etc. Bioinformatics - Finding Spliced Genes: Gene Prediction Purpose: to identify a gene in human DNA through gene predictions. Gene prediction 1. On Gene Boy select 'Clear', then select 'Genic 2' 2. Highlight and copy the sequence, incl. digits and spaces 3. Select WWW Tools, then Gene Prediction 4. Select FgenesH or Gegroup B select GenScan 5. On the website of the prediction tool find the input window for the sequence 6. Paste the sequence into the window (Ctrl-v) and submit for analysis by clicking the left button below the input window 7. Compare the result with the map in frame #12 of Genome Mining in DNAi. How accurate are the gene predictions? Bioinformatics - Finding Spliced Genes: Function Purpose: to identify the nature of a gene in human DNA by searching for similar sequences in sequence databases. Gene function: In order to find out what the gene is submit 'Genic 2' to a BLAST search. 1. On Gene Boy select 'Clear', then select 'Genic 2' 2. Select 'WWW Tools', then 'Sequence Search' 3. Select 'Format' and wait for your results to come up 4. On the results page determine which matches are the best, the one's with low E-values or high E-values? (Use the BLAST FAQs link to answer this question) 5. Scroll down to the match listing - what gene is mentioned there? 6. How could you find out which of these genes is the one in 'Genic 2'? 7. Clicking on the gi- number on the left side of a hit links you to the database ( GenBank) entry for this hit. Determine the function of the gene. 26 Bioinformatics - Finding Spliced Genes: Location Purpose: to learn about a human gene by studying its location in the human genome. Gene location: Find 'Genic 2' in the human genome. 1. On Gene Boy select 'Clear', then select 'Genic 2' 2. Select 'WWW Tools', then 'Genome View' 3. Select 'Format' and wait for your results to come up 4. Select 'Genome View' 5. What chromosome has 'Genic 2' been located on? 6. For a closer view select the number underneath the respective chromosome. 7. Select 'Maps & Options', 'Remove' everything with the exception of 'Gene' from 'Maps displayed' 8. Select 'Gene', click 'Toggle Ruler' 9. Click 'Apply', close this window, and begin zooming into the chromosome. 10. What genes do you find surrounding the 'Genic 2' locus? What gene cluster can you find in there? 27

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bioinformatics - Sequences and Computers