Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence comparisons Using the BLAST and FASTA tools Introduction Xeroderma Pigmentosum (XP) is a heterogenous group of genetically determined skin disorders due to unusual sensitivity to ultraviolet light. The sun-exposed areas of the skin have a strong tendency to develop tumors. The median age of onset of the first skin neoplasm in these patients is 8 years, as compared to 50 years for sporadic skin tumour cases. The causes are different genetic defects of the DNA repair system. The cell uses nucleotide excision repair to remove so called bulky lesions, typically for UV-induced DNA damage. Several enzymes are involved in this type of DNA repair. In XP patients, mutations have been found in at least seven different genes coding for such enzymes. Links: For more information about the disease Xeroderma Pigmentosum look at this shortcut or search the OMIM database for number 278700, corresponding to the XPA gene. Assignment BLAST is a local alignment tool found at the NCBI website. Read about the BLAST functions and the different modes to choose at the web page http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html. You may find the Query tutorial, BLAST tutorial and More information useful. Also read the overview and FAQs at the BLAST web site (http://www.ncbi.nlm.nih.gov/BLAST/). Hints: If you have problems in using BLAST first look at the Frequently Asked Questions. I also recommend you to read the chapter about Pairwise alignment techniques in the course literature “Introduction to Bioinformatics” by Attwood & Parry-Smith. Please, check the following things: in some of the BLAST search modes (and also other programs you will use on this course) you must load the sequence in the FASTA format. choose the correct database, usually nr (nr=non-redundant). if the search is taking very long time, you can use the option to get your results by email. This does not work with BLAST 2 sequences. Please, take your time in getting acquainted with BLAST. You will have great use of it in this course, and most likely in your future work!!! Now, when you have gone through the general introduction about BLAST, you are ready to answer some questions. Please, use your own words to formulate your answers. 1-1. What is the difference between global and local alignment? 1-2. Most bioinformatic tools take FASTA sequences as input. Describe the FASTA sequence format. 1-3. Nucleotide blast (blastn) is used to compare a DNA or RNA sequence against a database of nucleotide sequences. Describe the functions of the other modes (or programs) you can choose. 1-4. What is an E-value? (What does it stand for? How can we use it?) There are two main tools for sequence comparisons -- BLAST and FASTA. Make sequence comparisons using both tools with the following sequence as query sequence. In both cases, make a search against Swissprot database. Report the identifiers for the top 5 hits using: 1-5. FASTA (http://www.ebi.ac.uk/fasta3/) 1-6. BLAST (http://www.ncbi.nlm.nih.gov/BLAST) MNDLSGKTVI ITGGARGLGA EAARQAVAAG ARVVLADVLD EEGAATAREL GDAARYQHLD VTIEEDWQRV VAYAREEFGS VDGLVNNAGI STGMFLETES VERFRKVVDI NLTGVFIGMK TVIPAMKDAG GGSIVNISSA AGLMGLALTS SYGASKWGVR GLSKLAAVEL GTDRIRVNSV HPGMTYTPMT AETGIRQGEG NYPNTPMGRV GNEPGEIAGA VVKLLSDTSS YVTGAELAVD GGWTTGPTVK YVMGQ Now, we shall look at the difference between searching the complete sequence and searching only a short segment of the sequence. The test case is human plasminogen (below) which sequence you will compare with all sequences in the Swissprot database using FASTA (http://www.ebi.ac.uk/fasta3) MEHKEVVLLL LLFLKSGQGE PLDDYVNTQG ASLFSVTKKQ LGAGSIEECA AKCEEDEEFT 60 CRAFQYHSKE QQCVIMAENR KSSIIIRMRD VVLFEKKVYL SECKTGNGKN YRGTMSKTKN 120 GITCQKWSST SPHRPRFSPA THPSEGLEEN YCRNPDNDPQ GPWCYTTDPE KRYDYCDILE 180 CEEECMHCSG ENYDGKISKT MSGLECQAWD SQSPHAHGYI PSKFPNKNLK KNYCRNPDRE 240 LRPWCFTTDP NKRWELCDIP RCTTPPPSSG PTYQCLKGTG ENYRGNVAVT VSGHTCQHWS 300 AQTPHTHNRT PENFPCKNLD ENYCRNPDGK RAPWCHTTNS QVRWEYCKIP SCDSSPVSTE 360 QLAPTAPPEL TPVVQDCYHG DGQSYRGTSS TTTTGKKCQS WSSMTPHRHQ KTPENYPNAG 420 LTMNYCRNPD ADKGPWCFTT DPSVRWEYCN LKKCSGTEAS VVAPPPVVLL PDVETPSEED 480 CMFGNGKGYR GKRATTVTGT PCQDWAAQEP HRHSIFTPET NPRAGLEKNY CRNPDGDVGG 540 PWCYTTNPRK LYDYCDVPQC AAPSFDCGKP QVEPKKCPGR VVGGCVAHPH SWPWQVSLRT 600 RFGMHFCGGT LISPEWVLTA AHCLEKSPRP SSYKVILGAH QEVNLEPHVQ EIEVSRLFLE PTRKDIALLK LSSPAVITDK VIPACLPSPN YVVADRTECF ITGWGETQGT FGAGLLKEAQ 660 720 LPVIENKVCN RYEFLNGRVQ STELCAGHLA GGTDSCQGDS GGPLVCFEKD KYILQGVTSW 780 GLGCARPNKP GVYVRVSRFV TWIEGVMRNN 810 a) Start by doing a search using the complete sequence. Store these results (e.g. in "emacs", the "notepad" or any word processor) so that you can compare them with the results in b) and c) below. b) Do a search using only the segment between positions 121 and 240 as the query sequence. c) Finally, do a third search using only the segment 601--780. Quickly look through the descriptions for the top 50 results (a, b and c above). 1-7. Are they identical or not? 1-8. What kinds of protein did you find? 1-9. Are there any kinds of protein in b or c that were not found in a? 1-10. If so, could you find an explanation to this? Use BLAST to identify the gene from mRNA sequence 1. (Be patient, it may take a while.) From your result you can enter directly into GenBank via the link, or search for the accession number at the GenBank website (http://www.ncbi.nlm.nih.gov/Genbank/index.html), to gain access to more information about the gene. Hints: Do not remove empty spaces or numbering of the sequence manually, the program does that automatically. Some times your search can run very slowly, especially in the afternoon when the scientists in the big country in the West have started their work. If you get this problem, use the option to receive your answer via e-mail. Some programs also have a queuing system, where you get an identification number for your particular search. With this number you can enter the web site later and look at your result. Alternatively, get up early in the morningÂ… You probably have to work on every assignment at several different occasions and maybe also use different computers. It might be wise of you to save part of your results in a word or NotePad document and save it to a disk. 1-11. This gene has got several names. Which? (Either use the abbreviations or the full names) 1-12. Give the accession number for the sequence of this gene (or its transcript). 1-13. At which position in the mRNA does the coding sequence (CDS) start? 1-14. Removed. Enter 'removed' into the assignment report form. (Updated 2007-11-15: Deprecated due to new evidence.) 1-15. This enzyme is a part of a protein complex (NER), but what function does it have in the complex? There are several tools available to convert DNA sequence to protein. At the ExPASy web site (http://www.expasy.ch/) they can be found under the headline Proteomics. Use the following tools to verify the amino acid sequence you found in GenBank in the previous question (1-11): Translate (http://www.expasy.ch/tools/dna.html) Transeq in EMBOSS (http://www.ebi.ac.uk/emboss/transeq/) Sixpack (available in SRS (http://srs.ebi.ac.uk/), tab Tools, drop down menu or http://humpback.bii.a-star.edu.sg/cgi-bin/emboss/emboss.pl?_action=input&_app=sixpack) Hints: Remove the heading of the sequence in the FASTA format! The prefix may be interpreted as nucleotides. 1-16. Which are the first ten amino acids of the expressed protein? 1-17. Describe the differences between the search tools. 1-18. Which tool did you like best - and why? PS Don't miss the Swiss jokes at the ExPASy site!! You have the recently cloned mRNA sequence 2, but you do not know the reading frame. There are three possible reading frames on each DNA strand. Only one is the correct one that gives a functional protein. Instead of using a translation program, as in questions 1-16--1-18, you can use one of the translated BLAST programs to directly translate and identify the gene product. Also use blastn to identify the gene. 1-19. Which BLAST program did you use to compare all six frames? 1-20. What are the effects on quality of performing the sequence comparisons in BLAST on the protein level instead of on the nucleotide level when you have a nucleotide query? 1-21. This gene also has got (at least) two names. Which? (Again, either use the abbreviations or the full names.) 1-22. What is the enzymatic function of the protein that this gene codes for? Describe what the protein does. Even though deficiency in DNA repair systems has been shown to be a contributor to tumour development, the most common cause of cancer is malfunction of the cell cycle control. Normal cells use cell cycle checkpoints as a mechanism to avoid accumulation of genomic errors before cell division. Normal p53 protein blocks a mutated cell before it enters the cell cycle S-phase and that block gives the cell an opportunity to repair its DNA or enter the pathway to controlled cell death. p53 is the most commonly mutated gene in human tumours. It is called a tumour suppressor gene because it is the inactivation of the gene that contributes to cancer. This type of inactivating mutations is often recessive, so that both copies of the gene have to be affected. Read more at the p53 homepage (http://p53.free.fr/), which is also a very good example of the more specific databases that are available on the internet. There are databases for an increasing number of different genes, chromosomes and diseases. See Genes and disease (http://www.ncbi.nlm.nih.gov/disease/) on the NCBI homepage and choose the chromosome or disease area that interests you. In this part of the assignment, you are going to compare a normal and a mutated sequence by using the pairwise BLAST. Collect the p53 gene sequence from NCBI´s homepage by searching in Genbank for accession number = U94788. You will then get the complete cds and some part of the intronic sequences. To make it easier, look at the mRNA sequence that you find a link to in the Genbank document. Compare the normal p53 mRNA-sequence with this tumour sequence. Use BLAST2seq in BLAST and identify the mutations. 1-23. Try performing the alignment with and without filter. What is the difference in the result and why is there a difference? What can this kind of function be used for? 1-24. What are these mutations and how do you think these mutations could influence the function of the protein? Be very precise and describe every mutation, if it will create an amino acid shift or not, what kind of amino acid shift and if there will be some other change in the protein due to the mutation. You will need to use either the genetic code or some clever combination of the tools and tricks you have learned so far. From the NCBI home page you also find a link to OMIM (Online Mendelian Inheritance in Man). Here you will find review summaries of diseases with genetic factors involved. There is also links to Pubmed/Medline references for further reading. Search in OMIM for answers to the following questions: 1-25. In some areas of China and Africa very specific mutations of the p53 gene have been found in liver tumours. What environmental risk factors have been shown to contribute to these mutations? 1-26. Although most p53 mutations occur in sporadic cancer cases, some have been found as germline mutations. What is this familial syndrome called? 1-27. There are several allelic variants found in patients with this inherited cancer form. Which are the first four variants mentioned in patients with this syndrome, and in which codons and which exons are they located? Hints: If you go from the first reference you used in Genbank (U94788) and from there click on the CDS link, you will get the nucleotide number for each exon counted as if the first codon starts with nucleotide number one.