Download Sequence comparisons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Sequence comparisons
Using the BLAST and FASTA tools
Introduction
Xeroderma Pigmentosum (XP) is a heterogenous group of genetically determined skin
disorders due to unusual sensitivity to ultraviolet light. The sun-exposed areas of the skin
have a strong tendency to develop tumors. The median age of onset of the first skin neoplasm
in these patients is 8 years, as compared to 50 years for sporadic skin tumour cases. The
causes are different genetic defects of the DNA repair system. The cell uses nucleotide
excision repair to remove so called bulky lesions, typically for UV-induced DNA damage.
Several enzymes are involved in this type of DNA repair. In XP patients, mutations have been
found in at least seven different genes coding for such enzymes.
Links: For more information about the disease Xeroderma Pigmentosum look at this shortcut or search the OMIM database for number 278700, corresponding to the XPA gene.
Assignment
BLAST is a local alignment tool found at the NCBI website. Read about the BLAST
functions and the different modes to choose at the web page
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html. You may find the
Query tutorial, BLAST tutorial and More information useful. Also read the overview and
FAQs at the BLAST web site (http://www.ncbi.nlm.nih.gov/BLAST/).
Hints: If you have problems in using BLAST first look at the Frequently Asked Questions. I
also recommend you to read the chapter about Pairwise alignment techniques in the course
literature “Introduction to Bioinformatics” by Attwood & Parry-Smith. Please, check the
following things:



in some of the BLAST search modes (and also other programs you will use on this
course) you must load the sequence in the FASTA format.
choose the correct database, usually nr (nr=non-redundant).
if the search is taking very long time, you can use the option to get your results by email. This does not work with BLAST 2 sequences.
Please, take your time in getting acquainted with BLAST. You will have great use of it in this
course, and most likely in your future work!!! Now, when you have gone through the general
introduction about BLAST, you are ready to answer some questions. Please, use your own
words to formulate your answers.
1-1. What is the difference between global and local alignment?
1-2. Most bioinformatic tools take FASTA sequences as input. Describe the FASTA
sequence format.
1-3. Nucleotide blast (blastn) is used to compare a DNA or RNA sequence against a
database of nucleotide sequences. Describe the functions of the other modes (or
programs) you can choose.
1-4. What is an E-value? (What does it stand for? How can we use it?)
There are two main tools for sequence comparisons -- BLAST and FASTA. Make
sequence comparisons using both tools with the following sequence as query
sequence. In both cases, make a search against Swissprot database. Report the
identifiers for the top 5 hits using:
1-5. FASTA (http://www.ebi.ac.uk/fasta3/)
1-6. BLAST (http://www.ncbi.nlm.nih.gov/BLAST)
MNDLSGKTVI ITGGARGLGA EAARQAVAAG ARVVLADVLD EEGAATAREL GDAARYQHLD
VTIEEDWQRV VAYAREEFGS VDGLVNNAGI STGMFLETES VERFRKVVDI NLTGVFIGMK
TVIPAMKDAG GGSIVNISSA AGLMGLALTS SYGASKWGVR GLSKLAAVEL GTDRIRVNSV
HPGMTYTPMT AETGIRQGEG NYPNTPMGRV GNEPGEIAGA VVKLLSDTSS YVTGAELAVD
GGWTTGPTVK YVMGQ
Now, we shall look at the difference between searching the complete sequence and
searching only a short segment of the sequence. The test case is human
plasminogen (below) which sequence you will compare with all sequences in the
Swissprot database using FASTA (http://www.ebi.ac.uk/fasta3)
MEHKEVVLLL LLFLKSGQGE PLDDYVNTQG ASLFSVTKKQ LGAGSIEECA AKCEEDEEFT
60
CRAFQYHSKE QQCVIMAENR KSSIIIRMRD VVLFEKKVYL SECKTGNGKN YRGTMSKTKN
120
GITCQKWSST SPHRPRFSPA THPSEGLEEN YCRNPDNDPQ GPWCYTTDPE KRYDYCDILE
180
CEEECMHCSG ENYDGKISKT MSGLECQAWD SQSPHAHGYI PSKFPNKNLK KNYCRNPDRE
240
LRPWCFTTDP NKRWELCDIP RCTTPPPSSG PTYQCLKGTG ENYRGNVAVT VSGHTCQHWS
300
AQTPHTHNRT PENFPCKNLD ENYCRNPDGK RAPWCHTTNS QVRWEYCKIP SCDSSPVSTE
360
QLAPTAPPEL TPVVQDCYHG DGQSYRGTSS TTTTGKKCQS WSSMTPHRHQ KTPENYPNAG
420
LTMNYCRNPD ADKGPWCFTT DPSVRWEYCN LKKCSGTEAS VVAPPPVVLL PDVETPSEED
480
CMFGNGKGYR GKRATTVTGT PCQDWAAQEP HRHSIFTPET NPRAGLEKNY CRNPDGDVGG
540
PWCYTTNPRK LYDYCDVPQC AAPSFDCGKP QVEPKKCPGR VVGGCVAHPH SWPWQVSLRT
600
RFGMHFCGGT LISPEWVLTA AHCLEKSPRP SSYKVILGAH QEVNLEPHVQ EIEVSRLFLE
PTRKDIALLK LSSPAVITDK VIPACLPSPN YVVADRTECF ITGWGETQGT FGAGLLKEAQ
660
720
LPVIENKVCN RYEFLNGRVQ STELCAGHLA GGTDSCQGDS GGPLVCFEKD KYILQGVTSW
780
GLGCARPNKP GVYVRVSRFV TWIEGVMRNN
810
a) Start by doing a search using the complete sequence. Store these results (e.g. in
"emacs", the "notepad" or any word processor) so that you can compare them with
the results in b) and c) below.
b) Do a search using only the segment between positions 121 and 240 as the query
sequence.
c) Finally, do a third search using only the segment 601--780.
Quickly look through the descriptions for the top 50 results (a, b and c above).
1-7. Are they identical or not?
1-8. What kinds of protein did you find?
1-9. Are there any kinds of protein in b or c that were not found in a?
1-10. If so, could you find an explanation to this?
Use BLAST to identify the gene from mRNA sequence 1. (Be patient, it may take a
while.) From your result you can enter directly into GenBank via the link, or search for
the accession number at the GenBank website
(http://www.ncbi.nlm.nih.gov/Genbank/index.html), to gain access to more
information about the gene.
Hints:



Do not remove empty spaces or numbering of the sequence manually, the program
does that automatically.
Some times your search can run very slowly, especially in the afternoon when the
scientists in the big country in the West have started their work. If you get this
problem, use the option to receive your answer via e-mail. Some programs also have a
queuing system, where you get an identification number for your particular search.
With this number you can enter the web site later and look at your result.
Alternatively, get up early in the morningÂ…
You probably have to work on every assignment at several different occasions and
maybe also use different computers. It might be wise of you to save part of your
results in a word or NotePad document and save it to a disk.
1-11. This gene has got several names. Which? (Either use the abbreviations or the
full names)
1-12. Give the accession number for the sequence of this gene (or its transcript).
1-13. At which position in the mRNA does the coding sequence (CDS) start?
1-14. Removed. Enter 'removed' into the assignment report form. (Updated 2007-11-15:
Deprecated due to new evidence.)
1-15. This enzyme is a part of a protein complex (NER), but what function does it
have in the complex?
There are several tools available to convert DNA sequence to protein. At the ExPASy
web site (http://www.expasy.ch/) they can be found under the headline Proteomics.
Use the following tools to verify the amino acid sequence you found in GenBank in
the previous question (1-11):
Translate (http://www.expasy.ch/tools/dna.html)
Transeq in EMBOSS (http://www.ebi.ac.uk/emboss/transeq/)
Sixpack (available in SRS (http://srs.ebi.ac.uk/), tab Tools, drop down menu
or http://humpback.bii.a-star.edu.sg/cgi-bin/emboss/emboss.pl?_action=input&_app=sixpack)
Hints: Remove the heading of the sequence in the FASTA format! The prefix may be
interpreted as nucleotides.
1-16. Which are the first ten amino acids of the expressed protein?
1-17. Describe the differences between the search tools.
1-18. Which tool did you like best - and why?
PS Don't miss the Swiss jokes at the ExPASy site!!
You have the recently cloned mRNA sequence 2, but you do not know the reading frame.
There are three possible reading frames on each DNA strand. Only one is the correct one that
gives a functional protein. Instead of using a translation program, as in questions 1-16--1-18,
you can use one of the translated BLAST programs to directly translate and identify the gene
product. Also use blastn to identify the gene.
1-19. Which BLAST program did you use to compare all six frames?
1-20. What are the effects on quality of performing the sequence comparisons in BLAST on
the protein level instead of on the nucleotide level when you have a nucleotide query?
1-21. This gene also has got (at least) two names. Which? (Again, either use the abbreviations
or the full names.)
1-22. What is the enzymatic function of the protein that this gene codes for? Describe what
the protein does.
Even though deficiency in DNA repair systems has been shown to be a contributor to tumour
development, the most common cause of cancer is malfunction of the cell cycle control.
Normal cells use cell cycle checkpoints as a mechanism to avoid accumulation of genomic
errors before cell division. Normal p53 protein blocks a mutated cell before it enters the cell
cycle S-phase and that block gives the cell an opportunity to repair its DNA or enter the
pathway to controlled cell death. p53 is the most commonly mutated gene in human tumours.
It is called a tumour suppressor gene because it is the inactivation of the gene that contributes
to cancer. This type of inactivating mutations is often recessive, so that both copies of the
gene have to be affected. Read more at the p53 homepage (http://p53.free.fr/), which is also
a very good example of the more specific databases that are available on the internet. There
are databases for an increasing number of different genes, chromosomes and diseases. See
Genes and disease (http://www.ncbi.nlm.nih.gov/disease/) on the NCBI homepage and
choose the chromosome or disease area that interests you.
In this part of the assignment, you are going to compare a normal and a mutated sequence by
using the pairwise BLAST.
Collect the p53 gene sequence from NCBI´s homepage by searching in Genbank for
accession number = U94788. You will then get the complete cds and some part of the intronic
sequences. To make it easier, look at the mRNA sequence that you find a link to in the
Genbank document. Compare the normal p53 mRNA-sequence with this tumour sequence.
Use BLAST2seq in BLAST and identify the mutations.
1-23. Try performing the alignment with and without filter. What is the difference in the
result and why is there a difference? What can this kind of function be used for?
1-24. What are these mutations and how do you think these mutations could influence the
function of the protein? Be very precise and describe every mutation, if it will create an amino
acid shift or not, what kind of amino acid shift and if there will be some other change in the
protein due to the mutation. You will need to use either the genetic code or some clever
combination of the tools and tricks you have learned so far.
From the NCBI home page you also find a link to OMIM (Online Mendelian Inheritance in
Man). Here you will find review summaries of diseases with genetic factors involved. There
is also links to Pubmed/Medline references for further reading. Search in OMIM for answers
to the following questions:
1-25. In some areas of China and Africa very specific mutations of the p53 gene have been
found in liver tumours. What environmental risk factors have been shown to contribute to
these mutations?
1-26. Although most p53 mutations occur in sporadic cancer cases, some have been found as
germline mutations. What is this familial syndrome called?
1-27. There are several allelic variants found in patients with this inherited cancer form.
Which are the first four variants mentioned in patients with this syndrome, and in which
codons and which exons are they located?
Hints: If you go from the first reference you used in Genbank (U94788) and from
there click on the CDS link, you will get the nucleotide number for each exon counted
as if the first codon starts with nucleotide number one.