Download 1. Sequence analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

G protein–coupled receptor wikipedia , lookup

Amino acid synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Magnesium transporter wikipedia , lookup

Non-coding DNA wikipedia , lookup

Expression vector wikipedia , lookup

Biosynthesis wikipedia , lookup

Interactome wikipedia , lookup

Metalloprotein wikipedia , lookup

Biochemistry wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene expression wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Genetic code wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Proteolysis wikipedia , lookup

Point mutation wikipedia , lookup

Protein structure prediction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Άσκηση 6
Αναζήτηση και ανάλυση ακολουθιών πρωτεϊνών
χρησιμοποιώντας το Internet
A few exercises below involves the alignment of 2-4 different sequences. Useful sites are for the
alignment of 2 sequences :


www-hto.usc.edu/software/seqaln/seqaln-query.html (select global alignment!)
genome.eerie.fr/bin/align-guess.cgi
Site for multiple sequence alignment :

www.medkem.gu.se/ln/molbio/gene/msf.html
1. Sequence analysis
Question # 3 of 8
From the previous exercise you learned that your DNA sequence has the following
potential protein products:
Forward strand:
5' CUGCCCUGUGCAGCUGUGGGUUGAUUCCACACUC 3'
L P C A A V G * F H T
C P V Q L W V D S T L
A L C S C G L I P H
Reverse strand:
3' GACGGGACACGUCGACACCCAACUAAGGUGUGAG 5'
Q G T C S H T S E V S
A R H L Q P N I G C E
G Q A A T P Q N W V
You will now try to identify proteins in protein sequence databases that match any of the open reading
frames predicted from your DNA sequence. In the Basic Blast Search select the program 'blastp' and
'nr' as database. Try BLAST searches with the different peptide sequences. For one of them you should
be able to identify a protein in the database with nearly identical amino acid sequence.
What is the name of the database protein highly homologous to your sequence?
**********************************************************************************
2. Searching databases for sequence homology
Sequence homology. Connect to NCBI-BLAST and subject the amino acid sequence above to a
BLAST search. Go to www3.ncbi.nlm.nih.gov/BLAST/ and select "Basic BLAST search". Paste the
sequence into the window of BLAST. Select "blastp" as program and "swissprot" as database.
The result should be something like:
Sequences producing High-scoring Segment Pairs:
Score P(N)
N
sp|P04637|P53_HUMAN CELLULAR TUMOR ANTIGEN P53 (PHOSPHOP... 1582 6.1e-209 1
sp|P13481|P53_CERAE CELLULAR TUMOR ANTIGEN P53
1536 1.3e-202 1
sp|P41685|P53_FELCA CELLULAR TUMOR ANTIGEN P53
994 8.3e-177 2
.
.
.
Examine the result carefully, including the alignments of query and database sequences.
FASTA.
1.
Use FASTA to identify homologs of the SRP54 protein in bacteria. Select as query sequence
the mouse SRP54 protein ("sw:sr54_mouse").
2.
The expression from ferritin and transferrin messengers is regulated by a protein, the IRE
(iron responsive element) binding protein. When the amino acid sequence of the protein was
obtained from a cDNA clone there was an unexpected similarity to a previously identified
protein. Use STRINGSEARCH to locate the sequence of IRE binding protein in the Swissprot
database (Hint: use for instance iron,responsive as search string). Then use FASTA to
compare it to the same database. Can you identify the protein related to IRE binding protein?
3.
Consider the sequence em:Z82206. In the annotation section there is information about an
exon (<20814..21617). Use FASTA (www.ncbi.nlm.nih.gov/cgi-bin/BLAST/nphblast?Jform=0) to compare this sequence to the human section of the EMBL database. What
seems to be the protein encoded by the exon?
4.
Use the three DNA sequences 11e03t3.seq, 12c02t3.seq and 7f06t3.seq (in the directory
~/gcg/4). They are bovine sequences in "Bluescript" vectors. Perform a FASTA search in the
database of rodent DNA sequences (rod:*) to see if there are any homologous sequences. For
at least one of the sequences you should be able to identify protein homologues.
**********************************************************************************
3. Comparing two sequences
GAP. The two sequences 1.seq and 2.seq are present in the directory ~/gcg/2. Compare the two
sequences 1.seq and 2.seq using the "Gap" program. Do they look similar? Use the option "Generate
statistics from randomized alignments" (under "Options" in the GAP window) to answer the question.
BESTFIT. Compare the two sequences 1.seq and 2.seq using the "Bestfit" program. Do they look
similar? Use the option "Generate statistics from randomized alignments" (under "Options" in the
BESTFIT window) to answer the question.
Do you get different results from "Gap" and "Bestfit"? Why?
DOTPLOT. Compare the two sequences 1.seq and 2.seq by "dotplot" analysis. Run COMPARE with
the output directed to DOTPLOT ( DOTPLOT uses the output from COMPARE to make a 2D plot)
Identification of repeats with DOTPLOT. COMPARE may be used to identify repeats in a
sequence. Analyze the sequence sw:prio_human (the human prion protein). In the Editor mode of
Seqlab, make a copy of it to create two identical sequences. Apply these as input to COMPARE. How
many repeats can you identify? Compare to the information in the annotation section (Use "Graphical
features" to display the repeat regions in the Editor mode of Seqlab).
Alignment of genomic sequence with mRNA. Retrieve the nt sequences em:V00594 (Human mRNA
for metallothionein) and em:J00271 (corresponding genomic sequence). (Use the database browser).
Compare these sequences by doing an alignment with GAP. Based on the alignment, how many exons
are there in this gene? Compare your result to what's in the annotation section for J00271. The result of
GAP is in this case very much dependent on what gap penalty parameters you select. Try for instance
Gap creation penalty = 10 and Gap extension penalty = 0 (Under Options in the GAP window).
**********************************************************************************
4. Multiple sequence alignment.
PILEUP
2.
Sequence analysis of Drosophila homeotic genes reveals a region highly conserved, the
homeobox. In the protein antennapedia (antp) this sequence is:
Arg Arg Arg Ile Glu Ile Ala His Ala Leu Cys Leu Thr Glu Arg Gln Ile Lys Ile Trp Phe Gln
Asn Arg Arg Met Lys
Enter this seqence with the SeqLab editor (one-letter symbols!) and use FASTA to identify
homologous sequences in the database. Then select 6-7 of these sequences and use PILEUP to
align the sequences. Look at "Graphical features" to see what's in the feature section of these
entries. Can you find the homeobox motif ?
Use the result from PILEUP with PRETTY to display the alignment.
Part of the result from PRETTY could be something like this:
301
350
HMSC_DROME YPWMKRVHLG TSTVNANGET KRQ.RTSYTR YQTLELEKEF HFNRYLTRRR
HMSC_APIME .......... ..TVNANGEV KRQ.RTSYTR YQTLELEKEF HFNRYLTRRR
HMAA_DROME MGSPFERVVC GDFNGPNGCP RRRGRQTYTR FQTLELEKEF
HFNHYLTRRR
HMAA_APIME .......... ...PGPNGCP RRRGRQTYTR FQTLELEKEF HYNHYLTRRR
HMAA_SCHGR .......... .....PNGCP RRRGRQTYTR FQTLELEKEF HFNHYLTRRR
HMUX_DROME .......... ....GTNG.L RRRGRQTYTR YQTLELEKEF HTNHYLTRRR
HXB6_HUMAN PVYPWMQRMN SCNSSSFGPS GRRGRQTYTR YQTLELEKEF
HYNRYLTRRR
351
400
HMSC_DROME RIEIAHALCL TERQIKIWFQ NRRMKWKKE. HKMASMNIVP YHMGPYGHPY
HMSC_APIME RIEIAHALCL TERQIKIWFQ NRRMKWKKE. HKMASMNIVP YHMSPYGHPY
HMAA_DROME RIEIAHALCL TERQIKIWFQ NRRMKLKKEL RAVKEINEQA RRDREEQEKM
HMAA_APIME RIEIAHALCL TERQIKIWFQ NRRMKLKKEL RAVKEIN... ..........
HMAA_SCHGR RIEIAHALCL TERQIKIWFQ NRRMKLKKEL RAVKEINEQA RREREEQDRL
HMUX_DROME RIEMAHALCL TERQIKIWFQ NRRMKLKKEI QAIKELNEQE
KQAQAQKAAA
HXB6_HUMAN RIEIAHALCL TERQIKIWFQ NRRMKWKKES KLLSASQLSA EEEEEKQAE.
**********************************************************************************
5. Protein families
The following four amino acid sequences are derived from human proteins that all bind and hydrolyze
GTP.
EF1-ALPHA
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL
DKLKAERERG ITIDISLWKF ETSKYYVTII DAPGHRDFIK NMITGTSQAD CAVLIVAAGV
GEFEAGISKN GQTREHALLA YTLGVKQLIV GVNKMDSTEP PYSQKRYEEI VKEVSTYIKK
IGYNPDTVAF VPISGWNGDN MLEPSANMPW FKGWKVTRKD GNASGTTLLE ALDCILPPTR
PTDKPLRLPL QDVYKIGGIG TVPVGRVETG VLKPGMVVTF APVNVTTEVK SVEMHHEALS
EF-2
MVNFTVDQIR AIMDKKANIR NMSVIAHVDH GKSTLTDSLV CKAGIIASAR AGETRFTDTR
KDEQERCITI KSTAISLFYE LSENDLNFIK QSKDGAGFLI NLIDSPGHVD FSSEVTAALR
VTDGALVVVD CVSGVCVQTE TVLRQAIAER IKPVLMMNKM DRALLELQLE PEELYQTFQR
IVENVNVIIS TYGEGESGPM GNIMIDPVLG TVGFGSGLHG WAFTLKQFAE MYVAKFAAKG
EGQLGPAERA KKVEDMMKKL WGDRYFDPAN GKFSKSATSP EGKKLPRTFC QLILDPIFKV
SRP54
KELVKLVDPG VKAWTPTKGK QNVIMFVGLQ GSGKTTTCSK LAYYYQRKGW
KTCLICADTF RAGAFDQLKQ NATKARIPFY GSYTEMDPVI IASEGVEKFK NENFEIIIVD
TSGRHKQEDS
LFEEMLQVAN AIQPDNIVYV MDASIGQACE AQAKAFKDKV DVASVIVTKL
DGHAKGGGAL SAVAATKSPI IFIGTGEHID DFEPFKTQPF ISKLLGMGDI
SR-alpha
RRVDMLRDIM DAQRRQRPYV VTFCGVNGVG KSTNLAKISF WLLENGFSVL IAACDTFRAG
AVEQLRTHTR RLSALHPPEK HGGRTMVQLF EKGYGKDAAG IAMEAIAFAR NQGFDVVLVD
TAGRMQDNAP LMTALAKLIT VNTPDLVLFV GEALVGNEAV DQLVKFNRAL
ADHSMAQTPR LIDGIVLTKF DTIDDKVGAA ISMTYITSKP IVFVGTGQTY CDLRSLNAKA
VVAALMKA
Use multiple sequence alignment to compare them. What are the two proteins that are most closely
related to each other? These two proteins form a separate class of GTP binding proteins. Can you
identify in the alignment the consensus sequence GXXXXGK(S/T) (the 'X' is any amino acid) that is
typical for GTP binding proteins? This sequence is part of a loop that binds the phosphate group of
GTP.
A few exercises below involves the alignment of 2-4 different sequences. Useful sites are for the
alignment of 2 sequences :


www-hto.usc.edu/software/seqaln/seqaln-query.html (select global alignment!)
genome.eerie.fr/bin/align-guess.cgi
Site for multiple sequence alignment :

www.medkem.gu.se/ln/molbio/gene/msf.html
**********************************************************************************
Profile search
PROFILEMAKE and PROFILESEARCH
There is evidence from sequence comparison that asparagine synthetase is evolutionary related to
aspartyl - tRNA synthetase (Hinchman, S.K. et al 1992 J.Biol. Chem. 267: 144-149). The motif below
is from five different aspartyl-tRNA synthetases.
Syd2human PPHAGGGIGLERVTML
Syd2rat PPHAGGGIGLERVTML
Sydcyeast PPHAGGGIGLERVVMF
Sydmyeast PPHAGFAIGFDRMCAM
Sydecoli PPHAGLAFGLDRLTML
Enter these sequences in the SeqLab editor and use PROFILEMAKE to create a profile from the
sequences. Finally search E. coli proteins in Swissprot (sw:*_ecoli) with PROFILESEARCH using the
profile from PROFILEMAKE. Can you identify the relationship with asparagine synthase?
**********************************************************************************
Pattern searches
1.
The program FINDPATTERNS may be used to identify patterns of a nucleotide or amino acid
sequence. Search Swissprot for the sequence "GDSGGP", typical of serine proteases. Click
on "Patterns" in the FINDPATTERNS window. Select "Create new" and type the sequence
above.
Click on Apply change --> Close --> , Then click on Run to execute the search.
2.
Use FINDPATTERNS to identify zinc finger proteins. Select as pattern:
Cx{2,4}Cx12Hx{3,5}H
(Which means: A cysteine residue followed by any two to four amino acids, a cysteine
residue, any 12 amino acids, histidine, any three to five amino acids and finally histidine. )
In the result of FINDPATTERNS can you find any proteins that are described as zinc finger
proteins?
3.
Identify cytochrome proteins that have exactly one methionine residue. Hints: First identify
cytochromes using STRINGSEARCH . Then use the output from STRINGSEARCH with
FINDPATTERNS. Select "M" (=methionine) as the pattern to search for. Click on
"Options..." in the FINDPATTERNS window and select both "Minimum... "and "Maximum
number of occurences" = 1.
4.
Search for protein motifs in the sequence of human tissue plasminogen activator
(sw:urot_human). Make use the MOTIFS program that looks for motifs as specified in
PROSITE. Compare the result from MOTIFS with the information in the annotation section
for the Swissprot entry.
**********************************************************************************
Miscellaneous programs
Protein secondary structure. HELICALWHEEL is used to display the arrangement of residues in an
-helical structure. Create the sequence "LRKQF KEMKK MMKQM TNMS" with the SeqLab editor
and examine it with HELICALWHEEL.
PEPTIDESORT. PEPTIDESORT examines an amino acid sequence for proteolytic cleavage sites.
Exercise: Cleave sw:gag_rsvp with trypsin. What fragments are obtained?
Secondary structure prediction of nucleic acids
1.
2.
Run "mfold" and "plotfold" on ecrna.seq and mmrna.seq (in directory ~/gcg/7). Include the
sequences in the squiggle plot. Run "bestfit" and compare the sequences. Discuss the results.
Try to find prokaryotic transcription terminator structures with the TERMINATOR program.
Try the sequences "em:ssdestn", "em:ectrpx", and "em:bsrggad"