Download Glossary - ChristopherKing.name

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vectors in gene therapy wikipedia , lookup

Gene expression programming wikipedia , lookup

Frameshift mutation wikipedia , lookup

Genome evolution wikipedia , lookup

Gene desert wikipedia , lookup

Genome (book) wikipedia , lookup

Public health genomics wikipedia , lookup

Protein moonlighting wikipedia , lookup

Human genome wikipedia , lookup

Gene expression profiling wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Pathogenomics wikipedia , lookup

NEDD9 wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene nomenclature wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Designer baby wikipedia , lookup

Expanded genetic code wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome editing wikipedia , lookup

Genetic code wikipedia , lookup

Helitron (biology) wikipedia , lookup

Point mutation wikipedia , lookup

Sequence alignment wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Bioinformatics
Adapted from a paper at http://carbon.cudenver.edu/~bstith by April Bednarski and
Himadri Pakrasi that was funded by a grant from the Howard Hughes Medical Institute of
Washington University.
Glossary
Genome – The entire amount of genetic information for an organism. The human genome
is the set of 46 chromosomes.
Homologous – With regard to amino acids, a homologous amino acid is similar to a
reference amino acid in chemical properties and size. For example, glutamate can be
considered homologous to aspartate because both residues are roughly similar in size and
both residues contain a carboxylic acid moiety, which gives them similar chemical
properties.
Conserved – when talking about a position in a multiple sequence alignment, “conserved”
means the amino acid residues at that position are identical throughout the alignment.
Conservative residue change – when talking about a position in a multiple sequence
alignment, a “conservative change” is when there is a change to a homologous amino acid
residue.
EC number - Enzyme Commission number - Assigned by the IUBMB (International Union
of Biochemistry and Molecular Biology); classifies enzymes according to the reaction
catalyzed. An EC Number is composed of four numbers separated by dots. For example the
alcohol dehydrogenase has the EC Number 1.1.1.1.
BLOSUM – BLOcks of Amino Acid SUbstitution Matrix – A type of substitution matrix that is
used by programs like BLAST to give sequences a score based on similarity to another
sequence. The scoring matrix gives a score to conservative substitutions of amino acids. A
conservative substitution is a substitution of an amino acid similar in size and chemical
properties to the amino acid in the query sequence. Discussed in the Berg text, p.175 – 178.
BLAST – Basic Local Alignment Search Tool – can be accessed from the NCBI website,
blast.ncbi.nlm.nih.gov/Blast.cgi. A program that compares a given input sequence to all the
sequences in a specified database. This program aligns the most similar segments between
sequences. BLAST aligns sequences using a scoring matrix similar to BLOSUM (see entry).
This scoring method gives penalties for gaps and gives the highest score for identical
residues. Substitutions are scored based on how conservative the changes are. The output
is a list of sequences, with the highest scoring sequence at the top. The scoring output is
given as an E-value. The lower the E-value, the higher scoring the sequence is. E-values in
the range of 10-100 to 10-50 are very similar (or even identical) sequences. Sequences with
E-values 10-10 and higher need to be examined based on other methods to determine
homology. An Evalue of 10-10 for a sequence can be interpreted as, “a 1 in 1010 chance that
the sequence was pulled from the database by chance alone (has no homology to the query
sequence).”
1
ClustalW – A program for making multiple sequence alignments.
www.ebi.ac.uk/clustalw/index.html
ExPASy – Expert Protein Analysis System - us.expasy.org/ A server maintained by the
Swiss Institute of Bioinformatics. Home of SWISS-PROT, the most extensive and annotated
protein database. The Swiss-Pdb Viewer protein-viewing program is also available at this
site for free download.
FASTA – Fast Alignment Search Tool-All (since it works on both nucleotide and amino acid
sequences). Associated with this software is a way of formatting a nucleic acid or protein
sequence. It is important because many bioinformatics programs require that the
sequence be in FASTA format. The FASTA format has a title line for each sequence that
begins with a “>” followed by any needed text to name the sequence. The end of the
title line is signified by a paragraph mark (hit the return key). Bioinformatics
programs will know that the title line isn’t part of the sequence if you have it formatted
correctly. The sequence itself does NOT have any returns, spaces, or formatting of any
kind. The sequence is given in one-letter code. An example of a protein in correct FASTA
format is shown below:
>K-Ras protein Homo sapiens
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDI
LDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVP
MVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRK
HKEKMSKDGKKKKKKSKTKCVIM
GenBank - a database of nucleotide sequences from over 260,000 organisms.
www.ncbi.nlm.nih.gov/ This is the main database for nucleotide sequences. It is a
historical database, meaning it is redundant. When new or updated information is entered
into GenBank, it is given a new entry, but the older sequence information is also kept in the
database. GenBank belongs to an international collaboration of sequence databases, which
also includes EMBL (European Molecular Biology Laboratory) and DDBJ (DNA Data Bank of
Japan). In contrast, the RefSeq database (see entry) is non-redundant and contains only the
most current sequence information for genetic loci.
Gene – an NCBI database of genetic loci. It may be accessed through the NCBI homepage by
selecting “Gene” from the Search drop-down menu. This database used to be called
LocusLink. Entries provide links to RefSeqs, articles in PubMed, and other descriptive
information about genetic loci. The database also provides information on official
nomenclature, aliases, sequence accession numbers, phenotypes, EC numbers, OMIM
numbers, UniGene clusters, map information, and relevant web sites.
KEGG – Kyoto Encyclopedia of Genes and Genomes – http://www.genome.ad.jp/kegg/
This website is used for accessing metabolic pathways. At this website, you can search a
process, gene, protein, or metabolite and obtain diagrams of all the metabolic pathways
associated with your query. You will see a link to the KEGG entry at the end of the Gene
entry for a gene.
NCBI – National Center for Biotechnology Information – www.ncbi.nlm.nih.gov This center
was formed in 1988 as a division of the NLM (National Library of Medicine) at the NIH
2
(National Institute of Health). As part of the NIH, NCBI is funded by the US government.
The main goal of the center is to provide resources for biomedical researchers as well as
the general public. The center is continually developing new materials and updating
databases. The entire human genome is freely available on this website and is updated
daily as new and better data become available. NCBI also maintains an extensive education
site, which offers online tutorials of its databases and programs:
www.ncbi.nlm.nih.gov/About/outreach/courses.html
OMIM - Online Mendelian Inheritance in Man –
www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM a continuously updated catalog of
human genes and genetic disorders, with links to associated literature references, sequence
records, maps, and related databases.
PubMed – www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed when writing a paper on a
particular science/medical topic, you should always check PubMed. It is a retrieval system
containing citations, abstracts, and indexing terms for journal articles in the biomedical
sciences. PubMed contains the complete contents of the MEDLINE and PREMEDLINE
databases. It also contains some articles and journals considered out of scope for
MEDLINE, based on either content or on a period of time when the journal was not indexed,
and therefore is a superset of MEDLINE.
RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including
genomic DNA contigs, mRNAs, proteins, and entire chromosomes. Accession numbers have
the format of two letters, an underscore bar, and six digits. Example: NT_123456. Code:
NT, NC, NG = genomic; NM = mRNA; NP = protein (for more of the two letter codes, see the
NCBI site map).
Sequence Manipulation Suite – bioinformatics.org/sms/ a website that contains a
collection of web-based programs for analyzing and formatting DNA and protein
sequences.
Bioinformatics is a field of study that merges math, biology, and computer science.
Researchers in this field have developed a wide range of tools to help biomedical
researchers work with genomic, biochemical, and medical information. Some types of
bioinformatics tools include data base storage and search programs as well as software
programs for analyzing genomic and proteomic data.
We will be working through a tutorial on web-based bioinformatics programs. The
tutorial is based on the enzymes phospholipase C-gamma (believed to be the major enzyme
of fertilization), and cyclooxygenase-2 (COX-2), which also has the name prostaglandin
synthase-2 (PTGS2). In this tutorial, the bioinformatics tools from the NCBI (National
Center for Biotechnology Information) website will be introduced. NCBI is a division of the
National Institute of Health (NIH).
These tools include Gene, GenBank, RefSeq, and PubMed. Gene is a database of
genes in which each entry contains a brief summary, the common gene symbol, information
about the gene function, and links to websites, articles, and sequence information for that
gene.
3
GenBank is a historical database of gene sequences, which means it contains every
sequence that was published, even if the same sequence was published more than once.
Therefore, GenBank is considered a redundant database.
RefSeq is a database of sequences that is edited by NCBI and is NON-redundant,
meaning that it contains what NCBI determines is the strongest sequence data for each
gene.
Finally, we will be learning to use ClustalW, which is a multiple sequence alignment
program. It allows you to enter a series of gene or protein sequences that you believe are
similar and may be evolutionarily related. These sequences are usually obtained by
performing a BLAST search. ClustalW then aligns the sequences, so that the fewest gaps
are introduced and the largest number of similar residues is aligned with each other.
ClustalW uses a scoring matrix similar to BLOSUM-62, which is explained in your text and
will be presented in lecture.
Introduction to Phospholipase C-gamma and COX-2 (PTGS2)
Phospholipase C-gamma is believed to be the major enzyme of fertilization. We
obtained a partial clone of the gene when we performed RT-PCR. Take a look at the paper
that Dr. Stith has put on our web site. We will go through the paper more thoroughly at
some point. For now, the pathway of fertilization in Xenopus laevis may be the following:
1) Sperm binds to the egg
2) This binding somehow activates the 1b form of phospholipase D (PLD1b)
3) The enzyme PLD1b breaks down a lipid (phosphatidylcholine) to phosphatidic acid
(PA) and choline.
4) PA stimulates a tyrosine kinase called Src. Tyrosine kinases are enzymes that
transfer a phosphate from ATP to other proteins. This “phosphorylation” turns on
(in this case, Src), or can turn off another protein.
5) Once turned on, Src phosphorylates the gamma form of Phospholipase C (PLC-γ).
6) PLC-γ breaks down a lipid called PIP2 to make IP3 and DAG. IP3 diffuses from the
membrane to release calcium from stores in the endoplasmic reticulum.
7) The calcium floods into the cytoplasm to cause the events of fertilization (the
calcium travels across the zygote from the sperm binding site, causing a wave of
cortical granule exocytosis, a wave of elevation of the fertilization envelop, a wave
surface contraction (that we visualized); and initiation of other developmental
events leading to first cleavage (or cytokinesis). See our fertilization lecture for a
review of this.
COX-2 (PTGS2) is called prostaglandin H2 synthase-2 and cyclooxygenase-2 (COX-2).
COX-2 has been thoroughly studied because of its role in prostaglandin synthesis.
Prostaglandins have a wide range of roles in our body from aiding in digestion to
propagating pain and inflammation. Aspirin is a general inhibitor of prostaglandin
synthesis and, therefore, helps reduce pain. However, aspirin also inhibits the synthesis of
prostaglandins that aid in digestion. Therefore, aspirin is a poor choice for pain and
4
inflammation management for those with ulcers or other digestion problems. Recent
advances in targeting specific prostaglandin-synthesizing enzymes have lead to the
development of Celebrex, which is marketed as an arthritis therapy. Celebrex is a potent
and specific inhibitor of COX-2. Celebrex is considered specific because it doesn’t inhibit
COX-1, which is involved in synthesizing prostaglandins that aid in digestion. This is a
remarkable accomplishment given the great similarity between COX-1 and COX-2. This
achievement has paved the way for developing new therapies that bind more specifically to
their target and therefore have fewer side effects.
Understanding the enzyme structures of COX-1 and COX-2 helped researchers develop
a drug that would only bind and inhibit COX-2. Many of the types of information and tools
used by researchers for these types of studies are freely available on the web. In this
tutorial, and throughout this lab course, you will be introduced to the databases and freely
available software programs that are commonly used by professionals in research and
medicine to study genes, proteins, protein structure and function, and genetic disease.
Gene Database:
Follow these directions to access the entries for PTGS1 and PTGS2 in the “Gene” database
at the NCBI Website:
1) Go to the NCBI homepage: http://www.ncbi.nlm.nih.gov
2) Just after the word “Search,” select “Gene” from the database drop-down menu.
Enter “PTGS” in the “for” textbox, and click the Search button.
3) Scan the results for the “Homo sapiens” entries. There should be one called
“PTGS1” and one called “PTGS2.”
4) Select each entry by clicking on its name, then read the paragraph under the
Summary section for each entry.
Answer the following questions.
1. PTGS1 and PTGS2 are isozymes: Isozymes catalyze the same reaction, but are coded by
separate genes. Based on the summary, what types of reactions do PTGS enzymes
catalyze?
2. Which gene forms multiple transcript variants?
3. Which isozyme would you want to inhibit to stop inflammation?
5
4. According to the Pathways section, what KEGG pathways are listed for these enzymes
(other than “Metabolic pathways”)?
The next two questions are not discussed in the summaries- just read the questions and
think about the answers.
5.
The drug Celebrex selectively inhibits PTGS2 while aspirin and other NSAID’s
inhibit both PTGS1 and PTGS2 in the same way. Why do you think researchers wanted
to discover a selective inhibitor to PTGS2?
6.
Describe how studying 3-D structures of PTGS1 and PTGS2 could help researchers
design a drug that binds to PTGS1, but not to PTGS2.
7. Now enter “Phospholipase C-gamma” and search for this gene. Find the PLCG1 and
PLCG2 entries (case matters). On what chromosome are these found?
8. Go to the PLCG1 entry. From the summary, what do IP3 and PIP2 stand for (spell out
the complete chemical name):
9. What is the official symbol of phospholipase C, gamma 1?
6
HUGO is the acronym for the Human Genome Organization. The HUGO Gene Nomenclature
Committee’s acronym is HGNC. Click on the red HGNC:9065 link next to “Primary Source.”
This brings up the “Symbol Report” page; click on the link associated with the line: 17240
OMIM. OMIM stands for the Online Mendelian Inheritance in Man database. The OMIM
database was started at John Hopkins University and is now maintained by NCBI. The
OMIM database contains entries for both diseases with known genetic links and entries for
the genes that have been linked to a disease. Each OMIM entry is a summary of the
research that has been performed on the disease or gene and contains links to the research
articles that it summarizes. You will be able to read about the clinical and biochemical
research that has been performed related to the mutation you are studying. Is any
information available related to mutations or mutants for PLC gamma? YES NO
Each link in the OMIM entry will open an abstract from the PubMed database. PubMed is a
literature database, and is also maintained by NCBI. PubMed is a searchable database of
medical and life science journal articles. Most of the abstracts for these articles can be
accessed through PubMed, but in order to access the entire article, you need to go to each
individual journal website and have a subscription to the journal. The Troy University
library has subscriptions to electronic versions of many of these journals that you can
access through the E-journal link on the library home page. Most journals have their
articles available online as .pdf files for articles published between 1995 to present.
However, the older articles must still be accessed through the paper versions stored in
libraries.
Go back to the “Symbol Report” page. Click on the GenBank link. An example of a GenBank
entry is shown below.
7
For PLC-gamma 1, fill in the following info:
Number of base pairs:
Gene sequence was obtained from “Molecular Type”
Date of latest modification:
8
Accession number (Very important number):
Both the AMINO ACID (beginning with “/translation”) and then the GENE sequences
(in ATGC) are listed. Amino acids have both a 3-letter and 1-letter abbreviation—
databases use the 1-letter abbreviations.
Table 1. 1- and 3-Letter
Go back to the original page on PLCG1 (the page
with “Primary Source” and the HGNC:9065” link that you Abbreviations of Amino Acids.
followed). In your browser use Ctrl-F to find “Src” on
Amino Acid 3-Letter 1-Letter
that page. In the Bibliography section, find the first
Alanine Ala
A
paper that links Src to PLC-gamma; this might help our
Arginine Arg
R
research in Xenopus fertilization since we believe that
Asparagine Asn
N
Src turns on PLC-gamma. This paper is published in the
Aspartic acid Asp
D
Journal of Biological Chemistry. Go to their website,
Cysteine Cys
C
www.jbc.org, find that paper, and print off the first page
Glutamic acid Glu
E
of the paper as evidence that you have completed this
Glutamine Gln
Q
section successfully. (You have to be on a Troy
Glycine Gly
G
University computer to get access to that journal.)
Histidine His
H
Isoleucine
Ile
I
Then, continue the search for “src” through
Leucine Leu
L
Interactions: what is listed as an “interactant” with
Lysine
Lys
K
PLC-gamma?
Methionine Met
M
Phenylalanine Phe
F
Proline Pro
P
Click on PubMed to obtain the paper that you find
Serine
Ser
S
and then print off the first page of the 1994 J. Biol. Chem.
Threonine Thr
T
Tryptophan Trp
W
You have explored human forms of the enzyme and
Tyrosine Tyr
Y
its gene. Next, in the Entrez Gene database, search for a
Valine Val
V
reference to the presence of the PLC-gamma enzyme in
Xenopus laevis. You have to go back to the original page that had “Gene” for the database
and “Phospholipase C-gamma Xenopus laevis” for the search string. How many references
for Xenopus PLC-gamma did you find?
What is the preferred name of the enzyme in each reference (how do they differ?)?
9
For the first reference that you find, under “Related Sequences,” note that there are three
listed:
Nucleotide
Protein
mRNA
AB287408.1 BAF64273.1
mRNA
AF090111.1 AAD03594.1
mRNA
BC070837.1 AAH70837.1
The second column is a sequence of nucleotide bases; the third is the amino acid list for the
base sequence.
Go to the second Xenopus PLC gamma reference. Under General gene information, you
see “Pathways.” KEGG stands for the Kyoto Encyclopedia of Genes and Genomes. It is a
database of metabolic pathways that is maintained by a research institute in Japan. It
contains all the known metabolic and signaling pathways. Each protein in the pathway and
each small molecule metabolite (e.g., ATP) has its own entry in the database that can be
accessed by clicking on the protein or metabolite in the pathway figure. By using this
website, you can make predictions about what would happen to downstream events in the
pathway if the protein you are studying is either less active or more active. There are
several links to click on to show how PLC-gamma1b is involved in metabolism. Click on
the link related to inositol metabolism.
In the first link/path, the red arrow below shows where PLC gamma 1b is located- it
has a number of 3.1.4.11. PIP2 is to the right (1-phosphatidyl-1D-myo-inositol 4,5bisphosphate).
What is the full name of IP3 according to this metabolic pathway?
10
Click on the last KEGG link, about a signaling system; what is the name of this pathway?
Essentially, you now have two names for equivalent pathways involving PLC. Note that
they show PLC in red lettering and in a green box.
Locate PIP2 (top center; a substrate for PLC) and write how they abbreviate it here in this
second path:
Write down how they prefer to abbreviate IP3 (look for IP3 with some numbers in
parentheses):
NCBI – Gene
1. Go back to the “Gene” entry for Homo sapiens PTGS2. The section NCBI Reference
Sequences (RefSeq) gives the RefSeq accession number for the mRNA sequence of
Homo sapiens prostaglandin-endoperoxide synthase 2.
write it here__________________.
2. Open the RefSeq entry by clicking on the number (first link in the section), then click on
“FASTA” on the “Format:” line. Copy the nucleotide sequence (including the title line
designated by the “>” symbol) and paste it into a text or Word document.
3. Save the file as PTGS2rna.doc (or .txt) on your desktop. Review the entry for “FASTA”
in the Glossary: understanding the FASTA format will help in working with the
bioinformatics programs.
4. The amino acid sequence is conveniently obtained by first clicking on the “RefSeq
Protein Product” link, which is in the second column of the page, then selecting the
FASTA format again. Follow the steps given above to save the amino acid sequence in
FASTA format as a document called PTGS2prot.doc.
Swiss-Prot Entry
1. Go to the Expasy website (http://us.expasy.org/). Under Databases select “UniProtKB”
(a protein knowledgebase). At the top of the page, click “Fields” next to the search box.
For the first field, select “Protein Name”, and enter, for the “Term”, Phospholipase C
gamma 1. Click “Add & Search”, then click “Fields” again, and for the field, “Organisms”,
use the term “Homo sapiens”. Click “Add & Search”, again. Select the one entry that has
been reviewed (the gold star).
2. What is the accession number of this protein?
11
3. Write at least three alternate names for this protein.
4. In which two areas of the cell is this protein found?
5. What is its cofactor (needed for the enzyme to function)?
6. What is the PLC gamma1 amino acid length and molecular weight?
7. Return to the home page of the ExPASy Proteomics Server; select the SWISS-2DPAGE
database. Enter the accession number in the search box. Has anyone reported 2-D gel
electrophoresis data?
Sequence Manipulation
1. Go to the Sequence Manipulation Suite (http://bioinformatics.org/sms/).
2. Click on “Translate” under “DNA Analysis” heading from the menu.
3. Clear the data entry box by hitting “Clear”.
4. Copy the mRNA sequence in FASTA format from your file (PTGS2rna.doc) and paste it
into the data entry box on the Sequence Manipulation website.
5. Select “Reading Frame 3” and “direct” from the pull-down menus, then click “Submit”.
6. When the Output window opens with your results, copy and paste the sequence into a
Word document and save it as, “translate.doc” on your desktop.
7. Compare this sequence in the “translate.doc” file with the sequence in the
“PTGS2prot.doc”.
What are the first residues that are the same in the sequences?
12
Do the sequences look like they are the same? (Hint: protein sequences should start
with a methionine, M.)
Multiple Sequence Alignment with ClustalW
1. Go to the ClustalW2 website, http://www.ebi.ac.uk/Tools/clustalw2/index.html.
2. The following are 6 FASTA formatted sequences of PTGS2 from different organisms.
Copy” and paste all of the FASTA formatted sequences into the data entry box. For
alignment select “Full”; for output format, select “aln w/numbers” so we can find
particular residues (amino acids) in the alignment; for the Output order elect “input”.
Press “Run” located in the lower right.
>dog [Canis familiaris]
MLARALVLCAALAVVRAANPCCSHPCQNQGICMSTGFDQYKCDCTRTGFYGENCS
TPEFLTRIKLYLKPT
PNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHLIESPPTYNVNYGYKSW
EAFSNLSYYTRALPP
VPDDCPTPMGVKGKKELPDSKEIVEKFLLRRKFIPDPQGTNMMFAFFAQHFTHQF
FKTDHKRGPAFTKGL
GHGVDLNHVYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHV
PEHLQFAVGQEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTLQIDDQEYNFQQFIYNNSILLEHGL
TQFVESFSRQIAGRV
AGGRNVPAAVQQVAKASIDQSRQMKYQSLNEYRKRFRLKPYTSFEELTGEKEMAA
GLEALYGDIDAMELY
PALLVEKPRPDAIFGETMVEMGAPFSLKGLMGNPICSPDYWKPSTFGGEVGFKII
NTASIQSLICNNVKG
CPFTAFSVQDGQLTKTVTINASSSHSGLDDINPTVLLKERSTEL
>cow [Bos taurus]
MLARALLLCAAVALSGAANPCCSHPCQNRGVCMSVGFDQYKCDCTRTGFYGENCT
TPEFLTRIKLLLKPT
PNTVHYILTHFKGVWNIVNKISFLRNMIMRYVLTSRSHLIESPPTYNVHYSYKSW
EAFSNLSYYTRALPP
VPDDCPTPMGVKGRKELPDSKEVVKKVLLRRKFIPDPQGTNLMFAFFAQHFTHQF
FKTDFERGPAFTKGK
NHGVDLSHIYGESLERQHKLRLFKDGKMKYQMINGEMYPPTVKDTQVEMIYPPHV
PEHLKFAVGQEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
13
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDVFQIDGQEYNYQQFIYNNSVLLEHGL
TQFVESFTRQRAGRV
AGGRNLPVAVEKVSKASIDQSREMKYQSFNEYRKRFLVKPYESFEELTGEKEMAA
ELEALYGDIDAMEFY
PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII
NTASIQSLICSNVKG
CPFTSFSVQDTHLTKTVTINASSSHSGLDDINPTVLLKERSTEL
>mouse [Mus musculus]
MLFRAVLLCAALGLSQAANPCCSNPCQNRGECMSTGFDQYKCDCTRTGFYGENCT
TPEFLTRIKLLLKPT
PNTVHYILTHFKGVWNIVNNIPFLRSLIMKYVLTSRSYLIDSPPTYNVHYGYKSW
EAFSNLSYYTRALPP
VADDCPTPMGVKGNKELPDSKEVLEKVLLRREFIPDPQGSNMMFAFFAQHFTHQF
FKTDHKRGPGFTRGL
GHGVDLNHIYGETLDRQHKLRLFKDGKLKYQVIGGEVYPPTVKDTQVEMIYPPHI
PENLQFAVGQEVFGL
VPGLMMYATIWLREHNRVCDILKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIASEFNTLYHWHPLLPDTFNIEDQEYSFKQFLYNNSILLEHGL
TQFVESFTRQIAGRV
AGGRNVPIAVQAVAKASIDQSREMKYQSLNEYRKRFSLKPYTSFEELTGEKEMAA
ELKALYSDIDVMELY
PALLVEKPRPDAIFGETMVELGAPFSLKGLMGNPICSPQYWKPSTFGGEVGFKII
NTASIQSLICNNVKG
CPFTSFNVQDPQPTKTATINASASHSRLDDINPTVLIKRRSTEL
>Rabbit
MLARALLLCAAVALSHAANPCCSNPCQNRGVCMTMGFDQYKCDCTRTGFYGENCS
TPEFLTRIKLLLKPT
PDTVHYILTHFKGVWNIVNSIPFLRNSIMKYVLTSRSHMIDSPPTYNVHYNYKSW
EAFSNLSYYTRALPP
VADDCPTPMGVKGKKELPDSKDVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF
FKTDLKRGPAFTKGL
GHGVDLNHIYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHI
PAHLQFAVGQEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWDDEQLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTFQIDDQQYNYQQFLYNNSILLEHGL
TQFVESFTRQIAGRV
AGGRNVPPAVQKVAKASIDQSRQMKYQSLNEYRKRFLLKPYESFEELTGEKEMAA
ELEALYGDIDAVELY
PALLVERPRPDAIFGESMVEMGAPFSLKGLMGNPICSPNYWKPSTFGGEVGFKIV
NTASIQSLICNNVKG
CPFTSFNVPDPQLTKTVTINASASHSRLEDINPTVLLKGRSTEL
>pig [Sus scrofa]
MLARALLLCAAVSLCTAAKPCCSNPCQNRGICMSVGFDHYKCDCTRTGFYGENCT
TPEFLTRIKLFLKPT
PNTVHYILTHFKGVWNIVNNIPFLRNAIMKYVLISRSHLIDSPPTYNMHYGYKSW
14
EAFSNLSYYTRALPP
VPDDCPTPMGVKGRKELPDSKEVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF
FKTDQKRGPAFTKGQ
GHGVDLSHVYGESLERQHKLRLFKDGKMKYQIIDGEMYPPTAKDTQVEMIYPPHT
PEHLRFAVGHEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDAFQIDGHEYNYQQFLYNNSILLEHGI
TQFVESFSRQIAGRV
AGGRNLPAAVQKVSKASIDQSREMRYQSFNEYRKRFLLKPYRSFEELTGEKEMAA
ELEALYGDIDAMELY
PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII
NTASIQSLICNNVKG
CPFTSFSVQDPQLAKTVTINASSSHSGLDDINPTVLLKERSTEL
>coral [Gersemia fruticosa]
MVAKFVVFLGLQLILCSVVCEAVNPCCSFPCESGAVCVEDGDKYTCDCTRTGHYG
VNCEKPNWSTWFKAL
IAPSEETKHFILTHFKWFWWIVNNVPFIRNTVMKAAYFSRTDFVPVPHAYTSYHD
YATMEAHYNRSYFAR
TLPPVPKNCPTPFGVAGKKELPPAEEVANKFLKRGKFKTDHTSTSWLFMFFAQHF
THEFFKTIYHSPAFT
WGNHGVDVSHIYGQDMERQNKLRSFEDGKLKSQTINGEEWPPYLKDVDNVTMQYP
PNTPEDQKFALGHPF
YSMLPGLFMYASIWLREHNRVCTILRKEHPHWVDERLYQTGKLIITGELIKIVIE
DYVNHLANYNLKLTY
NPELVFDHGYDYDNRIHVEFNHMYHWHPFSPDEYNISGSTYSIQDFMYHPEIVVK
HGMSSFVDSMSKGLC
GQMSHHNHGAYTLDVAVEVIKHQRELRMQSFNNYRKHFALEPYKSFEELTGDPKM
SAELQEVYGDVNAVD
LYVGFFLEKGLTTSPFGITMIAFGAPYSLRGLLSNPVSSPTYWKPSTFGGDVGFD
MVKTASLEKLFCQNI
AGECPLVTFTVPDDIARETRKVLEARDEL
3. View the output- the SCORES table:
SeqA Name
Len(aa)
SeqB Name
Len(aa)
Score
===================================================
1
dog
604
2
cow
604
90
1
dog
604
3
mouse
604
89
Note that different specific combinations are examined; DOG TO COW for example. You
would expect a higher SCORE (right column; similarity of the gene sequence) between two
mammals than a mouse and the coral. What is the similarity score for the same gene found
in mouse and coral? ________
15
View the cladogram at the bottom of the page. (To learn more about cladograms go to
en.wikipedia.org/wiki/Cladogram.) Switch to the phylogram view. Which two species are
most similar, based on this view? (Or can one even tell?)
Now for the most important part of this ClustalW analysis: an amino acid by amino acid
comparison of the same protein from different species. Go about half way down the web
page and find ALIGNMENT. A button labeled 'Show Colors' will be displayed in the
Alignment section of results page. If you press this button the alignment will be show in
color according to the table below- remember our earlier discussion of types of amino
acids. (This option only works when you have chosen ALN or GCG as the output format).
AVFPMILW RED
Small (small+ hydrophobic (including aromatic - Y))
DE
BLUE
Acidic
RHK
MAGENTA Basic
STYHCNGQ GREEN
Others
Hydroxyl + Amine + Basic - Q
Gray
CONSENSUS SYMBOLS: An alignment will display by default the following symbols
denoting the degree of conservation observed in each column:
Symbol
Meaning
*
The residues in that column are identical in all sequences in the alignment.
:
Conserved substitutions are present, according to the COLOR table above.
.
Semi-conserved substitutions are present.
(space) ?
16
Figure 1. A Venn diagram showing the relationship of the 20 naturally occurring amino
acids to some physio-chemical properties. Exarchos et al. BMC Bioinformatics, 2009,
10:113 (Creative Commons Attribution License)
Copy the alignment of amino acids in various species and paste it into a Word
document. To make this file readable, do the following things:
a) Go to “Page Set-up” under “File” and change the page orientation to landscape.
b) Select all text and change to “Courier” font, size 10. Courier is the best font for
alignments because all the letters are the same width. This is one of the major
secrets of working with FASTA sequences.
c) Save and Print this file to the desktop as “ClustalW.doc” (send the file to yourself by
email or place on a floppy or flash drive). Place a copy in your lab notebook.
4. Review the alignment. What does the presence of a space under a column indicate in
the alignment indicate about the relation of the residues?
5. Find the longest string of conserved residues (watch out for strings at the ends of rows).
How many residues does it contain?
17