Download How to Find a Specific Gene or Protein to Study

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Promoter (genetics) wikipedia , lookup

List of types of proteins wikipedia , lookup

Gene expression profiling wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene wikipedia , lookup

Expanded genetic code wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Biochemistry wikipedia , lookup

Biosynthesis wikipedia , lookup

Genome evolution wikipedia , lookup

Point mutation wikipedia , lookup

Genetic code wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
How to Find a Specific Gene or Protein to Study
Searching with words that describe or label the sequence
Simple keyword searching
The initial search option, which is presented in the header as a text box with a "Go" button, is a keyword search
against the text of the data records. Thus, it suffers from the same limitations as all keyword searches, such as
misspellings and synonyms. Most genes and gene products can be described by several text strings. In this
example, we will try to find an enzyme in the folate biosynthesis pathway that has several common names, but one
specific EC number. The name of the gene that encodes the target enzyme has been named by several groups
working on different organisms. Any of the following terms may be used to describe the target enzyme:
•
7,8-dihydro-6-hydroxymethylpterin-pyrophosphokinase
•
hydroxymethylpterin pyrophosphokinase
•
HPPK
•
pyrophosphokinase
•
sulD
•
folK
•
folate biosynthesis
•
EC 2.7.6.3
•
2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase
Use your favorite strategy to compose a keyword search in the box below (or in the NMPDR banner in a new
window). Some of these terms will result in no hits, while some result in hundreds. Neither option is useful. A
new search form is presented at the bottom of the search results table so that you may revise your search. As with
all keyword searches, there is an appropriate subset of the above terms that will return the record of interest.
Please note that at this time, we are not curating gene names, so for example, a search for lacZ may not return all
instances of beta-galactosidase. (Use the back button on your browser to resume this tutorial.)
Quick search
keyw ord(s) or numerical id
Go
Keywords can include gene IDs (gi|16802272), gene names (folK), EC numbers (2.7.6.3), genus (Vibrio),
species (vulnificus), words contained in subsystem names (synthesis), functional assignments
(pyrophosphokinase), and subsystem classes (cofactors). You may also use attributes like iedb,
virulence, and essential. A list of protein encoding genes that match all of the keywords will be returned.
To search for genes matching only some of the keywords, surround the optional words with parentheses. For
example, 2.7.6.3
4.1.2.25 would match only bifunctional genes associated with both EC numbers 2.7.6.3
and 4.1.2.25, while (2.7.6.3) (4.1.2.25) would match the bifunctional genes as well as all single function
genes with either of those EC numbers. Use a minus sign to exclude genes matching a particular keyword. For
example, pyrophosphokinase -2-amino-4-hydroxy-6-hydroxymethyldihydropteridine would
match all pyrophosphokinases acting on substrates other than 2-amino-4-hydroxy-6hydroxymethyldihydropteridine.
Restricting keyword search to selected organisms or subsystems
There are several ways to limit the scope of a keyword search to organisms of interest to you. First, you may
simply include the organism genus and/or species and/or strain name among the keywords entered in the simple
keyword search box. Try, for example, searching for EC 2.7.6.3 listeria. Please note that if you enter only
the full name and strain of a sequenced organism without any additional search words, the search will return its
genome overview page (try, for example, Listeria monocytogenes EGD-e).
Second, if you start on one of the NMPDR organism summary pages, simple keyword searches are automatically
limited to that group of organisms. Try, for example, searching for EC 2.7.6.3 from the Campylobacter page,
which is directly accessible from the home page or through table of NMPDR organisms on the NMPDR Organisms
page.
Third, from the menu of supporting organisms on the NMPDR Organisms page, you may select any single
organism and go to its overview page, which links to a Browser that includes a searchable and sortable table of all
features in the genome. The overview page also provides direct links to tables of features that have or have not
been included in subsystems by NMPDR curators.
Advanced Search
Finally, the Advanced Search form has a menu of genomes for limiting your keyword search and a menu of
subsystems that may be used to restrict your keyword search.
In the form, genomes are grouped with the NMPDR focus organisms listed first, followed by the Archaea (blue),
Bacteria (pink), and Eukarya (yellow). Within groups, genomes are alphabetized. Select a single genome directly
by clicking on its name in the list box. To select multiple genomes, hold down the CTRL key while clicking. To
select a range of genomes, hold down the SHIFT key while clicking. Selected genomes appear in the box below the
buttons as they are selected.
It is also possible to select all genomes whose name includes text you type into the form. For example, if you type
pneumoniae into the box and click the button, "Select genomes containing," all genomes that contain
"pneumoniae" in the name will be selected, including species of Streptococcus and Chlamydophila, as well as
Mycoplasma hypopneumoniae. You can also type an NCBI taxonomy ID into the box: 171101 will select
Streptococcus pneumoniae R6.
Use the buttons, Select All to select all the genomes, Clear All to de-select all the genomes, or Select NMPDR
to select all the NMPDR focus genomes.
Searching the sequence data directly
BLAST -- Sequence alignment searching
The BLAST family of tools use local sequence alignments to search for matching sequences in the database.
BLAST uses a DNA or amino acid sequence as the query term instead of one or more keywords.
Suppose you did not know the EC number for our example enzyme, HPPK, and a search with your first choice of
common name returned no usable results. But, you have the amino acid sequence of the E.coli version:
>E.coli K12 HPPK
MTVAYIAIGSNLASPLEQVNAALKALGDIPESHILTVSSFYRTPPLGPQDQPDYLNAAVA
LETSLAPEELLNHTQRIELQQGRVRKAERWGPRTLDLDIMLFGNEVINTERLTVPHYDMK
NRGFMLWPLFEIAPELVFPDGEMLRQILHTRAFDKLNKW
Copy the sequence above and paste it into the sequence box on the Sequence Search page. Since this is an amino
acid sequence, set the tool to blastp. From the scrolling menu, choose any organism of interest to BLAST against.
Multiple genomes may be selected by using the control or shift buttons as you click. Buttons are also provided for
selecting all NMPDR focus genomes, or all of the supporting genomes. Click the button labeled "BLAST." The
table of BLAST results returned is ranked by score, with the most significant hits at the top of the results table.
The top entry in the table of returned results is most likely to be the target protein.
You may also use a nucleotide sequence to find your gene of interest:
>E.coli K12 HPPK gene
atgacagtggcgtatattgccataggcagcaatctggcctctccgctggagcaggtcaat
gctgccctgaaagcattaggcgatatccctgaaagccacattcttaccgtttcttcgttt
taccgcaccccaccgctggggccgcaagatcaacccgattacttaaacgcagccgtggcg
ctggaaacctctcttgcacctgaagagctactcaatcacacacagcgtattgaattgcag
caaggtcgcgtccgcaaagctgaacgctggggaccacgcacgctggatctcgacatcatg
ctgtttggtaatgaagtgataaatactgaacgcctgaccgttccgcactacgatatgaag
aatcgtggatttatgctgtggccgctgtttgaaatcgcgccggagttggtgtttcctgat
ggggagatgttgcgtcaaatcttacatacaagagcatttgacaaattaaacaaatggtaa
If you are interested in finding many orthologs of the query sequence, select the blastx tool, which translates the
nucleotide sequence and compares the result to protein sequences in the database to find matching genes.
If you want to find the data page for the exact sequence you entered, then select the blastn tool, which will match
the query (input) nucleotide sequence with nucleotide sequences in the database. The small number of characters
and the degeneracy of the genetic code causes blastn to find shorter matching sequences than blastx will find with
the same query.
Scan -- Sequence pattern, or motif, searching
Protein motifs
Another way to search for proteins or genes is to make use of known sequence patterns, or motifs, that are
characteristic of a a functional group of proteins. For example, a signature of HPPK enzymes has been defined by
ProSite as this: [KRHD]-x-[GA]-[PSAE]-R-x(2)-D-[LIV]-D-[LIVM](2). Such a sequence is more commonly
written in the text of a journal article, for example, as: (KRHD)X(GA)(PSAE)RXXD(LIV)D(LIVM)(LIVM).
The abstract instruction conveyed by the pattern is, "One of either lysine or arginine or histidine or aspartate,
followed by any single amino acid, followed by either glycine or alanine, then one of these four, then arginine, then
any two amino acids, then aspartate, then one of these three, then aspartate, then one of these four, then one of
the same four again." All of the following three examples of protScan patterns convey the same instruction:
any(KRHD) x any(GA) any(PSAE) RxxD any(LIV) D any(LIVM) any(LIVM)
any(KRHD) 1...1 any(GA) any(PSAE) R 2...2 D any(LIV) D any(LIVM) any(LIVM)
((K | (R | (H | D))) X (G | A) (P | (S | (A | E))) RXXD (L | (I | V)) D (L | (I |
(V | M))) (L | (I | (V | M)))
The word "any" must be hard up against the open parentheses to indicate a choice of those within the set. The tool
is not sensitive to the case of amino acid letters. A space should separate elements of the pattern. The letter "X" is
the wild card and specifies any of the 20 amino acids. The choice of any amino acid may also be indicated by the
number of amino acids required and three dots to represent the ellipsis. For example, both "XX" and "2...2" mean
any two amino acids. However, "2...4" means any two or three or four amino acids. The third way to indicate a
choice is by the use of nested parentheses and the symbol "|", commonly used as "or" in computer science. This is
not a lower-case letter L nor an upper-case letter i. It is sometimes called a pipe, and is usually "SHIFT \" on the
keyboard.
Try copying any of the three patterns into the sequence box on the Sequence Search page (or try the pre-filled
form below). Since this is an amino acid sequence, select protScan from the tool menu. Use the genomes list to
select organisms to search, then click the Scan button. Please note that a ProSite pattern must be translated into
one of the three forms recognized by protScan, which does NOT recognize the ProSite syntax.
DNA patterns
Nucleic acid patterns may used as input with the dnaScan tool. Pattern rules for spacing are similar to those for
amino acid patterns. For a complete description of how to format complex patterns, such as hairpin loops, please
see the article, Search Pattern. Limited options in degenerate positions are indicated using the IUB single letter
code for degenerate sequences:
M A or C (aMino)
R
A or G (puRine)
W A or T (Weak, 2 H-bonds)
S
C or G (Strong, 3 H-bonds)
Y
C or T (pYrimidine)
K
G or T (Keto)
V
A or C or G (not T; V > T)
H
A or C or T (not G; H > G)
D
A or G or T (not C; D > C)
B
C or G or T (not A; B > A)
N
A or C or G or T
Results
Search results are presented on a page with two tables and a search form. The downloads options table at the top
provides several different ways to save the search results to your local computer. The bottom table presents the
features that match your search parameters. A form for running the same type of search with different terms or
parameters is provided at the bottom of the page.
Download options
The search results table may be saved or downloaded to your local computer in several formats.
•
To save a url which will allow you to repeat the same search parameters in the future, e.g. after a data
update, use the right mouse button or control-click to bookmark or copy the link called "Repeat" in the
first row of the table.
•
Download the nucleotide sequences of the genes found by your search by clicking the download button
in the corresponding row of the table to create a text file containing all DNA sequences in FASTA format.
You may elect to append upstream and downstream sequences flanking each gene by typing a number in
the box. Leave the box empty to save the coding sequences without flanking sequence.
•
Download the amino acid sequences of the proteins found by your search by cliking the download
button in the corresponding row of the table to create a text file containing all protein sequences in FASTA
format.
•
Download the search results table as a tab-delimited text file, which may be opened in Excel, by
clicking the download button in the corresponding row of the table. The viewer buttons are not included
in the saved table. To save the results table with links to viewer pages, perform your search in Internet
Explorer, copy the table, then paste it into a new workbook in Excel.
•
Download the search results in XML format by clicking the download button in the corresponding row of
the table.
Viewing options
In the top table on the results page, there is a button for viewing the search results in a sortable, expandable table.
Click this button to view the resulting features in a table to which columns may be added using a pair of
controlling lists at the top of the page. All possible parameters are available to select to be columns in your table;
however, all features do not have data for all listed parameters. Further, some of the parameters describe
organisms while some describe proteins. For a full explanation of this table, see interactive table.
The results table displays all features that match all the search terms and parameters. Different features are
displayed in each row of the table. Columns of the table provide the name of the organism, database ID of the
feature, functional annotation of the feature, a button for accessing the viewer for the feature, and a link to the
subsystem(s) for those features that have been included in a subsystem by a curator.
When results are found for more than one organism, the NMPDR focus organisms (if any) are listed before
supporting organisms, in alphabetical order. Within the same organism, results are listed in order of their
database ID numbers (fig| ...).
To have another ID number listed in the database ID column, use the form at the bottom of the page to re-run the
same search, but use the drop-down list to select an Identifier Type, for example, Locus Tag.
To add a column containing all ID types, or aliases, use the form at the bottom of the page to re-run the same
search, but use the check box to select "Show alias links." If there is an alias type you are particularly interested in
and you aren't sure what it is called, e.g. those that start with "gi|", enter that into the box, and that type of ID will
be first in the list of aliases if it is available.
Summary
The gene product that you want to study may be located in the NMPDR by searching for one or more text strings
in a keyword search, or by searching directly for the protein or nucleic acid sequence using BLAST or Scan. The
results of these searches are presented in a table with links to a new NMPDR Viewer environment, which provides
tools for comparative analysis with other genomes. Sequence search results are presented with a link to a Context
viewer, which localizes the pattern match within the chromosomal environment.