* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download How to Find a Specific Gene or Protein to Study
Survey
Document related concepts
Promoter (genetics) wikipedia , lookup
List of types of proteins wikipedia , lookup
Gene expression profiling wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Community fingerprinting wikipedia , lookup
Expanded genetic code wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Biochemistry wikipedia , lookup
Biosynthesis wikipedia , lookup
Genome evolution wikipedia , lookup
Point mutation wikipedia , lookup
Genetic code wikipedia , lookup
Transcript
How to Find a Specific Gene or Protein to Study Searching with words that describe or label the sequence Simple keyword searching The initial search option, which is presented in the header as a text box with a "Go" button, is a keyword search against the text of the data records. Thus, it suffers from the same limitations as all keyword searches, such as misspellings and synonyms. Most genes and gene products can be described by several text strings. In this example, we will try to find an enzyme in the folate biosynthesis pathway that has several common names, but one specific EC number. The name of the gene that encodes the target enzyme has been named by several groups working on different organisms. Any of the following terms may be used to describe the target enzyme: • 7,8-dihydro-6-hydroxymethylpterin-pyrophosphokinase • hydroxymethylpterin pyrophosphokinase • HPPK • pyrophosphokinase • sulD • folK • folate biosynthesis • EC 2.7.6.3 • 2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase Use your favorite strategy to compose a keyword search in the box below (or in the NMPDR banner in a new window). Some of these terms will result in no hits, while some result in hundreds. Neither option is useful. A new search form is presented at the bottom of the search results table so that you may revise your search. As with all keyword searches, there is an appropriate subset of the above terms that will return the record of interest. Please note that at this time, we are not curating gene names, so for example, a search for lacZ may not return all instances of beta-galactosidase. (Use the back button on your browser to resume this tutorial.) Quick search keyw ord(s) or numerical id Go Keywords can include gene IDs (gi|16802272), gene names (folK), EC numbers (2.7.6.3), genus (Vibrio), species (vulnificus), words contained in subsystem names (synthesis), functional assignments (pyrophosphokinase), and subsystem classes (cofactors). You may also use attributes like iedb, virulence, and essential. A list of protein encoding genes that match all of the keywords will be returned. To search for genes matching only some of the keywords, surround the optional words with parentheses. For example, 2.7.6.3 4.1.2.25 would match only bifunctional genes associated with both EC numbers 2.7.6.3 and 4.1.2.25, while (2.7.6.3) (4.1.2.25) would match the bifunctional genes as well as all single function genes with either of those EC numbers. Use a minus sign to exclude genes matching a particular keyword. For example, pyrophosphokinase -2-amino-4-hydroxy-6-hydroxymethyldihydropteridine would match all pyrophosphokinases acting on substrates other than 2-amino-4-hydroxy-6hydroxymethyldihydropteridine. Restricting keyword search to selected organisms or subsystems There are several ways to limit the scope of a keyword search to organisms of interest to you. First, you may simply include the organism genus and/or species and/or strain name among the keywords entered in the simple keyword search box. Try, for example, searching for EC 2.7.6.3 listeria. Please note that if you enter only the full name and strain of a sequenced organism without any additional search words, the search will return its genome overview page (try, for example, Listeria monocytogenes EGD-e). Second, if you start on one of the NMPDR organism summary pages, simple keyword searches are automatically limited to that group of organisms. Try, for example, searching for EC 2.7.6.3 from the Campylobacter page, which is directly accessible from the home page or through table of NMPDR organisms on the NMPDR Organisms page. Third, from the menu of supporting organisms on the NMPDR Organisms page, you may select any single organism and go to its overview page, which links to a Browser that includes a searchable and sortable table of all features in the genome. The overview page also provides direct links to tables of features that have or have not been included in subsystems by NMPDR curators. Advanced Search Finally, the Advanced Search form has a menu of genomes for limiting your keyword search and a menu of subsystems that may be used to restrict your keyword search. In the form, genomes are grouped with the NMPDR focus organisms listed first, followed by the Archaea (blue), Bacteria (pink), and Eukarya (yellow). Within groups, genomes are alphabetized. Select a single genome directly by clicking on its name in the list box. To select multiple genomes, hold down the CTRL key while clicking. To select a range of genomes, hold down the SHIFT key while clicking. Selected genomes appear in the box below the buttons as they are selected. It is also possible to select all genomes whose name includes text you type into the form. For example, if you type pneumoniae into the box and click the button, "Select genomes containing," all genomes that contain "pneumoniae" in the name will be selected, including species of Streptococcus and Chlamydophila, as well as Mycoplasma hypopneumoniae. You can also type an NCBI taxonomy ID into the box: 171101 will select Streptococcus pneumoniae R6. Use the buttons, Select All to select all the genomes, Clear All to de-select all the genomes, or Select NMPDR to select all the NMPDR focus genomes. Searching the sequence data directly BLAST -- Sequence alignment searching The BLAST family of tools use local sequence alignments to search for matching sequences in the database. BLAST uses a DNA or amino acid sequence as the query term instead of one or more keywords. Suppose you did not know the EC number for our example enzyme, HPPK, and a search with your first choice of common name returned no usable results. But, you have the amino acid sequence of the E.coli version: >E.coli K12 HPPK MTVAYIAIGSNLASPLEQVNAALKALGDIPESHILTVSSFYRTPPLGPQDQPDYLNAAVA LETSLAPEELLNHTQRIELQQGRVRKAERWGPRTLDLDIMLFGNEVINTERLTVPHYDMK NRGFMLWPLFEIAPELVFPDGEMLRQILHTRAFDKLNKW Copy the sequence above and paste it into the sequence box on the Sequence Search page. Since this is an amino acid sequence, set the tool to blastp. From the scrolling menu, choose any organism of interest to BLAST against. Multiple genomes may be selected by using the control or shift buttons as you click. Buttons are also provided for selecting all NMPDR focus genomes, or all of the supporting genomes. Click the button labeled "BLAST." The table of BLAST results returned is ranked by score, with the most significant hits at the top of the results table. The top entry in the table of returned results is most likely to be the target protein. You may also use a nucleotide sequence to find your gene of interest: >E.coli K12 HPPK gene atgacagtggcgtatattgccataggcagcaatctggcctctccgctggagcaggtcaat gctgccctgaaagcattaggcgatatccctgaaagccacattcttaccgtttcttcgttt taccgcaccccaccgctggggccgcaagatcaacccgattacttaaacgcagccgtggcg ctggaaacctctcttgcacctgaagagctactcaatcacacacagcgtattgaattgcag caaggtcgcgtccgcaaagctgaacgctggggaccacgcacgctggatctcgacatcatg ctgtttggtaatgaagtgataaatactgaacgcctgaccgttccgcactacgatatgaag aatcgtggatttatgctgtggccgctgtttgaaatcgcgccggagttggtgtttcctgat ggggagatgttgcgtcaaatcttacatacaagagcatttgacaaattaaacaaatggtaa If you are interested in finding many orthologs of the query sequence, select the blastx tool, which translates the nucleotide sequence and compares the result to protein sequences in the database to find matching genes. If you want to find the data page for the exact sequence you entered, then select the blastn tool, which will match the query (input) nucleotide sequence with nucleotide sequences in the database. The small number of characters and the degeneracy of the genetic code causes blastn to find shorter matching sequences than blastx will find with the same query. Scan -- Sequence pattern, or motif, searching Protein motifs Another way to search for proteins or genes is to make use of known sequence patterns, or motifs, that are characteristic of a a functional group of proteins. For example, a signature of HPPK enzymes has been defined by ProSite as this: [KRHD]-x-[GA]-[PSAE]-R-x(2)-D-[LIV]-D-[LIVM](2). Such a sequence is more commonly written in the text of a journal article, for example, as: (KRHD)X(GA)(PSAE)RXXD(LIV)D(LIVM)(LIVM). The abstract instruction conveyed by the pattern is, "One of either lysine or arginine or histidine or aspartate, followed by any single amino acid, followed by either glycine or alanine, then one of these four, then arginine, then any two amino acids, then aspartate, then one of these three, then aspartate, then one of these four, then one of the same four again." All of the following three examples of protScan patterns convey the same instruction: any(KRHD) x any(GA) any(PSAE) RxxD any(LIV) D any(LIVM) any(LIVM) any(KRHD) 1...1 any(GA) any(PSAE) R 2...2 D any(LIV) D any(LIVM) any(LIVM) ((K | (R | (H | D))) X (G | A) (P | (S | (A | E))) RXXD (L | (I | V)) D (L | (I | (V | M))) (L | (I | (V | M))) The word "any" must be hard up against the open parentheses to indicate a choice of those within the set. The tool is not sensitive to the case of amino acid letters. A space should separate elements of the pattern. The letter "X" is the wild card and specifies any of the 20 amino acids. The choice of any amino acid may also be indicated by the number of amino acids required and three dots to represent the ellipsis. For example, both "XX" and "2...2" mean any two amino acids. However, "2...4" means any two or three or four amino acids. The third way to indicate a choice is by the use of nested parentheses and the symbol "|", commonly used as "or" in computer science. This is not a lower-case letter L nor an upper-case letter i. It is sometimes called a pipe, and is usually "SHIFT \" on the keyboard. Try copying any of the three patterns into the sequence box on the Sequence Search page (or try the pre-filled form below). Since this is an amino acid sequence, select protScan from the tool menu. Use the genomes list to select organisms to search, then click the Scan button. Please note that a ProSite pattern must be translated into one of the three forms recognized by protScan, which does NOT recognize the ProSite syntax. DNA patterns Nucleic acid patterns may used as input with the dnaScan tool. Pattern rules for spacing are similar to those for amino acid patterns. For a complete description of how to format complex patterns, such as hairpin loops, please see the article, Search Pattern. Limited options in degenerate positions are indicated using the IUB single letter code for degenerate sequences: M A or C (aMino) R A or G (puRine) W A or T (Weak, 2 H-bonds) S C or G (Strong, 3 H-bonds) Y C or T (pYrimidine) K G or T (Keto) V A or C or G (not T; V > T) H A or C or T (not G; H > G) D A or G or T (not C; D > C) B C or G or T (not A; B > A) N A or C or G or T Results Search results are presented on a page with two tables and a search form. The downloads options table at the top provides several different ways to save the search results to your local computer. The bottom table presents the features that match your search parameters. A form for running the same type of search with different terms or parameters is provided at the bottom of the page. Download options The search results table may be saved or downloaded to your local computer in several formats. • To save a url which will allow you to repeat the same search parameters in the future, e.g. after a data update, use the right mouse button or control-click to bookmark or copy the link called "Repeat" in the first row of the table. • Download the nucleotide sequences of the genes found by your search by clicking the download button in the corresponding row of the table to create a text file containing all DNA sequences in FASTA format. You may elect to append upstream and downstream sequences flanking each gene by typing a number in the box. Leave the box empty to save the coding sequences without flanking sequence. • Download the amino acid sequences of the proteins found by your search by cliking the download button in the corresponding row of the table to create a text file containing all protein sequences in FASTA format. • Download the search results table as a tab-delimited text file, which may be opened in Excel, by clicking the download button in the corresponding row of the table. The viewer buttons are not included in the saved table. To save the results table with links to viewer pages, perform your search in Internet Explorer, copy the table, then paste it into a new workbook in Excel. • Download the search results in XML format by clicking the download button in the corresponding row of the table. Viewing options In the top table on the results page, there is a button for viewing the search results in a sortable, expandable table. Click this button to view the resulting features in a table to which columns may be added using a pair of controlling lists at the top of the page. All possible parameters are available to select to be columns in your table; however, all features do not have data for all listed parameters. Further, some of the parameters describe organisms while some describe proteins. For a full explanation of this table, see interactive table. The results table displays all features that match all the search terms and parameters. Different features are displayed in each row of the table. Columns of the table provide the name of the organism, database ID of the feature, functional annotation of the feature, a button for accessing the viewer for the feature, and a link to the subsystem(s) for those features that have been included in a subsystem by a curator. When results are found for more than one organism, the NMPDR focus organisms (if any) are listed before supporting organisms, in alphabetical order. Within the same organism, results are listed in order of their database ID numbers (fig| ...). To have another ID number listed in the database ID column, use the form at the bottom of the page to re-run the same search, but use the drop-down list to select an Identifier Type, for example, Locus Tag. To add a column containing all ID types, or aliases, use the form at the bottom of the page to re-run the same search, but use the check box to select "Show alias links." If there is an alias type you are particularly interested in and you aren't sure what it is called, e.g. those that start with "gi|", enter that into the box, and that type of ID will be first in the list of aliases if it is available. Summary The gene product that you want to study may be located in the NMPDR by searching for one or more text strings in a keyword search, or by searching directly for the protein or nucleic acid sequence using BLAST or Scan. The results of these searches are presented in a table with links to a new NMPDR Viewer environment, which provides tools for comparative analysis with other genomes. Sequence search results are presented with a link to a Context viewer, which localizes the pattern match within the chromosomal environment.