* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Glossary - ChristopherKing.name
Metalloprotein wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene therapy wikipedia , lookup
Gene desert wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Proteolysis wikipedia , lookup
Molecular ecology wikipedia , lookup
Biochemical cascade wikipedia , lookup
Biochemistry wikipedia , lookup
Gene nomenclature wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Gene expression wikipedia , lookup
Metabolic network modelling wikipedia , lookup
Gene regulatory network wikipedia , lookup
Community fingerprinting wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Biosynthesis wikipedia , lookup
Protein structure prediction wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Genetic code wikipedia , lookup
Homology modeling wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Bioinformatics Adapted from a paper (http://www.lifescied.org/cgi/content/full/4/3/207; http://www.nslc.wustl.edu/elgin/genomics/Bio3055/manual.pdf) by April Bednarski and Himadri Pakrasi that was funded by a grant from the Howard Hughes Medical Institute of Washington University. Glossary Genome – The entire amount of genetic information for an organism. The human genome is the set of 46 chromosomes. Homologous – With regard to amino acids, homologous amino acids have similar chemical properties and sizes. For example, glutamate can be considered homologous to aspartate because both residues have similar sizes and both residues contain a carboxylic acid side chain. Sequence alignment – a sequence alignment is a way of arranging the sequences present in DNA, RNA, or proteins so as to identify regions that are similar. Multiple sequence alignment – a sequence alignment of three or more biological sequences. Conserved – the amino acid residues at a position in a multiple sequence alignment are identical throughout the alignment. Conservative residue change – the amino acid residues at a position in a multiple sequence alignment are homologous. ClustalW – A program for making multiple sequence alignments. www.ebi.ac.uk/clustalw/index.html EC number - Enzyme Commission number - Assigned by the IUBMB (International Union of Biochemistry and Molecular Biology); classifies enzymes according to the reaction catalyzed. An EC Number is composed of four numbers separated by dots. For example the alcohol dehydrogenase has the EC Number 1.1.1.1. BLOSUM – BLOcks of Amino Acid SUbstitution Matrix – A type of substitution matrix that is used by programs like BLAST to give sequences a score based on similarity to another sequence. The scoring matrix gives a score to conservative substitutions of amino acids. A conservative substitution is a substitution of an amino acid similar in size and chemical properties to the amino acid in the query sequence. BLAST – Basic Local Alignment Search Tool – can be accessed from the NCBI website, blast.ncbi.nlm.nih.gov/Blast.cgi. A program that compares a given input sequence to all the sequences in a specified database. This program aligns the most similar segments between sequences. BLAST aligns sequences using a scoring matrix similar to BLOSUM (see entry). This scoring method gives penalties for gaps and gives the highest score for identical residues. Substitutions are scored based on how conservative the changes are. The output is a list of sequences, with the highest scoring sequence at the top. The scoring output is given as an E-value. The lower the E-value, the higher scoring the sequence is. E-values in 1 the range of 10-100 to 10-50 are very similar (or even identical) sequences. Sequences with E-values 10-10 and higher need to be examined based on other methods to determine homology. An E-value of 10-10 for a sequence can be interpreted as, “a 1 in 1010 chance that the sequence was pulled from the database by chance alone (has no homology to the query sequence).” ExPASy – Expert Protein Analysis System - us.expasy.org/ A server maintained by the Swiss Institute of Bioinformatics. Home of SWISS-PROT, the most extensive and annotated protein database. The Swiss-Pdb Viewer protein-viewing program is also available at this site for free download. FASTA – Fast Alignment Search Tool-All (since it works on both nucleotide and amino acid sequences). Associated with this software is a way of formatting a nucleic acid or protein sequence. It is important because many bioinformatics programs require that the sequence be in FASTA format. The FASTA format has a title line for each sequence that begins with a “>” followed by any needed text to name the sequence. The end of the title line is signified by a paragraph mark (hit the return key). Bioinformatics programs will know that the title line isn’t part of the sequence if you have it formatted correctly. The sequence itself does NOT have any returns, spaces, or formatting of any kind. The sequence is given in one-letter code. An example of a protein in correct FASTA format is shown below: >K-Ras protein Homo sapiens MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDI LDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVP MVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRK HKEKMSKDGKKKKKKSKTKCVIM GenBank - a database of nucleotide sequences from over 260,000 organisms. http://www.ncbi.nlm.nih.gov/genbank/ This is the main database for nucleotide sequences. It is a historical database, meaning it is redundant. When new or updated information is entered into GenBank, it is given a new entry, but the older sequence information is also kept in the database. GenBank belongs to an international collaboration of sequence databases, which also includes EMBL (European Molecular Biology Laboratory) and DDBJ (DNA Data Bank of Japan). In contrast, the RefSeq database (see entry) is non-redundant and contains only the most current sequence information for genetic loci. Gene – an NCBI database of genetic loci. It may be accessed through the NCBI homepage by selecting “Gene” from the Search drop-down menu. This database used to be called LocusLink. Entries provide links to RefSeqs, articles in PubMed, and other descriptive information about genetic loci. The database also provides information on official nomenclature, aliases, sequence accession numbers, phenotypes, EC numbers, OMIM numbers, UniGene clusters, map information, and relevant web sites. KEGG – Kyoto Encyclopedia of Genes and Genomes – http://www.genome.ad.jp/kegg/ This website is used for accessing metabolic pathways. At this website, you can search a process, gene, protein, or metabolite and obtain diagrams of all the metabolic pathways 2 associated with your query. You will see a link to the KEGG entry at the end of the Gene entry for a gene. NCBI – National Center for Biotechnology Information – www.ncbi.nlm.nih.gov This center was formed in 1988 as a division of the NLM (National Library of Medicine) at the NIH (National Institute of Health). As part of the NIH, NCBI is funded by the US government. The main goal of the center is to provide resources for biomedical researchers as well as the general public. The center is continually developing new materials and updating databases. The entire human genome is freely available on this website and is updated daily as new and better data become available. NCBI also maintains an extensive education site, which offers online tutorials of its databases and programs: www.ncbi.nlm.nih.gov/About/outreach/courses.html OMIM - Online Mendelian Inheritance in Man – www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM a continuously updated catalog of human genes and genetic disorders, with links to associated literature references, sequence records, maps, and related databases. PubMed – http://www.ncbi.nlm.nih.gov/pubmed/ When writing a paper on a particular science/medical topic, you should always check PubMed. It is a retrieval system containing citations, abstracts, and indexing terms for journal articles in the biomedical sciences. PubMed contains the complete contents of the MEDLINE and PREMEDLINE databases. It also contains some articles and journals considered out of scope for MEDLINE, based on either content or on a period of time when the journal was not indexed, and therefore is a superset of MEDLINE. RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs, proteins, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits. Example: NT_123456. Code: NT, NC, NG = genomic; NM = mRNA; NP = protein (for more of the two letter codes, see the NCBI site map). Sequence Manipulation Suite – bioinformatics.org/sms/ a website that contains a collection of web-based programs for analyzing and formatting DNA and protein sequences. Bioinformatics is a field of study that merges math, biology, and computer science. Researchers in this field have developed a wide range of tools to help biomedical researchers work with genomic, biochemical, and medical information. Some types of bioinformatics tools include data base storage and search programs as well as software programs for analyzing genomic and proteomic data. We will be working through a tutorial on web-based bioinformatics programs. The tutorial is based on the enzymes phospholipase C-gamma (believed to be the major enzyme of fertilization), and cyclooxygenase-2 (COX-2), which also has the name prostaglandin synthase-2 (PTGS2). In this tutorial, the bioinformatics tools from the NCBI (National Center for Biotechnology Information) website will be introduced. NCBI is a division of the National Institute of Health (NIH). 3 These tools include Gene, GenBank, RefSeq, and PubMed. Gene is a database of genes in which each entry contains a brief summary, the common gene symbol, information about the gene function, and links to websites, articles, and sequence information for that gene. GenBank is a historical database of gene sequences, which means it contains every sequence that was published, even if the same sequence was published more than once. Therefore, GenBank is considered a redundant database. RefSeq is a database of sequences that is edited by NCBI and is NON-redundant, meaning that it contains what NCBI determines is the most reliable sequence data for each gene. Finally, we will be learning to use ClustalW, which is a multiple sequence alignment program. It allows you to enter a series of gene or protein sequences that you believe are similar and may be evolutionarily related. These sequences are usually obtained by performing a BLAST search. ClustalW then aligns the sequences, so that the fewest gaps are introduced and the largest number of similar residues is aligned with each other. ClustalW uses a scoring matrix similar to BLOSUM-62, which will be presented in a lecture. Introduction to Phospholipase C-gamma and COX-2 (PTGS2) Phospholipase C-gamma is believed to be a major enzyme of fertilization. The pathway of fertilization in Xenopus laevis is thought to be the following: 1) Sperm binds to the egg. 2) This binding somehow activates the 1b form of phospholipase D (PLD1b) 3) PLD1b breaks the lipid phosphatidylcholine down into phosphatidic acid (PA) and choline. 4) PA stimulates a tyrosine kinase called Src. Tyrosine kinases are enzymes that transfer a phosphate from ATP to other proteins. This “phosphorylation” can turn another protein on or off. 5) The activated Src phosphorylates the gamma form of Phospholipase C (PLC-γ). 6) PLC-γ breaks the lipid “PIP2” down to “IP3” and “DAG”. IP3 diffuses from the cell membrane to release calcium stored in the endoplasmic reticulum. 7) The calcium floods into the cytoplasm to cause the events of fertilization. The calcium travels across the zygote from the sperm binding site, causing a wave of cortical granule exocytosis, a wave of elevation of the fertilization envelop, a wave surface contraction (that we visualized); and initiation of other developmental events leading to first cleavage (or cytokinesis. COX-2 (PTGS2) is called prostaglandin H2 synthase-2 and cyclooxygenase-2 (COX-2). COX-2 has been thoroughly studied because of its role in prostaglandin synthesis. Prostaglandins have a wide range of roles in our body from aiding in digestion to propagating pain and inflammation. Aspirin is a general inhibitor of prostaglandin synthesis and, therefore, helps reduce pain. However, aspirin also inhibits the synthesis of 4 prostaglandins that aid in digestion. Therefore, aspirin is a poor choice for pain and inflammation management for those with ulcers or other digestion problems. Recent advances in targeting specific prostaglandin-synthesizing enzymes have led to the development of Celebrex, which is marketed as an arthritis therapy. Celebrex is a potent and specific inhibitor of COX-2. Celebrex is considered specific because it doesn’t inhibit COX-1, which is involved in synthesizing prostaglandins that aid in digestion. This is a remarkable accomplishment given the great similarity between COX-1 and COX-2. This achievement has paved the way for developing new therapies that bind more specifically to their target and therefore have fewer side effects. Understanding the enzyme structures of COX-1 and COX-2 helped researchers develop a drug that would only bind and inhibit COX-2. Many of the types of information and tools used by researchers for these types of studies are freely available on the web. In this tutorial, and throughout this lab course, you will be introduced to the databases and freely available software programs that are commonly used by professionals in research and medicine to study genes, proteins, protein structure and function, and genetic disease. Gene Database: Follow these directions to access the entries for PTGS1 and PTGS2 in the “Gene” database at the NCBI Website: 1) Go to the NCBI homepage: http://www.ncbi.nlm.nih.gov 2) Just after the word “Search,” select “Gene” from the database drop-down menu. Enter “PTGS” in the “for” textbox, and click the Search button. 3) Find the results for the “Homo sapiens” entries called “PTGS1” and one called “PTGS2.” (In Firefox, try Ctrl-F, and enter Homo sapiens.) 4) Select each entry by clicking on its name, then read the paragraph under the Summary section for each entry. Answer the following questions. 1. PTGS1 and PTGS2 are isozymes: Isozymes catalyze the same reaction, but are coded by separate genes. Based on the summary, what types of reactions do PTGS enzymes catalyze? 2. Which gene forms multiple transcript variants? 3. Which isozyme would you want to inhibit to stop inflammation? 5 4. According to the Pathways section, what KEGG pathways are listed for these enzymes (other than “Metabolic pathways”)? The next two questions are not discussed in the summaries- just read the questions and think about the answers. 5. The drug Celebrex selectively inhibits PTGS2 while aspirin and other NSAID’s inhibit both PTGS1 and PTGS2 in the same way. Why do you think researchers wanted to discover a selective inhibitor to PTGS2? 6. Describe how studying 3-D structures of PTGS1 and PTGS2 could help researchers design a drug that binds to PTGS1, but not to PTGS2. 7. Now start over and search for the gene for “Phospholipase C-gamma” in Homo sapiens. Find the PLCG1 and PLCG2 entries (case matters). On what chromosome are these found? 8. Now, go to the PLCG1 entry. From the summary, what do IP3 and PIP2 stand for (spell out the complete chemical name): 9. What is the official symbol of phospholipase C, gamma 1? 6 HUGO is the acronym for the Human Genome Organization. The HUGO Gene Nomenclature Committee’s acronym is HGNC. Click on the HGNC:9065 link next to “Primary Source.” This brings up the “Symbol Report” page. Find the section, “OMIM ID”, and click on the link associated with the entry 172420. OMIM stands for the Online Mendelian Inheritance in Man database. The OMIM database was started at John Hopkins University and is now maintained by NCBI. The OMIM database contains entries for both diseases with known genetic links and entries for the genes that have been linked to a disease. Each OMIM entry is a summary of the research that has been performed on the disease or gene and contains links to the research articles that it summarizes. You will be able to read about the clinical and biochemical research that has been performed related to the mutation you are studying. Is any information available related to mutations or mutants for PLC gamma? YES NO Each link in the OMIM entry will open an abstract from the PubMed database. PubMed is a literature database, and is also maintained by NCBI. PubMed is a searchable database of medical and life science journal articles. Most of the abstracts for these articles can be accessed through PubMed, but in order to access the entire article, you need to go to each individual journal website and have a subscription to the journal. The Troy University library has subscriptions to electronic versions of many of these journals that you can access through the E-journal link on the library home page. Most journals have their articles available online as .pdf files for articles published between 1995 to present. However, the older articles must still be accessed through the paper versions stored in libraries. Go back to the “Symbol Report” page. In the section, “Accession Numbers”, click on the GenBank link. An example of a GenBank entry is shown below. 7 For PLC-gamma 1, fill in the following info: Number of base pairs: Gene sequence was obtained from “Molecule Type”: Date of latest modification: 8 Accession number (Very important number): Both the AMINO ACID (beginning with “/translation”) and then the GENE sequences (in ATGC) are listed. Amino acids have both a 3-letter and 1-letter abbreviation— databases use the 1-letter abbreviations. Table 1. 1- and 3-Letter Go back to the original page on PLCG1 (the page with “Primary Source” and the HGNC:9065” link that you Abbreviations of Amino Acids. followed). In your browser use Ctrl-F to find “SH3” on in Amino Acid 3-Letter 1-Letter the Bibliography section of that page. Which journal Alanine Ala A published this entry? Arginine Arg R Asparagine Asn N Aspartic acid Asp D Then, search for “RET9” in the Interactions section. Cysteine Cys C Which journal published an article listed in PubMed Glutamic acid Glu E about this entry? Glutamine Gln Q Glycine Gly G Histidine His H You have explored human forms of the enzyme and Isoleucine Ile I its gene. Next, in the Entrez Gene database, search for a Leucine Leu L reference to the presence of the PLC-gamma enzyme in Lysine Lys K Xenopus laevis. You have to go back to the original page Methionine Met M that had “Gene” for the database and “Phospholipase CPhenylalanine Phe F gamma Xenopus laevis” for the search string. How many Proline Pro P references for Xenopus PLC-gamma did you find? Serine Ser S Threonine Thr T Tryptophan Trp W What is the preferred name (the name before the “Other Tyrosine Tyr Y Aliases” line) of the enzyme in each reference (how do Valine Val V they differ?)? 9 For the first reference that you find, under “Related Sequences,” note that there are three listed: Nucleotide Protein mRNA AB287408.1 BAF64273.1 mRNA AF090111.1 AAD03594.1 mRNA BC070837.1 AAH70837.1 The second column is a sequence of nucleotide bases; the third is the amino acid list for the base sequence. Go back and select the second Xenopus PLC gamma reference. Under General gene information, you see “Pathways.” KEGG stands for the Kyoto Encyclopedia of Genes and Genomes. It is a database of metabolic pathways that is maintained by a research institute in Japan. It contains all the known metabolic and signaling pathways. Each protein in the pathway and each small molecule metabolite (e.g., ATP) has its own entry in the database that can be accessed by clicking on the protein or metabolite in the pathway figure. By using this website, you can make predictions about what would happen to downstream events in the pathway if the protein you are studying is either less active or more active. There are several links to click on to show how PLC-gamma1b is involved in metabolism. Click on the link related to inositol metabolism. In the first link/path, the red arrow below shows where PLC gamma 1b is located- it has an enzyme number of 3.1.4.11. PIP2, the reactant, is to the right (1-phosphatidyl-1D-myo-inositol 4,5-bisphosphate). What is the full name of the product IP3 according to this metabolic pathway? 10 Click on the next to last KEGG link, about a signaling system; what is the name of this pathway? Essentially, you now have two names for equivalent pathways involving PLC. Note that they show PLC in red lettering and in a green box. Locate PIP2 (top center; a substrate for PLCγ) and write how they abbreviate it here in this second path: Write down how they prefer to abbreviate IP3 (look for IP3 with some numbers in parentheses): 11