* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download BioInformatics (1)
Genomic library wikipedia , lookup
Genetic engineering wikipedia , lookup
Genetic code wikipedia , lookup
Magnesium transporter wikipedia , lookup
Western blot wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Non-coding DNA wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Gene nomenclature wikipedia , lookup
Interactome wikipedia , lookup
Biosynthesis wikipedia , lookup
Community fingerprinting wikipedia , lookup
Expression vector wikipedia , lookup
Biochemistry wikipedia , lookup
Gene regulatory network wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Proteolysis wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Protein structure prediction wikipedia , lookup
Gene expression wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Two-hybrid screening wikipedia , lookup
BioInformatics (1) What is Life All About : Self-compiling & self-assembling Complementary surfaces Watson-Crick base pair (Nature April 25, 1953) Life Science vs Computing Where do parasites come from? (computer & biological viral codes) Over $12 billion/year on computer viruses LoveBug 20 M dead (worse than black plague & 1918 Flu) AIDS - HIV-1 Polymerase drug resistance mutations Set dirtemp =3D fso.GetSpecialFolder(2) M41L, D67N, T69D, L210W, T215Y, H208Y Set c =3D fso.GetFile(WScript.ScriptFullName) PISPIETVPV KLKPGMDGPK VKQWPLTEEK c.Copy(dirsystem&"\MSKernel32.vbs") IKALIEICAE LEKDGKISKI GPVNPYDTPV c.Copy(dirwin&"\Win32DLL.vbs") c.Copy(dirsystem&"\LOVE-LETTER-FOR-YOU.TXT.vbs") regruns() FAIKKKNSDK WRKLVDFREL NKRTQDFCEV html() spreadtoemail() listadriv() Exciting Life ?? Concept Computers Organisms Instructions Bits Stable memory Active memory Processing Editing Environment I/O Monomer Polymer Replication Sensor/In Program 0,1 ROM,Disk,tape RAM CPU/Compiler Editor Sockets,people AD/DA Minerals chip Cut/Paste scanner Genome a,c,g,t DNA RNA enzyme/Ribosome tRNA Water,salts,heat proteins Nucleotide DNA,RNA,protein DNA replication Chem/photo receptor Elements of RNA-based life: C,H,N,O,P Useful for many species: Na, K, Fe, Cl, Ca, Mg, Mo, Mn, S, Se, Cu, Ni, Co, Si The Four Nucleosides of DNA A nucleoside is a sugar, here deoxyribose, plus a base dA = deoxyadenosine, etc. dA dG PURINES dC dT PYRIMIDINES BASES Adenine Thymine Guanine Cytosine Uracil Base Pairing The monomeric units of nucleic acids are called nucleotides. A nucleotide is a phospate, a sugar, and a purine or a pyramidine base. Chromosomes Genome and gene Entity Genome Definition Unit of information transmission Molecular Mechanisms DNA replication Gene Unit of information expression Transcription to RNA (a special sequence of nucleotide Translation to protein bases, whose sequences carry the information required for constructing protein) Nucleic acid and proteins Backbone Macromolecule Nucleic acid Protein ( structure components of cells/tissues/en zymes) Repeating unit Length DNA Phosphodiester bonds Deoxyribonucleotides 103-108 (A, C, G, T) RNA Phosphodiester bonds Ribonucleotides (A, C, G, U) Peptide bonds Amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) Role Genome 103-105 103-104 102-103 Genome Messenger Gene product 102-103 Gene product Nucleotide codes A Adenine W Weak (A or T) G Guanine S Strong (G or C) C Cytosine M Amino (A or C) T Thymine K Keto (G or T) U Uracil B Not A (G or C or T) R Purine ( A or G) H Not G (A or C or T) Y Pyrimidin e (C or T) D Not C (A or G or T) N Any nucleotide V Not T (A or G or C) Amino acid codes Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val Asx Glx Sec Unk A R N D C Q E G H I L K M F P S T W Y V B Z U X Alanin e Arginin e Asparagin e Aspartic acid Cysteine Glut amin e Glut amic acid Glycine Histidin e Isoleucine Leucine Lysine Methionine Phenylalanin e Prolin e Serin e Threonine Tryptophan Tyrosine Valin e Asn or Asp Gln or Glu Selenocysteine Unknown Standard Genetic Code Schematic illustration of a plant cell (Home for DNA) History of structure determination for nucleic acids and proteins 1950 Technology development Structure determi nation 49 Edman degr adation -heli x model 54 Isomorphou s replaceme nt 1960 53 DNA double heli x model Insu li n p rim ary struc ture 60 Myog lobin tertiary structure 62 Restriction enzy me 65 tRNAAla prim ary struc ture 1970 72 DNA clon ing 73 tRNAPhe tertiary structure 75 DNA sequenc ing 77 X174 complete genom e 79 Z-DNA by s ingle crystal differentiation 1980 84 Puls e fi eld gel electrophoresis 85 Polymerase chain reaction 87 YAC vec tor 86 Protein structure by 2D NMR 88 Human Geno me Project 1990 93 DNA chip 95 H influenzae complete geno me 2000 Human chromosomes: idiograms X-linked recessive disorder. The inheritance pattern is shown for a recessive gene on the chromosome X, designated in bold. Male XY (normal) Male XY (normal) Male XY (affected) Female XX (normal) Female XX (normal) Female XX (normal) Reductionistic and synthetic approaches in biology Biological System (Organism) Reductionistic Synthetic Approach Approach (Experiments) (Bioinformatics) Building Blocks (Genes/Molecules) Basic principles in physics, chemistry and biology. Principles Known? Physics Chemistry Biology Matter Compound Organism Elementary Particles Elements Genes Yes Yes No 100 000 10 000 1000 Amount (x1000) 100 10 1 0.1 MEDLINE records MEDLINE G5 MeSH Transistors / chip DNA sequences Mapped human genes 3-D structures 0.01 0.001 1965 1970 1975 1980 1985 Year 1990 1995 2000 The addresses for the major databases Database Organization Address MEDLINE National Library of Medicine www.nlm.nih.gov GenBank National Center for Biotechno logy Info rmation www.ncbi.nlm. nih.gov EMBL European Bioinformatics Institute www.ebi.ac.uk DDBJ National Institute of Genetics, Japan www.ddbj.nig.ac.jp SWISS-PROT Swiss Institute of Bioinformatics www.expasy.ch PIR National Biomedical Research Founda tion www-nbrf.georgetown.edu PRF Protein Research Found ation, Japan www.prf.or.jp PDB Research Collaboratory for Structural Bioinfo rmatics www.rcsb.org CSD Cambridge Crystallographic Data Centre www.ccdc.cam.ac.uk New generation of molecular biology databases Info rmation Database Address Compounds and reactions LIGAND Aaindex PROSITE Blocks PRINTS Pfam Pro Dom SCOP CATH COG KEGG KEGG WIT EcoCyc UM-BBD NCBI Taxono my OMIM www.geno me.ad.jp/dbget/li gand .html www.geno me.ad.jp/dbget/aaindex.html www.expasy.ch/sprot/prosite.html www.blocks.fhcrc.org/ www.biochem.ucl.ac.uk.bsm.dbbrowser/PRINTS/ www.sanger.ac.uk/Pfam/,pfam.wus tl.edu/ protein.toulouse.inra.fr/prodom.html scop.mrc-lmb.cam.ac.uk/scop/ www.biochem.ucl.ac.uk/bsm/cath/ www.ncbi.nlm. nih.gov /COG/ www.geno me.ad.jp/kegg/ www.geno me.ad.jp/kegg/ www.mcs.anl.gov/WIT2/ ecocyc.Pange aSystems.com/ecocyc/ www.labmed.umn.edu/umbbd/ www.ncbi.nlm. nih.gov /Taxono my/ www.ncbi.nlm. nih.gov /Omim/ Protein families and sequence motifs 3D fold classifications Orthologous genes Biochemical pathways Geno me diversity Example of sequence database entry for Genbank LOCUS DRODPPC 4001 bp INV 15-MAR-1990 DEFINITION D.melanogaster decapentaplegic gene complex (DPP-C), complete cds. ACCESSION M30116 KEYWORDS . SOURCE D.melanogaster, cDNA to mRNA. ORGANISM Drosophila melanogaster Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda; Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophilia. REFER ENCE 1 (bases 1 to 4001) AUTHORS Padgett, R.W., St Johnston, R.D. and Gelbart, W.M. TITLE A transcript from a Drosophila pattern gene predicts a protein homologous to the transforming growth factor-beta family JOURNAL Nature 325, 81-84 (1987) MEDLINE 87090408 COMMENT The initiation codon could be at either 1188-1190 or 1587-1589 FEATURES Location/Qualifiers source 1..4001 /organism=“Drosophila melanogaster” /db_xref=“taxon:7227” mRNA <1..3918 /gene=“dpp” /note=“decapentaplegic protein mRNA” /db_xref=“FlyBase:FBgn0000490” gene 1..4001 /note=“decapentaplegic” /gene=“dpp” /allele=“” /db_xref=“FlyBase:FBgn0000490” CDS 1188..2954 /gene=“dpp” /note=“decapentaplegic protein (1188 could be 1587)” /codon_start=1 /db_xref=“FlyBase:FBgn0000490” /db_xref=“PID:g157292” /translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR …………………… LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL NDQSTBVVLKNYQEMTBBGCGCR” BASE COUNT 1170 a 1078 c 956 g 797 t ORIGIN 1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc 61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg 121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc 181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc 241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg 301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca 361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca 421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac 481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca 541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat 601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa …………………………. 3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc 3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta 3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g // Example of sequence database entry for SWISS-PROT ID AC DT DT DT DE GN OS OC RN RP RM RA RL RN RP RM RA RL CC CC CC CC CC DR DR DR DR DR KW FT FT FT FT FT FT FT FT FT FT FT SQ DECA_DROME STANDARD; PRT; 588AA. P07713; 01-APR-1988 (REL. 07, CREATED) 01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE) 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE) DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN). DPP. DROSOPHILA MELANOGASTER (FRUIT FLY). EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA. [1] SEQUENCE FROM N.A. 87090408 PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.; NATURE 325:81-84 (1987) [2] CHARACTERIZATION, AND SEQUENCE OF 457-476. 90258853 PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.; MOL. CELL. BIOL. 10:2669-2677(1990). -!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS. -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED. -!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY. EMBL; M30116; DMDPPC. PIR; A26158; A26158. HSSP; P08112; 1TFG. FLYBASE; FBGN0000490; DPP. PROSITE; PS00250; TGF_BETA. GROWTH FACTOR; DIFFERENTIATION; SIGNAL. SIGNAL 1 ? POTENTIAL. PROPEP ? 456 CHAIN 457 588 DECAPENTAPLEGIC PROTEIN. DISULFID 487 553 BY SIMILARITY. DISULFID 516 585 BY SIMILARITY. DISULFID 520 587 BY SIMILARITY. DISULFID 552 552 INTERCHAIN (BY SIMILARITY). CARBOHYD 120 120 POTENTIAL. CARBOHYD 342 342 POTENTIAL. CARBOHYD 377 377 POTENTIAL. CARBOHYD 529 529 POTENTIAL. SEQUENCE 588 AA; 65850MW; 1768420 CN; MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR Functional classification of E. coli genes according to Monica Riley I. II. III. IV. V. VI. Intermedia ry metabolism A. Degradation B. Central intermediary metabolism C. Respiration (aerobic and ana erobic) D. Fermentation E. ATP-proton motive force interconver sions F. Broad regul atory fun ctions Biosynthesis of small molecules A. Amino acids B. Nucleotides C. Suga rs and suga r molecules D. Cofactors, prosthetic groups, electron carriers E. Fatty a cids and lipids F. Polyamines Macromolecule metabolism A. Synthesis and modification B. Degradation of macromolecules Cell structure A. Membrane componen ts B. Murein sacculus C. Surface polysaccha rides and antigens D. Surface struc tures Cellular processes A. Transport/binding proteins B. Cell division C. Chemotaxis and mobilit y D. Protein secretion E. Osmotic adaptions Other func tions A. Cryptic genes B. Phage -related func tions and prophag es C. Colicin-related func tions D. Plasmid-related func tions E. Drug/ana log sensitivity F. Radation sensiti vity G. DNA sites H. Adaptations to atypical cond iti ons The Protein Folding Problem Protein Folding Problem (Sequence 3D Structure) 1 Protein folding is thermodynamically determined (Anfinsen’s thermodynamic principle) Protein + Environment 2. Protein folding is a reaction imvolving other interacting molecules (Principle of molecular interactions) Protein + Chaperonins +…. Central Paradigm Bioinformatics : A Long Journey (How far are we away from knowing the God ??) Sequence to exon 80% [Laub 98] Exons to gene (without cDNA or homolog) ~30% [Laub 98] Gene to regulation ~10% [Hughes 00] Regulated gene to protein sequence 98% [Gesteland ] Sequence to secondary-structure (,b,c) 77% [CASP] Secondary-structure to 3D structure 25% [CASP] 3D structure to ligand specificity ~10% [Johnson 99] Expected accuracy overall ~ = 0.8*.3*.1*.98*.77*.25*.1 = .0005 ? Our Focus in Bioinformatics Perturbation Dynamic Response Environment Medication Genetic Engineering Gene Expression Protein Expression Virtual Cell Analysis BioChip DataBase Genotype/Phenotype Biology Molecular Biology Bio Chemistry Genetics Symbolic Algorithms/ Computing Genome Sequencing