* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Presenter 18 - Florida International University
Nucleic acid double helix wikipedia , lookup
Molecular cloning wikipedia , lookup
Public health genomics wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Epigenomics wikipedia , lookup
DNA supercoil wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Transposable element wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Genetic engineering wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Point mutation wikipedia , lookup
Metagenomics wikipedia , lookup
Oncogenomics wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genomic imprinting wikipedia , lookup
Pathogenomics wikipedia , lookup
Primary transcript wikipedia , lookup
Ridge (biology) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Genomic library wikipedia , lookup
Gene expression profiling wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Human Genome Project wikipedia , lookup
Genome editing wikipedia , lookup
Designer baby wikipedia , lookup
Non-coding DNA wikipedia , lookup
Human genome wikipedia , lookup
Helitron (biology) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Minimal genome wikipedia , lookup
Proteus, a Grid based Problem Solving Environment (PSE) for Bioinformatics: Architecture and Experiments Presenter: Michael Robinson Agnostic: Javier Munoz Advanced Topics in Software Engineering CIS 6612 Florida International University July 31, 2006 Authors: Mario Cannataro1, Carmela Comito2, Filippo Lo Schiavo1, and Pierangelo Veltri1 (February 2004) 1 University of Magna Graecia of Catanzaro, Italy 2 University of Calabria, Italy Organization Abstract ~60% is about Bioinformatics Proteus Architecture First Test Implementation Results of First Test Conclusion and Future Work 2 Abstract Live sciences Bioinformatics Computer Science Data Files sizes Computer power 3 The Partners What is Livesciences What is Bioinformatics Other Sciences used in Bioinformatics What is Computer Science 4 Human Genome The sum total of DNA in an organism is its genome. The Human Genome Project (HGP) an international effort, began in October 1990, and was completed in 1999, 2003, 2004. (http://www.pbs.org/wgbh/nova/genome/program.html) Project goals were to: Determine the complete sequence of the 3 billion DNA bases Identify all human genes And make them accessible for further biological study 5 Human Genome The bacterium E. coli and others were used to help develop the technology and interpret human gene function. The Human Genome Project was sponsored by: The U.S. Department of Energy and The U.S. National Institutes of Health http://www.preventiongenetics.com/edu/genetics_nutshell.htm 6 DNA (ACGT) Humans have from 10 to 100 trillion cells Each Human cell has about 3 billion nucleotides We have approximately 30,000 genes Of the three billion letters of DNA that we have, only 1 to 1.5 percent of it is gene the rest is STUFF”. The functions are unknown for over 50% of known genes 7 DNA (ACGT) Human Genome 3,000,000,000 ~ dna bases 30,000,000 ~ bases in genes 2,970,000,000 ~ stuff adenine (A) forms a base pair with thymine (T) guanine (G) forms a base pair with cytosine (C) 8 Similarities to Human DNA Another human? 99.9% - All humans have the same genes, but some of these genes contain sequence differences that make each person unique. A chimpanzee? 98.5% - Chimpanzees are the closest living species to humans. A mouse? 92.0% - All mammals are quite similar genetically. A fruit fly? 44.0% - Studies of fruit flies have shown how shared genes govern the growth and structure of both insects and mammals. Yeast? 26.0% - Yeasts are single-celled organisms, but they have many housekeeping genes that are the same as the genes in humans, such as those that enable energy to be derived from the breakdown of sugars. A weed (thale cress)? 18.0% - Plants have many metabolic differences from humans. For example, they use sunlight to convert carbon dioxide gas to sugars. But they also have similarities in their housekeeping genes. 9 The gene sizes Largest known human gene is dystrophin at 2.4 million bases. Chromosome 21 is the smallest human chromosome. Three copies of this autosome causes Down syndrome, the most frequent genetic disorder associated with significant mental retardation. Academic groups from Germany and Japan mapped and sequenced it, it has 33,546,361 bp of DNA Analysis of the chromosome revealed: 127 known genes, 98 predicted genes, and 59 pseudogenes. Smallest bacterial genome, Mycoplasma genitalium size of 580 kbp 10 Bioinformatics DNA RNA PROTEINS MUTATIONS, ILLNESSES MEDICATIONS CLONING 11 DNA (ACGT) Pseudomonas Aeruginosas PA01 6,264,403 bases, 5565 genes complement(6264226..6264360) 6264181 6264241 6264301 6264361 gcttgtcccg gtcgaagtcc cttacggcct ttggcgcgac ggcgcggaaa ccgtggacgc gattcggtac ctgggttgac cgactcacca cccgtaccgg ataaatcaga cggtcagacg gacgcgacag aacctgacgg ccgttcttgg tggccatacg gagcgcgctt gagggtgctg ggttggaaag tacgtttcat gacttgaggt cgcagtgacc ccg 12 RNA In RNA, thymine is replaced by uracil (U). DNA 6264181 6264241 6264301 6264361 RNA 6264181 6264241 6264301 6264361 gcttgtcccg cttacggcct ggcgcggaaa gattcggtac gcuugucccg cuuacggccu ggcgcggaaa gauucgguac gtcgaagtcc ttggcgcgac ccgtggacgc ctgggttgac gucgaagucc uuggcgcgac ccguggacgc cuggguugac cgactcacca cccgtaccgg ataaatcaga cggtcagacg gacgcgacag aacctgacgg ccgttcttgg tggccatacg gagcgcgctt gagggtgctg ggttggaaag tacgtttcat gacttgaggt cgcagtgacc ccg cgacucacca cccguaccgg auaaaucaga cggucagacg gacgcgacag aaccugacgg ccguucuugg uggccauacg gagcgcgcuu gagggugcug gguuggaaag uacguuucau gacuugaggu cgcagugacc ccg 13 Amino Acids U U C A G C A G UUU F phe Phenylalanine UUG V val Valine UAU Y tyr Tyrosine UGU C cys Cysteine UUC F phe Phenylalanine UCC S ser Serine UAC Y tyr Tyrosine UGC C cys Cysteine UUA L leu Leucine UCA S ser Serine UAA Stop UGA Stop UUG L leu Leucine UCG S ser Serine UAG Stop UGG W trp Tryptophan CUU L leu Leucine CCU P pro Proline CAU H his Histedine CGU R srg Arginine CUC L leu Leucine CCC P pro Proline CAC H his Histedine CGC R srg Arginine CUA L leu Leucine CCA P pro Proline CAA Q gln Glutamine CGA R srg Arginine CUG L leu Leucine CCG P pro Proline CAG Q gln Glutamine CGG R srg Arginine AUU l lle Isoleucine ACU T thr Threonine AAU N asn Asparagine AGU S ser Serine AUC l lle Isoleucine ACC T thr Threonine AAC N asn Asparagine AGC S ser Serine AUA l lle Isoleucine ACA T thr Threonine AAA K lys Lysine AGA R arg Arginine AUG M met Methionime Start ACG T thr Threonine AAG K lys Lysine AGG R arg Arginine GUU V val Valine GCU A ala Alanine GAU D asp Aspartic GGU G gly Glycine GUC V val Valine GCC A ala Alanine GAC D asp Aspartic GGC G gly Glycine GUA V val Valine GCA A ala Alanine GAA Z glu Glutamic GGA G gly Glycine GUG V val Valine GCG A ala Alanine GAG Z glu Glutamic GGG G gly Glycine U C A G U C A G U C A G U C A G 14 Proteins (sequences) DNA 6264181 6264241 6264301 6264361 RNA 6264181 6264241 6264301 6264361 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga cggtcagacg cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg tggccatacg ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag tacgtttcat gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg gcuugucccg cuuacggccu ggcgcggaaa gauucgguac PROTEIN gucgaagucc cgacucacca cccguaccgg auaaaucaga cggucagacg uuggcgcgac gacgcgacag aaccugacgg ccguucuugg uggccauacg ccguggacgc gagcgcgcuu gagggugcug gguuggaaag uacguuucau cuggguugac gacuugaggu cgcagugacc ccg MKRTFQPSTLKRARVHGFRARMATKNGRQVLSRRRAKGRKRLTV 15 Proteins: Pattern Matching GHEGVGKVVKLGAGA GHEKKGYF-DRGPSA GHEGYGGRSRGGGYS GHEFEGPK-CGALYI GHELRGTTFMPALEC G-H-E-X(2)-G-X(4,5)-[GA] 16 Proteins: Structures Chemical properties that distinguish the 20 different amino acids cause the protein chains to fold up into specific threedimensional structures that define their particular functions in the cell 17 Reality Somewhere in this dense chemical forest are genes involved in deafness, Alzheimer, cancer, cataracts, etc. But where? This is such a maze scientists need a map. Out of three billion base pairs in our DNA, just one single letter can make a difference. 18 Data Locations GenBank in the US, 1974 http://www.ncbi.nlm.nih.gov/ 1997 = 1.26 gigabases 2004 = 39 gigabases 2005 = 100 gigabases EMBL in England, 1980 http://www.ebi.ac.uk/embl/ DDBJ in Japan, 1984 http://www.ddbj.nig.ac.jp/ 19 Some Databases The Swiss Institute of Bioinformatics maintains the following databases: Ashbya Genome Database Cancer Immunome Database Eukaryotic Promoter Database (EPD) GermOnline MyHits PROSITE Swiss-Prot and TrEMBL SWISS-2DPAGE SWISS-MODEL Repository 20 Specialization Plasmodb http://www.plasmodb.org/plasmo/home.jsp parasitic eukaryote Plasmodium the causative agent of the disease Malaria. [email protected] 21 Proteus General Architecture 22 Proteus’ Software Modules 23 Some Taxonomies of the Bioinformatics Ontology 24 Snapshot of the Ontology Browser 25 Human Protein Clustering Workflow 26 Snapshot of VEGA: Workspace 1 of the Data Selection Phase 27 Software Installed in the Example Grid Software Components segret splitfasta blastall cat Tribe-parse Tribe-matrix mcl Tribe-families Grid Nodes Minos k3 k4 * * * * * * * * * * * * * * 28 Snapshot of the Ontology Browser 29 Snapshot of the Ontology Browser 30 Snapshot of the Ontology Browser 31 Snapshot of VEGA: Workspace 1 of the Pre-processing Phase 32 Conclusions and Future Work Execution Times of the Application TribeMCL Application 30 Proteins All Proteins Data Selection 1’44” 1’41” Pre-Processing 2’50” 8h50’13” Clustering 1’40” 2h50’28” Results Visualization 1’14” 1’42” Total Execution Time 7’28” 11h50’53” 33 References On the paper the authors cited 27 references 34 Questions Thank you 35