Download Data Mining

Thanh-Phuong Nguyen Outline  Introduction to in-silico databases  Some in-silico databases Major biological databases Biological model databases How to retrieve information from databases Database integration  Data mining and machine learning  Applications, Tools and Software     Biological systems Sequence data Protein folding and 3D structure Taxonomic data Literature Pathways and networks Protein families and domains Small molecules Whole genome data Biological systems Biological systems Sequence data Protein folding and 3D structure Taxonomic data Literature Pathways and networks Protein families and domains Small molecules Whole genome data Biological systems Biological systems Sequence data Protein folding and 3D structure Taxonomic data Literature Protein families and domains Pathways and networks Small molecules Ontologies-GO Whole genome data Biological systems What is a database ?  A collection of     structured searchable (index) updated periodically (release) cross-referenced (hyperlinks) -> table of contents -> new edition -> links with other db data  Includes also associated tools (software) necessary for db access, db updating, db information insertion, db information deletion….  Data storage management: flat files, relational databases, xml files, sbml files… A brief history of biological databases 1965 M. O. Dayhoff et al. publish “Atlas of Protein Sequences and Structures” 1982 EMBL initiates DNA sequence database, followed within a year by GenBank (then at LANL) and in 1984 by DNA Database of Japan 1988 EMBL/GenBank/DDBJ agree on common format for data elements The growth of Genbank (updates) Prediction: data size doubles every 14 months 44,575,745,176 bases, from 40,604,319 reported sequences (up to Dec.,15, 2004) The growth of public domain bio-databases 800 Database number 700 600 500 400 300 200 100 0 1999 2000 2001 2002 Year 2003 2004 2005 (The Molecular Biology Database Collection from Nucleic Acids Research) Information Space July 17, 1999  Nucleotide sequences: 4,456,822 706,862 9,780 75,832 10,870 52,889 6,377 515 341 (4.9MB) 10,372,886 10,695  Protein sequences:  3D structures:  Human Unigene Clusters:  Maps and Complete Genomes:  Different species node:  dbSNP  RefGenes  human contigs > 250 kb  PubMed records:  OMIM records: 10 The challenge of the information space Feb 10 2004 Nucleotide records Protein sequences 3D structures Interactions & complexes Human Unigene Cluster Maps and Complete Genomes Different taxonomy Nodes Human dbSNP Human RefSeq records bp in Human Contigs > 5,000 kb (116) PubMed records OMIM records 36,653,899 4,436,362 19,640 52,385 118,517 6,948 283,121 13,179,601 22,079 2,487,920,000 12,570,540 15,138 in-silico databases  Sequence DB: EMBL,      GenBank, DDBJ Structure DB: PDB, SCOP, CATH Genomic DB: Ensembl, Genome Browser, NCBI Network and pathway DB: HRPD, i2d, STRING, DIP, BIND, KEGG PATHWAY Database, Reactome Mathematical model DB: BioModel, CellML Medical DB: OMIM, MGD, FlyBase, SGD Sequence databases  Used for retrieving a known gene/protein sequence  Useful for finding information on a gene/protein  Can find out how many genes are available for a given organism  Can comparing your sequence to the others in the database Protein Databases  Protein sequence and other related information  Genpept: CDS from GenBank entries  TrEMBL (1996) : Automatic CDS translations from EMBL  SWISS-PROT (1986): Best annotated, least redundant  PIR (Protein Information Resource)  More automated annotation  Collaborations with MIPS and JIPID  Uniprot (2003)  UniProt (Universal Protein Resource) is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. Networks and Pathways Databases  Networks of molecule interactions  Protein-protein interactions  Biological Pathways  Metabolism pathways  Signal transduction pathways  Known or predicted data Networks and Pathways Databases STRING Biological model databases Literature Databases        Medline/Pubmed OMIM CSULA Library BIOSIS Bookshelf (from NCBI) Melvyl (Books at UC Libraries) Other molecular life science databases      Science Direct Pub Med Central Free Medical Journals LinkOut Journals Wiley InterScience Literature databases – PubMed (MedLine) 1. It contains entries for more than 11 million abstracts of scientific publications. 2. It enables user to do keyword searches, provides links to a selection of full articles, and has text mining capabilities, e.g. provides links to related articles, and GenBank entries, among others. 3. Efficient searching PubMed requires some skill. For example, searching with a keyword “interleukin” returns 108,366 matches. Essential Bioinformatics and Biocomputing (LSM2104), NUS PubMed web-site (http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed ) Essential Bioinformatics and Biocomputing (LSM2104), NUS 25 PubMed Search (http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed ) Cancer treatment by targeting blood supply: Cancer growth depends on blood supply (why?) and thus requires the growth of new blood vessels – angiogenesis Proteins involved in angiogenesis may be potential anticancer targets You can find some of these targets by searching Pubmed Key word “cancer angiogenesis enzyme drug” produces 856 entries Key Word No. of Entries Cancer 1.45M Cancer Blood supply 22K Cancer Blood supply Protein 3.9K Cancer Blood supply Enzyme 1.5K Cancer Blood supply Enzyme Drug 500 Essential Bioinformatics and Biocomputing (LSM2104), NUS 26 Some examples of integrated biological database resources  SRS (Sequence Retrieval System)  Entrez Browser (at NCBI)  ExPASy (home of SwissProt)  Ensembl (Open Source based system)  Human Genome Browser (Jim Kents creation) NCBI ENTREZ MedLine Literature Database OMIM Database of human genes and genetic disorders GenBank Database of all publicly available DNA sequences Protein databases Database of amino acid sequences from SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq. Genomes Database of genomes from organisms and viruses PopSet Taxonomy Database of DNA sequences that have been collected to analyze the evolutionary relatedness of a population. Database of names of organisms with sequences in GenBank or Database Integration Hetegerous data type Accession, name of confusion Different ID for a same guy DNA records: NM_017442 BC032713 NG_001066 AF172169 toll-like receptor 9 toll-like receptor 9 toll-like receptor 7 toll-like receptor 7 RefSeq cDNA clone chromosome X genomic gene toll-like receptor 1 toll-like receptor 2 toll-like receptor 7 TIR domain of Tlr2 Swiss-Prot RefSeq Genbank protein 3D structure (PDB) Protein records: Q15399 NP_067681 AAH33651 1FYW Swimming in Data Sources What Is Data Mining?  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Data mining: a misnomer?  Alternative names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems 32 Why Data Mining?  The Explosive Growth of Data: from terabytes to petabytes  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras, YouTube  We are drowning in data, but starving for knowledge!  “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets 33 Data Mining: Confluence of Multiple Disciplines Machine Learning Applications Algorithm July 9, 2009 Pattern Recognition Data Mining Database Technology Data Mining: Concepts and Techniques Statistics Visualization High-Performance Computing 34 Data Mining: On What Kinds of Data?  Database-oriented data sets and applications  Relational database, data warehouse, transactional database  Advanced data sets and advanced applications  Data streams and sensor data  Time-series data, temporal data, sequence data (incl. bio-sequences)  Structure data, graphs, social networks and multi-linked data  Object-relational databases  Heterogeneous databases and legacy databases  Spatial data and spatiotemporal data  Multimedia database  Text databases  The World-Wide Web 35 Data Mining – Tasks Classification - Example: high risk for cancer or not Estimation - Example: household income Prediction - Example: credit card balance transfer average amount Affinity Grouping - Example: people who buy X, often also buy Y with a probability of Z Clustering - similar to classification but no predefined classes Description and Profiling – Identifying characteristics which explain behaviour - Example: “More men watch football on TV than women” Types of Data Mining  “Supervised” Methods (this DM course)  Training data has both predictor attributes & objective (to be predicted) attributes  Predict discrete classes  classification  Predict continuous values  regression  Duality: classification  regression  “Unsupervised” Methods  Training data without objective attributes  Goal: find novel & interesting patterns  Cutting-edge research, fewer success stories  Semi-supervised methods: market-basket, … © 2008, Jaime G. Carbonell December, 2008 37 Some Definitions (KBS vs ML)  Knowledge-Based Systems  Rules, procedures, semantic nets, Horn clauses  Inference: matching, inheritance, resolution  Acquisition: manually from human experts  Machine Learning  Data: tables, relations, attribute lists, …  Inference: rules, trees, decision functions, …  Acquisition: automated from data  Data Mining  Machine learning applied to large real problems  May be augmented with KBS © 2008, Jaime G. Carbonell December, 2008 38 Machine Learning Application Process in a Nutshell  Choose problem where  Prediction is valuable and non-trivial  Sufficient historical data is available  The objective is measurable (incl in past data)  Prepare the data  Tabular form, clean, divide training & test sets  Select a Machine Learning algorithm  Human readable decision fn  rules, trees, …  Robust with noisy data  kNN, logistic reg, … © 2008, Jaime G. Carbonell December, 2008 39 Machine Learning Techniques  Technical basis for data mining: algorithms for acquiring structural descriptions from examples  Methods originate from artificial intelligence, statistics, and research on databases  Structural descriptions represent patterns explicitly can be used to  predict outcome in new situation  understand and explain how prediction is derived (maybe even more important) © Copyright 2006, Natasha Balac 40 Symbolic Rule Induction Example (1) Age Gender Temp b-cult c-cult 65 M 101 + .23 25 M 102 + .00 65 M 102 .78 36 F 99 .19 11 F 103 + .23 88 F 98 + .21 39 F 100 + .10 12 M 101 + .00 15 F 101 + .66 20 F 98 + .00 81 M 98 .99 87 F 100 .89 12 F 102 + ?? loc USA CAN BRA USA USA CAN BRA BRA BRA USA BRA USA CAN 14 67 USA normal BRA rash F M 101 + 102 + .33 .77 Skin normal normal rash normal flush normal normal normal flush rash rash rash normal disease strep strep dengue *none* strep *none* strep strep dengue *none* ec-12 ec-12 strep Symbolic Rule Induction Example (2) Candidate Rules: IF age = [12,65] gender = *any* temp = [100,103] b-cult = + c-cult = [.00,.23] loc = *any* skin = (normal,flush) THEN: strep IF age = (15,65) gender = *any* temp = [101,102] b-cult = *any* c-cult = [.66,.78] loc = BRA skin = rash THEN: dengue Disclaimer: These are not real medical records or rules Why Validation?  Validation type:  Within the existing data  With newly collected data  Errors and uncertainties:  Systematic or random errors     Unknown variables - number of classes Noise level - statistical confidence due to noise Model validity – error measure, model over-fit or under-fit Number of data points - measurement replicas  Other issues  Experimental support of general theories  Exhaustive sampling is not permissive Major Challenges in Data Mining  Efficiency and scalability of data mining algorithms  Parallel, distributed, stream, and incremental mining methods  Handling high-dimensionality  Handling noise, uncertainty, and incompleteness of data  Incorporation of constraints, expert knowledge, and background knowledge in data mining  Pattern evaluation and knowledge integration  Mining diverse and heterogeneous kinds of data: e.g., bioinformatics, Web, software/system engineering, information networks  Application-oriented and domain-specific data mining  Invisible data mining (embedded in other functional modules)  Protection of security, integrity, and privacy in data mining July 9, 2009 Data Mining: Concepts and Techniques 44 Where to find the databases  Table of addresses for major databases and tools  Nucleic Acids Research Database issue January each year  Nucleic Acids Research Software issue –new  Amos’s list of tools: http://www.expasy.ch/alinks.html Finding data & tools Google http://www.google.com Nucleic Acids Research – Database & Web Server issues (Jan 1 and July 1) http://nar.oupjournals.org UBC Bioinformatics Links Directory http://bioinformatics.ubc.ca/resources/links_directory / Application  Genome Analysis  Pipeline Analysis  Genome Annotation  SNP Data warehouse/ Databases integration New Algorithm Literature Mining System Biology/ Microarray Analysis HEALTHCARE:  Decision Support: optimal treatment choice  Survivability Predictions  medical facility utilization predictions     Database Mining Tools  SRS: Sequence Retrieval System  Entrez: Search Engine at NCBI, US  Bankit: World Wide Web sequence submission server  Sequence Similarity Search Tools-BLAST & FASTA  Finding sequence homologs to deduce the identity of query sequence  Identify potential sequence homologs with known three dimensional structure Data Mining Tool Features  Installation: Hardware and software requirements (operating system, virtual machine, application server, required memory capacity...); type and simplicity of installation and deployment; availability of installation guide, licensing, ….  Usage: Is there a Graphic User Interface (GUI) available? Is it intuitive and easy to use? Flexible and personalizable? Is there Application Programming Interface (API) available? What is its learning curve like? Which programming languages and standards are supported? Modularity? What kind of documentation is available (tutorials, examples...)? Is technical support available?  Input: Data pre-processing; input formats; connection with databases (JDBC, ODBC, ...)  Output: Output formats; reusability of a model; available reports and graphs, ...  Performance: Speed, scalability, memory usage, ...  Features: Which algorithms are supported? Can new algorithms be added? Is there Geographic Information System (GIS) integration? Does it support standards like DMQL? Text mining features? Available plug-ins? Some Tools  WEKA University of Waikato, New Zealand  YALE University of Dortmund, Germany  MiningMart University of Dortmund, Germany  Orange University of Ljubljana, Slovenia  Rattle Togaware  Borgelt University of Magdeburg, Germany  Gnome Data Mine Togaware  Tanagra University of Lyon 2  Xelopes Prudsys  SpagoBI ObjectWeb, Italy JasperIntelligence  AlphaMiner  University of Hong Kong JasperSoft  Databionic ESOM Tools  University of Marburg, Germany  MLC++ SGI, USA  MLJ Kansas State University Pentaho Pentaho Database Mining Tools •SRS: Sequence Retrieval System •Entrez: Search Engine at NCBI, US •Bankit: World Wide Web sequence submission server •Sequence Similarity Search Tools-BLAST & FASTA •Finding sequence homologs to deduce the identity of query sequence •Identify potential sequence homologs with known three dimensional structure Bioinformatics software Major sources  Software package at ExPASy Molecular Biology Server http://www.expasy.org ; http://au.expasy.org  Software at PBIL Bio-Informatique Lyonnais http://pbil.univ-lyon1.fr/  Toolbox at EBI European Bioinformatics Institute http://www.ebi.ac.uk/Tools/index.html 53 Bioinformatics software  Major types of bioinformatics tools          Sequence analysis tools Sequence comparison Pattern and domain search Evolutionary analysis Prediction of sequence structure and function Visualization of molecular structures Structure modeling Bibliographic and text searches Specialized and other tools 54 Bioinformatics software Its role in research:  Hypothesis-driven research cycle in biology (From Kitano H. Systems biology: a brief overview. Science 2002, 295:1662-4) 55 Take home notes Yes, if you train quickly, you can create a new database of databases, but first have your lunch !

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining