Download Data Mining

Document related concepts

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Database model wikipedia , lookup

Transcript
Thanh-Phuong Nguyen
Outline
 Introduction to in-silico databases
 Some in-silico databases
Major biological databases
Biological model databases
How to retrieve information from databases
Database integration
 Data mining and machine learning
 Applications, Tools and Software




Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data
Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome
data
Biological systems
Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data
Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Biological systems
Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data
Literature
Protein families and domains
Pathways and networks
Small molecules
Ontologies-GO
Whole genome
data
Biological systems
What is a database ?
 A collection of




structured
searchable (index)
updated periodically (release)
cross-referenced (hyperlinks)
-> table of contents
-> new edition
-> links with other db
data
 Includes also associated tools (software) necessary for
db access, db updating, db information insertion, db
information deletion….
 Data storage management: flat files, relational
databases, xml files, sbml files…
A brief history of biological databases
1965 M. O. Dayhoff et al. publish “Atlas of
Protein Sequences and Structures”
1982 EMBL initiates DNA sequence database,
followed within a year by GenBank (then at
LANL) and in 1984 by DNA Database of Japan
1988 EMBL/GenBank/DDBJ agree on common format
for data elements
The growth of Genbank (updates)
Prediction: data size doubles every 14 months
44,575,745,176 bases, from 40,604,319 reported sequences (up to Dec.,15, 2004)
The growth of public domain bio-databases
800
Database number
700
600
500
400
300
200
100
0
1999
2000
2001
2002
Year
2003
2004
2005
(The Molecular Biology Database Collection from Nucleic Acids Research)
Information Space
July 17, 1999
 Nucleotide sequences:
4,456,822
706,862
9,780
75,832
10,870
52,889
6,377
515
341 (4.9MB)
10,372,886
10,695
 Protein sequences:
 3D structures:
 Human Unigene Clusters:
 Maps and Complete Genomes:
 Different species node:
 dbSNP
 RefGenes
 human contigs > 250 kb
 PubMed records:
 OMIM records:
10
The challenge of the information space
Feb 10 2004
Nucleotide records
Protein sequences
3D structures
Interactions & complexes
Human Unigene Cluster
Maps and Complete Genomes
Different taxonomy Nodes
Human dbSNP
Human RefSeq records
bp in Human Contigs > 5,000 kb (116)
PubMed records
OMIM records
36,653,899
4,436,362
19,640
52,385
118,517
6,948
283,121
13,179,601
22,079
2,487,920,000
12,570,540
15,138
in-silico databases
 Sequence DB: EMBL,





GenBank, DDBJ
Structure DB: PDB, SCOP,
CATH
Genomic DB: Ensembl,
Genome Browser, NCBI
Network and pathway DB:
HRPD, i2d, STRING, DIP,
BIND, KEGG PATHWAY
Database, Reactome
Mathematical model DB:
BioModel, CellML
Medical DB: OMIM, MGD,
FlyBase, SGD
Sequence databases
 Used for retrieving a known gene/protein
sequence
 Useful for finding information on a gene/protein
 Can find out how many genes are available for a
given organism
 Can comparing your sequence to the others in the
database
Protein Databases
 Protein sequence and other related information
 Genpept: CDS from GenBank entries
 TrEMBL (1996) : Automatic CDS translations from EMBL
 SWISS-PROT (1986): Best annotated, least redundant
 PIR (Protein Information Resource)
 More automated annotation
 Collaborations with MIPS and JIPID
 Uniprot (2003)
 UniProt (Universal Protein Resource) is a central repository of
protein sequence and function created by joining the information
contained in Swiss-Prot, TrEMBL, and PIR.
Networks and Pathways Databases
 Networks of molecule interactions
 Protein-protein interactions
 Biological Pathways
 Metabolism pathways
 Signal transduction pathways
 Known or predicted data
Networks and Pathways Databases
STRING
Biological model databases
Literature Databases







Medline/Pubmed
OMIM
CSULA Library
BIOSIS
Bookshelf (from NCBI)
Melvyl (Books at UC Libraries)
Other molecular life science databases





Science Direct
Pub Med Central
Free Medical Journals
LinkOut Journals
Wiley InterScience
Literature databases – PubMed (MedLine)
1. It contains entries for more than 11 million abstracts
of scientific publications.
2. It enables user to do keyword searches, provides links
to a selection of full articles, and has text mining
capabilities, e.g. provides links to related articles, and
GenBank entries, among others.
3. Efficient searching PubMed requires some skill. For
example, searching with a keyword “interleukin”
returns 108,366 matches.
Essential Bioinformatics and Biocomputing
(LSM2104), NUS
PubMed web-site
(http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed )
Essential Bioinformatics and Biocomputing
(LSM2104), NUS
25
PubMed Search
(http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed )
Cancer treatment by
targeting blood supply:
Cancer growth depends on blood
supply (why?) and thus requires
the growth of new blood vessels –
angiogenesis
Proteins involved in angiogenesis
may be potential anticancer targets
You can find some of these targets
by searching Pubmed
Key word “cancer angiogenesis
enzyme drug” produces 856 entries
Key Word
No. of Entries
Cancer
1.45M
Cancer
Blood supply
22K
Cancer
Blood supply
Protein
3.9K
Cancer
Blood supply
Enzyme
1.5K
Cancer
Blood supply
Enzyme
Drug
500
Essential Bioinformatics and Biocomputing
(LSM2104), NUS
26
Some examples of integrated biological
database resources
 SRS (Sequence Retrieval System)
 Entrez Browser (at NCBI)
 ExPASy (home of SwissProt)
 Ensembl (Open Source based system)
 Human Genome Browser (Jim Kents creation)
NCBI ENTREZ
MedLine
Literature Database
OMIM
Database of human genes and genetic disorders
GenBank
Database of all publicly available DNA sequences
Protein
databases
Database of amino acid sequences from SwissProt, PIR, PRF,
PDB, and translations from annotated coding regions in
GenBank and RefSeq.
Genomes
Database of genomes from organisms and viruses
PopSet
Taxonomy
Database of DNA sequences that have been collected to
analyze the evolutionary relatedness of a population.
Database of names of organisms with sequences in GenBank or
Database Integration
Hetegerous data type
Accession, name of confusion
Different ID for a same guy
DNA records:
NM_017442
BC032713
NG_001066
AF172169
toll-like receptor 9
toll-like receptor 9
toll-like receptor 7
toll-like receptor 7
RefSeq
cDNA clone
chromosome X
genomic gene
toll-like receptor 1
toll-like receptor 2
toll-like receptor 7
TIR domain of Tlr2
Swiss-Prot
RefSeq
Genbank protein
3D structure (PDB)
Protein records:
Q15399
NP_067681
AAH33651
1FYW
Swimming in Data Sources
What Is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
32
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability

Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated analysis
of massive data sets
33
Data Mining: Confluence of Multiple Disciplines
Machine
Learning
Applications
Algorithm
July 9, 2009
Pattern
Recognition
Data Mining
Database
Technology
Data Mining: Concepts and Techniques
Statistics
Visualization
High-Performance
Computing
34
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
35
Data Mining – Tasks
Classification - Example: high risk for cancer or not
Estimation - Example: household income
Prediction - Example: credit card balance transfer average
amount
Affinity Grouping - Example: people who buy X, often also
buy Y with a probability of Z
Clustering - similar to classification but no predefined
classes
Description and Profiling – Identifying characteristics
which explain behaviour - Example: “More men watch
football on TV than women”
Types of Data Mining
 “Supervised” Methods (this DM course)
 Training data has both predictor attributes & objective (to
be predicted) attributes
 Predict discrete classes  classification
 Predict continuous values  regression
 Duality: classification  regression
 “Unsupervised” Methods
 Training data without objective attributes
 Goal: find novel & interesting patterns
 Cutting-edge research, fewer success stories
 Semi-supervised methods: market-basket, …
© 2008, Jaime G. Carbonell
December, 2008
37
Some Definitions (KBS vs ML)
 Knowledge-Based Systems
 Rules, procedures, semantic nets, Horn clauses
 Inference: matching, inheritance, resolution
 Acquisition: manually from human experts
 Machine Learning
 Data: tables, relations, attribute lists, …
 Inference: rules, trees, decision functions, …
 Acquisition: automated from data
 Data Mining
 Machine learning applied to large real problems
 May be augmented with KBS
© 2008, Jaime G. Carbonell
December, 2008
38
Machine Learning Application
Process in a Nutshell
 Choose problem where
 Prediction is valuable and non-trivial
 Sufficient historical data is available
 The objective is measurable (incl in past data)
 Prepare the data
 Tabular form, clean, divide training & test sets
 Select a Machine Learning algorithm
 Human readable decision fn  rules, trees, …
 Robust with noisy data  kNN, logistic reg, …
© 2008, Jaime G. Carbonell
December, 2008
39
Machine Learning Techniques
 Technical basis for data mining: algorithms for
acquiring structural descriptions from examples
 Methods originate from artificial intelligence,
statistics, and research on databases
 Structural descriptions represent patterns explicitly can be
used to
 predict outcome in new situation
 understand and explain how prediction is derived (maybe
even more important)
© Copyright 2006, Natasha Balac
40
Symbolic Rule Induction Example (1)
Age Gender Temp b-cult c-cult
65
M
101 +
.23
25
M
102 +
.00
65
M
102 .78
36
F
99 .19
11
F
103 +
.23
88
F
98 +
.21
39
F
100 +
.10
12
M
101 +
.00
15
F
101 +
.66
20
F
98 +
.00
81
M
98 .99
87
F
100 .89
12
F
102 +
??
loc
USA
CAN
BRA
USA
USA
CAN
BRA
BRA
BRA
USA
BRA
USA
CAN
14
67
USA normal
BRA rash
F
M
101 +
102 +
.33
.77
Skin
normal
normal
rash
normal
flush
normal
normal
normal
flush
rash
rash
rash
normal
disease
strep
strep
dengue
*none*
strep
*none*
strep
strep
dengue
*none*
ec-12
ec-12
strep
Symbolic Rule Induction Example (2)
Candidate Rules:
IF age = [12,65]
gender = *any*
temp = [100,103]
b-cult = +
c-cult = [.00,.23]
loc = *any*
skin = (normal,flush)
THEN:
strep
IF age = (15,65)
gender = *any*
temp = [101,102]
b-cult = *any*
c-cult = [.66,.78]
loc = BRA
skin = rash
THEN:
dengue
Disclaimer: These
are not real medical
records or rules
Why Validation?
 Validation type:
 Within the existing data
 With newly collected data
 Errors and uncertainties:
 Systematic or random errors




Unknown variables - number of classes
Noise level - statistical confidence due to noise
Model validity – error measure, model over-fit or under-fit
Number of data points - measurement replicas
 Other issues
 Experimental support of general theories
 Exhaustive sampling is not permissive
Major Challenges in Data Mining
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Handling high-dimensionality
 Handling noise, uncertainty, and incompleteness of data
 Incorporation of constraints, expert knowledge, and background
knowledge in data mining
 Pattern evaluation and knowledge integration
 Mining diverse and heterogeneous kinds of data: e.g., bioinformatics,
Web, software/system engineering, information networks
 Application-oriented and domain-specific data mining
 Invisible data mining (embedded in other functional modules)
 Protection of security, integrity, and privacy in data mining
July 9, 2009
Data Mining: Concepts and Techniques
44
Where to find the databases
 Table of addresses for major databases and tools
 Nucleic Acids Research Database issue January
each year
 Nucleic Acids Research Software issue –new
 Amos’s list of tools:
http://www.expasy.ch/alinks.html
Finding data & tools
Google
http://www.google.com
Nucleic Acids Research – Database & Web Server issues
(Jan 1 and July 1)
http://nar.oupjournals.org
UBC Bioinformatics Links Directory
http://bioinformatics.ubc.ca/resources/links_directory
/
Application
 Genome Analysis
 Pipeline Analysis
 Genome Annotation
 SNP
Data warehouse/ Databases integration
New Algorithm
Literature Mining
System Biology/ Microarray Analysis
HEALTHCARE:
 Decision Support: optimal treatment choice
 Survivability Predictions
 medical facility utilization predictions




Database Mining Tools
 SRS: Sequence Retrieval System
 Entrez: Search Engine at NCBI, US
 Bankit: World Wide Web sequence submission server
 Sequence Similarity Search Tools-BLAST & FASTA
 Finding sequence homologs to deduce the identity of query
sequence
 Identify potential sequence homologs with known three
dimensional structure
Data Mining Tool Features
 Installation: Hardware and software requirements (operating system, virtual machine,
application server, required memory capacity...); type and simplicity of installation and
deployment; availability of installation guide, licensing, ….
 Usage: Is there a Graphic User Interface (GUI) available? Is it intuitive and easy to use?
Flexible and personalizable? Is there Application Programming Interface (API) available?
What is its learning curve like? Which pro- gramming languages and standards are
supported? Modularity? What kind of documentation is available (tutorials, examples...)? Is
technical support available?
 Input: Data pre-processing; input formats; connection with databases (JDBC, ODBC, ...)
 Output: Output formats; reusability of a model; available reports and graphs, ...
 Performance: Speed, scalability, memory usage, ...
 Features: Which algorithms are supported? Can new algorithms be added? Is there
Geographic Information System (GIS) integration? Does it support standards like DMQL?
Text mining features? Available plug-ins?
Some Tools
 WEKA
University of Waikato, New Zealand
 YALE
University of Dortmund, Germany
 MiningMart
University of Dortmund, Germany
 Orange
University of Ljubljana, Slovenia
 Rattle
Togaware

Borgelt
University of Magdeburg, Germany

Gnome Data Mine
Togaware

Tanagra
University of Lyon 2

Xelopes
Prudsys

SpagoBI
ObjectWeb, Italy
JasperIntelligence
 AlphaMiner

University of Hong Kong
JasperSoft
 Databionic ESOM Tools

University of Marburg, Germany
 MLC++
SGI, USA
 MLJ
Kansas State University
Pentaho
Pentaho
Database Mining Tools
•SRS: Sequence Retrieval System
•Entrez: Search Engine at NCBI, US
•Bankit: World Wide Web sequence submission server
•Sequence Similarity Search Tools-BLAST & FASTA
•Finding sequence homologs to deduce the identity
of query sequence
•Identify potential sequence homologs with known
three dimensional structure
Bioinformatics software
Major sources
 Software package at ExPASy Molecular Biology Server
http://www.expasy.org ; http://au.expasy.org
 Software at PBIL Bio-Informatique Lyonnais
http://pbil.univ-lyon1.fr/
 Toolbox at EBI European Bioinformatics Institute
http://www.ebi.ac.uk/Tools/index.html
53
Bioinformatics software
 Major types of bioinformatics tools









Sequence analysis tools
Sequence comparison
Pattern and domain search
Evolutionary analysis
Prediction of sequence structure and function
Visualization of molecular structures
Structure modeling
Bibliographic and text searches
Specialized and other tools
54
Bioinformatics software
Its role in research:
 Hypothesis-driven
research cycle in
biology (From
Kitano H. Systems
biology: a brief
overview. Science
2002, 295:1662-4)
55
Take home notes
Yes, if you train quickly, you can
create a new database of databases,
but first have your lunch !