Download Lecture 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Databases and Data
Mining
Lecture 1:
Introduction to Data Mining
for Bioinformatics
Fall 2005
Peter van der Putten
(putten_at_liacs.nl)
Course Outline
• Objective
– Understand the basics of data mining
– Gain understanding of the potential for applying it in
the bioinformatics domain
– Limited hands on experience
• Schedule
Date
Time
4-Nov-05 13.45 - 15.30
18-Nov-05 13.45 - 15.30
15.45 - 17.30
25-Nov-05 13.45 - 15.30
2-Dec-05 13.45 - 15.30
15.45 - 17.30
Room
174
174
306/308
403
174
306/308
Lecture: Introduction
Lecture: Predictive Data Mining
Practical Assignments
Lecture: Descriptive Data Mining & Search
Lecture: Bioinformatics Data Mining Cases
Practical Assignments
• Evaluation
– Practical assignment (2nd) plus take home exercise
Agenda Today
• What is data mining?
• A short summary of life
• Data mining revisited
What is data mining?
Genomic Microarrays – Case Study
• Problem:
– Leukemia (different types of Leukemia cells look very
similar)
– Given data for a number of samples (patients), can
we
• Accurately diagnose the disease?
• Predict outcome for given treatment?
• Recommend best treatment?
• Solution
– Data mining on micro-array data
Example: ALL/AML data
• 38 training patients, 34 test patients, ~ 7,000 patient attributes (micro
array gene data)
• 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid
Leukemia (AML)
• Use train data to build diagnostic model
ALL
AML
Results on test data:
33/34 correct, 1 error may be mislabeled
Sources of (artificial)
intelligence
• Reasoning versus learning
• Learning from data
–
–
–
–
–
–
–
–
Patient data
Customer records
Stock prices
Piano music
Criminal mug shots
Websites
Robot perceptions
Etc.
Some working definitions….
• ‘Data Mining’ and ‘Knowledge Discovery in
Databases’ (KDD) are used interchangeably
• Data mining =
– The process of discovery of interesting, meaningful
and actionable patterns hidden in large amounts of
data
• Multidisciplinary field originating from artificial
intelligence, pattern recognition, statistics,
machine learning, bioinformatics, econometrics,
….
A short summary of life
Bio Building Blocks
Biotech Data Mining Applications
The Promise….
.
The Promise….
.
The Promise….
.
DNA, Proteins, Cells
DNA, Proteins, Cells
From DNA to Proteins
Discovering the structure of DNA
James Watson & Francis Crick
- Rosalind Franklin
The structure of DNA
DNA Trivia
• DNA stores instructions for the cell to peform its
functions
• Double helix, two interwoven strands
• Each strand is a sequence of so called
nucleotides
• Deoxyribonucleic acid (DNA) comprises 4
different types of nucleotides (bases): adenine
(A), thiamine (T), cytosine (C) and guanine (G)
– Nucleotide uracil (U) doesn’t occur in DNA
• Each strand is reverse complement of the other
• Complementary bases
– A with T
– C with G
DNA Trivia
• Each nucleus contain 3 x 10^9 nucleotides
• Human body contains 3 x 10^12 cells
• Human DNA contains 26k expressed genes,
each gene codes for a protein in principle
• DNA of different persons varies 0.2% or less
• Human DNA contains 3.2 x 10^9 base pairs
– X-174 virus: 5,386
– Salamander: 100  109
– Amoeba dubia: 670  109
Primary Protein Structure
• Proteins are built out of peptides, which are poylmer chains of amino
acids
• Twenty amino acids are encoded by the standard genetic code
shared by nearly all organisms and are called standard amino acids
(100 amino acids exist in nature)
Protein Structure
from Primary to Quaternary
Proteins: 3D Structure
A representation of the 3D structure of myoglobin, showing coloured alpha helices.
This protein was the first to have its structure solved by X-ray crystallography by
Max Perutz and Sir John Cowdery Kendrew in 1958, which led to them receiving a
Nobel Prize in Chemistry. http://en.wikipedia.org/wiki/Protein
Proteins: 3D Structure
Molecular surface of several proteins showing their comparative sizes.
From left to right are: Antibody (IgG), Hemoglobin, Insulin (a hormone),
Adenylate Kinase (an enzyme), and Glutamine Synthetase (an enzyme).
Proteins: 3D Structure
G Protein-Coupled Receptors (GPCR) represent more than half the current drug targets
DNA Codes for Proteins
but Proteins also Control Gene Expression
• Protein regulation occurs at each step of
synthesis
Repressor Protein Switching
Genes On and Off
Regulatory Protein Coordinating
Gene Expression
Importance of Combinatorial Gene
Control
• combinations of a few gene regulatory proteins
can generate many different cell types during
development
Some working definitions….
• Bioinformatics =
– Bioinformatics is the research, development, or
application of computational tools and approaches for
expanding the use of biological, medical, behavioral
or health data, including those to acquire, store,
organize, archive, analyze, or visualize such data
[http://www.bisti.nih.gov/].
– Or more pragmatic: Bioinformatics or computational
biology is the use of techniques from applied
mathematics, informatics, statistics, and computer
science to solve biological problems [Wikipedia Nov
2005]
• NCBI Tools for data mining:
–
–
–
–
–
Nucleotide sequence analysis
Proteine sequence analysis
Structures
Genome analysis
Gene expression
• Data mining or not?.
Bio informatics and data mining
• From sequence to structure to function
• Genomics (DNA), Transcriptomics (RNA), Proteomics
(proteins), Metabolomics (metabolites) Pattern matching
and search
• Sequence matching and alignment
• Structure prediction
– Predicting structure from sequence
– Protein secondary structure prediction
• Function prediction
– Predicting function from structure
– Protein localization
• Expression analysis
– Genes: micro array data analysis etc.
– Proteins
• Regulation analysis
Bio informatics and data mining
•
•
•
•
•
•
Classical medical and clinical studies
Medical decision support tools
Text mining on medical research literature (MEDLINE)
Spectrometry, Imaging
Systems biology and modeling biological systems
Population biology & simulation
• Spin Off: Biological inspired computational learning
– Evolutionary algorithms, neural networks, artificial immune
systems
Examples of my related research
• Topology preserving property of self-organizing maps
– Neural network for clustering & classification inspired by cortical
maps
• Benchmarking Artificial Immune Systems
• Predicting throat cancer survival rate
– Value of fusing data from various sources for this purpose
• Automated recognition of sick yeast cells in images (with
prof. Verbeek)
• Recommender systems in bioinformatics
– Amazon.com style recommendations
Data mining revisited
Some working definitions….
• ‘Data Mining’ and ‘Knowledge Discovery in Databases’
(KDD) are used interchangeably
• Data mining =
– The process of discovery of interesting, meaningful and
actionable patterns hidden in large amounts of data
• Multidisciplinary field originating from artificial
intelligence, pattern recognition, statistics, machine
learning, bioinformatics, econometrics, ….
Some working definitions….
•
Concepts: kinds of things that can be learned
–
–
•
Instances: the individual, independent examples of
a concept
–
•
Example: a patient, candidate drug etc.
Attributes: measuring aspects of an instance
–
•
Aim: intelligible and operational concept description
Example: the relation between patient characteristics
and the probability to be diabetic
Example: age, weight, lab tests, microarray data etc
Pattern or attribute space
Data mining tasks
• Predictive data mining
– Classification: classify an instance into a category
– Regression: estimate some continuous value
• Descriptive data mining
–
–
–
–
–
–
Matching & search: finding instances similar to x
Clustering: discovering groups of similar instances
Association rule extraction: if a & b then c
Summarization: summarizing group descriptions
Link detection: finding relationships
…
Data Mining Tasks: Search
Finding best matching instances
Every instance is a point in
pattern space. Attributes are the
dimension of an instance, f.e.
Age, weight, gender etc.
f.e. weight
Pattern spaces may be high
dimensional (10 to thousands of
dimensions)
f.e. age
Data Mining Tasks: Clustering
Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
Data Mining Tasks: Clustering
Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
In >3 dimensions this is not
possible
Data Mining Tasks: Classification
Goal classifier is to seperate
classes on the basis of known
attributes
weight
The classifier can be applied
to an instance with unknow
class
age
For instance, classes are
healthy (circle) and sick
(square); attributes are age
and weight
Examples of Classification
Techniques
•
•
•
•
•
•
•
•
•
Majority class vote
Machine learning & AI
Decision trees
Nearest neighbor
Neural networks
Genetic algorithms / evolutionary computing
Artificial Immune Systems
Good old statistics
…..
Example Classification Algorithm 1
Decision Trees
20000 patients
age > 67
yes
no
1200 patients
Weight > 85kg
yes
400 patients
Diabetic (%50)
18800 patients
gender = male?
no
800 customers
Diabetic (%10)
no
etc.
Decision Trees in Pattern Space
Goal classifier is to seperate
classes (circle, square) on the
basis of attribute age and
income
weight
Each line corresponds to a
split in the tree
Decision areas are ‘tiles’ in
pattern space
age
Example classification algorithm 3:
Neural Networks
• Inspired by neuronal computation in the brain (McCullough & Pitts
1943 (!))
invoer:
bvb. klantkenmerken
uitvoer:
bvb. respons
• Input (attributes) is coded as activation on the input layer neurons,
activation feeds forward through network of weighted links between
neurons and causes activations on the output neurons (for instance
diabetic yes/no)
• Algorithm learns to find optimal weight using the training instances
and a general learning rule.
Neural Networks
• Example simple network (2 layers)
age
weightage
body_mass_index
Weightbody mass index
Probability of being diabetic
• Probability of being diabetic = f (age * weightage + body mass index
* weightbody mass index)
Neural Networks in Pattern
Space
Classification
Simpel network: only a line
available (why?) to seperate
classes
Multilayer network:
f.e. weight
Any classification boundary
possible
f.e. age
Descriptive data mining:
association rules
• Discovery of interesting patterns
• Rule format: if A (and B and C etc) then Z
• Example:
– If customer buys potatoes (A) and sauerkraut (B) then customer
buys sausage (Z)
• Important measures
– Support condition: how often do potatoes and sauerkraut occur
together (A,B)
– Confidence rule: how often do sausages then occur / support
conditions (is A,B  C always true?)
• Could be used for instance for mining gene expression
data
Quiz Question
What have we learned today
• An introduction into applying data mining for
bioinformatics
• A short history of life
• Basic data mining concepts