Download Machine Learning for Information Retrieval

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Advanced Algorithms for Biological
Data Analysis
Center for Bioinformation Technology (CBIT) &
Biointelligence Laboratory
School of Computer Science and Engineering
Seoul National University
http://bi.snu.ac.kr/
http://cbit.snu.ac.kr/
Lecture Schedule





Day 1: Introduction to Machine Learning
Day 2: Neural Networks
Day 3: Hidden Markov Models
Day 4: Principal Component Analysis
Day 5: Clustering Analysis
2
Introduction to Machine Learning
Algorithms in Bioinformatics
Byoung-Tak Zhang
Center for Bioinformation Technology (CBIT) &
Biointelligence Laboratory
School of Computer Science and Engineering
Seoul National University
E-mail: [email protected]
http://bi.snu.ac.kr./
http://cbit.snu.ac.kr/
Outline

Part I
Concept of Machine Learning (ML)
Machine Learning Algorithms and Applications
Applications in Bioinformatics

Part II
Version Space Learning
Decision Tree Learning
4
5
What is Artificial Intelligence (AI)?

Design and study of computer programs that
behave intelligently.
 Designing computer programs to make computers
smarter.
 Study of how to make computers do things at
which, at the moment, people are better.
 (No satisfactory definition of AI)
6
Research Areas and Approaches
Research
Artificial
Intelligence
Learning Algorithms
Inference Mechanisms
Knowledge Representation
Intelligent System Architecture
Application
Intelligent Agents
Information Retrieval
Electronic Commerce
Data Mining
Bioinformatics
Natural Language Proc.
Expert Systems
Paradigm
Rationalism (Logical)
Empiricism (Statistical)
Connectionism (Neural)
Evolutionary (Genetic)
Biological (Molecular)
7
Concept of Machine Learning
8
9
Context
Computer
Science
(AI)
Cognitive
Science
Machine
Learning
Statistics
Information
Theory
10
Why Machine Learning?




Recent progress in algorithms and theory
Growing flood of online data
Computational power is available
Budding industry
Three niches for machine learning
 Data mining: using historical data to improve decisions
 Medical records --> medical knowledge

Software applications we can’t program by hand
 Autonomous driving
 Speech recognition

Self-customizing programs
 Newsreader that learns user interests
11
Brief History of Machine Learning





1950’s: Samuels checker player
1960’s: Neural networks, perceptron; pattern recognition; learning in
the limit theory; Minsky &Papert.
1970’s: Symbolic concept induction; Winstons’s arch learner;
knowledge acquisition bottleneck; Quinlan’s ID3; Michalski’s AQ and
soybean diagnosis results; scientific discovery with BACON;
mathematical discovery with AM.
1980’s: Continued progress on decision-tree and rule learning;
Explanation-based learning; speedup learning; utility problem, analogy;
resurgence of connectionism (PDP, ANN); Valiant’s PAC learning;
experimental evaluation
1990’s: Data mining; adaptive software agents & IR; reinforcement
learning; theory refinement; inductive logic programming; voting,
bagging, boosting, and stacking; learning Bayesian networks.
12
Learning: Definition

Definition
Learning is the improvement of performance in some
environment through the acquisition of knowledge
resulting from experience in that environment.
the improvement
of behavior
through acquisition
of knowledge
on some
performance task
based on partial
task experience
13
A Learning Problem: EnjoySport
Sky
Temp
Humid
Wind
Water Forecast EnjoySports
Sunny Warm
Normal Strong Warm Same
Yes
Sunny Warm
High
Strong Warm Same
Yes
Rainy
High
Strong Warm Change
No
High
Strong Cool
Cold
Sunny Warm
Change
Yes
What is the general concept?
14
Possible Uses of Machine
Learning
configuration
and design
diagnostic
reasoning
planning and
scheduling
data mining and
knowledge discovery
language
understanding
execution
and control
vision and
speech
15
Metaphors and Methods
Neurobiology
Connectionist
Learning
Biological
Evolution
Heuristic
Search
Tree / Rule
Induction
Genetic Learning
Memory and
Retrieval
Case-Based
Learning
Statistical
Inference
Probabilistic
Induction
16
Learning: Components

Components of a learning system
Performance: accuracy, efficiency, understandability
Environment: external setting to the learner
Knowledge: internal data structure
Experience: perception, action, mental traces
Improvement: desirable change in performance
17
Learning System
Performance
problem
solution
Environment
get data
improve behavior
Knowledge
get knowledge
acquired knowledge
Learning
18
What is the Learning Problem?

Learning = improving with experience at some
task
Improve over task T,
With respect to performance measure P,
Based on experience E.
E.g., Learn to play checkers
 T: Play checkers
 P: % of games won in world tournament
 E: opportunity to play against self
19
Machine Learning: Tasks

Supervised Learning
 Estimate an unknown mapping from known input- output pairs
 Learn fw from training set D={(x,y)} s.t. f w (x)  y  f (x)
 Classification: y is discrete
 Regression: y is continuous

Unsupervised Learning
 Only input values are provided
 Learn fw from D={(x)} s.t. f w (x)  x
 Compression
 Clustering

Reinforcement Learning
20
Machine Learning: Strategies









Rote learning
Concept learning
Learning from examples
Learning by instruction
Inductive learning
Deductive learning
Explanation-based learning (EBL)
Learning by analogy
Learning by observation
21
Supervised Learning

Given a sequence of input/output pairs of
the form <xi, yi>, where xi is a possible
input and yi is the output associated with xi.
 Learn a function f that accounts for the
examples seen so far, f(xi) = yi for all i, and
that makes a good guess for the outputs of
the inputs that it has not seen.
22
Examples of Input-Output Pairs
Task
Inputs
Outputs
Recognition
Descriptions of
objects
Classes that the
objects belong to
Action
Descriptions of
situations
Actions or predictions
Janitor robot
problem
Descriptions of
offices (floor, prof’s
office)
Yes or No (indicating
whether or not the
office contains a
recycling bin)
23
Classification and Concept
Learning

Classification
If the function is discrete valued, then the
outputs are called classes

Concept learning
Learned function has only two possible outputs
24
Unsupervised Learning

Clustering
A clustering algorithm partitions the inputs into a fixed
number of subsets or clusters so that inputs in the same
cluster are close to one another.

Discovery learning
The objective is to uncover new relations in the data.

Reinforcement learning
Uses a feedback signal (not the target output) that gives
the learning program an indication of whether or not
what it has learned is correct.
25
Online and Batch Learning

Batch methods
Process large sets of examples all at once.

Online (incremental) methods
Process examples one at a time.
26
Machine Learning Algorithms and
Applications
27
Machine Learning Algorithms (1/2)

Symbolic Learning (covered on Day 1)
 Version Space Learning
 Case-Based Learning

Neural Learning (covered on Day 2)
 Multilayer Perceptrons (MLPs)
 Self-Organizing Maps (SOMs)
 Support Vector Machines (SVMs)

Evolutionary Learning (very briefly explained on Day 1)
 Evolution Strategies
 Evolutionary Programming
 Genetic Algorithms
 Genetic Programming
28
Machine Learning Algorithms (2/2)

Probabilistic Learning (covered on Days 3 and 5)
 Bayesian Networks (BNs)
 Helmholtz Machines (HMs)
 Latent Variable Models (LVMs)
 Generative Topographic Mapping (GTM)

Other Machine Learning Methods (partially covered on
Days 1 and 4)
 Decision Trees (DTs)
 Reinforcement Learning (RL)
 Boosting Algorithms
 Mixture of Experts (ME)
 Independent Component Analysis (ICA)
29
Example Applications of ML (1/2)

Banking & Investment
 Credit card fraud
 Delinquent accounts
 Authorization of purchases
 Predict stock market

Health Care
 Disease diagnosis
 Managing resources
 Look for causal relationships between environment and disease

Marketing
 Credit card applications
 Use past buying habits to predict likelihood of customer
purchasing some new product

Textual Data Mining
30
Example Applications of ML (2/2)









Astronomy
Bioinformatics
Chemistry
Human resources: evaluating job performance
Insurance & Finance
Manufacturing: process control
Signal and image processing
Speech recognition
…
31
Neural Nets for Handwritten Digit
Recognition
…
…
…
Pre-processing
?
0
1
2
3
9
…
…
0
Output units
Training
2
3
9
…
Hidden units
…
1
…
Input units
…
Test
…
32
ALVINN System: Neural Network Learning to Steer
an Autonomous Vehicle
33
Learning to Navigate a Vehicle by
Observing an Human Expert (1/2)

Inputs
The images produces by a camera mounted on
the vehicle

Outputs
The actions taken by the human driver to steer
the vehicle or adjust its speed.

Result of learning
A function mapping images to control actions
34
Learning to Navigate a Vehicle by
Observing an Human Expert (2/2)
35
Data Recorrection by a Hopfield
Network
corrupted
input data
original
target data
Recorrected
data after
10 iterations
Recorrected
data after
20 iterations
Fully
recorrected
data after
35 iterations
36
Predicting the Sunspot Number with
Neural Networks
37
ANN for Face Recognition
960 x 3 x 4 network is trained on gray-level images of faces to predict
whether a person is looking to their left, right, ahead, or up.
38
Data Mining
Selection
& Sampling
Preprocessing
& Cleaning
Transformation
& reduction
Data Mining
Interpretation/
Evaluation
-- -- --- -- --- -- --
Database/data
warehouse
Target
data
Cleaned
data
Transformed
data
Patterns/
model
Knowledge
Performance
system
39
Customer Relationship Management
(CRM)






Increased Customer Lifetime Value
Increased Wallet Share
Improved Customer Retention
Segmentation of Customers by Profitability
Segmentation of Customers by Risk of Default
Integrating Data Mining into the Full Marketing Proce
40
Hot Water Flashing Nozzle with
Evolutionary Algorithms
Hans-Paul Schwefel
performed the original
experiments
Start
Hot water entering
Steam and droplet at exit
At throat: Mach 1 and onset of flashing
41
Case-Based Reasoning
(Aamodt & Plaza, 1994)
Input
Learned
Case
New
Problem
1. Retrieve
Case Base
Retrived
Cases
General
Knowledge
4. Retain
Output
2. Reuse
Retrived
Solution
3. Revise
Retrived
Solution
42
Machine Learning Applications in
Bioinformatics
43
Bioinformatics

What is a Bioinformatics?
Bioinformatics is a new term referring to the discipline
that employs computers to store, retrieve, analyze and
assist in understanding biological information.

The application of information technology and computer
science to the study of biological systems.
The analysis of the massive (and constantly increasing)
amount of genetic information
Sophisticated computer technologies to enable discovery in
all fields of life sciences.


44
Problems in Bioinformatics
Sequence analysis
 Sequence alignment
 Structure and function prediction
 Gene finding
Structure analysis
 Protein structure comparison
 Protein structure prediction
 RNA structure modeling
Expression analysis
 Gen expression analysis
 Gene clustering
Pathway analysis
 Metabolic pathway
 Regulatory networks
45
Applications of Bioinformatics






Drug design
Identification of genetic risk factors
Gene therapy
Genetic modification of food crops and animals
Forensics
Biological warfare

Personalized Medicine
 E-Doctor
46
Machine Learning and
Bioinformatics
knowledge
knowledge
Machine learning
Bio DB
Drug
Development
Medical
therapy
research
Pharmacology
Ecology
47
Machine Learning Techniques for Bio
Data Mining

Sequence Alignment
 Simulated Annealing
 Genetic Algorithms

Structure and Function Prediction
 Hidden Markov Models
 Multilayer Perceptrons
 Decision Trees

Molecular Clustering and Classification
 Support Vector Machines
 Nearest Neighbor Algorithms

Expression (DNA Chip Data) Analysis
 Self-Organizing Maps
 Bayesian Networks
48
Structure and Function Prediction
Protein structure
prediction
Protein modeling
Gene finding and
gene prediction
49
Effect and Applications of Biological
Data Mining
Biocomputing
Increase and Improvement Renewable Energy
of Farm Products
Biological Data Mining
store, retrieve, analyze and assist
in understanding biological information
Diagnosis with Chip
SNP (Single Nucleotide
Polymorphism)
Customized Drug
50
Hidden Markov Models
for Protein Modeling

20 alphabets (20 amino acids)
 m0: start state, m5: end state, mk: match states
 ik: insertion states, dk: deletion states
 T(s2|s1): transition probabilities
 P(x|mk): alphabet generating probabilities (x: letter: amino
acid)
51
A Simple Example of Hidden Markov
Models
0.5
0.25
0.25
0.25
0.25
0.5
S
0.25
E
0.25
0.5
0.5
ATCCTTTTTTTCA
0.1
0.1
0.1
0.7
52
Clustering of Related Gene
Expressions
53
Non-negative Matrix Factorization
Clustering Gene Expression Data
H1·
H2 ·
W(?)
G
7,129
genes
W
. . . . .
. . . . .
…
g1
g2


g3
g4
g7,129
x
.
….
38 samples
H(?)
7,129
genes
. .
. .
…
encoding
38 samples
2 factors
Factors can capture the correlations between the genes using the values
of expression level.
Cluster training samples into 2 groups by NMF
 Assign each sample to the factor (class) which has higher encoding value.
 Accuracy: 0 ~1 error for the training data set
54
Bayesian Networks
for Gene Expression Analysis

Learning
Gene C
Processed
data
Data
Learning
algorithm
Gene B
Gene D
Gene A
Preprocessing
Target

Inference
Gene C
Gene D
Gene B
Gene A
Target
The values of Gene C and
Gene B are given.
Gene C
Gene D
Gene B
Gene A
Target
Belief propagation
Gene C
Gene D
Gene B
Gene A
Target
Probability for the target
is computed.
55
Multilayer Perceptrons for Gene
Finding and Prediction
Coding potential value
GC Composition
bases
Length
Discrete
Donor
exon score
Acceptor
Intron vocabulary
1
score
0
sequence
56
Self-Organizing Maps for DNA
Microarray Data Analysis
Two-dimensional array
of postsynaptic neurons
Winning
neurons
Bundle of synaptic
connections
Input
57
Biological Information Extraction
Data Analysis &
Field Identification
Text Data
Data Classification &
Field Extraction
Field Property
Identification & Learning
Database Template
DB Record
Filling
Location
Date
Information Extraction
DB
58
Biomolecular Computing
011001101010001
ATGCTCGAAGCT
59
More information
on
biological data mining
and related research
can be found
at
http://cbit.snu.ac.kr/
http://bi.snu.ac.kr/
60