Download Machine learning projects

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic imprinting wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene wikipedia , lookup

Protein moonlighting wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

NEDD9 wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression profiling wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Transcript
Machine learning projects
Andrea Passerini
[email protected]
Machine Learning
Machine learning projects
How projects work
Steps
1
understanding the problem to be addressed
2
preprocessing the available data to generate features for
the learning algorithm(s)
3
training the learning algorithm(s), including tuning model
parameters on a validation set (e.g. number of nearest
neighbours, kernel parameters)
4
evaluating the performance of the algorithm(s) on a
separate test set (supervised case)
Groups
Projects are conceived for groups of two students
It is always possible to assign a project to a single student
Students can also make their own proposals concerning
task and/or algorithm
Machine learning projects
How projects work
Report
A final report will contain the detail of your work.
The report should contain (at least) sections on:
1
2
3
4
Description of the problem addressed
Description of all the processing phases conducted to
prepare data for the learning algorithm(s)
Description of the learning algorithm(s) employed
Description of the experimental setting:
training, testing and validation (when applicable) phases
alternatives tried (e.g. BN structure learning techniques,
hyperparameters of SVM, etc)
5
Experimental results:
Proper performance measures (e.g. accuracy, cluster quality,
precision recall curves, etc)
6
Conclusions highlighting main findings of the project
Machine learning projects
How projects work
Software
1
Any programming language can be used to create
appropriate programs for processing data, managing
training and testing phases, evaluating results etc.
2
Simple learning algorithms (such as hierachical clustering)
will be implemented during the project. Such requirement
will be explicitly written in the project description.
3
For more complex learning algorithms (such as support
vector machines or Bayesian Networks), you will use
available implementations.
Note
It is advisable to contact the professor after an initial draft
of the report/experiments is ready in order to:
1
2
clarify if the direction taken is correct
verify if further work is needed
Machine learning projects
(1) Clustering genes
Problem
Genes are responsible for protein synthesis in living cells.
Genes interact with each other forming complex
gene-networks.
The function of most genes is still unknown.
Grouping together genes with similar behaviour can help
understanding their function.
Machine learning projects
(1) Clustering genes
Data
The expression level (i.e. amount of proteins synthesized
in a certain time) of genes can be measured by DNA
microarrays.
Expression level of genes from pathological cells are often
measured relative to healthy cells (the control).
The leukemia dataset consists of expression levels for
5147 genes in 72 patients: 47 affected by acute
lymphoblastic leukemia (ALL), 25 by acute myeloid
leukemia (AML).
Machine learning projects
(1) Clustering genes
Task
Expression levels depend strongly on the gene considered
⇒ need to be normalized
Use information gain (see decision tree lesson) to choose
the best threshold for binarizing each gene (classes are
ALL vs AML).
Select the 100 genes with highest information gain
Cluster the 100 (binarized) genes by agglomerative
hierarchical clustering.
Machine learning projects
(2) Classifying pathology by gene expression profiles
Problem
Genes are responsible for protein synthesis in living cells.
Genes interact with each other forming complex
gene-networks.
The behaviour of genes in a certain cell can indicate the
presence of a certain pathology.
Detecting pathologies by analizing gene behaviour can
help diagnosis and treatment
Machine learning projects
(2) Classifying pathology by gene expression profiles
Data
The expression level (i.e. amount of proteins synthesized
in a certain time) of genes can be measured by DNA
microarrays.
Expression level of genes from pathological cells are often
measured relative to healthy cells (the control).
The leukemia dataset consists of expression levels for
5147 genes in 72 patients: 47 affected by acute
lymphoblastic leukemia (ALL), 25 by acute myeloid
leukemia (AML).
Machine learning projects
(2) Classifying pathology by gene expression profiles
Task
Expression levels depend strongly on the gene considered
⇒ need to be normalized
Use information gain (see decision tree lesson) to choose
the best threshold for binarizing each gene (classes are
ALL vs AML). Can be iteratively applied to discretize in
more than two values.
Most of the genes will probably be uncorrelated with the
pathologies ⇒ uninformative features.
Perform feature selection on genes according to their
information gain (setting a threshold on it, or choosing a
fixed number of genes sorted by infogain)
Machine learning projects
(2) Classifying pathology by gene expression profiles
Task (cont.)
Once discretized, train an SVM classifier for discriminating
between ALL and AML patients. Examples are patients,
inputs are discretized gene expression levels.
Randomly split the available dataset into
train/validation/test set (e.g. 60% 20% 20%)
Use training set for learning, validation set for parameter
tuning (i.e. kernel parameters, C regularization), test set for
final evaluation
Machine learning projects
(3) Building a Bayesian model of leukemia pathologies
Problem
Genes are responsible for protein synthesis in living cells.
Genes interact with each other forming complex
gene-networks.
The behaviour of genes in a certain cell can indicate the
presence of a certain pathology.
Discovering interactions among genes which correlate with
a pathology can help elucidating its characteristics
Machine learning projects
(3) Building a Bayesian model of leukemia pathologies
Data
The expression level (i.e. amount of proteins synthesized
in a certain time) of genes can be measured by DNA
microarrays.
Expression level of genes from pathological cells are often
measured relative to healthy cells (the control).
The leukemia dataset consists of expression levels for
5147 genes in 72 patients: 47 affected by acute
lymphoblastic leukemia (ALL), 25 by acute myeloid
leukemia (AML).
Machine learning projects
(3) Building a Bayesian model of leukemia pathologies
Task
Expression levels depend strongly on the gene considered
⇒ need to be normalized
Use information gain (see decision tree lesson) to choose
the best threshold for binarizing each gene (classes are
ALL vs AML). Can be iteratively applied to discretize in
more than two values.
Most of the genes will probably be uncorrelated with the
pathologies ⇒ uninformative features.
Perform feature selection on genes according to their
information gain, choosing a small subset of the best
genes when sorted by infogain (max. 50 genes).
Machine learning projects
(3) Building a Bayesian model of leukemia pathologies
Task (cont.)
Once discretized, build a Bayesian network modeling the
data.
Split the dataset into a training and a test set (e.g. 80/20),
keeping the same proportion of ALL and AML in both sets.
Learn structure and parameters of the Bayesian network
on the training set
Evaluate performance of the learned network on the test
set.
Compare different networks:
hugin-lite structure learning (statistical-test based)
b-course structure learning (score based)
naive bayes classifier (simple fixed structure)
Machine learning projects
(4) Protein subcellular localization
Problem
Proteins are sequences of small molecules (amino-acids)
which arrange in a certain three dimensional structure
(fold)
Knowing the 3D structure of a protein is much harder than
knowing their 1D structure (the sequence of amino-acids,
represented as a string of letters, one for each amino-acid)
It would be highly desirable to predict characteristics of
proteins from their 1D structure alone.
Once synthesized, proteins are targeted to the location in
the cell (or secreted outside) where they will perform their
function.
Machine learning projects
(4) Protein subcellular localization
Data
The dataset is made of protein sequences from animals
(different Kingdoms of Life have different subcellular
localizations)
Proteins are classified in one of four possible localizations:
Cytoplasm
Mitochondrion
Nucleus
Secretory
Machine learning projects
(4) Protein subcellular localization
Task
Multiclass classification task: assign a protein to one of the
four possible localizations
SVM classifier in one-vs-all approach:
Train a classifier for each class, discriminating it from all the
other classes
Test a protein with all classifiers, and assign it the class with
highest confidence
Machine learning projects
(4) Protein subcellular localization
Task (cont.)
string kernel (trivial implementation):
1
2
3
4
choose one or more possible values k (e.g. k=3)
collect all possible k -mers in the whole dataset and sort
them
assign a numerical id to each k-mer
represent each protein with the sparse vector of its k -mer
ids, where the value for each id is the number of
occurrences of that k-mer in the protein.
Randomly split the available dataset into
train/validation/test set (e.g. 60% 20% 20%)
Use training set for learning, validation set for parameter
tuning (i.e. kernel parameters, C regularization), test set for
final evaluation
Machine learning projects
(5) Disulphide bonding state
Problem
A particular amino-acid called cysteine (C or CYS) can
bind to another cysteine in the same protein forming a
disulphide bridge
Disulphide bridges help stabilizing the 3D structure of a
protein, predicting their presence is a valuable insight for
predicting 3D structure
Not all cysteine perform this function, other cysteins are
free or bind different molecules (e.g. metal ions).
Machine learning projects
(5) Disulphide bonding state
Data
Dataset of 1D representations of proteins, together to labeling
for their cysteines (disulfide-bonded, non-disulfide-bonded)
Proteins change through evolution: a single protein has a
number of evolutionary related proteins in other organisms.
Parts of proteins (e.g. disulphide bonded cysteines) which are
crucial for their function should be better conserved through
evolution
Evolutionary information can be added to a protein
representation through multiple alignment profiles:
1
2
The protein sequence is aligned (with gaps) with similar
proteins chosen from a large database of protein
sequences
The alignment is used to compute a profile at each position
in the sequence: the amino-acid is replaced by a profile of
the frequency of occurrence of each possible amino-acid in
that position in the aligned sequences
Machine learning projects
(5) Disulphide bonding state
Task
Binary classification at the cysteine level (disulfide-bonded,
non-disulfide-bonded)
Generate multiple alignment profiles of proteins running
PsiBlast
Represent each position in a protein with the generated
profile (20 values, one for each possible amino-acid)
Represent each cysteine with its profile, plus the profiles of
neighbouring amino-acids up to a certain width (e.g. 5
positions to the left, 5 positions to the right)
Machine learning projects
(5) Disulphide bonding state
Task
Train a SVM classifier
Randomly split the available dataset into
train/validation/test set (e.g. 60% 20% 20%) at the protein
level (all cysteines in a certain protein go to the same set)
Use training set for learning, validation set for parameter
tuning (i.e. kernel parameters, C regularization), test set for
final evaluation
Machine learning projects
(6) Comparing learning algorithms
Data
Dataset from the “UCI Machine Learning Repository”, a
collection of datasets commonly used to evaluate learning
algorithms
Optdigits dataset:
automatic recognition of handwritten digits
each handwritten digit represented as an 8x8 bitmap, each
pixel by an integer in 0..16.
Machine learning projects
(6) Comparing learning algorithms
Task
Multiclass classification task (classify each digit as an
integer in 0..9)
Evaluate if two classifiers are significantly different in the
task.
10-fold cross validation with t-test for significance of
difference (instead of the single train-test split described in
the dataset)
Machine learning projects
(6) Comparing learning algorithms
Task (cont.)
Learning algorithms are:
k-Nearest neighbour
SVM with one-vs-all strategy for multiclass classification:
Train a classifier for each class, discriminating it from all the
other classes
Test an example with all classifiers, and assign it the class
with highest confidence
model selection (e.g. kernel parameters, C regularization,
number of neighbours):
For each fold of the cross validation, split its training set
into train and validation set, use validation set for
parameter tuning, test the resulting classifier on the test
set.
Machine learning projects
(7) Generative vs discriminative learning
Problem
Evaluate two simple learning algorithms, one generative
and one discriminative.
Linear classifiers:
Naive Bayes
Perceptron
on a binary classification task
Machine learning projects
(7) Generative vs discriminative learning
Data
The “UCI Machine Learning Repository” is a collection of
datasets commonly used to evaluate learning algorithms:
http://archive.ics.uci.edu/ml/datasets.html
Choose two datasets with Categorical attributes and
Binary classification task
Machine learning projects
(7) Generative vs discriminative learning
Task
Implement Naive Bayes and Perceptron
Run 10-fold cross validation on the two datasets
Report cross validated accuracies of the two classifiers for
each dataset
Machine learning projects