* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Machine learning projects
Survey
Document related concepts
Genomic imprinting wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Protein moonlighting wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome (book) wikipedia , lookup
Minimal genome wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Transcript
Machine learning projects Andrea Passerini [email protected] Machine Learning Machine learning projects How projects work Steps 1 understanding the problem to be addressed 2 preprocessing the available data to generate features for the learning algorithm(s) 3 training the learning algorithm(s), including tuning model parameters on a validation set (e.g. number of nearest neighbours, kernel parameters) 4 evaluating the performance of the algorithm(s) on a separate test set (supervised case) Groups Projects are conceived for groups of two students It is always possible to assign a project to a single student Students can also make their own proposals concerning task and/or algorithm Machine learning projects How projects work Report A final report will contain the detail of your work. The report should contain (at least) sections on: 1 2 3 4 Description of the problem addressed Description of all the processing phases conducted to prepare data for the learning algorithm(s) Description of the learning algorithm(s) employed Description of the experimental setting: training, testing and validation (when applicable) phases alternatives tried (e.g. BN structure learning techniques, hyperparameters of SVM, etc) 5 Experimental results: Proper performance measures (e.g. accuracy, cluster quality, precision recall curves, etc) 6 Conclusions highlighting main findings of the project Machine learning projects How projects work Software 1 Any programming language can be used to create appropriate programs for processing data, managing training and testing phases, evaluating results etc. 2 Simple learning algorithms (such as hierachical clustering) will be implemented during the project. Such requirement will be explicitly written in the project description. 3 For more complex learning algorithms (such as support vector machines or Bayesian Networks), you will use available implementations. Note It is advisable to contact the professor after an initial draft of the report/experiments is ready in order to: 1 2 clarify if the direction taken is correct verify if further work is needed Machine learning projects (1) Clustering genes Problem Genes are responsible for protein synthesis in living cells. Genes interact with each other forming complex gene-networks. The function of most genes is still unknown. Grouping together genes with similar behaviour can help understanding their function. Machine learning projects (1) Clustering genes Data The expression level (i.e. amount of proteins synthesized in a certain time) of genes can be measured by DNA microarrays. Expression level of genes from pathological cells are often measured relative to healthy cells (the control). The leukemia dataset consists of expression levels for 5147 genes in 72 patients: 47 affected by acute lymphoblastic leukemia (ALL), 25 by acute myeloid leukemia (AML). Machine learning projects (1) Clustering genes Task Expression levels depend strongly on the gene considered ⇒ need to be normalized Use information gain (see decision tree lesson) to choose the best threshold for binarizing each gene (classes are ALL vs AML). Select the 100 genes with highest information gain Cluster the 100 (binarized) genes by agglomerative hierarchical clustering. Machine learning projects (2) Classifying pathology by gene expression profiles Problem Genes are responsible for protein synthesis in living cells. Genes interact with each other forming complex gene-networks. The behaviour of genes in a certain cell can indicate the presence of a certain pathology. Detecting pathologies by analizing gene behaviour can help diagnosis and treatment Machine learning projects (2) Classifying pathology by gene expression profiles Data The expression level (i.e. amount of proteins synthesized in a certain time) of genes can be measured by DNA microarrays. Expression level of genes from pathological cells are often measured relative to healthy cells (the control). The leukemia dataset consists of expression levels for 5147 genes in 72 patients: 47 affected by acute lymphoblastic leukemia (ALL), 25 by acute myeloid leukemia (AML). Machine learning projects (2) Classifying pathology by gene expression profiles Task Expression levels depend strongly on the gene considered ⇒ need to be normalized Use information gain (see decision tree lesson) to choose the best threshold for binarizing each gene (classes are ALL vs AML). Can be iteratively applied to discretize in more than two values. Most of the genes will probably be uncorrelated with the pathologies ⇒ uninformative features. Perform feature selection on genes according to their information gain (setting a threshold on it, or choosing a fixed number of genes sorted by infogain) Machine learning projects (2) Classifying pathology by gene expression profiles Task (cont.) Once discretized, train an SVM classifier for discriminating between ALL and AML patients. Examples are patients, inputs are discretized gene expression levels. Randomly split the available dataset into train/validation/test set (e.g. 60% 20% 20%) Use training set for learning, validation set for parameter tuning (i.e. kernel parameters, C regularization), test set for final evaluation Machine learning projects (3) Building a Bayesian model of leukemia pathologies Problem Genes are responsible for protein synthesis in living cells. Genes interact with each other forming complex gene-networks. The behaviour of genes in a certain cell can indicate the presence of a certain pathology. Discovering interactions among genes which correlate with a pathology can help elucidating its characteristics Machine learning projects (3) Building a Bayesian model of leukemia pathologies Data The expression level (i.e. amount of proteins synthesized in a certain time) of genes can be measured by DNA microarrays. Expression level of genes from pathological cells are often measured relative to healthy cells (the control). The leukemia dataset consists of expression levels for 5147 genes in 72 patients: 47 affected by acute lymphoblastic leukemia (ALL), 25 by acute myeloid leukemia (AML). Machine learning projects (3) Building a Bayesian model of leukemia pathologies Task Expression levels depend strongly on the gene considered ⇒ need to be normalized Use information gain (see decision tree lesson) to choose the best threshold for binarizing each gene (classes are ALL vs AML). Can be iteratively applied to discretize in more than two values. Most of the genes will probably be uncorrelated with the pathologies ⇒ uninformative features. Perform feature selection on genes according to their information gain, choosing a small subset of the best genes when sorted by infogain (max. 50 genes). Machine learning projects (3) Building a Bayesian model of leukemia pathologies Task (cont.) Once discretized, build a Bayesian network modeling the data. Split the dataset into a training and a test set (e.g. 80/20), keeping the same proportion of ALL and AML in both sets. Learn structure and parameters of the Bayesian network on the training set Evaluate performance of the learned network on the test set. Compare different networks: hugin-lite structure learning (statistical-test based) b-course structure learning (score based) naive bayes classifier (simple fixed structure) Machine learning projects (4) Protein subcellular localization Problem Proteins are sequences of small molecules (amino-acids) which arrange in a certain three dimensional structure (fold) Knowing the 3D structure of a protein is much harder than knowing their 1D structure (the sequence of amino-acids, represented as a string of letters, one for each amino-acid) It would be highly desirable to predict characteristics of proteins from their 1D structure alone. Once synthesized, proteins are targeted to the location in the cell (or secreted outside) where they will perform their function. Machine learning projects (4) Protein subcellular localization Data The dataset is made of protein sequences from animals (different Kingdoms of Life have different subcellular localizations) Proteins are classified in one of four possible localizations: Cytoplasm Mitochondrion Nucleus Secretory Machine learning projects (4) Protein subcellular localization Task Multiclass classification task: assign a protein to one of the four possible localizations SVM classifier in one-vs-all approach: Train a classifier for each class, discriminating it from all the other classes Test a protein with all classifiers, and assign it the class with highest confidence Machine learning projects (4) Protein subcellular localization Task (cont.) string kernel (trivial implementation): 1 2 3 4 choose one or more possible values k (e.g. k=3) collect all possible k -mers in the whole dataset and sort them assign a numerical id to each k-mer represent each protein with the sparse vector of its k -mer ids, where the value for each id is the number of occurrences of that k-mer in the protein. Randomly split the available dataset into train/validation/test set (e.g. 60% 20% 20%) Use training set for learning, validation set for parameter tuning (i.e. kernel parameters, C regularization), test set for final evaluation Machine learning projects (5) Disulphide bonding state Problem A particular amino-acid called cysteine (C or CYS) can bind to another cysteine in the same protein forming a disulphide bridge Disulphide bridges help stabilizing the 3D structure of a protein, predicting their presence is a valuable insight for predicting 3D structure Not all cysteine perform this function, other cysteins are free or bind different molecules (e.g. metal ions). Machine learning projects (5) Disulphide bonding state Data Dataset of 1D representations of proteins, together to labeling for their cysteines (disulfide-bonded, non-disulfide-bonded) Proteins change through evolution: a single protein has a number of evolutionary related proteins in other organisms. Parts of proteins (e.g. disulphide bonded cysteines) which are crucial for their function should be better conserved through evolution Evolutionary information can be added to a protein representation through multiple alignment profiles: 1 2 The protein sequence is aligned (with gaps) with similar proteins chosen from a large database of protein sequences The alignment is used to compute a profile at each position in the sequence: the amino-acid is replaced by a profile of the frequency of occurrence of each possible amino-acid in that position in the aligned sequences Machine learning projects (5) Disulphide bonding state Task Binary classification at the cysteine level (disulfide-bonded, non-disulfide-bonded) Generate multiple alignment profiles of proteins running PsiBlast Represent each position in a protein with the generated profile (20 values, one for each possible amino-acid) Represent each cysteine with its profile, plus the profiles of neighbouring amino-acids up to a certain width (e.g. 5 positions to the left, 5 positions to the right) Machine learning projects (5) Disulphide bonding state Task Train a SVM classifier Randomly split the available dataset into train/validation/test set (e.g. 60% 20% 20%) at the protein level (all cysteines in a certain protein go to the same set) Use training set for learning, validation set for parameter tuning (i.e. kernel parameters, C regularization), test set for final evaluation Machine learning projects (6) Comparing learning algorithms Data Dataset from the “UCI Machine Learning Repository”, a collection of datasets commonly used to evaluate learning algorithms Optdigits dataset: automatic recognition of handwritten digits each handwritten digit represented as an 8x8 bitmap, each pixel by an integer in 0..16. Machine learning projects (6) Comparing learning algorithms Task Multiclass classification task (classify each digit as an integer in 0..9) Evaluate if two classifiers are significantly different in the task. 10-fold cross validation with t-test for significance of difference (instead of the single train-test split described in the dataset) Machine learning projects (6) Comparing learning algorithms Task (cont.) Learning algorithms are: k-Nearest neighbour SVM with one-vs-all strategy for multiclass classification: Train a classifier for each class, discriminating it from all the other classes Test an example with all classifiers, and assign it the class with highest confidence model selection (e.g. kernel parameters, C regularization, number of neighbours): For each fold of the cross validation, split its training set into train and validation set, use validation set for parameter tuning, test the resulting classifier on the test set. Machine learning projects (7) Generative vs discriminative learning Problem Evaluate two simple learning algorithms, one generative and one discriminative. Linear classifiers: Naive Bayes Perceptron on a binary classification task Machine learning projects (7) Generative vs discriminative learning Data The “UCI Machine Learning Repository” is a collection of datasets commonly used to evaluate learning algorithms: http://archive.ics.uci.edu/ml/datasets.html Choose two datasets with Categorical attributes and Binary classification task Machine learning projects (7) Generative vs discriminative learning Task Implement Naive Bayes and Perceptron Run 10-fold cross validation on the two datasets Report cross validated accuracies of the two classifiers for each dataset Machine learning projects