Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org 1 Outline of the lecture Motivating story: correlating inputs and outputs Learning with a teacher Regression and classification problems Model selection, feature selection and generalization k-nearest neighbors and some other classification algorithms Phenotype fingerprints and their applications in medicine JM - http://folding.chmcc.org 2 Web watch: an on-line biology textbook by JW Kimball Dr. J. W. Kimball's Biology Pages http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/ Story #1: B-cells and DNA editing, Apolipoprotein B and RNA eiditing http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/R/RNA_Editing.html#apoB_gene Story #2: ApoB, cholesterol uptake, LDL and its endocytosis http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/E/Endocytosis.html#ldl Complex patterns of mutations in genes related to cholesterol transport and uptake (e.g. LDLR, ApoB) may lead to an elevated level of LDL in the blood. JM - http://folding.chmcc.org 3 Correlations and fingerprints Instead of often difficult to decipher underlying molecular model, one may simply try to find correlations between inputs and outputs. If measurements on certain attributes correlate with molecular processes, underlying genomic structures, phenotypes, disease states etc., one can use such attributes as indicators of such “hidden” states and to make predictions for new cases. Consider for example the elevated levels of the low density lipoprotein (LDL) particles in the blood, as an indicator (fingerprint) of the atherosclerosis. JM - http://folding.chmcc.org 4 Correlations and fingerprints: LDL example 80 70 60 50 40 30 20 10 0 100 300 80 250 60 200 150 40 100 50 20 0 Healthy cases: blue; heart attack or stroke within 5 years from the exam: red (simulated data); x – LDL; y - HDL; z – age (see study by Westendorp et. al., Arch Intern Med. 2003, 163(13):1549 JM - http://folding.chmcc.org 5 LDL example: 2D projection 100 90 80 HDL 70 60 50 40 30 20 0 50 100 150 LDL 200 JM - http://folding.chmcc.org 250 300 6 LDL example: regression with binary output and 1D projection for classification 2 1.5 class 1 0.5 0 -0.5 -1 0 50 100 150 LDL 200 JM - http://folding.chmcc.org 250 300 7 Unsupervised vs. supervised learning In case of unsupervised learning the goal is to “discover” the structure in the data and group (cluster) similar objects, given a similarity measure. In case of supervised learning (or learning with a teacher) a set of examples with class assignments (e.g. healthy vs. diseased) is given and the goal is to find a representation of the problem in some feature (attribute) space that provides a proper separation of the imposed classes. Such representations With the resulting decision boundaries may be subsequently used to make prediction for new cases. Class 3 Class 1 Class 2 JM - http://folding.chmcc.org 8 Choice of the model, problem representation and feature selection: another simple example adults children F estrogen weight M heights testosterone JM - http://folding.chmcc.org 9 11113.3, Pauci , 1_Pauci , MTX 0 , XR_unknown 878, Pauci , 2_Poly , MTX 1 , XR_erosions Individuals (33 patients + 12 controls) 976, Pauci , 1_Pauci , MTX 1 , XR_norm al 9161, Pauci , 1_Pauci , MTX 1 , XR_norm al 7177, Spond, JSPA , MTX 1 , XR_space narrowing 8003, Pauci , 1_Pauci , MTX 0 , XR_unknown 993, Syst, 2_Poly , MTX 1 , XR_space narrowing 9137, Poly, 2_Poly , MTX 0 , XR_space narrowing 19, Pauci , 2_Poly , MTX 1 , XR_space narrowing 912, Syst, 2_Poly , MTX 1 , XR_space narrowing 1081, Poly, 2_Poly , MTX 0 , XR_space narrowing 9272, Poly, 2_Poly , MTX 1 , XR_unknown 1087PB, Poly, 2_Poly , M TX 1 , XR_space narrowing 1073, Syst, 2_Poly , MTX 0 , XR_space narrowing 872, Poly, 2_Poly , MTX 1 , XR_space narrowing 850, Poly, 2_Poly , MTX 1 , XR_erosions 831, Poly, 2_Poly , MTX 1 , XR_erosions 894, Pauci , 2_Poly , MTX 1 , XR_normal 7029, Pauci , 1_Pauci , MTX 0 , XR_norm al 18036, Pauci , 1_Pauci , MTX 0 , XR_norm al 18057, Spond, JAS , MTX 0 , XR_sclerosis 845, Spond, JAS , MTX 0 , XR_space narrowing ... 7113.3, Control, na , MTX 0 , XR_na 9150, Spond, JAS , MTX 0 , XR_sclerosis 7108, Syst, 3_System ic , M TX 1 , XR_normal 813.3, Control, na , MTX 0 , XR_ na 1084, Control, na , MTX 0 , XR_ na 7021.31, Cont rol, na , MTX 0 , XR_na 18042, Spond, JAS , MTX 0 , XR_sclerosis 7118.3, Control, na , MTX 0 , XR_ na 9264, Pauci , 1_Pauci , MTX 0 , XR_erosions 824, Poly, 2_Poly , MTX 0 , XR_norm al 9245, Spond, JSPA , MTX 0 , XR_space narrowing 1087ctrl, Cont rol, na , MTX 0 , XR_na 801, Pauci , 2_Poly , MTX 1 , XR_erosions 7149.3, Control, na , MTX 0 , XR_ na 1095, Control, na , MTX 0 , XR_ na 1089, Control, na , MTX 0 , XR_ na 1085, Control, na , MTX 0 , XR_ na 1082, Control, na , MTX 0 , XR_ na 817, Pauci , 1_Pauci , MTX 0 , XR_space narrowing 7206, Pauci , 1_Pauci , MTX 0 , XR_unknown 7145, Pauci , 1_Pauci , MTX 0 , XR_norm al 1083, Control, na , MTX 0 , XR_ na Gene expression example again: JRA clinical classes Individual: Poly-Articular JRA Course Controls 105 Genes with Significantly Lower Expression In PolyArticular JRA 242 genes ... 137 Genes with Significantly Higher Expression In PolyArticular JRA Picture: courtesy of B. Aronow 10 Advantages of prior knowledge, problems with class assignment (e.g. in clinical practice) on the other hand GLOBINS FixL ?? PYP Prior knowledge – the same class despite low sequence similarity; suggestion that distance based on sequence similarity is not sufficient – adding structure derived features might help (“good model” question again). JM - http://folding.chmcc.org 11 Three phases in supervised learning protocols Training data: examples with class assignment are given Learning: i) appropriate model (or representation) of the problem needs to be selected in terms of attributes, distance measure and classifier type; ii) adaptive parameters in the model need to optimized to provide correct classification of training examples (e.g. minimizing the number of misclassified training vectors) Validation: cross-validation, independent control sets and other measure of “real” accuracy and generalization should be used to assess the success of the model and the training phase (finding trade off between accuracy and generalization is not trivial) JM - http://folding.chmcc.org 12 Training set: LDL example again A set of objects (here patients) xi , i=1, …, N is given. For each patient a set of features (attributes and the corresponding measurements on these attributes) are given too. Finally, for each patient we are given the class Ck , k=1, …, K, he/she belongs to. Age 41 32 45 LDL 230 120 90 HDL 60 50 70 Sex F M M { xi , C k } Class healthy (0) stroke within 5 years (1) heart attack within 5 years (1) i=1, …, N JM - http://folding.chmcc.org 13 Optimizing adaptable parameters in the model Find a model y(x;w) that describes the objects of each class as a function of the features and adaptive parameters (weights) w. Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a stroke or heart attack in the next 5 years) y(x;w) JM - http://folding.chmcc.org 14 Examples of machine learning algorithms for classification and regression problems Linear perceptron, Least Squares LDA/FDA (Linear/Fisher Discriminate Analysis) (simple linear cuts, kernel non-linear generalizations) SVM (Support Vector Machines) (optimal, wide margin linear cuts, kernel non-linear generalizations) Decision trees (logical rules) k-NN (k-Nearest Neighbors) (simple non-parametric) Neural networks (general non-linear models, adaptivity, “artificial brain”) JM - http://folding.chmcc.org 15 Training accuracy vs. generalization JM - http://folding.chmcc.org 16 Model complexity, training set size and generalization 8 data 1 linear cubic 7th degree 7 6 5 4 3 2 1 0 0 1 2 3 4 5 JM - http://folding.chmcc.org 6 7 8 17 Similarity measures JM - http://folding.chmcc.org 18 k-nearest neighbors as a simple algorithm for classification Given a training set of N objects with known class assignment and k<N find an assignment of new objects (not included in the training) to one of the classes based on the assignment of its k neighbors A simple, non-parametric method that works surprisingly well, especially in case of low dimensional problems Note however that the choice of the distance measure may again have a profound effect on the results The optimal k is found by trial and error JM - http://folding.chmcc.org 19 k-nearest neighbor algorithm Step 1: Compute pairwise distances and take k closest neighbors Step2: Assign class based on a simple majority voting, the new point belongs to the class with most neighbors in this class JM - http://folding.chmcc.org 20