Download Datamining: Discovering Information From Bio-Data

Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell Introduction  Data Mining Background – Process – Functionalities – Techniques  Two examples – Short Peptide – Clinical Records  Conclusion Data Mining Background Process Data Collection & Selection Data Cleaning Data Enrichment Representations Data Mining Encoding Functionalities      Classification Cluster Analysis Outlier Analysis Trend Analysis Association Analysis Techniques      Decision Tree Bayesian Classification Hidden Markov Models Support Vector Machines Artificial Neural Networks Technique 1 – Decision Tree A1 > a s Ye No Class 2 Yes Class 3 No Class 1 A2 > c No Ye s A2 >b Class 4 Technique 2 – Bayesian Classification  Based on Bayes Theorem P(H | X )  P( X | H )  P(H ) P( X ) – Simple but comparable to Decision Trees and Neural Networks Classifier in many applications. Technique 3 – Hidden Markov Model Start End y a c c Technique 4 – Support Vector Machine  SVM find the maximum margin hyperplane that separate the classis – Hyperplane can be represented as a linear combination of training points – The algorithm that finds a separating hyperplane in the feature space can be stated entirely in terms of vectors in the input space and dot products in the feature space – Locate a separating hyperplane in the feature space and classify points in that space simply by defining a kernel function Example 1 – Short Peptides  Problem – Identify T-cell epitopes from Melanoma antigens – Training Set: 602 HLA-DR4 binding peptides  713 non-binding   Solution – Neural Networks Neural Networks – Single Computing Element x1 1 W W2 f(net) y W 3 x2 x3 1 y  f (net )  where net   xi wi  net i 1 e Neural Networks Classifier  X1 Sparse Coding – Alanine 10000000000000000000 X2 X3 Y X4 X5 Xn-1 Xn  9 x 20 = 180 bits per Inputs Neural Networks – Error Back-Propagation   x1 v1,1 w1 ,2 v1 x2 v3 ,1 ,1 v2 v2, 2 y w2 Adjustment E w j     ()( y)(1  y)( z j )  wj vij   (w j )( z j )(1  z j )( xi ) 2 v3, x3 y  f ( w j f ( xi vij )) j (t  y)2 Squared error: E  2 Where  is a fixed leaning rate i Where z j is the output of the computing j element of the first layer And  is the difference between the output y and correct output t. th Result & Remarks     Success Rate: 60% A systematic experimental study is very expensive Highly accurate predicting method can reduce the cost Other alternatives exist Datamining: Discovering Information A Clinical Records Problem  Problem : already known data (clinical records)  predict unknown data How to analysis known data ? --- training data How to test unknown data? --- Predict data Problem  The data has many attributes. Ex: Having 2300 combinations of attributes with 8 attributes for one class. It is impossible to calculate all manually Problem  One Example: Eight attributes for diabetic patients: (1)Number of times pregnant (2)Plasma glucose (3)Diastolic blood pressure (4)Triceps skin fold thickness (5)Two-hour serum insulin (6)Body mass index (7)Diabetes pedigree (8)Age CAEP-Classification by aggregating emerging patterns A classification (known data) and prediction (unknown data) algorithms. CAEP-Classification by aggregating emerging patterns  Definition: (1)Training data (2)Training data Discovery all the emerging patterns. Sum and normalize the differentiating weight of these emerging patterns (3)Training data  Chooses the class with the largest normalized score as the winner. (4)Test data  Computing the score of test data and making a Prediction CAEP : Emerging Pattern  Emerging Pattern Definition: An emerging pattern is a pattern with some attributes whose frequency increases significantly from one class to another. EX: Mushroom Poisonous Edible Smell odor None Surface Wrinkle smooth Ring-number 1 3 CAEP : Classification Classification: Definition: (1) Discover the factors that differentiate the two groups (2) Find a way to use these factors to predict to Which group a new patient should belong. CAEP : Method Method:  Discretize of the dataset into a binary one. item (attribute , interval) Ex:( age, >45) instance : a set of items such that an item (A,v) is in t if only if the value of the attribute A of t is within the interval Clinical Record: 768 women 21% diabetic instances : 161 71% non-diabetics instances: 546 CAEP: Support Support of X (attribute) Definition: the ratio of number of items has this attribute over the number of total items in this class. | {t  D | X  t} | |D| Formula: suppD(x)= Meaning: If supp(x) is high which means attribute x exist in many items in this class. Example : How many people in diabetic class are older than 60? (attribute : >60) 148/161 =91% CAEP: Growth The growth rate of X (attribute) Definition: The support comparison of same attributes from two classes. Formula: growD(x)= suppD(x) / suppD’(x) Meaning: If grow(x) is high which means more possibility of attribute X exist in class D than in class D’ Example: the patient older >60 in diabetic class is 91% the people older >60 in non-diabetic class is 10% growth(>60)= 91% / 10% = 9 CAEP: Likelihood LikelihoodD(x) Definition: the ratio of total number of items with attribute x in one class to the total number of items with attribute x in both two classes. Formula1: LikelihoodD(x)= suppD (x) * |D|_______________ suppD (x) *|D| + suppD’ (x) *|D’| Formula2: If D and D’ are roughly equal in size: LikelihoodD(x)= suppD (x) ____________ suppD (x) + suppD’ (x) 91% * 223___________ = 203 = 78.99% 91% *223 + 10% * 545 257 Example: 91% _______ = 91% = 90.10% 91% + 10% 101% Example: CAEP: Evaluation Sensitivity: the ratio of the number of correctly predicted diabetic instances to the number of diabetic instances. Example: 60correctly predicted /100diabetic=60% Specificity: the ratio of the number of correctly predicted diabetic instance to the number of predicted. Example: 60correctly predicted /120predicted=50% Accuracy: the percentage of instances correctly classified. Example: 60correctly predicted /180 =33% CAEP: Evaluation  Using one attribute for class prediction High accuracy: Low sensitivity: only identify 30% CAEP: Prediction  Consider all attributes: The accumulation of scores of all features it has for class D Formular: Score(t,D) =X likelihoodD(X)*suppD(x)  Prediction: Score(t,D)>score(t,D’)  t belongs to D class. CAEP: Normalize     If the numbers of emerging patterns are different significantly. One class D has more emerging patterns than another class D’ The score of one instance of D has higher score than the instance of D’ Score(t,D) = likelihoodD(X)*suppD(x) Normalize the score norm_score(t,D)=score(t,D) / base_score(D) Prediction: If norm_score(t,D)> norm_score(t,D’)  t belongs to D class. CAEP: Comparison C4.5 and CBA Sensitivity Specificity Accuracy Diabetic/non Diabetic/non -diabetic -diabetic C4.5 71.1% CBA 73.0% CAEP 70.5%/ 63.3% 77.4%/ 83%.1 75% CAEP: Modify  Problem: CAEP produces a very large number of emerging patterns. Example: with 8 attribute, 2300 emerging patterns. CAEP: Modify  Reduce emerging patterns numbers Method: Prefer strong emerging patterns over their weaker relatives Example: X1 with infinite growth,very small support X2 with less growth, much larger support, say 30 times than X2 In such case X2 is preferred because it covers many more cases than X1.  There is no lose in prediction performance using reduction of emerging patterns CAEP: Variations  JEP: using exclusively emerging patterns whose supports increase from zero to nonzero, are called jump. Perform well when there are many jump emerging patterns  DeEP: It has more training phases is customized for that instance Slightly better , incorporate new training data easily. Relevance analysis  Datamining algorithms are in general exponential in complexity  Relevance analysis : exclude the attributes that do not contribute to the classification process  Deal with much higher dimension datasets  Not always useful for lower ranking dimensions. Conclusion  Classification and prediction aspect of datamining  Method includes decision trees, mathematical formula, artificial neural networks, or emerging patterns.  They are applicable in a large variety of classification applications  CAEP has good predictive accuracy on all data sets. .

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Datamining: Discovering Information From Bio-Data