Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
September 10, 2002 Data Mining in Bioinformatics Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois [email protected] Outline • • • • 2 Introduction — Interdisciplinary Problem Statement — Microarray Problem Overview Microarray Data Processing — Image Analysis and Data Mining — Prior Knowledge — Data Mining Methods — Database and Optimization Techniques — Visualization Validation Summary Introduction: Recommended Literature 1. Bioinformatics – The Machine Learning Approach by P. Baldi & S. Brunak, 2nd edition, The MIT Press, 2001 2. Data Mining – Concepts and Techniques by J. Han & M. Kamber, Morgan Kaufmann Publishers, 2001 3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2nd edition, John Wiley & Sons, 2001 3 Bioinformatics, Computational Biology, Data Mining • • • 4 Bioinformatics is an interdisciplinary field about the the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems. Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g. — Genomes (viruses, bacteria, fungi, plants, insects,…) — Proteins and Proteomes — Biological Sequences — Molecular Function and Structure Data Mining is searching for knowledge in data — Knowledge mining from databases — Knowledge extraction — Data/pattern analysis — Data dredging — Knowledge Discovery in Databases (KDD) Introduction: Problems in Bioinformatics Domain • Problems in Bioinformatics Domain — Data production at the levels of molecules, cells, organs, organisms, populations — Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, … — Prediction of Molecular Function and Structure — Computational biology: synthesis (simulations) and analysis (machine learning) 5 MICROARRAY PROBLEM 6 Microarray Problem: Major Objective • Major Objective: Discover a comprehensive theory of life’s organization at the molecular level — The major actors of molecular biology: the nucleic acids, DeoxyriboNucleic acid (DNA) and RiboNucleic Acids (RNA) — The central dogma of molecular biology Proteins are very complicated molecules with 20 different amino acids. 7 Input and Output of Microarray Data Analysis • Input: Laser image scans (data) and underlying experiment • 8 hypotheses or experiment designs (prior knowledge) Output: — Conclusions about the input hypotheses or knowledge about statistical behavior of measurements — The theory of biological systems learnt automatically from data (machine learning perspective) – Model fitting, Inference process Overview of Microarray Problem Biology Application Domain Validation Data Analysis Microarray Experiment Experiment Design and Hypothesis Image Analysis Data Warehouse Artificial Intelligence (AI) 9 Data Mining Statistics Knowledge discovery in databases (KDD) Statistics Community • • • • • • • 10 Random Variables Statistical Measures Probability and Probability Distribution Confidence Interval Estimations Test of Hypotheses Goodness of Fit Regression and Correlation Analysis GeneFilter Comparison Report GeneFilter 1 Name: GeneFilter 1 O2#1 8-20-99adjfinal N2#1finaladj INTENSITIES RAW NORMALIZED ORF NAME GENE NAME CHRM F G R YAL001C TFC3 1 1 A 1 2 12.03 7.38 YBL080C PET112 2 1 A 1 3 53.21 YBR154C RPB5 2 1 A 1 4 79.26 78.51 YCL044C 3 1 A 1 5 53.22 44.66 YDL020C SON1 4 1 A 1 6 23.80 20.34 YDL211C 4 1 A 1 7 17.31 35.34 YDR155C CPH1 4 1 A 1 8 349.78 YDR346C 4 1 A 1 9 64.97 65.88 Name: GF1 GF2 GF1 GF2 DIFFERENCE RATIO 403.83 209.79 194.04 1.92 35.62 "1,786.11" "1,013.13" 772.98 1.76 "2,660.73" "2,232.86" 427.87 1.19 "1,786.53" "1,270.12" 516.41 1.41 799.06 578.42 220.64 1.38 581.00 "1,005.18" -424.18 -1.73 401.84 "11,741.98" "11,428.10" 313.88 "2,180.87" "1,873.67" 307.21 1.16 1.03 Artificial Intelligence (AI) Community Collect Data • Issues: Choose Features Choose Model Train Classifier Evaluate Classifier Design Cycle of Predictive Modeling 11 — Prior knowledge (e.g., invariance) — Model deviation from true model — Sampling distributions — Computational complexity — Model complexity (overfitting) Knowledge Discovery in Databases (KDD) Community Database 12 GeneFilter Comparison Report GeneFilter 1 Name: GeneFilter 1 O2#1 8-20-99adjfinal N2#1finaladj INTENSITIES RAW NORMALIZED ORF NAME GENE NAME CHRM F G R YAL001C TFC3 1 1 A 1 2 12.03 7.38 YBL080C PET112 2 1 A 1 3 53.21 YBR154C RPB5 2 1 A 1 4 79.26 78.51 YCL044C 3 1 A 1 5 53.22 44.66 YDL020C SON1 4 1 A 1 6 23.80 20.34 YDL211C 4 1 A 1 7 17.31 35.34 YDR155C CPH1 4 1 A 1 8 349.78 YDR346C 4 1 A 1 9 64.97 65.88 Name: GF1 GF2 GF1 GF2 DIFFERENCE RATIO 403.83 209.79 194.04 1.92 35.62 "1,786.11" "1,013.13" 772.98 1.76 "2,660.73" "2,232.86" 427.87 1.19 "1,786.53" "1,270.12" 516.41 1.41 799.06 578.42 220.64 1.38 581.00 "1,005.18" -424.18 -1.73 401.84 "11,741.98" "11,428.10" 313.88 "2,180.87" "1,873.67" 307.21 1.16 1.03 Microarray Data Mining and Image Analysis Steps • • • 13 Image Analysis — Normalization — Grid Alignment — Spot Quality Assurance Control — Feature construction (selection and extraction) Data Mining — Prior knowledge GeneFilter Comparison Report GeneFilter 1 Name: GeneFilter 1 Name: — Statistics O2#1 8-20-99adjfinal N2#1finaladj INTENSITIES — Machine learning RAW NORMALIZED ORF NAME GENE NAME CHRM F G R GF1 GF2 — Pattern recognition YAL001C TFC3 1 1 A 1 2 12.03 7.38 403.83 YBL080C PET112 2 1 A 1 3 53.21 35.62 "1, — Database techniques YBR154C RPB5 2 1 A 1 4 79.26 78.51 "2,660.73 YCL044C 3 1 A 1 5 53.22 44.66 "1,786.53 — Optimization techniques YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 YDL211C 4 1 A 1 7 17.31 35.34 581.00 — Visualization YDR155C CPH1 4 1 A 1 8 349.78 401.84 YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87 Validation YAL010C MDM10 1 ? 1 A 2 2 13.73 9.61 461.03 YBL088C TEL1 2 1 A 2 3 8.50 7.74 285.38 — Issues YBR162C 2 1 A 2 4 226.84 293.83 YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.79 — Cross validation techniques YDL028C YDL219W YDR163W YDR354W MPS1 TRP4 4 4 4 4 1 1 1 1 A A A A 2 2 2 2 6 7 8 9 7.95 16.08 19.13 62.24 6.24 11.33 14.19 40.74 266.99 539.93 642.17 "2,089.48 MICROARRAY IMAGE ANALYSIS 14 Microarray Image Analysis 15 DATA MINING OF MICROARRAY DATA 16 Why Data Mining ? Sequence Example • • • • • • • • • 17 Biology: Language and Goals A gene can be defined as a region of DNA. A genome is one haploid set of chromosomes with the genes they contain. Perform competent comparison of gene sequences across species and account for inherently noisy biological sequences due to random variability amplified by evolution Assumption: if a gene has high similarity to another gene then they perform the same function Analysis: Language and Goals Feature is an extractable attribute or measurement (e.g., gene expression, location) Pattern recognition is trying to characterize data pattern (e.g., similar gene expressions, equidistant gene locations). Data mining is about uncovering patterns, anomalies and statistically significant structures in data (e.g., find two similar gene expressions with confidence > x) Types of Expected Data Mining and Analysis Results Hypothetical Examples: • Binary answers using tests of hypotheses — Drug treatment is successful with a confidence level x. • Statistical behavior (probability distribution functions) — A class of genes with functionality X follows Poisson distribution. • Expected events — As the amount of treatment will increase the gene expression level will decrease. • Relationships — Expression level of gene A is correlated with expression level of gene B under varying treatment conditions (gene A and B are part of the same pathway). • Decision trees — Classification of a new gene sequence by a “domain expert”. 18 PRIOR KNOWLEDGE 19 Prior Knowledge: Experiment Design Collect Data Data Cleaning and Transformations Choose Features Choose Model and Data Mining Method • • • • • • 20 Microarray sources of systematic and random errors Feature selection and variability Expectations and Hypotheses Data cleaning and transformations Data mining method selection Interpretation Prior Knowledge from Experiment Design Complexity Levels of Microarray Experiments: 1. Compare single gene in a control situation versus a treatment situation • Example: Is the level of expression (up-regulated or down-regulated) significantly different in the two situations? (drug design application) • Methods: t-test, Bayesian approach 2. Find multiple genes that share common functionalities • Example: Find related genes that are dependent? • Methods: Clustering (hierarchical, k-means, self-organizing maps, neural network, support vector machines) 3. Infer the underlying gene and protein networks that are responsible for the patterns and functional pathways observed • Example: What is the gene regulation at system level? • Directions: mining regulatory regions, modeling regulatory networks on a global scale Goal of Future Experiment Designs: Understand biology at the system level, e.g., gene networks, protein networks, signaling networks, metabolic networks, immune system and neuronal networks. 21 Data Mining Techniques Data mining techniques draw from Statistics Machine learning Database techniques Pattern recognition Optimization techniques Visualization 22 STATISTICS 23 Statistics Statistics Descriptive Statistics Describe data Inductive Statistics Make forecast and inferences Are two sample sets identically distributed 24 ? Statistical t-test •Gene Expression Level in Control and Treatment situations •Is the behavior of a single gene different in Control situation than in Treatment situation ? Normalized distance • • 25 m – sample mean s – variance ? Normalized distance t follows a Student distribution with f degrees of freedom. If t>thresh then the control and treatment data populations are considered to be different. MACHINE LEARNING AND PATTERN RECOGNITION 26 Machine Learning Machine Learning Unsupervised Supervised “Natural groupings” Reinforcement Examples 27 Pattern Recognition Pattern Recognition Statistical Models Linear Correlation and Regression Locally Weighted Learning Decision Trees Neural Networks NN representation and gradient based optimization 28 NN representation and genetic algorithm based optimization k-nearest neighbors, support vectors Unsupervised Learning and Clustering • A cluster is a collection of data objects that are similar to • one another within the same cluster and are dissimilar to the objects in other clusters. Examples of data objects: — gene expression levels, sets of co-regulated genes (pathways), protein structures “Natural groupings” • Categories of Clustering Methods — Partitioning Methods — Hierarchical Methods — Density-Based Methods 29 Unsupervised Clustering: Partitioning Methods Example: Centroid-Based Technique • • • • 1. 2. 3. 4. 5. 6. 30 K-means Algorithm partitions a set of n objects into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster similarity is low. Input: number of desired cluster k Output: k labels assigned to n objects Steps: Select k initial cluster’s centers Compute similarity as a distance between an object and each cluster center Assign a label to an object based on the minimum similarity Repeat for all objects Re-compute the cluster’s centers as a mean of all objects assign to a given cluster Repeat from Step 2 until objects do not change their labels. Unsupervised Clustering: Partitioning Methods Example: Representative-Based Technique • • • • 1. 2. 3. 4. 5. 6. 31 K-medoids Algorithm partitions a set of n objects into k clusters so that it minimizes the sum of the dissimilarities of all the objects to their nearest medoid. Input: number of desired cluster k Output: k labels assigned to n objects Steps: Select k initial objects as the initial medoids Compute similarity as a distance between an object and each cluster medoid Assign a label to an object based on the minimum similarity Repeat for all objects Randomly select a non-medoid object and swap with the current medoid it would decrease intra-cluster square error Repeat from Step 2 until objects do not change their labels. Unsupervised Clustering: Hierarchical Clustering • Hierarchical Clustering partitions a set of n objects into a tree of clusters • Types of Hierarchical Clustering — Agglomerative hierarchical clustering – Bottom-up strategy of building clusters — Divisive hierarchical clustering – Top-down strategy of building clusters 32 Unsupervised Agglomerative Hierarchical Clustering • Agglomerative Hierarchical Clustering partitions a set of n • 1. 2. 3. • 33 objects into a tree of clusters with a bottom-up strategy. Steps: Assign a unique label to each data object and form n clusters Find nearest clusters and merge them Repeat Step 2 till the number of desired clusters is equal to the number of merged clusters. Types of Agglomerative Hierarchical Clustering — The nearest neighbor algorithms (minimum or single-linkage algorithm, minimal spanning tree) — The farthest neighbor algorithms (maximum or complete-linkage algorithm) Unsupervised Clustering: Density-Based Clustering • Density-Based Spatial Clustering with Noise aggregates • 34 objects into clusters if the objects are density connected. Density connected objects: — Simplified explanation:P and Q are density connected if there is an object O such that both P and Q are density connected to O. — Aggregate P and Q if they are density connected with respect to R-radius neighborhood and Minimum Object criteria Supervised Learning or Classification • Classification is a two-step process consisting of learning classification rules followed by assignment of classification label. 35 Supervised Learning: Decision Tree • Decision tree algorithm constructs a tree structure in a topdown recursive divide-and-conquer manner Age Car Type Risk 23 family High 17 sports High 43 sports High 68 family Low 32 truck Low 20 family High Age < 25 ? no yes Risk: High Sports car ? yes Answers Risk: High Car Insurance: Risk Assessment Visualization of Decision Boundaries 36 Attributes no Risk: Low Supervised Learning: Bayesian Classification • Bayesian Classification is based on Bayes theorem and it can • predict class membership probabilities. Bayes Theorem (X-data sample, H-hypothesis of data label) — P(H/X) posterior probability — P(H) prior probability • Classification-maximum posteriori hypothesis 37 Statistical Models: Linear Discriminant • Linear Discriminant Functions form boundaries between • data classes. Finding Linear Discriminant Functions is achieved by minimizing a criterion error function. Linear discriminant function Quadratic discriminant function Finding w coefficients: -Gradient Descent Procedures -Newton’s algorithm 38 Neural Networks • • • Neural network is a set of connected input/output units where each connection has a weight associated with it. Phase I: learning – adjust weights such that the network predicts accurately class labels of the input samples Phase II: classification- assign labels by passing an unknown sample through the network Interpretation • 1. 2. 3. 4. 39 Steps: Initial weights from [-1,1] Propagate the inputs forward Backpropagate the error Terminate learning (training) if (a) delta w < thresh or (b) percentage of misclassified samples < thresh or (c) max number of iterations has been exceeded Support Vector Machines (SVM) • SVM algorithm finds a separating hyperplane with the largest margin and uses it for classification of new samples 40 DATABASE TECHNIQUES AND OPTIMIZATION TECHNIQUES 41 Database Techniques • • • • • • • • 42 Database Design and Modeling (tables, procedures, functions, constraints) Database Interface to Data Mining System Efficient Import and Export of Data Database Data Visualization Database Clustering for Access Efficiency MINING Database Performance Tuning (memory usage, query encoding) Database Parallel Processing (multiple servers and CPUs) Distributed Information Repositories (data warehouse) Optimization Techniques • • 43 Highly nonlinear search space (global versus local maxima) • Gradient based optimization • Genetic algorithm based optimization • Optimization with sampling Large search space • Example: A genome with N genes can encode 2^N states (active or inactive states, regulated is not considered). Human genome ~ 2^30,000; Nematode genome ~ 2^20,000 patterns. VISUALIZATION 44 Visualization • Data: 3D cubes,distribution charts, curves, surfaces, link • graphs, image frames and movies, parallel coordinates Results: pie charts, scatter plots, box plots, association rules, parallel coordinates, dendograms, temporal evolution Pie chart Parallel coordinates Temporal evolution 45 Novel Visualization of Features Feature Selection and Visualization Feature Selection Mean Feature Image 46 Novel Visualization of Clustering Results Class Labeling and Visualization Isodata (K-means) Clustering Mean Feature Image 47 Label Image VALIDATION 48 Why Validation? • Validation type: — Within the existing data — With newly collected data • Errors and uncertainties: — Systematic or random errors — Unknown variables - number of classes — Noise level - statistical confidence due to noise — Model validity – error measure, model over-fit or under-fit — Number of data points - measurement replicas • Other issues — Experimental support of general theories — Exhaustive sampling is not permissive 49 Error Detection: Example of Spot Screening Mask Image – No Screening Mask Image – Location and Size Screening Mask Image – SNR Screening 50 Cross Validation: Example • • • 51 One-tier cross validation — Train on different data than test data Two-tier cross validation — The score from one-tier cross validation is used by the bias optimizer to select the best learning algorithm parameters (# of control points) . The more you optimize the more you over-fit. The second tier is to measure the level of over-fit (unbiased measure of accuracy). — Useful for comparing learning algorithms with control parameters that are optimized. — Number of folds is not optimized. Computational complexity: — #folds of top tier X #folds of bottom tier X #control points X CPU of algorithm Summary • • • 52 Bioinformatics and Microarray problem — Interdisciplinary Challenges: Terminology — Understanding Biology and Computer Science Data mining and image analysis steps — Image Analysis — Experiment Design as Prior Knowledge — Expected Results of Data Mining — Which Data Mining Technique to Use? — Data Mining Challenges: Complexity, Data Size, Search Space Validation — Confidence in Obtained Results? — Error Screening — Cross validation techniques Backup 53