Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What Data Mining Methods May Help Bio-Informatics? Jiawei Han Database Systems Research Lab Department of Computer Science University of Illinois at Urbana-Champaign, U.S.A. http://www.cs.uiuc.edu/~hanj May 22, 2017 Data Mining & Bio-Informatics 1 Bio-informatics and Data Mining Data mining: search for or discovery of patterns and knowledge hidden in data Biomedical/DNA data mining May 22, 2017 Biological data is abundant and information rich (e.g., gene chips, bio-testing data) It is critical to find correlations, linkages between disease and gene sequences, classification, clustering, outliers, etc. Lots of challenges and new techniques can be developed: A field yet to be explored Data Mining & Bio-Informatics 2 Biomedical Data Mining and DNA Analysis DNA sequences Four basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T). Gene: a sequence of hundreds of individual nucleotides arranged in a particular order Humans have around 30,000 genes Tremendous number of ways that the nucleotides can be ordered and sequenced to form distinct genes DNA micro-arrays and protein arrays have accumulated tremendous amount of data related to patients and diseases May 22, 2017 Data Mining & Bio-Informatics 3 What Data Mining Methods May Help Bio-Informatics? Semantic integration of heterogeneous, distributed genome databases Discovery of tandem repeats: Blast and beyond Similarity search in genome databases Association, correlation, and linkage analysis Fault-tolerant sequential and structured pattern mining Advanced classification techniques Cluster analysis and outlier detection Multi-dimensional data mining environments Visual data mining Invisible data mining May 22, 2017 Data Mining & Bio-Informatics 4 Semantic Integration of Heterogeneous, Distributed Genome Databases Current situation—highly distributed, uncontrolled generation and use of a wide variety of DNA data Semantic integration of different genome databases—a critical task It is highly desirable to build Web-based, integrated, multi-dimensional genome databases Data cleaning and data integration methods developed in data mining/data warehousing will help May 22, 2017 Data Mining & Bio-Informatics 5 What Data Mining Methods May Help Bio-Informatics? Semantic integration of heterogeneous, distributed genome databases Discovery of tandem repeats: Blast and beyond Similarity search in genome databases Association, correlation, and linkage analysis Fault-tolerant sequential and structured pattern mining Advanced classification techniques Cluster analysis and outlier detection Multi-dimensional data mining environments Visual data mining Invisible data mining May 22, 2017 Data Mining & Bio-Informatics 6 Discovery and Comparison of DNA Sequences Finding tandem repeats Fault-tolerant sequential patterns (Is Blast enough?) CACAC CACAC CACAC CACAC AC Similarity search and comparison among DNA sequences May 22, 2017 Compare the frequently occurring patterns of each class (e.g., diseased and healthy) Query-based: Identify gene sequence patterns that play roles in various diseases Data Mining & Bio-Informatics 7 What Data Mining Methods May Help Bio-Informatics? Semantic integration of heterogeneous, distributed genome databases Discovery of tandem repeats: Blast and beyond Similarity search in genome databases Association, correlation, and linkage analysis Fault-tolerant sequential and structured pattern mining Advanced classification techniques Cluster analysis and outlier detection Multi-dimensional data mining environments Visual data mining Invisible data mining May 22, 2017 Data Mining & Bio-Informatics 8 Similarity Search in Multimedia Data Description-based retrieval systems Build indices and perform object retrieval based on image descriptions, such as keywords, captions, size, and time of creation Labor-intensive if performed manually Results are typically of poor quality if automated Content-based retrieval systems Support retrieval based on the image content, such as color histogram, texture, shape, objects, and wavelet transforms May 22, 2017 Data Mining & Bio-Informatics 9 Approaches Based on Image Signature Color histogram-based signature The signature includes color histograms based on color composition of an image regardless of its scale or orientation No information about shape, location, or texture Two images with similar color composition may contain very different shapes or textures, and thus could be completely unrelated in semantics Multifeature composed signature Define different distance functions for color, shape, location, and texture, and subsequently combine them to derive the overall result. May 22, 2017 Data Mining & Bio-Informatics 10 One Signature for the Entire Image? Walnus: [NRS99] by Natsev, Rastogi, and Shim Similar images may contain similar regions, but a region in one image could be a translation or scaling of a matching region in the other Wavelet-based signature with region-based granularity Define regions by clustering signatures of windows of varying sizes within the image Signature of a region is the centroid of the cluster Similarity is defined in terms of the fraction of the area of the two images covered by matching pairs of regions from two images May 22, 2017 Data Mining & Bio-Informatics 11 Similarity Search in Time-Series Analysis Normal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequence Two categories of similarity queries Whole matching: find a sequence that is similar to the query sequence Subsequence matching: find all pairs of similar sequences Typical Applications Financial market Market basket data analysis Scientific databases Medical diagnosis May 22, 2017 Data Mining & Bio-Informatics 12 Similar time series analysis May 22, 2017 Data Mining & Bio-Informatics 13 Similar time series analysis VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund Two similar mutual funds in the different fund group May 22, 2017 Data Mining & Bio-Informatics 14 What Data Mining Methods May Help Bio-Informatics? Semantic integration of heterogeneous, distributed genome databases Discovery of tandem repeats: Blast and beyond Similarity search in genome databases Association, correlation, and linkage analysis Fault-tolerant sequential and structured pattern mining Advanced classification techniques Cluster analysis and outlier detection Multi-dimensional data mining environments Visual data mining Invisible data mining May 22, 2017 Data Mining & Bio-Informatics 15 Rule Measures: Support and Confidence Customer buys both Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains {X Y Z} confidence, c, conditional Customer buys beer probability that a transaction having {X Y} also contains Z Transaction ID Items Bought Let minimum support 50%, and minimum confidence 50%, 2000 A,B,C we have 1000 A,C A C (50%, 66.6%) 4000 A,D 5000 B,E,F C A (50%, 100%) May 22, 2017 Customer buys diaper Data Mining & Bio-Informatics 16 Association Rule Mining: A Road Map Boolean vs. quantitative associations (Based on the types of values handled) buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations (see ex. Above) Single level vs. multiple-level analysis What brands of beers are associated with what brands of diapers? Various extensions Correlation, causality analysis Association does not necessarily imply correlation or causality Maxpatterns and closed itemsets Constraints enforced May 22, 2017 E.g., small sales (sum < 100) trigger big buys (sum > 1,000)? Data Mining & Bio-Informatics 17 Construct FP-tree from a Transaction DB TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Steps: 2. Order frequent items in frequency descending order May 22, 2017 {} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 3. Scan DB again, construct FP-tree min_support = 0.5 Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Data Mining & Bio-Informatics f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 18 Classification of Constraints Monotone Antimonotone Succinct Strongly convertible Convertible anti-monotone Convertible monotone Inconvertible May 22, 2017 Data Mining & Bio-Informatics 19 Association and Path Analysis in BioMedical and DNA Data Mining Association analysis: identification of co-occurring gene sequences Most diseases are not triggered by a single gene but by a combination of genes acting together Association analysis may help determine the kinds of genes that are likely to co-occur together in target samples Path analysis: linking genes to different disease development stages Different genes may become active at different stages of the disease Develop pharmaceutical interventions that target the different stages separately Visualization tools and genetic data analysis May 22, 2017 Data Mining & Bio-Informatics 20 What Data Mining Methods May Help Bio-Informatics? Semantic integration of heterogeneous, distributed genome databases Discovery of tandem repeats: Blast and beyond Similarity search in genome databases Association, correlation, and linkage analysis Fault-tolerant sequential and structured pattern mining Advanced classification techniques Cluster analysis and outlier detection Multi-dimensional data mining environments Visual data mining Invisible data mining May 22, 2017 Data Mining & Bio-Informatics 21 What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern May 22, 2017 Data Mining & Bio-Informatics 22 Pair-wise Checking Using S-matrix SDB SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> <aa> happens twice <(ac)> happens once <ac> happens 4 times <ca> happens twice a 2 b (4, 2, 2) 1 c (4, 2, 1) (3, 3, 2) 3 d (2, 1, 1) (2, 2, 0) (1, 3, 0) 0 e (1, 2, 1) (1, 2, 0) (1, 2, 0) (1, 1, 0) 0 f (2, 1, 1) (2, 2, 0) (1, 2, 1) (1, 1, 1) (2, 0, 1) 1 a b c d e f S-matrix All length-2 sequential patterns are found in S-matrix May 22, 2017 Data Mining & Bio-Informatics 23 Constraint-Based Sequential Pattern Mining Constraint-based sequential pattern mining Constraints: User-specified, for focused mining of desired patterns How to explore efficient mining with constraints? — Optimization Classification of constraints Anti-monotone: E.g., value_sum(S) < 150, min(S) > 10 Monotone: E.g., count (S) > 5, S {PC, digital_camera} Succinct: E.g., length(S) 10, S {Pentium, MS/Office, MS/Money} Convertible: E.g., value_avg(S) < 25, profit_sum (S) > 160, max(S)/avg(S) < 2, median(S) – min(S) > 5 Inconvertible: E.g., avg(S) – median(S) = 0 May 22, 2017 Data Mining & Bio-Informatics 24 From Sequential Patterns to Structured Patterns Sets, sequences, trees and other structures Transaction DB: Sets of items Seq. DB: Sequences of sets: {{<i1, i2>, …, <im, in, ik>}, …} Sets of trees (each element being a tree): {<{i1, i2}, …, {im, in, ik}>, …} Sets of Sequences: {{i1, i2, …, im}, …} {t1, t2, …, tn} Applications: Mining structured patterns in XML documents May 22, 2017 Data Mining & Bio-Informatics 25 What Data Mining Methods May Help Bio-Informatics? Semantic integration of heterogeneous, distributed genome databases Discovery of tandem repeats: Blast and beyond Similarity search in genome databases Association, correlation, and linkage analysis Fault-tolerant sequential and structured pattern mining Advanced classification techniques Cluster analysis and outlier detection Multi-dimensional data mining environments Visual data mining Invisible data mining May 22, 2017 Data Mining & Bio-Informatics 26 Classification Methods Decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association rule mining Other Classification Methods May 22, 2017 Data Mining & Bio-Informatics 27 Output: A Decision Tree for “buys_computer” age? <=30 student? overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes May 22, 2017 Data Mining & Bio-Informatics 28 Classification in MultiMediaMiner May 22, 2017 Data Mining & Bio-Informatics 29 Bayesian Belief Network: An Example Family History Smoker (FH, S) (FH, ~S)(~FH, S) (~FH, ~S) LungCancer PositiveXRay Emphysema Dyspnea Bayesian Belief Networks May 22, 2017 LC 0.8 0.5 0.7 0.1 ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents P( z1,..., zn ) Data Mining & Bio-Informatics n P( z i | Parents( Z i )) i 1 30 Multi-Layer Perceptron Output vector Err j O j (1 O j ) Errk w jk Output nodes k j j (l) Err j wij wij (l ) Err j Oi Hidden nodes Err j O j (1 O j )(T j O j ) wij Input nodes Oj I j 1 e I j wij Oi j i Input vector: xi 1 Linear Classification x x x x x May 22, 2017 x x x x x ooo o o o o o o o o o o Binary Classification problem The data above the red line belongs to class ‘x’ The data below red line belongs to class ‘o’ Examples – SVM, Perceptron, Winnow, Probabilistic Classifiers Data Mining & Bio-Informatics 32 SVM – Support Vector Machines Small Margin Large Margin Support Vectors Association-Based Classification Several methods for association-based classification ARCS: Quantitative association mining and clustering of association rules (Lent et al’97) Associative classification: (Liu et al’98) It beats C4.5 in (mainly) scalability and also accuracy It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label CAEP (Classification by aggregating emerging patterns) (Dong et al’99) May 22, 2017 Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another Mine Eps based on minimum support and growth rate Data Mining & Bio-Informatics 34 The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq. Vonoroi diagram: the decision surface induced by 1NN for a typical set of training examples. . _ _ + _ _ May 22, 2017 _ . + xq _ + . + Data Mining & Bio-Informatics . . . 35 What Data Mining Methods May Help Bio-Informatics? Semantic integration of heterogeneous, distributed genome databases Discovery of tandem repeats: Blast and beyond Similarity search in genome databases Association, correlation, and linkage analysis Fault-tolerant sequential and structured pattern mining Advanced classification techniques Cluster analysis and outlier detection Multi-dimensional data mining environments Visual data mining Invisible data mining May 22, 2017 Data Mining & Bio-Informatics 36 Cluster Analysis and Outliner Detection Partitioning Methods K-means and k-medoids algorithms Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Constraint-Based Clustering Outlier Analysis May 22, 2017 Data Mining & Bio-Informatics 37 The K-Means Clustering Method Example 10 10 9 9 8 8 7 7 6 6 5 5 10 9 8 7 6 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 4 3 2 1 0 0 1 2 3 4 5 6 reassign 10 10 9 9 8 8 7 7 6 6 5 5 4 2 1 0 0 1 2 3 4 5 6 7 8 7 8 9 10 reassign 3 May 22, 2017 Update the cluster means 9 10 Update the cluster means Data Mining & Bio-Informatics 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 38 Typical k-medoids algorithm (PAM) Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary choose k object as initial medoids 7 6 5 4 3 2 Assign each remainin g object to nearest medoids 7 6 5 4 3 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 0 10 1 2 3 4 5 6 7 8 9 10 7 6 5 4 3 2 1 0 0 K=2 Until no change 10 3 4 5 6 7 8 9 10 10 Compute total cost of swapping 9 9 Swapping O and Oramdom 8 If quality is improved. 5 5 4 4 3 3 2 2 1 1 7 6 0 8 7 6 0 0 May 22, 2017 2 Randomly select a nonmedoid object,Oramdom Total Cost = 26 Do loop 1 1 2 3 4 5 6 7 8 9 10 Data Mining & Bio-Informatics 0 1 2 3 4 5 6 7 8 9 10 39 Hierarchical Clustering Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 May 22, 2017 agglomerative (AGNES) Step 3 Step 2 Step 1 Step 0 Data Mining & Bio-Informatics divisive (DIANA) 40 CF Tree Root B=7 CF1 CF2 CF3 CF6 L=6 child1 child2 child3 child6 Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 May 22, 2017 Leaf node CF6 next prev CF1 CF2 Data Mining & Bio-Informatics CF4 next 41 CURE (Clustering Using REpresentatives ) CURE: proposed by Guha, Rastogi & Shim, 1998 Stops the creation of a cluster hierarchy if a level consists of k clusters Uses multiple representative points to evaluate the distance between clusters, adjusts well to arbitrary shaped clusters and avoids single-link effect May 22, 2017 Data Mining & Bio-Informatics 42 Overall Framework of CHAMELEON Construct Partition the Graph Sparse Graph Data Set Merge Partition Final Clusters May 22, 2017 Data Mining & Bio-Informatics 43 DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core May 22, 2017 MinPts = 5 Data Mining & Bio-Informatics 44 Reachability -distance undefined ‘ May 22, 2017 Data Mining & Bio-Informatics Cluster-order of the objects 45 Density-Based Cluster analysis: OPTICS & Its Applications May 22, 2017 Data Mining & Bio-Informatics 46 Clustering and Distribution Density Functions: Density Attractor May 22, 2017 Data Mining & Bio-Informatics 47 Center-Defined and Arbitrary Shaped May 22, 2017 Data Mining & Bio-Informatics 48 40 50 20 30 40 50 age 60 Vacation =3 30 Vacation (week) 0 1 2 3 4 5 6 7 Salary (10,000) 0 1 2 3 4 5 6 7 20 age 60 30 50 age May 22, 2017 Data Mining & Bio-Informatics 49 STING: A Statistical Information Grid Approach Wang, Yang and Muntz (VLDB’97) Each cell stores statistical distribution of measure at low level Multi-level resolution May 22, 2017 Data Mining & Bio-Informatics 50 WaveCluster G. Sheikholeslami, et al. (1998) Multiple wavelet transformationbased cluster analysis May 22, 2017 Data Mining & Bio-Informatics 51 Constraint-Based Clustering: Planning ATM Locations C2 C3 C1 River Mountain Spatial data with obstacles May 22, 2017 C4 Clustering without taking obstacles into consideration Data Mining & Bio-Informatics 52 Clustering with Spatial Obstacles Not Taking obstacles into account May 22, 2017 Taking obstacles into account Data Mining & Bio-Informatics 53 What Data Mining Methods May Help Bio-Informatics? Semantic integration of heterogeneous, distributed genome databases Discovery of tandem repeats: Blast and beyond Similarity search in genome databases Association, correlation, and linkage analysis Fault-tolerant sequential and structured pattern mining Advanced classification techniques Cluster analysis and outlier detection Multi-dimensional data mining environments Visual data mining Invisible data mining May 22, 2017 Data Mining & Bio-Informatics 54 Multidimensional Data and Data Cubes Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Product Category Country Product City Office Quarter Month Week Day Month May 22, 2017 Data Mining & Bio-Informatics 55 Mining Multimedia Databases in May 22, 2017 Data Mining & Bio-Informatics 56 Mining and Explorative Analysis of Data Cubes (and Multi-Dimensional Databases) Efficient computation of data or iceberg cubes Discovery-driven data cube analysis Cube-gradient analysis May 22, 2017 What are the changes of the average house value in Sillicon Valley in 2001 comparing with 2000? Under what conditions the average house value increases 10% per year in Chicago area in 1990s? Data Mining & Bio-Informatics 57 What Data Mining Methods May Help Bio-Informatics? Semantic integration of heterogeneous, distributed genome databases Discovery of tandem repeats: Blast and beyond Similarity search in genome databases Association, correlation, and linkage analysis Fault-tolerant sequential and structured pattern mining Advanced classification techniques Cluster analysis and outlier detection Multi-dimensional data mining environments Visual data mining Invisible data mining May 22, 2017 Data Mining & Bio-Informatics 58 Visual Data Mining & Data Visualization Integration of visualization and data mining data visualization data mining result visualization data mining process visualization interactive visual data mining Data visualization Data in a database or data warehouse can be viewed at different levels of abstraction as different combinations of attributes or dimensions Data can be presented in various visual forms May 22, 2017 Data Mining & Bio-Informatics 59 Data Mining Result Visualization Presentation of the results or knowledge obtained from data mining in visual forms Examples Scatter plots and boxplots (obtained from descriptive data mining) Decision trees Association rules Clusters Outliers Generalized rules May 22, 2017 Data Mining & Bio-Informatics 60 Boxplots from Statsoft: Multiple Variable Combinations May 22, 2017 Data Mining & Bio-Informatics 61 Visualization of Data Mining Results in SAS Enterprise Miner: Scatter Plots May 22, 2017 Data Mining & Bio-Informatics 62 Visualization of Association Rules in SGI/MineSet 3.0 May 22, 2017 Data Mining & Bio-Informatics 63 Visualization of a Decision Tree in SGI/MineSet 3.0 May 22, 2017 Data Mining & Bio-Informatics 64 Visualization of Cluster Grouping in IBM Intelligent Miner May 22, 2017 Data Mining & Bio-Informatics 65 Data Mining Process Visualization Presentation of the various processes of data mining in visual forms so that users can see Data extraction process Where the data is extracted How the data is cleaned, integrated, preprocessed, and mined Method selected for data mining Where the results are stored How they may be viewed May 22, 2017 Data Mining & Bio-Informatics 66 Visualization of Data Mining Processes by Clementine See your solution discovery process clearly Understand variations with visualized data May 22, 2017 Data Mining & Bio-Informatics 67 Interactive Visual Data Mining Using visualization tools in the data mining process to help users make smart data mining decisions Example May 22, 2017 Display the data distribution in a set of attributes using colored sectors or columns (depending on whether the whole space is represented by either a circle or a set of columns) Use the display to which sector should first be selected for classification and where a good split point for this sector may be Data Mining & Bio-Informatics 68 Interactive Visual Mining by Perception-Based Classification (PBC) May 22, 2017 Data Mining & Bio-Informatics 69 Audio Data Mining Uses audio signals to indicate the patterns of data or the features of data mining results An interesting alternative to visual mining An inverse task of mining audio (such as music) databases which is to find patterns from audio data Visual data mining may disclose interesting patterns using graphical displays, but requires users to concentrate on watching patterns Instead, transform patterns into sound and music and listen to pitches, rhythms, tune, and melody in order to identify anything interesting or unusual May 22, 2017 Data Mining & Bio-Informatics 70 What Data Mining Methods May Help Bio-Informatics? Semantic integration of heterogeneous, distributed genome databases Discovery of tandem repeats: Blast and beyond Similarity search in genome databases Association, correlation, and linkage analysis Fault-tolerant sequential and structured pattern mining Advanced classification techniques Cluster analysis and outlier detection Multi-dimensional data mining environments Visual data mining Invisible data mining May 22, 2017 Data Mining & Bio-Informatics 71 Invisible Data Mining Embed mining functions into information services Web search engine (link analysis, authoritative pages, user profiles)—adaptive web sites, etc. Improvement of query processing: history + data Making service smart and efficient Benefits from/to data mining research Data mining research has produced many scalable, efficient, novel mining solutions Applications feed new challenge problems to research Can we make bio-informatics based data mining invisible? May 22, 2017 Data Mining & Bio-Informatics 72 Conclusions Data mining and bio-informatics: Both are young and promising disciplines Data mining: A confluence of multiple disciplines—database, data warehouse, machine learning, statistics, high performance computing, bio-technology, etc. Lots of research issues: need biologists and computer scientists working together May 22, 2017 Data Mining & Bio-Informatics 73 http://www.cs.uiuc.edu/~hanj Thank you !!! May 22, 2017 Data Mining & Bio-Informatics 74