Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA MINING: Algorithms, Applications and Beyond Chandan K. Reddy Department of Computer Science Wayne State University, Detroit, MI – 48202. Organization Introduction Basic components Fundamental Topics Research Topics Classification Clustering Association Analysis Probabilistic Graphical Models Boosting Algorithms Active Learning Mining under Constraints Teaching Lots of Data …. Customer Transactions Bioinformatics Banking Internet / Web Biomedical Imaging So What ????? Computers have become cheaper and more powerful, so storage is not an issue There is often information “hidden” in the data that is not Wereadily are evident drowning in data, butanalysts starving forweeks knowledge!!! Human may take to discover useful information Much of the data is never analyzed at all Data Mining is … “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data” “the science of extracting useful information from large data sets or databases” -Wikipedia.org More appropriate term will be …. Knowledge Discovery in Databases Steps in Knowledge Discovery Steps in the KDD Procedure Data Cleaning Data Integration (application of intelligent methods in order to extract data patterns) Model Evaluation (converting data into a form more appropriate for mining) Data Mining (only data relevant for the task are retrieved from the database) Data Transformation (combining multiple sources) Data Selection (removal of noise and inconsistent records) (identification of truly interesting patterns representing knowledge) Knowledge Presentation (visualization or other knowledge presentation techniques) What can Data mining do? Figures out some intelligent ways of handling the data Finds valuable information hidden in large volumes of data. Analyze the data and find patterns and regularities in data. Mining analogy: in a mining operation large amounts of low grade materials are sifted through in order to find something of value. Identify some abnormal/suspicious activities To provide guidelines to humans - what to look for in a dataset? Related CS Topics Pattern Recognition Database Systems Artificial Intelligence Data Mining Machine Learning Visualization Optimization Algorithms Statistics Typical Data Mining Tasks are … Prediction Methods (You know what to look for) Use some variables to predict unknown or future values of other variables. Description Methods (you don’t know what to look for) Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Basic components Data Pre-processing Data Visualization Model Evaluation Classification Clustering Association Analysis Different kinds of Data Record Data Data Matrix Document Data Transaction Data Graph Data Ordered Temporal Data Sequence Data Spatio-Temporal Data Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes Document Data Each document becomes a `term' vector, each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document. Transaction Data A special type of record data, where Each record (transaction) involves a set of items. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Graph Data Data with Relationships among objects Examples: (a) Generic Web Data (b) Citation Data Analysis 2 1 5 2 5 Ordered Data Time Series data – series of some measurements taken over certain time frame E.g. financial Data Ordered Data Sequence data – no time stamps, but order is still important. E.g. Genome data GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG Ordered Data Spatio-Temporal Data Average Monthly Temperature of land and ocean collected for a variety of geographical locations ( a total of 250,000 data points) Data Pre-Processing Removal of noise and outliers Sampling is employed for data selection Curse of dimensionality Data Normalization Processing entire Data might be expensive Dealing with High-dimensional data Will improve the performance of mining Different features have different range values e.g. human age, height, weight. Feature Selection Remove unnecessary features – redundant or irrelevant Data Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Histograms Pie Chart Scatter Plot Array of Iris Attributes Contour Plot Example: Celsius Parallel Coordinates Plots for Iris Data Chernoff Faces for Iris Data Setosa Versicolour Virginica A Sample Data Cube 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico sum Country TV PC VCR sum 1Qtr Date Total annual sales of TV in U.S.A. Organization Introduction Basic components Fundamental Topics Research Topics Classification Clustering Association Analysis Probabilistic Graphical Models Boosting Algorithms Active Learning Mining under Constraints Teaching Classification Training Algorithm Learn Model Apply Model Existing Data New Data ??? Training Phase Result Testing Phase Classification models Outlook Sunny Rainy Overcast High No Windy Yes Humidity Normal Yes True No False Yes Metrics for Performance Evaluation PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS Class=No a (TP) c (FP) Class=No b (FN) d (TN) Most widely-used metric: ad TP TN Accuracy a b c d TP TN FP FN Evaluating Data Mining techniques Predictive Accuracy (ability of a model to predict future) or Descriptive Quality (ability of a model to find meaningful descriptions of the data, e.g. clusters) Speed (computation cost involved in generating and using the model) Robustness (ability of a model to work well even with noisy or missing data) Scalability (ability of a model to scale up well with large amounts of data) Interpretability (level of understanding and insight provided by the model) Clustering No class Labels – so, no prediction Groupings in the data (descriptive) Can be used to summarize the data Can help in removing outliers and noise Image segmentation, document clustering, gene expression data etc.. Association Analysis Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality! Organization Introduction Basic components Fundamental Topics Research Topics Classification Clustering Association Analysis Probabilistic Graphical Models Boosting Algorithms Active Learning Mining under Constraints Teaching Probabilistic Graphical Models Real World Data is very complicated We would like to understand the underlying distribution that generated the data If it is unimodal, then it is easy to solve But, usually the distribution is multimodal – not unimodal Parameter Estimation Modeling with Probabilistic Graphical Models Mixture Models Hidden Markov Models Mixture-of-Experts Bayesian Networks Mixture of Factor Analyzers Neural Networks And so on….. We don’t want Sub-optimal models Example Motivation ? ? ? ? ? ? “Searching for a needle in hay stack” Problems with Local Optimization Local methods suffer from “fine-tuning ” capability and there is a need for a method that explores a subspace in a systematic manner. TRUST-TECH Approach Systematic Tier-by-Tier search Mixture Models Let x = [ x1, x2,…, xd ] T be the d - dimensional feature vector Assumption : K components in the mixture model. k p x i p x | i i 1 Let = { 1, 2,…, k, 1, 2,…, k } represent the collection of parameters 1 p x | i e 2 0 i 1 x i i 1, 2,..., k and 2 i2 k i 1 i 1 Maximum Likelihood Estimation Let X = { x(1), x(2),…, x(n) } be the set of n i.i.d samples log p X log p x ( j ) | log i p x ( j ) | i n n k j 1 j 1 i 1 Goal : Find that maximizes the likelihood function ˆ MLE arg max log p X | Difficulty : (i) No closed-form solution and (ii) The likelihood surface is highly nonlinear EM Algorithm Initialization : Set the initial parameters Iteration : Iterate the following until convergence E-Step : Compute the Q-function i.e. expectation of the log likelihood given the current parameters (t ) Q , EZ l og p X , Z | | X , (t ) M-Step : Maximize the Q-function with respect to t 1 arg max Q | t Nonlinear Transformation Minimize f ( x) f : R R, f C N Original Function 2 x (t ) f ( x) Dynamical System one-to-one correspondence of the critical points Local Minimum Saddle Point Local Maximum Likelihood Function [ JCB ’06 ] Stable Equilibrium Point Decomposition Point Source Energy Function Experimental Results [ IEEE PAMI ’08 ] Finding Motifs using Probabilistic Models J k=b k=1 k=2 k=3 k=4 … k=l {A} C0,1 C1,1 C2,1 C3,1 C4,1 … Cl,1 {T} C0,2 C1,2 C2,2 C3,2 C4,2 … Cl,2 {G} C0,3 C1,3 C2,3 C3,3 C4,3 … Cl,3 {C} C0,4 C1,4 C2,4 C3,4 C4,4 … Cl,4 Results Results Alignment Score 200 180 Original 160 Tier-1 Tier-2 140 (2 0, 6) (1 7, 5) (1 5, 4) (1 3, 3) (1 1, 2) 120 Motifs Different Motifs and the average score using random starts. The first tier and second tier improvements [ BMC AMB ’06 ] Neural Network Diagram x1 x2 x3 xn w11 wnk b1 b2 w01 w02 bk w0 k bk 1 y Inputs : xi Output : y Weights : wij Biases : bi Targets : t # of # of # of # of Input Nodes : n Hidden Layers : 1 Hidden Nodes : k Output Nodes : 1 1 Q 2 C(w) t (i) y (i, w, x) Q i 1 Results – Classification Error (%) [ IJCNN ’07 ] Train Test Best BP TRUSTTECH+BP Improve ment(%) Best BP TRUSTTECH+BP Improve ment(%) Cancer 2.21 1.74 27.01 3.95 2.63 50.19 Image 9.37 8.04 16.54 11.08 9.74 13.76 Ionosphere 2.35 0.57 312.28 10.25 7.96 28.77 Iris 1.25 1.00 25.00 3.33 2.67 24.72 Diabetes 22.04 20.69 6.52 23.83 20.58 15.79 Sonar 1.56 0.72 116.67 19.17 12.98 47.69 Wine 4.56 3.58 27.37 14.94 6.73 121.99 Boosting Algorithms for Biomedical Imaging T Training phase Learned Models T1 T2 … TS h1 h2 … hS (x, ?) h* = F(h1, h2, …, hS) Testing phase (x, y*) Tumor Detection and Tumor Tracking must be performed in almost real-time Wavelet features are good classifiers but not very good Medical Image Retrieval using Boosting Methods Retrieving similar medical images is very valuable for diagnosis (automated diagnosis systems) Each category is trained separately and different models are learned Given a query image, the most similar images are displayed Identification of Microbes Segment the objects by accurately identifying the boundaries Semi-automated methods perform very well Apply Active Learning Methods for labeling the pixels Results [ JMA ’04 ] Active Learning for Biomedical Imaging Labeling/Annotating Images is a daunting task We need help the medical doctors to efficiently label the images Rather than showing the images at random order, Active Learning can pick the most hard ones Mining Under Constraints Business problems pose many real-world constraints Obviously training models without the knowledge of these constraints do not perform well [ submitted ] Constraints Learn Model Training Phase Apply Model Testing Phase Mining Under Constraints Constraints Learn Model Training Phase Learn Constraints Model Apply Model Testing Phase Apply Model Conclusion Different Data Mining related tasks are discussed in general Core data mining algorithms are illustrated Data Mining helps existing technologies but it doesn’t override them Few challenges still remain unsolved Problems like parameter estimation and automated parameter selection are still on-going research tasks Handling real-world constraints Incorporating domain knowledge during the training phase Teaching Fall 2007 : CSC 5991 Data Mining I – Fundamentals of Data Mining http://www.cs.wayne.edu/~reddy/Courses/CS5991/ Winter 2008 : CSC 7991 Data Mining II – Topics in Data Mining http://www.cs.wayne.edu/~reddy/Courses/CSC7991/ Data Mining I ( Fall 2007 ) This course introduces the fundamental principles, algorithms and applications of data mining. Topics covered in this course include: data pre-processing data visualization model evaluation predictive modeling association analysis clustering anomaly detection. Data Mining II ( Winter 2008 ) This will be a continuation course. Data mining problems that arise various application domains will be discussed. (No Prereq: special classes) The following topics will be covered: Data Warehousing Mining Data Streams Probabilistic Graphical Models Frequent Pattern Mining Multi-relational Data Mining Graph Mining Text Mining Visual Data Mining Sequence Pattern Mining Mining Time-Series Data Privacy-preserving Data Mining High-Dimensional Data Clustering Thank You Questions and Comments!!!!!! Contact Information : Office : 452 State Hall Email : [email protected] WWW : http://www.cs.wayne.edu/~reddy/