Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland http://www.phys.uni.torun.pl/~duch ISEP Porto, 8-12 July 2002 What is it about? • Data used to be precious! Now it is overwhelming ... • In many areas of science, business and commerce people are drowning in data. • Ex: astronomy super-telescope – data mining in existing databases. • Database technology allows to store and retrieve large amounts of data of any kind. • There is knowledge hidden in data. • Data analysis requires intelligence. Ancient history • 1960: first databases, collections of data. • 1970: RDBMS, relational data model most popular today, large centralized systems. • 1980: application-oriented data models, specialized for scientific, geographic, engineering data, time series, text, object-oriented models, distributed databases. • 1990: multimedia and Web databases, data warehousing (subject-oriented DB for decision support), and on-line analytical processing (OLAP), deduction and verification of hypothetical patterns. • Data mining: first conference in 1989, book 1996, discover something useful! Data Mining History • 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro and W. Frawley 1991) • 1991-1994 Workshops on KDD • 1996 Advances in Knowledge Discovery and Data Mining (Fayyad et al.) • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) • 1997 Journal of Data Mining and Knowledge Discovery • 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations • Many conferences on data mining: PAKDD, PKDD, SIAMData Mining, (IEEE) ICDM, etc. References, papers KDD WWW Resources: http://www.kdd.org http://www.kdnuggets.com http://www.the-data-mine.com http://www.acm.org/sigkdd/ ResearchIndex: http://citeseer.nj.nec.com/cs AI & ML aspects http://www.phys.uni.torun.pl/kmk NN & Statistics http://www.phys.uni.torun.pl/kmk Comparison of results on many datasets: http://www.phys.uni.torun.pl/kmk Data Mining and statistics • Statisticians deal with data: what’s new in DM? • Many DM methods have roots in statistics. • Statistics used to deal with small, controlled experiments, while DM deals with large, messy collections of data. • Statistics is based on analytical probabilistic models, DM is based on algorithms that find patterns in data. • Many DM algorithms came from other sources and slowly get some statistical justification. • Key factor for DM is the computer cost/performance. • Sometimes DM is more art than science … Types of Data • Statistical data – clean, numerical, controlled experiments, vector space model. • • • • • • • Relational data – marketing, finances. Textual data – Web, NLP, search. Complex structures – chemistry, economics. Sequence data – bioinformatics. Multimedia data – images, video. Signals – dynamic data, biosignals. AI data – logical problems, games, behavior … What is DM? • Discovering interesting patterns, finding useful summaries of large databases. • DM is more than database technology, On-Line Analitic Processing (OLAP) tools. • DM is more than statistical analysis, although it includes classification, association, clustering, outlier and trend analysis, decision rules, prototype cases, multidimensional visualization etc. Understanding of data has not been an explicit goal of statistics, focusing on predictive data models. DM applications • Many applications, but spectacular new knowledge is rarely discovered. Some examples: – “Diapers and beer” correlation: please them close and put potato chips in between. – Mining astronomical catalogs (Skycat, Sloan Sky survey): new subtype of stars has been discovered! – Bioinformatics: more precise characterization of some diseases, many discoveries to be made? – Credit card fraud detection (HNC company). – Discounts of air/hotel for frequent travelers. Important issues in data mining. • Use of statistical and CI methods for KDD. • What makes an interesting pattern? • Handling uncertainty in the data. • Handling noise, outliers and missing or unknown data. • Finding linguistic variables, discretization of continuous data, presentation and evaluation of knowledge. • Knowledge representation for structural data, heterogeneous information, textual databases & NLP. • Performance, scalability, distributed data, incremental or “on-line” processing. • Best form of explanation depends on the application. DM dangers • If there are too many conclusions to draw some inferences will be true by chance due to too small data samples (Bonferroni’s theorem). Example 1: David Rhine (Duke Univ) ESP tests. 1 person in 1000 guessed correctly color (red or black) of 10 cards: is this evidence for ESP? Retesting of these people gave average results. Rhine’s conclusion: telling people that they have ESP interferes with their ability … Example 2: using m letters to form a random sequence of the length N all possible subsequences of logmN are found => Bible code! Data Mining process Knowledge discovery in databases (KDD): a search process for understandable and useful patterns in data. Clean, Collect, Summarize Data Warehouse most effort Operational Databases Data Preparation Training Data Verification, Evaluation Data Mining Model Patterns Stages of DM process • Data gathering, data warehousing, Web crawling. • Preparation of the data: cleaning, removing outliers and impossible values, removing wrong records, finding missing data. • Exploratory data analysis: visualization of different aspects of data. • Finding relevant features for questions that are asked, preparing data structures for predictive methods, converting symbolic values to numerical representation. • Pattern extraction, discovery, rules, prototypes. • Evaluation of knowledge gained, finding useful patterns, consultation with experts. Multidimensional Data Cuboids • Data warehouses use multidimensional data model. • Projections (views) of data on different dimensions (attributes) form “data cuboids”. • In DB warehousing literature: base cuboid: original data, N-Dim. apex cuboid: 0-D cuboid, highest-level summary; data cube: lattice of cuboids. • Ex: Sales data cube, viewed in multiple dimensions – Dimension tables, ex. item (item_name, brand, type), or time(day, week, month, quarter, year) – Fact tables, measures (such as cost), and keys to each of the related dimension tables Data Cube: A Lattice of Cuboids none time time,item item time,location 0-D(apex) cuboid location item,location time,supplier time,item,location supplier 1-D cuboids location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplier Forms of useful knowledge AI/Machine Learning camp: Neural nets are black boxes. Unacceptable! Symbolic rules forever. But ... knowledge accessible to humans is in: • symbols, • similarity to prototypes, • images, visual representations. What type of explanation is satisfactory? Interesting question for cognitive scientists. Different answers in different fields. Forms of knowledge • Humans remember examples of each category and refer to such examples – as similaritybased or nearest-neighbors methods do. • Humans create prototypes out of many examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do. • Logical rules are the highest form of summarization of knowledge. Types of explanation: • exemplar-based: prototypes and similarity; • logic-based: symbols and rules; • visualization-based: exploratory data analysis, maps, diagrams, relations ... Computational Intelligence Neural networks Evolutionary algorithms Fuzzy logic Soft computing Pattern Recognition Expert systems Computational Intelligence Data => Knowledge Artificial Intelligence Machine learning Probabilistic methods Visualization Multivariate statistics CI methods for data mining • Provide non-parametric (“universal”), predictive models of data. • Classify new data to pre-defined categories, supporting diagnosis & prognosis. • Discover new categories, clusters, patterns. • Discover interesting associations, correlations. • Allow to understand the data, creating fuzzy or crisp logical rules, or prototypes. • Help to visualize multi-dimensional relationships among data samples. Association rules • Classification rules: X => C(X) • Association rules: looking for correlation between components of X, i.e. probability p(Xi|X1,Xi-1,Xi+1,Xn). • “Market basket” problem: many items selected from an available pool to a basket; what are the correlations? • Only frequent items are interesting: itemsets with high support, i.e. appearing together in many baskets. Search for rules above support threshold > 1%. Association rules - related • Related problems to market basket: correlation between documents – high for plagiarism; phrases in documents – high for semantically related documents. • Causal relations matter, although may be difficult to determine: lower the price of diapers, keep high beer price, or try the reverse – what will happen? • More general approach: Bayesian belief networks, causal networks, graphical models. Clustering • Given points in multidimensional space divided them into groups that are “similar”. • Ex: if epidemic breaks, look for location of cases on the map (cholera in London). Documents in the space of words cluster according to their topics. • How to measure similarity? • Hierarchical approaches: start from single cases, join them forming clusters; ex: dendrogram. Centroid approaches: assume a few centers and adapt their position; ex: k-means, LVQ, SOM. Neural networks • Inspired by neurobiology: simple elements cooperate changing internal parameters. • Large field, dozens of different models, over 500 papers on NN in medicine each year. • Supervised networks: heteroassociative mapping X=>Y, symptoms => diseases, universal approximators. • Unsupervised networks: clusterization, competitive learning, autoassociation. • Reinforcement learning: modeling behavior, playing games, sequential data. Unsupervised NN example Clustering and visualization of the quality of life index (UN data) by SOM map. Poor classification, inaccurate visualization. Real and artificial neurons Dendrites Signals Synapses Nodes – artificial neurons Synapses (weights) Axon Neural network for MI diagnosis ~ p(MI|X) 0.7 Myocardial Infarction Output weights Input weights Inputs: -1 65 Sex Age 1 5 3 1 Smoking Pain Elevation Pain Intensity Duration ECG: ST MI network function Training: setting the values of weights and thresholds, efficient algorithms exist. Effect: non-linear regression function 5 o 6 i FMI X Wij W jk X k k 1 i 1 Such networks are universal approximators: they may learn any mapping X => Y Knowledge from networks Simplify networks: force most weights to 0, quantize remaining parameters, be constructive! • Regularization: mathematical technique improving predictive abilities of the network. • Result: MLP2LN neural networks that are equivalent to logical rules. MLP2LN Converts MLP neural networks into a network performing logical operations (LN). Input layer Output: one node per class. Aggregation: Linguistic units: better features windows, filters Rule units: threshold logic Learning dynamics Decision regions shown every 200 training epochs in x3, x4 coordinates; borders are optimally placed with wide margins. Neurofuzzy systems Fuzzy: mx0,1 (no/yes) replaced by a degree mx[0,1]. Triangular, trapezoidal, Gaussian ... MF. M.f-s in many dimensions: Feature Space Mapping (FSM) neurofuzzy system. Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions: G X ; P Gi X i ; Pi i 1 GhostMiner Philosophy GhostMiner, data mining tools from our lab. http://www.fqspl.com.pl/ghostminer/ • Separate the process of model building and knowledge discovery from model use => GhostMiner Developer & GhostMiner Analyzer. • There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, committees. • Provide tools for visualization of data. • Support the process of knowledge discovery/model building and evaluating, organizing it into projects. Heterogeneous systems Homogenous systems: one type of “building blocks”, same type of decision borders. Ex: neural networks, SVMs, decision trees, kNNs …. Committees combine many models together, but lead to complex models that are difficult to understand. • Discovering simplest class structures, its inductive bias, requires heterogeneous adaptive systems (HAS). • Ockham razor: simpler systems are better. • • • • HAS examples: NN with many types of neuron transfer functions. k-NN with different distance functions. DT with different types of test criteria. Wine data example Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars. Task: recognize the source of wine sample. 13 quantities measured, continuous features: • • • • • alcohol content ash content magnesium content flavanoids content proanthocyanins phenols content • OD280/D315 of diluted wines • • • • malic acid content alkalinity of ash total phenols content nonanthocyanins phenols content • color intensity • hue • proline. Exploration and visualization General info about the data Exploration: data Inspect the data Exploration: data statistics Distribution of feature values Proline has very large values, the data should be standardized before further processing. Exploration: data standardized Standardized data: unit standard deviation, about 2/3 of all data should fall within [mean-std,mean+std] Other options: normalize to fit in [-1,+1], or normalize rejecting some extreme values. Exploration: 1D histograms Distribution of feature values in classes Some features are more useful than the others. Exploration: 1D/3D histograms Distribution of feature values in classes, 3D Exploration: 2D projections Projections (cuboids) on selected 2D Projections on selected 2D Visualize data Relations in more than 3D are hard to imagine. SOM mappings: popular for visualization, but rather inaccurate, no measure of distortions. Measure of topographical distortions: map all Xi points from Rn to xi points in Rm, m < n, and ask: How well are Rij = D(Xi, Xj) distances reproduced by distances rij = d(xi,xj) ? Use m = 2 for visualization, use higher m for dimensionality reduction. Visualize data: MDS Multidimensional scaling: invented in psychometry by Torgerson (1952), re-invented by Sammon (1969) and myself (1994) … Minimize measure of topographical distortions moving the x coordinates. 1 S1 x 2 R ij R i j ij - rij x 2 MDS i j 1 - r x 1 S2 x Rij 1 S3 x Rij 1 - r x i j i j i j i j 2 ij Sammon Rij ij Rij 2 MDS, more local Visualize data: Wine 3 clusters are clearly distinguished, 2D is fine. The green outlier can be identified easily. Decision trees Simplest things first: use decision tree to find logical rules. Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms. 4 attributes used, 10 errors, 168 correct, 94.4% correct. Decision borders Univariate trees: test the value of a single attribute x < a. Multivariate trees: test on combinations of attributes, hyperplanes. Result: feature space is divided into cuboids. Wine data: univariate decision tree borders for proline and flavanoids Logical rules Crisp logic rules: for continuous x use linguistic variables (predicate functions). sk(x) True [Xk x X'k], for example: small(x) = True{x|x < 1} medium(x) = True{x|x [1,2]} large(x) = True{x|x > 2} Linguistic variables are used in crisp (prepositional, Boolean) logic rules: IF small-height(X) AND has-hat(X) AND hasbeard(X) THEN (X is a Brownie) ELSE IF ... ELSE ... Crisp logic decisions Crisp logic is based on rectangular membership functions: True/False values jump from 0 to 1. Step functions are used for partitioning of the feature space. Very simple hyper-rectangular decision borders. Sever limitation on the expressive power of crisp logical rules! Logical rules - advantages Logical rules, if simple enough, are preferable. • Rules may expose limitations of black box • • • • solutions. Only relevant features are used in rules. Rules may sometimes be more accurate than NN and other CI methods. Overfitting is easy to control, rules usually have small number of parameters. Rules forever !? A logical rule about logical rules is: IF the number of rules is relatively small AND the accuracy is sufficiently high. THEN rules may be an optimal choice. Logical rules - limitations Logical rules are preferred but ... • Only one class is predicted p(Ci|X,M) = 0 or 1 • • • • black-and-white picture may be inappropriate in many applications. Discontinuous cost function allow only nongradient optimization. Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules. Reliable crisp rules may reject some cases as unclassified. Interpretation of crisp rules may be misleading. • Fuzzy rules are not so comprehensible. Rules - choices Simplicity vs. accuracy. Confidence vs. rejection rate. p p true | predicted p- pp-- p r p p- r p- p is a hit; p- false alarm; p- is a miss. Accuracy (overall) A(M) = p+ p-Error rate Rejection rate Sensitivity Specificity L(M) = p-+ pR(M)=p+r+p-r= 1-L(M)-A(M) S+(M)= p+|+ = p++ /p+ S-(M)= p-|- = p-- /p- Rules – error functions The overall accuracy is equal to a combination of sensitivity and specificity weighted by the a priori probabilities: A(M) = pS(M)+p-S-(M) Optimization of rules for the C+ class; large g means no errors but high rejection rate. E(M;g)= gL(M)-A(M)= g (p-+p-) - (p+p--) minM E(M;g) minM {(1+g)L(M)+R(M)} Optimization with different costs of errors minM E(M;a) = minM {p-+ a p-} = minM {p1-S(M)) - pr(M) + a [p-1-S-(M)) - p-r(M)]} ROC (Receiver Operating Curve): p (p-, hit(false alarm). Wine example – SSV rules Decision trees provide rules of different complexity. Simplest tree: 5 nodes, corresponding to 3 rules; 25 errors, mostly Class2/3 wines mixed. Wine – SSV 5 rules Lower pruning leads to more complex tree. 7 nodes, corresponding to 5 rules; 10 errors, mostly Class2/3 wines mixed. Wine – SSV optimal rules What is the optimal complexity of rules? Use crossvalidation to estimate generalization. Various solutions may be found, depending on the search: 5 rules with 12 premises, making 6 errors, 6 rules with 16 premises and 3 errors, 8 rules, 25 premises, and 1 error. if OD280/D315 > 2.505 proline > 726.5 color > 3.435 then class 1 if OD280/D315 > 2.505 proline > 726.5 color < 3.435 then class 2 if OD280/D315 < 2.505 hue > 0.875 malic-acid < 2.82 then class 2 if OD280/D315 > 2.505 proline < 726.5 then class 2 if OD280/D315 < 2.505 hue < 0.875 then class 3 if OD280/D315 < 2.505 hue > 0.875 malic-acid > 2.82 then class 3 Wine – FSM rules SSV: hierarchical rules FSM: density estimation with feature selection. Complexity of rules depends on desired accuracy. Use rectangular functions for crisp rules. Optimal accuracy may be evaluated using crossvalidation. FSM discovers simpler rules, for example: if proline > 929.5 then class 1 (48 cases, 45 correct, 2 recovered by other rules). if color < 3.79285 then class 2 (63 cases, 60 correct) Examples of interesting knowledge discovered! The most famous example of knowledge discovered by data mining: correlation between beer, milk and diapers. Other examples: 2 subtypes of galactic spectra forced astrophysicist to reconsider star evolutionary processes. Several examples of knowledge found by us in medical and other datasets follow. Mushrooms The Mushroom Guide: no simple rule for mushrooms; no rule like: ‘leaflets three, let it be’ for Poisonous Oak and Ivy. 8124 cases, 51.8% are edible, the rest non-edible. 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features, or 2118=3.1035 possible input vectors. Odor: almond, anise, creosote, fishy, foul, musty, none, pungent, spicy Spore print color: black, brown, buff, chocolate, green, orange, purple, white, yellow. Safe rule for edible mushrooms: odor=(almond.or.anise.or.none) spore-print-color = green 48 errors, 99.41% correct This is why animals have such a good sense of smell! What does it tell us about odor receptors? Mushrooms rules To eat or not to eat, this is the question! Not any more ... A mushroom is poisonous if: R1) odor = (almond anise none); 120 errors, 98.52% R2) spore-print-color = green 48 errors, 99.41% R3) odor = none stalk-surface-below-ring = scaly stalk-color-above-ring = brown 8 errors, 99.90% R4) habitat = leaves cap-color = white no errors! R1 + R2 are quite stable, found even with 10% of data; R3 and R4 may be replaced by other rules, ex: R'3): gill-size=narrow stalk-surface-above-ring=(silky scaly) R'4): gill-size=narrow population=clustered Only 5 of 22 attributes used! Simplest possible rules? 100% in CV tests - structure of this data is completely clear. Recurrence of breast cancer Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia. 286 cases, 201 no recurrence (70.3%), 85 recurrence cases (29.7%) no-recurrence-events, 40-49, premeno, 25-29, 0-2, ?, 2, left, right_low, yes 9 nominal features: age (9 bins), menopause, tumor-size (12 bins), nodes involved (13 bins), node-caps, degree-malignant (1,2,3), breast, breast quad, radiation. Rules for breast cancer Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia. Many systems used, 65-78% accuracy reported. Single rule: IF (nodes-involved [0,2] degree-malignant = 3 THEN recurrence, ELSE no-recurrence 76.2% accuracy, only trivial knowledge in the data: Highly malignant breast cancer involving many nodes is likely to strike back. Recurrence - comparison. Method MLP2LN 1 rule SSV DT stable rules 10xCV accuracy 76.2 75.7 1.0 k-NN, k=10, Canberra 74.1 1.2 MLP+backprop. CART DT FSM, Gaussian nodes Naive Bayes 73.5 9.4 (Zarndt) 71.4 5.0 (Zarndt) 71.7 6.8 69.3 10.0 (Zarndt) Other decision trees < 70.0 Breast cancer diagnosis. Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg. 699 cases, 9 features quantized from 1 to 10: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses Tasks: distinguish benign from malignant cases. Breast cancer rules. Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg. Simplest rule from MLP2LN, large regularization: If uniformity of cell size < 3 Then benign Else malignant Sensitivity=0.97, Specificity=0.85 More complex NN solutions, from 10CV estimate: Sensitivity =0.98, Specificity=0.94 Breast cancer comparison. Method 10xCV accuracy k-NN, k=3, Manh FSM, neurofuzzy 97.0 2.1 (GM) 96.9 1.4 (GM) Fisher LDA MLP+backprop. LVQ IncNet (neural) Naive Bayes SSV DT, 3 crisp rules LDA (linear discriminant) Various decision trees 96.8 96.7 (Ster, Dobnikar) 96.6 (Ster, Dobnikar) 96.4 2.1 (GM) 96.4 96.0 2.9 (GM) 96.0 93.5-95.6 Melanoma skin cancer Collected in the Outpatient Center of Dermatology in Rzeszów, Poland. Four types of Melanoma: benign, blue, suspicious, or malignant. 250 cases, with almost equal class distribution. Each record in the database has 13 attributes: asymmetry, border, color (6), diversity (5). TDS (Total Dermatoscopy Score) - single index Goal: hardware scanner for preliminary diagnosis. Melanoma rules R1: R2: R3: R4: IF TDS ≤ 4.85 AND C-BLUE IS absent THEN MELANOMA IS Benign-nevus IF TDS ≤ 4.85 AND C-BLUE IS present THEN MELANOMA IS Blue-nevus IF TDS > 5.45 THEN MELANOMA IS Malignant IF TDS > 4.85 AND TDS < 5.45 THEN MELANOMA IS Suspicious 5 errors (98.0%) on the training set 0 errors (100 %) on the test set. Feature aggregation is important! Without TDS 15 rules are needed. Melanoma results Method Rules Training % Test % MLP2LN, crisp rules 4 98.0 all 100 SSV Tree, crisp rules 4 97.5±0.3 100 FSM, rectangular f. 7 95.5±1.0 100 knn+ prototype selection 13 97.5±0.0 100 FSM, Gaussian f. 15 93.7±1.0 95±3.6 knn k=1, Manh, 2 features -- 97.4±0.3 100 -- 96.2 LERS, rough rules 21 Summary Data mining is a large field; only a few issues have been mentioned here. DM involves many steps, here only those related to pattern recognition were stressed, but in practice scalability and efficiency issues may be most important. Neural networks are used still mostly for building predictive data models, but they may also provide simplified description in form of rules. Rules are not the only for of data understanding. Rules may be a beginning for a practical application. Some interesting knowledge has been discovered. Challenges Fully automatic universal data analysis systems: press the button and wait for the truth … • • • • Discovery of theories rather than data models Integration with image/signal analysis Integration with reasoning in complex domains Combining expert systems with neural networks We are slowly getting there. More & more computational intelligence tools (including our own) are available. Disclaimer A few slides/figures were taken from various presentations found in the Internet; unfortunately I cannot identify original authors at the moment, since these slides went through different iterations. I have to apologize for that.