* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Query Processing, Resource Management and Approximate in a
Survey
Document related concepts
Transcript
Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU Why Data Mining? Parkinson’s Law of Data Data expands to fill the space available for storage Disk-storage version of Moore’s law Capacity 2 t / 18 months Available data grows exponentially! Outline Motivation of 3 challenges More records (rows) More attributes (columns) New subject domains Some answers to the challenges Thesis work Generalized P-Tree structure Kernel-based semi-naïve Bayes classification KDD-cup 02/03 and with Csci 366 students Data with graph relationship Outlook: Data with time dependence Examples More records Many stores save each transaction Data warehouses keep historic data Monitoring network traffic Micro sensors / sensor networks More attributes Items in a shopping cart Keywords in text Properties of a protein (multi-valued categorical) New subject domains Data mining hype increases audience Algorithmic Perspective More records Standard scaling problem More attributes Different algorithms needed for 1000 vs. 10 attributes New subject domains New techniques needed Joining of separate fields Algorithms should be domain-independent Need for experts does not scale well Twice as many data sets Twice as many domain experts?? Ignore domain knowledge? No! Formulate it systematically Some Answers to Challenges Large data quantity (Thesis) Many records P-Tree concept and its generalization to non-spatial data Many attributes Algorithm that defies curse of dimensionality New techniques / Joining separate fields Mining data on a graph Outlook: Mining data with time dependence Challenge 1: Many Records Typical question How many records satisfy given conditions on attributes? Typical answer In record-oriented database systems Database scan: O(N) Sorting / indexes? Unsuitable for most problems P-Trees Compressed bit-column-wise storage Bit-wise AND replaces database scan P-Trees: Compression Aspect P-Trees: Ordering Aspect Compression relies on long sequences of 0 or 1 Images Neighboring pixels are probably similar Peano-ordering Other data? Peano-ordering can be generalized Peano-order sorting Peano-Order Sorting Impact of Peano-Order Sorting Speed 120 Unsorted 100 80 Simple Sorting 60 40 Generalized Peano Sorting cr op Less than O(N) scaling for all algorithms improvement especially for large data sets 80 Time per Test Sample in Milliseconds fu nc tio n us hr oo m m sp am 20 0 ad ul t Time in Seconds Impact of Sorting on Execution Speed 60 40 20 0 0 5000 10000 15000 20000 Num ber of Training Points 25000 30000 So Far Answer to challenge 1: Many records P-Tree concept allows scaling better than O(N) for AND (equivalent to database scan) Introduced effective generalization to non-spatial data (thesis) Challenge 2: Many attributes Focus: Classification Curse of dimensionality Some algorithms suffer more than others Curse of Dimensionality Many standard classification algorithms E.g., decision trees, rule-based classification For each attribute 2 halves: relevant irrelevant How often can we divide by 2 before small size of “relevant” part makes results insignificant? Inverse of Double number of rice grains for each square of the chess board Many domains have hundreds of attributes Occurrence of terms in text mining Properties of genes Possible Solution Additive models Each attribute contributes to a sum Techniques exist (statistics) Computationally intensive Simplest: Naïve Bayes x(k) is value P(x | C ci ) P( x ( k ) | C ci ) of kth attribute k 1 Considered additive model M Logarithm of probability additive Semi-Naïve Bayes Classifier Correlated attributes are joined Has been done for categorical data Kononenko ’91, Pazzani ’96 Previously: Continuous data discretized New (thesis) 0.1 Kernel-based evaluation of correlation distribution function 0.08 0.06 N Corr (a, b) K (k ) K (k ) t 1 k a ,b N k a ,b t 1 ( x ( k ) , xt( k ) ) 0.04 1 (k ) kernel density estimate data points 0.02 (k ) t (x , x ) 0 Results Error decrease in units of standard deviation for different parameter sets Improvement for wide range of correlation thresholds: 0.05 (white) to 1 (blue) Semi-Naive Classifier Compard with P-Tree Naive Bayes Decrease in Error Rate 25 20 15 Parameters (a) 10 Parameters (b) Parameters (c) 5 0 spam -5 crop adult sickeuthyroid mushroom genefunction splice So Far Answer to challenge 1: More records Generalized P-tree structure Answer to challenge 2: More attributes Additive algorithms Example: Kernel-based semi-naïve Bayes Challenge 3: New subject domains Data on a graph Outlook: Data with time dependence Standard Approach to Data Mining Conversion to a relation (table) Domain knowledge goes into table creation Standard table can be mined with standard tools Does that solve the problem? To some degree, yes But we can do better “Everything should be made as simple as possible, but not simpler” Albert Einstein Claim: Representation as single relation is not rich enough Example: Contribution of a graph structure to standard mining problems Genomics Protein-protein interactions WWW Link structure Scientific publications Citations Scientific American 05/03 Data on a Graph: Old Hat? Common Topics Analyze edge structure Google Biological Networks Sub-graph matching Chemistry Visualization Focus on graph structure Our work Focus on mining node data Graph structure provides connectivity Protein-Protein Interactions Protein data From Munich Information Center for Protein Sequences (also KDD-cup 02) Hierarchical attributes Function Localization Pathways Gene-related properties Interactions From experiments Undirected graph Questions Prediction of a property (KDD-cup 02: AHR*) Which properties in neighbors are relevant? How should we integrate neighbor knowledge? What are interesting patterns? Which properties say more about neighboring nodes than about the node itself? *AHR: Aryl Hydrocarbon Receptor Signaling Pathway But not: Possible Representations OR-based At least one neighbor has property Example: Neighbor essential true AND-based All neighbors have property Example: Neighbor essential false Path-based (depends on maximum hops) One record for each path Classification: weighting? Association Rule Mining: Record base changes AHR essential AHR essential AHR not essential Association Rule Mining OR-based representation Conditions Association rule involves AHR Support across a link greater than within a node Conditions on minimum confidence and support Top 3 with respect to support: AHR essential AHR nucleus (localization) AHR transcription (function) (Results by Christopher Besemann, project CSci 366) Classification Results Problem (especially path-based representation) Varying amount of information per record Many algorithms unsuitable in principle E.g., algorithms that divide domain space KDD-cup 02 Very simple additive model Based on visually identifying relationship Number of interacting essential genes adds to probability of predicting protein as AHR KDD-Cup 02: Honorable Mention NDSU Team Outlook: Time-Dependent Data KDD-cup 03 Prediction of citations of scientific papers Old: Time-series prediction New: Combination with similarity-based prediction Conclusions and Outlook Many exciting problems in data mining Various challenges Scaling of existing algorithms (more records) Different types of algorithms gain importance (more attributes) Identifying and solving new challenges in a domain-independent way (new subject areas) Examples of general structural components that apply to many domains Graph-structure Time-dependence Relationships between attributes Software design of scientific applications Rows vs. columns Software engineering aspects