Download Query Processing, Resource Management and Approximate in a

Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU Why Data Mining?  Parkinson’s Law of Data Data expands to fill the space available for storage  Disk-storage version of Moore’s law Capacity  2 t / 18 months  Available data grows exponentially! Outline  Motivation of 3 challenges  More records (rows)  More attributes (columns)  More subject domains  Some answers to the challenges  Thesis work  Generalized P-Tree structure  Kernel-based semi-naïve Bayes classification  KDD-cup 02/03 and with Csci 366 students  Data with graph relationship  Outlook: Data with time dependence Examples  More records     Many stores save each transaction Data warehouses keep historic data Monitoring network traffic Micro sensors / sensor networks  More attributes  Items in a shopping cart  Keywords in text  Properties of a protein (multi-valued categorical)  More subject domains  Data mining hype increases audience Algorithmic Perspective  More records  Standard scaling problem  More attributes  Different algorithms needed for 1000 vs. 10 attributes  More subject domains   New techniques needed  Joining of separate fields Algorithms should be domain-independent  Need for experts does not scale well  Twice as many data sets  Twice as many domain experts??  Ignore domain knowledge?  No! Formulate it systematically Some Answers to Challenges  Large data quantity (Thesis)  Many records  P-Tree concept and its generalization to non-spatial data  Many attributes  Algorithm that defies curse of dimensionality  New techniques / Joining separate fields  Mining data on a graph  Outlook: Mining data with time dependence Challenge 1: Many Records  Typical question  How many records satisfy given conditions on attributes?  Typical answer  In record-oriented database systems  Database scan: O(N)  Sorting / indexes?  Unsuitable for most problems  P-Trees  Compressed bit-column-wise storage  Bit-wise AND replaces database scan P-Trees: Compression Aspect P-Trees: Ordering Aspect  Compression relies on long sequences of 0 or 1  Images  Neighboring pixels are probably similar  Peano-ordering  Other data?  Peano-ordering can be generalized  Peano-order sorting Peano-Order Sorting Impact of Peano-Order Sorting  Speed 120 Unsorted 100 80 Simple Sorting 60 40 Generalized Peano Sorting cr op  Less than O(N) scaling for all algorithms improvement especially for large data sets 80 Time per Test Sample in Milliseconds fu nc tio n us hr oo m m sp am 20 0 ad ul t Time in Seconds Impact of Sorting on Execution Speed 60 40 20 0 0 5000 10000 15000 20000 Num ber of Training Points 25000 30000 So Far  Answer to challenge 1: Many records  P-Tree concept allows scaling better than O(N) for AND (equivalent to database scan)  Introduced effective generalization to non-spatial data (thesis)  Challenge 2: Many attributes  Focus: Classification  Curse of dimensionality  Some algorithms suffer more than others Curse of Dimensionality  Many standard classification algorithms  E.g., decision trees, rule-based classification  For each attribute 2 halves: relevant  irrelevant  How often can we divide by 2 before small size of “relevant” part makes results insignificant?  Inverse of  Double number of rice grains for each square of the chess board  Many domains have hundreds of attributes  Occurrence of terms in text mining  Properties of genes Possible Solution  Additive models  Each attribute contributes to a sum  Techniques exist (statistics)  Computationally intensive  Simplest: Naïve Bayes  x(k) is value P(x | C  ci )   P( x ( k ) | C  ci ) of kth attribute k 1  Considered additive model M  Logarithm of probability additive Semi-Naïve Bayes Classifier  Correlated attributes are joined  Has been done for categorical data  Kononenko ’91, Pazzani ’96  Previously: Continuous data discretized  New (thesis) 0.1  Kernel-based evaluation of correlation distribution function 0.08 0.06 N Corr (a, b)   K (k )  K (k ) t 1 k  a ,b N k  a ,b t 1 ( x ( k ) , xt( k ) ) 0.04 1 (k ) kernel density estimate data points 0.02 (k ) t (x , x ) 0 Results  Error decrease in units of standard deviation for different parameter sets  Improvement for wide range of correlation thresholds: 0.05 (white) to 1 (blue) Semi-Naive Classifier Compard with P-Tree Naive Bayes Decrease in Error Rate 25 20 15 Parameters (a) 10 Parameters (b) Parameters (c) 5 0 spam -5 crop adult sickeuthyroid mushroom genefunction splice So Far  Answer to challenge 1: More records  Generalized P-tree structure  Answer to challenge 2: More attributes  Additive algorithms  Example: Kernel-based semi-naïve Bayes  Challenge 3: More subject domains  Data on a graph  Outlook: Data with time dependence Standard Approach to Data Mining  Conversion to a relation (table)  Domain knowledge goes into table creation  Standard table can be mined with standard tools  Does that solve the problem?  To some degree, yes  But we can do better “Everything should be made as simple as possible, but not simpler” Albert Einstein Claim: Representation as single relation is not rich enough  Example: Contribution of a graph structure to standard mining problems  Genomics  Protein-protein interactions  WWW  Link structure  Scientific publications  Citations Scientific American 05/03 Data on a Graph: Old Hat?  Common Topics  Analyze edge structure  Google  Biological Networks  Sub-graph matching  Chemistry  Visualization  Focus on graph structure  Our work  Focus on mining node data  Graph structure provides connectivity Protein-Protein Interactions  Protein data  From Munich Information Center for Protein Sequences (also KDD-cup 02)  Hierarchical attributes  Function  Localization  Pathways  Gene-related properties  Interactions  From experiments  Undirected graph Questions  Prediction of a property (KDD-cup 02: AHR*)  Which properties in neighbors are relevant?  How should we integrate neighbor knowledge?  What are interesting patterns?  Which properties say more about neighboring nodes than about the node itself? *AHR: Aryl Hydrocarbon Receptor Signaling Pathway But not: Possible Representations  OR-based  At least one neighbor has property  Example: Neighbor essential true  AND-based  All neighbors have property  Example: Neighbor essential false  Path-based (depends on maximum hops)  One record for each path  Classification: weighting?  Association Rule Mining: Record base changes AHR essential AHR essential AHR not essential Association Rule Mining  OR-based representation  Conditions  Association rule involves AHR  Support across a link greater than within a node  Conditions on minimum confidence and support  Top 3 with respect to support: AHR  essential AHR  nucleus (localization) AHR  transcription (function) (Results by Christopher Besemann, project CSci 366) Classification Results  Problem (especially path-based representation)  Varying amount of information per record  Many algorithms unsuitable in principle  E.g., algorithms that divide domain space  KDD-cup 02  Very simple additive model  Based on visually identifying relationship  Number of interacting essential genes adds to probability of predicting protein as AHR KDD-Cup 02: Honorable Mention NDSU Team Outlook: Time-Dependent Data  KDD-cup 03  Prediction of citations of scientific papers  Old: Time-series prediction  New: Combination with similarity-based prediction Conclusions and Outlook  Many exciting problems in data mining  Various challenges  Scaling of existing algorithms (more records)  Different properties in algorithms become relevant (more attributes)  Identifying and solving new domainindependent challenges (more subject areas)  Examples of general structural components that apply to many domains  Graph-structure  Time-dependence  Relationships between attributes

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Query Processing, Resource Management and Approximate in a