Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Technologies For Electronic Commerce Rakesh Agrawal, Ramakrishnan Srikant, Yirong Xu Searching with Numbers Reflectivity Empirical Results Non-Reflective 50 40 Catalog Database 100 20 90 10 80 0 70 0 10 20 30 40 50 Precision Dell Computer 700 MHz Celeron, 256 MB SDRAM, … IBM Thinkpad 750 MHz Pentium 3, 196 MB DRAM, … • If we get a close match on numbers, how likely is it that we have correctly matched attribute names? – Likelihood Non-reflectivity (of data) • Let – D: dataset, ni : co-ordinates of point xi , – reflections(xi ): permutations of ni – (ni ): # of points within distance r of ni – (ni ): # of reflections within distance r of ni 30 Low Reflectivity 50 40 60 50 40 30 30 20 20 10 10 0 0 0 1 (ni ) Reflectivi ty 1 | D | xiD (ni ) 10 20 30 40 1 50 2 3 4 5 Trans Wine Auto Query Size High Reflectivity DRAM Credit 50 LCD Glass Proc Housing 40 . 800 200 IBM Thinkpad (750 MHz, 196 MB) … Dell (700 MHz, 256 MB) 800 200 3 lb • Non-overlapping attributes Non-reflective. – Memory: 64 - 512 Mb, Disk: 10 - 40 Gb • Correlations or Clustering Low reflectivity. – Memory: 64 - 512 Mb, Disk: 10 - 100 Gb 30 20 R. Agrawal and R. Srikant, “Searching with Numbers”, W W W 2002 10 0 0 10 20 30 40 50 eCommerce Applications Storage & Querying of eCommerce Data eCommerce Applications SELECT name, output FROM H Data stored in conventional way monitor recharge PANL75 7 inch KLH 221 - Built-in - scan … output S-Video - … … 1. Problem with Conventional Schema • • • • Large number of Columns Sparsity Constant schema evolution Performance ICs ICs Logic output scan … Query Parsing Query Mapping Layer Transformation Optimized Operator Implementation Pure SQL-92 Transform: SELECT V1.val, V2.val FROM V V1, V V2 Vertical Table (V) WHERE V1.key = ‘name’ oid key val AND V2.key = ‘output’ AMD V1.oid = V2.oid 0 name PANL75 0 0 0 … 1 1 monitor Recharge Output … name Output 7 inch Built-in Digital … KLH 221 S-Video 2. Advantages of Vertical Schema b c d e Cat1 f x Master Catalog y z Goal • Use affinity information in new catalog. – Products in same category are similar. • Accuracy boost depends on match between two categorizations. w New Catalog ICs a b x Mem. y c Logic d e f Stores for XML, RDF, LDAP and Data Mining R. Agrawal, A. Somani and Y. Xu, “Storage and Querying of E-Commerce Data”, VLDB 2001 Enhanced Naïve Bayes classifier Pr(Ci | S ) Pr( d | Ci ) Pr(Ci | d , S ) Pr( d | S ) | Ci | (# docs in S predicted to be in Ci ) w Pr(Ci | S ) w (| C | (# docs in S predicted to be in C ) ) j j j Cat2 • After integration: DSP Other Applications: • Objects can have large number of attributes • Handles sparseness well • Easy schema evolution Problem Statement a Recommendations for Database Vendors: Partial Indices Enhanced Table Functions (TF) First Class treatment of TF Native Support for v2h operation • Writing SQL is painful • B2B electronics portal: 2000 categories, 200K datasheets Mem. monitor recharge But … Catalog Integration DSP name • Hides complexity of vertical representation • Fast performance z w • Given – master categorization M: • categories C1, C2, …, Cn • set of documents in each category – new categorization N: • categories S1, S2, …, Sn • set of documents in each category • Standard Alg: Compute Pr(Ci | d) • Enhanced Alg: Compute Pr(Ci | d, S) • Use tuning set to determine w. – Defaults to standard Naïve Bayes if w = 0. • Only affects classification of borderline documents. Accuracy Improvement on Pangea Data 100 Perfect 90-10 80-20 GaussianA GaussianB Base 90 Accuracy name 3. Solution: Query Mapping Layer SELECT name, output FROM HV Horizontal View (HV) 80 70 60 R. Agrawal and R. Srikant, “On Integrating Catalogs”, W W W 2001 1 2 5 10 25 50 100 200 Weight