Download N - Rakesh Agrawal

Database Technologies For Electronic Commerce Rakesh Agrawal, Ramakrishnan Srikant, Yirong Xu Searching with Numbers Reflectivity Empirical Results Non-Reflective 50 40 Catalog Database 100 20 90 10 80 0 70 0 10 20 30 40 50 Precision Dell Computer 700 MHz Celeron, 256 MB SDRAM, … IBM Thinkpad 750 MHz Pentium 3, 196 MB DRAM, … • If we get a close match on numbers, how likely is it that we have correctly matched attribute names? – Likelihood  Non-reflectivity (of data) • Let – D: dataset, ni : co-ordinates of point xi , – reflections(xi ): permutations of ni – (ni ): # of points within distance r of ni – (ni ): # of reflections within distance r of ni 30 Low Reflectivity 50 40 60 50 40 30 30 20 20 10 10 0 0 0 1  (ni ) Reflectivi ty  1   | D | xiD  (ni ) 10 20 30 40 1 50 2 3 4 5 Trans Wine Auto Query Size High Reflectivity DRAM Credit 50 LCD Glass Proc Housing 40 . 800 200 IBM Thinkpad (750 MHz, 196 MB) … Dell (700 MHz, 256 MB) 800 200 3 lb • Non-overlapping attributes  Non-reflective. – Memory: 64 - 512 Mb, Disk: 10 - 40 Gb • Correlations or Clustering  Low reflectivity. – Memory: 64 - 512 Mb, Disk: 10 - 100 Gb 30 20 R. Agrawal and R. Srikant, “Searching with Numbers”, W W W 2002 10 0 0 10 20 30 40 50 eCommerce Applications Storage & Querying of eCommerce Data eCommerce Applications SELECT name, output FROM H Data stored in conventional way monitor recharge PANL75 7 inch KLH 221 - Built-in - scan … output S-Video - … … 1. Problem with Conventional Schema • • • • Large number of Columns Sparsity Constant schema evolution Performance ICs ICs Logic output scan … Query Parsing Query Mapping Layer Transformation Optimized Operator Implementation Pure SQL-92 Transform: SELECT V1.val, V2.val FROM V V1, V V2 Vertical Table (V) WHERE V1.key = ‘name’ oid key val AND V2.key = ‘output’ AMD V1.oid = V2.oid 0 name PANL75 0 0 0 … 1 1 monitor Recharge Output … name Output 7 inch Built-in Digital … KLH 221 S-Video 2. Advantages of Vertical Schema b c d e Cat1 f x Master Catalog y z Goal • Use affinity information in new catalog. – Products in same category are similar. • Accuracy boost depends on match between two categorizations. w New Catalog ICs a b x Mem. y c Logic d e f Stores for XML, RDF, LDAP and Data Mining R. Agrawal, A. Somani and Y. Xu, “Storage and Querying of E-Commerce Data”, VLDB 2001 Enhanced Naïve Bayes classifier Pr(Ci | S ) Pr( d | Ci ) Pr(Ci | d , S )  Pr( d | S ) | Ci | (# docs in S predicted to be in Ci ) w Pr(Ci | S )  w (| C |  (# docs in S predicted to be in C ) ) j j j Cat2 • After integration: DSP Other Applications: • Objects can have large number of attributes • Handles sparseness well • Easy schema evolution Problem Statement a Recommendations for Database Vendors: Partial Indices Enhanced Table Functions (TF) First Class treatment of TF Native Support for v2h operation • Writing SQL is painful • B2B electronics portal: 2000 categories, 200K datasheets Mem. monitor recharge But … Catalog Integration DSP name • Hides complexity of vertical representation • Fast performance z w • Given – master categorization M: • categories C1, C2, …, Cn • set of documents in each category – new categorization N: • categories S1, S2, …, Sn • set of documents in each category • Standard Alg: Compute Pr(Ci | d) • Enhanced Alg: Compute Pr(Ci | d, S) • Use tuning set to determine w. – Defaults to standard Naïve Bayes if w = 0. • Only affects classification of borderline documents. Accuracy Improvement on Pangea Data 100 Perfect 90-10 80-20 GaussianA GaussianB Base 90 Accuracy name 3. Solution: Query Mapping Layer SELECT name, output FROM HV Horizontal View (HV) 80 70 60 R. Agrawal and R. Srikant, “On Integrating Catalogs”, W W W 2001 1 2 5 10 25 50 100 200 Weight

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download N - Rakesh Agrawal