Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center Outline Overview of the Pangea B2B marketplace Technology dives: Catalog integration Searching specification documents Interactive parametric search Storage & retrieval of e-commerce data Research opportunities Pangea Architecture Product Selection Payment and Delivery Brand and Loyalty Creation Preference System User provided/System Mined Data Acquisition User Profiling and Market Research Wrappers Cleaning Data Transfer Registration Accounts Web Infrastructure Websphere/DB2 Product Selection Simple Products Search -Parametric -Description -Category -Substitute products Browse -Category tree -Neighborhood Composite Products Product lists with aggregate constraints (e.g. BOM) Configuration products with compatibility constraints Comparisons Bookmarking (shopping basket) Price Simple Attribute value comparison Detailed description (e.g.data sheets) Reviews Experiences Preference System Specification/Specializaion/Customization Multicriteria Auctions Brand and Loyalty Creation Personal Database Resource Center Forums Tools for - extracting - organizing - maintaining information Career Center Resume Bank Job Bank AskExpert Micro communities Technology Products Chats Events News Center Matchmaker - Query - Browse Headhunters Salary Surveys Job Trends Preference System Specification/Specializaion/Customization Businesses Current Archive Features (1) Eureka: Interactive parametric search Searches based on example products Similarity search for product substitution Dynamic Content generation/presentation (on the fly from database): Catalog Hierarchy pages with product counts Side-by-Side Product Comparison Product Listings/Details Shopper groups through preference layer Features (2) Server side Searching: Functional description search using classification Key word & Part Number Search Restriction of the search to sub-trees of the category hierarchy (pushed down to the database) Real time Price and Availability Engine: Ability to crawl, interrogate, extract price & availability data from various suppliers/distributors in real time and present them in side-by-side comparison format Easily add new distributors/suppliers Completely client side implementation to prevent blocking Features (3) Design Worksheet (Generalized shopping cart): List of items: an item could be a specific part, alternative set of parts, specifications, other design worksheets (nesting) Completely integrated with all relevant components (search, content, price, etc.) Aggregate constraints (e.g. total price) Multiple projects saved on server Share projects with other (authorized) users Mining for suggestions Features (4) Design data warehouse creation and maintenance Crawling technology for extracting part information from websites of various manufacturer/distributor/suppliers Data extraction from Manufacturer Spec Sheets (pdf files) Classification technology to build, merge, manage and populate category hierarchies. Features (5) Soft content creation and maintenance Crawling to acquire articles/news/postings from various web news sources, usenet newsgroups, etc. Agents to cluster, classify, extract concepts, identify associations. Personalized news/data presentation (based on user defined profile, channels, etc.) Complete interleaving and integration with part content. Automatic generation of summary pages for manufacturers (e.g. related news articles, URLs , product categories). Prototype Data for nearly 2 million parts, in 2000 categories Focused crawling of more than 175 manufacturers for datasheets/ application notes/manuals (160,000) 1/4 Terabyte database Catalog Integration R. Agrawal, R. Srikant: WWW-10 Catalog Integration Problem Integrate products from new catalog into master catalog. ICs ICs DSP a b Mem. c Cat 1 Logic d e Master Catalog f x y Cat 2 z New Catalog The Problem (cont.) After integration: ICs a b Logic Mem. DSP x y c d e f z Desired Solution Automatically integrate products: little or no effort on part of user. domain independent. Problem size: Million products Thousands of categories Model Product descriptions consist of words Products live in the leaf-level categories Basic Algorithm Build classification model using product descriptions in master catalog. Use classification model to predict categories for products in the new catalog. 95% x 5% DSP Logic National Semiconductor Files Part: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Part_Id: DS14185 Manufacturer: national Title: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Description: The DS14185 is a three driver, five receiver device which conforms to the EIA/TIA-232-E standard.The flow-through pinout facilitates simple non-crossover board layout. The DS14185 provides a one-chip solution for the common 9-pin serial RS-232 interface between data terminal and data communications equipment. Part: LM3940 1A Low Dropout Regulator Part: Wide Adjustable Range PNP Voltage Regulator Part: LM2940/LM2940C 1A Low Dropout Regulator ... ... ... National Semiconductor Files with Categories Part: DS14185 EIA/TIA-232 3 Driver x 5 Receiver Pangea Category: Choice 1: Transceiver Choice 2: Line Receiver Choice 3: Line Driver Choice 4: General-Purpose Silicon Rectifier Choice 5: Tapped Delay Line Part: LM3940 1A Low Dropout Regulator Pangea Category: Choice 1: Positive Fixed Voltage Regulator Choice 2: Voltage-Feedback Operational Amplifier Choice 3: Voltage Reference Choice 4: Voltage-Mode SMPS Controller Choice 5: Positive Adjustable Voltage Regulator ... ... Accuracy on Pangea Data B2B Portal for electronic components: 1200 categories, 40K training documents. 500 categories with < 5 documents. Accuracy: 72% for top choice. 99.7% for top 5 choices. Enhanced Algorithm: Intuition Use affinity information in the catalog to be integrated (new catalog): Products in same category are similar. Bias the classifier to incorporate this information. Accuracy boost depends on quality of new catalog: Use tuning set to determine amount of bias. Algorithm Extension of the Naive-Bayes classification to incorporate affinity information Naive Bayes Classifier Pr(Ci|d) = Pr(Ci)Pr(d|Ci)/Pr(d) //Baye’s Rule Pr(d): same for all categories (ignore) Pr(Ci) = #docs Ci / #total docs Pr(d|Ci) = Pwd Pr(w|Ci) – Words occur independently (unigram model) Pr(w|Ci) = (n(Ci ,w)+) / (n(Ci)+ |V|) – Maximum likelihood estimate smoothed with the Lidstone’s law of succession Enhanced Algorithm = Pr(Ci|d,S) //d existed in category S Pr(Ci,d,S) / Pr(d,S) – Pr(Ci,d,S) = Pr(d,S) Pr(Ci|d,S) Pr(Ci)Pr(S,d|Ci) / Pr(d,S) = Pr(Ci)Pr(S|Ci)Pr(d| Ci) / Pr(S,d) = – Assuming d, S independent given Ci = Pr(S)Pr(Ci|S)Pr(d| Ci) / Pr(S,d) – Pr(S|Ci) Pr(Ci) = Pr(Ci|S) Pr(S) = Pr(Ci|S)Pr(d|Ci) / Pr(d|S) – Pr(S,d) = Pr(S)Pr(d|S) Same as NB except Pr(Ci|S) instead of Pr(Ci) – Ignore Pr(d|S) as it is same for all classes Computing Pr(Ci|S) Pr(Ci|S) = |Ci|(#docs in S predicted to be in Ci)w / j[1,n] |Cj|(#docs in S predicted to be in Cj)w |Ci| = #docs in Ci in the master catalog w determines weight of the new catalog – Use a tune set of documents in the new catalog for which the correct categorization in the master catalog is known – Choose one weight for the entire new catalog or different weights for different sections Superiority of the Enhanced Algorithm Theorem: The highest possible accuracy achievable with the enhanced algorithm is no worse than what can be achieved with the basic algorithm. Catch: The optimum value of the weight for which enhanced achieves highest accuracy is data dependent. The tune set method attempts to select a good value for weight, but there is no guarantee of success. Empirical Evaluation Start with a real catalog M Remove n products from M to form the new catalog N In the new catalog N – Assign f*n products to the same category as M – Assign the rest to other categories as per some distribution (but remember their true category) Accuracy: Fraction of products in N assigned to their true categories Improvement in Accuracy (Pangea) 100 Perfect 95 90-10 Accuracy 90 80-20 85 80 GaussianA 75 GaussianB 70 Base 65 1 2 5 10 25 W eight 50 100 200 Improvement in Accuracy (Reuters) Accuracy 100 98 Perfect 96 90-10 94 80-20 92 90 GaussianA 88 GaussianB 86 Base 84 82 1 2 5 10 25 W eight 50 100 200 Improvement in Accuracy (Google.Outdoors) 100 Perfect Accuracy 90 90-10 80-20 80 GaussianA 70 GaussianB 60 Base 50 1 5 25 100 W eight 400 1000 Tune Set Size (Pangea) 95 Perfect Accuracy 90 90-10 85 80-20 GaussianA 80 GaussianB 75 Base 70 0 5 10 20 35 Tune Set Size 50 Similar results for Reuters and Google. Empirical Results 20 % Errors 15 Standard Enhanced 10 5 0 71-22-6 79-21 100 Purity (No. of classes & their distribution) Summary Classification accuracy can be improved by factoring in the affinity information implicit in the data to be categorized. How to apply these ideas to other types of classifiers? Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002 Specfication Documents Consist of <attribute name, value> pairs, embedded in text Examples: – Data sheets for electronic parts – Classifieds Sources of Problems Synonyms for attribute names and units. – "lb" and "pounds", but no "lbs" or "pound". Attribute names are often missing. – No "Speed", just "MHz Pentium III" – No "Memory", just "MB SDRAM" Accurate data extraction is hard, e.g. partial datasheet for Cypress CY7C225A PROM: High Speed 18 ns address set-up 12 ns clock to output Low Power 495 mW (commercial) 660 mW (military) An end run! Use a simple regular expression extractor to get numbers Do simple data extraction to get hints, e.g. – Hint for unit: the word following the number. – Hint for attribute name: k words following the number. 256 MB SDRAM memory Unit Hint: MB Attribute Hint: SDRAM, Memory Use only numbers in the queries – Treat any attribute name in the query also as hint Documents and Queries Document D = {<ni , Hi > | ni N, Hi A, 1 i m} Query Q = {<qi , Ai > | ni N, Ai A, 1 i k} Hi and Ai are hints D = {<256, {MB, SDRAM, Memory}>, <700, {<MHz, CPU>}} Q = <200 MB, 750 MHz> Document D = {ni | ni N, 1 i m} Query Q = {qi | ni N, 1 i k} No hints with Documents and Queries D = {256, 700} Q = {200, 750} Can it work? Yes!!!!! – Provided data is non-reflective – Reflectivity can be computed a priori for a given data set and provides estimate of expected accuracy. Demo Reflectivity Non-reflective <x=i, y=j> => <x=j, y=i> Low Reflectivity 50 50 40 40 30 30 20 20 10 10 0 0 10 20 30 40 50 0 0 50 50 40 40 30 30 20 20 10 10 0 0 0 10 20 30 40 Low Reflectivity 50 0 10 20 30 10 20 30 40 40 High Reflectivity 50 50 Non-reflectivity in real life Non-overlapping attributes: – Memory: 64 - 512 Mb, Disk: 10 - 40 Gb Correlations: – Memory: 64 - 512 Mb, Disk: 10 - 100 Gb still fine. Clusters Basic Idea Consider co-ordinates of a point If there is no data point at the permutations of its co-ordinates, this point is non-reflective – Only a few data points at the permutations of its co- ordinates => point is largely non-reflective Compute reflectivity as this ratio summed over all the points – Consider neighborhood of a point in the above calculation Reflectivity D: set of m-dimensional points n: coordinates of point x h(n): number of points within distance r of n Reflections(x): permutations of n q(n): number of points in D that have at least one reflection within distance r of n Reflectivity(m,r) = 1 - 1/|D| xD h(n)/q(n) Non-reflectivity = 1- Reflectivity See the paper for reflectivity of D over k-dimensional subspaces Algorithms How to compute match score (rank) of a document for a given query? How to limit the number of documents for which the match score is computed? Match Score of a Document Select k numbers from D yielding minimum distance between Q and D: – F(Q,D) = ( ik=1 w(qi ,nji)p )1/p Map problem to Bipartite Matching in graph G: – k source nodes: corresponding to query numbers – m target nodes: corresponding to document numbers – An edge from each source to k nearest targets. Assign weight w(qi ,nj)p to the edge (qi,nj). Bipartite Matching The optimum solution to the minimum weight bipartite graph matching problem matches each number in Q with a distinct number in D such that the distance score F(Q,D) is minimized. The minimum score gives the rank of the document D for the Query Q. Assuming F to be L1 and w(qi,nj) := |qi - nj| / |qi +e|: Doc: 10 .5 Query: 25 .25 20 75 .58 60 .25 Limiting the Set of Documents Similar to the score aggregation problem [Fagin, PODS 96] Proposed algorithm is an adaptation of the TA algorithm in [Fagin-Lotem-Naor, PODS 01] Limiting the Set of Documents 60 20 66/.1 D2 D7 25/.25 D1 D5 D7 D9 10/.5 D2 D3 75/.25 D1 D3 D4 D5 35/.75 D4 D6 D8 25/.58 D6 D8 D9 Make k conceptual sorted lists, one for each query term [use: documents = index(number)] Do a round robin access to the lists. For each document found, compute its distance F(D,Q) Let ni := number last looked at for query term qi Let t := (ik=1 w(qi, ni)p)1/p Halt when t documents found whose distance t Evaluation Metric Benchmark: Set of answers when attribute names are precisely known in the document and query What fraction of the top 10 "true" answers are present in the top 10 answers when attribute names are unknown in both document and query? Accuracy Results Fields 100 DRAM 3,800 10 LCD 1,700 12 Proc 1,100 12 Trans 22,273 24 Auto 205 16 Credit 666 6 Glass 214 10 Housing 506 14 Wine 179 14 80 Accuracy Records 60 40 20 0 1 2 3 4 5 Query Size DRAM LCD Proc Trans Auto Credit Glass Housing Wine Reflectivity estimates accuracy Non-reflectivity closely tracked accuracy for all nine data sets Non-reflectivity arises due to clustering and correlations in real data (Randomized nonreflectivity: value obtained after permuting values in the columns) 100 80 60 40 20 0 1 2 3 4 5 7 Query Size Accuracy Non-Reflectivity Randomized NonReflectivity Incorporating Hints L1 = w(qi,ni) + B v(Ai,Hi) – v(Ai,Hi) : distance between attribute name (or unit) for qi and set of hints for ni – B: relative importance of number match vs. name/unit match Balance between Number Match & Hint Match Wine 100 80 Accuracy Weight to hints should depend on the accuracy of hints Use tune set to determine B on per dataset basis 60 40 20 0 0.01 0.1 0.03 1 0.3 10 3 Weight of Hint Match 0.99 0.9 0.8 0.6 No HInts Effectiveness of Hints Wine Improvement in accuracy depends on how good are hints 100 80 Accuracy 60 40 20 0 1 2 3 4 6 8 10 Query Size 0.99 0.9 0.8 0.6 No Hints Effectiveness of Indexing 1 million docs: 100 – 1 sec for qsize = 5 – .03 sec for qsize=1 10 Time (ms) DRAM 1 0.1 0.01 1 2 3 4 5 Query Size Scan Indexed 7 Summary Allows querying using only numbers or numbers + hints. End run around data extraction. Use simple extractor to generate hints. Can ascertain apriori when the technique will be effective Future Work Integration with classic IR (key word search) – PROM speed 20 ns power 500 mW Extension to non-numeric values Parametric Search of E-Commerce Data J. Shafer, R. Agrawal : WWW-9 Conventional Wisdom • User enters search criteria into HTML form • User’s query is received by web-server and submitted to a server-side database (usually as SQL) • Result set is returned to user as an HTML page (or a series of pages) Search Problem in E-Commerce • Users often don’t know exactly what they are looking for – Search is a process, not a single operation – An individual query is simply a means to an end • Too many/few results: try different query • No universal ranking function Essential Ideas of Eureka • Incrementally fetch superset of tuples of interest in the client (e.g. all products in the product category of interest) • User-interface and fast indexing structures for interactive exploration of the cached tuples •Restriction by example •Ranking •Fuzzy restrictions Architecture Overview Eureka ListRenderer DataPump DataGroup DataColumn #1 ... HTTP DataColumn #N client JDBC Database server Servlet Comments • Used Eureka in several situations with very positive feedback • Used Eureka on datasets with 100K records with no visible deterioration in performance • Performance is excellent, even in Java Storage and Retrieval of E-Commerce Data R. Agrawal, A. Somani, Y. Xu: VLDB-2001 Typical E-Commerce Data Characteristics An Experimental E-marketplace for Computer components Nearly 2 Million components More than 2000 leaf-level categories Large number of Attributes (5000) Constantly evolving schema Sparsely populated data (about 50-100 attributes/component) Conventional horizontal representation (n-ary relation) Name Monitor Height Recharge Output playback Smooth scan Progressive Scan PAN DVD-L75 7 inch - Built-in Digital - - - KLH DVD221 - 3.75 - S-Video - - No SONY S-7000 - - - - - - - SONY S-560D - - - - Cinema Sound Yes - … … … … … … … … DB Catalogs do not support thousands of columns (DB2/Oracle limit: 1012 columns) Storage overhead of NULL values Nulls increase the index size and they sort high in DB2 B+ tree index Hard to load/update Schema evolution is expensive Querying is straightforward Binary Representation (N 2-ary relations) Monitor Height Output Name Val Name Val PAN DVD-L75 7 inch KLH DVD221 3.75 Dense representation Manageability is hard because of large number of tables Schema evolution expensive Name Val PAN DVD-L75 Digital KLH DVD221 S-Video Decomposition Storage Model [Copeland et al SIGMOD 85], [Khoshafian et al ICDE 87] Monet: Binary Attribute Tables [Boncz et al VLDB Journal 99] Attribute Approach for storing XML Data [Florescu et al INRIA Tech Report 99] Vertical representation (One 3-ary relation) Oid (object identifier) Key (attribute name) Val (attribute value) Oid Key Val 0 ‘Name’ ‘PAN DVDL75’ 0 ‘Monitor’ ‘7 inch’ 0 ‘Recharge’ ‘Built-in’ 0 ‘Output’ ‘Digital’ 1 ‘Name’ ‘KLH DVD221’ 1 ‘Height’ ‘3.75’ 1 ‘Output’ ‘S-Video’ 1 ‘Progressiv e Scan’ ‘No’ 2 ‘Name’ ‘SONY S-7000’ … … … Objects can have large number of attributes Handles sparseness well Schema evolution is easy Implementation of SchemaSQL [LSS 99] Edge Approach for storing XML Data [FK 99] Querying over Vertical Representation Simple query on a Horizontal scheme SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’ Becomes quite complex: SELECT v1.Val FROM vtable v1, vtable v2 WHERE v1.Key = ‘Monitor’ AND v2.Key = ‘Output’ AND v2.Val = ‘Digital’ AND v1.Oid = v2.Oid Writing applications becomes much harder. What can we do ? Solution : Query Mapping Translation layer maps relational algebraic operations on H to operations on V Horizontal view (H) Attr1 … Attr2 Query Mapping Layer Vertical table (V) Oid Key Val Attrk … Key Issues Can we define a translation algebra ? Standard database view mechanism does not work attribute values become column names (need higher order views) Can the vertical representation support fast querying ? What are the new requirements from the database engines? Transformation Algebra Defined an algebra for transforming expressions over horizontal views into expressions over the vertical representation. Two key operators: – v2h () Operation – Convert from vertical to horizontal k(V) = [POid(V)] [i=1,k POid,Val(Key=‘Ai’(V))] – h2V () Operation – Convert from horizontal to vertical k(H) = [i=1,k POid,’Ai’Ai(Ai ‘’(V))] [i=1,k POid,’Ai’Ai(i=1,k Ai=‘’(V)) – Similar to Unfold/Fold [LSS 99] ,Gather/Scatter [STA 98] Implementation Strategies VerticalSQL – Uses only SQL-92 level capabilities VerticalUDF – Exploits User Defined Functions and Table Functions – Binary – 2-ary representation with one relation per attribute (using only SQL-92 transforms) SchemaSQL – Addresses a more general problem – Performed 2-3X worse than Vertical representation because of temporary tables Performance: Experimental setup 600 MHz dual-processor Intel Pentium machine 512 MB RAM Windows NT 4.0 Database IBM DB2 UDB 7.1 Two 30GB IDE Drives Buffer Pool Size – 50 MB All numbers reported are cold numbers Synthetic data – 200X100K & r=10% 200 columns, 100K rows and non-null density = 10% Clustering by Key outperformed clustering by Oid Execution time (seconds) density = 10%, 1000 cols x 20K rows 25 20 VerticalSQL_oid 15 VerticalSQL_key 10 5 0 0.1% 1% Join selectivity Join 5% VerticalSQL comparable to Binary and outperforms Horizontal density = 10% Execution time (seconds) 60 50 40 HorizontalSQL 30 VerticalSQL 20 Binary 10 0 200x100K 400x50K 800x25K 1000x20K Table (#cols x #rows) Projection of 10 columns VerticalUDF is the best approach density = 10% Execution time (seconds) 30 20 VerticalSQL Binary 10 VerticalUDF 0 200x100K 400x50K 800x25K 1000x20K Table (#cols x #rows) Projection of 10 columns Wish List from Database Engines – VerticalUDF approach showed need for – Enhanced table functions – First class treatment of table functions – Native support for v2h and h2v operations – Partial indices Summary Horizontal Vertical (w/ Mapping) Binary (w/ Mapping) Manageability + + - Flexibility - + - Querying + + + Performance - + + Opportunities Semiautomatic Catalog Creation Supplier Web Sites Supplier 1 Supplier 2 Technology for Quickly Assemble catalog from semi-structured data Web Browser Target Pages Proxy Supplier[i] Semiautomatic Catalog Extraction Logs Fetch Rules Extract Rules Data Mining Rule Repository WCS Catalog ... Supplier n Architecture Web Crawler Data Store Other Sources PDF to Text OCR Extractor Categorizer Indexer (IM for Text) Improving the Catalog Hierarchy Identify sets of nodes (or single "kitchen sink" node) for re-organization. Cluster the set of nodes, and compare the resulting hierarchy with the original. Identify nodes that fit better elsewhere in the hierarchy. Metrics: classification accuracy, tightness of cluster. Integration of Real Time Operational Data Dynamic Req Res 1 6 Server 2 3 Client 4 5 4 Remote Site 1 Remote Site 2 4 Fetch target page(s)4, Query, Extract Response5 in real time Remote Site n Privacy Preserving Data Mining Insight: Preserve privacy at the individual level, while still building accurate data mining models at the aggregate level. Add random noise to individual 30 values to protect privacy. Can dramatically change distribution of values. become s 65 (30+35) EM algorithm to estimate original distribution of values given randomized values + randomization function. Estimate only accurate over thousands of values => preserves privacy. Algorithms for building classification models and discovering association rules on top of privacy-preserved data with only small loss of accuracy. Alice’s age Alice’s salary Bob’s age 30 | 70K | ... 50 | 40K | ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms Data Mining Model Seems to work well! 1000 Original 800 600 Randomized 400 Reconstructed 200 60 0 20 Number of People 1200 Age Accuracy vs. Privacy Fn 3 100 Accuracy 90 80 Original 70 Randomized Reconstructed 60 50 40 10 20 40 60 80 100 Randomization Level 150 200 Summary E-Commerce is a rich application domain for database technology Research opportunities abound