Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Challenges in Scalable Data Mining: Support for DB-Backed Web Sites Raghu Ramakrishnan Professor, UW-Madison Talk Outline Introduction – – – Background – – Personalization User interactivity DB implications Tracking users Content management and delivery Challenges Introduction Evolution of Websites Standard Content Passive Users Personalized Content Active Users Fundamental Shift User-centric design of websites. – – Web, unlike print, phones, TV, and other media, offers a unique opportunity to present each customer with a customized experience. Exploiting this potential is becoming a key differentiator across sites. See http://www.personalization.com/resources/vendors/ Personalization Adapt the site to each user, even each visit. – – Bugs Bunny is different from Michael Jordan Last time, Bugs was shopping for himself, this time he’s looking for a gift for Michael Personalization Technical Implications: – – Need to know something about user, current visit Need to dynamically alter requested page Privacy concerns: – Will an individual user’s profile be disclosed/sold to others? Will the profile information be used in ways other than to improve that user’s site-experience in ways that the user approves? User Interactions Traditionally: Searches, purchases. – Doesn’t leverage a site’s biggest asset: its users. Site itself is not changed by users in this model. Richer interactions: Web communities. – – – – – Put up for auction, bid Comment, rate Form groups and work together Ask, answer User-generated content Web Communities Site content driven by users, and changes rapidly. Viral growth patterns lead to high volumes of traffic. Need to validate, review for quality. – Must track user activity Greater need for personalization, push technologies. – Again, need to track users, dynamic pages DBs, Mining, and Websites Personalization and increased user interactivity both lead to websites that deliver dynamically constructed pages, based on data in a DB. Ergo, we have a vast new application domain for database management systems. Ergo, we have a challenge: How best to adapt each page to the current user and context. Background Tracking Users: Cookies GET – – Browser issues this command to retrieve a doc; includes all cookies visible to target server Server responds with header info, including doc size, server location, cookie directives, etc., plus document GET … Cookie: visits=10 … Set-Cookie: visits=11 Cookies Server can set following parameters for cookies: – – – – Name and value When cookie expires Which pages on server “see” the cookie Which servers can “see” the cookie E.g., Doubleclick servers can see cookies set at many sites Alternative to cookies: – – Carry request history along: modify each requested page to “attach” history to every link on page! Allows session tracking, but not across sessions. Vignette StoryServer A platform for developing dynamic web sites: – – Content personalization and delivery Content Management An elaborate gateway that sits between web servers and DBMSs (and file systems). Spin-off from CNET’s efforts to develop their own site. Vignette StoryServer A page is assembled dynamically from components: – – – Adaptive navigation bars Summary components (e.g., top-ten lists) Personalized elements (e.g., selected news); integration with recommendation engines such as Net Perceptions’ GroupLens is supported Caching support for components provides ability to trade-off degree of dynamism (and customization) Data Mining Challenges A List of Challenges Similarity (real-time) Matching (real-time) Trends (off-line) Correlation (off-line) The Similarity Problem Find users with similar tastes, in context. – Joe’s looking at an Athlon processor; which users are similar to Joe in their PC tastes? Whose recommendations is Joe likely to follow? Find similar content, in context. – – – Which processors are similar in that they appeal to the same groups of people? Which processors are similar in that they have similar performance characteristics? Which articles appeal to the same people? The Matching Problem Match user to data, in context. – What related information should you recommend to Joe when he is looking at the Athlon PC product? Related products: graphics cards, monitors Related reviews, discussions If Joe’s been looking only at AMD products, other AMD chips; if not, show alternatives from Intel Match data to user, in context. – Which expert is best qualified to answer Joe’s question? The Trends Problem Identify trends in sales. Identify trends in overall user preferences, user segmentation. Identify trends for individual users. Identify trends in overall product popularity, product segmentation. Identify trends for specific products. The Correlations Problem Given a set of trends (e.g., in pricing) track the impact on other trends. – – Are there correlated trends? Are there causal relationships? Note that correlating a given trend to an overall trend is hard enough, but trying to find all other individual or product-specific trends that happen to be correlated is much harder! Problem Characteristics Large datasets: Many users, huge activity levels, lots of products, lots of documents, … Real-time recommendations: “In context” Constantly evolving data: Data mining models can get outdated, want to find trends. Variations: – – Attach recommendation engine to a user’s browser, rather than to the web server. (Purple Swami) Look for similar documents across sites and extract relevant metadata. (Whizbang) Summary Lots of challenges. Lots of players. – Companies that provide applications and integrate data mining into the application logic. – E.g., ATG, BroadVision, QUIQ, Vignette Companies that provide data mining tools. E.g., Blaze, Broadbase, DataSage, Engage, E.piphany, Net Perceptions, Manna