Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Cloud Yury Lifshits Yahoo! Research http://yury.name My Beliefs The key challenge in web search is structured search Part 1: What is structured search? The key challenge in structured search is collecting data Part 2: Data distribution & idea of Data Cloud Part 3: Demo: numeric data distribution The key challenge in collecting data is incentive design Part 4: Economics of data distribution Structured Search Data Data = data of entities + data of content Structured data Semi-structured data Entity unit: Content unit: • Identifier • Body: text, video, audio, or image • Metadata: • Metadata: – Explicit key-value pairs – Explicit key-value pairs – Relational properties – Relational properties – Evaluation – Evaluation Structured Search Factoid search “what's the value of property X of object Y“ Entity hubs – Domain hubs Structured object search "all concerts this weekend in SF under 20$ sorted by popularity" – Time focus – Ranking focus – Relations focus Structured content search "all videos with Tom Brady" “all comments and blog posts about Bing" Yury’s Wishlist Business-generated data • Products, services, news, wishlists, contact data Reality stream, sensors • Where what have happened Expert knowledge • Glossary, issues, typical solutions, object databases, related objects graph Events • Sport, concerts, education, corporate, community, private Market graph & signals • Like, interested, use, following, want to buy; votes and ratings Search as a Platform Query analysis Classic search Web index Post analysis App 1 App 2 App 3 Structured Data App 4 Data Cloud How to collect all structured data in one place? Data Producers • People: forums, wiki, mail groups, blogs, social networks • Enterprizes: product profiles, corporate news, professional content • Sensors: GPS modules, web cameras, traffic sensors, RFID • Transactional data Data Distributors Data distributor is any technical solution to accumulate, organize and provide access to structured and semistructured data Data publisher: the original distributor of some data Data retailer: a consumerfacing distributor of some data Data Consumers • Humans – Email – Aggregators: news, friend feeds, RSS readers – Search – Browsing / random walks • Intelligence projects – Recommendation systems – Trend mining Data Cloud Data Cloud is a centralized fully-functional data distribution service Success metric for data cloud strategy = the total “value” of data on the cloud To-Cloud Solutions • Extraction – DBpedia.org, “web tables” • Semantic markup, data APIs – Yahoo! SearchMonkey • Feeds – Yahoo! Shopping – Disqus.com, js-kit.com, Facebook Connect • Direct publishing On-Cloud Solutions • Ontology maintenance – Freebase • Normalization, de-duplication, antispam • Named entity recognition, metadata inference, ranking • Data recycling (cross-references) – Amazon Public Data Sets – Viral license • Hosted search – Yahoo! BOSS From-Cloud Solutions • Search, audience – Y! SearchMonkey, Google Base • Data API, dump access, update stream • Custom notifications – Gnip.com • Data cloud as a primary backend • Access control – Ad distribution. (AT&T and Yahoo! Local deal) Demo: webNumbr.com Joint work with Paul Tarjan webNumbr.com: Import • Crawl numbers from the web URL + XPath + regex • Create “numbr pages” • Update their values every hour • Keep the history Anyone can create a numbr http://webnumbr.com/create webNumbr.com: Export • Embed code • Graphs • Search & browse • RSS Economics of Data Distribution Joint work with Ravi Kumar and Andrew Tomkins Network Effect in Two-Sided Markets Two sided market = every product serves consumers of two types A and B Cross-side network effect: the more type-A users product X has, the more attractive it is for type-B consumers and vice versa Examples: operating systems, credit cards, e-commerce marketplaces Two-sided network effects: A theory of information product design G. Parker, M.W. Van Alstyne, N. Bulkley, M. Van Alstyne Basic model • Distributors D1, … Dk • Producer/consumer joins only one distributor • Initial shares (p1,c1) … (pk,ck) • New consumer selects a distributor with a probability proportional to pi • New producer selects a distributor with probability proportional to ci Basic model a1 a1 a2 a2 a3 a3 a4 a4 Market Shares Dynamics Theorem 1 Market shares will stabilize Theorem 2 With super-liner preference rule one of distributors will tip Theorem 3 With sub-liner preference rule market shares will flatten External Factor Preference rule with external factor: ei+ci/(c1+…+ck) Theorem 4 Market shares will stabilize on e1 : e2 : … : ek Coalition Data Cloud Coalitions Theorem 5 If all market shares are below 1/sqrt(k) coalition (sharing data) is profitable for all distributors Corollary Coalitions are not monotone Example: 5 : 4 : 1 : 1 Model Variations • Same-side network effect • Different p-to-c and c-to-p rules • Multi-homing (overlapping audiences) • n^2 vs. nlog n revenue models • Mature market: newcomer rate = departing rate • Diverse market (many types of producers and consumers) • Newcoming and departing distributors • Directed coalitions Challenges Marketing • Data demand? • Data offerings? • Requirements for distribution technology? Incentive design • Incentives for data sharing? • Centralized or distributed? –For profit or non-profit? • Data licensing and ownership? • Monetizing data cloud? More Challenges Prototyping: • • • • Data marketplace: open data & data demand Search plugins: related objects, glossaries, object timelines Publishing tools for structured data Data client: structured news, bookmarking, notifications Tech design: • Access management • Namespace design User interface: • Structured search UI • Discovery UI Thanks! Follow my research: http://twitter.com/yurylifshits http://yury.name/blog