Download Content Optimization on Yahoo! Front Page

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Mirrors and Crystal Balls
A Personal Perspective on Data Mining
Raghu Ramakrishnan
ACM SIGKDD Innovation Award
1
Outline
• This award recognizes the work of many
people, and I represent the many
– A warp-speed tour of some earlier work
• What’s a data mining talk without
predictions?
– Some exciting directions for data mining that
we’re working on at Yahoo!
ACM SIGKDD Innovation Award
2
A Look in the Mirror …
(and the faces I found there:
unfortunately, couldn’t find photos for some people)
(and apologies in advance for not discussing the related
work that provided context and, often, tools and motivation)
ACM SIGKDD Innovation Award
3
1987
2007
ACM SIGKDD Innovation Award
4
Sequences, Streams
• SEQ
– Sequence Data Processing. P. Seshadri,
M. Livny and R. Ramakrishnan.
SIGMOD 1994
– SEQ: A Model for Sequence Databases.
P. Seshadri, M. Livny, and R.
Ramakrishnan, ICDE 1995
– The Design and Implementation of a
Sequence Database System. P.
Seshadri, M. Livny and R.
Ramakrishnan. VLDB 1996
• SRQL
– SRQL: Sorted Relational Query
Language. R. Ramakrishnan, D.
Donjerkovic, A. Ranganathan, K.
Beyer, and M. Krishnaprasad.
SSDBM 1998
ACM SIGKDD Innovation Award
5
Scalable Clustering
• Birch
– BIRCH: A Clustering Algorithm for Large
Multidimensional Datasets. T. Zhang, R.
Ramakrishnan and M. Livny. SIGMOD 96
– Fast Density Estimation Using CF-Kernels.
T. Zhang, R. Ramakrishnan, and M. Livny.
KDD 1999
– Clustering Large Databases in Arbitrary
Metric Spaces. V. Ganti, R. Ramakrishnan,
J. Gehrke, A. Powell, and J. French. ICDE
1999
• Clustering Categorical Data
– CACTUS: A Scalable Clustering Algorithm
for Categorical Data. V. Ganti, J. Gehrke,
and R. Ramakrishnan. KDD 1999
ACM SIGKDD Innovation Award
6
Scalable Decision Trees
• Rain Forest
– RainForest: A Framework for
Fast Decision Tree Construction
of Large Datasets. J. Gehrke, R.
Ramakrishnan and V. Ganti.
VLDB 1998
• Boat
– BOAT: Optimistic Decision Tree
Construction. J. Gehrke, V.
Ganti, R. Ramakrishnan, and WY. Loh. SIGMOD 1999
ACM SIGKDD Innovation Award
7
Streaming and Evolving Data,
Incremental Mining
• FOCUS
– FOCUS: A Framework for
Measuring Changes in Data
Characteristics. V. Ganti, J.
Gehrke, R. Ramakrishnan,
and W-Y. Loh. PODS 1999
• DEMON
– DEMON: Mining and
Monitoring Evolving Data. V.
Ganti, J. Gehrke, and R.
Ramakrishnan. ICDE 1999
ACM SIGKDD Innovation Award
8
Mass Collaboration
• The QUIQ Engine: A
Hybrid IR-DB System. Customer
N. Kabra, R.
Ramakrishnan, and V.
Ercegovac. ICDE 2003
• Mass Collaboration: A
Case Study. R.
Ramakrishnan, A.
Baptist, V. Ercegovac,
M. Hanselman, N.
Kabra, A. Marathe, U.
Shaft. IDEAS 2004
QUESTION
KNOWLEDGE
BASE
SELF SERVICE
Answer added to
Answer added to
power
self
service
power
self
service
ANSWER
ACM SIGKDD Innovation Award
-Partner Experts
-Customer Champions
-Employees
Support
Agent
9
OLAP, Hierarchies,
and Exploratory Mining
• Prediction Cubes. B-C.
Chen, L. Chen, Y. Lin, R.
Ramakrishnan. VLDB 2005
• Bellwether Analysis:
Predicting Global
Aggregates from Local
Regions. B-C. Chen, R.
Ramakrishnan, J.W.
Shavlik, P. Tamma. VLDB
2006
ACM SIGKDD Innovation Award
10
Hierarchies Redux
•
•
•
•
•
•
•
OLAP Over Uncertain and Imprecise Data. D.
Burdick, P. Deshpande, T.S. Jayram, R.
Ramakrishnan, S. Vaithyanathan. VLDB 2005
Efficient Allocation Algorithms for OLAP Over
Imprecise Data. D. Burdick, P.M. Deshpande, T.
S. Jayram, R. Ramakrishnan, S. Vaithyanathan.
Learning from Aggregate Views. B-C. Chen, L.
Chen, D. Musicant, and R. Ramakrishnan. ICDE
2006
Mondrian: Multidimensional K-Anonymity. K.
LeFevre, D.J. DeWitt, R. Ramakrishnan. ICDE
2006
Workload-Aware Anonymization. K. LeFevre,
D.J. DeWitt, R. Ramakrishnan. KDD 2006
Privacy Skyline: Privacy with Multidimensional
Adversarial Knowledge. B-C. Chen, R.
Ramakrishnan, K. LeFevre. VLDB 2007
Composite Subset Measures. L. Chen, R.
Ramakrishnan, P. Barford, B-C. Chen, V.
Yegneswaran. VLDB 2006
ACM SIGKDD Innovation Award
11
Many Other Connections …
• Scalable Inference
– Optimizing MPF Queries:
Decision Support and
Probabilistic Inference. H.
Corrada Bravo, R.
Ramakrishnan. SIGMOD 2007
• Relational Learning
– View Learning for Statistical
Relational Learning, with an
Application to Mammography.
J. Davis, E.S. Burnside, I.
Dutra, David Page, R.
Ramakrishnan, V. Santos
Costa, J.W. Shavlik.
ACM SIGKDD Innovation Award
12
Community Information
Management
• Efficient Information Extraction over Evolving
Text Data. F. Chen, A. Doan, J. Yang, R.
Ramakrishnan. ICDE 2008
• Toward Best-Effort Information Extraction.
W. Shen, P. DeRose, R. McCann, A. Doan,
R. Ramakrishnan. SIGMOD 2008
• Declarative Information Extraction Using
Datalog with Embedded Extraction
Predicates. W. Shen, A. Doan, J.F.
Naughton, R. Ramakrishnan. VLDB 2007
• Source-aware Entity Matching: A
Compositional Approach. W. Shen, P.
DeRose, L. Vu, A. Doan, R. Ramakrishnan.
ICDE 2007
ACM SIGKDD Innovation Award
13
… Through the Looking Glass
Prediction is very hard, especially about the future.
Yogi Berra
ACM SIGKDD Innovation Award
14
Information Extraction
… and the challenge of managing it
ACM SIGKDD Innovation Award
15
ACM SIGKDD Innovation Award
16
DBLife
 Integrated
information
about a
(focused) realworld
community
 Collaboratively
built and
maintained by
the community
 CIMple
software
package
ACM SIGKDD Innovation Award
17
Search Results of the Future
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
(Slide courtesy Andrew Tomkins)
ACM SIGKDD Innovation Award
18
Opening Up Yahoo! Search
Phase 1
Phase 2
Giving site owners and developers
control over the appearance of Yahoo!
Search results.
(Slide courtesy Prabhakar Raghavan)
BOSS takes Yahoo!’s open strategy to
the next level by providing Yahoo!
Search infrastructure and technology to
developers and companies to help them
build their own search experiences.
ACM SIGKDD Innovation Award
19
Custom Search Experiences
Social Search
Vertical Search
Visual Search
(Slide courtesy Prabhakar Raghavan)
ACM SIGKDD Innovation Award
20
Economics of IE
• Data $, Supervision $
– The cost of supervision, especially large,
high-quality training sets, is high
– By comparison, the cost of data is low
• Therefore
– Rapid training set construction/active learning
techniques
– Tolerance for low- (or low-quality) supervision
– Take feedback and iterate rapidly
ACM SIGKDD Innovation Award
21
Example: Accepted Papers
• Every conference comes with a slightly
different format for accepted papers
– We want to extract accepted papers directly
(before they make their way into DBLP etc.)
• Assume
– Lots of background knowledge (e.g., DBLP
from last year)
– No supervision on the target page
• What can you do?
ACM SIGKDD Innovation Award
22
ACM SIGKDD Innovation Award
23
Down the Page a Bit
ACM SIGKDD Innovation Award
24
Record Identification
• To get started, we need to identify records
– Hey, we could write an XPath, no?
– So, what if no supervision is allowed?
• Given a crude classifier for paper records,
can we recursively split up this page?
ACM SIGKDD Innovation Award
25
First Level Splits
ACM SIGKDD Innovation Award
26
After More Splits …
ACM SIGKDD Innovation Award
27
Now Get the Records
• Goal: To extract fields of individual records
• We need training examples, right?
– But these papers are new
• The best we can do without supervision is
noisy labels.
– From having seen other such pages
ACM SIGKDD Innovation Award
28
Partial, Noisy Labels
ACM SIGKDD Innovation Award
29
Extracted Records
ACM SIGKDD Innovation Award
30
Refining Results via Feedback
• Now let’s shift slightly to consider extraction of
publications from academic home pages
– Must identify publication sections of faculty home
pages, and extract paper citations from them
• Underlying data model for extracted data is
– A flexible graph-based model (similar to RDF or ER
conceptual model)
– “Confidence” scores per-attribute or relationship
ACM SIGKDD Innovation Award
31
Extracted Publication Titles
ACM SIGKDD Innovation Award
32
A Dubious Extracted Publication…
PSOX provides declarative lineage
tracking over operator executions
ACM SIGKDD Innovation Award
33
Where’s the Problem?
Use lineage to find source of problem..
ACM SIGKDD Innovation Award
34
Source Page
Hmm, not a
publication page ..
(but may have
looked like one to a
classifier)
ACM SIGKDD Innovation Award
35
Feedback
User corrects classification of that
section..
ACM SIGKDD Innovation Award
36
Faculty or Student?
•NLP
•Build a Classifier
•Or…
ACM SIGKDD Innovation Award
37
…Another Clue…
ACM SIGKDD Innovation Award
38
…Stepping Back…
Prof
Prof
Prof-List
• Leads to large-scale,
partially-labeled
relational learning
• Involving different types
of entities and links
Prof
Student-List
Student
AdvisorOf
ACM SIGKDD Innovation Award
Student
39
p1
p2
p3
Maximizing the Value of What
You Select to Show Users
ACM SIGKDD Innovation Award
40
Content Optimization
• PROBLEM: Match-making between content,
user, context
– Content:
• Programmed (e.g., editors); Acquired (e.g., RSS feeds, UGC)
– User
• Individual (e.g., b-cookie), or user segment
– Context
• E.g., Y! or non-Y! property; device; time period
• APPROACH: Scalable algorithms that select
content to show, using editorially determined
content mixes, and respecting editorially set
constraints and policies.
ACM SIGKDD Innovation Award
41
Team from Y! Research
BeeChung
Chen
Pradheep
Elango
Deepak Agarwal
Raghu Ramakrishnan
Wei Chu
Seung-Taek Park
ACM SIGKDD Innovation Award
42
Team from Y! Engineering
Nitin
Motgi
Joe Zachariah
Scott Roy
Todd Beaupre
Kenneth Fox
ACM SIGKDD Innovation Award
43
Yahoo! Home Page Featured Box
• It is the topcenter part of
the Y! Front
Page
• It has four tabs:
Featured,
Entertainment,
Sports, and
Video
ACM SIGKDD Innovation Award
44
Traditional Role of Editors
• Strict quality control
– Preserve “Yahoo! Voice”
• E.g., typical mix of content
– Community standards
– Quality guidelines
• E.g., Topical articles shown for limited time
• Program articles periodically
– New ones pushed, old ones taken out
• Few tens of unique articles per day
– 16 articles at any given time; editors keep up with
novel articles and remove fading ones
– Choose which articles appear in which tabs
ACM SIGKDD Innovation Award
45
Content Optimization Approach
• Editors continue to determine content
sources, program some content,
determine policies to ensure quality, and
specify business constraints
– But we use a statistically based machine
learning algorithm to determine what articles
to show where when a user visits the FP
ACM SIGKDD Innovation Award
46
Modeling Approach
• Pure feature based (did not work well):
– Article: URL, keywords, categories
– Build offline models to predict CTR when article
shown to users
– Models considered
• Logistic Regression with feature selection
• Decision Trees, Feature segments through clustering
• Track CTR per article in user segments through
online models
– This worked well; the approach we took
eventually
ACM SIGKDD Innovation Award
47
Challenges
• Non-stationary CTR
• To ensure webpage stability, we show the
same article until we find a better one
– CTR decays over time; sharply at F1
– Time-of-day; day-of-week effect in CTR
ACM SIGKDD Innovation Award
48
Modeling Approach
• Track item scores through dynamic linear
models (fast Kalman filter algorithms)
• We model decay explicitly in our models
• We have a global time-of-day curve
explicitly in our online models
ACM SIGKDD Innovation Award
49
Explore/Exploit
• What is the best strategy for new articles?
– If we show it and it’s bad: lose clicks
– If we delay and it’s good: lose clicks
• Solution: Show it while we don’t have
much data if it looks promising
– Classical multi-armed bandit type problem
– Our setup is different than the ones studied in
the literature; new ML problem
ACM SIGKDD Innovation Award
50
Novel Aspects
• Classical: Arms assumed fixed over time
– We gain and lose arms over time
• Some theoretical work by Whittle in 80’s; operations research
• Classical: Serving rule updated after each pull
– We compute optimal design in batch mode
• Classical: Generally. CTR assumed stationary
– We have highly dynamic, non-stationary CTRs
ACM SIGKDD Innovation Award
51
Some Other Complications
• We run multiple experiments (possibly correlated)
simultaneously; effective sample size calculation
a challenge
• Serving Bias: Incorrect to learn from data for
serving scheme A and apply to serving scheme B
– Need unbiased quality score
– Bias sources: positional effects, time effect, set of
articles shown together
• Incorporating feature-based techniques
– Regression style , E.g., logistic regression
– Tree-based (hierarchical bandit)
ACM SIGKDD Innovation Award
52
System Challenges
• Highly dynamic system characteristics:
– Short article lifetimes, pool constantly
changing, user population is dynamic, CTRs
non-stationary
– Quick adaptation is key to success
• Scalability:
– 1000’s of page views/sec; data collection,
model training, article scoring done under
tight latency constraints
ACM SIGKDD Innovation Award
53
Results
• We built an experimental infrastructure to
test new content serving schemes
– Ran side-by-side experiments on live traffic
– Experiments performed for several months;
we consistently outperformed the old system
– Results showed we get more clicks by
engaging more users
– Editorial overrides
• Did not reduce lift numbers substantially
ACM SIGKDD Innovation Award
54
Comparing buckets
ACM SIGKDD Innovation Award
55
Experiments
• Daily CTR Lift relative to editorial serving
ACM SIGKDD Innovation Award
56
Lift is Due to Increased Reach
• Lift in fraction of clicking users
ACM SIGKDD Innovation Award
57
Related Work
• Amazon, Netflix, Y! Music, etc.:
– Collaborative filtering with large content pool
– Achieve lift by eliminating bad articles
– We have a small number of high quality
articles
• Search, Advertising
– Matching problem with large content pool
– Match through feature based models
ACM SIGKDD Innovation Award
58
Summary of Approach
• Offline models to initialize online models
• Online models to track performance
• Explore/exploit to converge fast
• Study user visit patterns and behavior;
program content accordingly
ACM SIGKDD Innovation Award
59
Summary
• There are some exciting “grand challenge”
problems that will require us to bring to
bear ideas from data management,
statistics, learning, and optimization
– i.e., data mining problems!
• Our field is too young to think about
growing old, but the best is yet to be …
ACM SIGKDD Innovation Award
60