Download What is this data!?

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Optimizing Online Yield via
Predictive Modeling of
Individual Site Visitors
David Lapayowker
Marissa Quitt
Elaine Shaver (PM)
Devin Smith
Magnify360 Liasons:
Olivier Chaine, Jim Healy,
Nate Pool, Gilles ?????
HMC Advisor:
Zachary Dodds
Magnify360
Designs multiple websites for clients with each site customized
to meet the needs of different types of users.
Analyzes clickstream data from site visitors in order to provide
the website that will best suit each one.
The result is to convert a larger set of users than a single page.
old Facebook
new Facebook
System Overview
Navigates to a site
Tailored interactions
"Conversion"
User
Actions
[email protected]
Dataflow
clickstream data
Our
system
classify user
Musician
Pasadena
resident
User
groups
Insomniac
Musician
choose
page
Online classifier
serve page
results
• user data
• pages served
• conversion data
clustering
Musician
Bioengineer
Pachyphile
Offline analysis
Problem Statement
Navigates to a site
User
Actions
Tailored interactions
"Conversion"
Detailed problem statement here
[email protected]
Dataflow
clickstream data
Our
system
classify user
Musician
Pasadena
resident
User
groups
Insomniac
Musician
choose
page
Online classifier
serve page
results
• user data
• pages served
• conversion data
clustering
Musician
Bioengineer
Pachyphile
Offline analysis
Clickstream Data
example
columns…
Database
80 tables
110,000,000 rows
ethics ~ anonymous ~ no purchased data!
13 GB
User profiles
A profile is a binary attribute that captures a
specific combination of data values.
Currently 42 of them, hand-specified
from Mag360's site
insomniac
something
something
Tradeoffs:
+ captures experienced intuition about what is important
+ takes advantage of Magnify360's site-design expertise
- binary attributes
- may miss patterns not captured by the user profiles
Conversion data
The site yield, or conversion, is client-specified
Amount of transaction(s)
Time spent on (a part of) the site
Contact information
presence and/or time of an email address
table
Goal: to determine those clusters of
visitors who will be best served (convert)
via a particular version of a client site
3% conversion
Offline analysis ~ user clustering
one big cluster ~ "best page"
hand-tuned clusters
hierarchical clustering
growing neural gas
decision-tree clustering
fuzzy k-means clustering
support vector machines
Visitors ~ vectors of
profile attributes
Offline analysis ~ user clustering
one big cluster ~ "best page"
hand-tuned clusters
hierarchical clustering
growing neural gas
decision-tree clustering
fuzzy k-means clustering
support vector machines
Visitors ~ vectors of
profile attributes
Offline analysis ~ user clustering
one big cluster ~ "best page"
hand-tuned clusters
hierarchical clustering
growing neural gas
decision-tree clustering
fuzzy k-means clustering
support vector machines
Visitors ~ vectors of
profile attributes
Offline analysis ~ user clustering
one big cluster ~ "best page"
hand-tuned clusters
hierarchical clustering
growing neural gas
decision-tree clustering
fuzzy k-means clustering
support vector machines
Visitors ~ vectors of
profile attributes
Support vector machine example
Can we get one of the real data pages?
From clusters to sites
Training data from each cluster determines the best site:
7 + 1 + 1 (yield)
Page: A
Yield: 7
Page: A
Yield: 1
Page: A
Yield: 1
3 (visits)
page A score ~ 3.0
Page: B
Yield: 7
7 + 8 + 3 (yield)
Page: B
Yield: 3
Page: B
Yield: 8
3 (visits)
page B score ~ 6.0
This cluster of six people responds better to site B,
Time-based site choice
Magnify360 wants to adapt quickly to new preferences:
Time-weighted average yields:
20 • 7 + 2-3 • 1 + 2-4 • 1
Page: A
Yield: 7
t: 0
Page: A
Yield: 1
t: 3
Page: B
Yield: 7
t: 4
t ~ age of data
Page: A
Yield: 1
t: 4
20 + 2-3 + 2-4
page A score ~ 6.05
2-4 • 7 + 2-5 • 8 + 2-1 • 3
Page: B
Yield: 8
t: 5
Page: B
Yield: 3
t: 1
2-4 + 2-5 + 2-1
page B score ~ 3.68
but site A has had better recent performance.
Online classification
procedure
Possible results…
Results ~ Packet 8
all on one graph
what about hand-tuned system results?
comments
A closer look…
talk about SVM parameters here?
comments
Sensitivity to scoring parameters?
David's charts
comments
Software structure
Diagram
What's done and not done…
comments
Software structure
Diagram
What's done and not done…
comments
Perspective
Concluding comments
Questions?
Clickstream Data
The Good:
We have DATA!
The Bad:
Too much?
The Ugly:
What is this data!?
~ 80 tables
~ 13 GB
One of our tables…
ID, anyone?
Fun Statistics
Data: To do
Understand the purpose of each table / column
Understand relationships between tables
Create a single table (or file) of relevant information in order
to test and evaluate our clustering algorithms.
(table demodularization, against all design principles)
Clustering Algorithms
k-Means: Choose centroids at random, and place points in cluster such that
distances inside clusters are minimized. Recalculate centroids and repeat
until a steady state is reached
Fuzzy k-Means: Similar, but every datapoint is in a
cluster to some degree, not just in or out.
Heirarchical Clustering: Uses a bottom-up approach
to bring together points and clusters that are close
together
FuzME's best 10-cluster
results ~ synthetic data
Bottom line: These clustering algorithms are simple and effective techniques
for categorizing data, but they cannot exist in a vacuum; we are investigating
other techniques that may be used in parallel or instead.
Growing Neural Gas

A clustering algorithm masquerading as a neural network

Given a data distribution, dynamically determines nodes or
“centroids” to represent the data
Growing Neural Gas

A clustering algorithm masquerading as a neural network

Given a data distribution, dynamically determines nodes or
“centroids” to represent the data
Representative Nodes
User Profiles
Growing Neural Gas

A clustering algorithm masquerading as a neural network

Given a data distribution, dynamically determines nodes or
“centroids” to represent the data
Representative Nodes
User Profiles

“Dynamic” because it adds or deletes nodes as necessary, as well as
adapting nodes toward changes in the data.
How it works…
Given some input x:





Find the closest node, s, and the next closest, t.
Update the error of s by εw|s – x|
Shift s and its neighbors toward x, and increment the age of
all those edges.
If s and t are adjacent, set the age of that edge to 0.
Otherwise, create that edge.
Remove edges that are too old, decrease the error of all
edges by a small amount.


Add a node every  generations, putting it between the node with
the largest error and its largest-error neighbor.
Repeat!
A Few Parameters…
(Making sense of the GUI)






λ: Controls how frequently new nodes are inserted
Max Edge Age: Dictates how often old edges are deleted
εw: Factor to scale the value of the “winning” node
εn: Factor to scale the value of the next nearest node
α: Scale factor for decreasing the error of parent nodes
β: Scale factor for decreasing error of all nodes
… and the difference they make.
λ= 100
• Smaller λ, nodes inserted more often
• Leaves straggler nodes that don’t
accurately match data
λ= 1000
• Larger λ, nodes inserted less often
• Takes longer, but yields more accurate
placement of nodes
Support Vector Machines
Clearly planar
Planar in feature space
Support Vector Regression (Machine?)
Goal: Minimize error between hyper-plane and data points.
SVM
Maximize cluster separation
SVR
Minimize plane-to-data distance
Getting the correct page…
What do we want from a technique?
CLASSIFICATION:
Input: User data.
Output: Page to serve.
REGRESSION:
Input: User data and possible page.
Output: Predicted Success.
Both require multiple SVMs.
Using Classification via SVMs
C
B
DATA
Predicted Page:
C
C
Using Regression via SVRs
Page A
Predictor
0.42
Predicted Page:
DATA
Page B
Predictor
0.24
Page C
Predictor
0.78
C
Data
The Good:
We have DATA!
The Bad:
Too much?
The Ugly:
What is this data!?
~ 80 tables
~ 13 GB
One of our tables…
ID, anyone?
Fun Statistics
Data: To do
Understand the purpose of each table / column
Understand relationships between tables
Create a single table (or file) of relevant information in order
to test and evaluate our clustering algorithms.
(table demodularization, against all design principles)
Goal Breakdown
Short-term Plan
Plan for Algorithm Comparison
Plan for Algorithm Comparison
Plan for Algorithm Comparison
Schedule and Conclusion

Friday November 14


Friday November 21



Initial testing on real data
Meeting with Magnify360
Friday December 5


Prototype algorithm comparison method
Initial composition of classification algorithms
Friday December 12

Midyear Report
Questions?
Questions?
SVM vs SVR
SVM
Maximize Distance
SVR
Minimize Distance
Data
The Bad, or, The Challenges:
Lots of SQL data
Some Data Tables
80 tables total…
Data Size
Problem Statement
Officially: Develop an innovative predictive modeling system to predict shopping
cart abandonment based on profiles, clusters, shopping cart contents
Most importantly: GRAB from email ! Research and implement
various AI techniques to optimize the process of matching users
with websites
Individualized Online Experiences
Classifying Users
Unsupervised clustering: points are clustered without knowledge of the results
Supervised clustering: clusters are built using prior knowledge of the results
Ethical concerns?
Recap: What Magnify360 Does
Individualize a website for different types of users
Collect data on users from their clickstream, and give them the
site that will appeal to them best
Appeal to a larger base of users by making the site more
interesting to a larger group
old Facebook
serving both!