Download Posterior uncertainty for rank data aggregation and a priori plans for

Document related concepts

Microevolution wikipedia , lookup

Ridge (biology) wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Posterior uncertainty
for rank data aggregation
and
a priori plans for BigInsight
Arnoldo Frigessi
[email protected]
BigInsight
RANKS
3
5
1
2
4
Ranked data is everywhere.
Rankings arise when …
•
•
•
•
users express preferences about products and services,
voters cast ballots in elections,
research projects are evaluated based on their merits,
genes are ordered based on their expression levels under various
experimental conditions.
A ranking represents a statement about the relative quality or
relevance of the items being ranked.
Assessors rank items.
Designed or observed
Panel, volunteers, users….
1
3
4
2
Tasks
Aggregate, merge, summarise multiple rankings to discover
shared patterns and structure.
Assessors
?
?
?
?
1
3
4
2
2
1
3
4
1
2
4
3
4
3
1
2
3
1
4
2
3
1
4
2
1
3
4
2
Consensus ranking
Tasks
Predict individual ratings, when only partial ratings are made.
(not all items rated)
1
3
4
2
2
1
3
4
1
2
4
3
4
3
1
2
?
1
?
2
?
1
?
2
1
?
?
2
Some rankings are
missing in some samples
Tasks
Predict individual ratings, when only partial ratings are made.
(not all items rated)
1
3
4
2
2
1
3
4
1
2
4
3
4
3
1
2
1
1
2
1
2
2
UNCERTAINTY
Tasks
Partition assessors in classes and predict class membership of
new assessors.
Population subtypes
1
3
4
2
4
3
1
2
3
1
4
2
3
1
4
2
1
3
4
2
2
1
3
4
1
2
4
3
4
3
1
2
3
1
4
2
3
1
4
2
2
1
3
4
1
3
4
2
1
2
4
3
1
3
4
2
Classification
of new samples
1
2
MovieLens is a movie recommendation website.
You tell us what movies you love and hate.
We use that information to generate personalized recommendations
for other movies.
Prob ( member Nils likes movie A better than movie B? )
Prob ( for member Nils movie A will be among his top 5 preferences ?)
MovieLens uses collaborative filtering to generate recommendations.
It matches users with similar opinions about movies.
Each user has a 'neighbourhood' of other like-minded users.
Ratings from these neighbours are used to create personalized
recommendations for the target user.
Hundreds of thousands of users.
Started in 1997.
University of Minnesota.
M E TA A N A LY S I S O F G E N E E X P R E S S I O N S A C R O S S L A B S
Gene expression is a measure of the activity of a gene in a sample
~ 20000 genes measured in a few hundred patients (prostate cancer)
Repeated in various cohorts and labs with different technologies.
Absolute measures are hard to compare. Ranks easier.
Each lab produces a ranked list of genes, hard to analyse together
Genes = items to be ranked
Labs = assessors
Merge the studies to produce
a consensus list
Prob (P57 is among top 10?)
Mallows model
Bayesian inference
MCMC algorithm
Applications
What we shall do next
NP-hard; more complicated for α
The Kendall distance measures the minimum number of pairwise
adjacent switches which convert R into ρ.
The computation of the normalizing constant in the Mallows model
when using other distance measures than Kendall's is NP-complete.
Bayes!
Sampling from the posterior by Markov Chain Monte Carlo
for ρ
• Choose an item u at random in {1,2,…,n}.
Its current rank is
• Choose a new rank r for item u, uniformly in
- L, …,
+L
Now two items have rank r and one item (u) has no rank.
• Shift by one all the items of ranks between r and
ρ=(1,2,3,…, n)
= Eq{
}
(Pseudolikelihood)
Need a theorem….
n=26 students
posterior marginal
distribution for the rank
of each potato
true rank
heaviest potato
lightest potato
Represents uncertainty: The trace is the posterior expectation of the
number of correctly ranked potatoes
Central potatoes are the one ranked with highest uncertainty
by looking
by touching
Less uncertainty
Convergence of MCMC with imprecise
Longer MCMC
More Imp Sampl samples
Only a subset of the items have been ranked.
Ranks can be missing at random, or the assessors may only
have ranked, say, the top-5 items.
Can be handled easily in the Bayesian framework, by
applying data augmentation techniques: estimating the
lacking ranks consistently with the partial observations.
Cases vs. Controls
n=89 items
N=5 assessors
(89 genes in total)
VERY UNCERTAIN!
WEAK CONSENSUS!
N=5 too small
1. Find the gene with highest posterior probability of having
rank 1.
2. Among the remaining genes, find the gene with highest
posterior probability of having rank 1 or 2.
3. Etc.  cumulative probability
• The probability of being among the top-10 for each gene.
• “stationary distribution”, level of consensus
• No precise interpretation.
Assessors not one homogeneous group, but C groups
We use a mixture of Mallows models to cluster a sample of N assessors
according to how they rank the n items.
We estimate a latent ranking of the items for each cluster of assessors.
The variables
assign each assessor to one of
the C clusters.
Prior: Dirichlet distribution on the probabilities that an assessor is in
each class
Label switching is not explicitly handled inside the MCMC to ensure full
convergence of the chain (Jasra et al., 2005; Celeux et al., 2000).
MCMC iterations are reordered after convergence is achieved,
using the re-ordering approaches in (Papastamoulis, 2015).
N = 5000 people (assessors) were interviewed, each giving his/her
complete ranking of n = 10 sushi variants (items):
ebi (shrimp),
anago (sea eel),
maguro (tuna),
ika (squid),
uni (sea urchin),
sake (salmon roe),
tamago (egg),
toro (fatty tuna),
tekka-maki (tuna roll),
kappa-maki (cucumber roll).
within cluster distance
of each rank to the
cluster centroid
Elbow rule
DIC
clusters
posterior probabilities for being assigned to each cluster
assessors
• most assessors have posterior probabilities concentrated on some preferred
value of c, indicating a reasonably stable behaviour in the cluster assignments.
MCMC: we need to propose augmented ranks which obey the
partial ordering constraints given by the assessor.
Assume coherent pair comparisons
• perfect stochastic orderings between most of the teams
• … but not all
P (Team A < Team B | all data )
N=5891 assessors (users), n=200 movies
Mean number of movies rated per user = 30.2
Ratings transformed to pair comparisons (as in Lu & Boutilier 2011)
14 classes of users (age and gender) – for simplicity fixed, in real
application would be estimated
Normalising constant approximated as in Mukherjee 2013,
as importance sampling inefficient with n=200
is the posterior predicted probability of the
full ranking for assessor j, consistent with given preferences and
relative to the class j belongs to.
Personalised recommendation.
To test method, we discarded one rated movie per user
Use all other data to estimate
Read off posterior probability of the given (but hidden) preference
Median such probability over all assessors = 0.812
If we decide to predict the preference between two movies by taking
the preference with posterior predictive probability >0.5, then
we make an error in 12.7% of cases.
BIG INSIGHT
2015-2023
Sometimes, it is not enough to crunch data!
VentureBeat
Mike Loukides & VB
MODEL-BASED STATISTICS
Model-based methods exploit knowledge
and structure in the new data,
To understand, discover, predict, control.
BIG INSIGHT
Personalised solutions
Forecasting the transient
Personalised solutions
Forecasting the transient
Telenor: mobile phones
Gjensidige: policy-holders
DNB: customers
Folkehelsa: infected and susceptible
individuals
OUS: patients
NAV: people on sick leave
Skatteetaten: tax-payers
ABB & DNV GL: sensors on a ship
DNV GL & OUS: sensors in healthcare.
Aims:
•personalised marketing,
•personalised products,
•personalised prices,
•personalised risk assessments,
•personalised fraud assessment,
•personalised screening,
•personalised therapy,
•individualised sensor monitoring,
•individualised maintenance schemes,
High frequency data allow to measure
processes in time while they are not in a
stable situation, not in equilibrium.
Personalised solutions
Forecasting the transient
OUS: patient receiving treatment
ABB & DNV GL: sensor on a ship at sea
Telenor, Gjensidige, DNB : customer
NAV: worker who lost the job
HYDRO: electricity prices
Aims:
•Predict the dynamics
•Optimal intervention
•Causal understanding of the factors which
affect the process.
6 Innovation Objectives
1. Changepoint detection
2. Changepoint prediction
• Multivariate time series with dependence
1. some structure of the dependence is known
2. some is not, and must be estimated
Structure = network
• flow
• feed-back
• vicinity in many ways
(spatial, function, type,…)
• conditional independence
1. Changepoint detection
2. Changepoint prediction
The changepoint process must be understood and modelled.
A.from a (longish) history of changepoints, estimate the
rhythm of their arrivals, and use it for prediction
B.understand/assume/estimate the causes of changepoints,
and use these as early warnings.
•variability or extremes?
•slow or fast changes?
•surprise prediction! (absence of alternative hypothesis)
•few sensors fail or many? (sparsity)
•real time prediction: hours, days, minutes ? (what intervention?)
•adaptive resolution of data!
•master (real or in-silico) sensors?
Annals of statistics 2013
N sensors, each giving observations
yn,t
n=1,2,…, N
t=1,2,…
At certain time K, there are changes in the distributions of observations
from a subset M of the sensors. This changepoint K, the subset M and
its size #M are unknown.
Goal: to detect K as soon as possible after it occurs (minimizing
Expected Detection Delay EDD) while keeping the frequency of false
alarms as low as possible.
N is large, #M is relatively small
JRSS-B 2015
www.BigInsight.no