Download Posterior uncertainty for rank data aggregation and a priori plans for

Posterior uncertainty for rank data aggregation and a priori plans for BigInsight Arnoldo Frigessi [email protected] BigInsight RANKS 3 5 1 2 4 Ranked data is everywhere. Rankings arise when … • • • • users express preferences about products and services, voters cast ballots in elections, research projects are evaluated based on their merits, genes are ordered based on their expression levels under various experimental conditions. A ranking represents a statement about the relative quality or relevance of the items being ranked. Assessors rank items. Designed or observed Panel, volunteers, users…. 1 3 4 2 Tasks Aggregate, merge, summarise multiple rankings to discover shared patterns and structure. Assessors ? ? ? ? 1 3 4 2 2 1 3 4 1 2 4 3 4 3 1 2 3 1 4 2 3 1 4 2 1 3 4 2 Consensus ranking Tasks Predict individual ratings, when only partial ratings are made. (not all items rated) 1 3 4 2 2 1 3 4 1 2 4 3 4 3 1 2 ? 1 ? 2 ? 1 ? 2 1 ? ? 2 Some rankings are missing in some samples Tasks Predict individual ratings, when only partial ratings are made. (not all items rated) 1 3 4 2 2 1 3 4 1 2 4 3 4 3 1 2 1 1 2 1 2 2 UNCERTAINTY Tasks Partition assessors in classes and predict class membership of new assessors. Population subtypes 1 3 4 2 4 3 1 2 3 1 4 2 3 1 4 2 1 3 4 2 2 1 3 4 1 2 4 3 4 3 1 2 3 1 4 2 3 1 4 2 2 1 3 4 1 3 4 2 1 2 4 3 1 3 4 2 Classification of new samples 1 2 MovieLens is a movie recommendation website. You tell us what movies you love and hate. We use that information to generate personalized recommendations for other movies. Prob ( member Nils likes movie A better than movie B? ) Prob ( for member Nils movie A will be among his top 5 preferences ?) MovieLens uses collaborative filtering to generate recommendations. It matches users with similar opinions about movies. Each user has a 'neighbourhood' of other like-minded users. Ratings from these neighbours are used to create personalized recommendations for the target user. Hundreds of thousands of users. Started in 1997. University of Minnesota. M E TA A N A LY S I S O F G E N E E X P R E S S I O N S A C R O S S L A B S Gene expression is a measure of the activity of a gene in a sample ~ 20000 genes measured in a few hundred patients (prostate cancer) Repeated in various cohorts and labs with different technologies. Absolute measures are hard to compare. Ranks easier. Each lab produces a ranked list of genes, hard to analyse together Genes = items to be ranked Labs = assessors Merge the studies to produce a consensus list Prob (P57 is among top 10?) Mallows model Bayesian inference MCMC algorithm Applications What we shall do next NP-hard; more complicated for α The Kendall distance measures the minimum number of pairwise adjacent switches which convert R into ρ. The computation of the normalizing constant in the Mallows model when using other distance measures than Kendall's is NP-complete. Bayes! Sampling from the posterior by Markov Chain Monte Carlo for ρ • Choose an item u at random in {1,2,…,n}. Its current rank is • Choose a new rank r for item u, uniformly in - L, …, +L Now two items have rank r and one item (u) has no rank. • Shift by one all the items of ranks between r and ρ=(1,2,3,…, n) = Eq{ } (Pseudolikelihood) Need a theorem…. n=26 students posterior marginal distribution for the rank of each potato true rank heaviest potato lightest potato Represents uncertainty: The trace is the posterior expectation of the number of correctly ranked potatoes Central potatoes are the one ranked with highest uncertainty by looking by touching Less uncertainty Convergence of MCMC with imprecise Longer MCMC More Imp Sampl samples Only a subset of the items have been ranked. Ranks can be missing at random, or the assessors may only have ranked, say, the top-5 items. Can be handled easily in the Bayesian framework, by applying data augmentation techniques: estimating the lacking ranks consistently with the partial observations. Cases vs. Controls n=89 items N=5 assessors (89 genes in total) VERY UNCERTAIN! WEAK CONSENSUS! N=5 too small 1. Find the gene with highest posterior probability of having rank 1. 2. Among the remaining genes, find the gene with highest posterior probability of having rank 1 or 2. 3. Etc.  cumulative probability • The probability of being among the top-10 for each gene. • “stationary distribution”, level of consensus • No precise interpretation. Assessors not one homogeneous group, but C groups We use a mixture of Mallows models to cluster a sample of N assessors according to how they rank the n items. We estimate a latent ranking of the items for each cluster of assessors. The variables assign each assessor to one of the C clusters. Prior: Dirichlet distribution on the probabilities that an assessor is in each class Label switching is not explicitly handled inside the MCMC to ensure full convergence of the chain (Jasra et al., 2005; Celeux et al., 2000). MCMC iterations are reordered after convergence is achieved, using the re-ordering approaches in (Papastamoulis, 2015). N = 5000 people (assessors) were interviewed, each giving his/her complete ranking of n = 10 sushi variants (items): ebi (shrimp), anago (sea eel), maguro (tuna), ika (squid), uni (sea urchin), sake (salmon roe), tamago (egg), toro (fatty tuna), tekka-maki (tuna roll), kappa-maki (cucumber roll). within cluster distance of each rank to the cluster centroid Elbow rule DIC clusters posterior probabilities for being assigned to each cluster assessors • most assessors have posterior probabilities concentrated on some preferred value of c, indicating a reasonably stable behaviour in the cluster assignments. MCMC: we need to propose augmented ranks which obey the partial ordering constraints given by the assessor. Assume coherent pair comparisons • perfect stochastic orderings between most of the teams • … but not all P (Team A < Team B | all data ) N=5891 assessors (users), n=200 movies Mean number of movies rated per user = 30.2 Ratings transformed to pair comparisons (as in Lu & Boutilier 2011) 14 classes of users (age and gender) – for simplicity fixed, in real application would be estimated Normalising constant approximated as in Mukherjee 2013, as importance sampling inefficient with n=200 is the posterior predicted probability of the full ranking for assessor j, consistent with given preferences and relative to the class j belongs to. Personalised recommendation. To test method, we discarded one rated movie per user Use all other data to estimate Read off posterior probability of the given (but hidden) preference Median such probability over all assessors = 0.812 If we decide to predict the preference between two movies by taking the preference with posterior predictive probability >0.5, then we make an error in 12.7% of cases. BIG INSIGHT 2015-2023 Sometimes, it is not enough to crunch data! VentureBeat Mike Loukides & VB MODEL-BASED STATISTICS Model-based methods exploit knowledge and structure in the new data, To understand, discover, predict, control. BIG INSIGHT Personalised solutions Forecasting the transient Personalised solutions Forecasting the transient Telenor: mobile phones Gjensidige: policy-holders DNB: customers Folkehelsa: infected and susceptible individuals OUS: patients NAV: people on sick leave Skatteetaten: tax-payers ABB & DNV GL: sensors on a ship DNV GL & OUS: sensors in healthcare. Aims: •personalised marketing, •personalised products, •personalised prices, •personalised risk assessments, •personalised fraud assessment, •personalised screening, •personalised therapy, •individualised sensor monitoring, •individualised maintenance schemes, High frequency data allow to measure processes in time while they are not in a stable situation, not in equilibrium. Personalised solutions Forecasting the transient OUS: patient receiving treatment ABB & DNV GL: sensor on a ship at sea Telenor, Gjensidige, DNB : customer NAV: worker who lost the job HYDRO: electricity prices Aims: •Predict the dynamics •Optimal intervention •Causal understanding of the factors which affect the process. 6 Innovation Objectives 1. Changepoint detection 2. Changepoint prediction • Multivariate time series with dependence 1. some structure of the dependence is known 2. some is not, and must be estimated Structure = network • flow • feed-back • vicinity in many ways (spatial, function, type,…) • conditional independence 1. Changepoint detection 2. Changepoint prediction The changepoint process must be understood and modelled. A.from a (longish) history of changepoints, estimate the rhythm of their arrivals, and use it for prediction B.understand/assume/estimate the causes of changepoints, and use these as early warnings. •variability or extremes? •slow or fast changes? •surprise prediction! (absence of alternative hypothesis) •few sensors fail or many? (sparsity) •real time prediction: hours, days, minutes ? (what intervention?) •adaptive resolution of data! •master (real or in-silico) sensors? Annals of statistics 2013 N sensors, each giving observations yn,t n=1,2,…, N t=1,2,… At certain time K, there are changes in the distributions of observations from a subset M of the sensors. This changepoint K, the subset M and its size #M are unknown. Goal: to detect K as soon as possible after it occurs (minimizing Expected Detection Delay EDD) while keeping the frequency of false alarms as low as possible. N is large, #M is relatively small JRSS-B 2015 www.BigInsight.no

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Posterior uncertainty for rank data aggregation and a priori plans for