Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bellwether Analysis Hierarchies in Data Mining Raghu Ramakrishnan [email protected] Chief Scientist for Audience and Cloud Computing Yahoo! About this Talk • Common theme—multidimensional view of data: – Reveals patterns that emerge at coarser granularity • Widely recognized, e.g., generalized association rules – Helps handle imprecision • Analyzing imprecise and aggregated data – Helps handle data sparsity • Even with massive datasets, sparsity is a challenge! – Defines candidate space of subsets for exploratory mining • Forecasting query results over “future data” • Using predictive models as summaries • Potentially, space of “mining experiments”? Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 2 Background: The Multidimensional Data Model Cube Space Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 4 Star Schema PRODUCT pid pname Category Model SERVICE pid timeid locid repair “FACT” TABLE DIMENSION TABLES Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma TIME timeid date week year LOCATION locid country region state R. Ramakrishnan 5 Dimension Hierarchies • For each dimension, the set of values can be organized in a hierarchy: PRODUCT TIME LOCATION year automobile category model Hierarchies in Data Mining quarter week month date Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma country region state R. Ramakrishnan 6 Multidimensional Data Model • One fact table D=(X,M) – X=X1, X2, ... Dimension attributes – M=M1, M2,… Measure attributes • Domain hierarchy for each dimension attribute: – Collection of domains Hier(Xi)= (Di(1),..., Di(k)) – The extended domain: EXi = 1≤k≤t DXi(k) • Value mapping function: γD1D2(x) – e.g., γmonthyear(12/2005) = 2005 – Form the value hierarchy graph – Stored as dimension table attribute (e.g., week for a time value) or conversion functions (e.g., month, quarter) Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 7 Multidimensional Data 2 1 Region State ALL Truck Sedan TX NY MA Civic CA West ALL LOCATION East 3 ALL Automobile Hierarchies in Data Mining Camry F150 p3 p1 Sierra ALL 3 Category 2 Model 1 DIMENSION ATTRIBUTES p4 p2 FactID Auto Loc Repair p1 F150 NY 100 p2 Sierra NY 500 p3 F150 MA 100 p4 Sierra MA 200 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 8 Cube Space • Cube space: C = EX1EX2…EXd • Region: Hyper rectangle in cube space – c = (v1,v2,…,vd) , vi EXi – E.g., c1= (NY, Camry); c2 = (West, Sedan) • Region granularity: – gran(c) = (d1, d2, ..., dd), di = Domain(c.vi) – E.g., gran(c1) = (State, Model); gran(c2) = (State, Category) • Region coverage: – coverage(c) = all facts in c • Region set: All regions with same granularity Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 9 OLAP Over Imprecise Data with Doug Burdick, Prasad Deshpande, T.S. Jayram, and Shiv Vaithyanathan In VLDB 05, 06 joint work with IBM Almaden Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 10 Imprecise Data 2 1 Region State ALL Truck Sedan TX NY MA Civic CA West ALL LOCATION East 3 ALL Automobile Hierarchies in Data Mining Camry F150 p3 Sierra 3 Category 2 Model 1 p4 p5 p1 ALL p2 FactID Auto Loc Repair p1 F150 NY 100 p2 Sierra NY 500 p3 F150 MA 100 p4 Sierra MA 200 p5 Truck MA 100 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 11 Querying Imprecise Facts Auto = F150 Loc = MA SUM(Repair) = ??? How do we treat p5? Truck F150 Sierra NY East MA p5 p4 p3 p1 Hierarchies in Data Mining FactID Auto Loc Repair p1 F150 NY 100 p2 Sierra NY 500 p3 F150 MA 100 p4 Sierra MA 200 p5 Truck MA 100 p2 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 12 Allocation (1) Truck NY East MA F150 Sierra p5 p3 p4 p1 Hierarchies in Data Mining p2 FactID Auto Loc Repair p1 F150 NY 100 p2 Sierra NY 500 p3 F150 MA 100 p4 Sierra MA 200 p5 Truck MA 100 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 13 Allocation (2) (Huh? Why 0.5 / 0.5? - Hold on to that thought) Truck NY East MA F150 p5 p3 Sierra p5 p4 p1 Hierarchies in Data Mining p2 ID FactID Auto Loc Repair Weight 1 p1 F150 NY 100 1.0 2 p2 Sierra NY 500 1.0 3 p3 F150 MA 100 1.0 4 p4 Sierra MA 200 1.0 5 p5 F150 MA 100 0.5 6 p5 Sierra MA 100 0.5 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 14 Allocation (3) Auto = F150 Loc = MA SUM(Repair) = 150 Truck NY East MA F150 p5 p3 Sierra p5 p4 p1 Hierarchies in Data Mining p2 Query the Extended Data Model! ID FactID Auto Loc Repair Weight 1 p1 F150 NY 100 1.0 2 p2 Sierra NY 500 1.0 3 p3 F150 MA 100 1.0 4 p4 Sierra MA 200 1.0 5 p5 F150 MA 100 0.5 6 p5 Sierra MA 100 0.5 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 15 Allocation Policies • Procedure for assigning allocation weights is referred to as an allocation policy – Each allocation policy uses different information to assign allocation weight • Key contributions: – Appropriate characterization of the large space of allocation policies (VLDB 05) – Designing efficient algorithms for allocation policies that take into account the correlations in the data (VLDB 06) Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 16 Motivating Example Query: COUNT Truck F150 p4 We propose desiderata that enable p5 appropriate definition of query semantics for imprecise data NY East MA p3 Sierra Hierarchies in Data Mining p1 p2 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 17 Desideratum I: Consistency Truck F150 Sierra p4 p5 NY East MA p3 • Consistency specifies the relationship between answers to related queries on a fixed data set Hierarchies in Data Mining p1 p2 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 18 Desideratum II: Faithfulness p3 MA p5 F150 p4 p1 p2 Sierra p5 p3 Data Set 3 F150 p4 p1 p2 Sierra p5 MA Sierra NY NY MA F150 Data Set 2 p4 p3 NY Data Set 1 p1 p2 • Faithfulness specifies the relationship between answers to a fixed query on related data sets Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 19 F150 Sierra Imprecise facts lead to many possible worlds [Kripke63, …] p1 F150 w1 w2 p2 MA F150 p5 p3 p1 p3 Sierra p5 p4 w4 NY p4 Hierarchies in Data Mining MA Sierra NY NY p5 p2 w3 Sierra p4 p2 F150 MA p3 p1 NY MA F150 p4 p3 NY MA p5 p1 p2 Sierra p5 p4 p3 p1 p2 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 20 Query Semantics • Given all possible worlds together with their probabilities, queries are easily answered using expected values – But number of possible worlds is exponential! • Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data – Size increase is linear in number of (completions of) imprecise facts – Queries operate over this extended version Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 21 Bellwether Analysis Dealing with Data Sparsity Deepak Agarwal, Andrei Broder, Deepayan Chakrabarti, Dejan Diklic, Vanja Josifovski, Mayssam Sayyadian Estimating Rates of Rare Events at Multiple Resolutions, KDD 2007 Motivating Application Content Match Problem pages ads • Problem: – Which ads are good on what pages – Pages: no control; Ads: can control • First simplification: – (Page, Ad) completely characterized by a set of high-dimensional features • Naïve Approach: – Experiment with all possible pairs several times and estimate CTR. • Of course, this doesn’t work • Most (ad, page) pairs have very few impressions, if any, • and even fewer clicks Severe data sparsity Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 31 Estimation in the “Tail” • Use an existing, well-understood hierarchy – Categorize ads and webpages to leaves of the hierarchy – CTR estimates of siblings are correlated The hierarchy allows us to aggregate data • Coarser resolutions – provide reliable estimates for rare events – which then influences estimation at finer resolutions Similar “coarsening”, different motivation: Mining Generalized Association Rules Ramakrishnan Srikant, Rakesh Agrawal , VLDB 1995 Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 32 Sampling of Webpages • Naïve strategy: sample at random from the set of URLs Sampling errors in impression volume AND click volume • Instead, we propose: – Crawling all URLs with at least one click, and – a sample of the remaining URLs Variability is only in impression volume Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 34 Imputation of Impression Volume Z(0) • Region node = (page node, ad node) • Build a Region Hierarchy Z(i) A cross-product of the page hierarchy and the ad hierarchy Leaf Region Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 35 Exploiting Taxonomy Structure • Consider the bottom two levels of the taxonomy • Each cell corresponds to a (page, ad)-class pair Key point : Children under a parent node are alike and expected to have similar CTRs (i.e., form a cohesive block) Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 36 Imputation of Impression Volume For any level Z(i) Ad classes Page classes Clicked pool Sampled Excess impressions Non-clicked (to be imputed) pool sums to ∑nij + K.∑mij [row constraint] sums to #impressions on ads of this ad class [column constraint] Hierarchies in Data Mining #impressions = nij + mij + xij sums to Total impressions (known) Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 37 Imputation of Impression Volume sums to [block constraint] Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 38 Imputing xij Iterative Proportional Fitting [Darroch+/1972] Initialize xij = nij + mij (i) Z Z(i+1) block Top-down: • Scale all xij in every block in Z(i+1) to sum to its parent in Z(i) • Scale all xij in Z(i+1) to sum to the row totals • Scale all xij in Z(i+1) to sum to the column totals Repeat for every level Z(i) Bottom-up: Similar Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 39 Imputation: Summary • Given – nij (impressions in clicked pool) – mij (impressions in sampled non-clicked pool) – # impressions on ads of each ad class in the ad hierarchy • We get – Estimated impression volume Ñij = nij + mij + xij in each region ij of every level Z(.) Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 40 Bellwether Analysis Dealing with Data Sparsity Deepak Agarwal, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu Ramakrishnan, Scott Roy, Joe Zachariah Real-time Content Optimization through Active User Feedback, NIPS 2008 Yahoo! Home Page Featured Box • It is the top-center part of the Y! Front Page • It has four tabs: Featured, Entertainment, Sports, and Video Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 42 Novel Aspects • Classical: Arms assumed fixed over time – We gain and lose arms over time • Some theoretical work by Whittle in 80’s; operations research • Classical: Serving rule updated after each pull – We compute optimal design in batch mode • Classical: Generally. CTR assumed stationary – We have highly dynamic, non-stationary CTRs Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 43 Bellwether Analysis: Global Aggregates from Local Regions with Beechung Chen, Jude Shavlik, and Pradeep Tamma In VLDB 06 Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 44 Motivating Example • A company wants to predict the first year worldwide profit of a new item (e.g., a new movie) – By looking at features and profits of previous (similar) movies, we predict expected total profit (1-year US sales) for new movie • Wait a year and write a query! If you can’t wait, stay awake … – The most predictive “features” may be based on sales data gathered by releasing the new movie in many “regions” (different locations over different time periods). • Example “region-based” features: 1st week sales in Peoria, week-toweek sales growth in Wisconsin, etc. • Gathering this data has a cost (e.g., marketing expenses, waiting time) • Problem statement: Find the most predictive region features that can be obtained within a given “cost budget” Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 45 Key Ideas • Large datasets are rarely labeled with the targets that we wish to learn to predict – But for the tasks we address, we can readily use OLAP queries to generate features (e.g., 1st week sales in Peoria) and even targets (e.g., profit) for mining • We use data-mining models as building blocks in the mining process, rather than thinking of them as the end result – The central problem is to find data subsets (“bellwether regions”) that lead to predictive features which can be gathered at low cost for a new case Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 46 Motivating Example • A company wants to predict the first year’s worldwide profit for a new item, by using its historical database • Database Schema: Profit Table Time Location CustID ItemID Profit Ad Table Item Table ItemID Category R&D Expense Time Location ItemID AdExpense AdSize • The combination of the underlined attributes forms a key Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 47 A Straightforward Approach • Build a regression model to predict item profit By joining and aggregating tables in the historical database we can create a training set: Item-table features ItemID Category R&D Expense Profit Table Time Location CustID ItemID Profit Ad Table Item Table ItemID Category R&D Expense Time Location ItemID AdExpense AdSize Target Profit 1 Laptop 500K 12,000K 2 Desktop 100K 8,000K … … … … An Example regression model: Profit = 0 + 1 Laptop + 2 Desktop + 3 RdExpense • There is much room for accuracy improvement! Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 48 Using Regional Features • Example region: [1st week, HK] • Regional features: – Regional Profit: The 1st week profit in HK – Regional Ad Expense: The 1st week ad expense in HK • A possibly more accurate model: Profit[1yr, All] = 0 + 1 Laptop + 2 Desktop + 3 RdExpense + 4 Profit[1wk, HK] + 5 AdExpense[1wk, HK] • Problem: Which region should we use? – The smallest region that improves the accuracy the most – We give each candidate region a cost – The most “cost-effective” region is the bellwether region Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 49 Basic Bellwether Problem Features i,r(DB) 1 2 3 4 KR WI … 52 ItemID Category … Profit[1-2,USA] … … … i Desktop … … … … … 45K … … … ItemID Total Profit … … i 2,000K … … Aggregate over data records Total Profit in region r = [1-2, USA] in [1-52, All] … USA 5 Target i(DB) r WY ... … For each region r, build a predictive model hr(x); and then choose bellwether region: • Coverage(r) fraction of all items in region minimum coverage support • Cost(r, DB) cost threshold • Error(hr) is minimized Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 52 Experiment on a Mail Order Dataset Error-vs-Budget Plot Bel Err Avg Err Smp Err 30000 25000 RMSE 20000 15000 10000 5000 [1-8 month, MD] • Bel Err: The error of the bellwether region found using a given budget • Avg Err: The average error of all the cube regions with costs under a given budget • Smp Err: The error of a set of randomly sampled (non-cube) regions with costs under a given budget (RMSE: Root Mean Square Error) 0 5 25 Hierarchies in Data Mining 45 65 Budget 85 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 53 Experiment on a Mail Order Dataset Uniqueness Plot • Y-axis: Fraction of regions that are as good as the bellwether region Fraction of indistinguisables 0.9 0.8 0.7 – The fraction of regions that satisfy the constraints and have errors within the 99% confidence interval of the error of the bellwether region 0.6 0.5 0.4 0.3 0.2 0.1 [1-8 month, MD] 0 5 25 Hierarchies in Data Mining 45 65 Budget 85 • We have 99% confidence that that [1-8 month, MD] is a quite unusual bellwether region Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 54 Basic Bellwether Computation 1 • OLAP-style bellwether analysis – Candidate regions: Regions in a data cube – Queries: OLAP-style aggregate queries • E.g., Sum(Profit) over a region 2 3 4 5 … 5 2 KR … USA WI WY ... … • Efficient computation: – Use iceberg cube techniques to prune infeasible regions (Beyer-Ramakrishnan, ICDE 99; Han-PeiDong-Wang SIGMOD 01) • Infeasible regions: Regions with cost > B or coverage < C – Share computation by generating the features and target values for all the feasible regions all together • Exploit distributive and algebraic aggregate functions • Simultaneously generating all the features and target values reduces DB scans and repeated aggregate computation Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 55 Subset-Based Bellwether Prediction • Motivation: Different subsets of items may have different bellwether regions – E.g., The bellwether region for laptops may be different from the bellwether region for clothes • Two approaches: Bellwether Cube Bellwether Tree R&D Expenses No Category [1-2, WI] [1-1, NY] Laptop Desktop [1-3, MD] Hierarchies in Data Mining Yes Category R&D Expense 50K Software Hardware … Low Medium High OS [1-3,CA] [1-1,NY] [1-2,CA] … ... … … Laptop [1-4,MD] [1-1, NY] [1-3,WI] … … … … … … … … Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 57 Characteristics of Bellwether Trees & Cubes Dataset generation: • Use random tree to generate different bellwether regions for different subset of items Parameters: • Noise • Concept complexity: # of tree nodes Result: • Bellwether trees & cubes have better accuracy than basic bellwether search • Increase noise increase error • Increase complexity increase error 2 3 2.5 1.5 cube 1 tree 0.5 15 nodes 0.5 1 Noise Hierarchies in Data Mining RMSE RMSE 2 0 0.05 1.5 basic basic 1 cube 0.5 tree Noise level: 0.5 0 2 3 7 15 31 Number of nodes Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma 63 R. Ramakrishnan 69 Efficiency Comparison 3000 naive cube Naïve computation methods 2500 naive tree Sec 2000 1500 RF tree 1000 single-scan cube 500 0 100 Our computation techniques optimized cube 150 200 250 Thousands of examples Hierarchies in Data Mining 300 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 70 Scalability 7000 single-scan cube 1000 Sec 800 optimized cube 600 6000 5000 4000 Sec 1200 400 2000 200 1000 0 0 2.5 5 7.5 Millions of examples Hierarchies in Data Mining 10 RF tree 3000 2.5 5 7.5 Millions of examples Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma 10 R. Ramakrishnan 71 Exploratory Mining: Prediction Cubes with Beechung Chen, Lei Chen, and Yi Lin In VLDB 05 Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 72 The Idea • Build OLAP data cubes in which cell values represent decision/prediction behavior – In effect, build a tree for each cell/region in the cube— observe that this is not the same as a collection of trees used in an ensemble method! – The idea is simple, but it leads to promising data mining tools – Ultimate objective: Exploratory analysis of the entire space of “data mining choices” • Choice of algorithms, data conditioning parameters … Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 73 Example (1/7): Regular OLAP Z: Dimensions Y: Measure Goal: Look for patterns of unusually high numbers of applications: Location Time # of App. … AL, USA … … Dec, 04 … ... 2 … WY, USA Dec, 04 3 Location All Country Time All All Japan State Hierarchies in Data Mining USA AL Norway WY Year Month All 85 86 Jan., 86 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma 04 Dec., 86 R. Ramakrishnan 74 Example (2/7): Regular OLAP Goal: Look for patterns of unusually high numbers of applications: Coarser regions CA USA … 04 03 … 100 90 … 80 90 … … … … Roll up 2004 2003 Jan … Dec Jan … Dec CA USA … … … 30 20 50 25 30 … … 70 2 8 10 … … … … … … … … … … Drill down Cell value: Number of loan applications Hierarchies in Data Mining Z: Dimensions Y: Measure Location Time # of App. … AL, USA … WY, USA … Dec, 04 … Dec, 04 ... 2 … 3 CA USA … AB … YT AL … WY … Jan 2004 … Dec … … 20 15 15 … 5 2 20 … 5 3 15 … 55 … … … 5 … … 10 … … … … … … … Finer regions Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 75 Example (3/7): Decision Analysis Goal: Analyze a bank’s loan decision process w.r.t. two dimensions: Location and Time Fact table D Z: Dimensions X: Predictors Y: Class Location Time AL, USA Dec, 04 White … … WY, USA Dec, 04 … Approval M … Yes … … … … Black F … No Race Sex Cube subset Model h(X, Z(D)) E.g., decision tree Location All Country Time All All Japan State Hierarchies in Data Mining USA AL Norway WY Year Month All 85 86 Jan., 86 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma 04 Dec., 86 R. Ramakrishnan 76 Example (3/7): Decision Analysis • Are there branches (and time windows) where approvals were closely tied to sensitive attributes (e.g., race)? – • Suppose you partitioned the training data by location and time, chose the partition for a given branch and time window, and built a classifier. You could then ask, “Are the predictions of this classifier closely correlated with race?” Are there branches and times with decision making reminiscent of 1950s Alabama? – Requires comparison of classifiers trained using different subsets of data. Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 77 Example (4/7): Prediction Cubes 2004 2003 … Jan … Dec Jan … Dec … CA 0.4 0.8 0.9 0.6 0.8 … … USA 0.2 0.3 0.5 … … … … … … … … … … … 1. Build a model using data from USA in Dec., 1985 2. Evaluate that model Data [USA, Dec 04](D) Location Time Race Sex … Approval AL ,USA Dec, 04 White M … Y … … … … … … WY, USA Dec, 04 Black F … N Measure in a cell: • Accuracy of the model • Predictiveness of Race measured based on that model • Similarity between that model and a given model Model h(X, [USA, Dec 04](D)) E.g., decision tree Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 78 Example (5/7): Model-Similarity Given: - Data table D - Target model h0(X) - Test set D w/o labels 2004 2003 Data table D … Dec Jan … Dec … CA 0.4 0.2 0.3 0.5 … … USA 0.2 0.3 0.9 … … … … … … … … … … … Time Race Sex … Approval AL, USA Dec, 04 White M … Yes … … … … … … WY, USA Dec, 04 Black F … No … Jan 0.6 Location Level: [Country, Month] Build a model Similarity The loan decision process in USA during Dec 04 h0(X) was similar to a discriminatory decision model Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Race Sex … White F Yes … Yes … … … … … Black M No … Yes Test set D R. Ramakrishnan 79 Example (6/7): Predictiveness Given: - Data table D - Attributes V - Test set D w/o labels CA USA … Data table D 2004 2003 Jan … Dec Jan … Dec … … 0.4 0.2 0.3 0.2 0.3 0.9 … … … 0.6 … 0.5 … … … … … … … … Level: [Country, Month] Location Time Race Sex … Approval AL, USA … Dec, 04 … White … M … … … Yes … WY, USA Dec, 04 Black F … No Yes No . . Yes h(X) Yes No . . No Build models h(XV) Predictiveness of V Race was an important predictor of loan approval decision in USA during Dec 04 Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Race Sex … White … F … … … Black M … Test set D R. Ramakrishnan 80 Example (7/7): Prediction Cube 2004 2003 Roll up … 04 03 … Jan … Dec Jan … Dec … CA 0.3 0.2 … CA 0.4 0.1 0.3 0.6 0.8 … … USA 0.2 0.3 … USA 0.7 0.4 0.3 0.3 … … … … … … … … … … … … … … … Cell value: Predictiveness of Race CA Drill down USA … Hierarchies in Data Mining 2004 2003 … Jan … Dec Jan … Dec … AB 0.4 0.2 0.1 0.1 0.2 … … … 0.1 0.1 0.3 0.3 … … … YT 0.3 0.2 0.1 0.2 … … … AL 0.2 0.1 0.2 … … … … … 0.3 0.1 0.1 … … … WY 0.9 0.7 0.8 … … … … … … … … … … … … Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 81 Efficient Computation • Reduce prediction cube computation to data cube computation – Represent a data-mining model as a distributive or algebraic (bottom-up computable) aggregate function, so that data-cube techniques can be directly applied Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 82 Bottom-Up Data Cube Computation 1985 1986 1987 1988 47 107 76 67 1985 1986 1987 1988 Norway 10 30 20 24 Norway 84 … 23 45 14 32 … 114 USA 14 32 42 11 USA 99 All All All 297 All Cell Values: Numbers of loan applications Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 83 Functions on Sets • Bottom-up computable functions: Functions that can be computed using only summary information • Distributive function: (X) = F({(X1), …, (Xn)}) – X = X1 … Xn and Xi Xj = – E.g., Count(X) = Sum({Count(X1), …, Count(Xn)}) • Algebraic function: (X) = F({G(X1), …, G(Xn)}) – G(Xi) returns a length-fixed vector of values – E.g., Avg(X) = F({G(X1), …, G(Xn)}) • G(Xi) = [Sum(Xi), Count(Xi)] • F({[s1, c1], …, [sn, cn]}) = Sum({si}) / Sum({ci}) Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 84 Scoring Function • Represent a model as a function of sets • Conceptually, a machine-learning model h(X; Z(D)) is a scoring function Score(y, x; Z(D)) that gives each class y a score on test example x – h(x; Z(D)) = argmax y Score(y, x; Z(D)) – Score(y, x; Z(D)) p(y | x, Z(D)) – Z(D): The set of training examples (a cube subset of D) Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 85 Machine-Learning Models • Naïve Bayes: – Scoring function: algebraic • Kernel-density-based classifier: – Scoring function: distributive • Decision tree, random forest: – Neither distributive, nor algebraic • PBE: Probability-based ensemble (new) – To make any machine-learning model distributive – Approximation Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 87 Probability-Based Ensemble Decision tree on [WA, 85] PBE version of decision tree on [WA, 85] 1985 Jan … 1985 Dec Jan … WA … … … Dec … WA … … Decision trees built on the lowest-level cells Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 88 Efficiency Comparison 2500 RFex Execution Time (sec) KDCex 2000 NBex 1500 Using exhaustive method J48ex NB 1000 KDC 500 0 40K RFPBE J48PBE 80K 120K 160K Using bottom-up score computation 200K # of Records Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 94 Bellwether Analysis Conclusions Related Work: Building Models on OLAP Results • Multi-dimensional regression [Chen, VLDB 02] – Goal: Detect changes of trends – Build linear regression models for cube cells • Step-by-step regression in stream cubes [Liu, PAKDD 03] • Loglinear-based quasi cubes [Barbara, J. IIS 01] – Use loglinear model to approximately compress dense regions of a data cube • NetCube [Margaritis, VLDB 01] – Build Bayes Net on the entire dataset of approximate answer count queries Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 96 Related Work (Contd.) • Cubegrades [Imielinski, J. DMKD 02] – Extend cubes with ideas from association rules – How does the measure change when we rollup or drill down? • Constrained gradients [Dong, VLDB 01] – Find pairs of similar cell characteristics associated with big changes in measure • User-cognizant multidimensional analysis [Sarawagi, VLDBJ 01] – Help users find the most informative unvisited regions in a data cube using max entropy principle • Multi-Structural DBs [Fagin et al., PODS 05, VLDB 05] • Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning [Blockeel & Vanschoren, PKDD 2007] Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 97 Take-Home Messages • Promising exploratory data analysis paradigm: – Can use models to identify interesting subsets – Concentrate only on subsets in cube space • Those are meaningful subsets, tractable – Precompute results and provide the users with an interactive tool • A simple way to plug “something” into cube-style analysis: – Try to describe/approximate “something” by a distributive or algebraic function Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 98 Conclusion • Hierarchies are widely used, and a promising tool to help us deal with – – – – Data sparsity Data imprecision and uncertainty Exploratory analysis “Experiment” planning and management • Area is as yet under-appreciated – Lots of work on taxonomies and how to use them, but there are many novel ways of using them that have not received enough attention Hierarchies in Data Mining Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma R. Ramakrishnan 100