Download Andrew`s slides - Computer Science at Rutgers

Learning Relational Probability Trees Jennifer Neville David Jensen Lisa Friedland Michael Hay Presented by Andrew Tjang About the authors  David Jensen – Director of KDL, research asst. prof @ umass – Research focuses on the statistical aspects and architecture of data mining systems and the evaluation of those systems for government and business applications Authors, cont’d    Lisa Friedland – graduate student, KDL, interest in DB/data mining Michael Hay – graduate student, KDL Jennifer Neville – – masters student @ Umass in Knowledge Discovery Lab (KDL investigates how to find useful patterns in large and complex databases.) Learning Relational Probability Trees (RPTs)  Classification trees used in data mining: easily interpretable – – Conventional: homogeneous data, statistically independent Much research gone into classification trees -> relational data, so variances can be safely “ignored” and relational effects can be on the front burner Why is relational data so complicated?   Traditional assumption of instance independence is violated. 3 characteristics of relational data – – –  Concentrated linkage Degree disparity Relational autocorrelation All 3 potentially produce overly complex models with excess structure – This leads to factually incorrect relations and more space/computation requirements. What did they do?  Conventional techniques/modeling assume independence –  RPT and class of algorithms to adjust for relational characteristics. RPTs the only statistical model to adjust for these relational characteristics Applied these algorithms to a set of databases (IMDb, Cora, WebKB, human genome) – Then evaluated accuracies of their RPT applications Examples of relational data   Social networks, genomic data, data on interrelated people, places, things/events from text documents. First part was to apply RPT to WebKB – – – (a collection of websites classified into categories – course, faculty staff, student, research project) Appx 4000 pages, 8000 hyperlinks DB contains attributes associated with each page/hyperlink (path/domain, direction of hyperlink) Probability (sub) tree of WebKB RBTs   Estimate probability distros over att values These algs differ from regular – they consider the effect of related object –  Movie example Use Aggregation used to “proportionalize” relational data. – E.g. some movies have 10 actors, some have 1000  Aggregate as avg age. (either preprocess and or dynamically during learning process The Algorithm   RPT algorithm takes collection of subgraphs as input (single target object ot be classified + other objects/links in relat’l neighborhood Construct a probability estimation tree (like previous fig) to predict target class label given: – – – The atts of target object The atts of other objects in neighborhood The degree atts counting object and neighborhood objects Algorithm (cont’d)  Searches over space of binary relat’l features to “split data” –   MODE(actor.gender)=female Constructs features based on atts of different objects in subgraphs and aggregation of values. Feature scores calculated from class after split using chi squared analysis Algorithm Part trois   The p-value from the chi squared will reveal which features are significantly correlated. The max of these is included in tree Does this alg recursively greedily until class distros don’t change significantly – “Bonferroni adjustment” + Laplace correction to improve probability estimates Relational features  Traditional features – Genre=comedy   ID att, operator, value Relational features – Movie(x),Y = {y|ActedIn(x,y)}:Max(Age(Y))>65  ID relation that links an object x to a set of other object Y. Feature Selection Biases   Concentrated linkage and autocorrelation combine to reduce effective sample size and increase variance. Degree disparity can cause sprious elements and miss useful elements. Randomization Test  Used on relational data sets where hypothesis testing not as useful. –  Computationally intensive – –  Used to account for biases Pseudo samples generated by permuting variables In this case permute att values before aggregation This new permuted vector preserves intrinsic links/relations but removes correlations btw atts and class label Tests       Use IMDb to predict $2million box office receipts on opening weekend 1) IMDb with class labels only correlated with degree features. 2)Use structure of IMDb as well as attributes 3)From Cora : if paper is of topic “neural networks” 4) Genome project: if gene is in nucleus 5) WebKB: is student web page Tests (cont’d)  Each of these tasks a set of classification techniques were applied: – – – –  RPT w/ Chi squared (CT) RPT with randomized testing (RT) RPT (pseudo) w/o randomized testing (C4.5) Baseline Relat’l Bayesian Classifier (RBC) * entries statistically significant differences Results Results explained   None except for random showed that their algorithm performed any better than other algorithms For random they show: – – – RPTs can adjust for linkage and autocorrelation. RTs perfrom better than other 3 models Biased models select random attributes that hinder performance. Conclusions (their’s and mine)      Possible to extend conventional methods for estimation tree algorithms RPT w/randomization and RPT w/o randomization perform about the same (smaller trees in the former) Randomization adjusts for biases (but is this so important if they perform the same?) Non selective don’t have biases, but lose interpretablilty. Future work enrichments to RPT – Better ways of calc’ing ranking scores

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Andrew`s slides - Computer Science at Rutgers