Download ICDM10

Stratified Sampling for Data Mining on the Deep Web Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa ,agrawal}@cse.ohio-state.edu Dec. 16, 2010 Outline • Introduction • Background Knowledge – Association Rule Mining – Differential Rule Mining • Basic Formulation • Main Technical Approach – A Greedy Stratification Method • Experiment Result • Conclusion Introduction • Deep Web – Query interface vs. backend database – Input attribute vs. Output attribute • Data mining on the deep web – High level summary of the data – Challenge • Databases cannot be accessed directly – Sampling • Deep web querying is time consuming – Efficient Sampling Method Background Knowledge-Association Rule Mining • Aim: co-occurrence patterns for items • Frequent Itemset: Support of the itemset is larger than a threshold • Rule: – is a frequent itemset – Confidence is larger than threshold Background Knowledge-Differential Rule Mining • Aim: differences between two deep web data sources – E.g. Price of the same hotels on two web sites • Identical attributes vs. Differential attributes – Same vs. different values • Rule: – X: Frequent itemset composed of identical attributes – t: differential or target attribute – D1, D2: data sources Basic Formulation-Problem Formulation • Two step sampling procedure – A pilot sample – Randomly drawn from the deep web – Interesting rules are identified – Additional sample ‒ Verify identified rules • Association rules X   Y and differential rules – Sampling more data records satisfying X – X only contains input attributes – easy – X contains output attributes – Randomly sampling ? not efficient! – how? Basic Formulation-Problem Formulation in Detail • Considering rules with X  { A  a} – A single output attribute A in the left hand • Association Rule – Estimate or, p(Y | A  a) • Differential Rule – Estimate mean of • Goal – sampling – High estimation accuracy – Low sampling cost given A=a Basic Formulation-Stratified Sampling • Sampling separately from strata – Heterogeneous across strata & homogenous within stratum • Estimating mean value of – : : size, and sampled mean value • Association Rule Mining ‒ y : whether an itemset is contained in a transaction ‒ If an itemset is contained in a transaction, y  1 • Differential Rule Mining ‒ y :the value of target attribute Background-Neymann Allocation • Sample Allocation – Determining sample size for each stratum – Fixed sum of sample size • Neymann Allocation – Minimizing variance of the stratified sampling • Problem of application in Deep Web – The probability of A = a in each stratum is not considered – Possible large sampling cost • Sampling cost: number of queries submitted to the deep web Sampling Cost • Sampling Cost on the Deep web – Aim: obtain data records with – Sampling Cost: • • : number of data records with : probability of finding a data record with • Integrated Cost – Combing sampling cost and estimation variance – Two adjustable weights Main technical Approach – Stratification Process • Stratification by a tree on the query space – A top-down construction manner – Best split to create child nodes • Input attribute with the smallest integrated cost – The splitting process stops • Integrated cost at each leaf node is small – Leaf nodes: final strata for sampling Experiment Result • Data Set: US census – The income of US households from 2008 US Census – 40,000 data records – 7 categorical and 2 numerical attributes • Two Metrics – Variance of Estimation – Sampling Cost Experiment Result-Settings • Five sampling procedures – Four different weights for variance and sampling cost • • • • Full_Var: Var7 : Var5 : Var3 : – Rand : simple random sampling Experiment Result – Variance of Estimation • Association Rule Mining • Increase of variance of estimation by decreasing • Random Sampling has higher estimation of variance Experiment Result – Sampling Cost • Association Rule Mining • Decrease of sampling cost by decreasing • Random Sampling has higher sampling cost Conclusion • Stratified sampling for data mining on the deep web • Considering estimation accuracy and sampling cost • A tree model for the relation between input attributes and output attributes • A greedy stratification to maximally reduce an integrated cost metric • Our experiments show that – Higher sampling accuracy and lower sampling cost compared with simple random sampling – Reducing sampling costs by trading-off a fraction of estimation error Questions & Comments?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ICDM10