Download ICDM10

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Stratified Sampling for Data
Mining on the Deep Web
Tantan Liu, Fan Wang, Gagan Agrawal
{liut, wangfa ,agrawal}@cse.ohio-state.edu
Dec. 16, 2010
Outline
• Introduction
• Background Knowledge
– Association Rule Mining
– Differential Rule Mining
• Basic Formulation
• Main Technical Approach
– A Greedy Stratification Method
• Experiment Result
• Conclusion
Introduction
• Deep Web
– Query interface vs. backend database
– Input attribute vs. Output attribute
• Data mining on the deep web
– High level summary of the data
– Challenge
• Databases cannot be accessed directly
– Sampling
• Deep web querying is time consuming
– Efficient Sampling Method
Background Knowledge-Association
Rule Mining
• Aim: co-occurrence patterns for items
• Frequent Itemset: Support of the itemset
is larger than a threshold
• Rule:
–
is a frequent itemset
– Confidence
is larger than threshold
Background Knowledge-Differential
Rule Mining
• Aim: differences between two deep web data sources
– E.g. Price of the same hotels on two web sites
• Identical attributes vs. Differential attributes
– Same vs. different values
• Rule:
– X: Frequent itemset composed of identical attributes
– t: differential or target attribute
– D1, D2: data sources
Basic Formulation-Problem
Formulation
• Two step sampling procedure
– A pilot sample
– Randomly drawn from the deep web
– Interesting rules are identified
– Additional sample
‒ Verify identified rules
• Association rules X   Y and differential rules
– Sampling more data records satisfying X
– X only contains input attributes – easy
– X contains output attributes
– Randomly sampling ? not efficient!
– how?
Basic Formulation-Problem
Formulation in Detail
• Considering rules with X  { A  a}
– A single output attribute
A
in the left hand
• Association Rule
– Estimate
or,
p(Y | A  a)
• Differential Rule
– Estimate mean of
• Goal – sampling
– High estimation accuracy
– Low sampling cost
given A=a
Basic Formulation-Stratified Sampling
• Sampling separately from strata
– Heterogeneous across strata & homogenous within
stratum
• Estimating mean value of
–
:
: size, and sampled mean value
• Association Rule Mining
‒ y : whether an itemset is contained in a transaction
‒ If an itemset is contained in a transaction, y  1
• Differential Rule Mining
‒ y :the value of target attribute
Background-Neymann Allocation
• Sample Allocation
– Determining sample size for each stratum
– Fixed sum of sample size
• Neymann Allocation
– Minimizing variance of the stratified sampling
• Problem of application in Deep Web
– The probability of A = a in each stratum is not considered
– Possible large sampling cost
• Sampling cost: number of queries submitted to the deep web
Sampling Cost
• Sampling Cost on the Deep web
– Aim: obtain
data records with
– Sampling Cost:
•
•
: number of data records with
: probability of finding a data record with
• Integrated Cost
– Combing sampling cost and estimation variance
– Two adjustable weights
Main technical Approach –
Stratification Process
• Stratification by a tree on the query space
– A top-down construction manner
– Best split to create child nodes
• Input attribute with the smallest integrated cost
– The splitting process stops
• Integrated cost at each leaf node is small
– Leaf nodes: final strata for sampling
Experiment Result
• Data Set: US census
– The income of US households from 2008 US
Census
– 40,000 data records
– 7 categorical and 2 numerical attributes
• Two Metrics
– Variance of Estimation
– Sampling Cost
Experiment Result-Settings
• Five sampling procedures
– Four different weights for variance and sampling
cost
•
•
•
•
Full_Var:
Var7 :
Var5 :
Var3 :
– Rand : simple random sampling
Experiment Result – Variance of
Estimation
• Association Rule Mining
• Increase of variance of estimation by decreasing
• Random Sampling has higher estimation of variance
Experiment Result – Sampling Cost
• Association Rule Mining
• Decrease of sampling cost by decreasing
• Random Sampling has higher sampling cost
Conclusion
• Stratified sampling for data mining on the deep web
• Considering estimation accuracy and sampling cost
• A tree model for the relation between input attributes
and output attributes
• A greedy stratification to maximally reduce an
integrated cost metric
• Our experiments show that
– Higher sampling accuracy and lower sampling cost
compared with simple random sampling
– Reducing sampling costs by trading-off a fraction of
estimation error
Questions & Comments?
Related documents