Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining over Hidden Data Sources
Tantan Liu
Advisor: Gagan Agrawal
Dept. of Computer Science & Engineering
Ohio State University
July 23, 2012
Outline
•
•
Introduction
– Deep Web
– Data Mining on the deep web
Contributions
– Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012)
– Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source
(Submitted to ICDM 2012)
– Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012)
– An Active Learning Based Frequent Itemset Mining (ICDE, 2011)
– Differential Rule Mining (ICDM Workshops, 2010)
– Stratified Sampling for Deep Web Mining (ICDM, 2010)
•
Conclusion and Future work
Deep Web
• Data sources hidden from the Internet
– Online query interface vs. Database
– Database accessible through online Interface
– Input attribute vs. Output attribute
• An example of Deep Web
Data Mining over the Deep Web
• High level summary of data
– Scenario 1: a user wants to relocate to the county.
• Summary of the residences of the county?
– Age, Price, Square Footage
– County property assessor’s web-site only allows simple queries
– Scenario 2: a user is thinking about his or her career path
• High level knowledge about the job posts in the market
– Job type, salary, education, experience, skills, ..
– Job web-site, i.e. Linkedin and MSN careers, provide millions of job
posts.
Challenges
• Databases cannot be accessed directly
– Sampling method for Deep web mining
• Obtaining data is time consuming
– Efficient sampling method
– High accuracy with low sampling cost
Contributions
• Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD,
2012)
• Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source
(submitted to ICDM, 2012)
• Stratification Based Hierarchical Clustering on a Deep Web Data Source
(SDM, 2012)
• An Active Learning Based Frequent Itemset Mining (ICDE, 2011)
• Differential Rule Mining (ICDM Workshops, 2010)
• Stratified Sampling for Deep Web Mining (ICDM, 2010)
Roadmap
•
•
Introduction
– Deep Web
– Data Mining on the deep web
Contributions
– Stratified K-means Clustering Over A Deep Web Data Source
•
–
–
–
Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source
Stratification Based Hierarchical Clustering on a Deep Web Data Source
An Active Learning Based Frequent Itemset Mining
–
Differential Rule Mining
–
Stratified Sampling for Deep Web Mining
Conclusion and Future work
An Example of Deep Web for RealEstate
k-means clustering over a deep web
data source
• Goal: Estimating k centers for the underlying clusters,
so that the estimated k centers based on the sample are
close to the k true centers in the whole population.
Overview of Method
Clusters
Stratified Based
K-means
Clustering
Sample
Sample 1
Subpopulation 1
Sample
2
Sample
Allocation
SubStratification
population 2
Sample n
...
Subpopulation n
Stratification on the deep web
•
Partitioning the entire population in to
strata
Year of
construction
– Stratifies on the query space of input
attributes
– Goal: Homogenous Query subspaces
– Radius of query subspace:
NULL
Y=1980
Y=1990
Y=2000
Bedroom
– Rule: Choosing the input attribute that
mostly decreases the radius of a node
– For an input attribute , decrease of radius:
B=3
B=4
...
Y=2008
Partition on Space of Output Attributes
Square
Feet
2008
2000
1990
1980
Price
Sampling Allocation Methods
• We have created c*k partitions and c*k subspaces
– A pilot sample
– C*k-mean clustering generate c*k partitions
• Representative sampling
– Good Estimation on statistics of c*k subspaces
• Centers
• Proportions
Representative Sampling-Centers
• Center of a subspace
– Mean vector of all data points belonging to the subspace
• Let sample S={DR1, DR2, …, DRn}
– For i-th subspace, center
sc i ,m
DR
i, j
mi
(Om )
:
Distance Function
• For c*k estimated centers
centers
with true
• Using Euclidean Distance
– Integrated variance
• In terms of sub-space, stratum and output attributes
• Computed based on pilot sample
–
: # of sample drawn from j-th stratum
Optimized Sample Allocation
• Goal:
• Using Lagrange multipliers:
• We are going to sample stratum with large variance
• Data is spread in a wide area, and more data are need to represent
the population
Active Learning based sampling Method
• In machine learning
– Passive learning: data are randomly chosen
– Active Learning
• Certain data are selected, to help build a better model
• Obtaining data is costly and/or time-consuming
• Choosing stratum i, the estimated decrease of distance function is
• Iterative Sampling Process
– At each iteration, stratum with largest decrease of distance function
is selected for sampling
– Integrated variance
is updated
Representative Sampling-Proportion
• Proportion of a sub-space:
– Fraction of data records belonging to the sub-space
– Depends on proportion of the sub-space in each stratum
• In j-th stratum,
• Risk function
– Distance between estimated factions and their true values
• Iterative Sampling Process
– At each iteration, stratum with largest decrease of risk function is
chosen for sampling
– Parameters are updated
Stratified K-means Clustering
• Weight for data records in i-th stratum
–
,
: size of population,
• Similar to k-means clustering
– Center for i-th cluster
: size of sample
Contribution
• Sampling methods for solving the problem of k-means clustering
over a deep web data source
• Representative Sampling
– Partition on the space of output attributes
• Centers
– Optimized Sampling method
– Active learning based sampling method
• Proportions
– Active learning based sampling method
Experiment Result
• Data Set:
– Noisy Synthetic data set:
• 4,000 data records with 4 input attributes and 2 output attributes.
• Adding 400 noise data points
– Yahoo! data set:
• Data on used cars
• 8,000 data records
• Average Distance
Representative Sampling-Noisy Data Set
•
Benefit of Stratification
– Compared with rand,
decrease of AvgDist are
35.5%, 37.4%, 38.6%,
26.9%
•
Benefit of Representative
Sampling
– Compared with
rand_st, decrease of
AvgDist are 11.8%,
14.4% and 16.1 %
•
•
Center based sampling
methods have better
performance
Optimized sampling
method has better
performance in the long
run
Representative Sampling-Yahoo! Data
set
•
Benefit of Stratification
– Compared with rand,
decrease of AvgDist are
7.2%, 13.2%, 15.0%
and 16.8%
•
Benefit of Representative
Sampling
– Compared with
rand_st, decrease of
AvgDist are 6.6%,
8.5%, 10.5%
•
•
Center based sampling
methods have better
performance
Optimized sampling
method has better
performance in the long
run
Scalability
• The execution time for each method
is linear of the size of data set
Roadmap
•
•
Introduction
– Deep Web
– Data Mining on the deep web
Contributions
–
Stratified K-means Clustering Over A Deep Web Data Source
– Two-phase Sampling Based Outlier Detection Over A Deep Web Data
Source
–
–
–
–
•
Stratification Based Hierarchical Clustering on a Deep Web Data Source
An Active Learning Based Frequent Itemset Mining
Differential Rule Mining
Stratified Sampling for Deep Web Mining
Conclusion and Future work
Outlier Detection
• Outlier
– An observation that deviates greatly from other observations
• DB(p;D) outlier
– At least fraction p of the objects lie at a distance greater than D.
• Challenges for Outlier Detection over a deep web data source
– Recall: finding as large a fraction of outliers
– Precision: accurately identify outliers from sampled data
Two-phase Stratified Sampling Method
• Neighborhood Sampling
– Aiming at improving recall
– Query spaces with high probability of containing outliers
are explored.
• Uncertain Driven Sampling
– Aiming at improving precision
Outliers in Stratified Sampling
• Stratified Sampling has good performance
• Stratification
– Similar to stratification in k-means clustering over a deep
web data source
– Control the number of strata
• Outlier detection
– For a data object, let f denote the fraction of data objects
at distance greater than D
– Estimates in a stratified sample
Ij
Neighbor Nodes
• Similar data objects tend to be from same query
subspaces or neighbor query subspaces
• Neighbor Nodes for a node
– Left and right cousin with same parent nodes
Neighborhood Sampling
Root
Y=1980
B=1
Ba=1
B=2
B=3
Ba=2
Y=1990
B=4
B=1
Y=2000
B=2
B=3
Y=2010
B=4
Post-Stratification
• Original Strata are further stratified after additional sampling
• New Stratum: Leaf nodes with same sample rate under the same
original stratum
• Each data record has estimated and variance
– Fraction of data objects at distance greater than D
– Probability of being an outlier
Uncertain Driven Sampling
• For a sampled data record
– Outlier:
> 1
– Normal data object
< 2
– Otherwise, uncertain data object
• Task: Obtain a sample for identifying uncertain data
object
Sample Allocation
• For uncertain data objects
• To find better estimation of
• By using Lagrange multiplier
with estimated
, Minimize
Outlier in Stratified Sampling
• For a sampled data record
– Outlier:
>
– Otherwise, Normal data object
• Distance between each pair of sampled data object is computed
• An outlier:
• A normal data object
– where
denotes the fraction of neighbors in D neighborhood
Efficient Outlier Detection
• It can be shown that
• Sufficient condition
– If
• A normal data object
• An outlier
– Else
• A normal data object
• An outlier
Experiment Result
• Data Set:
– Yahoo! data set:
• Data on used cars
• 8,000 data records
• Evaluation
– Precision: fraction of outliers that are identified in the sample
– Recall: fraction of outliers that are sampled
Recall
• Benefit of Stratification
– Increase over SRS:
108.2%, 116.7%, and
74.7%
• Benefit of
neighborhood Sampling
– Increase over SSTS:
19.1% and 28.1%
• Uncertain sampling
decrease recall: 3.7%
Precision
• All four methods have
good performance
– Average precision is
over 0.9
• Stratified sampling
methods have lower
precision
– Compared with SRS,
decrease: 1.7%, 4.3%,
and 0.68%.
• Benefits of uncertain
sampling
– Compared with NS,
increase: 2.7%
Trade-off between Precision and Recall
•
•
Trade-off between
Precision and Recall
Benefit of Stratification
– TPS , NS and SSTS
improves recall for
precision in 0.75-0.975
•
Benefit of Neighborhood
Sampling
– TPS , NS improves recall
for precision in 0.750.975
•
Benefit of Uncertain
Sampling
– TPS improves recall for
precision in 0.92-1.0
Roadmap
•
•
Introduction
– Deep Web
– Data Mining on the deep web
Contributions
–
–
–
–
–
–
•
Stratified K-means Clustering Over A Deep Web Data Source
Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source
Stratification Based Hierarchical Clustering on a Deep Web Data Source
An Active Learning Based Frequent Itemset Mining
Differential Rule Mining
Stratified Sampling for Deep Web Mining
Conclusion and Future work
Stratification Based Hierarchical
Clustering on a Deep Web Data Source
• Hierarchical clustering
based on stratified
sampling
– Stratification
– Sample Allocation
• Representative Sampling
– Mean values of output
attributes are close to
true values
• Uncertain Sampling
– Sample heavily on
boundary between
clusters
An Active Learning Based Frequent
Itemset Mining
•
Frequent Itemset Mining
–
–
•
Estimating support for itemsets
The size of itemsets could be huge
• Considering 1-itemsets
Bayesian Network
– Model the relationship between input
attributes and output attributes
–
•
Risk Function on estimated parameters
Active learning Based Sampling
– Data records are selected step by step
– Sample query subspaces with
greatest decrease on risk function
Differential Rule Mining
• Different values for the same data object
– e.g. prices of commodities
• Goal: analyzing the difference between data sources
• Differential Rule:
– Left hand: a frequent itemset
– Right hand: behavior of differential attribute
• Differential Rule Mining
– Apriori Algorithm
– Hypothesis Statistical test
Stratified Sampling for Association Rule
Mining and Differential Rule Mining
•
Data Mining
– Association Rule Mining &
Differential Rule Mining
•
•
Stratified Sampling
Stratification
– Combing estimation variance and
sampling cost
– A tree recursively built on the
query space
•
Sampling Allocation
– An optimized method for
minimizing integrated cost on
variance and sampling cost
Conclusion
• Data mining on the deep web is challenging
• We proposed methods for data mining on the deep web
–
–
–
–
–
–
A stratified K-means clustering method
A two-phase sampling based outlier detection
A stratified hierarchical clustering method
An Active learning based frequent itemset mining
A stratified sampling method for data mining on the deep web
Differential rule mining
• The experiment results show the efficiency of our work
Future Work
• Outlier Detection over a deep web data source
– Consider the problem of statistical distribution based outlier
detection
•
Mining Multiple Deep Web Data Sources
– Instance-based Schema matching
• Efficiently sampling instance from deep web to facilitate schema
matching
– Mining data coverage of multiple deep web data sources
• Efficient sampling methods for estimating data coverage of multiple
data sources
Questions?