Download Stratified K-means Clustering Over A Deep Web Data Source

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining over Hidden Data Sources
Tantan Liu
Advisor: Gagan Agrawal
Dept. of Computer Science & Engineering
Ohio State University
July 23, 2012
Outline
•
•
Introduction
– Deep Web
– Data Mining on the deep web
Contributions
– Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012)
– Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source
(Submitted to ICDM 2012)
– Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012)
– An Active Learning Based Frequent Itemset Mining (ICDE, 2011)
– Differential Rule Mining (ICDM Workshops, 2010)
– Stratified Sampling for Deep Web Mining (ICDM, 2010)
•
Conclusion and Future work
Deep Web
• Data sources hidden from the Internet
– Online query interface vs. Database
– Database accessible through online Interface
– Input attribute vs. Output attribute
• An example of Deep Web
Data Mining over the Deep Web
• High level summary of data
– Scenario 1: a user wants to relocate to the county.
• Summary of the residences of the county?
– Age, Price, Square Footage
– County property assessor’s web-site only allows simple queries
– Scenario 2: a user is thinking about his or her career path
• High level knowledge about the job posts in the market
– Job type, salary, education, experience, skills, ..
– Job web-site, i.e. Linkedin and MSN careers, provide millions of job
posts.
Challenges
• Databases cannot be accessed directly
– Sampling method for Deep web mining
• Obtaining data is time consuming
– Efficient sampling method
– High accuracy with low sampling cost
Contributions
• Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD,
2012)
• Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source
(submitted to ICDM, 2012)
• Stratification Based Hierarchical Clustering on a Deep Web Data Source
(SDM, 2012)
• An Active Learning Based Frequent Itemset Mining (ICDE, 2011)
• Differential Rule Mining (ICDM Workshops, 2010)
• Stratified Sampling for Deep Web Mining (ICDM, 2010)
Roadmap
•
•
Introduction
– Deep Web
– Data Mining on the deep web
Contributions
– Stratified K-means Clustering Over A Deep Web Data Source
•
–
–
–
Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source
Stratification Based Hierarchical Clustering on a Deep Web Data Source
An Active Learning Based Frequent Itemset Mining
–
Differential Rule Mining
–
Stratified Sampling for Deep Web Mining
Conclusion and Future work
An Example of Deep Web for RealEstate
k-means clustering over a deep web
data source
• Goal: Estimating k centers for the underlying clusters,
so that the estimated k centers based on the sample are
close to the k true centers in the whole population.
Overview of Method
Clusters
Stratified Based
K-means
Clustering
Sample
Sample 1
Subpopulation 1
Sample
2
Sample
Allocation
SubStratification
population 2
Sample n
...
Subpopulation n
Stratification on the deep web
•
Partitioning the entire population in to
strata
Year of
construction
– Stratifies on the query space of input
attributes
– Goal: Homogenous Query subspaces
– Radius of query subspace:
NULL
Y=1980
Y=1990
Y=2000
Bedroom
– Rule: Choosing the input attribute that
mostly decreases the radius of a node
– For an input attribute , decrease of radius:
B=3
B=4
...
Y=2008
Partition on Space of Output Attributes
Square
Feet
2008
2000
1990
1980
Price
Sampling Allocation Methods
• We have created c*k partitions and c*k subspaces
– A pilot sample
– C*k-mean clustering generate c*k partitions
• Representative sampling
– Good Estimation on statistics of c*k subspaces
• Centers
• Proportions
Representative Sampling-Centers
• Center of a subspace
– Mean vector of all data points belonging to the subspace
• Let sample S={DR1, DR2, …, DRn}
– For i-th subspace, center
sc i ,m
DR


i, j
mi
(Om )
:
Distance Function
• For c*k estimated centers
centers
with true
• Using Euclidean Distance

– Integrated variance
• In terms of sub-space, stratum and output attributes
• Computed based on pilot sample
–
: # of sample drawn from j-th stratum
Optimized Sample Allocation
• Goal:
• Using Lagrange multipliers:
• We are going to sample stratum with large variance
• Data is spread in a wide area, and more data are need to represent
the population
Active Learning based sampling Method
• In machine learning
– Passive learning: data are randomly chosen
– Active Learning
• Certain data are selected, to help build a better model
• Obtaining data is costly and/or time-consuming
• Choosing stratum i, the estimated decrease of distance function is
• Iterative Sampling Process
– At each iteration, stratum with largest decrease of distance function
is selected for sampling
– Integrated variance
is updated
Representative Sampling-Proportion
• Proportion of a sub-space:
– Fraction of data records belonging to the sub-space
– Depends on proportion of the sub-space in each stratum
• In j-th stratum,
• Risk function
– Distance between estimated factions and their true values
• Iterative Sampling Process
– At each iteration, stratum with largest decrease of risk function is
chosen for sampling
– Parameters are updated
Stratified K-means Clustering
• Weight for data records in i-th stratum
–
,
: size of population,
• Similar to k-means clustering
– Center for i-th cluster
: size of sample
Contribution
• Sampling methods for solving the problem of k-means clustering
over a deep web data source
• Representative Sampling
– Partition on the space of output attributes
• Centers
– Optimized Sampling method
– Active learning based sampling method
• Proportions
– Active learning based sampling method
Experiment Result
• Data Set:
– Noisy Synthetic data set:
• 4,000 data records with 4 input attributes and 2 output attributes.
• Adding 400 noise data points
– Yahoo! data set:
• Data on used cars
• 8,000 data records
• Average Distance
Representative Sampling-Noisy Data Set
•
Benefit of Stratification
– Compared with rand,
decrease of AvgDist are
35.5%, 37.4%, 38.6%,
26.9%
•
Benefit of Representative
Sampling
– Compared with
rand_st, decrease of
AvgDist are 11.8%,
14.4% and 16.1 %
•
•
Center based sampling
methods have better
performance
Optimized sampling
method has better
performance in the long
run
Representative Sampling-Yahoo! Data
set
•
Benefit of Stratification
– Compared with rand,
decrease of AvgDist are
7.2%, 13.2%, 15.0%
and 16.8%
•
Benefit of Representative
Sampling
– Compared with
rand_st, decrease of
AvgDist are 6.6%,
8.5%, 10.5%
•
•
Center based sampling
methods have better
performance
Optimized sampling
method has better
performance in the long
run
Scalability
• The execution time for each method
is linear of the size of data set
Roadmap
•
•
Introduction
– Deep Web
– Data Mining on the deep web
Contributions
–
Stratified K-means Clustering Over A Deep Web Data Source
– Two-phase Sampling Based Outlier Detection Over A Deep Web Data
Source
–
–
–
–
•
Stratification Based Hierarchical Clustering on a Deep Web Data Source
An Active Learning Based Frequent Itemset Mining
Differential Rule Mining
Stratified Sampling for Deep Web Mining
Conclusion and Future work
Outlier Detection
• Outlier
– An observation that deviates greatly from other observations
• DB(p;D) outlier
– At least fraction p of the objects lie at a distance greater than D.
• Challenges for Outlier Detection over a deep web data source
– Recall: finding as large a fraction of outliers
– Precision: accurately identify outliers from sampled data
Two-phase Stratified Sampling Method
• Neighborhood Sampling
– Aiming at improving recall
– Query spaces with high probability of containing outliers
are explored.
• Uncertain Driven Sampling
– Aiming at improving precision
Outliers in Stratified Sampling
• Stratified Sampling has good performance
• Stratification
– Similar to stratification in k-means clustering over a deep
web data source
– Control the number of strata
• Outlier detection
– For a data object, let f denote the fraction of data objects
at distance greater than D
– Estimates in a stratified sample
Ij
Neighbor Nodes
• Similar data objects tend to be from same query
subspaces or neighbor query subspaces
• Neighbor Nodes for a node
– Left and right cousin with same parent nodes
Neighborhood Sampling
Root
Y=1980
B=1
Ba=1
B=2
B=3
Ba=2
Y=1990
B=4
B=1
Y=2000
B=2
B=3
Y=2010
B=4
Post-Stratification
• Original Strata are further stratified after additional sampling
• New Stratum: Leaf nodes with same sample rate under the same
original stratum
• Each data record has estimated and variance
– Fraction of data objects at distance greater than D
– Probability of being an outlier
Uncertain Driven Sampling
• For a sampled data record
– Outlier:
> 1
– Normal data object
< 2
– Otherwise, uncertain data object
• Task: Obtain a sample for identifying uncertain data
object
Sample Allocation
• For uncertain data objects
• To find better estimation of
• By using Lagrange multiplier
with estimated
, Minimize
Outlier in Stratified Sampling
• For a sampled data record
– Outlier:
>
– Otherwise, Normal data object
• Distance between each pair of sampled data object is computed
• An outlier:
• A normal data object
– where
denotes the fraction of neighbors in D neighborhood
Efficient Outlier Detection
• It can be shown that
• Sufficient condition
– If
• A normal data object
• An outlier
– Else
• A normal data object
• An outlier
Experiment Result
• Data Set:
– Yahoo! data set:
• Data on used cars
• 8,000 data records
• Evaluation
– Precision: fraction of outliers that are identified in the sample
– Recall: fraction of outliers that are sampled
Recall
• Benefit of Stratification
– Increase over SRS:
108.2%, 116.7%, and
74.7%
• Benefit of
neighborhood Sampling
– Increase over SSTS:
19.1% and 28.1%
• Uncertain sampling
decrease recall: 3.7%
Precision
• All four methods have
good performance
– Average precision is
over 0.9
• Stratified sampling
methods have lower
precision
– Compared with SRS,
decrease: 1.7%, 4.3%,
and 0.68%.
• Benefits of uncertain
sampling
– Compared with NS,
increase: 2.7%
Trade-off between Precision and Recall
•
•
Trade-off between
Precision and Recall
Benefit of Stratification
– TPS , NS and SSTS
improves recall for
precision in 0.75-0.975
•
Benefit of Neighborhood
Sampling
– TPS , NS improves recall
for precision in 0.750.975
•
Benefit of Uncertain
Sampling
– TPS improves recall for
precision in 0.92-1.0
Roadmap
•
•
Introduction
– Deep Web
– Data Mining on the deep web
Contributions
–
–
–
–
–
–
•
Stratified K-means Clustering Over A Deep Web Data Source
Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source
Stratification Based Hierarchical Clustering on a Deep Web Data Source
An Active Learning Based Frequent Itemset Mining
Differential Rule Mining
Stratified Sampling for Deep Web Mining
Conclusion and Future work
Stratification Based Hierarchical
Clustering on a Deep Web Data Source
• Hierarchical clustering
based on stratified
sampling
– Stratification
– Sample Allocation
• Representative Sampling
– Mean values of output
attributes are close to
true values
• Uncertain Sampling
– Sample heavily on
boundary between
clusters
An Active Learning Based Frequent
Itemset Mining
•
Frequent Itemset Mining
–
–
•
Estimating support for itemsets
The size of itemsets could be huge
• Considering 1-itemsets
Bayesian Network
– Model the relationship between input
attributes and output attributes
–
•
Risk Function on estimated parameters
Active learning Based Sampling
– Data records are selected step by step
– Sample query subspaces with
greatest decrease on risk function
Differential Rule Mining
• Different values for the same data object
– e.g. prices of commodities
• Goal: analyzing the difference between data sources
• Differential Rule:
– Left hand: a frequent itemset
– Right hand: behavior of differential attribute
• Differential Rule Mining
– Apriori Algorithm
– Hypothesis Statistical test
Stratified Sampling for Association Rule
Mining and Differential Rule Mining
•
Data Mining
– Association Rule Mining &
Differential Rule Mining
•
•
Stratified Sampling
Stratification
– Combing estimation variance and
sampling cost
– A tree recursively built on the
query space
•
Sampling Allocation
– An optimized method for
minimizing integrated cost on
variance and sampling cost
Conclusion
• Data mining on the deep web is challenging
• We proposed methods for data mining on the deep web
–
–
–
–
–
–
A stratified K-means clustering method
A two-phase sampling based outlier detection
A stratified hierarchical clustering method
An Active learning based frequent itemset mining
A stratified sampling method for data mining on the deep web
Differential rule mining
• The experiment results show the efficiency of our work
Future Work
• Outlier Detection over a deep web data source
– Consider the problem of statistical distribution based outlier
detection
•
Mining Multiple Deep Web Data Sources
– Instance-based Schema matching
• Efficiently sampling instance from deep web to facilitate schema
matching
– Mining data coverage of multiple deep web data sources
• Efficient sampling methods for estimating data coverage of multiple
data sources
Questions?
Related documents