Download Stratified K-means Clustering Over A Deep Web Data Source

Data Mining over Hidden Data Sources Tantan Liu Advisor: Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University July 23, 2012 Outline • • Introduction – Deep Web – Data Mining on the deep web Contributions – Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012) – Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source (Submitted to ICDM 2012) – Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012) – An Active Learning Based Frequent Itemset Mining (ICDE, 2011) – Differential Rule Mining (ICDM Workshops, 2010) – Stratified Sampling for Deep Web Mining (ICDM, 2010) • Conclusion and Future work Deep Web • Data sources hidden from the Internet – Online query interface vs. Database – Database accessible through online Interface – Input attribute vs. Output attribute • An example of Deep Web Data Mining over the Deep Web • High level summary of data – Scenario 1: a user wants to relocate to the county. • Summary of the residences of the county? – Age, Price, Square Footage – County property assessor’s web-site only allows simple queries – Scenario 2: a user is thinking about his or her career path • High level knowledge about the job posts in the market – Job type, salary, education, experience, skills, .. – Job web-site, i.e. Linkedin and MSN careers, provide millions of job posts. Challenges • Databases cannot be accessed directly – Sampling method for Deep web mining • Obtaining data is time consuming – Efficient sampling method – High accuracy with low sampling cost Contributions • Stratified K-means Clustering Over A Deep Web Data Source (SIGKDD, 2012) • Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source (submitted to ICDM, 2012) • Stratification Based Hierarchical Clustering on a Deep Web Data Source (SDM, 2012) • An Active Learning Based Frequent Itemset Mining (ICDE, 2011) • Differential Rule Mining (ICDM Workshops, 2010) • Stratified Sampling for Deep Web Mining (ICDM, 2010) Roadmap • • Introduction – Deep Web – Data Mining on the deep web Contributions – Stratified K-means Clustering Over A Deep Web Data Source • – – – Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source Stratification Based Hierarchical Clustering on a Deep Web Data Source An Active Learning Based Frequent Itemset Mining – Differential Rule Mining – Stratified Sampling for Deep Web Mining Conclusion and Future work An Example of Deep Web for RealEstate k-means clustering over a deep web data source • Goal: Estimating k centers for the underlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population. Overview of Method Clusters Stratified Based K-means Clustering Sample Sample 1 Subpopulation 1 Sample 2 Sample Allocation SubStratification population 2 Sample n ... Subpopulation n Stratification on the deep web • Partitioning the entire population in to strata Year of construction – Stratifies on the query space of input attributes – Goal: Homogenous Query subspaces – Radius of query subspace: NULL Y=1980 Y=1990 Y=2000 Bedroom – Rule: Choosing the input attribute that mostly decreases the radius of a node – For an input attribute , decrease of radius: B=3 B=4 ... Y=2008 Partition on Space of Output Attributes Square Feet 2008 2000 1990 1980 Price Sampling Allocation Methods • We have created c*k partitions and c*k subspaces – A pilot sample – C*k-mean clustering generate c*k partitions • Representative sampling – Good Estimation on statistics of c*k subspaces • Centers • Proportions Representative Sampling-Centers • Center of a subspace – Mean vector of all data points belonging to the subspace • Let sample S={DR1, DR2, …, DRn} – For i-th subspace, center sc i ,m DR   i, j mi (Om ) : Distance Function • For c*k estimated centers centers with true • Using Euclidean Distance  – Integrated variance • In terms of sub-space, stratum and output attributes • Computed based on pilot sample – : # of sample drawn from j-th stratum Optimized Sample Allocation • Goal: • Using Lagrange multipliers: • We are going to sample stratum with large variance • Data is spread in a wide area, and more data are need to represent the population Active Learning based sampling Method • In machine learning – Passive learning: data are randomly chosen – Active Learning • Certain data are selected, to help build a better model • Obtaining data is costly and/or time-consuming • Choosing stratum i, the estimated decrease of distance function is • Iterative Sampling Process – At each iteration, stratum with largest decrease of distance function is selected for sampling – Integrated variance is updated Representative Sampling-Proportion • Proportion of a sub-space: – Fraction of data records belonging to the sub-space – Depends on proportion of the sub-space in each stratum • In j-th stratum, • Risk function – Distance between estimated factions and their true values • Iterative Sampling Process – At each iteration, stratum with largest decrease of risk function is chosen for sampling – Parameters are updated Stratified K-means Clustering • Weight for data records in i-th stratum – , : size of population, • Similar to k-means clustering – Center for i-th cluster : size of sample Contribution • Sampling methods for solving the problem of k-means clustering over a deep web data source • Representative Sampling – Partition on the space of output attributes • Centers – Optimized Sampling method – Active learning based sampling method • Proportions – Active learning based sampling method Experiment Result • Data Set: – Noisy Synthetic data set: • 4,000 data records with 4 input attributes and 2 output attributes. • Adding 400 noise data points – Yahoo! data set: • Data on used cars • 8,000 data records • Average Distance Representative Sampling-Noisy Data Set • Benefit of Stratification – Compared with rand, decrease of AvgDist are 35.5%, 37.4%, 38.6%, 26.9% • Benefit of Representative Sampling – Compared with rand_st, decrease of AvgDist are 11.8%, 14.4% and 16.1 % • • Center based sampling methods have better performance Optimized sampling method has better performance in the long run Representative Sampling-Yahoo! Data set • Benefit of Stratification – Compared with rand, decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8% • Benefit of Representative Sampling – Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5% • • Center based sampling methods have better performance Optimized sampling method has better performance in the long run Scalability • The execution time for each method is linear of the size of data set Roadmap • • Introduction – Deep Web – Data Mining on the deep web Contributions – Stratified K-means Clustering Over A Deep Web Data Source – Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source – – – – • Stratification Based Hierarchical Clustering on a Deep Web Data Source An Active Learning Based Frequent Itemset Mining Differential Rule Mining Stratified Sampling for Deep Web Mining Conclusion and Future work Outlier Detection • Outlier – An observation that deviates greatly from other observations • DB(p;D) outlier – At least fraction p of the objects lie at a distance greater than D. • Challenges for Outlier Detection over a deep web data source – Recall: finding as large a fraction of outliers – Precision: accurately identify outliers from sampled data Two-phase Stratified Sampling Method • Neighborhood Sampling – Aiming at improving recall – Query spaces with high probability of containing outliers are explored. • Uncertain Driven Sampling – Aiming at improving precision Outliers in Stratified Sampling • Stratified Sampling has good performance • Stratification – Similar to stratification in k-means clustering over a deep web data source – Control the number of strata • Outlier detection – For a data object, let f denote the fraction of data objects at distance greater than D – Estimates in a stratified sample Ij Neighbor Nodes • Similar data objects tend to be from same query subspaces or neighbor query subspaces • Neighbor Nodes for a node – Left and right cousin with same parent nodes Neighborhood Sampling Root Y=1980 B=1 Ba=1 B=2 B=3 Ba=2 Y=1990 B=4 B=1 Y=2000 B=2 B=3 Y=2010 B=4 Post-Stratification • Original Strata are further stratified after additional sampling • New Stratum: Leaf nodes with same sample rate under the same original stratum • Each data record has estimated and variance – Fraction of data objects at distance greater than D – Probability of being an outlier Uncertain Driven Sampling • For a sampled data record – Outlier: > 1 – Normal data object < 2 – Otherwise, uncertain data object • Task: Obtain a sample for identifying uncertain data object Sample Allocation • For uncertain data objects • To find better estimation of • By using Lagrange multiplier with estimated , Minimize Outlier in Stratified Sampling • For a sampled data record – Outlier: > – Otherwise, Normal data object • Distance between each pair of sampled data object is computed • An outlier: • A normal data object – where denotes the fraction of neighbors in D neighborhood Efficient Outlier Detection • It can be shown that • Sufficient condition – If • A normal data object • An outlier – Else • A normal data object • An outlier Experiment Result • Data Set: – Yahoo! data set: • Data on used cars • 8,000 data records • Evaluation – Precision: fraction of outliers that are identified in the sample – Recall: fraction of outliers that are sampled Recall • Benefit of Stratification – Increase over SRS: 108.2%, 116.7%, and 74.7% • Benefit of neighborhood Sampling – Increase over SSTS: 19.1% and 28.1% • Uncertain sampling decrease recall: 3.7% Precision • All four methods have good performance – Average precision is over 0.9 • Stratified sampling methods have lower precision – Compared with SRS, decrease: 1.7%, 4.3%, and 0.68%. • Benefits of uncertain sampling – Compared with NS, increase: 2.7% Trade-off between Precision and Recall • • Trade-off between Precision and Recall Benefit of Stratification – TPS , NS and SSTS improves recall for precision in 0.75-0.975 • Benefit of Neighborhood Sampling – TPS , NS improves recall for precision in 0.750.975 • Benefit of Uncertain Sampling – TPS improves recall for precision in 0.92-1.0 Roadmap • • Introduction – Deep Web – Data Mining on the deep web Contributions – – – – – – • Stratified K-means Clustering Over A Deep Web Data Source Two-phase Sampling Based Outlier Detection Over A Deep Web Data Source Stratification Based Hierarchical Clustering on a Deep Web Data Source An Active Learning Based Frequent Itemset Mining Differential Rule Mining Stratified Sampling for Deep Web Mining Conclusion and Future work Stratification Based Hierarchical Clustering on a Deep Web Data Source • Hierarchical clustering based on stratified sampling – Stratification – Sample Allocation • Representative Sampling – Mean values of output attributes are close to true values • Uncertain Sampling – Sample heavily on boundary between clusters An Active Learning Based Frequent Itemset Mining • Frequent Itemset Mining – – • Estimating support for itemsets The size of itemsets could be huge • Considering 1-itemsets Bayesian Network – Model the relationship between input attributes and output attributes – • Risk Function on estimated parameters Active learning Based Sampling – Data records are selected step by step – Sample query subspaces with greatest decrease on risk function Differential Rule Mining • Different values for the same data object – e.g. prices of commodities • Goal: analyzing the difference between data sources • Differential Rule: – Left hand: a frequent itemset – Right hand: behavior of differential attribute • Differential Rule Mining – Apriori Algorithm – Hypothesis Statistical test Stratified Sampling for Association Rule Mining and Differential Rule Mining • Data Mining – Association Rule Mining & Differential Rule Mining • • Stratified Sampling Stratification – Combing estimation variance and sampling cost – A tree recursively built on the query space • Sampling Allocation – An optimized method for minimizing integrated cost on variance and sampling cost Conclusion • Data mining on the deep web is challenging • We proposed methods for data mining on the deep web – – – – – – A stratified K-means clustering method A two-phase sampling based outlier detection A stratified hierarchical clustering method An Active learning based frequent itemset mining A stratified sampling method for data mining on the deep web Differential rule mining • The experiment results show the efficiency of our work Future Work • Outlier Detection over a deep web data source – Consider the problem of statistical distribution based outlier detection • Mining Multiple Deep Web Data Sources – Instance-based Schema matching • Efficiently sampling instance from deep web to facilitate schema matching – Mining data coverage of multiple deep web data sources • Efficient sampling methods for estimating data coverage of multiple data sources Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Stratified K-means Clustering Over A Deep Web Data Source