Download Data Mining over the Deep Web - Computer Science and Engineering

Data Mining over the Deep Web Tantan Liu, Gagan Agrawal {liut,agrawal}@cse.ohio-state.edu Ohio State University April 12, 2011 1 Outline • Introduction – Deep Web – Data Mining on the deep web • Frequent itemset mining over the deep web – Bayesian network – Active learning based sampling method • Experiment Result • Conclusion 2 Deep Web • Data sources hidden from the Internet – Online query interface vs. Database – Database accessible through online Interface – Input attribute vs. Output attribute • An example of Deep Web 3 Data Mining over the Deep Web • High level summary of data – Scenario 1: A student wants to find a job as a software Engineer • • • • Will a master degree help? Which language to learn: Java, C, or C#? Try MSN careers – to much information! Frequent itemset mining! 4 Challenges • Databases cannot be accessed directly – Sampling method for Deep web mining • Obtaining data is time consuming – Efficient sampling method – High accuracy with low sampling cost 5 Roadmap • Introduction – Deep Web – Data Mining • Frequent Itemset mining over the deep web – Bayesian Network – Active learning based sampling method • Experiment Result • Conclusion 6 Frequent Itemset Mining • • • Itemset: a set of attributes with instantiations, e.g I={Brand=benz, Age>5} Support(Brand=Benz, Age>5)=2/8=0.25 Frequent Itemset: Support is larger than a threshold 7 Frequent Itemset Mining on Deep Web • Challenges – Support of itemsets is unavailable – The size of itemsets could be huge • Considering 1-itemsets – Simple random sample – Inefficient • Support of itemsets of input attributes is known – # of data records satisfying the query is provided 8 Main Idea • Task: Estimating the support of itemsets of output attributes • Questions – Can we use information about input attributes? • Bayesian Network – Relation between input attributes and output attributes – Compute support for itemsets of output attributes – How to quickly build the model • Active learning based sample method 9 Bayesian Network • Relation between input and output attributes • Graphical model – Random variables • Input and output attributes – Conditional dependencies • Output attributes depend on input attributes 10 Active Learning • In machine learning – Passive learning: data are randomly chosen – Active Learning • Certain data are selected, to help build a better model • Active Learning – Obtaining data is costly and/or time-consuming • Frequent Itemset Mining on Deep Web 11 An Example of Bayesian Network Age Brand <=5, >5 H, B known Mileage Estimate! Price Price Brand • Support of Itemsets depends on parameters in the Bayesian network H • Parameters are estimated based on Sample H ‒ Parameter: p(price<=5000|H,<=5) • 2 data records satisfying brand=H, Age<=5 • 1 data records satisfying brand=H, Age<=5, price<=5000 B B Age p <=5000 >5000 <=5 0.25 >5 0.25 0.5 0.5 0.5 0.5 <=5 0.25 >5 0.25 0.0 1.0 0.0 1.0 [0.125 0.125 0.0 0.0] ‒ p(price<=5000|H,<=5)=1/2=0.5 Support(Price<=5000)= 0.25 12 Example of Active learning on Deep Web Deep Web Data Source Qi, i=1,…, 4 Q1 B=H& Age<=5 Sampled Data Q2 Q3 B=H& Age>5 Q4 B=B& Age>5 B=B& Age<=5 Price <=5000 >5000 Price Q1 Q2 Q3 Q4 p11 p21 p31 p41 p12 p22 p32 p42 13 An Example of Active Learning Based Sampling • Hidden idea – Sampling heavily on query spaces with high impurity Deep Web Data Source Q1(B=H) Q1 0.5 0.5 Q2(B=B) Price Q2 0.01 0.99 14 Detailed Formulation • Support for output attributes ‒ – : an instantiation of input attributes, or a query : prior probability • Known – , • Conditional probability • Parameters in conditional table • Unknown, need to estimate 15 Parameters in Bayesian Network • are estimated based on a sample • Difference between estimated values and true values • Consider as statistical variables • Conjugate distribution – After observing data D, is in the same family with • Hyper parameter – Expectation: , where • Estimation for support of output attributes – Expectation on the distribution 16 Active Learning on Deep Web • Risk Function – Risk with the estimation for 1-itemsets composed of output attributes – Based on the hyper parameter in the Bayesian Network, • Data Selection – Data are obtained by queries : query selection – Data records are selected step by step – Choosing the query with most reduction on risk function • Updating Model – For where , and sample denotes the number of data records containing 17 Support for n-itemsets(n>1) • Estimation based on the Bayesian network – • Support value of in the query space 18 Roadmap • Introduction – Deep Web – Data Mining • Frequent Itemset mining over the deep web – Bayesian Network – Active learning based sampling method • Experiment Result • Conclusion 19 Experiment Result • Data set: US census – 2008 US Census on the income of US households – 40,000 data records • Three Methods – Dir: • Random Sample • Direct Computation – Bay • Random Sample • Computation Based on Bayesian Network – Act: our proposed method • Active Learning based Sample • Computation Based on Bayesian Network 20 US census • Square Error Rate: • Absolute Error Rate (AER): 21 Conclusion • • • • • Data mining on the deep web is challenging Frequent itemset mining over the deep web Bayesian network is used to model the deep web A active learning based sampling method The experiment results show the efficiency of our work 22 Questions? 23

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining over the Deep Web - Computer Science and Engineering