* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download www.ipeirotis.com
Survey
Document related concepts
Commitment ordering wikipedia , lookup
Serializability wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Microsoft Access wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Global serializability wikipedia , lookup
Encyclopedia of World Problems and Human Potential wikipedia , lookup
Oracle Database wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Ingres (database) wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Versant Object Database wikipedia , lookup
Relational model wikipedia , lookup
ContactPoint wikipedia , lookup
Transcript
Classifying and Searching "Hidden-Web" Text Databases Panos Ipeirotis Computer Science Department Columbia University Motivation? “Surface” Web vs. “Hidden” Web Keywords SUBMIT “Surface” Web – – – 5/22/2017 Link structure Crawlable Documents indexed by search engines CLEAR “Hidden” Web – – – – No link structure Documents “hidden” in databases Documents not indexed by search engines Need to query each collection individually Panos Ipeirotis - Columbia University 2 Hidden-Web Databases: Examples Search on U.S. Patent and Trademark Office (USPTO) database: [wireless network] 29,051 matches (USPTO database is at http://patft.uspto.gov/netahtml/search-bool.html) Search on Google restricted to USPTO database site: [wireless network site:patft.uspto.gov] 0 matches Database Query Database Matches Site-Restricted Google Matches USPTO wireless network 29,051 0 Library of Congress visa regulations >10,000 0 PubMed thrombopenia 26,887 221 as of July 6th, 2004 5/22/2017 Panos Ipeirotis - Columbia University 3 Interacting With Hidden-Web Databases Browsing: Yahoo!-like directories InvisibleWeb.com SearchEngineGuide.com Populated Manually 5/22/2017 Root Arts Computers Legal Science Sports Patents Searching: Metasearchers Panos Ipeirotis - Columbia University USPTO 4 Outline of Talk 5/22/2017 Classification of Hidden-Web Databases Search over Hidden-Web Databases Modeling and Managing Changes in Hidden-Web Databases Panos Ipeirotis - Columbia University 5 Hierarchically Classifying the ACM Digital Library Root ACM DL ? Arts ? Software ? Computers ? Health ? Science ? Sports ? Hardware C/C++ 5/22/2017 ? Programming Perl Java Panos Ipeirotis - Columbia University Visual Basic 6 Text Database Classification: Definition For a text database D and a category C: Coverage(D,C) = number of docs in D about C Specificity(D,C) = fraction of docs in D about C Assign a text database to a category C if: 5/22/2017 Database coverage for C at least Tc Tc: coverage threshold (e.g., > 100 docs in C) Database specificity for C at least Ts Ts: specificity threshold (e.g., > 40% of docs in C) Panos Ipeirotis - Columbia University 7 Brute-Force Classification “Strategy” 1. Extract all documents from database 2. Classify documents on topic (use state-of-the-art classifiers: SVMs, C4.5, RIPPER,…) 3. Classify database according to topic distribution Problem: No direct access to full contents of Hidden-Web databases 5/22/2017 Panos Ipeirotis - Columbia University 8 Classification: Goal & Challenges Goal: Discover database topic distribution Challenges: No direct access to full contents of Hidden-Web databases Only limited search interfaces available Should not overload databases Key observation: Only queries “about” database topic(s) generate large number of matches 5/22/2017 Panos Ipeirotis - Columbia University 9 Query-based Database Classification: Overview 1. Train document classifier 2. Extract queries from classifier 3. Adaptively issue queries to database 4. Identify topic distribution based on adjusted number of query matches 5. Classify database TRAIN CLASSIFIER Sports: +nba +knicks Health +sars EXTRACT QUERIES +sars QUERY DATABASE 1254 IDENTIFY TOPIC DISTRIBUTION Root 5/22/2017 CLASSIFY DATABASE Panos Ipeirotis - Columbia University Arts Computers Legal Science Sports 10 Training a Document Classifier 1. Get training set (set of pre-classified documents) 2. Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] 3. TRAIN CLASSIFIER Sports: +nba +knicks EXTRACT QUERIES Health +sars Train classifier (SVM, C4.5, RIPPER, …) QUERY DATABASE Output: A “black-box” model for classifying documents IDENTIFY TOPIC DISTRIBUTION Root Root Document Arts Classifier 5/22/2017 Computers Legal Science Sports Panos Ipeirotis - Columbia University CLASSIFY DATABASE Arts Computers Legal Science Sports 11 Extracting Query Probes Transform classifier model into queries Trivial for “rule-based” classifiers (RIPPER) Easy for decision-tree classifiers (C4.5) for which ACM TOIS 2003 TRAIN CLASSIFIER Sports: +nba +knicks EXTRACT QUERIES Health +sars rule generators exist (C4.5rules) C4.5rules +sars QUERY DATABASE 1254 Trickier for other classifiers: we devised rule- extraction methods for linear classifiers (linearkernel SVMs, Naïve-Bayes, …) IDENTIFY TOPIC DISTRIBUTION Rule extraction Root CLASSIFY DATABASE Example query for Sports: +nba +knicks 5/22/2017 Panos Ipeirotis - Columbia University Arts Computers Legal Science Sports 12 Querying Database with Extracted Queries Issue each query to database to obtain number of matches without retrieving any documents TRAIN CLASSIFIER Sports: +nba +knicks EXTRACT QUERIES Increase coverage of rule’s category accordingly (#Sports = #Sports + 706) Health +sars +sars QUERY DATABASE 1254 IDENTIFY TOPIC DISTRIBUTION Root SIGMOD 2001 5/22/2017 ACM TOIS 2003 Panos Ipeirotis - Columbia University CLASSIFY DATABASE Arts Computers Legal Science Sports 13 Identifying Topic Distribution from Query Results Query-based estimates of topic distribution not perfect TRAIN CLASSIFIER Document classifiers not perfect: Rules for one category match documents from other categories Sports: +nba +knicks EXTRACT QUERIES Health +sars Querying not perfect: Queries for same category might overlap Queries do not match all documents in a category QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION Solution: Learn to adjust results of query probes 5/22/2017 Panos Ipeirotis - Columbia University Root CLASSIFY DATABASE Arts Computers Legal Science Sports 14 Confusion Matrix Adjustment of Query Probe Results correct class comp Correct (but unknown) topic distribution sports health comp 0.80 0.10 0.00 sports 0.08 0.85 0.04 health 0.02 0.15 0.96 assigned class 10% of “sport” documents match queries for “computers” 5/22/2017 Incorrect topic distribution derived from query probing Real Coverage X 1000 5000 Estimated Coverage = 50 800+500+0 = 1300 80+4250+2 = 4332 20+750+48 = 818 This “multiplication” can be inverted to get a better estimate of the real topic distribution from the probe results Panos Ipeirotis - Columbia University 15 Confusion Matrix Adjustment of Query Probe Results Coverage(D) ~ M-1 . ECoverage(D) Adjusted estimate of topic distribution Probing results M usually diagonally dominant for “reasonable” document classifiers, hence invertible TRAIN CLASSIFIER Sports: +nba +knicks EXTRACT QUERIES Health +sars QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION Compensates for errors in query-based estimates of topic distribution 5/22/2017 Panos Ipeirotis - Columbia University Root CLASSIFY DATABASE Arts Computers Legal Science Sports 16 Classification Algorithm (Again) 1. TRAIN CLASSIFIER Train document classifier One-time process 2. 3. Extract queries from classifier Adaptively issue queries to database 4. Identify topic distribution based on adjusted number of query matches 5. Classify database Sports: +nba +knicks EXTRACT QUERIES Health +sars +sars QUERY DATABASE 1254 IDENTIFY For every TOPIC database DISTRIBUTION Root 5/22/2017 CLASSIFY DATABASE Panos Ipeirotis - Columbia University Arts Computers Legal Science Sports 17 Experimental Setup 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) 500,000 Usenet articles (April-May 2000): Newsgroups assigned by hand to hierarchy nodes RIPPER trained with 54,000 articles (1,000 articles per leaf), 27,000 articles to construct confusion matrix 500 “Controlled” databases built using 419,000 newsgroup articles (to run detailed experiments) comp.hardware rec.music.classical 130 real Web databases picked from InvisibleWeb (first 5 under each topic) 5/22/2017 Panos Ipeirotis - Columbia University rec.photo.* 18 Experimental Results: Controlled Databases Accuracy (using F-measure): Above 80% for most <Tc, Ts> threshold combinations tried Degrades gracefully with hierarchy depth Confusion-matrix adjustment helps Efficiency: Relatively small number of probes (<500) needed for most threshold <Tc, Ts> combinations tried 5/22/2017 Panos Ipeirotis - Columbia University 19 Experimental Results: Web Databases Accuracy (using F-measure): ~70% for best <Tc, Ts> combination Learned thresholds that reproduce human classification Tested threshold choice using 3-fold cross validation Efficiency: 120 queries per database on average needed for choice of thresholds, no documents retrieved Only small part of hierarchy “explored” Queries are short: 1.5 words on average; 4 words maximum (easily handled by most Web databases) 5/22/2017 Panos Ipeirotis - Columbia University 20 Other Experiments Effect of choice of document classifiers: RIPPER C4.5 Naïve Bayes SVM ACM TOIS 2003 Benefits of feature selection Effect of search-interface heterogeneity: Boolean vs. vector-space retrieval models Effect of query-overlap elimination step Over crawlable databases: query-based classification orders of magnitude faster than “brute-force” crawling-based classification IEEE Data Engineering Bulletin 2003 5/22/2017 Panos Ipeirotis - Columbia University 21 Hidden-Web Database Classification: Summary Handles autonomous Hidden-Web databases accurately and efficiently: ~70% F-measure Only 120 queries issued on average, with no documents retrieved Handles large family of document classifiers (and can hence exploit future advances in machine learning) 5/22/2017 Panos Ipeirotis - Columbia University 22 Outline of Talk 5/22/2017 Classification of Hidden-Web Databases Search over Hidden-Web Databases Modeling and Managing Changes in Hidden-Web Databases Panos Ipeirotis - Columbia University 23 Interacting With Hidden-Web Databases Browsing: Yahoo!-like directories Searching: Metasearchers Content not accessible through Google } … PubMed Query Metasearcher … USPTO … 5/22/2017 Panos Ipeirotis - Columbia University NYTimes Archives … Library of Congress 24 Metasearchers Provide Access to Distributed Databases Database selection relies on simple content summaries: thrombopenia vocabulary, word frequencies Metasearcher PubMed (11,868,552 documents) … aids cancer heart hepatitis thrombopenia … 5/22/2017 123,826 1,598,896 PubMed 706,537 124,320 26,887 ? NYTimes Archives USPTO ... ... ... thrombopenia 26,887 thrombopenia 0 thrombopenia 42 ... ... ... Panos Ipeirotis - Columbia University 25 Extracting Content Summaries from Autonomous Hidden-Web Databases [Callan&Connell 2001] 1. 2. 3. Send random queries to databases Retrieve top matching documents If retrieved 300 documents then stop; else go to Step 1 Content summary contains words in sample and document frequency of each word Problems: • Random sampling retrieves non-representative documents • Frequencies in summary “compressed” to sample size range • Summaries from small samples are highly incomplete 5/22/2017 Panos Ipeirotis - Columbia University 26 Extracting Representative Document Sample Problem 1: Random sampling retrieves non-representative documents 1. 2. 3. • • 4. 5. 6. Train a document classifier Create queries from classifier Adaptively issue queries to databases Retrieve top-k matching documents for each query Save #matches for each one-word query Identify topic distribution based on adjusted number of query matches Categorize the database Generate content summary from document sample Sampling retrieves documents only from “topically dense” areas from database 5/22/2017 Panos Ipeirotis - Columbia University 27 Sample Frequencies vs. Actual Frequencies Problem 2: Frequencies in summary “compressed” to sample size range PubMed (11,868,552 docs) … cancer heart … 1,562,477 691,360 Sampling PubMed Sample (300 documents) … cancer heart … 45 16 Key Observation: Query matches reveal frequency information 5/22/2017 Panos Ipeirotis - Columbia University 28 Adjusting Document Frequencies Zipf’s law empirically connects word frequency f and rank r f = A (r + B) c frequency ... cancer ... ... liver ... kidneys ... ... ... stomach hepatitis rank VLDB 2002 5/22/2017 Panos Ipeirotis - Columbia University 29 Adjusting Document Frequencies Zipf’s law empirically connects word frequency f and rank r f = A (r + B) c frequency We know document frequency and rank r of the words in sample Frequency in sample ... cancer 1 ... ... liver 12 ... ... kidneys ... 78 Rank in sample 5/22/2017 Panos Ipeirotis - Columbia University ... stomach hepatitis rank …. VLDB 2002 30 Adjusting Document Frequencies Zipf’s law empirically connects word frequency f and rank r f = A (r + B) c frequency We know document 140,000 matches Frequency in database frequency and rank r of the words in sample 60,000 matches We know real document frequency f of some words from one-word queries 20,000 matches ... cancer 1 ... ... liver 12 ... ... kidneys ... 78 Rank in sample 5/22/2017 Panos Ipeirotis - Columbia University ... stomach hepatitis …. rank VLDB 2002 31 Adjusting Document Frequencies Zipf’s law empirically connects word frequency f and rank r f = A (r + B) c frequency We know document 140,000 matches frequency and rank r of the words in sample Estimated frequency in database We know real document 60,000 matches frequency f of some words from one-word queries 20,000 matches ... We use curve-fitting to ... ... cancer liver 1 12 ... ... kidneys 78 estimate the absolute frequency of all words in sample 5/22/2017 ... ... stomach hepatitis rank …. VLDB 2002 Panos Ipeirotis - Columbia University 32 Actual PubMed Content Summary PubMed content summary Extracted automatically Number of Documents: 8,868,552 (Actual: 12,349,932) ~ 27,500 words in extracted Category: Health, Diseases content summary … cancer 1,562,477 heart 581,506 (Actual: 706,537) aids 121,491 hepatitis sent 73,481 (Actual: 124,320) … basketball 907 (Actual: 1,094) cpu 598 Fewer than 200 queries At most 4 documents retrieved per query (heart, hepatitis, basketball not in 1-word probes) 5/22/2017 Panos Ipeirotis - Columbia University 33 Sampling and Incomplete Content Summaries Problem 3: Summaries from small samples are highly incomplete Sample=300 Log(Frequency) 107 106 10% most frequent words in PubMed database 9,000 .. endocarditis ~9,000 docs / ~0.1% 103 102 2·104 4·104 105 Rank Many words appear in “relatively few” documents (Zipf’s law) Low-frequency words are often important Small document samples miss many low-frequency words 5/22/2017 Panos Ipeirotis - Columbia University 34 Sample-based Content Summaries Challenge: Improve content summary quality without increasing sample size Main Idea: Database Classification Helps Similar topics ↔ Similar content summaries Extracted content summaries complement each other 5/22/2017 Panos Ipeirotis - Columbia University 35 Databases with Similar Topics Cancerlit contains “metastasis”, not found during sampling CANCERLIT Number of Documents: 148,944 CancerBacup contains “metastasis” Databases under … breast … cancer … thrombopenia … metastasis ... 121,134 ... 91,688 ... 11,344 … <not found> CancerBACUP Number of Documents: 17,328 … breast … cancer … thrombopenia … metastasis ... 12,546 ... 9,735 ... <not found> … 3,569 same category have similar vocabularies, and can complement each other 5/22/2017 Panos Ipeirotis - Columbia University 36 Content Summaries for Categories Databases under Category: Root Fraction of sample 0.2% metastasis word same category share similar vocabulary word Higher level category content summaries provide additional useful estimates Category: Health Fraction of sample metastasis Database: PubMed Fraction of sample metastasis 4% word 5% word Category: Cancer Fraction of sample metastasis 9.2% Can use all estimates in category path word Database: CANCERLIT Fraction of sample metastasis 5/22/2017 Panos Ipeirotis - Columbia University 0% Database: CANCERBACUP word Fraction of sample metastasis 12% 37 Enhancing Summaries Using “Shrinkage” Category: Root (|Sample| = 30,000) word Fraction of sample metastasis 0.2% (± 0.01%) Category: Health (|Sample| = 8,000) word Fraction of sample metastasis 5% (± 0.1%) Category: Cancer (|Sample| = 1,200) word Fraction of sample metastasis 9.2% (± 2%) Database: D (|Sample| = 300 docs) word Fraction of sample metastasis Estimates from database content summaries can be unreliable Category content summaries are more reliable (based on larger samples) but less specific to database By combining estimates from category and database content summaries we get better estimates 0% (± 12%) SIGMOD 2004 5/22/2017 Panos Ipeirotis - Columbia University 38 Shrinkage-based Estimations Category: Root (|Sample| = 30,000) word Fraction of sample metastasis 0.002 Adjust probability estimates Pr [metastasis | D] = λ1 * 0.002 + λ2 * 0.05 + λ3 * 0.092 + λ4 * 0.000 Category: Health (|Sample| = 8,000) word Fraction of sample metastasis 0.005 Category: Cancer (|Sample| = 1,200) word Fraction of sample metastasis 0.092 Database: D (|Sample| = 300 docs) word Fraction of sample metastasis 5/22/2017 0 (??) Select λi weights to maximize the probability that the summary of D is from a database under all its parent categories Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - Columbia University 39 Shrinkage Weights and Summary new estimates CANCERLIT Shrinkage-based metastasis old estimates λroot=0.02 λhealth=0.13 λcancer=0.20 λcancerlit=0.65 2.5% 0.2% 5% 9.2% 0% aids 14.3% 0.8% 7% 2% 20% football 0.17% 2% 1% 0% 0% … … … … … … Shrinkage: Increases estimations for underestimates (e.g., metastasis) Decreases word-probability estimates for overestimates (e.g., aids) …it also introduces (with small probabilities) spurious words (e.g., football) 5/22/2017 Panos Ipeirotis - Columbia University 40 Database selection algorithms assign scores to databases for each query Probability Adaptive Application of Shrinkage Unreliable Score Estimate: Use shrinkage When frequency estimates are uncertain, assigned score is uncertain… …but sometimes confidence about assigned score is high unnecessary 0 5/22/2017 Panos Ipeirotis - Columbia University 1 Database Score for a Query Reliable Score Estimate: Shrinkage might hurt Probability When confident about score, shrinkage 0 Database Score for a Query 1 41 Extracting Content Summaries: Problems Solved Problem 1: Random sampling may retrieve non-representative documents Solution: Focus querying on “topically dense” areas of the database Problem 2: Frequencies are “compressed” to the sample size range Solution: Exploit number of matches for query and adjust estimates using curve fitting Problem 3: Summaries based on small samples are highly incomplete Solution: Exploit database classification and augment summaries using samples from topically similar databases 5/22/2017 Panos Ipeirotis - Columbia University 42 Searching Algorithm One-time process 1. 2. Classify databases and extract document samples Adjust frequencies in samples For each query Q: For each database: Assign score to each database D (using extracted content summary) For every query Examine uncertainty of score If uncertainty high, apply shrinkage and give new score Query only top-K scoring databases 5/22/2017 Panos Ipeirotis - Columbia University 43 Experimental Setup Two standard testbeds from TREC 200 databases 100 queries with associated human-assigned document relevance judgments Two sets of experiments: (“Text Retrieval Conference”): SIGMOD 2004 Content summary quality Metrics: precision, recall, Spearman correlation coefficient, KL-divergence Database selection accuracy Metric: fraction of relevant documents for queries in top-scored databases 5/22/2017 Panos Ipeirotis - Columbia University 44 Experimental Results Content summary quality: Shrinkage improves quality of content summaries without increasing sample size Frequency estimation gives accurate (within ±30%) estimates of actual frequencies Database selection accuracy: Focused sampling: Improves performance by 20%-40% Adaptive application of shrinkage: Improves performance up to 50% Shrinkage is robust: Improved performance consistently across many different configurations 5/22/2017 Panos Ipeirotis - Columbia University 45 Results: Database Selection Metric: R(K) = Χ / Υ X = # of relevant documents in the selected K databases Y = # of relevant documents in the best K databases 0.8 0.75 0.7 Rk 0.65 0.6 0.55 0.5 0.45 Shrinkage No Shrinkage 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k CORI, with stemming, TREC4 testbed 5/22/2017 Panos Ipeirotis - Columbia University 46 Other Experiments Additional data set: 315 real Web databases Choice of database selection algorithm (CORI, bGlOSS, Language Modeling) Effect of stemming Effect of stop-word elimination Comparison with VLDB’02 hierarchical database selection algorithm Universal vs. adaptive application of shrinkage SIGMOD 2004 5/22/2017 Panos Ipeirotis - Columbia University 47 Search: Contributions Developed strategy to automatically summarize contents of Hidden-Web text databases Strategy assumes no cooperation from databases Improves content summary quality by exploiting topical similarity and number of matches No increase in document sample size required Developed adaptive database selection strategy that decides whether to apply shrinkage on a database- and query-specific way 5/22/2017 Panos Ipeirotis - Columbia University 48 Outline of Talk 5/22/2017 Classification of Hidden-Web Databases Search over Hidden-Web Databases Modeling and Managing Changes in Hidden-Web Databases Panos Ipeirotis - Columbia University 49 Do Content Summaries Change Over Time? Databases are not static. Their content changes. Should we refresh the content summary? Examined summaries of 152 real Web databases over 52 weeks Summary quality declines as age increases 5/22/2017 Panos Ipeirotis - Columbia University 50 Updating Content Summaries Summaries change Need to refresh to capture changes To devise update policy Need to know frequency of “change” Summary changes at time T if dist(current, old(T)) > τ Survival analysis estimates probability S(t) that T>t Common model S(t) = e-λt (λ defines frequency of change) change sensitivity threshold Problems: 5/22/2017 No access to content summaries Even if we know summaries, long time to estimate λ Panos Ipeirotis - Columbia University 51 Cox Proportional Hazards Regression We want to estimate frequency of change for each database Cox PH model examines effect of database characteristics on frequency of change E.g., “if you double the size of a database, it changes twice as fast” Cox PH model effectively uses “censored” data (i.e., database did not change within time T) 5/22/2017 Panos Ipeirotis - Columbia University 52 Cox PH Regression Results Examined effect of: Change-sensitivity threshold τ Topic Domain Size Number of words Differences of summaries extracted in consecutive weeks Devised concrete “change model” according to database characteristics (formula in thesis) 5/22/2017 Panos Ipeirotis - Columbia University 53 Scheduling Updates λ D Tom’s Hardware USPS average time between updates 10 weeks 40 weeks 0.088 5 weeks 46 weeks 0.023 12 weeks 34 weeks Using our change model, we schedule updates according to the available resources (using Lagrange-multiplier method) 5/22/2017 Panos Ipeirotis - Columbia University 54 Scheduling Results With clever scheduling we improve the quality of summaries according to a variety of metrics (precision, recall, KL-divergence) 5/22/2017 Panos Ipeirotis - Columbia University 55 Updating Content Summaries: Contributions Extensive experimental study showing that quality of summaries deteriorates for increasing summary age Change frequency model that uses database characteristics to predict frequency of change Derived scheduling algorithms that define update frequency for each database according to the available resources 5/22/2017 Panos Ipeirotis - Columbia University 56 Overall Contributions Support for browsing, searching and updating autonomous Hidden-Web databases Browsing: Algorithm for automatic classification of Hidden-Web databases Accuracy ~70% (F-measure) Only 120 queries issued on average, with no documents retrieved Searching: Content summary construction technique that samples “topically dense” areas of the database Database selection algorithms (hierarchical and shrinkage-based) that improve existing algorithms by exploiting topical similarity Updating: Change model that uses database characteristics to predict frequency of change Scheduling algorithms that exploit the model and define update frequency for each database according to the available resources 5/22/2017 Panos Ipeirotis - Columbia University 57 Thank you! Classification and content summary extraction implemented and available for download at: http://sdarts.cs.columbia.edu 5/22/2017 Panos Ipeirotis - Columbia University 58 Panos Ipeirotis http://www.cs.columbia.edu/~pirot Classification and Search of Hidden-Web Databases P. Ipeirotis, L. Gravano, When one Sample is not Enough: Improving Text Database Selection using Shrinkage [SIGMOD 2004] L. Gravano, P. Ipeirotis, M. Sahami QProber: A System for Automatic Classification of Hidden-Web Databases [ACM TOIS 2003] E. Agichtein, P. Ipeirotis, L. Gravano Modelling Query-Based Access to Text Databases [WebDB 2003] P. Ipeirotis, L. Gravano Distributed Search over the Hidden-Web: Hierarchical Database Sampling and Selection [VLDB 2002] P. Ipeirotis, L. Gravano, M. Sahami Query- vs. Crawling-based Classification of Searchable Web Databases [DEB 2002] P. Ipeirotis, L. Gravano, M. Sahami Probe, Count, and Classify: Categorizing Hidden-Web Databases [SIGMOD 2001] Approximate Text Matching L. Gravano, P. Ipeirotis, N. Koudas, D. Srivastava Text Joins in an RDBMS for Web Data Integration [WWW2003] L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava Approximate String Joins in a Database (Almost) for Free [VLDB 2001] L. Gravano, P. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, L. Pietarinen Using q-grams in a DBMS for Approximate String Processing [DEB 2001] SDARTS: Protocol & Toolkit for Metasearching N. Green, P. Ipeirotis, L. Gravano SDLIP + STARTS = SDARTS. A Protocol and Toolkit for Metasearching [JCDL 2001] P. Ipeirotis, T. Barry, L. Gravano Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with the Open Archives Initiative [JCDL 2002] 5/22/2017 Panos Ipeirotis - Columbia University 59 Future Work: Integrated Access to Hidden-Web Databases Query: [good indie movies playing in Manhattan tomorrow] Current top Google result: (Feb 17th, 2004): Story at “Seattle Times” about 9-year old drummer Rachel Trachtenburg 5/22/2017 Panos Ipeirotis - Columbia University 60 Future Work: Integrated Access to Hidden-Web Databases Query: [good indie movies playing in New York now] query review databases query movie databases query ticket databases All information already available on the Web 5/22/2017 Review databases: Rotten Tomatoes, NY Times, TONY,… Movie databases: All Movie Guide, IMDB Tickets: Moviefone, Fandango,… Panos Ipeirotis - Columbia University 61 Future Work: Integrated Access to Hidden-Web Databases Query: [good indie movies playing in New York now] query review databases query movie databases query ticket databases Challenges: Short term: Learn to interface with different databases Adapt database selection algorithms Long term: Understand semantics of query Extract “query plans” and optimize for distributed execution Personalization Security and privacy 5/22/2017 Panos Ipeirotis - Columbia University 62 SDARTS: Protocol and Toolkit for Metasearching Query Harrison’s Online SDARTS British Medical Journal PubMed Unstructured text documents Local 5/22/2017 DLI2 Corpus XML documents Panos Ipeirotis - Columbia University Web 63 SDARTS: Protocol and Toolkit for Metasearching Accomplishments: Combines the strength of existing Digital Library protocols (SDLIP, STARTS) Enables indexing and wrapping of “local” collections of text and XML documents Enables “declarative” wrapping of Hidden-Web databases, with no programming Extracts content summary, topical focus, and technical level of each database Interfaces with Open Archives Initiative, an emerging Digital Library interoperability protocol Critical building block for search component of Columbia’s PERSIVAL project (5-year, $5M NSF Digital Libraries – Phase 2 project) ACM+IEEE JCDL Conference 2001, 2002 Open source, available at: http://sdarts.cs.columbia.edu ~1,000 downloads since Jan 2003 5/22/2017 Supervised and coordinated eight students during development Panos Ipeirotis - Columbia University 64 Current Work: Updating Content Summaries Databases are not static. Their content changes. When should we refresh the content summary? Examined 150 real Web databases over 52 weeks Modeled changes using “survival analysis” techniques (Cox proportional hazards model) Currently developing updating algorithms: 5/22/2017 Contact database only when necessary Improve quality of summaries by exploiting history Joint work with Junghoo Cho and Alex Ntoulas (UCLA) Panos Ipeirotis - Columbia University 65 Other Work: Approximate Text Matching VLDB’01 WWW’03 Matching similar strings within relational DBMS important: data resides there Service A Service B Jenny Stamatopoulou Panos Ipirotis John Paul McDougal Jonh Smith Aldridge Rodriguez Stamatopulou, Jenny Panos Ipeirotis John P. McDougal John Smith Al Dridge Rodriguez Exact joins not enough: Typing mistakes, abbreviations, different conventions Introduced algorithms for mapping approximate text joins into SQL: 1. No need for import/export of data 2. Provides crucial building block for data cleaning applications 3. Identifies many interesting matches Joint work with Divesh Srivastava, Nick Koudas (AT&T Labs-Research) and others 5/22/2017 Panos Ipeirotis - Columbia University 66 No Good Category for Database General “problem” with supervised learning Example: English vs. Chinese databases Devised technique to analyze if can work with given database: 1. Find candidate textfields 2. Send top-level queries 3. Examine results & construct similarity matrix 4. If “matrix rank” small Many “similar” pages returned Web form is not a search interface Textfield is not a “keyword” field Database is of different language Database is of an “unknown” topic 5/22/2017 Panos Ipeirotis - Columbia University 67 Database not Category Focused 5/22/2017 Extract one content summary per topic: Focused queries retrieve documents about known topic Each database is represented multiple times in hierarchy Panos Ipeirotis - Columbia University 68 Near Future Work: Definition and analysis of query-based algorithms Currently query-based algorithms are evaluated only empirically Possible to model querying process using random graph theory and: Analyze thoroughly properties of the algorithms Understand better why, when, and how the algorithms work Interested in exploring similar directions: Adapt hyperlink-based ranking algorithms Use results in graph theory to design sampling algorithms WebDB 2003 5/22/2017 Panos Ipeirotis - Columbia University 69 Database Selection (CORI, TREC6) 1 0.9 0.8 0.7 Rk 0.6 0.5 0.4 0.3 QBS - Shrikage 0.2 QBS - Plain 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) More results in … Stemming/No Stemming, CORI/LM/bGlOSS, QBS/FPS/RS/CMPL, Stopwords 5/22/2017 Panos Ipeirotis - Columbia University 70 3-Fold Cross-Validation 0.9 0.8 F-1 0.7 F-2 F-measure 0.6 F-3 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Specificity Threshold 0.9 0.8 0.7 F-measure 0.6 0.5 0.4 0.3 0.2 0.1 0 5/22/2017 1 8 64 512 4096 Panos Ipeirotis - Columbia University Coverage Threshold 32768 262144 71 Crawling- vs. Query-based Classification for CNN Sports Efficiency Statistics: Crawling-based Query-based Time Files Size Time Queries Size 1325min 270,202 8Gb 2min (-99.8%) 112 357Kb (-99.9%) Accuracy Statistics: IEEE DEB – March 2002 Crawling-based classification is classified correctly only after downloading 70% of the documents in CNN-Sports 5/22/2017 Panos Ipeirotis - Columbia University 72 Experiments: Precision of Database Selection Algorithms Content Summary Generation Technique CORI Hierarchical Flat FP-SVM-Documents 0.270 0.170 FP-SVM-Snippets 0.200 0.183 Random Sampling 0.177 QPilot (backlinks + front page) 0.050 VLDB 2002 (extended version) 5/22/2017 Panos Ipeirotis - Columbia University 73 F-measure vs. Hierarchy Depth 1.00 0.95 0.90 F1-measure 0.85 0.80 QP-RIPPER 0.75 QP-SVM 0.70 0.65 0.60 0.55 0.50 0 1 2 3 Hierarchy Depth ACM TOIS 2003 5/22/2017 Panos Ipeirotis - Columbia University 74 Real Confusion Matrix for Top Node of Hierarchy 5/22/2017 Health Sports Science Computers Arts Health 0.753 0.018 0.124 0.021 0.017 Sports 0.006 0.171 0.021 0.016 0.064 Science 0.016 0.024 0.255 0.047 0.018 Computers 0.004 0.042 0.080 0.610 0.031 Arts 0.004 0.024 0.027 0.031 0.298 Panos Ipeirotis - Columbia University 75 Overlap Elimination 1.00 0.95 0.90 F1-measure 0.85 0.80 0.75 0.70 0.65 0.60 QP-RIPPER 0.55 QP-RIPPER (overlap elimination) 0.50 0.45 0 5/22/2017 0.1 0.2 0.3 0.4 0.5 Tes 0.6 Panos Ipeirotis - Columbia University 0.7 0.8 0.9 1 76 No Support for Conjunctive Queries (Boolean vs. Vector-space) 1.00 0.95 F1-measure 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 QP-RIPPER (Boolean) 0.50 0.45 QP-SVM (Boolean) 0.40 0.35 QP-RIPPER (Vector) QP-SVM (Vector) 0.30 0 5/22/2017 0.1 0.2 0.3 0.4 0.5 Tes 0.6 Panos Ipeirotis - Columbia University 0.7 0.8 0.9 1 77 0.8 0.8 0.75 0.75 0.7 0.7 0.65 0.65 Rk Rk CORI – Stemming 0.6 0.55 0.6 0.55 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.5 0.45 FPS - Shrinkage FPS - Hierarchical FPS - Plain 0.5 0.45 0.4 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 k (databases selected) 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC4 QBS TREC4 FPS 0.75 0.8 0.7 0.75 0.65 0.7 0.6 0.65 Rk Rk 0.55 0.5 0.6 0.55 0.45 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.4 0.35 0.3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0.5 FPS - Shrinkage FPS - Hierarchical FPS - Plain 0.45 0.4 1 2 3 k (databases selected) 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC6 QBS 5/22/2017 4 TREC6 FPS Panos Ipeirotis - Columbia University 78 bGlOSS – Stemming 0.8 0.8 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.7 0.7 0.6 0.5 0.5 Rk Rk 0.6 0.4 0.4 0.3 0.3 0.2 0.2 0.1 FPS - Shrinkage FPS - Hierarchical FPS - Plain 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 k (databases selected) 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC4 QBS TREC4 FPS 0.7 0.8 0.65 0.75 0.6 0.7 0.55 0.65 Rk Rk 0.5 0.45 0.4 0.6 0.55 0.35 0.5 0.3 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.25 0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 FPS - Shrinkage FPS - Hierarchical FPS - Plain 0.45 0.4 1 2 3 k (databases selected) 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC6 QBS 5/22/2017 4 TREC6 FPS Panos Ipeirotis - Columbia University 79 LM – Stemming 0.8 0.75 0.7 Rk 0.65 0.6 0.55 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.5 0.45 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC4 QBS TREC4 FPS 0.7 0.8 0.75 0.65 0.7 0.6 0.65 0.6 Rk Rk 0.55 0.5 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.45 0.55 0.5 0.45 FPS - Shrinkage FPS - Hierarchical FPS - Plain 0.4 0.4 0.35 0.35 0.3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 k (databases selected) 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC6 QBS 5/22/2017 4 TREC6 FPS Panos Ipeirotis - Columbia University 80 CORI – No Stemming 0.8 0.7 Rk 0.6 0.5 0.4 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.3 0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC4 QBS TREC4 FPS 0.8 0.85 0.75 0.8 0.7 0.75 0.65 0.7 0.6 Rk Rk 0.65 0.55 0.6 0.5 0.55 0.45 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.4 0.35 0.3 FPS - Shrinkage FPS - Hierarchical FPS - Plain 0.5 0.45 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 k (databases selected) 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC6 QBS 5/22/2017 4 TREC6 FPS Panos Ipeirotis - Columbia University 81 bGlOSS – No stemming 0.8 0.8 0.7 0.7 0.6 0.6 0.5 Rk Rk 0.5 0.4 0.4 0.3 0.3 0.2 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.1 FPS - Shrinkage FPS - Hierarchical FPS - Plain 0.2 0 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 k (databases selected) 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC4 QBS TREC4 FPS 0.7 0.85 0.8 0.6 0.75 0.5 0.7 Rk Rk 0.65 0.4 0.6 0.3 0.55 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.2 0.1 0.5 FPS - Shrinkage FPS - Hierarchical FPS - Plain 0.45 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 k (databases selected) 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC6 QBS 5/22/2017 4 TREC6 FPS Panos Ipeirotis - Columbia University 82 LM – No Stemming 0.8 0.75 0.7 Rk 0.65 0.6 0.55 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.5 0.45 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC4 QBS TREC4 FPS 0.7 0.85 0.8 0.65 0.75 0.6 0.7 0.65 Rk Rk 0.55 0.5 0.6 0.55 0.45 QBS - Shrinkage QBS - Hierarchical QBS - Plain 0.4 0.35 0.5 FPS - Shrinkage FPS - Hierarchical FPS - Plain 0.45 0.4 0.3 0.35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 k (databases selected) 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k (databases selected) TREC6 QBS 5/22/2017 4 TREC6 FPS Panos Ipeirotis - Columbia University 83 Frequency Estimation – TREC 4 - CORI 0.75 0.8 0.7 0.75 0.65 0.7 0.6 0.55 Rk Rk 0.65 0.6 0.5 0.45 0.55 0.4 QBS - Shrinkage - FreqEst QBS - Shrinkage - NoFreqEst QBS - Plain - FreqEst QBS - Plain - NoFreqEst 0.5 0.45 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 QBS - Shrinkage - FreqEst QBS - Shrinkage - NoFreqEst QBS - Plain - FreqEst QBS - Plain - NoFreqEst 0.35 0.3 0.25 19 20 1 2 3 4 5 6 7 8 k (databases selected) 9 10 11 12 13 14 15 16 17 18 19 20 19 20 k (databases selected) Stemming 0.75 0.8 0.7 0.75 0.65 0.7 0.6 0.65 0.55 Rk Rk 0.6 0.5 0.55 0.45 0.5 0.4 FPS - Shrinkage - FreqEst FPS - Shrinkage - NoFreqEst FPS - Plain - FreqEst FPS - Plain - NoFreqEst 0.45 0.4 0.35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 FPS - Shrinkage - FreqEst FPS - Shrinkage - NoFreqEst FPS - Plain - FreqEst FPS - Plain - NoFreqEst 0.35 0.3 0.25 19 20 1 2 3 4 k (databases selected) 5 6 7 8 9 10 11 12 13 14 15 16 17 18 k (databases selected) No Stemming 5/22/2017 Panos Ipeirotis - Columbia University 84 Frequency Estimation – TREC 6 - CORI 0.8 0.8 0.75 0.75 0.7 0.7 0.65 0.6 0.65 0.6 Rk Rk 0.55 0.5 0.55 0.45 0.4 0.5 QBS - Shrinkage - FreqEst QBS - Shrinkage - NoFreqEst QBS - Plain - FreqEst QBS - Plain - NoFreqEst 0.35 0.3 0.25 0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 FPS - Shrinkage - FreqEst FPS - Shrinkage - NoFreqEst FPS - Plain - FreqEst FPS - Plain - NoFreqEst 0.45 0.4 0.35 20 1 2 3 4 5 6 7 8 k (databases selected) 9 10 11 12 13 14 15 16 17 18 19 20 19 20 k (databases selected) Stemming 0.75 0.8 0.7 0.75 0.65 0.7 0.6 0.65 Rk Rk 0.55 0.6 0.5 0.55 0.45 QBS - Shrinkage - FreqEst QBS - Shrinkage - NoFreqEst QBS - Plain - FreqEst QBS - Plain - NoFreqEst 0.4 0.35 0.45 0.4 0.3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 FPS - Shrinkage - FreqEst FPS - Shrinkage - NoFreqEst FPS - Plain - FreqEst FPS - Plain - NoFreqEst 0.5 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 k (databases selected) k (databases selected) No Stemming 5/22/2017 Panos Ipeirotis - Columbia University 85 Universal Application of Shrinkage – TREC4 – CORI 0.8 0.7 0.7 0.6 0.6 Rk Rk 0.5 0.5 0.4 0.4 QBS - Plain QBS - Universal QBS - Shrinkage 0.3 0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0.3 QBS - Plain QBS - Universal QBS - Shrinkage 0.2 19 20 1 2 3 4 5 6 7 8 k (databases selected) 9 10 11 12 13 14 15 16 17 18 19 20 19 20 k (databases selected) 0.8 0.8 0.75 0.7 0.7 0.6 Rk Rk 0.65 0.6 0.5 0.55 0.4 0.5 FPS - Plain FPS - Universal FPS - Shrinkage 0.45 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 FPS - Plain FPS - Universal FPS - Shrinkage 0.3 0.2 19 20 1 2 3 4 k (databases selected) 5/22/2017 5 6 7 8 9 10 11 12 13 14 15 16 17 18 k (databases selected) Panos Ipeirotis - Columbia University 86 Universal Application of Shrinkage – TREC4 – bGlOSS 0.8 0.7 0.7 0.6 0.6 0.5 Rk Rk 0.5 0.4 0.4 0.3 0.3 0.2 QBS - Plain QBS - Universal QBS - Shrinkage 0.2 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 QBS - Plain QBS - Universal QBS - Shrinkage 0.1 0 19 20 1 2 3 4 5 6 7 8 k (databases selected) 10 11 12 13 14 15 16 17 18 19 20 19 20 k (databases selected) 0.8 0.7 0.7 0.6 0.6 0.5 0.5 Rk Rk 0.8 0.4 0.4 0.3 0.3 FPS - Plain FPS - Universal FPS - Shrinkage 0.2 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 FPS - Plain FPS - Universal FPS - Shrinkage 0.2 0.1 19 20 1 2 3 4 k (databases selected) 5/22/2017 9 5 6 7 8 9 10 11 12 13 14 15 16 17 18 k (databases selected) Panos Ipeirotis - Columbia University 87 Results: Content Summary Quality Recall: How many words in database also in summary? Shrinkage-based summaries include 10-90% more words than unshrunk summaries 100 90 80 70 60 50 40 30 20 10 0 Shrinkage No Shrinkage Web Precision: How many words in the summary also in database? Shrinkage-based summaries include 5%-15% words not in actual database 100 90 80 70 60 50 40 30 20 10 0 TREC6 Shrinkage No Shrinkage Web 5/22/2017 TREC4 Panos Ipeirotis - Columbia University TREC4 TREC6 88 Results: Content Summary Quality Rank correlation: Is word ranking in summary similar to ranking in database? Shrinkage-based summaries demonstrate better word rankings than unshrunk summaries 100 90 80 70 60 50 40 30 20 10 0 Shrinkage No Shrinkage Web TREC4 TREC6 Kullback-Leibler divergence: Is probability distribution in summary similar to distribution in database? Shrinkage improves bad cases, making very good ones worse Motivates adaptive application of shrinkage! 5/22/2017 Panos Ipeirotis - Columbia University 89 Model: Querying Graph Words 5/22/2017 Documents t1 d1 t2 d2 t3 d3 t4 d4 t5 d5 Panos Ipeirotis - Columbia University 90 Model: Reachability Graph Words 5/22/2017 Documents t1 d1 t2 d2 t3 d3 t4 d4 t5 d5 t1 t2 t3 t5 t4 t1 retrieves document d1 that contains t2 Panos Ipeirotis - Columbia University 91