Download Data Quality Challenges in Community Systems AnHai Doan University of Wisconsin-Madison

Data Quality Challenges in Community Systems AnHai Doan University of Wisconsin-Madison Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron Gao, Fei Chen, Yoonkyong Lee, Raghu Ramakrishnan, Jeff Naughton Numerous Web Communities  Academic domains – database researchers, bioinformatists  Infotainments – movie fans, mountain climbers, fantasy football  Scientific data management – biomagnetic databank, E. Coli community  Business – enterprise intranets, tech support groups, lawyers  CIA / homeland security – Intellipedia Much Efforts to Build Community Portals   Initially taxonomy based (e.g., Yahoo style) But now many structured data portals – capture key entities and relationships of community No general solution yet on how to build such portals Cimple Project @ Wisconsin / Yahoo! Research Develops such a general solution using extraction + integration + mass collaboration Maintain and add more sources Jim Gray Researcher Homepages * ** * Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 ** * Text documents give-talk SIGMOD-04 Question answering Browse Mining Alert/Monitor News summary DBLP Mass collaboration Prototype System: DBLife   Integrate data of the DB research community 1164 data sources Crawled daily, 11000+ pages = 160+ MB / day Data Extraction Data Integration Raghu Ramakrishnan co-authors = A. Doan, Divesh Srivastava, ... Resulting ER Graph “Proactive Re-optimization write write Shivnath Babu coauthor write Pedro Bizarro coauthor advise coauthor Jennifer Widom David DeWitt PC-member PC-Chair SIGMOD 2005 advise Provide Services  DBLife system Mass Collaboration: Voting Picture is removed if enough users vote “no”. Mass Collaboration via Wiki Summary: Community Systems  Data integration systems + extraction + Web 2.0 – manage both data and users in a synergistic fashion  In sync with current trends – manage unstructured data (e.g., text, Web pages) – get more structure (IE, Semantic Web) – engage more people (Web 2.0) – best-effort data integration, data spaces, pay-as-you-go Numerous potential applications  But raises many difficult data quality challenges Rest of the Talk  Data quality challenges in 1. Source selection 2. Extraction and integration 3. Detecting problems and providing feedback 4. Mass collaboration  Conclusions & ways forward 1. Source Selection Maintain and add more sources Jim Gray Researcher Homepages ** * * Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 ** * Text documents give-talk SIGMOD-04 Question answering Browse Mining Alert/Monitor News summary DBLP Mass collaboration Current Solutions vs. Cimple  Current solutions – find all relevant data sources (e.g., using focused crawling, search engines) – maximize coverage – have lot of noisy sources  Cimple – starts with a small set of high-quality “core” sources – incrementally adds more sources – only from “high-quality” places – or as suggested by users (mass collaboration) Start with a Small Set of “Core” Sources    Key observation: communities often follow 80-20 rules – 20% of sources cover 80% of interesting activities Initial portal over these 20% often is already quite useful How to select these 20% – select as many sources as possible – evaluate and select most relevant ones Evaluate the Relevancy of Sources  Use PageRank + virtual links across entities + TF/IDF ... Gerhard Weikum G. Weikum See [VLDB-07a] Add More Sources over Time  Key observation: most important sources will eventually be mentioned within the community – so monitor certain “community channels” to find them Message type: conf. ann. Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB 2007 http://mud.cs.utwente.nl ...  Also allow users to suggest new sources – e.g., the Silicon Valley Database Society Summary: Source Selection  Sharp contrast to current work – start with highly relevant sources – expand carefully – minimize “garbage in, garbage out”  Need a notion of source relevance Need a way to compute this  2. Extraction and Integration Maintain and add more sources Jim Gray Researcher Homepages ** * * Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 ** * Text documents give-talk SIGMOD-04 Question answering Browse Mining Alert/Monitor News summary DBLP Mass collaboration Extracting Entity Mentions   Key idea: reasonable plan, then patch Reasonable plan: – collect person names, e.g., David Smith – generate variations, e.g., D. Smith, Dr. Smith, etc. – find occurrences of these variations ExtractMbyName Union s1 … sn Works well, but can’t handle certain difficult spots Handling Difficult Spots  Example – R. Miller, D. Smith, B. Jones – if “David Miller” is in the dictionary  will flag “Miller, D.” as a person name  Solution: patch such spots with stricter plans ExtractMStrict ExtractMbyName Union s1 … sn FindPotentialNameLists Matching Entity Mentions   Key idea: reasonable plan, then patch Reasonable plan – mention names are the same (modulo some variation)  match – e.g., David Smith and D. Smith MatchMbyName Extract Plan Union s1 … sn Works well, but can’t handle certain difficult spots Handling Difficult Spots MatchMStrict DBLP: Chen Li ··· 41. Chen Li, Bin Wang, Xiaochun Yang. VGRAM. VLDB 2007. ··· 38. Ping-Qi Pan, Jian-Feng Hu, Chen Li. Feasible region contraction. Applied Mathematics and Computation. ··· MatchMbyName Extract Plan Union {s1 … sn} \ DBLP   Extract Plan DBLP Estimate the semantic ambiguity of data sources – use social networking techniques [see ICDE-07a] Apply stricter matchers to more ambiguous sources Going Beyond Sources: Difficult Data Spots Can Cover Any Portion of Data MatchMStrict2 MatchMStrict Mentions that Match “J. Han” MatchMbyName Extract Plan Extract Plan Union {s1 … sn} \ DBLP DBLP Summary: Extraction and Integration  Most current solutions – try to find a single good plan, applied to all of data  Cimple solution: reasonable plan, then patch So the focus shifts to: – how to find a reasonable plan? – how to detect problematic data spots? – how to patch those?    Need a notion of semantic ambiguity Different from the notion of source relevance 3. Detecting Problems and Providing Feedback Maintain and add more sources Jim Gray Researcher Homepages ** * * Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 ** * Text documents give-talk SIGMOD-04 Question answering Browse Mining Alert/Monitor News summary DBLP Mass collaboration How to Detect Problems?  After extraction and matching, build services – e.g., superhomepages  Many such homepages contain minor problems – e.g., X graduated in 19998 X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers  Intuitively, something is semantically incorrect  To fix this, lets build a Semantic Debugger – learns what is a normal profile for researcher, paper, etc. – alerts the builder to potentially buggy superhomepages – so feedback can be provided What Types of Feedback?      Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge – e.g., no researcher has ever published 5 SIGMOD papers in a year Add more data – e.g., X was advised by Z – e.g., here is the URL of another data source Modify the underlying algorithm – e.g., pull out all data involving X match using names and co-authors, not just names How to Make Providing Feedback Very Easy?  “Providing feedback” for the masses – in sync with current trends of empowering the masses  Extremely crucial in DBLife context If feedback can be provided easily – can get more feedback – can leverage the mass of users   But this turned out to be very difficult How to Make Providing Feedback Very Easy?      Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge Add more data Provide form interfaces Modify the underlying algorithm Provide a Wiki interface Critical in our experience, but unsolved Unsolved, some recent interest on how to mass customize software See our IEEE Data Engineering Bulletin paper on user-centric challenges, 2007 What Feedback Would Make the Most Impact? I have one hour spare time, would like to “teach” DBLife – what problems should I work on? – what feedback should I provide?  Need a Feedback Advisor – define a notion of system quality Q(s) – define questions q1, ..., qn that DBLife can ask users – for each qi, evaluate its expected improvement in Q(s) – pick question with highest expected quality improvement  Observations – a precise notion of system quality is now crucial – this notion should model the expected usage  Summary: Detection and Feedback  How to detect problems? – Semantic Debugger  What types of feedback & how to easily provide them? – critical, largely unsolved  What feedback would make most impact? – crucial in large-scale systems – need a Feedback Advisor – need a precise notion of system quality 4. Mass Collaboration Maintenance and expansion Jim Gray Researcher Homepages ** * * Pages * * Group Pages mailing list Keyword search SQL querying Web pages Conference DBworld Jim Gray * ** ** * SIGMOD-04 ** * Text documents give-talk SIGMOD-04 Question answering Browse Mining Alert/Monitor News summary DBLP Mass collaboration Mass Collaboration: Voting Can be applied to numerous problems Example: Matching Dell laptop X200 with mouse ... Mouse for Dell laptop 200 series ... Dell X200; mouse at reduced price ...  Hard for machine, but easy for human Challenges  How to detect and remove noisy users? – evaluate them using questions with known answers  How to combine user feedback? – # of yes votes vs. # of no votes See [ICDE-05a, ICDE-08a] Mass Collaboration: Wiki Data Sources M G T  V1 W1 V2 W2 V3 W3 V3’ W3’ T3 ’ Community wikipedia – built by machine + human – backed up by a structured database u1 Mass Collaboration: Wiki Machine <# person(id=1){name}=David J. DeWitt #> Professor <# person(id=1){title}=Professor #> Interests: <# person(id=1).interests(id=3) .topic(id=4){name}=Parallel Database #> Human Human <# person(id=1){name}=David J. DeWitt #> <# person(id=1){title}=John P. Morgridge Professor #> <# person(id=1) {organization}=UW #> since 1976 Interests: <# person(id=1).interests(id=3) .topic(id=4){name}=Parallel Database #> David J. DeWitt Interests: Parallel Database Machine Machine <# person(id=1){name}=David J. DeWitt #> <# person(id=1){title}= John P. Morgridge Professor #> <# person(id=1){organization}=UW-Madison#> since 1976 Interests: <# person(id=1).interests(id=3) .topic(id=4){name}=Parallel Database #> <# person(id=1).interests(id=5) .topic(id=6){name}=Privacy #> David J. DeWitt John P. Morgridge Professor UW-Madison since 1976 Interests: Parallel Database Privacy Sample Data Quality Challenges    How to detect noisy users? – no clear solution yet – for now, limit editing to trusted editors – modify notion of system quality to account for this How to combine feedback, handle inconsistent data? – user vs. user – user vs. machine How to verify claimed ownership of data portions? – e.g., this superhomepage is about me – only I can edit it See [ICDE-08b] Summary: Mass Collaboration    What can users contribute? How to evaluate user quality? How to reconcile inconsistent data? Additional Challenges        Dealing with evolving data (e.g., matching) Iterative code development Lifelong quality improvement Querying over inconsistent data Managing provenance and uncertainty Generating explanations Undo Conclusions  Community systems: – data integration + IE + Web 2.0 – potentially very useful in numerous domains  Such systems raise myriad data quality challenges – subsume many current challenges – suggest new ones  Can provide a unifying context for us to make progress – building systems has been a key strength of our field – we need a community effort, as always See “cimple wisc” for more detail Let us know if you want code/data

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Quality Challenges in Community Systems AnHai Doan University of Wisconsin-Madison