Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DBpedia: Querying Wikipedai like a Database Nucleus for a web of Open Data Songqing Liu 5/23/2017 CSCI8986: DBpedia 1 DBpedia is an effort to Extract structured information from Wikipedia Make this information available on the Web under an open license Interlink DBpedia dataset with other datasets on Web 5/23/2017 CSCI8986: DBpedia 2 Outline: Extracting Structured Information from Wikipedia DBpedia Dataset Accessing DBpedia Dataset over Web Use Cases: 5/23/2017 Improving Wikipedia Search Royalty-Free Data Source for other Applications Nucleus for the Emerging Web of Data CSCI8986: DBpedia 3 •Title •Abstract •Infoboxes •Geo-coordinates •Categories •Images •Links •Other languages •Other wiki pages •To the web •Redirects •Disambiguates 5/23/2017 CSCI8986: DBpedia 4 Extracting Structured Information from Wikipedia Wikipedia consists of 12.379 million articles In 275 languages (285 in total) 35 million users Monthly growth-rate: 4% Wikipedia articles contain structured information 5/23/2017 Infobox which use template mechanism Images depicting article’s topic Categorization of the article Links to external webpages Intra-wiki links to other articles Inter-language links to articles about same topic in different languages CSCI8986: DBpedia 5 Overview of the component: Web 2.0 Mashups SPARQL Endpoint Traditional Web Browser Semantic Web Browsers SNORQL Browser Linked Data Query Builder published via Virtuoso MySQL loaded into DBpedia datasets Articles Categories Infobox Extraction Wikipedia Dumps 5/23/2017 Article texts DB tables CSCI8986: DBpedia 6 Infobox template: 5/23/2017 CSCI8986: DBpedia 7 Extracting Infobox Data(RDF) Webpage http://en.wikipedia.org/wiki/C algary DBpedia resource http://dbpedia.org/page/Calg ary Dbpedia: native_name “Calgary” Dbpedia: altitute “1048” Dbpedia: population_city “1096833” Dbpedia: population_metro “1214839” Mayor_name dbpedia: Naheed Nenshi Governing_body dbpeida: Calgary_City_Council; 5/23/2017 CSCI8986: DBpedia 8 Question 5/23/2017 CSCI8986: DBpedia 9 Extract infomation Short and long abstracts in different languages dbpedia:Calgary dbpedia:abstract “Calgary is the largest ...”@en ; dbpedia:abstract “Calgary ist eine Stadt ...”@de . Categorization information dbpedia:Calgary skos:subject dbpedia:Category_Cities_in_Alberta ; skos:subject dbpedia:Host_cities_Olympic_Games . Links to the original Wikipedia articles, pictures and relevant external web pages dbpedia:Calgary foaf:page <http://en.wikipedia.org/wiki/Calgary> ; dbpedia:wikipage-de<http://de.wikipedia.org/wiki/Calgary> ; foaf:depiction <http://upload.wikimedia.org/thumb/3/32> ; dbpedia:reference <http://www.calgary.ca> ; dbpedia:reference <http://www.tourismcalgary.com>. 5/23/2017 CSCI8986: DBpedia 10 5/23/2017 CSCI8986: DBpedia 11 5/23/2017 CSCI8986: DBpedia 12 DBpedia Basics: Structured information can be extracted from Wikipedia DBpedia.org project uses Resource Description Framework (RDF) as flexible data model Serve as basis for enabling sophisticated queries against Wikipedia content Representing extracted information and for publishing on the Web Use SPARQL query language to query this data 5/23/2017 At Developers Guide to Semantic Web Toolkits, we can find development toolkit in our preferred programming language to process DBpedia data CSCI8986: DBpedia 13 The DBpedia Dataset Describe 20.8 million things, out of which 10.5 mio overlap from English DBpedia Full Dbpedia dataset features labels and abstracts for 10.3 million unique things in 111 different languages 8.0 million links to images and 24.4 million HTML links to external web pages 27.2 million data links into external RDF data sets 55.8 million links to Wikipedia categories and 8.2 million YAGO categories Consists of 1.89 billion pieces of information (RDF triples) out of which 400 million were extracted from English edition English version: 3.77 million things out of 2.35 million are classified in a consistent Ontology 5/23/2017 764,000 persons 573,000 places 333,000 creative works: music albums, films and video games 192,000 organizations: companies and educational institutions 202,000 species and 5,500 diseases CSCI8986: DBpedia 14 Multi-Lingual Abstracts Dataset contains short and long English abstract for each concept Short abstracts 5/23/2017 English: 3,770,000 German: 1,244,000 French: 1,197,000 Dutch: 993,000 Italian: 882,000 Spanish: 879,000 Polish: 848,000 Japanese: 781,000 Portuguese: 699,000 Swedish: 457,000 Chinese: 445,000 CSCI8986: DBpedia 15 Accessing DBpedia Dataset over the Web SPARQL Endpoint Linked Data Interface DB Dumps for Download 5/23/2017 CSCI8986: DBpedia 16 SAPRQL: SAPRQL is query language for RDF RDF is a directed, labeled graph data format for representing information in the Web This specification defines syntax and semantics of SPARQL query language for RDF SPARQL can be used to express queries across diverse data sources 5/23/2017 whether data is stored natively as RDF or viewed as RDF via middleware CSCI8986: DBpedia 17 DBpedia SPARQL Endpoint http://dbpedia.org/sparql Hosted on OpenLink Virtuoso server Can answer SPARQL queries as 5/23/2017 Give me all Sitcoms that are set in NYC? All tennis players from Moscow? All films by Quentin Tarentino? All German musicians that were born in Berlin in the 19th century? CSCI8986: DBpedia 18 5/23/2017 CSCI8986: DBpedia 19 Interesting Example: To know everything Bart wrote on blackboard in season 12 of Simpson's entities • The Simpson episode Wikipedia pages are the identified "things" that we would consider as the subjects of our RDF triples. • The content of the Wikipedia page for the "Tennis the Menace" episode tells us that it is a member of the Wikipedia category "The Simpsons episodes, season 12". • The episode's DBpedia page tells us that p:blackboard is the property name for the Wikipedia infobox "Chalkboard" field. SELECT ?episode,?chalkboard_gag WHERE { ?episode skos:subject <http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>. ?episode dbpedia2:blackboard ?chalkboard_gag } 5/23/2017 CSCI8986: DBpedia Table 20 5/23/2017 CSCI8986: DBpedia 21 5/23/2017 CSCI8986: DBpedia 22 Linked Data Interface Large body of information and knowledge is already available in structured form, yet not accessible on the Web Integrating open data provides real value Linked Data on the Web can be accessed using Semantic Web browsers Semantic Web browsers enable users to navigate between different data sources It also allows robots of Semantic Web search engines to follow these links to crawl the Semantic Web 5/23/2017 CSCI8986: DBpedia 23 Linked Data Interface Project follows Linked Data principles All concepts are identified using Uniform Resource Identifier references, URI is compact string of characters used to identify or name a resource Linked Data interface can be used by Semantic Web Browsers, like Semantic Web Crawlers, like 5/23/2017 DISCO Hyperdata Browser Tabulator Browser OpenLink RDF Browser Zitgist (Zitgist LLC, USA) SWSE (DERI, Ireland) Swoogle (UMBC, USA) CSCI8986: DBpedia 24 5/23/2017 CSCI8986: DBpedia 25 DBpedia Use Cases Improving Wikipedia Search Royalty-Free Data Source for other Applications Nucleus for the Emerging Web of Data 5/23/2017 CSCI8986: DBpedia 26 Improving Wikipedia Search (Various Interfaces) 5/23/2017 CSCI8986: DBpedia 27 Improving Wikipedia Search 5/23/2017 CSCI8986: DBpedia 28 Royalty-Free Data Source for other Applications Dbpedia is published under GNU Free Documentation License Example use case: SPARQL generated tables within webpages 5/23/2017 CSCI8986: DBpedia 29 Nucleus for the Emerging Web of Data W3C SWEO Linking Open Data Project 5/23/2017 CSCI8986: DBpedia 30 295 data sets consists of over 31 billion RDF triples, which are interlinked 504 million RDF links April 2005 CSA2050:NLTK 31 Dbpedia User Applications AboutThisDay.com: Search engine of births & deaths of people etc. Dbpedia Mobile: Map view annotated with Dbpedia entities RelFinder: Connections between objects SemLens: Uses scatter plots to analyze Dbpedia data and semantic lenses Dbpedia Navigator: Alumis: Answer engine based on DBpedia 5/23/2017 CSCI8986: DBpedia 32 How can I support Dbpedia? Develop another cool user interface to Dbpedia Publish more RDF datasets with dereferenceable URIs Interlink your datasets with Dbpedia 5/23/2017 CSCI8986: DBpedia 33 Discussions? Dbpedia Website http://wiki.dbpedia.org/About Linking Open Data Project Website 5/23/2017 http://www.w3.org/wiki/SweoIG/TaskForces/Com munityProjects/LinkingOpenData CSCI8986: DBpedia 34