* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Brian - osm.cs.byu.edu
Survey
Document related concepts
Transcript
Inconsistent Data on the Semantic Web A Theoretical Approach Brian Goodrich The Problem An computer application has a set of input and a set of output based upon the set of input and its internal logic. If an application is given data as input which causes a conflicted state in deciding its output, it will crash without some kind of logic by which to decide that conflict. The Semantic Web is based being able to parse human intent from structured, semistructured, and unstructured data on the Web. Human intent is frequently conflicting. Conflicting Data Sources Malicious - (deceptive or rerouting attempts) or just ignorantly incorrect information Incomplete Information – having insufficient context or simply unfinished data Humor – especially sarcasm, satire and exaggeration (e.g. political cartoons) Time – what once was one thing is now another (e.g. quality of service, price, etc.) Ontological Deficiency – when extraction ontology lacks sufficient vividness to separate data appropriately. Solution Fast Maintain current speed of the Web. Accurate Correct decisions of data reliance. Dynamic Keeps pace with change on the Web. Thesis To propose a method for simplifying the task of dealing with conflicting data on the Semantic Web in a fast, accurate and dynamic way by supplying each web source with a derived indicator of its communal usage called a Consensual Reliability Score. (CRS) Methods CRSBot Site Type (a) Incoming Index (b) Usage Mining (c) Direct Survey (d) •Formula for deriving CRS from inputs a, b, c, & d. •With weighted constants z, y, x, & w. Site Type Mining (a * z)… Five types of Web Pages Head Pages Navigation Pages Content Pages Look up Pages Personal Pages Incoming Index …(b * y)… •Distributed web crawler that counts hyperlinks then traverses the unique hyperlink paths, looking for additional links. •Link counts are stored in a hash indexed by the destination of the hyperlinks. •Provides a dynamic count of how often the internet as a whole is pointing to a given web source. Therefore an indication of how often people use the given web source. •Excludes orphan sites (mostly personal sites and spam pop-ups) •Based on the success of the Google search engine Usage Mining …(c * x)… Most straight forward approach of testing how often people use a web source. Query site’s # of hits or how many people have seen this site? Problem: Unlike Incoming Index method, does not exclude orphan sites. Further experimentation needed to determine x’s weight. Direct Survey …(d * w)… Most reliable method of determining reliability. Manually query users directly. Too slow and costly to be consider a whole solution but can assist in CRS derivation. Hopefully offset frequently visited sites with no true info (onion.com, humor, etc.) More experimentation needed to determine w’s weight. Review Semantic Agents Semantic Browsers CRSBot Site Type (a) Incoming Index (b) Usage Mining (c) Direct Survey (d) “Classical content data mining is not applicable in this case (CRS derivation) because it is the content of the web sources that is in question.” -Brian Goodrich Storage Global Index – Fast access Centralized storage for CRSBot. Centralized vulnerability. Vital non-distributed resource in a distributed system. Local Storage Non-centralized vulnerability Non-unified derivation formula (disrupts trust algorithm) Local Derivation Too slow to be useful (problem size too large) Related Work Tim Berners-Lee There is a choice here, and I am not sure right now which appeals to me most. One is to say precicely, "whatever any document says of the form xxxx is a member of W3C so long as it is signed with key 32457934759432". The other is to say, "whatever is of form xxxx and can be inferred from information signed with key 32457934759432“ Problems with both choices, but both use static references in a dynamic environment (the web) Contributions CRS provides a fast and accurate measure of community consensus on the web. Allows reliable decision about deciding between conflicting data on the web, fine-tuning the results from the Semantic Web. Limitations Totally reliant on usage patterns of the internet, which may not always reflect which data is more correct. Reflects only consensus to a data source, not the actual data contained in it. Cannot express complex or compound relationships or extract partial truths. Questions?