Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data and text mining: the search for unknown knowns Geoffrey Bilder UKSG, 2007 [email protected] "Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -the ones we don't know we don't know." The Mining Metaphor Gold Mining Diamond Mining Data Mining Data Mining- What it isn’t ≠ Information Retrieval ≠ Information Extraction ≠ Information Analysis Information Retrieval Information Extraction + Information Analysis + Data Mining new, previously unknown information And so what is text data mining? Text Mining Information Retrieval Information Extraction + Information Analysis + Crucial question for publishers is: “If ‘hiding’ information in unstructured text is a problemthen shouldn’t we be exploring new ways to “publish”? So how did we get here? • • • • The word tobacco originates from the Taino indians. There is no I in the word Team. The book captured the zeitgeist of the time. I am sure that I turned the gas off. The book captured the <foreign_phrase lang="DE">zeitgeist</foreign_phrase > of the time. I am <emphasis>sure</emphasis> that I turned the gas off. Semantic Web “Light” But we can do more... The web as a database The Relational Model Title Author Labyrinths Jorge Luis Borges Hopscotch The Aleph ... ISBN-13 9780811200127 978Julio Cortazar 0394752846 Jorge Luis 978Borges 0140286809 ... ... Publisher New Directions Pantheon Penguin ... Rows represent things Title Author Labyrinths Jorge Luis Borges Hopscotch The Aleph ... ISBN-13 9780811200127 978Julio Cortazar 0394752846 Jorge Luis 978Borges 0140286809 ... ... Publisher New Directions Pantheon Penguin ... Columns are properties Title Author Labyrinths Jorge Luis Borges Hopscotch The Aleph ... ISBN-13 9780811200127 978Julio Cortazar 0394752846 Jorge Luis 978Borges 0140286809 ... ... Publisher New Directions Pantheon Penguin ... The thing’s property Title Author ISBN-13 Publisher Labyrinths Jorge Luis Borges 9780811200127 New Directions Hopscotch Julio Cortazar 9780394752846 Pantheon The Aleph Jorge Luis Borges 9780140286809 Penguin ... ... ... ... The book has an author “Jorge Luis Borges” Subject Predicate Object The book has an author “Jorge Luis Borges” Subject URI Predicate Object URI http://www.amazon.com/isbn/978-0140286809 has an author http://www.wikipedia.com/borges RDF: Resource Description Framework Blog Journal A Journal B Wiki Personal Website OPAC Blog Journal A Journal B Wiki Personal Website OPAC SPARQL http://api.ingentaconnect.com/content/cabi/nrr/latest?format=rss PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT DISTINCT ?name WHERE { ?x rdf:type foaf:Person . ?x foaf:name ?name } ORDER BY ?name Creative Commons FOAF Geo RSS 1.0 FRBR SKOS The Early Modern Internet Data Mining = Information retrieval + Information extraction + Information analysis... With the goal of discovering new, previously unknown information Data Mining = Information retrieval + Information extraction + Information analysis... With the goal of discovering new, previously unknown information Text Data Mining = Complex data extraction layer + data mining Why do we publish text? Thank You [email protected]