Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D [email protected] 1 1 Outline • • • • Definitions of terms Customers (Who cares?) Finding Text – ontology-guided search Text Processing – – Content extraction – Text Mining • Temporal Data Mining at GM • Multi-Lingual Text Processing • Summary 2 2 What is Text Mining? • Data Mining: – The process of analyzing data to discover new patterns or relationships – 1st International Conference was KDD-95 – http://www-aig.jpl.nasa.gov/public/kdd95/ • Text Mining is Subfield of Data Mining – As such, ideally TM is the process of analyzing unstructured text to discover new patterns or relationships – In practice, TM often refers simply to the Content Extraction (CE) of structured data from unstructured text, usually from finite-state parsers. 3 3 Content Extraction: Structured Data from Unstructured Text “Company XYZ, is known to ship products through the port of Dubai.” From Text to Actionable Knowledge: Automatic multi- language scanning SaoPaulo Brazil1 AdenYemen Dominican1 Helsinki PortAden BuenosAires SunsetUSA VyborgRussia RioHaina RioDeJaneiro Istanbul Brazil2 Kansas French1 LosAngeles Urumchi Gdansk Hamburg Canada1 Guangzhou ZhongshanGuangdongProvince Algiers Abbas Taipei AjmanUAE Saudi1 ShanghaiPort DubaiUAE Jakarta XinfengGuangdongWichitaUSA Shanghai SomervilleUSA Jeddah RuianZhejiangProvince AmmanJordan Riyadh Karachi CixiChina DammamSaudi HongKong SanaaYemen MisratahLibya LahorePakistan BenghaziLybia KhamisMushaytSaudi MississaugaCanada Lisboa Magadan Homs NingboPort SharjahUAE StPetersburg ZhaoqingGuangdongProvince Libya1 Cairo Misratah Entity and Relation extraction/distillation Filtering <XYZ-Corp,exports-through,Dubai> 4 4 Who Cares? • Government – NSA, CIA, DIA, DHS, DARPA • Industry – – – – – – Automotive Chemical Pharmaceutical Legal Consumer goods Aerospace 5 5 Why do they care? • Intelligence and Security – Valdis E. Krebs was able to manually map much of the 9/11 terrorist cell from public documents. • http://vlado.fmf.uni-lj.si/pub/networks/doc/Seminar/Krebs.pdf • Industrial – Urban Legend: (Is it true?) “80% of all corporate knowledge is in text.” – – – – – – – Market research Fraud detection Root cause analysis Document clustering and categorization Competitive intelligence Patent analysis 6 etc 6 Before Mining Must Come Text • How to find it? 7 7 Ontology-Guided Search (OGS) • Oft-cited definition of ontology by T.R. Gruber: – An ontology is a formal specification of a shared conceptualization. • www.vivisimo.com clusters search results according to semantic categories • OGS: use an ontology to guide the search for documents to include not only keywords of interest, but also terms that are semantically related to those keywords 8 8 What ontology to use? • Public – Wordnet: http://wordnet.princeton.edu/ • Organizes content words (N,V,Adj,Adv) into sets of semanticallyrelated concepts connected by relations • Currently 207k pairs of words-senses – <bank1, monetary institution> – <bank2, land adjacent to river> • Custom – Parts – Products – Processes • Tool: Protégé at http://protege.stanford.edu/ 9 9 Ontology-Guided Search (OGS) avoids neighborhood riot “driving through” avoiding neighborhoods riots “drive through” avoided suburb “civil unrest” “drove through” suburbs • Use ontology to search not only on keywords, but on semantically-related keywords 10 10 Pitfalls of OGS • Beware of semantically related terms • Simulation of OGS using Wordnet – Original query: • Which neighborhoods of Paris are safe? – One of several transformed queries was: • Which suburbs of Paris are condoms? 11 11 Content Extraction Technology • Regular Expressions Mapped to Semantic Templates • Regular Expression for Passives: NP1 BE TV [by NP2] “The lecture was presented by Kurt Godden” • Mapping of Match Registers to Template < NP2:agent, TV:relation, NP1:object> <kg, presented, lecture> Post-ProcessingRule: if NP2 is empty string, then use ‘someone’:agent 12 12 Content Extraction Example “Some 40 vehicles were torched in the Val d'Oise area NW of Paris.” http://www.breitbart.com/news/2005/11/04/D8DLFA780.html For pattern: ‘vehicles’ ‘were’ ‘torched’ No match for NP1 BE TV [by NP2] matches NP1 matches BE matches TV NP2 • Canonicalize tokens via a domain ontology (e.g. vehicles→vehicle, torched→burn) <someone, burn, vehicle> • Additional triples can be matched by other RegExp patterns, giving: <vehicle, count, 40> <vehicle, located-in, val-d’oise> <val-d’oise, near, paris> 13 13 Why Only Regular Expressions? • Computational Efficiency • Practical Adequacy • Workaround for lack of recursion: Lots of RE’s ! NP → NP and NP becomes NP → CN and CN NP → CN and CN and CN NP → NAME and NAME NP → NAME and NAME and NAME 14 14 After Text Must Come Mining • Temporal Data Mining research by K.P. Unnikrishnan (GM R&D) and P.S. Sastry (IISc, Bangalore) • TDMiner – Proprietary tool – Discovers frequent sequences of events from symbolic data 15 15 16 16 17 17 18 18 For More Info: • 4th Workshop on Temporal Data Mining: Network Reconstruction from Dynamic Data – http://www.kdd2006.com/workshops.html • Laxman, Sastry and Unnikrishnan. “Discovering Frequent Episodes and Learning Hidden Markov Models: a Formal Connection.” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1505-1517. 2005 19 19 Network Reconstruction • How to determine directed, acyclic graphs from sequential event data z x a g n p 20 20 Multilingual Problem • What if source text is not in English? 21 21 Machine Translation (MT) • Free, web-based tools not state-of-the-art e.g. http://babelfish.altavista.com/ • LanguageWeaver uses Statistical-Based MT Spin-off of USC Information Sciences Institute www.languageweaver.com 22 22 23 23 Hypothesis • Effective Content Extraction rules can be custom-developed for raw machinetranslated text. 24 24 Summary • Text Mining Can Offer Real Value – Used Extensively by Gov’t Intel Agencies – Several COTS tools available for Content Extraction: • • • • • SAS Text Miner AeroText (Lockheed Martin) ClearForest Attensity etc.… – GATE – Univ. of Sheffield, open-source – http://gate.ac.uk/ 25 25