Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Information Retrieval course ::: Information Management Technologies Kalliopi Zervanou [email protected] Overview The need for information processing Structured vs. unstructured data (text) The challenges of text Textual information processing technologies The need for info processing Large amounts of data in electronic form Need for large scale & fast info processing Most information to be found in text Types of Data Structured data Semi-structured data Unstructured, free-text data Structured Data: e.g. Databases Title: Author: Doc type: Publisher: Pub date: Id: Location: Keywords: Introduction to Information Retrieval C.D.Manning, P.Raghavan, H.Schütze Book Cambridge University Press 2008 CM20B Computer Science section Information Retrieval, Indexing, … Semi-Structured Data (e.g. XML) <?xml version="1.0" encoding="utf-8" ?> <cmsbwsa_iisg_nl> <bwsa> <path> bios/bymholt.html </path> <voornaam> Berend </voornaam> <achternaam> Bymholt </achternaam> <geboortejaar> 1864 </geboortejaar> <geboortedatum>07-09</geboortedatum> <sterfjaar>1947</sterfjaar> <sterfdatum>05-27</sterfdatum> <extrainfo> socialistisch en anarchistisch publicist en auteur van de Geschiedenis der Arbeidersbeweging in Nederland</extrainfo> <id>77</id> </bwsa> ... </cmsbwsa_iisg_nl> Free-Text/ Unstructured data Bertelsmann 9-mth profit slips on start-up losses FRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a slight decline in nine-month operating profit due to start-up losses related to new businesses. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices. Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider. Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said. Data Mining analysis of structured data detection of unknown interesting patterns: groups of data records (cluster analysis) unusual records (anomaly detection) data dependencies (association rule mining) Text Mining / Text Analytics analysis of text (semi-/unstructured data) detection of unknown, interesting information: group documents (classification/clustering) extract information (content descriptors, concepts of interest) associate/link discover information (e.g. concept relations) previously unknown facts The challenges of text Full text understanding beyond current technology Human understanding based on context Context: text, but also world knowledge Text: ambiguity (syntactic, semantic, lexical, pragmatic) Relevant Docs Doc Collection IR Summarisation (or Abstracting) Relevant Info IE UNSTRUCTURED Important Info NE … EVENT … DATA ( Indexing ) ATR Index Terms Terminology Derived Info Process Resource Data Bases STRUCTURED - Thesauri Reasoning, etc… - Lexicons - Ontologies DATA - Gazetteers Structured Info Data Mining IR: Select relevant documents Query: “query term” Relevant: Documents containing the “term” Methods: Indexing or Automatic Term Recognition Automatic Term Recognition Objective: detect words or phrases denoting specialised concepts, i.e. terms supervised/ unsupervised task Methods: rule based, statistics-based, machine learning, hybrid ATR: example C-value Candidate term 338.13958 213.127 200.55471 143.48147 139.07053 134.47055 131.19386 124.91502 94.48066 91.18482 90.80228 trade union [trade union, Trades Union,…] ernst papanek [Ernst Papanek] new york [New York] press clipping [Press clippings, press -clippings,…] world war [world war, world wars, World Wars,…] print material [printed materials, Printed material,…] executive committee [executive committee, …] communist party [Communist party,…] second world war [Second World War, …] spanish civil war [Spanish Civil War, …] great britain [Great Britain, Great -Britain] Document clustering Objective: group documents based on their content / semantic similarities unsupervised task “clusters”, group categories unknown machine learning and statistics-based approaches Document classification Objective: classify documents based on their content / semantics supervised task we know the classes/categories use of machine learning, or statistics-based methods Relevant Docs Doc Collection IR Summarisation (or Abstracting) Important Info Relevant Info IE NE … EVENT … ( Indexing ) ATR Index Terms Terminology Derived Info Process Resource Reasoning, etc… Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Structured Info Data Mining Summarisation or Abstracting Bertelsmann 9-mth profit slips on start-up losses FRANKFURT, Nov 10 (Reuters) - Media conglomerate Bertelsmann posted a slight decline in nine-month operating profit due to start-up losses related to new businesses. Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. It had cut its outlook in August due to costs for new projects and rising energy prices. Bertelsmann owns publishers Gruner + Jahr and Random House as well as European TV broadcaster RTL Group and Arvato, an outsourcing service provider. Operating earnings before interest and tax (EBIT) eased by 1.1 percent to 1.03 billion euros ($1.4 billion) in the first nine months of 2011, Bertelsmann said. Information Extraction Objective: detect specific types of info in documents, e.g. names, events, relations supervised, or unsupervised/generic task Methods: rule-based, machine learning IE tasks Named Entity (NE) recognise entities/concepts of interest, e.g. persons, organisations, dates & times Co-reference (CO) recognise mentions to the same entity Template Relation (TR) & Scenario Template (ST) recognise relations among concepts, e.g. concept properties & entities involved in facts & events of interest IE Tasks ORGANISATION Bertelsmann said operating earnings before interest PERCENT and tax (EBIT) rose 35 percent to 215 million euros DATE ($272.1 million) compared with 2005, and sales were AMOUNT up 17.3 percent at 4.5 billion euros. ORGANISATION=“Bertelsmann” DATE=“2011-11-10” Europe's largest media group on Thursday said it still expects its 2011 operating profit to decline slightly year-on-year. IE Tasks Event_type: sales Bertelsmann said operating earnings before interest Organisation_type: Company and tax (EBIT) rose 35 percent to 215 SALES_of million euros Organisation_name: Bertelsmann ($272.1 million) compared with 2005, and sales were Sector: media up 17.3 percent at 4.5 billion euros. Sales_mode: increase Sales_amount: 4.500.000.000 Europe's largest media group on Thursday said it still Currency: euros expects its 2011 operating profit to decline slightly Period: ?? year-on-year. Date: ?? Sentiment analysis/Opinion mining Polarity classification (positive/negative) Objectivity/Subjectivity detection Relevant Docs Doc Collection IR Summarisation (or Abstracting) Important Info Relevant Info IE NE … EVENT … ( Indexing ) ATR Index Terms Terminology Derived Info Process Resource Reasoning, etc… Data Bases - Thesauri - Lexicons - Ontologies - Gazetteers Structured Info Data Mining Structured Data: e.g. Databases Title: Author: Doc type: Publisher: Pub date: Id: Location: Keywords: Introduction to Information Retrieval C.D.Manning, P.Raghavan, H.Schütze Book Cambridge University Press 2008 CM20B Computer Science section Information Retrieval, Indexing, … Structured Data: Ontologies Structure of concepts: Entities (concepts, objects) Properties (concept properties) Relations (links between concepts) Domain specific relations, e.g., “has_capital” Objective: describe domain knowledge and reason about concepts & relations Einstein's riddle Source: http://en.wikipedia.org/wiki/Zebra_puzzle we have five houses in a row, each house is painted with a different colour, each house has a single inhabitant each inhabitant is of different nationality drinks different beverage, owns a different pet, smokes different brands of cigarettes Einstein's riddle Source: http://en.wikipedia.org/wiki/Zebra_puzzle 1. There are five houses. 2. The Englishman lives in the red house. 3. The Spaniard owns the dog. 4. Coffee is drunk in the green house. 5. The Ukrainian drinks tea. Einstein's riddle Source: http://en.wikipedia.org/wiki/Zebra_puzzle 6. The green house is immediately to the right of the ivory house. 7. The Old Gold smoker owns snails. 8. Kools are smoked in the yellow house. 9. Milk is drunk in the middle house. 10. The Norwegian lives in the first house. Einstein's riddle Source: http://en.wikipedia.org/wiki/Zebra_puzzle 11. The man who smokes Chesterfields lives in the house next to the man with the fox. 12. Kools are smoked in a house next to the house where the horse is kept. 13. The Lucky Strike smoker drinks orange juice. 14. The Japanese smokes Parliaments. 15. The Norwegian lives next to the blue house. Einstein's riddle Source: http://en.wikipedia.org/wiki/Zebra_puzzle Who drinks water? Who owns a zebra? Ontology: hierarchical structure House-1 House Thing/Root House-2 House... House-3 Englishman Inhabitant House-4 Spaniard... Spaniard Japanese House-5 Red Norwegean Green... Green Ukranian Dog Blue Colour Pet Horse Ivory Beverage Yellow Snails Fox Zebra Ontology House-1 House Thing/Root House-2 House... Englishman Inhabitant “is-a” or taxonomic relationships Denote the “kind” of a concept But ontologies: more than taxonomic relationships! Spaniard... Spaniard Colour Red Green... Green Pet Dog Horse... Beverage Brand Ontology: properties House Thing/Root Inhabitant Colour Pet House-1 Has_colour: (Colour>Is_ColourOf: [House]) Has_inhabitant: Beverage Brand [Colour] [Inhabitant] (Inhabitant>LivesIn:[ House]) Is_rightTo: [House] Ontology: properties Spaniard House Thing/Root LivesIn: Inhabitant (House>Has_inhabitant: [Inhabitant]) Has_pet: Colour Pet Beverage Brand [House] [Pet] (Pet>Has_owner: [Inhabitant]) Drinks: [Beverage] (Beverage>Drunk_by: [Inhabitant]) Uses_brand: (Brand>Used_by: [Inhabitant]) [Brand]