Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
LECTURE 2: INTRODUCTION TO TEXT MINING 2.1 INFORMATION RETRIEVAL IR is concerned with retrieving textual records, not data items like relational databases, nor (specifically) with finding patterns like data mining. Examples: SQL: Find rows where the text column LIKE “%information retrieval%” DM: Find a model in order to classify document topics. IR: Find documents with text that contains the words Information adjacent to Retrieval, Protocol or SRW, but not Google. IR focuses on finding the most appropriate or relevant records to the user's request. The supremacy of Google can be attributed primarily to its PageRank algorithm for ranking web pages in order of relevance to the user's query. $741.79 (on 2007-11-06, up from $471.80 on 2006-11-03) a share says this topic is important to understand! IR also focuses on finding these records as quickly as possible. Not only does Google find relevant pages, it finds them Fast, for many thousands (maybe millions?) of concurrent users. Is IR = Google?? So is “Google” the answer to the question of “Information Retrieval”? No! Google has a good answer for how to search the web, but there are many more sources of data, and many more interesting questions. Many other examples, including: Library catalogues XML searching Distributed searching Query languages @ St. Paul’s University 1 IR Processes: Discovery IR Processes: Ingestion Compare to the KDD process we looked at last time! 2.2 DOCUMENT INDEXING What information do we need to store? Query: Documents containing Information and Retrieval but not Protocol. You need to find which documents contain which words. Could perform this query using a document/term matrix: @ St. Paul’s University 2 Also useful to know is the frequency of the term in the document. Each row in the matrix is a vector, and useful for data mining functions as the document has been reduced to a series of numbers rather than words. Our new matrix might look like: Evaluation Common evaluation for IR relevance ranking: Precision and Recall Precision: Number Relevant and Retrieved / Number Retrieved Recall: Number Relevant and Retrieved / Number Relevant F Score: recall * precision / ((recall + precision) / 2) Ideal situation is all and only relevant documents retrieved. Also used in Data Mining evaluation. Topics of Interest Format Processing: Extraction of text from different file formats Indexing: Efficient extraction/storage of terms from text Query Languages: Formulation of queries against those indexes Protocols: Transporting queries from client to server Relevance Ranking: Determining the relevance of a document to the user's query Metasearch: Cross-searching multiple document sets with the same query GridIR: Using the grid (or other massively parallel infrastructure) to perform IR processes Multimedia IR: IR techniques on multimedia objects, compound digital objects... 2.3 DATA MINING ON TEXT All of the Data Mining functions can be applied to textual data, using term as the attribute and frequency as the value. Classification: Classify a text into subjects, genres, quality, reading age, ... @ St. Paul’s University 3 Clustering: Cluster together similar texts Association Rule Mining: Find words that frequently appear together Finds texts that are frequently cited together. Key challenge is the very large number of terms (eg the number of different words across all documents) 2.4 TEXT MINING So, we've looked at Data Mining and IR... What's Text Mining then? Good question. No canonical definition yet, but a similar definition for Data Mining could be applied: The nontrivial extraction of previously unknown, interesting facts from an (invariably large) collection of texts. So it sounds like a combination of IR and Data Mining, but actually the process involves many other steps too. Before we look at what actually happens, let's look at why it's different... Text Mining vs Data Mining Data Mining finds a model for the data based on the attributes of the items. The only attributes of text are the words that make up the text. As we looked at for IR, this creates a very sparse matrix. Even if we create that matrix, what sort of patterns could we find: Classification: We could classify texts into pre-defined classes (eg spam / not spam) Association Rule Mining: Finding frequent sets of words. (eg if 'computer' appears 3+ times, then 'data' appears at least once) Clustering: Finding groups of similar documents (IR?) None of these fit our definition of Text Mining. Text Mining vs IR Information Retrieval finds documents that match the user's query. Even if we matched at a sentence level rather than document, all we do is retrieve matching sentences, we're not discovering anything new. The relevance ranking is important, but it still just matches information we already knew... it just orders it appropriately. IR (typically) treats a document as a big bag of words... but doesn't care about the meaning of the words, just if they exist in the document. IR doesn't fit our definition of Text Mining either. @ St. Paul’s University 4 2.5 TEXT MINING PROCESS How would one find previously unknown facts from a bunch of text? Need to understand the meaning of the text! Part of speech of words Subject/Verb/Object/Preposition/Indirect Object Need to determine that two entities are the same entity. Need to find correlations of the same entity. Form logical chains: Milk contains Magnesium. Magnesium stimulates receptor activity. Inactive receptors cause Headaches -> Milk is good for Headaches. (fictional example!) Part of Speech Tagging First we need to tag the text with the parts of speech for each word. eg: Rob/noun teaches/verb the/article course/noun How could we do this? By learning a model for the language! Essentially a data mining classification problem -- should the system classify the word as a noun, a verb, an adjective, etc. Lots of different tags, often based on a set called the Penn Treebank. (NN = Noun, VB = Verb, JJ = Adjective, RB = Adverb, etc) Parsing Entity Recognition identify the real world objects referred to. This is typically done via lookups in very large thesauri or 'ontologies', specific to the domain being processed (eg medical, historical, current events, etc.) People, Organisations, Products, Locations, … Named Entity Recognisers use entity lexicons, contextual clues (e.g. Mr.~, capitalisation), and lots of training data! Popular approaches Hidden Markov Models (HMMs) @ St. Paul’s University 5 Conditional Random Fields (CRFs) Applications Web search engines Social network extraction Recommendation systems. Sentiment/Opinion Analysis Financial time series prediction Social media analysis (eg. Twitter) Text summarisation systems Machine translation systems Medical term extraction. Adverse drug reaction detection @ St. Paul’s University 6