Download introduction to text mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
LECTURE 2:
INTRODUCTION TO TEXT MINING
2.1 INFORMATION RETRIEVAL
IR is concerned with retrieving textual records, not data items like relational databases, nor
(specifically) with finding patterns like data mining.
Examples:



SQL: Find rows where the text column LIKE “%information retrieval%”
DM: Find a model in order to classify document topics.
IR: Find documents with text that contains the words Information adjacent to
Retrieval, Protocol or SRW, but not Google.
IR focuses on finding the most appropriate or relevant records to the user's request.
The supremacy of Google can be attributed primarily to its PageRank algorithm for ranking
web pages in order of relevance to the user's query. $741.79 (on 2007-11-06, up from
$471.80 on 2006-11-03) a share says this topic is important to understand!
IR also focuses on finding these records as quickly as possible.
Not only does Google find relevant pages, it finds them Fast, for many thousands (maybe
millions?) of concurrent users.
Is IR = Google??
So is “Google” the answer to the question of “Information Retrieval”?
No! Google has a good answer for how to search the web, but there are many more sources of
data, and many more interesting questions.
Many other examples, including:
 Library catalogues
 XML searching
 Distributed searching
 Query languages
@ St. Paul’s University
1
IR Processes: Discovery
IR Processes: Ingestion
Compare to the KDD process we looked at last time!
2.2 DOCUMENT INDEXING
What information do we need to store?
Query: Documents containing Information and Retrieval but not Protocol. You need to find
which documents contain which words. Could perform this query using a document/term
matrix:
@ St. Paul’s University
2
Also useful to know is the frequency of the term in the document. Each row in the matrix is a
vector, and useful for data mining functions as the document has been reduced to a series of
numbers rather than words.
Our new matrix might look like:
Evaluation
Common evaluation for IR relevance ranking: Precision and Recall
 Precision: Number Relevant and Retrieved / Number Retrieved
 Recall: Number Relevant and Retrieved / Number Relevant F Score: recall *
precision / ((recall + precision) / 2)
Ideal situation is all and only relevant documents retrieved. Also used in Data Mining
evaluation.
Topics of Interest
 Format Processing: Extraction of text from different file formats Indexing: Efficient
extraction/storage of terms from text

Query Languages: Formulation of queries against those indexes Protocols:
Transporting queries from client to server

Relevance Ranking: Determining the relevance of a document to the user's query

Metasearch: Cross-searching multiple document sets with the same query GridIR:
Using the grid (or other massively parallel infrastructure) to perform IR processes
Multimedia IR: IR techniques on multimedia objects, compound digital objects...

2.3 DATA MINING ON TEXT
All of the Data Mining functions can be applied to textual data, using term as the attribute
and frequency as the value.
Classification:
Classify a text into subjects, genres, quality, reading age, ...
@ St. Paul’s University
3
Clustering:
Cluster together similar texts
Association Rule Mining:
Find words that frequently appear together Finds texts that are frequently cited together. Key
challenge is the very large number of terms (eg the number of different words across all
documents)
2.4 TEXT MINING
So, we've looked at Data Mining and IR... What's Text Mining then? Good question. No
canonical definition yet, but a similar definition for Data Mining could be applied: The nontrivial extraction of previously unknown, interesting facts from an (invariably large)
collection of texts.
So it sounds like a combination of IR and Data Mining, but actually the process involves
many other steps too. Before we look at what actually happens, let's look at why it's
different...
Text Mining vs Data Mining
Data Mining finds a model for the data based on the attributes of the items. The only
attributes of text are the words that make up the text. As we looked at for IR, this creates a
very sparse matrix.
Even if we create that matrix, what sort of patterns could we find:
 Classification: We could classify texts into pre-defined classes (eg spam / not spam)
 Association Rule Mining: Finding frequent sets of words. (eg if 'computer' appears
3+ times, then 'data' appears at least once)
 Clustering: Finding groups of similar documents (IR?)
None of these fit our definition of Text Mining.
Text Mining vs IR
Information Retrieval finds documents that match the user's query. Even if we matched at a
sentence level rather than document, all we do is retrieve matching sentences, we're not
discovering anything new.
The relevance ranking is important, but it still just matches information we already knew... it
just orders it appropriately.
IR (typically) treats a document as a big bag of words... but doesn't care about the meaning of
the words, just if they exist in the document.
IR doesn't fit our definition of Text Mining either.
@ St. Paul’s University
4
2.5 TEXT MINING PROCESS
How would one find previously unknown facts from a bunch of text?
 Need to understand the meaning of the text!

Part of speech of words

Subject/Verb/Object/Preposition/Indirect Object

Need to determine that two entities are the same entity.

Need to find correlations of the same entity.

Form logical chains: Milk contains Magnesium. Magnesium stimulates receptor
activity. Inactive receptors cause Headaches
-> Milk is good for Headaches. (fictional example!)
Part of Speech Tagging
First we need to tag the text with the parts of speech for each word.
eg: Rob/noun teaches/verb the/article course/noun
How could we do this? By learning a model for the language! Essentially a data mining
classification problem -- should the system classify the word as a noun, a verb, an adjective,
etc.
Lots of different tags, often based on a set called the Penn Treebank. (NN = Noun, VB =
Verb, JJ = Adjective, RB = Adverb, etc)
Parsing
Entity Recognition
identify the real world objects referred to. This is typically done via lookups in very large
thesauri or 'ontologies', specific to the domain being processed (eg medical, historical, current
events, etc.)
People, Organisations, Products, Locations, …
Named Entity Recognisers use entity lexicons, contextual clues (e.g. Mr.~, capitalisation),
and lots of training data!
Popular approaches
 Hidden Markov Models (HMMs)
@ St. Paul’s University
5

Conditional Random Fields (CRFs)
Applications
Web search engines Social network extraction Recommendation systems. Sentiment/Opinion
Analysis Financial time series prediction Social media analysis (eg. Twitter) Text
summarisation systems Machine translation systems Medical term extraction. Adverse drug
reaction detection
@ St. Paul’s University
6