Download Introduction to Text Mining Overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Text Mining
Pavel Brazdil
LIAAD INESC Porto LA
FEP, Univ. of Porto
Escola de verão
Aspectos de processamento da LN
F. Letras, UP, 4th June 2009
http://www.liaad.up.pt
Overview
1. Introduction
1.1 What is Text Mining
1.2. Documents and Document Collections
1.3. Representation of Documents and Features
1.4. Basic Text Mining Tasks
1.5. Advanced Text Mining Tasks
2. Document Classification
3. Clustering of Documents
4. Information Extraction
5. Discovery of Patterns and Trends
2
Bibliography
R. Feldman, J. Sanger: The Text Mining Textbook: Advanced Approaches in Analyzing
Unstructured Data, Cambridge Univ. Press, 2007
S.Weiss, N.Indurkhya, T.Zhang, F. Damerau, Text Mining: Predictive Methods for Analysing
Unstructured Information, Springer, 2005.
R. Feldman, J. Sanger: The Text Mining Handbook: Advanced Approaches in Analyzing
Unstructured Data, Cambridge Univ. Press, 2007
F. Sebastiani: Machine Learning in Automated Text Classification, J. ACM Computing
Surveys, Vol. 34, No.1, 2002.
F. Colas, P. Brazdil: On the Behavior of SVM and Some Older Algorithms in Binary Text
Classification Tasks, in Text, Speech and Dialog, LNCS, Vol. 4188, pp. 45-52, 2006.
Cordeiro, J., Brazdil, P.: Learning Text Extraction Rules without Ignoring Stop Words. in
the 4th International Workshop on Pattern Recognition in Information Systems – PRIS
- 2004; pp. 128-138.
Patil, K. and Brazdil, P., SumGraph: Text Summarization using Centrality in the Pathfinder
Network, International Journal on Computer Science and Information Systems, 2(1),
pp. 18-32, 2007.
Ingo Feinerer: Introduction to the tm Package: Text Mining in R, http://cran.rproject.org/web/ packages/tm/vignettes/tm.pdf
Luís Torgo: A Linguagem R, programação para a Análise de Dados, Escolar Editora, 2009.
3
20Newsgroups, http://people.csail.mit.edu/jrennie/20Newsgroups/
1.1 What is Text Mining
Text mining (TM) seeks to extract useful information
from a collection of documents.
It is similar to data mining (DM), but the data sources
are unstructured or semi-structured documents.
The TM methods involve :
- Basic pre-processing / TM operations, such as
identification / extraction of representative features
(this can be done in several phases)
- Advanced text mining operations, involving
identification of complex patterns
(e.g. relationships between previously identified concepts)
TM exploits techniques / methodologies from
data mining, machine learning, information retrieval,
corpus-based computational linguistics
4
1.2 Documents and Document Collections
Document collection is
a grouping of text-based documents.
It can be
either static or dynamic (growing over time).
Document is
a unit of discrete textual data within a collection,
representing usually some real world document, such as,
a business report, memorandum, email, research paper, news story etc.
A document can be a member of different document collections
(e.g. legal affairs and computing equipment, if it falls under both).
5
1.2 Document Collection PubMed
An example of a real-world document collection:
PubMed is the National Library of Medicin’s on-line repository of
citation-related information for biomedical research papers.
This on-line service contains abstracts for > 12 million research papers.
About 40.000 new abstracts are added every month.
Search for a particular document:
Keyword search is not very useful,
as protein or gene returns 2.800.000 documents.
Even a more specific term – epidermal growth factor receptor returned 10.000 documents.
Identification of relationships across documents across documents:
Manual attempts are labour-intensive / impossible to achieve.
Automatic methods enhance the speed / efficiency of research activities.
6
1.2 Document Structure
Text documents can be :
- unstructured,
i.e. free-style text
(but from a linguistic perpective they are really structured objects)
- weakly structured
adhering to some pre-specified format,
like most scientific papers, business reports, legal memoranda,
news stories etc.
- semistructured
exploiting heavy document templating or style sheets.
7
1.3 Document Representation and Features
Irregular and implicitely structured representation
is trasformed into an explicitly structured representation.
We can distinguish:
- feature based representation,
- relational representation.
In feature based representation
that documents are represented by a set of features.
8
1.3 Document Representation and Features
Examples of some commonly used features:
- Characters
enabling to recognize e.g. morpological features.
So called bigrams (trigrams) represent sequences of 2 (or 3) characters.
- Words
Often the term word-level tokens is used instead.
Tokens can be anotated (e.g. with labels representing noun, verb etc.).
Bag-of-words representation exploits words, but the order is ignored.
Word stem represents a group of related words stripped of a suffix
- Terms
may represent single words or multiword units, such as “White House”
9
1.3 Document Representation and Features
Examples of some commonly used features (continued):
- Concepts
For example, the concept identifier “car” can represent
different words in the text, such as automobile, car, sport-car etc.
Concepts are useful to represent synonyms and help to resolve polysemy.
10
1.3 Problem of High Dimensionality
Structured representations of natural languange documents
leads usually to very large number of features.
For instance, one small collection of Reuters of 15,000 documents
contains 25,000 non-trivial features (word stems).
Some algorithms do not deal very well with large numbers of features
and hence it is necessary to employ feature reduction techniques.
Another problem is feature sparcity:
Each document conctains only a small number of all potential features.
11
1.4 Basic Text Mining Tasks
•
•
•
•
Document classification (categorization)
Information Retrieval
Clustering / organization of documents
Information extraction
More infomation to follow
12
1.4 Information Retrieval
Retrieval of documents in response to a “query document”
(as a special case, the query document can consist of a few keywords)
Document
Collection
Query Document
Document Matcher
Retrieved
Documents
13
1.4 Document Classification
Classification of documents into predefined categories (classes)
Agriculture
Finance
New Document
Classification
Legislation
etc.
14
1.4 Clustering / Organizing Documents
Unsupervised process through which documents are classified
into groups called clusters.
Document
Collection
Group 1
Group 2
Document Organizer
Group 3
etc.
15
1.4 Information Extraction
IE involves identification of certain entities in the text,
their extraction and representation in a pre-specified format (e.g. a table).
T5 Duplex em Gaia
Data: 2002-05-10 15:01:24 PST
Excelente localização no centro da cidade.
2 WC, despensa, terraço com marquise
com 70 m2; 119700 euros; Tel. 966969663
Apartamento pouco usada T4, 2 wc´s, 3º andar
com vista panorámica. Excelente localização,
a poucos metros da zona central de Loulé.
Perto metros do tribunal, biblioteca, piscinas,
e diversos estabelecimentos comerciais.
Preço: 132.180 Euros (negociavel)
936109097
Output: Filled in Template / Table
Price
Type
Location
Area
119 700
T5
Gaia
70
132.180
T4
Loulé
?
...
...
...
16
1.5 Advanced Text Mining Tasks
•
Concept co-occurrence
Quantification of co-occurrence
•
Identification of trends in data
Identification of new topics
•
Summarization
17
1.5 Concept Co-occurrence
Detection of concept co-occurrence in documents, e.g.:
Disease – Medical Drug (based on BioWorld articles)
Rheumatoid Arthritis - Rituximab
Rheumatoid Arthritis - Infliximab
Prostate Carriroma - APC8015 Vacine
etc.
Quantification of frequency of co-occurrence
can be expressed in numerical form, or using a graphical representation
(e.g. a circle graph; the width of the line indicates the strength of the connection)
Diseases
Medical Drugs
Rheumatoid Arthritis
Prostate Carriroma
Rituximab
Infliximab
APC8015 Vacine
18
1.5 Identification of Trends in Data
Identification of trends in data
How does the news concerning a particular disease (e.g. Rheum. Arthritis)
and a particular medical drug (e.g. Rituximab) change over time?
How does the news concerning a particular company
and a particular product (e.g. medical drug Rituximab) change over time?
Identification of new topics in the data
Did any new articles appear concerning certain type of company
(e.g. a farmaceutical company)
and a particular type of product
(e.g. A medical drug useful for treating lung cancer)?
Identification of disappearing topics in the data
Identification of a period covered by a certain topic
19
1.5 Summarization
Summarization of a single document
Selection of some sentences, summarizing the document
D1: S1, S2, S3, S4
S4
Summarization of several documents
Selection of a single representative document
D1, D2, D3, D4, D5
D6
Selection of representative sentences from different documents
D1: S1, S2, S3, S4
D2: S5, S6, S7
D3: S8, S9, S10
Dsummar: S1, S6, S4
20