Download Introduction to Text Mining - Indian Statistical Institute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Text Mining
Mandar Mitra
Indian Statistical Institute
M. Mitra (ISI)
Text Mining
1 / 29
Outline
1
Preliminaries
2
Preprocessing
3
Mining word associations
4
Opinion mining
M. Mitra (ISI)
Text Mining
2 / 29
What is Text Mining?
.
Strict definition
.
The nontrivial extraction of implicit, previously unknown, and potentially
useful
information from [textual] data
.
OR
.
Loose definition
.
The science of extracting useful information from large [textual] data
sets
.
M. Mitra (ISI)
Text Mining
3 / 29
What is Text Mining?
.
Strict definition
.
The nontrivial extraction of implicit, previously unknown, and potentially
useful
information from [textual] data
.
OR
.
Loose definition
.
The science of extracting useful information from large [textual] data
sets
.
.
Old wine in a new bottle?
.
Text mining = information retrieval + statistics + artificial intelligence
(natural
language processing, machine learning / pattern recognition)
.
M. Mitra (ISI)
Text Mining
3 / 29
Why is it interesting?
Growth of Web / electronic information sources
Multidisciplinary nature
E-commerce potential
“Electronic commerce is emerging as the killer domain for
data-mining technology” — RONNY KOHAVI
M. Mitra (ISI)
Text Mining
4 / 29
Data sources
World Wide Web
unstructured and semi-structured text
“deep” web: pages that do not exist until they are created
dynamically as the result of a specific search
social networks
M. Mitra (ISI)
Text Mining
5 / 29
Data sources
World Wide Web
unstructured and semi-structured text
“deep” web: pages that do not exist until they are created
dynamically as the result of a specific search
social networks
Intranet
internal correspondence, memos, presentations
white papers, technical reports
customer email, customer forums, product reviews
news Wires
...
M. Mitra (ISI)
Text Mining
5 / 29
Data sources
World Wide Web
unstructured and semi-structured text
“deep” web: pages that do not exist until they are created
dynamically as the result of a specific search
social networks
Intranet
internal correspondence, memos, presentations
white papers, technical reports
customer email, customer forums, product reviews
news Wires
...
.
.
No structure / general schema / tabular form that fits text
M. Mitra (ISI)
Text Mining
5 / 29
Outline
1
Preliminaries
2
Preprocessing
3
Mining word associations
4
Opinion mining
M. Mitra (ISI)
Text Mining
6 / 29
Indexing
Any text item (“document”) represented as list of terms and
associated weights
D = (⟨t1 , w1 ⟩, . . . , ⟨tn , wn ⟩)
Term = keywords or content-descriptors
Weight = measure of the importance of a term in representing the
information contained in the document
M. Mitra (ISI)
Text Mining
7 / 29
Indexing
Tokenization: identify individual words
.
Sachin Tendulkar made a tearful but self-effacing farewell as his
glittering 24-year career came to an end on Saturday at his home
ground
of Wankhede Stadium.
.
⇓
Sachin
Tendulkar
made
a
tearful
but
...
M. Mitra (ISI)
Text Mining
8 / 29
Indexing
Stopword removal: eliminate common words, e.g. and, of, the,
etc.
.
Sachin Tendulkar made a tearful but self-effacing farewell as his
glittering 24-year career came to an end on Saturday at his home
.ground of Wankhede Stadium.
M. Mitra (ISI)
Text Mining
9 / 29
Indexing
Stemming: reduce words to a common root
e.g. resignation, resigned, resigns → resign
analysis, analyze, analyzing → analy
use standard algorithms (Porter)
M. Mitra (ISI)
Text Mining
10 / 29
Indexing
Thesaurus: find synonyms for words in the document
M. Mitra (ISI)
Text Mining
11 / 29
Indexing
Thesaurus: find synonyms for words in the document
Phrases: find multi-word terms e.g. computer science, data
mining
use syntax/linguistic methods or “statistical” methods
M. Mitra (ISI)
Text Mining
11 / 29
Indexing
Thesaurus: find synonyms for words in the document
Phrases: find multi-word terms e.g. computer science, data
mining
use syntax/linguistic methods or “statistical” methods
Named entities: identify names of people, organizations, places;
dates; monetary or other amounts, etc.
M. Mitra (ISI)
Text Mining
11 / 29
Indexing
Thesaurus: find synonyms for words in the document
Phrases: find multi-word terms e.g. computer science, data
mining
use syntax/linguistic methods or “statistical” methods
Named entities: identify names of people, organizations, places;
dates; monetary or other amounts, etc.
.
Sachin Tendulkar made a tearful but self-effacing farewell as his
glittering 24-year career came to an end on Saturday at his home
.ground of Wankhede Stadium.
M. Mitra (ISI)
Text Mining
11 / 29
Indexing: Term Weights
Term frequency (tf): repeated words are strongly related to content
Inverse document frequency (idf): uncommon term is more
important
Normalization by document length
long docs. contain many distinct words
long docs. contain same word many times
term-weights for long documents should be reduced
use # bytes, # distinct words, Euclidean length, etc.
Weight = tf x idf / normalization
M. Mitra (ISI)
Text Mining
12 / 29
Commonly used weighting schemes
Pivoted normalization [Singhal et al., SIGIR 96]
1+log(tf )
1+log(average tf )
N
× log( df
)
(1.0 − slope) × pivot + slope × # unique terms
BM25 (probabilistic model) [Robertson and Zaragoza, FTIR 2009]
−df +0.5
tf × log( N df
+0.5 )
dl
k1 ((1 − b) + b avdl
) + tf
M. Mitra (ISI)
Text Mining
13 / 29
Searching
Measure vocabulary overlap between user query and documents.
t1
. . . tn
Q
=
q1
. . . qn
D
=
d1
. . . dn
⃗ D
⃗
Sim(Q, D) = ∑ Q.
=
i qi × d i
M. Mitra (ISI)
Text Mining
14 / 29
Searching
Measure vocabulary overlap between user query and documents.
t1
. . . tn
Q
=
q1
. . . qn
D
=
d1
. . . dn
⃗ D
⃗
Sim(Q, D) = ∑ Q.
=
i qi × d i
Use inverted list (index).
Term i → (Di1 , wi1 ), . . . , (Dik , wik )
M. Mitra (ISI)
Text Mining
14 / 29
Outline
1
Preliminaries
2
Preprocessing
3
Mining word associations
4
Opinion mining
M. Mitra (ISI)
Text Mining
15 / 29
Stemming
YASS
[Majumder et al., ACM TOIS 25(4), 2007]
Stemming ≡ grouping morphologically related words together
e.g. { analysis, analyze, analyzing }
Try clustering
distance measure: edit distance, or
D(X, Y ) =
n
n−m+1 ∑ 1
×
if m > 0,
m
2i−m
i=m
∞ otherwise
clustering algorithm: hierarchical agglomerative
(single link / complete link / average link)
M. Mitra (ISI)
Text Mining
16 / 29
Stemming
0
1
2
3
4
5
6
7
8
9
10
11
12
13
a
s
t
r
o
n
o
m
e
r
x
x
x
x
a
s
t
r
o
n
o
m
i
c
a
l
l
y
Edit distance = 6
D = 68 × ( 210 + . . . +
1
)
213−8
= 1.4766
0
1
2
3
4
5
6
7
8
9
a
s
t
r
o
n
o
m
e
r
a
s
t
o
n
i
s
h
x
x
D = 73 × ( 210 + . . . +
Edit distance = 5
M. Mitra (ISI)
1
)
29−3
= 4.6302
Text Mining
17 / 29
Stemming
Clustering:
[Courtesy: http://espin086.files.wordpress.com/2011/02/2-variable-clustering.png]
M. Mitra (ISI)
Text Mining
18 / 29
Word Relations
Motivation:
Manual thesauri are:
general purpose (Roget’s Thesaurus, WordNet) – difficult to use for
document retrieval
retrieval-oriented (INSPEC, MeSH) – expensive to build and
maintain
Construct an automatic thesaurus (based on information about
co-occurrence of words in a collection)
M. Mitra (ISI)
Text Mining
19 / 29
Word Relations
Association: if two terms co-occur within the same paragraph,
they constitute an association
⟨term1 , term2 , assoc. frequency⟩
Gather data about term-associations over a large amount of text
Refine associations:
Discard associations with frequency 1
Discard terms that are associated with too many other terms
(people, state, company, etc.)
M. Mitra (ISI)
Text Mining
20 / 29
Word Relations
Each term is represented by a vector of associated terms
T = (⟨t1 , w1 ⟩, . . . , ⟨tn , wn ⟩)
⇒ term = pseudo document
Compare query to the term vectors (instead of document vectors)
Sim(Q, T ) = Σi wt(qi ) × wt(ti )
Most “similar” terms are added to the query
Example: 1986 US Immigration Law
similar terms: illegal immigration, amnesty program,
simpson-mazzoli
M. Mitra (ISI)
Text Mining
21 / 29
Word Relations
Experimental results:
Data: 500,000 documents (news, computer abstracts, govt.
documents); 50 queries
Baseline average precision: 37%
Improves to 6 - 30% by using thesaurus
2 weeks to generate association data!
Processing time can be reduced without major loss in
performance by using a subset of the document collection
M. Mitra (ISI)
Text Mining
22 / 29
Outline
1
Preliminaries
2
Preprocessing
3
Mining word associations
4
Opinion mining
M. Mitra (ISI)
Text Mining
23 / 29
Challenges
Does a document contain an opinion? In which portion?
sites with a review component — easy
e.g. CNET, Amazon, Epinions
blogs — harder
M. Mitra (ISI)
Text Mining
24 / 29
Challenges
Does a document contain an opinion? In which portion?
sites with a review component — easy
e.g. CNET, Amazon, Epinions
blogs — harder
Sentiment classification
overall (polarity) / specific
free form / grades or stars
.
quotations
M. Mitra (ISI)
Text Mining
24 / 29
Challenges
Does a document contain an opinion? In which portion?
sites with a review component — easy
e.g. CNET, Amazon, Epinions
blogs — harder
Sentiment classification
overall (polarity) / specific
free form / grades or stars
.
quotations
Presentation
highlighting
aggregation
community identification
estimating reliability
M. Mitra (ISI)
Text Mining
24 / 29
Challenges
Does a document contain an opinion? In which portion?
sites with a review component — easy
e.g. CNET, Amazon, Epinions
blogs — harder
Sentiment classification
overall (polarity) / specific
free form / grades or stars
.
quotations
Presentation
highlighting
Query classification: is the
user looking for an opinion?
aggregation
community identification
estimating reliability
.
M. Mitra (ISI)
Text Mining
24 / 29
Opinion Mining
Feature-based opinion summarization
Identify the features of the product that customers have expressed
opinions on (called opinion features)
For each feature, identify how many customer reviews are positive
/ negative
Examples:
The pictures are very clear.
Overall a fantastic, very compact, camera.
While light, it will not easily fit in pockets. (HARD !)
M. Mitra (ISI)
Text Mining
25 / 29
Opinion Mining
Feature identification
1
2
POS tagging + chunking: identify nouns, verbs, adjectives, simple
noun groups, verb groups
Transaction creation for each sentence: item ≡ normalized nouns
/ noun phrases
3
Association rule mining: all itemsets with > 1% support are
candidate frequent features
4
Feature pruning:
keep features that have some compact occurrences
keep singleton itemsets only if they occur enough times in isolation
e.g. manual vs. manual mode, manual setting
M. Mitra (ISI)
Text Mining
26 / 29
Opinion Mining
Sentiment / orientation identification
1
Examine each sentence in the review database
2
If it contains a frequent feature, extract all the adjective words as
opinion words
3
For each feature in the sentence, the nearby adjective is recorded
as its effective opinion
4
Look up adjective in a list of adjectives with known orientation, or
consult WordNet (discard unknowns)
adjectives arranged in bipolar structures
M. Mitra (ISI)
Text Mining
27 / 29
Datasets
Blog06 (25GB) : University of Glasgow
http://ir.dcs.gla.ac.uk/test_collections/access_to_data.htm
Congressional floor-debate transcripts
http://www.cs.cornell.edu/home/llee/data/convote.html
Cornell movie-review datasets
http://www.cs.cornell.edu/people/pabo/movie-review-data/
M. Mitra (ISI)
Text Mining
28 / 29
References
Untangling Text Data Mining. M. Hearst. Proceedings of ACL’99.
www.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
An Introduction to Information Retrieval. Manning, Raghavan,
Schutze.
www-csli.stanford.edu/~schuetze/information-retrieval-book.html
Tutorial on Web Content Mining. Bing Liu. WWW 2005.
www.cs.uic.edu/~liub
Web Data Mining. Bing Liu. Springer, 2006.
Opinion Mining and Sentiment Analysis. B. Pang and L. Lee.
Foundations and Trends in Information Retrieval, 2(1-2), 2008.
Sentiment Analysis and Opinion Mining. Bing Liu. Morgan
Claypool, 2012.
www.morganclaypool.com/doi/abs/10.2200/S00416ED1V01Y201204HLT016?
journalCode=hlt
M. Mitra (ISI)
Text Mining
29 / 29