Download Text Preprocessing For Unsupervised Learning: Why It Matters

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Text Preprocessing For
Unsupervised Learning: Why It
Matters, When It Misleads, And
What To Do About It
Matthew J. Denny1
Penn State University
Arthur Spirling
New York University
October 15, 20016
1
Work supported by NSF Grant: DGE-1144860
Text-As-Data Research
1. Awesome Research Design!
2. Collect Awesome Text Data!
3. ...
4. Perform Awesome Analysis!
5. Publish Awesome Paper!
...
RawText
Preprocessing
Document-TermMatrix
amend
federal
section
spending
…
56
34
20
75
…
24
13
41
0
…
…
…
…
…
…
Common Preprocessing Decisions
P
N
L
S
W
I
‘3’
–
–
–
–
–
–
–
Punctuation Removal
Number Removal
Lowercasing
Stemming
Stopword Removal
Infrequent Term Removal
n-gram Inclusion
7 binary choices −→ 27 = 128 specifications.
Supervised Learning
Unsupervised Learning
What Could Possibly Go
Wrong?
Motivating Example
I
UK Manifestos Corpus (1918–2001)
I
Labour, Liberal, Conservative Parties
I
Wordfish
I
I
Place documents in ideological space
Process:
1. Select preprocessing specification
2. Run Wordfish
1983 Labour Manifesto
A-Priori Rankings
I
Focus on 8 Manifestos.
1. Four general elections (1983–1997)
2. Labour and Conservative parties
I
Lab 1983: “longest suicide note in
history”, extremely left–wing.
Lab
1983
Con
< Lab
1992
1987
< Con
< Lab
1997
1992
< Con
< Lab
1987
1997
< Con
<
1983
La
b1
9
La 83
b1
9
La 87
b1
9
La 92
b1
9
C 97
on
19
9
C
on 2
19
9
C
on 7
19
8
C
on 7
19
83
Wordfish Rankings
Forking Paths
I
12 unique document rankings
I
Substantially different conclusions.
Specification Most Left Most Right
P-N-S-W-I-3 Lab 1983
Cons 1983
N-S-W-3
Lab 1987
Cons 1987
N-L-3
Lab 1992
Cons 1987
N-L-S
Lab 1983
Cons 1992
Another Example: Topic Models
I
Senate Press Releases (Grimmer, 2010)
I
Sample of 1,000 documents
I
100 × 10 Senators.
I
Note: no n-grams (computational cost).
I
Procedure:
1. Find optimal number of topics for each
specification (perplexity).
2. Run topic model (LDA)
Sen. Sanders, April 1, 2008
Perplexity to Select Number of Topics
I
10-fold cross validation.
I
Split data into train/test sets (80/20).
I
Find minimum perplexity over num. topics.
I
topics = {25, 50, 75, 100, 125, 150, 175, 200}
Optimal Number of Topics
W
200
Optimal Number of Topics
N
S
P
L
P−W
L−S
L−W
P−N
N−W
P−S
N−S
S−W
P−L
N−L
150
L−S−W
P−N−S
N−S−W
P−L−W
P−L−S
P−N−W
P−S−W
N−L−W
P−N−S−W
N−L−S−W
P−L−S−W
P−N−L−W
P−N−L−S−W
P−N−L
N−L−S
P−N−L−S
N−L−S−I
P−L−S−I
N−L−I
100
S−I
I
P−S−I
L−W−I
L−I
P−I
N−I
W−I
50
0
P−L−S−W−I
L−S−I
P−L−I
S−W−I
N−W−I
P−W−I
P−N−I
N−S−I
P−N−S−I
L−S−W−I
P−N−L−I
P−N−L−S−I
P−N−S−W−I
N−L−S−W−I
P−N−W−I
N−L−W−I
P−L−W−I
P−S−W−I
N−S−W−I
P−N−L−W−I
2
4
Number of Preprocessing Steps
P−N−L−S−W−I
6
Key Terms Example
I
Select five “key terms”.
I
How many topic top-terms are they in?
iraq
terror(ism)
(al) qaeda
insur(ance)
stem (cell)
Key Terms in Topic Top-Terms
Key Terms: Average of 40 Initializations
Iraq
P−N−L−S−W−I
N−L−S−W−I
P−L−S−W−I
L−S−W−I
P−N−S−W−I
N−S−W−I
P−S−W−I
S−W−I
P−N−L−W−I
N−L−W−I
P−L−W−I
L−W−I
P−N−W−I
N−W−I
P−W−I
W−I
P−N−L−S−I
N−L−S−I
P−L−S−I
L−S−I
P−N−S−I
N−S−I
P−S−I
S−I
P−N−L−I
N−L−I
P−L−I
L−I
P−N−I
N−I
P−I
I
Terrorism Al Qaeda Insurance Stem Cell
Iraq
P−N−L−S−W
N−L−S−W
P−L−S−W
L−S−W
P−N−S−W
N−S−W
P−S−W
S−W
P−N−L−W
N−L−W
P−L−W
L−W
P−N−W
N−W
P−W
W
P−N−L−S
N−L−S
P−L−S
L−S
P−N−S
N−S
P−S
S
P−N−L
N−L
P−L
L
P−N
N
P
Terrorism Al Qaeda Insurance Stem Cell
0%
<1%
1−2%
2−3%
3−4%
4−5%
5−6%
6−7%
7−8%
8−9%
9−10%
10%+
Forking Paths
I
I
Different preprocessing −→ different
conclusions.
Are we doomed?
Our Solution: preText
I
Assess consequences of preprocessing
choices.
I
Characterize a number of corpora.
I
Easy to use R package!
Overview: Movements in Pairwise
Document Distances
I
I
I
No preprocessing as base case.
Compare how pairwise document
distances change with preprocessing.
Measure how unusual these changes are.
Example With Three Documents
PreprocessingSpecification1
2
Doc2
Doc1
OriginalDTM
1
Doc1
Doc2
3
Doc1
Doc2
2
6
Doc1
Doc2
Doc3
Doc3
4
Doc3
Doc3
PreprocessingSpecification2
4
Doc1
Doc2
Doc1 1 Doc3
Doc2
6
Doc3
Ranking Distance Changes
PreprocessingSpecification2
4
Doc
Doc
OriginalDTM
Doc1 1 Doc2
3
Doc1
Doc2
2
1
2
Doc1 1 Doc3
Doc3
Doc3
Doc2
6
Doc3
OriginalDTM
Preproc.Spec.2
Abs.Difference
d(1,3)=3
d(2,3)=6
∆d(1,3)=2
d(2,3)=2
d(1,2)=4
∆d(2,3)=1
d(1,2)=1
d(1,3)=1
∆d(1,2)=1
Comparing Preprocessing Specifications
I
I
Each specification will have a largest
mover.
Rank in other specifications
(M1, ..., M127)?
vM1 = (2M2 , 14M3 , 2M4 , 3M5 , . . . , 15M127 ).
I
Average of vMi −→ how unusual.
preText Scores
I
Consider top k largest moving doc pairs.
I
Average across vMi −→ vMi (k)
I
Normalize by
n(n−1)
2
(n = num docs)
2vMi (k)
preText scorei =
n(n − 1)
Interpreting preText Scores
I
I
I
preText scores range between 0 and 1.
Lower score −→ “typical” changes in
document distances.
Higher score −→ “atypical” changes
in document distances.
Preprocessing Combination
preText Scores for Press Releases
0.0
0.1
0.2
preText Score
Regression Analysis
preText scorei =β0+
β1Punctuationi+
β2Numbersi+
β3Lowercasei+
β4Stemi+
β5Stop Wordsi+
β6N-Gramsi+
β7Infrequent Termsi+
εi
Regression Analysis Results
UK Manifestos
SOTU Speeches
●
Use NGrams
Remove Stopwords
●
Remove Punctuation
●
●
●
●
●
●
●
●
●
●
●
Remove Numbers
●
●
Remove Infrequent Terms
5
02
●
●
●
●
●
0.
5
●
●
●
0
00
0.
0
02
05
.
−0
5
07
.
−0
5
02
.
−0
●
●
●
●
Remove Infrequent Terms
0.
0
00
0.
●
●
●
●
Lowercase
5
●
Trump Tweets
●
●
Remove Punctuation
Remove Numbers
02
0
05
.
−0
.
−0
01
0.
00
0.
01
02
.
−0
.
−0
01
00
0.
.
−0
●
●
House Bills
●
Stemming
●
●
NYT Articles
●
●
●
●
Press Releases
Use NGrams
●
●
●
●
Lowercase
Remove Stopwords
Death Row Statements
●
●
Stemming
Indian Treaties
●
●
●
●
●
●
04
4
.0
00
0.
0.
8
.0
−0
00
2
.0
−0
0.
4
.0
−0
−0
6
.0
−0 5
0
0.
00
0.
5
5
.0
−0
05
0.
00
0.
0
.0
−0
5
.1
.1
−0
−0
Regression Coefficient
●
Different preprocessing
steps “matter” for
different corpora
What To Do About It
1. Significant parameter estimates serve as
an “early warning”.
2. Conservative approach: average results
over all specifications.
3. Depends on how good your “theory” is.
4. A priori reasons for selecting a
particular specification.
Three Cases
1. All Parameter Estimates Are Not
Significantly Different From Zero.
2. Strong Theory, Some Parameter
Estimates Are Significantly Different
From Zero.
3. Weak Theory, Some Parameter
Estimates Are Significantly Different
From Zero.
Returning To The UK Wordfish Example
I
I
Weak “theory” −→ P-N-L-S-W-I
23 = 8 combinations of choices to
average over.
Model Averaging
P−N−L−S−W−I
Averaged
Con1983
●
●
Con1987
●
●
●
Con1997
Con1992
●
●
Lab1997
●
●
Lab1992
●
●
Lab1987
Lab1983
●
●
●
●
●
I
Theoretical Specification: “Wrong”!
Averaged: Less “Wrong”!
0
I
1.
5
0.
0
0.
.5
−0
.0
−1
.5
−1
0
1.
5
0.
0
0.
.5
−0
.0
−1
.5
−1
Wordfish Score
Summary
I
Preprocessing matters.
I
Forking paths of inference.
I
Our solution: preText.
I
General Advice:
I
Represent uncertainty.
I
Always check – tell reader!
Software and Paper
I
install.packages("preText")
I
ssrn.com/abstract=2849145
I
github.com/matthewjdenny/preText
Related documents