Download Text Preprocessing For Unsupervised Learning: Why It Matters

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It Matthew J. Denny1 Penn State University Arthur Spirling New York University October 15, 20016 1 Work supported by NSF Grant: DGE-1144860 Text-As-Data Research 1. Awesome Research Design! 2. Collect Awesome Text Data! 3. ... 4. Perform Awesome Analysis! 5. Publish Awesome Paper! ... RawText Preprocessing Document-TermMatrix amend federal section spending … 56 34 20 75 … 24 13 41 0 … … … … … … Common Preprocessing Decisions P N L S W I ‘3’ – – – – – – – Punctuation Removal Number Removal Lowercasing Stemming Stopword Removal Infrequent Term Removal n-gram Inclusion 7 binary choices −→ 27 = 128 specifications. Supervised Learning Unsupervised Learning What Could Possibly Go Wrong? Motivating Example I UK Manifestos Corpus (1918–2001) I Labour, Liberal, Conservative Parties I Wordfish I I Place documents in ideological space Process: 1. Select preprocessing specification 2. Run Wordfish 1983 Labour Manifesto A-Priori Rankings I Focus on 8 Manifestos. 1. Four general elections (1983–1997) 2. Labour and Conservative parties I Lab 1983: “longest suicide note in history”, extremely left–wing. Lab 1983 Con < Lab 1992 1987 < Con < Lab 1997 1992 < Con < Lab 1987 1997 < Con < 1983 La b1 9 La 83 b1 9 La 87 b1 9 La 92 b1 9 C 97 on 19 9 C on 2 19 9 C on 7 19 8 C on 7 19 83 Wordfish Rankings Forking Paths I 12 unique document rankings I Substantially different conclusions. Specification Most Left Most Right P-N-S-W-I-3 Lab 1983 Cons 1983 N-S-W-3 Lab 1987 Cons 1987 N-L-3 Lab 1992 Cons 1987 N-L-S Lab 1983 Cons 1992 Another Example: Topic Models I Senate Press Releases (Grimmer, 2010) I Sample of 1,000 documents I 100 × 10 Senators. I Note: no n-grams (computational cost). I Procedure: 1. Find optimal number of topics for each specification (perplexity). 2. Run topic model (LDA) Sen. Sanders, April 1, 2008 Perplexity to Select Number of Topics I 10-fold cross validation. I Split data into train/test sets (80/20). I Find minimum perplexity over num. topics. I topics = {25, 50, 75, 100, 125, 150, 175, 200} Optimal Number of Topics W 200 Optimal Number of Topics N S P L P−W L−S L−W P−N N−W P−S N−S S−W P−L N−L 150 L−S−W P−N−S N−S−W P−L−W P−L−S P−N−W P−S−W N−L−W P−N−S−W N−L−S−W P−L−S−W P−N−L−W P−N−L−S−W P−N−L N−L−S P−N−L−S N−L−S−I P−L−S−I N−L−I 100 S−I I P−S−I L−W−I L−I P−I N−I W−I 50 0 P−L−S−W−I L−S−I P−L−I S−W−I N−W−I P−W−I P−N−I N−S−I P−N−S−I L−S−W−I P−N−L−I P−N−L−S−I P−N−S−W−I N−L−S−W−I P−N−W−I N−L−W−I P−L−W−I P−S−W−I N−S−W−I P−N−L−W−I 2 4 Number of Preprocessing Steps P−N−L−S−W−I 6 Key Terms Example I Select five “key terms”. I How many topic top-terms are they in? iraq terror(ism) (al) qaeda insur(ance) stem (cell) Key Terms in Topic Top-Terms Key Terms: Average of 40 Initializations Iraq P−N−L−S−W−I N−L−S−W−I P−L−S−W−I L−S−W−I P−N−S−W−I N−S−W−I P−S−W−I S−W−I P−N−L−W−I N−L−W−I P−L−W−I L−W−I P−N−W−I N−W−I P−W−I W−I P−N−L−S−I N−L−S−I P−L−S−I L−S−I P−N−S−I N−S−I P−S−I S−I P−N−L−I N−L−I P−L−I L−I P−N−I N−I P−I I Terrorism Al Qaeda Insurance Stem Cell Iraq P−N−L−S−W N−L−S−W P−L−S−W L−S−W P−N−S−W N−S−W P−S−W S−W P−N−L−W N−L−W P−L−W L−W P−N−W N−W P−W W P−N−L−S N−L−S P−L−S L−S P−N−S N−S P−S S P−N−L N−L P−L L P−N N P Terrorism Al Qaeda Insurance Stem Cell 0% <1% 1−2% 2−3% 3−4% 4−5% 5−6% 6−7% 7−8% 8−9% 9−10% 10%+ Forking Paths I I Different preprocessing −→ different conclusions. Are we doomed? Our Solution: preText I Assess consequences of preprocessing choices. I Characterize a number of corpora. I Easy to use R package! Overview: Movements in Pairwise Document Distances I I I No preprocessing as base case. Compare how pairwise document distances change with preprocessing. Measure how unusual these changes are. Example With Three Documents PreprocessingSpecification1 2 Doc2 Doc1 OriginalDTM 1 Doc1 Doc2 3 Doc1 Doc2 2 6 Doc1 Doc2 Doc3 Doc3 4 Doc3 Doc3 PreprocessingSpecification2 4 Doc1 Doc2 Doc1 1 Doc3 Doc2 6 Doc3 Ranking Distance Changes PreprocessingSpecification2 4 Doc Doc OriginalDTM Doc1 1 Doc2 3 Doc1 Doc2 2 1 2 Doc1 1 Doc3 Doc3 Doc3 Doc2 6 Doc3 OriginalDTM Preproc.Spec.2 Abs.Difference d(1,3)=3 d(2,3)=6 ∆d(1,3)=2 d(2,3)=2 d(1,2)=4 ∆d(2,3)=1 d(1,2)=1 d(1,3)=1 ∆d(1,2)=1 Comparing Preprocessing Specifications I I Each specification will have a largest mover. Rank in other specifications (M1, ..., M127)? vM1 = (2M2 , 14M3 , 2M4 , 3M5 , . . . , 15M127 ). I Average of vMi −→ how unusual. preText Scores I Consider top k largest moving doc pairs. I Average across vMi −→ vMi (k) I Normalize by n(n−1) 2 (n = num docs) 2vMi (k) preText scorei = n(n − 1) Interpreting preText Scores I I I preText scores range between 0 and 1. Lower score −→ “typical” changes in document distances. Higher score −→ “atypical” changes in document distances. Preprocessing Combination preText Scores for Press Releases 0.0 0.1 0.2 preText Score Regression Analysis preText scorei =β0+ β1Punctuationi+ β2Numbersi+ β3Lowercasei+ β4Stemi+ β5Stop Wordsi+ β6N-Gramsi+ β7Infrequent Termsi+ εi Regression Analysis Results UK Manifestos SOTU Speeches ● Use NGrams Remove Stopwords ● Remove Punctuation ● ● ● ● ● ● ● ● ● ● ● Remove Numbers ● ● Remove Infrequent Terms 5 02 ● ● ● ● ● 0. 5 ● ● ● 0 00 0. 0 02 05 . −0 5 07 . −0 5 02 . −0 ● ● ● ● Remove Infrequent Terms 0. 0 00 0. ● ● ● ● Lowercase 5 ● Trump Tweets ● ● Remove Punctuation Remove Numbers 02 0 05 . −0 . −0 01 0. 00 0. 01 02 . −0 . −0 01 00 0. . −0 ● ● House Bills ● Stemming ● ● NYT Articles ● ● ● ● Press Releases Use NGrams ● ● ● ● Lowercase Remove Stopwords Death Row Statements ● ● Stemming Indian Treaties ● ● ● ● ● ● 04 4 .0 00 0. 0. 8 .0 −0 00 2 .0 −0 0. 4 .0 −0 −0 6 .0 −0 5 0 0. 00 0. 5 5 .0 −0 05 0. 00 0. 0 .0 −0 5 .1 .1 −0 −0 Regression Coefficient ● Different preprocessing steps “matter” for different corpora What To Do About It 1. Significant parameter estimates serve as an “early warning”. 2. Conservative approach: average results over all specifications. 3. Depends on how good your “theory” is. 4. A priori reasons for selecting a particular specification. Three Cases 1. All Parameter Estimates Are Not Significantly Different From Zero. 2. Strong Theory, Some Parameter Estimates Are Significantly Different From Zero. 3. Weak Theory, Some Parameter Estimates Are Significantly Different From Zero. Returning To The UK Wordfish Example I I Weak “theory” −→ P-N-L-S-W-I 23 = 8 combinations of choices to average over. Model Averaging P−N−L−S−W−I Averaged Con1983 ● ● Con1987 ● ● ● Con1997 Con1992 ● ● Lab1997 ● ● Lab1992 ● ● Lab1987 Lab1983 ● ● ● ● ● I Theoretical Specification: “Wrong”! Averaged: Less “Wrong”! 0 I 1. 5 0. 0 0. .5 −0 .0 −1 .5 −1 0 1. 5 0. 0 0. .5 −0 .0 −1 .5 −1 Wordfish Score Summary I Preprocessing matters. I Forking paths of inference. I Our solution: preText. I General Advice: I Represent uncertainty. I Always check – tell reader! Software and Paper I install.packages("preText") I ssrn.com/abstract=2849145 I github.com/matthewjdenny/preText

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Text Preprocessing For Unsupervised Learning: Why It Matters