Download Part I: Quality Phrase Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Constructing Structured Information
Networks from Massive Text Corpora
Part I: Quality Phrase Mining
Effort-Light StructMine: Methodology
Text
corpus
Structures from
the remaining
unlabeled data
Data-driven text
segmentation
(SIGMOD’15, WWW’16)
Learning Corpusspecific Model
(KDD’15, KDD’16,
EMNLP’16, WWW’17)
Entity names
& context units
Partiallylabeled
corpus
Knowledge
bases
2
Quality Phrase Mining
• Quality phrase mining seeks to extract a
ranked list of phrases with decreasing quality
from a large collection of documents
• Examples:
Scientific
Papers
Expected Results
data mining
machinelearning
informationretrieval
…
support vectormachine
…
the paper
…
News
Articles
Expected Results
USPresident
AndersonCooper
Barack Obama
…
Obama administration
…
atown
…
3
Why Phrase Mining?
• Phrase: Minimal, unambiguous semantic unit; basic building
block for information networks and knowledge bases
• Unigrams vs. phrases
• Unigrams (singlewords)areambiguous
• E.g., “United”: United States? United Airline? United Parcel Service?
• Phrase:Anatural,meaningful,unambiguous semanticunit
• E.g., “United States” vs. “United Airline”
• Mining semantically meaningful phrases
• Transformtextdatafromwordgranularity tophrasegranularity
• Enhancethepowerandefficiencyatmanipulatingunstructureddata
usingdatabasetechnology
4
Application Scenarios
• Natural Language Processing (NLP)
• Documentanalysis
• Information Retrieval (IR)
• Indexinginsearchengine
• Text Mining
• Keyphrases fortopicmodeling
5
What Kind of Phrases Are of
“High Quality”?
• Popularity
• “informationretrieval”>“cross-languageinformation
retrieval”
• Concordance
• “strongtea”>“powerfultea”
• “activelearning”> “learningclassification”
• Informativeness
• “thispaper”(frequentbutnotdiscriminative,notinformative)
• Completeness
• “supportvectormachine” >“vectormachine”
6
Three Families of Methods
Supervised(linguisticanalyzers)
Unsupervised(statistical signals)
Weakly/DistantlySupervised
7
Supervised Phrase Mining
• Phrase mining was originated from the NLP
community
• How to use linguistic analyzers to extract phrases?
• Parsing(e.g.,stanford NLPparsers)
• NounPhrase(NP)Chunking
• How to rank extracted phrases?
• C-value[Frantzi etal.’00]
• TextRank [Mihalcea etal.’04]
• TF-IDF
8
Linguistic Analyzer – Parsing
• Minimal Grammatical Segments ó Phrases
Rawtextsentence
(string)
Full-textParsing
Fullparsetree
(grammaticalanalysis)
Thechefcooks
thesoup.
• Phrases: “the chef”, “the soup”
9
Inefficiencies of Parsing
• Difficult to directly apply pre-trained to new
domains (e.g. twitter, biomedical, yelp)
• Unlesssophisticated,manuallycurated,domain-specific
trainingdataareprovided
• Computationally slow.
• Cannotbeappliedonweb-scaledatatosupport
emergingapplications
• We need “shallow” phrase mining techniques
10
Linguistic Analyzer – Chunking
• Noun phrase chunking is a light version of
parsing
1. Apply tokenization and part-of-speech (POS)
tagging to each sentence
2. Search for noun phrase chunks
11
Drawbacks of NP Chunking
• Pre-trained models may not be transferable to
new domains
• Scientificdomains,querylogs,socialmedia(e.g.,Yelp,
Twitter)
• Lack of the usage of corpora-level information
• NPsometimescan’tmeettherequirementsofquality
phrases
12
Ranking – C-value
• Given a set of phrases, for a given phrase 𝑝
• 𝑓(𝑝) istherawfrequency
• |𝑝| isthenumberoftokensin𝑝
• If there is no phrase contains 𝑝 as a substring
• C-value(𝑝)=log ) |𝑝| ⋅ 𝑓(𝑝)
• Else
• C-value(𝑝)=log ) |𝑝| ⋅ 𝑓 𝑝 − avg .012345267 𝑓 𝑞 • Prefers “maximal” phrases
• Popularity & Completeness
13
Ranking – TextRank
• Construct a network of phrases & unigrams
• Compute the importance of vertices
• SimilartoPageRank
• Popularity & Informativeness
Compatibilityofsystemsoflinear
constraintsoverthesetofnatural
numbers.Criteriaofcompatibility
ofasystemoflinearDiophantine
equations,strictinequations,and
nonstrict inequations are
considered.…..
14
Ranking – TF-IDF
• Term Frequency
• E.g.,rawfrequency
• Rewardsfrequentphrases
• Inverse Document Frequency
• E.g.,log((#ofalldocuments)/(#ofoccurreddocuments))
• Rewards“rare”phrases
• Popularity & Informativeness
15
Three Families of Methods
Supervised(linguisticanalyzers)
Unsupervised(statistical signals)
Weakly/DistantlySupervised
16
Unsupervised Phrase Mining
• Statistics based on massive text corpora
• Popularity
• Rawfrequency
• FrequencydistributionbasedonZipfian ranks[Deane’05]
• Concordance
• Significancescore[Churchetal.’91][El-Kishky etal.’14]
• Completeness
• Comparisontosuper/sub-sequences[Parameswaran et
al.’10]
17
Raw Frequency
• Frequent contiguous pattern mining
• If“AB” isfrequent,likely“AB” couldbeaphrase
• It prefers
• “Stopphrases”
• Shorterphrases
• E.g., freq(vector machine) ≥ freq(support vector machine)
• Raw frequency could NOT reflect the quality of
phrases
18
Raw Frequency (improved)
• Combine with topic modeling
• Mergeadjacentunigramsofthesametopic[Blei &
Lafferty’09]
• Frequentpatternminingwithinthesametopic
[Danilevsky etal.’14]
• Limitations
• Tokensinthesamephrasemaybeassignedtodifferent
topics
• E.g.knowledge discovery usingleastsquaressupport
vector machineclassifiers…
19
Frequency Distribution
• Idea: ranks in a Zipfian frequency distribution is
more reliable than raw frequency
• Heuristic: Actual Rank / Expected Rank
• Example:
• Givenaphraselike“eastend”
• ActualRank:rank“eastend”amongalloccurrencesof
“east”(e.g.,“east end”,“east side”,“theeast”,“towards
theeast”,etc.)
• ExpectedRank:rank“__end”amongallcontextsof“east”
(e.g.,“__end”,“__side”,“the__”,“towardsthe__”,etc.)
20
Significance score
• Significance score [Church et al.’91]
• A.k.a.Zscore
• ToPMine [El-Kishky et al.’15]
• Ifaphrasecanbedecomposedintotwoparts
• P = P1 " P2
• α(P1,P2)≈(f(P1●P2)̶µ0(P1,P2))/√f(P1●P2)
Quality
phrases
21
Significance score (cont’d)
• Merge adjacent unigrams greedily if their
significance score is above the threshold.
22
Comparison to super/subsequences
• Frequency ratio between an n-gram phrase
and its two (n-1)-gram phrases
• Example
Phrase
Rawfrequency
San
14585
Antonio
2855
SanAntonio
2385
• Pre-confidence ofSanAntonio:2385/14585
• Post-confidence ofSanAntonio:2385/2855
• Expand / Terminate based on thresholds
23
Comparison to super/subsequences (cont’d)
• Assumption
Ann-gramqualityphrase
Two(n-1)-gramsub-phrases
Atleastoneof
themisnota
qualityphrase.
• Anti-example
• “relationaldatabasesystem”isaqualityphrase.
• Both“relationaldatabase”and“databasesystem”canbe
qualityphrases.
24
Limitations of Statistical Signals
• The thresholds should be carefully chosen.
• Only consider a subset of quality phrase
requirements.
• Combining different signals in an
unsupervised manner is difficult.
• Introducesomesupervisionmayhelp!
25
Three Families of Methods
Supervised(linguisticanalyzers)
Unsupervised(statistical signals)
Weakly/DistantlySupervised
26
Weakly / Distantly Supervised
Phrase Mining Methods
• SegPhrase [Liu et al.’15]
• Weaklysupervised
• AutoPhrase [Shang et al.’17]
• Distantlysupervised
27
SegPhrase
• Outperform all above methods on domainspecific corpus (e.g., Yelp reviews)
RawCorpus
Quality Phrases
SegmentedCorpus
Document 1
Citationrecommendationisaninterestingbut
challengingresearchproblemindataminingarea.
Document 2
Inthisstudy,weinvestigatetheprobleminthe
contextofheterogeneousinformationnetworks
usingdataminingtechnique.
Document 3
PrincipalComponentAnalysisisalinear
dimensionalityreduction technique commonly used
in machine learning applications.
InputRawCorpus
Phrase Mining
Quality Phrases
SegmentedCorpus
PhrasalSegmentation
28
Quality Estimation
• Weakly Supervised
• Labels:Whetheraphraseisaqualityoneornot
• “support vector machine”: 1
• “the experiment shows”:
0
• For~1GBcorpus,only300labels
• Pros
• Binaryannotationsareeasy
• Cons
• Theselectionofhundredsofvarying-qualityphrasesfrom
millionsofcandidatesshouldbecareful.
29
Phrasal Segmentation
• Phrasal segmentation can tell which phrase is
more appropriate
• Ex:Astandard⌈featurevector⌋ ⌈machinelearning⌋ setup
isusedtodescribe...
Notcountedtowardstherectifiedfrequency
• Effects on quality re-estimation (real data)
• nphardinthestrongsense
• nphardinthestrong
• databasemanagementsystem
30
From the Titles and Abstracts of SIGMOD
Query
SIGMOD
Method
SegPhrase
Chunking(TF-IDF&C-Value)
1
database
data base
2
databasesystem
database system
3
relationaldatabase
queryprocessing
4
queryoptimization
queryoptimization
5
queryprocessing
relationaldatabase
…
…
…
51
sql server
databasetechnology
52
relationaldata
databaseserver
53
datastructure
largevolume
54
joinquery
performancestudy
55
webservice
…
…
201
highdimensionaldata
efficientimplementation
202
location basedservice
sensornetwork
203
xmlschema
largecollection
204
twophaselocking
importantissue
205
deepweb
frequentitemset
…
…
…
OnlyinSegPhrase
webservice
…
OnlyinChunking
31
From the Titles and Abstracts of SIGKDD
Query
SIGKDD
Method
SegPhrase
Chunking(TF-IDF&C-Value)
1
datamining
datamining
2
dataset
association rule
3
association rule
knowledge discovery
4
knowledgediscovery
frequentitemset
5
timeseries
decisiontree
…
…
…
51
associationrulemining
searchspace
52
ruleset
domain knowledge
53
conceptdrift
importnant problem
54
knowledgeacquisition
concurrencycontrol
55
geneexpressiondata
conceptualgraph
…
…
…
201
web content
202
frequentsubgraph
semanticrelationship
203
intrusiondetection
effectiveway
204
categoricalattribute
spacecomplexity
205
userpreference
smallset
…
…
…
OnlyinSegPhrase
optimalsolution
OnlyinChunking
32
Reported by TripAdvisor
(Find “Interesting” Collections of Hotels)
33
AutoPhrase
• No label selection and annotation effort
• Smoothly support multiple languages
34
AutoPhrase vs. Previous Work
Differentdomains
Differentlanguages
35
AutoPhrase’s Example Results
36
References
Deane, P., 2005, June. A nonparametric method for extraction of candidate phrasal terms. In Proceedings of the
43rd Annual Meeting on Association for Computational Linguistics (pp. 605-613). Association for Computational
Linguistics.
Koo, T., Carreras Pérez, X. and Collins, M., 2008. Simple semi-supervised dependency parsing. In 46th Annual
Meeting of the Association for Computational Linguistics (pp. 595-603).
Xun, E., Huang, C. and Zhou, M., 2000, October. A unified statistical model for the identification of English
baseNP. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 109-116).
Association for Computational Linguistics.
Zhang, Z., Iria, J., Brewster, C. and Ciravegna, F., 2008, May. A comparative evaluation of term recognition
algorithms. In LREC.
Park, Y., Byrd, R.J. and Boguraev, B.K., 2002, August. Automatic glossary extraction: beyond terminology
identification. In Proceedings of the 19th international conference on Computational linguistics-Volume 1 (pp. 1-7).
Association for Computational Linguistics.
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G., 1999, August. KEA: Practical
automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries (pp. 254-255).
ACM.
Liu, Z., Chen, X., Zheng, Y. and Sun, M., 2011, June. Automatic keyphrase extraction by bridging vocabulary gap.
In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (pp. 135-144).
Association for Computational Linguistics.
Evans, D.A. and Zhai, C., 1996, June. Noun-phrase analysis in unrestricted text for information retrieval. In
Proceedings of the 34th annual meeting on Association for Computational Linguistics (pp. 17-24). Association for
Computational Linguistics.
37
References
Frantzi, K., Ananiadou, S. and Mima, H., 2000. Automatic recognition of multi-word terms:. the c-value/nc-value
method. International Journal on Digital Libraries, 3(2), pp.115-130.
Mihalcea, R. and Tarau, P., 2004, July. TextRank: Bringing order into texts. Association for Computational
Linguistics.
Blei, D.M. and Lafferty, J.D., 2009. Topic models. Text mining: classification, clustering, and applications, 10(71),
p.34.
Danilevsky, M., Wang, C., Desai, N., Ren, X., Guo, J. and Han, J., 2014, April. Automatic construction and ranking
of topical keyphrases on collections of short documents. In Proceedings of the 2014 SIAM International
Conference on Data Mining (pp. 398-406). Society for Industrial and Applied Mathematics.
Church, K., Gale, W., Hanks, P. and Hindle, D., 1991. Using statistics in lexical analysis. Lexical acquisition:
exploiting on-line resources to build a lexicon, 115, p.164.
El-Kishky, A., Song, Y., Wang, C., Voss, C.R. and Han, J., 2014. Scalable topical phrase mining from text corpora.
Proceedings of the VLDB Endowment, 8(3), pp.305-316.
Parameswaran, A., Garcia-Molina, H. and Rajaraman, A., 2010. Towards the web of concepts: Extracting
concepts from large datasets. Proceedings of the VLDB Endowment, 3(1-2), pp.566-577.
Liu, J., Shang, J., Wang, C., Ren, X. and Han, J., 2015, May. Mining quality phrases from massive text corpora. In
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1729-1744). ACM.
Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R. and Han, J., 2017. Automated Phrase Mining from Massive Text
Corpora. arXiv preprint arXiv:1702.04457.
38