Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Constructing Structured Information Networks from Massive Text Corpora Part I: Quality Phrase Mining Effort-Light StructMine: Methodology Text corpus Structures from the remaining unlabeled data Data-driven text segmentation (SIGMOD’15, WWW’16) Learning Corpusspecific Model (KDD’15, KDD’16, EMNLP’16, WWW’17) Entity names & context units Partiallylabeled corpus Knowledge bases 2 Quality Phrase Mining • Quality phrase mining seeks to extract a ranked list of phrases with decreasing quality from a large collection of documents • Examples: Scientific Papers Expected Results data mining machinelearning informationretrieval … support vectormachine … the paper … News Articles Expected Results USPresident AndersonCooper Barack Obama … Obama administration … atown … 3 Why Phrase Mining? • Phrase: Minimal, unambiguous semantic unit; basic building block for information networks and knowledge bases • Unigrams vs. phrases • Unigrams (singlewords)areambiguous • E.g., “United”: United States? United Airline? United Parcel Service? • Phrase:Anatural,meaningful,unambiguous semanticunit • E.g., “United States” vs. “United Airline” • Mining semantically meaningful phrases • Transformtextdatafromwordgranularity tophrasegranularity • Enhancethepowerandefficiencyatmanipulatingunstructureddata usingdatabasetechnology 4 Application Scenarios • Natural Language Processing (NLP) • Documentanalysis • Information Retrieval (IR) • Indexinginsearchengine • Text Mining • Keyphrases fortopicmodeling 5 What Kind of Phrases Are of “High Quality”? • Popularity • “informationretrieval”>“cross-languageinformation retrieval” • Concordance • “strongtea”>“powerfultea” • “activelearning”> “learningclassification” • Informativeness • “thispaper”(frequentbutnotdiscriminative,notinformative) • Completeness • “supportvectormachine” >“vectormachine” 6 Three Families of Methods Supervised(linguisticanalyzers) Unsupervised(statistical signals) Weakly/DistantlySupervised 7 Supervised Phrase Mining • Phrase mining was originated from the NLP community • How to use linguistic analyzers to extract phrases? • Parsing(e.g.,stanford NLPparsers) • NounPhrase(NP)Chunking • How to rank extracted phrases? • C-value[Frantzi etal.’00] • TextRank [Mihalcea etal.’04] • TF-IDF 8 Linguistic Analyzer – Parsing • Minimal Grammatical Segments ó Phrases Rawtextsentence (string) Full-textParsing Fullparsetree (grammaticalanalysis) Thechefcooks thesoup. • Phrases: “the chef”, “the soup” 9 Inefficiencies of Parsing • Difficult to directly apply pre-trained to new domains (e.g. twitter, biomedical, yelp) • Unlesssophisticated,manuallycurated,domain-specific trainingdataareprovided • Computationally slow. • Cannotbeappliedonweb-scaledatatosupport emergingapplications • We need “shallow” phrase mining techniques 10 Linguistic Analyzer – Chunking • Noun phrase chunking is a light version of parsing 1. Apply tokenization and part-of-speech (POS) tagging to each sentence 2. Search for noun phrase chunks 11 Drawbacks of NP Chunking • Pre-trained models may not be transferable to new domains • Scientificdomains,querylogs,socialmedia(e.g.,Yelp, Twitter) • Lack of the usage of corpora-level information • NPsometimescan’tmeettherequirementsofquality phrases 12 Ranking – C-value • Given a set of phrases, for a given phrase 𝑝 • 𝑓(𝑝) istherawfrequency • |𝑝| isthenumberoftokensin𝑝 • If there is no phrase contains 𝑝 as a substring • C-value(𝑝)=log ) |𝑝| ⋅ 𝑓(𝑝) • Else • C-value(𝑝)=log ) |𝑝| ⋅ 𝑓 𝑝 − avg .012345267 𝑓 𝑞 • Prefers “maximal” phrases • Popularity & Completeness 13 Ranking – TextRank • Construct a network of phrases & unigrams • Compute the importance of vertices • SimilartoPageRank • Popularity & Informativeness Compatibilityofsystemsoflinear constraintsoverthesetofnatural numbers.Criteriaofcompatibility ofasystemoflinearDiophantine equations,strictinequations,and nonstrict inequations are considered.….. 14 Ranking – TF-IDF • Term Frequency • E.g.,rawfrequency • Rewardsfrequentphrases • Inverse Document Frequency • E.g.,log((#ofalldocuments)/(#ofoccurreddocuments)) • Rewards“rare”phrases • Popularity & Informativeness 15 Three Families of Methods Supervised(linguisticanalyzers) Unsupervised(statistical signals) Weakly/DistantlySupervised 16 Unsupervised Phrase Mining • Statistics based on massive text corpora • Popularity • Rawfrequency • FrequencydistributionbasedonZipfian ranks[Deane’05] • Concordance • Significancescore[Churchetal.’91][El-Kishky etal.’14] • Completeness • Comparisontosuper/sub-sequences[Parameswaran et al.’10] 17 Raw Frequency • Frequent contiguous pattern mining • If“AB” isfrequent,likely“AB” couldbeaphrase • It prefers • “Stopphrases” • Shorterphrases • E.g., freq(vector machine) ≥ freq(support vector machine) • Raw frequency could NOT reflect the quality of phrases 18 Raw Frequency (improved) • Combine with topic modeling • Mergeadjacentunigramsofthesametopic[Blei & Lafferty’09] • Frequentpatternminingwithinthesametopic [Danilevsky etal.’14] • Limitations • Tokensinthesamephrasemaybeassignedtodifferent topics • E.g.knowledge discovery usingleastsquaressupport vector machineclassifiers… 19 Frequency Distribution • Idea: ranks in a Zipfian frequency distribution is more reliable than raw frequency • Heuristic: Actual Rank / Expected Rank • Example: • Givenaphraselike“eastend” • ActualRank:rank“eastend”amongalloccurrencesof “east”(e.g.,“east end”,“east side”,“theeast”,“towards theeast”,etc.) • ExpectedRank:rank“__end”amongallcontextsof“east” (e.g.,“__end”,“__side”,“the__”,“towardsthe__”,etc.) 20 Significance score • Significance score [Church et al.’91] • A.k.a.Zscore • ToPMine [El-Kishky et al.’15] • Ifaphrasecanbedecomposedintotwoparts • P = P1 " P2 • α(P1,P2)≈(f(P1●P2)̶µ0(P1,P2))/√f(P1●P2) Quality phrases 21 Significance score (cont’d) • Merge adjacent unigrams greedily if their significance score is above the threshold. 22 Comparison to super/subsequences • Frequency ratio between an n-gram phrase and its two (n-1)-gram phrases • Example Phrase Rawfrequency San 14585 Antonio 2855 SanAntonio 2385 • Pre-confidence ofSanAntonio:2385/14585 • Post-confidence ofSanAntonio:2385/2855 • Expand / Terminate based on thresholds 23 Comparison to super/subsequences (cont’d) • Assumption Ann-gramqualityphrase Two(n-1)-gramsub-phrases Atleastoneof themisnota qualityphrase. • Anti-example • “relationaldatabasesystem”isaqualityphrase. • Both“relationaldatabase”and“databasesystem”canbe qualityphrases. 24 Limitations of Statistical Signals • The thresholds should be carefully chosen. • Only consider a subset of quality phrase requirements. • Combining different signals in an unsupervised manner is difficult. • Introducesomesupervisionmayhelp! 25 Three Families of Methods Supervised(linguisticanalyzers) Unsupervised(statistical signals) Weakly/DistantlySupervised 26 Weakly / Distantly Supervised Phrase Mining Methods • SegPhrase [Liu et al.’15] • Weaklysupervised • AutoPhrase [Shang et al.’17] • Distantlysupervised 27 SegPhrase • Outperform all above methods on domainspecific corpus (e.g., Yelp reviews) RawCorpus Quality Phrases SegmentedCorpus Document 1 Citationrecommendationisaninterestingbut challengingresearchproblemindataminingarea. Document 2 Inthisstudy,weinvestigatetheprobleminthe contextofheterogeneousinformationnetworks usingdataminingtechnique. Document 3 PrincipalComponentAnalysisisalinear dimensionalityreduction technique commonly used in machine learning applications. InputRawCorpus Phrase Mining Quality Phrases SegmentedCorpus PhrasalSegmentation 28 Quality Estimation • Weakly Supervised • Labels:Whetheraphraseisaqualityoneornot • “support vector machine”: 1 • “the experiment shows”: 0 • For~1GBcorpus,only300labels • Pros • Binaryannotationsareeasy • Cons • Theselectionofhundredsofvarying-qualityphrasesfrom millionsofcandidatesshouldbecareful. 29 Phrasal Segmentation • Phrasal segmentation can tell which phrase is more appropriate • Ex:Astandard⌈featurevector⌋ ⌈machinelearning⌋ setup isusedtodescribe... Notcountedtowardstherectifiedfrequency • Effects on quality re-estimation (real data) • nphardinthestrongsense • nphardinthestrong • databasemanagementsystem 30 From the Titles and Abstracts of SIGMOD Query SIGMOD Method SegPhrase Chunking(TF-IDF&C-Value) 1 database data base 2 databasesystem database system 3 relationaldatabase queryprocessing 4 queryoptimization queryoptimization 5 queryprocessing relationaldatabase … … … 51 sql server databasetechnology 52 relationaldata databaseserver 53 datastructure largevolume 54 joinquery performancestudy 55 webservice … … 201 highdimensionaldata efficientimplementation 202 location basedservice sensornetwork 203 xmlschema largecollection 204 twophaselocking importantissue 205 deepweb frequentitemset … … … OnlyinSegPhrase webservice … OnlyinChunking 31 From the Titles and Abstracts of SIGKDD Query SIGKDD Method SegPhrase Chunking(TF-IDF&C-Value) 1 datamining datamining 2 dataset association rule 3 association rule knowledge discovery 4 knowledgediscovery frequentitemset 5 timeseries decisiontree … … … 51 associationrulemining searchspace 52 ruleset domain knowledge 53 conceptdrift importnant problem 54 knowledgeacquisition concurrencycontrol 55 geneexpressiondata conceptualgraph … … … 201 web content 202 frequentsubgraph semanticrelationship 203 intrusiondetection effectiveway 204 categoricalattribute spacecomplexity 205 userpreference smallset … … … OnlyinSegPhrase optimalsolution OnlyinChunking 32 Reported by TripAdvisor (Find “Interesting” Collections of Hotels) 33 AutoPhrase • No label selection and annotation effort • Smoothly support multiple languages 34 AutoPhrase vs. Previous Work Differentdomains Differentlanguages 35 AutoPhrase’s Example Results 36 References Deane, P., 2005, June. A nonparametric method for extraction of candidate phrasal terms. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 605-613). Association for Computational Linguistics. Koo, T., Carreras Pérez, X. and Collins, M., 2008. Simple semi-supervised dependency parsing. In 46th Annual Meeting of the Association for Computational Linguistics (pp. 595-603). Xun, E., Huang, C. and Zhou, M., 2000, October. A unified statistical model for the identification of English baseNP. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 109-116). Association for Computational Linguistics. Zhang, Z., Iria, J., Brewster, C. and Ciravegna, F., 2008, May. A comparative evaluation of term recognition algorithms. In LREC. Park, Y., Byrd, R.J. and Boguraev, B.K., 2002, August. Automatic glossary extraction: beyond terminology identification. In Proceedings of the 19th international conference on Computational linguistics-Volume 1 (pp. 1-7). Association for Computational Linguistics. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G., 1999, August. KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries (pp. 254-255). ACM. Liu, Z., Chen, X., Zheng, Y. and Sun, M., 2011, June. Automatic keyphrase extraction by bridging vocabulary gap. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (pp. 135-144). Association for Computational Linguistics. Evans, D.A. and Zhai, C., 1996, June. Noun-phrase analysis in unrestricted text for information retrieval. In Proceedings of the 34th annual meeting on Association for Computational Linguistics (pp. 17-24). Association for Computational Linguistics. 37 References Frantzi, K., Ananiadou, S. and Mima, H., 2000. Automatic recognition of multi-word terms:. the c-value/nc-value method. International Journal on Digital Libraries, 3(2), pp.115-130. Mihalcea, R. and Tarau, P., 2004, July. TextRank: Bringing order into texts. Association for Computational Linguistics. Blei, D.M. and Lafferty, J.D., 2009. Topic models. Text mining: classification, clustering, and applications, 10(71), p.34. Danilevsky, M., Wang, C., Desai, N., Ren, X., Guo, J. and Han, J., 2014, April. Automatic construction and ranking of topical keyphrases on collections of short documents. In Proceedings of the 2014 SIAM International Conference on Data Mining (pp. 398-406). Society for Industrial and Applied Mathematics. Church, K., Gale, W., Hanks, P. and Hindle, D., 1991. Using statistics in lexical analysis. Lexical acquisition: exploiting on-line resources to build a lexicon, 115, p.164. El-Kishky, A., Song, Y., Wang, C., Voss, C.R. and Han, J., 2014. Scalable topical phrase mining from text corpora. Proceedings of the VLDB Endowment, 8(3), pp.305-316. Parameswaran, A., Garcia-Molina, H. and Rajaraman, A., 2010. Towards the web of concepts: Extracting concepts from large datasets. Proceedings of the VLDB Endowment, 3(1-2), pp.566-577. Liu, J., Shang, J., Wang, C., Ren, X. and Han, J., 2015, May. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1729-1744). ACM. Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R. and Han, J., 2017. Automated Phrase Mining from Massive Text Corpora. arXiv preprint arXiv:1702.04457. 38