Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Current Research in Data Mining Research Group Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, ARO, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo! Labs, LinkedIn, HP Lab & Boeing May 4, 2017 1 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions 2 Data Mining and Data Warehousing Jiawei Han’s Group at CS, UIUC Mining patterns and knowledge discovery from massive data Data mining in heterogeneous information networks Exploring broad applications of data mining Developed popular data mining algorithms: FPgrowth, gSpan, PrefixSpan, RankingCube, TruthFinder, NetClus, RankClass, … 600+ research papers, most cited author/group in data mining ACM Fellow, IEEE Fellow, ACM SIGKDD Innovation Award, W. McDowell Award; Students: ACM KDD Dissertation Awards (2008, 2013), …… Textbook, “Data mining: Concepts and Techniques,” adopted worldwide Funded as NSCTA (Network Science Collaborative Technology Alliance) by ARL [09-14, 15-19], ARO, NIH KnowEnG, NSF, Boeing, MSR, Google, Yahoo!, HP Labs, … Graduated 40+ Ph.D.’s: joined Google, Microsoft Research, Yahoo! Labs, Facebook, Twitter, as well as professors (14) Supervising 17 Ph.D., 4 M.S. students & 5 visitors/postdocs 3 Data Mining Research Group in CS, Univ. Illinois – – – – Student Prominent Awards SIGKDD or SIGMOD Ph.D. Dissertation Awards/ Runner-Ups 10-year impact paper awards Best student paper awards, best papers, best posters, … KDDCUP 2013 Runner Up Award IBM/Microsoft/NSF/NDSEG Ph.D. Fellowships • – – Graduation: Professors at UVA, UCSB, PSU, U. Buffalo, Northeastern, FSU, MSU, Notre Dame, CUHK, … Researchers at IBM, MSR, Google Research, Yahoo! Labs, Facebook, Twitter, NEC, etc. • – 4 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions 5 Mining Sequential Patterns from Shopping Sequences Sequential pattern mining: Given a set of (shopping) sequences, find the complete set of frequent subsequences A sequence database SID 10 20 30 40 sequence <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc> Idea of PrefixSpan <a(bc)dc>: a subsequence of <a(abc)(ac)d(cf)> s=<a(abc)(ac)d(cf)> <a> s|<a>: ( , 2) <(abc)(ac)d(cf)> <ab> s|<ab>: ( , 4) <(_c)(ac)d(cf)> Idea of CloSpan Given support threshold min_sup =2, <(ab)c> is a sequential pattern (1) (2) (3) Our innovation: PrefixSpan (TKDE’04): 1598 citations CloSpan (SDM’03): 568 (reduce redundancy) FPgrowth (SIGMOD’00): 4956 Difficulty to generalize it to biosequence mining: approximate patterns & noise 6 Mining Frequent Subgraph Patterns from Graph DBs GRAPH DATASET (e.g., Chemical Compound Database) Graph pattern mining: Given a set of graphs, find the complete set of frequent subgraphs Idea of gSpan FREQUENT PATTERNS (Let MIN SUPPORT = 2) Graph pattern growth + completeness of right-most extension Our innovation: (1) gSpan (ICDM’02): 1319 citations (2) CloseGraph (KDD’03): 520 (not to mine subgraphs covered by their super-patterns) NCI/NIH AIDS antiviral screen compound data minsup = 5% (k+1)-edge CloseGraph k-edge G1 At what condition, can we stop searching their Children. i.e., early termination? G2 G … Gn Extend to mine structures in large single networks (VLDB’11) 7 Graph Indexing and Graph Similarity Search Graph Search: Given a query graph Q, find all the graphs in graph DB containing Q gIndex key idea: index on frequent and discriminative substructures (mined) 1.4E+05 140 120 100 80 60 40 20 0 Path Frequent Structure Discriminative Frequent Structure 1.2E+05 1.0E+05 8.0E+04 6.0E+04 4.0E+04 2.0E+04 0.0E+00 query graph graph DB Graph Index helps search 1k 2k 4k 8k 16k # indices/ DBsize GraphGrep gIndex Actual Match 4 8 12 16 20 24 # candidates/query size grafil key idea: explore feature similarity Query:Q Graph (G) Query:Q Graph Index Our Innovation: gIndex (SIGMOD’04): 419 citations grafil (SIGMOD’05): similarity search Graph (G) features … Approximate features 8 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions 11 Mining Heterogeneous Information Networks Heterogeneous networks: Multiple object types and/or multiple link types Movie Studio Venue Paper Author DBLP Bibliographic Network Actor Movie Director The IMDB Movie Network The Facebook Network Homogeneous networks are info. loss projection of heterogeneous networks! Directly mining information-richer heterogeneous networks Current work: Mining DBLP (CS bibliographic DB), PubMed, news, tweets, data.gov, … Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! DBLP: A Computer Science bibliographic database A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), … Knowledge hidden in DBLP Network Mining Functions How are CS research areas structured? Clustering Who are the leading researchers on Web search? Ranking What are the most essential terms, venues, authors in AI? Classification + Ranking Who are the peer researchers of Jure Leskovec? Similarity Search Whom will Christos Faloutsos collaborate with? Relationship Prediction Which types of relationships are most influential for an author to decide her topics? Relation Strength Learning How was the field of Data Mining emerged or evolving? Network Evolution Which authors are rather different from his/her peers in IR? Outlier/anomaly detection Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens! 13 RankClus: Rank-Based Clustering RankClus (EDBT’09)/NetClus (KDD’09): Integrate ranking & clustering for mining A heterogeneous info networks V Rank treatments for AIDS from MEDLINE P Venue T Author Publish Database Write V Research Paper A P Hardware T Contain …… Term NetClus Computer Science DBLP Schema V A P Theory T RankCompete: Organize your photo album automatically! 14 RankClass: Integration of Tanking and Classification Knowledge propagation via multi-typed heterogeneous networks Top-5 ranked conf.s Top-5 ranked terms Our innovation: ECMLPKDD'10/KDD’11: integrate ranking and classification; small training set; knowledge propagation across typed links; efficient and scalable Potential applications: Biological network mining Database Data Mining AI IR VLDB KDD IJCAI SIGIR SIGMOD SDM AAAI ECIR ICDE ICDM ICML CIKM PODS PKDD CVPR WWW EDBT PAKDD ECML WSDM data mining learning retrieval database data knowledge information query clustering reasoning web system classification logic search xml frequent cognition text DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network Rank objects within each class (with extremely limited label information) Obtain High classification accuracy and excellent rankings within each class 15 Meta-Path Guided Similarity Search in Networks Similarity search: Find similar objects in networks Who are most similar to AnHai Doan? DBLP Network Schema Anhai Doan CS, Wisconsin Database area PhD: 2002 Meta-Path: Meta-level description of a path between two objects Different meta-paths carry rather different semantics Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) Jignesh Patel CS, Wisconsin Database area PhD: 1998 Amol Deshpande CS, Maryland Database area PhD: 2004 Our innovation PathSim (VLDB’11): Similarity search in heterogeneous networks; a balanced similarity measure; userguidance by selecting different meta-paths Jun Yang CS, Duke Database area PhD: 2001 Application in biomedical domain IBM: search for close relationships among disease, drugs, treatments, side-effects, and explanations 16 PathPredict: Meta-Path Based Relationship Prediction Who will be your new coauthors? venue Network schema publish topic mention-1 publish-1 paper mention cite/cite-1 contain/contain-1 write-1 write author Our contribution PathPredict (ASONAM’11) Co-author prediction (A—P—A) using topological features encoded by meta paths, e.g., (A—P→P—A). Which meta-path is more important? Applications Meta path-guided prediction: Infer or predict new relationships among multi-typed links Different meta-paths have different prediction power: p-values obtained from the DBLP data Co-author prediction for Jian Pei: Only 42 among 4809 candidates are true first-time co-authors! (Trained based on data collected in [1996, 2002]; Testing period: [2003,2009]) 17 Truth Analysis: Enhancing the Quality of Heterogeneous Information Networks Motivation: Info. provided can be untrustworthy, error-prone, missing, … Application: handling conflicting claims on biomedical properties Experimental datasets: Large and real datasets Our contribution Book Authors from abebooks.com (1263 TruthFinder (TKDE’08): mutual books, 879 sources, 48153 claims, 2420 bookenhancement of trustworthiness of info author, 100 labeled) providers and claims Movie Directors from Bing (15073 movies, 12 Latent Truth Model (VLDB’12): modeling sources, 108873 claims, 33526 movie-director, two sided truth 100 labeled) Info provider w1 Claim Objects f1 o1 w2 w3 f2 IMDB Negative Claim High Precision, Correct Claim f3 f4 Positive Claim High Precision, High Recall Low Recall o2 w4 Multiple facts, two-sided claims: Netflix Low Precision, Low Recall Incorrect Claim BadSource Harry Potter 18 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions 19 Hierarchical Relationship Discovery From partially ordered objects to hierarchy (tree) Based on NLP or other techniques to extract partially ordered objects Using constraints to discover relationships Singleton Potential Type Homophile Polarity Support pattern Forbidden pattern Cognitive description Potential definition Parent and child are similar Parent is superior to child Patterns frequently occurring with child-parent pairs Patterns rarely occurring with child-parent pairs Discovery of the Kenny Family Tree Pairwise Potential Function: Cases Type Cognitive description Potential definition Attribute augment Label propagate Use inherited attributes from parents or children Similar nodes share similar parents (or children) Patterns altering in childReciprocity parent & parent-child pairs Constraints Restrict certain patterns 20 Recursive Construction of a Topical Hierarchy by Phrase Mining information retrieval question answering relevance feedback web search search engine world wide web semantic web Topic discovery Recursive construction learning support vector machines reinforcement learning feature selection Term co-occurrence network conditional random fields classification decision trees The Framework of CATHY (Constructing A Topical HierarchY) Topical phrase mining and ranking 21 Growing Parallel Paths (WWW 2011) Path DIV ... P AD HTML DIV HTML DIV LI AB HTML P LI AC AE HTML Page B Page E HTML HTML Page C 1 LI AY 2 LI AZ 3 LI AW 4 TD AU 5 TD AV 6 X Y DIV UL Page A AX UL Page D DIV ... LI DIV P AF Page F DIV TABLE Z UL TR W U V Result: 22 WinaCS: Web Information Network Analysis for Computer Science Name Tarek Abdelzaher Sarita Adve Vikram Adve Gul Agha Eyal Amir Dan Roth Jiawei Han Zipcode -------- rsim.cs.illinois.edu/ ~sadve/ URL -------- llvm.cs.uiuc.edu /~vadve/Home.html l2r.cs.uiuc.edu /~danr/ www.cs.illinois.edu /homes/hanj/ Mappings Web Pages Structured Data Database records can be found on link paths! Faculty /people Vikram Adve /people /faculty /people /faculty /vikramadve Personal Site llvm.cs.uiuc.edu /~vadve/Home.html Dan Roth People Jiawei Han / (root) [cs.illinois.edu] /people /faculty /dan-roth Personal Site l2r.cs.uiuc.edu /~danr/ Research Data Mining /research Dan Roth /research /areas /data Jiawei Han /people /faculty /jiawei-han Personal Site www.cs.illinois.edu /homes/hanj/ 23 Research-Insight [SIGMOD’13 Demo] Query on “Jim Gray” Query on “Machine Learning” Advisor-Advisee result for “Kevin Chang” Potential collaborators for “Jiawei Han” 24 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions 25 Event Cube: An Overview Funded by NASA (2008-2010) Analysis Support … Analyst Multidimensional OLAP, Ranking, Cause Analysis, …… Topic Summarization/Comparison Topic Topic turbulence birds undershoot Event Cube Representation Encounter Deviation overshoot LAX SJC MIA AUS Location 98.02 98.01 99.02 99.01 drilldown 1998 1999 CA FL TX Location roll-up Multidimensional Text Database Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events 26 Text/Topic Cube: General Idea Heterogeneous: categorical attributes + unstructured text ACN Time Location Place Environment …… Event Report Text data How to combine? Our solution: Cube: Categorical Attributes Measure Term/Topic Weight T1 W1 T2 W2 T3 W3 … … Text/Topic Model: Unstructured Text 27 Effective OLAP Exploration TopCells (ICDE’ 10): Ranking aggregated cells (objects) in TextCube TEXplorer (CIKM’11): Integrating keyword-based ranking and OLAP exploration Healthcare Reform 28 EventCube Snapshot: Query Result 29 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions 30 MoveMine: Mining Moving Object Databases A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo) 31 31 Longitude longitude Mining Spatiotemporal and Mobility Data #1 #2 Raw movement data (time series view) 8 7.5 7 6.5 #4 latitude Latitude 6 0 500 1000 1500 2000 2500 3000 3500 2000 2500 3000 3500 time 46.8 46.6 46.4 #3 46.2 46 0 500 1000 1500 time density map #1 #2 #4 #3 Time (hour) Spot #1: Office Spot #2: Commuting city Spot #3: Home Spot #4: Vacation place 32 Mining Periodicity in Sparse Data [KDD12] Event has a period of 20 Occurrences of the event happen between 20k+5 to 20k+10 Event has a period of 20. Occurrences of the event happen between 20k+5 to 20k+10. 5 13 18 26 29 Segment the data using length 20 48 50 62 67 Time 79 Segment the data using length 16 Overlay the segments Overlay the segments Observations are clustered in [5,10] interval. Observations are scattered. 33 GeoTopic Discovery: Mining Spatial Text Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11 Geo-tagged photos w. landscape (coast vs. desert vs. mountain) LDM TDM GeoFolk LGTA 34 LPTA: Latent Periodic Topic Analysis: Discovery of Temporal Patterns of Topics Periodic topic: repeating in regular intervals Background topic: covered uniformly over the entire period Bursty topic: A transient topic that is intensively covered only in a certain time period Time distribution of topics Integration of both text and time in analysis 35 Social Relationship Mining from Sensor Trace Data T-Motif: a time interval [S,T], that many positive pairs meet at that time few negative pairs meet at that time Ex.: MIT Reality mining dataset: 94 people tracked for 10 months Use only spatiotemporal info Algs. for efficient mining of T-motifs and effective classification 36 Mining RFID Data to Explore Trajectories (Factory, T1,T2) Warehousing and mining RFID data (Checkout,T9,T10) (Shelf, T7,T8) 37 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions 38 Conclusions An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Lots to be done in this promising research frontier! 39