Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Unsupervised Streaming Feature
Selection in Social Media
Jundong Li1, Xia Hu2, Jiliang Tang3 and Huan Liu1
1Arizona
State University
2Texas A&M University
3Yahoo! Labs
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
1
Outline
• Background and Motivation
• Problem Statement
• Proposed USFS Framework
• Experimental Results
• Conclusions and Future Work
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
22
Social Media
• Rapid growth of social media provides a platform for
people to perform online social activities
• Massive amounts of high dimensional data are user
generated and quickly disseminated
• It is desirable to reduce the dimensionality of social
media data due to curse of dimensionality
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
33
Feature Selection
• Feature selection is effective to preparing highdimensional data by selecting a subset of relevant
features for a compact and accurate representation
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
44
Feature Selection in Social Media
• Traditional feature selection assumes that all
features are static and known in advance
• Features in social media are usually generated
dynamically in a streaming fashion
– Twitter produces more than 500 millions of tweets
everyday and a large amount of slang words (features) are
continuously being user generated
– In disaster relief, topics (features) like ``Chile Earthquake”
emerge to be hot shortly
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
55
Streaming Feature Selection
• It is more appealing to perform streaming feature
selection to capture relevant features timely
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
66
Challenges, Opportunities and Target
• Challenges
– Label information is costly
– Data not i.i.d
• Opportunities
– Link information is abundant and maybe helpful
• Target
No existing unsupervised
streaming feature selection
algorithms !
– Propose an unsupervised streaming feature
selection algorithm for social media data
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
77
Outline
• Background and Motivation
• Problem Statement
• Proposed USFS Framework
• Experimental Results
• Conclusions and Future Work
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
88
Problem Statement
• Given n linked instances, let adjacency M denotes
their link information. Assume that features arrive
dynamically one each time, at time step t, each
instance is associated with a set of streaming
features X(t) = {f1, f2, …, ft}
• we want to select a subset of relevant features at
each time step effectively and efficiently by using link
information M and content information X(t)
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
99
Illustration
……
……
t t+1
t t+1
t+i
……
t+i
t t+1
……
……
t t+1
Accept the
new feature?
t t+1
t+i
Selected
Feature Set
Arizona State University
Data Mining and Machine Learning Lab
t+i
t+i
Reject existing
feature?
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
10
10
Outline
• Background and Motivation
• Problem Statement
• Proposed USFS Framework
• Experimental Results
• Conclusions and Future Work
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
11
11
Modeling Link Information
• Social media users connect due to a variety of
reasons such as movie fans, sports enthusiasts,
colleagues, etc
• Users with similar hidden factors are similar
• Hidden factors are helpful to steer unsupervised
streaming feature selection
• We use mixed membership stochastic blockmodel
(MMSB) [Blei+NIPS2009] to extract hidden social
factors from link information
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
12
12
Modeling Link Information (con’t)
• At time step t:
• Hidden social factors as regression targets
• L1-norm can be used for feature selection
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
13
13
Modeling Content Information
• If two users are similar in the original feature
space, the two users are also similar in the
selected feature space.
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
14
14
Optimization Formulation at Time t
• By combining network information and
content information
• Decompose into a set of sub-problems
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
15
15
Testing New Feature
• At time step t+1 when the new feature arrives:
• Objective function is reduced if the reduction in
1st,3rd,4th term outweighs the increase in the 2nd term
• Therefore, the condition to accept the new feature is
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
16
16
Testing Existing Features
• Test existing features when new feature is added
• When new feature is accepted, we optimize the
following w.r.t. current variables, which forces some
feature coefficient to be zero
• Convex optimization problem, we use BroydenFletcher-Goldfarb-Shanno (BFGS)
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
17
17
Feature Selection by USFS
• If the new feature is accepted, we obtain sparse
coefficient matrix
by
solving all sub-problems
• For each feature j, if any of its k corresponding
feature weight is nonzero, the feature is included in
the final model, the feature score is defined as
• Features are ranked in a descending order by their
feature scores
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
18
18
Outline
• Background and Motivation
• Problem Statement
• Proposed USFS Framework
• Experimental Results
• Conclusions and Future Work
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
19
19
Questions to Investigate
• Q1: How is the quality of selected
features by the USFS framework?
• Q2: How efficient is the proposed USFS
framework?
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
20
20
Datasets
• BlogCatalog (social blog directory)
• Flickr (image sharing website)
• Assume features arrive in a random order,
take {20%,30%,…,90%,100%} of all features
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
21
21
Experimental Settings
• Evaluation
– Clustering: K-means
– Metrics: Accuracy and NMI
• Baseline batch-mode methods
•
•
•
•
Laplacian Score [He et al. NIPS 2005]
SPEC [Zhao and Liu. ICML 2007]
NDFS [Li et al. AAAI 2012]
LUFS [Tang and Liu, KDD 2012]
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
22
22
Performance on Flickr
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
23
23
Performance on BlogCatalog
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
24
24
Cumulative Running Time
• In BlogCatalog, USFS is 7x, 20x, 29x, 76x faster
• In Flickr, USFS is 5x, 11x, 20x, 75x faster
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
25
25
Outline
• Background and Motivation
• Problem Statement
• Proposed USFS Framework
• Experimental Results
• Conclusions and Future Work
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
26
26
Conclusion
• Goals:
– Perform unsupervised streaming feature selection
for social media data
• Solutions:
– Leverage link information as constraints
– Stagewise algorithm for streaming features
• Results:
– Achieve better feature selection performance in
terms of clustering
– Reduce running time compared with batch-mode
methods
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
27
27
Future Work
• In this work, we consider the link information is
relative stable compared with dynamic content
information, we will investigate streaming feature
selection in dynamic networks
• Streaming features come from different sources, we
will investigate how to fuse heterogeneous feature
sources for streaming feature selection
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
28
28
Questions
Acknowledgement: This material is, in part, supported by
National Science Foundation (NSF) under grant number IIS1217466. Comments and suggestions from DMML members and
reviewers are greatly appreciated.
Arizona State University
Data Mining and Machine Learning Lab
Unsupervised Streaming Feature Selection in Social Media
CIKM 2015
29