Download Research Statement

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Big Signal Processing for Multi-Aspect Data Mining
Evangelos E. Papalexakis, Carnegie Mellon University
http://www.cs.cmu.edu/˜epapalex
What does a social graph between people who call each other look like? How does it differ from one where people instantmessage or e-mail each other? Social interactions, along with many other real-word processes and phenomena, have different
Chapter
2 people calling each other will likely differ
aspects, such as the means of communication. In the above example, the
activity of
from the activity of people instant-messaging each other. Nevertheless, each aspect of the interaction is a signature of the same
underlying social phenomenon: formation of social ties and communities. Taking into account all aspects of social interaction
Preliminaries
results in more accurate social models (e.g, communities). The main thesis
of my work is that many real-world problems, such as
the aforementioned, benefit from jointly modeling and analyzing the multi-aspect data associated with the underlying phenomenon
we seek to uncover. My research bridges Signal Processing and Data Science through designing and developing scalable and
interpretable algorithms for mining big multi-aspect data, to addressOurhigh-impact
real-world
methods have a specific
emphasis applications.
on Tensor analysis. Thus, here, we provide a very brief, high
overview of how Tensors can be used as an exploratory analysis tool, using as a motivating example th
a Knowledge Base. Tensor analysis is by no means a new problem, however, our on-going and prop
work is novel in the context of the applications that we are interested in, as well as the new models
algorithms that we develop.
THESIS WORK
My thesis work is broken down to algorithms with contributions mostly in tensor analysis as well as other fields such as control
system identification [10], and multi-aspect data mining applications. Matrices record dyadic properties, like “people recommending products”. Tensors are the n-mode ge
alizations, capturing 3- and higher-way relationships. For example “subject-verb-object” triplets natu
lead to a 3-mode tensor.
A) Algorithms
Concept1 Concept2
Concept R
The primary computational tool in my work is tensor decomposition. Tensors are multicR
c1
c2
bR
b1
b2
dimensional matrices, where each aspect of the data is mapped to one of the dimensions verbs
⇡
+
+ ... +
or modes. In order to analyze a tensor, we compute a decomposition or factorizationsubjects X" X
aR
a1
a2
(henceforth we use the terms interchangeably), which gives a low-dimensional embedobjects
ding of all the aspects. I focused on the Canonical or PARAFAC decomposition, which
Figure 1: Canonical or PARAFAC decomdecomposes the tensor into a sum of outer products of latent factors (seeFigure
also2.1:
Figure
1):
PARAFAC decomposition of a three-way tensor of the NELL dataset as sum of R outer products (
X≈
R
X
r=1
position into sum of R rank-one compo-
one tensors), reminiscent of the rank-R singular value decomposition of a matrix. Each component correspon
nents.
component
is a latent concept
a latent concept, e.g. ”leaders”,
”cars”, Each
”tools” and
so on.
ar ◦ br ◦ cr
or a co-cluster.
where ◦ denotes outer product, i.e., [a ◦ b ◦ c](i, j, k) = a(i)b(j)c(k). Tensor
Informally,
each latent factor is a co-cluster of the aspects
decomposition as soft clustering: For instance, given a “subject-verb-object” tensor, one
of the tensor. The advantages of this decomposition over other existing
ones are
(each
factor
corresponds
to a each one of t
decompose
it intointerpretability
a sum of a (usually) small
number
of triplets
of vectors; intuitively,
triplets
corresponds to a different
concept,
e.g., “politicians”,
and “tools”. Each vector of
co-cluster), and strong uniqueness guarantees for the latent factors.Tensor
decompositions
are very
powerful
tools, “countries”,
with numerous
triplet may be viewed as a soft clustering indicator: suppose that a, b, c are the vectors of the “politici
applications (see details below). There is increasing interest in their application
to BigtoData
problems,
both
from
academia
and respectively. The
triplet that correspond
the “subject”,
“verb” and
“object”
dimensions
(or modes)
willapplicability
indicate the membership
of all the subjects
to theproblems,
“politicians” cluster,
and b and c will do so fo
industry. However, algorithmically, there exist challenges which limit their
to real-world,
big data
pertaining
the verbs and objects.
to scalability and quality assessment of the results. Below I outline how
my work addresses those challenges, towards a broad
For example, see Figure 2.1. The triplet of vectors a1 , b1 , c1 will correspond to the first concept (
adoption of tensor decompositions in big data science.
“leaders-organizations”); subjects/rows with high score on a1 will be the leaders, like “obama”, “mer
“eric-schmidt”, objects/columns with high score on b1 will be organizations, like “usa”, “germa
A1) Parallel, and Scalable Tensor Decompositions
Consider a multi-aspect tensor dataset that is too big to fit in the main memory of a single machine. The5 data may have large
“ambient” dimension (e.g., a social network can have billions of users), however the observed interactions are very sparse, resulting
in extremely sparse data. This data sparsity can be exploited for efficiency. In [9] we formalize the above statement by proposing the
concept of a triple-sparse algorithm where 1) the input data are sparse, 2) the intermediate data that the algorithm is manipulating
or creating are sparse, and 3) the output is sparse. Sparsity in the intermediate data is crucial for scalability. In [15] we show that
the intermediate data created by a tensor decomposition algorithm designed for dense data can be many orders of magnitude larger
than the original data, rendering the analysis prohibitive. Sparsity in the results is a great advantage, both in terms of storage but
most importantly in terms of interpretability, since sparse models are easier for humans to inspect. None of the existing state of the
art algorithms, before[9], fulfilled all three requirements for sparsity.
In [9] we propose PAR C UBE, the first triple-sparse, parallel algorithm for tensor decomposition. Figure 2(a) depicts a highlevel overview of PAR C UBE. Suppose that we computed a weight of how important every row, column, and “fiber” (the third mode
index) of the tensor is. Given that weight, PAR C UBE takes biased samples of rows, columns, and fibers, extracting a small tensor
from the full data. This is done repeatedly with each sub-tensor effectively explores different parts of the data. Subsequently,
PAR C UBE decomposes all those sub-tensors in parallel generating partial results. Finally, PAR C UBE merges the partial results
ensuring that partial results corresponding to the same latent component are merged together. The power behind PAR C UBE is that,
even though the tensor itself might not fit in memory, we can choose the sub-tensors appropriately so that they fit in memory,
and we can compensate by extracting many independent sub-tensors. PAR C UBE converges to the same level of sparsity as [13]
(the first tensor decomposition with latent sparsity) and furthermore PAR C UBE’s approximation error converges to the one of
the full decomposition (in cases where we are able to run the full decomposition). This demonstrates that PAR C UBE’s sparsity
maintains the useful information in the data. In [11], we extend the idea of [9], introducing T URBO -SMT, for the case of Coupled
Matrix-Tensor Factorization (CMTF), where a tensor and a matrix share one of the aspects of the data, achieving up to 200 times
faster execution with comparable accuracy to the baseline, on a single machine. Subsequently, in [17] we propose PARAC OMP, a
novel parallel architecture for tensor decomposition in similar spirit as PAR C UBE. Instead of sampling, PARAC OMP uses random
1
Evangelos E. Papalexakis - Research Statement
Quality Assessment Scalability!
4
10
!#"
Ү#
2
$#
!"
Time (sec)
10
0
10
−2
10
Ү#
100x larger data!
Baseline−1
Baseline−2
ICASSP 15
−4
10
!"#
(a) The main idea behind PAR C UBE: Using biased sampling, extract small representative sub-sampled tensors, decompose them in parallel, and carefully merge
the final results into a set of sparse latent factors.
2
3
10
4
10
5
10
I=J=K
10
(b) Computing the decomposition quality for tensors
for two orders of magnitude larger tensor than the state
of the art (I, J, K are the tensor dimensions).
Figure 2
projections to compress the original tensor to multiple smaller tensors. Thanks to compression, in [17] we prove that PARAC OMP
can guarantee the uniqueness of the results (cf. [17] for exact bounds and conditions). This is a very strong guarantee on the
correctness and quality of the result.
In addition to [9, 11, 17], which introduce a novel paradigm for parallelizing and scaling up tensor decomposition, in [15] we
developed the first scalable algorithm for tensor decompositions on Hadoop which was able to decompose problems larger by at
least two orders of magnitude than the state of the art. Subsequently [4] we developed a Distributed Stochastic Gradient Descent
method for Hadoop that scales to billions of parameters.
A2) Unsupervised Quality Assessment of Tensor Decompositions
Real-world exploratory analysis of multi-aspect data is, to a great extent, unsupervised. Obtaining ground truth is a very expensive
and slow process, or in the worst case impossible; for instance, in Neurosemantics where we research how language is represented
in the brain, most of the subject matter is uncharted territory and our analysis drives the exploration. Nevertheless, we would
like assess the quality of our results in absence of ground truth. There is a very effective heuristic in Signal Processing and
Chemometrics literature by the name of “Core Consistency Diagnostic” (cf. Bro and Kiers, Journal of Chemometrics, 2003) which
assigns a “quality” number to a given tensor decomposition and gives information about the data being inappropriate for such
analysis, or the number of latent factors being incorrect. However, this diagnostic has been specifically designed for fully dense
and small datasets, and is not able to scale to large and sparse data. In [8], exploiting sparsity, we introduce a provably exact
algorithm that operates on at least two orders of magnitude larger data than the state of the art (as shown in Figure 2(b)), which
enables quality assessment on large real datasets for the first time.
Impact - Algorithms
• PAR C UBE [9] is the most cited paper of ECML-PKDD 2012 with 46 citations at the time of writing, where the median
number of citations for ECML-PKDD 2012 is 5. Additionally, PAR C UBE has already been downloaded more than 80 times
by universities and organizations from 23 countries.
• T URBO -SMT [11] was selected as one of the best papers of SDM 2014, and will appear in a special issue of the Statistical
Analysis and Data Mining journal.
• PARAC OMP [17] has appeared in the prestigious IEEE Signal Processing Magazine.
B) Applications
My work has focused on: 1) Multi-Aspect Graph Mining, 2) Neurosemantics, and 3) Knowledge on the Web.
B1) Multi-Aspect Graph Mining
0
0
0
0
20
20
20
20
60
80
0
60
80
20
40
60
Node Groups
2
40
80
0
Node Groups
40
Node Groups
Node Groups
Node Groups
In [5] we introduce G RAPH F USE, a tensor based community detection algorithm which uses different aspects of social interaction and
outperforms state of the art in community extraction accuracy. Figure
3 shows G RAPH F USE at work, identifying communities in R EALI TY M INING , a real dataset of multi-aspect interactions between stu(a) calls
(b) proximity
(c) sms
(d) friends
dents and faculty at MIT. The communities are consistent across as- Figure 3: Results on the four views of the R EALITY M INING
pects and agree with ground truth. Another aspect of a social network multi-graph. Red dashed lines outline the clustering found by
is time. In [3] we introduce C OM 2, a tensor based temporal commu- G RAPH F USE.
nity detection algorithm, which identifies social communities and their behavior over time. In [7] we consider language as another
aspect, where we identify topical and temporal communities in a discussion forum of Turkish immigrants in the Netherlands, and in
[12] we consider location (which has become extremely pervasive recently) analyzing data from Foursquare, identifying spatial and
40
60
80
20
40
60
Node Groups
80
0
40
60
80
20
40
60
Node Groups
80
0
20
40
60
Node Groups
80
to r (and typically r is much smaller than the dimensions of
the matrix we are solving for, i.e. we are forcing the solution
model.
In particular,
the brain
activityStatement
vector y that we obse
Evangelos
E. Papalexakis
- Research
to be low rank). Similar to the LS case, here we minimize
simply the collection of values recorded by the m sensors, p
the sum of squared errors, however, the solution here is low
on a person’s scalp.
rank, as opposed to the LS solution which is (with very high
In M ODEL0 , we attempt to model the dynamics of the s
full rank.
temporal patternsprobability)
of users’ activity
in Foursquare. Not necessarily restricted to social
networks, directly.
in my work
I haveby
also
analyzed
measurements
However,
doing
so, we are directin
However network
intuitive, graphs,
the formulation
M ODEL0and
turns
out to attacks[6,
be
attention
to an observable proxy of the process that we are t
multi-aspect computer
detectingofanomalies
network
16].
rather ineffective
in capturing
to estimate (i.e. the functional connectivity). Instead, it is
Impact - Multi-Aspect
Graph
Mining the temporal dynamics of the recorded
brain activity. As an example of its failure to model brain activity
beneficial to model the direct outcome of that process. Ideall
• C OM 2 [3] won the best student paper award at PAKDD 2014.
successfully, Fig. 2 shows the real and predicted (using LS and
would like to capture the dynamics of the internal state of the
• G RAPH
F USE
[5]activity
has been
80 times
21 countries.
CCA)
brain
fordownloaded
a particular more
voxelthan
(results
by LSfrom
and CCA
son’s brain, which, in turn, cause the effect that we are meas
• Our work
in [16]toisthe
deployed
the 2Institute
Information
Industry in Taiwan,
detecting
real network intrusion attempts.
are similar
one in by
Fig.
for all for
voxels).
By minimizing
with our
MEG sensors.
• Our work
in [6]
was selected,
onealgorithms
of the bestthat
papers
ASONAM
appear
in Springer’s
Encyclopedia
Social
the sum
of squared
errors,as
both
solveoffor
M ODEL0 2012, to Let
us assume
that there
are n hiddenfor
(hyper)regions
of the b
resortAnalysis
to a simple
line that increases very slowly over time, thus
Network
and Mining.
which interact with each other, causing the activity that we ob
having a minimal squared error, given linearity assumptions.
in y. We denote the vector of the hidden brain activity as
B2) Neurosemantics
size n ⇥ 1. Then, by using the same idea as in M ODEL0 , we
Real and predicted MEG brain activity
How is knowledge represented in the human brain? Which regions have high activity
and information
flow,
when of
a concept
such
formulate
the temporal
evolution
the hidden
brain activity
as “food” is shown to a human subject? Do all human subjects’ brains behave similarly in x(t
this+context?
Consider
the +
following
1) = A[n⇥n]
⇥ x(t)
B[n⇥s] ⇥ s(t)
experimental setting, where multiple human subjects are shown a set of concrete English
nouns
(e.g. “dog”,
“tomato”
etc), and
we one step clo
Having
introduced
the above
equation,
we are
measure each person’s brain activity using various techniques (e.g, fMRI or MEG).modelling
In this experiments,
human
subjects,
semantic
the underlying,
hidden
process
whose outcome w
stimuli (i.e., the nouns), and measurement methods are all different aspects of the same
phenomenon:
mechanisms
serve.underlying
However, an
issue that wethe
have
yet to address is the fac
x is not observed and we have no means of measuring it. We
that the brain uses to process language.
to resolve this
issue by
theTo
measurement
proc
In [11], we seek to identify coherent regions of the brain that are activated forpose
a semantically
coherent
setmodelling
of stimuli.
that
itself,
i.e.
model
the
transformation
of
a
hidden
brain
activity
end we combine fMRI measurements with semantic features (in the form of simple questions, such as Can you pick it up?) for the
tor to its observed counterpart. We assume that this transform
same set of nouns, which provide useful information to the decomposition which might be missing from the fMRI data, as well
is linear, thus we are able to write
as constitute a human readable description of the semantic context of each latent group. A very exciting example of our results
y(t) = C[m⇥n] x(t)
can be seen in Figure 4(a), where all the nouns in the “cluster” are small objects, the corresponding questions reflect
holding or
Figure
2: up,
Comparison
true brain activity
andregion
brain activity
Putting
everything
together,
we end upwas
withthe
the following
picking such
objects
and mostofimportantly,
the brain
that wasgenhighly active
for this
set of nouns
and questions
erated using the LS, and CCA solutions to M ODEL0 . Clearly,
equations, which constitute our proposed model G E BM:
premotor cortex, which is associated with holding or picking small items up. This result is entirely unsupervised and agrees with
M ODEL0 is not able to capture the trends of the brain activity,
x(t tasks
+ 1) (cf.
= future
A[n⇥n]
⇥ x(t)and
+ drive
B[n⇥s] ⇥ s(t)
Neuroscience.
usminimizing
confidence the
thatsquared
the same
technique
can
used in more complex
research)
and This
to thegives
end of
error,
produces
anbe
almost
y(t)
=
C
⇥
x(t)
[m⇥n]
neuroscientific
discovery.
straight
line that dissects the real brain activity
waveform.
“apple”! “Is it edible?” (y/n)!
Additionally, we require the hidden functional connectivity
“knife”! “Can it hurt you?” (y/n)!
Parietal lobe!
trix
Frontal lobe! A to be sparse because, intuitively, not all (hidden) regio
(movement)!
the brain interact
directly with each other. Thus, given the a
Nouns Questions
Nouns QuestionsFL! PL!
Nouns Questions
Nouns Questions
(attention)!
beetle
can
ityou
cause
you
pain?
can itit cause
you
bed
does
use electricity?
glass
can
pick
itbeetle
up?
beetle
can
beetle
can
it cause
you pain?
approach:
GeBM
OL!
TL!pain?
beetle
can itit cause
cause you
you pain?
pain?
bear 3.2
does
it Proposed
grow?
formulation
of G E BM, we seek to obtain a matrix A sparse en
4
3
house
can
you
sit
on
it?
tomato can you hold it in one hand? MEG!
pants
do you see it daily?
cow
is it alive?
while obeying the dynamics dictated by model. Sparsity is k
carODEL
does
it cast
Formulating
the problem
asthan
M
nota shadow?
able to meet the re0 is
is it smaller
a golfball?’
bee
is it conscious? coat
was
it ever alive?bell
voxel
1!
Nouns
Nouns
Questions
Nouns
Questions
Nouns
Questions
providing
more
2 3 insightful and easy to interpret functional co
…" not exQuestions
Nouns
Questions
Nouns
Questions
Nouns
Questions
quirements
forQuestions
our
desired
solution.
However,
we
have
24
41
bed
does it use electricity?
glass the
canspace
you pickof
it up?
can it cause you pain?
bear
does it grow?
tivity matrices, since an exact zero on the connectivity matri
Real and our
predicted
hausted
possible
formulations
that live within
setMEG brain activity
house
can you sit on it?
hand?
do you see it daily?
cow
is it alive? tomato can you hold it in one
0.35
0.35
0.35
0.4
Occipital lobe!
plicitly
of alive?
simplifying
assumptions.
Indoes
thisit cast
section,
we describe G
E BM,
3 2 states that there is no direct interaction between neuron
s it conscious? coat
a shadow?
bell
is it smaller
than a car
golfball?’
was it ever
voxel 2!
0.35
(vision)!
0.3
0.3
0.3
the
contrary,
a
very
small value in the matrix (if the matrix
our proposed approach which, under the assumptions that we have
0.3
0.25
0.25
0.25
Temporal lobe!
real
sparse)
is
ambiguous
and
could imply either that the interact
already made in Section 2, is able to meet our requirements remarkGeBM
(language)! 0.25
0.2
0.2
0.2
0.2
negligible
and thus could be ignored, or that there indeed is
ably well.
0.15
0.15
0.15
0.15
with
very small
weight between the two neurons.
In order to come up with 0.1
a more accurate0.1 model,
it is useful
to
voxel 306!
…"
0.1
0.1
The
key
ideas
behind G E BM are:
look more carefullyPremotor
at the
system
that
we
are
attempting
to
Cortexactual
0.05
0.05
0.05
0.05
0.4
real
LS
CCA
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
5
10
15
20
25
30
35
…"
…"
1
Real and predicted MEG brain activity
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.3
0.3
0.25
0.25
0.2
0.2
0
0.15
0.035
0.05
50
50
0.05
0.04
0.04
100
0.05
Real and predicted MEG brain activity
0.03
150
150
0.3
0.3
0.25
0.25
0.2
0.2
0.03
150
0.4
0.4
0.3
0.3
0.02
200
0.01
250
0.15
0.3
0.2
0.2
20
40
0
0.15
150
Group1
200
250
100
150
200
Group 2
250
40
0
20
40
19
0
20
0.35
40
Real
and predicted
MEG
brain
activityMEG brain activity
Real and
predicted
MEG brain
activity
Real
and
predicted
0.4
0.35
0.4
0.4
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.35
0.3
0.35
0.35
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.15
0.15
0.15
0.15
0.15
0.15
0.15
0.15
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.05
0.05
0.05
0.05
0.05
0.05
0.35
250
0.01
0.1
0.1
300
0
50
0
0.1
20
0.35
0.35
0.05
0.05
0.005
0.1
300
0
100
0
0.35
0.35
0.35
0.35
0.01
0.005
50
40
0.2
0
200
0.02
0.01
250
0.4
0.3
20
0.015
0.015
200
0.02
0.4
16
0.03
0 0.025
100
0.025
0.02
40
0.1
0.05
0
Real40and
predicted
MEG
brain activity
0
20
40
0
20
…"
100
20
0.2
0.1
18
0.1
0.1
0
0.3
0.2
0.035
0.15
0.1
0.05
50
0.03
0.4
0.3
8
0.2
0.04
0.4
0.1
0
0
0
0
20
20 40
0
0
40
0
0
0
0
20
20
20 40
40
0
0
040
0
0
0 20
20
20 40
40
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.1
0.1
0.05
0.05
0.05
00
0040
0
real
GeBM
real
GeBM
real
GeBM
0.05
0 20
20
204040
0
0
40
20
40
300
50
0.05
100
0
150
20
200
40
Group 3
0.05
0
250
0
50
0
20
0
(a) The premotor cortex,
having high activation here,
is associated with motions
such as holding and picking
small items up.
40
0
100
0
150
20
200
Group
4
20
40
250
40
0
0
0
0
20
40
20
40
0
0
20
40
0
0
20
40
(b) Given MEG measurements (time series of the magnetic activity of a
particular region of the brain), G E BM learns a graph between physical
brain regions. Furthermore, G E BM simulates the real MEG activity very
accurately.
Figure 4: Overview of results of the Neurosemantics application.
In a similar experimental setting, where the human subjects are also asked to answer a simple yes/no question about the noun
they are reading, in [10] we seek to discover the functional connectivity of the brain for the particular task. Functional connectivity
is an information flow graph between different regions of the brain, indicating high degree of interaction between (not necessarily
physically connected) regions while the person is reading the noun and answering the question. [10] we propose G E BM, a novel
model for the functional connectivity which views the brain as a control system and we propose a sparse system identification
algorithm which solves the model. Figure 4(b) shows an overview of G E BM: given MEG measurements (time series of the
magnetic activity of a particular region of the brain), we learn a model that describes a graph between physical brain regions, and
simulates real MEG activity very accurately.
Impact - Neurosemantics
• G E BM [10] is taught in class CptS 595 at Washington State University.
• Our work in [11] was selected as one of the best papers of SDM 2014
B3) Knowledge on the Web
Knowledge on the Web has multiple aspects: real-world entities such as Barack Obama and USA are usually linked in multiple
ways, e.g., is president of, was born in, and lives in. Modeling those multi-aspect relations a tensor, and computing a low rank
decomposition of the data, results in embeddings of those entities in a lower dimension, which can help discover semantically and
3
Evangelos E. Papalexakis - Research Statement
contextually similar entities, as well as discover missing links. In [9] we discover semantically similar noun-phrases in a Knowledge
Base coming from the Read the Web project at CMU: http://rtw.ml.cmu.edu/rtw/. Language is another aspect: many
web-pages have parallel content in different languages, however, some languages have higher representation than others. How
can we learn a high quality joint latent representation of entities and words, where we combine information from all languages?
In [14] we introduce the notion of translation-invariant word embeddings where we compute multi-lingual embeddings, forcing
translations to be “close” in the embedding space. Our approach outperforms the state of the art.
Yet another aspect of an real-world entity on the web is the set of search engine results for that entity, which is the biased view
of each search engine, as a result of their crawling, indexing, ranking, and potential personalization algorithms, for that query. In
[1] we introduce T ENSOR C OMPARE, a tool which measures the overlap in the results of different search engines. We conduct a
case study on Google and Bing, finding high overlap. Given this high overlap, how can we use different signals, potentially coming
from social media, in order to provide diverse and useful results? In [2] we follow-up designing a Twitter based web search engine,
using tweet popularity as a ranking function, which does exactly that. This result has huge potential for the future of web search,
paving the way for the use of social signals in the determination and ranking of results.
Impact - Knowledge on the Web
• In [14] we are the first to propose the concept of translation-invariant word embeddings.
• Our work in [1] was selected to appear in a special issue of the Journal of Web Science, as one of the best papers in the Web
Science Track of WWW 2015.
FUTURE RESEARCH DIRECTIONS
As more field sciences are incorporating information technologies (with Computational Neuroscience being a prime example from
a long list of disciplines), the need for scalable, efficient, and interpretable multi-aspect data mining algorithms will only increase.
1) Long-Term Vision: Big Signal Processing for Data Science
The process of extracting useful and novel knowledge from big data in order to drive scientific discovery is the holy grail of
data science. Consider the case where we view the knowledge extraction process through a signal processing lens: suppose
that a transmitter (a physical or social phenomenon) is generating signals which describe aspects of the underlying phenomenon.
An example signal is a time-evolving graph (a time-series of graphs) which can be represented as a tensor. The receiver (the
data scientist in our case), combines all those signals with ultimate goal the reconstruction (and general understanding) of their
generative process. The “communication channel” wherein the data are transmitted can play the role of the measurement process
where loss of data occurs, and thus we have to account the channel estimation into our analysis, in order to reverse its effect. We
may consider that the best way to transmit all the data to the receiver is to find the best compression or dimensionality reduction
of those signals (e.g., Compressed Sensing). There may be more than one transmitters (if we consider a setting where the data are
distributed across data centers) and multiple receivers, in which case privacy considerations come into play. In my work so far I
have established two connections between Signal Processing and Data Science (Tensor Decompositions & System Identification),
contributing new results in both communities and have already had significant impact, demonstrated, for instance, by the amount
of researchers using and extending my work. These two aforementioned connections are but instances of a vast landscape of
opportunities for cross-pollination which will advance the state of the art of both fields and drive scientific discovery. I envision
my future work as a three-way bridge between Signal Processing, Data Science, and high impact real-world applications.
2) Mid-Term Research Plan
With respect to my shorter term plan (first 3-5 years), I provide a more detailed outline of the thrusts and challenges that I am
planning to tackle, both regarding algorithms and multi-aspect data applications.
Algorithms: Triple-Sparse Multi-Aspect Data Exploration & Analysis with Quality Assessment
In addition to improving the state of the art for tensor decompositions, I plan to explore alternatives for multi-aspect data modeling
and representation learning, which can be used with tensor decompositions. Some of the challenges are:
Algorithms, models, and problem formulation: What is the appropriate (factorization) model? Are there any noisy aspects that may
hurt performance? As the ambient dimensions of the data grow, and the number of aspects increases, data become much sparser.
Sparsity, as we saw is a blessing when it comes to scalability, however it can also be a curse when it comes to detecting meaningful
relatively dense patterns in subspaces within the data.
Scalability & Efficiency: Data size will continue to grow and we need faster and more scalable algorithms to handle the size and
complexity of the data. This will involve adjusting existing algorithms or proposing new algorithms that adhere to the triple-sparse
paradigm. Furthermore, the particular choice of a distributed system can make a huge difference depending on the application. For
instance, Map/Reduce works well in batch tasks, whereas it has well known weaknesses in iterative algorithms. Other systems, e.g.,
Spark could be more appropriate for such tasks, and in the future I plan to research the capabilities of current high-end distributed
systems, in relation to the algorithms I develop.
(Unsupervised) Quality Assessment: In addition to purely algorithmic approaches, it is instrumental to work in collaboration with
field scientists and experts, and incorporate in the quality assessment elements that experts indicate as important. Finally, there are
a lot of exciting routes of collaboration with Human-Computer Interaction and Crowdsourcing experts, harnessing the “wisdom
4
Evangelos E. Papalexakis - Research Statement
of the crowds” in assessing the quality of our analysis, especially in applications such as knowledge on the web and multi-aspect
social networks, where non-experts may be able to provide high quality judgements.
Application: Neurosemantics
In the search for understanding how semantic information is processed by the brain, I am planning to broaden the scope of the
Neurosemantics applications, considering aspects such as language: are same concepts in different languages mapped in the same
way in the brain? Are cultural idiosyncrasies reflected on the way that speakers of different languages represent information?
Furthermore, I will consider more complex forms of stimuli (such as phrases, images, and video) and richer sources of semantic
information, e.g, from Knowledge on the Web. There is profound scientific interest in answering the above research questions a fact
also reflected on how well funded of a research area this is (e.g., see Brain Initiative http://braininitiative.nih.gov/)
Application: Urban & Social Computing
Social and physical interactionsof people in an urban environment, is an inherent multi-aspect process, that ties the physical and
the on-line domains of human interaction. I plan to investigate human mobility patterns, e.g. through check-ins in on-line social
networks, in combination with their social interactions and the content they create on-line, with specific emphasis on multi-lingual
content which is becoming very prevalent due to population mobility. I also intend to develop anomaly detection which can point to
fraud (e.g., people buying fake followers for their account). Improving user experience through identifying normal and anomalous
patterns in human geo-social activity is an extremely important problem, both in terms of funding and research interest, as well as
implications on revolutionizing modern societies.
Application: Knowledge on the Web
Knowledge bases are ubiquitous, providing taxonomies of human knowledge and facilitating web search. Many knowledge bases
are extracted automatically, and as such they are noisy and incomplete. Furthermore, web content exists in multiple languages,
which is inherently imbalanced and may result in imbalanced knowledge bases, consequently leading to skewed views of web
knowledge per language. I plan to continue my work on web knowledge, devising and developing techniques that combine multiple,
multilingual, structured (e.g., knowledge bases) and unstructured (e.g., plain text) sources of information on the web, aiming for
high quality knowledge representation, as well as enrichment and curation of knowledge bases.
3) Funding, Collaborations, and Parting Thoughts
During my studies, I have partaken in grant proposal writing, contributing significantly to a successful NSF/NIH BIGDATA grant
proposal (NSF IIS-1247489 & NIH 1R01GM108339-1) which resulted in $894,892 to CMU ($1.6M total) and has funded most of
my PhD studies. In particular, I worked on outlining and describing the important algorithmic challenges as well as the applications
we proposed to tackle. Being a major contributor to the proposal was a unique opportunity for me, giving me freedom to shape
my research agenda. I have also been fortunate to have collaborated with a number of stellar researchers both in academia and
industry, many of whom have been my mentors in research. Research agenda is very often shaped in wonderful ways through
such collaborations, as well as through mentoring and advising graduate students. To that end, I intend to keep nurturing and
strengthening my on-going collaborations, and pursue collaboration with scholars within and outside my field, always following
my overarching theme: Bridging Signal Processing and Big Data Science for high-impact, real-world applications.
Selected References
[1] Rakesh Agrawal, Behzad Golshan, and Evangelos E. Papalexakis. A
study of distinctiveness in web results of two search engines. In WWW’15
Web Science Track, (author order is alphabetical).
[2] Rakesh Agrawal, Behzad Golshan, and Evangelos E. Papalexakis.
Whither social networks for web search? In ACM KDD’15, (author order
is alphabetical).
[3] Miguel Araujo, Spiros Papadimitriou, Stephan Günnemann, Christos
Faloutsos, Prithwish Basu, Ananthram Swami, Evangelos E. Papalexakis, and Danai Koutra. Com2: Fast automatic discovery of temporal
(’comet’) communities. In Advances in Knowledge Discovery and Data
Mining. Springer, 2014.
[4] Alex Beutel, Abhimanu Kumar, Evangelos E. Papalexakis, Partha Pratim
Talukdar, Christos Faloutsos, and Eric P Xing. Flexifact: Scalable flexible
factorization of coupled tensors on hadoop. In SIAM SDM’14, 2014.
[5] Evangelos E. Papalexakis, Leman Akoglu, and Dino Ienco. Do more
views of a graph help? community detection and clustering in multigraphs. In IEEE FUSION’13.
[6] Evangelos E. Papalexakis, Alex Beutel, and Peter Steenkiste. Network
anomaly detection using co-clustering. In Encyclopedia of Social Network
Analysis and Mining. Springer, 2014.
[7] Evangelos E. Papalexakis and A. Seza Doğruöz. Understanding multilingual social networks in online immigrant communities. WWW ’15
Companion.
[8] Evangelos E. Papalexakis and C. Faloutsos. Fast efficient and scalable
core consistency diagnostic for the parafac decomposition for big sparse
tensors. In IEEE ICASSP’15.
[9] Evangelos E. Papalexakis, Christos Faloutsos, and Nicholas D Sidiropoulos. Parcube: Sparse parallelizable tensor decompositions. In ECMLPKDD’12.
[10] Evangelos E. Papalexakis., Alona Fyshe, Nicholas D. Sidiropoulos,
Partha Pratim Talukdar, Tom M. Mitchell, and Christos Faloutsos. Goodenough brain model: Challenges, algorithms and discoveries in multisubject experiments. In ACM KDD’14.
[11] Evangelos E. Papalexakis, Tom M Mitchell, Nicholas D Sidiropoulos,
Christos Faloutsos, Partha Pratim Talukdar, and Brian Murphy. Turbosmt: Accelerating coupled sparse matrix-tensor factorizations by 200x. In
SIAM SDM’14.
[12] Evangelos E. Papalexakis, Konstantinos Pelechrinis, and Christos Faloutsos. Location based social network analysis using tensors and signal processing tools. In IEEE CAMSAP’15.
[13] Evangelos E. Papalexakis, Nicholas D Sidiropoulos, and Rasmus Bro.
From k-means to higher-way co-clustering: multilinear decomposition
with sparse latent factors. IEEE Transactions on Signal Processing, 2013.
[14] Matt Gardner, Kejun Huang, Evangelos E. Papalexakis., Xiao Fu, Partha
Talukdar, Christos Faloutsos, Nicholas Sidiropoulos, and Tom Mitchell.
Translation invariant word embeddings. In EMNLP’15.
[15] U Kang, Evangelos E. Papalexakis, Abhay Harpale, and Christos Faloutsos. Gigatensor: scaling tensor analysis up by 100 times-algorithms and
discoveries. In ACM KDD’12.
[16] Ching-Hao Mao, Chung-Jung Wu, Evangelos E. Papalexakis, Christos
Faloutsos, and Tien-Cheu Kao. Malspot: Multi2 malicious network behavior patterns analysis. In PAKDD’14.
[17] N Sidiropoulos, Evangelos E. Papalexakis, and C Faloutsos. Parallel randomly compressed cubes: A scalable distributed architecture for big tensor
decomposition. IEEE Signal Processing Magazine, 2014.
5