Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
IBM Research
Information Flow Prediction
and People Mining
Ching-Yung Lin
IBM T. J. Watson Research Center
May 27, 2007
5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin
© 2007 IBM Corporation
IBM Research
Data Flow through an Internet Gateway..
 10Gbit/s Continuous Feed Coming into System
 Types of Data
• Speech, text, moving images, still images, coded application
data, machine-to-machine binary communication
 System Mechanisms
• Telephony: 9.6Gbit/sec (including VoIP)
• Internet
 Email: 250Mbit/sec (about 500 pieces per second)
 Dynamic web pages: 50Mbit/sec
 Instant Messaging: 200Kbit/sec
 Static web pages: 100Kbit/sec
 Transactional data: TBD
• TV: 40Mb/sec (equivalent to about 10 stations)
• Radio: 2Mb/sec (equivalent to about 20 stations)
2
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Network Monitoring and Stream Analysis
rtsp
ftp
tcp
ip
http
udp
Advanced content
analysis
rtp
audio
video
sess
Interest
Routing
Interest
Filtering
keywords id
Interested
MM streams
sess
ntp
per PE
rates
3
Dataflow
Graph
Packet content
analysis
Inputs
200-500MB/s
~100MB/s
10 MB/s
By IBM Dense Information Gliding Team
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Borrow this from Hoover...
4
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
One of the issues – Speech Recognition, Speaker & Social
Network Detection
Stream A
Speaker Detection
Olivier
Denoising & Social
Network Analysis
Mihalis
talks to
Upendra
Ching-Yung
talks to
Stream B
Stream C
Deepak
After denoising
Stream D
- Social network
- Fusion technique
- Iterative method
What can be achieved by combining
content analysis and social network analysis?
5
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Challenge – every node in the network is unique
Photo Source: New York Times, 3/2/2005
6
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Part I: Dynamic Probabilistic Complex
Network and Information Flow
7
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
The Most Difficult Challenge: State-of-the-Arts?
 Our Objectives: Find important people, community structures, or
information flow in a network, which is dynamic, probabilistic and
complex, in order allocate resources in a large-scale mining system.
 Social Networks in sociological and statistic fields: focus on (1) overall
network characteristics, (2) dynamic random graphs, (3) binary edges, etc.
 Not consider probabilistic nodes/edges or individual nodes/edges.
 Epidemic Networks & Computer Virus Network: focus on (1) overall network
characteristics – when will an outbreak occurs, (2) regular / random graphs.
 Not focus on individual nodes/edges.
 (Computer) Communication Networks: focus on (1) packet transmission –
information is not duplicated, or (2) broadcasting – not considering individual
nodes/edges or complex network topology.
 WWW: focus on (1) topology description, (2) binary edges and ranked nodes
(e.g., Google PageRank)  Not consider probabilistic edges
8
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
What is a Dynamic Probabilistic Complex Network?
9
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Modeling a Dynamic Probabilistic Complex Network
 [Assumption] A DPCN can be represented by a Dynamic Transition Matrix P(t), a
Dynamic Vertex Status Random Vector Q(t), and two dependency functions fM and gM.
 p1,1 (t ) p 2,1 (t )
 p (t ) p (t )
2,2
 1,2



 p1,N (t ) p 2,N (t )

p N,1 (t ) 
p N,2 (t ) 
,


p N,N (t ) 
 Pr( yi , j (t )  SE1 ) 
 Pr( y (t )  SE ) 
i, j
2 
pi,j (t ) 
, q (t )

 i


 Pr( yi , j (t )  SE E ) 
 Pr( xi (t )  SV1 ) 
 Pr( x (t )  SV ) 
i
2 

,




Pr(
x
(
t
)

SV
)
i


V 

P( t )
where
Pr( y


i , j (t )
E
 SE )  1,
Q(t )
 q1 ( t ) 
 q (t ) 
 2 

,




q N (t ) 
P(t   t )
f M (Q(t ), P(t )),
and
Q(t   t )
g M ( P(t   t ), Q(t ), P(t )),
Pr( x (t )  SV )  1,


i
V
and xi (t ) : the status value of vertex i at time t.
yi , j (t ): the status value of edge i →j at time t.
10
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Information Flow in Dynamic Probabilistic Complex Network
(Let’s call it: Behavioral Information Flow (BIF) Model)
 [Assumption] Edge can be represented by a four-state S-D-A-R (SusceptibleDormant-Active-Removed) Markov Model. Nodes can be represented by three states
S-A-I (Susceptible-Active-Informed) Model.
P( t )
 p1,1 (t ) p 2,1 (t )
 p (t ) p (t )
2,2
 1,2



 p1,N (t ) p 2,N (t )

p N,1 (t ) 
p N,2 (t ) 
,


p N,N (t ) 
Q(t )
P( t   t )
 q1 ( t ) 
 q (t ) 
 2 

,




q N (t ) 
f ( M, Q(t ), P(t )),
and
Q(t   t )
g (P(t   t ), Q(t ), P(t )),
where
 Pr( yi , j (t )  S ) 
 Pr( y (t )  D ) 
i, j

pi,j (t )  
 Pr( yi , j (t )  A) 


Pr(
y
(
t
)

R
)
i, j


 i , j 
 
 i, j  ,
 i , j 
 
 i , j 
 i, j  i, j  i, j  i, j  1
11
 Pr( xi (t )  S ) 
qi (t )   Pr( xi (t )  A) 
 Pr( xi (t )  I ) 
i 
  ,
 i
 i 
i  i   i  1
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Major Difference between BIF and Prior Modeling Methods
in Epidemic Research and Computer Virus Fields
 Prior Models:
 Model Human Nodes as S-I-R (Susceptible, Infected, and Removed).
 Did not consider individual node’s behavior different in network
structure/topology  did not consider edge status.
 We propose to model edge status as (autonomous) S-D-A-R Markov
Model (Susceptible, Dormant, Active, Removed)
 We propose to model human node behavior as S-A-I (Susceptible,
Active, and Informed).
12
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Edges are Markov State Machines, Nodes are not
 State transitions of edges: S-D-A-R model. (Susceptible, Dormant,
Active, and Removed) This indicates the time-aspect changes of the
state of edges.
1 
1 
trigger
1



D
S
A
R
Edge view
1
 States of nodes: S-A-I model. (Susceptible, Active, and Informed)
Trigger occurs when the start node of the edge changes from state
S to state I :
S
A
I
trigger
Node view
13
Network view
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Edge State Probability and Network Configuration Model
 Nodes and Edges
P(t   t )  f (M, Q(t ), P(t )),
 Network Configuration Model (which is learned by training). It
includes the network topology information, long-term edge
probability, and delay parameter).
M
( 2,1 ,  2,1 ,  2,1 )
 (1,1 , 1,1 ,  1,1 )
 ( ,  ,  )
( 2,2 ,  2,2 ,  2,2 )
 1,2 1,2 1,2



(1, N , 1, N ,  1, N ) ( 2, N ,  2, N ,  2, N )

( N ,1 ,  N ,1 ,  N ,1 ) 
( N ,2 ,  N ,2 ,  N ,2 ) 
,


( N , N ,  N , N ,  N , N ) 
 i,j = 0  No Edge between i and j
 Our KDD 2005 paper is a special case that i,j =1 or 0, and did not
model (i,j ,i,j )
14
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Define Edge State Probability Update Function
Edge State Probability
Update function f(.) s.t.:
P(t   t )  f (M, Q(t ), P(t ))
 Given three different cases:
1. On trigger: xi (t   t )  I , xi (t )  I
0
0
 i, j   0
    
1  i , j
0
i, j  
i, j

pi,j (t   t ) 

 i, j   0
i , j 1   i , j
  
0
 i, j
 i, j  1  i , j
1 

trigger
S
1


D
A
0  i , j 
0  i , j 
F  pi,j (t ),
0  i , j 
 
1  i , j 
2. No trigger – node not informed yet: xi (t   t )  I , xi (t )  I
pi,j (t   t )  pi,j (t ),
3. No trigger – node has been informed: xi (t   t )  I , xi (t )  I
pi,j (t   t )  F  pi,j (t ),
 Therefore, consider the probabilities of node states, then we get f(.):
pi,j (t   t )   i  F  pi,j (t )  (1  i )  pi,j (t )
15
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
R
IBM Research
Nodes: State Transitions Determined by Incoming
Edges
Q(t   t )  g(P(t   t ), Q(t ), P(t )),
 Node State Probability Update Function g(.):
S


(1  n ,i )
0
 n
V ,i
i 
q i (t   t )  i   1 
(1  n ,i )
(1   n ,i ) n ,i
  
nV ,i
 i   nV ,i

0
1
(1   n,i ) n ,i

nV ,i

where




Pr(n {1
 1
A

0

 i 
0 i 
 
  i 
1


I
trigger
Q  q i (t ),
N }, yn,i (t   t )  R, yn,i (t )  A)
 (1  
n ,i ) n ,i
nV ,i
and V,i is the set of all source nodes of the
incoming edges of Node i: V ,i  {n | n  {1
N },  n ,i  0}
Network view
16
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
An Application of Information Flow Prediction – find
important people
 Who are the most likely people to talk about this information at a
specific time given the current observation?
(m, n) 
arg
m,n{1 N }
max( m,n (t   ))
given Q(t ) or (P(t ), Q(t ))
 For a given concrete observation, the values in the given priors P(t ), Q(t )
are either 0 or 1.
 For speaker recognition results, the priors can be confidence values
between 0 ~ 1.
17
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Case Study I – Switchboard data from 679 people
 Monte Carlo Method: Simulate each DPCN information flow for 1000
times.
 It takes 12 seconds to use MC simulation to predict the process. (For a
given model and test all 679 nodes, it takes a PC 130 mins for calculate
the probabilities if the information flow starts from different 679 seeds).
The Probabilities of the Nodes Receives Information
0.3
SeedID100
0.25
0.2
0.15
0.1
0.05
18
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
676
649
622
595
568
541
514
487
460
433
406
379
352
325
298
271
244
217
190
163
136
109
82
55
28
1
0
© 2007 IBM Corporation
IBM Research
The distribution histogram of the alpha values of the
edges in the Enron dataset.
100000
All Topics
Market Opportunity
California Market
North America Product
10000
1000
100
10
1
0
0+
19
~
1
0.
0.
1+
~
2
0.
0.
2+
~
3
0.
0.
3+
~
4
0.
0.
4+
~
5
0.
0.
5+
~
6
0.
0.
6+
~
7
0.
0.
7+
~
8
0.
0.
8+
~
9
0.
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
9
0.
+
~
1
© 2007 IBM Corporation
IBM Research
Noise Factor I – Impact of Classification Error from Speaker
Recognition
 Assume the classification precision rate on the speaker (node) i is fi, and
the false alarm rate on the speaker i is φi.
 Then the expected number of times that the node is
counted is:
K  fi K  i  2 Z
 And the link is counted is: L  fif j L  i j Z
Z
K
fi K
φi 2Z
 Therefore,   L  fif j L  i j Z
truth
detected
i, j
fi K  i  2Z
K
 If we assume a universal precision and false alarm rate at all speakers,
then:
L
f2L   2Z
i , j 

K fi K  i  2 Z
Assume the average waiting time of links and the average transmission
duration of links are the same regardless of the links observed, then:
i, j  i, j and  i, j   i , j
 If we assume the false alarm rate is small and can be neglected when the
number of nodes is large, then
i, j  f  i, j
20
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Speaker Recognition Accuracy can be Improved by Fusion
of Original Speaker Recognition and Predicted Node
Probability
 We can use this fusion method to combine both speaker recognition
result and the estimated node probability:
fi 
fi  i
fi  i 
f
i ,k
 k
k
which is guaranteed to be increasing when  i   k
Speaker i
Recognizer
Speaker i
Recognizer
BIF
Prediction
21
fi
fi ,k1
fi,k 2
fi,k 3
Before Fusion
fi

 i fi
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
After Fusion with
BIF Prediction
© 2007 IBM Corporation
IBM Research
Recognition Result from Switchboard-2 Telephone
Conversation Set
1.2
1
Node 218, no false
alarm
0.8
Node 164, no false
alarm
Node 218, mutually
confused
Node 164, mutually
confused
0.6
0.4
Node 218, prob. false
alarm = 0.3
0.2
Node 164, prob. false
alarm = 0.3
0
0
1
2
3
4
5
 Improvement on Recognition Accuracy on Node 171. The x-axis is the time that
model is updated based on the recognition result after fusion. The y-axis
represents the recognition accuracy. In the six testing cases, the Node 171 is
usually confused with Node 218 or Node 164. In the first two cases, there are no
false alarm from the classification of Node 218 or 164. In the next two cases, they
are usually confused with each other. In the last two cases, the false alarm from
Node 218 or 164 is 0.3.
22
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Case Study (II) – our experiments on Enron Emails
23
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Modeling and Predicting Topic-Related Personal Information Flow
 Content-Time-Relation Model Combine content, time and social relation information
with Dirichlet allocations and a causal Bayesian network. [ Song et al., KDD, August 2005]
(1st paper combining content analysis and social network analysis)
ad



 A
f
t
Given the sender
S
z
w
T
r
and the time of an
email:
1. Get the
probability of a
topic given the
sender
2. Get the
probability of the
receiver given the
sender and the
topic
N

D
Tm
: observations
3. Get the
probability of a
word given the
topic
a: sender/author, z: topic, S: social network (Exponential Random Graph Model / p* model),
D: document/emailr: receivers, w: content words, N: Word set, T: Topic
24
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
Boxes
represents iteration
.
© 2007 IBM Corporation
IBM Research
Corporate Topic Trend Analysis Example: Yearly repeating events
Topic45(y2000)
Topic Trend Comparison
Topic45(y2001)
Topic19(y2000)
0.03
Topic19(y2001)
Popularity
0.025
0.02
0.015
0.01
0.005
0
Jan
Mar
May
Jul
Sep
Nov
Topic 45, which is talking about a schedule issue, reaches a peak during June to September.
For topic 19, it is talking about a meeting issue. The trend repeats year to year.
25
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Topic Detection and Key People Detection of “California Power” Match
Their Real-Life Roles
Popularity
T opic Analysis for T opic 61
0.018
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
Jan-00
Key Words
Key People
(a)
Apr-00
Jul-00
Oct-00
Jan-01
Apr-01
Jul-01
Oct-01
power 0.089361 California 0.088160 electrical 0.087345 price 0.055940
energy 0.048817 generator 0.035345 market 0.033314 until 0.030681
Jeff_Dasovich 0.249863 James_Steffes 0.139212
Richard_Shapiro 0.096179 Mary_Hain 0.078131
Richard_Sanders 0.052866 Steven_Kean 0.044745
Vince_Kaminski 0.035953
Event “California Energy Crisis” occurred at exactly this time period. Key people are active in
this event except Vince_Kaminski …
26
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Social Network of Enron Managers
 If we try to find out social networks based on all communications, it is
difficult.
27
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Information Flow in Enron – California Market
 Actor 151 (Rosalee Fleming — the Enron CEO Ken L.’s assistant) is
the key information spreader of this issue.
28
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Information Flow in Enron – Market Opportunities
 Rosalee Fleming also played an important role at “Market Opportunities.” She received info
from Actor 119 (Mike Carson) and Actor 23 (James Steffes – VP of Gov. Affairs of Enron.)
 Actor 68 (Rod Hayslett -- CFO) is also a major information spreader.
29
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Information Flow in Enron – North American Products
 Two disjoint communities can be observed. Actor 21 (Keith Holst) and Actor
142 (Dan Hyvl) are the main bridges of the two communities.
30
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
This kind of analysis is wonderful, but..
 We cannot wait until our company has scandle and bankrupts....
 What kinds of applications can be valuable out of network analysis?
31
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Part II: Small Blue
32
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Social Network -- A key differentiator for corporate performance
Informal social network within formal organizations is a major
factor affecting companies’ performance:
Krackhardt (CMU, 2005) showed that companies with strong informal
networks perform five or six times better than those with weak
networks.
 Brydon (VisblePath, 2006) showed that the performance gain of
companies utilizing social networks:
• 16x at sales
• 4x at marketing
• 10x at hiring
33
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
We hope social network and expertise mining can dramatically
increase our colleagues’ knowledge and collaboration
34
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Social Networks -- Beyond the organizational
chart
 Organization charts are
not the best indicator of
how work gets done
 Senior people are not
always central; peripheral
people can represent
untapped knowledge
 Making the network
visible makes it actionable
and becomes the basis
for a collaboration action
plan
Source: Cross, R., Parker, A., Prusak, L. & Borgatti, S.P. 2001. Knowing What We Know: Supporting
Knowledge Creation and Sharing in Social Networks. Organizational Dynamics 30(2): 100-120. [pdf]
Provided by Drs. Tony Mobbs and Kate Ehrlich, IBM
35
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Group and Roles
Central people
 Sam. Could be bottleneck or
holding group together
Andy
Frank
Indojit
Carl
Peripheral people
 Earl. Goes to others but noone goes to him for
information. At risk for
leaving. Potentially unrealized
expertise
Karen
Darren
Bob
Sam
Ming
Neo
Sub-groups
 Group split by function. Very
little information shared
across groups
Leo
Earl
Gerry
Harry
Jeff
Marketing
Finance
Manufacturing
This slide is excerpted from SNA Theory, Concepts and Practice
by Dr. T. Mobbs, BCS and Dr. K. Ehrlick, Research
36
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Some Roles are especially critical
What happens if Sam leaves
the group through layoffs, job
reassignment, attrition,
merger, retirement?
Andy
Frank
Indojit
Carl
Karen
Darren
Bob
Ming
Neo
Leo
Earl
Gerry
Harry
Jeff
Marketing
Finance
Manufacturing
This slide is excerpted from SNA Theory, Concepts and Practice
by Dr. T. Mobbs, BCS and Dr. K. Ehrlick, Research
37
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Relationships are multi-dimensional and (traditionally)
uncovered through network questions
Awareness
Actions
Emotional
Communication
How often do you
communicate with this
person?
Awareness
I am aware of this
person’s knowledge and
skills
Trust
I believe there is a high
personal cost in seeking
advice or support from
this person
Innovation
How often do you turn
to this person for new
ideas
Valued Expertise
How likely are you to
turn to this person for
specialized expertise
Access
I believe this person will
respond to my request in a
reasonable and timely
manner
Advice
How often do you seek advice
from this person before making
an important decision?
Learning
How likely are you to
rely on this person for
advice on new methods
and processes
Energy
I generally feel
energized when I
interact with this person
Provided by Drs. Tony Mobbs and Kate Ehrlich, IBM
38
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Personal Network preferred source for information and collaboration
Forces:
• Time Constrained
• Delivery activity focus
• What gets measured gets done
• Expedience
• Perceived value (return on time investment)
Personal Network
•
•
•
•
•
•
fast turnaround of request
specific response
Small # relevant items returned
recommendation of quality
ability to quickly understand the
supplied resource & determine
relevant parts
additional context / value-add info
not available in electronic
materials
Preferred / primary
mode
High reliance on:
• 50% ~ 75%: Personal networks (Gartner Report,
2006)
• Hard-drive materials
• What has worked for them previously (personal
experience)
?
GBS Practitioner with task in project / delivery environment
client
client
W3 Stub
W3 Stub
W3 Stub
/ client
W3 Stub
/ Client
W3 Stub
/ client
W3 Stub
W3 Stub
/ client
Project
Tools
Project
Repositories
Knowledge
View
PSN
Methods
Education
Communities
Other w3
content
Collaboration
Existing
Resources
Provided
Standalone, disparate, poor
integration, large number of
sources, steep learning curve
(identify, understand & synthesise
into specific work context), difficult to
locate, choose & use.
•
leads to
•
•
Under utilisation of electronic products and
services.
Content has lower performance impact / not
realising full potential benefits.
Widely inconsistent working practices.
 Who knows what? How to reach them? Who plays what hidden roles?
39
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Mining Expertise, Interests and Social Network
 People can be “known” by:



public resources:
• publications
• personal webpages
• blogs
• presentations
• wiki
organizational resources:
• patent applications
• bluepages
personal resources:
• emails
• instant messaging
• meeting
• phone calls
• face-to-face interactions
public
timely &
abundant
resources
for
expertise
modeling
private
 Expertise can also be inferred by her friends’ recommendations or
expertises.
40
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
SmallBlue Clients
(Distributed Automatic Social Sensors)
 Other
IBMers’
EgoNets
 Other
IBMers’
Expertise
Inferences
 My personal
network (Ego net)
inferred from my
Notes emails in
server/local/archive
and SameTime chats
External
Data
Bluepages
BlueGroups
CommunityMap
BlogCentral
IBM Forum
KnowledgeView
Social Bookmark
 I cannot
see their
communicati
ons,
EgoNets nor
Expertise
Inferences
 Inference of my
understanding on my
friends’ expertise
 user search
experts or person
SmallBlue Find
 social network
analysis of Top-K
experts
 social
network
analysis of
a list of
people
SmallBlue Connect
SmallBlue Ego
 Corporate-wise ranked experts
 My friends’
social values
to me
SmallBlue
Inference
Engines
and
Servers
 Ranked experts in my extended
personal network, in a business
unit and/or in a country
 Only Public Information is shown
 how to
reach a
person
 social network info
 Evolution
of my Ego
net
SmallBlue Reach
SmallBlue Expand
 Who I may want
to know..
Which
communities I may
want to join..
Which
documents I may
want to look at
Private &
Personalized
41
 My social paths to
her: which friends
can introduce her,
which friends work
with her, ..  trust,
awareness,
collaboration.
 Her public
postings, profiles,
and communities to
judge whether she
is the right person.
Public
 social network analysis
(SNA): who are the key
persons in this network?
who are the major hubs?
who are the major
bridges?
Public &
Personalized
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
 SNA of a formal group,
a bluegroup or a
community
© 2007 IBM Corporation
IBM Research
Major Use of SmallBlue Find
 Find out who are the experts of any search terms. (Right now, zillions of possible
terms.)
 Rank them based on collaborative expert recommendation
 Can show experts based on:
 whole corporate-wise
 business unit
 country
 my personal proximity
42
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Collaborative Expert Recommendation
 Combine everyone’s knowledge of the expertise of our colleagues.
 The more recommendation from more colleagues, the higher the
score.
 The more recommendation from my trusted colleagues, the higher
the score.
 The higher recommendation score from colleagues, the higher the
overall score.
Combining all IBMers’ knowledge,
we can make an advanced expert finding search engine.
Utilizing the expert search engine, we can enhance all IBMers’
knowledge and social connections.
43
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
SmallBlue Reach Paths help users to reach another person
 SmallBlue Reach Paths show the shortest
paths for me to reach a person up to 6
degrees away.
 SmallBlue Reach Paths can be initiated from
any one of three SmallBlue applications.
 Can be used for:
 Access -- knowing who can help introducing
me to this person.
 Trust -- knowing who in my social networks
knows this person.
 Get Familiar with – knowing what kinds of
people are contacting to this person.
 Initiate Communication – who do we know in
common.
44
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
SmallBlue Ego
 How healthy is my personal social capital?
 What is the social value of Alice to me?
 What are the changes and trends of my social capital evolution?
 For instance, I have to talk to Alice soon. She is valuable to me in
terms of social connections and she is getting out of the Ego net circle..
45
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
SmallBlue Connect
 Enterprise Social Network Analysis Tool
 Showing Social Networks of people
based on:
 expertise key words
 formal hierarchy
 Any list of emails
 Utilizing Social Network Analysis to
show:
 who are the important hubs among
experts
 who are the important bridges linking
groups
46
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Privacy Consideration – Bottom Line
 Employees’ communications (e.g., time, from, to, cc, subject, content of
emails, SameTime, etc.) are NOT searched nor retrievable to anyone.
 Employees’ knowledge of other employees are INFERRED. Only the
aggregated inferred knowledge is searchable. It is NOT possible to guess
which part of aggregated inferred knowledge is contributed by whom.
 In the social network analysis graphs, people relationships are modeled by
their multimodal generic relationships. NO clue for their communication
content.
 Only the employees’ outgoing emails & instant messages and the portion that
was authored by the employee is utilized.
 Anyone can suggest keywords not be searched, search terms that should not
find him, or ask to remove from the system.
47
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Preliminary User Evaluation
Scores
5 – very
satisfied
48
5
4
3
2
1
Capability
24%
42%
17%
17%
0%
Usability
28%
33%
5%
25%
10%
Search
10%
43%
23%
22%
2%
Reliability
28%
38%
17%
12%
5%
Performance
15%
45%
25%
13%
3%
Privacy
29%
34%
34%
3%
0%
Personal
Network
15%
50%
13%
23%
0%
Overall
Satisfaction
17%
49%
17%
15%
2%
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Demo
49
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Coincidence ?? 
SmallBlue Ego
Trial Release (8/21)
50
SmallBlue Find and Connect
Trial Release (9/20)
SmallBlue on
TAP (11/07)
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation
IBM Research
Acknowledgements
 Thanks to the SmallBlue Team Members:














Vicky Griffits-Fisher,
Kate Ehrlich,
Christopher Desforges,
Michael Ackerbaruer,
Reynold Khachatourian,
Irina Fedulova,
Ekaterina Zaytseva,
Jeffrey Borden,
Jennifer Xu,
Yi Gu,
Jie Lu,
Dima Rekesh
Belle Tseng
Xiaodan Song
 Contact: Ching-Yung Lin ([email protected])
( http://www.research.ibm.com/people/c/cylin )
51
5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center
© 2007 IBM Corporation