Download From Association Rules to Social Network

Document related concepts

Database model wikipedia , lookup

Transcript
The Power of Data Mining and Machine
Learning Techniques for Network
Construction and Analysis
Reda Alhajj
University of Calgary, Calgary, Alberta, Canada
Global University, Beirut, Lebanon
[email protected]
General Overview
2

The network model provides a powerful platform
to study a group of entities and their relationships

The semantics of the links in the network is determined
by considering the application domain to be investigated

A network can be constructed by considering pairwise correlation
between entities or by investigating the correlation between two
entities based on a global view of the data

Data mining and machine learning techniques allow for better
investigation by globally visioning the data to derive the strength
of pairwise links

The combination of data mining, machine learning and network
analysis would lead to a comprehensive and robust framework for
data analysis.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Outline of the talk
3

Background on ARM, Clustering, Network Model, fuzziness

From FPM, ARM and clustering to network

Some Application Domains:
 database design
 web mining
 terror network analysis
 outlier detection
 Disease Biomarker
 Database search

Conclusions and research directions
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Overview of Association Rules Mining
4

A general model for mining domains where there is
many2many relationship between two sets of entities, e.g.,
baskets and items; documents and words, etc.

Consider a set of items I = {I1 , I2 , I3 ,…, Im }

Consider a database of transactions D where each transaction
T is a set of items such that T  I

So, if A is a set of items a transaction T is said to contain A if
and only if A  T

An association rule is an implication or correlation of the
form:
A  B where A  I, B  I, and A  B = 

Support and confidence are the measures generally used to
filter the rules
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Association Rules Mining: Two Steps


5
In general association rules mining can be reduced to the
following two steps:
1.
Find all frequent itemsets
 Each itemset will occur at least as frequently as a
minimum support count
2.
Generate strong association rules from the frequent
itemsets
 These rules will satisfy minimum support and confidence
measures
We use the outcome from the first step in part of the research
and the outcome from the second step in another part of the
research
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Association Rules Mining: Apriori Algorithm

Any subset of a frequent itemset must be frequent
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!

6
Minimum support = 2
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Association Rule Mining
Frequent Closed Itemset

A frequent itemset X is closed if none of its immediate
supersets has the same support as the itemset X

Example
Image Reference:
http://www.siam.org/meetings/sdm06/proceedings/038lucchesec.pdf
7
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Clustering

It is an unsupervised learning process

It is the process of distributing a given set of data instances into
groups such that the similarity of instances is high within each
group and low between the groups.


Similarity within the cluster (intra-cluster) is measured using
variance average variance or TWCV
Similarity across the clusters (inter-cluster) is measure based on
linkage.

For clustering we need to know at least the characteristics of the
instances and the similarity measure to be used in the process

Various algorithms exist for clustering, e.g., k-means, DBscan,

Each algorithm has its advantages and disadvantages
8
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Clustering


9
Example 1
Example 2
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Overview of Social Network Analysis

A social network is a set of entities called actors and the links
connecting them.


Ex: students enrolled in same courses, people and likes, etc
A social network is mostly represented as a graph called sociogram

Social Network Analysis (SNA) is powerful because it has
foundations in math/graph theory

SNA provides a set of tools to empirically extend our theoretical
intuition of the patterns that compose a social structure.

SNA provides a set of relational methods for systematically
understanding and identifying connections among actors.

SNA embodies a range of theories relating types of observable
social spaces and their relation to individual and group behavior.
10
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Social Network Analysis
Centrality Measures

Degree
 Sum of connections (sum of the weights of connections in
case of weighted graphs) from or to an actor

Closeness
 Distance of one actor to all others in the network

Betweenness
 The number of shortest paths that passes through an actor

Eigen-vector
 Measures how importance of an actor
11
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Social Network Analysis
Centrality Measures (example)
Example 2
Example 1


The red nodes have the highest
degree centrality
The blue node has the highest
Closeness and betweenness
centrality
Image Reference:
http://www.biomedcentral.com/
12



Node 7 has the highest degree
centrality
Node 8 has the highest
betweenness Centrality
Nodes 4 and 5 have the highest
Closeness Centrality
Image Reference:
http://mande.co.uk/special-issues/network-models/
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Social Network Analysis
Graph Clustering Algorithms

MST based clustering
 First finds a Minimum Spanning Tree (MST) of the graph


13
Removes edges with the highest weight from the MST to
form clusters of vertices (actors)
Edge Betweenness clustering
 The betweenness of an edge is defined as the extent to
which the edge lies along shortest paths

First computes edge betweenness for all edges in current
graph

Removes edges having the highest betweenness from the
graph
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
One Mode versus Two Mode Networks
Queries (users) versus Tables
is a two mode network
Folding is used to produce
one mode networks from
a two mode network
Folding is simply the multiplication
of the adjacency matrix of the
two mode network by its transpose
14
X
Y
Z
A
1
0
0
B
1
0
1
C
1
1
0
D
1
0
1
A
B
C
D
X
1
1
1
1
Y
0
0
1
0
Z
0
1
0
1
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Fuzzy Sets

Generalizes the classical set theory by a characteristic
membership function.

A membership function introduces a grey area between the
black and white areas

Consider fuzzy set A, its domain D, and object x.

Membership function µ specifies the degree of membership of
x in A:
15

µA(x): D → [0, 1].

µA(x)= 0 means x does not belong to A.

µA(x)= 1 means x completely belongs to A.

Intermediate values 0< µA(x)<1 represent varying degree
of membership.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Example on Membership
The ranges of fuzzy sets
Income
Range
Centroid
Quite poor
Poor
Moderate
10-10-30
30
70
10-30-70
30-70-120
Rich
Membership
1.0
quite
poor
70-120-120
poor
moderate
rich
0.5
0.
0 10K
30K
70K
120K
income($)
The membership functions found according to the centroids
16
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From FPM to Network Construction

Given a data set of M instances and N features per instance

Prepare the data for FPM by deciding on the baskets and items.
Keep in mind that items are the actors in the network

Apply the FPM algorithm of your choice to find Frequent sets of
items; it is possible to narrow down to closed or maximal FP

Construct the network by considering the frequent sets as
follows:


17
Add a link between two actors i and j iff i and j exist together in
at least one FP, the weight of the link is set to the number of
common FP’s
It is possible to normalize the weights and/or remove some links
based on a certain criteria like below average weight or below
certain predefined threshold based on weight, etc.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From FPM to Network Construction
18
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From ARM to Network Construction

Given a data set of M instances and N features per instance

Prepare the data for ARM by deciding on the baskets and
items. Keep in mind that items are the actors in the network;
they will form the antecedents and consequents of the rules

Apply the ARM algorithm of your choice to find all AR’s that
satisfy certain criteria

Construct the network by considering the AR’s as follows:


19
Add a link between two actors i and j iff i and j exist together in
at least one AR, the weight of the link is set to the number of
common AR’s. It is possible to concentrate on antecedent,
consequent or both.
It is possible to normalize the weights and/or remove some links
based on a certain criteria like below average weight or below
certain predefined threshold based on weight, etc.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From ARM to Network Construction
20
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Clustering to Network Construction

Given a data set of M instances and N features per instance

Prepare the data for clustering by deciding on the features to
consider in computing the similarity measure

Apply either one clustering algorithm several times by
playing with the required input parameters or a number of
clustering algorithms to find one clustering solution per run.

Construct the network by considering the clusters as follows:


21
Add a link between two actors i and j iff i and j exist together in
the same cluster in at least one clustering solution, the weight
of the link is set to the number of common clusters across the
solutions.
It is possible to normalize the weights and/or remove some
links based on a certain criteria like below average weight or
below certain predefined threshold based on weight, etc.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Network Construction
Multiple clustering
solutions
22
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From the Data to Network Construction
23

Given a data set of M instances and N features per instance

Prepare the data processing by deciding on the features P to
consider in the analysis

Construct a MxP matrix A by considering every instance as a
row and every feature as a column

Find the transpose of matrix A

Multiply matrix A by its transpose to get the adjacency
matrix for the target network.

It is possible to normalize the weights and/or remove some
links based on a certain criteria like below average weight or
below certain predefined threshold based on weight, etc.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
NetDriller : A Powerful Social Network Analysis Tool*
Negar Koochakzadeh, Atieh Sarraf, Keivan Kianmehr, Jon Rokne, Reda Alhajj
{nkoochak, sarrafsa}@ucalgary.ca, [email protected], {alhajj, rokne}@ucalgary.ca
Social Network Analysis (SNA) is a technique first used in sociology.
Recently computer scientists have realized that this model is general enough to be applied to any domain where the entities and their
interconnections can be separated into actors and their links, respectively.
Data Mining techniques can strengthen SNA
age
work class
education
39
50
52
30
25
43
State-gov
Self-emp-not-inc
Self-emp-not-inc
State-gov
Self-emp-not-inc
Self-emp-not-inc
Bachelors
Bachelors
HS-grad
Bachelors
HS-grad
Masters
relationship
race
sex
Hours/week
native
country
Never-married
Adm-clerical
Not-in-family
Married-civ-spouse Exec-managerial
Husband
Married-civ-spouse Exec-managerial
Husband
Married-civ-spouse Prof-specialty
Husband
Never-married
Farming-fishing
Own-child
Divorced
Exec-managerial Unmarried
White
White
White
Black
White
White
Male
Male
Male
Male
Male
Female
40
13
45
40
35
45
US
Canada
US
India
Iran
US
Marital status
occupation
1
Network
Construction
…
Raw Dataset: People and their attributes
2
Searching in the Network:
Example1: Find individuals who could monitor the information flow in an organization better than most
others.
Example 2: Find individuals who have best picture of what is happening in the network as a whole.
Closeness centrality reveals how long it takes information to spread from one individual to others in the
network. High scoring individuals in Closeness have the shortest paths to all others in the network.
Betweenness centrality indicates the extent that an individual is a broker of indirect connections among
all others in a network. Someone with high Betweenness could be thought of as a gatekeeper of
information flow. People that occur on many shortest paths among other People have highest
Betweenness value.
Degree centrality indicates the extent that an individual send or receive information to the neighbors.
Eigenvector centrality calculates the principle eigenvector of the network. A node is central to the extent
that its neighbors are central.
Social Network: Based on community detection
Fuzzy Query Example: Find individuals with high centralities
Fuzzy Sets: Based on multi-objective GA optimization
* ICDM 2011 IEEE International Conference on Data Mining
Fuzzy Query Result: Color hue shows DofM
http://cpsc.ucalgary.ca/~nkoochak/NetDriller/
IMPROVING DATABASE PERFORMANCE BY
BUILDING AND ANALYZING NETWORK OF
TABLES FROM QUERY ACCESS PATTERNS
25
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Problem Definition

Response time in a distributed or parallel database system is
largely determined by how data is organized and stored on
different machines/sites.

The goal is to place related data on nearby, or preferably the
same, sites to minimize the response time.

The study of data distribution requires solving two problems:
1. The partitioning problem
2. The allocation problem
26
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Queries (users) versus Tables
27
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Overview of the analysis process

Three main steps:
Considering tables as items and queries as transactions,
extract frequent closed itemsets
1.

28
A kind of fuzzy sets can be built from the closed itemsets
in this step
2.
Use the extracted itemsets from the previous step to build
the network of tables
3.
Use network analysis to extract information about the
tables from the network of tables
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step1
Items and Transactions

Sample database
 EMPLOYEE (Ssn, Fname, Lname, Dno)
 DEPARTMENT (Dnumber, Dname)
 PROJECT (Pnumber, Pname, Plocation, Dno)

Sample query (Q1)
 SELECT
Lname
FROM
EMPLOYEE, DEPARTMENT
WHERE
DNO = Dnumber
AND
Dname = ‘Reasearch’

Items
 EMPLOYEE, DEPARTMENT, PROJECT

Transactions
 Q1: EMPLOYEE, DEPARTMENT
29
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 1
Example (Sample Database)

30
Sample database schema from Fundamentals of Database
Systems, Elmasri/Navathe
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 1
Example (List of Queries)
List of Queries in Transaction Format
31
Q1
EMPLOYEE
DEPARTMENT
Q2
EMPLOYEE
DEPARTMENT
Q3
EMPLOYEE
DEPARTMENT
Q4
EMPLOYEE
DEPARTMENT
WORKS_ON
Q5
EMPLOYEE
WORKS_ON
PROJECT
Q6
EMPLOYEE
DEPARTMENT
WORKS_ON
Q7
EMPLOYEE
DEPENDENT
Q8
EMPLOYEE
WORKS_ON
Q9
EMPLOYEE
DEPENDENT
Q10
EMPLOYEE
DEPENDENT
Q11
EMPLOYEE
DEPARTMENT
Q12
EMPLOYEE
DEPARTMENT
Q13
WORKS_ON
PROJECT
Q14
WORKS_ON
PROJECT
Q15
EMPLOYEE
WORKS_ON
PROJECT
PROJECT
PROJECT
PROJECT
PROJECT
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 1
Example (Closed Itemsets)


32
List of frequent closed itemsets with
min-support-threshold = 2
Itemset
Frequency
EMPLOYEE, DEPARTMENT, WORKS_ON, PROJECT
2
EMPLOYEE, WORKS_ON, PROJECT
5
EMPLOYEE, DEPARTMENT, PROJECT
3
EMPLOYEE, PROJECT
6
WORKS_ON, PROJECT
7
EMPLOYEE, DEPARTMENT
7
EMPLOYEE, DEPENDENT
3
Note: 1-itemsets are omitted from the results
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step1
Example (Fuzzy Sets)
Itemset
Frequency
EMPLOYEE, DEPARTMENT, WORKS_ON,
PROJECT
2
EMPLOYEE, WORKS_ON, PROJECT
5
EMPLOYEE, DEPARTMENT, PROJECT
3
EMPLOYEE, PROJECT
6
WORKS_ON, PROJECT
7
EMPLOYEE, DEPARTMENT
7
EMPLOYEE, DEPENDENT
3
Fuzzy Sets
{WORKS_ON: 0.500, PROJECT: 0.304}
{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}
{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}
{EMPLOYEE: 0.231, PROJECT: 0.261}
{EMPLOYEE: 0.269, DEPARTMENT: 0.583}
{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}
{EMPLOYEE: 0.115, DEPENDENT: 1.000}
33
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Example (Fuzzy Sets)
Fuzzy Sets
{WORKS_ON: 0.500, PROJECT: 0.304}
{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}
{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}
{EMPLOYEE: 0.231, PROJECT: 0.261}
{EMPLOYEE: 0.269, DEPARTMENT: 0.583}
{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}
{EMPLOYEE: 0.115, DEPENDENT: 1.000}
SUGGESTED ALLOCATION, NO REPLICATION CASE
{WORKS_ON: 0.500, PROJECT: 0.304}
{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}
{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}
{EMPLOYEE: 0.231, PROJECT: 0.261}
{EMPLOYEE: 0.269, DEPARTMENT: 0.583, DEPENDENT: 1.000}
{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}
{EMPLOYEE: 0.115}
34
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Example (Fuzzy Sets)
SUGGESTED ALLOCATION, REPLICATION CASE; AT MOST
THREE REPLICA ALLOWED
{WORKS_ON: 0.500, PROJECT: 0.304}
{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}
{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}
{EMPLOYEE: 0.231, PROJECT: 0.261, DEPARTMENT: 0.250}
{EMPLOYEE: 0.269, DEPARTMENT: 0.583, DEPENDENT: 1.000}
{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}
{EMPLOYEE: 0.115, DEPENDENT: 1.000}
35
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step2
Building the Network

Each item (table) is a node in the network

An edge exists between two nodes if they appear together in
at least one frequent closed itemset

The weight of an edge between two nodes is related to the
number of frequent closed itemsets in which corresponding
tables appear together

36
Weight is normalized
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 2
Example


37
Network of tables
Note: Table DEPT_LOCATIONS is not included in the graph since this table did not appear
in any of the queries
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 3
Applying Network Analysis

38
Various network analysis techniques can be used to extract
relationships of tables from the social network

Centrality measures can be used to identify the tables that
are in relationship with many other tables and consequently
play a key role in linking data from different tables together

Graph clustering algorithms can be applied to find groups
of tables that are frequently accessed together in queries
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 3
Example (Centrality Measures)
39
Tables
Degree
(unweighted)
Closeness
Betweenness
EMPLOYEE
4
0.40
6
DEPARTMENT
3
0.27
4
WORKS_ON
3
0.25
4
PROJECT
3
0.36
4
DEPENDENT
1
0.18
4
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 3
Example (Clustering Results)
40

Edge betweenness clusters
 C1: EMPLOYEE, PROJECT, DEPARTMENT
 C2: WORKS_ON
 C3: DEPENDENT

MST clusters
 C1: DEPENDENT
 C2: EMPLOYEE, WORKS_ON, PROJECT
 C3: DEPARTMENT

Clustering results may seem meaningless since in this
example we have 5 highly correlated nodes in the graph
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experiment1
Centrality Measures

This experiment has been done on a synthetic dataset of 14 tables
(T0 to T13) and 20 queries, min-support-threshold = 2

High degree nodes
 T10: 6
 T14: 4

High closeness nodes
 T10: 0.25
 T14: 0.20

High betweenness nodes
 T10: 86
 T14: 49
41
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experiment1
Clustering Result

Edge betweenness clusters
 C1: T11, T12, T13, T14
 C2: T1, T0, T2
 C3: T4, T5, T10, T8, T3

MST clusters
 C1: T11
 C2: T4, T3
 C3: T5, T10, T12, T13, T8, T14, T1, T0, T2
42
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experiment 2
Centrality Measures

The experiment has been done on a synthetic dataset of 14
tables (T0 to T13) and 30 queries, min-support-threshold = 1

High degree nodes
 T7: 12
 T10: 11

High closeness nodes
 T10: 0.20
 T7: 0.19

High betweenness nodes
 T7: 43
 T10: 31
43
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experiment 2
Clustering Result

Edge betweenness clusters
 C1: T6
 C2: T8
 C3: T4, T5, T3, T2
 C4: T1, T0
 C5: T7, T10, T11, T12, T13, T14, T9

MST clusters
 C1: T6, T8
 C2: T11
 C3: T7, T9
 C4: T10, T12, T13, T14, T1, T0, T2
 C5: T4, T5, T3
44
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013

To further demonstrate the effectiveness of the proposed
approach in practice

we conducted another experiment using a synthetic query set
of 1000 queries on 50 tables

finding real data is very hard because this type of data is very
sensitive and hence highly confidential.

We have generated the data by restricting the number of
tables that could appear in the same query to be at most 20
 one query may require accessing at most 20 different
tables, though in practice it is not more than four or five
tables.
45
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
46
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013

These are four example communities:

{T6, T8, T9, T22, T23, T24, T33 } –

{ T6, T9, T21, T37, T42, T45} –

{T5, T6, T11, T13, T14, T16, T19 } –

{ T6, T7, T9, T10, T12, T13, T19} .
47
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Frequent Patterns to
Network construction
48
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Overview

Given a dataset, e.g., emails exchanged between a group of
people, like employees in the same company

Partition the dataset into groups based on a certain criteria to
be studied


To study the employees, all emails are grouped such that emails
of the same employee form one group
Decide on the items to be considered in the analysis

E.g., each email could be a transaction and words/emails within
the header/text could be items

Mine FP within each group and globally

Find relevant features for each group based on the entropy
49
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
The Proposed Framework
Mine
frequent
closed
patterns
Freq.
Closed
Pats.
Select suitable
features
based on
entropy
ranking
Features
Calculate
weights of
features to
create feature
vectors
Feature Extraction Model
Front End Interface and
Visualization Tool
50
Reda Alhajj, University of Calgary
Network
Creation Model
Statistical
Analysis
Model
BYU, Provo, USA, March 2013
Feature Extraction Model:
The Feature Vector

The feature vector related to entity ej with m features is
represented as Fj = ( w(f1), w(f2), …, w(fm) ),
where w(fk) is the weight of the k-th feature, fk in entity ej.
51
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Feature Extraction Model: Weight of a Feature
The weights of each feature is calculated using the following
formula,

wDj(fk) = supDj(fk)/supD(fk)
where
wDj(fk) is the weight of the feature k for entity ej,
supDj(fk) is frequency of feature fk across dataset Dj of entity ej,
and
supD(fk) is frequency of fk across dataset D of all entities E.
52
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experimental Results:
Enron E-mail dataset description
53

Dataset contains 500,000 e-mail messages over 150 Enron
employees.

For this analysis inbox having more than 1000 e-mails were
considered.

From each user’s inbox we have chosen 1000 e-mails
randomly that makes the e-mail dataset for the
corresponding user.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experimental Results:
Processing Enron E-mail dataset

54
Identify itemsets from email dataset –

The stem words appearing in the body and the subject line of the emails are considered as items.

E-mail addresses inside the e-mails are identified as items as well.

These items appearing in a single e-mail are considered as a single
transaction

This way for each user we make a transactional database of 1000 email transactions for each of the 1000 e-mails in the inbox

From these transactional databases we identify the globally frequent
closed itemsets (corresponding to a support of 10%)

Based on entropy ranking we chose top 100 closed itemsets as our
feature set.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experimental Results: Euclidean Distance Matrix for Enron
Users
Distance cutoff point 0.30
buy
dean
ermis
jones
kamiski
keavey
lokey
may
sager
saibi
salisbury
shackleton
thomas
whalley
ybarbo
55
buy
0.00
0.65
0.57
0.26
0.43
0.41
0.43
0.35
0.32
0.36
0.25
0.22
0.65
0.60
0.59
dean
0.65
0.00
0.13
0.50
0.28
0.50
0.27
0.68
0.40
0.44
0.73
0.64
0.08
0.10
0.13
ermis
0.57
0.13
0.00
0.44
0.22
0.44
0.21
0.61
0.33
0.38
0.65
0.56
0.15
0.14
0.16
jones
0.26
0.50
0.44
0.00
0.27
0.35
0.29
0.38
0.19
0.26
0.36
0.21
0.50
0.47
0.44
kamiski
0.43
0.28
0.22
0.27
0.00
0.31
0.16
0.47
0.17
0.28
0.51
0.39
0.28
0.25
0.25
keavey
0.41
0.50
0.44
0.35
0.31
0.00
0.38
0.25
0.30
0.41
0.45
0.38
0.51
0.47
0.50
lokey
0.43
0.27
0.21
0.29
0.16
0.38
0.00
0.50
0.22
0.25
0.52
0.41
0.27
0.25
0.24
may
0.35
0.68
0.61
0.38
0.47
0.25
0.50
0.00
0.40
0.45
0.35
0.33
0.69
0.65
0.67
sager
0.32
0.40
0.33
0.19
0.17
0.30
0.22
0.40
0.00
0.25
0.44
0.28
0.40
0.36
0.36
saibi
0.36
0.44
0.38
0.26
0.28
0.41
0.25
0.45
0.25
0.00
0.45
0.34
0.43
0.41
0.41
salisbury
0.25
0.73
0.65
0.36
0.51
0.45
0.52
0.35
0.44
0.45
0.00
0.30
0.75
0.70
0.70
shackleton
0.22
0.64
0.56
0.21
0.39
0.38
0.41
0.33
0.28
0.34
0.30
0.00
0.63
0.60
0.59
thomas
0.65
0.08
0.15
0.50
0.28
0.51
0.27
0.69
0.40
0.43
0.75
0.63
0.00
0.09
0.13
whalley
0.60
0.10
0.14
0.47
0.25
0.47
0.25
0.65
0.36
0.41
0.70
0.60
0.09
0.00
0.11
ybarbo
0.59
0.13
0.16
0.44
0.25
0.50
0.24
0.67
0.36
0.41
0.70
0.59
0.13
0.11
0.00
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experimental Results: The Enron E-mail users’ social network
based on e-mail usage
56
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experimental Results: The Enron E-mail users’ social network
based on e-mail usage
Five CLUSTERS OF ENRON E-MAIL.
1 saibi
2 buy, salisbury, shakleton, jones
3 dean, ermis, jones, kaminski, lokey, sager, thomas,
whalley, ybarbo
4 keavey
5 may
57
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Association rules to
Network
58
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Basic Steps

Given a website

The mining process can be applied on three dimensions:
content, structure and log

Actors in the network are the pages.

Construct the adjacency matrix by mining association rules from the
transactional database obtained after preprocessing the web log
data:

Each transaction is a set of pages accessed together in one
session.

59
FPM algorithm, e.g., Apriori or FP-growth is applied on the derived
transactional data and association rules are derived.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Basic Steps

Determine frequent Itemsets

Find association rules

Add items in the rule as node in the graph and connect items
in the left side to items in the right side (directed edges)

Use support and confidence to find a combined weight of each
added edge

If edge already exist then add the new weight to the existing
weight of the edge

Analyze the graph using SNA techniques
60
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Association Rules to Social Network
61
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Association Rules to Social Network

Analyze weblog

Determine frequent sets of
pages based on frequency
of pages accessed
together

Determine rules and keep
only those satisfying
minimum confidence

Construct network of
pages based on rules
62
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Association Rules to Network

Each rule is reflected in the adjacency matrix by incrementing
every entry (i; j) such that pages i and j exist in the
antecedent and consequent of the rule, respectively.

Entries in the adjacency matrix are normalized by dividing
each value by the overall average of the values that exist in
the matrix.

The network is analyzed to rank the pages by considering their
in-degrees, out-degrees, and betweenness, eigen-vector
centrality.

Pages with high betweenness centrality are considered as
important to link pages from different communities.
63
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Association Rules to Social Network

analysis was done using the software Visone (http://visone.info/)

Betweeness Centrality measure
64
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Association Rules to Social Network

65
Closeness Centrality measure
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Association Rules to Social Network

66
Eigenvector Centrality measure
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Multi-objective GA based clustering
to Network Construction
The case of Genes/Proteins
67
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Motivation
68

In most traditional clustering algorithms, number of
clusters is given a-priori.

In fact: the clustering criteria is dependent on more than
one objective!

Cluster validation to assess the number of clusters.

Multi-objective clustering must work on small and large
data sets.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Objective Functions For Clustering

69
Three objectives:

F1 : minimize the number of clusters

F2 : maximize the heterogeneity between clusters

F3 : maximize the within cluster homogeneity
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Objective functions
70
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Divide and Conquer
Basic Steps:

If the dataset to be clustered is of manageable size then it is clustered
as a whole set.

Otherwise

repeat the following steps
 Partition the dataset (or set of centroids after the first iteration)
into subsets of manageable size
 Cluster each subset individually by applying multi-objective GA
combined with validity analysis to get the centroids of the
obtained clusters
 If the set of all centroids is of manageable size then cluster the
whole set of centroids and exit the loop

71
Backtrack to merge clusters that have their centroids ending up in
the same final cluster
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Unique Solution of Compact Clusters
72
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Alternative Solutions to Adjacency Matrix

Genes
Genes
Genes
Entry (i,j)
specifies
number of
solutions
where Genei
and Genej
occurred in
the same
cluster
73
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
From Adjacency Matrix to Network
74
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Criminal and Terror Network Analysis
75
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Terror Network Analysis by Clustering

We developed a framework that employs clustering, frequent
pattern mining and some social network analysis measures to
determine the effectiveness of a network.

The clustering and frequent pattern mining techniques start
with the adjacency matrix of the network.

For clustering, we utilize entries in the table by considering
each row as an object and each column as a feature.
76

features of a network member are his/her direct neighbors.

We maintain the weight of links in case of weighted
network links.
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Multi-Objective GA based Clustering

77
We applied multi-objective GA based clustering
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Terror Network Analysis by Clustering & FPM

For Clustering, we consider each row as an instance and each column
as a feature

We Cluster instances to find important groups and individuals within
the network

For frequent pattern mining, we consider each row of the adjacency
matrix as a transaction and each column as an item.

We map entries into a 0/1 scale such that every entry whose value is
greater than zero is assigned the value one; entries keep the value
zero otherwise.

This way we can apply frequent pattern mining algorithms to
determine the most influential members in a network as well as the
effect of removing some members or even links between members of
a network.
78
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Terror Network Analysis

We investigate the effect of adding some links between
members.

We are able to study how the various members in the network
change role as the network evolves.

This is measured by applying some SNA measures on the
network at each stage during the development.

We report some interesting results related to on various
benchmark networks: including 9/11 and Madrid bombing.
79
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Database Search
80
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Problem Definition

You tell the computer what
you want in terms that mean
something to you; using
fuzzy sets

You ask your question from
the computer using the fuzzy
term
 Computer tells you how accurate your results are
81
Reda Alhajj, University of Calgary
Degree of
membership
BYU, Provo, USA, March 2013
Related Work: Database Search

Fuzzy Data Representation

Disadvantages:



Extending a Query Language to support fuzzy
querying without changing the database itself

Disadvantages:


82
Existing databases need to be re-structured
Prevent traditional users from executing standard
(non-fuzzy) queries
Commercially available DBMS’s need to support a
new query language
Requires users to learn the new query language
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Motivation
83

Proposing an independent intermediate translation layer to
incorporate fuzziness in:
 the interface/querying facility of database systems to
retrieve more accurate facts
 Groups within a social network may share the same
intermediate layer
 Recommendation system based on SNA to help users in
building their intermediate layer

The intermediate layer provides the mapping between
fuzziness expected by the user and the actual crisp values
stored in the data repository
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Methodology

Fuzziness can be specified :


Manually: by a human expert
Semi-automatically:




84
A human experts decides on the number of fuzzy sets
the intermediate layer defines the fuzzy sets
Fully-automatically: by the intermediate layer
The intermediate layer uses the fuzzy sets
specifications to map between fuzziness expected
by the user and the actual crisp values stored in
the data repository
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Intelligent Database Search
85
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
AskFuzzy: Attractive Visual Fuzzy Query Builder*
Fuzzy
Layer
1
DBMS
Fuzzy Query
Data Fuzzification
• Transferring numeric values to fuzzy sets:
Number of Fuzzy sets
Manual
By User
Semi-automated
By System
Full-automated
(Optimization process:
Min number of clusters
Max cluster quality)
2
Fuzzy Query Construction
3
Fuzzy Query Execution
http://cpsc.ucalgary.ca/~nkoochak/AskFuzzy/
* ICDE 2011 IEEE International Conference on Knowledge Engineering
Fuzzy sets Functions
By User
By System
(Initial Fuzzy sets:
based on Clustering result
Optimized fuzzy sets:
Based on Genetic Algorithm Optimization
Conclusions

Data mining and machine learning techniques could be
integrated with the network based analysis.

The combination would lead to

87

A strong framework for data analysis from various
perspectives.

Global correlations within the data are considered and
hence lead to more realistic results
A variety of application domains could benefit from the
integrated setup
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
The End!
Thank you for your attention
Reda Alhajj
[email protected]
88
Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013