Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Power of Data Mining and Machine Learning Techniques for Network Construction and Analysis Reda Alhajj University of Calgary, Calgary, Alberta, Canada Global University, Beirut, Lebanon [email protected] General Overview 2 The network model provides a powerful platform to study a group of entities and their relationships The semantics of the links in the network is determined by considering the application domain to be investigated A network can be constructed by considering pairwise correlation between entities or by investigating the correlation between two entities based on a global view of the data Data mining and machine learning techniques allow for better investigation by globally visioning the data to derive the strength of pairwise links The combination of data mining, machine learning and network analysis would lead to a comprehensive and robust framework for data analysis. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Outline of the talk 3 Background on ARM, Clustering, Network Model, fuzziness From FPM, ARM and clustering to network Some Application Domains: database design web mining terror network analysis outlier detection Disease Biomarker Database search Conclusions and research directions Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Overview of Association Rules Mining 4 A general model for mining domains where there is many2many relationship between two sets of entities, e.g., baskets and items; documents and words, etc. Consider a set of items I = {I1 , I2 , I3 ,…, Im } Consider a database of transactions D where each transaction T is a set of items such that T I So, if A is a set of items a transaction T is said to contain A if and only if A T An association rule is an implication or correlation of the form: A B where A I, B I, and A B = Support and confidence are the measures generally used to filter the rules Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Association Rules Mining: Two Steps 5 In general association rules mining can be reduced to the following two steps: 1. Find all frequent itemsets Each itemset will occur at least as frequently as a minimum support count 2. Generate strong association rules from the frequent itemsets These rules will satisfy minimum support and confidence measures We use the outcome from the first step in part of the research and the outcome from the second step in another part of the research Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Association Rules Mining: Apriori Algorithm Any subset of a frequent itemset must be frequent Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! 6 Minimum support = 2 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Association Rule Mining Frequent Closed Itemset A frequent itemset X is closed if none of its immediate supersets has the same support as the itemset X Example Image Reference: http://www.siam.org/meetings/sdm06/proceedings/038lucchesec.pdf 7 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Clustering It is an unsupervised learning process It is the process of distributing a given set of data instances into groups such that the similarity of instances is high within each group and low between the groups. Similarity within the cluster (intra-cluster) is measured using variance average variance or TWCV Similarity across the clusters (inter-cluster) is measure based on linkage. For clustering we need to know at least the characteristics of the instances and the similarity measure to be used in the process Various algorithms exist for clustering, e.g., k-means, DBscan, Each algorithm has its advantages and disadvantages 8 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Clustering 9 Example 1 Example 2 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Overview of Social Network Analysis A social network is a set of entities called actors and the links connecting them. Ex: students enrolled in same courses, people and likes, etc A social network is mostly represented as a graph called sociogram Social Network Analysis (SNA) is powerful because it has foundations in math/graph theory SNA provides a set of tools to empirically extend our theoretical intuition of the patterns that compose a social structure. SNA provides a set of relational methods for systematically understanding and identifying connections among actors. SNA embodies a range of theories relating types of observable social spaces and their relation to individual and group behavior. 10 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Social Network Analysis Centrality Measures Degree Sum of connections (sum of the weights of connections in case of weighted graphs) from or to an actor Closeness Distance of one actor to all others in the network Betweenness The number of shortest paths that passes through an actor Eigen-vector Measures how importance of an actor 11 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Social Network Analysis Centrality Measures (example) Example 2 Example 1 The red nodes have the highest degree centrality The blue node has the highest Closeness and betweenness centrality Image Reference: http://www.biomedcentral.com/ 12 Node 7 has the highest degree centrality Node 8 has the highest betweenness Centrality Nodes 4 and 5 have the highest Closeness Centrality Image Reference: http://mande.co.uk/special-issues/network-models/ Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Social Network Analysis Graph Clustering Algorithms MST based clustering First finds a Minimum Spanning Tree (MST) of the graph 13 Removes edges with the highest weight from the MST to form clusters of vertices (actors) Edge Betweenness clustering The betweenness of an edge is defined as the extent to which the edge lies along shortest paths First computes edge betweenness for all edges in current graph Removes edges having the highest betweenness from the graph Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 One Mode versus Two Mode Networks Queries (users) versus Tables is a two mode network Folding is used to produce one mode networks from a two mode network Folding is simply the multiplication of the adjacency matrix of the two mode network by its transpose 14 X Y Z A 1 0 0 B 1 0 1 C 1 1 0 D 1 0 1 A B C D X 1 1 1 1 Y 0 0 1 0 Z 0 1 0 1 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Fuzzy Sets Generalizes the classical set theory by a characteristic membership function. A membership function introduces a grey area between the black and white areas Consider fuzzy set A, its domain D, and object x. Membership function µ specifies the degree of membership of x in A: 15 µA(x): D → [0, 1]. µA(x)= 0 means x does not belong to A. µA(x)= 1 means x completely belongs to A. Intermediate values 0< µA(x)<1 represent varying degree of membership. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Example on Membership The ranges of fuzzy sets Income Range Centroid Quite poor Poor Moderate 10-10-30 30 70 10-30-70 30-70-120 Rich Membership 1.0 quite poor 70-120-120 poor moderate rich 0.5 0. 0 10K 30K 70K 120K income($) The membership functions found according to the centroids 16 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From FPM to Network Construction Given a data set of M instances and N features per instance Prepare the data for FPM by deciding on the baskets and items. Keep in mind that items are the actors in the network Apply the FPM algorithm of your choice to find Frequent sets of items; it is possible to narrow down to closed or maximal FP Construct the network by considering the frequent sets as follows: 17 Add a link between two actors i and j iff i and j exist together in at least one FP, the weight of the link is set to the number of common FP’s It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From FPM to Network Construction 18 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From ARM to Network Construction Given a data set of M instances and N features per instance Prepare the data for ARM by deciding on the baskets and items. Keep in mind that items are the actors in the network; they will form the antecedents and consequents of the rules Apply the ARM algorithm of your choice to find all AR’s that satisfy certain criteria Construct the network by considering the AR’s as follows: 19 Add a link between two actors i and j iff i and j exist together in at least one AR, the weight of the link is set to the number of common AR’s. It is possible to concentrate on antecedent, consequent or both. It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From ARM to Network Construction 20 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Clustering to Network Construction Given a data set of M instances and N features per instance Prepare the data for clustering by deciding on the features to consider in computing the similarity measure Apply either one clustering algorithm several times by playing with the required input parameters or a number of clustering algorithms to find one clustering solution per run. Construct the network by considering the clusters as follows: 21 Add a link between two actors i and j iff i and j exist together in the same cluster in at least one clustering solution, the weight of the link is set to the number of common clusters across the solutions. It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Network Construction Multiple clustering solutions 22 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From the Data to Network Construction 23 Given a data set of M instances and N features per instance Prepare the data processing by deciding on the features P to consider in the analysis Construct a MxP matrix A by considering every instance as a row and every feature as a column Find the transpose of matrix A Multiply matrix A by its transpose to get the adjacency matrix for the target network. It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 NetDriller : A Powerful Social Network Analysis Tool* Negar Koochakzadeh, Atieh Sarraf, Keivan Kianmehr, Jon Rokne, Reda Alhajj {nkoochak, sarrafsa}@ucalgary.ca, [email protected], {alhajj, rokne}@ucalgary.ca Social Network Analysis (SNA) is a technique first used in sociology. Recently computer scientists have realized that this model is general enough to be applied to any domain where the entities and their interconnections can be separated into actors and their links, respectively. Data Mining techniques can strengthen SNA age work class education 39 50 52 30 25 43 State-gov Self-emp-not-inc Self-emp-not-inc State-gov Self-emp-not-inc Self-emp-not-inc Bachelors Bachelors HS-grad Bachelors HS-grad Masters relationship race sex Hours/week native country Never-married Adm-clerical Not-in-family Married-civ-spouse Exec-managerial Husband Married-civ-spouse Exec-managerial Husband Married-civ-spouse Prof-specialty Husband Never-married Farming-fishing Own-child Divorced Exec-managerial Unmarried White White White Black White White Male Male Male Male Male Female 40 13 45 40 35 45 US Canada US India Iran US Marital status occupation 1 Network Construction … Raw Dataset: People and their attributes 2 Searching in the Network: Example1: Find individuals who could monitor the information flow in an organization better than most others. Example 2: Find individuals who have best picture of what is happening in the network as a whole. Closeness centrality reveals how long it takes information to spread from one individual to others in the network. High scoring individuals in Closeness have the shortest paths to all others in the network. Betweenness centrality indicates the extent that an individual is a broker of indirect connections among all others in a network. Someone with high Betweenness could be thought of as a gatekeeper of information flow. People that occur on many shortest paths among other People have highest Betweenness value. Degree centrality indicates the extent that an individual send or receive information to the neighbors. Eigenvector centrality calculates the principle eigenvector of the network. A node is central to the extent that its neighbors are central. Social Network: Based on community detection Fuzzy Query Example: Find individuals with high centralities Fuzzy Sets: Based on multi-objective GA optimization * ICDM 2011 IEEE International Conference on Data Mining Fuzzy Query Result: Color hue shows DofM http://cpsc.ucalgary.ca/~nkoochak/NetDriller/ IMPROVING DATABASE PERFORMANCE BY BUILDING AND ANALYZING NETWORK OF TABLES FROM QUERY ACCESS PATTERNS 25 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Problem Definition Response time in a distributed or parallel database system is largely determined by how data is organized and stored on different machines/sites. The goal is to place related data on nearby, or preferably the same, sites to minimize the response time. The study of data distribution requires solving two problems: 1. The partitioning problem 2. The allocation problem 26 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Queries (users) versus Tables 27 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Overview of the analysis process Three main steps: Considering tables as items and queries as transactions, extract frequent closed itemsets 1. 28 A kind of fuzzy sets can be built from the closed itemsets in this step 2. Use the extracted itemsets from the previous step to build the network of tables 3. Use network analysis to extract information about the tables from the network of tables Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Step1 Items and Transactions Sample database EMPLOYEE (Ssn, Fname, Lname, Dno) DEPARTMENT (Dnumber, Dname) PROJECT (Pnumber, Pname, Plocation, Dno) Sample query (Q1) SELECT Lname FROM EMPLOYEE, DEPARTMENT WHERE DNO = Dnumber AND Dname = ‘Reasearch’ Items EMPLOYEE, DEPARTMENT, PROJECT Transactions Q1: EMPLOYEE, DEPARTMENT 29 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Step 1 Example (Sample Database) 30 Sample database schema from Fundamentals of Database Systems, Elmasri/Navathe Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Step 1 Example (List of Queries) List of Queries in Transaction Format 31 Q1 EMPLOYEE DEPARTMENT Q2 EMPLOYEE DEPARTMENT Q3 EMPLOYEE DEPARTMENT Q4 EMPLOYEE DEPARTMENT WORKS_ON Q5 EMPLOYEE WORKS_ON PROJECT Q6 EMPLOYEE DEPARTMENT WORKS_ON Q7 EMPLOYEE DEPENDENT Q8 EMPLOYEE WORKS_ON Q9 EMPLOYEE DEPENDENT Q10 EMPLOYEE DEPENDENT Q11 EMPLOYEE DEPARTMENT Q12 EMPLOYEE DEPARTMENT Q13 WORKS_ON PROJECT Q14 WORKS_ON PROJECT Q15 EMPLOYEE WORKS_ON PROJECT PROJECT PROJECT PROJECT PROJECT Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Step 1 Example (Closed Itemsets) 32 List of frequent closed itemsets with min-support-threshold = 2 Itemset Frequency EMPLOYEE, DEPARTMENT, WORKS_ON, PROJECT 2 EMPLOYEE, WORKS_ON, PROJECT 5 EMPLOYEE, DEPARTMENT, PROJECT 3 EMPLOYEE, PROJECT 6 WORKS_ON, PROJECT 7 EMPLOYEE, DEPARTMENT 7 EMPLOYEE, DEPENDENT 3 Note: 1-itemsets are omitted from the results Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Step1 Example (Fuzzy Sets) Itemset Frequency EMPLOYEE, DEPARTMENT, WORKS_ON, PROJECT 2 EMPLOYEE, WORKS_ON, PROJECT 5 EMPLOYEE, DEPARTMENT, PROJECT 3 EMPLOYEE, PROJECT 6 WORKS_ON, PROJECT 7 EMPLOYEE, DEPARTMENT 7 EMPLOYEE, DEPENDENT 3 Fuzzy Sets {WORKS_ON: 0.500, PROJECT: 0.304} {EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217} {EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250} {EMPLOYEE: 0.231, PROJECT: 0.261} {EMPLOYEE: 0.269, DEPARTMENT: 0.583} {EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167} {EMPLOYEE: 0.115, DEPENDENT: 1.000} 33 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Example (Fuzzy Sets) Fuzzy Sets {WORKS_ON: 0.500, PROJECT: 0.304} {EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217} {EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250} {EMPLOYEE: 0.231, PROJECT: 0.261} {EMPLOYEE: 0.269, DEPARTMENT: 0.583} {EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167} {EMPLOYEE: 0.115, DEPENDENT: 1.000} SUGGESTED ALLOCATION, NO REPLICATION CASE {WORKS_ON: 0.500, PROJECT: 0.304} {EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217} {EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250} {EMPLOYEE: 0.231, PROJECT: 0.261} {EMPLOYEE: 0.269, DEPARTMENT: 0.583, DEPENDENT: 1.000} {EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167} {EMPLOYEE: 0.115} 34 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Example (Fuzzy Sets) SUGGESTED ALLOCATION, REPLICATION CASE; AT MOST THREE REPLICA ALLOWED {WORKS_ON: 0.500, PROJECT: 0.304} {EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217} {EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250} {EMPLOYEE: 0.231, PROJECT: 0.261, DEPARTMENT: 0.250} {EMPLOYEE: 0.269, DEPARTMENT: 0.583, DEPENDENT: 1.000} {EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167} {EMPLOYEE: 0.115, DEPENDENT: 1.000} 35 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Step2 Building the Network Each item (table) is a node in the network An edge exists between two nodes if they appear together in at least one frequent closed itemset The weight of an edge between two nodes is related to the number of frequent closed itemsets in which corresponding tables appear together 36 Weight is normalized Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Step 2 Example 37 Network of tables Note: Table DEPT_LOCATIONS is not included in the graph since this table did not appear in any of the queries Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Step 3 Applying Network Analysis 38 Various network analysis techniques can be used to extract relationships of tables from the social network Centrality measures can be used to identify the tables that are in relationship with many other tables and consequently play a key role in linking data from different tables together Graph clustering algorithms can be applied to find groups of tables that are frequently accessed together in queries Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Step 3 Example (Centrality Measures) 39 Tables Degree (unweighted) Closeness Betweenness EMPLOYEE 4 0.40 6 DEPARTMENT 3 0.27 4 WORKS_ON 3 0.25 4 PROJECT 3 0.36 4 DEPENDENT 1 0.18 4 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Step 3 Example (Clustering Results) 40 Edge betweenness clusters C1: EMPLOYEE, PROJECT, DEPARTMENT C2: WORKS_ON C3: DEPENDENT MST clusters C1: DEPENDENT C2: EMPLOYEE, WORKS_ON, PROJECT C3: DEPARTMENT Clustering results may seem meaningless since in this example we have 5 highly correlated nodes in the graph Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Experiment1 Centrality Measures This experiment has been done on a synthetic dataset of 14 tables (T0 to T13) and 20 queries, min-support-threshold = 2 High degree nodes T10: 6 T14: 4 High closeness nodes T10: 0.25 T14: 0.20 High betweenness nodes T10: 86 T14: 49 41 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Experiment1 Clustering Result Edge betweenness clusters C1: T11, T12, T13, T14 C2: T1, T0, T2 C3: T4, T5, T10, T8, T3 MST clusters C1: T11 C2: T4, T3 C3: T5, T10, T12, T13, T8, T14, T1, T0, T2 42 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Experiment 2 Centrality Measures The experiment has been done on a synthetic dataset of 14 tables (T0 to T13) and 30 queries, min-support-threshold = 1 High degree nodes T7: 12 T10: 11 High closeness nodes T10: 0.20 T7: 0.19 High betweenness nodes T7: 43 T10: 31 43 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Experiment 2 Clustering Result Edge betweenness clusters C1: T6 C2: T8 C3: T4, T5, T3, T2 C4: T1, T0 C5: T7, T10, T11, T12, T13, T14, T9 MST clusters C1: T6, T8 C2: T11 C3: T7, T9 C4: T10, T12, T13, T14, T1, T0, T2 C5: T4, T5, T3 44 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 To further demonstrate the effectiveness of the proposed approach in practice we conducted another experiment using a synthetic query set of 1000 queries on 50 tables finding real data is very hard because this type of data is very sensitive and hence highly confidential. We have generated the data by restricting the number of tables that could appear in the same query to be at most 20 one query may require accessing at most 20 different tables, though in practice it is not more than four or five tables. 45 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 46 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 These are four example communities: {T6, T8, T9, T22, T23, T24, T33 } – { T6, T9, T21, T37, T42, T45} – {T5, T6, T11, T13, T14, T16, T19 } – { T6, T7, T9, T10, T12, T13, T19} . 47 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Frequent Patterns to Network construction 48 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Overview Given a dataset, e.g., emails exchanged between a group of people, like employees in the same company Partition the dataset into groups based on a certain criteria to be studied To study the employees, all emails are grouped such that emails of the same employee form one group Decide on the items to be considered in the analysis E.g., each email could be a transaction and words/emails within the header/text could be items Mine FP within each group and globally Find relevant features for each group based on the entropy 49 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 The Proposed Framework Mine frequent closed patterns Freq. Closed Pats. Select suitable features based on entropy ranking Features Calculate weights of features to create feature vectors Feature Extraction Model Front End Interface and Visualization Tool 50 Reda Alhajj, University of Calgary Network Creation Model Statistical Analysis Model BYU, Provo, USA, March 2013 Feature Extraction Model: The Feature Vector The feature vector related to entity ej with m features is represented as Fj = ( w(f1), w(f2), …, w(fm) ), where w(fk) is the weight of the k-th feature, fk in entity ej. 51 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Feature Extraction Model: Weight of a Feature The weights of each feature is calculated using the following formula, wDj(fk) = supDj(fk)/supD(fk) where wDj(fk) is the weight of the feature k for entity ej, supDj(fk) is frequency of feature fk across dataset Dj of entity ej, and supD(fk) is frequency of fk across dataset D of all entities E. 52 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Experimental Results: Enron E-mail dataset description 53 Dataset contains 500,000 e-mail messages over 150 Enron employees. For this analysis inbox having more than 1000 e-mails were considered. From each user’s inbox we have chosen 1000 e-mails randomly that makes the e-mail dataset for the corresponding user. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Experimental Results: Processing Enron E-mail dataset 54 Identify itemsets from email dataset – The stem words appearing in the body and the subject line of the emails are considered as items. E-mail addresses inside the e-mails are identified as items as well. These items appearing in a single e-mail are considered as a single transaction This way for each user we make a transactional database of 1000 email transactions for each of the 1000 e-mails in the inbox From these transactional databases we identify the globally frequent closed itemsets (corresponding to a support of 10%) Based on entropy ranking we chose top 100 closed itemsets as our feature set. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Experimental Results: Euclidean Distance Matrix for Enron Users Distance cutoff point 0.30 buy dean ermis jones kamiski keavey lokey may sager saibi salisbury shackleton thomas whalley ybarbo 55 buy 0.00 0.65 0.57 0.26 0.43 0.41 0.43 0.35 0.32 0.36 0.25 0.22 0.65 0.60 0.59 dean 0.65 0.00 0.13 0.50 0.28 0.50 0.27 0.68 0.40 0.44 0.73 0.64 0.08 0.10 0.13 ermis 0.57 0.13 0.00 0.44 0.22 0.44 0.21 0.61 0.33 0.38 0.65 0.56 0.15 0.14 0.16 jones 0.26 0.50 0.44 0.00 0.27 0.35 0.29 0.38 0.19 0.26 0.36 0.21 0.50 0.47 0.44 kamiski 0.43 0.28 0.22 0.27 0.00 0.31 0.16 0.47 0.17 0.28 0.51 0.39 0.28 0.25 0.25 keavey 0.41 0.50 0.44 0.35 0.31 0.00 0.38 0.25 0.30 0.41 0.45 0.38 0.51 0.47 0.50 lokey 0.43 0.27 0.21 0.29 0.16 0.38 0.00 0.50 0.22 0.25 0.52 0.41 0.27 0.25 0.24 may 0.35 0.68 0.61 0.38 0.47 0.25 0.50 0.00 0.40 0.45 0.35 0.33 0.69 0.65 0.67 sager 0.32 0.40 0.33 0.19 0.17 0.30 0.22 0.40 0.00 0.25 0.44 0.28 0.40 0.36 0.36 saibi 0.36 0.44 0.38 0.26 0.28 0.41 0.25 0.45 0.25 0.00 0.45 0.34 0.43 0.41 0.41 salisbury 0.25 0.73 0.65 0.36 0.51 0.45 0.52 0.35 0.44 0.45 0.00 0.30 0.75 0.70 0.70 shackleton 0.22 0.64 0.56 0.21 0.39 0.38 0.41 0.33 0.28 0.34 0.30 0.00 0.63 0.60 0.59 thomas 0.65 0.08 0.15 0.50 0.28 0.51 0.27 0.69 0.40 0.43 0.75 0.63 0.00 0.09 0.13 whalley 0.60 0.10 0.14 0.47 0.25 0.47 0.25 0.65 0.36 0.41 0.70 0.60 0.09 0.00 0.11 ybarbo 0.59 0.13 0.16 0.44 0.25 0.50 0.24 0.67 0.36 0.41 0.70 0.59 0.13 0.11 0.00 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Experimental Results: The Enron E-mail users’ social network based on e-mail usage 56 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Experimental Results: The Enron E-mail users’ social network based on e-mail usage Five CLUSTERS OF ENRON E-MAIL. 1 saibi 2 buy, salisbury, shakleton, jones 3 dean, ermis, jones, kaminski, lokey, sager, thomas, whalley, ybarbo 4 keavey 5 may 57 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Association rules to Network 58 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Basic Steps Given a website The mining process can be applied on three dimensions: content, structure and log Actors in the network are the pages. Construct the adjacency matrix by mining association rules from the transactional database obtained after preprocessing the web log data: Each transaction is a set of pages accessed together in one session. 59 FPM algorithm, e.g., Apriori or FP-growth is applied on the derived transactional data and association rules are derived. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Basic Steps Determine frequent Itemsets Find association rules Add items in the rule as node in the graph and connect items in the left side to items in the right side (directed edges) Use support and confidence to find a combined weight of each added edge If edge already exist then add the new weight to the existing weight of the edge Analyze the graph using SNA techniques 60 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Association Rules to Social Network 61 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Association Rules to Social Network Analyze weblog Determine frequent sets of pages based on frequency of pages accessed together Determine rules and keep only those satisfying minimum confidence Construct network of pages based on rules 62 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Association Rules to Network Each rule is reflected in the adjacency matrix by incrementing every entry (i; j) such that pages i and j exist in the antecedent and consequent of the rule, respectively. Entries in the adjacency matrix are normalized by dividing each value by the overall average of the values that exist in the matrix. The network is analyzed to rank the pages by considering their in-degrees, out-degrees, and betweenness, eigen-vector centrality. Pages with high betweenness centrality are considered as important to link pages from different communities. 63 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Association Rules to Social Network analysis was done using the software Visone (http://visone.info/) Betweeness Centrality measure 64 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Association Rules to Social Network 65 Closeness Centrality measure Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Association Rules to Social Network 66 Eigenvector Centrality measure Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Multi-objective GA based clustering to Network Construction The case of Genes/Proteins 67 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Motivation 68 In most traditional clustering algorithms, number of clusters is given a-priori. In fact: the clustering criteria is dependent on more than one objective! Cluster validation to assess the number of clusters. Multi-objective clustering must work on small and large data sets. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Objective Functions For Clustering 69 Three objectives: F1 : minimize the number of clusters F2 : maximize the heterogeneity between clusters F3 : maximize the within cluster homogeneity Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Objective functions 70 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Divide and Conquer Basic Steps: If the dataset to be clustered is of manageable size then it is clustered as a whole set. Otherwise repeat the following steps Partition the dataset (or set of centroids after the first iteration) into subsets of manageable size Cluster each subset individually by applying multi-objective GA combined with validity analysis to get the centroids of the obtained clusters If the set of all centroids is of manageable size then cluster the whole set of centroids and exit the loop 71 Backtrack to merge clusters that have their centroids ending up in the same final cluster Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Unique Solution of Compact Clusters 72 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Alternative Solutions to Adjacency Matrix Genes Genes Genes Entry (i,j) specifies number of solutions where Genei and Genej occurred in the same cluster 73 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 From Adjacency Matrix to Network 74 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Criminal and Terror Network Analysis 75 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Terror Network Analysis by Clustering We developed a framework that employs clustering, frequent pattern mining and some social network analysis measures to determine the effectiveness of a network. The clustering and frequent pattern mining techniques start with the adjacency matrix of the network. For clustering, we utilize entries in the table by considering each row as an object and each column as a feature. 76 features of a network member are his/her direct neighbors. We maintain the weight of links in case of weighted network links. Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Multi-Objective GA based Clustering 77 We applied multi-objective GA based clustering Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Terror Network Analysis by Clustering & FPM For Clustering, we consider each row as an instance and each column as a feature We Cluster instances to find important groups and individuals within the network For frequent pattern mining, we consider each row of the adjacency matrix as a transaction and each column as an item. We map entries into a 0/1 scale such that every entry whose value is greater than zero is assigned the value one; entries keep the value zero otherwise. This way we can apply frequent pattern mining algorithms to determine the most influential members in a network as well as the effect of removing some members or even links between members of a network. 78 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Terror Network Analysis We investigate the effect of adding some links between members. We are able to study how the various members in the network change role as the network evolves. This is measured by applying some SNA measures on the network at each stage during the development. We report some interesting results related to on various benchmark networks: including 9/11 and Madrid bombing. 79 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Database Search 80 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Problem Definition You tell the computer what you want in terms that mean something to you; using fuzzy sets You ask your question from the computer using the fuzzy term Computer tells you how accurate your results are 81 Reda Alhajj, University of Calgary Degree of membership BYU, Provo, USA, March 2013 Related Work: Database Search Fuzzy Data Representation Disadvantages: Extending a Query Language to support fuzzy querying without changing the database itself Disadvantages: 82 Existing databases need to be re-structured Prevent traditional users from executing standard (non-fuzzy) queries Commercially available DBMS’s need to support a new query language Requires users to learn the new query language Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Motivation 83 Proposing an independent intermediate translation layer to incorporate fuzziness in: the interface/querying facility of database systems to retrieve more accurate facts Groups within a social network may share the same intermediate layer Recommendation system based on SNA to help users in building their intermediate layer The intermediate layer provides the mapping between fuzziness expected by the user and the actual crisp values stored in the data repository Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Methodology Fuzziness can be specified : Manually: by a human expert Semi-automatically: 84 A human experts decides on the number of fuzzy sets the intermediate layer defines the fuzzy sets Fully-automatically: by the intermediate layer The intermediate layer uses the fuzzy sets specifications to map between fuzziness expected by the user and the actual crisp values stored in the data repository Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 Intelligent Database Search 85 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 AskFuzzy: Attractive Visual Fuzzy Query Builder* Fuzzy Layer 1 DBMS Fuzzy Query Data Fuzzification • Transferring numeric values to fuzzy sets: Number of Fuzzy sets Manual By User Semi-automated By System Full-automated (Optimization process: Min number of clusters Max cluster quality) 2 Fuzzy Query Construction 3 Fuzzy Query Execution http://cpsc.ucalgary.ca/~nkoochak/AskFuzzy/ * ICDE 2011 IEEE International Conference on Knowledge Engineering Fuzzy sets Functions By User By System (Initial Fuzzy sets: based on Clustering result Optimized fuzzy sets: Based on Genetic Algorithm Optimization Conclusions Data mining and machine learning techniques could be integrated with the network based analysis. The combination would lead to 87 A strong framework for data analysis from various perspectives. Global correlations within the data are considered and hence lead to more realistic results A variety of application domains could benefit from the integrated setup Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013 The End! Thank you for your attention Reda Alhajj [email protected] 88 Reda Alhajj, University of Calgary BYU, Provo, USA, March 2013