Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pattern Analysis & Machine Intelligence Research Group UNIVERSITY OF WATERLOO LORNET Theme 4 Data Mining and Knowledge Extraction for LO T L : Mohamed Kamel PI’s: O. Basir, F. Karray, H. Tizhoosh Assoc PI’s: A. Wong, C. DiMarco Knowledge Extraction and LO Mining GOAL: Develop Data mining and knowledge extraction techniques and tools for learning object repositories. These tools can provide context and facilitate interactions, efficient organization, efficient delivery, navigation and retrieval. PAMI Research Group, University of Waterloo Theme Overview From Text Syntactic: Keyword, Keyphrase-based Semantic: Concept-based From Images Image Features, Shape Features From Text + Images Knowledge Extraction Describing Images with Text Enriching Text with Images Classification (MCS, Data Partitioning, Imbalanced Classes) LO Similarity and Ranking Clustering (Parallel/Distributed Clustering, Cluster Aggregation) Association Rules / Social Networks Reinforcement Learning LO Mining Specialized / Personalized Search Tagging Matching and and Organizing Ranking PAMI Research Group, University of Waterloo Types of Data in LORNET TELOS LCMS Course Course Course Resource Resource Resource Module Module Module Lesson Lesson Lesson Subject Matter Text, Images, Flash, Applets, Metadata, Interaction Logs Discussion Board Board Board Board Semantic Layer Thread Thread Thread Post Post Post Discussions Text, Interaction Logs LOR Record Record Record Resources LO Descriptors Metadata, Semantic References Metadata Metadata Metadata Metadata PAMI Research Group, University of Waterloo LO LO LO LO Mining Scenarios Task Environment Knowledge Extraction Tagging / Organizing Matching / Ranking Ontology Construction Grouping Components Finding & Ranking Components E-Learning Design Environment (LMS) Extracting LO Summary Extracting LO Concepts Extracting Image Description Grouping LOs Finding Similar LOs Ranking LOs Learning Object Content MS (LCMS) Summarizing Documents Extracting Concepts from Documents Grouping Documents Tagging Documents Finding Similar Topics Finding Similar Profiles Building Social Networks Detect Plagiarism Extracting Metadata Extracting Ontologies Classifying LOs Building LO Clusters Detecting Duplicate LOs Ranking LOs Metadata Matching TELOS LO Repository PAMI Research Group, University of Waterloo LO Mining and Knowledge Extraction Applications / Services Data Mining Algorithms LO Automatic Tagging Text Mining Parsing, Tokenization, Keyword/phrase Exraction Math & Statistics Vectors, Matrices, Statistics LO Grouping/ Ranking Semantic Analysis NLP, Ontologies, Knowledge Rep. LO Similarity LO Summarization Categorization Classification, Clustering Data Representation Features, Feature Types, Normalization, Discretization LO Recommendation Learning from Interactions Reinforcement Learning, Multi-Agent Systems Data Structures Arrays, Lists, Trees, Graphs Data Mining Foundations PAMI Research Group, University of Waterloo . . . . Image Mining Feature Extraction, Shape Analysis, Indexing and Retrieval Data Access Data Sources, Data Readers/Writers, Data Converters Projects Overview Information Extraction Categorization Analyzing content to extract relevant information Organizing LOs according to their content Text Document Text Document Keyword Extraction Summarization Concept Extraction Social Network Analysis - Traditional - MCS - Imbalanced Classification - Traditional - Ensembles - Distributed Clustering Personalization Image Mining Providing user-specific results Describing and finding relevant images Interaction Logs Image Reinforcement Learning - Traditional - Oppositionbased - Traditional - Fusion-based CBIR Integration and Applications Software Components Theme and Industry Collaboration In Progress PAMI Research Group, University of Waterloo Publications Information Extraction: Summarization LO Content Package Summarization Learning objects stored in IMS content pacakges are loaded and parsed. Textual content files are extracted for analysis. Statistical term weighting and sentence ranking are performed on each document, and to the whole collection. Top relevant sentences are extracted for each document. Planned functionality: Summarization of whole modules or lessons (as opposed to single documents). Benefits Provide summarized overview of learning objects for quick browsing and access to learning material. Scenarios Learning Management Systems can call the summarization component to produce summaries for content packages. Data is courtesy University of Saskatchewan PAMI Research Group, University of Waterloo Information Extraction: Concept Extraction Text Text Sentence Separator Language Dependent Natural Languagel Processing POS Tagger Concept-Based Statistical Analyser F-measure of Hierarchical Clustering Syntax Parser Concept - based Model Language Independent Text Pre- processor Improvement Single-Term Concept-based Reuters 0.723 0.925 +27.94% ACM 0.697 0.918 +31.70% Brown 0.581 0.906 +55.93% Semantic Parser Semantic Role Labeler Entropy of Hierarchical Clustering Concept - based Statistical Analyzer Conceptual Ontological Graph (COG) Representation (ctf: conceptual term frequency) Concepts Concepts Concepts Concepts Improvement Single-Term Concept-based Reuters 0.251 0.012 -95.21% ACM 0.317 0.043 -86.43% Brown 0.385 0.018 -95.32% (tf : term frequency) Conceptual Ontological Graph (COG) Ranking Precision of Search Single-Term Concept-based Improvement Cran 0.536 0.901 +68.09% Reuters 0.591 0.897 +51.77% Single-Term Concept-based Improvement Cran 0.486 0.827 +70.16% Reuters 0.452 0.841 +86.06% Recall of Search Result PAMI Research Group, University of Waterloo Information Extraction: Keyword Extraction Semantic Keyword Extraction Tasks Progress Developing tools and techniques to extract semantic keywords toward facilitating metadata generation Developing algorithms to enrich metadata (tags) which can be applied in index-based multimedia retrieval Proposed a new information theoretic inclusion index to measure the asymmetric dependency between terms (and concepts), which can be used in term selection (keyword extraction) and taxonomy extraction (pseudo ontology) Makrehchi, M. and Kamel, ICDM07, WI 07 PAMI Research Group, University of Waterloo Information Extraction: Keyword Extraction Rule-based Keyword Extraction Learn rules to find keywords in English sentences Rules represent sentence fragments Specific enough for reliable keyword extraction General enough to be applied to unseen sentences Rule generalization Begin with an exact sentence fragment Merge with another by moving different words to the lowest common level in the part-of-speech hierarchy Keep merged rule if it does not reduce precision and recall of keyword extraction; keep original rules otherwise Keyword extraction Find sequence of rules that best cover an unseen sentence Extract keywords according to rules Rule base size shows quick initial growth, followed by slow and irregular growth and rule elimination Learns 20 rules from the first 50 training rules Learns 13 additional rules from the next 220 training rules Both precision and recall values increase during training Precision Recall (blue) increases 10% (red) shows slight upward trend PAMI Research Group, University of Waterloo Categorization: Ensemble-based Clustering Consensus Clustering Categorization of learning objects using proposed consensus clustering algorithms. The goal of consensus clustering is to find a clustering of the data objects that optimally summarizes an ensemble of multiple clusterings. Consensus clustering can offer several advantages over a single data clustering, such as the improvement of clustering accuracy, enhancing the scalability of clustering algorithms to large volumes of data objects, and enhancing the robustness by reducing the sensitivity to outlier data objects or noisy attributes. Tasks Development of techniques for producing ensembles of multiple data clusterings where diverse information about the structure of the data is likely to occur. Development of consensus algorithms to aggregate the individual clusterings. Develop solutions for the cluster symbolic-label matching problem Empirical analysis on real-world data and validation of proposed method. PAMI Research Group, University of Waterloo Categorization using cluster ensemble Dataset # samples # attributes # classes K-means’ Mean Error Rate in % Ensemble’s Mean Error Rate in % Synthetic1 1000 8 5 17.41 0 Yahoo! (text) 2340 1458 6 38.23 16.24 Texture (image) 5500 40 11 37.99 11.54 Optical Digit Recognition 500 64 10 27.31 16.40 PAMI Research Group, University of Waterloo Categorization: Distributed Clustering Hierarchical P2P Document Clustering Root h=H Peer nodes are arranged into groups called “neighborhoods”. Multiple neighborhoods are formed at each level of the hierarchy. This size of each neighborhood is determined through a network partitioning factor. h = H-1 h=2 h=1 SuperNode (S) h=0 Each neighborhood has a designated supernode. Supernodes of level h form the neibhorhoods for level h+1. Clustering is done within neighborhood boundaries, then is merged up the hierarchy through the supernodes. Significant speedup over centralized clustering and flat peer-to-peer clustering. Multiple levels of clusters. Distributed summarization of clusters using CorePhrase keyphrase extraction. Scenarios HP2PC Architecture h=3 P( 2) { p1( 2) , p2( 2) } Q( 2) {Q1( 2) } Distributed knowledge discovery in hierarchical organizations. P(1) { p1(1) , p2(1) , p3(1) , p4(1) } Q(1) {Q1(1) , Q2(1) } h=2 β=0 Benefits Neighborhood (Q) P(0) { p1(0) ,, p16(0) } Q(0) {Q1(0) ,, Q4(0) } h=1 β = 0.33 h=0 β = 0.2 HP2PC Example 3-level network, 16 nodes PAMI Research Group, University of Waterloo Categorization: Multiple Classifier Systems Tasks To investigate various aspects of cooperation in Multiple Classifier Systems (Classifier Ensembles) To develop evaluation measures in order to estimate various types of cooperation in the system To gain insight into the impact of changes in the cooperative components with respect to system performance using the proposed evaluation measures To apply these findings to optimize existing ensemble methods To apply these findings to develop novel ensemble methods with the goal of improving classification accuracy and reducing computation complexity Progress Proposed a set of evaluation measures to select sub-optimal training partitions for training classifier ensembles. Proposed an ensemble training algorithm called Clustering, Declustering, and Selection (CDS). Proposed and optimized a cooperative training algorithm called Cooperative Clustering, Declustering, and Selection (CO-CDS). Investigated the applications of proposed training methods (CDS and CO-CDS) on LO classification. PAMI Research Group, University of Waterloo Categorization: Imbalanced Class Distribution Objective Advance classification of multi-class imbalanced data Tasks To develop cost-sensitive boosting algorithm AdaC2.M1 To improve the identification performance on the important classes To balance classification performance among several classes PAMI Research Group, University of Waterloo Categorization: Imbalanced Class Distribution Class Distribution Performance of Base Classification and AdaBoost C4.5 class Ind. size C1 Dist. 49 C1 7.84% C2 C2 288 46.08% C3 C3 288 46.08% HPWR (Od=3) Meas. Base AdaBoost Base AdaBoost R 0 5.11 10.70 44.06 P N/A 6.5 11.82 32.89 F N/A 5.84 10.83 35.84 R 73.21 92.28 88.31 87.43 P 69.53 88.75 86.79 91.99 F 72.29 90.38 87.43 89.64 R 67.94 91.36 87.63 88.42 P 73.89 87.88 87.07 89.91 F 71.91 89.42 86.99 89.03 0 11.46 33.32 68.50 G-measure Balanced performance among classes - Evaluated by G-mean C4.5 HPWR (Od=3) Class Meas. Base AdaBoost AdaC2.M1 Base AdaBoost AdaC2.M1 C1 R 0 5.11 77.58 10.70 44.06 65.72 P N/A 6.50 14.12 11.82 32.89 30.83 R 73.21 92.28 64.73 88.31 87.43 83.12 P 69.53 88.75 97.24 86.79 91.99 91.38 R 67.94 91.36 65.23 87.63 88.42 83.95 P 73.89 87.88 93.22 87.07 89.91 90.81 0 11.46 68.42 33.32 68.50 76.08 C2 C3 G-mean PAMI Research Group, University of Waterloo Personalization Opposition-based Reinforcement Learning for Personalizing Image Search Developing a reliable technique to assist users, facilitate and enhance the learning process Personalized ORL tool assists user to observe the searched images desirable for her/him Personalized tool gathers images of the searched results, selects a sample of them By interacting with user and presenting the sample, it learns the user’s preferences PAMI Research Group, University of Waterloo Personalization PAMI Research Group, University of Waterloo Image Mining: CBIR Content based image retrieval Rich Documents images Build an IR system that can retrieve images based on: Textual Cues, Image content, NL Queries Documents contain QI Image Retrieval Tool Set Images contain QT Images match QI NL Description of Image Query Image QI Query Text QT Query Document Automated image tagging PAMI Research Group, University of Waterloo Illustrative Example IZM FD Accuracy = 70% x x x x Accuracy = 60% x x x x x x x x MTAR x x x x Accuracy = 95% x x x x x Accuracy = 55% x x x The proposed approach PAMI Research Group, University of Waterloo Experimental Results (Cont’d) The Performance of the proposed approach PAMI Research Group, University of Waterloo Integration and Applications Progress Finished core parts of the common data mining framework. Built components and services from theme researchers’ work around the data mining framework. Provided documentation for the data mining framework and software components. Launched web site to host components and documentation from Theme 4: http://pami.uwaterloo.ca/projects/lornet/software/ PAMI Research Group, University of Waterloo Integration and Applications Progress Core parts of the common data mining framework are available, including: • • • • • Components and tools built around the common data mining framework: • • • • Vector and matrix manipulation. Document parsing and tokenization. Statistical term and sentence analysis. Similarity calculation using multiple distance functions. IMS Content Package compliant parser. Metadata extraction from single documents; supports Dublin Core encoding. Document similarity calculation using cosine similarity. Single document and content package summarization. Building of standard text datasets from large document collections. Integration with TELOS: • • • • Developed C# TELOS connector for integrating Theme 4 components. Worked on component manifest specification with Theme 6. Provided metadata extraction as part of a complete scenario for TELOS components integration. The following components were wrapped for use by TELOS through the C# connector: Automatic Metadata Extractor, Document Similarity, and Document Summarizer. PAMI Research Group, University of Waterloo Industry Collaboration Pattern Discovery Software (PDS) provided data mining software tools for use by researchers. Vestech provided opportunities for researchers to work on speech technologies. Desire2Learn opened job opportunities for LORNET researchers. PAMI Research Group, University of Waterloo Software Components Overview of Components General Tools C# Connector for TELOS Common Data Mining Framework Scenarios for Use of Software Components Environment Data Types TELOS Metadata Extractor Document Summarizer Content Package Summarizer Document Similarity LO Recommender Metadata Harvester Keyword Extractor Taxonomy Extractor Metadata Enrichment Tools Concept-based and Semantic Text Mining Tools LO Classifier LO Multiple Classifier LO Clusterer LO Ensemble Clusterer LO Consensus Clusterer LO Distributed Clusterer Learning Object Repository Metadata Structured Text Categorical e-Learning Environment Metadata Extractor LO Search Engine Document Similarity Document Classifier Document Clusterer Semantic-based Ontology Representation Semantic Metadata Matching POS Rule-Learning System Triplet Representation System Categorization Tools Metadata Ontology Standard Text Mining Tools Tasks Structured Text Images Object Relationships Context User-centric Ontology construction and unification Finding relations between components Ranking components Grouping components Tagging components Automatic metadata extraction LO automatic classification LO organization through clustering Multiple organization strategies through cluster ensembles Extracting concepts from LO Summarizing Documents Grouping LOs Tagging LOs Discovering Similar Topics Discovering Similar Peers Building Social Networks Detecting Plagiarism LO recommendation using similarity ranking Personalization / Specialization through reinforcement learning Tools Personalized Social Image Search Engine Network Learner Mining Tools Content-based Image Search Image Search Consensus-based Fusion for Image Retrieval Personalized PAMI Research Group, University of Waterloo Legend Integrated Ready In Progress Year 5 Publications Papers Papers Theses (accepted / published) (submitted / in prep) (completed / in progress) 4.1 Information Extraction from Text 11 7 3/2 4.2 Semantic Knowledge Synthesis from Text 10 4 4/1 4.3 Knowledge Discovery through Categorization 12 10 4/1 4.4 Knowledge from Interaction 8 3 1/2 4.5 Knowledge from Image Mining 10 3 2/1 Total 51 27 14//7 PAMI Research Group, University of Waterloo = 21 Theme 4 Team Leader: M. Kamel PI’s: Dr. Basir Dr. Tizhoosh Dr. Karray Asso PI (Wong, DiMarco Graduated Researchers H. Ayad R. Kashef A. Ghazel Dr. Makhreshi M. Shokri S. Hassan A. Farahat Dr. R. Khoury CRC/CFI/OIT NSERC PAMI Lab Funding PDS, Vestech, Desire2Learn PAMI Research Group, University of Waterloo R. Khoury, PhD 07 L. Chen, PhD 07 M. Makhreshi,PhD 07 K.Hammouda,PhD 07 R. Dara, PhD 07 Y.Sun, PhD 07 K. Shaban, PhD 06 Y. Sun, PhD 06 M. Hussin, PhD 05 Jan Bakus, PhD 05 A. Adegorite, MA.Sc04 A. Khandani, MA.Sc05. S. Podder, MA.Sc.04 Pattern Analysis and Machine Intelligence Lab Electrical and Computer Engineering University of Waterloo Canada www.pami.uwaterloo.ca www.pami.uwaterloo.ca/projects/lornet/software/ www.pami.uwaterloo.ca/kamel.html PAMI Research Group, University of Waterloo publications