Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
This slide show presents in detail the different features of the Multi-Relational Data Mining package Safarii Walkthrough version 2.0. Copyright © Kiminkii, 2007. Parts of this presentation may be reused as long as clear reference is made to Safarii and http://www.kiminkii.com/safarii.html. Key Features of Safarii • • • • Multi-Relational Data Mining Mine relational data directly, no unnatural flattening User-friendly graphical user interface Versatile, allows many analytical settings (predictive, exploratory, associative) • Platform and RDBMS-independent • Mining process done ‘in-database’ • Scalable Contents • Tell me about Multi-Relational Data Mining first... • Go straight to the Safarii Walkthrough... – – – – – go go go Subgroup Discovery Filtering & Pattern Teams Building Classifiers from Collections of Patterns Decision Lists Pre-processing with ProSafarii • Frequently Asked Questions • Contact information go go go go go go Or just proceed to the next slide... Multi-Relational Data Mining Safarii is the only available commercial implementation of a new data analysis paradigm called Multi-Relational Data Mining (MRDM). It allows the efficient and automated analysis of structured data stored in relational databases Multi-Relational Data Mining Mainstream Data Mining packages require the data to fit into a single table, where rows describe cases, and columns describe the different features. But how would you deal with more complex domains such as web-logs, social networks, extensive active inactive exist? customer descriptions, molecules etc., if such a restriction H3C NH Histamine2 HN N Histamine HN N H N HN N H3C inactive NH2 active HN N NH2 HN N NH2 HN H N N NH2 CH3 O N CH3 NH2 O N The solution from relational database theory is to use multiple tables. MRDM generalises traditional Data Mining by working with cases spread over multiple tables. In other words, cases are structured, and consist of multiple related parts (such as atoms within a molecule, or contracts within a customer description) Structured Data Structured data consists of parts, that appear as records in different tables. MRDM creates patterns based on the properties of parts, but also on the existence of particular parts and the relations between parts credit card 1 owner y credit card 1 account x transaction 1 ... transaction n account y owner z ... credit card 1 Structured Patterns MRDM builds models based on patterns that are structured, so-called Selection Graphs. These graphs capture structural, as well as nominal and numeric properties of structured cases. Multiple patterns are combined into predictive models data a pattern that occurs in some of the cases credit card 1 owner y credit card 1 account x transaction 1 ... transaction n ... ‘all accounts that have at least one transaction for which the resulting balance does not exceed 303.0’ Safarii Methodology The different tools and techniques in Safarii are organised in the Safarii methodology. The following slides help to explain the structure of this methodology, and hence how to work with Safarii Safarii Methodology Closed Pattern Team Interesting Patterns Pattern Team Classifier Preprocessing (ProSafarii) Classifier (Decision List) The Safarii methodology consists of two main streams: building classifiers based on multi-relational structure directly, and finding interesting multi-relational patterns and potentially building classifiers from those Safarii Methodology Closed Pattern Team Interesting Patterns Pattern Team Classifier Preprocessing (ProSafarii) Classifier (Decision List) Build Classifiers Directly: Safarii lets you analyse (possibly pre-processed) multi-relational data by inducing classifiers that capture predictive structural features of the data. Safarii Methodology Closed Pattern Team Interesting Patterns Pattern Team Classifier Preprocessing (ProSafarii) Classifier (Decision List) Find Interesting Patterns: the second stream is based on the discovery of interesting (predictive) patterns or rules. In this mode, you will find more regularities compared to building classifiers directly, because alternative or embedded structural features are not overlooked. This enables a more explorative survey of the dependencies in the database. The important predictive patterns can be combined to form classifiers in a number of ways. Safarii Methodology Closed Pattern Team Interesting Patterns Pattern Team Classifier Preprocessing (ProSafarii) Classifier (Decision List) As an option, an accompanying tool, called ProSafarii, helps you pre-process the data in a number of multi-relational-sensitive ways. Optimised datasets can be mined directly using Safarii More about ProSafarii... go Subgroup Discovery Find interesting subgroups within the database, identified by Selection Graphs, that show significant deviation from the whole database Subgroup Discovery Subgroup Discovery finds collections of subgroups within the database that show significant deviation from the whole database. Several parameters can be set, in order to define subgroups of interest Find subgroups of molecules, where mutagenicity is common 188 molecules appear, of which 66,5% mutagenic Search deep for 1 minute, and find subgroups of at least 47 molecules (= 25%) Judge, and report subgroups on the basis of a range of interestingness measures (Novelty balances accuracy and coverage) Subgroup Discovery: search for Patterns … … Subgroup Discovery: search for Patterns … compl T F .42 .13 .12 .33 .54 .55 1.0 … novelty(ST) = p(ST)−p(S)p(T) = .42 − .297 = .123 (novelty between −.25 and .25, 0 means uninteresting) A user-specified interestingness measure is used to guide the search for predictive patterns. In this example Novelty is used. Note the high numbers along the diagonal of the contingency table, indicating a positive dependency between the current pattern and the target Subgroup Discovery: Search for Patterns … … … Subgroup Discovery: Search for Patterns … … … … Subgroup Discovery All subgroups that satisfy the search conditions are reported, and details of the subgroups can be inspected Turn Selection Graphs into SQL Each subgroup (Selection Graph) corresponds to a SQL statement that can be saved for future deployment, or turned into a database view, such that the interesting subgroup is always virtually present, even after updates of the original data. ROC Space Analysis Plot the set of discovered patterns in ROC space, and find optimal patterns. Safarii will report patterns that lie on the convex hull ROC Space Analysis Each dot represent a single subgroup. Interesting subgroups appear in the top left corner. Uninteresting subgroups appear on the diagonal. Dots on the convex hull represent subgroups that perform better than those under the hull This line represent the minimum support threshold: all subgroups of more than 30 molecules appear above it ROC Space Analysis Safarii will list the subgroups appearing on the convex hull Creating Pattern Teams Filter the initial set of discovered patterns to obtain a small team of patterns that are interesting as well as non-redundant Creating Pattern Teams The outcome of Subgroup Discovery will typically produce an abundance of interesting subgroups. Within this collection of subgroups there will be a lot of overlap and redundancy. Apply filtering to obtain Pattern Teams: small collections of subgroups that are predictive and each add unique expertise to the team. Multiple mechanisms allow for teams with different properties Joint Entropy maximises the independence of subgroups Exclusive Coverage selects subgroups that are mutually exclusive, but cover a lot of the database DTM Purity selects those subgroups that together lead to the best performing Decision Table Majority classifier SVM Purity works in the same way, but uses a Support Vector Machine Pattern Team Four patterns that appear often, are correlated with the target (positively or negatively), and are relatively uncorrelated with each other Pattern Team 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 -1 -4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 -1 -1 0 -4 -4 -3,5 -3,5 -3 -3 -2,5 -2,5 -2 -2 -1,5 -1,5 The diagram on the left shows 82 subgroups discovered in a 2-dimensional domain. On the right a Pattern Team of 4 subgroups captures most of the discriminatory power. -1 -1 -0,5 -0,5 0 0 Bayesian Network of Patterns A Bayesian Network captures the relationships between patterns. Similar patterns are connected. The Pattern Team (blue nodes) can be indicated in this network. The highlighted patterns tend to end up in separate clusters, as by definition, there is little redundancy in Pattern Teams. Building Classifiers from Collections of Patterns Turn the set of relevant binary features formed by the discovered patterns into a predictive model Building Classifiers from Patterns two classifier are available for turning patterns into classifiers: Decision Table Majority classifier works on Pattern Teams only Support Vector Machine works on both Pattern Teams and the whole collection of patterns Pattern Team Decision Table Four patterns lead to potentially 16 groups of molecules. A decision table shows how many molecules appear for each combination Decision Table Majority classifier A Decision Table Majority classifier uses this decision table to classify new cases. Any Pattern Team can be used to build a classifier, regardless of the filtering mechanism used. Results on the test-set can be compared to the actual class Decision Table Majority classifier A Decision Table Majority table can be stored in a database as a separate table. Simply classify future cases or data sets by joining over the first k columns Alternatively, the DTM classifier can be applied to the currently defined test set, with results being stored in the database Support Vector Machine Support Vector Machines find linear hyperplanes in the space of binary features resulting from the discovered patterns. The hyperplane attempts to separate the positive and negative cases as best as possible. The hyperplane is determined by assigning weights to patterns. The weights indicate the influence of individual patterns on the overall model. Propositionalisation Use the interesting patterns discovered, to create a single table where the original cases are described in terms of binary features, each corresponding to a pattern: is a case covered or not? Propositionalisation id 1 2 3 f1 1 1 0 f2 0 1 0 f3 0 0 0 With a simple press of a button, a list of discovered subgroups can be stored as a table of binary data. Join this table with the original target table to obtain a rich propositional table that captures structural information Any propositional Data Mining tool, as well as Safarii, can now be used to mine the flattened information Graph Mining Features Make Safarii mimic the behaviour of Graph Mining systems — on a relational database Object Identity Interpretation: ‘all molecules that contain a carbon atom and an atom of low charge (potentially the same atom)’ Can these two atoms actually be one and the same atom? Object Identity ‘all molecules that contain at least one carbon atom’ or ‘all molecules that contain at least two carbon atom’ Safarii lets you choose between two alternative semantics of Selection Graphs: traditional or object identity. This last mode is common in Graph Mining, and allows for a rudimentary form of counting substructure ‘Closed’ Patterns In many cases, patterns capture the minimal requirements to select an interesting subgroup. Often one is interested in finding out what other features typically hold for such a pattern. Safarii lets you blow up a Selection Graph to get a closed pattern: no extra constraints can be added without reducing the size of the subgroup covered by the pattern More complex Selection Graph, but same subgroup Building Decision Lists Build a Decision List classifier from the data directly, rather than first building a Pattern Team and deriving a classifier from this Decision List Decision Lists are induced by a method called Separate & Conquer (also known as covering approach): run Subgroup Discovery to find a good subgroup, then ‘remove’ this subgroup from the database, and continue in the same way with the remainder, etc. The result is a Decision List, an ordered list of Selection Graphs with associated prediction Decision List Characteristics Decision Lists can be plotted in ROC space, in much the same way as collections of rules. The area under the ROC-curve is a good measure for its quality Other graphs are available to analyse the quality of each individual decision that appears in the Decision List Testing Decision List Decision Lists are not only models of the data, they are also classifiers: they can be used to predict held-out data (testset) for validation, or to predict future data for which the target is unknown Safarii lets you compare predictions made on the test-set to the actual target of those cases. The classification score tells you what score to expect on future data Deploying Decision List In order to deploy the Decision List as a classifier for use in other applications or on-line scoring systems, Safarii allows the exporting of the model as a series of SQL statements Alternatively, if the data to be classified appears in the same table as the original table (identified as a different sample), the classifier can be applied directly, as shown before Pre-processing using ProSafarii Use ProSafarii, the companion of Safarii, to pre-process the initial data, using a number of pre-defined transformations that are specifically relevant for MRDM. Opportunities for improvement are identified by ProSafarii automatically Pre-processing with ProSafarii ProSafarii is a separate software package for preprocessing in a multirelational domain. It considers any given relational database, and identifies opportunities for transformation Selected operations can be executed automatically The resulting modified database is visible to Safarii, and can be mined directly Pre-processing with ProSafarii Classes of transformation ProSafarii supports Opportunities for transformation Original or modified data model Aggregation Enrich the information available in individual tables, by applying aggregate functions to the relationships between pairs of tables ProSafarii: Aggregation Enrich the data in one table by aggregating over a one-to-many relationship with a neighbouring table Discretisation Transform or enrich numeric data appearing in any table, by applying a number of MRDMsensitive discretisation procedures ProSafarii: Discretisation Different discretisation procedures, sensitive to the multi-relational structure of the data Two representations for the discretised attributes The original numeric columns can be kept alongside the new nominal attributes Two Alternative Representations • Nominal (1 attribute) a b c • Cumulative binary (n-1 attributes) d Sampling Use ProSafarii to define multiple samples of your database, or use existing definitions of samples, in order to distinguish training and test sets, and apply cross-validation ProSafarii: Sampling Divide the target table up in two or more subsets by defining a new attribute that identifies each sample. Either create two samples with specified probability, or split into n samples with uniform distribution Use sampling in Safarii to compare different samples, classify test-sets, and do crossvalidation Training set is either currently selected sample (positive), or all samples except current (negative) Frequently Asked Questions (1) • What operating systems does Safarii run on? – The software is written in Java, so it will run on most modern operating systems. • I only have single table data. Can I still use Safarii? – Yes. Even though you will not be using Safarii to its full power, it still offers many features that are useful, and that can not be found in competing single table tools. • Can I deploy models created by Safarii in an operational environment? – Yes. Safarii allows you to define different data samples, such that you can apply your models to held-out data. Alternatively, you can save any model as a collection of SQL statements, to be applied to new data, or specific cases. Frequently Asked Questions (2) • Can we use Safarii under an academic license? – Yes, we offer an academic license at highly reduced rates. Please contact us for conditions and pricing. • What database management systems can I use with Safarii? – Safarii is able to mine most modern relational datasources • Do you offer consulting services or support with Safarii? – Yes. We can help you set up a successful MRDM project with Safarii. We also provide a data analysis service where we analyse data for you, without the need of working with the system yourself or purchase of a license. Contact Information For all information about features of Safarii, pricing or consulting, please contact us here: Kiminkii P.O. Box 171 3995 DD Houten the Netherlands +31 6 24 61 25 60 [email protected] www.kiminkii.com/safarii.html