Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Self-Organised Data Mining – 20 Years after GUHA-80 Martin Kejkula KEG 8th April 2004 http://gama.vse.cz/keg/ Agenda Idea of Self-Organised Data Mining GUHA-80 revival Process of Self-Organised Data Mining Key factors for Self-Organised Data Mining Metabase, Knowledge Base, etc. Proposed EverMiner system for Self-Organised Data Mining 2 Introduction Motivation: support X-Miner users Best practices, known problems collection Muller, Lemke: Self-Organising Data Mining (2000) My thesis: Design/test strings of jobs for EverMiner Formalization/using heuristics 3 References (1) P. – Havránek, T.: GUHA 80: An Application of Artificial Intelligence to Data Analysis. Computers and Artificial Intelligence, Vol. 1, 1982, pp. 107-134 Hájek, P. – Ivánek, J.: Artificial Intelligence and Data Analysis. Proc. COMPSTAT’82, Wien, Physica Verlag 1982, pp. 54-60 Hájek, 4 References (2) P. – Havránek, T.: GUHA-80 – An Application of Artificial Intelligence to Data Analysis. Matematické středisko biologických ústavů ČSAV, Praha, 1982 Jirků, P. – Havránek, T.: On Verbosity Levels in Cognitive Problem Solvers. Proc. Computational Linguistics, 1982, http://acl.eldoc.ub.rug.nl/mirror/C/C82/ Hájek, 5 References (3) J.: EverMiner – studie projektu. Dokumentace projektu LISp-Miner, 2003. Mueller, J.-A. – Lemke, F.: Self-Organising Data Mining. Extracting Knowledge from Data. Dresden, Berlin, 2000. Rauch, 6 GUHA-80: Main Features Application of artificial intelligence to exploratory data analysis To generate interesting views onto given empirical data (recognize interesting logical patterns) Views: relevant, useful 7 GUHA-80 Sources (1) GUHA Automatically generate all interesting hypotheses Lenat’s AM Jobs (tasks) Agenda of jobs Hundreds of heuristical rules Concepts 8 GUHA-80 Sources (2) GUHA-80 vs. Lenat’s AM Data • Data-processing procedures Statistical program packages Effective modules 9 GUHA-80 Paradigm Open-ended To maximize interestingness value Hundreds data analysis of heuristic rules Guide to define and study next step Access potentially relevant rules, Find truly relevant rules, Follows truly relevant rules 10 Interestingness in GUHA-80 No explicit definition Determined by interplay Heuristical rules Weighting mechanisms Testing in practice (adequately behaviour?) No algorithm, but constraints 11 Principles of GUHA-80 Domain dependence (…exploratory data analysis) Join human possibilities with machine More heuristics are relevant Interactivity with user Non routine (GUHA-80 not for every-day data processing) 12 GUHA-80 Structure (1) 13 GUHA-80 Structure (2) Input empirical data Input parameters How understood “interestingness” Effective modules (system’s knowledge) Clustering procedures GUHA procedures Agenda of jobs (priority/weight) 14 GUHA-80 Structure (3) Heuristics: optimal way to realize a job Changing system of concepts Hierarchy of concepts (applicability) Possible unification of heuristics, jobs,… 15 16 17 18 19 GUHA-80 Input Data Input information Decompositions/orderings of sets of quantities Help understand “interestingness” 20 GUHA-80 Effective modules Evaluation of usual statistical characteristics,… Complicated procedures Synthesis of parameters (“job on job”) 21 GUHA-80 Hundreds of heuristic rules No explicit definition of interestingness (exploration in a space) Interactivity with the user Non-routine character 22 Process of S-O Data Mining Empirical Data Domain Knowledge,… Chains of Data & Knowledge Processing Tasks All Interesting Views, Patterns DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, … 23 Process of S-O Data Mining 24 Key Factors of S-O Data Mining Data Preparation Modeling Evaluation Knowledge Base Domain Knowledge 25 Data Preparation Discretization Attribute Type dependent: • Nominal/Ordinal/Interval/Ratio Type of coefficient dependent Discretization-Modeling Cycle (KL, 4ft, CF,…) Known problem with intervals of categories without values Usually not one target attribute 26 Attribute type dependent discretization Nominal Classes of values Ordinal Extrem/missing values Type of coefficient Usually not one target attribute 27 Intervals of Categories without Values 28 Intervals of Categories without Values Solution: Statistics – extrem values 4ft Task: correlations, implications Potentially interesting patterns 29 Extrem/Missing Values 4ft: Find associations between extrem/missing values (impl/correl) CF, KL: Find patterns with extrem/missing values 30 Data Preparation Classes of attributes Partial cedents Associations between attributes in one class Associations between partial cedents 31 Evaluation-Modeling Input information for partial cedents Mining for Interesting Patterns Exceptions Missing values Extrem values Discovered hypotheses Groups of hypotheses Coverage hypotheses/input data 32 Heuristic Rules (1) Examples: IF more extrem/missing values found, search for association with extrem/missing values IF 0 hypotheses found, set-up less strong quantifier (p, Base) values IF subset of input data not covered by hypotheses THEN search for associations covering these data 33 Heuristic Rules (2) Examples: IF nominal type of column (input data matrix) AND no associated table for discretization THEN each value is one category (attribute creation) Use “subset” coefficient type for nominal attributes 34 Metabase, Knowledge Base Metadata (Knowledge): Results of Previous X-Miner Tasks Domain Knowledge Interaction with User (learning?) 35 GUHA-80 vs. X-Miner (1) Task parameters (partial cedents, …) SW, HW Experiences with LM applications,… 36 GUHA-80 vs. X-Miner (2) More complex heuristics 37 EverMiner – Features Based on LispMiner (X-Miners) Agenda of jobs, priority/strings Heuristics Interaction with user Enables to repeat the process on new data (“check” vs. new KDD process) 38 EverMiner – where we are Experiences (Medicine, traffic, shares, sociology,…) Heuristics collection (www, brainstorming) Co-operation with data preparation experts (FEL, SumatraTT) Testing “Strings of jobs” (learning) 39 Discussion 40