Download Data mining with GUHA – Part 2 GUHA produces hypothesis

TAM002 Data mining with GUHA – Part 2 GUHA produces hypothesis Esko Turunen1 1 Tampere University of Technology Esko Turunen http://www.vrtuosi.com GUHA does not test hypothesis, instead GUHA produces them In this chapter we learn the outlines of GUHA e.g. that • GUHA is suitable for exploratory analysis of large data • GUHA goes through 2 × 2 contingency tables and picks up the interesting ones • in logic terms, GUHA is based on first order monoidal logic whose models are finite • a fundamental reference, the ‘GUHA book’, is free to be downloaded from www.cs.cas.cz/hajek/guhabook/ Esko Turunen http://www.vrtuosi.com ♣ 1. GUHA (General Unary Hypotheses Automaton) is a method of automatic generation of hypotheses based on empirical data, thus a method of data mining. • GUHA is one of the oldest methods of data mining – GUHA was introduced by Hájek, Havel and Chytil in The GUHA method of automatic hypotheses determination published in Computing 1 (1966) 293–308. • GUHA still develops: there are dozens of new features added to GUHA during the past ten years. • GUHA is a kind of automated exploratory data analysis. Instead of testing given hypothesis supported by a data, GUHA procedure generates systematically hypotheses supported by the data. By GUHA it is possible to find all interesting dependences hidden in a data. Esko Turunen http://www.vrtuosi.com ♣ 2. GUHA is suitable for exploratory analysis of large data. • The processed data form a rectangle matrix, where rows corresponds to objects belonging to the sample and each column corresponds to one investigated variable. A typical data matrix processed by GUHA has hundreds or thousands of rows and tens of columns. • Cells of the analyzed data matrix can contain whatever symbols, however, before a concrete data mining task the data must be categorized to contain only 0s or 1s (or are empty). • Exploratory analysis means that there is no single specific hypothesis that should be tested by our data; rather, our aim is to get orientation in the domain of investigation, analyze the behavior of chosen variables, interactions among them etc. Such inquiry is not blind but directed by some general (possibly vague) direction of research (some general problem). Esko Turunen http://www.vrtuosi.com ♣ 3. GUHA systematically creates all hypotheses interesting from the point of view of a given general problem and on the base of given data. • This is the main principle: all interesting hypotheses. Clearly, this contains a dilemma: all means most possible, only interesting means not too many. To cope with this dilemma, one may use different GUHA procedures, implemented in LISp–Miner system, and having selected one, by fixing in various ways its numerous parameters. • The LISp–Miner system leads the user and makes the selection of parameters relatively easy but somewhat laborious. LISp–Miner cannot be as automatized as much as e.g. B–course is automatized, however, the results of LISp–Miner are much more illustrative and detailed. Esko Turunen http://www.vrtuosi.com Three remarks: • GUHA procedures polyfactorial hypotheses i.e. not only hypotheses relating one variable with another one, but expressing relations among single variables, pairs, triples, quadruples of variables etc. • GUHA offers hypotheses. Exploratory character implies that the hypotheses produced by the computer (numerous in number: typically tens or hundreds of hypotheses) are just supported by the data, not verified. You are assumed to use this offer as inspiration, and possibly select some few hypotheses for further testing. • GUHA is not suitable for testing a single hypothesis: routine packages are good for this. Esko Turunen http://www.vrtuosi.com ♣ 4. 4ft–miner procedure generates hypotheses (or observational statements) on association between complex Boolean formulae (attributes). These formulae are constructed from unary predicates (corresponding to the columns of the processed 0/1–data matrix) by logical connectives ∧, ∨, ¬ (conjunction, disjunction, negation). Examples of predicates are TEMPERATURE : ≥ 38◦ C , PRESURE: high , DIAGNOSIS: infection Examples of formulae are [TEMPERATURE : ≥ 38◦ C] ∧ [PRESURE: high] and [DIAGNOSIS: infection] An example of a hypothesis is [TEMPERATURE : ≥ 38◦ C] ∧ [PRESURE: high] ≈ [DIAGNOSIS: infection] More generally, hypotheses are of form φ ≈ ψ. Notice that our terminology differs from the original GUHA approach: we want to keep things as simple as possible! Esko Turunen http://www.vrtuosi.com Given the 0/1–data matrix, each pair of Boolean attributes φ , ψ determines its four-fold frequency table; the association of with is defined by choosing an associational or generalized quantifier i.e. a function assigning to each four-fold table either 1 (associated or true in the data) or 0 (not associated or false in the data) and satisfying some natural monotonicity conditions. ♣ The four–fold table has the form: φ ¬φ ψ a c a+c =k ¬ψ b d b+d =l a+b =r c+d =s m where a + b + c + d = m and • a is the number of objects satisfying both φ and ψ, • b is the number of objects satisfying φ but not ψ, • c is the number of objects not satisfying φ but satisfying ψ, • d is the number of objects not satisfying φ nor ψ. Esko Turunen http://www.vrtuosi.com ♣ 6. There are various types of generalized quantifiers formalizing various kinds of associations: • Implicational quantifiers formalize the association many φ are ψ, they do not depend on the values c, d. • Comparative quantifiers formalize the association φ makes ψ more likely than ¬ψ. • Some quantifiers just express observations on the data, some others serve as tests of statistical hypotheses on unknown probabilities. • Some quantifiers ones symmetric: φ ≈ ψ implies ψ ≈ φ, some admit negation: φ ≈ ψ implies ¬φ ≈ ¬ψ. 4ft-Miner procedure contains dozen of generalized quantifiers, a novel procedure Ac4ft–Miner offers also action quantifiers. An advantage of GUHA is that new quantifiers can be defined and their properties can be analyzed in a well establish logic framework. Esko Turunen http://www.vrtuosi.com ♣ 7. After preparing the original data matrix to a 0/1-format, the user can start the 4ft–Miner procedure. The input of such a procedure consists of (1) the 0/1–data matrix (2) parameters determining symbolic restriction to the pairs ψ, φ of Boolean attributes to be generated; ψ is called antecedent and φ is called succedent (3) the quantifier to be used and some parameters of the quantifier (4) some other determinations, for example the user has to declare predicates that can occur in the antecedent and the succedent, the use of logic connectives, minimal and maximal length of antecedent and succedent, way to process of missing data etc. Esko Turunen http://www.vrtuosi.com 8. 4ft–Miner produces all associations φ ≈ ψ satisfying the syntactic restrictions and true in the data. The generation is not done blindly but uses various techniques serving to avoid exhaustive search. The found associations together with various parameters are not mechanically printed but saved in a solution file for further processing. 9. 4ft–Result module for interpretation of results enables the user to browse the associations’ format, sort them according to various criteria, select reasonably defined subsets and output concise information of various kinds. There are however other procedures implemented in the LISp–Miner system that do mine for other kinds of patterns, even for more complex one than associational rules. Esko Turunen http://www.vrtuosi.com 10. The GUHA method has deep logical and statistical foundations. GUHA is being further developed at the institute of Computer Science of the Academy of Sciences of the Czech Republic (Petr Hájek and his group), at the Prague University of Economics (Jan Rauch and his group) and at Tampere University of Technology (Esko Turunen and his group). Esko Turunen http://www.vrtuosi.com Example Assume we are observing children who have an allergic reaction to, say, tomato, apple, orange, cheese or milk. These observations are presented in the following table: Child Anna Aina Naima Rauha Kai Kille Lempi Ville Ulle Dulle Dof Kinge Laade Koff Olvi Tomato 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 Apple 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 Esko Turunen Orange 0 1 1 1 0 0 1 0 0 1 1 1 0 0 1 Cheese 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 http://www.vrtuosi.com Milk 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 Thus, we have observations as Child x is allergic to milk and Child y is allergic to cheese, We write shorter Milk(x) and Cheese(y ). • Milk(-), Cheese(-), Tomato(-), Orange(-) and Apple(-) are unary predicates of our observational language and x, y , z, · · · are variables. • Expressions like Milk(x) or Cheese(y ) are atomic (open) formulae. Combine formulae by logical connectives ¬ (not), ∧ (and) ∨ (or). E.g. Milk(x)∧¬Cheese(x) would mean Child x is allergic to milk and is not allergic to cheese. • However, in stead of open formulae, we are more interested in universal closed formulae, e.g. All children are allergic to milk, Most children are not allergic to orange, There is a child allergic to tomato, In most cases if a child is allergic to milk then she/he is allergic to cheese, too. Esko Turunen http://www.vrtuosi.com Everyone who passed a basic course in logic would know that statements like All children are allergic to milk and There is a child allergic to tomato are expressible by ∀xMilk(x) and ∃xTomato(x). Unfortunately classical mathematical quantifiers ∀ and ∃ are rather useless in the real world. Much more valuable are generalized quantifiers like In most cases, e.g. in the statement In most cases if a child is allergic to milk then she/he is allergic to cheese, too. GUHA method is a logic formalism of generalized quantifiers for data mining purposes. Esko Turunen http://www.vrtuosi.com

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data mining with GUHA – Part 2 GUHA produces hypothesis