Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IV. MODELS FROM DATA: Data mining 1 Outline A) THEORETICAL BACKGROUND 1. Knowledge discovery in data bases (KDD) 2. Data mining • Data • Patterns • Data mining algorithms B) PRACTICAL IMPLEMENTATIONS 3. Applications: • Equations • Decision trees • Rules 2 1 Methods Methods - Data Classic predictive models Hypothesis Model Model - Hypothesis Knowledge discovery in data Data New Knowledge Marko Debeljak Knowledge discovery in data bases (KDD) KDD-data mining Frawley et al., 1991: “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”, How to find patters in data? Data mining (DM) – central step in the KDD process concerned with applying computational techniques to actually find patterns in the data (1525% of the effort of the overall KDD process). - step 1: preparing data for DM (data preprocessing) - step 3: evaluating the discovered patterns (results of DM) 4 2 Knowledge discovery in data bases (KDD) When the patterns can be treated as knowledge? Frawley et al., (1991): “A pattern that is interesting (according to a user- imposed interest measure) and certain enough (again according to the user’s criteria) is called knowledge. “ Condition 1: Discovered patterns should be valid on new data with some degree of certainty (typically prescribed by the user). Condition 2: The patterns should potentially lead to some useful actions (according to user defined utility criteria). 5 Knowledge discovery in data bases (KDD) What can KDD and data mining contribute to environmental sciences (ES) (e.g. agronomy, forestry, ecology, …)? Environmental sciences deal with complex unpredictable natural systems (e.g. arable, forest and water ecosystems) in order to get answers on complex questions. The amount of collecting environmental data is increasing exponentially. KDD was purposefully designed to cope with such complex questions about complex systems like: - Understanding the domain/system studied (e.g., gene flow, seed bank, life cycle, …) - Predicting future values of system variables of interest (e.g., the rate of out-crossing with GM plants at location x at time y, seedbank 6 dynamics, …) 3 What is data mining? Data mining focuses on the discovery of previously unknown knowledge and integrates machine learning. Machine learning focuses on descriptions and prediction, based on known properties learned from the training empirical data (examples) using computer algorithms. Learning from examples is called inductive learning. If the goal of inductive learning is to obtain a model that predicts the value of the target variable from learning examples, then it is called predictive or supervised learning. Data mining (DM) What is data mining from a technical perspective? Data Mining, is the process of automatically searching large volumes of data for patterns using algorithms. Data Mining – Machine learning Data Mining is the application of Machine Learning techniques to data analysis problems. The most relevant notions of data mining: 1. Data 2. Patterns 3. Data mining algorithms 8 4 Data mining (DM) - data Data stored in one flat table. Each example represented by a fixed number of attributes. Data stored in original tables or relations. No loss of information, due to aggregation. RELATIONAL data mining PROPOSITIONAL data mining Objects Loss of information due to aggregation Distance (m) 10 12 14 18 20 22 … Properties of objects Wind direction (0) Wind speed (m/s) 123 3 88 4 121 6 147 2 93 1 115 3 … … Out-crossing rate (%) 8 7 3 4 5 1 .. 9 Marko Debeljak This image cannot currentl y be display ed. WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007 Data Data are not stored at all but they continuously flow through an algorithm. Each example can be propositional or relational. DATA STREAM mining Marko Debeljak This image cannot currentl y be display ed. WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007 5 Data mining (DM) - pattern 2. What is a pattern? A pattern is defined as: ”A statement (expression) in a given language, that describes (relationships among) the facts in a subset of the given data and is (in some sense) simpler than the enumeration of all facts in the subset” (Frawley et al. 1991, Fayyad et al. 1996). Classes of patterns considered in data mining: A. equations, B. decision trees, relational decision trees C. association, classification, and regression rules. 11 Data mining (DM) - pattern A. Equations To predict the value of a target (dependent) variable as a linear or nonlinear combination of the input (independent) variables: - Algebraic equations To predict the behavior of dynamic systems, which change their rates over time: - Difference equations - Differential equations 12 6 Data mining (DM) - pattern B. Decision trees To predict the value of one or several target dependent variables from the values of other independent variables by a decision tree. Decision tree has a hierarchical structure, where: - each internal node contains a test on an independent variable(s), - each branch corresponds to an outcome of the test (critical values of independent variable(s)), - each leaf gives a prediction for the value of the dependent (predicted) variable(s). 13 Data mining (DM) - pattern The decision tree is called: A classification tree: class value in the leaf is discrete (a finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, …) A regression tree: class value in leaf is a constant (infinite set of values): e.g.,120, 220, 312, … A model tree: leaf contains linear model predicting the class value (piecewise linear function): out-crossing rate= 12.3 distance - 0.123 wind speed + 0.00123 wind direction 14 7 Data mining (DM) - pattern 3. Rules To perform association analysis between attributes discovered by association rules. The rule denotes patterns of the form: “IF Conjunction of conditions THEN Conclusion.” - For classification rules, the conclusion assigns one of the possible discrete values to the class (a finite set of nominal values): e.g., (yes, no), (spec. A, spec. B, spec. D) - For predictive rules, the conclusion gives a prediction for the value of the target (class) variable (infinite set of values): e.g., 120, 220, 312, … 15 Data mining (DM) - algorithm 3. What is data mining algorithm? Algorithm in general: - A procedure (a finite set of well-defined instructions) for accomplishing some task which will terminate in a defined endstate. Data mining algorithm: - A computational process defined by a Turing machine (Gurevich et al. 2000) for finding patterns in data 16 8 Data mining (DM) - algorithm What kind of possible algorithms do we use for discovering patterns? Selection of the algorithm depends on the problem at hand: 1. Equations = Linear and multiple regressions, equation discovery 2. Decision trees = Top/down induction of decision trees 3. Rules = Rule induction 17 Data mining (DM) - algorithm 1. Linear and multiple regression • Bivariate linear regression: Predicted variable (C-class (ML) may be contusions or discontinues) can be expressed as a linear function of one attribute (A): C = α+ β×A • Multiple regression: Predicted variable (C-class (ML) may be contusions or discontinues) can be expressed as a linear function of a multi-dimensional attribute vector (AI): C = Σni =1 βi×Ai 18 9 Data mining (DM) - algorithm 2. Top/down induction of decision trees The decision tree is induced by Top-Down Induction of Decision Trees (TDIDT) algorithm (Quinlan, 1986) Tree construction proceeds recursively starting with the entire set of training examples (entire table). At each step, an attribute is selected as the root of the (sub) tree and the current training set is split into subsets according to the values of the selected attribute. 19 Data mining (DM) - algorithm 3. Rule induction A rule that correctly classifies some examples is constructed first. The positive examples covered by the rule from the training set are removed and the process is repeated until no more examples remain. 20 10 Data mining (DM) - Statistics Data mining vs. Statistics Common to both approaches: Reasoning FROM properties of a data sample TO properties of a population. 21 Data mining (DM) – Machine learning Statistics Statistics Hypothesis testing when certain theoretical expectations about the data distribution, independence, random sampling, sample size, etc. are satisfied. Main approach: best fitting all the available data. Data mining Automated construction of understandable patterns, and structured models. Main approach: structuring the data space, heuristic search for decision trees, rules, … covering (parts of) the data space. 22 11 DATA MINING – CASE STUDIES 23 DATA MINING – Hands-on exercises 1. Data preprocessing 24 12 Applications – ecological modeling Propositional and relational supervised data mining: - Simple data mining - Data mining of time series - Spatial data mining 1. Equations: • Algebraic equations • Differential equations 2. Single and multi target decision trees: • Classification trees • Regression trees • Model trees (single target only) Applications – ecological modeling POPULATION DYNAMICS HABITAT MODELLING GENE FLOW MODELLING RISK MODELLING 13 Algebraic equations: SOIL WATER FLOW Problem: Prediction of drainage water Type of pattern: Algebraic equation Algorithm: CIPER PCQE Database • Experimental site La Jaillière • Western France • Owned by ARVALIS • Shallow silt clay soils • 11 fields are observed • Field size about 0.3 – 1 ha Introduction Domain & Problem Related work Methodology Data sources Experimental design Results Discussion Further work 14 PCQE Database (continued) • Agricultural practices Fertilization Irrigation Phytochemical protection Harvesting Tillage • Slope • Water flow Drainage Runoff • 25 campaigns (1987 - 2011) • The campaign is defined as the period starting from 01 SEP and finishing on 31 AUG, following year Introduction Domain & Problem Related work Methodology Data sources Experimental design Results Discussion Further work DRAINAGE predictive model - CIPER • Polynomial equations induced on data for a whole campaign – CIPER algorithm • Evaluation “Leave one out” approach Introduction Fields All All Test field T3 T4 Std. Dev. 3.187 3.188 RMSE 2.1119 1.7220 RRSE 66.26 % 54.01 % Corr. coeff. (r) 0.7855 0.8273 All All T5 T6 3.163 3.096 2.2478 2.1784 71.06 % 70.36 % 0.7467 0.8096 All T7 3.229 1.3286 41.15 % 0.7812 All T8 3.210 1.3813 43.03 % 0.7839 All T9 3.210 1.5927 49.62 % 0.7434 All T10 3.130 1.6672 53.26 % 0.7766 All T11 3.108 1.6274 52.36 % 0.7841 T3 T6 3.056 2.1251 69.54 % 0.8125 T6 T3 3.600 2.0961 58.22 % 0.7745 Methodology Data sources Experimental design Domain & Problem Related work Results Discussion Further work 15 Predictive models Model (All/T4) Drainage = 0.0196445 * RainfallA1 * Temp + 0.33246 * DrainageN1 + 0.000662861 * CDCoef 2 * RainfallA12 * Slope + 0.00000107253 * Runoff * DrainageN1 * Temp2 * RainfallA12 – 0.00115983 * Runoff 2 * DrainageN1 * Slope3 – 0.00114057 * Temp2 * RainfallA1 + 0.00153725 * RainfallA12 + 1.63563 * Runoff * Slope – 1.90622 * Runoff – 0.0231748 * Slope2 * RainfallA1 + 0.0654042 * RainfallA1 * Slope – 0.00755737 * Slope3 * CDCoef 3 * Runoff 2 + 0.0675951 * Slope – 0.146702 Introduction Domain & Problem Related work Methodology Data sources Experimental design Results Discussion Further work Experimental design Results Discussion Further work Predictive models Introduction Domain & Problem Related work Methodology Data sources 16 Constraint algebraic equations: GENE FLOW Constraint algebraic equations: GENE FLOW Problem: Prediction of gene flow Type of pattern: Constraint algebraic equation Algorithm: Lagramge 17 Constraint algebraic equations: GENE FLOW 96 points Experimental design: Federal Biological Research Centre, BBA, D 2m 100 m Field design 2000 220 m 6 transg enic field / donor non-transgenic field / recipient access paths 4 sampling point 3 n o 60 cobs – 2500 kernels l % of outcrossing j 25,5m 49,5 m q a 3 1 b c Donors GMO corns d g e Direction of drilling k 4.5 m 3 m p N m 7.5 m 13.5 m 5 2 system o f coordinates a for the sampling points 2m h f Receptors NT corns Constraint algebraic equations: GENE FLOW 18 Constraint algebraic equations: GENE FLOW Differential equations: COMMUNITY STRUCTURE 19 Differential equations: COMMUNITY STRUCTURE Problem: Time dependent ecosystem processes Type of pattern: Differential equations Algorithm: LAgramge Differential equations: COMMUNITY STRUCTURE 20 Differential equations: COMMUNITY STRUCTURE Data: 1995 to 2002 Differential equations: COMMUNITY STRUCTURE Phosphorus Water inflow out-flow Respiration Growth 21 Differential equations: COMMUNITY STRUCTURE Phytoplankton Growth Respiration Sedimentation Grazin g 43 Differential equations: COMMUNITY STRUCTURE Zooplankton Feeds on phytoplankton Respiration Mortality 22 Differential equations: COMMUNITY STRUCTURE EQUATIONS Algebraic - CIPER Constraint algebraic- Lagramge Differential- Lagramge SUITABLE for predictions, UNSUITABLE for interpretation 23 Decision trees: HABITAT MODELS Decision trees: HABITAT MODELS Problem: Classification of habitats (suitable, unsuitable) Type of pattern: Classification decision trees Algorithm: J4.8 24 Decision trees: HABITAT MODELS Observed locations of BBs # ## # ## ### # # # # ## # # ## # ## # # ## # # # # ## # # ## ## # # # # # # # # ## # ## # # ## ## # ## ## # # ### # # ## # # ## # ## ### ## ## ## ## ### ## # # ## # # # ### ## # # ## # # # # ### ### # # # ## # # # # ## # # # # ## # # # # # # # # ## # # ## # # # # # # ## # # # # # # # # # # # # ## # # # ### # # ## ### # # # # # # # ## ## # ### ## # # # # ### # ## ## ## ## # # ## ##### ### # ### #### # # ### # # ###### ## # # ## ## #### # # # # # # # # # # # # # # # # ## # # # # # # ## # # # # ## # ## # # ## # ## # ## # ## ## # ## # # # # ### ## # # # # # #### # ## # ## # # # # # # ## ## # # # # # ## ### # ### # # # # # ## # # # ## # ## ## # # # ### # # # # # ## # # # # ## ### # ## # ### # # ### # ## # # # # ### # ### ## # # ## ## # # ## ### # # ## ## ## # # # # # # # # # ### #### # # # ### ### ## # # # ## # # # ## # # # # ####### # ## ## # # ## # # # ## # ## ### # ## # # ## ## # # ## # ### # ## ## # # # # # # # # # # # # # # # # # # # # # # # # # ## ### ###### # ### # # # # # ## ## # # # # # # # # # # ## ## # # ## ## #### #### ##### ### #### ### ## # # ### # # # ## # # ## # ### # ## # # # ## # # # # # # # # # # # ### # # ## # ## # # # # # ##### # ### ### # ## ###### ## ### ## # ##### # # ## # # # # # # ## # ## ## # # ## # ## # ## ###### ### # # ## ## # ## # # # ## ## #### # ## # # # # # #### ## # # # # ## # # ## # ## # # ### # ## # ## # # ## ## ## # ###### #### ### # # ## ## # ## # # # ## ### # # # # # # ## # # ### ## ## # # # # ## # # # # ## # ## # # ## # ## # # # ## # ## # # # ## # # ## ### # # ## # ## ### # # # ## # # ## # # ## # ## # # # # ## # # # # # # # # # ## # # # # # # # # ## # ## # # # # # # # ## # # ## ## ## # ## # ## # ## ## # ## # # # ### # # ##### ## ## # ## # # # #### # ## # # ## # # ### ## # ### ## ## # ## # ## # ###### ## # # # ## # ## ## ## ## # # #### # # ## # ## # ## # ## # # # ## # ## ## # # # ## ## ## # ## # # # ## # ## ## ## # # # ### ## ### ## ## # ##### ### # # ### ### ## # # # # # # # # # # # # ## ## ## ## ## ## ## # # ## # # # # # # ## # ## ## ## # ## ## ## ### ## ## # # #### # ## # # ## # ## ## # ## ## ### # # # ## #### ### ## # # ## ## # ## ## # ###### ## ## ## ## # # # ### # # ## ## # ## ## # # ## # # # # ### ## # # # # # # ### ## # # ## # ## # # ## # # ##### ## # # # # # ### # # ###### # # # ########## # # ## # # # ## # # # # # # # # # ## ### # ## ## ## ### # # ## ## ### # # ## # ## #### # ## # ## # # ## ## ###### ## # # # ## ## ## ### ## ### ## ## # # # ## ### # ## # ##### # ## ## ## ### #### # # # # ## ## ## # ## # # # ## #### # ## # # ## # ## # ## ## ## ## ## ## ## # # # # ## # # ## # # ## # # # # ## # # ### ## # # ###### ## # ### ## # # # # ### ### #### # ### # ## # # ## # ## ## ## # # # # #### ##### ## ###### # ## #### # ## # ## # # ## # # # #### # ## # ## ## # # # ## ## ### # ## # ## ### # ##### ##### ### # ## ##### ## # ## ## # # ## # ####### ## # ### # # ##### # # # # # ## ### ## # # ## ## ######## ## # # ### ## ###### # # ###### # ## # ## ## # ## # ## # # ## # # ########## # # # # # ## ### ###### # # ## ## # ## ## ## # # ## ## # ###### # ### # ##### ## #### ## ## # ## # ## # # # # # # ## # ## ##### # # # ## ### ## ## # ## # ## ## ## # ###### # ## # # #### ## ## # # ## # ## # ## # # ## # # # ## # ### # # # # ## ## # # # ## # ## # ## # ### #### ##### ###### # ## ## ## ## # ## # # # ##### ###### # ### ########## ## # ##### # # #### ## # # ###### ######## ## #### ## # # # ## # # ## # ####### # ## ## # # # # # ## ## # ######### # ## ### # # # ##### ## ### ##### ## ## # # # # ## # #### # # # ## # # # #### ## # ## # # # # ## # # #### ## ## # ## # ### ### ### ##### ### ## # ## # # ## # # # # # ### # #### ## # ## ## # ## # ## ## # ## # ## ## # ### ## ## ### # ## ### # ### ## #### # ## ## ### # # # # ## # # # # ### # # # # #### # # ## ### # ### ## # # ## # ### # ### # # ### ## # ## # # # # # # ## # # # ## ## ## # # ## ### #### # ### # ## # # # ### # # # # #### ## # ## # # # ## ## ## ## ### # ## # # ##### ### # # # ## # #### ## # # # #### # # # # ## ## ### ## ## # # # # ######## # # ## ## # ## ## # ## ## ## # ## ## # ### ## # # # ## ## # ###### ### ## ## # ####### ## # # # # # ## # ### # # ## # ## # # # # # # ### # ## # ## # # # # ## # # # # # # # # ## ###### # # # ## ## # # ### # ## ### # ## # #### ## # ## # ### ## ### # ## # # ##### # # # # # ## ### #### ### ## # ### ## ## # ## # # # # ## ## # # ## # # ###### ## # # ## ## ######## # #### # ## # # ## # ### # ## ## # ## ### ##### # ## ## # #### ### #### # ## # # # ##### # ## # # # # # # # ### # # # # # # # # # # # # # # # # # # # # # ### #### # # # ## # # ## # #### # # # ## # # ## ## # # # # # # ## # # # # ## # # ### # # ## ### ### ### # ### # # ## ### ## ##### ## # # # # #### ### ## # ## ### ## ## #### # # # # ##### ##### ## # ## ### ## # ## # # # ## # # # ### ##### ### ## ## # # # # ### # # # # # ## # ## # ## ## ## #### ### ## ## ####### # # # ### # #### # # # # ##### # # ### # ## ##### # # ## # # # # ## ### # # ### ## # ### ## ## # ##### # # # # # ## ## #### # # ### ## # # # # # #### # ##### ### ## #### #### #### ## # ## ### ## # ### ## # # # ### ## # ### ## # # # #### # # # # # ### ## # ### ## # ## ### #### ## ## # ## ## #### ## # ## ### ### ## ## #### ## ### # ## # # # # ### ### ### ## # # # # # ## ### ## ## ## ## ## # ### ### #### ## ### ###### #### # ## # ## ## ## ## # # ## ### ## # # # # ### ## ## ## # # ##### # ## ######## ## # # # ## # # # # # #### ## ##### # # ## # # # ##### #### ## # ## ## ## #### #### # ## # # # ## # ### #### # # # # # ## # ## # # # # ### ### # # # ## ## # # #### # ## # # ## ### # # ## # # ###### ##### ## # # # ### ## ### # ## # # # # # # # ## # ## # # ### ## ## # ## # ## # # # # # # ## # ## #### # # ### ##### ### # # ## # ## # ## # ## ## # # ## ### ## # # ## ### ## # ## ### ###### ## # ## ## #### ## ## # ## #### ### ## # ### #### ## # ## ## # ## # ## ## ## ## ## # # # ### ###### ## ## ## ##### # # # # # ## # # # # # # # # # # # # # # # # # # # # ## # ## ## # ## # ## #### ## ### # # # # # ## ## ### # ## ## # # ## ## # ## # # # # ## # # ## ## ## ## # # ### ## # ##### # # # # ## # ## # ## ## # # ## ## # ##### # # #### # ### # ## # # # # ## # # ## # # ## ## #### # # ## ### # ## ## ## ## # # # # # ## ## # # # # # ## # ## ## ### #### # # ### ### #### # ### # # ### # # # # # # # # # # # # # # #### # # ### # # # # # ## # # # # ## # # # # # # ## # # # # # #### ## # ## ##### # # # # # # #### ## # ## # # ## # # #### # ## ### # # ### # # #### ## ### ##### #### # # # ## # ## ## # # # ## ## ## ## ## #### # ### ## ### # # ## ## ## ##### # # #### #### # ## ### # ## # # ## ### # # # ## ### ## # # # # # ### ### ### #### # ## # ## ## # #### # # #### # ## ## ### # #### ## ## # ## # # # # ### # # #### # # ## # # # # # #### # # ### ## ## # # ### # # # # # ## # ## ##### ## # # ## ## ## ## # # # # # ## # # ### # ## ## # # #### # # # # # # ### ## ### ## ### ### # # ### # # ### # # ### # # #### #### # # # # # ###### #### ## #### #### # # ## # ## ### ## # ## # #### #### # #### # # ## ### # ## # ## # ## ## ## ## ## ## # ####### #### ## # ## # # # ### ### # # # # # # ## # # #### ##### # # # # ## # # # # ## ## # # # ### #### # ####### # # # ## ## ## # # # # # ### # # # ## ## ## ## # # # ## # # ## # ## # ## # # # # # # ## # ## # ## # # # ### # # # # # # # # # # # # ### # # # ## # # # ###### ### # # # # # # # # # ## #### # #### ## ### ## # ## # # # ## Decision trees: HABITAT MODELS The training dataset • Positive examples: - Locations of bear sightings (Hunting Association; telemetry) - Females only - Using home-range (HR) areas instead of “raw” locations - Narrower HR for optimal habitat, wider for maximal • Negative examples: - Sampled from the unsuitable part of the study area - Stratified random sampling - A different land cover types equally accounted for 25 Decision trees: HABITAT MODELS Propositional dataset 1,73,26,0,0,1,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,0,0,0,4123,0,0,0,0,63,211,11,11,11,83,213,213,0,0,4155. 1,62,37,0,0,2,88,0,2,70,7,20,1,0,0,1,0,60,0,0,0,0,0,2,1,53,0,3640,0,0,0,-1347,63,211,11,11,11,83,213,213,11,89,3858. 2,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,10404,0,2074,-309,48,0,0,11,11,11,83,83,83,0,20,3862. 2,0,100,0,0,1,76,0,16,71,0,12,0,0,0,0,0,0,0,0,0,0,0,1,6,82,0,7500,0,1661,-319,-942,0,0,11,11,11,0,0,0,0,20,4088. 1,8,91,0,0,1,52,0,59,41,0,0,0,0,0,0,0,4,0,0,0,0,5,1,6,82,0,6500,0,1505,-166,879,9,57,11,11,11,281,281,281,0,20,3199. 4,3,0,86,9,0,75,0,33,67,0,0,0,0,0,0,1,2,0,0,0,0,0,1,2,54,0,0,0,465,-66,-191,4,225,11,31,31,41,72,272,60,619,4013. 1,34,65,0,0,2,51,9,76,9,5,1,4,1,0,1,0,29,0,0,0,0,0,1,2,54,0,3000,0,841,-111,-264,34,220,11,41,41,151,141,112,60,619,3897. 1,100,0,0,0,3,52,0,86,6,3,5,9,6,7,38,40,0,0,0,0,0,0,1,17,64,0,8062,0,932,-603,-71,100,337,11,41,41,171,232,202,4,24,3732. ……. ……. ……. Present: 1 Absent: 0 The model for maximal habitat 26 The model for optimal habitat Decision trees: HABITAT MODELS Map of maximal habitat (39% SLO territory) 27 Decision trees: HABITAT MODELS Map of optimal habitat (13% SLO territory) 55 Multi target predictions: COMMUNITY STRUCTURE 28 Decision trees: COMMUNITY STRUCTURE Problem: Temporal prediction of height and cover of crop and weeds Type of pattern: a) Multi target regression trees / time series b) Constraint multi target regression clustering trees Algorithm: CLUS Data • 130 sites, monitoring every 7 to 14 days for 5 months (2665 samples: 1322 conventional, 1333, HT OSR observations) • Each sample (observation) described by 65 attributes • Original data collected by the Centre for Ecology and Hydrology, Rothamsted Research and SCRI within the Farm Scale Evaluation Program (2000, 2001, 2002) 29 Results scenario A: Multiple target regression tree Target: Avg Crop Covers, Avg Weed Covers Excluded attributes: / Constraints: MinimalInstances = 64.0; MaxSize = 15 Predictive power: Corr.Coef.: 0.8513, 0.3746 RMSE: 16.504,12.6038 RRMSE: 0.5248,0.9301 Results scenario B: Constraint predictive clustering trees for time series Syntactic constraint 30 Results scenario B: Constraint predictive clustering trees for time series Target: Avg Weed Covers (Time Series) Scenario 3.9 Constraints: Syntactic, MinInstances = 32 Predictive power: TSRMSExval: 4.98 TSRMSEtrain: 4.86 ICVtrain: 30.44 Results scenario B: Constraint predictive clustering trees for time series 31 Results scenario B: Constraint predictive clustering trees for time series Relational data mining: GENE FLOW MODELLING Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007 32 Relational data mining: GENE FLOW MODELLING Problem: Classification of fields to above or below 0.9% of outcrossing Type of pattern: Relational classification decision tree Algorithm: TILDE Spatial temporal relations 2004: 40 GM fields 7 non-GM fields 181 sampling points 2005: 17 GM fields 4 non-GM fields 127 sampling points 2006: 43 GM fields 4 non-GM fields 112 sampling points Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007 33 Relational data mining: GENE FLOW MODELLING Data scattered over several tables or relations: • A table storing general information on each field (e.g., area) • A table storing the cultivation techniques for each field and each year • A table storing the relations (e.g., distance) between fields Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007 Relational data mining: GENE FLOW MODELLING Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007 34 Relational data mining: GENE FLOW MODELLING Relation database system PostGIS Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007 Relational data mining – building model Relation data analysis: • Algorithm Tilde (Blockeel and De Raedt, 1998; De Raedt et al., 2001) => upgrade of algorithm C.4.5 (Quinlan, 1993) for classification decision trees • The algorithm is included in the ACE-ilProlog data mining system Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007 35 Relational data mining – results Threshold 0.01% Threshold 0.45% Threshold 0.9% Marko Debeljak WP2 contributions to SIGMEA Final Meeting, Cambridge, UK, 16-19 April, 2007 Multi-target regression model: RISK MODELLING 36 Decision trees: RISK MODELLING Problem: Prediction of soil exposed to disturbances Type of pattern: Multi target regression trees (resistance, resilience) Algorithm: CLUS Multi-target regression model: RISK MODELLING The dataset: soil samples taken at 26 locations throughout SCO The dataset: The flat table of data: 26 of 18 data entries 37 Multi-target regression model: RISK MODELLING The dataset: Physical properties: soil texture: sand, silt, clay Chemical properties: pH, C, N, SOM (soil organic matter) FAO soil classification: Order and Suborder Physical resilience: resistance to compression: 1/Cc, recovery from compression: Ce/Cc, overburden stress: e.g., recovery from overburden stress after two day cycles: eg2dc Biological resilience: heat, copper Multi-target regression model: RISK MODELLING 38 Multi-target regression model: RISK MODELLING Macaulay Institute (Aberdeen): soils data – attributes and maps: Approximately 13 000 soil profiles held in database Descriptions of over 40 000 soil horizons Application 78 39 Application 79 Application 80 40 Decision trees Single or multiple decision trees Classification, regression, model trees Propositional and relational data Temporal and spatial data SUITABLE for predictions, SUITABLE for interpretation Conclusions What can data mining do for you? Knowledge discovered by analyzing data with DM techniques can help: Understand the domain studied Make predictions/classifications Support decision processes in environmental management 41 Conclusions What data mining cannot do for you? The law of information conservation (garbage-ingarbage-out) The knowledge we are seeking to discover has to come from the combination of data and background knowledge If we have very little data of very low quality and no background knowledge no form of data analysis will help Conclusions Side-effects? • Discovering problems with the data during analysis – Missing values – Erroneous values – Inappropriately measured variables • Identifying new opportunities – New problems to be addressed – Recommendations on what data to collect and how 42 DATA MINING – data preprocessing DATA FORMAT • File extension .arff • This a plain text format, files should be edited by editors such as Notepad, TextPad, WordPad (that do not add extra formatting information) • The file consists of – Title: @relation NameOfDataset – List of attributes: @attribute AttName AttType • AttType can be ‘numeric’ or nominal list of categorical values, e.g., ‘{red, green, blue}’ – Data: @data (in a separate line), followed by the actual data in comma separated value (.csv) format 85 DATA MINING – data preprocessing DATA FORMAT @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes … 86 43 DATA MINING – data preprocessing • Excel • Attributes (variables) in columns and cases in lines • Use decimal POINT and not decimal COMMA for numbers • Save excel sheet as CSV file 87 DATA MINING – data preprocessing • TextPad, Notepad • Open CSV file • Delete “ “ on the beginning of lines and save (just save, don’t change the format) • Change all ; to , • The numbers must have decimal dot (.) and not a decimal comma (,) • Save file as CSV file (don't change the format) 88 44 DATA MINING – Hands-on exercises 2. Data mining 89 DATA MINING – data preprocessing • WEKA http://www.cs.waikato.ac.nz/ml/weka/index.html • Open CVS file in WEKA • Select algorithm and attributes • Perform data mining 90 45 How to select the “best” classification tree? Performance of the classification tree: Classification accuracy: (Correctly classified examples) /(all examples) 1 0.9 0.8 True positive rate 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 False positive rate Confusion matrix is a matrix showing actual and predicted classifications 91 How to select the “best” classification tree? Classification trees: J48 Interpretable size: -Pruned or unpruned - Minimal number of objects per leaf The number of instances classified in this leaf The number of instances INCORRECTLY classified in this leaf It could appear: (13) – no incorrectly classified instances Or (3.5/0.5) – due to missing values (?) where instances are fractured Or (0,0) – a split on a nominal attribute and one or more of the values do not occur in the subset of instances at the node in question 92 46 How to select the “best” regression / model tree? The performance of the regression / model tree: 93 How to select the “best” regression / model tree? The interpretable size: -Pruned or unpruned - Minimal number of objects per leaf The number of instances that REACH this leaf Root of the mean squared error (RMSE) of the predictions from the leaf's linear model for the instances that reach the leaf, expressed as a percentage of the global standard deviation of the class attribute (i.e. the standard deviation of the class attribute computed from all the training data). Sum is not 100%. The smaller this value, the better. 94 47 Accuracy and error Avoid overfitting the data by tree pruning. Pruned trees are: - Less accurate (percentage of correct classifications) of training data - More accurate when classifying unseen data 95 How to prune optimally? Pre-pruning: stop growing the tree, e.g. when data split not statistically significant or too few examples are in a split (minimum number of objects in leaf) Post-pruning: grow full tree, then post-prune (confidence factor-classification trees) 96 48 Optimal accuracy 10-fold cross-validation is a standard classifier evaluation method used in machine learning: - Break data into 10 sets of size n/10. - Train on 9 datasets and test on 1. - Repeat 10 times and takes a mean accuracy. 97 49