Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Trend Analysis in Stulong Data Jiří Kléma, Lenka Nováková, Filip Karel, Olga Štěpánková The Gerstner laboratory for intelligent decision making and control Department of Cybernetics, Czech Technical University, Prague PKDD 2004, Discovery Challenge Outline  Previous CTU entry – subgroup discovery (ENTRY), general CVD model – trend analysis: global approach vs. windowing  Role of windowing in mining trends – KM, Cox models in medicine – (symbolic) temporal trends in data mining  Development of windowing approach – temporal CVD definition – role of the window length – multi-feature interactions  Ordinal association rules – processing of the windowed features STULONG Data  Four tables: Entry, Control, Letter, Death  Dependent variable: (static) CVD – CardioVascular Diseases – Boolean attribute derived of A2 questionnaire (Control table) CVD = false The patient has no coronary disease. CVD = true The patient has one of these attributes true (Hodn1, Hodn2, Hodn3, Hodn11, Hodn13, Hodn14) positive angina pectoris (silent) myocardial infarction cerebrovascular accident We remove patients who have diabetes (Hodn4) or cancer (Hodn15) only. ischemic heart disease ENTRY - subgroup discovery  AQ no.6: Are there any differences in the ENTRY examination for different CVD groups?  Statistica 6.0 – module for interactive decision tree induction – two tailed t-test or chi-square test to asses significance of subgroups  Dependencies are relatively weak  Interesting dependencies found – social characteristics: derived attribute AGE_of_ENTRY – alcohol: “positive effect” of beer, no effect of wine – sugar consumption increases CVD risk – well-known dependencies are not mentioned (smoking, BMI, cholesterol) ENTRY - general model  General CVD model (in WEKA) – feature selection + modeling (e.g., decision trees) – tends to generate trivial models (always predicting false) – asymmetric error-cost matrix does not help  Predict CVD risk – Identify principal variables (Chi-squared test) – Naïve Bayes + ROC evaluation – three independent variables – discretized AGE_of_ENTRY – discretized BMI – Cholrisk - derived of CHLST – AUC = 0.66 CONTROL - trend analysis  AQ no.7: Are there any differences in development of risk factors for different CVD groups? – increasing BMI makes a contribution to CVD appearance ENTRY table ICO – primary key Year of birth Year of entry Smoking Alcohol Cholesterol Body Mass Index Blood pressure CONTR table ICO Risk factors followed during 20 years Motivation  focus on development – trend gradients  possibilities – contemporary statistical methods used in medicine • KM, Cox models – analyze sth else than we want • ANOVA etc. – features have to be developed anyway, lack of data – complex sequential data mining • introduction of structural patterns and then e.g., association rules • interesting but again needs more data  our approach – introduction of simple aggregates – application of windowing – statistical evaluation for simple dependencies – ordinal association rules for more complex relations Survival curves  Kaplan-Meier or Cox method – typical example of temporal analysis in medicine – regards survival period, BUT disregards development of RFs – typical scenario • distinguish groups of patients (ENTRY table) • follow their “survival” periods (DEATH or CONTROL table) Derived trend attributes Intercept Correlation coefficient y (observed variable) Mean Standard deviation Gradient x (decimal time ~ year + 1/12 month) referential time (1975) Global Approach  Risk factors to be observed are selected – SYST, DIAST, TRIGL, BMI, CHLSTMG  Selected control examinations are transformed – pivoting  Patients with no control entries are removed – about 60 patients  Trend aggregates are calculated ICO ICO_1 ICO_2 Entry Contr1 Contr2 ... ContrM Aggr1 ... AggrN Windowing Approach  Constant number of examinations for  individuals  Issues: – window length • time period vs. number of checkups • how many checkups to select? 5, 8, 10 tested – single distinct window or sliding window? • entry is used as the first examination • more records per patient  records are not independent – temporal CVD definition • CVDi - time from the last examination to CVD • yes/no (yes = CVD in the next year or CVD in future) – missing values treatment Windowing – missing values approach 1: shift the series approach 2: introduce a new value Window length selection Window length effects  3 different lengths tested, 5 risk factors considered  compared with the global approach  test used, – null hypothesis: independence of trends and CVD – p-values are shown  windowing: CVD1 vs. nonCVD group  global: CVD vs. nonCVD group global approach is completely misleading prefer shorter windows down-up effect prefers longer windows only long term changes may have effect ControlCount vs. CVD  ControlCount – number of examinations – strong relation with CVD – AUC = 0.35 – ControlCount  CVD risk  – anachronistic attribute – introduced by the design of the study  ControlCount has influence on the trend aggregates - ControlCount  gradients tend to be more steep etc.  Conclusion: global approach cannot be applied (at least with the selected aggregates) Influence of SYSTGrad (W5)  122 individual CVD1 observations in total  SYSTGrad (W5) equi-depth binned in 5 groups  representation CVD1 group significantly increases with increasing group number of SYSTGrad 0.040 34 0.035 28 0.030 CVD rate average rate 0.025 25 0.020 0.015 18 17 0.010 0.005 0.000 1 2 3 4 SYSTGrad group (equi-depth binning) 5 Averaged blood pressure  striking difference in CVD1 and nonCVD groups – linear vs. down-up development – can also be observed for the individuals – see the next slide – cannot be distinguished by longer windows 88 SystCVD DiastCVD SystHealthy Avg. diastolic blood pressure [mm Hg] Avg. systolic blood pressure [mm Hg] 142 140 138 136 134 132 DiastHealthy 87 86 85 84 83 82 81 130 9 8 7 6 5 4 3 2 Time to last examination [years] 1 0 9 8 7 6 5 4 3 2 Time to last examination [years] 1 0 Averaged body mass index  difference in CVD1 and nonCVD groups – increasing BMI in the CVD1 group – longer windows express this trend better – this graph shows that W10 may benefit from increase between examination 9 and 8 BMICVD Avg. diastolic blood pressure [mm Hg] – steady BMI in the nonCVD group 28 BMIHealthy 27.5 27 26.5 26 25.5 9 8 7 6 5 4 3 2 Time to last examination [years] 1 0 Trend factors – hypothesis testing  Influence of trend aggregates on CVD – 9 gradients considered: SYST, DIAST, CHLSTMG, TRIGLMG, BMI, HDL, LDL, POCCIG and MOC  Identified relations – decreasing HDL cholesterol level relates to the increasing risk of CVD (p=0.001) – decreasing POCCIG (the average number of cigarettes smoked per day) relates to the increasing risk of CVD (p=0.0001)  Again: correlation vs. causality – statement 1 makes sense: HDL is a ’good’ cholesterol – statement 2 suggests spurious dependency smoking habits effect 1 patient state cause CVD onset effect 2 Overview of AR found  Group a – relations among trend factors – a great prevalence of the rules joining together either blood pressures (DIASTGrad and SYSTGrad) or cholesterol attributes (HLDGrad, LDLGrad and CHLSTGrad)  Group b - hypothesis to be verified by experts – insufficient target groups, 6% transactions makes 26 individuals, i.e., instead of 10 prospective diseased patients we actually observe 19 Conclusions  The main scope – AQ no.7: Are there any differences in development of risk factors for different CVD groups?  Contributions – Pitfalls of the global approach revealed – Windowing enabling multivariate temporal analysis proposed, effects of various window lengths studied – Development of the following risk factors may influence future CVD occurrence: • DIAST, SYST, BMI, (HDL) cholesterol, (POCCICG) – Other trends may have or intensify their influence under specific conditions (BMI trend and overweight, etc.) – we lack data to prove it