Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Oracle Data Mining and Epidemiological Analysis Scott A. Rappoport, OCP MTS Technologies OracleWorld 2003 San Francisco, CA paper #63144 Presentation Goals Short intros Vocabulary Present Basic Medical Terms Describe Data Mining Models and Terms Synthesize What questions are we asking? Applying DM to Epidemiological issues Demonstrate the DM4J components The future: Challenges 10g features paper #63144 The DM Dimension Data Mining capability readily accessible to the end users opens a whole new dimension of what can be performed in the medicine. New questions are being generated based on the availability of these new techniques. This is a cutting edge (bleeding edge) advanced technique paper #63144 A few disclaimers… Medical data is highly sensitive information… Thus: No personally identifiable info is presented No specific aggregated information on disease types, locations, or time is provided Scaled back list of attributes in demos However, demos will give an indicative application of the technology. paper #63144 About you What percentage of the audience: Has a medical background? Has an IT background? Physician Epidemiology/research/academic DBA Developer Knows a lot about Data Mining? Statistics? Has at least two of the above? Three? paper #63144 About me Oracle Certified DBA and Developer ASQ Certified Quality Engineer Principal Architect, supporting the Naval Health Research Center in San Diego, CA Instructor of Oracle, Data Warehouse, and Web Services courses at UCSD-Extension Papers on Java, DataWarehousing – IOUG/ODTUG Biochemistry degree/ worked in a diagnostics firm Son of a clinical pathologist paper #63144 Let’s get at it… The Medical Side paper #63144 Medical Lexicon Epidemiology Study of the relationships of various factors determining the frequency and outbreak of disease. Nosocomial Outbreaks originating within a hospital. Nosology Study of the classification of diseases. ICD9/10 International Classification of Diseases: v9 or 10. Classification of disease by major category – represented by a three-digit code, followed by a specific type, represented by a two-digit code. DNBI: Disease Non-Battle Injury. Military classification of disease types. paper #63144 Nosology/ICD9 Disease Classification Over 12,000 separate diseases Classified into 13 areas Further sub-classed Set off by 3 digit code, then additional 2 digit descriptor for better granularity DNBI – military designations paper #63144 Epidemiological/Medical Practice Questions What factors affect the onset of disease within a population? What is the likelihood that a patient will require follow-up treatment, hospitalization, or that the case will worsen? Are there particular clusters of patients that are more likely to develop a certain disease? How often is a case mis-diagnosed? Is a particular treatment likely to cure the ailment? paper #63144 Summarizing the Concerns Predictive concerns Classification of risks and subjects Attribute ranking concerns Multi-factor relevance Dealing with large numbers of attributes Clustering questions Unknown associations paper #63144 Epidemiological techniques Statistical packages Chi-square ANOVA / ANCOVA / MANOVA Multi-variate Analysis (Attribute Scoring): Multiple Logistic Regression (binomial/dichotomous) Multiple Linear Regression (multiple/category) Covariance 2x2 matrix paper #63144 Risk factors/classification Environmental: exposure, location, job risks, diet Genetic: Genetic markers present? Clinical: Blood/other diagnostics data Familial: Other family members? Who, what? History: Past illnesses? What? When? How often? Socio-economic: Job, married, education, age, gender Lifestyle: Exercise, smoker, alcohol Ethnic/National/Geographical paper #63144 Patient Data Universe Geographic factors Treatment Fac/ Personnel Patient history Physician's note Family History Total Patient Description Diagnostic data Drug interactions Genomic paper #63144 Ethnic/race/ national Lifestyle factors A vast amount of data potentially to be collected and mined in the patient data universe !!! The Data Mining Side paper #63144 Reporting techniques/hierarchies User Sophistication Data Mining What hidden associations or clusters of attributes may exist? On-Line Analytical Processing (OLAP) Ad Hoc Queries Operational Reporting Data paper #63144 What is likely to happen tomorrow (based on past trends/ aggregations)? Why did that happen yesterday? What (specific events) happened yesterday? Reporting Examples Query Technique Operational reporting Reporting needs Example Basic information on an event Find the diagnosis of patient #A1234 on this date. Ad-hoc User define queries to help understand an event Does the specific patient have a past history of such a diagnosis? OLAP Summarized data of events across many dimensions What is the incidence rate of this disease among this patient type? For this area, season, hospital, etc? Is this becoming more prevalent? Data Mining Attribute associations, predictive modeling, clustering of populations by attribute sets. Across many attributes and records What are the risk factors for this disease? What is the likelihood a treatment will succeed for a patient? What specific populations are at risk? paper #63144 Data Mining Techniques Classification Seeks to find out attributes that best predict a dependent variable Clustering Seeks groupings of attributes in populations Association What is the likelihood that event A will lead to or occur with event B, C, or D… Attribute Importance Ranking of attributes based on their effects on a given dependent variable Lift Model: Measures how well a model can identify a given target paper #63144 Data Mining Terms Confusion Matrix: Tests model accuracy. Actual to predicted evaluated, scored by incidence of false-positives / false-negatives. False-negative: disease present, results not shown False-positive: disease not present, results show Supervised learning: target value is specified. Classification / regression Unsupervised learning: Relations/target attributes not known. Clusters/Assoc paper #63144 Data Mining Terms (cont’d) Support: The measure of how often the collection of items in an association occur together as a percentage of all the transactions. Confidence: Confidence of rule "B given A" is a measure of how much more likely it is that B occurs when A has occurred. ROC: Receiver Operating Characteristic. Used in Lift models to determine how well the model identifies targets as opposed to random selection. paper #63144 Supervised/Unsupervised Supervised Prediction odds of success Classification Model Test (obtain false-positives/negatives Apply Lift Attribute Importance Determine attributes with the most effect on result Want to split on this attribute paper #63144 Supervised/Unsupervised Unsupervised No a priori knowledge find hidden relations/ associations/ groupings Clustering What groups of subjects share values of attributes that are closely related? Associations paper #63144 Find events that are related; i.e., if A (and/or B) happens, what are the odds that C will happen? Classification Modeling Used to find a predictive model of independent attributes on the outcome of a dependent attribute Algorithms: Naïve Bayes, Adaptive Bayes NetWork Attributes Branches Pruned . . . . . paper #63144 . . . . . . . . . . . . . . . . . . . . . . . . . Number of Levels Classification Model (cont’d) Replaces: Multi-variate Analysis Multiple Logistic Regression (binomial/dichotomous) Multiple Linear Regression (multiple/category) Questions: Given a set of factors, what is the likelihood that a disease will be expressed? What is the likelihood the disease will lead to a more severe ailment? What category (multi-option) of health based on inputs? paper #63144 Classification Model: To Do’s 1. Create a model: Classification Build 2. Refine: Run an Attribute Importance Model to help define best attributes to “split” 3. Test the model: Classification Test 4. Predict results: Classification Apply 5. Targeting: Classification Lift paper #63144 Clustering Unsupervised model that attempts to find groups within the population that share similar attributes Algorithm: k-means, O-Cluster Age AGE INCOME C2 Rank C2 INCOME C1 Age AGE C1 C1 Rank INCOME paper #63144 Centroids AGE Histograms Courtesy Charlie Berger, Oracle Clustering (cont’d) k-means only takes numeric values, and requires the number of clusters to be specified. Good for smaller datasets with fewer attributes. O-Clusters: more robust than k-means Questions: What groups of people are present in a population, and what are their common attributes? How are the members distributed along those attributes? Are there given clusters of people related to a specific disease family? Are members more or less susceptible? paper #63144 Association Models Unsupervised model that returns a set of rules determining if one or more attributes are associated with other attributes. Scored by support/confidence What is the likelihood of A happening if B happens? Often used with sparsely populated data sets. Questions: What is the relationship between overweight recruits, smoking, and attrition in boot camp? paper #63144 Applications/Demos Review of the parts of the process: JDeveloper9i layout, model wizards, creation, run ODM Browser: task review, navigation, results Creation of models in JDeveloper9i with DM4J Wizards Clustering Model Build and analyze histograms Association Model Build: Analyze rules Classification Model: Build, Test, Apply, Lift Attribute Importance paper #63144 Challenges Most data sources have not been modeled to collect the range of data needed. Bio-informatics opens a whole new range of study not even imagined a few years ago. Data Stores are inconsistent. Doctors notes are not uniform. Legacy Apps are a mess. (COBOL, poorly documented, personnel retired…) paper #63144 More challenges Vast amounts of data/ processing Confusion matrix on attributes with large categories. Structuring questions “to peel away” masking factors, and be sensitive to subtle associations Bringing it to the masses Overcoming resistance to change. paper #63144 New Native g 10 Features Text Mining – to help us search through physicians’ notes Support Vector Machines (SVM): “Neural Networks on Steroids.” Non-negative Matrix Factorization (NMF): Algorithm to help “boil down” many attributes into a manageable set. Enhanced Bio-informatics support in the DB. Transformation creation (currently alpha) paper #63144 Summary Covered a multi-disciplinary topic Attempted to show how DM is uniquely suited to Epidemiological study Showed the ease by which models can be made Still, model creation requires trained personnel Many challenges remain to fully exploit this technology. paper #63144 Questions? paper #63144 Special Thanks to…. Mark Kelly, Oracle Data Mining Robert Haberstoh, Oracle Data Mining Charlie Berger, Director Oracle Data Mining paper #63144 Follow-up Please fill out the on-line survey Session #63144 Feel free to contact me: Scott Rappoport, OCP Principal Technical Staff Member MTS Technologies 619-725-5082 [email protected] paper #63144