Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data, Databases, and Discovery Andy Novobilski, PhD UT Chattanooga Computer Science Nuts and Bolts Research Methods Symposium UT College of Medicine Chattanooga September 29, 2006 An Introduction to Knowledge Discovery • • • • • Data Collection Data Validation Preprocessing of Data Mining the Data Comparing Methods Data Collection … • Paper or Electronic? – Fingernet • Continuous or Discrete? • And the Understatement of the Year … Health Insurance Portability and Accountability Act of 1996 The HIPAA website http://www.hipaa.org/ links to the government’s website http://aspe.hhs.gov/admnsimp/ which states “Administrative Simplification in the Health Care Industry” … And Raw Storage … • Alphanumeric Data – Excel Worksheets – Comma/Tab Delimited Text Files – XML: The Extensible Markup Language • http://www.xml.com/ • Binary Data – Images • GIF, BMP, EPS – Streaming Data • HL7 - http://www.hl7.org/ (http://en.wikipedia.org/wiki/HL7) • DICOM - http://medical.nema.org/ … Stored in a Relational Manner • Relational Databases – Inexpensive • MS Access – Expensive • MS SQL Server, Oracle, Sybase, … – Free (sort of … open source) • MySQL, PostgreSQL • Licensing Varies by Usage Data Validation Id Gender Age Months Pregnant Temp Smoker 001 002 M M 55 55 0 9 98.3 9.82 Yes . • Patient 002 is a … – Pregnant Male ( hit the 9 instead of 0) – With Ice Water in His Veins (misplaced decimal) – Who Might or Might Not Smoke (missing data) Preprocessing the Data • Clean-up – Out of Scope vs. Out of Family • Feature Extraction – Data Aggregation • Feature Transformation – Normalization – Principle Component Analysis Turning Data into Information • Data Mining … – Clustering – Decision Trees – Neural Networks – Bayesian Networks Clustering K-Means Y N Y Y Y N Y N N N N N Decision Trees • Division of Data Based on Information Gain • White Box Gender M F Smoker N Age N Y Age Y Y Y N N Y Neural Networks • Functional Approximation to Data – Black Box Case Data Forecast – Most Common is Feed Forward, Back Propagation • Considerations in Training the Network – Many Types of Neural Networks – Difficulties with Discrete Data – Missing Data Requires Careful Consideration Bayesian Networks • Belief Networks – White Box • Causal Orientation • Beliefs are Updated Based on New Information • Nodes Can Serve as Both Evidence and Query Points • Handles Missing Data Gracefully An Example Novobilski, Andrew, F. Fesmire, D. Sonnemaker. "Mining Bayesian Networks to Forecast Adverse Outcomes Related to Acute Coronary Syndrome." ." The 17th International FLAIRS Conference 2004. Comparing Models – The ROC Curve • The Receiver Operating Characteristic (ROC) Curve – Plots the Percentage of True Positives against the Percentage of False Positives as the Cutoff Value is varied from everyone classified as ill to everyone classified as healthy. – Provides a consistent measure of model fitness that varies between 0 and 100. An Illustration Healthy Cutoff Value Ill Comparing Multiple Classifiers In Summary … • A Process to Consider … – Collect, Validate, Preprocess, Mine, Compare • Excellent Software is Available – Both Commercial and Open Source • Sample Data Is Available Thank You ! • Questions and/or Comments are Welcome … Dr. Andy Novobilski UT Chattanooga Computer Science 615 McCallie Ave., Dept. 2302 Chattanooga, TN 37403 (423) 425-4202 [email protected] http://www.utc.edu/Faculty/Andy-Novobilski