Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CollegeofHealthandHumanServices Fall2016 Syllabus Course information Course placement Instructor Course description Course objectives Fall2016 HAP780:DataMininginHealthCare Time:Mondays,7.20pm–10pm Location:AquiaBuilding219 ()Core(X)Concentration (X)Elective ()Pre-requisite(s) (X)Course(s)recommendedbeforetakingthiscourse:HAP700,HAP709,HAP602 Itisimpossibletomineandanalyzedatawithoutgoodknowledgeofdatabase systems.HAP709orotherrelationaldatabasecourses(withSQL)arestrongly recommendedbeforetakingthiscourse. JanuszWojtusiakPhD [email protected] OfficeHoursbyappointmentWednesdays1-4pm(NortheastModule,Room108, FairfaxCampus) An introductory course to data mining and knowledge discovery in health care. Methods for mining health care databases and synthesizing task-oriented knowledge from computer data and prior knowledge are emphasized. Topics include fundamental concepts of data mining, data preprocessing, classification and prediction (decision trees, attributional rules, Bayesian networks), constructive induction, cluster and association analysis, knowledge representation and visualization, and an overview of practical tools for discovering knowledge from medical data. These topics are illustrated by examples of practical applications in health care. Upon completion of the course, students will be able to: 1. Understand and describe data mining techniques and their use in knowledge discovery as it applies to health related fields. 2. Define a health related problem to be solved by means of data mining. 3. Apply data preprocessing techniques to clean and prepare data sets for analysis. HAP780DataMininginHealthCare JanuszWojtusiak,PhD Required textbook(s) and/or materials 4. Built and assess predictive models using various techniques such as decision trees, decision rules, Bayesian networks and clustering. 5. Develop skills of using recent data mining software for solving practical problems in health services research and other medical and public health related fields. 6. Use methods for presenting knowledge in natural language and other understandable forms. 7. Review and critique current research papers on data mining algorithms and implementations. Required Text: Class notes and slides. Recommended Readings: Han, J., Kamber, M., Pei, J. (2011), Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufmann. Witten I.H., Frank E. , Hall, MA (2011). Data Mining: Practical Machine Learning Tools and Techniques, third edition. Morgan Kaufmann. Course requirements Black K. (2008). Business Statistics for Contemporary Decision Making. New Jersey: John Wiley & Sons. Computer requirements This is a computationally intensive course and you are expected to access databases, software tools, and other contents. You will need: • Fast computer (multicore PC or Mac) with at least 100GB of free disk space and at least 4GB RAM (4GB+ recommended), Windows 7 or newer. Mac users may require more powerful computers to enable virtualization to run windows. • Fast internet connection • Microsoft office for viewing and preparing files • Other software will be provided in class (SQL server, Weka, R, Genie, Python) If you do not have sufficient computer, you can request access to Health Informatics Learning Lab, located in Northeast Module, or use one of computer labs at GMU. It is responsibility of students to configure and maintain own computers, make sure that it is set up correctly and installed software (i.e., security) do not interfere with software used in class. Expectations: Students are responsible for assigned readings, class content and material. Students are also responsible for finding right computer equipment that allows accessing the course materials, using data and software tools, and for checking email/blackboard on daily basis. Fall2016 HAP780DataMininginHealthCare JanuszWojtusiak,PhD Data mining is a very broad topic, which is condensed here into one semester course. This course requires students to participate in lectures and spend at least another 6 hours per week on assignments, reading, and project. Evaluation Methods: If you are taking this course as part of a graduate level course, you will receive a grade. Your grade will depend on your participation, quality of your project work and your team work. Assignments and projects are graded based on multiple criteria that will be discussed in detail. Always write all answers in own words. Do not copy-and-paste. You can ask questions by sending email to the instructor. In most cases you will receive response within 48 hours. Participation Outside Classroom You should attend a meeting (conference, seminar, local chapter meeting, etc.) and write about a page description of what you learned and how the attended event relates to this course. It is not sufficient to simply pay the membership fee for a professional organization and do not participate in the organization in any way. The report is due last day of classes. Look for a meeting early in the semester. In person-meetings are strongly suggested. DataMiningTopicpresentation Youwillneedtoprepare10minutepresentationaboutatopicrelatedtotheclass. Istronglyrecommendtofindajournalarticle,analyzeitindetailandpresent.Do notpreparepresentationsthatrepeattopicscoveredinclass.Donotrepeat topicspresentedbyotherstudents. Final Project Dataminingrequirescombiningtheoreticalknowledgewithpracticalskills.Inorder todevelopskillsinthecontextofhealthcareapplications,semester-longprojectis themostimportantcomponentofthegrade. Theprojecttopicsshouldberelatedtoanalyzinghealthcaredatainordertosolve clinicaloradministrativeproblems.Theprojectshouldinclude,butbenotlimited to:(1)problemdescription;(2)dataselection;(3)datapre-processing;(4)selection DMmethods;(5)applicationofmethods;(6)analysisofresults;(7)reviewof availableliteratureandrelatedwork;(7)conclusionsanddescriptionofimpacton healthcare.Briefdescriptionofwhatyoulearnedintheproject. Directapplicationofexistingsoftwaretopublicallyavailabledatasetsisnot sufficient.Theprojectsmustdemonstratesignificanteffortsindatamanipulation, processing,andmining.Projectsmustalsoillustrateunderstandingofapplied techniquesaswellasthehealthcareproblemaddressed. Fall2016 HAP780DataMininginHealthCare JanuszWojtusiak,PhD Teaching methods Inthepastsomestudentprojectwererevised,extendedandsubmittedfor conferencepresentations. (X)Lecture()Groupwork()Independentresearch()Fieldwork ()Papers()Guestspeakers()Studentpresentations()CaseStudies (X)Lab()Classdiscussion()Other___________ Evaluation Weekly Assignments DM topic presentation Participation Outside Classroom Semester-long project 35% 10% 5% 50% Grading Scale 96+ 90-95 86-89 80-85 75-79 70-74 0-70 Mason HonorCode ThecompleteHonorCodeisasfollows: Topromoteastrongersenseofmutualresponsibility,respect,trust,andfairness amongallmembersoftheGeorgeMasonUniversitycommunityandwiththedesire forgreateracademicandpersonalachievement,we,thestudentmembersofthe universitycommunity,havesetforththishonorcode:Studentmembersofthe GeorgeMasonUniversitycommunitypledgenottocheat,plagiarize,steal,orlie inmattersrelatedtoacademicwork. A AB+ B BC F (Fromthe2016-17Catalog–catalog.gmu.edu) Individuals with Disabilities Theuniversityiscommittedtoprovidingequalaccesstoemploymentand educationalopportunitiesforpeoplewithdisabilities.Masonrecognizesthat individualswithdisabilitiesmayneedreasonableaccommodationstohaveequally effectiveopportunitiestoparticipateinorbenefitfromtheuniversityeducational programs,services,andactivities,andhaveequalemploymentopportunities.The universitywilladheretoallapplicablefederalandstatelaws,regulations,and guidelineswithrespecttoprovidingreasonableaccommodationsasnecessaryto affordequalemploymentopportunityandequalaccesstoprogramsforqualified peoplewithdisabilities.Applicantsforadmissionandstudentsrequesting reasonableaccommodationsforadisabilityshouldcalltheOfficeofDisability Servicesat703-993-2474.Employeesandapplicantsforemploymentshouldcall theOfficeofEquityandDiversityServicesat703-993-8730.Questionsregarding reasonableaccommodationsanddiscriminationonthebasisofdisabilityshouldbe directedtotheAmericanswithDisabilitiesAct(ADA)coordinatorintheOfficeof EquityandDiversityServices. (Fromthe2016-17Catalog–catalog.gmu.edu) Fall2016 HAP780DataMininginHealthCare JanuszWojtusiak,PhD E-mailPolicy Web:masonlive.gmu.edu Masonuseselectronicmailtoprovideofficialinformationtostudents.Examples includenoticesfromthelibrary,noticesaboutacademicstanding,financialaid information,classmaterials,assignments,questions,andinstructorfeedback. Studentsareresponsibleforthecontentofuniversitycommunicationsenttotheir Masone-mailaccountandarerequiredtoactivatethataccountandcheckit regularly.Studentsarealsoexpectedtomaintainanactiveandaccuratemailing addressinordertoreceivecommunicationssentthroughtheUnitedStatesPostal Service. (Fromthe2016-17Catalog–catalog.gmu.edu) Iplantovideotapeselectedlecturesforthefutureuseofonlinestudents.The recordedvideoswillbepostedonlineforstudentstoview.Thecamerawillbe facingthescreenandinstructor.Becauseliveinteractionwithclassisrecorded, someofyourquestionsandvoicemaybealsorecorded.Ifyoudonotwishtobe onthefinalrecording,pleaseletmeknow.Then,Iwillaskforyourhelptoreview thefinalversionsofrecordingstoensurethatyouarecompletelyeditedout. Tentative Weekly Schedule The schedule below is approximate and may be changed to adapt to students' needs and requests, new material, and for other reasons. Due dates and assignments are subject to change and will be provided weekly. Wk Date Topics Assignments Due Date 1 8/29 What do you know? 9/11 2 9/5 9/12 9/19 4 9/26 What do you know/ prepare sample data What do you know/ prepare sample data What do you know/ analyze data 9/18 3 5 10/3 Introduction to data mining in health care Review of databases Introduction to software No Class – Labor Day Measuring/Describing the world Data Preprocessing - part 1 Data preprocessing - part 2 Knowledge representation Data preprocessing – part 3: Exploratory data analysis, simple statistics Review of types of health data Mining Frequent Patterns/Associations 10/9 6 Classification and Regression: Basics 7 10/11 (Tue) 10/17 8 10/24 Cluster Analysis What do you know/ analyze data What do you know/ analyze data What do you know/ analyze data What do you know/ Fall2016 Classification 2 HAP780DataMininginHealthCare 9/25 10/2 10/16 10/23 10/30 JanuszWojtusiak,PhD 9 10/31 Outlier Detection 10 11/7 Time and Space 11 11/14 No class meeting – online material assigned on healthcare applications 12 11/21 13 11/28 Text and Image Mining Genomic data BIG DATA Analysis 14 12/5 Final Project Presentations analyze data What do you know/ analyze data What do you know/ analyze data What do you know/ analyze data What do you know/ analyze data What do you know/ analyze data All Missing Assignments Due 11/6 11/13 11/21 11/21 12/5 SampleAssignments Belowaredraftinstructionsforsomeoftheassignments.Theyareforinformation purposesonlytohelpstudentsbetterplantimeandunderstandcoursecontent.The actualassignmentswillbepostedonBlackboardandmaybedifferentthanones presentedhere. Assignment 1 – Introduction to Databases and Data Mining When answering questions: (1) use own words; (2) discuss answers; (3) do not copy-andpaste; (4) provide enough details, so I can help if answer is incorrect. 1. How data warehouse is different from an operational database? Are there any similarities? 2. Why outliers are particularly important in healthcare applications? Explain and give examples. 3. How data mining process of very large data differs from very small data? Describe challenges related to both types of data. 4. You are a consultant hired by a hospital. The hospital is engaged in a quality improvement process to reduce medical errors. You are asked to analyze data to learn why some reported incidents result in lawsuits or claims, while others don’t. Describe how would you approach this problem. What type of data mining is involved? What do you need in data in order to perform this task? 5. Load attached file hepatitis.csv to SQL Server, MySQL, PostgreSQL, or other relational database (do not use MS Access). Some data description is available at: http://archive.ics.uci.edu/ml/datasets/Hepatitis Prepare SQL queries to answer the following questions: - what is the average value of bilirubin? - what is the average value of bilirubin for patients that live? Fall2016 HAP780DataMininginHealthCare JanuszWojtusiak,PhD - what is the average value of bilirubin for patients that are dead? - how many patients have value of bilirubin higher than average in the data? - how many patients are at least 50 years old? How many of them are dead? - how many patients are younger than 50 years? How many of them are dead? - is average age of dead patients higher than average age of alive patients? Present both queries and results. Do not click-in the queries. Write them in SQL. 6. Decide on the topic and date of your presentation. Submit date and topic of presentation to doodle. 7. Propose the topic of your project. Write 1-2 paragraphs describing what you want to do. The topic may evolve later, but you need to start thinking about it now. 8. Download “SQL Server 2016 Enterprise 64-bit (English)” from DreamSpark. Install the server yourself, or keep the downloaded files on your laptop for installation in class. We will proceed with step-by-step installation. http://e5.onthehub.com/WebStore/Welcome.aspx?vsro=8&ws=058b3ace-2512-e111a703-f04da23e67f6 It is free for HAP students for academic use! We will do installation in class next week. If you are Mac user, you need to install it in a virtual machine on Windows. Windows can also be downloaded from DreamSpark. You need a virtualization software such as Parallels or VirtualBox. Assignment 2 – Preprocessing Part 1 1. What types of dirty data one can expect when starting a data mining project? List at least five potential problems and give examples. 2. Why do you have to specify field types when loading data into a database or data warehouse? Why not make everything a text field? 3. Why semantic/analytic data types are important in data mining? 4. Use the “claims” data from HAP 709 class (you can get excel files at http://gunston.gmu.edu/healthscience/709/QuerriesReports.asp#Analyze%20data ). - How many patients have chronic conditions? - What is the most common chronic condition? - What is the maximum number of body systems involved for a single patient? Which patient(s)? You should use definitions of chronic conditions and body systems from the website of Agency of Health Research and Quality. The mapping is available in CCI2012.CSV file at the website: http://www.hcup-us.ahrq.gov/toolssoftware/chronic/chronic.jsp Fall2016 HAP780DataMininginHealthCare JanuszWojtusiak,PhD You will need to load claims data to SQL Server (preferred) or Access. Then you need to load chronic conditions indicators (CCI2012.CSV file). Then you need to link the two files in queries that answer the questions above. Please send screenshots of different steps of your work, the final results, and any SQL code you used. Assignment 3 – Preprocessing Part 2 1. Why is it important to explore data before executing any DM algorithms? What can go wrong? What can be discovered in an exploratory analysis? 2. What are the three types of missing values? Give examples. 3. Suppose you are given two datasets: a survey result about satisfaction with a clinic visit and patient records. For privacy reasons personal information (name, record number, address) has been removed from both datasets. Your goal is to find out if there is any relationship between the survey results and severity of cases (severity can be obtained form medical records). - How do you approach the problem of linking the two datasets? Speculate on what attributes you would use to link them. - Should you assume that medical record is found for every patient who completed survey? - Should you assume that survey is found for every patient who was treated at the clinic? 4. Using the “Hepatitis” dataset from Assignment 1: - Write SQL queries to calculate arithmetic mean, standard deviation, median, mode, and all three quartiles for SEX, AGE, and BILIRUBIN. Note: make calculations only for values that make sense. - Using SQL prepare data for Q-Q plot of BLIRUBIN levels for male vs. female patients. You do not need to make the actual plot (although the prepared data can be easily copied and plotted as a scatterplot in excel). Prepare data in the form of a table with 2 columns, in which 1 column corresponds to male and second to female patients, and rows corresponds to selected quantiles. Note: even if you fail at some details here and something does not work, describe the procedure how you would approach the problem. 5. Install Weka software http://www.cs.waikato.ac.nz/ml/weka/ on your computer. It runs on most platforms (Windows, Mac, Linux) and requires Java (JRE). Installation is simple. We will be using Weka in class. Send screenshots. Assignment 4 – Data Transformation 1. Why it is important to select right attributes for DM algorithms? Why not keep all attributes? 2. Give an example (different than in lecture) when creation of new attributes may be useful? Fall2016 HAP780DataMininginHealthCare JanuszWojtusiak,PhD 3. When stratified sampling should be used? 4. Using hepatitis data, write SQL code for: - Sampling 50% of data without repetitions - Sampling 50% of data with repetitions - Stratified sampling, to make frequency of both classes equal 5. In Weka, using the hepatitis data - Compare results of three different attribute selection methods - Compare results of three different sampling methods (show plotted data distributions). Assignment 5 – Association Rule Mining 1. Describewhyassociationrulesandclassificationrulesarenotthesame?Give examples. 2. WhyFP-Treealgorithmisusuallyfasterthanapriori?Givesomeintuitive explanation. 3. Listatleastfourmetricsofqualityforassociationrules.Provideformulas. 4. Usingheritagedata(release1)inSQL a. Findsupportforallsingleitemsets b. Listallitemsetswith2elementsandsupportofatleast0.2 c. Listallitemsetswith3elementsandsupportatleast0.2 5. InWeka a. Loadheritagedata(release1) b. Applyatleasttwoassociationrulegenerationalgorithmsandcompare results c. ApplyFP-Treealgorithmwithatleasttwomeasuresofrulemetrics Assignment 6 – Classification and Regression 1. Describeprocessofpreparingdataforclassificationlearning. 2. Whyitisimportanttoselectcorrecttypeofmodel?Listatleastthreereasons. 3. Supposeyouareaskedtocreateamodeltopredicthospitalization.Whatyouhave isclaimsdata.Describeprocessofpreparingdata,constructingmodel,andtesting themodel.Shouldyouuseregressionlearningorclassificationlearningforthis problem?Why? 4. InSQL/Weka: a. Prepareheritagedataforclassificationlearning b. Loadheritagedatarelease3(preprocessedtobinaryrepresentation, includingdemographicsandoutputattribute(s)) c. Performexploratoryanalysis d. Createatleastthreeclassificationmodelsforpredictinghospitalization basedonYear1data. e. Whichmodelperformsthebestonyear2data? Fall2016 HAP780DataMininginHealthCare JanuszWojtusiak,PhD f. Createregressionmodelforpredictinghospitalizationdays. g. Whatisthedifferencebetweenregressionandclassificationmodels? h. Presentyourresultsinaformofshortreportthatincludesscreenshots, tables,andneededdescription. Assignment 7 – Classification Part 2 1. WhenROCcanbeusedasameasureofqualityofmodels?Howdoesitdifferfrom othermeasuressuchasaccuracy,precisionorrecall? 2. Whendesigningsystemfordetectinglife-threateningeventsforICUpatients,what ismoreimportant:precisionorrecall?Why? 3. Describehowmultipleclassificationmodelscanbecombined? 4. Usingheritagerelease3datapreparedlastweek a. Includedruginformationintodata b. Includelaboratoryinformationintodata c. ImportnewlycreateddataintoWekaandrunclassificationalgorithms d. Doesinclusionoftheinformationimprovepredictions? There are many ways to complete question 4, so you need to make different decisions. Try not to overcomplicate the problem. Assignment 8 – Cluster Analysis 1. Describedifferencesandsimilaritiesbetweenclusteringandclassification.Use examples. 2. Supposeyouaregivenadatasetwith1000binaryattributesrepresentingpresence ofdiagnoses.Describepotentialproblemswithclusteringdatawiththatmany dimensions. 3. Usingthedatatableshownbelow. a. Calculatedistancebetweenallpointsin1-norm,2-normandinfinitynorm.Showdissimilaritymatrix. b. Isthereanyneedtopreprocessthedatatobemoresuitablefor clustering?Ifso,describetheoperationsandshowtheresultingdata table. c. Applyk-meansclusteringalgorithmwithk=2. ID Age BMI Gender 1 2 3 4 30 70 65 40 24 19 26 32 M M M F Fall2016 HAP780DataMininginHealthCare Total Cholesterol 180 190 220 260 JanuszWojtusiak,PhD 4. InWekausingheritage3dataset a. Applyk-meansalgorithmfork=2,3,5,10 b. ApplyEMalgorithm.Whatistheoptimalnumberofclustersobtainedby EM? c. Comparethecreatedclusterstoclassificationbasedonhospitalizationin year2. Assignment 9 – Outlier Detection 1. Howoutliersdifferfromnoise?Provideexamplesofboth. 2. Whycollectiveoutliersarehardertofindthanindividualoutliers? 3. Whatareissuesinapplyingclassificationmethodsforoutlierdetection.Whatare therequirements? 4. UsingHeritagedata,arethereanypatientstatcanbeconsideredoutliersinthe data?Why? Assignment 10 – Text Mining 1. Whytextminingisimportantinhealthcareapplications 2. Describepre-processingofclinicalnotesbeforeclassificationlearningcanbe applied.Whatmethodscanbeusedateachstepofpre-processing? 3. Writeregularexpressionto: a. detectzipcodesintext b. FindlastnamesofallpatientswhosefirstnameisJohn (notethatregularexpressionsmayhavesomefalsepositives/falsenegatives). 4. ListchallengesinautomaticallyretrievingICD-9codesfromclinicalnotes.Search literaturefortofindrelevantpublishedwork.Also,includeownobservationsand comments. 5. UsingtheSMSdata a. Splitdataintotraining(80%)andtesting(20%)sets b. BuildnaïveBayesclassifierfordetectingspambasedonbagofwords i. Listallwordsinthedocuments ii. Countoccurrencesinspamandham iii. AssignlikelihoodsP(word|spam)andP(word|ham)forallwords iv. Converttestdataintolistofwords.Foreachmessageyouneed,2 columns:messageidandword v. Classifytestdata.Thiscanbedonebyaseriesofjoinswiththe datapreparedin(iii). vi. Calculateaccuracyofyourmodel(accuracy,precision,recall) Fall2016 HAP780DataMininginHealthCare JanuszWojtusiak,PhD