Download HAP 780 : Data Mining in Health Care - CHHS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CollegeofHealthandHumanServices
Fall2016
Syllabus
Course
information
Course
placement
Instructor
Course
description
Course
objectives
Fall2016
HAP780:DataMininginHealthCare
Time:Mondays,7.20pm–10pm
Location:AquiaBuilding219
()Core(X)Concentration
(X)Elective ()Pre-requisite(s)
(X)Course(s)recommendedbeforetakingthiscourse:HAP700,HAP709,HAP602
Itisimpossibletomineandanalyzedatawithoutgoodknowledgeofdatabase
systems.HAP709orotherrelationaldatabasecourses(withSQL)arestrongly
recommendedbeforetakingthiscourse.
JanuszWojtusiakPhD
[email protected]
OfficeHoursbyappointmentWednesdays1-4pm(NortheastModule,Room108,
FairfaxCampus)
An introductory course to data mining and knowledge discovery in health care.
Methods for mining health care databases and synthesizing task-oriented knowledge
from computer data and prior knowledge are emphasized. Topics include
fundamental concepts of data mining, data preprocessing, classification and
prediction (decision trees, attributional rules, Bayesian networks), constructive
induction, cluster and association analysis, knowledge representation and
visualization, and an overview of practical tools for discovering knowledge from
medical data. These topics are illustrated by examples of practical applications in
health care.
Upon completion of the course, students will be able to:
1. Understand and describe data mining techniques and their use in knowledge
discovery as it applies to health related fields.
2. Define a health related problem to be solved by means of data mining.
3. Apply data preprocessing techniques to clean and prepare data sets for analysis.
HAP780DataMininginHealthCare
JanuszWojtusiak,PhD
Required
textbook(s)
and/or
materials
4. Built and assess predictive models using various techniques such as decision
trees, decision rules, Bayesian networks and clustering.
5. Develop skills of using recent data mining software for solving practical
problems in health services research and other medical and public health related
fields.
6. Use methods for presenting knowledge in natural language and other
understandable forms.
7. Review and critique current research papers on data mining algorithms and
implementations.
Required Text:
Class notes and slides.
Recommended Readings:
Han, J., Kamber, M., Pei, J. (2011), Data Mining: Concepts and Techniques, 3rd
edition, Morgan Kaufmann.
Witten I.H., Frank E. , Hall, MA (2011). Data Mining: Practical Machine
Learning Tools and Techniques, third edition. Morgan Kaufmann.
Course
requirements
Black K. (2008). Business Statistics for Contemporary Decision Making. New
Jersey: John Wiley & Sons.
Computer requirements
This is a computationally intensive course and you are expected to access databases,
software tools, and other contents. You will need:
• Fast computer (multicore PC or Mac) with at least 100GB of free disk space
and at least 4GB RAM (4GB+ recommended), Windows 7 or newer. Mac
users may require more powerful computers to enable virtualization to run
windows.
• Fast internet connection
• Microsoft office for viewing and preparing files
• Other software will be provided in class (SQL server, Weka, R, Genie,
Python)
If you do not have sufficient computer, you can request access to Health Informatics
Learning Lab, located in Northeast Module, or use one of computer labs at GMU. It
is responsibility of students to configure and maintain own computers, make sure
that it is set up correctly and installed software (i.e., security) do not interfere with
software used in class.
Expectations: Students are responsible for assigned readings, class content and
material. Students are also responsible for finding right computer equipment
that allows accessing the course materials, using data and software tools, and
for checking email/blackboard on daily basis.
Fall2016
HAP780DataMininginHealthCare
JanuszWojtusiak,PhD
Data mining is a very broad topic, which is condensed here into one semester
course. This course requires students to participate in lectures and spend at
least another 6 hours per week on assignments, reading, and project.
Evaluation Methods:
If you are taking this course as part of a graduate level course, you will receive a
grade. Your grade will depend on your participation, quality of your project work
and your team work. Assignments and projects are graded based on multiple criteria
that will be discussed in detail.
Always write all answers in own words. Do not copy-and-paste.
You can ask questions by sending email to the instructor. In most cases you will
receive response within 48 hours.
Participation Outside Classroom
You should attend a meeting (conference, seminar, local chapter meeting, etc.) and
write about a page description of what you learned and how the attended event
relates to this course. It is not sufficient to simply pay the membership fee for a
professional organization and do not participate in the organization in any
way. The report is due last day of classes. Look for a meeting early in the semester.
In person-meetings are strongly suggested.
DataMiningTopicpresentation
Youwillneedtoprepare10minutepresentationaboutatopicrelatedtotheclass.
Istronglyrecommendtofindajournalarticle,analyzeitindetailandpresent.Do
notpreparepresentationsthatrepeattopicscoveredinclass.Donotrepeat
topicspresentedbyotherstudents.
Final Project
Dataminingrequirescombiningtheoreticalknowledgewithpracticalskills.Inorder
todevelopskillsinthecontextofhealthcareapplications,semester-longprojectis
themostimportantcomponentofthegrade.
Theprojecttopicsshouldberelatedtoanalyzinghealthcaredatainordertosolve
clinicaloradministrativeproblems.Theprojectshouldinclude,butbenotlimited
to:(1)problemdescription;(2)dataselection;(3)datapre-processing;(4)selection
DMmethods;(5)applicationofmethods;(6)analysisofresults;(7)reviewof
availableliteratureandrelatedwork;(7)conclusionsanddescriptionofimpacton
healthcare.Briefdescriptionofwhatyoulearnedintheproject.
Directapplicationofexistingsoftwaretopublicallyavailabledatasetsisnot
sufficient.Theprojectsmustdemonstratesignificanteffortsindatamanipulation,
processing,andmining.Projectsmustalsoillustrateunderstandingofapplied
techniquesaswellasthehealthcareproblemaddressed.
Fall2016
HAP780DataMininginHealthCare
JanuszWojtusiak,PhD
Teaching
methods
Inthepastsomestudentprojectwererevised,extendedandsubmittedfor
conferencepresentations.
(X)Lecture()Groupwork()Independentresearch()Fieldwork
()Papers()Guestspeakers()Studentpresentations()CaseStudies
(X)Lab()Classdiscussion()Other___________
Evaluation
Weekly Assignments
DM topic presentation
Participation Outside Classroom
Semester-long project
35%
10%
5%
50%
Grading
Scale
96+
90-95
86-89
80-85
75-79
70-74
0-70
Mason
HonorCode
ThecompleteHonorCodeisasfollows:
Topromoteastrongersenseofmutualresponsibility,respect,trust,andfairness
amongallmembersoftheGeorgeMasonUniversitycommunityandwiththedesire
forgreateracademicandpersonalachievement,we,thestudentmembersofthe
universitycommunity,havesetforththishonorcode:Studentmembersofthe
GeorgeMasonUniversitycommunitypledgenottocheat,plagiarize,steal,orlie
inmattersrelatedtoacademicwork.
A
AB+
B
BC
F
(Fromthe2016-17Catalog–catalog.gmu.edu)
Individuals
with
Disabilities
Theuniversityiscommittedtoprovidingequalaccesstoemploymentand
educationalopportunitiesforpeoplewithdisabilities.Masonrecognizesthat
individualswithdisabilitiesmayneedreasonableaccommodationstohaveequally
effectiveopportunitiestoparticipateinorbenefitfromtheuniversityeducational
programs,services,andactivities,andhaveequalemploymentopportunities.The
universitywilladheretoallapplicablefederalandstatelaws,regulations,and
guidelineswithrespecttoprovidingreasonableaccommodationsasnecessaryto
affordequalemploymentopportunityandequalaccesstoprogramsforqualified
peoplewithdisabilities.Applicantsforadmissionandstudentsrequesting
reasonableaccommodationsforadisabilityshouldcalltheOfficeofDisability
Servicesat703-993-2474.Employeesandapplicantsforemploymentshouldcall
theOfficeofEquityandDiversityServicesat703-993-8730.Questionsregarding
reasonableaccommodationsanddiscriminationonthebasisofdisabilityshouldbe
directedtotheAmericanswithDisabilitiesAct(ADA)coordinatorintheOfficeof
EquityandDiversityServices.
(Fromthe2016-17Catalog–catalog.gmu.edu)
Fall2016
HAP780DataMininginHealthCare
JanuszWojtusiak,PhD
E-mailPolicy
Web:masonlive.gmu.edu
Masonuseselectronicmailtoprovideofficialinformationtostudents.Examples
includenoticesfromthelibrary,noticesaboutacademicstanding,financialaid
information,classmaterials,assignments,questions,andinstructorfeedback.
Studentsareresponsibleforthecontentofuniversitycommunicationsenttotheir
Masone-mailaccountandarerequiredtoactivatethataccountandcheckit
regularly.Studentsarealsoexpectedtomaintainanactiveandaccuratemailing
addressinordertoreceivecommunicationssentthroughtheUnitedStatesPostal
Service.
(Fromthe2016-17Catalog–catalog.gmu.edu)
Iplantovideotapeselectedlecturesforthefutureuseofonlinestudents.The
recordedvideoswillbepostedonlineforstudentstoview.Thecamerawillbe
facingthescreenandinstructor.Becauseliveinteractionwithclassisrecorded,
someofyourquestionsandvoicemaybealsorecorded.Ifyoudonotwishtobe
onthefinalrecording,pleaseletmeknow.Then,Iwillaskforyourhelptoreview
thefinalversionsofrecordingstoensurethatyouarecompletelyeditedout.
Tentative Weekly Schedule
The schedule below is approximate and may be changed to adapt to students' needs and
requests, new material, and for other reasons. Due dates and assignments are subject to
change and will be provided weekly.
Wk
Date
Topics
Assignments
Due
Date
1
8/29
What do you know?
9/11
2
9/5
9/12
9/19
4
9/26
What do you know/
prepare sample data
What do you know/
prepare sample data
What do you know/
analyze data
9/18
3
5
10/3
Introduction to data mining in health care
Review of databases
Introduction to software
No Class – Labor Day
Measuring/Describing the world
Data Preprocessing - part 1
Data preprocessing - part 2
Knowledge representation
Data preprocessing – part 3: Exploratory
data analysis, simple statistics
Review of types of health data
Mining Frequent Patterns/Associations
10/9
6
Classification and Regression: Basics
7
10/11
(Tue)
10/17
8
10/24
Cluster Analysis
What do you know/
analyze data
What do you know/
analyze data
What do you know/
analyze data
What do you know/
Fall2016
Classification 2
HAP780DataMininginHealthCare
9/25
10/2
10/16
10/23
10/30
JanuszWojtusiak,PhD
9
10/31
Outlier Detection
10
11/7
Time and Space
11
11/14
No class meeting – online material
assigned on healthcare applications
12
11/21
13
11/28
Text and Image Mining
Genomic data
BIG DATA Analysis
14
12/5
Final Project Presentations
analyze data
What do you know/
analyze data
What do you know/
analyze data
What do you know/
analyze data
What do you know/
analyze data
What do you know/
analyze data
All Missing Assignments Due
11/6
11/13
11/21
11/21
12/5
SampleAssignments
Belowaredraftinstructionsforsomeoftheassignments.Theyareforinformation
purposesonlytohelpstudentsbetterplantimeandunderstandcoursecontent.The
actualassignmentswillbepostedonBlackboardandmaybedifferentthanones
presentedhere.
Assignment 1 – Introduction to Databases and Data Mining
When answering questions: (1) use own words; (2) discuss answers; (3) do not copy-andpaste; (4) provide enough details, so I can help if answer is incorrect.
1. How data warehouse is different from an operational database? Are there any
similarities?
2. Why outliers are particularly important in healthcare applications? Explain and give
examples.
3. How data mining process of very large data differs from very small data? Describe
challenges related to both types of data.
4. You are a consultant hired by a hospital. The hospital is engaged in a quality
improvement process to reduce medical errors. You are asked to analyze data to learn
why some reported incidents result in lawsuits or claims, while others don’t. Describe
how would you approach this problem. What type of data mining is involved? What do
you need in data in order to perform this task?
5. Load attached file hepatitis.csv to SQL Server, MySQL, PostgreSQL, or other
relational database (do not use MS Access). Some data description is available at:
http://archive.ics.uci.edu/ml/datasets/Hepatitis
Prepare SQL queries to answer the following questions:
- what is the average value of bilirubin?
- what is the average value of bilirubin for patients that live?
Fall2016
HAP780DataMininginHealthCare
JanuszWojtusiak,PhD
- what is the average value of bilirubin for patients that are dead?
- how many patients have value of bilirubin higher than average in the data?
- how many patients are at least 50 years old? How many of them are dead?
- how many patients are younger than 50 years? How many of them are dead?
- is average age of dead patients higher than average age of alive patients?
Present both queries and results. Do not click-in the queries. Write them in SQL.
6. Decide on the topic and date of your presentation. Submit date and topic of
presentation to doodle.
7. Propose the topic of your project. Write 1-2 paragraphs describing what you want to
do. The topic may evolve later, but you need to start thinking about it now.
8. Download “SQL Server 2016 Enterprise 64-bit (English)” from DreamSpark. Install
the server yourself, or keep the downloaded files on your laptop for installation in class.
We will proceed with step-by-step installation.
http://e5.onthehub.com/WebStore/Welcome.aspx?vsro=8&ws=058b3ace-2512-e111a703-f04da23e67f6
It is free for HAP students for academic use! We will do installation in class next week.
If you are Mac user, you need to install it in a virtual machine on Windows. Windows can
also be downloaded from DreamSpark. You need a virtualization software such as
Parallels or VirtualBox.
Assignment 2 – Preprocessing Part 1
1. What types of dirty data one can expect when starting a data mining project? List at
least five potential problems and give examples.
2. Why do you have to specify field types when loading data into a database or data
warehouse? Why not make everything a text field?
3. Why semantic/analytic data types are important in data mining?
4. Use the “claims” data from HAP 709 class (you can get excel files at
http://gunston.gmu.edu/healthscience/709/QuerriesReports.asp#Analyze%20data ).
- How many patients have chronic conditions?
- What is the most common chronic condition?
- What is the maximum number of body systems involved for a single patient? Which
patient(s)?
You should use definitions of chronic conditions and body systems from the website of
Agency of Health Research and Quality. The mapping is available in CCI2012.CSV file
at the website: http://www.hcup-us.ahrq.gov/toolssoftware/chronic/chronic.jsp
Fall2016
HAP780DataMininginHealthCare
JanuszWojtusiak,PhD
You will need to load claims data to SQL Server (preferred) or Access. Then you need to
load chronic conditions indicators (CCI2012.CSV file). Then you need to link the two
files in queries that answer the questions above. Please send screenshots of different steps
of your work, the final results, and any SQL code you used.
Assignment 3 – Preprocessing Part 2
1. Why is it important to explore data before executing any DM algorithms? What can go
wrong? What can be discovered in an exploratory analysis?
2. What are the three types of missing values? Give examples.
3. Suppose you are given two datasets: a survey result about satisfaction with a clinic
visit and patient records. For privacy reasons personal information (name, record number,
address) has been removed from both datasets. Your goal is to find out if there is any
relationship between the survey results and severity of cases (severity can be obtained
form medical records).
- How do you approach the problem of linking the two datasets? Speculate on what
attributes you would use to link them.
- Should you assume that medical record is found for every patient who completed
survey?
- Should you assume that survey is found for every patient who was treated at the clinic?
4. Using the “Hepatitis” dataset from Assignment 1:
- Write SQL queries to calculate arithmetic mean, standard deviation, median, mode, and
all three quartiles for SEX, AGE, and BILIRUBIN. Note: make calculations only for
values that make sense.
- Using SQL prepare data for Q-Q plot of BLIRUBIN levels for male vs. female patients.
You do not need to make the actual plot (although the prepared data can be easily copied
and plotted as a scatterplot in excel). Prepare data in the form of a table with 2 columns,
in which 1 column corresponds to male and second to female patients, and rows
corresponds to selected quantiles. Note: even if you fail at some details here and
something does not work, describe the procedure how you would approach the problem.
5. Install Weka software http://www.cs.waikato.ac.nz/ml/weka/ on your computer.
It runs on most platforms (Windows, Mac, Linux) and requires Java (JRE). Installation is
simple. We will be using Weka in class. Send screenshots.
Assignment 4 – Data Transformation
1. Why it is important to select right attributes for DM algorithms? Why not keep all
attributes?
2. Give an example (different than in lecture) when creation of new attributes may be
useful?
Fall2016
HAP780DataMininginHealthCare
JanuszWojtusiak,PhD
3. When stratified sampling should be used?
4. Using hepatitis data, write SQL code for:
- Sampling 50% of data without repetitions
- Sampling 50% of data with repetitions
- Stratified sampling, to make frequency of both classes equal
5. In Weka, using the hepatitis data
- Compare results of three different attribute selection methods
- Compare results of three different sampling methods (show plotted data distributions).
Assignment 5 – Association Rule Mining
1. Describewhyassociationrulesandclassificationrulesarenotthesame?Give
examples.
2. WhyFP-Treealgorithmisusuallyfasterthanapriori?Givesomeintuitive
explanation.
3. Listatleastfourmetricsofqualityforassociationrules.Provideformulas.
4. Usingheritagedata(release1)inSQL
a. Findsupportforallsingleitemsets
b. Listallitemsetswith2elementsandsupportofatleast0.2
c. Listallitemsetswith3elementsandsupportatleast0.2
5. InWeka
a. Loadheritagedata(release1)
b. Applyatleasttwoassociationrulegenerationalgorithmsandcompare
results
c. ApplyFP-Treealgorithmwithatleasttwomeasuresofrulemetrics
Assignment 6 – Classification and Regression
1. Describeprocessofpreparingdataforclassificationlearning.
2. Whyitisimportanttoselectcorrecttypeofmodel?Listatleastthreereasons.
3. Supposeyouareaskedtocreateamodeltopredicthospitalization.Whatyouhave
isclaimsdata.Describeprocessofpreparingdata,constructingmodel,andtesting
themodel.Shouldyouuseregressionlearningorclassificationlearningforthis
problem?Why?
4. InSQL/Weka:
a. Prepareheritagedataforclassificationlearning
b. Loadheritagedatarelease3(preprocessedtobinaryrepresentation,
includingdemographicsandoutputattribute(s))
c. Performexploratoryanalysis
d. Createatleastthreeclassificationmodelsforpredictinghospitalization
basedonYear1data.
e. Whichmodelperformsthebestonyear2data?
Fall2016
HAP780DataMininginHealthCare
JanuszWojtusiak,PhD
f. Createregressionmodelforpredictinghospitalizationdays.
g. Whatisthedifferencebetweenregressionandclassificationmodels?
h. Presentyourresultsinaformofshortreportthatincludesscreenshots,
tables,andneededdescription.
Assignment 7 – Classification Part 2
1. WhenROCcanbeusedasameasureofqualityofmodels?Howdoesitdifferfrom
othermeasuressuchasaccuracy,precisionorrecall?
2. Whendesigningsystemfordetectinglife-threateningeventsforICUpatients,what
ismoreimportant:precisionorrecall?Why?
3. Describehowmultipleclassificationmodelscanbecombined?
4. Usingheritagerelease3datapreparedlastweek
a. Includedruginformationintodata
b. Includelaboratoryinformationintodata
c. ImportnewlycreateddataintoWekaandrunclassificationalgorithms
d. Doesinclusionoftheinformationimprovepredictions?
There are many ways to complete question 4, so you need to make different decisions.
Try not to overcomplicate the problem.
Assignment 8 – Cluster Analysis
1. Describedifferencesandsimilaritiesbetweenclusteringandclassification.Use
examples.
2. Supposeyouaregivenadatasetwith1000binaryattributesrepresentingpresence
ofdiagnoses.Describepotentialproblemswithclusteringdatawiththatmany
dimensions.
3. Usingthedatatableshownbelow.
a. Calculatedistancebetweenallpointsin1-norm,2-normandinfinitynorm.Showdissimilaritymatrix.
b. Isthereanyneedtopreprocessthedatatobemoresuitablefor
clustering?Ifso,describetheoperationsandshowtheresultingdata
table.
c. Applyk-meansclusteringalgorithmwithk=2.
ID
Age
BMI
Gender
1
2
3
4
30
70
65
40
24
19
26
32
M
M
M
F
Fall2016
HAP780DataMininginHealthCare
Total
Cholesterol
180
190
220
260
JanuszWojtusiak,PhD
4. InWekausingheritage3dataset
a. Applyk-meansalgorithmfork=2,3,5,10
b. ApplyEMalgorithm.Whatistheoptimalnumberofclustersobtainedby
EM?
c. Comparethecreatedclusterstoclassificationbasedonhospitalizationin
year2.
Assignment 9 – Outlier Detection
1. Howoutliersdifferfromnoise?Provideexamplesofboth.
2. Whycollectiveoutliersarehardertofindthanindividualoutliers?
3. Whatareissuesinapplyingclassificationmethodsforoutlierdetection.Whatare
therequirements?
4. UsingHeritagedata,arethereanypatientstatcanbeconsideredoutliersinthe
data?Why?
Assignment 10 – Text Mining
1. Whytextminingisimportantinhealthcareapplications
2. Describepre-processingofclinicalnotesbeforeclassificationlearningcanbe
applied.Whatmethodscanbeusedateachstepofpre-processing?
3. Writeregularexpressionto:
a. detectzipcodesintext
b. FindlastnamesofallpatientswhosefirstnameisJohn
(notethatregularexpressionsmayhavesomefalsepositives/falsenegatives).
4. ListchallengesinautomaticallyretrievingICD-9codesfromclinicalnotes.Search
literaturefortofindrelevantpublishedwork.Also,includeownobservationsand
comments.
5. UsingtheSMSdata
a. Splitdataintotraining(80%)andtesting(20%)sets
b. BuildnaïveBayesclassifierfordetectingspambasedonbagofwords
i. Listallwordsinthedocuments
ii. Countoccurrencesinspamandham
iii. AssignlikelihoodsP(word|spam)andP(word|ham)forallwords
iv. Converttestdataintolistofwords.Foreachmessageyouneed,2
columns:messageidandword
v. Classifytestdata.Thiscanbedonebyaseriesofjoinswiththe
datapreparedin(iii).
vi. Calculateaccuracyofyourmodel(accuracy,precision,recall)
Fall2016
HAP780DataMininginHealthCare
JanuszWojtusiak,PhD