Download HAP 780 - CHHS - George Mason University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
College of Health and Human Services Fall 2015 Syllabus Course information Course placement Instructor Course description Course objectives Fall 2015 HAP 780 : Data Mining in Health Care Time: Mondays, 7.20pm – 10pm Location: Research Hall 202 ( ) Core ( X) Concentration ( X) Elective ( ) Pre-­‐requisite(s) ( X) Course(s) recommended before taking this course: HAP 700, HAP 709, HAP 602 It is impossible to mine and analyze data without good knowledge of database systems. HAP 709 or other relational database courses (with SQL) are strongly recommended before taking this course. Janusz Wojtusiak PhD [email protected] Office Hours by appointment Wednesdays 1-­‐4pm (Northeast Module, Room 108, Fairfax Campus) An introductory course to data mining and knowledge discovery in health care.
Methods for mining health care databases and synthesizing task-oriented knowledge
from computer data and prior knowledge are emphasized. Topics include
fundamental concepts of data mining, data preprocessing, classification and
prediction (decision trees, attributional rules, Bayesian networks), constructive
induction, cluster and association analysis, knowledge representation and
visualization, and an overview of practical tools for discovering knowledge from
medical data. These topics are illustrated by examples of practical applications in
health care. Upon completion of the course, students will be able to:
1. Understand and describe data mining techniques and their use in knowledge
discovery as it applies to health related fields.
2. Define a health related problem to be solved by means of data mining.
3. Apply data preprocessing techniques to clean and prepare data sets for analysis.
HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD Required textbook(s) and/or materials 4. Built and assess predictive models using various techniques such as decision
trees, decision rules, Bayesian networks and clustering.
5. Develop skills of using recent data mining software for solving practical
problems in health services research and other medical and public health related
fields.
6. Use methods for presenting knowledge in natural language and other
understandable forms.
7. Review and critique current research papers on data mining algorithms and
implementations. Required Text:
Class notes and slides.
Assigned Readings:
Han, J., Kamber, M., Pei, J. (2011), Data Mining: Concepts and Techniques, 3rd
edition, Morgan Kaufmann.
Witten I.H., Frank E. (2005). Data Mining: Practical Machine Learning Tools and
Techniques, second edition. Morgan Kaufmann.
Course requirements Black K. (2008). Business Statistics for Contemporary Decision Making. New
Jersey: John Wiley & Sons.
Computer requirements
This is a computationally intensive course and you are expected to access databases,
software tools, and other contents. You will need:
Fast computer (multicore PC or Mac) with at least 50GB of free disk space
and at least 2GB RAM (4GB+ recommended), Windows 7 or newer. Mac
users may require more powerful computers to enable virtualization to run
windows.
Fast internet connection
Microsoft office for viewing and preparing files
Other software will be provided in class (most likely: SQL server, Weka, R,
Genie, Python)
If you do not have sufficient computer, you can request access to Health Informatics
Learning Lab, located in Northeast Module, or use one of computer labs at GMU. It
is responsibility of students to configure and maintain own computers, make sure
that it is set up correctly and installed software (i.e., security) do not interfere with
software used in class.
Expectations: Students are responsible for assigned readings, class content and
material. Students are also responsible for finding right computer equipment
that allows accessing the course materials, using data and software tools, and
for checking email/blackboard on daily basis.
Fall 2015 HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD Data mining is a very broad topic, which is condensed here into one semester
course. This course requires students to participate in lectures and spend at
least another 3 hours per week on assignments, reading, and project.
Evaluation Methods:
If you are taking this course as part of a graduate level course, you will receive a
grade. Your grade will depend on your participation, quality of your project work
and your team work. Assignments and projects are graded based on multiple criteria
that will be discussed in detail.
Always write all answers in own words. Do not copy-and-paste.
You can ask questions by sending email to the instructor. In most cases you will
receive response within 48 hours.
Participation Outside Classroom
You should attend a meeting (conference, seminar, local chapter meeting, etc.) and
write about a page description of what you learned and how the attended event
relates to this course. It is not sufficient to simply pay the membership fee for a
professional organization and do not participate in the organization in any
way. The report is due last day of classes. Look for a meeting early in the semester.
Data Mining Topic presentation You will need to prepare 10 minute presentation about a topic related to the class. I strongly recommend to find a journal article, analyze it in detail and present. Do not prepare presentations that repeat topics covered in class. Final Project
Data mining requires combining theoretical knowledge with practical skills. In order to develop skills in the context of health care applications, semester-­‐long project is the most important component of the grade. The project topics should be related to analyzing healthcare data in order to solve clinical or administrative problems. The project should include, but be not limited to: (1) problem description; (2) data selection; (3) data pre-­‐processing; (4) selection DM methods; (5) application of methods; (6) analysis of results; (7) review of available literature and related work; (7) conclusions and description of impact on healthcare. Brief description of what you learned in the project. Direct application of existing software to publically available datasets is not sufficient. The projects must demonstrate significant efforts in data manipulation, processing, and mining. Projects must also illustrate understanding of applied techniques as well as healthcare problem. In the past some student project were revised, extended and submitted for conference presentations. Fall 2015 HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD Teaching methods ( X) Lecture ( ) Group work ( ) Independent research ( ) Field work ( )Papers ( ) Guest speakers ( ) Student presentations ( ) Case Studies ( X) Lab ( ) Class discussion ( ) Other ___________ Evaluation Weekly Assignments
DM topic presentation
Participation Outside Classroom
Semester-long project
35%
10%
5%
50%
Grading Scale 96+
90-95
86-89
80-85
75-79
70-74
0-70
Mason Honor Code The complete Honor Code is as follows: To promote a stronger sense of mutual responsibility, respect, trust, and fairness among all members of the George Mason University community and with the desire for greater academic and personal achievement, we, the student members of the university community, have set forth this honor code: Student members of the George Mason University community pledge not to cheat, plagiarize, steal, or lie in matters related to academic work. A
AB+
B
BC
F
(From the 2015-­‐16 Catalog – catalog.gmu.edu) Individuals with Disabilities The university is committed to providing equal access to employment and educational opportunities for people with disabilities. Mason recognizes that individuals with disabilities may need reasonable accommodations to have equally effective opportunities to participate in or benefit from the university educational programs, services, and activities, and have equal employment opportunities. The university will adhere to all applicable federal and state laws, regulations, and guidelines with respect to providing reasonable accommodations as necessary to afford equal employment opportunity and equal access to programs for qualified people with disabilities. Applicants for admission and students requesting reasonable accommodations for a disability should call the Office of Disability Services at 703-­‐993-­‐2474. Employees and applicants for employment should call the Office of Equity and Diversity Services at 703-­‐993-­‐8730. Questions regarding reasonable accommodations and discrimination on the basis of disability should be directed to the Americans with Disabilities Act (ADA) coordinator in the Office of Equity and Diversity Services. (From the 2015-­‐16 Catalog – catalog.gmu.edu) E-­‐mail Policy Fall 2015 Web: masonlive.gmu.edu Mason uses electronic mail to provide official information to students. Examples HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD include notices from the library, notices about academic standing, financial aid information, class materials, assignments, questions, and instructor feedback. Students are responsible for the content of university communication sent to their Mason e-­‐mail account and are required to activate that account and check it regularly. Students are also expected to maintain an active and accurate mailing address in order to receive communications sent through the United States Postal Service. (From the 2015-­‐16 Catalog – catalog.gmu.edu) I plan to videotape selected lectures for the future use of online students. The recorded videos will be posted online for students to view. The camera will be facing the screen and instructor. Because live interaction with class is recorded, some of your questions and voice may be also recorded. If you do not wish to be on the final recording, please let me know. I will ask for your help to review the final versions of recordings to ensure that you are completely edited out. Tentative Weekly Schedule
The schedule below is approximate and may be changed to adapt to students' needs and
requests, new material, and for other reasons. Due dates and assignments are subject to
change and will be provided weekly.
Wk
Date
Topics
Assignments
Due
Date
1
8/31
What do you know?
9/5
2
9/7
9/14
9/21
4
9/28
What do you know/
prepare sample data
What do you know/
prepare sample data
What do you know/
analyze data
9/20
3
5
10/5
Introduction to data mining in health care
Review of databases
Introduction to software
No Class – Labor Day
Measuring/Describing the world
Data Preprocessing - part 1
Data preprocessing - part 2
Knowledge representation
Data preprocessing – part 3: Exploratory
data analysis, simple statistics
Review of types of health data
Mining Frequent Patterns/Associations
10/10
6
Classification and Regression: Basics
7
10/13
(Tue)
10/19
8
10/26
Cluster Analysis
9
11/2
Outlier Detection
What do you know/
analyze data
What do you know/
analyze data
What do you know/
analyze data
What do you know/
analyze data
What do you know/
Fall 2015 Classification 2
HAP 780 Data Mining in Health Care 9/26
10/3
10/17
10/24
10/30
11/7
Janusz Wojtusiak, PhD 10
11/9
Time and Space
11
11/16
No class meeting – online material
assigned on healthcare applications
12
11/23
13
11/30
Text and Image Mining
Genomic data
BIG DATA Analysis
14
12/7
Final Project Presentations
analyze data
What do you know/
analyze data
What do you know/
analyze data
What do you know/
analyze data
What do you know/
analyze data
All Missing Assignments Due
11/14
11/21
11/21
12/5
Sample Assignments Below are draft instructions for some of the assignments. They are for information purposes only to help students better plan time and understand course content. The actual assignments will be posted on Blackboard and may be different than ones presented here. Assignment 1 – Introduction to Databases and Data Mining
1. How data warehouse is different from a production database? Are there any
similarities?
2. Why outliers are particularly important in healthcare applications? Explain and give
examples.
3. How data mining process of very large data differs from very small data? Describe
challenges related to both types of data.
4. You are a consultant hired by a hospital. The hospital is engaged in a quality
improvement process to reduce medical errors. You are asked to analyze data to learn
why some reported incidents result in lawsuits or claims, while others don’t. Describe
how would you approach this problem. What type of data mining is involved? What do
you need in data in order to perform this task?
5. Load attached file hepatitis.csv to MS Access. Some data description is available at:
http://archive.ics.uci.edu/ml/datasets/Hepatitis
Prepare SQL queries to answer the following questions:
- what is the average value of bilirubin?
- what is the average value of bilirubin for patients that live?
- what is the average value of bilirubin for patients that are dead?
- how many patients are at least 50 years old? How many of them are dead?
- how many patients are younger than 50 years? How many of them are dead?
- is average age of dead patients higher than average age of alive patients?
Present both queries and results.
Fall 2015 HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD 6. Decide on the topic and date of your presentation.
7. Propose the topic of your project. Give a one paragraph description of what you want
to do.
8. Download “Microsoft SQL Server 2012 Enterprise 32/64-bit (English)” from
DreamSpark. Keep the downloaded files on your laptop for installation in class.
http://e5.onthehub.com/WebStore/Welcome.aspx?vsro=8&ws=058b3ace-2512-e111a703-f04da23e67f6
It is free for HAP students for academic use! If you are Mac user, you need to install it in
a VM on Windows. Windows can also be downloaded from DreamSpark.
Assignment 2 – Preprocessing Part 1
1. What types of dirty data one can expect when starting a data mining project? List at
least five potential problems and give examples.
2. Why do you have to specify field types when loading data into a database or data
warehouse? Why not make everything a text field?
3. Why semantic/analytic data types are important in data mining?
4. Use the “claims” data from HAP 709 class (you can get excel files at
http://gunston.gmu.edu/healthscience/709/QuerriesReports.asp#Analyze%20data ).
- How many patients have chronic conditions?
- What is the most common chronic condition?
- What is the maximum number of body systems involved for a single patient? Which
patient(s)?
You should use definitions of chronic conditions and body systems from the website of
Agency of Health Research and Quality. The mapping is available in CCI2012.CSV file
at the website: http://www.hcup-us.ahrq.gov/toolssoftware/chronic/chronic.jsp
You will need to load claims data to SQL Server (preferred) or Access. Then you need to
load chronic conditions indicators (CCI2012.CSV file). Then you need to link the two
files in queries that answer the questions above. Please send screenshots of different steps
of your work, the final results, and any SQL code you used.
Assignment 3 – Preprocessing Part 2
1. Why is it important to explore data before executing any DM algorithms? What can go
wrong? What can be discovered in an exploratory analysis?
2. What are the three types of missing values? Give examples.
Fall 2015 HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD 3. Suppose you are given two datasets: a survey result about satisfaction with a clinic
visit and patient records. For privacy reasons personal information (name, record number,
address) has been removed from both datasets. Your goal is to find out if there is any
relationship between the survey results and severity of cases (severity can be obtained
form medical records).
- How do you approach the problem of linking the two datasets? Speculate on what
attributes you would use to link them.
- Should you assume that medical record is found for every patient who completed
survey?
- Should you assume that survey is found for every patient who was treated at the clinic?
4. Using the “Hepatitis” dataset from Assignment 1:
- Write SQL queries to calculate arithmetic mean, standard deviation, median, mode, and
all three quartiles for SEX, AGE, and BILIRUBIN. Note: make calculations only for
values that make sense.
- Using SQL prepare data for Q-Q plot of BLIRUBIN levels for male vs. female patients.
You do not need to make the actual plot (although the prepared data can be easily copied
and plotted as a scatterplot in excel). Prepare data in the form of a table with 2 columns,
in which 1 column corresponds to male and second to female patients, and rows
corresponds to selected quantiles. Note: even if you fail at some details here and
something does not work, describe the procedure how you would approach the problem.
5. Install Weka software http://www.cs.waikato.ac.nz/ml/weka/ on your computer.
It runs on most platforms (Windows, Mac, Linux) and requires Java (JRE). Installation is
simple. We will be using Weka in class. Send screenshots.
Assignment 4 – Data Transformation
1. Why it is important to select right attributes for DM algorithms? Why not keep all
attributes?
2. Give an example (different than in lecture) when creation of new attributes may be
useful?
3. When stratified sampling should be used?
4. Using hepatitis data, write SQL code for:
- Sampling 50% of data without repetitions
- Sampling 50% of data with repetitions
- Stratified sampling, to make frequency of both classes equal
5. In Weka, using the hepatitis data
- Compare results of three different attribute selection methods
- Compare results of three different sampling methods (show plotted data distributions).
Assignment 5 – Association Rule Mining
Fall 2015 HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD 1. Describe why association rules and classification rules are not the same? Give examples. 2. Why FP-­‐Tree algorithm is usually faster than apriori? Give some intuitive explanation. 3. List at least four metrics of quality for association rules. Provide formulas. 4. Using heritage data (release 1) in SQL a. Find support for all single itemsets b. List all itemsets with 2 elements and support of at least 0.2 c. List all itemsets with 3 elements and support at least 0.2 5. In Weka a. Load heritage data (release 1) b. Apply at least two association rule generation algorithms and compare results c. Apply FP-­‐Tree algorithm with at least two measures of rule metrics Assignment 6 – Classification and Regression
1. Describe process of preparing data for classification learning. 2. Why it is important to select correct type of model? List at least three reasons. 3. Suppose you are asked to create a model to predict hospitalization. What you have is claims data. Describe process of preparing data, constructing model, and testing the model. Should you use regression learning or classification learning for this problem? Why? 4. In SQL/Weka: a. Prepare heritage data for classification learning b. Load heritage data release 3 (preprocessed to binary representation, including demographics and output attribute(s)) c. Perform exploratory analysis d. Create at least three classification models for predicting hospitalization based on Year 1 data. e. Which model performs the best on year 2 data? f. Create regression model for predicting hospitalization days. g. What is the difference between regression and classification models? h. Present your results in a form of short report that includes screenshots, tables, an d needed description. Assignment 7 – Classification Part 2
1. When ROC can be used as a measure of quality of models? How does it differ from other measures such as accuracy, precision or recall? 2. When designing system for detecting life-­‐threatening events for ICU patients, what is more important: precision or recall? Why? Fall 2015 HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD 3. Describe how multiple classification models can be combined? 4. Using heritage release 3 data prepared last week a. Include drug information into data b. Include laboratory information into data c. Import newly created data into Weka and run classification algorithms d. Does inclusion of the information improve predictions? There are many ways to complete question 4, so you need to make different decisions.
Try not to overcomplicate the problem.
Assignment 8 – Cluster Analysis
1. Describe differences and similarities between clustering and classification. Use examples. 2. Suppose you are given a dataset with 1000 binary attributes representing presence of diagnoses. Describe potential problems with clustering data with that many dimensions. 3. Using the data table shown below. a. Calculate distance between all points in 1-­‐norm, 2-­‐norm and infinity-­‐
norm. Show dissimilarity matrix. b. Is there any need to preprocess the data to be more suitable for clustering? If so, describe the operations and show the resulting data table. c. Apply k-­‐means clustering algorithm with k=2. ID Age BMI Gender Total Cholesterol 180 190 220 260 1 30 24 M 2 70 19 M 3 65 26 M 4 40 32 F 4. In Weka using heritage 3 dataset a. Apply k-­‐means algorithm for k=2, 3, 5, 10 b. Apply EM algorithm. What is the optimal number of clusters obtained by EM? c. Compare the created clusters to classification based on hospitalization in year 2. Assignment 9 – Outlier Detection
1. How outliers differ from noise? Provide examples of both. 2. Why collective outliers are harder to find than individual outliers? Fall 2015 HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD 3. What are issues in applying classification methods for outlier detection. What are the requirements? 4. Using Heritage data, are there any patients tat can be considered outliers in the data? Why? Assignment 10 – Text Mining
1. Why text mining is important in healthcare applications 2. Describe pre-­‐processing of clinical notes before classification learning can be applied. What methods can be used at each step of pre-­‐processing? 3. Write regular expression to: a. detect zip codes in text b. Find last names of all patients whose first name is John (note that regular expressions may have some false positives/false negatives). 4. List challenges in automatically retrieving ICD-­‐9 codes from clinical notes. Search literature for to find relevant published work. Also, include own observations and comments. 5. Using the SMS data a. Split data into training (80%) and testing (20%) sets b. Build naïve Bayes classifier for detecting spam based on bag of words i. List all words in the documents ii. Count occurrences in spam and ham iii. Assign likelihoods P(word|spam) and P(word|ham) for all words iv. Convert test data into list of words. For each message you need, 2 columns: message id and word v. Classify test data. This can be done by a series of joins with the data prepared in (iii). vi. Calculate accuracy of your model (accuracy, precision, recall) Fall 2015 HAP 780 Data Mining in Health Care Janusz Wojtusiak, PhD