Download Building Scalable Predictive Modeling Platform for Healthcare

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Scalable Predictive Modeling Methods
for Healthcare Applications
Jimeng Sun
[email protected]
Jimeng Sun)[email protected],) ))sunlab.org
Jimeng Sun
1 Research Scientist: M. Bahadori
8 PhD students: R. Chen, I. Perros, H. Su, E. Choi, Z. Wu, K.
Malhotra, Y. Zhang, S. An
many MS/undergrad students
Collaborators & Sponsors
Government
Provider
University
Jimeng Sun)[email protected],) ))sunlab.org
Company
Research in
using
Predictive Modeling
Electronic health records
(EHR)
Jimeng Sun)[email protected],) ))sunlab.org
PREDICTIVE MODELING
using
ELECTRONIC HEALTH RECORDS (EHR)
EHR++
Predic(ve+model++
2500"
3500"
Publica(on+count+in+PubMed+
Publica(on+count+in+PubMed+
4000"
3000"
2500"
2000"
1500"
1000"
2000"
1500"
1000"
500"
500"
0"
1990"
1995"
2000"
2005"
2010"
0"
1990"
1995"
2000"
2005"
2010"
Explosion in interest
Jimeng Sun)[email protected],) ))sunlab.org
Requirements: Healthcare Predictive Modeling
Efficiency
Interpretation
Accuracy
• Parallel predictive modeling pipeline
• Nonnegative tensor factorization for phenotyping
• Deep learning for heart failure onset prediction
Jimeng Sun)[email protected],) ))sunlab.org
5
Algorithm
Application
RESEARCH PORTFOLIO
Predictive Computational
Modeling Phenotyping
Big data
system
Patient
Similarity
Disease
progression
Public health
informatics
Tensor Distance metric Deep learning Search &
learning
recommendation
factorization
Jimeng Sun)[email protected],)))sunlab.org
Jimeng Sun)[email protected],) ))sunlab.org
6
SpaceShip
PARALLEL PREDICTIVE MODELING PLATFORM
Hang)Su
Yuyu Zhang
Sungtae An
Jimeng Sun)[email protected],) ))sunlab.org
Kunal Malhotra
Epilepsy Treatment Prediction
Can)you)predict)which)treatment)will)work?
Chris) Clark
Early)Responders)– 32%
Achieved)seizure)freedom)with)first)drug
Late)Responders)– 24%
Achieved)seizure)freedom)after)2)to)5)
years
Recommend the right treatment earlier
NonQResponders)– 44%
Continued)seizures)despite)treatment
Begley' CE,'et'al.'Epilepsia. 2000041(3):3427351.
Jimeng Sun)[email protected],) ))sunlab.org
11
Performance Comparison in Treatment
Recommendation
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10% to 160% improvement
over human
Doctor)accuracy
Doctor)accuracy
Algorithm)accuracy
• Accuracy in terms of positive predictive value (PPV) aka precision
• Drug works = treatment stability
Jimeng Sun)[email protected],) ))sunlab.org
17
Challenges for Epilepsy Treatment
Recommendation
Big Data
# of patients: 30 million
Diagnosis Claims : (2010-2013), 1.2 billion records
Medication Claims : (2010-2013), 2.3 billion records
Inpatient claims: 2.7 million
Outpatient claims: 8.3 million
Number of anti-epilepsy drugs = 27
300 GB data
Many models
•
•
Over 100 different treatment combinations
Potentially multiple outcomes
– Stable treatment
– Hospitalization
•
Many modeling choices
Jimeng Sun)[email protected],) ))sunlab.org
Predictive Modeling Pipeline
Many
–
–
–
–
–
models to be built and evaluated
Different patient cohorts
Different targets
Different features
Different algorithms
Multiple training and testing splits in cross-validation
Jimeng Sun)[email protected],) ))sunlab.org
19
Keys: Efficient Computation over Pipelines
Parallel
computing
Smart
Sampling
• dependency)graph
• parallel)execution)
• cloud)computing)engine
• TwoQlevel)Bayesian)optimization
• algorithm)selection
• parameter)tuning
Jimeng Sun)[email protected],) ))sunlab.org
20
Parallel Computation on Dependency Graph
Feature)Selection)
InfoGain
Feature)Construction) ))))
Medication
Feature)Construction) )))))))
Diagnosis
Start
Cohort)
Construction
Feature)Construction) ))))))))
Demographics
Feature)Construction) )))))
Feature)Construction) ))))
Hospitalization
Feature)Selection)
Fisher)Score
Feature)Selection)
InfoGain
Classification) Random)
Forest
Output
Classification)
Log) Regression
Output
Classification)
Naïve)Bayes
Output
Classification) Random)
Forest
Output
Classification)
Log) Regression
Output
Classification)
Naïve)Bayes
Output
Classification) Random)
Forest
Output
Classification)
Log) Regression
Output
Classification)
Naïve)Bayes
Output
Feature)Construction) ))))
Procedures
Cross) Validation
Feature)Selection)
Fisher)Score
Jimeng Sun)[email protected],) ))sunlab.org
21
Runtime for Predictive
modeling (38K patients)
12
10
8
Runtime+ 6
(hours)
4
10)minutes
60X
faster
78+X+
faste
r
2
0
Parallel
Sequential
Jimeng Sun)[email protected],) ))sunlab.org
Smart Sampling via Bayesian
Optimization
Classification)
Random) Forest
Feature)
Selection)
InfoGain
Feature)
Construction) ))))
Medication
Classification)
Naïve)Bayes
Feature)Construction) )))))))
Diagnosis
Start
Cohort)
Construction
Feature)
Construction) ))))))))
Demographics
Classification)
Log) Regression
Feature)
Construction) )))))
Feature)
Selection)
Fisher)Score
Classification)
Random) Forest
Classification)
Log) Regression
Feature)
Construction) ))))
Hospitalization
Feature)
Selection)
InfoGain
Feature)
Construction) ))))
Procedures
Classification)
Naïve)Bayes
Classification)
Random) Forest
Cross)
Validation
Feature)
Selection)
Fisher)Score
Classification)
Log) Regression
Classification)
Naïve)Bayes
Jimeng Sun)[email protected],) ))sunlab.org
Outpu
t
Outpu
t
Outpu
t
Outpu
t
Outpu
t
Outpu
t
Outpu
t
Outpu
t
Outpu
t
23
Smart Sampling via Bayesian
Optimization
The)state)of)the)art
Our) method
ideal
Y.)Zhang,)M.)Bahadori,)H.)Su,)J.)Sun.)FLASH:)Fast)Bayesian)Optimization)for)Data)Analytic)
Pipelines.)KDD’16
Jimeng Sun)[email protected],) ))sunlab.org
24
Limestone
How to handle high-dimensionality of EHR data
UNSUPERVISED PHENOTYPE GENERATION
VIA TENSOR FACTORIZATION
Joyce)Ho
Joydeep Ghosh Steve)Steinhubl
Buzz)Stewart
Josh)Denny
Brad)Malin
Ho,)Joyce)C.,)Joydeep Ghosh,) Steve)R.)Steinhubl,) Walter)F.)Stewart,)Joshua) C.)Denny,)Bradley)A.)Malin,)and)Jimeng)Sun.)“Limestone:)
HighQThroughput)Candidate)Phenotype)Generation)via)Tensor)Factorization.”)Journal)of)Biomedical)Informatics.)2014)
Jimeng Sun)[email protected],) ))sunlab.org
COMPUTATIONAL PHENOTYPING
Demographics
Diagnoses
Procedures
Lab
result
Medications
Clinical
notes
Raw Data
Phenotyping
Medical Concepts
(Phenotypes)
A phenotype = a cluster of patients that share similar characteristics
Jimeng Sun)[email protected],) ))sunlab.org
PHENOTYPING ALGORITHM
Typ e 1
D i a g no se s
Type 2 Diabetes Cases
EHR
NO
YES
Typ e 2
D i a g no se s
Typ e 1 R x
NO
NO
Typ e 2 R x
NO
YES
Typ e 2 R x
YES
Typ e 2 R x
T y pe 2 Rx
Pr ec edes
T y pe 1 Rx
YES
YES
NO
YES
YES
Ab n o rma l
Lab
T y pe 2 Dx
by phy s .
> = 2
Case
Jimeng Sun)[email protected],) ))sunlab.org
YES
USE TENSOR TO MODEL EHR DATA
• Tensor is a generalization of matrix
Data element types:
• Binary
• Count (integer)
• Continuous (numeric)
Mode
Jimeng Sun)[email protected],) ))sunlab.org
39
PHENOTYPING THROUGH NONNEGATIVE
TENSOR FACTORIZATION
Medication factor
Phenotype importance
Factor elements
sum to 1
Diagnosis factor
λ1
Medication
≈
Patient
λR
+
Patients
factor
…
+
Elements
sum to 1
Phenotype 1
Phenotype R
Diagnosis
Jimeng Sun)[email protected],) ))sunlab.org
40
EXAMPLE PHENOTYPE
Medication factor
λk
Diagnosis factor
Candidate Phenotype+k
(40%+of+patients)
Hypertension
Patients
factor
Beta)Blockers)CardioQSelective
Thiazides)and)ThiazideQLike)Diuretics
HMG)CoA)Reductase Inhibitors
a phenotype = a group of patients that share common
characteristics
Jimeng Sun)[email protected],) ))sunlab.org
41
CP ALTERNATING POISSON REGRESSION (CP-APR)
•Nonnegative input tensor
•Nonnegative constraints
•Stochastic column constraints on factor matrices
KL divergence
Chi, E. C., & Kolda, T. G. (2012). On tensors, sparsity, and nonnegative factorizations. SIAM Journal on
Matrix Analysis and Applications,
33(4), 1272–1299.
Jimeng Sun)[email protected],)
))sunlab.org
Limestone
•Nonnegative input tensor
•Nonnegative constraints
•Stochastic column constraints on factor matrices
•Hard thresholding on elements in factor matrices
Ho,)Joyce)C.,)Joydeep Ghosh,)Steve)R.)Steinhubl,)Walter)F.)Stewart,)Joshua)C.)Denny,)Bradley)A.)Malin,)and)Jimeng)Sun.)“Limestone:) HighQ
Throughput)Candidate)Phenotype)Generation)via)Tensor)Factorization.”)Journal#of#Biomedical#Informatics.)2014
Jimeng Sun)[email protected],) ))sunlab.org
EXPERIMENTS OF LIMESTONE
Jimeng Sun)[email protected],) ))sunlab.org
TENSOR CONSTRUCTION
• Medication orders from Geisinger dataset
• 31,816 patients x 169 diagnose categories
x 471 medication classes
Patient X
Diabetes
Sulfonylureas
Coronary
Atherosclerosis
Beta Blockers
Cardio-selective
Index Date
Medication
Diabetes
Sulfonylureas
Diabetes
Sulfonylureas
Coronary
Atherosclerosis
Loop Diuretics
Hypertension
ACE Inhibitors
t0
t1
t2
Coronary
Atherosclerosis
Nitrates
Coronary
Atherosclerosis
Loop Diuretics
Congestive
Heart Failure
Loop Diuretics
t3
t4
Time
Patient
Observation Window = 2 years
Diagnosis
Jimeng Sun)[email protected],) ))sunlab.org
46
HEART FAILURE (HF) PREDICTION
• Task: predict patients with HF
• Model: logistic regression with ℓ1 regularization
• Evaluation : Cross-validation
• Methods for feature construction:
1. Baseline using source independence matrix
2. Principal Component Analysis (PCA)
3. Nonnegative Matrix Factorization (NMF)
4. Limestone
Jimeng Sun)[email protected],) ))sunlab.org
47
PREDICTIVE PERFORMANCE
Small # of features outperforms 640
features
0.72
●
●
●
●
●
●
●
●
●
●
●
●
●
0.70
AUC
●
Baseline
0.68
Features
PCA
●
NMF
0.66
● Limestone
0
40
80
120
160
Number of Phenotypes
Jimeng Sun)[email protected],) ))sunlab.org
48
MAJOR DISEASE PHENOTYPES
Uncomplicated Diabetes
Phenotype+3+
(17.6% of+patients)
Diabetes) with)No)or)
Unspecified)Complications
Sulfonylureas
Biguanides
Diagnostic) Tests
Insulin)Sensitizing)Agents
Diabetic) Supplies
Meglitinide Analogues
Antidiabetic Combinations
Mild Hypertension
Phenotype+4+
(31.1% of+patients)
Hypertension
ACE)Inhibitors
Thiazides) and)ThiazideQLike)
Diuretics
Chronic Respiratory
Inflammation/Infection
Phenotype 5+
(36.7%+of+patients)
Other)Ear,)Nose,)Throat,)and)Mouth)Disorders
Viral)and)Unspecified) Pneumonia,)Pleurisy
Significant)Ear,)Nose,)and)Throat)Disorders
Cough/Cold/Allergy)Combinations
Azithromycin
Fluoroquinolones
Sympathomimetics
Penicillin) Combinations
Antitussives
Glucocorticosteroids
Tetracyclines
AntiQinfective)Misc.)Q Combinations
Clarithromycin
Cephalosporins Q 2nd)Generation
Cephalosporins Q 1st)Generation
Expectorants
Jimeng Sun)[email protected],) ))sunlab.org
49
DISEASE SUBTYPES CAN BE IDENTIFIED
Mild Hypertension
Moderate Hypertension
Phenotype+4+
(31.1% of+patients)
Hypertension
ACE)Inhibitors
Thiazides) and)ThiazideQLike)
Diuretics
Phenotype+2
(31.5% of+patients)
Hypertension
Beta)Blockers)CardioQSelective
Angiotensin)II)Receptor)
Antagonists
Loop)Diuretics
Potassium
Nitrates
AlphaQBeta)Blockers
Vasodilators
Severe Hypertension
Phenotype+6+
(24.3% of+patients)
Hypertension
Calcium) Channel)Blockers
Antihypertensive)Combinations
Antiadrenergic) Antihypertensives
Potassium)Sparing)Diuretics
Over 80% phenotype factors are clinically meaningful
Jimeng Sun)[email protected],) ))sunlab.org
50
Limestone
Advantages
• Unsupervised
• Intuitive phenotypes
• Predictive
Jimeng Sun)[email protected],) ))sunlab.org
51
USING RECURSIVE NEURAL NETWORK MODELS FOR
EARLY DETECTION OF HEART FAILURE ONSET
How to model temporal relations in the EHR data
Edward)Choi
Andy)Schuetz
Buzz)Stewart
Edward)Choi,)Andy)Schuetz,)Walter)Stewart,)Jimeng)Sun.)Using)Recursive)Neural)Network)Models)for)Early)Detection)
of)Heart)Failure)Onset,)to)appear)on)JAMIA
Jimeng Sun)[email protected],) ))sunlab.org
MOTIVATIONS FOR EARLY DETECTION
OF HEART FAILURE
Heart failure is
a complex
disease.
Improves
existing clinical
guidelines of HF
prevention.
Reduces cost and
hospitalization.
Early intervention
can slow down
disease progression.
Jimeng Sun)[email protected],) ))sunlab.org
PREDICTIVE MODELING PIPELINE
1.
Prediction
Target
6. Performance
Evaluation
2.
Cohort
Construction
5.
Predictive
Model
3.
Feature
Construction
4.
Feature
Selection
Jimeng Sun)[email protected],) ))sunlab.org
Deep Learning Pipeline
1.
Prediction
Target
6. Performance
Evaluation
2.
Cohort
Construction
5.
Predictive
Model
3.
Feature
Construction
Deep learning:
4.
Representation
learning
Feature
Selection
Jimeng Sun)[email protected],) ))sunlab.org
Input: Longitudinal EHR
Pneumonia
Benzonatate
Amoxicillin
Time
Cough
0
0
0
0
..
..
0
0
1
Fever
0
0
0
0
..
..
0
1
0
1
0
0
0
..
..
0
0
0
Chest X-ray
0
0
0
1
..
..
0
0
0
0
0
1
0
..
..
0
0
0
Jimeng Sun)[email protected],) ))sunlab.org
0
0
0
0
..
..
1
0
0
Representation learning: Word2vec
Input at time t: xt
(a) One-hot encoding
(b) Word2vec vectors
N-dimensional vector
Bronchitis: [1, 0, 0, 0, 0, …, 0, 0, 0]
Pneumonia: [0, 1, 0, 0, 0, …, 0, 0, 0]
Obesity:
[0, 0, 1, 0, 0, …, 0, 0, 0]
D-dimensional vector
Bronchitis: [0.4, -0.2, …, 0.2]
Pneumonia: [0.3, -0.3, …, 0.1]
Obesity:
[-0.7, 1.4, …, 1.2]
N diagnoses
Cataract:
[0, 0, 0, 0, 0, …, 0, 0, 1]
Cataract:
Jimeng Sun)[email protected],) ))sunlab.org
[1.2, 0.8, …, 1.5]
Temporal model: RNN
•
•
•
•
•
xt: one-hot coded Dx, Rx, Proc at time t
ht: hidden state at time t
y: binary outcome of HF prediction
T: total length of the medical codes
Red box: a single unit of RNN
y
Logistic Regression
h0
h1
hT-1
hT
x0
x1
xT-1
xT
Jimeng Sun)[email protected],) ))sunlab.org
EVALUATION
• Models
– Baselines: logistic regression, SVM, MLP, KNN
– Training data: one-hot, grouped, Skip-gram
• Performance measure
– AUC of 6-fold cross validation
• Vary Observation window & prediction window
Index Date
Prediction
Window
Observation
Window
Diagnosis
Date
Time
Jimeng Sun)[email protected],) ))sunlab.org
PREDICTION PERFORMANCE OF RNN
One-hot encoding
1
RNN
0.95
0.9
0.85
AUC
0.8
0.75
0.7
0.65
0.6
0.55
0.5
Logistic
Regression
SVM
MLP
KNN
GRU w/o
duration
GRU w/
duration
• RNN#model#achieves#over#10%#improvement#on#AUC
Jimeng Sun)[email protected],) ))sunlab.org
PREDICTION PERFORMANCE OF RNN
One-hot encoding
Grouped code vectors
1
0.95
0.9
0.85
AUC
0.8
0.75
0.7
0.65
0.6
0.55
0.5
Logistic
Regression
SVM
MLP
KNN
GRU w/o
duration
GRU w/
duration
• RNN)model)achieves)over)10%)improvement)on)AUC
• Representation#matters
Jimeng Sun)[email protected],) ))sunlab.org
PREDICTION PERFORMANCE OF RNN
One-hot encoding
Grouped code vectors
Medical concept vectors
1
0.95
0.9
0.85
AUC
0.8
0.75
0.7
0.65
0.6
0.55
0.5
Logistic
Regression
SVM
MLP
KNN
GRU w/o
duration
GRU w/
duration
• RNN)model)achieves)over)10%)improvement)on)AUC
• Data#rep.#(word2vec)#>#knowledge#rep.#(medical#groupers)
Jimeng Sun)[email protected],) ))sunlab.org
Summary: Healthcare Predictive Modeling
Efficiency
Interpretation
Accuracy
• Parallel predictive modeling pipeline
• Nonnegative tensor factorization for phenotyping
• Deep learning for heart failure onset prediction
Jimeng Sun)[email protected],) ))sunlab.org
68
Related documents