Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Scalable Predictive Modeling Methods for Healthcare Applications Jimeng Sun [email protected] Jimeng Sun)[email protected],) ))sunlab.org Jimeng Sun 1 Research Scientist: M. Bahadori 8 PhD students: R. Chen, I. Perros, H. Su, E. Choi, Z. Wu, K. Malhotra, Y. Zhang, S. An many MS/undergrad students Collaborators & Sponsors Government Provider University Jimeng Sun)[email protected],) ))sunlab.org Company Research in using Predictive Modeling Electronic health records (EHR) Jimeng Sun)[email protected],) ))sunlab.org PREDICTIVE MODELING using ELECTRONIC HEALTH RECORDS (EHR) EHR++ Predic(ve+model++ 2500" 3500" Publica(on+count+in+PubMed+ Publica(on+count+in+PubMed+ 4000" 3000" 2500" 2000" 1500" 1000" 2000" 1500" 1000" 500" 500" 0" 1990" 1995" 2000" 2005" 2010" 0" 1990" 1995" 2000" 2005" 2010" Explosion in interest Jimeng Sun)[email protected],) ))sunlab.org Requirements: Healthcare Predictive Modeling Efficiency Interpretation Accuracy • Parallel predictive modeling pipeline • Nonnegative tensor factorization for phenotyping • Deep learning for heart failure onset prediction Jimeng Sun)[email protected],) ))sunlab.org 5 Algorithm Application RESEARCH PORTFOLIO Predictive Computational Modeling Phenotyping Big data system Patient Similarity Disease progression Public health informatics Tensor Distance metric Deep learning Search & learning recommendation factorization Jimeng Sun)[email protected],)))sunlab.org Jimeng Sun)[email protected],) ))sunlab.org 6 SpaceShip PARALLEL PREDICTIVE MODELING PLATFORM Hang)Su Yuyu Zhang Sungtae An Jimeng Sun)[email protected],) ))sunlab.org Kunal Malhotra Epilepsy Treatment Prediction Can)you)predict)which)treatment)will)work? Chris) Clark Early)Responders)– 32% Achieved)seizure)freedom)with)first)drug Late)Responders)– 24% Achieved)seizure)freedom)after)2)to)5) years Recommend the right treatment earlier NonQResponders)– 44% Continued)seizures)despite)treatment Begley' CE,'et'al.'Epilepsia. 2000041(3):3427351. Jimeng Sun)[email protected],) ))sunlab.org 11 Performance Comparison in Treatment Recommendation 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10% to 160% improvement over human Doctor)accuracy Doctor)accuracy Algorithm)accuracy • Accuracy in terms of positive predictive value (PPV) aka precision • Drug works = treatment stability Jimeng Sun)[email protected],) ))sunlab.org 17 Challenges for Epilepsy Treatment Recommendation Big Data # of patients: 30 million Diagnosis Claims : (2010-2013), 1.2 billion records Medication Claims : (2010-2013), 2.3 billion records Inpatient claims: 2.7 million Outpatient claims: 8.3 million Number of anti-epilepsy drugs = 27 300 GB data Many models • • Over 100 different treatment combinations Potentially multiple outcomes – Stable treatment – Hospitalization • Many modeling choices Jimeng Sun)[email protected],) ))sunlab.org Predictive Modeling Pipeline Many – – – – – models to be built and evaluated Different patient cohorts Different targets Different features Different algorithms Multiple training and testing splits in cross-validation Jimeng Sun)[email protected],) ))sunlab.org 19 Keys: Efficient Computation over Pipelines Parallel computing Smart Sampling • dependency)graph • parallel)execution) • cloud)computing)engine • TwoQlevel)Bayesian)optimization • algorithm)selection • parameter)tuning Jimeng Sun)[email protected],) ))sunlab.org 20 Parallel Computation on Dependency Graph Feature)Selection) InfoGain Feature)Construction) )))) Medication Feature)Construction) ))))))) Diagnosis Start Cohort) Construction Feature)Construction) )))))))) Demographics Feature)Construction) ))))) Feature)Construction) )))) Hospitalization Feature)Selection) Fisher)Score Feature)Selection) InfoGain Classification) Random) Forest Output Classification) Log) Regression Output Classification) Naïve)Bayes Output Classification) Random) Forest Output Classification) Log) Regression Output Classification) Naïve)Bayes Output Classification) Random) Forest Output Classification) Log) Regression Output Classification) Naïve)Bayes Output Feature)Construction) )))) Procedures Cross) Validation Feature)Selection) Fisher)Score Jimeng Sun)[email protected],) ))sunlab.org 21 Runtime for Predictive modeling (38K patients) 12 10 8 Runtime+ 6 (hours) 4 10)minutes 60X faster 78+X+ faste r 2 0 Parallel Sequential Jimeng Sun)[email protected],) ))sunlab.org Smart Sampling via Bayesian Optimization Classification) Random) Forest Feature) Selection) InfoGain Feature) Construction) )))) Medication Classification) Naïve)Bayes Feature)Construction) ))))))) Diagnosis Start Cohort) Construction Feature) Construction) )))))))) Demographics Classification) Log) Regression Feature) Construction) ))))) Feature) Selection) Fisher)Score Classification) Random) Forest Classification) Log) Regression Feature) Construction) )))) Hospitalization Feature) Selection) InfoGain Feature) Construction) )))) Procedures Classification) Naïve)Bayes Classification) Random) Forest Cross) Validation Feature) Selection) Fisher)Score Classification) Log) Regression Classification) Naïve)Bayes Jimeng Sun)[email protected],) ))sunlab.org Outpu t Outpu t Outpu t Outpu t Outpu t Outpu t Outpu t Outpu t Outpu t 23 Smart Sampling via Bayesian Optimization The)state)of)the)art Our) method ideal Y.)Zhang,)M.)Bahadori,)H.)Su,)J.)Sun.)FLASH:)Fast)Bayesian)Optimization)for)Data)Analytic) Pipelines.)KDD’16 Jimeng Sun)[email protected],) ))sunlab.org 24 Limestone How to handle high-dimensionality of EHR data UNSUPERVISED PHENOTYPE GENERATION VIA TENSOR FACTORIZATION Joyce)Ho Joydeep Ghosh Steve)Steinhubl Buzz)Stewart Josh)Denny Brad)Malin Ho,)Joyce)C.,)Joydeep Ghosh,) Steve)R.)Steinhubl,) Walter)F.)Stewart,)Joshua) C.)Denny,)Bradley)A.)Malin,)and)Jimeng)Sun.)“Limestone:) HighQThroughput)Candidate)Phenotype)Generation)via)Tensor)Factorization.”)Journal)of)Biomedical)Informatics.)2014) Jimeng Sun)[email protected],) ))sunlab.org COMPUTATIONAL PHENOTYPING Demographics Diagnoses Procedures Lab result Medications Clinical notes Raw Data Phenotyping Medical Concepts (Phenotypes) A phenotype = a cluster of patients that share similar characteristics Jimeng Sun)[email protected],) ))sunlab.org PHENOTYPING ALGORITHM Typ e 1 D i a g no se s Type 2 Diabetes Cases EHR NO YES Typ e 2 D i a g no se s Typ e 1 R x NO NO Typ e 2 R x NO YES Typ e 2 R x YES Typ e 2 R x T y pe 2 Rx Pr ec edes T y pe 1 Rx YES YES NO YES YES Ab n o rma l Lab T y pe 2 Dx by phy s . > = 2 Case Jimeng Sun)[email protected],) ))sunlab.org YES USE TENSOR TO MODEL EHR DATA • Tensor is a generalization of matrix Data element types: • Binary • Count (integer) • Continuous (numeric) Mode Jimeng Sun)[email protected],) ))sunlab.org 39 PHENOTYPING THROUGH NONNEGATIVE TENSOR FACTORIZATION Medication factor Phenotype importance Factor elements sum to 1 Diagnosis factor λ1 Medication ≈ Patient λR + Patients factor … + Elements sum to 1 Phenotype 1 Phenotype R Diagnosis Jimeng Sun)[email protected],) ))sunlab.org 40 EXAMPLE PHENOTYPE Medication factor λk Diagnosis factor Candidate Phenotype+k (40%+of+patients) Hypertension Patients factor Beta)Blockers)CardioQSelective Thiazides)and)ThiazideQLike)Diuretics HMG)CoA)Reductase Inhibitors a phenotype = a group of patients that share common characteristics Jimeng Sun)[email protected],) ))sunlab.org 41 CP ALTERNATING POISSON REGRESSION (CP-APR) •Nonnegative input tensor •Nonnegative constraints •Stochastic column constraints on factor matrices KL divergence Chi, E. C., & Kolda, T. G. (2012). On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4), 1272–1299. Jimeng Sun)[email protected],) ))sunlab.org Limestone •Nonnegative input tensor •Nonnegative constraints •Stochastic column constraints on factor matrices •Hard thresholding on elements in factor matrices Ho,)Joyce)C.,)Joydeep Ghosh,)Steve)R.)Steinhubl,)Walter)F.)Stewart,)Joshua)C.)Denny,)Bradley)A.)Malin,)and)Jimeng)Sun.)“Limestone:) HighQ Throughput)Candidate)Phenotype)Generation)via)Tensor)Factorization.”)Journal#of#Biomedical#Informatics.)2014 Jimeng Sun)[email protected],) ))sunlab.org EXPERIMENTS OF LIMESTONE Jimeng Sun)[email protected],) ))sunlab.org TENSOR CONSTRUCTION • Medication orders from Geisinger dataset • 31,816 patients x 169 diagnose categories x 471 medication classes Patient X Diabetes Sulfonylureas Coronary Atherosclerosis Beta Blockers Cardio-selective Index Date Medication Diabetes Sulfonylureas Diabetes Sulfonylureas Coronary Atherosclerosis Loop Diuretics Hypertension ACE Inhibitors t0 t1 t2 Coronary Atherosclerosis Nitrates Coronary Atherosclerosis Loop Diuretics Congestive Heart Failure Loop Diuretics t3 t4 Time Patient Observation Window = 2 years Diagnosis Jimeng Sun)[email protected],) ))sunlab.org 46 HEART FAILURE (HF) PREDICTION • Task: predict patients with HF • Model: logistic regression with ℓ1 regularization • Evaluation : Cross-validation • Methods for feature construction: 1. Baseline using source independence matrix 2. Principal Component Analysis (PCA) 3. Nonnegative Matrix Factorization (NMF) 4. Limestone Jimeng Sun)[email protected],) ))sunlab.org 47 PREDICTIVE PERFORMANCE Small # of features outperforms 640 features 0.72 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.70 AUC ● Baseline 0.68 Features PCA ● NMF 0.66 ● Limestone 0 40 80 120 160 Number of Phenotypes Jimeng Sun)[email protected],) ))sunlab.org 48 MAJOR DISEASE PHENOTYPES Uncomplicated Diabetes Phenotype+3+ (17.6% of+patients) Diabetes) with)No)or) Unspecified)Complications Sulfonylureas Biguanides Diagnostic) Tests Insulin)Sensitizing)Agents Diabetic) Supplies Meglitinide Analogues Antidiabetic Combinations Mild Hypertension Phenotype+4+ (31.1% of+patients) Hypertension ACE)Inhibitors Thiazides) and)ThiazideQLike) Diuretics Chronic Respiratory Inflammation/Infection Phenotype 5+ (36.7%+of+patients) Other)Ear,)Nose,)Throat,)and)Mouth)Disorders Viral)and)Unspecified) Pneumonia,)Pleurisy Significant)Ear,)Nose,)and)Throat)Disorders Cough/Cold/Allergy)Combinations Azithromycin Fluoroquinolones Sympathomimetics Penicillin) Combinations Antitussives Glucocorticosteroids Tetracyclines AntiQinfective)Misc.)Q Combinations Clarithromycin Cephalosporins Q 2nd)Generation Cephalosporins Q 1st)Generation Expectorants Jimeng Sun)[email protected],) ))sunlab.org 49 DISEASE SUBTYPES CAN BE IDENTIFIED Mild Hypertension Moderate Hypertension Phenotype+4+ (31.1% of+patients) Hypertension ACE)Inhibitors Thiazides) and)ThiazideQLike) Diuretics Phenotype+2 (31.5% of+patients) Hypertension Beta)Blockers)CardioQSelective Angiotensin)II)Receptor) Antagonists Loop)Diuretics Potassium Nitrates AlphaQBeta)Blockers Vasodilators Severe Hypertension Phenotype+6+ (24.3% of+patients) Hypertension Calcium) Channel)Blockers Antihypertensive)Combinations Antiadrenergic) Antihypertensives Potassium)Sparing)Diuretics Over 80% phenotype factors are clinically meaningful Jimeng Sun)[email protected],) ))sunlab.org 50 Limestone Advantages • Unsupervised • Intuitive phenotypes • Predictive Jimeng Sun)[email protected],) ))sunlab.org 51 USING RECURSIVE NEURAL NETWORK MODELS FOR EARLY DETECTION OF HEART FAILURE ONSET How to model temporal relations in the EHR data Edward)Choi Andy)Schuetz Buzz)Stewart Edward)Choi,)Andy)Schuetz,)Walter)Stewart,)Jimeng)Sun.)Using)Recursive)Neural)Network)Models)for)Early)Detection) of)Heart)Failure)Onset,)to)appear)on)JAMIA Jimeng Sun)[email protected],) ))sunlab.org MOTIVATIONS FOR EARLY DETECTION OF HEART FAILURE Heart failure is a complex disease. Improves existing clinical guidelines of HF prevention. Reduces cost and hospitalization. Early intervention can slow down disease progression. Jimeng Sun)[email protected],) ))sunlab.org PREDICTIVE MODELING PIPELINE 1. Prediction Target 6. Performance Evaluation 2. Cohort Construction 5. Predictive Model 3. Feature Construction 4. Feature Selection Jimeng Sun)[email protected],) ))sunlab.org Deep Learning Pipeline 1. Prediction Target 6. Performance Evaluation 2. Cohort Construction 5. Predictive Model 3. Feature Construction Deep learning: 4. Representation learning Feature Selection Jimeng Sun)[email protected],) ))sunlab.org Input: Longitudinal EHR Pneumonia Benzonatate Amoxicillin Time Cough 0 0 0 0 .. .. 0 0 1 Fever 0 0 0 0 .. .. 0 1 0 1 0 0 0 .. .. 0 0 0 Chest X-ray 0 0 0 1 .. .. 0 0 0 0 0 1 0 .. .. 0 0 0 Jimeng Sun)[email protected],) ))sunlab.org 0 0 0 0 .. .. 1 0 0 Representation learning: Word2vec Input at time t: xt (a) One-hot encoding (b) Word2vec vectors N-dimensional vector Bronchitis: [1, 0, 0, 0, 0, …, 0, 0, 0] Pneumonia: [0, 1, 0, 0, 0, …, 0, 0, 0] Obesity: [0, 0, 1, 0, 0, …, 0, 0, 0] D-dimensional vector Bronchitis: [0.4, -0.2, …, 0.2] Pneumonia: [0.3, -0.3, …, 0.1] Obesity: [-0.7, 1.4, …, 1.2] N diagnoses Cataract: [0, 0, 0, 0, 0, …, 0, 0, 1] Cataract: Jimeng Sun)[email protected],) ))sunlab.org [1.2, 0.8, …, 1.5] Temporal model: RNN • • • • • xt: one-hot coded Dx, Rx, Proc at time t ht: hidden state at time t y: binary outcome of HF prediction T: total length of the medical codes Red box: a single unit of RNN y Logistic Regression h0 h1 hT-1 hT x0 x1 xT-1 xT Jimeng Sun)[email protected],) ))sunlab.org EVALUATION • Models – Baselines: logistic regression, SVM, MLP, KNN – Training data: one-hot, grouped, Skip-gram • Performance measure – AUC of 6-fold cross validation • Vary Observation window & prediction window Index Date Prediction Window Observation Window Diagnosis Date Time Jimeng Sun)[email protected],) ))sunlab.org PREDICTION PERFORMANCE OF RNN One-hot encoding 1 RNN 0.95 0.9 0.85 AUC 0.8 0.75 0.7 0.65 0.6 0.55 0.5 Logistic Regression SVM MLP KNN GRU w/o duration GRU w/ duration • RNN#model#achieves#over#10%#improvement#on#AUC Jimeng Sun)[email protected],) ))sunlab.org PREDICTION PERFORMANCE OF RNN One-hot encoding Grouped code vectors 1 0.95 0.9 0.85 AUC 0.8 0.75 0.7 0.65 0.6 0.55 0.5 Logistic Regression SVM MLP KNN GRU w/o duration GRU w/ duration • RNN)model)achieves)over)10%)improvement)on)AUC • Representation#matters Jimeng Sun)[email protected],) ))sunlab.org PREDICTION PERFORMANCE OF RNN One-hot encoding Grouped code vectors Medical concept vectors 1 0.95 0.9 0.85 AUC 0.8 0.75 0.7 0.65 0.6 0.55 0.5 Logistic Regression SVM MLP KNN GRU w/o duration GRU w/ duration • RNN)model)achieves)over)10%)improvement)on)AUC • Data#rep.#(word2vec)#>#knowledge#rep.#(medical#groupers) Jimeng Sun)[email protected],) ))sunlab.org Summary: Healthcare Predictive Modeling Efficiency Interpretation Accuracy • Parallel predictive modeling pipeline • Nonnegative tensor factorization for phenotyping • Deep learning for heart failure onset prediction Jimeng Sun)[email protected],) ))sunlab.org 68