Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining High-Throughput Biological Data David Page Dept. Biostatistics and Medical Informatics Dept. Computer Sciences University of Wisconsin-Madison Copyrighted © 2006 by David Page Some Data Types We’ll Discuss Gene expression microarray Single-nucleotide polymorphisms Mass spectrometry proteomics and metabolomics Protein-protein interactions (from coimmunoprecipitation) High-throughput screening of potential drug molecules image from the DOE Human Genome Program http://www.ornl.gov/hgmis How Microarrays Work Probes (DNA) Labeled Sample (RNA) Hybridization Gene Chip Surface Two Views of Microarray Data Data points are genes Represented by expression levels across different samples (ie, features=samples) Goal: categorize new genes Data points are samples (eg, patients) Represented by expression levels of different genes (ie, features=genes) Goal: categorize new samples Two Ways to View The Data Person Gene A28202_ac AB00014_at AB00015_at ... Person 1 1142.0 321.0 2567.2 ... Person 2 586.3 586.1 759.0 ... Person 3 105.2 559.3 3210.7 ... Person 4 42.8 692.1 812.0 ... . . . . . . ... . . . . . . ... . . . . . . ... Data Points are Genes Person Gene A28202_ac AB00014_at AB00015_at ... Person 1 1142.0 321.0 2567.2 ... Person 2 586.3 586.1 759.0 ... Person 3 105.2 559.3 3210.7 ... Person 4 42.8 692.1 812.0 ... . . . . . . ... . . . . . . ... . . . . . . ... Data Points are Samples Person Gene A28202_ac AB00014_at AB00015_at ... Person 1 1142.0 321.0 2567.2 ... Person 2 586.3 586.1 759.0 ... Person 3 105.2 559.3 3210.7 ... Person 4 42.8 692.1 812.0 ... . . . . . . ... . . . . . . ... . . . . . . ... Supervision: Add Class Values Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 1142.0 321.0 2567.2 ... normal Person 2 586.3 586.1 759.0 ... cancer Person 3 105.2 559.3 3210.7 ... normal Person 4 42.8 692.1 812.0 ... cancer . . . . . . ... . . . . . . ... . . . . . . ... Class Supervised Learning Task Given: a set of microarray experiments, each done with mRNA from a different patient (same cell type from every patient) Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class Do: Learn a model that accurately predicts class based on features Location in Task Space Data Points are: Genes Samples Clustering Supervised Data Mining Predict the class value for a patient based on the expression levels for his/her genes Leukemia (Golub et al., 1999) Classes Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leukemia (AML) Approach Weighted voting (essentially naïve Bayes) Cross-Validated Accuracy Of 34 samples, declined to predict 5, correct on other 29 Cancer vs. Normal Relatively easy to predict accurately, because so much goes “haywire” in cancer cells Primary barrier is noise in the data… impure RNA, cross-hybridization, etc Studies include breast, colon, prostate, lymphoma, and multiple myeloma X-Val Accuracies for Multiple Myeloma (74 MM vs. 31 Normal) Trees 98.1 Boosted Trees 99.0 SVMs 100.0 Vote 100.0 Bayes Nets 97.0 More MM (300), Benign Condition MGUS (Hardin et al., 2004) ROC Curves: Cancer vs. Normal ROC: Cancer vs. Benign (MGUS) Work by Statisticians Outside of Standard Classification/Clustering Methods to better convert Affymetrix’s low-level intensity measurements into expression levels: e.g., work by Speed, Wong, Irrizary Methods to find differentially expressed genes between two samples, e.g. work by Newton and Kendziorski But the following is most related… Ranking Genes by Significance Some biologists don’t want one predictive model, but a rank-ordered list of genes to explore further (with estimated significance) For each gene we have a set of expression levels under our conditions, say cancer vs. normal We can do a t-test to see if the mean expression levels are different under the two conditions: p-value Multiple comparisons problem: if we repeat this test for 30,000 genes, some will pop up as significant just by chance alone Could do a Bonferoni correction (multiply p-values by 30,000), but this is drastic and might eliminate all False Discovery Rate (FDR) [Storey and Tibshirani, 2001] Addresses multiple comparisons but is less extreme than Bonferoni Replaces p-value by q-value: fraction of genes with this p-value or lower that really don’t have different means in the two classes (false discoveries) Publicly available in R as part of Bioconductor package Recommendation: Use this in addition to your supervised data mining… your collaborators will want to see it FDR Highlights Difficulties Getting Insight into Cancer vs. Normal Using Benign Condition Instead of Normal Helps Somewhat Question to Anticipate You’ve run a supervised data mining algorithm on your collaborator’s data, and you present an estimate of accuracy or an ROC curve (from X-val) How did you adjust this for the multiple comparisons problem? Answer: you don’t need to because you commit to a single predictive model before ever looking at the test data for a fold—this is only one comparison Prognosis and Treatment Features same as for diagnosis Rather than disease state, class value becomes life expectancy with a given treatment (or positive response vs. no response to given treatment) Breast Cancer Prognosis (Van’t Veer et al., 2002) Classes good prognosis (no metastasis within five years of initial diagnosis) vs. poor prognosis Algorithm Ensemble of voters Results 83% cross-validated accuracy on 78 cases A Lesson Previous work selected features to use in ensemble by looking at the entire data set Should have repeated feature selection on each cross-val fold Authors also chose ensemble size by seeing which size gave highest cross-val result Authors corrected this in web supplement; accuracy went from 83% to 73% Remember to “tune parameters” separately for each cross-val fold! Prognosis with Specific Therapy (Rosenwald et al., 2002) Data set contains gene-expression patterns for 160 patients with diffuse large B-cell lymphoma, receiving anthracycline chemotherapy Class label is five-year survival One test-train split 80/80 True positive rate: 60% False negative rate: 39% Some Future Directions Using gene-chip data to select therapy Predict which therapy gives best prognosis for patient Combining Gene Expression Data with Clinical Data such as Lab Results, Medical and Family History Multiple relational tables, may benefit from relational learning Unsupervised Learning Task Given: a set of microarray experiments under different conditions Do: cluster the genes, where a gene described by its expression levels in different experiments Location in Task Space Data Points are: Genes Samples Clustering Supervised Data Mining Group genes into clusters, where all members of a cluster tend to go up or down together Example Genes (Green = up-regulated, Red = down-regulated) Experiments (Samples) Visualizing Gene Clusters (eg, Sharan and Shamir, 2000) Normalized expression Gene Cluster 1, size=20 Time (10-minute intervals) Gene Cluster 2, size=43 Unsupervised Learning Task 2 Given: a set of microarray experiments (samples) corresponding to different conditions or patients Do: cluster the experiments Location in Task Space Data Points are: Genes Samples Clustering Supervised Data Mining Group samples by gene expression profile Examples Cluster samples from mice subjected to a variety of toxic compounds (Thomas et al., 2001) Cluster samples from cancer patients, potentially to discover different subtypes of a cancer Cluster samples taken at different time points Some Biological Pathways Regulatory pathways Nodes are labeled by genes Arcs denote influence on transcription G1 codes for P1, P1 inhibits G2’s transcription Metabolic pathways Nodes are metabolites, large biomolecules (eg, sugars, lipids, proteins and modified proteins) Arcs from biochemical reaction inputs to outputs Arcs labeled by enzymes (themselves proteins) Metabolic Pathway Example H 20 HSCoA cis-Aconitate Citrate Acetyl CoA citrate synthase aconitase Oxaloacetate NADH NAD+ MDH Malate H20 fumarase (Krebs Cycle, Citric Acid Cycle) IDH NAD+ NADH + CO2 a-Ketoglutarate a-KDGH succinate thikinase Succinate FAD Isocitrate TCA Cycle, Fumarate FADH2 H 20 Succinyl-CoA GTP GDP + Pi + HSCoA NAD+ + HSCoA NADH + CO2 Regulatory Pathway (KEGG) Using Microarray Data Only Regulatory pathways Nodes are labeled by genes Arcs denote influence on transcription G1 codes for P1, P1 inhibits G2’s transcription Metabolic pathways Nodes are metabolites, large biomolecules (eg, sugars, lipids, proteins, and modified proteins) Arcs from biochemical reaction inputs to outputs Arcs labeled by enzymes (themselves proteins) Supervised Learning Task 2 Given: a set of microarray experiments for same organism under different conditions Do: Learn graphical model that accurately predicts expression of some genes in terms of others Some Approaches to Learning Regulatory Networks Bayes Net Learning (started with Friedman & Halpern, 1999, we’ll see more) Boolean Networks (Akutsu, Kuhara, Maruyama & Miyano, 1998; Ideker, Thorsson & Karp, 2002) Related Graphical Approaches (Tanay & Shamir, 2001; Chrisman, Langley, Baay & Pohorille, 2003) Bayesian Network (BN) Data geneA P(geneA) geneB Expt1 Expt2 Expt3 Expt4 Note: direction of arrow indicates dependence not causality 0.5 0.5 geneA parent node child node P(geneB) 1.0 0.0 0.5 0.5 geneB Problem: Not Causality A B A is a good predictor of B. But is A regulating B?? Ground truth might be: B A C A B A C B B C A Or a more complicated variant Approaches to Get Causality Use “knock-outs” (Pe’er, Regev, Elidan and Friedman, 2001). But not available in most organisms. Use time-series data and Dynamic Bayesian Networks (Ong, Glasner and Page, 2002). But even less data typically. Use other data sources, eg sequences upstream of genes, where transcription regulators may bind. (Segal, Barash, Simon, Friedman and Koller, 2002; Noto and Craven, 2005) A Dynamic Bayes Net gene 2 gene gene 1 3 gene N gene 2 gene gene 1 3 gene N Problem: Not Enough Data Points to Construct Large Network Fortunate to get 100s of chips But have 1000s of genes E. coli: ~4000 Yeast: ~6000 Human: ~30,000 Want to learn causal graphical model over 1000s of variables with 100s of examples (settings of the variables) Advance: Module Networks [Segal, Pe’er, Regev, Koller & Friedman, 2005] Cluster genes by similarity over expression experiments All genes in a cluster are “tied together”: same parents and CPDs Learn structure subject to this tying together of genes Iteratively re-form clusters and re-learn network, in an EM-like fashion Problem: Data are Continuous but Models are Discrete Gene chips provide a real-valued mRNA measurement Boolean networks and most practical Bayes net learning algorithms assume discrete variables May lose valuable information by discretizing Advance: Use of Dynamic Bayes Nets with Continuous Variables [Segal, Pe’er, Regev, Koller & Friedman, 2005] Expression measurements used instead of discretized (up, down, same) Assume linear influence of parents on children (Michaelis-Menten assumption) Work so far constructed the network from literature and learned parameters Problem: Much Missing Information mRNA from gene 1 doesn’t directly alter level of mRNA from gene 2 Rather, the protein product from gene 1 may alter level of mRNA from gene 2 (e.g., transcription factor) Activation of transcription factor might not occur by making more of it, but just by phosphorylating it (post-translational modification) Example: Transcription Regulation Operon Operon P O geneR T Operon Operon P O geneA geneB geneC T R DNA mRNA mRNA R Approach: Measure More Stuff Mass spectrometry (later) can measure protein rather than mRNA Doesn’t measure all proteins Not very quantitative (presence/absence) 2D gels can measure post-translational modifications, but still low-throughput because of current analysis Co-immunoprecipitation (later), Yeast 2Hybrids can measure protein interactions, but noisy Another Way Around Limitations Identify smaller part of the task that is a step toward a full regulatory pathway Part of a pathway Classes or groups of genes Examples: Chromatin remodelers Predicting the operons in E. coli Chromatin Remodelers and Nucleosome [Segal et al. 2006] Previous DNA picture oversimplified DNA double-helix is wrapped in further complex structure DNA is accessible only if part of this structure is unwound Can we predict what chromatin remodelers act on what parts of DNA, also what activates a remodeler? The E. Coli Genome Finding Operons in E. coli (Craven, Page, Shavlik, Bockhorst and Glasner, 2000) g2 g3 g4 g1 Given: Do: known operons and other E. coli data predict all operons in E. coli Additional Sources of Information gene-expression data functional annotation g5 Comparing Naive Bayes and Decision Trees (C5.0) Using Only Individual Features Single-Nucleotide Polymorphisms SNPs: Individual positions in DNA where variation is common Roughly 2 million known SNPs in humans New Affymetrix whole-genome scan measures 500,000 of these Easier/faster/cheaper to measure SNPs than to completely sequence everyone Motivation … If We Sequenced Everyone… Susceptible to Disease D or Responds to Treatment T Not Susceptible or Not Responding Example of SNP Data Person SNP 1 2 3 ... CLASS Person 1 C T A G T T ... old Person 2 C C A G C T ... young Person 3 T T A A C C ... old Person 4 C T G G T T ... young . . . . . . ... . . . . . . . ... . . . . . . . ... . Phasing (Haplotyping) Advantages of SNP Data Person’s SNP pattern does not change with time or disease, so it can give more insight into susceptibility Easier to collect samples (can simply use blood rather than affected tissue) Challenges of SNP Data Unphased Algorithms exist for phasing (haplotyping), but they make errors and typically need related individuals, dense coverage Missing values are more common than in microarray data (though improving substantially, down to around 1-2% now) Many more measurements. For example, Affymetrix human SNP chip at a half million SNPs. Supervised Learning Task Given: a set of SNP profiles, each from a different patient. Phased: nucleotides at each SNP position on each copy of each chromosome constitute the features, and patient’s disease constitutes the class Unphased: unordered pair of nucleotides at each SNP position constitute the features, and patient’s disease constitutes the class Do: Learn a model that accurately predicts class based on features Waddell et al., 2005 Multiple Myeloma, Young (susceptible) vs. Old (less susceptible), 3000 SNPs, best at 64% acc (training) SVM with feature selection (repeated on every fold of cross-validation): 72% accuracy, also naïve Bayes. Significantly better than chance. Old Young Old 31 9 Young 14 26 Actual Listgarten et al., 2005 SVMs from SNP data predict lung cancer susceptibility at 69% accuracy Naïve Bayes gives similar performance Best single SNP at less than 60% accuracy (training) Lessons Supervised data mining algorithms can predict disease susceptibility at rates better than chance and better than individual SNPs Accuracies much lower than we see with microarray data, because we’re predicting who will get disease, not who already has it Future Directions Pharmacogenetics: predicting drug response from SNP profile Drug Efficacy Adverse Reaction Combining SNP data with other data types, such as clinical (history, lab tests) and microarray Proteomics Microarrays are useful primarily because mRNA concentrations serve as surrogate for protein concentrations Like to measure protein concentrations directly, but at present cannot do so in same high-throughput manner Proteins do not have obvious direct complements Could build molecules that bind, but binding greatly affected by protein structure Time-of-Flight (TOF) Mass Spectrometry (thanks Sean McIlwain) Detector Measures the time for an ionized particle, starting from the sample plate, to hit the detector Laser Sample +V Time-of-Flight (TOF) Mass Spectrometry 2 Matrix-Assisted Laser Desorption-Ionization Detector (MALDI) Crystalloid structures made using protonrich matrix molecule Hitting crystalloid with laser causes molecules Sample to ionize and “fly” +V towards detector Laser Time-of-Flight Demonstration 0 Sample Plate Time-of-Flight Demonstration 1 Matrix Molecules Time-of-Flight Demonstration 2 Protein Molecules Time-of-Flight Demonstration 3 Laser Detector +10KV Positive Charge Time-of-Flight Demonstration 4 Laser pulsed directly onto sample Proton kicked off matrix molecule onto another molecule + +10KV Time-of-Flight Demonstration 5 Lots of protons kicked off matrix ions, giving rise to more positively charged molecules + + + + +10KV + Time-of-Flight Demonstration 6 The high positive potential under sample plate, causes positively charged molecules to accelerate towards detector + + + + +10KV + Time-of-Flight Demonstration 7 + + + + + +10Kv + Smaller mass molecules hit detector first, while heavier ones detected later Time-of-Flight Demonstration 8 + + The incident time measured from when laser is pulsed until molecule hits detector + + + +10KV + Time-of-Flight Demonstration 9 + + + + + + Experiment repeated a number of times, counting frequencies of “flight-times” +10KV Example Spectra from a Competition by Lin et al. at Duke Intensity These are different fractions from the same sample. M/Z Frequency Trypsin-Treated Spectra M/Z Many Challenges Raised by Mass Spectrometry Data Noise: extra peaks from handling of sample, from machine and environment (electrical noise), etc. M/Z values may not align exactly across spectra (resolution ~0.1%) Intensities not calibrated across spectra: quantification is difficult Cannot get all proteins… typically only several hundred. To improve odds of getting the ones we want, may fractionate our sample by 2D gel electrophoresis or liquid chromatography. Challenges (Continued) Better results if partially digest proteins (break into smaller peptides) first Can be difficult to determine what proteins we have from spectrum Isotopic peaks: C13 and N15 atoms in varying numbers cause multiple peaks for a single peptide Handling Noise: Peak Picking Want to pick peaks that are statistically significant from the noise signal Want to use these as features in our learning algorithms. Many Supervised Learning Tasks Learn to predict proteins from spectra, when the organism’s proteome is known Learn to identify isotopic distributions Learn to predict disease from either proteins, peaks or isotopic distributions as features Construct pathway models Using Mass Spectrometry for Early Detection of Ovarian Cancer [Petricoin et al., 2002] Ovarian cancer difficult to detect early, often leading to poor prognosis Trained and tested on mass spectra from blood serum 100 training cases, 50 with cancer Held-out test set of 116 cases, 50 with cancer 100% sensitivity, 95% specificity (63/66) on held-out test set Not So Fast Data mining methodology seems sound But Keith Baggerly argues that cancer samples were handled differently than normal samples, and perhaps data were preprocessed differently too If we run cancer samples Monday and normals Wednesday, could get differences from machine breakdown or nearby electrical equipment that’s running on Monday but not Wed Lesson: tell collaborators they must randomize samples for the entire processing phase and of course all our preprocessing must be same Debate is still raging… results not replicated in trials Other Proteomics: 3D Structures Other Proteomics: Interactions Figure from Ideker et al., Science 292(5518):929-934, 2001 each node represents a gene product (protein) blue edges show direct protein-protein interactions yellow edges show interactions in which one protein binds to DNA and affects the expression of another Protein-Protein Interactions Yeast 2-Hybrid Immunoprecipitation Antibodies (“immuno”) are made by combinatorial combinations of certain proteins Millions of antibodies can be made, to recognize a wide variety of different antigens (“invaders”), often by recognizing protein specific proteins antibody Protein-Protein Interactions Immunoprecipitation antibody Co-Immunoprecipitation antibody Many Supervised Learning Tasks Learn to predict protein-protein interactions: protein 3D structures may be critical Use protein-protein interactions in construction of pathway models Learn to predict protein function from interaction data ChIP-Chip Data Immunoprecipitation can also be done to identify proteins interacting with DNA rather than other proteins Chromatin immunoprecipitation (ChIP): grab sample of DNA bound to a particular protein (transcription factor) ChIP-Chip: run this sample of DNA on a microarray to see which DNA was bound Example of analysis of such new data: Keles et al., 2006 Metabolomics Measures concentration of each lowmolecular weight molecule in sample These typically are “metabolites,” or small molecules produced or consumed by reactions in biochemical pathways These reactions typically catalyzed by proteins (specifically, enzymes) This data typically also mass spectrometry, though could also be NMR Lipomics Analogous to metabolomics, but measuring concentrations of lipids rather than metabolites Potentially help induce biochemical pathway information or to help disease diagnosis or treatment choice To Design a Drug: Identify Target Protein Determine Target Site Structure Synthesize a Molecule that Will Bind Knowledge of proteome/genome Relevant biochemical pathways Crystallography, NMR Difficult if Membrane-Bound Imperfect modeling of structure Structures may change at binding And even then… Molecule Binds Target But May: Bind too tightly or not tightly enough. Be toxic. Have other effects (side-effects) in the body. Break down as soon as it gets into the body, or may not leave the body soon enough. It may not get to where it should in the body (e.g., crossing blood-brain barrier). And Every Body is Different: Even if a molecule works in the test tube and works in animal studies, it may not work in people (will fail in clinical trials). A molecule may work for some people but not others. A molecule may cause harmful sideeffects in some people but not others. Typical Practice when Target Structure is Unknown High-Throughput Screening (HTS): Test many molecules (1,000,000) to find some that bind to target (ligands). Infer (induce) shape of target site from 3D structural similarities. Shared 3D substructure is called a pharmacophore. Perfect example of a machine learning task with spatial target. Inactive Active An Example of Structure Learning Common Data Mining Approaches Represent a molecule by thousands to millions of features and use standard techniques (e.g., KDD Cup 2001) Represent each low-energy conformer by feature vector and use multiple-instance learning (e.g., Jain et al., 1998) Relational learning Inductive logic programming (e.g., Finn et al., 1998) Graph mining Supervised Learning Task Given: a set of molecules, each labeled by activity -- binding affinity for target protein -- and a set of low-energy conformers for each molecule Do: Learn a model that accurately predicts activity (may be Boolean or real-valued) ILP as Illustration: The Logical Representation of a Pharmacophore Active(X) if: has-conformation(X,Conf), has-hydrophobic(X,A), has-hydrophobic(X,B), distance(X,Conf,A,B,3.0,1.0), has-acceptor(X,C), distance(X,Conf,A,C,4.0,1.0), distance(X,Conf,B,C,5.0,1.0). This logical clause states that a molecule X is active if it has some conformation Conf, hydrophobic groups A and B, and a hydrogen acceptor C such that the following holds: in conformation Conf of molecule X, the distance between A and B is 3 Angstroms (plus or minus 1), the distance between A and C is 4, and the distance between B and C is 5. Background Knowledge I Information about atoms and bonds in the molecules atm(m1,a1,o,3,5.915800,-2.441200,1.799700). atm(m1,a2,c,3,0.574700,-2.773300,0.337600). atm(m1,a3,s,3,0.408000,-3.511700,-1.314000). bond(m1,a1,a2,1). bond(m1,a2,a3,1). Background knowledge II Definition of distance equivalence dist(Drug,Atom1,Atom2,Dist,Error):number(Error), coord(Drug,Atom1,X1,Y1,Z1), coord(Drug,Atom2,X2,Y2,Z2), euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),Dist1), Diff is Dist1-Dist, absolute_value(Diff,E1), E1 =< Error. euc_dist(p(X1,Y1,Z1),p(X2,Y2,Z2),D):Dsq is (X1-X2)^2+(Y1-Y2)^2+(Z1-Z2)^2, D is sqrt(Dsq). Central Idea: Generalize by searching a lattice Lattice of Hypotheses active(X) active(X) if has-hydrophobic(X,A) active(X) if has-hydrophobic(X,A), has-donor(X,B), distance(X,A,B,5.0) active(X) if has-donor(X,A) active(X) if has-acceptor(X,A) active(X) if active(X) if has-donor(X,A), has-acceptor(X,A), has-donor(X,B), has-donor(X,B), distance(X,A,B,4.0) distance(X,A,B,6.0) etc. Conformational model Conformational flexibility modelled as multiple conformations: Sybyl randomsearch Catalyst Pharmacophore description Atom and site centred Hydrogen bond donor Hydrogen bond acceptor Hydrophobe Site points (limited at present) User definable Distance based Example 1: Dopamine agonists Agonists taken from Martin data set on QSAR society web pages Examples (5-50 conformations/molecule) OH H3C OH N OH OH OH H2N H2N H3C OH OH OH OH H2N OH OH N HN OH CH3 Pharmacophore identified Molecule A has the desired activity if: in conformation B molecule A contains a hydrogen acceptor at C, and in conformation B molecule A contains a basic nitrogen group at D, and the distance between C and D is 7.05966 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrogen acceptor at E, and the distance between C and E is 2.80871 +/- 0.75 Angstroms, and the distance between D and E is 6.36846 +/- 0.75 Angstroms, and in conformation B molecule A contains a hydrophobic group at F, and the distance between C and F is 2.68136 +/- 0.75 Angstroms, and the distance between D and F is 4.80399 +/- 0.75 Angstroms, and the distance between E and F is 2.74602 +/- 0.75 Angstroms. Example II: ACE inhibitors 28 angiotensin converting enzyme inhibitors taken from literature D. Mayer et al., J. Comput.-Aided Mol. Design, 1, 3-16, (1987) O HS N O CH3 HO N O N COOH P N H OH N O COOH COOH CH3 N H N O COOH ACE pharmacophore Molecule A is an ACE inhibitor if: molecule A contains a zinc-site B, molecule A contains a hydrogen acceptor C, the distance between B and C is 7.899 +/- 0.750 A, molecule A contains a hydrogen acceptor D, the distance between B and D is 8.475 +/- 0.750 A, the distance between C and D is 2.133 +/- 0.750 A, molecule A contains a hydrogen acceptor E, the distance between B and E is 4.891 +/- 0.750 A, the distance between C and E is 3.114 +/- 0.750 A, the distance between D and E is 3.753 +/- 0.750 A. Pharmacophore discovered Distance Progol Mayer A 4.9 5.0 B 3.8 3.8 C 8.5 8.6 Zinc site H-bond acceptor B A C Additional Finding Original pharmacophore rediscovered plus one other different zinc ligand position similar to alternative proposed by CibaGeigy 4.0 3.9 7.3 Example III: Thermolysin inhibitors 10 inhibitors for which crystallographic data is available in PDB Conformationally challenging molecules Experimentally observed superposition Key binding site interactions Asn112-NH O OH O=C Asn112 S2’ NH Arg203-NH O S1’ O Zn P NH O R O=C Ala113 Interactions made by inhibitors Interaction Asn112-NH S2’ Asn112 C=O Arg 203 NH S1’ Ala113-C=O Zn 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMN Pharmacophore Identification Structures considered 1HYT 1THL 1TLP 1TMN 2TMN 4TLN 4TMN 5TLN 5TMN 6TMN Conformational analysis using “Best” conformer generation in Catalyst 98-251 conformations/molecule Thermolysin Results 10 5-point pharmacophore identified, falling into 2 groups (7/10 molecules) 3 “acceptors”, 1 hydrophobe, 1 donor 4 “acceptors, 1 donor Common core of Zn ligands, Arg203 and Asn112 interactions identified Correct assignments of functional groups Correct geometry to 1 Angstrom tolerance Thermolysin results Increasing tolerance to 1.5Angstroms finds common 6-point pharmacophore including one extra interaction Example IV: Antibacterial peptides [Spatola et al., 2000] Dataset of 11 pentapeptides showing activity against Pseudomonas aeruginosa 6 actives <64mg/ml IC50 5 inactives Pharmacophore Identified A Molecule M is active against Pseudomonas Aeruginosa if it has a conformation B such that: M has a hydrophobic group C, M has a hydrogen acceptor D, the distance between C and D in conformation B is 11.7 Angstroms M has a positively-charged atom E, the distance between C and E in conformation B is 4 Angstroms the distance between D and E in conformation B is 9.4 Angstroms M has a positively-charged atom F, the distance between C and F in conformation B is 11.1 Angstroms the distance between D and F in conformation B is 12.6 Angstroms the distance between E and F in conformation B is 8.7 Angstroms Tolerance 1.5 Angstroms Clinical Databases of the Future (Dramatically Simplified) PatientID Gender Birthdate P1 M PatientID Date P1 P1 3/22/63 Lab Test 1/1/01 blood glucose 1/9/01 blood glucose PatientID Date Physician Symptoms P1 P1 Result 42 45 PatientID Date Prescribed Date Filled P1 5/17/98 5/18/98 1/1/01 2/1/03 Smith Jones Diagnosis palpitations hypoglycemic fever, aches influenza PatientID SNP1 SNP2 … SNP500K P1 P2 AA AB AB BB BB AA Physician Medication Dose Jones prilosec 10mg Duration 3 months Final Wrap-up Molecular biology collecting lots and lots of data in post-genome era Opportunity to “connect” molecular-level information to diseases and treatment Need analysis tools to interpret Data mining opportunities abound Hopefully this tutorial provided solid start toward applying data mining to highthroughput biological data Thanks To Jude Shavlik John Shaughnessy Bart Barlogie Mark Craven Sean McIlwain Jan Struyf Arno Spatola Paul Finn Beth Burnside Michael Molla Michael Waddell Irene Ong Jesse Davis Soumya Ray Jo Hardin John Crowley Fenghuang Zhan Eric Lantz If Time Permits… some of my group’s directions in the area Clinical Data (with Jesse Davis, Beth Burnside, M.D.) Addressing another problem with current approaches to biological network learning (with Soumya Ray, Eric Lantz) Using Machine Learning with Clinical Histories: Example We’ll use example of Mammography to show some issues that arise These issues arise here even with just one relational table These issues are even more pronounced with data in multiple tables Supervised Learning Task Given: a database of mammogram abnormalities for different patients (same cell type from every patient) Radiologist-entered values describing the abnormality constitute the features, and abnormality’s biopsy result as benign or malignant constitutes the class Do: Learn a model that accurately predicts class based on features Mammography Database [Davis et al, 2005; Burnside et al, 2005] Patient Abnormality Date Mass Shape … P1 P1 P1 … 1 2 3 … 5/02 5/04 5/04 … Spic Var Spic … Mass Size 0.03 0.04 0.04 … Loc RU4 RU4 LL3 … Be/Mal B M B … Original Expert Structure Level 1: Parameters Be/Mal Shape Size Given: Features (node labels, or fields in database), Data, Bayes net structure Learn: Probabilities. Note: probabilities needed are Pr(Be/Mal), Pr(Shape|Be/Mal), Pr (Size|Be/Mal) Level 2: Structure Be/Mal Shape Size Given: Features, Data Learn: Bayes net structure and probabilities. Note: with this structure, now will need Pr(Size|Shape,Be/Mal) instead of Pr(Size|Be/Mal). Mammography Database Patient Abnormality Date Mass Shape … P1 P1 P1 … 1 2 3 … 5/02 5/04 5/04 … Spic Var Spic … Mass Size 0.03 0.04 0.04 … Loc RU4 RU4 LL3 … Be/Mal B M B … Mammography Database Patient Abnormality Date Mass Shape … P1 P1 P1 … 1 2 3 … 5/02 5/04 5/04 … Spic Var Spic … Mass Size 0.03 0.04 0.04 … Loc RU4 RU4 LL3 … Be/Mal B M B … Mammography Database Patient Abnormality Date Mass Shape … P1 P1 P1 … 1 2 3 … 5/02 5/04 5/04 … Spic Var Spic … Mass Size 0.03 0.04 0.04 … Loc RU4 RU4 LL3 … Be/Mal B M B … Level 3: Aggregates Avg size this date Be/Mal Shape Size Given: Features, Data, Background knowledge – aggregation functions such as average, mode, max, etc. Learn: Useful aggregate features, Bayes net structure that uses these features, and probabilities. New features may use other rows/tables. Mammography Database Patient Abnormality Date Mass Shape … P1 P1 P1 … 1 2 3 … 5/02 5/04 5/04 … Spic Var Spic … Mass Size 0.03 0.04 0.04 … Loc RU4 RU4 LL3 … Be/Mal B M B … Mammography Database Patient Abnormality Date Mass Shape … P1 P1 P1 … 1 2 3 … 5/02 5/04 5/04 … Spic Var Spic … Mass Size 0.03 0.04 0.04 … Loc RU4 RU4 LL3 … Be/Mal B M B … Mammography Database Patient Abnormality Date Mass Shape … P1 P1 P1 … 1 2 3 … 5/02 5/04 5/04 … Spic Var Spic … Mass Size 0.03 0.04 0.04 … Loc RU4 RU4 LL3 … Be/Mal B M B … Level 4: View Learning Shape change in abnormality at this location Increase in average size of abnormalities Avg size this date Be/Mal Shape Size Given: Features, Data, Background knowledge – aggregation functions and intensionally-defined relations such as “increase” or “same location” Learn: Useful new features defined by views (equivalent to rules or SQL queries), Bayes net structure, and probabilities. Example of Learned Rule is_malignant(A) IF: 'BIRADS_category'(A,b5), 'MassPAO'(A,present), 'MassesDensity'(A,high), 'HO_BreastCA'(A,hxDCorLC), in_same_mammogram(A,B), 'Calc_Pleomorphic'(B,notPresent), 'Calc_Punctate'(B,notPresent). ROC: Level 2 (TAN) vs. Level 1 Precision-Recall Curves All Levels of Learning 1 Level 4 (View) Precision 0.9 0.8 Level 3 (Aggregate) 0.7 Level 2 (Structure) 0.6 Level 1 (Parameter) 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 SAYU-View Improved View Learning approach SAYU: Score As You Use For each candidate rule, add it to the Bayesian network and see if it improves the network’s score Only add a rule (new field for the view) if it improves the Bayes net Relational Learning Algorithms 1 SAYU-View 0.9 Precision 0.8 Initial Level 4 (View) 0.7 Level 3 (Aggregates) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Recall 0.7 0.8 0.9 1 Average Area Under PR Curve For Recall >= 0.5 0.12 SAYU-View Initial Level 4 AURPR 0.10 0.08 0.06 0.04 0.02 0.00 Level 3 Clinical Databases of the Future (Dramatically Simplified) PatientID Gender Birthdate P1 M PatientID Date P1 P1 3/22/63 Lab Test 1/1/01 blood glucose 1/9/01 blood glucose PatientID Date Physician Symptoms P1 P1 Result 42 45 PatientID Date Prescribed Date Filled P1 5/17/98 5/18/98 1/1/01 2/1/03 Smith Jones Diagnosis palpitations hypoglycemic fever, aches influenza PatientID SNP1 SNP2 … SNP500K P1 P2 AA AB AB BB BB AA Physician Medication Dose Jones prilosec 10mg Duration 3 months Another Problem with Current Learning of Regulatory Models Current techniques all use greedy heuristic Bayes net learning algorithms use “sparse candidate” approach: to be considered as a parent of gene 1, another gene 2 must be correlated with gene 1 CPDs often represented as trees: use greedy tree learning algorithms All can fall prey to functions such as exclusive-or – do these arise? Skewing Example [Page & Ray, 2003; Ray & Page, 2004; Rosell et al., 2005; Ray & Page, 2005] Female 0 0 1 1 SxL Gene Gene3 … Gene100 Survival 0 0…0000000 0…0000001 … 1…1111111 0 1 0…0000000 0…0000001 … 1…1111111 1 0 0…0000000 0…0000001 … 1…1111111 1 1 0…0000000 0…0000001 … 1…1111111 0 Drosophila survival based on gender and Sxl gene activity Hard Functions Our Definition: those functions for which no attribute has “gain” according to standard purity measures (GINI, Entropy) NOTE: “Hard” does not refer to size of representation Example: n-variable odd parity Many others Learning Hard Functions Standard method of learning hard functions (e.g. with decision trees): depth-k Lookahead O(mn2 k+1-1 ) for m examples in n variables We devise a technique that allows learning algorithms to efficiently learn hard functions Key Idea Hard functions are not hard for all data distributions We can skew the input distribution to simulate a different one By randomly choosing “preferred values” for attributes Accumulate evidence over several skews to select a split attribute Example: Uniform Distribution x1 x2 0 0 0 1 1 0 1 1 x3 0 1 0 1 0 1 0 1 f 0 1 1 0 GINI ( f ) 0.25 GINI ( f ; xi 0) 0 .25 GINI ( f ; xi 1) 0 .25 GAIN ( xi ) 0 Example: Skewed Distribution (“Sequential Skewing”, Ray & Page, ICML 2004) x1 0 0 1 1 x2 0 1 0 1 x3 0 1 0 1 0 1 0 1 f 0 1 1 0 Weight 1 4 1 4 3 4 3 4 1 4 1 4 3 4 3 4 GINI ( f ) 0.25 GINI ( f ; x1 1) 0.19 GINI ( f ; x2 1) 0.25 GINI ( f ; x3 1) 0.25 GAIN ( x1 ) 0 GAIN ( x2 ) 0 GAIN ( x3 ) 0