Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Learning rule-based models from gene expression time profiles annotated with Gene Ontology terms Jan Komorowski and Astrid Lägreid Joint work with • Torgeir R. Hvidsten, Herman Midelfart, Astrid Lægreid and Arne K. Sandvik J. Komorowski and A. Lägreid Selected Challenges in Geneexpression Analysis • Function similarity corresponds to expression similarity but: – Functionally corelated genes may be expression-wise dissimilar (e.g. anti-coregulated) – Genes usually have multiple function – Measurements may be approximate and contradictory • Can we obtain clusters of biologically related genes? • Can we build models that classify unknown genes to functional classes, that are human legible, and that handle approximate and often contradictory data? • How can we re-use biological knowledge? J. Komorowski and A. Lägreid Data • Data material – Serum starved fibroblasts, 8,613 genes • Added serum to medium at time = 0 • Used starved fibroblasts as reference • Measured gene activity at various time points – 493 genes found to be differentially expressed • Results – 278 genes known (3 repeats) – 212 genes unknown, (uncharacterized) – 211 genes given hypothetical function with 88% quality J. Komorowski and A. Lägreid Fibroblast - serum response serum 0 quiescent samples for microarray analysis 1 non-proliferating 4 8 24 proliferating J. Komorowski and A. Lägreid Processes stress response protein synthesis transcription organelle biogenesis lipid synthesis 0 quiescent 1 non-proliferating 4 8 re-entry cell cycle cell motility 24 proliferating J. Komorowski and A. Lägreid Dynamic processes delayed immediate early immediate intermediate early 0 primary quiescent 1 4 secondary non-proliferating late 8 24 tertiary proliferating J. Komorowski and A. Lägreid Protein appears after the transcript 0 1 4 primary secondary quiescent non-proliferating 8 24 tertiary proliferating J. Komorowski and A. Lägreid Protein dynamics are not always similar to transcript dynamics 0 1 gene 4 transcript 8 24 protein J. Komorowski and A. Lägreid Molecular mechanisms of transcriptional response serum = signal effectors secondary transcription factors = cellular response immediate early response factors immediate early response genes delayed immediate early response genes intermediate/late response genes J. Komorowski and A. Lägreid The dynamics of cellular processes stress response cell motility cell adhesion DNA synthesis energy metabolism protein synthesis 1 cell cycle regulation 4 8 24 DNA synthesis cell motility lipid synthesis cell proliferation, negative regulation quiescent non-proliferating proliferating J. Komorowski and A. Lägreid Ontology Methodology Process Defense response Transport g2 g2 ... ... Positive control of cell proliferation g4 ... g5 Cell cycle control g3 ... Gene 0HR 15MIN 30MIN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR g1 0.00 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 -0.94 g2 g3 0.00 0.00 0.66 0.14 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 -0.42 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 -1.12 g4 0.00 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 -0.62 g5 ... 0.00 ... 0.28 ... 0.37 ... 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 -0.74 ... ... ... ... ... ... ... ... ... 1. Mining functional classes from an ontology Process Unknown Transport and defense response Cell cycle control Positive control of cell proliferation Positive control of cell proliferation ... 2. Extracting features for learning 1.5 3. Inducing minimal decision rules using rough sets 1 0.5 0 0 - 4(Increasing) AND 6 - 10(Decreasing) AND 14 - 18(Constant) => GO(cell proliferation) -0.5 -1 -1.5 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 4. The function of unknown genes is predicted using the rules ! J. Komorowski and A. Lägreid Gene Ontology Cell growth and maintenance FUNCTION Metabolism Energy pathways Nucleotide and nucleic acid metabolism DNA metabolism Mutagenesis DNA repair DNA packaging Transcription Protein metabolism and modification Amino-acid and derivative metabolism Protein targeting Lipid metabolism Transport GENE FUNCTION Ion homeostasis PROCESS Intracellular protein traffic Cell death Cell motility Stress response Organelle organizaton and response Oncogenesis Cell proliferation Cell cycle Cell communication Cell adhesion Signal transduction Cell surface receptor linked signal transduction Intracellular signalling cascade Developmental processes CELLULAR COMPARTMENT Physiological processes Blood Coagulation Circulation J. Komorowski and A. Lägreid Biological processes from GO Energy pathways DNA metabolism Amino acid and derivative metabolism Protein targeting Lipid metabolism Transport Ion hemostasis Intracellular traffic Cell death Cell motility Stress response Oncogenesis Cell cycle Cell adhesion Cell surface receptor linked signal transduction Developmental processes Blood coagulation Circulation Intracellular signaling cascade Organelle organization and biogenesis J. Komorowski and A. Lägreid Hierchical Clustering of the Fibroblast Data It’s not a cluster! J. Komorowski and A. Lägreid Gene Ontology vs. Clusters found by Iyer et al. J. Komorowski and A. Lägreid Template-based feature synthesis Templates: Increasing Decreasing Constant All possible subintervals in the time series + Gene expression time series data MATCH Groups containing genes matching the same templates over the same subinterval 12 measurement points, 55 possible intervals of length >2 J. Komorowski and A. Lägreid Examples of template definitions Increasing-template 1.0 M IN. 0 M IN. 0.1 M IN. 0.6 M AX 0.2 M IN. 0.1 2HR 0.5 4HR M IN. 0 6HR 8HR 12HR Constant-template M IN. 0.2 M EAN M IN. 0.2 8HR 4HR 6HR 8HR 12HR J. Komorowski and A. Lägreid 3 2.5 Rule example 1 2 1.5 1 0.5 0 -0.5 -1 0 2 4 6 8 10 12 14 16 18 20 22 24 Rule 0 - 4(Constant) AND 0 - 10(Increasing) => GO(protein metabolism and modification) OR GO(mesoderm development) OR GO(protein biosynthesis) Covered genes M35296 J02783 D13748 X05130 X60957 D13748 U90918 (unknown) J. Komorowski and A. Lägreid 1.5 1 Rule example 2 0.5 0 -0.5 -1 -1.5 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 Rule Covered genes 0 - 4(Increasing) AND 6 - 10(Decreasing) AND 14 - 18(Constant) => GO(cell proliferation) OR GO(cell-cell signaling) OR GO(intracellular signaling cascade) OR GO(oncogenesis) Y07909 X58377 U66468 X58377 X85106 Y07909 J. Komorowski and A. Lägreid Classification using templatebased rules IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … X60957 3 2.5 2 1.5 1 0.5 0 -0.5 0 2 4 6 8 10 12 14 16 18 20 22 24 -1 IF 0 - 4(Constant) AND 0 - 10(Increasing) THEN GO(prot. met. and mod.) OR … IF … THEN IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … … +4 Process Votes protein metabolism and modification protein amino acid phosphorylation proteolysis and peptidolysis transcription transport vision … 6 3 2 1 1 1 Votes are normalized and processes with vote fractions higher than a selection-threshold are chosen as predictions J. Komorowski and A. Lägreid Cross validation estimates Iyer et al. PROCESS AUC SE Ion homeostasis Protein targeting Blood coagulation DNA metabolism Intracellular signaling cascade Energy pathways Cell cycle Oncogenesis Circulation Cell death Developmental processes Transcription Defense (immune) response Cell adhesion Stress response Protein metabolism and modification Cell motility Cell surface rec linked signal transd Lipid metabolism Transport Cell organization and biogenesis Cell proliferation Amino acid and derivative metabolism 1.00 0.99 0.96 0.94 0.94 0.93 0.93 0.92 0.91 0.90 0.90 0.88 0.88 0.87 0.86 0.85 0.84 0.82 0.81 0.79 0.79 0.79 0.69 0.00 0.03 0.08 0.09 0.06 0.12 0.04 0.11 0.11 0.10 0.07 0.11 0.05 0.09 0.15 0.10 0.11 0.15 0.14 0.17 0.11 0.06 0.06 0.88 0.09 AVERAGE A: Coverage: 84% Precision: 50% B: Coverage: 71% Precision: 60% C: Coverage: 39% Precision: 90% Coverage = TP/(TP+FN) Precision = TP/(TP+FP) J. Komorowski and A. Lägreid Cross validation estimates Cho et al. Process GO AUC SE apoptosis* carbohydrate metabolism cell adhesion* cell cycle control* cell motility* cell proliferation cell surface rec linked signal transd cell-cell signaling DNA metabolism energy pathways humoral immune response immune response intracellular signaling cascade lipid metabolism mesoderm development mitotic cell cycle* neurogenesis oncogenesis phototransduction physiological processes protein biosynthesis protein metabolism and modification protein amino acid phosphorylation proteolysis and peptidolysis transcription transport vision GO:0006915 GO:0005975 GO:0007155 GO:0000074 GO:0006928 GO:0008283 GO:0007166 GO:0007267 GO:0006259 GO:0006091 GO:0006959 GO:0006955 GO:0007242 GO:0006629 GO:0007498 GO:0000278 GO:0007399 GO:0007048 GO:0007602 GO:0007582 GO:0006412 GO:0006411 GO:0006468 GO:0006508 GO:0006350 GO:0006810 GO:0007601 0.81 0.72 0.77 0.83 0.81 0.80 0.79 0.80 0.78 0.76 0.77 0.81 0.81 0.71 0.77 0.84 0.78 0.77 0.85 0.77 0.80 0.77 0.82 0.80 0.71 0.71 0.83 0.01 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.01 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.02 0.01 0.01 0.01 AVERAGE 0.78 0.01 Coverage: 58% Precision: 61% Coverage = TP/(TP+FN) Precision = TP/(TP+FP) J. Komorowski and A. Lägreid Protein Metabolism and Modification A B D E C A – annotations B – false negatives C – false positives D – true positives E – pred. unknown gene J. Komorowski and A. Lägreid Re-classification of the Known Genes J. Komorowski and A. Lägreid Co-classifications for the Unknown Genes J. Komorowski and A. Lägreid Conclusions • Our methodology – Incorporates background biological knowledge – Handles well the noise and incompleteness in the microarray data – Can be objectively evaluated – Predicts multiple functions per gene – Can reclassify known genes and provide possible new functions of the known genes – Can provide hypotheses about the function of unknown genes • Experimental work needs to be done to confirm our predictions J. Komorowski and A. Lägreid Genomic ROSETTA: http://www.idi.ntnu.no/~aleks/rosetta J. Komorowski and A. Lägreid