* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A stepwise procedure for conditional testing of
Gene nomenclature wikipedia , lookup
Gene therapy wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Oncogenomics wikipedia , lookup
X-inactivation wikipedia , lookup
Epigenetics in stem-cell differentiation wikipedia , lookup
Genetic engineering wikipedia , lookup
Ridge (biology) wikipedia , lookup
Point mutation wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome evolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genome (book) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Microevolution wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
A stepwise procedure for conditional testing of GO term overrepresentation Constantin Georgescu © Intelligent Systems and Bioinformatics Laboratory 1 The human genome • The whole hereditary information of an organism: Instructions providing all the information necessary for a living organism to grow and live • Instructions encoded in the form of DNA molecules. DNA encodes a detailed set of plans, like a blueprint, for building different parts of a cell. • Reside in the nucleolus of every cell, on 23 pairs of chromosomes • DNA molecule forms a double helix, a string built with the four-letter DNA alphabet A,C,T,G DNA strand made of letters that make words that make sentences called “genes”. • Genes: segment of chromosomal DNA that encode and direct the synthesis of a protein; proteins carrying out most cellular functions • Sequenced by 2003; 2 meters of DNA; 3 bil bp; 25000 genes; • 97% junk DNA © Intelligent Systems and Bioinformatics Laboratory 2 © Intelligent Systems and Bioinformatics Laboratory 3 Differential expression • Cells: the fundamental working units of every living organism. • Each cell contains a complete copy of the organism's genome. • Cells are of many different types and states E.g. Blood, nerve, and skin cells, dividing cells, cancerous cells, etc. • What makes the cells different? • Differential gene expression, i.e., when, where, and how much each gene is expressed. • On average, 40% of our genes are expressed at any given time. © Intelligent Systems and Bioinformatics Laboratory 4 Central dogma • The expression of the genetic information stored in the DNA molecule occurs in two stages: – (i) transcription, during which DNA is transcribed into mRNA; – (ii) translation, during which mRNA is translated to produce a protein. DNA-> mRNA->protein • Other important aspects of gene regulation: methylation, alternative splicing, etc. © Intelligent Systems and Bioinformatics Laboratory 5 Examining Gene Expression • Understanding the functions of genes depends on knowing when and in what cells they are each expressed. • microarray chip (developed in late 1990) allow examining the expression of thousands of genes simultaneously • microarray chips are glass slides spotted with many rows containing tiny amounts of probe DNA, one for each of thousands of genes • measure the amount of mRNA transcribed from a gene in a particular cell type through complementary binding • rapid and sensitive tests, in a variety of experimental studies on different cell types : cancer cells versus normal cells, or liver cells versus kidney cells, etc © Intelligent Systems and Bioinformatics Laboratory 6 A RNA is isolated from cells from two samples (in this illustration, infected and uninfected plant cells). B. The mRNA from both samples is copied to a more stable form, called cDNA, using reverse transcriptase. C. At the same time, the cDNA is labeled with fluorescent tags (a different color tag for each sample). D. The tagged cDNA is placed on the microarray chip, where it binds to the corresponding DNA that makes up the genes that have been previously spotted on the chip. E. The chip is placed in a laser scanner, which identifies the genes that hybridize to each sample (uninfected=green; infected=red; and both samples=yellow). F. The data are displayed on a computer screen where expression of the individual genes can be identified. © Intelligent Systems and Bioinformatics Laboratory 7 Combining data across slides Data on G genes for n hybridizations results in a Gxn gene-by-array data matrix Gene1 Gene2 Gene3 Gene4 Gene5 … Array1 0.46 -0.10 0.15 -0.45 -0.06 … Array2 0.30 0.49 0.74 -1.03 1.06 … Array3 0.80 0.24 0.04 -0.79 1.35 … Array4 1.51 0.06 0.10 -0.56 1.09 … Array5 … 0.90 ... 0.46 ... 0.20 ... -0.32 ... -1.09 ... … ... Preprocessing->normalization->summarization->testing=> List of differentially expressed genes © Intelligent Systems and Bioinformatics Laboratory 8 Gene Groups • Challenge: go from sequence to function, i.e., define the role of each gene and understand how the genome functions as a whole. • The complete genome sequence doesn’t tell us much about how the organism functions as a biological system. • We need to study how different gene products interact to produce various components. • Most important activities are not the result of a single molecule but depend on the coordinated effects of multiple molecules. © Intelligent Systems and Bioinformatics Laboratory 9 Gene Ontology • Common set of terms and descriptions for basic biological functions, processes and entities. (Mechanism for representing a communities domain knowledge in a form accessible by human and amenable to computation) • GO provides a restricted vocabulary and clear description of the relationships between terms. • Gene Ontology consortium produce 3 independent ontologies: -Biological Process: “biological objective to which the gene product contribute”; accomplished via one or more ordered assembiles of molecular functions. Ex: cell growth; signal transduction “almost a pathway” -Molecular Function: “biochemical activity or action of the gene product”, EX:”enzime”, ”transporter”,”ligand” -Cellular Component: component of a cell that is part of some larger object or structure; Ex: chromosome, nucleus, ribosome © Intelligent Systems and Bioinformatics Laboratory 10 Gene Ontology • Organized as a DAG with many to many relationships; • Children terms are more specific that their parents • Is a/has a relationships • Mapping of genes to GO terms carried out separately (ex chip meta-data, GOA); • Mapping as specific as possible; • Propagation up through hierarchy • “Across dependences”: one gene mapped to several GO terms © Intelligent Systems and Bioinformatics Laboratory 11 Gene set analysis Given: • a directed acyclic graph (GO graph) and a set of items (genes) s.t.: – each node in the graph contains some genes – the parent of a node contains all the genes of its child – a node can contain genes that are not found in the children • a subset of genes that we call significant genes (differentially expressed genes) Goal: • find the nodes from the graph (biological functions) that best represent the significant genes w.r.t some scoring function (some test statistic) Over-representation analysis (ORA): is based on Fisher (hypergeometric) test -Most popular method: easy; exact; works for small sets; stability -implemented in GOstats, OntoExpress, GOMiner, Ontologizer, FatiGO, MAPPfinder … © Intelligent Systems and Bioinformatics Laboratory 12 Fisher’s exact test The score for a GO term is the degree of independence between the two properties: A = {gene is in the list of significant genes} B = {gene is found in the GO term}. • Testing the independence of two groups in the above contingency table corresponds to Fisher’s exact test [Khatri and Draghici, 2005] © Intelligent Systems and Bioinformatics Laboratory 13 Fisher’s exact test For computing the significance of a gene set, we can use a hypergeometric test: • N genes are on microarray • Bio is a GO term – M genes in Bio – N −M genes not in Bio • Let K be the no. of significant genes • What is the probability of having exactly x genes from K of type Bio ? © Intelligent Systems and Bioinformatics Laboratory 14 This is the probability of getting exactly x by chance (not what we want) Parent-Child method • What is the proper N ? • x=10, M=400, K=40 N=1000 => pval=0.98 N=5000 => pval=0.0009145082 • Need unspecific prefiltering (remove genes not expressed in any sample) • Remove genes not present in any GO terms • Parent-Child method (Grossmann) proposes N=nb. genes in the parent of current GO term © Intelligent Systems and Bioinformatics Laboratory 15 Complex test dependence • Gene annotations propagate through DAG • Gene annotated to multiple unrelated GO terms (across dependence) • Implicit propagation of GO term significance • No reasonable pvalue correction mehtod available © Intelligent Systems and Bioinformatics Laboratory 16 Elim method The main idea: Test how enriched node x is if we do not consider the genes from its significant children (Alexa A. 2006) • The nodes are processed bottom-up. This assures that all children of node x were investigated before node x itself. • The p-value for node x is computed using Fisher’s exact test. • If node x is found significant, remove all the genes mapped to this node, from all its ancestors. • Elimw –use some heuristic to ease gene removal • Essentially Parent-Child method at the other end of DAG © Intelligent Systems and Bioinformatics Laboratory 17 Step method • First attempt: do both. Good ordering but (very) little test power • Need to reduce conditioning as much as possible (to recover test power) =>stepwise feature selection • Asymptotically Hypergeometric test binomialnormalchi-squareratio likelihood (information criteria test) • Feature selection uses AIC/BIC=f(information criteria) AIC=IC-d; BIC=IC-d*log(N)/2 • Translate BIC back in terms of hypergeometric => Fisher test with adaptive pvalue treshold • Develop close form solutions specific to this particular situation for diffrence in deviances of two models © Intelligent Systems and Bioinformatics Laboratory 18 Step methods • Reduces to Parent-Child /elim for nodes on bottom/top of the DAG GO • Adaptive threshold: no need to choose a cutoff for the p-value • Results in independent tests (makes value correction methods valid) • Developed in terms of hypergeometric test: fast, applicable on small GO terms © Intelligent Systems and Bioinformatics Laboratory 19 Simulation results 1000 iterations; 3 nodes enriched 1/20 vs 1/100 tpr enriched nodes 1.000 sigN 0.063 sigNc 0.078 Grossman 0.110 selGlobGO 0.608 selectsGOi 0.516 selectsGOi2 0.289 selectsGOih 0.472 fpr 0.000 0.936 0.921 0.889 0.391 0.483 0.710 0.527 © Intelligent Systems and Bioinformatics Laboratory tsel sel 3000 3000 2832 44696 1036 13237 2034 18416 1262 2073 1252 2422 1441 4978 1261 2669 20 -Use of Affymetrix U133 gene arrays, -Explored the APC-induced gene expression in the lung of baboons challenged with lethal doses of E. coli at 8 hrs. Expression pattern and biological significance of the differentially expressed genes were explored using Gene Ontology (GO) and pathway analysis. -6 samples (3 control 3 lethal E coli) -8700 expressed genes -294 diff expressed genes (at 0.01 FDR) -44 BP GO terms (<0.01) GOBPID mark GO:0009607 10 GO:0010038 8 GO:0006508 9 GO:0045185 GO:0019363 GO:0042327 8 GO:0008624 10 © Intelligent Systems and Bioinformatics Laboratory 21 Term response to biotic stimulus response to metal ion proteolysis maintenance of protein locali pyridine nucleotide biosynthe positive regulation of phosph induction of apoptosis by ext Significant GO terms with Step GOBPID Pvalue ExpCount Count Size GO:0009607 GO:0010038 GO:0006508 GO:0045185 GO:0019363 GO:0042327 GO:0008624 0.0000 0.0090 0.0012 0.0040 0.0055 0.0180 0.0291 16.0550 0.1547 9.4969 0.3403 0.1237 0.2165 0.6806 39 2 20 3 2 2 3 Term 519 response to biotic stimulus 5 response to metal ion 307 proteolysis 11 maintenance of protein localization 4 pyridine nucleotide biosynthesis 7 positive regulation of phosphorylation 22 induction of apoptosis by extracellular signals markGO 10 8 9 0 0 8 10 pvlw W pvlGsm G pvlGlb 1.0000 0.0090 0.0890 1.0000 0.0055 0.1448 0.0291 0 1 0 0 1 0 0 0.0002 0.4762 0.0012 0.0084 0.0084 0.2000 0.0307 1 0 1 1 1 0 0 0.0000 0.0047 0.0008 0.0008 0.0029 0.0015 0.0059 S pvlGih I 1 1 1 1 1 1 1 0.0000 0.0053 0.0014 0.0010 0.0032 0.0016 0.0069 GO:0009607` response to biotic stimulus "A change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a biotic stimulus, a stimulus caused or produced by a living organism." `GO:0010038` response to metal ion "A change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a metal ion stimulus." `GO:0006508` proteolysis "The hydrolysis of a peptide bond or bonds within a protein." `GO:0045185` maintenance of protein localization "The processes by which a protein is maintained in a location and prevented from moving elsewhere. These include sequestration, stabilization to prevent transport elsewhere and the active retrieval of proteins that do move away." `GO:0019363` pyridine nucleotide biosynthesis "The chemical reactions and pathways resulting in the formation of a pyridine nucleotide, a nucleotide characterized by a pyridine derivative as a nitrogen base." `GO:0042327` positive regulation of phosphorylation "Any process that activates or increases the frequency, rate or extent of addition of phosphoric groups to a molecule." `GO:0008624` induction of apoptosis by extracellular signals "Any process induced by extracellular signals that directly activates any of the steps required for cell death by apoptosis." © Intelligent Systems and Bioinformatics Laboratory 22 1 1 1 1 1 1 1 First Connected Component © Intelligent Systems and Bioinformatics Laboratory 23 GOBPID Term mark I GO:0030218 erythrocyte differentiation 3 0 GO:0030099 myeloid cell differentiation 9 0 GO:0006857 oligopeptide transport 4 0 GO:0015833 peptide transport 0 0 GO:0045185 maintenance of protein localization 0 1 GO:0006621 protein retention in ER 5 0 GO:0019363 pyridine nucleotide biosynthesis 0 1 GO:0007259 JAK-STAT cascade 10 0 GO:0018108 peptidyl-tyrosine phosphorylation 0 0 GO:0042327 positive regulation of phosphorylation 8 1 GO:0008624 induction apoptosis by extcell signals 10 1 GOBPID Term GO:0006508 proteolysis GO:0006511 ubiquitin-dependent protein catabolism GO:0006568 tryptophan metabolism GO:0006569 tryptophan catabolism GO:0006576 biogenic amine metabolism GO:0006586 indolalkylamine metabolism GO:0006725 aromatic compound metabolism GO:0009056 catabolism GO:0009072 aromatic amino acid family metabolism GO:0009074 aromatic amino acid family catabolism GO:0019439 aromatic compound catabolism GO:0019941 modification-dependent protein catabolism GO:0030163 protein catabolism GO:0042219 amino acid derivative catabolism GO:0042402 biogenic amine catabolism GO:0042430 indole and derivative metabolism GO:0042434 indole derivative metabolism GO:0042436 indole derivative catabolism GO:0043285 biopolymer catabolism GO:0043632 modification-dependent macromlc catabolism GO:0046218 indolalkylamine catabolism © Intelligent Systems and Bioinformatics Laboratory 24 Selection with Bayesian network GO:0006576 GO:0006725 GO:0009072 GO:0006508 GO:0009074 GO:0019439 GO:0006511 GO:0009056 GO:0043632 GO:0019941 GO:0006568 DIFF GO:0006621 GO:0045185 GO:0030163 GO:0042219 GO:0042402 GO:0043285 GO:0006954 GO:0006952 GO:0006586 GO:0009607 GO:0009611 GO:0042430 GO:0050874 GO:0051707 GO:0006569 GO:0019363 GO:0006302 GO:0042434 GO:0006950 GO:0009613 GO:0006955 GO:0010038 GO:0050896 GO:0006857 GO:0007259 GO:0015833 GO:0018108 GO:0030099 GO:0030218 GO:0042436 GO:0046218 © Intelligent Systems and Bioinformatics Laboratory 25 The acute lymphoblast leukemia (ALL) microarray dataset of Chiaretti et al. (2004) Differential gene expression between B-cell ALL with the BCR/ABL (37 samples) fusion and cytogenetically normal NEG B-cell (42 samples) ALL The BCR/ABL fusion (Dudoit 2006) A number of recent articles have investigated the prognostic relevance of the BCR/ABL fusion in adult ALL of the B-cell lineage (Gleissner et al., 2002). The BCR/ABL fusion is the molecular analogue of the Philadelphia chromosome, one of the most frequent cytogenetic abnormalities in human leukemias. This t(9;22) translocation leads to a head-to-tail fusion of the v-abl Abelson murine leukemia viral oncogene homolog 1 (ABL1) from chromosome 9 with the 5’ half of the breakpoint cluster region (BCR) on chromosome 22 (Figure 4). The ABL1 proto-oncogene encodes a cytoplasmic and nuclear protein tyrosine kinase that has been implicated in processes of cell differentiation, cell division, cell adhesion, and stress response. Although the BCR/ABL fusion protein, encoded by sequences from both the ABL1 and BCR genes, has been extensively studied, the function of the normal product of the BCR gene is not clear. The BCR/ABL proto-oncogene has been found to be highly-expressed in chronic myeloid leukemia (CML) and acute myeloid leukemia (AML) cells (Mukhopadhyay et al., 2002). (See Figure 4 in Dudoit paper) © Intelligent Systems and Bioinformatics Laboratory 26 © Intelligent Systems and Bioinformatics Laboratory 27 $`GO:0007155` cell adhesion The attachment of a cell, either to another cell or to an underlying substrate such as the extracellular matrix, via cell adhesion molecules. $`GO:0007154` cell communication Any process that mediates interactions between a cell and its surroundings. Encompasses interactions such as signaling or attachment between one cell and another cell, between a cell and an extracellular matrix, or between a cell and any other aspect of its environment. $`GO:0008283` cell proliferation The multiplication or reproduction of cells, resulting in the rapid expansion of a cell population. $`GO:0007165` signal transduction The cascade of processes by which a signal interacts with a receptor, causing a change in the level or activity of a second messenger or other downstream target, and ultimately effecting a change in the functioning of the cell. $`GO:0007166` cell surface receptor linked signal transduction Any series of molecular signals initiated by the binding of an extracellular ligand to a receptor on the surface of the target cell. © Intelligent Systems and Bioinformatics Laboratory 28 © Intelligent Systems and Bioinformatics Laboratory 29 BCR vs NEG, ALL file GOBPID Pvalue ExpCount Count Size GO:0007155 GO:0008283 GO:0007154 GO:0007165 GO:0007166 GO:0043067 GO:0042981 GO:0048519 GO:0043118 GO:0051243 GO:0048523 GO:0009653 GO:0007275 GO:0000902 GO:0007420 GO:0048731 GO:0007399 GO:0031175 GO:0009887 GO:0048513 GO:0048468 GO:0048666 GO:0007611 GO:0030036 0.0000 0.0856 0.0001 0.0002 0.0006 0.0093 0.0093 0.0021 0.0023 0.0041 0.0083 0.0005 0.0008 0.0019 0.0019 0.0025 0.0025 0.0042 0.0051 0.0066 0.0067 0.0077 0.0036 0.0097 6.8198 11.1271 42.4743 40.4404 12.7423 8.4949 8.4949 17.7076 16.2120 16.0924 16.9299 8.4350 21.4166 4.6662 0.2991 4.1876 4.1876 0.7179 2.7519 7.4779 1.2563 0.8375 0.1196 3.0510 19 16 65 61 25 16 16 30 28 27 27 19 36 12 3 11 11 4 8 15 5 4 2 8 Term 114 cell adhesion 186 cell proliferation 710 cell communication 676 signal transduction 213 cell surface receptor linked signal transduction 142 regulation of programmed cell death 142 regulation of apoptosis 296 negative regulation of biological process 271 negative regulation of physiological process 269 negative regulation of cellular physiological process 283 negative regulation of cellular process 141 morphogenesis 358 development 78 cellular morphogenesis 5 brain development 70 system development 70 nervous system development 12 neurite development 46 organ morphogenesis 125 organ development 21 cell development 14 neuron development 2 learning and/or memory 51 actin cytoskeleton organization and biogenesis pvlw W pvlGsm G pvlGlb S pvlGih Ih 0.0002 0.2002 0.0153 0.3324 0.5743 1.0000 0.3622 1.0000 0.4735 0.2914 0.6374 1.0000 0.8927 0.6335 0.0019 1.0000 0.3264 1.0000 0.0082 0.8719 1.0000 1.0000 0.0036 0.0073 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0.0000 0.0382 0.0000 0.8081 0.0656 0.0080 0.1605 0.0033 0.0014 0.0025 0.0166 0.0613 0.0008 0.0038 0.0103 0.0672 1.0000 0.4945 0.1553 0.2363 0.0443 0.3801 0.0077 0.7184 1 0 1 0 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0.0114 0.0012 0.0000 0.7354 0.7487 0.1348 0.1348 0.1178 0.2156 0.2096 0.2096 0.1760 0.0055 0.1496 0.0350 0.2145 0.2145 0.2947 0.0666 0.1892 0.1343 0.2947 1.0000 0.0680 -disagreement about including or not development -cell proliferation not significant initially, very significant after conditioning © Intelligent Systems and Bioinformatics Laboratory 30 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0.0002 0.0032 0.0000 0.9084 0.6235 0.1414 0.1414 0.0703 0.1252 0.1211 0.1211 0.0105 0.0069 0.0215 0.0099 0.0241 0.0241 0.1582 0.0084 0.0115 0.0416 0.1582 1.0000 0.1294 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 REFERENCES 1. http://www.learner.org/channel/courses/biology/support/1_genom.pdf 2. Tarca AL, Romero R, Draghici S. Analysis of microarray experiments of gene expression profiling. American Journal of Obstetrics and Gynecology 195(2):373-388, August 2006 3. Purvesh Khatri and Sorin Draghici. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics, 21(18):3587-95, September 2005 4. A. Alexa et al. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure, Bioinformatics, 13, 2006 5. Grossmann, S., Bauer, S., Robinson, P.N., Vingron, M. (2006) An improved statistic for detecting over-represented Gene Ontology annotations in gene sets. Proceedings of the Lecture Notes in Computer Science 3909 , pp. 85–98 March 2006. 6. Drăghici, S. et al. (2003) Onto-Tools, the toolkit of the modern biologist: Onto-Express, OntoCompare, Onto-Design and Onto-Translate. Nucleic Acids Res., 31, 3775–3781. 7. 7S. Falcon, R. Gentleman Using GOstats to test genes lists for GO term association, Bioinformatics, Jan 15, 2007, 23 8. H. Zhu et all. (2007) Genomic and structural analysis of the protective effects of activated protein C in a baboon model of E. Coli sepsis. ISTH 2007 Congress 9. Chiaretti, S., et al. (2004) Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood, 103, 2771–2778 10. Dudoit S. Multiple Tests of Association with Biological Annotation Metadata © Intelligent Systems and Bioinformatics Laboratory 31