* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Detecting phenotype-specific interactions between
Genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Quantitative trait locus wikipedia , lookup
X-inactivation wikipedia , lookup
Primary transcript wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Point mutation wikipedia , lookup
Pathogenomics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Non-coding DNA wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Gene therapy wikipedia , lookup
Gene nomenclature wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Oncogenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Gene desert wikipedia , lookup
Ridge (biology) wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
History of genetic engineering wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome (book) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene expression programming wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Detecting Phenotype-Specific Interactions Between Biological Processes Nadeem A. Ansari Department of Computer Science Wayne State University Detroit, MI 48202 1 Outline • • • • • Biological background Motivation and problem description Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype • Improvements • Results • Summary 2 Outline • • • • • Biological background Motivation and problem description Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype • Improvements • Results 3 Cells, proteins, and DNA • Cells: fundamental units of life that contain all the working machinery necessary for their functioning • Proteins: the main contributors of this working machinery • Deoxyribonucleic acid (DNA): contains the blueprint for making the working machinery • Gene expression: the process of making the working machinery 4 DNA • Linear molecule of two strands; each composed of subunits called Nucleotides • Nucleotide types: Adenine – A Cytosine – C Guanine – G Thymine – T 5 DNA • Base pairing: … A A C G G A T … … T T G C C T A … 6 Transcription • Information stored in DNA letters is transcribed into Ribonucleic acid (RNA) • RNA: a chain of nucleotides - A, C, G, U (uracil) … G T G C A T … DNA … C A C G U A … RNA 7 Translation • Information stored in RNA is translated into chains of amino acids - proteins 8 Gene expression • The process of making the working machinery of a cell. 9 • Regions of DNA that are synthesized into functional RNA and proteins are known as genes • An observable characteristic (or trait) of an organism caused by gene expression is known as a phenotype. 10 Gene expression measurement – why? • All cells contain same DNA – express genes selectively • Various stimuli cause change in gene expression • Change in expression level results in under or over production of working machinery – diseases / phenotypes • Measuring gene expression can help us understand underlying biological phenomenon 11 Gene expression measurements • Typically researchers measure gene expression in two different tissues or cell samples – Cells treated with a drug vs. untreated cells • Genes expressed differently than in a controlled sample are called differentially expressed (DE) genes • High throughput technologies like DNA microarrays measure expression levels of thousands of genes 12 Genes and annotations • Functional characteristics of gene products are stored in annotation databases like gene ontology • Gene Ontology (GO): a controlled and structured vocabulary – Molecular functions, biological processes, and cellular components • Structured as directed acyclic graphs (DAGs) – Nodes represent terms – Edges represent relationships • Parent-child relations (more than one parent) – Is-a, part-of, and regulates (negatively, positively) 13 Biological processes – GO subset • GO is a set of terms and their definitions organized in a structure that reflects their relationships • GO also provides a set of annotations, describing what is known about each gene (products) 14 Outline • • • • • Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype • Improvements • Results 15 Motivation and problem description • Various stimuli cause differential gene expression, which results in the over and under production of proteins • Over and under production of proteins can result in the expression of a disease and disease-specific phenotype • Understanding genes behavior can help us understand diseases in ways never thought before – e.g. drug targets for curing diseases 16 Motivation and problem description • Current approaches look for the biological functions that are under or over represented in the phenotype-specific gene expression patterns • However, life is complex and biological functions also interact • These interactions change in a phenotype • Understanding changed interactions between biological functions is important in understanding the underlying biological mechanism that resulted in the phenotype 17 Outline • • • • • Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype • Improvements • Results 18 Goals • Our goal is to detect the interactions between biological functions that have changed significantly in a given phenotype • We detect these interactions between the biological processes from GO annotated with differentially expressed genes in a phenotype 19 Challenges and limitations • There is no simple way to establish which biological functions are important – No universally accepted statistical model exists • Finding relationship between biological processes using mathematical models is challenging • No known statistical model exists that detects changed interactions in a given phenotype • Using GO annotations presents its own challenges 20 Challenges and limitations • GO is incomplete and updated on continuous basis – Missing information regarding gene annotations • GO contains inconsistencies – New research may make previous annotations obsolete • GO hierarchy poses challenge of dependencies – Genes annotated with specific terms are assumed to be annotated with all the ascendants of the annotated term 21 Outline • • • • • Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype • Improvements • Results 22 Information retrieval (IR) • Problem: Given a query, find relevant documents from a collection • Vector space model (VSM) – Represent document and keywords in a matrix • Documents as columns with keywords as components – columns are document vectors – Represent query as a (column) vector – Find document vectors closer to query vector • Documents are relevant to query 23 Example – document retrieval A D1 D2 D3 D4 D5 Document collection How to bake bread without recipes The classic art of Viennese pastry Numerical recipes: the art of scientific computing Breads, pastries, pies, and cakes: quality baking recipes Pastry: a book of best French recipes Example taken from Berry et al., SIAM: Review 41, 2 (1999) 24 Example – document retrieval A D1 Document collection How to bake bread without recipes D2 The classic art of Viennese pastry D3 Numerical recipes: the art of scientific computing D4 Breads, pastries, pies, and cakes: quality baking recipes Pastry: a book of best French recipes D5 Terms T1 bake T2 recipe T3 bread T4 cake T5 pastry T6 pie 25 Example – document retrieval A Document collection D1 D2 D3 D4 D5 D1 How to bake bread without recipes A T1 D2 The classic art of Viennese pastry T2 1 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 0 1 0 T3 D3 Numerical recipes: the art of scientific computing T4 D4 Breads, pastries, pies, and cakes: quality baking recipes T5 T6 D5 Pastry: a book of best French recipes Terms Term by document matrix T1 T2 T3 T4 T5 T6 bake recipe bread cake pastry pie 26 Example (IR VSM) • Document vector: D1 (1 1 1 0 0 0)T • User searching for documents related to “baking bread” • Query vector: Q (1 0 1 0 0 0)T A D1 D2 D3 D4 D5 T1 T2 T3 T4 1 1 1 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 T5 T6 0 0 1 0 0 0 1 1 1 0 T1 T2 T3 T4 T5 T6 Terms bake recipe bread cake pastry pie Query 1 0 1 0 0 0 27 Finding relevant (similar) documents D j (a1 j a2 j ... atj)T Q (q1 q2 ... qt ) a11 a12 … a1n T2 a21 a22 … a2n … T1 … Dn … … … D2 QT Dj Q Dj Tm am1 am2 … amn Q D j q1 a1 j T D1 … similarity (Q, D j ) T A q2 a2 j ... qm amj Q QT Q q1 q2 qm 2 2 2 28 Correlation • Determines if two random variables vary together • Linear correlation between X and Y: – Positive correlation - X increases as Y increases – Negative correlation - X decreases as Y increases – No linear correlation - no linear relationship X x1 , x2 ,..., xm Y y1 , y2 ,..., ym rXY XY XX . YY ( X X )(Y Y ) ( X X ) (Y Y ) m 2 m (Pearson correlation coefficient) 2 m 29 Pearson correlation coefficient – geometric interpretation rXY m ( X X )(Y Y ) ( X X ) (Y Y ) 2 2 ( X X )(Y Y ) ( x1 X )( y1 Y ) ( xm X )( ym Y ) xc1 yc1 xcm ycm X c Yc T rXY X c T Yc similarity ( X c , Yc ) X c Yc 30 Outline • • • • • Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype • Improvements • Results 31 Detecting interactions that have changed significantly in the phenotype • Represent differentially expressed genes, in a phenotype, and their biological functions as a matrix – vector space model with biological processes as column vectors • Find associations between pairs of biological processes • Compare these associations with the corresponding associations in the absence of such phenotype • Detect association that are significantly different in the phenotype 32 Data inputs - genes and functions • Reference genes and functions set (R) – M genes on a microarray – N GO terms annotated with M genes • In a biological condition under study (E) – m < M differentially expressed (DE) genes – n <= N GO terms annotated with m DE genes 33 Gene function matrix – reference data fN g1 a11 a12 … a1N g2 a21 a22 … a2N … aMN gM aM1 aM2 … … … f2 … f1 … GF 34 Gene function matrix – reference data GF R M N 1 If gene g i is annotated {aij } with GO term f j 0 otherwise Example gene-function matrix 35 Gene function matrix – experiment data GF E mn 1 if DE gene g i is annotated {aij } with GO term f j 0 otherwise Example gene-function matrix 36 Gene function matrix – reference and experiment Data • Experiment gene-function matrix is subpart of reference gene-function matrix 37 Challenges and limitations • GO is incomplete and updated on continuous basis – Missing information regarding gene annotations • GO contains inconsistencies – New research may make previous annotations obsolete • GO hierarchy poses challenge of dependencies – Genes annotated with specific terms are assumed to be annotated with all the ascendants of the annotated term 38 Our approach to solve challenges • Use singular value decomposition (SVD) • SVD can find missing relationships between genes and annotations in the latent semantic space and also remove noise from data • Noise: multiple words describing the same concepts • SVD is a factorization of a matrix into three matrices consisting of singular vectors and singular values corresponding to the original matrix 39 Singular value decomposition (SVD) • SVD of a GF matrix • Columns of matrix G (F) are left (right) singular vectors of GF • S is a diagonal matrix of singular values si. – The values on the main diagonal are ordered in non40 increasing order and represent variability in data Matrix approximation – dimensionality reduction • An approximated matrix can be computed by keeping only the first k largest singular values • We select k that retains the desired data variance (say x%) using the equation: 41 Approximated matrix – column view • We approximate both reference and experiment matrices • The approximated experiment gene-function matrix is not a sub-part of the approximated reference gene-function matrix 42 Correlation Between Functions • Indicates the strength and direction of a linear relationship between two biological processes • Pearson correlation coefficient rfi,fj between a pair of functions fi and fj is computed as: fi f j T rf i , f j fi f j • Matrices (RRNxN and REnxn) of correlation coefficients are computed for reference and experiment data (respectively) 43 Pair-wise Correlation Coefficients for Reference and Experiment data = • RRnxn contains the pair-wise correlation coefficients between the first n functions in the absence of phenotype 44 Fisher Z Transform – Correlation Coefficient To Z-values • Correlation coefficients from samples of large population can be mapped to z values using Fisher z-transform, which approximates normal distribution • For a correlation coefficient r, the Fisher ztransform Zr can be computed as: • Compute ZRr from RRNxN and ZEr from REnxn 45 Detecting Changes Between Functional Interactions • Hypothesis: Correlation between two biological processes in the given phenotype differs from the correlation in the reference data Hypothesis Test statistic 46 Outline • • • • • Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype • Improvements • Results 47 Improvements • The dependencies between GO terms can somewhat be removed using weights in our matrix. 48 Scheme 1-1 GF R M N GF E mn 1 If gene g i is annotated {aij } with GO term f j 0 otherwise 1 if DE gene g i is annotated {aij } with GO term f j 0 otherwise • This is a binary scheme and was discussed while describing our main method 49 Scheme 1-e GF R M N GF E mn 1 If gene g i is annotated {aij } with GO term f j 0 otherwise ei if DE gene g i is annotated {aij } with GO term f j 0 otherwise • ei is the normalized log-transformed foldchange measured for gene gi in the given condition 50 Scheme IR 1-1 1 wij E GF mn {aij } 0 wij gb j iabi if DE gene g i is annotated with GO term f j otherwise 1 gb j # of genes annotated with f j # of annotation s for g i iabi ln Total annotation s gb: Gene (annotation) bias – GO DAG related iab: Inverse annotation bias – experiment related 51 Scheme IR 1-e GF R M N 1 wij {aij } 0 if gene g i is annotated with GO term f j otherwise ei wij if DE gene g i is annotated E GF mn {aij } with GO term f j 0 otherwise ei is the normalized log - transforme d fold - change and wij gb j iabi 52 Outline • • • • • Biological background Motivation and problem description Goals, Challenges and limitations Mathematical background Detecting changed interactions between biological processes in a phenotype • Improvements • Results 53 Breast cancer data set • Veer et al. (2002) found some differentially expressed genes in breast cancer – 24,000 reference genes on the microarray – 13,201 annotated biological processes from GO – 231 genes were found to be differentially expressed – 246 annotated biological processes with the DE genes • Since then no satisfactory prediction has been made in this regard 54 Breast Cancer Data Set Results A subset of predicted biological pairs with significant interaction change Scheme GO Term 1 GO Term 2 p-value 1-1, IR 1-e Proteolysis Positive regulation of apoptosis .0001 1-1 Transcription DNA replication initiation .026 1-1 DNA repair Regulation of transcription, DNAdependant .033 IR 1-1 Vesicle-mediated transport Transcription from RNA polymerase II promoter .002 IR 1-1 DNA replication initiation Phosphinositidemediated signaling .00001 55 Breast Cancer Data Set Results Summary Number of predicted biological pairs with significant interaction change Scheme 1-1 1-e IR 1-1 IR 1-e Total Cat. 1 10 16 Cat. 2 5 6 Cat. 3 1 2 Accuracy 93.7% 91.6% 9 15 50 7 9 27 2 2 7 88.8% 92.3% 91.6% Cat. 1: Known interactions and trivial Cat. 2: Known interactions and non-trivial Cat. 3: Unknown 56 Lung cancer data set • Beer et al. (2002) found some differentially expressed genes in lung cancer – 5541 reference genes on the microarray – 2908 annotated biological processes from GO – 87 genes were found to be differentially expressed – 248 annotated biological processes with the DE genes 57 Lung Cancer Data Set Results Summary Number of predicted biological pairs with significant interaction change Scheme 1-1 1-e IR 1-1 IR 1-e Total Cat. 1 16 39 Cat. 2 3 3 Cat. 3 2 2 Accuracy 90.4% 95.4% 29 38 122 2 9 17 0 3 7 100.0% 94.0% 95.21% 58 Summary • Various stimuli cause differential gene expression, which results in the expression of a disease and disease-specific phenotype • Biological processes interact and their interaction change in a given phenotype • We proposed methods to detect such significantly changed interactions in the observed phenotype • We used vector space model, matrix approximation, and statistical hypothesis testing to find changed interactions between biological processes from GO • Results showed 89% or more accuracy for our proposed methods 59 References: • • • • • Ansari, N. A., Bao, R., and Drăghici, S. Detecting phenotype-specific interactions between biological processes from microarray data and annotations. Bioinformatics, under revision. Drăghici, S. Data Analysis Tools for DNA Microarrays. Chapman and Hall/CRC Press, 203 (first print), 2006 (second print) Berry, M. W., Drmac, Z., and Jessup, R. E. Matrices, vectors spaces, and information retrieval. SIAM: Review 41, 2 (1999), 335-62 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6 (1990), 391-407 Done, B., Khatri, P., Done, A., and Drăghici, S. Predicting novel human Gene Ontology annotations using semantic analysis. IEEE/ACM Transactions on CBB (2009) 60 Special Thanks to • Dr. Sorin Draghici 61 Thank You 62