Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pattern Detection and Co-methylation Analysis of Epigenetic Features in Human Embryonic Stem Cells Ben Niu,Qiang Yang, Jinyan Li, Hong Xue, Simon Chikeung Shiu, Weichuan Yu, Huiqing Liu, Sankar Kumar Pal HKPolyU Computational Epigenetics An emerging and most exciting area incorporating the state of the art Machine learning Molecular biology Aims to understand the epigenetic process in gene transcriptional regulation Advance our knowledge to the medical arsenal in treating human diseases. The Research Human Epigenome project (HEP): the next wave to the Human Genome Project (HGP) Started in 2003 after completion of the Human Genome Project. HEP aims to identify the epigenetic markers associated with human diseases ‘Journal of Epigenetics’ has been released: first journal dedicated to the communications in Epigenetics, started in 2006. Series of publications in highly cited journals in 2005-07: Nature Cell Focus issue on epigenetics, Nature Review Genetics, April, 2007. Special issue on epigenetics, Cell, Feburary, 2007. J. Bioinformatics We are jointly invited to write a review paper on computational epigenetics to the Journal of bioinformatics. The Industry Epigenetics open a rapidly growing market of epigenetic medical services (diagnostic, drugs) According to 2007 report of MarketResearch, as shown in the figure, the global market of epigenetic applications (i.e., drug+ diagnostic services) will be 4 billion US$, by 2012, the annual Growth rate at present time is 60.4%. 4500 4000 3500 3000 2500 2000 1500 1000 500 0 global Market (Million U.S.$) Promising direction! 2005 2006 2007 2008 2009 2010 2011 2012 What we know Basically: Genes can be turned on/ off through Cytosine methylation or Histone modifications, a reversible process The epigenetic events is heritable, can change the cell’s phenotypes without altering its sequence Functionally: Dominate the growth of cancer and embryonic stem cells These two type of cells are of great medical interests Cancer is the leading cause of human death hESCs are the answer to the regenerative treatments For the two points see: Nature Insight: Epigenetics Vol. 447, 2007. What we don’t know The logic behind DNA methylation underlying cells’ behaviors remains unclear How DNA methylation concerts the product of molecular machineries for cell functions In the context of epigenetics, we need to address two issues: What are the rules of DNA methylation differing the cancer, the normal, the human ES cells from each other. Uncover the interactive patterns of the genes in these cells. The role of methylation in coordinating the activities of genes. State of the art in Methylation Analysis SVMs, ANNs have been successfully applied to predict the epigenetic events, for example, Methylation status of CpG sites CpG islands/ promoter regions in DNA sequence CpG island mapping by Epigenome prediction’, Plos Computational Biology, Volume 3(6), 2007. Promoter prediction analysis on the whole human genome’, Nature Biotechnology, Vol. 22, 2004. Cancers Computational prediction of methylation status in human genomic sequence, PNAS, Vol. 103(28), 2006. Tumour class prediction and discovery by microarray-based DNA methylation analysis, NAR, Vol. 30, 2002. Co-regulation analysis through clustering Clustering of methylation arrays Marjoram P, Chang J, Laird PW, Siegmund KD: Cluster analysis for DNA having a detection threshold. BMC Bioinformatics Vol. 7, 2006. methylation profiles 2 Problems 1. Traditional methods, SVMs, ANNs are 2. ‘black box’ models Knowledge extracted are characterized by the connection weights, and Support Vectors. hard to understand for biologists Investigate the co-methylation patterns Cancer cells human Embryonic stem cells (hESCs) Co-methylation analysis can help to uncover the hidden pathways leading to new drug design Methodogy Two computational methods proposed 1. Adaptive Cascade Sharing Trees (ACS4) for problem 1 2. To learn the human understandable DNA methylation rules Adaptive clustering for problem 2 To highlight the orchestration of genes for function through the methylation mechanism ACS4 method (1) Promoters are regulatory elements upstream the 5’ end of TSS. Methylation of promoter CpGs remodels the chromatin structure for gene expression Methylated CpG methyl-binding proteins (MeCP) methyltransferase Histone deacetylases (HDAC) ACS4 method (2) Methylation levels of promoters can be measured using Microarrays Each spot on the array corresponds to a promoter CpG sites. The methylation intensity is a numerical value between 0 and 1. ACS4 method (3) Objective: learn human understandable rules that define the epigenetic process in cancer and embryonic stem cells Idea: Adaptively partition the numeric attributes into a set of the linguistic domains, e.g., ‘high’, ‘very high’, ‘Medium’, ‘Low’, ‘Very Low’ . Train a committee of trees to select the most salient features and predict through voting. ACS4 method (4) ACS4 method (5) ACS4 method (6) ACS4 method (7) We have learned k rules Given a testing sample, compute pi Rules are weighted according to their Coverage, i.e., the number of matched samples Overall prediction is made by voting across the rules. ACS4 method (8) Dataset: 37 hESC, 33 non-hESC, 24 cancer cell lines, 9 normal cell lines. 1,536 attributes Result Just 2 attributes are enough to separate the 3 cell types No need of 40 attributes by using fisher’s score in [1]. Wet lab cost can be reduced by testing on 2 attributes only, instead of 40. Accuracy is better, except when compared with SVM, but SVM cannot tell us ‘why’. Rules can be easily understood to biologist to conceive new biological experiments seeking in wet lab proof. [1] ‘Human embryonic stem cells have a unique epigenetic signature‘, Genome Research, Vol. 16, 2006 ACS4:Biological interpretation(1) Example: IF PI3-504 is ‘High’ THEN hESC IF PI3-504 is ‘Low’ AND NPY-1009 is ‘Low’ THEN Normal IF PI3-504 is ‘Low’ AND NPY-1009 is ‘High’ THEN Cancer ACS4:Biological interpretation(2) The two marker genes PI3(PI 3-kinases )-activate the cell growth, proliferation, differentation, motility, intracellular trafficking Down-regulated in hESCs maintain stable state Keep from growth, proliferation, differentiation… Neuropeptide Y (NPY)- signal protein produced by nerves [Immunology:Stress and Immunity, Science, Vol. 311, 2006.] Experiment shows deficiency of NPY cause immune defects Consistent to our computational result ACS4: Biological interpretation(3) Example: IF PI3-504 is ‘High’ THEN hESC IF PI3-504 is ‘Low’ AND NPY-1009 is ‘Low’ THEN Normal PI3 gene is silenced to maintain a stable cell context in hESCs Normal cells can grow, and grow safely with immune defenses IF PI3-504 is ‘Low’ AND NPY-1009 is ‘High’ THEN Cancer Cancer cells grow, and grow out of control, due to the immune deficiency Adaptive clustering (1) Co-methylation of genes are important Because we want to know how genes are co-working in the epigenetic framework Clustering should reflect the true distribution of the gene space. assuming data are normally distributed, which is usually the case in real world applications Fisher’s criterion is computed to validate the result of clustering, and choose the best one. Adaptive clustering (2) For embryonic and cancer cells we optimally cluster the 1536 genes for each round of clustering with k-Means, we start from different # of initial centers. Candidate clustering result with the largest Fisher’s discriminant score qualifies for further analysis. Each cluster of genes can be functionally related, and participate in the same pathway of DNA methylation. By further analysis of the sequences, we can find out the feature binding sites for each cluster of genes, and discover the epigenetic binding factors unknown before. Adaptive clustering (3) For cancer and hESCs, 41 and 59 clusters generate the best separation So, 41 and 59 functional domains are though to be underlying the 1536 genes. Adaptive clustering (4) In experiments: The distance measure d is based on Pearson’s correlation score. N = 60. Adaptive clustering (5) For hESC the formed clusters of the co-methylated genes, e.g., MAGEA1, STK23, EFNB1, MKN3, TMEFF2, AR, FMR1, are most related to differentiation, self-renewal, and migration of hESC activities. Adaptive clustering (6) For cancer cells, the formed clusters of the co-methylated genes, e.g., RASGRF1, MYC, and CFTR, are highly involved in cell apoptosis, DNA repair, tumour suppressing, and ion transportation, which are typically the immunological activities of cells against DNA damages. Adaptive clustering (7) Particularly, we discover: gene CFTR (7q31), long in focus in medical research, is comethylated with MT1A (16q13) and KCNK4 (11q13). CFTR defects contribute to the disease of Cystic Fibrosis (CF). One in twenty-two people of European descent carry one gene for CF, making it the most common and lethal genetic disease of still no cure at the present time among such people. The CFTR and KCNK4 proteins form the ion channels across cell membranes, while MT1A proteins bind with the ions as the transporters. They are all related to the transportation of ions across cell membrane, functionally related. The can participate in the same pathway, the breakdown of which can explain the process of turmogenesis Adaptive clustering (8) Two summarize: Co-methylation occurs widely across the whole genome It dominates the growth and development of various types of cells Different cells exhibit different patterns of comethylation Our adaptive clustering algorithm can naturally capture the group-wise activities in these cells. Conclusion Genome wide Epigenetic analysis: promising direction to research and industry The logic of DNA methylation can be learned and interpreted by using our proposed ACS4 algorithm Just 2 attributes are good enough to separate the 3 cell types No need of 40 attributes by using fisher’s score in G.R. paper. Wet lab cost can be reduced by testing on just 2 attributes, instead of 40, lab cost is significantly reduced, more cost - effective. More accurate by adaptively partition the attribute domain Knowledge learned are human understandable, to assist biologist design in wet lab test for further investigations Adaptive clustering Epigenetic events are highly active in cancer and hESCs. Functionally related genes are co-methylated patterns of co-methylation are much different in cancer and hESCs, highlighting the versatile roles of Epigenetic events in cell function. Thanks!