Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A graph-based integration of multiple layers of cancer genomics data (Progress Report) 2010.03.12 Do Kyoon Kim Outline • Introduction • Databasing TCGA Data • Graph-based Semi-Supervised Learning with Gene Expression Data Introduction Introduction • Although microarray technology allows the investigation of the transcriptomic make-up of a tumor – the transcriptome does not completely reflect the underlying biology due to alternative splicing, post-translational modification • This increases the importance of integration more than one source of genome-wide data, such as the genome, transcriptome, proteome, and epigenome • The current increase in the amount of available omics data emphasizes the need for a methodological integration framework Introduction • Data integration: different point of view – Heterogeneous data from different sources were analyzed sequentially – The term data integration has also been used as synonym for data merging in which different data sets are concatenated at the database level by cross-referencing the identifiers – Integrate multiple layers of experimental data into one mathematical model for the development of more homogeneous classifiers in clinical decision support Daemen et al., 2009, Genome Medicine The Cancer Genome Atlas (TCGA) • Mission – The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale sequencing • Goal – To improve our ability to diagnose, treat and prevent cancer – A pilot project developed and tested the research framework needed to systematically explore the entire spectrum of genomic changes involved in human cancer • Focus on three selected cancer types – Serous cystadenocarcinoma (ovarian) – Squamous carcinoma (lung) – Glioblastoma multiforme (brain) • 500 samples per tumor type TCGA data How to integrate? TCGA research network., (2008), Nature The second page of TCGA project Specific Goal • Problem: Prediction of recurrence in GBM patients using multiple types of genomic data PHENOTYPE SNP SEQUENCE EXPRESSION METHYLATION COPY NUMBER miRNA Biological Organization • TF binding • SNP • methylation • CNV,LOH, Del • CNV,LOH, Del TFbs TFbs TFbs Gene Gene Gene TRANSCRIPTION alternative splicing EXPRESSION • microRNA microRNA mRNA mRNA mRNA TRANSLATION x • post modification • glucosylation • phosphorylation Protein TF Protein FUNCTION TF: transcription factor TFbs: transcription factor binding site Phenotype Graph-based Learning • Recently, to integrate multiple data sources, a simidefinite programming (SDP) based SVM method was introduced • In SDP/SVM, multiple kernel matrices corresponding to each of data sources are combined • However, when trying to apply SDP/SVM to large problems, the computational cost can become prohibitive, since both converting the data to a kernel matrix for SVM and solving the SDP are time and memory demanding Graph-based Learning • Significant progress of graph-based semi-supervised learning methods in the machine learning community • One important problem in graph-based learning, which has not yet been addressed, is the combination of multiple graphs • Each vectorial data can be incorporated after conversion into a network • Due to the sparsity of network edges, the computation time is nearly linear in the number of edges of the combined network Graph-based integration expression miRNA Methylation CNV Databasing TCGA data Data release • Data Levels I and II correspond to raw and processed data, respectively, for each sample • Level III data are the output of basic analyses of Level I/II data, such as mutational calls of sequenced genes, copy number and LOH calls of genomic regions of aberrations, and expression level of a gene for each sample • Level IV data represent interpretations of the data, such as what genes are significantly mutated, or altered in copy number, DNA methylation, or expression across multiple samples and data types • For protection of patient privacy, access to Level I and/or II data for certain platforms (e.g. SNP genotyping) or data types (e.g. germ-line mutations) is restricted to qualified researchers and requires approval of a TCGA Data Access Committee Download directory structure and URL construction Retrieving available TCGA Data: Done • Cancer type: GBM • Time: about 10 days • Size: About 230 GB Databasing TCGA data Python scripts for inserting data into database Databasing annotation files from multiple types of platforms • Multiple types of Annotation Files -> Database – Annotation (each platforms) – ADF files (same genome build) Insert new platforms and Experiment data • Column wise queries Row wise queries • Theoretically, queries are possible – Select all data with level 3 where gene symbol is ‘ERBB2’ – Select expression, CGH, methylation with level 2 where chromosome position is ‘17q36.1’ Statistics of data expression P: 22277 P: 1510 S: 276 S: 273 • miRNA Methylation P: 1498 S: 238 Overlap samples with tumor type = ‘solid tumor’ – 195 samples CNV P: 235829 S: 441 Graph-based Semi-Supervised Learning with Gene Expression Data Gene expression data • Data reduction expression expression P: 22277 S: 273 with output variable P: 22277 S: 258 expression Gene summarization P: 12043 S: 258 • Class (Procudure_Type from Phenotype) – 1: Surgical Resection – -1: Secondary Surgery for tumor recurrence/progression: locoregional procedure Data plot • http://xperanto.snubi.org/php/display.php?mode=S_Experiment&&S_ Experiment_ID=416 Graph-based SSL • Without Feature selection: (258 x 12043) • W matrix: K-NN + exp-weighted graphs – K=5 • SSL – Mu = 1 • 5-fold cross validation • ROC score: 0.5264 Feature Selection • Identify differential expressed genes from two phenotypes – T-test – Using mattest in MATLAB: • http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref/mattest.html – p_value < 0.05: 768 – p_value < 0.01: 181 – p_value < 0.001: 23 Graph-based SSL • With Feature selection (p_value < 0.05) – 258 x 768 • W matrix: K-NN + exp-weighted graphs – K = 20 • SSL – Mu = 10 • 5-fold cross validation • ROC score: 0.7077 Future work • Systematically control parameters • Other methods for making W matrix – Correlation – Tanh-weighted graphs – Any good method with large-scale features ? • Experiment with other data types – miRNA – Methylation – CNV • Combine multiple types of genomics data – ROC score improved?