Download Integrating Genetic and Network Analysis to Characterize Genes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Site-specific recombinase technology wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
A graph-based integration of multiple
layers of cancer genomics data
(Progress Report)
2010.03.12
Do Kyoon Kim
Outline
• Introduction
• Databasing TCGA Data
• Graph-based Semi-Supervised Learning with Gene Expression Data
Introduction
Introduction
• Although microarray technology allows the investigation of the
transcriptomic make-up of a tumor
– the transcriptome does not completely reflect the underlying biology due to
alternative splicing, post-translational modification
• This increases the importance of integration more than one source of
genome-wide data, such as the genome, transcriptome, proteome,
and epigenome
• The current increase in the amount of available omics data
emphasizes the need for a methodological integration framework
Introduction
• Data integration: different point of view
– Heterogeneous data from different sources were analyzed sequentially
– The term data integration has also been used as synonym for data
merging in which different data sets are concatenated at the database
level by cross-referencing the identifiers
– Integrate multiple layers of experimental data into one mathematical
model for the development of more homogeneous classifiers in clinical
decision support
Daemen et al., 2009, Genome Medicine
The Cancer Genome Atlas (TCGA)
• Mission
– The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to
accelerate our understanding of the molecular basis of cancer through the
application of genome analysis technologies, including large-scale sequencing
• Goal
– To improve our ability to diagnose, treat and prevent cancer
– A pilot project developed and tested the research framework needed to
systematically explore the entire spectrum of genomic changes involved in human
cancer
• Focus on three selected cancer types
– Serous cystadenocarcinoma (ovarian)
– Squamous carcinoma (lung)
– Glioblastoma multiforme (brain)
• 500 samples per tumor type
TCGA data
How to integrate?
TCGA research network., (2008), Nature
The second page of TCGA project
Specific Goal
•
Problem: Prediction of recurrence in GBM patients using multiple types of
genomic data
PHENOTYPE
SNP
SEQUENCE
EXPRESSION
METHYLATION
COPY NUMBER
miRNA
Biological Organization
• TF binding
• SNP
• methylation
• CNV,LOH, Del
• CNV,LOH, Del
TFbs
TFbs
TFbs
Gene
Gene
Gene
TRANSCRIPTION
alternative
splicing
EXPRESSION
• microRNA
microRNA mRNA mRNA mRNA
TRANSLATION
x
• post modification
• glucosylation
• phosphorylation
Protein
TF Protein
FUNCTION
TF: transcription factor
TFbs: transcription factor binding site
Phenotype
Graph-based Learning
•
Recently, to integrate multiple data sources, a simidefinite programming
(SDP) based SVM method was introduced
•
In SDP/SVM, multiple kernel matrices corresponding to each of data sources
are combined
•
However, when trying to apply SDP/SVM to large problems, the
computational cost can become prohibitive, since both converting the data to
a kernel matrix for SVM and solving the SDP are time and memory
demanding
Graph-based Learning
•
Significant progress of graph-based semi-supervised learning methods in the
machine learning community
•
One important problem in graph-based learning, which has not yet been
addressed, is the combination of multiple graphs
•
Each vectorial data can be incorporated after conversion into a network
•
Due to the sparsity of network edges, the computation time is nearly linear in
the number of edges of the combined network
Graph-based integration
expression
miRNA
Methylation
CNV
Databasing TCGA data
Data release
•
Data Levels I and II correspond to raw and processed data, respectively, for
each sample
•
Level III data are the output of basic analyses of Level I/II data, such as
mutational calls of sequenced genes, copy number and LOH calls of genomic
regions of aberrations, and expression level of a gene for each sample
•
Level IV data represent interpretations of the data, such as what genes are
significantly mutated, or altered in copy number, DNA methylation, or
expression across multiple samples and data types
•
For protection of patient privacy, access to Level I and/or II data for certain
platforms (e.g. SNP genotyping) or data types (e.g. germ-line mutations) is
restricted to qualified researchers and requires approval of a TCGA Data
Access Committee
Download directory structure and URL
construction
Retrieving available TCGA Data: Done
• Cancer type: GBM
• Time: about 10 days
• Size: About 230 GB
Databasing TCGA data
Python scripts for inserting data into
database
Databasing annotation files from multiple
types of platforms
•
Multiple types of Annotation Files -> Database
–
Annotation (each platforms)
–
ADF files (same genome build)
Insert new platforms and Experiment data
• Column wise queries
Row wise queries
•
Theoretically, queries are possible
–
Select all data with level 3 where gene symbol is ‘ERBB2’
–
Select expression, CGH, methylation with level 2 where chromosome position is ‘17q36.1’
Statistics of data
expression
P: 22277
P: 1510
S: 276
S: 273
•
miRNA
Methylation
P: 1498
S: 238
Overlap samples with tumor type = ‘solid tumor’
– 195 samples
CNV
P: 235829
S: 441
Graph-based Semi-Supervised
Learning with Gene Expression Data
Gene expression data
• Data reduction
expression
expression
P: 22277
S: 273
with output variable
P: 22277
S: 258
expression
Gene summarization
P: 12043
S: 258
• Class (Procudure_Type from Phenotype)
– 1: Surgical Resection
– -1: Secondary Surgery for tumor recurrence/progression: locoregional procedure
Data plot
• http://xperanto.snubi.org/php/display.php?mode=S_Experiment&&S_
Experiment_ID=416
Graph-based SSL
• Without Feature selection: (258 x 12043)
• W matrix: K-NN + exp-weighted graphs
– K=5
• SSL
– Mu = 1
• 5-fold cross validation
• ROC score: 0.5264
Feature Selection
• Identify differential expressed genes from two phenotypes
– T-test
– Using mattest in MATLAB:
• http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref/mattest.html
– p_value < 0.05: 768
– p_value < 0.01: 181
– p_value < 0.001: 23
Graph-based SSL
• With Feature selection (p_value < 0.05)
– 258 x 768
• W matrix: K-NN + exp-weighted graphs
– K = 20
• SSL
– Mu = 10
• 5-fold cross validation
• ROC score: 0.7077
Future work
• Systematically control parameters
• Other methods for making W matrix
– Correlation
– Tanh-weighted graphs
– Any good method with large-scale features ?
• Experiment with other data types
– miRNA
– Methylation
– CNV
• Combine multiple types of genomics data
– ROC score improved?