Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Development of pathway analysis tools 한양대학교 의과학연구소/ 한국해양과학기술원 박기정 2014. 06. 12. HiFI (Human integrated Functional Interaction) YEONGJUN JANG, Kiejung Park [email protected], [email protected] Background High-throughput genomic experiments, including association studies, examinations of sequence mutations/copy number variants, and expression experiments typically generate multiple candidate genes that are involved in cancer causing cellular processes However, these data sets are noisy and contain false positives How to extract true positive candidate genes, and reveal functional relationships among these genes with confidence for use in further experimental analysis Methods to distinguish drivers from passengers examine the rate of synonymous versus non-synonymous mutations predict the functional consequence of mutations assess the overall rate of recurrence, based on combined rates of sequence mutation and copy number alteration identify cancer drivers by identifying an enrichment of rare cancer mutations within network modules Pathway-driven approach It marks the genes associated with the disease or other phenotype And separates them (driver) from innocent bystanders (passenger) caught in the general instability of the malignant genome or other false positive hits It identifies and extends the biological pathways affected by the genes HiFI Functional interaction network upon pathway context Yu N, Seo J, Rho K, Jang Y, Park J, Kim WK, Lee S. (2012) hiPathDB: a human-integrated pathway database with facile visualization. Nucleic Acids Res. 40(Database issue), D797-802 Naive Bayes Classifier The Bayes Naive classifier selects the most likely classification V given the attribute values a1 , a2 , . . . We generally estimate P (a | v ) using m-estimates: i n = the number of training examples for which v is happened n = number of examples for which a is happened i pc= a priori estimate for P (a | v ) i m = the equivalent sample size Naive Bayes Classifier Training No. Interact? EXP1 EXP2 EXP3 EXP3 1 Yes Yes No Yes No 2 Yes No Yes No No 3 Yes No Yes No No 4 No Yes Yes No Yes 5 No Yes No Yes No 6 No No Yes Yes No ? Yes No No Yes Calculate P (EXP1=Yes | Interact?=Yes ) using p = 0.5, n = 3, n = 1 and m = 3 c ѵ a a ... a Yes 1 =Yes | ѵ=Yes)=π P(a 1 P(a1=No | ѵ=Yes)=1-π 2 =Yes | ѵ=Yes)=π P(a 2 P(a2=No | ѵ=Yes)=1-π ... i =Yes | ѵ=Yes)=π P(a i P(ai=No | ѵ=Yes)=1-π No P(a =Yes | ѵ=No)=π 1 P(a1=No | ѵ=No)=1-π ... P(a =Yes | ѵ=No)=π i P(ai=No | ѵ=No)=1-π 1 1 1 1 2 P(a =Yes | ѵ=No)=π 2 P(a2=No | ѵ=No)=1-π 2 2 2 Calculate V using the prior probability, P (v=Yes) = P (v=No) = 0.5 i i i i Avoid violations of the strong independence assumption One requirement for a successful NBC is that the features used in the classifier be independent Human PPIs and gene co-expression data sets were generated experimentally Many human protein interactions in interaction databases, including IntAct, BioGrid, and HPRD, are not generated experimentally but are human curated from the literature Many of the GO annotations and domain interactions are predicted based on sequence similarities among proteins in different species. Hence, there is a potential dependency among these data types since they all rely on the same phylogenetic trees Experimental methods as new features for NBC PSI-MI Ontology MI:0045 (experimental interaction detection) 125 experiment types Biochemical Biophysical Genetic interference Imaging technique Phenotype-based detection assay Post transcriptional interference Protein complementation assay Random Naïve Bayes label … label … label … … Generalization of Random Forest to Naïve Bayes (Naïve Bayes + Bagging) Random Naïve Bayes Conditional Independecne Test: Mutual Information •Implementation: –Used a stratified approach •24 classifiers at feature size m = 5 Human PP Human PP Is (3) Is (2) Human PP Is (2) Gene co-expres sions: microarra y (1) Gene co-expre ssions: RNA-se q (2) –Used majority voting with equal weights •classifiers agree (> 12) → there exists a functional inter action PPI Data Sources HIPPIE: Integrating Protein Interaction Networks with Experiment Based Quality S Schaefer MH, et al. PLoS ONE 7(2): e31826. (2012) PSICQUIC: a community standard for programmatic access to molecular interactio Bruno Aranda, et al. Nature Methods 8, 528–529 (2011) http://psicquic.googlecode.com IntAct—open source resource for molecular interaction data Kerrien S, et al. NAR, Database issue, D561-5 (2007) http://www.ebi.ac.uk/intact EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in Vilella AJ, et al. Genome Res, 327-35 (2009) http://ensembl.org/info/docs/compara/index.html Gene Co-expressions: COXPRESdb v5.0 Using Microarray data downloaded from ArrayExpress Normalize raw data using the RMA method Calculate PCC for every pair of genes Calculate MR(Mutual Rank) from PCC Species Platform # of genes # of Microarrays (GeneChip) Release Date Homo S apiens HG-U133_Plus_2 19,803 73,083 2012.08.29 COXPRESdb: a database of comparative gene coexpression networks of eleven species for mammals. Obayashi T, et al. (2013) Nucleic Acids Res. 41, D1014-20 Gene Co-expressions: RNA-seq Hong SJ., et al. Canonical Correlation Analysis for RNA-seq Co-expression Networks. Nucleic Acids Research. 2013. Traditional statistical methods designed for microarray data do not use all of the information contained in RNA-seq data such as the expression at exon, single-nucleotide polymorphism (SNP) and positional level; splicing; posttranscriptional RNA editing across the entire gene; isoform and allele-specific expressions. R package of Canonical Correlation based RNA-Seq Co-expression Network TCGA LUSC RNA-seq dataset Interaction Data Sources Training and validation of the functional interaction classifier Build training and test FI sets from hiPathDB Function interaction(FI)? Two proteins are involved in the same biochemical reaction as an input, catalyst, activator, or inhibitor, or as two members of the same protein complex 10-fold cross validation 10-fold cross-validation Method Positive/Negative Training Data Ratio Link Threshold Accuracy Naïve Bayes classifier 10:1 0.5 92.15% Naïve Bayes classifier 100:1 0.5 97.72% Random naïve Bayes 10:1 0.5 96.47% Random naïve Bayes 100:1 0.5 99.99% Results Number of Interactions (probability cut-off: 0.5) HiFI 183716 Reactome FI 169988 Sharing rate of GO annotations BP CC MF HiFI 0.815 0.904 0.753 Reactome FI - 0.962 - HiFI Web Interface Case study Point mutations and CNV genes TCGA glioblastoma multiforme (GBM) data set Extract a subnetwork around these genes in the HiFI Find modules by using a clustering algorithm to it Evaluate statistical significance of modularity Test mutated genes in modules significantly distributed over multiple samples Annotate modules using pathways and GO terms via over-representation analysis Glioblastoma multiforme (GBM) the most common and aggressive brain tumor in humans the first cancer type to undergo comprehensive genomic characterization by The Cancer Genome Atlas (TCGA) project the TCGA GBM project has cataloged somatic mutations and recurrent copy number alterations in GBM, and has identified frequent alterations in the p53, RB, PI3-kinase (PI3K) and receptor tyrosine kinase (RTK) signaling pathways Identify frequently altered network modules and candidate driver mutations in GBM GBM data set Cerami E, et al., “Automated Network Analysis Identifies Core Pathways in Glioblastoma”, PLoS One. 2010 Feb 12;5(2):e8918 84 GBM cases with both sequence mutation and copy number data Each gene was considered altered if modified by a validated non-synonymous somatic nucleotide substitution, a homozygous deletion or a multi-copy amplification genes that were altered in two or more of the final 84 cases 517 alternated genes and 259 genes had interactions in HiFI Extract GBM-specific functional network For each pair of 259 genes, found all shortest paths in the HiFI (threshold = 2) To identify statistically significant linker genes, we used the hypergeometric distribution to assess the probability that the linker gene would connect to the altered genes FDR correction via Benjamini Hochberg, p-value threshold = 0.05 96 GBM altered genes + 6 linker genes Captures the majority of proteins and interactions in a humancurated map of the molecular pathways involved in GBM: 96% of proteins (70 of 73) and 69% of interactions (129 of 187) Network clustering MCL algorithm van Dongen S: Graph Clustering by Flow Simutation. PhD thesis. University of Utrecht; 2000. calculate the overall modularity of the partitioned GBM network a total of 10 modules, with an overall network modularity of 0.519 the statistical significance of the observed network modularity in relation to a null model of random networks of the same size and same degree distribution Functional annotation p53 tumor suppressor pathway prevents the propagation of unstable genomes, is frequently altered in glioblastoma RB pathway Glioblastomas also nearly universally circumvent cell cycle inhibition through genetic alterations to the RB pathway phosphatidylinositol 3-Kinase-AKT(PI3K/AKT) pathway Major downstream effects of PI3K/AKT activation include cell growth, proliferation and survival Summary A network-based approach of HiFI identifies many of the same candidates as the original frequency-based approach used to assess mutational significance. Furthermore, this approach can automatically identify and extract biologically relevant GBM modules, which correspond closely to prior known GBM biology. http://hifi.kobic.re.kr [email protected] PAMES ( PAthway Mapping & Editing System) KEGG pathway의 예 KEGG pathway 정보 추출 • Entry : 각 object에 대한 설명 – Map, enzyme, compound, gene 등 – Object 종류, 크기, 위치정보 확인 • Reaction : 각 enzyme반응에 대한 substrate, product – Map상의 에지정보 <pathway name="path:aae00010" org="aae" number="00010" • title="Glycolysis Relation : map/link 등의 연결정보 추출 Gluconeogenesis" image="http://www.genome.ad.jp/kegg/pathway/aae/aae00010.gif" link="http://www.genome.ad.jp/dbget-bin/show_pathway?aae00010"><entry id="1" name="aae:aq_186" type="product" reaction="rn:R00710" link="http://www.genome.ad.jp/dbget-bin/www_bget?aae+aq_186"><graphics name="aldH1" fgcolor="#000000" bgcolor="#BFFFBF" type="rectangle" x="170" y="1018" width="45" height="17" /> </entry><entry id="2" name="aae:aq_2103 aae:aq_2104" type="product" reaction="rn:R00235" link="http://www.genome.ad.jp/dbgetbin/www_bget?aae+aq_2103+aq_2104"><graphics name="acs'..." fgcolor="#000000" bgcolor="#BFFFBF" type="rectangle" x="102" y="916" width="46" height="17" /> </entry> Entry 위치 추출 에지 생성 및 semi auto layout Editing pathway • Object : Gene, Compound, Map – Move, Resize – Delete – Create : ToolBox를 이용 • Path : 에지 – – – – – – Move, Resize Delete Create : ToolBox를 이용 Pivot point를 이용 화살표 추가 및 삭제 Rotation 그림 삽입 및 편집기능 • 그림( jpg, jpeg) 추가 가능 • 그림 편집 – Move, resize • 각 맵당 필요한 그림 수작업 Label 편집 Pathway Editing KEGG reference map – PAT파일 작성 • 178 XML 파일 –reference map • 필요한 그림 추출 작업 – KEGG 이미지 파일로 부터 필요한 그림 오려내기 및 지우기 • Semi-Auto layout이후 manual 편집 – 에지 collision을 피하지 못한경우 에지 수정 – Label 수정 – 에지 겹침 수정 • Local pathway DB에 저장 – 178 map으로 저장 Pathway Mapping Mapping genes Applying microarray profile=>RNA_seq 처리 Down regulated genes Up regulated genes