Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Development and Implementation of Tools, Software and Databases for Transcriptome Characterization FIRB-LIBI CBM - Trieste FROM CLUSTERS TO MODULES TO NETWORKS GENECLUSTERS TFBS-ANALYSIS Scientific Activities 1. Population and use of MATS (MicroArray TrieSte database) for the recognition of gene-modules among clusters and of coregulated Vs co-expressed genes (Schneider-Unit/LNCIB-Ts) 2. Search for transcription factors binding sites (TFBS) in the genomic neighbourhood of genes of interest (Policriti/UniUD +Schneider Units) 3. Construction of DB-ncRNA, a database of non-coding RNAs extracted from genome analysis and catalogued in a project underway for the identification of the corresponding cDNAs (Fabris/UniTS+Policriti Units) 4. Study of the network architecture of transcriptomic data as well as their relation to the regulatory networks (Pongor/ICGEB-TsUnit) Ovarian cancer The fifth most common malignancy The leading cause of death among gynecologic malignancies Diagnosis late (70% in FIGO stage III) Short term curability high (surgery + first-line chemoterapy) Long term curability poor (intrinsic/acquired chemoresistance) Scientific Activity 1 - Implementation Aims of the Study • Combine data from different tumor collections in order to increase the patient number. • Compare Resistant Vs Sensitive patients by using supervised approaches • Optimize existing tools to extract functional themes underlying biological/clinical properties, MODULES, within analyzed tumors. • Confirm the obtained gene lists on independent collections and extract the corresponding lists of co-regulated genes as in activity2. • Proof-of-principles: In vitro models to recapitulate in vivo findings DATASET MERGING Each matrix was normalized and centered separately before merging using IMAGE identifiers Chieti: 30 tumors/~13000 genes INT: 43 tumors tumors/~4500 genes Combined dataset: 73 tumors/2864 genes Class comparison using dataset of origin as testing variable yielded no-significant genes: the merged dataset can be used for further analysis!!! Scientific Activity 1 - Implementation Study Design • Data processing: normalization and standardization of gene expression data (gene filtering, print-tip loess, “library loess, scaling, variance stabilization within and between datasets) • Dataset merging: using IMAGE clones ID we could compare the INT data and the Chieti data (73 advanced tumors, about 2600 genes) • SAM (statistical analysis of microarray): selection of genes associated to the investigated phenotype. • SVM (supported vector machine): to evaluate classification performance of the selected genes • EASE annotation survey: statistical evaluation of functional themes associated with the gene list retrieved by SAM Scientific Activity 1 - Results Identification of a Significant Variable that affects Patients Survival: Chemotherapy Response Merged Dataset Analysis summary T-Test to extract genes associated with chemotherapy resistance (p-value < 0.01, Bonferroni correction) SVM with cross-validation to assess classification performance T-test to expand the gene list associated with chemotherapy resistance (p-value < 0.025) Annotation survey with EASE for GO terms (P-value < 0.05 - usually < 0.01) Functional classes of genes associated with chemotherapy resistance T-Test and SVM results • T-Test results: 43 genes (Bonferroni corrected p-value < 0.01) are associated with first-line chemotherapy resistance; 27 genes up-regulated in sensitive patients, while 16 genes upregulated in resistant patients • SVM results: 63 samples out of 73 (~86,3%) patients resulted to be correctly classified using SVM (cross-validation by leave-one-out procedure) Scientific Activity 1 - Results EASE Results • Small gene list: no statistically relevant results were obtained (as expected) • Expanded gene list 121 genes (T-Test p-value < 0.025) : • Large gene list, up-regulated in resistant patients : • ECM remodeling • Transcription factor activity • Proteoglycan • Large gene list, down-regulated in resistant patients : • M-Phase and chromatin complex remodeling • Acetylation • Ligase activity Scientific Activity 1 - Results T-Test and GO analysis GO-TERM chromatin remodeling complex M phase regulation of mitosis mitotic cell cycle mitosis GO-term extracellular matrix structural constituent extracellular matrix perception of sound hearing structural molecule activity collagen p-value 1.46E-03 2.38E-03 2.49E-03 4.47E-03 6.30E-03 p-value 1.44E-05 3.05E-05 3.85E-04 3.85E-04 7.52E-04 1.81E-03 • T-Test results: 121 genes (T-Test p-value < 0.025) are associated with first-line chemotherapy resistance; 59 genes up-regulated in sensitive patients, while 62 genes up-regulated in resistant patients Scientific Activity 1 - Results Some of the genes up-regulated in sensitive patients grouped by their functional classes according to GO Name Symbol Function Baculoviral IAP repeat-containing 5 (survivin) BIRC5 Apoptosis Caspase 3, apoptosis-related cysteine protease CASP3 Apoptosis ASF1 anti-silencing function 1 homolog B (S. cerevisiae) ASF1B chromatin remodeling complex BRCA1 associated RING domain 1 BARD1 DNA Polymerase (DNA directed), delta 2, regulatory subunit 50kDa POLD2 DNA H2A histone family, member X H2AFX DNA BUB1 budding uninhibited by benzimidazoles 1 homolog (yeast) BUB1 M phase Centromere protein F, 350/400ka (mitosin) CENPF M phase Karyopherin alpha 2 (RAG cohort 1, importin alpha 1) KPNA2 M phase Preimplantation protein 3 PREI3 M phase Scientific Activity 1 - Results Some of the genes up-regulated in resistant patients grouped by their functional classes according to GO Name Symbol Function Chemokine (C-C motif) ligand 14 CCL15 Chemokines Chemokine (C-X-C motif) ligand 12 (stromal cell-derived factor 1) CXCL12 Chemokines Chemokine (C-X-C motif) ligand 2 CXCL2 Chemokines Stromal cell-derived factor 2 SDF2 Chemokines Collagen, type III, alpha 1 COL3A1 Collagens Procollagen-lysine 1, 2-oxoglutarate 5-dioxygenase 1 PLOD1 Collagens Biglycan BGN ECM Cartilage oligomeric matrix protein COMP ECM Fibulin 2 FBLN2 ECM Lumican LUM ECM Osteoglycin (osteoinductive factor, mimecan) OGN ECM Fibroblast growth factor receptor 4 FGFR4 Signalling Transforming growth factor, beta receptor II (70/80kDa) TGFBR2 Signalling Brain-derived neurotrophic factor BDNF Signalling Scientific Activity 1 - Conclusion • The retrieved gene lists account for relevant mechanisms involved in the development of chemotherapy resistance • The ECM-mesodermal signature is associated with chemotherapy resistance in ovarian cancer • Genes associated with chemotherapy sensitivity are related to M-Phase check-point (BUB1, Survivin), chromatin-remodelling (BARD1, H2AX). Future work Activity 1 • Completion of tools for MATS • Extend the analysis to new classifiers within internal and external data-sets using unsupervised algorithms to search for Gene-Modules • Exploit and validate Activity 2 results to find co-regulated genes within Modules and gene-lists/classifiers produced in Activity 1 FROM CLUSTERS TO MODULES TO NETWORKS GENECLUSTERS TFBS-ANALYSIS Scientific Activity 2 - Objectives 1. Search for transcription factors binding sites (TFBS) in groups of coexpressed genes as a precious resource to understand the mechanism involved in co-regulation application to the gene classifier that has been defined during ovarian cancer transcriptional profiling to identify the TFs responsible of its signatures. 2. Search for common transcription factors binding sites (TFBS) in genes discarded during the analysis because of threshold of differential expression 3. Perform blind scan on wide regions of the human genome, looking for transcriptional islands with common TFBS in combination with complementary approaches such as phylogenetic footprinting. Scientific Activity 2 - Implementation While many approaches already exist to find TFBS in prokaryotes, their identification in eukaryotes and in particular in human is very tricky due to genomic complexity: 1. the distances between different modules belonging to the same promoter greatly increase 2. regulatory elements can be located in non-canonical regions 3. regulatory elements are usually associated with multiple regulatory factors (Cawley et al. Cell. 2004 Feb 20; 116(4):499-509) Scientific Activity 2 - Implementation ScanPro New algorithm-ScanPro for TFBS search (Zantoni M. and Dalla E., recently presented at Cold Spring Harbor Systems Biology Meeting, March 2005) based on a “brute force” approach. Encoding pattern is based on integers and associated to a genomic information retrieval system using EnsEMBL for filtering the results. ScanPro allows to find conserved elements, putatively corresponding to TFBS, without having any previous knowledge of the sites that have to be found, thus preventing biased results. Scientific Activity 2 - Implementation ScanPro Look for common subsequences max m-nucleotides long (usually m=7-15 ) and with a maximum of k error ( k= 1-4). Encode each m-nucleotides window into integers in a unique way To each encoding a known sequence and position is associated to retrieve the respective subsequence Finally filtering allows removal of false positive results Scientific Activity 2 - Implementation ScanPro – EnsEMBL Combined Data Handling Genomic Localization of cDNAs Recovery of genomic surroundings to be analyzed EnsEMBL Data ScanPro Merged Data Scientific Activity 2 - Results Testing Phase: the HIF-1 dataset Many processes involved in oxygen homeostasis are mediated by hypoxia-inducible factors (HIFs), which transcriptionally regulate the expression of several dozens of target genes. Thus, HIFs represent the link between oxygen sensors and effectors at the cellular, local, and systemic level. Scientific Activity 2 - Results Testing Phase: the HIF-1 dataset HIF-1 DNA binding site: A/(G)CGTG Subset of first 6 examined genes: analysis performed on -500/-1 region with respect to TSS Scientific Activity 2 - Results Testing Phase: the HIF-1 dataset Output generated by existing TFBS search tools on the 6 selected 5’-promoter regions Ann-Spec Transfac YMF TTACATCA TGTGGT ACACAC TTCTTCCT AACCACA CGGGGC TCACCACG ATAAAT AAAAGGAG ACGTGC AAGAGCCT ACGCAC Only one of the tools identifies, among other recurrences, the expected TFBS When the analysis was performed on the entire 26 promoter-data-set no significant recurrence was found by any of the above tested tools except YMF : 12/26 Scientific Activity 2 - Results Testing Phase: the HIF-1 dataset Output generated by ScanPro ~ -80nt region : 8 sequences contain the TFBS ~ -190nt region: 10 sequences contain the TFBS ~ -260nt region: 11 sequences contain the TFBS ~ -320nt region: 9 sequences contain the TFBS ~ -410nt region: 9 sequences contain the TFBS ~ -480nt region: 8 sequences contain the TFBS The expected Consensus Sequence is identified in 24 of the 26 analyzed sequences, thus greatly improving the number of true positive results. Scientific Activity 2 - Results Output generated by ScanPro on the entire 26 promoter data set Expected Consensus Sequence: ACGTGA Expected Consensus Sequence: GCGTGA Exact Occurrencies (with position): Exact Occurrencies (with position): Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq Seq 2: -243 -133 3: -87 4: -193 5: -310 6: -139 8: -407 -84 9: -330 10: -481 -40 -20 11: -443 12: -84 14: -13 15: -312 –258 -107 16: -413 -228 19: -226 20: -345 21: -366 -181 -83 -46 22: -306 -225 -50 23: -402 -91 26: -264 -195 2: -46 3: -412 -253 -189 4: -481 -415 -5 5: -493 -489 -477 -264 -125 6: -494 -13 7: -410 -406 -257 9: -335 11: -500 -113 12: -475 -271 -89 14: -323 -219 -73 15: -469 -186 16: -421 -293 -199 -168 -20 17: -488 -223 -194 -139 -125 -118 -52 18: -421 -341 -207 -78 19: -392 -327 -235 -77 20: -373 –361 -207 -198 22: -404 23: -495 –263 -150 24: -309 25: -196 –162 -17 Scientific Activity 2 - Results Testing Phase: the HIF-1 dataset Output generated by ScanPro - Additional New Conserved Regions First conserved element: TCxCCGCC Gene 5’ Coord. Genomic Coord. Repeated Sequence EPO 118 99962692 None 379 99962953 None Transferrin Receptor 1 273 197297524 None VEGF-A 123 43845547 None 402 43845826 GC_rich 82 27967062 GC_rich 159 27967139 None 246 27967226 None FLT-1 Second conserved element: CGAGCxTC Gene 5’ Coord. Genomic Coord. Repeated Sequence EPO 300 99962874 None VEGF-A 303 43845727 None FLT-1 264 27967244 None Scientific Activity 2 - Results Testing Phase: the HIF-1 dataset Output generated by ScanPro - New Conserved Regions 1. Conserved elements are located at similar positions from TSS. 2. Only 2 out of 11 identified conserved regions match with Repeated Sequence (as defined in the EnsEmbl Genome Browser) 3. Only 1 out of 11 identified conserved regions matches with known TFBS: CGAGCxTC (EPO): 300 (99962874) Pax-4 binding site 4. The other conserved elements identified are currently being investigated! Scientific Activity 2 – Future Work • Validation of ScanPro on Tompa benchmark • Exact Definition of the New Conserved Regions • Investigation of other TFs datasets (p53,c-myc, NFkB) •Biological Assays to experimentally demonstrate TFBS (ChIP) • Analysis of the gene classifier observed in ovarian cancers as described in Activity 1 • Integration between ScanPro and other data structures for boosting its performance (i.e. integration with BuST, see Activity 3) Scientific Activity 3 - Objectives Development and implementation of a new algorithm for the identification of significantly conserved motifs at the level of primary and secondary structure in a set of non aligned RNA sequences Identify specific regulatory regions in non-coding RNA sequence collections (e.g. UTRs) Submission to experimental confirmation Major problem: search for approximated interspersed patterns inside the analyzed sequences Scientific Activity 3 - Implementation The algorithm is based on a new data structure called Bundled Suffix Tree (BuST, recently presented at BiTS 2005, Milan), a generalization of well-known Suffix Trees It is expected to allow faster approximated-patterns queries (i.e. TFBS) than those currently available, exploiting BuST ability to structurally consider the distance between the tree patterns. Another advantage is represented by the possibility to use distances referred to strings oriented towards biological aspects (thus not only mathematical distances like Hamming or edit distances) Scientific Activity 3 – Expected Results 1. Precise definition of BuST theoretical capabilities 2. Production of a software prototype for BuST creation and approximatedpatterns search 3. Fullfillment of Pevzner benchmark analysis in a definitely faster way with respect to methods cited in literature 4. Beginning of analysis of Tompa benchmark for TFBS search Subsequent Work: integration between ScanPro and BuST Scientific Activity 4 - Objectives Few reliable regulatory networks exist. Among these: E. coli and yeast Regulatory network models suggest that partial inhibition of a surprisingly small number of gene targets can be more efficient than the complete inhibition of a single target, which raises the possibility that transcriptomics data can be used in designing multitarget drug strategies Transcriptome data can be analyzed in terms of topological units that may be correlated with the stability/robustness of an organism’s signal processing systems, which raises the possibility that transcriptomics data can be used in designing multitarget drug strategies S. cerevisiae E. coli + (up) - (down) Ago ston, V., Csermely, P. and Pongo r S. (2005) Physical Review E, 71, Csermely, P. Ago ston, V., and Pongo r S. (2005). Trends in the Pharmacological Sciences, 26, 178-182. Scientific Activity 4 - Implementation Regulatory network models suggest that partial inhibition of a surprisingly small number of gene targets can be more efficient than the complete inhibition of a single target Simulation: by calculating some numeric property of the network (such as the size of the largest connected cluster, or the so called communication efficiency) it is then possible to follow how this property changes as links are deleted from the network, as shown here. Network integrity measure* The decrease caused by targeted attack (red line) is greater than the decrease of random attack (cyan line) x Random attack (mutation) x xx x 5 *communication efficiency, largest fragment size etc. No of successive attacks Scientific Activity 4 – Expected Results • Algorithms will be developed and implemented for the study of regulatory and other biological networks using transcriptomics and functional genomics data deposited in publicaly available databases • Graph algorithms included in the BOOST package and developed/implemented locally will be used to analyze and visualize the architecture of the networks • In the first approximation, these methods will be used to analyze published microarray data and in a second step we will use them for the analysis of data generated within this project Scientific Activity 4 - Implementation Despite the progress in genome-based high-throughput screening and rational pharmacon design, the number of successful single target drugs did not increase appreciably during the past decade X = removal of a link; dashed line = attenuation of a link (increasing the resistance to signal/metabolite flow) A. Complete knockout B1. Partial knockout C1. Distributed knockout B2. Attenuation Modelling drug attacks in networks (deletion of interactions) C2. Distributed attenuation The effects of drugs can be modelled in terms of removal or attenuation of links in weighted directed networks: a high affinity drug is one that inactivates a certain protein node (fully inhibitis an enzyme). A low affinity drug will only attenuate the effects of the protein (b2). Or, it is possible that a very specific drug will only inhibit a single interaction within the entire system. Scientific Activity 4 – Results Multiple attacks are not only an efficient strategy for confusing a regulatory system, but also it points to a central core of the system, as shown here: A E. coli B S. cerevisiae STE12 IME1 MMG1 GCN4 Dal80 ● Single target (A), partial knockout ) (B1), attenuation (B2); ● Partial knockout (B1), attenuation (B2); ● Partial knockout; ● Attenuation; Distributed edge-knockout (C1), distribued edgeattenuation (C2); Distributed edge-attenuation (C2); Several attack strategies (different colours) were tried to pipoint vulnerable points of the regulatory network and they seem to roughly involve the same core component of the network Ago ston, V., Csermely, P. and Pongo r S. (2005) Physical Review E, 71, Csermely, P. Ago ston, V., and Pongo r S. (2005). Trends in the Pharmacological Sciences, 26, 178-182. A central core of regulatory networks is sensitive to multiple attacks Scientific Activity 1 - Objectives 1. Population of the gene expression database MATS (MicroArray TrieSte db) with data on gene expression inferred from cDNA microarrays generated by the same research unit and also with data from external databases (GEO, ArrayExpress, SMD) 2. Gene expression profiling of different datasets of advanced stages of ovarian cancer to build classifiers and subsequent validation necessity to avoid bias due to external causes of systematic errors such as hospitaldependent clinical information, surgery methodologies, etc. 3. Characterization of gene-classifiers, significantly associated with response to therapy and disease recurrence, composed of genes associated with ECM-remodeling/mesenchymal-plasticity on one side and with Mitosis-checkpoint/chromatin-remodeling on the other. 4. Unsupervised expression data analysis with different algorithms to identify genes that show interesting behaviour with respect to LNCIB or public datasets Pevzner Benchmark 20 sequences 600nt long, each containing a 15nt string with Hamming distance ≤ 4 from a given string, and not present in the dataset a c b d e dist(c,x) ≤ 4 T-Test and GO analysis GO-TERM chromatin remodeling complex M phase regulation of mitosis mitotic cell cycle mitosis p-value 1.46E-03 2.38E-03 2.49E-03 4.47E-03 6.30E-03 GO-term extracellular matrix structural constituent extracellular matrix perception of sound hearing structural molecule activity collagen p-value 1.44E-05 3.05E-05 3.85E-04 3.85E-04 7.52E-04 1.81E-03 • T-Test results: 121 genes (T-Test p-value < 0.025) are associated with first-line chemotherapy resistance; 59 genes up-regulated in sensitive patients, while 62 genes upregulated in resistant patients Scientific Activity 1 - Results Merged Dataset Analysis summary T-Test to extract genes associated with chemotherapy resistance (p-value < 0.01, Bonferroni correction) SVM with cross-validation to assess classification performance T-test to expand the gene list associated with chemotherapy resistance (p-value < 0.025) Annotation survey with EASE for GO terms (P-value < 0.05 - usually < 0.01) Functional classes of genes associated with chemotherapy resistance Tompa Benchmark 52 datasets based on real TRANSFAC data (6 fly, 26 human, 12 mouse, 8 yeast) + 4 spurious datasets 1-35 sequences per dataset (average: 7) Length: 500-3000 bp Dataset dimension: 1-70 Kb Tompa et al., Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, vol.23, n.1, Jan 2005 Scientific Activity 1 – Future Work • Torino samples gene expression profiling to confirm the signature • Stimulation of the h-Tert cells with various factors to recapitulate to ECM-mesodermal signature in vitro (FGF2, TGFB, EGF, PDGF, LPS, BIO…) • HDAC inhibition in h-Tert cells to evaluate if epigentic changes are involved in the determination of tht ECM-mesodermal signature in vitro • TaqMan of selected genes • Marker selection to perform IHC staining • Hopefully, gene expression profiling of specimens from 2nd look/relapses Scientific Activity 1 - Implementation Biological Evaluation of Ovarian Cancer Prognostic Factors • Clinical Chemotherapy Response (clinical evaluation, CT-scan, ultra-sound imaging, absence of relapse within 6 months from surgery): • Complete or Absent • Pathological Chemotherapy Response (2nd look surgery positive for cancer cells): • Complete or Partial • Residual disease after surgery (as self-assessed by the surgeon): • Optimal de-bulking: currently < 1 cm (< 2 cm in the past) • Sub-optimal de-bulking: currently > 1 cm (> 2 cm in the past) Scientific Activity 1 - Implementation Ovarian Cancer Pathological Model Very Bad Outcome Sensitive Chemotherapy tumors Surgery Resistant BadMedium BadMedium tumors Is it possible to predict chemotherapy resistance by gene expression analysis of primary tumor specimens? Very Good