Download information

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Development and Implementation of
Tools, Software and Databases
for Transcriptome Characterization
FIRB-LIBI
CBM - Trieste
FROM CLUSTERS TO MODULES TO
NETWORKS
GENECLUSTERS
TFBS-ANALYSIS
Scientific Activities
1.
Population and use of MATS (MicroArray TrieSte database) for
the recognition of gene-modules among clusters and of coregulated Vs co-expressed genes (Schneider-Unit/LNCIB-Ts)
2.
Search for transcription factors binding sites (TFBS) in the
genomic neighbourhood of genes of interest (Policriti/UniUD
+Schneider Units)
3.
Construction of DB-ncRNA, a database of non-coding RNAs
extracted from genome analysis and catalogued in a project
underway for the identification of the corresponding cDNAs
(Fabris/UniTS+Policriti Units)
4.
Study of the network architecture of transcriptomic data as well as
their relation to the regulatory networks (Pongor/ICGEB-TsUnit)
Ovarian cancer
 The fifth most common malignancy
 The leading cause of death among gynecologic malignancies
 Diagnosis
late (70% in FIGO stage III)
 Short term curability
high (surgery + first-line chemoterapy)
 Long term curability
poor (intrinsic/acquired chemoresistance)
Scientific Activity 1 - Implementation
Aims of the Study
• Combine data from different tumor collections in order to increase
the patient number.
• Compare Resistant Vs Sensitive patients by using supervised approaches
• Optimize existing tools to extract functional themes underlying
biological/clinical properties, MODULES, within analyzed tumors.
• Confirm the obtained gene lists on independent collections and extract
the corresponding lists of co-regulated genes as in activity2.
• Proof-of-principles: In vitro models to recapitulate in vivo findings
DATASET MERGING
Each matrix was normalized and centered separately before merging using IMAGE identifiers
Chieti: 30 tumors/~13000 genes
INT: 43 tumors tumors/~4500 genes
Combined dataset: 73 tumors/2864 genes
Class comparison using dataset of origin as testing variable yielded no-significant genes:
the merged dataset can be used for further analysis!!!
Scientific Activity 1 - Implementation
Study Design
• Data processing: normalization and standardization of gene expression
data (gene filtering, print-tip loess, “library loess, scaling, variance
stabilization within and between datasets)
• Dataset merging: using IMAGE clones ID we could compare the INT
data and the Chieti data (73 advanced tumors, about 2600 genes)
• SAM (statistical analysis of microarray): selection of genes associated to
the investigated phenotype.
• SVM (supported vector machine): to evaluate classification performance
of the selected genes
• EASE annotation survey: statistical evaluation of functional themes
associated with the gene list retrieved by SAM
Scientific Activity 1 - Results
Identification of a Significant Variable that affects Patients Survival:
Chemotherapy Response
Merged Dataset Analysis summary
T-Test to extract genes associated with chemotherapy resistance (p-value < 0.01, Bonferroni correction)
SVM with cross-validation to assess classification performance
T-test to expand the gene list associated with chemotherapy resistance (p-value < 0.025)
Annotation survey with EASE for GO terms
(P-value < 0.05 - usually < 0.01)
Functional classes of genes associated with chemotherapy resistance
T-Test and SVM results
• T-Test results: 43 genes (Bonferroni corrected p-value < 0.01) are associated with first-line
chemotherapy resistance; 27 genes up-regulated in sensitive patients, while 16 genes upregulated in resistant patients
• SVM results: 63 samples out of 73 (~86,3%) patients resulted to be correctly classified
using SVM (cross-validation by leave-one-out procedure)
Scientific Activity 1 - Results
EASE Results
• Small gene list: no statistically relevant results were obtained (as expected)
• Expanded gene list 121 genes (T-Test p-value < 0.025) :
• Large gene list, up-regulated in resistant patients :
• ECM remodeling
• Transcription factor activity
• Proteoglycan
• Large gene list, down-regulated in resistant patients :
• M-Phase and chromatin complex remodeling
• Acetylation
• Ligase activity
Scientific Activity 1 - Results
T-Test and GO analysis
GO-TERM
chromatin remodeling complex
M phase
regulation of mitosis
mitotic cell cycle
mitosis
GO-term
extracellular matrix structural constituent
extracellular matrix
perception of sound
hearing
structural molecule activity
collagen
p-value
1.46E-03
2.38E-03
2.49E-03
4.47E-03
6.30E-03
p-value
1.44E-05
3.05E-05
3.85E-04
3.85E-04
7.52E-04
1.81E-03
• T-Test results: 121 genes (T-Test p-value < 0.025) are associated with first-line chemotherapy
resistance; 59 genes up-regulated in sensitive patients, while 62 genes up-regulated in
resistant patients
Scientific Activity 1 - Results
Some of the genes up-regulated in sensitive patients grouped by their
functional classes according to GO
Name
Symbol
Function
Baculoviral IAP repeat-containing 5 (survivin)
BIRC5
Apoptosis
Caspase 3, apoptosis-related cysteine protease
CASP3
Apoptosis
ASF1 anti-silencing function 1 homolog B (S. cerevisiae)
ASF1B
chromatin remodeling complex
BRCA1 associated RING domain 1
BARD1
DNA
Polymerase (DNA directed), delta 2, regulatory subunit 50kDa
POLD2
DNA
H2A histone family, member X
H2AFX
DNA
BUB1 budding uninhibited by benzimidazoles 1 homolog (yeast)
BUB1
M phase
Centromere protein F, 350/400ka (mitosin)
CENPF
M phase
Karyopherin alpha 2 (RAG cohort 1, importin alpha 1)
KPNA2
M phase
Preimplantation protein 3
PREI3
M phase
Scientific Activity 1 - Results
Some of the genes up-regulated in resistant patients grouped by their
functional classes according to GO
Name
Symbol
Function
Chemokine (C-C motif) ligand 14
CCL15
Chemokines
Chemokine (C-X-C motif) ligand 12 (stromal cell-derived factor 1)
CXCL12
Chemokines
Chemokine (C-X-C motif) ligand 2
CXCL2
Chemokines
Stromal cell-derived factor 2
SDF2
Chemokines
Collagen, type III, alpha 1
COL3A1
Collagens
Procollagen-lysine 1, 2-oxoglutarate 5-dioxygenase 1
PLOD1
Collagens
Biglycan
BGN
ECM
Cartilage oligomeric matrix protein
COMP
ECM
Fibulin 2
FBLN2
ECM
Lumican
LUM
ECM
Osteoglycin (osteoinductive factor, mimecan)
OGN
ECM
Fibroblast growth factor receptor 4
FGFR4
Signalling
Transforming growth factor, beta receptor II (70/80kDa)
TGFBR2
Signalling
Brain-derived neurotrophic factor
BDNF
Signalling
Scientific Activity 1 - Conclusion
• The retrieved gene lists account for relevant mechanisms involved in the
development of chemotherapy resistance
• The ECM-mesodermal signature is associated with chemotherapy
resistance in ovarian cancer
• Genes associated with chemotherapy sensitivity are related to M-Phase
check-point (BUB1, Survivin), chromatin-remodelling (BARD1, H2AX).
Future work Activity 1
• Completion of tools for MATS
• Extend the analysis to new classifiers within internal and external data-sets
using unsupervised algorithms to search for Gene-Modules
• Exploit and validate Activity 2 results to find co-regulated genes within
Modules and gene-lists/classifiers produced in Activity 1
FROM CLUSTERS TO MODULES TO
NETWORKS
GENECLUSTERS
TFBS-ANALYSIS
Scientific Activity 2 - Objectives
1. Search for transcription factors binding sites (TFBS) in groups of coexpressed genes as a precious resource to understand the mechanism
involved in co-regulation  application to the gene classifier that
has been defined during ovarian cancer transcriptional profiling to identify
the TFs responsible of its signatures.
2. Search for common transcription factors binding sites (TFBS) in genes
discarded during the analysis because of threshold of differential
expression
3. Perform blind scan on wide regions of the human genome, looking
for transcriptional islands with common TFBS in combination with
complementary approaches such as phylogenetic footprinting.
Scientific Activity 2 - Implementation
While many approaches already exist to find TFBS in prokaryotes, their
identification in eukaryotes and in particular in human is very tricky due to
genomic complexity:
1. the distances between different modules belonging to the same promoter
greatly increase
2. regulatory elements can be located in non-canonical regions
3. regulatory elements are usually associated with multiple regulatory
factors
(Cawley et al. Cell. 2004 Feb 20; 116(4):499-509)
Scientific Activity 2 - Implementation
ScanPro
New algorithm-ScanPro for TFBS search (Zantoni M. and Dalla E., recently
presented at Cold Spring Harbor Systems Biology Meeting, March 2005) based
on a “brute force” approach.
Encoding pattern is based on integers and associated to a genomic information
retrieval system using EnsEMBL for filtering the results.
ScanPro allows to find conserved elements, putatively corresponding to TFBS,
without having any previous knowledge of the sites that have to be found,
thus preventing biased results.
Scientific Activity 2 - Implementation
ScanPro
Look for common subsequences max m-nucleotides long (usually
m=7-15 ) and with a maximum of k error ( k= 1-4).


Encode each m-nucleotides window into integers in a unique way
To each encoding a known sequence and position is associated to
retrieve the respective subsequence


Finally filtering allows removal of false positive results
Scientific Activity 2 - Implementation
ScanPro – EnsEMBL Combined Data Handling
Genomic Localization of cDNAs
Recovery of genomic surroundings
to be analyzed
EnsEMBL Data
ScanPro
Merged Data
Scientific Activity 2 - Results
Testing Phase: the HIF-1 dataset
Many processes involved in oxygen homeostasis are mediated by hypoxia-inducible factors
(HIFs), which transcriptionally regulate the expression of several dozens of target genes. Thus,
HIFs represent the link between oxygen sensors and effectors at the cellular, local, and
systemic level.
Scientific Activity 2 - Results
Testing Phase: the HIF-1 dataset
HIF-1 DNA binding site: A/(G)CGTG
Subset of first 6 examined genes:
analysis performed on
-500/-1 region with respect to TSS
Scientific Activity 2 - Results
Testing Phase: the HIF-1 dataset
Output generated by existing TFBS search tools
on the 6 selected 5’-promoter regions
Ann-Spec
Transfac
YMF
TTACATCA
TGTGGT
ACACAC
TTCTTCCT
AACCACA
CGGGGC
TCACCACG
ATAAAT
AAAAGGAG
ACGTGC
AAGAGCCT
ACGCAC
Only one of the tools identifies, among other
recurrences, the expected TFBS
When the analysis was performed on the entire 26 promoter-data-set
no significant recurrence was found by any of the above tested tools
except YMF : 12/26
Scientific Activity 2 - Results
Testing Phase: the HIF-1 dataset
Output generated by ScanPro
~ -80nt region : 8 sequences contain the TFBS
~ -190nt region: 10 sequences contain the TFBS
~ -260nt region: 11 sequences contain the TFBS
~ -320nt region: 9 sequences contain the TFBS
~ -410nt region: 9 sequences contain the TFBS
~ -480nt region: 8 sequences contain the TFBS
The expected Consensus Sequence is identified in 24 of the 26
analyzed sequences,
thus greatly improving the number of true positive results.
Scientific Activity 2 - Results
Output generated by ScanPro on the entire 26 promoter data set
Expected Consensus Sequence: ACGTGA
Expected Consensus Sequence: GCGTGA
Exact Occurrencies (with position):
Exact Occurrencies (with position):
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
2: -243 -133
3: -87
4: -193
5: -310
6: -139
8: -407 -84
9: -330
10: -481 -40 -20
11: -443
12: -84
14: -13
15: -312 –258 -107
16: -413 -228
19: -226
20: -345
21: -366 -181 -83 -46
22: -306 -225 -50
23: -402 -91
26: -264 -195
2: -46
3: -412 -253 -189
4: -481 -415 -5
5: -493 -489 -477 -264 -125
6: -494 -13
7: -410 -406 -257
9: -335
11: -500 -113
12: -475 -271 -89
14: -323 -219 -73
15: -469 -186
16: -421 -293 -199 -168 -20
17: -488 -223 -194 -139 -125 -118 -52
18: -421 -341 -207 -78
19: -392 -327 -235 -77
20: -373 –361 -207 -198
22: -404
23: -495 –263 -150
24: -309
25: -196 –162 -17
Scientific Activity 2 - Results
Testing Phase: the HIF-1 dataset
Output generated by ScanPro - Additional New Conserved Regions
First conserved element: TCxCCGCC
Gene
5’ Coord.
Genomic Coord.
Repeated Sequence
EPO
118
99962692
None
379
99962953
None
Transferrin Receptor 1 273
197297524
None
VEGF-A
123
43845547
None
402
43845826
GC_rich
82
27967062
GC_rich
159
27967139
None
246
27967226
None
FLT-1
Second conserved element: CGAGCxTC
Gene
5’ Coord.
Genomic Coord.
Repeated Sequence
EPO
300
99962874
None
VEGF-A
303
43845727
None
FLT-1
264
27967244
None
Scientific Activity 2 - Results
Testing Phase: the HIF-1 dataset
Output generated by ScanPro - New Conserved Regions
1.
Conserved elements are located at similar positions from TSS.
2.
Only 2 out of 11 identified conserved regions match with Repeated
Sequence (as defined in the EnsEmbl Genome Browser)
3.
Only 1 out of 11 identified conserved regions matches with known TFBS:
CGAGCxTC (EPO):
300 (99962874)  Pax-4 binding site
4.
The other conserved elements identified are currently being investigated!
Scientific Activity 2 – Future Work
• Validation of ScanPro on Tompa benchmark
• Exact Definition of the New Conserved Regions
• Investigation of other TFs datasets (p53,c-myc, NFkB)
•Biological Assays to experimentally demonstrate TFBS (ChIP)
• Analysis of the gene classifier observed in ovarian cancers as
described in Activity 1
• Integration between ScanPro and other data structures
for boosting its performance (i.e. integration with BuST, see
Activity 3)
Scientific Activity 3 - Objectives
Development and implementation of a new algorithm for the identification
of significantly conserved motifs at the level of primary and secondary
structure in a set of non aligned RNA sequences
Identify specific regulatory regions in
non-coding RNA sequence collections (e.g. UTRs)
Submission to experimental confirmation
Major problem: search for approximated interspersed
patterns inside the analyzed sequences
Scientific Activity 3 - Implementation
The algorithm is based on a new data structure called Bundled
Suffix Tree (BuST, recently presented at BiTS 2005, Milan), a
generalization of well-known Suffix Trees
It is expected to allow faster approximated-patterns queries (i.e.
TFBS) than those currently available, exploiting BuST ability to
structurally consider the distance between the tree patterns.
Another advantage is represented by the possibility to use
distances referred to strings oriented towards biological aspects
(thus not only mathematical distances like Hamming or edit
distances)
Scientific Activity 3 – Expected Results
1. Precise definition of BuST theoretical capabilities
2. Production of a software prototype for BuST creation and approximatedpatterns search
3. Fullfillment of Pevzner benchmark analysis in a definitely faster way
with respect to methods cited in literature
4. Beginning of analysis of Tompa benchmark for TFBS search
Subsequent Work: integration between ScanPro and BuST
Scientific Activity 4 - Objectives
Few reliable regulatory networks exist. Among these: E. coli and yeast
Regulatory network models suggest that partial inhibition of a surprisingly small
number of gene targets can be more efficient than the complete inhibition of a single
target, which raises the possibility that transcriptomics data can be used in designing
multitarget drug strategies
Transcriptome data can be analyzed in terms of topological units that may be
correlated with the stability/robustness of an organism’s signal processing systems,
which raises the possibility that transcriptomics data can be used in designing
multitarget drug strategies
S. cerevisiae
E. coli
+ (up)
- (down)
Ago ston, V., Csermely, P. and Pongo r S. (2005) Physical Review E, 71,
Csermely, P. Ago ston, V., and Pongo r S. (2005). Trends in the
Pharmacological Sciences, 26, 178-182.
Scientific Activity 4 - Implementation
Regulatory network models suggest that partial inhibition of a surprisingly
small number of gene targets can be more efficient than the
complete inhibition of a single target
Simulation: by calculating some numeric property of the network (such as
the size of the largest connected cluster, or the so called communication
efficiency) it is then possible to follow how this property changes as links are
deleted from the network, as shown here.
Network integrity
measure*
The decrease caused
by targeted attack
(red line) is greater
than the decrease of
random attack (cyan
line)
x
Random
attack
(mutation)
x
xx
x
5
*communication efficiency, largest fragment size etc.
No of successive attacks
Scientific Activity 4 – Expected Results
• Algorithms will be developed and implemented for the study of regulatory
and other biological networks using transcriptomics and functional genomics data
deposited in publicaly available databases
• Graph algorithms included in the BOOST package and developed/implemented
locally will be used to analyze and visualize the architecture of the networks
• In the first approximation, these methods will be used to analyze published
microarray data and in a second step we will use them for the analysis of data
generated within this project
Scientific Activity 4 - Implementation
Despite the progress in genome-based high-throughput screening and rational pharmacon
design, the number of successful single target drugs did not increase appreciably during the
past decade
X = removal of a link;
dashed line = attenuation of
a link (increasing the
resistance to
signal/metabolite flow)
A. Complete knockout B1. Partial knockout
C1. Distributed knockout
B2. Attenuation
Modelling
drug attacks
in networks
(deletion of
interactions)
C2. Distributed attenuation
The effects of drugs can be modelled in terms of removal or attenuation of links in weighted
directed networks: a high affinity drug is one that inactivates a certain protein node (fully
inhibitis an enzyme). A low affinity drug will only attenuate the effects of the protein (b2). Or,
it is possible that a very specific drug will only inhibit a single interaction within the entire
system.
Scientific Activity 4 – Results
Multiple attacks are not only an efficient strategy for confusing a regulatory
system, but also it points to a central core of the system, as shown here:
A
E. coli
B
S. cerevisiae
STE12
IME1
MMG1
GCN4
Dal80
● Single target (A), partial knockout ) (B1), attenuation (B2);
● Partial knockout (B1), attenuation (B2);
● Partial knockout;
● Attenuation;
Distributed edge-knockout (C1), distribued edgeattenuation (C2);
Distributed edge-attenuation (C2);
Several attack strategies (different colours) were tried to pipoint vulnerable points of the
regulatory network and they seem to roughly involve the same core component of the network
Ago ston, V., Csermely, P. and Pongo r S. (2005) Physical Review E, 71,
Csermely, P. Ago ston, V., and Pongo r S. (2005). Trends in the
Pharmacological Sciences, 26, 178-182.
A central core
of regulatory
networks is
sensitive to
multiple
attacks
Scientific Activity 1 - Objectives
1. Population of the gene expression database MATS (MicroArray TrieSte
db) with data on gene expression inferred from cDNA microarrays
generated by the same research unit and also with data from external
databases (GEO, ArrayExpress, SMD)
2. Gene expression profiling of different datasets of advanced stages of
ovarian cancer to build classifiers and subsequent validation  necessity
to avoid bias due to external causes of systematic errors such as hospitaldependent clinical information, surgery methodologies, etc.
3. Characterization of gene-classifiers, significantly associated with
response to therapy and disease recurrence, composed of genes associated
with ECM-remodeling/mesenchymal-plasticity on one side and with
Mitosis-checkpoint/chromatin-remodeling on the other.
4. Unsupervised expression data analysis with different algorithms to
identify genes that show interesting behaviour with respect to LNCIB or
public datasets
Pevzner Benchmark
20 sequences 600nt long, each containing a 15nt string with Hamming
distance ≤ 4 from a given string, and not present in the dataset
a
c
b
d
e
dist(c,x) ≤ 4
T-Test and GO analysis
GO-TERM
chromatin remodeling complex
M phase
regulation of mitosis
mitotic cell cycle
mitosis
p-value
1.46E-03
2.38E-03
2.49E-03
4.47E-03
6.30E-03
GO-term
extracellular matrix structural constituent
extracellular matrix
perception of sound
hearing
structural molecule activity
collagen
p-value
1.44E-05
3.05E-05
3.85E-04
3.85E-04
7.52E-04
1.81E-03
• T-Test results: 121 genes (T-Test p-value < 0.025) are associated with first-line
chemotherapy resistance; 59 genes up-regulated in sensitive patients, while 62 genes upregulated in resistant patients
Scientific Activity 1 - Results
Merged Dataset Analysis summary
T-Test to extract genes associated with chemotherapy resistance (p-value < 0.01, Bonferroni correction)
SVM with cross-validation to assess classification performance
T-test to expand the gene list associated with chemotherapy resistance (p-value < 0.025)
Annotation survey with EASE for GO terms
(P-value < 0.05 - usually < 0.01)
Functional classes of genes associated with chemotherapy resistance
Tompa Benchmark
52 datasets based on real TRANSFAC data (6 fly, 26 human, 12 mouse,
8 yeast) + 4 spurious datasets
1-35 sequences per dataset (average: 7)
Length: 500-3000 bp
Dataset dimension: 1-70 Kb
Tompa et al., Assessing computational tools for the discovery of
transcription factor binding sites, Nature Biotechnology, vol.23, n.1,
Jan 2005
Scientific Activity 1 – Future Work
• Torino samples gene expression profiling to confirm the signature
• Stimulation of the h-Tert cells with various factors to recapitulate
to ECM-mesodermal signature in vitro (FGF2, TGFB, EGF,
PDGF, LPS, BIO…)
• HDAC inhibition in h-Tert cells to evaluate if epigentic changes
are involved in the determination of tht ECM-mesodermal
signature in vitro
• TaqMan of selected genes
• Marker selection to perform IHC staining
• Hopefully, gene expression profiling of specimens from 2nd
look/relapses
Scientific Activity 1 - Implementation
Biological Evaluation of Ovarian Cancer Prognostic Factors
• Clinical Chemotherapy Response (clinical evaluation, CT-scan, ultra-sound
imaging, absence of relapse within 6 months from surgery):
• Complete or Absent
• Pathological Chemotherapy Response
(2nd look surgery positive for cancer
cells):
• Complete or Partial
• Residual disease after surgery (as self-assessed by the surgeon):
• Optimal de-bulking: currently < 1 cm (< 2 cm in the past)
• Sub-optimal de-bulking: currently > 1 cm (> 2 cm in the past)
Scientific Activity 1 - Implementation
Ovarian Cancer Pathological Model
Very
Bad
Outcome
Sensitive
Chemotherapy
tumors
Surgery
Resistant
BadMedium
BadMedium
tumors
Is it possible to predict chemotherapy resistance by
gene expression analysis of primary tumor specimens?
Very
Good