Download B - Computational Systems Biology Group

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Copy-number variation wikipedia , lookup

Transposable element wikipedia , lookup

Genetic engineering wikipedia , lookup

Epistasis wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Gene therapy wikipedia , lookup

Metagenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

X-inactivation wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Oncogenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene nomenclature wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene desert wikipedia , lookup

Essential gene wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Public health genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

The Selfish Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene wikipedia , lookup

Minimal genome wikipedia , lookup

Genome (book) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Functional genomics and gene
expression data analysis
Joaquín Dopazo
Bioinformatics Unit,
Centro Nacional de Investigaciones Oncológicas (CNIO), Spain.
http://bioinfo.cnio.es
The use of high throughput methodologies allows us to query
our systems in a new way but, at the same time, generates new
challenges for data analysis and requires from us a change in
our data management habits
National Institute of Bioinformatics,
Functional Genomics node
Now: 23531 (NCBI 34 assembly 02/04)
Recent estimations: 20.000 to 100.000.
50% mRNAs do not code for proteins (mouse)
50% display alternative splicing
Genes in the
DNA...
25%-60% unknown
…whose final
effect can be
different because
of the variability.
>protein kinase
acctgttgatggcgacagggactgtatgctgatct
atgctgatgcatgcatgctgactactgatgtgggg
gctattgacttgatgtctatc....
…are expressed and
constitute the
transcriptome...
A typical tissue is
expressing among
5000 and 10000
genes
More than 4 millon
SNPs have been
mapped
From genotype
to phenotype.
(only the genetic component)
… which accounts for the function
providing they are expressed in
the proper moment and place...
…conforming complex
interaction networks
(metabolome)...
…in cooperation
with other proteins
(interactome) …
...and code for
proteins (proteome)
that...
Each protein has an average
of 8 interactions
Pre-genomics scenario in the lab
>protein kunase
acctgttgatggcgacagggactgtatgctga
tctatgctgatgcatgcatgctgactactgatg
tgggggctattgacttgatgtctatc....
Bioinformatics tools for pre-genomic
sequence data analysis
Phylogenetic
tree
Information
Sequence
Molecular
databases
Motif
databases
Search results
Motif
Conserved
region
The aim:
Extracting as much
information as
possible for one
single data
alignment
Secondary and tertiary
protein structure
Post-genomic vision
Who?
Genome
sequencing
Literature,
databases
2-hybrid systems
Mass spectrometry for
protein complexes
What do
we know?
And who else?
SNPs
Expression
Arrays
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Where, when and how much?
In what way?
Post-genomic vision
genes
Information
The new tools:
interactions
Clustering
Feature selection
Data integration
Information mining
Information
Databases
polimorphisms
Gene
expression
Gene expression profiling.
The rationale, what we would like and related problems
Differences at phenotype level are the visible cause of differences at molecular
level which, in many cases, can be detected by measuring the levels of gene
expression. The same holds for different experiments, treatments, etc.
• Classification of phenotypes / experiments (Can I distinguish among classes,
values of variables, etc. using molecular gene expression data?)
• Selection of differentially expressed genes among the phenotypes / experiments
(did I select the relevant genes, all the relevant genes and nothing but the relevant
genes?)
• Biological roles the genes are playing in the cell (what general biological roles
are really represented in the set of relevant genes?)
A note of caution:
Genome-wide technologies allows us to produce vast
amounts of data.
But... data is not knowledge
Misunderstanding of this has lead to “new” (not
necessarily good) ways of asking (scientific) questions
Question
Experiment
test
Is gene A involved in process B?
Experiment
(sometimes) test
Question
Is there any gene (or set of genes) involved in any process?
Gene expression analysis using DNA microarrays
There are two
dominant
technologies:
spotted arrays
and oligo arrays
although new
players are
arriving to the
arena
Cy5
Cy3
cDNA arrays
Oligonucleotide arrays
Transforming images into data
Test sample labeled red (Cy5)
Reference sample labeled
green (Cy3)
Red : gene overexpressed in test
sample
Green : gene underexpressed in
test sample
Yellow - equally expressed
red/green - ratio of expression
Normalisation
A
There are many sources of error that can
affect and seriously biass the
interpretation of the results. Differences in
the efficience of labeling, the
hibridisation, local effects, etc.
B
Normalisation is a necessary step before
proceeding with the analysis
C
Before (left) and after (right) normalization. A) BoxPlots, B)
BoxPlots of subarrays and C) MA plots (ratio versus intensity)
(a) After normalization by average (b) after print-tip lowess
normalization (c) after normalization taking into account spatial
effects
The data
...
A
Genes
(thousands)
B C
Different classes
of experimental
conditions, e.g.
Cancer types,
tissues, drug
treatments, time
survival, etc.
Expression
profile of all the
genes for a
experimental
condition (array)
Expression profile
of a gene across the
experimental
conditions
Experimental conditions
(from tens up to no more than a few houndreds)
Characteristics of the data:
• Many more variables (genes) than
measurements (experiments / arrays)
• Low signal to noise ratio
• High redundancy and intra-gene
correlations
• Most of the genes are not
informative with respect to the trait
we are studying (account forunrelated
physiological conditions, etc.)
• Many genes have no annotation!!
Multiple array experiments.
Can we find groups of
experiments with
similar gene expression
profiles?
Unsupervised
Different phenotypes...
Supervised
Reverse engineering
Molecular classification
of samples
Co-expressing genes...
What genes are
responsible for?
What do they
have in common?
B
Genes interacting in a
network (A,B,C..)...
How is the
network?
A
C
D
E
Unsupervised clustering methods:
Useful for class discovery (we do not have
any a priori knowledge on classes)
Non hierarchical
K-means, PCA
SOM
hierarchical
UPGMA
SOTA
Different levels
of information
quick and
robust
An unsupervised problem:
clustering of genes.
•Gene clusters are
unknown beforehand
•Distance function
•Cluster gene expression
patterns based uniquely
on their similarities.
•Results are subjected to
further interpretation (if
possible)
Clustering of experiments:
The rationale
If enough genes have their
expression levels altered in the
different experiments, we might
be able of finding these classes by
comparing gene expression
profiles.
Distinctive gene expression patterns in human
mammary epithelial cells and breast cancers
Overview of the combined in vitro and breast tissue
specimen cluster diagram. A scaled-down representation of
the 1,247-gene cluster diagram The black bars show the
positions of the clusters discussed in the text: (A)
proliferation-associated, (B) IFNregulated, (C) B
lymphocytes, and (D) stromal cells.
Perou et al., PNAS 96 (1999)
Clustering of experiments:
The problems
Any gene (regardeless its relevance for
the classification) has the same weight
in the comparison. If relevant genes
are not in overwhelming majority it
produces:
Noise
and/or
irrelevant trends
Supervised analysis.
If we already have information on the classes, our question
to the data should use it.
Class prediction based on gene expression profiles:
A
B C
Problems:
How can classes A, B, C... be
distiguished based on the corresponding
profiles of gene expression?
Genes
(thousands)
Predictor
How a continuous phenotypic trait
(resistence to drugs, survival, etc.) can
be predicted?
And
Which genes among the thousands
analysed are relevant for the
classification?
Experimental conditions
(from tens up to no more than a few houndreds)
Gene
selection
Gene selection.
We are interested in selecting those genes showing
differential expression among the classes studied.
• Contingency table (Fisher's test)
For discrete data
(presence/absence, etc).
• T-test
We could compare gene expression
data between two types of patients.
• ANOVA
Analysis of variance. We compare
between two or more groups the
value of an interval data.
The pomelo tool
Gene selection and class
discrimination
10
cases
10
controls
Genes differentially expressed
among classes (t-test or
ANOVA), with p-value < 0.05
Sorry... the data was a collection of
random numbers labelled for two classes
This is a multiple-testing
statistic contrast.
Adjusted p-values must be used!
Gene selection
NE
between normal endometrium
(ne) and endometrioid
endometrial carcinomas (eec)
Hierarchical Clustering of 86 genes with
different expression patterns between
Normal Endometrium and Endometrioid
Endometrial Carcinoma (p<0.05) selected
among the ~7000 genes in the CNIO
oncochip
Moreno et al., BREAST AND
GYNAECOLOGICAL CANCER LABORATORY,
Molecular Pathology Programme, CNIO
NE
EEC
EEC
G Symbol A Number
And, genes are not only related to
discrete classes...
Pomelo: a tool for
finding differentially
expressed genes
• Among classes
• Survival
• Related to a continuous
parameter
Of predictors and molecular
signatures
A B
1 Training
Model, or
classificator
(with internal and/or
external CV)
A/B?
Unknown sample
A
CV
A/B?
2. Classification /
prediction
Predictor of clinical outcome in
breast cancer
Genes are arranged to
their correlation eith
the pronostic groups
Pronostic classifier
with optimal accuracy
van’t Veer et al., Nature, 2002
Information mining
My data...
How are
structured?
What are
these
groups?
What is
this gen?
?
Cell cycle...
DBs Information
Clustering
Information mining
Links
Information mining applications.
1) use of biological information
as a validation criteria
Information mining of DNA array data.
Allows quick assignation of function, biological role
and subcellular location to groups of genes.
Used to understand why genes differ in their
expression between two different conditions
Sources of information:
• Free text
• Curated terms (ontologies, etc.)
Gene OntologyCONSORTIUM
http://www.geneontology.org
• The objective of GO is to provide controlled vocabularies
for the description of the molecular function, biological
process and cellular component of gene products.
• These terms are to be used as attributes of gene products
by collaborating databases, facilitating uniform queries
across them.
• The controlled vocabularies of terms are structured to
allow both attribution and querying to be at different levels
of granularity.
FatiGO: GO-driven data analysis
The aim: to develop a statistical framework
able to deal with multiple-testing questions
GO: source of information. A reduced number of curated terms
The Gene Ontology Consortium. 2000. Gene Ontology: tool for the unification of biology.
Nature Genetics 25: 25-29
How does FatiGO work?



Compares two sets of genes (query and reference)
Has Ontology information [Process, Function and Component] on
different organisms
Select level [2-5]. Important: annotations are upgraded to the level
chosen. This increases the power of the test: there are less terms to be
tested and more genes by term.
Remove genes
repeated
Cluster
Genes
Query
in Cluster Query
Remove genes
repeated
between Clusters
Cluster
Genes
Reference
Search GO term at
level and ontology
selected
Remove genes
repeated
in Cluster Reference
Distribution
Of GO Terms
In Query
Cluster
Clean
Cluster
Query
Clean
Cluster
Reference
p-value
multiple test
GO – DB
Distribution
Of GO Terms
In Reference
Cluster
Important: since we are performing as many tests as
GO terms, multiple-testing adjustment must be used
FatiGO Results
The application extracts biological
relevant terms (showing a
significant differential distribution)
for a set of genes
Number Genes with GO Term at level
and ontology selected for each Cluster
Unadjusted p-value
Step-down min p adjusted p-value
FDR (indep.) adjusted p-value
FDR (arbitrary depend.) adjusted p-value
Tables
GO Term – Genes
Genes of old versions (Unigene)
Genes without result
Repeated Genes
GO Tree with diferent levels of
information
C
PTL
LB
Understanding why genes differ
in their expression between two
different phenotypes
Limphomas from mature lymphocytes (LB)
and precursor T-lymphocyte (PTL).
Genes differentially expressed, selected
among the ~7000 genes in the CNIO oncochip
Genes differentially expressed among both
groups were mainly related to immune
response (activated in mature lymphocytes)
Martinez et al., Human Genetics Laboratory.
Molecular Pathology Programme, CNIO
Biological processes shown by the genes differentially
expressed among PTL-LB
Martinez et al., Human Genetics Laboratory. Molecular Pathology Programme, CNIO
Looking for significant differences.
Statistical approaches
Don’t worry, be happy
2-fold increase/decrease
Hundred of
differentially
expressed genes
Individual test
Hardly a few
differentially
expressed genes
(or even none)
Panic
Bonferroni
FWER
Looking for more heuristic and/or realistic
ways of finding differentially expressed genes
False Discovery Rate (FDR), controls the expected
number of false rejections among the rejected hypotheses
(differentially expressed genes), instead of the more
conservative FWER, that controls the probability that one
of more of the rejected hypotheses is true.
Use of external information
1.
2.
Use of biological information as a validation criteria
Use of biological information as part of the
algorithm
Necessity of a tool and the appropriate statistical
framework for the management of the information
Applications
2) Use of biological information as a
threshold criteria
The problem:
We might be interested in
understanding, e.g., which genes
differ between tissues, diseases,
etc.
A
B
B
Typically:
We examine each gene selecting
only those that show significant
differences using an appropriate
statistical model, and correcting for
multiple testing.
Use biological
information as a
validation criteria
Metabolism
Transport
...
Reproduction
The threshold, thus, is based on
expression values in absence of
any other information.
Conventional levels (e.g., Type I
error rate of 0.05) attending
exclusively to statistical criteria are
used.
A
Use of biological information as a
threshold criteria
Information-driven
approach
We examine the GO terms
associated to each gene
and see, correcting for
multiple testing, if some of
them are overrepresented
A
B
B
A
Metabolism
Transport
...
Reproduction
The threshold is based on
levels (e.g., Type I error
rate of 0.05) of
distribution of GO terms
The rationale: genes are
differentially expressed
because some biological
reason
GO
terms
The procedure becomes more sensitive
Present
Absent
Comparing genes differentially
expressed between organs
testis
kidney
Díaz-Uriarte et al., CAMDA 02
Other approaches
that include
information in the
algorithm: GSEA
Figure 1: Schematic overview of GSEA.
The goal of GSEA is to determine whether any a priori defined gene sets
(step 1) are enriched at the top of a list of genes ordered on the basis of
expression difference between two classes (for example, highly expressed
in individuals with NGT versus those with DM2). Genes R1,...RN are
ordered on the basis of expression difference (step 2) using an appropriate
difference measure (for example, SNR). To determine whether the
members of a gene set S are enriched at the top of this list (step 3), a
Kolmogorov-Smirnov (K-S) running sum statistic is computed: beginning
with the top-ranking gene, the running sum increases when a gene
annotated to be a member of gene set S is encountered and decreases
otherwise. The ES for a single gene set is defined as the greatest positive
deviation of the running sum across all N genes. When many members of S
appear at the top of the list, ES is high. The ES is computed for every gene
set using actual data, and the MES achieved is recorded (step 4). To
determine whether one or more of the gene sets are enriched in one
diagnostic class relative to the other (step 5), the entire procedure (steps 2–
4) is repeated 1,000 times, using permuted diagnostic assignments and
building a histogram of the maximum ES achieved by any pathway in a
given permutation. The MES achieved using the actual data is then
compared to this histogram (step 6, red arrow), providing us with a global P
value for assessing whether any gene set is associated with the diagnostic
categorization.
Mootha et al., Nat Genet. 2003 Jul;34(3):267-73
ISW applied to a dataset for which
no genes differentially expressed
could be found
ISW detects 5 pathways
arrangement
Pathways over- and
underrepresented
Mootha et al., Nat Genet. 2003
17 NTG vs.
8 IGT 18 DM2
No differentially expressed
genes between both conditions
were found.
Gene Set Enrichment Analysis
detects Oxidative
phosphorylation
IGT + Diabetic
Normal tolerance to glucose
Algorithms are used if they are available in programs.
External tools
GEPAS, a package for DNAEP, array
data analysis
In silico CGH
HAPI
Scanning,
Array
Image processing
Two-conditions
comparison
Normalization
Gene selection
DNMAD
Two-classes
Multiple classes
Unsupervised
clustering
Hierarchical
SOM
SOTA
SomTree
Preprocessor
+ hub
Continuous variable
Categorical variable
survival
Predictor
tnasas
Supervised
clustering
SVM
Datamining
FatiGO
FatiWise
Viewers
SOTATree
TreeView
SOMplot
A
G
E
F
B
C
D
Bioinformatics Group, CNIO
From left to right: Lucía Conde, Joaquín Dopazo, Alvaro Mateos, Fátima Al-Shahrour,
Víctor Calzado, Hernán Dopazo, Javier Herrero, Javier Santoyo, Ramón Díaz, Michal
Karzinstky & Juanma Vaquerizas
http://bioinfo.cnio.es
http://gepas.bioinfo.cnio.es
http://fatigo.bioinfo.cnio.es