Download Detecting phenotype-specific interactions between

Document related concepts

Genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Quantitative trait locus wikipedia , lookup

X-inactivation wikipedia , lookup

Primary transcript wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Point mutation wikipedia , lookup

Pathogenomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Non-coding DNA wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Epistasis wikipedia , lookup

Gene therapy wikipedia , lookup

Gene nomenclature wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Oncogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Gene desert wikipedia , lookup

Ridge (biology) wikipedia , lookup

NEDD9 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

History of genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression programming wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Detecting Phenotype-Specific
Interactions Between Biological
Processes
Nadeem A. Ansari
Department of Computer Science
Wayne State University
Detroit, MI 48202
1
Outline
•
•
•
•
•
Biological background
Motivation and problem description
Challenges and limitations
Mathematical background
Detecting changed interactions between
biological processes in a phenotype
• Improvements
• Results
• Summary
2
Outline
•
•
•
•
•
Biological background
Motivation and problem description
Challenges and limitations
Mathematical background
Detecting changed interactions between
biological processes in a phenotype
• Improvements
• Results
3
Cells, proteins, and DNA
• Cells: fundamental units of life that contain all
the working machinery necessary for their
functioning
• Proteins: the main contributors of this working
machinery
• Deoxyribonucleic acid (DNA): contains the
blueprint for making the working machinery
• Gene expression: the process of making the
working machinery
4
DNA
• Linear molecule of two
strands; each composed
of subunits called
Nucleotides
• Nucleotide types:
Adenine – A
Cytosine – C
Guanine – G
Thymine – T
5
DNA
• Base pairing:
… A A C G G A T …
… T T G C C T A …
6
Transcription
• Information stored in DNA letters is transcribed
into Ribonucleic acid (RNA)
• RNA: a chain of nucleotides - A, C, G, U (uracil)
… G T G C A T … DNA
… C A C G U A … RNA
7
Translation
• Information stored in RNA is translated into
chains of amino acids - proteins
8
Gene expression
• The process of making the working machinery of
a cell.
9
• Regions of DNA that are synthesized into functional RNA
and proteins are known as genes
• An observable characteristic (or trait) of an organism
caused by gene expression is known as a phenotype. 10
Gene expression measurement – why?
• All cells contain same DNA – express genes
selectively
• Various stimuli cause
change in gene
expression
• Change in expression
level results in under or
over production of
working machinery
– diseases / phenotypes
• Measuring gene expression can help us understand
underlying biological phenomenon
11
Gene expression measurements
• Typically researchers measure gene expression in
two different tissues or cell samples
– Cells treated with a drug vs. untreated cells
• Genes expressed differently than in a controlled
sample are called differentially expressed (DE)
genes
• High throughput technologies like DNA
microarrays measure expression levels of
thousands of genes
12
Genes and annotations
• Functional characteristics of gene products are
stored in annotation databases like gene ontology
• Gene Ontology (GO): a controlled and structured
vocabulary
– Molecular functions, biological processes, and
cellular components
• Structured as directed acyclic graphs (DAGs)
– Nodes represent terms
– Edges represent relationships
• Parent-child relations (more than one parent)
– Is-a, part-of, and regulates (negatively, positively)
13
Biological processes – GO subset
• GO is a set of terms and
their definitions
organized in a structure
that reflects their
relationships
• GO also provides a set
of annotations,
describing what is
known about each gene
(products)
14
Outline
•
•
•
•
•
Biological background
Motivation and problem description
Goals, Challenges and limitations
Mathematical background
Detecting changed interactions between
biological processes in a phenotype
• Improvements
• Results
15
Motivation and problem description
• Various stimuli cause differential gene expression,
which results in the over and under production of
proteins
• Over and under production of proteins can result
in the expression of a disease and disease-specific
phenotype
• Understanding genes behavior can help us
understand diseases in ways never thought
before – e.g. drug targets for curing diseases
16
Motivation and problem description
• Current approaches look for the biological
functions that are under or over represented in
the phenotype-specific gene expression patterns
• However, life is complex and biological functions
also interact
• These interactions change in a phenotype
• Understanding changed interactions between
biological functions is important in understanding
the underlying biological mechanism that
resulted in the phenotype
17
Outline
•
•
•
•
•
Biological background
Motivation and problem description
Goals, Challenges and limitations
Mathematical background
Detecting changed interactions between
biological processes in a phenotype
• Improvements
• Results
18
Goals
• Our goal is to detect the interactions between
biological functions that have changed
significantly in a given phenotype
• We detect these interactions between the
biological processes from GO annotated with
differentially expressed genes in a phenotype
19
Challenges and limitations
• There is no simple way to establish which
biological functions are important
– No universally accepted statistical model exists
• Finding relationship between biological processes
using mathematical models is challenging
• No known statistical model exists that detects
changed interactions in a given phenotype
• Using GO annotations presents its own
challenges
20
Challenges and limitations
• GO is incomplete and updated on continuous
basis
– Missing information regarding gene annotations
• GO contains inconsistencies
– New research may make previous annotations
obsolete
• GO hierarchy poses challenge of dependencies
– Genes annotated with specific terms are assumed to
be annotated with all the ascendants of the annotated
term
21
Outline
•
•
•
•
•
Biological background
Motivation and problem description
Goals, Challenges and limitations
Mathematical background
Detecting changed interactions between
biological processes in a phenotype
• Improvements
• Results
22
Information retrieval (IR)
• Problem: Given a query, find relevant documents
from a collection
• Vector space model (VSM)
– Represent document and keywords in a matrix
• Documents as columns with keywords as
components – columns are document vectors
– Represent query as a (column) vector
– Find document vectors closer to query vector
• Documents are relevant to query
23
Example – document retrieval
A
D1
D2
D3
D4
D5
Document collection
How to bake bread without recipes
The classic art of Viennese pastry
Numerical recipes: the art of scientific
computing
Breads, pastries, pies, and cakes: quality
baking recipes
Pastry: a book of best French recipes
Example taken from Berry et al., SIAM: Review 41, 2 (1999)
24
Example – document retrieval
A
D1
Document collection
How to bake bread without recipes
D2
The classic art of Viennese pastry
D3
Numerical recipes: the art of scientific computing
D4
Breads, pastries, pies, and cakes: quality baking
recipes
Pastry: a book of best French recipes
D5
Terms
T1
bake
T2
recipe
T3
bread
T4
cake
T5
pastry
T6
pie
25
Example – document retrieval
A
Document collection
D1
D2
D3
D4
D5
D1 How to bake bread without
recipes
A
T1
D2 The classic art of Viennese
pastry
T2
1
1
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
1
1
1
1
1
0
1
0
0
1
0
T3
D3 Numerical recipes: the art of
scientific computing
T4
D4 Breads, pastries, pies, and
cakes: quality baking recipes
T5
T6
D5 Pastry: a book of best French
recipes
Terms
Term by document matrix
T1
T2
T3
T4
T5
T6
bake
recipe
bread
cake
pastry
pie
26
Example (IR VSM)
• Document vector:
D1  (1 1 1 0 0 0)T
• User searching for
documents related to
“baking bread”
• Query vector:
Q  (1 0 1 0 0 0)T
A
D1
D2
D3
D4
D5
T1
T2
T3
T4
1
1
1
0
0
0
0
0
0
1
0
0
1
1
1
1
0
1
0
0
T5
T6
0
0
1
0
0
0
1
1
1
0
T1
T2
T3
T4
T5
T6
Terms
bake
recipe
bread
cake
pastry
pie
Query
1
0
1
0
0
0
27
Finding relevant (similar) documents
D j  (a1 j a2 j ... atj)T
Q  (q1
q2 ... qt )
a11
a12
…
a1n
T2
a21
a22
…
a2n
…
T1
…
Dn
…
…
…
D2
QT  Dj
Q  Dj
Tm
am1
am2
…
amn
Q  D j  q1  a1 j
T
D1
…
similarity (Q, D j ) 
T
A
q2  a2 j
... qm  amj
Q  QT  Q  q1  q2    qm
2
2
2
28
Correlation
• Determines if two random variables vary
together
• Linear correlation between X and Y:
– Positive correlation - X increases as Y increases
– Negative correlation - X decreases as Y increases
– No linear correlation - no linear relationship
X  x1 , x2 ,..., xm Y  y1 , y2 ,..., ym
rXY 
 XY
 XX . YY

 ( X  X )(Y  Y )
 ( X  X )  (Y  Y )
m
2
m
(Pearson correlation coefficient)
2
m
29
Pearson correlation coefficient – geometric
interpretation
rXY 

m
 ( X  X )(Y  Y )
 ( X  X )  (Y  Y )
2
2
( X  X )(Y  Y )  ( x1  X )( y1  Y )    ( xm  X )( ym  Y )
 xc1  yc1    xcm  ycm
 X c  Yc
T
rXY
X c T  Yc

 similarity ( X c , Yc )
X c  Yc
30
Outline
•
•
•
•
•
Biological background
Motivation and problem description
Goals, Challenges and limitations
Mathematical background
Detecting changed interactions between
biological processes in a phenotype
• Improvements
• Results
31
Detecting interactions that have changed
significantly in the phenotype
• Represent differentially expressed genes, in a
phenotype, and their biological functions as a
matrix – vector space model with biological
processes as column vectors
• Find associations between pairs of biological
processes
• Compare these associations with the
corresponding associations in the absence of
such phenotype
• Detect association that are significantly different
in the phenotype
32
Data inputs - genes and functions
• Reference genes and functions set (R)
– M genes on a microarray
– N GO terms annotated with M genes
• In a biological condition under study (E)
– m < M differentially expressed (DE) genes
– n <= N GO terms annotated with m DE genes
33
Gene function matrix – reference data
fN
g1
a11
a12
…
a1N
g2
a21
a22
…
a2N
…
aMN
gM
aM1
aM2
…
…
…
f2
…
f1
…
GF
34
Gene function matrix – reference data
GF
R
M N
1 If gene g i is annotated

 {aij }   with GO term f j
0 otherwise

Example gene-function matrix
35
Gene function matrix – experiment data
GF
E
mn
1 if DE gene g i is annotated

 {aij }   with GO term f j
0 otherwise

Example gene-function matrix
36
Gene function matrix – reference and
experiment Data
• Experiment gene-function matrix is subpart of
reference gene-function matrix
37
Challenges and limitations
• GO is incomplete and updated on continuous
basis
– Missing information regarding gene annotations
• GO contains inconsistencies
– New research may make previous annotations
obsolete
• GO hierarchy poses challenge of dependencies
– Genes annotated with specific terms are assumed to
be annotated with all the ascendants of the annotated
term
38
Our approach to solve challenges
• Use singular value decomposition (SVD)
• SVD can find missing relationships between genes
and annotations in the latent semantic space and
also remove noise from data
• Noise: multiple words describing the same
concepts
• SVD is a factorization of a matrix into three
matrices consisting of singular vectors and
singular values corresponding to the original
matrix
39
Singular value decomposition (SVD)
• SVD of a GF matrix
• Columns of matrix G (F) are left (right) singular
vectors of GF
• S is a diagonal matrix of singular values si.
– The values on the main diagonal are ordered in non40
increasing order and represent variability in data
Matrix approximation – dimensionality
reduction
• An approximated matrix can be computed by
keeping only the first k largest singular values
• We select k that retains the desired data variance
(say x%) using the equation:
41
Approximated matrix – column view
• We approximate both reference and
experiment matrices
• The approximated experiment gene-function
matrix is not a sub-part of the approximated
reference gene-function matrix
42
Correlation Between Functions
• Indicates the strength and direction of a linear
relationship between two biological processes
• Pearson correlation coefficient rfi,fj between a pair
of functions fi and fj is computed as:
fi  f j
T
rf i , f j 
fi  f j
• Matrices (RRNxN and REnxn) of correlation
coefficients are computed for reference and
experiment data (respectively)
43
Pair-wise Correlation Coefficients for
Reference and Experiment data
=
• RRnxn contains the pair-wise correlation
coefficients between the first n functions in the
absence of phenotype
44
Fisher Z Transform – Correlation
Coefficient To Z-values
• Correlation coefficients from samples of large
population can be mapped to z values using
Fisher z-transform, which approximates normal
distribution
• For a correlation coefficient r, the Fisher ztransform Zr can be computed as:
• Compute ZRr from RRNxN and ZEr from REnxn
45
Detecting Changes Between Functional
Interactions
• Hypothesis: Correlation between two biological
processes in the given phenotype differs from the
correlation in the reference data
Hypothesis
Test statistic
46
Outline
•
•
•
•
•
Biological background
Motivation and problem description
Goals, Challenges and limitations
Mathematical background
Detecting changed interactions between
biological processes in a phenotype
• Improvements
• Results
47
Improvements
• The dependencies between GO terms can
somewhat be removed using weights in our
matrix.
48
Scheme 1-1
GF R M  N
GF
E
mn
1 If gene g i is annotated

 {aij }   with GO term f j
0 otherwise

1 if DE gene g i is annotated

 {aij }   with GO term f j
0 otherwise

• This is a binary scheme and was discussed
while describing our main method
49
Scheme 1-e
GF R M  N
GF E mn
1 If gene g i is annotated

 {aij }   with GO term f j
0 otherwise

ei if DE gene g i is annotated

 {aij }  
with GO term f j
 0 otherwise

• ei is the normalized log-transformed foldchange measured for gene gi in the given
condition
50
Scheme IR 1-1
1 wij

E
GF mn  {aij }  
 0

wij  gb j  iabi
if DE gene g i is annotated
with GO term f j
otherwise
1
gb j 
# of genes annotated with f j
# of annotation s for g i
iabi   ln
Total annotation s
gb: Gene (annotation) bias – GO DAG related
iab: Inverse annotation bias – experiment related
51
Scheme IR 1-e
GF
R
M N
1 wij

 {aij }  
 0

if gene g i is annotated
with GO term f j
otherwise
ei  wij if DE gene g i is annotated

E
GF mn  {aij }  
with GO term f j
 0
otherwise

ei is the normalized log - transforme d fold - change and
wij  gb j  iabi
52
Outline
•
•
•
•
•
Biological background
Motivation and problem description
Goals, Challenges and limitations
Mathematical background
Detecting changed interactions between
biological processes in a phenotype
• Improvements
• Results
53
Breast cancer data set
• Veer et al. (2002) found some differentially
expressed genes in breast cancer
– 24,000 reference genes on the microarray
– 13,201 annotated biological processes from GO
– 231 genes were found to be differentially
expressed
– 246 annotated biological processes with the DE
genes
• Since then no satisfactory prediction has been
made in this regard
54
Breast Cancer Data Set Results
A subset of predicted biological pairs with significant
interaction change
Scheme
GO Term 1
GO Term 2
p-value
1-1, IR 1-e
Proteolysis
Positive regulation of
apoptosis
.0001
1-1
Transcription
DNA replication initiation
.026
1-1
DNA repair
Regulation of
transcription, DNAdependant
.033
IR 1-1
Vesicle-mediated
transport
Transcription from RNA
polymerase II promoter
.002
IR 1-1
DNA replication
initiation
Phosphinositidemediated signaling
.00001
55
Breast Cancer Data Set Results Summary
Number of predicted biological pairs with significant
interaction change
Scheme
1-1
1-e
IR 1-1
IR 1-e
Total
Cat. 1
10
16
Cat. 2
5
6
Cat. 3
1
2
Accuracy
93.7%
91.6%
9
15
50
7
9
27
2
2
7
88.8%
92.3%
91.6%
Cat. 1: Known interactions and trivial
Cat. 2: Known interactions and non-trivial
Cat. 3: Unknown
56
Lung cancer data set
• Beer et al. (2002) found some differentially
expressed genes in lung cancer
– 5541 reference genes on the microarray
– 2908 annotated biological processes from GO
– 87 genes were found to be differentially expressed
– 248 annotated biological processes with the DE
genes
57
Lung Cancer Data Set Results Summary
Number of predicted biological pairs with significant
interaction change
Scheme
1-1
1-e
IR 1-1
IR 1-e
Total
Cat. 1
16
39
Cat. 2
3
3
Cat. 3
2
2
Accuracy
90.4%
95.4%
29
38
122
2
9
17
0
3
7
100.0%
94.0%
95.21%
58
Summary
• Various stimuli cause differential gene expression,
which results in the expression of a disease and
disease-specific phenotype
• Biological processes interact and their interaction
change in a given phenotype
• We proposed methods to detect such significantly
changed interactions in the observed phenotype
• We used vector space model, matrix approximation,
and statistical hypothesis testing to find changed
interactions between biological processes from GO
• Results showed 89% or more accuracy for our
proposed methods
59
References:
•
•
•
•
•
Ansari, N. A., Bao, R., and Drăghici, S. Detecting phenotype-specific interactions
between biological processes from microarray data and annotations.
Bioinformatics, under revision.
Drăghici, S. Data Analysis Tools for DNA Microarrays. Chapman and Hall/CRC Press,
203 (first print), 2006 (second print)
Berry, M. W., Drmac, Z., and Jessup, R. E. Matrices, vectors spaces, and information
retrieval. SIAM: Review 41, 2 (1999), 335-62
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R.
Indexing by latent semantic analysis. Journal of the American Society for
Information Science 41, 6 (1990), 391-407
Done, B., Khatri, P., Done, A., and Drăghici, S. Predicting novel human Gene
Ontology annotations using semantic analysis. IEEE/ACM Transactions on CBB
(2009)
60
Special Thanks to
• Dr. Sorin Draghici
61
Thank You
62