Download PPI network construction and false positive detection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene nomenclature wikipedia , lookup

Point mutation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein moonlighting wikipedia , lookup

NEDD9 wikipedia , lookup

Transcript
PPI network construction and
false positive detection
Jin Chen
CSE891-002
2011 Spring
1
Layout
• Protein-protein interaction (PPI) networks
• PPI network construction
• PPI network false-positive detection
2
Background
• Study of interactions between proteins is fundamental to the
understanding of biological systems
• PPIs have been studied through a number of high-throughput experiments
• PPIs have also been predicted through an array of computational methods
that leverage the vast amount of sequence data generated
• Comparative genomics at sequence level has indicated that species
differences are due more to the difference in the interactions between the
component proteins, rather than the individual genes themselves *
* Valencia A, Pazos F: Computational methods for the prediction of protein interactions. Curr Opin Struct Biol 2002, 12:368-373.
3
PPI at different levels
3D structure
Protein folding
Protein docking
Domain
4
Nidhi et al. DSiMB 2009
PPI at different levels
Node – protein
Every node represents an
unique protein
Edge – protein interaction
Physical interaction
Functional interaction
Hawoong Jeong
5
PPI Identification
• Concept of PPI ranges from direct physical interactions
inferred from experimental methods (yeast two-hybrid) to
functional linkages predicted on the basis of computational
analysis (based on protein sequences and structures )
• Given the difficulties in experimentally identifying PPIs, a wide
range of computational methods have been used to identify
functional PPIs
6
Domain Fusion
Hypothesis: if domains A and B exist fused in a single polypeptide AB in
another organism, then A and B are functionally linked
Marcotte EM et al. Detecting Protein Function and Protein-Protein Interactions from Genome Sequences. Science, 285(5428) 751-753 1999
7
Domain Fusion
• Further inclusion of eukaryotic sequences increased the robustness of
domain fusion predictions *
• Eukaryotes, with a larger volume, cannot afford to accommodate separate
proteins A and B, as the required concentrations of A and B would be
prohibitively high, to achieve the same equilibrium concentration of AB.
• Limitation: low coverage
*Veitia RA: Rosetta Stone proteins: "chance and necessity"? Genome Biol 2002,3(2):interactions1001.1-1001.3.
8
Conserved Neighborhood
Hypothesis: If the genes that encode two proteins are neighbors on the
chromosome in several genomes, the corresponding proteins are likely to be
functionally linked
Dandekar T et al. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochemical Sci 1998 , 23(9):324-328
9
Conserved Neighborhood
• The method has been reported to identify high-quality functional
relationships
• The method suffers from low coverage, due to the dual requirement of
identifying orthologues in another genome and then finding those
orthologues that are adjacent on the chromosome
Marcotte EM: Computational genetics: finding protein function by nonhomology methods. Curr Opin Struct Biol 2000 , 10:359-365
10
Phylogenetic Profiles
• Hypothesis: functionally linked proteins would co-occur in genomes
• Phylogenetic profile of a protein can be represented as a 'bit string',
encoding the presence or absence of the protein in each of the genomes
considered
11
Co-evolution
• Hypothesis: Co-evolution requires the existence of mutual selective
pressure on two or more species
• in silico Two-hybrid (i2h) method has been proposed based on the study of
correlated mutations in multiple sequence alignments
12
Software: Protein Link Explorer (PLEX)
Date, S.V. and E.M. Marcotte, Protein function prediction using the Protein Link EXplorer (PLEX). Bioinformatics, 2005. 21(10): p. 2558-2559.
13
Biological Problem  Algorithm  Knowledge
1. Biological hypothesis
2. Mathematical representation
3. Algorithm design
4. Biological verification
15
High-throughput PPI Detection
• Booming of biotechnology
–
–
–
–
Yeast-two hybrid / split ubiquitin system
Mass spectrometry
Protein microarrays
etc.
• Limitations of computational prediction
– Low coverage
– Locally optimized (pair-wise)
– Super-high negative PPI rates
16
Yeast Two-Hybrid
• Two hybrid proteins are generated with transcription
factor domains
• Both fusions are expressed in a yeast cell that carries a
reporter gene whose expression is under the control of
binding sites for the DNA-binding domain
Activation
Domain
Bait
Protein
Prey
Protein
Binding
Domain
Reporter Gene
Yeast Two-Hybrid
• Interaction of bait and prey proteins localizes the
activation domain to the reporter gene, thus activating
transcription
• Since the reporter gene typically codes for a survival
factor, yeast colonies will grow only when an interaction
occurs
Activation
Domain
Bait
Protein
Prey
Protein
Binding
Domain
Reporter Gene
Mating based Split-ubiquitin System
Lalonde S et al. Plant J 2008
Biomass
Yeast Cell Growth Rate
The trends for yeast cell growth over time
PPI Databases
• STRING – PPIs derived from high-throughput experimental data, mined of
databases and literature, analyses of co-expressed genes and also from
computational predictions
• HPRD - Human Protein Reference Database. It integrates information
relevant to the function of human proteins in health and disease
• DIP - Experimentally derived PPIs with assessments. DIP is generally
considered as a valuable benchmark or verify the performance of any new
method for prediction of PPIs
• Others: MIPS, YGD, BIND, TAIR…
21
False-Positive Detection in PPI Networks
• Background: PPI networks generated with high-throughput
methods contain a sizeable number of false-positives and
their reproducibility is not satisfactory*
• Central to the understanding of PPI is the definition of
“interaction” itself
– Binding energy / Interaction / Complex
– We need to define what we mean by interaction
* von Mering Comparative assessment of large-scale data sets of protein-protein interactions. Nature ;417(6887):399-403 2002
22
Useful Data for False-Positive Detection
• Functional and localization data (Gene Ontology)
• Indirect high-throughput data (gene and protein
expression)
• Sequence related data ( protein domain (domain fusion),
interologs)
• Structure data (protein 3D structure)
• Network topological features (connectivity, network motif)
23
Different Hypothesis for Different Data
Data
Example of Hypothesis
Gene Ontology
Two proteins which share a similar annotation are more likely
to interact than proteins with different or null annotations
Gene Expression
Two proteins which have similar genes express patterns are
more likely to interact
Domain Interaction
If two domains are often found in PPIs, two proteins containing
such domains are more likely to interact
PPI network
topological analysis
PPI topologies fit spoke or matrix models are more likely to be
true
Other hypotheses include: synthetic lethality, interlogs, linear motif, etc.
24
Gold Standard for PPI Networks
• For algorithm evaluation and comparison
• To train a model as positive training data
• Manually annotated databases such as DIP
• Interactions from low-throughput experiments
• True negative set is equally important
– Co-localized? No?
25
Estimate PPI Network Reliability
• Overall index of reliability of a PPI network
TP
precision 
TP  FP
TP
recall 
TP  FN
26
Estimate PPI Network Reliability
“capture-recapture” model - reaching back to the raw counts of
observed bait–prey clones of yeast-two hybrid experiments
Huang et al. Where Have All the Interactions Gone? Estimating the Coverage of Two-Hybrid Protein
Interaction Maps. PLoS Computational Biology 2007
27
PPI Filtering
• GOAL: To identify reliable protein complexes from two
existing mass spectrometry (MS) data
• Analyze the data with a purification enrichment (PE) scoring
system
• Using gold standard PPIs, the consolidated dataset is of
greater accuracy than the original sets and is comparable to
PPIs defined using more conventional small-scale methods
Collins et al. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics. 2007 Mar;6(3):439-50
28
PPI Filtering
eobservation  log10
P(observation | TruePPI )
P(observation | FalsePPI )
• e=0  no evidence for or against the validity of a particular
interaction was collected
• Two types of observations: bait-prey observations and prey-prey
observations
PEij   eijk   e jik  M ij
k
k
• i and j are two proteins (bait & prey). k indicates a distinct
purification. Mij measures indirect evidence due to co-occurrence of
proteins i and j as preys in the same purifications
29
PPI Filtering
eijk  log10
r  (1  r )  pijk
pijk
where r representing the probability that a true association will be preserved
and detected in a purification experiment and pijk representing the probability
that a bait-prey pair will be observed for nonspecific reasons
pijk  1  exp( fi nikprey nbait
)
j
where nikprey is the number of preys identified in purification k with bait i, nibait
is the number of times protein i was used as bait, and fj is an estimate of the
nonspecific frequency of occurrence of prey j in the dataset
30
PPI Filtering
31
PPI Filtering
32
PPI Filtering
• PPI topological analysis
– First student presentation is about a topological measure
called “FS-weight”, which was compared with other
topological measures
– Suitable for large PPI networks rather than preliminary
networks
33