Download 33511-33521

Document related concepts

Endomembrane system wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Proteasome wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Magnesium transporter wikipedia , lookup

Protein phosphorylation wikipedia , lookup

Signal transduction wikipedia , lookup

Protein structure prediction wikipedia , lookup

SR protein wikipedia , lookup

Bacterial microcompartment wikipedia , lookup

Protein wikipedia , lookup

Protein moonlighting wikipedia , lookup

List of types of proteins wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

JADE1 wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Chemical biology wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteolysis wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Proteomics wikipedia , lookup

Transcript
INTERNATIONAL COLLABORATION IN
PROTEOMICS AND INFORMATICS
Bibliotheca Alexandrina, 9 October, 2007
Gilbert S. Omenn, M.D., Ph.D.
Center for Computational Medicine & Biology
Chair, HUPO Plasma Proteome Project
University of Michigan, Ann Arbor, MI, USA
1
It Is Such A Great Pleasure to Visit
The Bibliotheca Alexandrina
One of the Wonders of the Modern World!
“The First Digital Library, from its Birth”
Facilitating International Collaboration in
Science and Technology
2
Nearly-Complete Human Genome Sequence, 15-16 Feb 2001
3
We Live in a New World of Life Sciences
New Biology---New Technology: a “parts list”
Genome Expression Microarrays
Comparative Genomics + CNV + miRNA
Proteomics and Metabolomics
Bioinformatics & Computational Biology
• Mechanism- & Evidence-Based Medicine:
“What were you doing up to now?!”
• Predictive, personalized, preventive,
participatory healthcare and community
health services
4
Key Components of the Vision of
Biology As An Information Science
• An avalanche of genomic information: validated
SNPs, haplotype blocks, candidate genes/alleles,
proteins, & metabolites--associated with disease risk
• Powerful computational methods
• Effective linkages with better environmental and
behavioral datasets for eco-genetic analyses
• Credible privacy and confidentiality protections
• Breakthrough tests, vaccines, drugs, behaviors, and
regulatory actions to reduce health risks and costeffectively treat patients globally.
5
A Golden Age for the
Public Health Sciences
Sequencing and analyzing the human genome is
generating genetic information that must be linked with
information about:
• Nutrition and metabolism
• Lifestyle behaviors
• Diseases and medications
• Microbial, chemical, physical exposures
Every discipline of public health sciences needed.
6
Definitions
Genetics is the scientific study of genes and their
roles in health and disease, physiology, and
evolution.
Genomics is a modern subset of the broader field of
genetics, made feasible by remarkable advances in
molecular biology, biotechnology, and
computational sciences, to examine the entire
complement of genes and their actions.
Global analyses permit us and require us to go
beyond the known “lamp-posts” of individual gene
associations and effects.
7
Proteins are the action molecules of the cell and the
leading candidates for biomarkers—in tissues and
in the blood. Proteins are coded for by genes.
Understanding one protein can be a lifetime’s work!
Proteomics is the global analysis of proteins in cells or
body fluids. Techniques for global analysis of
proteins are advancing rapidly, especially for
discovery of biomarkers for diagnosis, treatment,
and prevention.
Metabolomics is the global analysis of metabolites.
Proteomics + metabolomics + epigenomics =
“functional genomics”
8
Protein
DNA
9
Rationale for Proteomics
Proteins are much closer to the pathophysiologic
changes and molecular targets for drugs than are
mRNAs.
Changes in mRNAs are clues, but changes in
corresponding proteins often are not highly
correlated.
Advances in fractionation of complex tissue and
plasma protein mixtures, in mass spectrometry,
and in curated databases of proteins help address
complexity, dynamic range, and uncertainty of
protein identifications.
10
A Vision For Proteomics
Multiple protein biomarkers discovered
Biomarkers combined on diagnostic chips
Detect organ location of cancers, for surgery
or radiation
Detect mechanism of disease for
chemotherapy, even if location unknown
Mechanistic, rather than “geographic”
classification
Better efficacy/less toxicity for all types of
patients
11
Status of Proteomics Assays
• Many technology platforms of increasing
sensitivity and resolution
• Patterns or specific proteins still just
biomarker candidates —most lack
independent confirmation and coefficient of
variation, let alone “validation” with
standard clinical chemistry parameters of
sensitivity, specificity, and especially positive
predictive value
• Approaches of clinical chemistry needed to
guide further development of the field
12
Barriers for Proteomic Cancer Biomarker
Discovery in Plasma
Human cancers are very heterogeneous
Tumor proteins are in low abundance for
early detection of cancers
Tumor proteins are greatly diluted upon
release to ECF and blood
Plasma is an extraordinarily complex
specimen dominated by high abundance
proteins (50% by weight is albumin)
Knowledge of the plasma proteome is still
limited
13
Outline of Lecture
1. Review of the vision, strategy, and
output of the HUPO Human Plasma
Proteome Project Pilot Phase
2. Objectives for the New Phase of the
Plasma Proteome Project
3. Example of the power of computational
tools and collaborations (if time)
14
HUPO
The international Human Proteome
Organization (HUPO) was founded in
2001. Its aims are:
1. To advance the science of proteomics
2. To enhance training in proteomics
3. To build international initiatives by
organ (liver, brain, kidney), biofluid
(plasma, urine, CSF, saliva), and
disease (cardiovascular, cancers), plus
antibodies and data standards.
15
Proteomics Interaction Map
Ruth McNally, sociologist
16
Samir Hanash, founding President of HUPO
Gil Omenn, leader of HUPO PPP
17
THE PLASMA PROTEOME
Advantages: The most available human
specimen; the most comprehensive sample of
tissue-derived proteins; the basis for a Disease
Biomarkers Initiative tied to organ proteomes.
Specific Disadvantages:
Extreme complexity/enormous dynamic range
High risk of ex vivo modifications
Lack of highly standardized protocols
General Challenges: Inadequate appreciation of
incomplete sampling by MS/MS; evolving
annotations and unstable databases
18
Long-Term Scientific Goals of the HUPO
Human Plasma Proteome Project
1. Comprehensive analysis of plasma and serum
protein constituents in people
2. Identification of biological sources of variation
within individuals over time, with validation of
biomarkers
Physiological: age, sex/menstrual cycle, exercise
Pathological: selected diseases/special cohorts
Pharmacological: common medications
3. Determination of the extent of variation across
populations and within populations
19
Scheme Showing Aims and Linkages of the
HUPO Plasma Proteome Project, Pilot Phase
Serum vs
Plasma
Reference
Specimens
HUPO PPP
Participating Labs
Technology Platforms-Separation and
Identification
HUPO HUMAN
PLASMA PROTEOME
PROJECT (PPP)
Technology
Vendors
Development &
Validation of
Biomarkers
Liver and Brain
Proteome, Antibody,
Protein Stds Projects
Omenn GS. The Human Proteome Organization Plasma Proteome Project Pilot Phase: Reference Specimens, Technology
20
Platform Comparisons, and Standardized Data Submissions and Analyses. Proteomics 2004;4:1235-1240.
OUTPUT FROM PPP Pilot Phase
Special Issue Aug 2005, Proteomics, “Exploring
the Human Plasma Proteome”: 28 papers—
collaborative analyses and annotations, plus
lab-specific analyses, and Wiley book (2006)
Publicly-accessible datasets:
www.ebi.ac.uk/pride [EBI]
www.peptideatlas.org/repository [ISB]
www.bioinformatics.med.umich.edu/hupo/ppp
Additional papers are encouraged:
Nature Biotechnology 2006; 24:333-338 (States et al)
Genome Biology 2006;7:R35 (Fermin et al)
Proteomics 2006; 6: 5662-5673 (Omenn)
Numerous citations/comparisons of datasets
21
22
SERUM AND PLASMA REFERENCE SPECIMENS
1. BD: specially prepared male/female pooled samples,
divided into EDTA-, Heparin-, and Citrate-anticoagulated Plasma and Serum (250 ul x4 of each).
BD clot activator. No protease inhibitors. Three
separate ethnic pools prepared. Shipped frozen.
2. Chinese Academy of Medical Sciences: Sets of three
plasmas + serum, similar to BD protocol.
3. National Institute for Biological Standards & Control,
UK: citrate-anti-coagulated, freeze-dried plasma, from
25 donors, prepared for Intl Soc Thrombosis &
Hemostasis, 1 ml aliquots/ampoules.
23
Specifications for Data Submission
Each of 55 labs agreed (July, 2003 Workshop) to
provide, and 31 labs did provide:
a) a detailed experimental protocol, to “push the
limits” to detect low-abundance proteins
b) peptide sequences, rated as “high” or “lower”
confidence, based on MS/MS criteria
c) protein IDs from IPI 2.21 (July 2003) and
search engine parameters used to align peptide
sequences with proteins in human database
Later, we obtained m/z peak lists and raw spectra
(by DVD) for independent analyses.
24
From Peptides to Genome Annotation
extraction
digestion
database
search
LC-MS/MS
200 400 600 80010001200
m/z
Sample
Proteins
Peptides
Mass Spectrum
Spectrum
Peptide Probability
Spectrum 1 LGEYGH
1.0
…
…
…
Spectrum N EIQKKF
0.3
SBEAMS
BLAST
protein
database
Map to
genome
Peptides
statistical
filtering
Peptide
… Chrom Start_Coord End_Coord …
PAp00007336 … X
132217318 132217368 …
…
… …
…
…
…
visualization
25
Genome Browser
PeptideAtlas Database
Numbers of Proteins Identified
(LC-MS/MS or FTICR-MS, 18 labs)
From 15,519 reported distinct protein IDs in IPI
2.21, we chose one representative/cluster:
(a) 9504 = 1 or more peptide matches
(b) 3020 = 2+ peptide matches (Core Dataset)
(c) 1274 = 3 or more peptide matches
(d) 889 = follow-up high-stringency analysis
with adjustments for protein length and
multiple (43,000) comparisons in IPI v2.21
(Nature Biotech 2006; 24:333-338)
26
GREATEST RESOLUTION AND
SENSITIVITY
The most extensive high-confidence yield was
from combined methods of immunoaffinity
(“top-6”) depletion, 2 or 3-D highresolution fractionation, and then ESIMS/MS with ion-trap LTQ instrument.
LTQ gave several fold more IDs (1168) than
did LCQ (271) in same hands (B1-serum vs
B1-heparin) and obtained multiple peptides
for many proteins which had just one hit
with LCQ.
27
SPECIFIC OBSERVATIONS: DEPLETION
• Many investigators depleted albumin
and/or immunoglobulins
• Several were provided Agilent
immunoaffinity column to remove “top6” proteins
• Much higher numbers of identifications
after depletion if sufficient fractionation
• Inadvertent removal of other proteins;
“sponge” effect of albumin
• Assay both flow-through & bound
fractions
28
SPECIMEN VARIABLES
What evidence have we developed for choice of
specimens for analysis?
Plasma preferred over serum—more consistent, less
degradation
EDTA-plasma preferred over heparin interferences
and citrate dilution
Clot activator? necessary only for serum
Minimize freeze/thaw cycles (archives)
Minimal evidence of platelet activation [4C]
Protease inhibitors desirable, but alter proteins
29
INFLUENCE OF ABUNDANCE
Using quantitative immunoassays and
microarrays (generally unknown
epitopes), we have found very high rates
of detection of the more abundant
proteins, less in the mid-range, and
occasional detection of very low
abundance proteins, as expected.
High correlation (r=0.9) between # peptides
and measured concentrations
30
Least Abundant Proteins Identified
with two distinct peptides
(pg/ml: range 200 pg/ml to 20 ng/ml)
Alpha fetoprotein
2.9E+-02
TNF-R-8
3.3E+02
TNF-ligand-6
1.5E+03
PDGF-R alpha
4.6E+03
Leukemia inhibitory factor receptor 5.0E+03
MMP-2/gelatinase
8.8E+03
EGFR
1.1E+04
TIMP-1
1.4E+04
IGFBP-2
1.5E+04
Activated leukocyte adhesion mol 1.6E+04
Selectin L [five labs;10 peptides]
1.7E+04
31
BIOLOGICAL INSIGHTS
The proteins identified can be annotated by
many methods. We have searched multiple
databases, including Gene Ontology, Novartis
Atlas, Online Mendelian Inheritance in Man
(OMIM), incomplete or unidentified
sequences in the human genome, microbial
genomes, InterPro protein domains,
transmembrane domains, secretion signals.
See Proteomics 2005; 5:3226-3519; Wiley, 2006
32
GENE ONTOLOGY SPECIFIC TERMS
Over-represented in PPP 3020 (vs whole
genome): “extracellular”, “immune response”,
“blood coagulation”, “lipid transport”,
“complement activation”, “regulation of blood
pressure”, as expected; also: cytoskeletal
proteins, receptors and transporters.
Proteins from most cellular locations and
molecular processes are recognized.
Under-represented: “perception of smell” (1 vs
25 exp); cation transporters, ribosomal
proteins, G-protein coupled receptors, and
nucleic acid binding proteins.
33
InterPro Protein Domain Analysis
Compared with the whole human genome, the
3020 PPP proteins are:
Over-represented for EGF, intermediate filament
protein, sushi, thrombospondin, complement
C1q, and cysteine protease inhibitor.
Under-represented: Zinc finger (C2H2, B-box,
RING), tyrosine protein phosphatase, tyrosine
and serine/threonine protein kinases, helixturn-helix motif, and IQ calmodulin binding
region domains.
34
TRANSMEMBRANE AND SECRETED
PROTEIN FEATURES
1297 of 3020:
SwissProt Annotated
Transmembrane
230
Secretion signal
1723 of 3020:
TM domain(s)
Secretion signal
373
ProFun
151
Both
104
420
358
ProFun Predicted
137
255
35
Cardiovascular-Related Proteins Biomarker
Candidates in the PPP Database
Proteins characterized in eight groups:
Inflammation
Vascular
Signaling
Growth and differentiation
Cytoskeletal
Transcription factors
Channels
Receptors
36
Comparison of Five Search Algorithms
Using PPP data, Kapp et al (Proteomics
2005) found Sequest and Spectrum Mill
more sensitive and MASCOT, Sonar, and
X!Tandem more specific for peptide
identifications at specified false-positive
rates.
Some investigators have reported using
combinations of two or more search
engines. Decision rules are necessary.
37
Can We Overcome the Idiosyncrasies of
Individual Instruments and Laboratories?
Several informatics investigators approached
the human PPP with an offer to re-analyze
the complete MS/MS datasets using their
own software and criteria from the raw
spectra (or peaklists).
These analyses eliminated the heterogeneity
of search algorithms, search parameters,
and idiosyncrasies of individual labs.
The results are hard to compare, given
different extent of analysis. However, each
can be compared with the Core Dataset. 38
Independent Analyses from Raw Spectra
(#IDs with 2+ peptides)
Core Dataset (18 datasets, 3020)
• PepMiner (Beer, 8 large datasets, 2895)
[1051 in 3020 dataset, + 700 in the 9504]
• X!Tandem (Beavis/States, 18 datasets,
2678) [577 in the 3020; 218 in the 889]
• PeptideProphet/ProteinProphet
(Deutsch, 7+ datasets, 960)[479 in 3020]
• Mascot/Digger (Kapp, Australia, 14
datasets, 513 with 1.4% error rate;
39
ongoing analysis
What is Required and Feasible to Enhance the
Statistical Robustness of Findings?
Many complex proteomics analyses are done
once, without replicates required to estimate
coefficient of variation or other standard
parameters for clinical chemistry use.
“Five to ten independent repetitions of the
experiments are a must” [Hamacher et al,
Proteomics in Drug Discovery, 2006].
How should we determine how similar or
different are samples A and B, or the results of
methods X and Y? What decision rules apply?
We have a long way to go from discovery
research to clinical applications.
40
Comparison of 5 Published Reports on Plasma
Proteins with HUPO PPP Datasets
Report #IDs #IPI in 3020
Anderson 1175 990 316
Shen
[1682] 1842 213
Chan
1444 1019 257
Zhou
210 148 51
Rose
405 287 142
in 9504
471
526
402
88
159
41
Comparison of New Biofluid Proteome
Findings with HUPO PPP-3020 Proteins
Proteome
Urine
tears
semen
# Proteins
1543
491
923
IPI 2.21
910
313
560
PPP-3020
293
117
180
Refs from Matthias Mann Lab, Genome Biology,
2007, different IPI versions.
Comparison, Omenn, Proteomics-Clinical
Applications (2007).
42
NEXT PHASE OF PPP (PPP-2)
1. Standard operating procedures (SOPs),
including EDTA-plasma as standard
specimen; replication and confirmation of
results
2. Quantitation and subproteomes, using
new methods and advanced instruments
3. Databases and robust bioinformatics
4. Clinical chem/disease-related studies
43
PPP-2 Research & Technology Thrusts
Learned a lot from Pilot Phase—plasma is a very
complex specimen; no single platform sufficient;
analyses currently far from comprehensive, let
alone reproducible; now have improved data
quality and informatics resources.
PPP-2: use multiple methods; focus on biomarker
discovery; build upon already-funded
laboratories and repositories.
44
Specific Technology Recommendations
N-Glycosite (proteotypic) peptide resource is a
special subproteome likely to have high biomarker
relevance.
Capture glycoproteins, digest with trypsin and
PNGase F to yield N-linked glycopeptides. Choose
one unique to each protein; a finite number; not
all proteins. Use complementary lectin approach to
characterize glycans.
Prepare isotope-labeled N-glyco-peptides for
multiple uses as standards and to spike specimens.
45
N-Glycosites
Glycoproteins are enriched on cell surface, in
secreted proteome and in plasma
Glycoproteins tend to be stable
Only few glycosites per protein: reduction in
sample complexity (excludes albumin)
Inherent validation of N-glycosite by fragment
ion spectrum
N-glycosite subproteome is probably the one
easiest to completely map
46
Glycopeptide Isolation
Capture
Wash
Non-glycoproteins
Trypsin digestion
Non-glyco-peptides
Wash
 Asn  Asp
PNGaseF Digestion
N-linked
glycopeptides
47
Zhang H., Li X.-J., Martin D.B. & Aebersold R. (2003) Nat Biotech 21: 660-666
Flow chart of process
Tissue Samples
Normal & Disease
Plasma Samples
Capture / Digestion
'Glycopeptide' Fract.
'Glycopeptide' Fract.
LC-MS
LC/MS Maps
Target peptides
Data Analysis
MRM LC/MS/MS
Targeted LC/MS/MS
Data Analysis
48
Reducing Complexity: Glycoprotein-Enriched
Subproteomes
Methods
Lab 2
Lab 11
Enrichment hydrazide chem lectin chrom’y
Peptide Fxn
SCX + RP
RP
Mass Spec
qtof
deca-xp
Search engine Seq/ProteinProphet Sequest
Protein IDs
222
83
in B1-serum
[51 in common]
Of total 254, 164 found among data from 11
other labs without glycoprotein enrichment.
49
Technology Recommendations (cont’d)
Orbitrap and other advanced instruments with high
mass accuracy and increased throughput
Multiple Reaction Monitoring (Q-Trap, triple quad--LOD <50 amol, 5 logs range, probably ng/ml
range for GP.
Extensive fractionation and newer labeling methods.
Recruit several major labs; be open to volunteers.
Determine interest in reference specimen.
Make peptide standards available through PPP-2:
post lists and make labeled compounds.
50
Multiple Reaction Monitoring (MRM)
Source
MS-1
Fixed
Set precursor m/z
Peptide (M)
CID
MS-2
Fixed
time
Set fragment m/z
Fragment (m)
High selectivity ~ two levels of mass selection (increased
S/N)
High sensitivity because of high duty cycle
(Q1 and Q3 are static)
Only known peptides (candidates) are detected
51
Technology Recommendations (cont’d)
Compare pooled samples from disease and
control; high throughput not essential for
discovery phase
Continue to build the catalog
Do longitudinal repeat measures on individuals to
establish CoV—must reliably tell whether two
samples are the same or different, including
PTMs
Pay attention to precursor ions
Known interested labs: Aebersold, Paik, Smith,
Speicher, Hancock, Mann; probably Chinese,
Michigan, FHCRC, Japanese/glycomics.
52
Issues for PPP Bioinformatics
What are imperatives for project design?
How can many more spectra be interpreted?
How can more confident protein IDs be
generated?
How do we add value and benefit from
EBI/PRIDE and ISB/PeptideAtlas repositories?
What is required to make the datasets more
useful for other investigators?
Can quantitation, including of PTMs, be achieved
with statistical robustness?
53
A Robust Bioinformatics Architecture
PRIDE
Peptide
Atlas
Dissemination
Genome
annotation
Level I repository
Individual
labs
54
Repositories and Resources for Proteomics
Informatics
PRIDE at EBI, repository for protein
identifications (Martens)
PeptideAtlas, repository for raw data processed
through TransProteomics Pipeline at ISB
(Deutsch), plus SpectraST barcodes from NIST
Tranche Distributed File System/DFS (Andrews,
UM) at ProteomeCommons.org, National
Resource for Proteomics and Pathways
CPAS, developed as part of Mouse Models of
Human Cancers Consortium, at Fred
Hutchinson (McIntosh)
GPMdb, developed by Beavis (Canada)
55
Tranche Distributed (P2P) File System
Open, simple, cross-platform protocols
– e-Commerce-grade encryption makes it appropriate
for scientific research (peer-review and traceability)
– Can easily grow to accommodate very large
amounts of data and users
• Commodity hardware @ $0.37 per GB storage
~16 TB over 12 servers (30 additional TB
ordered) and funding for additional 20TB
Documentation, tools, code, credits:
http://www.proteomecommons.org/dev/dfs
Data sets: GPM, PNNL, Aurum, QqTOF vs
QSTAR, sPRG ABRF 2006, HUPO PPP
– Links with PeptideAtlas, OPD, HPRD, TheGPM
56
Can We Identify More High Confidence Peptides
from the MS/MS spectra?
The spectra, not protein lists, are the raw data.
<20% of spectra are confidently assigned to
peptide sequences; the rest are typically
discarded.
More high quality spectra can be mined
(Nesvizhskii et al, MCP 2006).
Higher mass accuracy greatly enhances results
(with some complications---Eric Deutsch).
Error estimates and thresholds should be routine
for peptide IDs and protein matches.
TransProteomicPipeline (TPP) from ISB has
been designed for this purpose.
57
Mining Un-assigned High Quality Spectra
(Nesvizhskii)
Typical search:
SEQUEST, IPI database
semi-constrained (tryptic on one end)
Met + 16
+/- 3 Da, average mass
Average numbers (LCQ/LTQ data): 10-15% of all
spectra assigned peptide with high
confidence
20-25 % of all high quality spectra are not
assigned
58
Why Are Spectra Not Assigned?
Possible causes of failure to assign peptide:
• Imperfect scoring scheme
• Constrained search (PTM, not tryptic etc.)
• Incorrect mass/ charge state
• Low spectrum quality / contaminant ion
• Correct sequence may not be in the database
searched (e.g., SNP)
• Novel sequence (splice variants, fusion peptides?)
Use MS/MS data for genome annotation
59
Finding and Mining High Quality Unassigned
Spectra (Nesvizhskii)
60
Further Analyses at the Peptide Level
The PPP, GPM, and PeptideAtlas databases are
rich with peptide-level findings, which can be
analyzed for many questions---e.g., which
peptides are most likely to be detected from
among the predicted tryptic peptides of various
proteins, and why? Can peptides be used
directly to identify sequences of splice isoforms
and SNPs? Can PTMs be identified more
readily? Answers: Yes to all three questions.
Proteotypic peptides will be a major feature of
Next Phase PPP.
61
What Kinds of Biological Insights Emerge from
Annotation?
The aim of proteomics analyses is not just
to create lists of peptides and proteins,
but to advance our understanding of
complex biological processes in health
and disease. Going forward,
quantitation of proteins and their PTMs
will be increasingly important---and
feasible.
62
High Throughput Proteomics
and Systems Biology
condition 1
condition 2
Understanding and
modeling cell
behavior
Systems Biology
condition 3
Integration of genomic, transcriptomic, proteomic,
metabolomic data
63
SUMMARY
Enthusiasm for continuing and expanding Plasma
Proteome Project, confirmed at Seoul, Korea,
World Congress of Proteomics Oct 2007
Commitment to combine PPP with concept of
Disease Biomarker Initiative
Interest in linking with and absorbing datasets
from other Biofluid Proteomes (saliva, urine,
CSF, organ-related proximal fluids)
64
Biology as an Information Science: NIH Roadmap
National Centers for Biomedical Computing
Physics-Based Simulation of
Biological Structures (SIMBIOS)
Russ Altman, PI
National Center for Integrative
Biomedical Informatics (NCIBI)
Brian Athey, PI
Informatics for Integrating
Biology and the Bedside (i2b2)
Isaac Kohane, PI
National Alliance for Medical
Imaging Computing (NA-MIC)
Ron Kikinis, PI
The National Center For
Biomedical Ontology (NCBO)
Mark Musen, PI
Multiscale Analysis of Genomic
and Cellular Networks (MAGNet)
Andrea Califano, PI
Center for Computational Biology
(CCB)
Arthur Toga, PI
65
A Bioinformatics Approach to Discover
Candidate Oncogenes
Few causal cancer genes have been discovered using
gene expression microarrays
Oncogenic events are often heterogeneous
– ERBB2/HER2 amplification in 20% of breast CA
– Activating Ras mutations in 25% of melanomas
– E2A-PBX1 translocation in 5-10% of leukemias
Chromosomal aberrations that result in marked
over-expression of an oncogene should be
detectable in transcriptome data
Protein products then may be identified in tumor,
biofluids, and plasma
66
67
COPA of microarray data revealed ETV1 and ERG as outlier
genes across multiple prostate cancer gene expression data sets
[Tomlins et al., Science 2005, 310: 644 -648]
68
COPA Unveils Androgen-Responsive TF Fusion Genes
69
The Molecular Concept Map Project [Chinnaiyan, Rhodes]
70
71
Our Genetic Future
“Mapping the human genetic terrain may rank with
the great expeditions of Lewis and Clark, Sir
Edmund Hillary, and the Apollo Program.” -Francis Collins, Director
National Human Genome Research Institute, 1999
Next:
Understand gene and protein expression
Elucidate genetic, environmental, and
behavioral interactions in health and disease
Engage scientists globally
72
Acknowledgements
HUPO PPP: Ruedi Aebersold and Young-Ki Paik,
co-chairs; Eric Deutsch, Lennart Martens, Alexey
Nesvizhskii, David States, bioinformatics; lab
leaders and sponsors (see Proteomics 2005)
UM Proteomics Alliance for Cancer Research: Phil
Andrews, David States, Alexey Nesvizhskii,
George Michailidis, Mike Pisano, Arul
Chinnaiyan, Dan Rhodes, Scott Tomlins, Arun
Sreekumar, Adai Vellaichamy, Brian Haab
UM National Center for Integrative Biomedical
Informatics: Brian Athey, David States, HV
Jagadish, Jignesh Patel, Peter Woolf, Biaoyang
73
Lin