Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

X-inactivation wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Essential gene wikipedia , lookup

Epistasis wikipedia , lookup

Pathogenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene therapy wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

History of genetic engineering wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome evolution wikipedia , lookup

Minimal genome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression programming wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

RNA-Seq wikipedia , lookup

Public health genomics wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
Genes
Medical
Informatics
Diseas
Diseas
Diseas
es
Diseas
es
Diseas
es
Diseas
es
Diseases
es
es
Genomics and
Bioinformatics
Gene
Gene
Annotation
Gene
Annotation
Databases
Gene
Annotation
Databases
Annotation
Databases
Databases
Genes
Diseases
Anatomy
Genes
Physiology
Diseases
Diseases
Novel
relationships &
Deeper insights
Identification and Prioritization
of Novel Disease Candidate Genes
Systems Biology Based Integrative Approaches
Bioinformatics to Systems Biology
November 16, 2007
Anil Jegga
Division of Biomedical Informatics,
Cincinnati Children’s Hospital Medical Center (CCHMC)
Department of Pediatrics, University of Cincinnati
Cincinnati, Ohio - 45229
[email protected]
http://anil.cchmc.org
Acknowledgements
• Jing Chen
• Eric Bardes
• Bruce Aronow
Support
• All the publicly
available gene
annotation
resources
especially NCBI,
MGI and UCSC
Cincinnati Children’s Hospital Medical Center
Computational Medical Center, Cincinnati
Mouse Models of Human Cancers Consortium
University of Cincinnati College of Medicine
Two Separate Worlds…..
Disease
World
Medical Informatics
Bioinformatics & the “omes
Genome
Regulome
Transcriptome
miRNAome
Disease
Database
Patient
Records
Clinical
Trials
Proteome
Interactome
Metabolome
Variome
Pharmacogenome
PubMed
→Name
Physiome
OMIM
→Synonyms
Clinical
→Related/Similar Diseases
Synopsis
Pathome
→Subtypes
→Etiology
→Predisposing Causes
→Pathogenesis
→Molecular Basis
382 “omes” so far………
→Population Genetics
→Clinical findings
→System(s) involved
and there is “UNKNOME” too →Lesions
→Diagnosis
genes with no function known
→Prognosis
http://omics.org/index.php/Alphabetically_ordered_list_of_omics
→Treatment
(as on November
15, 2007)
→Clinical Trials……
With Some Data
Exchange…
the Ultimate Goal…….
Disease
World
Medical Informatics
Bioinformatics
Genome
Regulome
Personalized Medicine
Disease
Database ►Decision Support System
►Course/Outcome Predictor
►Diagnostic Test Selector
→Name
→Synonyms
►Clinical Trials Design
→Related/Similar Diseases
►Hypothesis Generator
→Subtypes
→Etiology
►Novel Gene/Drug Targets…..
►
Patient
Records
Clinical
Trials
→Predisposing Causes
→Pathogenesis
→Molecular Basis
→Population Genetics
→Clinical findings
→System(s) involved
→Lesions
→Diagnosis
→Prognosis
→Treatment
→Clinical Trials……
Transcriptome
miRNAome
Proteome
Interactome
Metabolome
Physiome
Pathome
Integrative
Genomics Biomedical
OMIM
Informatics
Variome
Pharmacogenome
PubMed
No Integrative Genomics is
Complete without Ontologies
Gene World
• Gene Ontology
(GO)
Biomedical World
• Unified Medical
Language System
(UMLS)
The 3 Gene Ontologies
• Molecular Function = elemental activity/task
– the tasks performed by individual gene products; examples are
carbohydrate binding and ATPase activity
– What a product ‘does’, precise activity
• Biological Process = biological goal or objective
– broad biological goals, such as dna repair or purine metabolism,
that are accomplished by ordered assemblies of molecular
functions
– Biological objective, accomplished via one or more ordered assemblies of
functions
• Cellular Component = location or complex
– subcellular structures, locations, and macromolecular
complexes; examples include nucleus, telomere, and RNA
polymerase II holoenzyme
– ‘is located in’ (‘is a subcomponent of’ )
http://www.geneontology.org
Example: Gene Product = hammer
Function (what)
Process (why)
Drive a nail - into wood
Carpentry
Drive stake - into soil
Gardening
Smash a bug
Pest Control
A performer’s juggling object
Entertainment
http://www.geneontology.org
Unified Medical Language System Knowledge
Server– UMLSKS
• The UMLS Metathesaurus contains information about biomedical
concepts and terms from many controlled vocabularies and
classifications used in patient records, administrative health data,
bibliographic and full-text databases, and expert systems.
• The Semantic Network, through its semantic types, provides a
consistent categorization of all concepts represented in the UMLS
Metathesaurus. The links between the semantic types provide the
structure for the Network and represent important relationships in
the biomedical domain.
• The SPECIALIST Lexicon is an English language lexicon with many
biomedical terms, containing syntactic, morphological, and
orthographic information for each term or word.
http://umlsks.nlm.nih.gov/kss
•
•
•
•
•
Unified Medical Language System
Metathesaurus
about over 1 million biomedical concepts
About 5 million concept names from more than 100 controlled vocabularies
and classifications (some in multiple languages) used in patient records,
administrative health data, bibliographic and full-text databases and expert
systems.
The Metathesaurus is organized by concept or meaning. Alternate names for
the same concept (synonyms, lexical variants, and translations) are linked
together.
Each Metathesaurus concept has attributes that help to define its meaning,
e.g., the semantic type(s) or categories to which it belongs, its position in
the hierarchical contexts from various source vocabularies, and, for many
concepts, a definition.
Customizable: Users can exclude vocabularies that are not relevant for
specific purposes or not licensed for use in their institutions.
MetamorphoSys, the multi-platform Java install and customization program
distributed with the UMLS resources, helps users to generate pre-defined
or custom subsets of the Metathesaurus.
• Uses:
– linking between different clinical or biomedical vocabularies
– information retrieval from databases with human assigned subject index terms
and from free-text information sources
– linking patient records to related information in bibliographic, full-text, or factual
databases
– natural language processing and automated indexing research
Open biomedical ontologies
http://obo.sourceforge.net/
Mammalian Phenotype Ontology
1. The Mammalian Phenotype (MP)
Ontology enables robust annotation of
mammalian phenotypes in the context
of mutations, quantitative trait loci
and strains that are used as models of
human biology and disease.
2. Each node in MPO represents a
category of phenotypes and each MP
ontology term has a unique identifier,
a definition, synonyms, and is
associated with gene variants causing
these phenotypes in genetically
engineered or mutagenesis
experiments.
3. In the current version of MPO, there
are >4250 terms associated to >4300
unique Entrez mouse genes
(extrapolated to ~4300 orthologous
human genes).
http://www.informatics.jax.org
Disease Gene Identification and
Prioritization
Hypothesis: Majority of genes that impact or
cause disease share membership in any of several
functional relationships OR Functionally similar or
related genes cause similar phenotype.
Functional Similarity – Common/shared
•Gene Ontology term
•Pathway
•Phenotype
•Chromosomal location
•Expression
•Cis regulatory elements (Transcription factor binding sites)
•miRNA regulators
•Interactions
•Other features…..
Background, Problems & Issues
1. Most of the common diseases are multifactorial and modified by genetically and
mechanistically complex polygenic
interactions and environmental factors.
2. High-throughput genome-wide studies like
linkage analysis and gene expression
profiling, tend to be most useful for
classification and characterization but do
not provide sufficient information to
identify or prioritize specific disease causal
genes.
Background, Problems & Issues
3. Since multiple genes are associated with
same or similar disease phenotypes, it is
reasonable to expect the underlying genes
to be functionally related.
4. Such functional relatedness (common
pathway, interaction, biological process,
etc.) can be exploited to aid in the finding
of novel disease genes. For e.g., genetically
heterogeneous hereditary diseases such as
Hermansky-Pudlak syndrome and Fanconi
anaemia have been shown to be caused by
mutations in different interacting proteins.
Background, Problems & Issues
Disease candidate gene studies
Ellinor et al. J Am Coll Cardiol 2006.
dilated cardiomyopathy
Linkage, gene expression
Linkage analysis
Potential candidate genes (too
many!)
Locus region 10q25-26
~9.5Mb with 68 genes
Fine
mapping
Hand/cherry
picking
Prioritization
approach
Biological experiments
(expensive, time consuming)
7 candidates selected by
experts
ADRB1 missing
Background, Problems & Issues
Current candidate gene prioritization tools
Assumption: genes involved in the same
complex disease will have similar functions
dilated cardiomyopathy
Approach without training
Input:
Multiple locus regions
Approach with training
Training: Known
disease genes (10
from OMIM)
Test: 68 genes
at 10q25-26
Enriched functions
Prioritize genes based
on the functions
Score test genes
based on their
similarity to training set
TOPPGene
Transcriptome Ontology Pathway based Prioritization of Genes
http://toppgene.cchmc.org
Chen J, Xu H, Aronow BJ, Jegga AG. 2007.
Improved human disease candidate gene
prioritization using mouse phenotype. BMC
Bioinformatics 8(1): 392 [Epub ahead of
print]
Applications:
1. For functional enrichment
2. For candidate gene prioritization
Why another gene prioritization method?
Comparison with other related approaches
Feature type
POCUS
Prospectr
SUSPECTS
ENDEAVOUR
ToppGene
Year
2003
2005
2006
2006
2007
Sequence Features
GO Annotations
Transcript Features
Protein Features
Literature
Phenotype
Annotations
Training set
Comparison with other related approaches
Feature Details
Feature
type
POCUS
Prospectr
SUSPECTS
ENDEAVOUR
ToppGene
Year
2003
2005
2006
2006
2007
Gene length
Homology
Base
composition
Gene length
Homology
Base composition
Blast
cis-element
Cytoband
cis-element
miRNA targets
GeneSets
Gene Ontology
Gene Ontology
Gene Ontology
Mouse
Phenotype
Gene expression
Gene expression
EST expression
Gene
expression
Protein domains
domains
interactions
pathways
domains
interactions
pathways
Keywords
Co-citation
Yes
Yes
Sequence
Features &
Annotations
Gene
Annotations
Gene
Ontology
Transcript
Features
Protein
Features
domains
Literature
Training set
No
No
Yes
Mammalian Phenotype Ontology
We do not check whether the human
orthologous gene of a mouse gene
causes similar phenotype. Rather, we
assume that orthologous genes cause
“orthologous phenotype” and test the
potential of the extrapolated mouse
phenotype terms as a similarity
measure to prioritize human disease
candidate genes
Mammalian Phenotype Ontology
77 human genes explicitly associated
with “heart development” (GO:0007507)
Mouse orthologs cause
various types of cardiac
phenotype (MPO)
ToppGene – General Schema
TOPPGene - Data Sources
1. Gene Ontology: GO and NCBI Entrez Gene
2. Mouse Phenotype: MGI (used for the first
time for human disease gene prioritization)
3. Pathways: KEGG, BioCarta, BioCyc,
Reactome, GenMAPP, MSigDB
4. Domains: UniProt (Pfam, Interpro,etc.)
5. Interactions: NCBI Entrez Gene (Biogrid,
Reactome, BIND, HPRD, etc.)
6. Pubmed IDs: NCBI Entrez Gene
7. Expression: GEO
8. Cytoband: MSigDB
New
9. Cis-Elements: MSigDB
features
10.miRNA Targets: MSigDB
added
TOPPGene - Validation
• Random-gene cross-validation
– Disease-gene relations from OMIM
and GAD databases
– Training set: disease genes with one
gene (“target”) removed
– Test set: 100 genes = “target” gene +
99 random genes
– Rank of “target” gene
– Control: random training sets
– AUC and Sensitivity/Specificity
TOPPGene - Validation
Random-gene cross-validation: breast cancer example
Disease genes
ATM
BARD1
BRCA1
BRCA2
BRIP1
CASP8
CHEK2
KRAS
PALB2
PIK3CA
PPM1D
RAD51
RB1CC1
SLC22A18
TP53
Training set
BARD1
BRCA1
BRCA2
BRIP1
CASP8
CHEK2
KRAS
PALB2
PIK3CA
PPM1D
RAD51
RB1CC1
SLC22A18
TP53
Test set
KIAA1333
PQLC3
RBMY2OP
ZNF133
LOC402643
FBL
SLEB4
FAM32A
AACSL
ATM
NDUFB5
DENND4A
C14orf106
…
…
KCNJ16
Ranked list
1.
2.
3.
4.
prioritization 5.
6.
99
random
genes
ATM
KIAA1333
PQLC3
RBMY2OP
ZNF133
LOC40264
3
FBL
SLEB4
FAM32A
AACSL
NDUFB5
DENND4A
C14orf106
7.
8.
9.
10.
11.
12.
13.
…
…
100. KCNJ16
•
AUC: 0.916
Sensitivity: frequency of “target”
genes that are ranked above a
particular threshold position
Specificity: the percentage of
genes ranked below the threshold
0.8
Sensitivity/Specificity:
77/90
0.6
•
0.4
Control: 20 random
sets of 35 genes each
0.2
•
0.0
Training:19 diseases
with 693 genes
True
positive rate
Sensitivity
•
1.0
Random-gene cross-validation result
0.0
0.2
0.4
0.6
1
- specificity
False
positive rate
0.8
1.0
Using Mouse Phenotype as a feature of similarity
measure improves human disease gene prioritization
Random-gene cross-validation with only one feature
AUC of different feature sets
1
100.00%
AUC (random control)
AUC (p-value score)
Coverage
AUC
0.8
90.00%
80.00%
0.7
70.00%
0.6
60.00%
0.5
50.00%
0.4
40.00%
0.3
30.00%
0.2
20.00%
0.1
10.00%
0
0.00%
All
GO:MF
GO:BP
MP
Pathway
Feature set
Domain
Pubmed
Interaction
Expression
Coverage
0.9
Using Mouse Phenotype as a feature of similarity
measure improves human disease gene prioritization
Random-gene cross-validation by leaving one feature out
1.0
Overall performance
All features: 0.913
All – MP: 0.893
All – MP – PubMed: 0.888
0.8
0.4
0.6
All – MP
0.2
All – MP - Pubmed
0.0
True
positive rate
Sensitivity
Sensitivity: true
positive rate at a
cutoff score
Specificity: true
negative rate at the
same cutoff
All
0.00
0.05
0.10
1-specificity
False
positive rate
0.15
0.20
Locus-region cross-validation using different feature sets
Features
Average rank ratio Number of times
of
“target” genes were
“target” genes
ranked top 5%
Number of times
“target” genes
were
ranked top 10%
All
7.39%
118
125
GO + MP + PubMed
7.50%
118
126
MP + PubMed
7.08%
121
126
Without GO
6.84%
117
123
Without Pathway
7.66%
118
124
Without Domain
6.71%
118
124
Without Interaction
7.17%
120
124
Without Expression
7.28%
118
128
Without MP
9.77%
110
117
Without Pubmed
9.91%
100
111
Without MP &
Pubmed
22.61%
71
80
ToppGene web server (http://toppgene.cchmc.org)
For functional enrichment analysis
ToppGene web server (http://toppgene.cchmc.org)
For functional enrichment analysis
ToppGene web server (http://toppgene.cchmc.org)
For functional enrichment analysis
ToppGene web server (http://toppgene.cchmc.org)
For functional enrichment analysis
PPI - Predicting Disease Genes
1. Direct protein–protein interactions (PPI) are
one of the strongest manifestations of a
functional relation between genes.
2. Hypothesis: Interacting proteins lead to same
or similar disease phenotypes when mutated.
3. Several genetically heterogeneous hereditary
diseases are shown to be caused by mutations
in different interacting proteins. For e.g.
Hermansky-Pudlak syndrome and Fanconi
anaemia. Hence, protein–protein interactions
might in principle be used to identify
potentially interesting disease gene candidates.
7
Known Disease Genes
Mining human
interactome
HPRD
BioGrid
Direct Interactants
of Disease Genes
Indirect Interactants
of Disease Genes
Prioritize candidate genes in the
interacting partners of the diseaserelated genes
•
Training sets: disease related genes
•
Test sets: interacting partners of the
training genes
66
Which of these
interactants are
potential new
candidates?
778
Example: Breast cancer
OMIM genes
(level 0)
Directly interacting
genes (level 1)
Indirectly interacting
genes (level2)
15
342
2469!
15
342
2469
ToppGene web server (http://toppgene.cchmc.org)
For candidate gene prioritization
ToppGene web server (http://toppgene.cchmc.org)
For candidate gene prioritization
ToppGene web server (http://toppgene.cchmc.org)
For candidate gene prioritization
Example: Breast cancer study. Genome-wide
association study identifies novel breast cancer
susceptibility loci. Nature. 2007 May 27.
rs id
Location Gene
Training set
Test set
rs2981582
10q26
15 OMIM
genes
83 genes in the
region
FGFR2
Prioritization result:
Rank
Gene
Description
P-value
1
BUB3
budding uninhibited by benzimidazoles 3 homolog
0.003865
2
FGFR2
fibroblast growth factor receptor 2
0.018906
3
BCCIP
BRCA2 and CDKN1A interacting protein
0.04784
Example: Breast cancer study. Genome-wide association
study identifies novel breast cancer susceptibility loci.
Nature. 2007 May 27.
ToppGene Prioritization
Example: Breast cancer
Training set
Test set
15 OMIM genes 342 interacting
genes
Ranked Interactants
Rank Gene
Description
1
ATR
ataxia telangiectasia and Rad3 related
2
FANCD2
Fanconi anemia, complementation group D2
3
NBN (NBS1)
Nibrin
Limitations
General limitations of any training-test strategy:
• Prior knowledge of disease-gene associations.
• Assumption that the disease genes yet to
discover will be consistent with what is already
known about a disease.
• Depend on the accuracy and completeness of the
functional annotations.
– Only one-fifth of the known human genes have
pathway or phenotype annotations and there
are still more than 40% genes whose functions
are not defined!
Chen et al., 2007; BMC Bioinformatics
Mouse Phenotype - Limitations
1. MP is not a disease-centric ontology and the
phenotype of a same gene mutation can vary
depending on specific mouse strains or their
genetic backgrounds.
2. Orthologous genes need not necessarily result in
orthologous phenotypes.
Possible Solutions - Future Directions
More efficient cross-species phenome extrapolation where in
the mouse phenotype terms are mapped to human phenotype
concepts (from UMLS) semantically (“orthologous
phenotype”) and the resultant orthologous genes associated
with an orthologous phenotype are identified.
Chen et al., 2007; BMC Bioinformatics
PPIs for disease gene identification
Limitations
1. Noisy interactome data
• In vitro Vs in vivo (for e.g. only 5.8% of yeast twohybrid predicted interactions were confirmed by
HPRD)
• Extrapolation of interactions from one species to
another
• Bias towards “well-studied” genes/proteins
2. Too many interactants! Hub proteins
3. Two interacting proteins need not lead to similar
phenotype when mutated
4. Disease proteins may lie at different points in a
pathway and need not interact directly
5. Lastly, disease mutations need not always involve
proteins
Oti et al., 2006; J Med Gen
http://anil.cchmc.org
(under presentations)
And PRIORITIZATION too!
Thank You!
http://sbw.kgi.edu/