Download MGI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Metagenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Human genetic variation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Human genome wikipedia , lookup

Gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

Microevolution wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Human Genome Project wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genome editing wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Genome (book) wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Designer baby wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Public health genomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Transcript
Ontologies, Databases, Knowledgebases:
How should they interoperate?
Judith Blake, Ph.D.
The Jackson Laboratory
Thesis
• The Mouse Genome Informatics (MGI) system
– provides a model for interoperablity that
– incorporates the use of ontologies,
– depends upon the interconnection among
databases, and
– Supports integration of data from multiple data
sources
• This may provide model for PRO objectives
to support connections between PRO and
disease representations
Mouse Genome Informatics (MGI)
MGI’s primary mission is to facilitate the use of mouse as a model for human
biology by providing integrated access to data on the genetics, genomics, and
biology of the laboratory mouse.
variants &
polymorphisms
expression
sequence
genome location
gene function
strain geneaology
Hermansky-Pudlak syndrome
Mouse model & human phenotype
tumors
mouse/human
orthologs & maps
Information content spans from sequence to phenotype/disease
3
Automated (mostly) Data Integration (Loads)
EG mouse
UniProt
Associations
DFCI
DoTS
NIA
Unigene
TreeFam
Gene traps
Clones
RPCI
MGC
MGI
GenBank
EG chimp
EG dog
EG rat
EG human
HCOP
Homologene
Non-mouse
SNP db
GO
MP
Vocabularies
Anatomy
Interpro
OMIM
PIRSF
Annotation
RefSeq
Sequences
UniProt
DFCIseq
DoTSseq
NIAseq
dbSNP
UniSTS
microRNAs
NCBI
VEGA
Ensembl Gene models and
coordinates
4
Mouse Genome Informatics
Controlled vocabularies and ontologies
•
•
•
•
•
•
•
•
•
GO - Gene Ontology (GO)
PRO - Protein Ontology (PRO)
MP - Mouse Phenotype Ontology
MA - Mouse Anatomy (GXD)
CL- Cell Type Ontology
Mouse gene and strain nomenclature
SO - Sequence Ontology
RO - Relations Ontology
ECO - Evidence Code ontology
Integration: Controlled Vocabularies and Ontologies
5
MGI Operating Principles
• Data integration is key to comprehensive access to mouse genome,
functional, mouse model, and comparative data allows the data to be
evaluated in new contexts
– Supports robust access to comprehensive information
– Permits efficient access to related resources
• Standards are key to data integration
– Nomenclature
• Standardized gene nomenclature, keywords, etc.
– Knowledge representation
• Gene Ontology (GO)
• Mammalian Phenotype Ontology
• Integration of Multi-Source Data
– Depends on consistent entity tagging
– Requires improvement of data storage structures
– Necessitates ontology updates for data categories and context
Mouse Phenotypes and Disease Models
Connects mouse and human phenotypes in studies of human disease processes
Mouse Crebbptm1Sis/Crebbp+
mutants showing skeletal
formation defects.
Human Rubinstein-Taybi
Syndrome 1 (OMIM:180849),
caused by CREBBP mutation.
° mental retardation
° postnatal growth deficiency
° microcephaly
° broad thumbs & halluces
° dysmorphic facial features (beaked nose,
high arched palate, characteristic grimacing)
° increased tumor risk
1
Diseases and Phenotypes
•
Diseases are described by signs and symptoms
–
–
Signs – things you can measure
Symptoms – things the patient notices
•
Signs are phenotypes
•
Diseases are characterized by phenotypes including the order, severity and duration
with which they occur. A full model of disease takes into account dimensions of
anatomy, time, severity, therapeutic responsiveness, outcomes etc. There is also a
probabilistic element to an instance of the disease and a probabilistic association
between phenotypic elements in one instance.
•
Diseases are not phenotypes ( although predisposition may be considered as such)
but single phenotype diseases may be viewed as phenotypes, eg. osteoarthritis.
Paul Schofield, 2013
Status of Phenotype & Disease Data
May
2012
May
2013
May
2014
change
this yr.
8,775
9,034
10,190
+1,156
Mutant alleles cataloged : total
: in mice
number of genes represented
targeted alleles
number of genes targeted
743,813 748,960 754,256
32,299 33,659 39,241
20,937 21,442 21,786
46,822 51,119 55,640
15,488 16,221 16,358
+5,296
+5,582
+344
+14,521
+137
Alleles w/ phenotype (MP) annotation
Genotypes with MP annotation
Total MP annotations
29,064
43,579
223,125
+2,530
+3,930
+19,117
Phenotype terms in MP ontology
32,095 34,625
47,790 51,720
249,46 268,577
0
Mouse genotypes modeling human disease
Human Diseases w/1 mouse model(s)
3,687
1,153
4,084
1,239
4,365
1,310
+281
+71
QTLs
4,696
4,715
4,835
+120
Objective
…make phenotype and disease model data robust and
accessible to researchers and computational biologists
• semantic consistency to enable complete data retrieval
• integrated access to all phenotypic variation sources
(single-gene and genomic mutations, engineered mutations, QTLs, strains)
• data on human disease correlation
• access to mouse models from various approaches
- Genetic
- Phenotypic
- Genomic localization
- Computational
10
Annotating Disease to Genotype
•
•
•
•
Different alleles of a gene on the same background may/may not be
disease models
The same alleles of a gene on different genetic backgrounds may/may not
be disease model
Disease models are attached to genotype “objects”
Disease annotation consists of OMIM term, the data reference /source,
and association type
OMIM term
129S1/Sv
genotype
Crouzon Syndrome
Fgfr2tm1Schl / Fgfr2+
phenotypic similarity to human
disease associated with ortholog
association type
Eswarakumar VP et al.,
PNAS USA 2006;103:18603-8
source
8
MGI
4,084 MGI Mouse Models
1,239 OMIM diseases (associated with)
12
Each associated
human disease
links to a Human
Disease and
Mouse Model
Detail Page
Note chicken and zebrafish
13
Biological knowledge and
attributes in MGI
Mouse Genome Informatics:
Integrate Sequence with Biology
•Nomenclature
•Genome location
Nucleotide
Sequences
Genome
variation
•Strains
•Polymorphisms
•Orthology
•Expression
•Alleles
•Mutant phenotypes
•Function of gene products
•Literature
Genome
Features
Protein
Sequences
Gene
predictions
14
Disease
Cell
Anatomy
Adapted from Schriml and Kibbe: ICBO submission 2013
Now with annotation extensions
protein
localization to
nucleus[GO:003
4504]
cellular response
to oxidative stress
[GO:0034599]
positive regulation of
transcription from pol II
promoter in response to
oxidative
stress[GO:0036091]
happens
during
sty1
has
input
<anonymous
description>
pap1
DB
Object
Term
Ev
Ref
PomBase
sty1
GO:0034504
IMP
PMID:9585505
SPAC24B11.06c
protein
localization to
nucleus
pap1
GO:0036091
IMP
PMID:9585505
PomBase
SPAC1783.07c
<anonymous
description>
has regulation
target
Extension
..
happens_during(GO:0034599),
has_input(SPAC1783.07c)
has_regulation_target(…)
..
Annotation Extensions
MGI Modular Annotation Example
– http://amigo2.berkeleybop.org
Xirp1 is involved in the organization of the sarcomere
in a cardiac muscle cell (CL:0000746) of the myocardilum (MA:0000080)
Total number of MGI modular annotation units to proteins: 22,866
This does not include annotations to permanent cell lines
Summary of MGI Modular Annotations
MGI
MGI
MGI
MGI
part_of
occurs_in
regulates_o_occurs_in
regulates_o_acts_on_population_of
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
MGI
regulates_o_results_in_acquisition_of_features_of
results_in_acquisition_of_features_of
regulates_o_has_agent
regulates_o_has_participant
acts_on_population_of
results_in_movement_of
results_in_development_of
regulates_o_results_in_movement_of
results_in_specification_of
results_in_maturation_of
results_in_morphogenesis_of
has_agent
results_in_commitment_to
results_in_division_of
regulates_o_results_in_commitment_to
results_in_determination_of
regulates_o_results_in_specification_of
has_output_o_axis_of
regulates_o_results_in_development_of
9013
6298
2884
1017
967
723
537
275
259
232
173
156
96
61
48
34
32
21
14
13
7
5
1
Interaction Data in MGI
…from catalog to context
• Relationships among markers project
– Explicit representation of relationships among genome
features
– Interaction explorer
• Project initially focused on microRNAs
– microRNA cluster membership
– Predicted and validated microRNA targets
• Curation of interaction data from the literature
(Gene Ontology) and from specialized external
informatics resources
20
Mouse_CCO is an application ontology built on experimental evidence-based
annotations. The data drives the structure allowing a user to ‘discover’ connections.
This diagram illustrates the generic template for the ontology.
orthologous_to
Gene_human
NCBI
Gene_mouse
MGI
described_by
Cell_type CL
CCO_human
BioPortal
participates_in
Function
GO (MF)
participates_in
Component
GO (CC)
Allele_mouse
MGI
part_of
encodes
Genotype_mouse
MGI
Protein_mouse
PRO, UniProtKB
Process
GO (BP)
associated_with
has_variant
located_in
Mary Dolan
associated_with
associated_with
part_of
Pathway_mouse
MouseCyc
Phenotype_mouse
MGI
expressed_in
Anatomy_mouse
GXD, EMAP
effects
Disease_human
OMIM, DO
Mouse_CCO is populated using 1017 mouse genes annotated to GO ‘cell cycle’ along
with all their annotations from MGI and several additional data resources. Here we
show how the generic template is populated for Brca1.
orthologous_to
Gene_human
BRCA1
has_variant
Mouse gene: Brca1
(breast cancer 1)
described_by
id: CCO:B0001598
name: BRCA1_HUMAN
participates_in
Process
DNA repair
Allele:
Brca1tm1Thl
part_of
encodes
Genotype:
Brca1tm1Thl/Brca1tm1Thl
Waptm1(cre)Arge/0
129S1/Sv * C57BL/6J
Protein_mouse
VEGA model
OTTMUSP00000002773
associated_with
mammary
adenocarcinoma
Function participates_in
damaged DNA binding
Component located_in
BRCA1-BARD1 complex
associated_with
Mary Dolan
associated_with
expressed_in
TS28:
mammary gland
effects
OMIM:114480
Breast Cancer
Keys to Interoperability
self-help mantras
 Start where you are: from silos to networks
 Identify shared interests: educated self
promotion
 Develop shared processes/applications
 Discuss the ideal, implement the practical
23
Acknowledgements
Gene Ontology
Mike Cherry
Suzi Lewis
Paul Sternberg
Paul Thomas
• Mouse Genome Informatics
Carol Bult
Janan Eppig
Jim Kadin
Joel Richardson
Martin Ringwald
Funding: NIH_NHGRI
• MGI-GO-PRO team
Karen Christie
Mary Dolan
Harold Drabkin
David Hill
Li Ni
Dmitry Sitnikov