Download Lecture slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

List of types of proteins wikipedia , lookup

Magnesium transporter wikipedia , lookup

Gene desert wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Protein moonlighting wikipedia , lookup

Gene expression wikipedia , lookup

Molecular evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Interactome wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene nomenclature wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene regulatory network wikipedia , lookup

Transcript
Disease Gene Candidate Prioritization by Integrative Biology
Table of contents:
Where are we in the course pipeline?
Maturation of the project and the project description
Background
Networks – deducing functional relationships from PPI data networks
Protein interaction networks
Functional modules / network clusters
Phenotype association
Grouping disorders based on their phenotype.
Biological implications of phenotype clusters.
Method and examples
Integrating protein interaction data and phenotype associations in an automated
large scale disease gene finding platform
The DNA Array Analysis Pipeline
Question
Experimental Design
Array design
Probe design
Sample Preparation
Hybridization
Buy Chip/Array
Image analysis
Normalization
Expression Index
Calculation
Comparable
Gene Expression Data
Statistical Analysis
Fit to Model (time series)
Advanced Data Analysis
Clustering
Meta analysis
PCA
Classification
Survival analysis
Promoter Analysis
Regulatory Network
Project description
Masters thesis sept 2003
Niels Tommerup Professor, Centre
Director, Dr. Med.
Søren Brunak, Professor, center
director, Dr. Phil, PhD
Project description
Project:
Find disease
genes
Niels Tommerup Professor, Centre
Director Dr. Med.
Using
Bioinformatics
Søren Brunak, Professor, center
director, Dr. Phil, PhD
Project description
Project:
Disease gene candidate prioritization by
integrating protein interaction and phenotype
association data.
Niels Tommerup Professor, Centre
Director Dr. Med.
Søren Brunak, Professor, center
director, Dr. Phil, PhD
Project description
Abstract
The availability of the first draft of the human genome in 2001 (Venter, Adams et al. 2001) led to an
increase in the number of methods for disease gene identification. However, the general number of
candidates in most loci linked to a particular phenotype is in the hundreds (McCarthy, Smedley et al.
2003; van Driel, Cuelenaere et al. 2003), and the underlying genes in over 900 of the ~ 2550 loci
associated with a phenotype in the “Online Mendelian Inheritance in Man” (OMIM) database, have not
yet been identified (Hamosh, Scott et al. 2005). Evidently disease gene identification continues to be a
very strenuous challenge, since mutational analysis of hundreds of candidates in a critical interval using
methods currently available is extremely resource demanding. Thus, prioritising the candidates based on
different criteria followed by an extensive investigation of promising candidates, is a logical step in the
disease gene finding process.
With the advent of proteomics, we are now able to retrieve information on gene functions in a large-scale
manner, thus bridging the gap between genotype and phenotype, a possibility with significant interest for
disease gene candidate prioritization.
We propose that automated correlation of phenotype association networks, with interolog data (the
transfer of protein interactions between orthologous protein pairs in different organisms), is a powerful
way of identifying good disease gene candidates in a large list of genes in loci associated with a
phenotype. Our method automatically identifies potential functional modules consisting of protein
components, where at least one of the components is a disease related protein. When such incriminated
modules are identified, the remaining protein components of the module are correlated with loci in the
genome associated with a similar phenotype. A hit is reported if other protein components of the
incriminated module are the product of genes in loci associated with an identical or overlapping
phenotype. Using this large scale approach we show that a gene in a locus is a heavily incriminated
candidate, if the protein product of the gene interacts with a protein involved in a similar or identical
phenotype, and publish a list of 60 likely candidates in various disorders.
Project description
Abstract
The availability of the first draft of the human genome in 2001 (Venter, Adams et al. 2001) led to an
increase in the number of methods for disease gene identification. However, the general number of
candidates in most loci linked to a particular phenotype is in the hundreds (McCarthy, Smedley et al.
2003; van Driel, Cuelenaere et al. 2003), and the underlying genes in over 900 of the ~ 2550 loci
associated with a phenotype in the “Online Mendelian Inheritance in Man” (OMIM) database, have not
yet been identified (Hamosh, Scott et al. 2005). Evidently disease gene identification continues to be a
very strenuous challenge, since mutational analysis of hundreds of candidates in a critical interval using
methods currently available is extremely resource demanding. Thus, prioritising the candidates based on
different criteria followed by an extensive investigation of promising candidates, is a logical step in the
disease gene finding process.
With the advent of proteomics, we are now able to retrieve information on gene functions in a large-scale
manner, thus bridging the gap between genotype and phenotype, a possibility with significant interest for
disease gene candidate prioritization.
We propose that automated correlation of phenotype association networks, with interolog data (the
transfer of protein interactions between orthologous protein pairs in different organisms), is a powerful
way of identifying good disease gene candidates in a large list of genes in loci associated with a
phenotype. Our method automatically identifies potential functional modules consisting of protein
components, where at least one of the components is a disease related protein. When such incriminated
modules are identified, the remaining protein components of the module are correlated with loci in the
genome associated with a similar phenotype. A hit is reported if other protein components of the
incriminated module are the product of genes in loci associated with an identical or overlapping
phenotype. Using this large scale approach we show that a gene in a locus is a heavily incriminated
candidate, if the protein product of the gene interacts with a protein involved in a similar or identical
phenotype, and publish a list of 60 likely candidates in various disorders.
Background
Background
Finding genes responsible for major genetic disorders can lead to
diagnostics, potential drug targets, treatments and large amounts of
information about molecular cell biology in general.
Background
Methods for disease gene finding post genome era (>2001):
Mircodeletions
http://www.med.cmu.ac.th/dept/pediatrics/06interest-cases/ic-39/case39.html
Translocations
http://www.rscbayarea.com/image
s/reciprocal_translocation.gif
Linkage analysis
Fagerheim et al 1996.
1q21-1q23.1
chr1:141,600,00-155,900,000
Background
Automated methods for disease gene finding int the post genome era (>2001):
?
Grouping:
Tissues, Gene Ontology, Gene
Expression, MeSH terms …….
(Perez-Iratxeta, Bork et al. 2002)
(Freudenberg and Propping 2002)
(van Driel, Cuelenaere et al. 2005)
(Hristovski, Peterlin et al. 2005)
Disease Gene Finding.
Summery
Background
Why do we want to find disease genes, how has it been done until now?
Networks – deducing functional relationships from network theory
Protein interactionnetworks
Functional modules / network clusters
Phenotype association
Grouping disorders based on their phenotype.
Biological implications of phenotype clusters.
Method and examples
Combining network theory and phenotype associations
in an automated large scale disease gene finding platform
proof of concept.
Status of pipeline / infrastructure
Networks and functional modules
Deducing functional relationships from protein interaction networks
Networks and functional modules
Deducing functional relationships from network theory
Network theory is boooooooooring
Networks
Text mining of full text corpora e.g PubMed Central
http://www.biosolveit.de/ToPNet/screenshots/fig1.html
Networks
Protein interaction networks of physical interactions.
(Barabasi and Oltvai 2004).
Networks
daily
Social Networks, The CBS interactome
weekly
monthly
(de Licthenberg et al.)
Networks
daily
Social Networks, The CBS interactome
weekly
monthly
(de Licthenberg et al.)
Extracting functional data from protein interaction networks
The Ach receptor
involved in
Myasthenic
Syndrome.
InWeb
Homo Sapiens
Dynamic
funcional
module:
Eg:
Cell cycle
regulation
Metabolism
Trans-organism protein interaction network
Orthologs?
Orthologous genes are direct descendants of a gene in a common
ancestor:
S.Cerevisiae
D. Melanogaster
H.Sapiens
(O'Brien K, Remm et al. 2005)
Trans-organism protein interaction network
H.Sapiens MOSAIC
D. Melanogaster Experim.
C. Elegans Experim.
S. Cerevisiae Experim.
Infrastructure status
BIND
IntAct
DIP
Web server
Opis
MINT
HPRD
Command line
Inweb.pl
GRID
Handcurated sets
PPI – pred.
Trans-organism ppi
pipeline
InWeb
Homo Sapiens
>122.000 int.
> 22.000 genes
CBS Datawarehouse
Download/reformat db’s
Scoring
A)
Topological
B)
No publ.
Extraction
perl modules
Direct SQL access
XML or SIF output
Disease Gene Finding.
Summery
Background
Why do we want to find disease genes, how has it been done until now?
Networks – deducing functional relationships from network theory
Protein interactionnetworks
Functional modules / network clusters
Phenotype association
Grouping disorders based on their phenotype.
Biological implications of phenotype clusters.
Method and examples
Combining network theory and phenotype associations
in an automated large scale disease gene finding platform
proof of concept.
Status of pipeline / infrastructure
Phenotype association
Phenotype association
Zelwegger syndrome
Absent liver peroxisomes
Hepatomegaly
Intrahepatic biliary dysgenesis
Prolonged neonatal jaundice
Pyloric hypertrophy
Patent ductus arteriosus
Ventricular septal defects
Bell-shaped thorax
Small adrenal glands
Absent renal peroxisomes
Clitoromegaly
Cryptorchidism
Hydronephrosis
Hypospadias
Renal cortical microcysts
Failure to thrive
Abnormal electroretinogram
Abnormal helices
Anteverted nares
Brushfield spots
Cataracts
Corneal clouding
Epicanthal folds
Flat facies
Flat occiput
Glaucoma
High arched palate
High forehead
Hypertelorism
Large fontanelles
Macrocephaly
Micrognathia
Nystagmus
Pale optic disk
Pigmentary retinopathy
Posteriorly rotated ears
Protruding tongue
Redundant skin folds of
neck
Round facies
Sensorineural deafness
Turribrachycephaly
Upward slanting
Hyporeflexia or areflexia
Hypotonia
Polymicrogyria
Seizures
Severe mental
retardation
Subependymal cysts
Pulmonary hypoplasia
Cubitus valgus
Delayed bone age
Metatarsus adductus
Rocker-bottom feet
Stippled epiphyses
(especially patellar and
acetabular regions)
Talipes equinovarus
Transverse palmar
crease
Ulnar deviation of hands
Wide cranial sutures
Transverse palmar
crease
Heterotopias/abnormal
migration
Hypoplastic olfactory
lobes
palpebral fissures
Autosomal recessive
Albuminuria
Aminoaciduria
Decreased dihydroxyacetone
phosphate acyltransferase (DHAPAT) activity
Decreased plasmologen
Elevated long chain fatty acids
Elevated serum iron and iron binding
capacity
Increased phytanic acid
Pipecolic acidemia
Breech presentation
Death usually in first year of life
Genetic heterogeneity
Infants occasionally mistaken as
having Down syndrome
Agenesis/hypoplasic corpus collosum
Phenotype association
Word vectors
Reference : Zelwegger Syndrome (214100)
214100
202370
Phenotype
Sim. Score
Adrenoleukodystrophy (202370)
0.781
Hyperpipecolatemia (239400)
0.703
Cerebrohepatorenal Syndr. (214110)
0.682
Refsum Disease (266510)
0.609
A relationship between the infantile form of Refsum
disease and Zellweger syndrome was suggested by
the observations of Poulos et al. (1984) in 2 patients.
In the infantile form of Refsum disease, as in Zellweger
syndrome, peroxisomes are deficient and peroxisomal
functions are impaired (Schram et al., 1986). Clinically,
infantile Refsum disease, ZWS, and adrenoleukodystrophy have several overlapping features.
(Stokke et al., 1984).
(http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=266510)
Phenotype association
Phenotype
association
network
Word vectors
CerebroHepatorenal
Zelwegger
Refsum
Adrenoleuko
-dystrophy
Disease Gene Finding.
Summery
Background
Why do we want to find disease genes, how has it been done until now?
Networks – deducing functional relationships from network theory
Protein interactionnetworks
Functional modules / network clusters
Phenotype association
Grouping disorders based on their phenotype.
Biological implications of phenotype clusters.
Method and examples
Combining network theory and phenotype associations
in an automated large scale disease gene finding platform
proof of concept.
Method –
Proof of concept
Method
InWeb
Word vectors
Homo Sapiens
Phenotype clustering
Results - Benchmark
DE SANCTIS-CACCHIONE
SYNDROME
Gene map locus 10q11
>12MB area, 103 ranked genes
CLINICAL FEATURES
De Sanctis and Cacchione (1932)
reported a condition, which they
called 'xerodermic idiocy,' in which
patients had xeroderma
pigmentosum, mental deficiency,
progressive neurologic
deterioration, dwarfism, and
gonadal hypoplasia.
http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=
278800
MIM
RANK
GENE
P-value
TRUE
278800
278800
1
2
ENSG00000032514
ENSG00000188611
0.300326793109544
0.0125655342047565
*
278800
278800
278800
278800
278800
278800
278800
.
.
.
.
.
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
278800
2
2
3
3
4
4
4
.
.
.
.
.
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
6
ENSG00000138297
ENSG00000165406
ENSG00000196693
ENSG00000185532
ENSG00000197910
ENSG00000165383
ENSG00000172538
.
.
.
.
.
ENSG00000165511
ENSG00000182354
ENSG00000172661
ENSG00000165507
ENSG00000178440
ENSG00000138299
ENSG00000197704
ENSG00000012779
ENSG00000197354
ENSG00000189090
ENSG00000107551
ENSG00000126542
ENSG00000198364
ENSG00000185849
ENSG00000150165
ENSG00000128815
ENSG00000178645
ENSG00000138293
ENSG00000176833
ENSG00000179251
ENSG00000169826
ENSG00000172678
ENSG00000197752
ENSG00000107643
ENSG00000165733
0.0125655342047565
0.0125655342047565
0.0121357313793756
0.0121357313793756
0.00680983722337082
0.00680983722337082
0.00680983722337082
.
.
.
.
.
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00680983722337082
0.00412573091718715
0.000263885640603109
278800
7
ENSG00000169813
6,63E+07
Results – Benchmarking
DE SANCTIS-CACCHIONE
SYNDROME
Ranked 1
P-value 0.300326793109544
#278800 DE SANCTISCACCHIONE SYNDROME
DNA excision repair
protein ERCC-6
*126340 DNA REPAIR DEFECT EM9 OF
CHINESE HAMSTER OVARY CELLS,
COMPLEMENTATION OF; EM9
#133540 COCKAYNE SYNDROME CKN2
#278730 XERODERMA PIGMENTOSUM,
COMPLEMENTATION GROUP D
DNA excision repair protein
ERCC-2
#601675 TRICHOTHIODYSTROPHY
Eukaryotic initiation
factor 4A-I (eIF4A-I)
Eukaryotic translation initiation
factor 4E (eIF4E)
Results – Benchmarking
DE SANCTIS-CACCHIONE
SYNDROME
Ranked 2
P-value 0.0125655342047565
Disease Gene Finding.
Summery
Background
Why do we want to find disease genes, how has it been done until now?
Networks – deducing functional relationships from network theory
Protein interactionnetworks
Functional modules / network clusters
Phenotype association
Grouping disorders based on their phenotype.
Biological implications of phenotype clusters.
Method and examples
Combining network theory and phenotype associations
in an automated large scale disease gene finding platform
proof of concept.
Method - status
In Silico:
Two in silico proof of concept hits:
BOR Syndrome
Dilated Cardiomyopathy
In Vitro:
5 Candidates being tested in the lab by mutational screening in patient material:
BOR Syndrome
Sensorineural Deafness
Holoprosencephaly
Obesity
Anhidrosis
Benchmarking:
Benchmarking by unbiased prioritizing genes in ~ 1000 critical intervals where the
actual disease gene is known.
Project description
Abstract
The availability of the first draft of the human genome in 2001 (Venter, Adams et al. 2001) led to an
increase in the number of methods for disease gene identification. However, the general number of
candidates in most loci linked to a particular phenotype is in the hundreds (McCarthy, Smedley et al.
2003; van Driel, Cuelenaere et al. 2003), and the underlying genes in over 900 of the ~ 2550 loci
associated with a phenotype in the “Online Mendelian Inheritance in Man” (OMIM) database, have not
yet been identified (Hamosh, Scott et al. 2005). Evidently disease gene identification continues to be a
very strenuous challenge, since mutational analysis of hundreds of candidates in a critical interval using
methods currently available is extremely resource demanding. Thus, prioritising the candidates based on
different criteria followed by an extensive investigation of promising candidates, is a logical step in the
disease gene finding process.
With the advent of proteomics, we are now able to retrieve information on gene functions in a large-scale
manner, thus bridging the gap between genotype and phenotype, a possibility with significant interest for
disease gene candidate prioritization.
We propose that automated correlation of phenotype association networks, with interolog data (the
transfer of protein interactions between orthologous protein pairs in different organisms), is a powerful
way of identifying good disease gene candidates in a large list of genes in loci associated with a
phenotype. Our method automatically identifies potential functional modules consisting of protein
components, where at least one of the components is a disease related protein. When such incriminated
modules are identified, the remaining protein components of the module are correlated with loci in the
genome associated with a similar phenotype. A hit is reported if other protein components of the
incriminated module are the product of genes in loci associated with an identical or overlapping
phenotype. Using this large scale approach we show that a gene in a locus is a heavily incriminated
candidate, if the protein product of the gene interacts with a protein involved in a similar or identical
phenotype, and publish a list of 60 likely candidates in various disorders.
Project description
Project:
Find disease
genes
Niels Tommerup Professor, Centre
Director Dr. Med.
Using
Bioinformatics
Søren Brunak, Professor, center
director, Dr. Phil, PhD, physicist
Acknowledgments
Disease Gene Finding :
Olga Rigina
Olof Karlberg
Zenia M. Størling
Páll Ísólfur Ólason
Kasper Lage
Anders Gorm
Anders Hinsby
Yves Moreau
Søren Brunak