Download Diapositive 1 - Institut Pasteur

Document related concepts

Gene therapy of the human retina wikipedia , lookup

Transposable element wikipedia , lookup

Population genetics wikipedia , lookup

Genomics wikipedia , lookup

Frameshift mutation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Human genome wikipedia , lookup

Metagenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Human genetic variation wikipedia , lookup

Gene therapy wikipedia , lookup

Epistasis wikipedia , lookup

Gene desert wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Point mutation wikipedia , lookup

Pathogenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics of human development wikipedia , lookup

History of genetic engineering wikipedia , lookup

Minimal genome wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome editing wikipedia , lookup

Gene expression programming wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

RNA-Seq wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Genome (book) wikipedia , lookup

Public health genomics wikipedia , lookup

Transcript
Computational tools for
disease gene identification
Sonia ABDELHAK, PhD
Molecular Investigation of Genetic Orphan Disorders
Institut Pasteur de Tunis
Summary

How could we identify genes involved in
human disorders?
Positional cloning in the pre-genomic era.
 Monogenic/multifactorial diseases.


Computational tools: Positional cloning in
the post genomic era.
Monogenic versus Complex Diseases : Genes & Environment
Environmental
Effect
Genetic
Component
S.K. Brahmachari, GENOMED-HEALTH meeting
What could we learn from disease gene
identification?
Better understanding of the underlying biology
of the trait in question
 Serve as direct targets for better treatments

Pharmacogenetics
 Interventions

Predictions of susceptibility to the disease
 Predictions of the course of the disease
 Knowledge for treatment or prevention

“SIMPLE” MENDELIAN
GENETIC DISEASES

Diseases of Simple Genetic Architecture
Can tell how trait is passed in a family: follows
a recognizable pattern (Mendelian disease)
 One gene altered per family (exceptions)
 Usually quite rare in population (exceptions)
 “Causative” gene

Some examples of deleterious mutations
Stop codon creation
CAG
Gln
TAG
Modes of inheritance
•X linked
•Duchenne muscular dystrophy
•Autosomal dominant
•Huntington disease
•Autosomal recessive
•Cystic fibrosis

Mitochondrial

Leber Optic atrophy
C
Functional cloning versus positional cloning of genes
Disease
hromosomal
calisation
Function/
Protein
Gene
Disease
Function/
Protein
Chromosomal
localisation
Gene
.

Position-Independent Methods
Gene-specific
oligonucleotides:
hemophilia A Factor VIII
gene (most common form
of hemophilia, X-linked)
 Clotting factor purified
from pig, and its Nterminal amino acids
were sequenced.
 This allowed a group of
oligonucleotides to be
synthesized.
 These probes were
used with colony
hybridization against a
cDNA library.
Positional cloning of genes
Disease
hromosomal
calisation
Function/
Protein
Gene
Disease
Function/
Protein
Chromosomal
localisation
Gene
Identification of informative
families
Genetic mapping
Physical mapping
Identification of coding sequences
(candidate genes)
normal
Mutation screening
Functional analysis
muté
... CCT GAG GAG...
... CCT GTG GAG...
... Pro Glu Glu ...
... Pro Val Glu ...
Genetic mapping
What are the markers that are used for genetic mapping
Polymorphisms used in
Gene Mapping
1980s – RFLP marker maps
 1990s – microsatellite marker maps

Identification de Polymorphismes de type microsatellites par analyse de séquence:
IL-12p35AC F
tggtggcagaaatcattgtctgaaaagtaattgttttacttttattcttttcgtgtgtgtgtgtgt
gtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgcatgtgccagatttcttgtttgaaaggcaat
gagcttcatccaagtatcaa
78.57%
IL-12p35AC R
IL-12p40AC F
atttcaggtgtgagccactgtgcctggccagaactttttcaatgaatattcaagataattgtata
cacattttatatatatatatatatatacacacacacacacacacacatatgtatacacaca
ttatatatataatccatgttatatacatctctacattatatatatccactatatatattttacttataca
tatagattttatttttatgaactaggatcaaattgta
69.23%
IL-12p40AC R
1
174
170
166
2
3
4
5
SNPs in Genetic Analysis
Abundance – lots
 Position – throughout genome
 Haplotype patterns – groups of SNPs may
provide exploitable diversity
 Rapid and efficient to genotype
 Increased stability over other types of
mutation

Gene mapping: Linkage analysis
Do marker alleles co-segregate with the disease by chance
or are there linked to the underlying gene?
Crossing over and Recombination
Recombination Fraction

 = ½ : independent assortment (Mendel)

 < ½ : linked loci

 = 0 : tightly linked loci (no recombination)
LOD Score Analysis
The likelihood ratio as defined by Morton (1955):
L(pedigree| = x)
L(pedigree |  = 0.50)
where  represents the recombination fraction and where 0  x  0.49.
When all meioses are “scorable”, the LR is constructed as:
L.R. =
( R (1   ) NR )
(  0.5) N
The LOD score (z) is the log10 (L.R.)
H1: Linkage
: z() is the lod score at a particular value
of the recombination fraction

: z() is the maximum lod score, which
occurs at the MLE of the recombination
fraction
H0: Exclusion =0
Identification of informative
families
Cytogenetic anomalies
Animal model
Genetic mapping
Physical mapping
Identification of coding sequences
(candidate genes)
Functional candidate
genes
normal
Mutation screening
Functional analysis
muté
... CCT GAG GAG...
... CCT GTG GAG...
... Pro Glu Glu ...
... Pro Val Glu ...
1 to 10 years!
Branchio-oto-renal syndrome
PAC contig

Clinical features:
deafness, renal
anomalies, cervical
cysts…

Mapped to 8q13.
11083
9480
4405
cDNA library screening, cDNA selection and exon trapping
10910
PAC (P1 derived)
Sonication or partial
digestion
T7
T3
subcloning in pBCSK+
Selection of clones
Sequencing T7, T3
Sequence assemble and analysis
The different steps used for sequence analysis
Quality assessment
A
G
C
T
A
T
Elimination of contaminating sequences
Blastn against vector, bacteria, yeast… databases
Assemble using Phred, Phrap, Consed
Identification of candidate genes by blastx and tblastx,
Gene prediction tool: GRAIL
11083
9480
4405
10910
BLASTX 1.4.7 [19-Dec-94] [Build 07:11:56 Jun 16 1995]
Query= w1g9t7.Seq
(743 letters)
Translating both strands of query sequence in all 6 reading frames
Database: ../../databases/fasta/nrprot
244,544 sequences; 71,258,360 total letters.
Searching..................................................done
Smallest
Sum
Reading High Probability
Sequences producing High-scoring Segment Pairs:
Frame Score P(N)
pir|S|A45174 eyes absent (eya) protein (alternatively... -2 173 5.6e-15 1
>pir|S|A45174 eyes absent (eya) protein (alternatively spliced) - fruit fly
(Drosophila melanogaster) >gp||DRONOEYE_
Length = 760
Minus Strand HSPs:
Score = 173 (79.6 bits), Expect = 5.6e-15, P = 5.6e-15
Identities = 29/36 (80%), Positives = 34/36 (94%), Frame = -2
Query: 169 LCLPXGVRGGVDWMRKLAFRYRRVKEIYNTYKNNVG 62
LCLP GVRGGVDWMRKLAFRYR++K+IYN+Y+ NVG
Sbjct: 586 LCLPTGVRGGVDWMRKLAFRYRKIKDIYNSYRGNVG 621
N
EYA1 gene structure
-1
1 1'
-I
I
2
I'
3 4
II
5
III IV
6 7
V
VI
8 9
VII
VIII
12 14
11 13
15
10
IX
X
XI XII XIV
XIII
Identification of a new gene family EYA1, EYA2, EYA3, ….
16
XV
COMPLEX (MULTIFACTORIAL)
GENETIC DISEASE

Diseases of Complex Genetic Architecture
No clear pattern of inheritance
 Moderate to strong evidence of being
inherited
 Common in population: cancer, heart disease,
dementia etc.
 Involves many genes and environment
 “Susceptibility” genes

Complex disease loci mapping
Linkage Analysis
Large Families
Small Families
Association Studies
Family-Based
Case-Control
Study Designs
Linkage Analysis
Large Families
Small Families
Association Studies
Family-Based
Case-Control
12
12
11
Non-Transmitted
TDT calculation
Transmitted
2
1
A
B
C
D
(B-C)2
TDT= (B+C)
With > 5 per cell, this follows
a 2 distribution with 1 df
Examples: Alzheimer’s

Alzheimer’s disease and ApoE
E4 present
E4 absent
Patients
58
33
Controls
16
55
The E4 allele appears to be positively associated
with Alzheimer’s disease:
Odds Ratio = (58/16)/(33/55) = 6
February 2001
« Finished » sequence
April 1953-April 2003
Identification of informative
families
Genetic mapping
Physical mapping
Identification of coding sequences
(candidate genes)
normal
Mutation screening
Functional analysis
muté
... CCT GAG GAG...
... CCT GTG GAG...
... Pro Glu Glu ...
... Pro Val Glu ...
Past and present tools





Genetic mapping
Physical mapping
Cytogenetic
abnormalities
Animal models
Positional and
functional candidates




Genome databases
and genome browsers
Comparative Genome
Hybridization.
Comparative Genomics
Microarray analysis
NCBI genome browser
Visualize all the genes in an interval
UCSC genome browser
Ensembl genome browser
NCBI genome browser showing candidate region for EV

How to collect and interpret all the data?

How to choose the best “candidate” gene?
Strategies and adapted tools for gene
selection are urgently needed!

Find candidate genes for the trait (time
and cost!)
WHAT genes are there?
 WHAT do they do?
 How could they play a role in the disease
 = Data mining and integration!!


Visualization of the whole picture
Global view
 Option to zoom into detail

http://www.esat.kuleuven.be/endeavour.
Disease Gene Finding
(Center for Biological Sequence Analysis)
Combining network theory and phenotype associations in an automated large
scale disease gene finding platform
Networks – deducing functional relationships from network
theory
Phenotype association
Grouping disorders based on their phenotype.
Phenotype association
Phenotype clustering:
Word vectors




Each arrow represents a
KEYWORD vector.
The components in a
keyword vector correspond to
terms in the document.
Vectors that point in the
same direction are more
alike.
Ordering phenotypes in
“syndrome families” could tell
us about the relationships of
the underlying genes.

(Brunner and van Driel 2004)

Disease gene identification.
Clues to gene interactions
pathways and functions.
%608389 BRANCHIOOTIC SYNDROME 3
14q23.1
SIX1
SIX1 mutations cause branchio-oto-renal syndrome by
disruption of EYA1-SIX1-DNA complexes.
Ruf RG, Xu PX, Silvius D, Otto EA, Beekmann F, Muerb UT, Kumar S, Neuhaus TJ, Kemper MJ, Raymond
RM Jr, Brophy PD, Berkman J, Gattas M, Hyland V, Ruf EM, Schwartz C, Chang EH, Smith RJ, Stratakis CA,
Weil D, Petit C, Hildebrandt F.
Department of Pediatrics, University of Michigan, Ann Arbor, MI 48109, USA.
Urinary tract malformations constitute the most frequent cause of chronic renal failure in the first two decades of
life. Branchio-otic (BO) syndrome is an autosomal dominant developmental disorder characterized by hearing loss.
In branchio-oto-renal (BOR) syndrome, malformations of the kidney or urinary tract are associated.
Haploinsufficiency for the human gene EYA1, a homologue of the Drosophila gene eyes absent (eya), causes BOR
and BO syndromes. We recently mapped a locus for BOR/BO syndrome (BOS3) to human chromosome 14q23.1.
Within the 33-megabase critical genetic interval, we located the SIX1, SIX4, and SIX6 genes, which act within a
genetic network of EYA and PAX genes to regulate organogenesis. These genes, therefore, represented excellent
candidate genes for BOS3. By direct sequencing of exons, we identified three different SIX1 mutations in four
BOR/BO kindreds, thus identifying SIX1 as a gene causing BOR and BO syndromes. To elucidate how these
mutations cause disease, we analyzed the functional role of these SIX1 mutations with respect to protein-protein
and protein-DNA interactions. We demonstrate that all three mutations are crucial for Eya1-Six1 interaction, and
the two mutations within the homeodomain region are essential for specific Six1-DNA binding. Identification of SIX1
mutations as causing BOR/BO offers insights into the molecular basis of otic and renal developmental diseases in
humans.
PMID: 15141091 [PubMed - indexed for MEDLINE]
Computational tools for disease gene identification
Application to EV and T2D
Olfa MESSAOUD and Manel BALI
GENE SEEKER
 DGP
 PROSPECTR
 SUSPECTS
 G2D
 TOM

GeneSeeker
http://www.cmbi.ru.nl/geneseeker/

Web tool

Gathers and combines data from several databases
(MIMMAP, MGD, GDB etc.)

Selects positional candidate genes according to their
expression and phenotypic data from both human and
mouse.
A general overview of the
GeneSeeker program
Output of the GeneSeeker program
G2D= Genes to Diseases
http://www.ogic.ca/projects/g2d_2/

Scoring all terms in GO according to their relevance
to each disease using MEDLINE and RefSeq.

Identifying candidate genes by performing BLASTX
searches.
131244
q13.2
Band(s)
1
63950000 73950000
Databases
used
Band(s)
3667 3630 3767
1
DGP= Disease Gene Prediction
http://cgg.ebi.ac.uk/services/dgp/

A decision tree-based model built based on sequence
properties.

This model is then applied to all the genes in the
disease loci analysed in order to obtain a probability
score for these proteins to be involved in hereditary
disease.
22500000
33200000
PROSPECTR
http://www.genetics.med.ed.ac.uk/prospectr/

Automatic classifier based on sequence features
using the alternating decision tree algorithm which
ranks genes in the order of likelihood of involvement
in disease

Score: >0.5
< 0.5
likely to be involved
unlikely to be involved
SUSPECTS
http://www.genetics.med.ed.ac.uk/suspects/

Web-based server.

Builds on PROSPECTOR (sequence features)
and combines annotation data (from GO,
InterPro and expression librairies).
q21.1
1
-
TOM= Transcriptomics of OMIM
http://www-micrel.deis.unibo.it/~tom
An automated pipeline for the extraction
of the best candidate genes for a given
genetic disease.
Global description of the process
The second option (two loci option) is designed for poorly characterized diseases
when no specific gene is a priori known. At least 2 linkage areas need to be present.
(Looks for pairs that have similar expression and functional profiles)
The results page (genes and GO annotation)
Application
- A monogenic disorder:
Epidermodysplasia verruciformis
- A multifactorial disorder:
Type 2 diabetes
Epidermodysplasia
verruciformis (EV)
Genetic skin disease (genodermatosis)
 Predisposition to skin cancer
 High susceptibility to human papillomavirus
(HPV)

Genomic organisation of EV1 locus
(Ramoz et al., 2002)
Haplotypic analysis of microsatellites
(A) Sources of input data for each method, (B) number of genes in the starting
candidate set and number of genes selected by each method
Methods
GeneSeeker
DGP
Prospectr
Suspects
G2D
TOM
Input
PubMed abstracts
X
Sequence data
X
X
X
GO annotation
X
X
X
X
X
X
X
Protein data
X
X
Expression libraries
X
X
Orthologous mouse
genes
X
OMIM
X
X
X
Number of genes selected
EV
Starting set of
candidates
85
85
85
85
85
85
selected genes
11
37
40
45
20
54
Starting set of
candidates
260
260
260
260
260
?
selected genes
24
76
14
26
3
?
T2D
Personal
annotation
GeneSeeker
DGP
PROSPECTR
SUSPECTS
G2D
TOM
(SLC30A6)
ALK
HADHB
SPG4
OTOF
BFSP2
LBH
BIRC6
CARD12
SNX17
HADHA
KCNK3
KRT19
KHK
SLC5A6
MSH2
NULL
OTOF
XDH
KRT12
FOSL2
GTF3C2
PDE1C
LBH
CAD
SLC5A6
KRT18
KRT18
PREB
POMC
BIRC6
GALNTM4
CAD
GFAP
PPP1CB
KCNK3
PPM1G
SLC5A6
KCNK3
HADHB
NEF3
GTF3C2
NRBP1
SDC1
POMC
SLC5A6
SPG4
KRT23
HADHA
SELI
(SLC23A3)
HADHA
KIF3C
NP056477
KRT33B
KRTCAP3
RAB10
SMARCAD1
XDH
RNF30
HADHA
KRT1
FLJ20254
SOS1
SRD5A2
MAPRE3
RAB10
KRT14
XDH
SRD5A2
EIF2B4
DPSYSL5
CENPA
KRT35
HADHB
OTOF
ALK
EHD3
KRT14
PPP1CB
SPG4
XDH
GALNT14
KRT15
HIBCH
Comparison between Results obtained by each method
Conclusion
Several promising computational tools
Need for more accurate
methods
Thank you!
Some References and H-References
For a good review see: Nucleic Acids Res.
2006 Jun 6;34(10):3067-81.
 kc.vanderbilt.edu/quant/Seminar/StatGen02-2006.ppt
 http://www.cbs.dtu.dk/
 http://www.bios.niu.edu/johns/humgen/Fin
ding_Disease_Genes.ppt
