Download Finding disease genes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene therapy wikipedia , lookup

Gene desert wikipedia , lookup

Tag SNP wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Point mutation wikipedia , lookup

Essential gene wikipedia , lookup

Transposable element wikipedia , lookup

Genetic engineering wikipedia , lookup

Genomic library wikipedia , lookup

Gene expression programming wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Non-coding DNA wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Human genetic variation wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Human genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Oncogenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Pathogenomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Gene wikipedia , lookup

Genome editing wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene expression profiling wikipedia , lookup

Minimal genome wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome (book) wikipedia , lookup

Public health genomics wikipedia , lookup

Transcript
Finding disease genes:
A challenge for Medicine,
Mathematics and Computer
Science
• Andrew Collins, Professor of
Genetic Epidemiology and
Bioinformatics
• [email protected]
1
The human genome
• 22 chromosomes + X and Y
• Sequence of 3,200 million base pairs
(of A,T,G,C)
• Codes for ~30,000 genes
• 1000s? of genes contain mutations
contributing to disease ‘phenotypes’
2
Single nucleotide polymorphisms
(SNPs or ‘snips’)
• DNA sequence variation in one nucleotide: A, T, C, or G
• ~15 million+ SNPs – 90% of genetic variation
• Two forms (alleles)- a C/T SNP has ‘genotypes’ C,C or
C,T or T,T
• How to link genotype(s) with disease phenotype(s)?
Look for shared SNP mutations in families, cases vs
controls
3
Disease gene mapping - timeline
• 1990-1998: ‘linkage mapping’-rare genes
causing severe disease in families – Cystic
Fibrosis, Huntington’s disease
• 1998-2010: ‘association mapping’ common
genes involved in common disease (asthma,
heart disease diabetes) - case control studies
(~1 million SNPs)
• 2010-onwards: ‘next generation sequencing’ –
test all 15 million+ SNPs. Low frequency
variants with intermediate effect on common
disease
4
Human genome project - timeline
• 1990: Start of ‘Human Genome Project’
(to generate one genome sequence)
• 2003: One sequence completed: cost
$300 million
• 2010: 3,000 sequences now completed
• 2011: 30,000 sequences expected: cost
~$5000 each
5
6
Breast cancer genetics
• Rare genes (clear inheritance in
families): <25% of inherited risk
• Common genes (low-risk, association
mapping) <5% of inherited risk
• ~70% of risk not explained by all breast
cancer genes found so far – so, many
genes are ‘missing’…
7
Susceptibility genes in breast
cancer: more is less?
“The large number of anticipated
susceptibility factors, their low
predictive value and the high frequency
of these variants…..make these findings
of limited use in clinical practice”
• Ref: Willems PJ (2007) Clin Genet 72:493-496
8
• 183,000 samples: found 180 ‘height’ genes
– enriched for genes in shared biological pathways
– and genes involved in skeletal growth defects
• Genes found only explain 10% of variation in height
• Many genes missing…..
9
Linking genes with disease
10
Next generation sequencing
“1000 Genomes”
- a deep catalog of human variation
• July 2010
– Sequenced 6 people (two
families - parents and a
daughter)
– sequenced genomes of 179
people
– sequencing exons 700 people
(‘exomes’ -protein-coding
genome)
• Ongoing
– 2,500 DNA samples from 27
populations around the world
11
Next generation sequence data analysis
12
Copyright ©2009 American Association for Clinical Chemistry
Exome data
• Sequence of protein-coding exons-one ‘exome’
contains coding regions of all ~30,000 genes
• Exome contains 30 megabases DNA (whole
genome has 3200 megabases)
• Detect all SNP variation in a person. Align
‘short reads’ (millions of sequences of ~100
bases against the reference genome)
• Requires 40X ‘depth’ to reliably identify all
DNA variation
13
Sequence data – the ‘filtering’
problem
• Each person has 250-300 mutations that could affect
protein function and 50-100 mutations implicated in
inherited disorders. Most variants have no effect on
health
• To find disease gene(s) filter out ‘normal’ variation
(reference data:1000 genomes, web databases)
• Common disease may involve complex interactions
between networks of 100’s of genes
• Machine learning and other mathematical tools
required to interpret complex phenotype/sequence
data
14
15
16
“The production of billions of NGS reads has
also challenged the infrastructure of existing
information technology systems in terms of data
transfer, storage and quality control,
computational analysis to align or assemble
read data….”
“Advances in bioinformatics are ongoing, and
improvements are needed if these systems are
to keep pace with the continuing developments
in NGS technologies. It is possible that the costs
associated with downstream data handling and
analysis could match or surpass the dataproduction costs…” (Metzker, Nat Rev Genet 2010, 11, 3146.)
17
Some applications of DNA
sequence data
• Disease gene mapping
• Disease diagnosis/disease sub-types
• Differences between populations, migration patterns
• Biotechnology (bacterial genomes, genetic
engineering)
• Infectious disease control
• Evolution/Taxonomy/classification
• Archaeology
• Forensic science
18
Machine learning to identify
genetic factors in breast cancer
• 3000 cases with early-onset breast cancer
(Southampton data), genotyped with 1000s of
SNPs
• Identify new breast cancer genes – integrate
phenotypic data (tumour sub-types, survival,
response to treatment) with
genotypes/sequence and gene functional
information (web databases)
• Machine learning models: test gene : gene and
gene : phenotype interactions. New genes?
Groups of genes distinguishing sub-types of
19
disease?
Web-based tools to improve
diagnosis of ‘dosage’ diseases
• ‘Dosage’ – number of copies of a gene (more or less
than 2 due to duplication or deletion)
• Gene(s) in duplicated/deleted might cause disease if
abnormal ‘dose’. Which gene(s)? Identification
influences patient treatments.
• Data-mine for known gene function in
literature/databases.
• Prediction of disease causing genes - machine learning
models (integrate gene function, expression, known
‘dosage genes’)
• Web-based tools for mining/querying/presentation of
data for clinicians to improve diagnosis.
20
Conclusions
• Majority of genetic variation underlying human disease
is unknown
• Next-generation sequencing will, in time, reveal all of
these genes
• But…finding missing disease genes in DNA sequence
presents huge challenges for medicine, mathematics
and computer/web science
• Exome and whole genome sequence analysis will
transform all ‘bioscience’ research fields
• NGS is now generating vast data sets - novel and
multidisciplinary approaches to management,
visualisation, analysis and interpretation are urgently
needed
21