* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Finding disease genes
Gene therapy wikipedia , lookup
Gene desert wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Point mutation wikipedia , lookup
Essential gene wikipedia , lookup
Transposable element wikipedia , lookup
Genetic engineering wikipedia , lookup
Genomic library wikipedia , lookup
Gene expression programming wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Non-coding DNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Human genetic variation wikipedia , lookup
Genomic imprinting wikipedia , lookup
Ridge (biology) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Human genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Oncogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Pathogenomics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Metagenomics wikipedia , lookup
Genome editing wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Gene expression profiling wikipedia , lookup
Minimal genome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Genome evolution wikipedia , lookup
Finding disease genes: A challenge for Medicine, Mathematics and Computer Science • Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics • [email protected] 1 The human genome • 22 chromosomes + X and Y • Sequence of 3,200 million base pairs (of A,T,G,C) • Codes for ~30,000 genes • 1000s? of genes contain mutations contributing to disease ‘phenotypes’ 2 Single nucleotide polymorphisms (SNPs or ‘snips’) • DNA sequence variation in one nucleotide: A, T, C, or G • ~15 million+ SNPs – 90% of genetic variation • Two forms (alleles)- a C/T SNP has ‘genotypes’ C,C or C,T or T,T • How to link genotype(s) with disease phenotype(s)? Look for shared SNP mutations in families, cases vs controls 3 Disease gene mapping - timeline • 1990-1998: ‘linkage mapping’-rare genes causing severe disease in families – Cystic Fibrosis, Huntington’s disease • 1998-2010: ‘association mapping’ common genes involved in common disease (asthma, heart disease diabetes) - case control studies (~1 million SNPs) • 2010-onwards: ‘next generation sequencing’ – test all 15 million+ SNPs. Low frequency variants with intermediate effect on common disease 4 Human genome project - timeline • 1990: Start of ‘Human Genome Project’ (to generate one genome sequence) • 2003: One sequence completed: cost $300 million • 2010: 3,000 sequences now completed • 2011: 30,000 sequences expected: cost ~$5000 each 5 6 Breast cancer genetics • Rare genes (clear inheritance in families): <25% of inherited risk • Common genes (low-risk, association mapping) <5% of inherited risk • ~70% of risk not explained by all breast cancer genes found so far – so, many genes are ‘missing’… 7 Susceptibility genes in breast cancer: more is less? “The large number of anticipated susceptibility factors, their low predictive value and the high frequency of these variants…..make these findings of limited use in clinical practice” • Ref: Willems PJ (2007) Clin Genet 72:493-496 8 • 183,000 samples: found 180 ‘height’ genes – enriched for genes in shared biological pathways – and genes involved in skeletal growth defects • Genes found only explain 10% of variation in height • Many genes missing….. 9 Linking genes with disease 10 Next generation sequencing “1000 Genomes” - a deep catalog of human variation • July 2010 – Sequenced 6 people (two families - parents and a daughter) – sequenced genomes of 179 people – sequencing exons 700 people (‘exomes’ -protein-coding genome) • Ongoing – 2,500 DNA samples from 27 populations around the world 11 Next generation sequence data analysis 12 Copyright ©2009 American Association for Clinical Chemistry Exome data • Sequence of protein-coding exons-one ‘exome’ contains coding regions of all ~30,000 genes • Exome contains 30 megabases DNA (whole genome has 3200 megabases) • Detect all SNP variation in a person. Align ‘short reads’ (millions of sequences of ~100 bases against the reference genome) • Requires 40X ‘depth’ to reliably identify all DNA variation 13 Sequence data – the ‘filtering’ problem • Each person has 250-300 mutations that could affect protein function and 50-100 mutations implicated in inherited disorders. Most variants have no effect on health • To find disease gene(s) filter out ‘normal’ variation (reference data:1000 genomes, web databases) • Common disease may involve complex interactions between networks of 100’s of genes • Machine learning and other mathematical tools required to interpret complex phenotype/sequence data 14 15 16 “The production of billions of NGS reads has also challenged the infrastructure of existing information technology systems in terms of data transfer, storage and quality control, computational analysis to align or assemble read data….” “Advances in bioinformatics are ongoing, and improvements are needed if these systems are to keep pace with the continuing developments in NGS technologies. It is possible that the costs associated with downstream data handling and analysis could match or surpass the dataproduction costs…” (Metzker, Nat Rev Genet 2010, 11, 3146.) 17 Some applications of DNA sequence data • Disease gene mapping • Disease diagnosis/disease sub-types • Differences between populations, migration patterns • Biotechnology (bacterial genomes, genetic engineering) • Infectious disease control • Evolution/Taxonomy/classification • Archaeology • Forensic science 18 Machine learning to identify genetic factors in breast cancer • 3000 cases with early-onset breast cancer (Southampton data), genotyped with 1000s of SNPs • Identify new breast cancer genes – integrate phenotypic data (tumour sub-types, survival, response to treatment) with genotypes/sequence and gene functional information (web databases) • Machine learning models: test gene : gene and gene : phenotype interactions. New genes? Groups of genes distinguishing sub-types of 19 disease? Web-based tools to improve diagnosis of ‘dosage’ diseases • ‘Dosage’ – number of copies of a gene (more or less than 2 due to duplication or deletion) • Gene(s) in duplicated/deleted might cause disease if abnormal ‘dose’. Which gene(s)? Identification influences patient treatments. • Data-mine for known gene function in literature/databases. • Prediction of disease causing genes - machine learning models (integrate gene function, expression, known ‘dosage genes’) • Web-based tools for mining/querying/presentation of data for clinicians to improve diagnosis. 20 Conclusions • Majority of genetic variation underlying human disease is unknown • Next-generation sequencing will, in time, reveal all of these genes • But…finding missing disease genes in DNA sequence presents huge challenges for medicine, mathematics and computer/web science • Exome and whole genome sequence analysis will transform all ‘bioscience’ research fields • NGS is now generating vast data sets - novel and multidisciplinary approaches to management, visualisation, analysis and interpretation are urgently needed 21