* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Comparative genomics and Target discovery
Epigenetics of diabetes Type 2 wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene therapy wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genetic engineering wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Nutriepigenomics wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Koinophilia wikipedia , lookup
Gene nomenclature wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Oncogenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Transposable element wikipedia , lookup
History of genetic engineering wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene desert wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genome (book) wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genomic library wikipedia , lookup
Human genome wikipedia , lookup
Minimal genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Human Genome Project wikipedia , lookup
Designer baby wikipedia , lookup
Public health genomics wikipedia , lookup
Genome editing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Metagenomics wikipedia , lookup
Microevolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Comparative genomics and Target discovery Maarten Sollewijn Gelpke MDI, Organon What is comparative genomics? What can we learn from comparative genomics? What is target discovery? What are the implications of comparative genomics to target discovery? What issues in target discovery can be addressed by comparative genomics? Overview Introduction to genomes and sequencing. Comparative genomics aspects. Phylogenomics concepts. Examples of comparative genomics. Sequence availability Availability of gene and protein sequences has increased enormously in during the last 2 decades. Current capacity of the main sequencing centers is >3Gb per month per centre. This will increase again dramatically with the development of new superfast sequencing techniques. Currently > 100Gbases Genomes sequenced A.thaliana First bacterial genomes sequenced H.influenzae and M.genitalium 1995 •Mouse •Ciona •Rice •Fugu •Anopheles The yeast genome 1996 2002 •Human finished •Rat •Chicken Human draft 2001 E.coli K12 1997 1998 2004 Full sequence of chr. 22 2005 1999 C.elegans 2000 D.melanogaster Genome & Chr. 21 2003 Chimpanzee Xenopus Zebrafish Genome sequencing primates mammalia rodents carnivores amniota tetrapods vertebrates bovidae aves amphibia chordates fish metazoans tunicates insects nematodes Evolutionary relationship between metazoans (multicellular animals) that have been sequenced or are due for sequencing. Genome sequencing BAC fingerprinting shotgun approach Accurate but laborious! Shotgun sequencing (WGS) Bac Clone: 100-200 kb Whole Genome: 30Mb – 3Gb Random Reads Assembly Sheared DNA: 1.0-2.0 kb Low Base Quality Single Stranded Region Consensus Sequence Gap Sequencing Templates: Templates Finishing MisMis-Assembly (Inverted) Genome sequencing Current state of sequenced ‘organisms’: >316 Prokaryotes >27 Archae >280 Eukaryotes (complete, in assembly or in progress) >1600 Viruses and > 500 mitochondria/chloroplasts Some ongoing genome sequencing projects: Poplar, gibbon, platypus, Drosophila species, variety of pathogenic fungi and bacteria, etc. Meta-genomic projects on environmental samples (soil, deepsea, waste sites) Future of genome sequencing? New complete genomes. New low-redundancy genomes. New (low-redundancy) genome areas. Meta-genomics. Sequencing of microbial communities. Sequencing of extinct species. 40000 year old Cave bear: 26k, 21 genes. 45000 year old Neanderthaler: 75k diverged from human lineage ~ 315000 years ago Comparative genomics Discover what lies hidden in genomic sequence by comparing sequence information. Main areas Whole genome alignment Gene prediction Regulatory element prediction Phylogenomics Pharmacogenetics Affected by evolutionary aspects Mutational forces (introduce random mutations) Selection pressures Ratio of non-synonymous to synonymous substitutions Mutation rates lower or higher than neutral Comparing sequences, methods. Pairwise comparison of sequences (alignments) proteins or genes variety of local alignment tools like BLAST, SmithWaterman etc. multiple sequence comparisons (ClustalW, Muscle etc.) results may be dependent on alignment settings Comparing sequences, methods. Whole genome comparisons Large stretches of sequence Divergence up to 450Mya (fugu-human) with sufficient similarity remaining. BLAT, BLASTZ, Phusion/BlastN Seeding strategy → alignment extension → gapped alignments Whole genome comparison Conservation of synteny! Genome expansion and contraction Genome duplications, segmental duplications: important mechanism for generating new genes. (G+C) content, CpG islands Cross-reference of any genetic traits (diseases!) from one organism (eg mouse) to genes in the syntenic regions in the other organism (eg human). Reflect different mutational or DNA repair processes? Repeats Transposable elements are a main force in reshaping genomes. TE’s (or remainders thereof) can be used to measure evolutionary forces acting on the genome. Neutral mutation rate. Gene prediction Comparing sequences has contributed enormously to the accuracy of gene prediction. Evidence based method. Use cDNAs, ESTs and proteins from various organisms. Apply gene feature rules. Gene model Proteins Clustered ESTs cDNAs Gene prediction De novo methods. Alignment of genomic sequences Splicing rules and other gene features De novo gene prediction by comparing sequences attempts to model a negative selection of mutations. Areas with less mutations are conserved because the mutations where detrimental for the organism. Prediction of similar proteins in both genomes. Newly predicted protein in mouse and human, similar to the disease related gene dystrophin. Regulatory element prediction The complexity of higher eukaryotes and their relatively low number of genes can be explained partially through the importance of transcriptional regulation. Identification of RE’s will have an extensive impact in understanding gene expression patterns (expression intensity, tissue specificity), relations within expression patterns and inferring biological systems or networks. Regulatory element prediction No formal models for regulatory motifs Attempt to find conserved regions or motifs based on the global alignment of similar sequences of different organisms (phylogenetic footprinting). Which species to compare? Evolutionary distance? What regions around gene models to investigate? 5’ and 3’ flanking regions, introns? Take expression patterns into account? How does evolution affect RE’s? Phylogenomics Comparison of genes and gene products across a number of species (whole genomes), characterizing homologues and gain insights in the evolutionary process itself. Pharmacophylogenomics is the use of phylogenomics in aid of drug discovery, through improved target selection and validation. Orthology and paralogy Phylogenetic tree of gene X Orthologs: genes in different species that arose from a single gene in the most recent common ancestor, by speciation. Paralogs : genes in the same species that arose from a single gene in a ancestral species, by a process of gene duplication. Target orthology Species differences frequently affect progression of targets and compounds. Orthology maps in combination with expression studies may explain these differences. Establishing orthology Reciprocal highest scoring Blast hit. Conservation of synteny. Gene loss or rate of evolution issues. Orthology does not guarantee common function (functional shift). Extensive sequence divergence High non-synonymous over synonymous nucleotide substitution ratios. Comparison of regulatory regions? Target paralogy Key insights in large pharmacologically relevant families (NRs, GPCRs) can be gained from paralogy analysis. Paralogy is inter-related with several other gene to function occurrences that can seriously affect the suitability of genes as drug targets Paralogy Alternative transcription Pleiotropy Redundancy Heteromery Crosstalk Function Protein Gene Schematic representation of various mappings of genes to functions. Pleiotropy Suggested to precede paralogy Relaxed substrate or ligand specificity Multiple protein domains Tissue or cellular localization Redundancy Total or partial redundancy of function Directly linked to paralogy Robustness against gene knock-outs (target validation) PPAR-δ / PPAR-α in skeletal muscle; PXR / FXR in bile acid signaling; dopamine transporters / serotonin transporters in adjacent neurons. Heteromery Formation of heteromers between paralogs Known examples in major classes of drug targets GPCRs : GABAβ receptors NRs : formation of heterodimers with retinoid X rexeptors (RXR) Ion channels Crosstalk Combination of pleiotropy and redundancy May be regulated in time and space (expression and localization) Action of cytokines (interleukins) on immune cell types. Alternative transcription Intermediate between paralogy and pleiotropy. ‘paralogy in place’ Increases effective size of the genome (estimated >30% of human genes show alternative transcription!) P Effects on drug discovery Functional shifts, pleiotropy and redundancy potentially have good or bad news for drug discovery. Functional shifts Pleiotropy Misleading or unavailable animal model Animal toxicity irrelevant for humans Unintended drug effects Opportunities for multiple indications Redundancy Disease resistant to treatment (multi-functionality) Highly selective treatment for complex diseases. Pharmacogenetics Within species comparative genomics: Single Nucleotide Polymorphisms: SNPs Current focus in coding regions, expected to expand to sites of transcription regulation. Determine the site of a SNP and the allele frequencies from ethnic or multi-ethnic panels of individuals (eg 100) Pharmacogenetics (PGx): relate SNP information to efficacy and safety issues during the drug development process. Efficacy PGx: Select/predict drug responders, increase confidence in a certain drug in development. Safety PGx: Identification of individuals with adverse effects to a drug Examples New genes and REs from yeast genomes. Multi species comparisons from targeted genomic regions. Comparative genomics at the vertebrate extremes. Pharmacogenetics in drug efficacy Comparison of yeast species to identify genes and regulatory elements. (Kellis et al, Nature 2003) Saccharomyces cerevisiae and 3 related species 7x coverage WGS of each species Assembly of draft genome sequence S.cerevisiae genome aligned to others using ORFs as seeds Most ORFs have 1:1 matches. Considerable conserved synteny. Most genomic rearrangements clustered in telomeric regions. Local gene family expansion/contraction, creating phenotypic diversity over evolutionary time. Balance between conservation and divergence allows for accurate gene identification and recognition of REs as well! Identification of genes Original S.cerevisiae genome (1996): 6275 ORFs Re-analysis and other evidence (2002): 6062 ORFs This study validates all ORFs using a reading frame conservation score (very sensitive). 5538 ORFs, 20 unresolved, 504 rejected ORFs! In addition to gene recognition, also largely improved gene structure definitions (start, stop, intron). Identification of regulatory elements REs are difficult to identify De novo discovery of REs directly from genomic sequence. Short (6-15bp), sequence variation, few known rules Develop a motif conservation score system based on known motifs 78 motifs discovered, overlapping with 36 of 55 known motifs Putative annotation of motifs using adjacent genes. (GO) 25 of 42 new motifs show high category annotation correlation Discovery of combinatorial control of Res Applications to human genome? Increase number of species in comparison to enrich the low signal to noise ratio in humans. Multi species comparisons from targeted genomic regions. (Thomas et al, Nature 2003) Comparing targeted regions areas in multiple evolutionary diverse vertebrates (less probable for conservation to occur by chance) ENCODE project 44 genomic regions (14 manually selected of which some disease related, 30 random) of diverse gene density and nonexonic conservation primates, bat, alligator, elephant, cat, emu, leopard, salmon etc. Initial analysis 1.8 Mb on chromosome 7 containing 10 genes, including CFTR, from 12 species. Detection of ~1000 multi-species conserved sequences of which >60% would not be detected by a 2 species comparison. Comparative genomics at the vertebrate extremes (Bofelli et al, Nature 2004) What can be learned from comparisons of genomes that are distant or closely related in evolution? Distant comparisons reveal the most constrained sequence elements. Most of the conserved human-fish non-coding sequences are found near genes with roles in embryonic development. Mutations can have an important role in human disease Human-Fugu conservation of non-coding sequence in the DACH gene area (development of brain, limbs, sensory organs). Validation of identified enhancer regions by driving expression of a reporter in mouse embryos. Comparative genomics at the vertebrate extremes Intraspecies sequence comparisons allow identification of species specific sequences Phylogenetic shadowing Requires high rate of polymorphism Comparison among primates show human specific sequences Analysis of regulatory sequence of ApoA (involved in human heart disease) A. Mutation rate analysis of Ciona intestinalis 5` region of the forkhead gene. B. Validation of identified potential regulatory elements in Ciona larvae. Pharmacogenetics in drug efficacy Efficacy PGx for an obesity drug. Compare genotypes 1-1, 1-2 and 2-2