* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download lecture23_AnnotatePr..
Genetic engineering wikipedia , lookup
Copy-number variation wikipedia , lookup
Transposable element wikipedia , lookup
Metagenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Point mutation wikipedia , lookup
Genomic library wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Non-coding DNA wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression programming wikipedia , lookup
Messenger RNA wikipedia , lookup
Gene nomenclature wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epitranscriptome wikipedia , lookup
Human genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome editing wikipedia , lookup
Minimal genome wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Primary transcript wikipedia , lookup
Helitron (biology) wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression profiling wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
sequence conservation of vertebrate gene components CDS exon intron 3’ end 5’ end untranslated region (UTR) International Chicken Genome Sequencing Consortium. 2004. Nature 432: 695-716 popular methods for finding exons in protein coding genes ab initio computer predictions PRO: can identify genes expressed at low levels or under rare conditions CON: tradeoffs between false positives and negatives reverse transcribe mRNA into cDNA and sequence PRO: the “gold standard” even if getting full length cDNAs is problematic CON: genes expressed at low levels or under rare conditions are missed hybridize cDNA to tiling array PRO: no need to wade through highly expressed genes CON: genes expressed at low levels or under rare conditions are missed CON: determining start/end of transcript is problematic 16 of the largest human genes based on cDNA alignments to BAC-end consistent genomic contigs Genomic size Functional description 2,009,201 Homo sapiens dystrophin (muscular dystrophy, Duchenne and Becker 1,775,997 Homo sapiens alpha-catenin-like protein (VR22), mRNA. 1,646,716 Homo sapiens contactin associated protein-like 2 (CNTNAP2), mRNA. 1,468,271 Homo sapiens glypican 5 (GPC5), mRNA. 1,467,842 Homo sapiens glutamate receptor, ionotropic, delta 2 (GRID2), mRNA. 1,463,738 Homo sapiens discs, large homolog 2, chapsyn-110 (Drosophila) 1,458,699 Homo sapiens neurexin 3 (NRXN3), transcript variant alpha, mRNA. 1,434,566 Homo sapiens atrophin-1 interacting protein 1; activin receptor 1,224,569 Homo sapiens protein kinase, cGMP-dependent, type I (PRKG1), mRNA. 1,200,828 Homo sapiens interleukin 1 receptor accessory protein-like 2 1,176,853 Homo sapiens glypican 6 (GPC6), mRNA. 1,140,268 Homo sapiens amiloride-sensitive cation channel 1, neuronal 1,117,166 Homo sapiens protein tyrosine phosphatase, receptor type, T 1,113,822 Homo sapiens WW domain containing oxidoreductase (WWOX), transcript 1,103,531 Homo sapiens neuregulin 1 (NRG1), transcript variant GGF2, mRNA. 1,098,403 Homo sapiens cadherin 12, type 2 (N-cadherin 2) (CDH12), mRNA. 16 of the largest human introns based on cDNA alignments to BAC-end consistent genomic contigs Largest intron Functional description 1,040,458 Homo sapiens amiloride-sensitive cation channel 1, neuronal 955,172 Homo sapiens neuregulin 1 (NRG1), transcript variant GGF2, mRNA. 779,656 Homo sapiens WW domain containing oxidoreductase (WWOX), transcript 721,292 Homo sapiens glypican 5 (GPC5), mRNA. 677,142 Homo sapiens phosphodiesterase 4D, cAMP-specific (phosphodiesterase 593,993 Homo sapiens protocadherin 9 (PCDH9), mRNA. 540,026 Homo sapiens fibroblast growth factor 12B (FGF12B), mRNA. 536,481 Homo sapiens interleukin 1 receptor accessory protein-like 2 494,708 Homo sapiens glutamate receptor, ionotropic, delta 2 (GRID2), mRNA. 483,412 Homo sapiens catenin (cadherin-associated protein), alpha 2 479,087 Homo sapiens neurexin 3 (NRXN3), transcript variant alpha, mRNA. 457,155 Homo sapiens potassium voltage-gated channel, Shal-related 445,813 Homo sapiens atrophin-1 interacting protein 1; activin receptor 404,792 Homo sapiens alpha-catenin-like protein (VR22), mRNA. 404,673 Homo sapiens RAD51-like 1 (S. cerevisiae) (RAD51L1), transcript 400,181 Homo sapiens heparanase-like protein (HPA2), mRNA. 3.6 Mbp intron in the dynein gene DhDhc7(Y) on the heterochromatic Drosophila hydei Y chromosome 3.6 Mb intron full of microsatellites Reugels AM, et al. Genetics 154: 759-769 (2000) how much transcribed DNA is attributable to genes over 100 Kb lower cutoff size for the gene 100 Kb 250 Kb 500 Kb fraction (based on gene number) 16.5% 6.2% 2.8% fraction (based on gene length) 70.4% 48.7% 31.5% based on estimates published in Wong GK, et al. 2001. Most of the human genome is transcribed. Genome Res 11: 1975-1977 large genes are attributable to more introns and to bigger introns most exons are 150 bp except for the 3’ terminal exon with the UTR information used by ab initio algorithms for exon prediction signal terms = short sequence motifs like splice sites, branch points, polypyrimidine tracts, start codons, and stop codons is almost enough to define the genes when the introns are small like in yeast BUT is not adequate when the introns are large like in human content terms = patterns of codon usage that are unique to a species, and optionally, cross species sequence conservation algorithm must be trained by presenting them with known coding sequences caution 1: untranslated regions (UTR)s cannot be detected caution 2: non-protein-coding RNA genes cannot be detected caution 3: alternatively spliced isoforms are not considered ab initio method is used when there is no full length cDNA (or protein) Ensembl process in absence of full length cDNA (or protein) cDNA or protein in another species genome sequence false positives but false gene deserts reject unmatched exons novel genes reject unmatched genes putative genes ab initio prediction EST in the chosen species not in final gene counts based on Curwen V, … Clamp M (2004) Genome Res 14: 942 but modified according to reviews by Wang J, … Wong GK (2003) Nat Rev Genet 4: 741 example of what can go wrong in transition from ab initio to Ensembl false gene desert over-prediction gene fragment false positive Refseq = full length cDNA; Genscan/FgeneSH = ab initio algorithms; Ensembl = final annotation size dependencies of FP and FN Genscan prediction fails at both extremes in size; lower sizes correspond to singleexon genes; upper sizes are due to large introns Ensembl does everything to minimize the FP rate but in doing so it increases the FN rate to almost 50% in contrast to FP and FN, overpredictions are size independent over-predictions arise when the ab initio algorithms fail to detect the start and stop codons at the ends of a gene; most performance assessments confuse this issue with FP but it is a distinct phenomenon because unlike FP the probability of an over-prediction is independent of size false gene deserts from Ensembl a complete miss (CM) is a gene where fewer than 100 bp of the total protein coding sequence is correctly predicted a false desert (FD) is the fraction of a gene’s sequence that is not covered by any gene predictions; notice that definition of FD must exclude CM genes gene size distribution with and without full length cDNA support gene fragments Refseq is a curated set of full length cDNAs; the right panel shows what is left after removing cDNA derived genes from Ensembl (human-32.35e 7/21/2005) ENCODE Project Consortium. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799-816 [excerpt of abbreviations from box 1] CDS Coding sequence: a region of a cDNA or genome that encodes proteins CS Constrained sequence: a genomic region associated with evidence of negative selection (that is, rejection of mutations relative to neutral regions) GENCODE Integrated annotation of existing cDNA and protein resources to define transcripts with both manual review and experimental testing procedures PET A short sequence that contains both the 5' and 3' ends of a transcript RACE Rapid amplification of cDNA ends: a technique for amplifying cDNA sequences between a known internal position in a transcript and its 5' end RxFrag Fragment of a RACE reaction: a genomic region found to be present in a RACE product by an unbiased tiling-array assay TxFrag Fragment of a transcript: a genomic region found to be present in a transcript by an unbiased tiling-array assay Un.TxFrag A TxFrag that is not associated with any other functional annotation UTR Untranslated region: part of a cDNA either at the 5' or 3' end that does not encode a protein sequence experimental annotation of a genome using tiling microarrays Shoemaker DD, et al. 2001. Nature 409: 922-927 [nonrepetitive half of human genome requires 150 million probes if tiled at 10 bp steps] weakness of the method is it cannot determine the ends of the gene annotated and unannotated TxFrags vs number of cell lines most TxFrags (63.5%) do not concur with the GENCODE exons and are observed in intronic (40.9%) and intergenic (22.6%) regions; annotated TxFrags are more likely to be seen in multiple cell lines; more disturbingly these unannotated TxFrags contain little evidence of encoding proteins extension of annotated genes based on the RACE experiments mean gene size was 27 kb in the 2001 human genome papers RACE (rapid amplification of cDNA ends) is a way to get the ends of a gene by priming off the incomplete cDNA; using 399 proteincoding loci and mRNA for 12 tissues they found that 90% of these loci contain at least one novel RxFrag that extends well beyond the annotated TSS multiple lines of evidence for the fusion of two adjacent genes 330-kb interval of human chromosome 21 with 4 annotated genes: DONSON, CRYZL1, ITSN1 and ATP5O; 5’ RACE products generated from small intestine RNA and detected by tiling-array analyses (RxFrags) are shown along the top; magnified along the bottom is a cloned and sequenced RT–PCR product with 2 exons from the DONSON gene and 3 exons from the ATP5O gene connected by a single large 300 kb intron; PET tags show the termini of a transcript that is consistent with this RT–PCR product in fact approximately 50% of RACE-positive loci appear to have incorporated at least 1 exon from an upstream gene most of the human genome is converted into primary transcripts GENCODE annotations, RACE-array experiments, and PET tags were used to assess the presence of a nucleotide in a primary transcript; the proportion of genomic bases detected can be classified into the following scenarios: all three technologies, two of the three technologies, one technology but with multiple observations, and one technology with only one observation; also indicated are genomic bases without any detectable coverage of primary transcripts ENCODE confirmed previous studies in human and mouse showing extensive transcription beyond the official annotations 93% of bases are represented in a primary transcript identified by at least 2 independent observations, some by same technology many of the resulting transcripts are neither traditional protein-coding genes nor explainable by structural non-coding RNAs the rest of the paper shows extensive amounts of regulatory factors around the novel transcription start sites, as is to be expected compared to other annotated features unannotated transcripts show weaker (i.e. almost neutral) evolutionary conservation biological relevance of unannotated transcripts remains unanswered conservation patterns comparing human to other vertebrate genomes Thomas JW, et al. 2003. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424: 788-793 Q: what is the optimal species or combination of species to use? evolutionarily constrained regions are not always ENCODE annotated evolutionarily constrained regions are computed for 28 vertebrate species and defined to have a false discovery rate of 5%; the median length of the constrained sequences is 19 bp, and the minimum length is 8 bp or about the size of a typical transcription factor binding site ENCODE annotated regions are not always evolutionarily constrained increase in significance from bases to regions definition is indication that tiny islands of constrained sequences exist in the experimentally defined functional elements whereby the surrounding bases seem not to be constrained identifying functional elements from genome sequence is one challenge, but what biological roles (if any) do the elements serve? sequence similarity to previously characterized genes and proteins is commonly used to infer biological roles, but no one has ever quantified how reliable these inferences might be ascertainment of biological roles is extremely difficult as most knockouts have no phenotype even for indisputably reliable genes does orthology necessarily imply functional equivalence? S1 ortholog B1 paralog paralog S2 species S ortholog B2 species B http://www.treefam.org/ is a curated database of animal gene family trees with reliable assignments of ortholog and paralog evolution, language, and analogy in functional genomics Benner SA, Gaucher EA. 2001. Trends Genet 17: 414-418 Homologous enzymes catalyze four different reactions that are involved in (a) central metabolism, i.e. the citric acid cycle (b) amino acid degradation (c) nucleic acid biosynthesis and (d) amino acid biosynthesis. There is NO question that the four enzymes are homologous, but their biological roles are arguably quite different. chicken SNPs corresponding to mutations in human disease genes 2.83 million variant sites 1065 human genes taken from OMIM chicken genome chicken SNP map 995 chicken orthologs 520 cSNPs in 245 genes 6 cSNPs in disease site 1 cSNP intolerant in SIFT 5 cSNPs tolerant in SIFT if orthologs are functionally equivalent no SNPs would survive the process but a few do in paper of Wong GK, … Yang H. 2004. Nature 432: 717-722 G188R substitution associated is with hyperammonemia in humans ornithine transcarbamylase (OTC) Human Pig Mouse Rat 188 HYSSLKGLTLSWIGDGN --GA--------------G---------------G-------------- Chicken RJF Chicken B/L --GG-N---IA-------GG-NR--IA------ mutation associated with hyperammonemia in humans turns out to be a common polymorphism in healthy chickens, with the deleterious variant observed in 65% of layers and 75% of broilers nitrogenous waste processing in mammals versus birds-and-reptiles mammals UREA waste; birds-and-reptiles URIC acid waste every human urea cycle gene (including OTC) is found in chicken Q: could we have predicted OTC’s lack of functional equivalence? Inf(2/0) preliminary OTC Ka/Ks Human 0.08(14/62) 0.38(19/6) Chimp 0(0/0) 0.14(12/30) Rat 0.1(41/142) 0.1(84/275) 0.06(6/38) Mouse 0.03(30/300) 0.04(6/61) 0(0/18) 0.1(111/370) 0.13(28/74) 0.31(123/132) Cattle Alligator Inf(24/0) 0.19(119/210) 0.19(47/84) Dog 0.19(153/270) Chicken Lizard 0.21(29/46) 0.1(78/261) Inf(47/0) 0.22(63/98) 0.14(89/208) 0.14(172/419) 0.11(45/134) 0.07(133/640) Western Frog Bull Frog 0.18(43/78) 0(35/37914) African Frog Fugu Green Buffer Zebrafish “genetic uncertainty principle” explains why a gene’s biological role is so difficult to ascertain hypothesis by Tautz D. 2000. Trends Genet 16: 475-477 2Ne Δt > 1 / Δw Ne is effective population size Δt is number of generations while Δw is differential fitness if we think of one generation for one member of a population as a single evolutionary experiment, then we can never hope to duplicate the number of experiments that nature conducted in order to decide what will survive