* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download lecture25_DarkMatter..
Nutriepigenomics wikipedia , lookup
Essential gene wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Point mutation wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene expression programming wikipedia , lookup
Public health genomics wikipedia , lookup
Transposable element wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genomic library wikipedia , lookup
Gene desert wikipedia , lookup
RNA interference wikipedia , lookup
History of RNA biology wikipedia , lookup
Microevolution wikipedia , lookup
Polyadenylation wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epitranscriptome wikipedia , lookup
Ridge (biology) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Metagenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Designer baby wikipedia , lookup
Genome editing wikipedia , lookup
Pathogenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Minimal genome wikipedia , lookup
Gene expression profiling wikipedia , lookup
Human genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
RNA silencing wikipedia , lookup
Helitron (biology) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Primary transcript wikipedia , lookup
Non-coding RNA wikipedia , lookup
proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M. 2007. Genome Res 17: 669-681 example of complexities observed by ENCODE (A) annotated exons (black rectangles), novel transcriptionally active regions or TARs (hollow rectangles); conventional annotation identifies only 4 genes or just a fraction of the transcripts reported (dashed lines are introns) (B) observed transcripts are shown alongside the sequences that regulate them (gray circles); note that some of the enhancers are actually promoters for novel splice isoforms a redefinition of the “gene” 1. a gene is a genomic sequence directly encoding functional product molecules, either RNAs or proteins 2. when there are several functional products that share overlapping regions, take the union of all overlapping genomic sequences encoding them 3. this union must be coherent, done separately for protein and RNA products, but it does not require that all the products necessarily share a common subsequence concisely summarized as a union of genomic sequences encoding a coherent set of potentially overlapping functional products 4 genes defined in this one locus there are three primary transcripts, two of which encode five proteins, while the third encodes a noncoding RNA; two primary transcripts share a 5’ untranslated region, but they are considered different genes because the translated regions (D and E do not overlap; there is a noncoding RNA, but the fact it shares its genomic sequence (X and Y) with the protein-coding genomic segments A and E does not make it a coproduct of these genes; there are four genes in this one locus by the new definition gene number estimates as a function of time and methodology genome is sequenced genes observed transcripts dark matter sequence annotation time dark matter is reproducible, but it’s poorly transcribed, poorly conserved, non protein coding, and outnumbers validated microRNAs by ~1000 fold cDNA sequencing reveals an abundance of non-coding genes number of cDNAs size of transcript size of best ORFs % as single exon FANTOM categories for mouse cDNAs coding1 coding2 non-coding1 non-coding2 14,317 3,277 11,526 4,280 2146 (1061) 2174 (1091) 1939 (1019) 1790 (996) 1107 (742) 550 (578) 206 (91) 194 (80) 13.4% 35.4% 68.7% 73.1% number of cDNAs coding1 coding2 non-coding1 non-coding2 mouse cDNAs by Okazaki Y, …, Hayashizaki Y. 2002. Nature 420: 563 or human cDNAs by Imanishi T, …, Sugano S. 2004. PLoS Biol 2: e162 neutral evolution of non-coding cDNAs from mouse transcriptome BlastZ to HUMAN at 25% threshold 30 coding1 30 20 20 coding1-CDS coding2 non-coding1 non-coding2 ncRNAs intron1 intergenic 10 0 BlastZ to RAT at 25% threshold coding1 coding1-CDS coding2 non-coding1 non-coding2 ncRNAs intron1 intergenic 10 50 60 70 80 90 sequence identity [%] 100 0 60 70 80 90 sequence identity [%] ncRNAs are known RNA genes; intron1 and intergenic are negative controls communications arising Wang J, …, Wong GK. 2004. Nature 431: after p757 100 tiling array data are riddled with unexplained signal anomalies too do not assume that non-coding cDNAs are tiling arrays exons mystery BURST human thymus polyA+ cDNAs profiled at locus of Ewing sarcoma breakpoint region 1 gene; from Johnson JM, …, Schadt EE. 2005. Trends Genet 21: 93 indications of biological relevance: transcription, conservation, both lines of evidence, or neither? poorly transcribed highly transcribed most biology dark matter highly conserved poorly conserved possible dark matter explanations: 1. biological noise, i.e. real transcripts with no biological roles 2. RNA genes unique to a species 3. long RNAs are precursors for short (and conserved) RNAs NB: dark matter based on tiling arrays with 150 bp exons is not equivalent to cDNA sequences with 1800 bp exons hypothesis is unannotated long RNAs are precursors for short RNAs Kapranov P, …, Gingeras TR. 2007. Science 316: 1484-1488 nuclear and cytosolic polyadenylated RNAs longer than 200 nt (long RNAs, lRNAs) and whole-cell RNAs less than 200 nt (short RNAs, sRNAs) for non-repetitive portion of human genome; 64% of poly(A)+ transcription (nucleus and cytosol) do not align with annotated exons but of these 265,237 annotated exons some 80% are detected lRNAs that overlap with sRNAs are more PhastCons conserved (i) PhastCons identifies evolutionarily conserved elements from a multi-species sequence alignment, given their phylogenetic tree, and based on a statistical model of evolution called a phylogenetic hidden Markov model (phylo-HMM) lRNAs that overlap with sRNAs are more PhastCons conserved (ii) quantile-quantile plot of PhastCons scores for long RNAs that do (x axis) and do not (y axis) overlap with short RNAs; conservatively, 3.1% of HepG2 and 2.4% of HeLa nuclear lRNA transfrags might be parts of precursors of sRNAs sRNAs associate with 5’ and 3’ boundaries of annotated transcripts enrichment over random expectation is plotted as function of distance from 5’ and 3’ termini for sRNAs on same (sense) or opposite (antisense) strand as the annotated transcripts; comparison is made against random regions with matched G+C content