Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computing non-coding cis-regulatory DNAs Michele Markstein IEEE CSB 2003 Stanford University August 11, 2003 [email protected] OUTLINE (first-half) 1.Brief Review of Central Dogma (DNA->RNA-> Protein) base-pairing, gene architecture, transcription, translation 2. Landscape of the Human Genome 3. Cis-regulation Enhancers, Insulators, Chromatin Boundaries BASE PAIRING DNA serves as a template for DNA and RNA P The Building Block of DNA is the NUCLEOTIDE 5’ S 3’ P S A T S BASE 5’ 3’ P P S C G S P P S T A 3’ S P P S 3’ C G 5’ Template Strand S P 5’ Template Strand Gene Architecture and the Central Dogma exon 1 TATA exon 2 exon 3 intron 2 intron 1 DNA Transcription mRNA splicing Mature mRNA Introns stay in the nucleus exons exit the nucleus AUG Nucleus UAA Translation protein protein folding Cytoplasm Another View of Exon/Intron Structure Exon 1 Intron 1 Exon 2 Intron 2 Exon 3 GGGTGTTTCCAAAAATACTCGGGTGTTTCCAAAAATACTCGAGTGGTCTCGTAGGTAGTGA GTCAAATGGCGCCATACATAATGATTGTTGAGTTCTTGTGTCTTTGGTCCAGTGTCTCGGC TGTTAATTGCGTCTGTTACGATGCAATTACTAGCTTGTTAGGATTCAGTATTATTTGGAGT CCAAAGGAAAAGGTCACAATAATGGCGAAGCGGCTGATTTCGTTAAAAATTTTTACCCTTC ATTTCTTATACCCGTCACGCTTCCACCCATACAAATTTTAGGCGTACAAAAAATGACCAGA GAACTGCAGCCCGCATACAAAAAATGACCTGCGGCAGATCGTTGACTGTGCGTCCACTCAC CCATACGGCTCTTGCGCAGCAGGCCTCGGGTGGTTTTTTTACTAGTAAATTGCCCCGCCCC CCAACGGTTACGATGCAATTACTAGCTTGTTAGGATTCAGTATTATTTGGAAGCCAAAGGA AAAGGTCACAATAATGGCAGAAGCGGCTGATTTCGTTAAAAATAAAATTAACAATGGAACA TACTCAGTTGCCAATAAACATAAAGGAAAAAGTGTTATTTGGTGCATTTTATGTGACATTT TAAAGGAAGATGAAACTGTTCTGACGGATGGCTGCAGCCCGCATACAAAAAATGACCTGCG GCCGATCGTTGACTGTGCGTCCACTCACCCATACGGCTCTTGCGCAGCAGGCCTCTTGCGC GTCAGGCCTCGTACATAATGATTGTTGAGTTCTTGTGTCTTTGGTCCAGTGTCTCGGCTGT TAATTGCCCTTTGTACGATGCAATTACTAGCTTGTTAGGATTCAGTATTATTTGGAAGCCA AAGGAAAAGGTCCCAAAACACAATAATGGCGAAGCGGCTGATTTCGTTAAAAATTCCCTAC CCTTCATTTCTTATACCCGTCACGCTTCCACCCATACAAATTTTAGGCGTACAAAAAATGA CCACAATAATGGCAGAAGCGGCTGATTTCGTTAAAAATAAAATTAACAATGGAACATACTC AGTTGCCAATAAACCAGAGAACTGCAGCCCGCAGGTGGTTTTTTTACTCGTAAATTGCCCC ACGATGCAGTTACTAGCTTGTTAGGATTCAGTATTATTTGGAAGCCAAAGGAAAAGGTCAC AATAATGGCAGAAGCGGCTGATTAGGTTAAAAATAAAATTAACAATGGAACATACTCAGTT GCCAATAAACATAAAGG E1 E2 E3 Snap-shot of RNA transcription Puzzle: how do you translate a 4-letter alphabet into a 20-letter alphabet? nucleotides amino acids The Triplet Code 64 combinations Each triplet is called a Codon The “Genetic Code” codons amino acids amino-acid Pro Gly generic tRNA anti-codon 1 mRNA 2 3 GGU CCU GG A C CA U U U 1 2 3 The Ribosome sets the reading frame Met His U A C G UA G G A C C A U U U C A U G C A U C A U G GG A A A G C Anatomy of mRNA 5’ UTR UTR= untranslated region UAA 3’ UTR AUG translation Protein mRNA is composed of EXONS not all of the mRNA necessarily serves as template for protein synthesis (hence 5’ and 3’ UTRs) therefore not all EXONS or parts of EXONS necessarily serve as template for protein synthesis mRNA The Human Genome estimated to have 25,000 – 30,000 genes Estimate of 100,000 genes was a “back of the envelope” guess by a Harvard Professor in the mid-80’s gene = 30,000 bp genome = 3 billion bp Table from Lander ES, et al.Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature 2001 Jun 7;411(6838):720. PMID: 11237011 [PubMed - indexed for MEDLINE] Copied from NCBI Genome size does not correlate with complexity YEAST 9 .012 X 10 ~5,500 genes HUMAN 9 3 X 10 ~30,000 genes AMOEBA 9 600 X 10 ? 1-2 % of the human genome encodes proteins 50% REPEATS 25% GENES 15% 10% ? H exons introns cis-regulation? H = largely unsequenced heterochromatin The human genome is AT- rich G + C content = 41% CG CG di-nucleotides expected at frequency of .2 X .2 = .04 BUT, observed only 1/5 as frequently as expected Why? CG is often methylated, and spontaneous de-amination converts the C to T CpG islands associated with the beginning of genes CG From: Lander ES, et al.Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature 2001 Jun 7;411(6838):720. 2 Major Classes of Repeats: 1. Transposons 2. Simple Repeats 45% of our genome 3% of our genome (A)n or (CA)n or (CGG)n where n=1 to 11 generally microsatellites—exhibit great variation Junk or “rich paleontological record” ? 1 in 600 mutation in humans are due to transposons 10% of mutations in mouse due to transposons Why? 4 TYPES OF TRANSPOSONS LINES = long interspersed repeats (L1 still active) SINES = short interspersed repeats (ALU sequences) Diagram from Lander ES, et al.Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature 2001 Jun 7;411(6838):720. PMID: 11237011 [PubMed - indexed for MEDLINE] LINES = long interspersed repeats (L1 still active) spreads by “copy & paste” 1 2 DNA mRNA Cell nucleus Cell cytoplasm mRNA Full-length LINE = 6kb 1. Reverse Transcriptase encodes 2 ORFs about 60-100 LINES still mobile New L1 Jump in every 10-250 people born 2. endonuclease SINES — do not encode proteins They take advantage of LINE’s machinery to move Retrovirus-like transposons like LINES except they make the double-stranded RNA in the cytoplasm. Encode 2 proteins: Reverse Transcriptase and Integrase. HIV and other Retroviruses have 2 extra genes: coat protein and envelope protein DNA Transposons A dying breed. They require virgin genomes to survive because they don’t have the advantage of “cis-preference”. CREATIVE or DESTRUCTIVE FORCE? 3’ tranduction —LINEs have a tendency to transcribe DNA beyond their 3’ end and thereby move host DNA MER85 5’ MER85 3’ ORF Novel protein 1.7 kb Closest sequence is the insect piggyBAC transposon Expressed in fetal brain and cancer cells Maintained for 40-50 Myr Other candidiates: intronless genes Most LINES found in AT-rich, gene-poor regions: they integrate at TTTT/A Alus accumulate in GC-rich gene-rich regions! Why? Increased loss at AT regions? Selective benefit to retaining Alus near genes? May be used in the stress response to mediate QUICK responses; e.g. they have been shown to promote translation Graph from Lander ES, et al.Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature 2001 Jun 7;411(6838):720. PMID: 11237011 [PubMed - indexed for MEDLINE] Alu sequences evenly spread out across most chromosomes (exception is Chr.19) Graph from Lander ES, et al.Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature 2001 Jun 7;411(6838):720. PMID: 11237011 [PubMed - indexed for MEDLINE] Gene Regulation Odorant receptor (neurons) Drosomycin anti-microbial peptide (liver, secreted into blood) Genomic Equivalence All cells have the same DNA but they express only a subset of available genes Berkeley Drosophila Genome Browser at www.fruitfly.org Gary Felsenfeld* & Mark Groudine† NATURE | VOL 421 | 23 JANUARY 2003 | www.nature.com/nature also in Albert’s Textbook Molecular Biology of the Cell simplified anatomy of a gene Slide from Mike Levine Changes in regulatory DNA cause changes in morphology Slide from Mike Levine in vivo assay for enhancer activity Slide from Mike Levine Regulatory DNA is modular Slide Courtesy of Mike Levine Enhancers can also be intronic THE EXPERIMENT: Above are the results of an in situ hybridization. This in situ shows mRNA localization in fly embryos. The embryo on the left shows sog mRNA in blue. The embryo on the right shows lacZ mRNA in blue. Both patterns are about the same--thus indicating that the dorsal cluster is sufficient to drive the sog pattern of expression A 263 bp cluster of Dorsal binding sites in the intron of a gene called “sog” was cloned and fused to a lacZ reporter. This fusion construct was injected into the fly germline to make transgenic flies. Markstein et al., Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc Natl Acad Sci U S A. 2002 Jan 22;99(2):763-8. Epub 2001 Dec 18. Gene Regulation: Trafficking Problem Gene Regulation: Trafficking Problem Promoter competition Tethering Element Insulator Butler and Kadonaga Genes and Development 2002 Gene Regulation: Trafficking Problem Promoter competition genomebiology .com/2002/3/12/rese http:// arch/0087.1 comment reviews reports deposit ed researchinteractions information refereed research Human: Computational over half of txn analysis of core start sites are prom oters in the Drosophila associated with genom e CpG islands Uwe Ohler* † , Guo-chun Research Liao*, Heinrich Niemann‡ and Gerald M Rubin*§ Ohler, U., Liao, G.C., Niemann, H., and Rubin, G.M. Computational analysis of core promoters in the Drosophila genome. Genome Biology 3, RESEARCH0087. Epub 2002 Dec 20. Promoter-proximal tethering elements regulate enhancer-promoter specificity in the Drosophila Antennapediacomplex Vincent C. Calhoun, Angelike Stathopoulos, and Michael Levine PNAS July 9, 2002 vol. 99 no. 14 9243–9247 Microarray Experiment involves RNA-DNA base pairing on spotted DNA chips Learn all about microarrays at Pat Brown’s Homepage http://cmgm.stanford.edu/pbrown/ Spellman PT, Rubin GM. Evidence for large domains of similarly expressed genes in the Drosophila genome. J Biol. 2002;1(1):5. Epub 2002 Jun 18. Genes are organized into co-expression domains on average about 10 genes per 100,000 bp (in flies) We don’t know what determines the boundaries or if they are functional Weitzman JB. Transcriptional territories J Biol. 2002;1(1):2. Epub 2002 Jun 25 in the genome. OUTLINE (second-half) 1.Identifying regulatory regions by phylogenetic comparisons in yeast 2. Phylogenetic comparisons in mouse-human 3. Ab initio predictions of enhancers in flies PHYLOGENETIC APPROACH IN YEAST Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003 May 15;423(6937):241-54. Kellis et al. 2003 PHYLOGENETIC APPROACH IN MAMMALS Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Loots GG, Locksley RM, Blankespoor CM, Wang ZE, Miller W, Rubin EM, Frazer KA. Science. 2000 Apr 7;288(5463):136-40. Ab initio Method of predicting enhancers Scan the Genome for Clusters of Binding Sites Cis-Analyst http://rana.lbl.gov/cis-analyst/ Fly Enhancer http://flyenhancer.org Cluster Buster http://sullivan.bu.edu/cluster-buster/ Defining TF binding sites SELEX = selected evolution of ligand by exonential-enrichment 1. Mix your TF with a pool of all possible 25-mers + all 25-mers TF 2. Isolate 25-mers that bind your TF - 3. Cut 25-mers out of gel and sequence bound 25-mers + free 25-mers Selex Results for Dorsal GGGAATTCCC GGGAATTCCC GGGTTATCCC GGGAATTCCA gel Analyze about 30 independently obtained sequences consensus? Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A. 2002 Jan 22;99(2):757-62. Berman et al., 2003 Markstein M., unpublished data 2003 REFERENCES Lander ES, et al.Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature 2001 Jun 7;411(6838):720. PMID: 11237011 [PubMed - indexed for MEDLINE] Felsenfeld G, Groudine M. Controlling the double helix. Nature. 2003 Jan 23;421(6921):448-53. Review. PMID: 12540921 [PubMed - indexed for MEDLINE] Spellman PT, Rubin GM. Evidence for large domains of similarly expressed genes in the Drosophila genome. J Biol. 2002;1(1):5. Epub 2002 Jun 18. PMID: 12144710 [PubMed - as supplied by publisher] Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003 May 15;423(6937):241-54. PMID: 12748633 [PubMed - indexed for MEDLINE] Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A. 2002 Jan 22;99(2):757-62. PMID: 11805330 [PubMed - indexed for MEDLINE] Levine M, Tjian R. Transcription regulation and animal diversity. Nature. 2003 Jul 10;424(6945):147-51. Review. A Final Look at the Central Dogma ? Promoter/enhancer predicition and enhancer trafficking This figure (minus the arrow and quetsion mark) is from Albert’s Molecular Biology of the Cell, 4th edition