* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Genome Organization
Survey
Document related concepts
Transcript
Genome Organization … all started with heating up DNA ÇpH Renaturation kinetics Upon heating (or treatment with alkali), the two strands of the DNA double helix separate from one another into two single-stranded molecules. This process, called DNA denaturation or melting, can be assayed by several different methods, but most commonly the 40% increase in optical density (termed hyperchromicity) that accompanies denaturation is monitored. When optical density is measured as the DNA is heated, and temperature is plotted against absorbance, a melting curve is obtained. Trp λ max 253 nm λ max 259 nm λ max 280 nm Nucleic Acid λ max 267 nm λ max 271 nm λ max 262 nm Which are the chemical groups involved in H-bonding ? Nitrogenous Bases DO NOT Base Pair in Solution Effects of temperature One of the simplest ways of denaturing nucleic acids is to raise the temperature until all the hydrogen bonds and stacking forces have been broken. The strands then dissociate or “melt”. ss-nucleic acids absorbs UV light (max. at 260 nm) more strongly than ds-DNA → a phenomenon called hyperchromicity. The absorption of DNA itself is some 40% less than would be displayed by a mixture of free nucleotides of the same composition. The increase in absorption is proportional to the extent of denaturation, so dissociation can be followed by monitoring absorbance as the temperature is being raised. Thermal melting curve for DNA. As the DNA is heated, the weak forces holding the ds-DNA together are broken. The midpoint of the increase in absorption is defined as the melting temperature (Tm) or transition temperature. DNA denaturation/renaturation experiments ⇒ effect of temperature or pH ⇒ hyperchromic effect (max. 40%) ⇒ free nitrogen bases versus stacked nitrogen bases ⇒ simple and cheap experiment. The effect of temperature on nucleic acids and proteins (reversible versus irreversible denaturation) ⇒ little amounts of DNA can be detected ⇒ in addition to A260 nm one can also use Ethidium Bromide to monitor DNA ⇒ salt conc. affects the renaturation. 10XSSC (1.5 M NaCl; 0.3 M Na-Citrate pH 7.0 as standard buffer) ⇒ use of sonicated DNA (fragments = 400 - 500 bp) Analysis of the GC content of the DNA extracted from different organisms. Ö No conclusive results on the organization of GENOMES (e.g. E.coli 50%, S. cerevisiae 70%) Pneumococcus (40 % GC content) Serratia marcescens (60 % GC content) It is more difficult -- it takes a higher temperature (higher energy input) -to separate a GC base pair than an AT. Denatured DNA will renature to re-form the duplex structure if the denaturing conditions are removed (that is, if the solution is cooled, the pH is returned to neutrality, or the denaturants are diluted out). Renaturation requires reassociation of the DNA strands into a double helix, a process termed reannealing. For this to occur, the strands must realign themselves so that their complementary bases are once again in register and the helix can be zippered up. Many of the realignments are imperfect, and thus the strands must dissociate again to allow for proper pairings to be formed. Renaturation and Hybridization The reverse of nucleic acid denaturation is renaturation, also referred to as reassociation or hybridization. When heat-denatured DNA is cooled slowly, the complementary strands reassociate to form a double-stranded molecule. This is a two-step process: - random collisions - rapid zippering Single strands of DNA undergo numerous random collisions until eventually a collision occurs that places complementary sequences in the correct register. This event is known as NUCLEATION. Formation of a base-paired duplex quickly follows At the Tm of a long nucleic acid, 50% of the bp have been broken and the molecules contain both single-stranded and double-stranded regions Heat denaturation of hybrids containing oligonucleotides. These hybrids are too short to contain both single and double-stranded regions, and they melt over a very sharp temperature range. At Tm half the hybrids have dissociated. Satellite Sequences (1961) At the time, several laboratories were carrying out equilibrium density gradient centrifugation studies on DNA from higher organisms. Two simultaneously found that a number of such organisms, including two species of crab, guinea pig and mouse, had DNA that -- unlike prokaryotic DNA -- did not behave as a single entity upon centrifugation. The DNA showed a major band, similar to that of prokaryotic DNA, and a minor band (or bands) that had a different density. The minor band was termed "satellite". (lower density) Main Band 92% – Mouse DNA GC content = 42% , (typical for a mammal) Minor band 8% - Mouse satellite DNA GC content = 30% Satellite DNA in Drosophila melanogaster Satellite 1.672 g/cc 1.686 g/cc Sequence % of genome (A A T A T)n 3.0% (A A T A C)n <0.5% (A A T A G)n <0.5% (A A T A A C A T A G)n 2.0% (A A G A C)n 2.4% 1.688 g/cc 359 base pair repeat 6.8% 1.705 g/cc (A A G A G)n 6.0% (A A G A G A G)n 1.5% SCIENCE Carnegie Institute of Washington Time-course of an ideal reaction Cot curves DNA denaturation/renaturation experiments C/Co = 1/(1 +kCot) 25°C below Tm C = conc. ssDNA at time t Co = initial ssDNA conc. (t = 0) k = constant t = incubation time when C/C0 = ½ ½ = 1/ (1 + kC0t½) 1 + kC0t½ = 2 → kC0t½ = 1 The meaning of Cot C0t½ = 1/k A greater Cot implies a slower reaction. (nucleotide-moles x sec /liter) The renaturation of the DNA of any genome should display a Cot ½ that is proportional to its complexity. [ 2nd order kinetic ] (1966) Renaturation experiments carried out with mouse DNA. If mouse DNA was denatured, and then subjected to renaturation, a fraction of the DNA (about 8 - 10%) renatured in very short times -- many times shorter than the simplest natural DNA known at the time (red line in the figure). Conclusion: this fraction of the mouse genome was present in all mouse tissues examined, and was a simple sequence of about 200 - 400 nucleotides (ultimately it was found to be a sequence of 234 bp). It was present in tandem arrays and was repeated about a million times in the mouse genome. Subsequent analysis showed that this DNA formed a satellite upon density gradient centrifugation. Satellite DNA in general has the following characteristics: - It is found in almost all eukaryotes. - It consists of short sequences, typically 2 to 200 bp in length, repeated many times in one or more tandem arrays. - The bulk of the sequences seem to be located in a fraction of chromatin called heterochromatin. When its location is determined, it is found mostly around the centromeres, and at the telomeres. - The satellite sequences are seldom transcribed. Some of these satellites are cryptic; that is, they do not have a density that is different from main band DNA and so are not visible after CsCl centrifugation as distinct bands. They become visible when actinomycin D is added to the density gradients. This and similar antibiotics bind at GC sites, and shift the density of DNA Renaturation rates depend on temperature, salt concentration and the length of the DNA molecules involved. If these factors are held constant, then the rate of renaturation (how much of the DNA forms a duplex per time period) is dependent only on the concentration of the two strands of DNA. DNA fingerprinting in Forensic Science Minisatellites, 10-100 bp repeated several times in tandem (VNTR =Variable Number Tandem Repeats) Microsatellites, 2-4 bp repeated several times in tandem (STR = Short Tandem Repeat Polymorphisms). They represent 3% of the human genome. Tandem nucleotide repeat are generated by slippage mutation occurring during DNA replication In most of the cases, DNA fingerprinting is based on minisatellite or microsatellite markers. When different probes are used to make several fingerprints, the likelihood that any two individuals chosen at random will have identical matches in all of them is extremely small (less than 1 in 1 trillion). Nucleic Acid Hybridization If DNA from two different species are mixed, denatured, and allowed to cool slowly so that reannealing can occur, artificial hybrid duplexes may form, provided the DNA from one species is similar in nucleotide sequence to the DNA of the other. The degree of hybridization is a measure of the sequence similarity or relatedness between the two species. Used for: - Evolutionary relationships Zuckerland, E. & Pauling,L. Molecules as documents of evolutionary history. J. Theor. Biol. (1965), 8:357-366 - Identification of specific genes, using oligonucleotides or polynucleotide probes - Quantitative expression of genes (amount of mRNA synthesized) ss DNA hybrid Conclusion: each RNA molecule is derived from a specific DNA sequence Re-Annealing or Hybridization Works with: • DNA - DNA • DNA - RNA • RNA - RNA Basis of many techniques in molecular biology DNA-RNA HYBRID DNA renaturation curve for mouse genomic DNA 10 % 15 % % of dsDNA % of ssDNA Chemical and kinetic complexity 75 % Kinetic Complexity C0t½ (DNA of any genome) C0t½ (E.coli) This genome has an haploid DNA content of 7x108 bp (chemical complexity) = Complexity of any genome 4,2 x106 bp Assuming C0t½ (E.coli) = 4 Kinetic Complexity Fast Component = 0,0013 x 0,25 x 4,2 x106 bp / 4 = 340 (bp) Kinetic Complexity Intermediate Component = 1,9 x 0,30 x 4,2 x106 bp / 4 = 6x105 (bp) Kinetic Complexity Slow Component = 630 x 0,45 x 4,2 x106 bp / 4 = 3x108 (bp) Repetition frequency = Total DNA content (bp) • % (fraction) / kinetic complexity (bp) Genes are expressed at widely differing levels Abundant mRNAs consist of a small number of individual species, each present in a large number of copies per cell. Scarce mRNA (Complex mRNA) consists of a large number of individual mRNA species, each present in very few copies per cell. This accounts for most of the sequence complexity in RNA. • The first component is the ovalbumin mRNA which indeed occupies about 50% of the messenger mass in oviduct tissue. • The next component provides 15% of the reaction, with a total complexity of 15 kb. This corresponds to 7-8 mRNA species of average length 2000 bases. • The last component provides 35% of the reaction, which corresponds to a complexity of 26 Mb. This corresponds to ~13,000 mRNA species of average length 2000 bases. Hybridization between excess mRNA and cDNA identifies several components in chick oviduct cells, each characterized by the Rot½ of reaction. Elliot Volkin (1957) ORNL (Oak Ridge National Laboratories,TN) Infection of E.coli with bacteriophage T2 + 32PO4 labeling Alkaline hydrolysis – paper chromatography Quantitative measurement of C, G, U, A (Do you remember G. Mendel and E. Chargaff ?) Nucleotide composition of the host E.coli RNA ( C:G:U:A ratios of 1 : 1.4 : 1 : 1 ) Nucleotide composition of the infecting phage RNA ( C:G:U:A ratios of 1 : 1 : 1.9 : 1.9 ) Nucleotide composition of the infecting phage DNA ( C:G:T:A ratios of 1 : 1 : 1.9 : 1.9 ) This molecule was originally called “DNA-like RNA” Messenger RNA (mRNA) … the metabolic intermediate X Features: 1) the same composition of DNA; 2) it renews itself very quickly F. Jacob, M. Meselson, J. Monod, S. Brenner, F. Crick (1960) 1) Bacteria grown for several generations on heavy-isotope labeled media ⇒ “heavy”-labeled bacteria 2) Bacteria infected with T4 virus, which destroys bacterial DNA, and substitutes viral DNA. Bacterial cells are simultaneuosly transferred to a “light” (non heavy-isotope labeled) medium 3) Radioactive RNA precursor (14C – Uracil) was added and time was allowed for virus-directed synthesis to proceed 4) Bacteria are lysed, and ribosomes are centrifuged on a CsCl gradient 5) Only heavy ribosomes were present after centrifugation Conclusion: ribosomes are not carrier of genetic information; instead they are protein synthesis machines 6) Any newly synthesized RNA is detected because it is radioactive. New radioactive [14C] RNA is associated with “old” bacterial ribosomes 7) New radioactive RNA is removed from ribosomes. It hybridizes with viral single-stranded DNA and not with bacterial ss-DNA. These experiments established that: 1. The expression of the viral genes is associated with the formation of new virus-specific RNA molecules (mRNA); 2. Ribosomes are NOT involved in viral gene expression except as passive sites of synthesis; 3. The new messenger RNA has a base sequence complementary to DNA, and presumably originated there; 4. The new mRNA can be isolated complexed to ribosomes. It follows that these new RNA molecules are indeed the genetic messenger, carrying information from DNA to ribosome, as envisaged by Francis Crick. (M.I.T. & CSHL, New York) Human adenovirus type 41 (Ad41) is involved in the aetiology of some enteric diseases. The major capsid protein (hexon) contributes to the pseudo-hexagonal shape of the virus (Icosahedral-type). R-looping experiments The electron microscope photo shows the formation of a DNA-mRNA hybrid. This hybrid contains loops. A TESTABLE HYPOTHESIS As a consequence of the discovery that genes are often split, it seems likely that higher organisms in addition to undergoing mutations may utilize another mechanism to speed up evolution: rearrangement (or shuffling) of gene segments to new functional units. This can take place in the germ cells through crossing-over during pairing of chromosomes. This hypothesis seems even more attractive following the discovery that individual exons in several cases correspond to building modules in proteins, so-called domains, to which specific functions can be attributed. An exon in the genome would thus correspond to a particular subfunction in the protein and the rearrangement of exons could result in a new combination of subfunctions in a protein. This kind of process could drive evolution considerably by rearranging modules with specific functions. Structure of eukaryotic genes Genomic DNA (20 – 50 Kbp) RNA pol Nuclear Heterogeneous RNA (hn RNA) Precursor RNA Mature messenger RNA (average length = 2.2 kb in human) ( approx. 5% of the length of the gene ) Notes: 1. Alternative splicing from intronic sequences 2. mini-RNA with regulative function(s) produced Prokaryotic and Eukaryotic Genes Uninterrupted genes Interrupted genes are expressed via a precursor RNA. Introns are removed when the exons are spliced together. The mRNA contains only the sequences of the exons. The exons coding for stretches of protein tend to be fairly small. In higher eukaryotes, the average exon codes for ~50 amino acids, and the general distribution fits well with the idea that genes have evolved by the slow addition of units that code for small, individual domains of proteins. The introns vary widely in size. There are no very long introns in worms, but flies contain a significant proportion. In vertebrates, the size distribution is much wider, extending from approximately the same length as the exons (<200 bp) to lengths measured in 10s of kbs, and extending up to 50-60 kb in extreme cases. Eukaryotic genome organization Very long genes are the result of very long introns, not the result of coding for longer products. Species Average exon N° length (kb) Average gene length (Kb) Average mRNA S. cerevisiae 1 1,6 Fungi 3 1,5 C. elegans 4 4.0 3,0 D. melanogaster 4 11,3 2,7 Chicken 9 13,9 2,4 Mammals 7 16,6 2,2 10 times 1,6 1,5 There are virtually no S. cerevisiae genes with more than 4 exons. Uninterrupted genes BACTERIAL GENOMES - Species Genome size No. of (kb) genes Gene density (per 1 kb) Fraction of noncoding DNA Aeropyrum pernix 1670 1688 1.01 0.14 Sulfolobus solfataricus 2592 3012 1.16 0.15 Sulfolobus tokodaii 2695 2956 1.10 0.15 Methanococcus jannaschii 1665 1828 1.10 0.12 Methanobacterium thermoautotrophicum 1751 1917 1.09 0.09 Pyrococcus horikoshii 1739 1796 1.03 0.09 Pyrococcus abyssi 1765 1802 1.02 0.08 Archaeoglobus fulgidus 2178 2467 1.13 0.07 Thermoplasma acidophilum 1565 1528 0.98 0.12 Thermoplasma volcanium 1585 1548 0.98 Halobacterium sp.a 2570 2640 Escherichia coli K12 4639 Buchnera sp. Salmonella typhi Organism GC % Number of Genes Annotated Buchnera 26.2 564 B. burgdorferi 28.6 857 C. jejuni 30.6 1654 M. jannaschii 31.4 1715 M. genitalium 31.7 483 H. influenzae 38.0 1754 H. pylori 38.9 1593 A. aeolicus 43.3 1517 0.14 B. subtilis 43.5 4220 1.03 0.14 Synechocystis 47.6 3169 4375 0.94 0.12 Y. pestis 47.6 4043 641 610 0.95 0.12 E. coli 50.8 4290 4809 4696 0.98 0.13 D. radiodurans 67.0 2622 Vibrio cholerae 4033 3949 0.98 0.13 Yersinia pestis 4654 4096 0.88 0.19 R. solanacearum 67.0 3442 Haemophilus influenzae 1830 1746 0.96 0.11 S. coelicolor 72.1 7851 a The C-value paradox The total amount of DNA in the (haploid) genome is a characteristic of each living species known as its Cvalue. The C-value paradox describes the lack of relationship between the DNA content (C-value) of an organism and its coding potential. In some phyla (insects, amphibians, and plants) there are extremely large variations in DNA content between organisms that do not vary very much in complexity. … a short tour into GENOMICS “GENOM” (Hans Winkler, Univ. Hamburg, 1920) gen "gene" + (chromos)om "chromosome” He referred to the complete set of chromosomes, intended as carriers of hereditary factors. 1 h WHAT IS GENOMICS GENOMICS: The study of genomes, starting from the determination of the nucleotide sequence of the chromosome/s of a given organism. G-A-T-C GENOMES: - How big they are ? - How they are organized ? - Why sequencing ? 2 h OBJECTIVES Construct physical and genetic maps Determine the DNA sequence Identify all the genes, and their regulatory sequences Characterize non-coding DNA sequences Recognize the molecular bases of cellular processes and the differences among organisms. - “Holistic approach” - The amount of data obtained requires the use of algorithms and computers that are able to assemble, organize and analyze all the information. - Results should be made available to the scientific community. 3 h STRATEGIES USED FOR GENOMIC SEQUENCING PROJECTS f What is DNA sequencing? Genomes and genes are “restricted” to small fragments. Subsequently, the chemical composition of these fragments is determined, as a sequence of bases. [ Reductionist approach ] Genetic Map → measuring recombination frequencies of “linked markers” (genes or polymorphisms, whose pattern of transmission can be tracked) → low resolution Restriction Map → alignment of 1-2 Mbp DNA fragments Libraries → → medium resolution 40-400 Kbp DNA fragments inserted into artificial chromosomes (YAC, BAC, cosmids) → high resolution Nucleotide Sequence → “is the ultimate physical map” Translation initiation factor IF-1 [Escherichia coli str. K-12 substr. MG1655] >EG10504 infA Protein chain initiation factor IF1 925665 5’ATGGCCAAAGAAGAC AATATTGAAATGCAA GGTACCGTTCTTGAA ACGTTGCCTAATACC ATGTTCCGCGTAGAG TTAGAAAACGGTCAC GTGGTTACTGCACAC ATCTCCGGTAAAATG CGCAAAAACTACATC CGCATCCTGACGGGC GACAAAGTGACTGTT GAACTGACCCCGTAC 925466 GACCTGAGCAAAGGC CGCATTGTCTTCCGT AGTCGCTGA 3’ 219 nucleotides >EG10504 infA Protein chain initiation factor IF1 ATG GCC AAA GAA GAC AAT ATT GAA ATG CAA GGT ACC GTT CTT M A K E D N I E M Q G T V L GAA ACG TTG CCT AAT ACC ATG TTC CGC GTA GAG TTA GAA AAC GGT CAC E T L P N T M F R V E L E N G GTG GTT ACT GCA CAC ATC TCC GGT AAA ATG CGC AAA AAC TAC ATC V V T A H I S G K M R K N Y I CGC ATC CTG ACG GGC GAC AAA GTG ACT GTT GAA CTG ACC CCG TAC R I L T G D K V T V E L T GAC CTG AGC AAA GGC CGC ATT GTC TTC CGT AGT CGC TGA D L S K G R I V F R S R (stop) P Y H MAKEDNIEMQ GTVLETLPNT MFRVELENGH VVTAHISGKM RKNYIRILTG DKVTVELTPY DLSKGRIVFR SR calculated_mol_wt=8118 fMet-tRNA 3D structure IF3 IF2 mRNA rRNA tRNA f Escherichia coli genome: . . ds circular DNA (4 639 221 bp) Genome: 12.1 Mbp 1.Non classificati N° of estimated genes = 6.100 Genome: 167 Mbp 5 cromosomes (2N = 10) N° of estimated genes = 25.706 Working Draft Sequence of the Human Genome (published in Feb. 2001) Unknown functions ► Human Genome: 3.4 Gbp N° of estimated genes: 32.000-35.000 ≅ 2 % of the genome corresponds to EXONS h 23% INTRONS (Intervening Seq.) h 75% intergenic DNA (< 2%) Red Bars J repeated sequences Blue Bars J EXONS GenBank is an international archive where researchers can submit and retrieve partial or complete genomic sequences. Seats: NCBI (Bethesda Maryland, USA) Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information. EMBL (Heidelberg, Germany) DDBJ DNA Data Bank of Japan (Mishima, Japan) BLAST is a program used to compare a sequence of interest against all known sequences available Types of DNA markers present in genomic DNA - SNP (Single Nucleotide Polymorphysm) [ > 1% in a population] - RFLP 5’ GCT CTATCGTT 3’ 5’ GCT CTACCGTT 3’ 3’ CGAGATAGCAA 5’ 3’ CGAGATGGCAA 5’ The SNP defines two alleles for which there could be three genotypes among individuals in the population. In the example shown above: - Homozygous with T-A in both homologous chromosomes - Homozygous with C-G in both homologous chromosomes - Heterozygous with T-A in one chromosome and C-G in the homologous chromosome ► SNPs need not be in a coding sequence, or even in a gene. ► They are the most common form of genetic differences among people. ► They are distributed approx. uniformely along the chromosomes (> 4X106 identified). A marker is like a signpost on the genetic highway - a spot that is observed in everyone and that can be used as a reference point among people. (http://snp.cshl.org/ http://www.hgvbaseg2p.org/) The marker itself (a SNP, for example) may or not cause the disease, medicine response or other phenotype that is being examined. In some cases, it may be directly linked to the phenotype, but it is useful as a signpost in either case. Using the information SNPs provide, it may be possible to predict your genetic risk of developing a certain disease, to diagnose a disease more accurately, or to predict how you most likely will respond to a medicine. How might a doctor's knowledge of your genetic data affect your everyday life in the future? Just as you carry your medical insurance card with you, you may also one day carry a wallet-sized card that has your genetic data coded on it. Doctors would be able to use this data to predict your risk of developing a disease and your likely response to a medicine before they prescribe it for you → → towards personalized medicine. Connecting phenotype with genotype is the fundamental aim of Genetics Some polymorphic variants Relative risk is the ratio of the disease rate in exposed persons to that in people who are unexposed. This ratio can be <1 or > 1 Cytogenetic Location of APOE gene: 19q13.2 Molecular Location on chromosome 19: base pairs 50,100,878 to 50,104,489 Lipoproteins are responsible for packaging cholesterol and other fats and carrying them through the bloodstream. Apolipoprotein E is a major component of a specific type of lipoprotein called very low-density lipoproteins (VLDLs). VLDLs remove excess cholesterol from the blood and carry it to the liver for processing. Maintaining normal levels of cholesterol is essential for the prevention of disorders that affect the heart and blood vessels (cardiovascular diseases), including heart attack and stroke. The ε4 version of the APOE gene (Cys112→ Arg112; Cys158 → Arg158) increases an individual's risk for developing late-onset Alzheimer disease. People who inherit one copy of the APOE ε4 allele have an increased chance of developing the disease; those who inherit two copies of the allele are at even greater risk. The APOE ε4 allele may also be associated with an earlier onset of memory loss and other symptoms. 5’TGT3’ → 5’CGT3’ Cys → Arg The number of genes in a eukaryote varies from 6,000 to 40,000 but does not correlate with the genome size nor with the complexity of the organism. S. Cerevisiae → 6,000 genes S. Pombe → 5,000 genes C. Elegans → 18,500 genes D. Melanogaster → 13,600 genes Arabidopsis thaliana → 25,000 genes Oryza sativa → 40,000 genes Mouse → 33,000 Man → 35,000 Mycoplasma genitalium → 470 genes Archaea → 1,500-2,700 genes H. influenzae → 1,743 genes E. coli → 4,288 genes Nitrogen-fixing bacteria → 6,000 genes The bacteria with genome sizes below 1.5 Mb are obligate intracellular parasites. Their genomes identify the minimum number of functions required to construct a cell. The MINIMUM GENE NUMBER required for any type of organism increases with its complexity. Key words: Genome size – Gene Density - Complexity of the Genome - Complexity of the Organism Complexity of a genome is defined as the total length (bp) of different DNA sequences. Gene density is the average number of genes per Mb of genomic DNA. There is an inverse correlation between organism complexity and gene density; the less complex the organism, the higher the gene density.