Download BB30055: Genes and genomes

BB30055: Genes and genomes Genomes - Dr. MV Hejmadi (bssmvh) BB30055: Genomes - MVH 3 broad areas (A)Genomes, transcriptomes, proteomes (B) Applications of the human genome project (C) Genome evolution Why sequence the genome? 3 main reasons • description of sequence of every gene valuable. Includes regulatory regions which help in understanding not only the molecular activities of the cell but also ways in which they are controlled. • identify & characterise important inheritable disease genes or bacterial genes (for industrial use) • Role of intergenic sequences e.g. satellites, intronic regions etc History of Human Genome Project (HGP) 1953 – DNA structure (Watson & Crick) 1972 – Recombinant DNA (Paul Berg) 1977 – DNA sequencing (Maxam, Gilbert and Sanger) 1985 – PCR technology (Kary Mullis) 1986 – automated sequencing (Leroy Hood & Lloyd Smith 1988 – IHGSC established (NIH, DOE) Watson leads 1990 – IHGSC scaled up, BLAST published (Lipman+Myers) 1992 – Watson quits, Venter sets up TIGR 1993 – F Collins heads IHGSC, Sanger Centre (Sulston) 1995 – cDNA microarray 1998 – Celera genomics (J Craig Venter) 2001 – Working draft of human genome sequence published 2003 – Finished sequence announced Human Genome Project (HGP) Goal: Obtain the entire DNA sequence of human genome Players: (A) International Human Genome Sequence Consortium (IHGSC) - public funding, free access to all, started earlier - used mapping overlapping clones method (B) Celera Genomics – private funding, pay to view - started in 1998 - used whole genome shotgun strategy Whose genome is it anyway? (A) International Human Genome Sequence Consortium (IHGSC) - composite from several different people generated from 10-20 primary samples taken from numerous anonymous donors across racial and ethnic groups (B) Celera Genomics – 5 different donors (one of whom was J Craig Venter himself !!!) Strategies for sequencing the human genome sequencing larger genomes Mapping phase Sequencing phase Result…. ~30 - 40,000 protein-coding genes estimated based on known genes and predictions definite genes possible genes IHGSC 24,500 5000 Celera 26,383 12,000 Organisation of human genome Nuclear genome (3.2 Gbp) 24 types of chromosomes Y- 51Mb and chr1 -279Mbp Mitochondrial genome General organisation of human genome Coding & regulatory - (mRNA) Non polypeptide–coding: RNA Gene organisation Rare bicistronic transcription units E.g. UBA52 transcription generates ubiquitin and a ribosomal protein S27a Pseudogenes () non functional copies of exonic sequences of an active gene. Thought to arise by genomic insertion of a cDNA as a result of retroposition Contributes to overall repetitive elements (<1%) processed pseudogenes - Pseudogenes in globin gene cluster Gene fragments or truncated genes Gene fragments: small segments of a gene (e.g. single exon from a multiexon gene) Truncated genes: Short components of functional genes (e.g. 5’ or 3’ end) Thought to arise due to unequal crossover or exchange Repetitive elements Main classes based on origin  Tandem repeats  Interspersed repeats  Segmental duplications 1) Tandem repeats Blocks of tandem repeats at  subtelomeres  pericentromeres  Short arms of acrocentric chromosomes  Ribosomal gene clusters Tandem / clustered repeats Broadly divided into 4 types based on size class Size of repeat Repeat block Major chromosomal location Satellite 5-171 bp > 100kb centromeric heterochromatin minisatellite 9-64 bp 0.1–20kb Telomeres microsatellites 1-13 bp < 150 bp Dispersed HMG3 by Strachan and Read pp 265-268 Satellites Large arrays of repeats Some examples Satellite 1,2 & 3 a (Alphoid DNA) - found in all chromosomes b satellite HMG3 by Strachan and Read pp 265-268 Minisatellites Moderate sized arrays of repeats Some examples Hypervariable minisatellite DNA - core of GGGCAGGAXG - found in telomeric regions - used in original DNA fingerprinting technique by Alec Jeffreys HMG3 by Strachan and Read pp 265-268 Microsatellites VNTRs - Variable Number of Tandem Repeats, SSR - Simple Sequence Repeats 1-13 bp repeats e.g. (A)n ; (AC)n 2% of genome (dinucleotides - 0.5%) Used as genetic markers (especially for disease mapping) Individual genotype HMG3 by Strachan and Read pp 265-268 Microsatellite genotyping design PCR primers unique to one locus in the genome .a single pair of PCR primers will produce different sized products for each of the different length microsatellites 2) Interspersed repeats A.k.a. Transposon-derived repeats 45% of genome Arise mainly as a result of transposition either through a DNA or a RNA intermediate Interspersed repeats (transposon-derived) major types class size Copy % number genome* LINE L1 (Kpn family) L2 ~6.4kb 0.5x106 0.3 x 106 16.9 3.2 SINE Alu ~0.3kb 1.1x106 10.6 LTR e.g.HERV ~1.3kb 0.3x106 8.3 mariner ~0.25kb 1-2x104 2.8 DNA transposon family * Updated from HGP publications HMG3 by Strachan & Read pp268-272 LINEs (long interspersed elements) Most ancient of eukaryotic genomes  Autonomous transposition (reverse trancriptase)  ~6-8kb long  Internal polymerase II promoter and 2 ORFs  3 related LINE families in humans – LINE-1, LINE-2, LINE-3.  Believed to be responsible for retrotransposition of SINEs and creation of processed pseudogenes LINEs (long interspersed elements) Most ancient of eukaryotic genomes Autonomous transposition (reverse trancriptase) ~6-8kb long Internal polymerase II promoter and 2 ORFs 3 related LINE families in humans – LINE-1, LINE-2, LINE-3. Believed to be responsible for retrotransposition of SINEs and creation of processed pseudogenes Nature (2001) pp879-880 HMG3 by Strachan & Read pp268-272 SINEs (short interspersed elements) Non-autonomous (successful freeloaders! ‘borrow’ RT from other sources such as LINEs) ~100-300bp long Internal polymerase III promoter No proteins Share 3’ ends with LINEs 3 related SINE families in humans – active Alu, inactive MIR and Ther2/MIR3. LINES and SINEs have preferred insertion sites • In this example, yellow represents the distribution of mys (a type of LINE) over a mouse genome where chromosomes are orange. There are more mys inserted in the sex (X) chromosomes. Try the link below to do an online experiment which shows how an Alu insertion polymorphism has been used as a tool to reconstruct the human lineage http://www.geneticorigins.org/geneticorigins/ pv92/intro.html Long Terminal Repeats (LTR) Repeats on the same orientation on both sides of element e.g. ATATATNNNNNNNATATAT • contain sequences that serve as transcription promoters • as well as terminators. • These sequences allow the element to code for an mRNA molecule that is processed and polyadenylated. • At least two genes coded within the element to supply essential • activities for the retrotransposition mechanism. • The RNA contains a specific primer binding site (PBS) for initiating reverse transcription. • A hallmark of almost all mobile elements is that they form small direct repeats formed at the site of integration. Long Terminal Repeats (LTR) Autonomous or non-autonomous Autonomous retroposons encode gag, pol genes which encode the protease, reverse transcriptase, RNAseH and integrase Nature (2001) pp879-880 HMG3 by Strachan & Read pp268-272 DNA transposons (lateral transfer?) DNA transposons Inverted repeats on both sides of element e.g. ATGCNNNNNNNNNNNCGTA Nature (2001) pp879-880 From GenesVII by Levin 3) Segmental duplications  Closely related sequence blocks at different genomic loci  Transfer of 1-200kb blocks of genomic sequence  Segmental duplications can occur on homologous chromosomes (intrachromosomal) or non homologous chromosomes (interchromosomal)  Not always tandemly arranged  Relatively recent Segmental duplications Interchromosomal segments duplicated among non-homologous chromosomes Intrachromosomal duplications occur within a chromosome / arm Nature Reviews Genetics 2, 791-800 (2001); Segmental duplications in chromosome 22 Segmental duplications Segmental duplications - chromosome 7. Nature Reviews Genetics 2, 791-800 (2001) Major insights from the HGP 1) Gene size, content and distribution 2) Proteome content 3) SNP identification 4) Distribution of GC content 5) CpG islands 6) Recombination rates 7) Repeat content Nature (2001) 15th Feb Vol 409 special issue; pgs 814 & 875-914. 1) Gene size Gene content…. More genes: Twice as many as drosophila / C.elegans Uneven gene distribution: Gene-rich and gene-poor regions More paralogs: some gene families have extended the number of paralogs e.g. olfactory gene family has 1000 genes More alternative transcripts: Increased RNA splice variants produced thereby expanding the primary proteins by 5 fold (e.g. neurexin genes) Gene distribution Genes generally dispersed (~1 gene per 100kb) Class III complex at HLA 6p21.3 Overlapping genes (transcribed from 2 DNA strands) - Rare Genes- within genes E.g. NF1 gene HMG3 Fig 9.8 Uneven gene distribution Gene-rich E.g. MHC on chromosome 6 has 60 genes with a GC content of 54% Gene-poor regions 82 gene deserts identified ? Large or unidentified genes What is the functional significance of these variations? 2) Proteome content proteome more complex than invertebrates Protein Domains (sections with identifiable shape/function) Domain arrangements in humans largest total number of domains is 130 largest number of domain types per protein is 9 Mostly identical arrangement of domains A A B B B C C C C C Protein X Proteome more complex than invertebrates…… no huge difference in domain number in humans BUT, frequency of domain sharing very high in human proteins (structural proteins and proteins involved in signal transduction and immune function) However, only 3 cases where a combination of 3 domain types shared by human & yeast proteins. e.g carbomyl-phosphate synthase (involved in the first 3 steps of de novo pyrimidine biosynthesis) has 7 domain types, which occurs once in human and yeast but twice in drosophila 3) SNPs (single nucleotide polymorphisms) Sites that result from point mutations in individual base pairs  biallelic  ~60,000 SNPs lie within exons and untranslated regions (85% of exons lie within 5kb of a SNP)  May or may not affect the ORF  Most SNPs may be regulatory  More than 1.4million SNPs identified One every 1.9kb length on average Densities vary over regions and chromosomes e.g. HLA region has a high SNP density, reflecting maintenance of diverse haplotypes over many MYears Nature (2001) 15th Feb Vol 409 special issue; pgs 821-823 & 928 How does one distinguish sequence errors from polymorphisms? sequence errors Each piece of genome sequenced at least 10 times to reduce error rate (0.01%) Polymorphisms Sequence variation between individuals is 0.1% To be defined as a polymorphism, the altered sequence must be present in a significant population Rate of polymorphisms in diploid human genome is about 1 in 500 bp Nature (2001) 15th Feb Vol 409 special issue; pgs 821-823 & 928 SNPs and disease 3) SNPs……and risk of disease N(291)S 3) SNPs……and risk of disease late-onset Alzheimer's disease (LOAD) Apolipoprotein e4 haplotype is a genetic risk factor 3 major alleles (APO E2, E3, and E4) APO E2: Cys112 / Cys158 APO E3: Cys112 / Arg158 APO E4: Arg112 / Arg158 3) SNPs……and pharmacogenomics 4) Distribution of GC content Genome wide average of 41% Huge regional variations exist E.g.distal 48Mb of chromosome 1p-47% but chromosome 13 has only 36% Confirms cytogenetic staining with G-bands (Giemsa) dark G-bands – low GC content (37%) light G-bands – high GC content (45%) Nature (2001) 15th Feb Vol 409 special issue; pg 876-877 5) CpG islands CpG Methyl CpG methylated at C TpG Deamination CpG islands show no methylation Significance of CpG islands 1) Non-methylated CpG islands associated with the 5’ ends of genes 2) Aberrant methylation of CpG islands is one mechanism of inactivating tumor suppressor genes (TSGs) in neoplasia http://www.sanger.ac.uk/HGP/cgi.shtml CpG islands Greatly under-represented in human genome • ~28,890 in number • Variable density e.g. Y – 2.9/Mb but 16,17 & 22 have 19-22/Mb Average is 10.5/Mb Nature (2001) 15th Feb Vol 409 special issue; pg 877-888 6) Recombination rates 2 main observations • Recombination rate increases with decreasing arm length • Recombination rate suppressed near the centromeres and increases towards the distal 20-35Mb 7) Repeat content a) Age distribution b) Comparison with other genomes c) Variation in distribution of repeats d) Distribution by GC content e) Y chromosome Nature (2001) 409: pp 881-891 Repeat content……. a) Age distribution  Most interspersed repeats predate eutherian radiation (confirms the slow rate of clearance of nonfunctional sequence from vertebrate genomes)  LINEs and SINEs have extremely long lives  2 major peaks of transposon activity  No DNA transposition in the past 50MYr  LTR retroposons teetering on the brink of extinction a) Age distribution overall decline in interspersed repeat activity in hominid lineage in the past 35-40MYr compared to mouse genome, which shows a younger and more dynamic genome b) Comparison with other genomes     Higher density of transposable elements in euchromatic portion of genome Higher abundance of ancient transposons 60% of IR made up of LINE1 and Alu repeats whereas DNA transposons represent only 6% (a few human genes appear likely to have resulted from horizontal transfer from bacteria!!) c) Variation in distribution of repeats Some regions show either High repeat density e.g. chromosome Xp11 – a 525kb region shows 89% repeat density Low repeat density e.g. HOX homeobox gene cluster (<2% repeats) (indicative of regulatory elements which have low tolerance for insertions) d) Distribution by GC content High GC – gene rich ; High AT – gene poor LINEs abundant in AT-rich regions SINEs lower in AT-rich regions Alu repeats in particular retained in actively transcribed GC rich regions E.g. chromosme 19 has 5% Alus compared to Y chromosome e) The Y chromosome ! Unusually young genome (high tolerance to gaining insertions) Mutation rate is 2.1X higher in male germline Possibly due to cell division rates or different repair mechanisms • Working draft published – Feb 2001 • Finished sequence – April 2003 • Annotation of genes going on (refer: International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 21 October 2004 (doi: 10.1038/nature03001) Other genomes sequenced 1997 4,200 genes 1998 19,099 genes 2002 38,000 genes 2002 Mus musculus 36,000 genes Sept 2003 Canis 18,473 human orthologs 31Aug 2005 Pan troglodytes 28% identical Human orthologs Science (26 Sep 2003)Vol301(5641)pp1854-1855 References Text: 1) HMG 3 by Strachan and Read Chapter 9 pp 265-268 References Nature (2001) 409: pp 879-891 Batzer MA, Deininger PL Alu repeats and human genomic diversity Nature Rev Genet 3 (5): 370379 May 2002

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download BB30055: Genes and genomes