Download MGY428- Genomes

MGY428- Genomes John Parkinson [email protected] 15-704 TMDT East Tower Hospital for Sick Children History and availability of genomes What‟s in a genome Genome comparisons Prokaryotes / Eukaryotes Genome analyses Finding information on genome projects http://www.genomesonline.org/ Timeline of genome sequencing H. sapiens “Finished” H. sapiens Draft 600 500 C. elegans (1st multicellular organism) 400 S. cerevisiae (first eukaryote) 300 M. tuberculosis (minimum genome) H. influenzae (first bacteria) Cumulative number of genomes D. melanogaster (1st eukaryote via W.G.S) 200 HIV-1 Genome sequenced 100 Fred Sanger - method of sequencing DNA 1974 Bacteria 0 1984 1995 1998 2000 Today Archaea Eukarya Many bacteria – fewer eukaryotes Nematodes Number of published genomes 2 Bacteria – 526+ Eukaryotes - 66 Archaea - 47 Metazoa Eukaryotes 8 1 7 1 7 3 2 1 1 12 17 4 Where do I find genome data ? Well annotated 'finished' sequence Organism specific sites Yeast - SGD - yeastgenome.org Plasmodium - plasmodb.org C. elegans - wormbase.org Drosophila - flybase.org Generic genome sites ENSEMBL - ensembl.org NCBI - ncbi.nlm.nih.gov/Genomes/ Sequencing centers Poorly annotated 'draft' sequence Sanger - sanger.ac.uk TIGR - tigr.org Wormbase - Caenorhabditis elegans Focuses on nematodes Among the best annotated Updated bi-monthly by expert curators ENSEMBL Mainly focuses on vertebrates NCBI - http://www.ncbi.nlm.nih.gov/Genomes/ Not as well annotated Covers a wider spectrum of organisms Genomes as hypotheses Number of annotated proteins in wormbase 25,000 20,000 15,000 10,000 5,000 0 The C. elegans genome will continue to be annotated as more data is generated (e.g. Marc Vidals ORFeome project) Which organism has the largest genome? H. influenzae 1.8 Mbp Mudpuppy 50 Gbp Ancylostoma 100 Mbp David and Victoria 3 Gbp (each) Fern 160 Gbp Elephant 3 Gbp Amoeba 670 Gbp Distribution of genome size Increasing size More 'complex' organisms do not necessarily have larger genomes C-value paradox - due to 'junk' (repetitive) DNA C-value enigma - what causes accumulation of junk ? Smaller genomes may reflect a parasitic lifestyle Genome comparisons – gene function in bacteria For certain functions (e.g. translation / transcription) a basic complement of proteins is required For other functions, the number of proteins can vary and may be related to genome size Mycobacterium genitalium Treponema pallidum Borrelia burgdorferi Helicobacter pylori Methanococcus jannaschii Haemophilus influenzae Archaeoglobus fulgidus Analysing gene complements informs on their biology e.g. bacterial transporters Both occupy respiratory tract of humans, however two very different strategies – H. influenzae – employs a battery of redundant processes that allow it to optimize survival. – M. pneumoniae uses a generalist strategy of maintaining proteins that are more versatile because of their broad substrate range. What can we find in a genome ? Introns / Exons 5`, 3` UTRs Regulatory regions Promoters Enhancers Repressors etc. Single nucleotide polymorphisms (SNPs) Genes ! Protein coding / Ribosomal tRNAs rRNAs, microRNAs snoRNAs, snRNAs..) What can we find in a genome ? Junk !? Pseudogenes Transposons > 50% of human genome Parasitic - spread through genome. Viral origins (most are inactive) 47 types found in human genome Tandem Repeats microsatellites (1- 7bp) (e.g. cacacacacacaca....) minisatellites (typically <40bp) satellites (140- 360 bp) Prokaryotes / Eukaryotes Prokaryotes - 526 genomes Many bacteria have been sequenced due to importance in medicine, agriculture and the food industry 80-90% protein coding 500Kbp-10Mbp Typically 40-80 tRNAs 600-8000 ORFs Average ORF size 925 bp Introns virtually absent; few repetitive sequences; short intergenic sequences (< Kbp); genes organised as operons Eukaryotes - 66 genomes Selected for sequencing on the basis of genome size as well as importance in fundamental research. < 2-70% protein coding 5Mbp – 600 Gbp Typically 200-600 tRNAs 4000-40000 ORFs Average ORF size 1.3-25 kbp Many genes have introns; many repetitive sequences; large intergenic regions (upto many Mbp's); few operons Bacterial Genome Features – Chromosome organization Typically possess a single circular chromosome a few possess >1 chromosome (e.g. Vibrio) some possess linear chromosomes (e.g. Streptomyces) some contain both a linear and a circular chromosome (e.g. Agrobacterium) some plasmids have also been sequenced (e.g. Megaplasmid of D. radiodurans) contingency genes Streptomyces coelicolor Actinomycete (soil bacteria) essential genes Produces > 2/3 of all natural antibiotics Sequenced 2002 Linear chromosome Large for a bacteria (8.7Mb - 7825 genes) all genes G + C content Bacterial Genomes – Streptomyces coelicolor Comparison with other related organisms suggest arms are novel compared with rest of sequence Secondary metabolite “factories” associated with the ends of chromosome arms Arisen through gene duplications Create products aimed at knocking out other bacteria (soil environment highly competitive) Bacterial Genomes – Gene organization Transcription units are often organised as operons (25% of genes in E. coli) Bacterial Genome Features – GC Content Biases in GC content Different bacterial species demonstrate altered G/C biases G-C has an extra H-bond compared with AT - biological role ? e.g. Thermophilic bacteria may require a higher GC content to withstand higher temperatures GC content C. botulinum 26% H. influenzae 38% E. coli 50% T. thermophilus 69% Coding regions also tend to have higher G/C biases - can be exploited to find genes G + C content Bacterial Genomes – Repetitive elements Repeat sequences are rarer in prokaryotes than eukaryotes – in E. coli for example : Type IS (simple insertion sequence) Rhs (recombinational hotspot) REP (repetitive palindromic sequence) Chi (cross-over hotspot) IRU (intergenic repeat unit) Num. Genome 50 5 581 ~1000 19 Size <2 kb 6-10 kb 38 bp 8 bp 126 bp % of genome 1.5 0.8 0.5 0.2 0.05 The function of some of these repeats has been identified Chi sequences are implicated in homologous recombination REP elements are palindromes and have been implicated in supercoiling Some of these sequences have been identified in other bacteria IS elements are common REP elements have been found in N. meningitidis Eukaryotic Genomes – Chromosome organization Genomes organised into chromosomes Number varies between species with little if any correlation with complexity S. cerevisiae C. Elegans Humans Crayfish 16 6 23 100 Complementary strands typically have similar numbers of genes, but striking examples in Leishmania major and related Trypansomes (protozoan parasites) Thought to be related to specialised transcriptional processes – gene expression regulated primarily by posttranslational mechanisms ? Eukaryotic Genomes – Some examples The first eukaryotic genome 13.4 Mb 70% coding 16 chromosomes – 5570 genes (6300 originally) 2kb per gene 275 tRNAs Relatively compact - small intergenic regions 32% of genes have a homolog to another gene within Yeast (paralog) S. cerevisiae underwent a whole genome duplication event The first multicellular genome 100 Mb 25% coding 6 chromosomes – 20000 genes 5kb per gene 584 tRNAs 25% of genes are organized as operons Almost 60% of genes appear „specific‟ to nematodes Essential genes are at the center of chromosomes The human genome 3.2 Gb 2% coding 23 chromosomes – 25000 genes (?) 100kb per gene 497 tRNAs Genes appear more „complex‟ More domains, more domain architectures More introns and hence more alternative splicing - Responsible for our biological complexity ? Yeast genes are less dense at the ends of chromosomes Eukaryotic Genomes – Centromeres and telomeres Centromeres mediate interactions between sister chromatids and the kinetochore during replication In budding yeast centromeres are 125 bp in length and contain specific sites for binding kinetochore proteins. In human the centromere is composed of hundreds of thousands of copies of a 171 bp repeat that directs heterochromatin assembly that replaces sequence specific binding sites Telomeres are found at the end of chromosomes and are composed of simple tandem repeats which protect the integrity of the ends They are dynamic – for many cell types during every round of replication, they shrink. This limits the number of times the cell can divide Eukaryotic Genomes – Introns Number of introns varies between species Yeast only 4% of genes have introns, C. elegans „most‟ genes have introns, Human – all genes have introns (except histones) Comparisons of conserved genes across different eukaryotes reveals interesting patterns of intron gain and loss. ~80% of introns in humans are conserved in sea anemone Fly, worms and sea squirts have lost ~50-90% of their ancestral introns Ancient intron conserved Ancient intron lost flies/worms Ancient intron lost in flies; worms; fungi; sea squirt Animal intron lost in flies and worms Eukaryotic Genomes – Repeats Eukaryotes contain a high proportion of repeated sequences these include transposons and related elements Transposons are elements which can move around the genome potentially leading to: mutations (insertions in genes) increasing (or decreasing) amount of DNA Class I (Retrotransposons) use RNA as an intermediary LINEs – Long interspersed elements SINEs – Short interspersed elements HIV Class II (Transposons) - uses only DNA P elements (Drosophila) Eukaryotic Genomes – Repeats Incidence of repeats LINE/SINE Retrovirus-type sequences Transposon type sequences Total 0.5% 4.8% 5.1% 10.5% 0.4% 0 5.3% 6.5% 4.7% 6.4% 3.6% 14.9% 34% 8% 3% 45% Retrotransposons and the C-value Paradox The genome of Arabidopsis thaliana contains 125 Mbp of DNA. This includes a small number of retrotransposons and about 25,000 functional genes. The maize (corn) genome contains 20 times more DNA (2.4 Gbp) 50% of the corn genome is made up of retrotransposons. Most of the 250 Gbp of DNA in the genome of the fern - Psilotum nudum is presumably "junk" DNA.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download MGY428- Genomes