Download ppt

The Human Genome The International Human Genome Consortium Initial sequencing and analysis of the human genome Nature, 409, February 15, 860-921 (2001) Venter et al. (Celera) The Sequence of the Human Genome Science, 291, February 16, 1304-1351 (2001) HC LEE January 8, 2002 Computational Biology Lab National Central University 1984 to 1986 – first proposed at US DOE meetings 1988 – endorsed by US National Research Council - creation of genetic, physical and sequence maps of the human genome - parallel efforts in key model organisms: bacteria, yeast, worms, flies and mice; - develop of supporting technology - ethical, legal and social issues (ELSI) 1990 – Human Genome Project (NHGRI) Later – UK, France, Japan, Germany, China Time-line large scale genomic analysis 1995 – First complete bacterial genomes Completed sequences 2002 – About 35 bacterial genomes; 0.5-5 Mb; hundreds to 2000 genes 1996 April – Yeast (Saccharomyces cerevisiae) 12 Mb, 5,500 genes 1998 Dec. -Worm (Caenorhabditis elegans) 97 Mb, 19,000 genes 2000 March - Fly (Drosophila melanogaster) 137 Mb, 13,500 genes 2000 Dec. - Mustard (Arabidopsis thaliana) 125 Mb, 25,498 genes 2000 June – Human (Homo sapiens) 1st rough draft 2001 Feb 15/16 – Human, “working draft” 3000 Mb, 35,000~40,000 genes Nature, 409, February 15, 860-921 (2001) IHGCS paper Science, 291, February 16, 1304-1351 (2001) Celera paper Sequencing BAC: Bacterial Artificial Chromosome clone Contig: joined overlapping collection of sequences or clones. C-value paradox: Genome size does C-value paradox not correlate well with organismal complexity. Human Homo sapiens 3000 Mb Yeast S. cerevisiae 12 Mb Amoeba dubia 600,000 Mb Genomes can contain a large quantity of repetitive sequence, far in excess of that devoted to protein-coding genes Global properties • Pericentromeric and subtelomeric regions of chromosomes filled with large recent transposable elements • Marked decline in the overall activity of transposable elements or transposons • Male mutation rate about twice female – most mutation occurs in males • Recombination rates much higher in distal regions of chromosomes and on shorter chromosome arms – > one crossover per chromosome arm in each meiosis Important features of Human proteome • 30,000–40,000 protein-coding genes • Proteome (full set of proteins) more complex than those of invertebrates. – pre-existing components arranged into a richer architectures. • Hundreds of genes seem to come from horizontal transfer from bacteria • Dozens of genes seem to come from transposable elements. Human proteome is complex • Gene codes proteins (also RNAs) • Number of genes does not reflect complexity of organism Org’nism no. genes no. proteins Worm Fly Human 20,000 13,500 ~40,000 ~20,000 >>20,000 >>100,000 The Human Genome Human genome content Total length 3000 Mb ~ 40,000 genes (coding seq) Gene sequences < 5% Exons ~ 1.5% (coding) Introns ~ 3.5% (noncoding) Intergenic regions (junk) > 95% Repeats > 50% Gene codes proteins (also RNAs) (transcription & translation) Procaryotes (single cell): one gene, one protein Eucaryote (multicell): gene = intron + exon; one gene, many proteins Fig 35a Size distributions of exons in Human, Worm and Fly. Human have shorter exons. Fig 35c Size distributions of intons in Human, Worm and Fly. Human have longer introns. Fig 35b Gene recognition • Coding region and non-coding region have different sequence profiles – coding region is “protected” from mutation and is less random • Gene recognition by sequence alignment • Gene prediction by Hidden Markov Model trained by set of known genes • Many genes are homologs – similar in vastly different organisms Gene recog’n difficult for Human • Easy for procaryotes (single cell) – one gene, one protein • More difficult for eukaryotes (multicell) – one gene, many proteins • Very difficult for Human – short exons separated by non-coding long introns Genes predicted in Human Genome Int’l Consortium Celera known genes 14,882 novel genes 16,896 17,764 21,350 Total 39,114 31,778 Two predictions disagree John B. Hogenesch, et al Cell, Vol. 106, 413–415 August 24, 2001 “…predicted transcripts collectively contain partial matches to nearly all know genes, but the novel genes predicted by both groups are largely non-overlapping Global properties with evolutionary implications • Long-range variation in GC content not random • CpG islands protected by genes • Genetic and physical distance nonlinear • > 50% genome composed of repeats GC-rich and GC-poor regions have different biological properties, such as gene density, composition of repeat sequences, correspondence with cytogenetic bands and recombination rate. Standard deviation 15 times wider than random distrib’n GC content is correlated with coding regions GC content in introns (exons) vs introns (exons) length. Fig 14 CpG islands CpG islands and genes are correlated. CpG dinucleotides are methylated; methyl-CpG steadily mutate to TpG. Hence CpG is greatly under-represented in human DNA. Except in CpG islands near genes. Recombination rate vs Physical position from centromere of genes. Rate higher in distal regions. Fig 15 recomb rate (distal) Recombination rate higher on shorter chromosome arms Fig 16 recomb rate (short arm) The genome mutates and copies itself • 50%, probably much more, of genome composed of repeats – Many traces of repeats obliterated by mutation – Lower organisms may have longer genomes • Five types of repeats – transposable elements; processed pseudogenes; simple k-mer repeats; segmental duplications (10300 kb); (large) blocks of tandemly repeated sequences Interspersed repeats: fixed transposable elements copied to non-homologous regions. Fig 17 transposables Total 45% Classes of transposable elements. LINE, long interspersed element. SINE short interspersed element. Genes are sometimes protected from repeats Fig 21 Two regions of about 1 Mb on chromosomes 2 and 22. Red bars, interspersed repeats; blue bars, exons of known genes. Note the deficit of repeats in the HoxD cluster, which contains a collection of genes with complex, interrelated regulation. Simple sequence (k-mers) repeats: SSR Tab 14 SSR content Fig 32b Mosaic patterns of duplications. For each region top horizon line: segment of sequence (100–500 kb) with interchromosomal (red) and intrachromosomal (blue) duplications displayed. Lower lines with a distinct colours: separate sequence duplication. y axis: per cent nucleotide identity. b. An ancestral region from Xq28 that has contributed various 'genic' segments to pericentromeric regions. Fig 30 Fig 32a An active pericentromeric region on chromosome 21. Fig 32c c. A pericentromeric region from chromosome 11. Fig 32d d. A subtelomeric region from chromosome 7p. Fig 33 Finished HG has 1.5% interchromosomal 2% intrachromosomal segmental duplications. The duplications are 10–50 kb long and highly homologous. Structure in similarity may indicate that interchromosomal duplications occurred in a punctuated manner. Human Proteome • Number of human genes (~40,000) only twice that of worm or fly • Many more transcripts (combination of exons in one gene) • Many more proteins, perhaps >> 100,000 • Most proteins are still homologs of non-human proteins • Homologs (from a common ancestor gene) – orthologs – derived through speciation – paralogs: derived through duplication Completed eukaryotic proteomes Human Fly Worm Identified genes 32,000 13,338 18,266 Annotated domain families 1,262 1,035 1,014 Distinct domain architectures 1,695 1,036 1,018 Yeast Mustard weed 6,144 25,706 861 1,010 310 - Functional categories of eukaryote proteomes Distribution of homologues of predicted human proteins Fig 38 distribution of homologs Simplified cladogram (relationship tree) of the 'many-to-many' relationships of classical nuclear receptors. Triangles indicate expansion within one lineage; bars represent single members. Numbers in parentheses indicate the number of paralogues in each group. Fig 42 domain accretion Domain accretion in chromatin proteins in various lineages before the animal divergence, in the apparent coelomate lineage and the vertebrate lineage are shown using schematic representations of domain architectures (not to scale). Asterisks, mobile domains that have participated in theaccretion. Species in which a domain architecture has been identified are indicated (Y, yeast; W, worm; F, fly; V, vertebrate). Fig 45 domain expansion Lineage-specific expansions of domains and architectures of transcription factors Conserved segments in human and mouse genome Colour code: Mouse genome Applications to medicine and biology • Disease genes – human genomic sequence in public databases allows rapid identification of disease genes in silico • Drug targets – pharmaceutical industry has depended upon a limited set of drug targets to develop new therapies – now can find new target in silico • Basic biology – basic physiology, cell biology… The next steps • Finishing the human sequence • Developing the IGI (integrated gene index) and IPI (protein) • Large-scale identification of regulatory regions • Sequencing of additional large genomes – mouse, super-rice, pig, fish… • Completing the catalogue of human variation – Single nucleotide polymorphism – nasal and throat cancer… • From sequence to function

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt