Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SUPPLEMENTAL INFORMATION Supplemental Methods. Mapping and sequencing at the Broad Institute. Methods for generation of the clone path and sequence map at the Broad Institute (Whitehead/MIT Center for Genome Research) have been previously described1. Mapping and sequencing at Keio University. BAC clones from the 30 Mb region spanning 8q22-q24.1 were initially identified from the Keio BAC library using the 8-bit digital hybridization (DH) method2 with 1,207 STS/EST markers mapped to Chromosome 8. To fill gaps in the BAC contig, chromosome walking was performed by selecting neighboring clones with the two-step 4-dimensional PCR screening method3. The Keio BAC library was also used to fill several gaps in other regions of Chromosome 8. Gene annotation. The gene catalog was produced as previously described1. Databases of transcribed sequences, including RefSeq (release 1), mammalian gene collection (MGC) (February 3, 2003), dbEST, and GenBank mRNA (December 29, 2002), were aligned to the genome assembly using BLAT4. GenPept protein sequences (February 3, 2003) from many species were aligned using BLASTX5 and GeneWise6. Gene models were then created by manual annotation based on this transcriptional evidence, with the models classified according to the HAWK2 transcript type conventions (www.sanger.ac.uk/Info/workshops/hawk2), see reference 1 for details. We note that 77% of ‘known’ genes have evidence of CpG islands in the region from 2 kb upstream to 1 kb downstream of the annotated 5’-end; this is somewhat higher than previous reports in the range of 61-66% (see references 1, 7-10). Portions of the chromosome underwent independent manual annotation by the groups at Keio and Jena. Results were very similar (see section below on comparison of manual annotations of selected regions). Gene symbols were assigned by the HUGO Gene Nomenclature Committee for biologically characterized loci; a complete list of gene symbols from this paper is found in Table S8. Our annotations are available from the Vertebrate Genome Annotation database (VEGA) (http://vega.sanger.ac.uk/Homo_sapiens). Defensin gene sequencing. Regions spanning the two exons of two defensin genes, DEFA1 and DEFB105, were amplified, subcloned and sequenced from a set of primates and dogs (see Supplemental Table S9). PCR reactions were carried out in a 10ul volume containing 15ng template DNA, primers (0.2uM each), MgCl2 (2.0mM), dNTPs (0.5mM) and 2.5u AmpliTaq Gold polymerase in 1X PCR buffer (Applied Biosystems, Foster City, CA). The amplification was carried out with an initial treatment at 94oC for 12 minutes, followed by 30 cycles of 94oC for 30 seconds, 54oC for 30 seconds and 70oC for 1 minute. The resultant PCR products were ligated into either pCR 2.1-TOPO or pCR II-TOPO (Invitrogen, Carlsbad, CA), as directed by the manufacturer. The cloned PCR products were sequenced using BigDye v3.1 reagents on ABI 3730xl instruments (Applied Biosystems, Foster City, CA). Supplemental Results. Gaps in the clone path. The Human Chromosome 8 clone path contains 6 euchromatic gaps, a gap at the 8p telomere and a gap of centromeric heterochromatin. Proceeding from 8ptel to 8qtel, the gaps were sized as described below. See Table S1 for more details. 1. The 8ptel gap is captured in the half-YAC YRM2205, which was sized at 300 kb. We were unable to grow this YAC in sufficient quantities to generate a subclone library. Sequences from cosmid clones derived from this YAC are found 220 kb from the end of the sequence contig, so we estimate the amount of missing sequence at 80kb. 2. The gap at 7.5Mb contains the -defensin cluster, which is known to have significant copy number polymorphism in the population. Copy numbers from 1-8 have been reported11, which would correspond to a gap size of between 0 and ~560 kb, depending on haplotype. This gap is captured by two CEPH YACs. Fibre-FISH was repeatedly attempted to size this gap, without success. A default estimate of 100kb for this gap is used in calculating the size of the chromosome. 3. The gap at 12.1Mb is flanked by segmental duplications, and is apparently refractory to cloning. We were unable to determine the size of this gap using conserved synteny of the human genome to mouse or dog, or by using the TNG radiation hybrid map. A default estimate of 100kb is used in calculating the size of the chromosome. 4. The gap at 47.6 Mb contains sequence that persistently deletes from large insert clones. The size of this gap was estimated at 60kb by identifying estimating physical distance between radiation hybrid markers on the TNG map. 5. The gap at 86.2Mb is apparently refractory to cloning. The size of this gap was estimated at 87.3kb by taking advantage of conserved synteny of markers between the human and mouse genomes. Because this gap is flanked by segmental duplications in the human genome, this sizing may be unreliable. Extreme genes. The longest gene on Chromosome 8 is CSMD1 (see also section on CSMD gene family, below). It contains 70 exons spanning just over 2Mb, making it the fourth-largest human gene. The longest mature transcript is Plectin1 (PLEC1) at 14,805 bp; mutations in plectin are associated with epidermolysis bullosa with muscular dystrophy12. The longest single exon is a coding-exon of 7015 bp in Testis Expressed Sequence 15 (TEX15). The gene with the most reported splice forms is EEFD1 (eukaryotic translation elongation factor 1 delta), with 33 reported splice forms, of which two are represented by RefSeq transcripts. CSMD gene family. The second longest gene on chromosome 8 is CSMD313 on 8q23.3, a paralog of CSMD1 on 8p23. The CSMD3 contains 71 exons spanning over 1.2 Mb. The CSMD3 and CSMD1 genes are located in the largest (1 gene/5.4 Mb) and the second largest (1 gene/4.2 Mb) gene poor region on chromosome 8, respectively. Structure of the chromosome 8 pericentromeric region. The finished sequence of Chromosome 8 includes completion of the pericentromeric region (Figure 2 in main text). Proximal sequence contigs on both arms of Chromosome 8 reach from euchromatic arm sequences through pericentromeric satellites and into the higher order alpha satellite array. The most proximal sequence contig on 8p spans 14 Mb. On 8q, the most proximal contig spans 1.3 Mb but is ordered and oriented to the next distal contig (spanning over 40 Mb) via a BAC-end captured gap. Dotter analysis (Figure 2 in the main text) reveals that ~18 kb on each arm is comprised of 9 copies of a 1.9 kb repeat unit. On chromosome 8, the higher order array, D8Z2, is comprised of 3 predominant higher-order repeat units – 3.9, 2.5 and 1.9 kb - that occur in clusters within the array14. Each of these 3 units is a length variant based upon a common unit and created by unequal crossover. The smallest of these (1.9 kb) was predicted to occupy ~41% of the 2 +/- 0.2 Mb array. Alpha satellite has a directional quality with monomers and higher-order repeat units arranged in a tandem, head-to-tail orientation. The p and q arm higher order sequences contained within the Chromosome 8 assembly are not only highly identical to each other (96-98%) but occur in the same orientation indicating that these sequences sample the distal edges of the continuous array. The pericentromeric satellite regions flanking the higher order alpha satellite stretch 476 kb on 8p (UCSC Genome Browser; chr8:43,481,669-43,958,293; May 2004 assembly) and >1 Mb on 8q. 8p satellites include both monomeric alpha satellite and gamma satellite. The proximal 622 kb of 8q (chr8:46,956,700-47,579,000; May 2004 assembly) is comprised of large stretches of monomeric alpha satellite and gamma satellite while the distal 461 kb region (chr8:47,579,000-48,040,000; May 2004 assembly) is highly LINE dense and contains interspersed beta satellite blocks and segmental duplications. Pericentromeric satellites differ from array sequences in that they lack higher order structure, are quite divergent and contain interspersed elements. Despite the highly repetitive nature of satellite regions, these features provide for a great deal of sequence variation facilitating sequence assembly. Supplemental discussion of annotation rules and methods We emphasized aligned mRNA and protein sequences when analyzing the gene content of the finished 142.3 Mb euchromatic portion of human Chromosome 8. In accord with Hawk2 (www.sanger.ac.uk/Info/workshops/hawk2) conventions gene models were grouped into the following categories: 1. Known - identical to a human mRNA sequence, usually from Refseq or MGC. 2. Novel_CDS - identical to spliced human ESTs and coding for a protein with nonidentical homology to a protein in the public databases. 3. Novel_Transcript - identical to a spliced human EST with canonical splice junctions, no homology to known mRNAs or proteins, at least one exon without repeat content and the longest ATG to stop codon open reading frame (ORF) spans more than one exon. This category includes genes predicted by ab initio tools including Genscan15 and FGENESH (Softberry Inc., Mount Kisco, NY) were annotated when one or more exons overlapped a mouse-human BLASTN alignment (evolutionary conserved regions or ecores16). The comparative version of FGENESH was not used. 4. Putative - identical to a spliced human EST with canonical splice junctions, no homology to known mRNAs or proteins, at least one exon without repeat content and the longest ATG to stop codon open reading frame (ORF) is entirely contained in one exon. 5. Pseudogene - gene models that contain a disrupted ORF containing more than 50% of the ORF of a known protein at greater than 50% identity. This category is further subdivided into single exon processed pseudogenes and multi-exon unprocessed pseudogenes. 6. Putative novel transcript - gene models containing an ORF with less than 50% of the ORF of a known protein and greater than 50% identity. Annotation of Novel and Putative loci. The few novel and putative loci that we annotated were required to align over at least 30% of their length to either the mouse or dog genome with at least 70% identity. Some lineage specific and rarely detected genes will be lost using this method, as will gene models with a very high ratio of UTR to coding sequence. Comparison of manual annotations of selected regions of Human Chromosome 8 from multiple sites. The full sequence of Human Chromosome 8 underwent manual annotation at the Broad Institute. In addition, selected regions were annotated at the Keio University and the Institute of Molecular Biology in Jena. In order to compare the efficacy of these difference annotation systems, we used BLAT to attempt alignment against the full sequence of the chromosome. Greater than 99% of the gene models from Keio and Jena aligned at greater than 95% identity and 50% query coverage. There were 108 Keio and 190 Jena transcripts that did not overlap the Broad annotated genes. These fall into the following two categories. Category 1: Loci with less than three spliced ESTs with no homology to any proteins or spliced EST exons with no match to any conserved sequence from syntenic mouse or dog regions. These majority of these transcripts were identified by Broad annotation, but fell below the threshold for calling a gene. There were 73 Keio transcripts and 146 Jena transcripts under this category. Category 2: Loci with no expressed sequence as evidence or with short single exon ESTs or repeat elements. We found 35 Keio transcripts and 44 Jena transcripts in this category. These annotations probably represent de novo gene calls, and would be excluded from Broad annotation in the absence of other evidence. Alternative explanations for apparent rapid divergence on 8p22-23. Genome-wide analysis has identified the distal portion of 8p as a region with unusually high humanchimpanzee divergence and diversity, the highest found on any autosome. There are at least 3 possible explanations of this high observed divergence and diversity: artifacts of assembly errors in duplicated regions, long coalescence time and high mutation rates. We conclude that high local mutation rate is the most likely explanation for the high rates of diversity and divergence in the distal 8p region. Other possible explanations are balancing selection, long coalescence time since the last common ancestor in the region or an artifact of unresolved duplications in the map. These are discussed below. Artifacts: Undetected duplications can lead to misassembled sequences and artificially high divergence/diversity levels. 8p contains several highly duplicated regions that are likely to not be assembled well: subtelomeric duplications at 0-200 kb, a smaller duplicated region at ~2.1 Mb, a large low copy repeat (LCR) cluster spanning ~6.1-8.1 Mb17 and a smaller LCR cluster at ~ 121.8-12.5 Mb. However, these clusters cannot by themselves explain the high rates. The estimated divergence and diversity in the 1 Mb non-overlapping windows spanning each of these clusters is ~ 2%, which is not higher than nearby 1 Mb windows containing no known duplicated sequence. This is likely in part because sequences in the LCRs are poorly represented in the aligned genome sequence, as well as in the aligned SNP reads. The 2.9% divergence peak spans two 1 Mb windows roughly corresponding to 8p23.2, which do not contain any of the repeats and appear to be relatively unique in the genome. Rare duplications of 8p23.2 have been described without apparent deleterious consequences17, but they would have to be present in both the human and chimp libraries along with more proximal duplications to explain the data. Low quality chimpanzee sequence cannot explain this, since the human diversity data are independent of these data. Also, there are a number of other chromosomal regions with LCR clusters that do not show equivalently elevated rates, suggesting that a potential artifact would not be due to a systematic problem in dealing with duplicated regions. Long coalescence time and balancing selection: Divergence and diversity depend in part on the time since the last common ancestor (TLCA) between the genomes or haplotypes compared. Balancing selection (as seen at HLA loci) or suppressed recombination (as seen at BRCA1) can both increase the TLCA. The immunological importance of the defensin cluster situated between 6-8 Mb suggests the possibility of strong balancing selection in this region, but the divergence/diversity peak overlaps a gene desert (approximately 2.5 Mb proximal) with only one large gene: CSMD1 a putative tumor suppressor expressed mainly in the brain. Due to the high estimated recombination rate in the region (2.5-3.2 cM/Mb) strong hitchhiking effects seem unlikely to extend as far as 5 Mb on both sides of the defensin cluster. Supplemental references. 1. Nusbaum, C., et al. (2005) DNA sequence and analysis of human Chromosome 18. Nature 437:551-555. 2. Asakawa, S. and Shimizu, N. (1998) High-fidelity Digital Hybridization Screening. Genomics, 49:209-217. 3. Asakawa, S., et al. (1997) Human BAC Library: Construction and Rapid Screening. Gene 191:69-79. 4. Kent, W.J. (2002) BLAT--the BLAST-like alignment tool. Genome Res. 12:65664. 5. Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389. 6. Birney, E., Clamp, M. and Durbin, R. (2004) GeneWise and Genomewise. Genome Res. 14:988-95. 7. International Human Genome Sequencing Consortium. (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931-945. 8. Grimwood, J., et al. (2004) The DNA sequence and Biology of human Chromosome 19. Nature 428:529-535. 9. Deloukas, P., et al. (2004) The DNA sequence and comparative analysis of human Chromosome 10. Nature 429:375-381. 10. Martin, J. et al. (2004) The sequence and analysis of duplication-rich human Chromosome 16. Nature 432:988-994. 11. Taudien, S., et al. (2004) Polymorphic segmental duplications at 8p23.1 challenge the determination of individual defensin gene repertoires and the assembly of a contiguous human reference sequence. BMC Genomics 5:92. 12. McLean, W.H. et al. (1996) Loss of plectin causes epidermolysis bullosa with muscular dystrophy: cDNA cloning and genomic organization. Genes Dev. 10:1724-1735. 13. Shimizu, A., et al. (2003) A novel giant gene CSMD3 encoding a protein with CUB and sushi multiple domains: a candidate gene for benign adult familial myoclonic epilepsy on human chromosome 8q23.3-q24.1. Biochem. Biophys. Res. Commun. 309:143-154. 14. Ge, Y., Wagner, M.J., Siciliano, M. and Wells, D.E. (1992) Sequence, higher order repeat structure, and long-range organization of alpha satellite DNA specific to human chromosome 8. Genomics 13:585-593. 15. Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268:78-94. 16. Roest Crollius, H., et al. (2000) Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet. 25:235–238 17. Harada, N., et al. (2002) Duplication of 8p23.2: a benign cytogenetic variant? Am. J. Med. Genet. 111:285-288.