Download Supplementary Notes - Word file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
SUPPLEMENTAL INFORMATION
Supplemental Methods.
Mapping and sequencing at the Broad Institute. Methods for generation of the clone
path and sequence map at the Broad Institute (Whitehead/MIT Center for Genome
Research) have been previously described1.
Mapping and sequencing at Keio University. BAC clones from the 30 Mb region
spanning 8q22-q24.1 were initially identified from the Keio BAC library using the 8-bit
digital hybridization (DH) method2 with 1,207 STS/EST markers mapped to
Chromosome 8. To fill gaps in the BAC contig, chromosome walking was performed by
selecting neighboring clones with the two-step 4-dimensional PCR screening method3.
The Keio BAC library was also used to fill several gaps in other regions of Chromosome
8.
Gene annotation. The gene catalog was produced as previously described1. Databases of
transcribed sequences, including RefSeq (release 1), mammalian gene collection (MGC)
(February 3, 2003), dbEST, and GenBank mRNA (December 29, 2002), were aligned to
the genome assembly using BLAT4. GenPept protein sequences (February 3, 2003) from
many species were aligned using BLASTX5 and GeneWise6. Gene models were then
created by manual annotation based on this transcriptional evidence, with the models
classified according to the HAWK2 transcript type conventions
(www.sanger.ac.uk/Info/workshops/hawk2), see reference 1 for details. We note that
77% of ‘known’ genes have evidence of CpG islands in the region from 2 kb upstream to
1 kb downstream of the annotated 5’-end; this is somewhat higher than previous reports
in the range of 61-66% (see references 1, 7-10). Portions of the chromosome underwent
independent manual annotation by the groups at Keio and Jena. Results were very similar
(see section below on comparison of manual annotations of selected regions). Gene
symbols were assigned by the HUGO Gene Nomenclature Committee for biologically
characterized loci; a complete list of gene symbols from this paper is found in Table S8.
Our annotations are available from the Vertebrate Genome Annotation database (VEGA)
(http://vega.sanger.ac.uk/Homo_sapiens).
Defensin gene sequencing. Regions spanning the two exons of two defensin genes,
DEFA1 and DEFB105, were amplified, subcloned and sequenced from a set of primates
and dogs (see Supplemental Table S9). PCR reactions were carried out in a 10ul volume
containing 15ng template DNA, primers (0.2uM each), MgCl2 (2.0mM), dNTPs
(0.5mM) and 2.5u AmpliTaq Gold polymerase in 1X PCR buffer (Applied Biosystems,
Foster City, CA). The amplification was carried out with an initial treatment at 94oC for
12 minutes, followed by 30 cycles of 94oC for 30 seconds, 54oC for 30 seconds and 70oC
for 1 minute. The resultant PCR products were ligated into either pCR 2.1-TOPO or pCR
II-TOPO (Invitrogen, Carlsbad, CA), as directed by the manufacturer. The cloned PCR
products were sequenced using BigDye v3.1 reagents on ABI 3730xl instruments
(Applied Biosystems, Foster City, CA).
Supplemental Results.
Gaps in the clone path. The Human Chromosome 8 clone path contains 6 euchromatic
gaps, a gap at the 8p telomere and a gap of centromeric heterochromatin. Proceeding
from 8ptel to 8qtel, the gaps were sized as described below. See Table S1 for more
details.
1. The 8ptel gap is captured in the half-YAC YRM2205, which was sized at 300 kb.
We were unable to grow this YAC in sufficient quantities to generate a subclone
library. Sequences from cosmid clones derived from this YAC are found 220 kb
from the end of the sequence contig, so we estimate the amount of missing
sequence at 80kb.
2. The gap at 7.5Mb contains the -defensin cluster, which is known to have
significant copy number polymorphism in the population. Copy numbers from 1-8
have been reported11, which would correspond to a gap size of between 0 and
~560 kb, depending on haplotype. This gap is captured by two CEPH YACs.
Fibre-FISH was repeatedly attempted to size this gap, without success. A default
estimate of 100kb for this gap is used in calculating the size of the chromosome.
3. The gap at 12.1Mb is flanked by segmental duplications, and is apparently
refractory to cloning. We were unable to determine the size of this gap using
conserved synteny of the human genome to mouse or dog, or by using the TNG
radiation hybrid map. A default estimate of 100kb is used in calculating the size
of the chromosome.
4. The gap at 47.6 Mb contains sequence that persistently deletes from large insert
clones. The size of this gap was estimated at 60kb by identifying estimating
physical distance between radiation hybrid markers on the TNG map.
5. The gap at 86.2Mb is apparently refractory to cloning. The size of this gap was
estimated at 87.3kb by taking advantage of conserved synteny of markers between
the human and mouse genomes. Because this gap is flanked by segmental
duplications in the human genome, this sizing may be unreliable.
Extreme genes. The longest gene on Chromosome 8 is CSMD1 (see also section on
CSMD gene family, below). It contains 70 exons spanning just over 2Mb, making it the
fourth-largest human gene. The longest mature transcript is Plectin1 (PLEC1) at 14,805
bp; mutations in plectin are associated with epidermolysis bullosa with muscular
dystrophy12. The longest single exon is a coding-exon of 7015 bp in Testis Expressed
Sequence 15 (TEX15). The gene with the most reported splice forms is EEFD1
(eukaryotic translation elongation factor 1 delta), with 33 reported splice forms, of which
two are represented by RefSeq transcripts.
CSMD gene family. The second longest gene on chromosome 8 is CSMD313 on 8q23.3,
a paralog of CSMD1 on 8p23. The CSMD3 contains 71 exons spanning over 1.2 Mb.
The CSMD3 and CSMD1 genes are located in the largest (1 gene/5.4 Mb) and the
second largest (1 gene/4.2 Mb) gene poor region on chromosome 8, respectively.
Structure of the chromosome 8 pericentromeric region. The finished sequence of
Chromosome 8 includes completion of the pericentromeric region (Figure 2 in main text).
Proximal sequence contigs on both arms of Chromosome 8 reach from euchromatic arm
sequences through pericentromeric satellites and into the higher order alpha satellite
array. The most proximal sequence contig on 8p spans 14 Mb. On 8q, the most proximal
contig spans 1.3 Mb but is ordered and oriented to the next distal contig (spanning over
40 Mb) via a BAC-end captured gap. Dotter analysis (Figure 2 in the main text) reveals
that ~18 kb on each arm is comprised of 9 copies of a 1.9 kb repeat unit. On chromosome
8, the higher order array, D8Z2, is comprised of 3 predominant higher-order repeat units
– 3.9, 2.5 and 1.9 kb - that occur in clusters within the array14. Each of these 3 units is a
length variant based upon a common unit and created by unequal crossover. The smallest
of these (1.9 kb) was predicted to occupy ~41% of the 2 +/- 0.2 Mb array. Alpha satellite
has a directional quality with monomers and higher-order repeat units arranged in a
tandem, head-to-tail orientation. The p and q arm higher order sequences contained
within the Chromosome 8 assembly are not only highly identical to each other (96-98%)
but occur in the same orientation indicating that these sequences sample the distal edges
of the continuous array.
The pericentromeric satellite regions flanking the higher order alpha satellite
stretch 476 kb on 8p (UCSC Genome Browser; chr8:43,481,669-43,958,293; May 2004
assembly) and >1 Mb on 8q. 8p satellites include both monomeric alpha satellite and
gamma satellite. The proximal 622 kb of 8q (chr8:46,956,700-47,579,000; May 2004
assembly) is comprised of large stretches of monomeric alpha satellite and gamma
satellite while the distal 461 kb region (chr8:47,579,000-48,040,000; May 2004
assembly) is highly LINE dense and contains interspersed beta satellite blocks and
segmental duplications. Pericentromeric satellites differ from array sequences in that
they lack higher order structure, are quite divergent and contain interspersed elements.
Despite the highly repetitive nature of satellite regions, these features provide for a great
deal of sequence variation facilitating sequence assembly.
Supplemental discussion of annotation rules and methods
We emphasized aligned mRNA and protein sequences when analyzing the gene content
of the finished 142.3 Mb euchromatic portion of human Chromosome 8. In accord with
Hawk2 (www.sanger.ac.uk/Info/workshops/hawk2) conventions gene models were
grouped into the following categories:
1. Known - identical to a human mRNA sequence, usually from Refseq or MGC.
2. Novel_CDS - identical to spliced human ESTs and coding for a protein with nonidentical homology to a protein in the public databases.
3. Novel_Transcript - identical to a spliced human EST with canonical splice junctions,
no homology to known mRNAs or proteins, at least one exon without repeat content and
the longest ATG to stop codon open reading frame (ORF) spans more than one exon.
This category includes genes predicted by ab initio tools including Genscan15 and
FGENESH (Softberry Inc., Mount Kisco, NY) were annotated when one or more exons
overlapped a mouse-human BLASTN alignment (evolutionary conserved regions or
ecores16). The comparative version of FGENESH was not used.
4. Putative - identical to a spliced human EST with canonical splice junctions, no
homology to known mRNAs or proteins, at least one exon without repeat content and the
longest ATG to stop codon open reading frame (ORF) is entirely contained in one exon.
5. Pseudogene - gene models that contain a disrupted ORF containing more than 50% of
the ORF of a known protein at greater than 50% identity. This category is further
subdivided into single exon processed pseudogenes and multi-exon unprocessed
pseudogenes.
6. Putative novel transcript - gene models containing an ORF with less than 50% of the
ORF of a known protein and greater than 50% identity.
Annotation of Novel and Putative loci. The few novel and putative loci that we annotated
were required to align over at least 30% of their length to either the mouse or dog
genome with at least 70% identity. Some lineage specific and rarely detected genes will
be lost using this method, as will gene models with a very high ratio of UTR to coding
sequence.
Comparison of manual annotations of selected regions of Human Chromosome 8
from multiple sites. The full sequence of Human Chromosome 8 underwent manual
annotation at the Broad Institute. In addition, selected regions were annotated at the Keio
University and the Institute of Molecular Biology in Jena. In order to compare the
efficacy of these difference annotation systems, we used BLAT to attempt alignment
against the full sequence of the chromosome. Greater than 99% of the gene models from
Keio and Jena aligned at greater than 95% identity and 50% query coverage. There were
108 Keio and 190 Jena transcripts that did not overlap the Broad annotated genes. These
fall into the following two categories.
Category 1:
Loci with less than three spliced ESTs with no homology to any proteins or spliced EST
exons with no match to any conserved sequence from syntenic mouse or dog regions.
These majority of these transcripts were identified by Broad annotation, but fell below
the threshold for calling a gene. There were 73 Keio transcripts and 146 Jena transcripts
under this category.
Category 2:
Loci with no expressed sequence as evidence or with short single exon ESTs or repeat
elements. We found 35 Keio transcripts and 44 Jena transcripts in this category. These
annotations probably represent de novo gene calls, and would be excluded from Broad
annotation in the absence of other evidence.
Alternative explanations for apparent rapid divergence on 8p22-23. Genome-wide
analysis has identified the distal portion of 8p as a region with unusually high humanchimpanzee divergence and diversity, the highest found on any autosome. There are at
least 3 possible explanations of this high observed divergence and diversity: artifacts of
assembly errors in duplicated regions, long coalescence time and high mutation rates.
We conclude that high local mutation rate is the most likely explanation for the high rates
of diversity and divergence in the distal 8p region. Other possible explanations are
balancing selection, long coalescence time since the last common ancestor in the region
or an artifact of unresolved duplications in the map. These are discussed below.
Artifacts: Undetected duplications can lead to misassembled sequences and artificially
high divergence/diversity levels. 8p contains several highly duplicated regions that are
likely to not be assembled well: subtelomeric duplications at 0-200 kb, a smaller
duplicated region at ~2.1 Mb, a large low copy repeat (LCR) cluster spanning ~6.1-8.1
Mb17 and a smaller LCR cluster at ~ 121.8-12.5 Mb. However, these clusters cannot by
themselves explain the high rates. The estimated divergence and diversity in the 1 Mb
non-overlapping windows spanning each of these clusters is ~ 2%, which is not higher
than nearby 1 Mb windows containing no known duplicated sequence. This is likely in
part because sequences in the LCRs are poorly represented in the aligned genome
sequence, as well as in the aligned SNP reads. The 2.9% divergence peak spans two 1 Mb
windows roughly corresponding to 8p23.2, which do not contain any of the repeats and
appear to be relatively unique in the genome. Rare duplications of 8p23.2 have been
described without apparent deleterious consequences17, but they would have to be present
in both the human and chimp libraries along with more proximal duplications to explain
the data. Low quality chimpanzee sequence cannot explain this, since the human diversity
data are independent of these data. Also, there are a number of other chromosomal
regions with LCR clusters that do not show equivalently elevated rates, suggesting that a
potential artifact would not be due to a systematic problem in dealing with duplicated
regions.
Long coalescence time and balancing selection: Divergence and diversity depend in part
on the time since the last common ancestor (TLCA) between the genomes or haplotypes
compared. Balancing selection (as seen at HLA loci) or suppressed recombination (as
seen at BRCA1) can both increase the TLCA. The immunological importance of the
defensin cluster situated between 6-8 Mb suggests the possibility of strong balancing
selection in this region, but the divergence/diversity peak overlaps a gene desert
(approximately 2.5 Mb proximal) with only one large gene: CSMD1 a putative tumor
suppressor expressed mainly in the brain. Due to the high estimated recombination rate in
the region (2.5-3.2 cM/Mb) strong hitchhiking effects seem unlikely to extend as far as 5
Mb on both sides of the defensin cluster.
Supplemental references.
1. Nusbaum, C., et al. (2005) DNA sequence and analysis of human Chromosome
18. Nature 437:551-555.
2. Asakawa, S. and Shimizu, N. (1998) High-fidelity Digital Hybridization
Screening. Genomics, 49:209-217.
3. Asakawa, S., et al. (1997) Human BAC Library: Construction and Rapid
Screening. Gene 191:69-79.
4. Kent, W.J. (2002) BLAT--the BLAST-like alignment tool. Genome Res. 12:65664.
5. Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Res. 25:3389.
6. Birney, E., Clamp, M. and Durbin, R. (2004) GeneWise and Genomewise.
Genome Res. 14:988-95.
7. International Human Genome Sequencing Consortium. (2004) Finishing the
euchromatic sequence of the human genome. Nature 431:931-945.
8. Grimwood, J., et al. (2004) The DNA sequence and Biology of human
Chromosome 19. Nature 428:529-535.
9. Deloukas, P., et al. (2004) The DNA sequence and comparative analysis of
human Chromosome 10. Nature 429:375-381.
10. Martin, J. et al. (2004) The sequence and analysis of duplication-rich human
Chromosome 16. Nature 432:988-994.
11. Taudien, S., et al. (2004) Polymorphic segmental duplications at 8p23.1 challenge
the determination of individual defensin gene repertoires and the assembly of a
contiguous human reference sequence. BMC Genomics 5:92.
12. McLean, W.H. et al. (1996) Loss of plectin causes epidermolysis bullosa with
muscular dystrophy: cDNA cloning and genomic organization. Genes Dev.
10:1724-1735.
13. Shimizu, A., et al. (2003) A novel giant gene CSMD3 encoding a protein with
CUB and sushi multiple domains: a candidate gene for benign adult familial
myoclonic epilepsy on human chromosome 8q23.3-q24.1. Biochem. Biophys. Res.
Commun. 309:143-154.
14. Ge, Y., Wagner, M.J., Siciliano, M. and Wells, D.E. (1992) Sequence, higher
order repeat structure, and long-range organization of alpha satellite DNA
specific to human chromosome 8. Genomics 13:585-593.
15. Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human
genomic DNA. J. Mol. Biol. 268:78-94.
16. Roest Crollius, H., et al. (2000) Estimate of human gene number provided by
genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet.
25:235–238
17. Harada, N., et al. (2002) Duplication of 8p23.2: a benign cytogenetic variant? Am.
J. Med. Genet. 111:285-288.