Download Supplementary Information Text

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mitochondrial DNA wikipedia , lookup

Y chromosome wikipedia , lookup

NEDD9 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genetic engineering wikipedia , lookup

Polyploid wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Human genetic variation wikipedia , lookup

Point mutation wikipedia , lookup

Oncogenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression programming wikipedia , lookup

Primary transcript wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

NUMT wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Copy-number variation wikipedia , lookup

Ridge (biology) wikipedia , lookup

X-inactivation wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Gene expression profiling wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Transposable element wikipedia , lookup

Genomic imprinting wikipedia , lookup

Neocentromere wikipedia , lookup

Public health genomics wikipedia , lookup

Microsatellite wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

History of genetic engineering wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Metagenomics wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

RNA-Seq wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Genomics wikipedia , lookup

Human genome wikipedia , lookup

Genome editing wikipedia , lookup

Genomic library wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Supplementary Information Text
Alternative Splicing
An additional 296 genes have one or more EST sequence overlapping with non-annotated
exons and also contain flanking canonical splice which supports potential alternative
splicing for a minimum of 712 of the genes (77%) with 212 loci showing no alternative
splicing evidence. As most of our conclusions are based on the existing transcribed
sequence data, the depth of the EST database is the limiting factor in positively
identifying alternative spliced loci.
Unlike chromosome 19, where there is ample
evidence for multiple instances of low levels of alternative splicing; for chromosome 5
there are only 15 loci where 2 ESTs or more confirm a low-level alternative splice site.
This disparity may indicate that more attempts have been made to isolate low-level
alternative transcripts for the loci on chromosome 19, or suggests that the genes as a
whole on chromosome 5 are less likely to have rare spliced variants.
Pseudogenes
445 of the 577 pseudogenes have at least one frameshift or stop codon mutation as
compared to their original parent genes. 43 of the 479 processed pseudogenes that lack
introns present at the parent locus and display poly-A stretches in adjacent genomic
sequence were identified by manual validation of the collection of human processed
pseudogenes as determined by Zhang et al1. An additional 62 processed and 28 nonprocessed pseudogenes were identified via further manual inspection of the candidate
gene loci. The accumulation of insults to the open reading frames and intron loss of these
genomic sequences confirm these as pseudogenes and suggest that they have lost their
1
original function. However, 47 of these pseudogenes have at least one overlapping EST
sequence available, indicating they are occasionally transcribed and may have adopted an
alternative functional role to the parent gene.
Protocadherin Gene Family
The largest gene family on chromosome 5 is the protocadherin (PCDH) gene cluster,
which consists of 53 tandemly-arrayed, single-exon paralogous genes organized into
three subclusters, designated ,  and 2.
Each protocadherin exon encodes an
extracellular domain consisting of six cadherin-like ectodomain repeats, a transmembrane
domain and a short cytoplasmic tail. At the 3’ end of both the  and  subclusters are an
additional three short exons that are alternatively cis-spliced to each  and  exon,
providing a “constant” cytoplasmic region2-4. Each protocadherin gene is transcribed
from its own promoter and all protocadherin cluster promoters share a highly conserved
core motif5, 6. Promoter choice appears to determine the splicing of a particular  or 
variable exon to the first constant region exon, in that the splice donor site of the
transcribed variable exon is used in cis-splicing3.
Each neuron appears to express a distinct combination of protocadherin genes7.
Protocadherin proteins are thought to form homophilic interactions at synapses, providing
a molecular means to distinguish subsets of neurons based on the combinations of
protocadherins they express7, 8. Protocadherin clusters are present in many vertebrate
species, although the sequence content greatly differs between mammals and other
vertebrates. Protocadherin cluster genes in humans and other species also undergo
frequent gene conversion events. These events are restricted to specific ectodomains,
2
resulting in some ectodomains becoming nearly identical among paralogs while other
ectodomains remain diverse. This process also generates allelic variants of human
protocadherin cluster genes.
Comparative Methods
Regions of evolutionary conservation are detected using the program PEAK-VISTA (S.
Prabhakar, unpublished work), which takes MLAGAN9 alignments as input. PEAKVISTA goes through a 3-step process to identify statistically significant slowly-evolving
regions. First, noncoding regions in the input alignment are used to estimate the
approximate local neutral mutation rates between all pairs of aligned sequences. The
method is adapted from Cooper et al.10. The estimated rates are then used to derive a loglikelihood score for slow vs. neutral evolution at each aligned position, similar to the
strategy of Boffelli et al.11. Conserved regions show up as high-scoring segments, which
are assigned p-values relative to random permutations of the alignment columns. The
statistical formalism for computing p-values is identical to that of the NCBI BLAST
algorithm.
To generate substitution rates for the chimp/human comparison, we constructed four-way
alignments of human, mouse, chimpanzee, and rat using M-LAGAN and limited our
analysis to conserved regions with p-values less than a cutoff number using PEAKVISTA (S. Prabhakar, unpublished work). Simple scripts in PERL calculate the
substitution rates of sequence falling in each category based on our internal annotation of
chromosome 5. To ensure comparison of high-quality alignments and truly orthologous
3
sequences, we limited our analysis to aligned segments with reasonable levels of
nucleotide diversity (0.5 between primates and rodents, 0.05 between primates, and 0.25
between rodents) encompassing approximately 130 Mb or 70% of the finished
chromosome. It should be noted that the observed non-coding/non-conserved substitution
rate was consistent among the variety of simple repeats, repetitive elements, and nonrepeat sequences encompassed by this category. Finally, all calculations employed the
Jukes-Cantor substitution model without corrections due to the extremely high sequence
similarity between humans and chimpanzees.
Additional tiling set information
The distance from the most distal cosmid to the true telomere is estimated to be 63kb for
the p-telomere and 20kb for the q-telomere (Harold Riethman, pers. comm.). A 5qtelomere-containing ‘half-YAC’ (Riethman 2001) has been identified and sequenced by
subcloning into cosmid vector. No half-YAC has been identified for the 5p-telomere.
The boundary between euchromatin and heterochromatin at the centromere was identified
by the presence of centromere specific alpha satellite repeats.
Supplementary Finishing Information
The tiling set of chromosome 5 consists of 1763 finished clones. 1685 of these clones
were drafted at the Joint Genome Institute and finished at the Stanford Human Genome
Center while 71 clones were drafted and finished at Lawrence Berkeley National
Laboratories. The following clones were drafted and/or finished elsewhere.
4
AC022493
Human Genome Sequencing Center, Baylor
AC025156
Human Genome Sequencing Center, Baylor
AC009757
Whitehead Institute/MIT Center for Genome Research
AC020728
Washington University, Genome Sequencing Center
AC002428
Washington University, Genome Sequencing Center
AC022217
Washington University, Genome Sequencing Center
AP006257
Riken, Genomic Sciences Center
Supplementary references
1.
Zhang, Z., Harrison, P. M., Liu, Y. & Gerstein, M. Millions of years of evolution preserved:
a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res.
13, 2541-2558 (2003).
2.
Wu, Q. & Maniatis, T. A striking organization of a large family of human neural cadherin-like
cell adhesion genes. Cell 97, 779-790 (1999).
3.
Tasic, B. et al. Promoter choice determines splice site selection in protocadhe
-
mRNA splicing. Mol Cell 10, 21-33 (2002).
4.
Wang, X., Su, H. & Bradley, A. Molecular mechanisms governing Pcdh-
gene expression:
evidence for a multiple promoter and cis-alternative splicing model. Genes Dev. 16, 18901905 (2002).
5.
Wu, Q. et al. Comparative DNA sequence analysis of mouse and human protocadherin
clusters. Genome Res 11, 389-404 (2001).
6.
Noonan, J.P. et al. Extensive linkage disequilibrium, a common 16.7 kilobase deletion, and
evidence of balancing selection in the human protocadhe
Am. J. Hum.Genet. 72,
621-635 (2003).
7.
Kohmura, N. et al. Diversity revealed by a novel family of cadherins expressed in neurons at
a synaptic complex. Neuron 20, 1137-1151 (1998).
5
8.
Obata, S. et al. Protocadherin Pcdh2 shows properties similar to, but distinct from, those of
classical cadherins. J. Cell Sci. 108, 3765-3773 (1995).
9.
Brudno, M. et al. LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple
Alignment of Genomic DNA. Genome Res. 13, 721-731 (2003).
10. Cooper, G. M. et al. Characterization of evolutionary rates and constraints in three
Mammalian genomes. Genome Res. 14, 539-548 (2004).
11. Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of
the human genome. Science 299, 1391-1394 (2003).
6