Download Supplemental Information

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Supplemental Information
Methods:
Generating the sequence data:
Our experiment started with ~30,000 human genes annotated on Celera’s human genome
version R26k. Of these thirty thousand genes, we de-selected 2,925 pseudogenes and 905 genes
in the MCH region on chromosome 6 or on the Y chromosome; we selected 25,605 genes to
undergo primer design. Primers were designed using Primer 3 (Steve Rosen and Helen J.
Skaletsky (2000) Primer3 on the WWW for general users and for biologist programmers. In:
Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in Molecular
Biology. Humana Press, Totowa, NJ, pp 365-386). We required an amplicon size between 200550bp and that the target coding sequence must be at least 30bp from the ends of the amplicon.
Of the approximately twenty-five thousand genes selected for primer design, we were able to
cover an average of 92+/-2% of the coding sequence across 30 different molecular functions.
We were not able to design amplicons targeting the coding sequence of 2,242 genes (8.8%).
Genes failing primer design were not clustered among the different molecular functions. This left
us with 202,078 primer pairs covering some portion of 23,363 human coding regions. The
breakdown of primer pairs per gene is given in Table I. The identity and source of the DNA
samples amplified is given in Table II (end of supplementary section). All Coriell samples were
from the “Apparently healthy non-fetal tissue” NIGMS collection. The DNAs sequenced are
unique and distinct from DNA panels 1 and 2 used by the Seattle SNPs group
(http://pga.gs.washington.edu/platemaps.html#table1).
Table I. Number of primer pairs per gene
Percent of
Number of primer pairs
23,363 genes
per gene
45%
Between 1-5
25%
Between 6-10
11%
Between 11-15
17%
Between 16-70
2%
More than 70
Sequencing coverage:
Forward and reverse sequencing traces for each human amplicon were base called and
assembled using PhredPhrapConsed package (Ewing B, Green P: Basecalling of automated
sequencer traces using phred. II. Error probabilities. Genome Research 8:186-194 (1998)). If at
least 10 traces gave high quality sequencing data (>200 contiguous bases of phred>20 scores)
and assembled with the reference amplicon sequence, the amplicon set entered the SNP detection
pipeline. Table III shows the total number of coding base pairs that were screened in different
numbers of individuals. Two-thirds of the coding bases in 23,363 were screened in at least 35
DNAs, 82% of the coding bases were screened in at least 10 DNAs. Approximately 3,000 genes
did not yield any good quality sequence data. On average 29 individuals were screened for each
coding base; an average of 35 individuals were screened for each base that was sequenced
successfully in at least 5 DNAs. One third of the genes had >90% of their coding sequence
1
covered in at least 35 individuals (Figure 1). 80% of the genes had >70% of their coding
sequence covered in at least 10 individuals (Figure 1). Based on these numbers we could
calculate the probability of detecting a SNP of a particular allele frequency in the population.
For bases screened in at least 35 individuals we had 75% power to detect minor allele
frequencies >2% and >99% power to detect alleles greater than 5%. For bases screened in at
least 10 individuals we had 30% power to detect minor allele frequencies >2%, 60% power to
detect alleles greater than 5%, and >99% power to detect minor allele frequencies >25%.
Table III. Number of human individuals screened for SNPs
coding bases
coding bases with primers
base pairs
29,975,619
27,583,680
Coding bases screened in at least 10 DNAs
Coding bases screened in at least 20 DNAs
Coding bases screened in at least 30 DNAs
Coding bases screened in at least 35 DNAs
24,609,384
23,572,449
21,719,876
19,732,672
% of coding
bases
92%
82%
79%
72%
66%
Chimpanzee traces were analyzed independently of the human traces. 80% of amplicons
that gave product in human were successfully sequenced in the chimpanzee.
60%
%of genes min5
% of genes screened in x samples
%of genes min10
%of genes min20
50%
%of genes min30
%of genes min35
40%
%of genes min39
30%
20%
10%
0%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
% of coding sequenced analyzed for SNPs
Figure 1. Gene distribution of sequence coverage. The 10% bin refers to values >0 and <10% of
coding sequence analyzed for SNPs.
2
Detecting the SNPs:
Human polymorphisms were detected automatically from assembled sequencing traces
using Polyphred 4.0 (Nickerson DA, Tobe VO, Taylor SL. (1997) PolyPhred: automating the
detection and genotyping of single nucleotide substitutions using fluorescence-based
resequencing. Nucleic Acids Res. 25: 2745-51) and RuleGen, a decision-tree based method
(Stephen Glanowski, unpublished) using several sequencing trace/base parameters (signal
strength, %GC, amplicon size etc.) Manual calls were employed if potential SNPs were not
flagged by both programs. Validation of the automated pipeline using a set of 250 manually
called SNPs in the same set of traces (96 amplicons) showed a sensitivity of 85% for all SNPs
and up to 100% for SNPs with more than 3 minor alleles observed. Independent verification of
several hundred SNPs using TaqMan assays indicated that a validation rate of 95% for common
SNPs and 90% for SNPs with only one minor allele observed.
Table IV. Comparison of Data Set 1 (this study) to Data Set 2 (described in text).
No of snps with No of snps with
% of coding bases sequenced in at >2 minor alleles <3 minor alleles
least X number of DNAs
in Data Set 2
in Data Set 2
avg no of
DNA
sequenced
across the
gene
coding
name
region
TFPI
24
F3
28
CYP4F3
30
C3
31
IL4
38
IL1F6
39
MC1R
39
TNF
39
5
10
20
30
35
64% 64% 64% 64% 64%
77% 77% 77% 66% 66%
82% 82% 82% 65% 65%
85% 85% 85% 77% 65%
100% 100% 100% 100% 100%
100% 100% 100% 100% 100%
100% 100% 100% 100% 100%
100% 100% 100% 100% 100%
39
20%
54%
44%
34%
78%
75%
86%
92%
present Not present Not
in both present in both present
Data in Data Data in Data
Sets
Set 1
Sets
Set 1
0
0
0
0
0
1
1
1
3
1
1
3
11
1
5
2
1
0
0
0
2
0
1
0
5
1
4
1
0
0
0
0
False Negative Rate:
We compared the coding SNPs from 8 genes that had been sequenced in the SeattleSNPs
PGA (http://pga.gs.washington.edu/finished_genes.html) as an independent assessment of our
false negative rate. Although the DNAs sequenced are different, the number of individuals
sequenced is comparable (39 humans in this paper (referred to as data set 1) compared to 48 by
SeattleSNPs (data set 2). Furthermore, both data sets contain roughly equal numbers of African
American and European American individuals. In data set 1, four genes (Il4, IL1F6, MC1R, and
TNF) were screened in ~38 DNAs for each coding base; the remaining 4 genes (TFPI, F3,
CYP4F3, and C3) were screened in ~29 DNAs (data set 1 average). In summary, 91% (41/45) of
SNPs with >2 minor alleles observed in data set 2 were also identified in data set 1.
3
All genes analyzed in this study have a corresponding entry in the Refseq 9.0 database
with at least 95% nucleotide sequence identity and co-localization to build 34 of the Human
Genome Project. SNPs occurring in regions of the Celera gene for which there was no Refseq
support were omitted from mkprf analysis as were fixed nucleotide differences between all
humans and the chimpanzee in these regions. Furthermore, 12.9% of genes have at least one
synonymous or nonsynonymous SNP (total of 2,359 SNPs) in the Celera gene database in
regions with Refseq support that were excluded from our analysis. (We include details as to
which genes have more SNPs than we used in our analysis in the supplementary on-line
information.) The mkprf algorithm was run with and without the omitted SNPs and genes in
which omitting non-synonymous SNPs would lead to erroneous inference of positive selection
were excluded from the final set of genes reported and used in the molecular function and
biological processes section. Omitting non-synonymous SNPs from genes that show a signal of
weak negative selection is conservative, since it reduces the posterior probability that the
selection coefficient is below 0. Such genes were not removed the analysis and are noted in the
supplementary table with McDonald-Kreitman cell entries and estimated selection coefficients.
The omitted SNPs do not alter the main conclusions of the paper nor lead to gross differences in
overall percentages.
4
Table II. DNAs sequenced for polymorphisms.
DNA name ethnicity
4X0033
NA00131
NA14548
NA14448
NA12593
NA10959
NA09947
NA08587
NA05920
NA02254
NA01990
NA01954
NA01953
NA01814
NA01805
NA00946
NA00893
NA00607
NA00546
NA00333
NA10924
NA14672
NA14665
NA14663
NA14661
NA14649
NA14632
NA14535
NA14532
NA14529
NA14511
NA14508
NA14503
NA14501
NA14480
NA14476
NA14474
NA14464
NA14454
NA14439
not known
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
European American
African American
African American
African American
African American
African American
African American
African American
African American
African American
African American
African American
African American
African American
African American
African American
African American
African American
African American
African American
taxon
source
Southwest National Primate Research
Pan trodoglytes Center
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Coriell Cell Repositories
Homo sapiens
Homo sapiens
Coriell Cell Repositories
gender
male
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
female
5