Download COGENT_SequencingProposal_ShortProposal_17Oct2011

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neurogenomics wikipedia , lookup

Transcript
Whole exome sequencing in >20,000 well-phenotyped African Americans to identify genetic
risk allele associated with cardiovascular disease related traits
Proposed study PIs: Jim Wilson, David Reich, Leslie Lange
Current co-investigators: Herman Taylor, Alex Reiner, Charles Kooperberg, Ethan Lange, Ervin Fox, Yun Li;
(Others to be named)
Introduction and rationale:
The genetics community has recently begun to utilize next generation sequencing technology such as whole
exome sequencing to seek uncommon and rare variants that may contribute to disease. The cost of exome
sequencing has so far limited these efforts to perhaps 20,000 samples total, including all that have completed
or are awaiting sequencing. These samples are heavily weighted toward European ancestry based on
availability and funding sources. A technological breakthrough in the laboratory of Dr. David Reich, a long-time
collaborator of Drs. Wilson and Taylor, now makes it possible to conduct whole exome sequencing in >20,000
DNA samples, within the budget of an R01 grant (albeit at >$500,000 per year). This will provide a sample size
that will allow analysis in African American populations to have analytic power approaching that in populations
of European ancestry, and may be expected to lead to discoveries that are particularly relevant to African
Americans. It is essential that this work be pursued within the COGENT Consortium in order to obtain large
samples of African Americans.
We propose an ambitious study to sequence the exomes of >20,000 African Americans from the WHI and
other cohorts to discover genetic risk alleles for a range of phenotypes related to cardiovascular disease
(CVD). This is a unique experiment that is enabled by a technical breakthrough in the laboratory of joint PI
David Reich. This breakthrough means the effective cost of exome sequencing is reduced by more than 10fold per sample for both single variant and burden-of-rare-variant association analyses, making it possible to
propose an exome sequencing study on a scale of >20,000 samples. This large sample size is expected to
provide power to detect risk factors for CVD that would be undetectable in smaller sample sizes. Our proposal
also leverages the COGENT consortium, a group of cohorts with excellent phenotyping for heart, lung, and
blood traits, and in most cases with GWAS genotyping already available. Organizational approaches will
resemble those of the CARe consortium, with active phenotype working groups involving participation by
investigators from all contributing cohorts (a portion of effort in the budget will be dedicated to analysts and
investigators within each cohort to support manuscript development).
From the standpoint of WHI, there are several features of this collaborative proposal that are particularly
attractive. (1) It will be especially valuable to obtain rare variant/exome sequence data on the 3,683 African
American women eligible for the WHI In-Person Visit since this sub-cohort will be examined in even greater
detail through the 2010-2015 WHI contract extension not only through the In-Person Visit, but through other
BAA and ancillary study mechanisms. (2) The recent measurement of baseline CVD biomarkers in all African
American WHI-SHARe participants through core study W54 greatly enhances the ability to provide very large
sample sizes for assessment of rare variants across a broad range of quantitative traits and CVD intermediate
phenotypes. (3) The large sample size, richness of available phenotypes and the non-trivial statistical and
analytical requirements for this project ensure not only great opportunities for WHI investigator participation in
phenotypic working and writing groups as content experts across a range of phenotypes, but also robust
involvement of WHI statistical staff and genetics expertise during the 2010-2015 extension and beyond.
Study questions or hypotheses and specific aims:
We hypothesize that exome sequencing in a large number of well-characterized African Americans will lead to
the discovery of uncommon and rare variants that contribute to CVD, its risk factors, and intermediate
phenotypes. We therefore will pursue the following specific aims:
AIM 1: Produce DNA sequencing libraries for 20,000 individuals. We have developed a technology for
automated library preparation that allows us to prepare samples for sequencing at a tenth of the
commercial cost. We will make libraries for all consenting participants with adequate amounts of DNA in up
to 20,000 African Americans from large observational studies.
AIM 2: Exome enrichment and sequencing. We will capture the exomes of batches of 24 samples per
preparation and sequence them to 10x average coverage on the Illumina platform.
AIM 3: Detect low frequency (0.1-5%) and common (>5%) variation and perform association
testing. We will detect genetic variation and perform association analysis for CVD, its risk factors, and
intermediate phenotypes. The 10x sequencing data will provide >93% of the power per sample as 100x
sequencing.
AIM 4: Burden-of-rare-variants testing. We will annotate variants and perform gene burden tests on nonsynonymous variants to assess for association between cumulative variant allele counts and prevalent and
incident CVD, its risk factors, and intermediate phenotypes. The 10x sequencing data will provide >80% of
the power per sample as 100x sequencing.
Proposed samples:
We plan to select ~20,000 samples from COGENT cohorts agreeing to participate (also in discussions with
JHS, MESA, Health-ABC, REGARDS, GENOA; and plan to contact additional cohorts) to provide maximum
power for analysis of common, low frequency, and rare variants. Samples that are well phenotyped and have
prior GWAS genotyping data will be preferentially selected for sequencing. In general, we plan to exclude
participants who have already been selected for exome sequencing in other projects (e.g. ESP), although we
may include a small number of samples for quality control.
Exome sequencing at $100 per sample:
Whole exome sequencing is a promising tool for disease gene discovery, but a challenge is cost. The prices
currently projected by academic sequencing centers and companies are $1,500-$2,000 per exome. Here we
describe how we will use three ideas to reduce exome sequencing costs by more than 10-fold, allowing us to
achieve a scale of 20,000 samples:
(1) $30 sample preparation cost. Currently, the costs charged commercially or by academic service centers for
library construction are $300-$800 per sample. However, over the last two years with NHLBI and NCI funding,
post-doctoral fellow Nadin Rohland who works in the laboratory of one of the joint PIs (David Reich) developed
a high-throughput library construction protocol with low reagent costs, which allows barcoding of samples so
that they can be captured in pools (this paper is in submission at Nature Methods. The Budget Justification
projects $30 per sample for library production.
(2) $22 exome enrichment cost. Another major cost is the process of enriching a DNA sample for the subset of
the genome that lies in protein coding regions. Because of the way that most library protocols are designed,
enriching more than one sample simultaneously results in a high proportion of “off-target” DNA sequences that
are not in genes, and hence most protocols use a single aliquot of enrichment reagent per sample at a cost of
around $500 per sample. However, our libraries are specifically built for pooled target enrichment—with short
internal barcodes that do not interfere with the exome enrichment process—and hence we can pool 24
samples in a reaction, resulting in a 24-fold decrease in reagent costs.
(3) $46 exome sequencing cost. Current exome sequencing protocols have been developed in large part for
studies of Mendelian disease, where it is important to sequence to high coverage with the goal of finding all
possible causal variants. However, with 10x coverage, one can confidently detect >80% of even of the rarest
variants (those that occur only once). Thus, we can reduce sequencing costs per exome by a factor of ten
while retaining at least 80% of the power for disease gene mapping per sample. Combined with current prices
for Next-Gen sequencing, this means that the sequencing cost per sample can be reduced to the same order
of magnitude as the sample preparation and sample enrichment cost.
Amount and type of sample (ml, µl, µg, etc.) required for proposed study:
5 µg genomic DNA per sample.