Download Detecting Allelic Effects

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Behavioural genetics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Heritability of IQ wikipedia , lookup

Gene expression programming wikipedia , lookup

Neurogenomics wikipedia , lookup

Transcript
Detecting Allelic Effects
December 10, 2004
PI: Fernando Pardo Manuel, PhD (UNC/Genetics)
Programmer Supervisor: Patrick Sullivan, MD (UNC/Genetics)
Background. The locations of millions of single nucleotide polymorphisms (SNPs) are
known to high precision. However, we very rarely know whether a particular SNP is
functional – meaning whether it is “silent” or if it yields different amounts of messenger
RNA or its protein product or whether the protein has altered function. There is an urgent
need for high-throughput methods to catalog SNP functional variation.
Dr. Pardo Manuel has developed a method to investigate the functionality of SNPs within
virtually any gene in the mouse genome. The pilot studies are promising and he wishes
to scale the project upward.
To do this requires more sophisticated and integrated data management that is now
available. Without improved data management, error and inefficiency could hinder the
project.
Précis.
- Assume there is a mouse gene for which one wishes to catalog genetic variation. As
an example, the mouse gene Il9r (interleukin 9 receptor) on chromosome 11 is
shown above. The transcript is about 12,000 bases.
- The first step is to conduct DNA sequencing of a considerable portion of the gene
including all exons and many introns. This is done in tiles of 500-700 bases
strategically chosen across the gene from both right-to-left and from left-to-right.
Existing software is used to design primers and to create the DNA sequence tiles.
- Sequencing is done in a sizeable panel of diverse inbred mouse strains (about 25).
So, DNA sequence has to be stored for Il9r from 25 different mice.
- Next, the sequence across the strains has to be compared to note the presence of a
variant position across one or more strains and to classify the type of variant (e.g.,
SNP, insertion/deletion, microsatellite, etc).
- Several standard indices need to be computed – e.g., nucleotide diversity, etc.
- A schema to detect the consequences of a variant is then designed and conducted in
multiple different mouse tissues. Initially at the mRNA level and later at the protein
level.
- At every step there are critical quality control steps to be conducted.
- All data entries and changes to existing data have to be recorded.
- This process will be scaled up to consider many hundreds (perhaps thousands) of
genes.
Need. An experienced data base programmer is required to work under supervision to
develop a relational data base to record, track, and manipulate the data from this project.
A. Specific Aims
Several large-scale studies indicate that allelic variation in gene expression is common
and may account for much of the phenotypic variation within and among species. These
observations and the repeated finding in human studies that susceptibility alleles at
candidate genes often lack changes in the coding sequence, suggest that allelic
variation in gene expression may play a central role in the etiology of complex genetic
traits including common human diseases. Regulatory variation may be due to
differences in trans-acting factors or cis-acting elements and may lead to differences in
the level of gene expression. The challenges facing the identification of cis-regulatory
variants include our limited capacity to identify regulatory elements and to evaluate the
functional consequences of sequence variants based solely on sequence data.
Functional annotation of the predicted 107 common variants present in the human
genome (and similar numbers in model organisms) is an essential step to increase our
understanding of basic biological processes, to identify the causative genetic variants
responsible for common human diseases.
Genes that harbor regulatory variants in cis-acting elements can be identified by
differential allelic expression in heterozygous carriers of sequence variants within the
transcript. This sensitive approach measures the ratio of the two alleles of a gene in the
same cellular environment (therefore accounting for trans-acting factors and
environmental variation) and may be used to identify genes harboring regulatory
variants. Application of this and other methods should provide a long and interesting list
of genes harboring cis-regulatory variants. However, these studies would fail to identify
the causative variant in many, if not most, of the genes because the presence of multiple
genetic variants in complete linkage disequilibrium (LD) makes it very difficult to
distinguish between causative and nearby neutral variation. Wild-derived mouse inbred
strains provide a unique opportunity to overcome this critical limitation and to increase
the likelihood of detecting cis-regulatory variation in a systematic and genome-wide
manner. The mouse is an exceptional model because it harbors the highest level of
genetic diversity described in a mammalian species, multiple sequence variants are
found in the transcript of up to 95% of genes and limited LD is found between nearby
variants. These three characteristics stem from the phylogenetic history and short
generation time of this rodent for which available inbred strains have captured a large
fraction of the genetic variation. Our analyses, based on >50,000 genotypes obtained by
sequencing ≈2000 new genetic variants in a panel of 25 inbred strains, indicate that the
level and distribution of genetic variants and LD, observed among inbred strains has a
wide range of variation (one or two orders of magnitude) depending the strains
considered. Therefore, it is possible to identify a set of strains with very high levels of
diversity (>25 variants per kb) that can be used to generate a panel of F1 mice in which
the majority of genes can be screened for cis-regulatory variation using the differential
allelic expression method. Then, the association between the allelic ratio and the
patterns of alleles observed at different variants among the strains can be used to
discriminate between causative and neutral variation. In our preliminary work we
reduced the number of candidate variants in the Il9r gene from several hundreds to, on
average, 2.5 variants located less than 500bp apart. We hypothesize that mapping
resolution may be increased further with the selection of the optimal strains.
To establish proof of principle for a scalable method by which to identify causative allelic
variants influencing gene expression, we propose the following Specific Aims:
1. Identify an optimal panel of inbred strains for identification of cis-regulatory
variants in a genome-wide manner. To accomplish this aim we propose to:
1.1. Estimate the mapping resolution in our panel of 25 strains for a
previously described, but not yet identified, regulatory variant
responsible for differential allelic expression of the Il9r gene.
1.2. Estimate the level of genetic diversity, the fraction of informative genes,
and the level and extent of LD in our initial panel of inbred strains.
1.3. Define the optimal panel of strains for high-resolution mapping of cisregulatory variants.
2. Provide proof of principle for this approach by the identification and
validation of the causal variants for multiple genes (e.g., Il9r, Comt,
Ccnf, Uros, etc). To accomplish this aim we propose to:
2.1. Generate the mapping panel(s). Establish priority criteria for highresolution mapping.
2.2. Statistical methods.
2.3. Comprehensive analysis of cis-regulatory variation in the Il9r gene,
including identification of the genetic variant(s) responsible for the
differential allelic expression in the spleen.
2.4. High-resolution mapping of cis-regulatory variants in high-priority genes.
Successful completion of these Specific Aims will simplify, accelerate and reduce the
cost of sorting through hundreds or thousand of neutral polymorphisms in the
identification of causative cis-regulatory variants. The method proposed here may be
applied to any autosomal gene subject to regulatory variation and allows prioritization of
genes based on the probability of success. It will complement ongoing efforts to estimate
the contribution of trans and cis regulation to phenotypic variation. Because the mapping
method is independent of any prior knowledge of functional elements it is likely to
identify novel regulatory sequences. The data obtained may be integrated with predicted
regulatory elements identified in comparative genomic analyses. Furthermore, the
variants uncovered in this study may be used to generate highly informative and robust
microarray-based assays to test allele specific gene expression.