Download Supplementary Notes S1 (doc 64K)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pathogenomics wikipedia , lookup

Gene desert wikipedia , lookup

Copy-number variation wikipedia , lookup

DNA supercoil wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Epigenomics wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Transposable element wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene wikipedia , lookup

Point mutation wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Human Genome Project wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomic library wikipedia , lookup

Metagenomics wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Human genome wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genome evolution wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome editing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Microsatellite wikipedia , lookup

SNP genotyping wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Transcript
S1. SUPPLEMENTARY NOTES
ARRAY DESIGN DETAILS
First Phase Design
1. Source Sequence. Retrieved human chromosome sequences from UCSC (Hinrichs,
Karolchik et al. 2006)and used RepeatMasker (Smit, Hubley et al. 1999-2004) to identify
repeat elements
2. Import the target region list. Ensured that the chromosome name and coordinates
specified for every region were valid. Also tested that there were no overlaps between
regions in the input list. Overlaps and duplicates in the initial lists were resolved and
resulted in a final list of 11,000 target chromosomal regions, mostly corresponding to exons.
3. Add flanking sequence to uniform length of 1200 bp in size. Flanking regions were
added to all regions less than 600 bp so that their final size will be 1200 bpto accommodate
>8 50-60mer probes per region. Regions that are already 1200 bp or greater were left
untouched. The resulting region is always centered on the original target (i.e. the same
amount of flank is added to each side)
4. Probe extraction. Extracted probe sequences at 5 bp intervals from within each target
region. Checked for ambiguous bases and counted the number of RepeatMasked bases.
At each 5 bp interval, the length of the extracted probe was varied to be 54bp +/- 10bp to
achieve a target Tm of 76ºC. If the resulting Tm was still more than 5 ºC from the target Tm,
the probe was discarded. Also tested for simple low complexity elements (homopolymers,
dipolymers, etc.) at this stage and discarded probes if they were encountered. Finally, if
more than 25% of the bases of a probe were repeat-masked, the probe was discarded. Of
~9.4 million possible probes at total of ~3.6 million probes passed these simple tests and
were stored as the complete ‘unfiltered’ probe set. The most common reason for failure of a
probe at this early stage was the presence of a repeat element. Probe Tm was calculated
according to the ‘Nearest Neighbour’ approach described by (Breslauer, Frank et al. 1986)
using the thermodynamic measurements of (Sugimoto, Nakano et al. 1996).
5. Random sequences generation. Random probes were generated to be used as negative
controls and to estimate background hybridization. They were selected to uniformly cover
the range of probe Tm and length represented by the actual region probes extracted above.
Initially, ~1.9 million random probes were generated. These probes were subjected to the
same quality tests as the experimental probes but were tested to ensure that they have
minimal homology to the human genome.
6. Probe folding. All probes were folded by MultiRNAFold (www.rnasoft.ca) to identify
sequences which form hairpins or duplexes (Andronescu, Aguirre-Hernandez et al. 2003;
Andronescu, Fejes et al. 2004).
7. Low complexity testing. The ‘mdust’ algorithm (Hancock and Armstrong 1994) was used
to identify low complexity elements which were not previously identified by searching for
homopolymers, dipolymers, etc.
8. Specificity testing. Each probe was mapped to the complete human genome sequence
using BLAST 2.2.15 (with a word size of 20). The probe needed to be successfully mapped
back to its source location and the number of hits to other regions were noted.
1
9. Cycle calculation. The number of cycles required by NimbleGen to synthesize each probe
was calculated by a modification of their cycle calculator tool (written in C). I modified this
tool to accept a more convenient file format as input. The algorithm was not altered.
10. Summary statistics. At this point various figures and statistics were generated to
represent the distribution of values resulting from the tests conducted in steps 5-8 for all
probes. These statistics were used to chose reasonable cutoffs for filtering probes in the
following step.
11. Filtering region probes. Probes which did not meet particular cutoffs for the values
calculated in steps 5-9 were removed from consideration for the final array. Specifically a
probe was required to have: (a) a probe length of 54bp +/- 10bp (b) a Tm of 76ºC +/- 4.5 ºC,
(c) less than 10% of its length representing repeat-masked sequence, (d) a free-energy of
hairpin folding greater than -10.0 kcal/mol, (e) a free-energy of dimerization greater than 26.0 kcal/mol, (f) low complexity bases occupying 10% or less of the probe length, (g) no
non-specific blast hits of 80% of the probe length or greater, (h) no more than 1 non-specific
blast hits of 75% of the probe length or greater, (i) no more than 4 non-specific blast hits of
50% of the probe length or greater and (j) no more than 178 cycles required for synthesis
according to NimbleGen’s cycle calculator. After filtering for these 10 criteria, a pool of 2.8
million region probes remained for potential inclusion on the final array design.
12. Filtering random control probes. Filtering of random sequence probes was done exactly
as for region probes except that no blast hits of any length to the human genome were
allowed. After filtering, a pool of 1.5 million random probes remained for potential inclusion
on the final array design.
13. Probe selection. The probe selection process involved cycling through all 11,000 target
regions and selecting probes for each region until 99% of 385,000 probes were identified.
At every cycle for each region, the ‘best’ probe was determined by considering its Tm and
length as well as is distance from probes already selected within the region. The selection
algorithm attempted to maximize the distance between probes, promote even coverage of
each region, and minimize probe overlap. The remaining 1% of the array was filled by
selecting random control probes. These were selected to uniformly represent the range of
Tm and probe length of all region probes selected.
Design Outcome

The 11,000 regions targeted by the design represent ~1.5% of the human genome and the
majority of these regions encompass 1 or more exons.

At least 1 probe was selected for 10,675 (97%) of the total 11,000 regions. 96% of regions
have 6 or more probes. Most regions are ~1200 bp in size and typically have 25-35 probes.
Larger regions may have as many as 62 probes.

375 have 0 probes but in general other exons in the corresponding genes were more
successful. Furthermore 50 of these regions with 0 probes were Y chromosome controls
and only 225 were exons of primary gene targets.

The final microarray design consists of 385,000 probes, of which, 99% correspond to the
10,625 successful targeted regions. The remaining 1% of the array consists of random
sequence probes.

This 385k design will be used to profile a series of test samples and will ultimately be used
to generate a 72k design compatible with NimbleGen’s ‘4-plex’ format. The 5-6 probes from
2
each regions which are determined to have the best performance during the test phase will
be selected for the 72k design.
References
Andronescu, M., R. Aguirre-Hernandez, et al. (2003). "RNAsoft: A suite of RNA secondary
structure prediction and design software tools." Nucleic Acids Res 31(13): 3416-22.
Andronescu, M., A. P. Fejes, et al. (2004). "A new algorithm for RNA secondary structure
design." J Mol Biol 336(3): 607-24.
Breslauer, K. J., R. Frank, et al. (1986). "Predicting DNA duplex stability from the base
sequence." Proc Natl Acad Sci U S A 83(11): 3746-50.
Hancock, J. M. and J. S. Armstrong (1994). "SIMPLE34: an improved and enhanced
implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of
clustered repetitive motifs in nucleotide sequences." Comput Appl Biosci 10(1): 67-70.
Hinrichs, A. S., D. Karolchik, et al. (2006). "The UCSC Genome Browser Database: update
2006." Nucleic Acids Res 34(Database issue): D590-8.
Smit, A. F. A., R. Hubley, et al. (1999-2004). RepeatMasker Open-3.0.
Sugimoto, N., S. Nakano, et al. (1996). "Improved thermodynamic parameters and helix
initiation factor to predict stability of DNA duplexes." Nucleic Acids Res 24(22): 4501-5.
Testing Of First Phase Design
- identified 10 individuals with known CNVs with probes on designed array
- hybridized patients to arrays according to manufacturer’s specifications (see below for
protocol)
- ensures probes within known CNVs identified the CNV if not the probe was eliminated
- the remaining probes outside of the known CNV were used to assess probe performance
- The mean log2 ratio for each probe (taken from all 10 individual hybs) and the standard
deviation (SD) of the log2 ratio were used as selection criteria. The following are criteria I used
to select probes to eliminate:
a)
Probes with a mean log2 ratio >0.25 and any SD
b)
Probes with a mean log2 ratio >0.2 and a SD >0.15
c)
Probes with a mean log2 ratio>0.1 and a SD >0.25
Second Phase Design
- use a script to select the best performing probe (log2 ratio closest to 0 with low SD) for each
exon within each target gene until all exons were covered. A second round selected the next
best probe for each exon within each gene and the cycle continued until all 135,000 probes for
the final 12-plex 135K array were filled.
- manual scan of all 135,000 probes that were selected for 2nd phase design for genes that did
not have any probes and exons/regions that had fewer than 5 probes.
NIMBLEGEN HYB PROTOCOL AND NEXUS ANALYSIS PROTOCOL
500 ng of DNA was labeled according to the manufacturer’s specifications (Roche NimbleGen
CHG Analysis User’s Guide V5.1 Mar 16, 2009). Arrays were hybridized in the NimbleGen
Hybridization 12-plex System and washed manually. The arrays were scanned with Molecular
3
Devices GenePix 4000B at 5µm. Sample Tracking Control was used in all 12-plex reactions.
Data was extracted using NimbleScan v2.5. Only those arrays passing all QC measures
outlined in NimbleScan v2.5 Software User Guide were analysed for CNVs. SegMNT files were
generated using the following settings: Min segment difference = 0.1, Min Segment length = 5,
Acceptance percentiles = 0.999, Averaging window – 1X, Including non-uniques probes, spatial
correction, normalized. SegMNT files were imported in Nexus v4.0. Gender of each sample was
included do allow for probe correction on the sex chromosomes. CNVs were identified with the
following settings: log2 ratio >0.2 (duplications) or <0.2 (deletions) with a minimum of 5 probes
affected, maximum probe distance was <10 Kb (>10Kn these were treated as separate CNVs).
All CNVs identified each hybridization were visually assessed for confidence and to determine if
it was present in the other parental hybridization but not called by the software. De novo CNVs
were identified when set of probes identified by the platform algorithm was called as a deletion
in the child relative to both parents or as a duplication in the child relative to both parents on the
same platform. Males who inherited an X-chromosome CNV from the mother were also
identified by the software because of the correction.
QUANTITATIVE PCR VALIDATION
Primer Design
1. Selecting Target Sequence from UCSC browser. Go to UCSC genome browser, human
genome, select appropriate build. Go to target sequence. I.e., if a gene, enter gene name in
search box and jump to gene.If needed, zoom in to exon of interest. Obtain DNA sequence of
selected region on browser. On ‘Get DNA’ page, make sure ‘mask repeats’ is checked and get
DNA sequence with repeats visualized. Examine sequence for repeats. Make a selection of
sequence that does not have repeats. Sequence length can be from one exon (about 200bp) or
1000bp.Copy and paste this sequence in to a .txt file and save.
2. Designing primers using PrimerExpress. Go to PrimerExpress. Open PrimerExpress.
Open DNA PCR document and import relevant .txt file with DNA sequence. Select design
parameters of min length of 100 bp and max length of 150bp and run program. Obtain list of
primer pairs.
3. Ensuring primers are specific. Use in-silico PCR (http://genome.ucsc.edu/cgibin/hgPcr?command=start). Make sure the appropriate genome version is selected and then load
your forward and reverse primer sequence and submit. Output generated should be a single
amplicon of the same length as denoted in PrimerExpress, with 100% sequence match. Click on
the browser location link to look at where the primer positions on the genome and make sure
the primer maps to the exact exon inputted to design software. Discard primers that return more
than one unique hit or do not generate a 100% match hit
4. Screening Primers for secondary structure. Use Beacon Design
(http://www.premierbiosoft.com/qpcr/index.html) . Select ‘SYBR green’ qPCR oligo
analysis page. Paste in forward and reverser primer sequences and leave all other parameters
as default and analyze. Check results for; primer cross binding, primer self-self binding and
primer hairpins. Chose primers without secondary structure noted, however in the event
secondary structure is found, choose primers of delta G (Gibbs free energy) score
corresponding to each reaction greater than -3.0 indicating a low likelihood of this occurring.
Only primers that are unique (in-silico PCR) and pass secondary structure test (Beacon Design)
are used for qPCR.
4
Quantitative PCR protocol
Primers were designed within ID candidate genes within the identified CNVs, avoiding benign
polymorphisms listed in the DGV (version- variation.hg18.v10.nov.2010) as described above.
Primers were designed using Primer Express (Applied Biosystems) and purchased from
Integrated DNA Technologies (www.idtdna.com) in lab ready format. The patient's DNA was
diluted in PCR-grade water, and the quality and concentration was assessed using
spectrophotometer (Nanodrop, Thermo Scientific). Primers were optimized for qPCR by
standard PCR amplification (50ng/μl sample DNA concentration) on a positive and negative
control. PCR product was visualized on a 2% agarose gel stained with ethidium bromide. The
presence of only a single band of the expected size in the control DNA, the absence of primerdimers, and the absence of any amplification on the blank was considered indicative of a primer
set that could be used for qPCR.
CNVs were validated by qPCR (ΔΔCt method) using SYBR Green (Applied Biosystems).
Testing was performed in triplicate on child, mother, father and pooled Promega reference
sample, on an endogenous control gene (H6PD) and target gene. Promega pooled male and
female sample (catologue#: G3041) was used for autosomal CNVs and Promega pooled
female-only sample (catalogue#: G1521) was used for X linked CNVs. Sample DNAs were
diluted to 30ng/µl and concentration and quality was re-assessed using spectrophotometer
(Nanodrop,Thermo Scientific). 30ng of sample DNA was combined in a 10µl reaction mixture
with 5nM forward and reverse primer and SYBR green master mix solution (Qaunta
Biosciences). qPCR thermal cycle parameters were programmed for each test based on the
preceding standard PCR amplification protocol optimization (see above). Testing was performed
on an ABI7500 fast DNA sequencer and melt curve analysis was also conducted as an
additional QC metric (amplifications where the melt curve did not show the expected single peak
corresponding to the melting temperature of the amplicon were discarded). Results were
visualized using Applied Biosystems software (7500fast SDS software, Applied Biosystems) as
outlined in the software user guide. PCR amplification curves were examined visually for
amplification efficiency and the software was programed to only use runs with 100% PCR
amplification efficiency for result generation. A heterozygous deletion was confirmed with the
following settings; RQ=0.5, range 0.3<RQ<0.7, a normal two copy state was considered when
RQ=1, range 0.8<RQ<1.3, and a heterozygous duplication was considered when RQ=1.5,
range 1.3<RQ<2. Only those results that were replicated by all triplicates were considered true
positives.
5