Download 07 oct 2014 Sequence capture work flow

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Last updated 2014-­‐10-­‐07 Document created by Lovisa Gustafsson. Any feedback is welcome, and can be sent by email.
Lovisa Gustafsson, PhD
Department of Biological & Environmental Sciences
[email protected]
Phone: +46 31 7866665
Visiting Address: Carl Skottsbergs Gata 22 B, Göteborg
Postal Address: Box 463, 40530 Göteborg
Sequence Capture (hybrid enrichment) – an overview
Sequence capture is a DNA-RNA hybridisation†-based gene enrichment technique for precise
targeting of genomic regions of interest. In practice this means that instead of sequencing a
whole genome or transcriptome ‡, one can target specific regions of the DNA, and thereby
significantly reduce the sequencing costs. The use of NGS in phylogenetics has so far been
directed towards anonymous loci (e.g. restriction-site-associated DNA (RAD) tags) but since
sequence capture enables precise targeting of regions of interest the quality of NGS data for
evolutionary studies can be maximised. Currently, hybrid enrichment approaches offer the
greatest promise for high-throughput and cost-efficient data collection for phylogenetic
studies compared to other genomic partitioning strategies (Lemmon and Lemmon 2013)
available today.
Researchers interested in this technique should be aware of the fact that as for all NGS
projects i) a large amount of bioinformatics needs to be invested both before and after the
actual lab work and sequencing – bioinformatics expertise is required to develop probe sets
and especially to analyse raw sequence data and ii) you need prior knowledge of the
genome/transcriptome of the organism you are working with or at least its close relatives.
However, there are many benefits of using this approach:
Ø You can select specific loci of interest that are suitable for your type of analysis,
for example relatively conserved exons for deep (old) questions or genes with
many introns for shallow questions.
Ø The number of targeted sequences can be very high (e.g., the smallest bait set from
MyBaits captures 1 Mb of target sequence, i.e., 500 genes of 2 kb each).
Ø Coupled with indexed DNA libraries (individual samples with unique barcode
sequences), multiplexing of individual samples upon capture is possible (e.g., 8
samples can easily be captured using one MyBaits capture reaction).
Ø Sequence capture can be performed on NGS platforms such as Illumina, which
reduces sequencing costs (e.g., 48 samples on a single MiSeq run produces
consistently good results).
Ø The method has a high sequencing efficiency compared to other genomic
partitioning strategies, which results in high sequence depths (see below).
Ø This method usually results in low formation of recombinants and chimeric
sequences.
Ø Also works on highly fragmented genomic DNA, which is often the result when
extracting from e.g. herbarium material.
†
In this case, hybridization is the process of establishing a sequence-specific interaction
between two complementary strands of nucleic acids into a single complex.
‡
A transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA and other
non-coding RNA produced in one or a population of cells. For more info visit link.
Procedure:
The following example gives an overview on how to develop genetic markers for a group of
organisms, such as a genus or family, using a sequence capture approach and next generation
sequencing. This document is meant as an overview - for a more detailed description (i.e.
exact lab-work procedure) please see the document “Sequence capture lab procedure”.
•
Gene selection - In order to maximise the outcome of a sequence capture analysis, it is
important to consider what gene regions best fit your analysis. Also, you need to have
prior knowledge of the genome and/or transcriptome of the organism you are working
with or its close relatives. As an example in angiosperms, a single representative
transcriptome from the same family is sufficient, when compared to publically
available reference genomes, to design a probe set for several hundred genes.
However, the further away the closely related transcriptome/genome is, and the lower
the coverage of that reference, the more genes may end up unsuitable in the final
analysis. So as a rule of thumb, plan to capture 3 times as many genes as you might
need, unless the closely related sequences are a near-complete representation of that
transcriptome or genome.
Step one is to download available sequence data that will be used to select gene
regions of interest. It is in many cases not possible to find a reference genome from
the study group, but a dataset from a related organism outside of the study group is
usually sufficient (e.g. Solanum lycopersicon[Tomato], Solanaceae, was used as
reference genome in a recent study of the plant genus Antirhinum in another plant
family, namely Plantaginaceae).
If the annotated genome belongs to your group then it is sufficient to work only
with that (e.g. Medicago truncatula was used as reference genome in a recent study of
the plant genus Medicago; Sousa et al. PloS One, in press).
o What gene regions to work with depend largely on the questions and each person’s
preferences and also on what genetic data is available. You can think about:
Ø Copy number
Ø Conserved regions
Ø Overall length, length of specific parts of the locus (e.g., exon versus
intron)
Ø Expressed genes or non-genic regions
Ø Introns, exons, or both
Ø Unlinked or closely linked
Ø Etc
o Examples of webpages where you can search for available genomic data:
§ GenBank (USA)
§ EMBL (Europe)
§ DNA Data Bank of Japan (Japan)
§ European Nuclear Archive: http://www.ebi.ac.uk/ena/
§ Phytozome: http://www.phytozome.net/ (only plants)
§ JGI: http://genome.jgi-psf.org/
§ The UCSC Genome Browser: https://genome.ucsc.edu/
§ Ensemble: http://www.ensembl.org/index.html
§ KEGG GENES Database: http://www.genome.jp/kegg/genes.html
Following is an imaginary example of one possible way to identify gene regions for sequence
capture:
Procedure:
i) BLAST the two transcriptomes against
each other. Recover the conserved regions.
ii) BLAST the annotated reference genome
with itself. Recover the regions that only
occur in low-copy.
iii) BLAST the desired regions from step i)
with the low-copy regions from step ii)
Recover low-copy, conserved regions.
Task: You want to perform a phylogenetic
analysis on the clade marked with a red dot
using a sequence capture approach. You
want to have 20 genetic regions (i.e. 20
markers) to do the full analysis with.
Genetic data available: You have two
transcriptomes available inside the clade of
interest (blue dots) and one annotated
reference genome outside the clade of
interest (green dot).
Genetic region of interest: Single copy,
unlinked, 2Kbp long, etc.
•
iv) You may now limit your BLAST results
even more. Let’s say you have recovered 500
regions in step iii). Now you want to add
restrictions such as i.e. minimum fragment
length. This returns e.g. 350 regions. Now
you can add your preferred restrictions until
you have reached a reasonable number of
genetic regions to work with.
If you have decided you want to develop
20 markers to be used in your final analysis
you should initially select about 200-300
regions, since many of the regions you have
selected won’t be appropriate to use in the
end
Probe design - The hybridization probe is an artificially generated fragment of RNA
that is used to target the specific DNA fragment of interest. The probe will hybridize
with the complementary single-stranded DNA sequence and thus target the DNA
fragment of interest.
o To ensure that you cover the full gene region of interest the probe is designed in
a tiling manner, i.e. the probes are partially overlapping.
o For example a 3x tiling of the probes means that for a 90-mer (90bp) probe one
starts at the beginning of the sequence, the second starts at 30bp and the third at
60 (see figure), and so on until the whole region of interest is covered.
TGATCGGTTCCAAAATGATCGGTGATCGGGTCGGTTCCAAAATGATCGGTGATCGGGATCGGTGATCGGGTCGGTTCCAAAATGATCGGT
CCAAAATTGCCAAAAAGAAACCTTGATCGGTGATCGGTTCCAAAATGATCGGTGATCGGGTCGGTTCCAAAATGATCGGTGATCGGGATC
AAGAAACCCAAAAAGAAACCTTGATCGGTTCCAAAATTGCCAAAAAGAAACCTTGATCGGTGATCGGTTCCAAAATGATCGGTGATCGGG
AAGAAACCCAAAAAGAAACCTTGATCGGTTCCAAAATTGCCAAAAAGAAACCTTGATCGGTGATCGGTTCCAAAATGATCGGTGATCGGGTCGGTTCCAAAATGATCGGTGATCGGGATCGGTGATCGGGTCGGTTCCAAAATGATCGGT
o A company designs the probes (e.g. MYcroarray). You just send them the
sequence of the gene region of interest and tell them what probe density you
want. Shorter probes allow for less divergence (higher specificity), e.g., 90-mer
probes allows for about 5% difference between target and probe, whereas 120mer probes allow for more than 10% difference. Note: The company can spend
up to two months (45-60 days) to prepare your probes after you have sent them
the sequences.
•
DNA extraction
o Use your preferred DNA extraction method - Extraction kits from Qiagen are
commonly used.
o Check the length of your DNA fragments on an agarose gel - fragmented DNA
is a pre-requisite for sequence capture methods. If you have worked with e.g.
herbarium material the DNA is usually already fragmented, but if you have fresh
material you usually will have longer DNA fragments that must be sheared.
o Use the Nanodrop instrument to measure your DNA concentration.
•
Shearing DNA – If your DNA fragments are too long; use the Covaris S220
instrument at the Sahlgrenska Genomics Core Facility to shear your DNA in
appropriate length (depending on your analysis you want fragments of different
lengths).
o You need an introductory course at the core facility to be allowed to process the
machine. Contact Ellen Hanson for booking and details. At the moment you can
ask Filipe De Sousa or José Luis Blanco Pastor for help as they have the
permission to use it.
o Your samples must be transferred
into special Covaris glass tubes
that are provided by the core
facility. Talk to Ellen so that you
can pick them up and prepare
your samples before using the
machine.
•
Library construction – To generate high quality NGS data you need high quality
libraries that have the desired insert size and proper adaptor ligation. In this step you
add an adapter with a unique barcode to each sample so that you can identify them
also after all samples are being pooled for NGS. There is a fragment size selection step
where you basically select the fragments of preferred length, then selected fragments
are amplified in a PCR reaction and your library is done!
Genomic DNA fragments are transformed into libraries ready for sequence
capture/sequencing using e.g. the NEXTflexTM DNA Rapid Sequencing Kits. These
kits are designed to prepare single, paired-end and multiplexed genomic DNA libraries
for sequencing using Illumina® platforms. The steps for DNA library construction are
as follows (see figure below):
1. End-repair (blunt end formation). After the shearing of the DNA some
fragments may not terminate in a base pair but rather have an overhang of
unpaired nucleotides (“sticky end”). This needs to be corrected for to create
only blunt end fragments.
Example blunt ended DNA:
Example of an A-overhang:
2. Adenylation (adding an “A” base). You add an “A” base to be able to attach
the adapters with unique barcodes.
3. Ligation of adapters. You add adaptors with unique barcodes (e.g NEXTflexTM
Barcodes) to your DNA fragments so that you can later pool your samples and
still be able to identify the specific samples. You can have up to 96 unique
barcodes (one unique adaptor for each DNA sample), but most cost-efficient is
to use 48 barcodes.
4. Fragment size selection. You have probably decided to work only with
fragments of a specific size. You can use the Agencourt AMPure XP magnetic
beads for fragment size selection. This system utilizes magnetic beads to size
select the fragments. Fragment size selection is an important parameter in
relation to how large your targets are, and whether your probes will capture the
whole target. E.g., to capture introns up to 500 bp long with exon-based
probes, most fragments should be at least 400 bp long (we expect 100 to be
bound to the probe, and 300 to extend across the intron, thus overlapping with
the fragments bound to the next exon). The sequence length will also play role
in determining fragment lengths. Illumina MiSeq can vary in read length (e.g.,
150 paired-end or 300 paired end sequence length [see below]). It’s not very
efficient to use 300 bp fragments with 300 bp paired end reads.
5. PCR of indexed DNA fragments. For better yield of the selected DNA
fragments a PCR is run for e.g. 14 cycles (98°C, 2’; 14x(98°C, 30”; 65°C,
30”;72°C, 60”); 72°C, 4’. The necessary reagents are provided with the
NEXTflex kit.
N
A
T
Adapters with !
unique barcodes
•
Measure DNA concentrations of libraries. You need to check the concentration of
your DNA library. This is usually done on a NanoDrop instrument. If you want more
accurate measurements, you can use the Agilent 2200 TapeStation instrument.
Optimally you should have about 500 ng of library DNA.
•
Pooling of samples for sequence capture. You can pool some of your DNA libraries
before sequence capture. If you used 48 unique barcodes in the library construction,
you could in theory pool all 48 samples in one capture but then you might not get
enough of each. How many samples to pool will depend on your total target size (i.e.
the total length of all selected regions). Previous work has shown that pooling 8
samples works well for a total target size of about 150 – 200 Kbp.
•
Sequence capture. There are basically 5 steps in the actual sequence capture
procedure, see figure below: you are interested in the red DNA strands – those are
your targeted DNA sequences. This procedure follows the MYBaits target enrichment
system (read this document for further details).
1. Denature DNA. From the library preparation you have your size selected
DNA fragments with unique barcoded adapters. The first step is to denature
the DNA to get single stranded DNA.
2. Hybridize baits to target. The biotinylated baits are the predesigned RNA
probes that will hybridize with the targeted complementary single-stranded
DNA sequence. Biotinylated means that there is a biotin molecule attached
to the RNA strand. (Step one and two are done in the same reaction in the
lab)
3. Capture target on beads. Streptavidin is a protein with a very high affinity
for biotin. When you add the magnetic streptavidin beads to your sample,
the beads will bind to your biotinylated probes.
4. Recover captured targets. With a magnetic separation you can now recover
the DNA fragments to which the probes have hybridized. After washing of
the probes you are left with your targeted DNA fragments.
5. Amplify. Before sending for sequencing you need to amplify the selected
fragments to ensure you have a high concentration of your targeted DNA
fragments. Since you still have the unique barcode adaptor on the selected
fragments you use the primers that are complementary to the adaptors. It is
important to limit the number of PCR cycles to get just enough material for
sequencing while minimizing PCR amplification bias.
•
Pooling of samples for NGS sequencing. You can pool as many samples as the number
of barcodes you have used in the library preparation, i.e. if you have used 48 unique
barcodes you pool all 48 samples in one tube.
•
Measure DNA concentration, fragments size. Use the Agilent 2200 TapeStation
instrument to validate your sample’s DNA concentration and fragment size.
•
NGS sequencing. There are several different techniques for next generation
sequencing. Prof Elaine Mardis from the Genome Institute at Washington University,
USA, has given two excellent educational lectures on “Next generation sequencing
technologies” well worth a view. Links: Lecture in 2012, lecture in 2014. The lectures
were given at the Current Topics in Genome Analysis, at the National Human Genome
Research Institute, that has been held every other year since 2003. Other lectures are
uploaded on the website.
o At BioEnv, the Illumina sequencing platform has been used successfully. This
follows a “Sequencing by Synthesis” approach and the number of cycles
determines the length of the read. This video explains the process.
o You have to decide what read length and what sequencing depth your analysis
require.
§
Read length: The number of bases for each read. The longer the read
length the easier it might be to assembly the reads, but the quality of the
reads will be lower due to stochastic errors during sequencing.
§ Sequence depth: Basically the number of reads. The more reads you get
the more accurate the sequence assembly, and the deeper you sequence
the more info you can get out of your data, i.e. you will be able to
search for polymorphic alleles (see “phasing” below).
Low sequence depth
High sequence depth
§
Paired-end reads: Paired ends refer to the two ends of the same DNA
fragment. With paired-end sequencing you sequence from both ends of
the same DNA fragment, with one forward read and one reverse read.
I.e. if a DNA fragment is 300bp long you cover the whole fragment by
sequencing 150pb from each end. Check the Illumina video above for
details.
When you look at a pair of reads, the sequences you see are,
conceptually, pointing towards each other on opposite strands. When
you align them to the genome, one read should align to the forward
strand, and the other should align to the reverse strand, at a higher base
pair position than the first one so that they are pointed towards one
another. This is known as an “FR” read – forward/reverse read.
READ 1 (150bp)
5'
3'
3'
5'
READ 2 (150 bp)
300bp
o For the Illumina MiSeq platform
you can have paired-end reads up
to 300bp (i.e. total coverage of
600bp) but the longer the reads,
the lower sequence quality you
will receive (see figure to the
right, or online MiSeq System
Performance Parameters).
o If you want to focus on
maximum sequence depth the
new Illumina NextSeq 500 might
be considered. You can get a
sequence depth of up to 800
million pair-end reads, which is
considerably higher than for the
MiSeq platform (up to 30 million
reads). However the read length
is limited to 150bp, and pricing is
higher.
o The Sahlgrenska Genomics Core Facility has the commonly used MiSeq
platform as well as the new NextSeq 500 platform. A large benefit of using this
facility is the personal service. You have direct contact with the people who
work there and can be part of the whole procedure. However, pricing is higher
than at i.e. the SciLifeLab, and you have to pay for all labour hours (whereas at
SciLifeLab this is covered by public research funding).
Price example for a MiSeq run 2*150bp (7 oct 2014):
Sahlgrenska Genomics Core Facility – 10 500SEK,
SciLifeLab – 7 789SEK
•
•
Sequence handling. From the high-throughput sequencing you will receive millions of
reads covering the specific gene regions of interest.
o
If you have for instance used the Illumina MiSeq platform with 2*150bp pairedend sequencing you will get back between 24-30 million forward and reverse
reads.
o
The sequences come in FASTQ format and are separated into forward and
reverse reads for each DNA sample (you provide the sequence facility with the
unique adaptor barcode sequence so that the different DNA samples can be
identified). The files are usually named like this:
“sample_nameR1.fastq” for the forward read
“sample_nameR2.fastq” for the reverse read
o
Each fastq-file will include the nucleotide sequence and its corresponding Phred
quality score. Phred is a base-calling program for DNA sequence traces that
reads DNA sequence chromatogram files and analyses the peaks to call bases,
assigning quality scores ("Phred scores") to each base call. You use this score to
evaluate the quality of the read.
Bioinformatics. You now have millions of reads to work with and you need to use
bioinformatics to handle all the data. You can use Albiorix Cluster at BioEnv – talk to
Mats Töpel to get an account. Handling of data involves different steps that can be
performed using several scripts i.e. the scripts developed by Yann Bertrand and Filipe
De Sousa (under publication). Please contact either of them to ask for permission to
use the scripts and request instructions on how to use them. As they have not yet been
published, depending on the case you should discuss co-authorship or formal
acknowledgments in resulting papers. The different steps involved in data handling are
the following:
o The first thing to do is to strip the adapter sequences from the reads.
o Second you filter the reads for quality with a threshold of 20 phred-scores.
o Reads are mapped towards the reference sequence using CLC Genomics.
Reference sequence
o Phasing of alleles with SAMtools Phase. Some may just make a consensus
sequence out of the mapped reads. However you may then miss out on important
information. Instead you should perform a phasing of alleles.
§ Diploid organisms have one copy of each gene (and, therefore, one
allele) on each chromosome. If both alleles are the same, they and
the organism are homozygous with respect to that gene. If the
alleles are different, they and the organism are heterozygous with
respect to that gene.
§ With the phasing of alleles you will find the sequences that have
regions of heterozygote alleles. Below is a simplified example. Lets
say you thought you were working with a diploid individual, but
thanks to the allele phasing you have now discovered that this
individual in fact is a polyploid (e.g., between 3x and 6x). Without
the phasing you would not have discovered this!
§ For higher polyploids the phasing could just be continued, if there
is sufficient read depth to support accurate phasing for each allele.
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
5
8
10
12
1
2
3
5
8
10
12
4
7
9
4
6
7
9
11
6
11
o Retrieve the consensus sequences (FASTA format) using i.e. SAMtools
mpileup.
o Now you have your sequences in hand and can perform your preferred analyses!
Acknowledgment: A special thanks to Filipe de Sousa who has given constructive input in
creating this document. Thanks to Alexandre Anonelli, Bernard Pfeil and Mats Töpel for
commenting and giving valuable feedback on the text.
Disclaimer: Although the author have made every effort to ensure that the information in
this document is correct, the author disclaim any liability to any party for any loss,
damage, or disruption caused by errors or omissions, whether such errors or omissions
result from negligence, accident, or any other cause.
References:
Lemmon, E. M. & Lemmon, AR (2013) High-throughput genomic data in systematics and
phylogenetics. Annual Review of Ecology, Evolution, and Systematics, 44, 99-121.
Sousa F, Bertrand YJK, Nylinder S, Oxelman B, Eriksson JS, and Pfeil BE (2014)
Phylogenetic properties of 50 nuclear loci in Medicago (Leguminosae) generated using
multiplexed sequence capture and Next-Generation Sequencing. PLoS ONE, in press.