Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Last updated 2014-‐10-‐07 Document created by Lovisa Gustafsson. Any feedback is welcome, and can be sent by email. Lovisa Gustafsson, PhD Department of Biological & Environmental Sciences [email protected] Phone: +46 31 7866665 Visiting Address: Carl Skottsbergs Gata 22 B, Göteborg Postal Address: Box 463, 40530 Göteborg Sequence Capture (hybrid enrichment) – an overview Sequence capture is a DNA-RNA hybridisation†-based gene enrichment technique for precise targeting of genomic regions of interest. In practice this means that instead of sequencing a whole genome or transcriptome ‡, one can target specific regions of the DNA, and thereby significantly reduce the sequencing costs. The use of NGS in phylogenetics has so far been directed towards anonymous loci (e.g. restriction-site-associated DNA (RAD) tags) but since sequence capture enables precise targeting of regions of interest the quality of NGS data for evolutionary studies can be maximised. Currently, hybrid enrichment approaches offer the greatest promise for high-throughput and cost-efficient data collection for phylogenetic studies compared to other genomic partitioning strategies (Lemmon and Lemmon 2013) available today. Researchers interested in this technique should be aware of the fact that as for all NGS projects i) a large amount of bioinformatics needs to be invested both before and after the actual lab work and sequencing – bioinformatics expertise is required to develop probe sets and especially to analyse raw sequence data and ii) you need prior knowledge of the genome/transcriptome of the organism you are working with or at least its close relatives. However, there are many benefits of using this approach: Ø You can select specific loci of interest that are suitable for your type of analysis, for example relatively conserved exons for deep (old) questions or genes with many introns for shallow questions. Ø The number of targeted sequences can be very high (e.g., the smallest bait set from MyBaits captures 1 Mb of target sequence, i.e., 500 genes of 2 kb each). Ø Coupled with indexed DNA libraries (individual samples with unique barcode sequences), multiplexing of individual samples upon capture is possible (e.g., 8 samples can easily be captured using one MyBaits capture reaction). Ø Sequence capture can be performed on NGS platforms such as Illumina, which reduces sequencing costs (e.g., 48 samples on a single MiSeq run produces consistently good results). Ø The method has a high sequencing efficiency compared to other genomic partitioning strategies, which results in high sequence depths (see below). Ø This method usually results in low formation of recombinants and chimeric sequences. Ø Also works on highly fragmented genomic DNA, which is often the result when extracting from e.g. herbarium material. † In this case, hybridization is the process of establishing a sequence-specific interaction between two complementary strands of nucleic acids into a single complex. ‡ A transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA and other non-coding RNA produced in one or a population of cells. For more info visit link. Procedure: The following example gives an overview on how to develop genetic markers for a group of organisms, such as a genus or family, using a sequence capture approach and next generation sequencing. This document is meant as an overview - for a more detailed description (i.e. exact lab-work procedure) please see the document “Sequence capture lab procedure”. • Gene selection - In order to maximise the outcome of a sequence capture analysis, it is important to consider what gene regions best fit your analysis. Also, you need to have prior knowledge of the genome and/or transcriptome of the organism you are working with or its close relatives. As an example in angiosperms, a single representative transcriptome from the same family is sufficient, when compared to publically available reference genomes, to design a probe set for several hundred genes. However, the further away the closely related transcriptome/genome is, and the lower the coverage of that reference, the more genes may end up unsuitable in the final analysis. So as a rule of thumb, plan to capture 3 times as many genes as you might need, unless the closely related sequences are a near-complete representation of that transcriptome or genome. Step one is to download available sequence data that will be used to select gene regions of interest. It is in many cases not possible to find a reference genome from the study group, but a dataset from a related organism outside of the study group is usually sufficient (e.g. Solanum lycopersicon[Tomato], Solanaceae, was used as reference genome in a recent study of the plant genus Antirhinum in another plant family, namely Plantaginaceae). If the annotated genome belongs to your group then it is sufficient to work only with that (e.g. Medicago truncatula was used as reference genome in a recent study of the plant genus Medicago; Sousa et al. PloS One, in press). o What gene regions to work with depend largely on the questions and each person’s preferences and also on what genetic data is available. You can think about: Ø Copy number Ø Conserved regions Ø Overall length, length of specific parts of the locus (e.g., exon versus intron) Ø Expressed genes or non-genic regions Ø Introns, exons, or both Ø Unlinked or closely linked Ø Etc o Examples of webpages where you can search for available genomic data: § GenBank (USA) § EMBL (Europe) § DNA Data Bank of Japan (Japan) § European Nuclear Archive: http://www.ebi.ac.uk/ena/ § Phytozome: http://www.phytozome.net/ (only plants) § JGI: http://genome.jgi-psf.org/ § The UCSC Genome Browser: https://genome.ucsc.edu/ § Ensemble: http://www.ensembl.org/index.html § KEGG GENES Database: http://www.genome.jp/kegg/genes.html Following is an imaginary example of one possible way to identify gene regions for sequence capture: Procedure: i) BLAST the two transcriptomes against each other. Recover the conserved regions. ii) BLAST the annotated reference genome with itself. Recover the regions that only occur in low-copy. iii) BLAST the desired regions from step i) with the low-copy regions from step ii) Recover low-copy, conserved regions. Task: You want to perform a phylogenetic analysis on the clade marked with a red dot using a sequence capture approach. You want to have 20 genetic regions (i.e. 20 markers) to do the full analysis with. Genetic data available: You have two transcriptomes available inside the clade of interest (blue dots) and one annotated reference genome outside the clade of interest (green dot). Genetic region of interest: Single copy, unlinked, 2Kbp long, etc. • iv) You may now limit your BLAST results even more. Let’s say you have recovered 500 regions in step iii). Now you want to add restrictions such as i.e. minimum fragment length. This returns e.g. 350 regions. Now you can add your preferred restrictions until you have reached a reasonable number of genetic regions to work with. If you have decided you want to develop 20 markers to be used in your final analysis you should initially select about 200-300 regions, since many of the regions you have selected won’t be appropriate to use in the end Probe design - The hybridization probe is an artificially generated fragment of RNA that is used to target the specific DNA fragment of interest. The probe will hybridize with the complementary single-stranded DNA sequence and thus target the DNA fragment of interest. o To ensure that you cover the full gene region of interest the probe is designed in a tiling manner, i.e. the probes are partially overlapping. o For example a 3x tiling of the probes means that for a 90-mer (90bp) probe one starts at the beginning of the sequence, the second starts at 30bp and the third at 60 (see figure), and so on until the whole region of interest is covered. TGATCGGTTCCAAAATGATCGGTGATCGGGTCGGTTCCAAAATGATCGGTGATCGGGATCGGTGATCGGGTCGGTTCCAAAATGATCGGT CCAAAATTGCCAAAAAGAAACCTTGATCGGTGATCGGTTCCAAAATGATCGGTGATCGGGTCGGTTCCAAAATGATCGGTGATCGGGATC AAGAAACCCAAAAAGAAACCTTGATCGGTTCCAAAATTGCCAAAAAGAAACCTTGATCGGTGATCGGTTCCAAAATGATCGGTGATCGGG AAGAAACCCAAAAAGAAACCTTGATCGGTTCCAAAATTGCCAAAAAGAAACCTTGATCGGTGATCGGTTCCAAAATGATCGGTGATCGGGTCGGTTCCAAAATGATCGGTGATCGGGATCGGTGATCGGGTCGGTTCCAAAATGATCGGT o A company designs the probes (e.g. MYcroarray). You just send them the sequence of the gene region of interest and tell them what probe density you want. Shorter probes allow for less divergence (higher specificity), e.g., 90-mer probes allows for about 5% difference between target and probe, whereas 120mer probes allow for more than 10% difference. Note: The company can spend up to two months (45-60 days) to prepare your probes after you have sent them the sequences. • DNA extraction o Use your preferred DNA extraction method - Extraction kits from Qiagen are commonly used. o Check the length of your DNA fragments on an agarose gel - fragmented DNA is a pre-requisite for sequence capture methods. If you have worked with e.g. herbarium material the DNA is usually already fragmented, but if you have fresh material you usually will have longer DNA fragments that must be sheared. o Use the Nanodrop instrument to measure your DNA concentration. • Shearing DNA – If your DNA fragments are too long; use the Covaris S220 instrument at the Sahlgrenska Genomics Core Facility to shear your DNA in appropriate length (depending on your analysis you want fragments of different lengths). o You need an introductory course at the core facility to be allowed to process the machine. Contact Ellen Hanson for booking and details. At the moment you can ask Filipe De Sousa or José Luis Blanco Pastor for help as they have the permission to use it. o Your samples must be transferred into special Covaris glass tubes that are provided by the core facility. Talk to Ellen so that you can pick them up and prepare your samples before using the machine. • Library construction – To generate high quality NGS data you need high quality libraries that have the desired insert size and proper adaptor ligation. In this step you add an adapter with a unique barcode to each sample so that you can identify them also after all samples are being pooled for NGS. There is a fragment size selection step where you basically select the fragments of preferred length, then selected fragments are amplified in a PCR reaction and your library is done! Genomic DNA fragments are transformed into libraries ready for sequence capture/sequencing using e.g. the NEXTflexTM DNA Rapid Sequencing Kits. These kits are designed to prepare single, paired-end and multiplexed genomic DNA libraries for sequencing using Illumina® platforms. The steps for DNA library construction are as follows (see figure below): 1. End-repair (blunt end formation). After the shearing of the DNA some fragments may not terminate in a base pair but rather have an overhang of unpaired nucleotides (“sticky end”). This needs to be corrected for to create only blunt end fragments. Example blunt ended DNA: Example of an A-overhang: 2. Adenylation (adding an “A” base). You add an “A” base to be able to attach the adapters with unique barcodes. 3. Ligation of adapters. You add adaptors with unique barcodes (e.g NEXTflexTM Barcodes) to your DNA fragments so that you can later pool your samples and still be able to identify the specific samples. You can have up to 96 unique barcodes (one unique adaptor for each DNA sample), but most cost-efficient is to use 48 barcodes. 4. Fragment size selection. You have probably decided to work only with fragments of a specific size. You can use the Agencourt AMPure XP magnetic beads for fragment size selection. This system utilizes magnetic beads to size select the fragments. Fragment size selection is an important parameter in relation to how large your targets are, and whether your probes will capture the whole target. E.g., to capture introns up to 500 bp long with exon-based probes, most fragments should be at least 400 bp long (we expect 100 to be bound to the probe, and 300 to extend across the intron, thus overlapping with the fragments bound to the next exon). The sequence length will also play role in determining fragment lengths. Illumina MiSeq can vary in read length (e.g., 150 paired-end or 300 paired end sequence length [see below]). It’s not very efficient to use 300 bp fragments with 300 bp paired end reads. 5. PCR of indexed DNA fragments. For better yield of the selected DNA fragments a PCR is run for e.g. 14 cycles (98°C, 2’; 14x(98°C, 30”; 65°C, 30”;72°C, 60”); 72°C, 4’. The necessary reagents are provided with the NEXTflex kit. N A T Adapters with ! unique barcodes • Measure DNA concentrations of libraries. You need to check the concentration of your DNA library. This is usually done on a NanoDrop instrument. If you want more accurate measurements, you can use the Agilent 2200 TapeStation instrument. Optimally you should have about 500 ng of library DNA. • Pooling of samples for sequence capture. You can pool some of your DNA libraries before sequence capture. If you used 48 unique barcodes in the library construction, you could in theory pool all 48 samples in one capture but then you might not get enough of each. How many samples to pool will depend on your total target size (i.e. the total length of all selected regions). Previous work has shown that pooling 8 samples works well for a total target size of about 150 – 200 Kbp. • Sequence capture. There are basically 5 steps in the actual sequence capture procedure, see figure below: you are interested in the red DNA strands – those are your targeted DNA sequences. This procedure follows the MYBaits target enrichment system (read this document for further details). 1. Denature DNA. From the library preparation you have your size selected DNA fragments with unique barcoded adapters. The first step is to denature the DNA to get single stranded DNA. 2. Hybridize baits to target. The biotinylated baits are the predesigned RNA probes that will hybridize with the targeted complementary single-stranded DNA sequence. Biotinylated means that there is a biotin molecule attached to the RNA strand. (Step one and two are done in the same reaction in the lab) 3. Capture target on beads. Streptavidin is a protein with a very high affinity for biotin. When you add the magnetic streptavidin beads to your sample, the beads will bind to your biotinylated probes. 4. Recover captured targets. With a magnetic separation you can now recover the DNA fragments to which the probes have hybridized. After washing of the probes you are left with your targeted DNA fragments. 5. Amplify. Before sending for sequencing you need to amplify the selected fragments to ensure you have a high concentration of your targeted DNA fragments. Since you still have the unique barcode adaptor on the selected fragments you use the primers that are complementary to the adaptors. It is important to limit the number of PCR cycles to get just enough material for sequencing while minimizing PCR amplification bias. • Pooling of samples for NGS sequencing. You can pool as many samples as the number of barcodes you have used in the library preparation, i.e. if you have used 48 unique barcodes you pool all 48 samples in one tube. • Measure DNA concentration, fragments size. Use the Agilent 2200 TapeStation instrument to validate your sample’s DNA concentration and fragment size. • NGS sequencing. There are several different techniques for next generation sequencing. Prof Elaine Mardis from the Genome Institute at Washington University, USA, has given two excellent educational lectures on “Next generation sequencing technologies” well worth a view. Links: Lecture in 2012, lecture in 2014. The lectures were given at the Current Topics in Genome Analysis, at the National Human Genome Research Institute, that has been held every other year since 2003. Other lectures are uploaded on the website. o At BioEnv, the Illumina sequencing platform has been used successfully. This follows a “Sequencing by Synthesis” approach and the number of cycles determines the length of the read. This video explains the process. o You have to decide what read length and what sequencing depth your analysis require. § Read length: The number of bases for each read. The longer the read length the easier it might be to assembly the reads, but the quality of the reads will be lower due to stochastic errors during sequencing. § Sequence depth: Basically the number of reads. The more reads you get the more accurate the sequence assembly, and the deeper you sequence the more info you can get out of your data, i.e. you will be able to search for polymorphic alleles (see “phasing” below). Low sequence depth High sequence depth § Paired-end reads: Paired ends refer to the two ends of the same DNA fragment. With paired-end sequencing you sequence from both ends of the same DNA fragment, with one forward read and one reverse read. I.e. if a DNA fragment is 300bp long you cover the whole fragment by sequencing 150pb from each end. Check the Illumina video above for details. When you look at a pair of reads, the sequences you see are, conceptually, pointing towards each other on opposite strands. When you align them to the genome, one read should align to the forward strand, and the other should align to the reverse strand, at a higher base pair position than the first one so that they are pointed towards one another. This is known as an “FR” read – forward/reverse read. READ 1 (150bp) 5' 3' 3' 5' READ 2 (150 bp) 300bp o For the Illumina MiSeq platform you can have paired-end reads up to 300bp (i.e. total coverage of 600bp) but the longer the reads, the lower sequence quality you will receive (see figure to the right, or online MiSeq System Performance Parameters). o If you want to focus on maximum sequence depth the new Illumina NextSeq 500 might be considered. You can get a sequence depth of up to 800 million pair-end reads, which is considerably higher than for the MiSeq platform (up to 30 million reads). However the read length is limited to 150bp, and pricing is higher. o The Sahlgrenska Genomics Core Facility has the commonly used MiSeq platform as well as the new NextSeq 500 platform. A large benefit of using this facility is the personal service. You have direct contact with the people who work there and can be part of the whole procedure. However, pricing is higher than at i.e. the SciLifeLab, and you have to pay for all labour hours (whereas at SciLifeLab this is covered by public research funding). Price example for a MiSeq run 2*150bp (7 oct 2014): Sahlgrenska Genomics Core Facility – 10 500SEK, SciLifeLab – 7 789SEK • • Sequence handling. From the high-throughput sequencing you will receive millions of reads covering the specific gene regions of interest. o If you have for instance used the Illumina MiSeq platform with 2*150bp pairedend sequencing you will get back between 24-30 million forward and reverse reads. o The sequences come in FASTQ format and are separated into forward and reverse reads for each DNA sample (you provide the sequence facility with the unique adaptor barcode sequence so that the different DNA samples can be identified). The files are usually named like this: “sample_nameR1.fastq” for the forward read “sample_nameR2.fastq” for the reverse read o Each fastq-file will include the nucleotide sequence and its corresponding Phred quality score. Phred is a base-calling program for DNA sequence traces that reads DNA sequence chromatogram files and analyses the peaks to call bases, assigning quality scores ("Phred scores") to each base call. You use this score to evaluate the quality of the read. Bioinformatics. You now have millions of reads to work with and you need to use bioinformatics to handle all the data. You can use Albiorix Cluster at BioEnv – talk to Mats Töpel to get an account. Handling of data involves different steps that can be performed using several scripts i.e. the scripts developed by Yann Bertrand and Filipe De Sousa (under publication). Please contact either of them to ask for permission to use the scripts and request instructions on how to use them. As they have not yet been published, depending on the case you should discuss co-authorship or formal acknowledgments in resulting papers. The different steps involved in data handling are the following: o The first thing to do is to strip the adapter sequences from the reads. o Second you filter the reads for quality with a threshold of 20 phred-scores. o Reads are mapped towards the reference sequence using CLC Genomics. Reference sequence o Phasing of alleles with SAMtools Phase. Some may just make a consensus sequence out of the mapped reads. However you may then miss out on important information. Instead you should perform a phasing of alleles. § Diploid organisms have one copy of each gene (and, therefore, one allele) on each chromosome. If both alleles are the same, they and the organism are homozygous with respect to that gene. If the alleles are different, they and the organism are heterozygous with respect to that gene. § With the phasing of alleles you will find the sequences that have regions of heterozygote alleles. Below is a simplified example. Lets say you thought you were working with a diploid individual, but thanks to the allele phasing you have now discovered that this individual in fact is a polyploid (e.g., between 3x and 6x). Without the phasing you would not have discovered this! § For higher polyploids the phasing could just be continued, if there is sufficient read depth to support accurate phasing for each allele. 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 5 8 10 12 1 2 3 5 8 10 12 4 7 9 4 6 7 9 11 6 11 o Retrieve the consensus sequences (FASTA format) using i.e. SAMtools mpileup. o Now you have your sequences in hand and can perform your preferred analyses! Acknowledgment: A special thanks to Filipe de Sousa who has given constructive input in creating this document. Thanks to Alexandre Anonelli, Bernard Pfeil and Mats Töpel for commenting and giving valuable feedback on the text. Disclaimer: Although the author have made every effort to ensure that the information in this document is correct, the author disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. References: Lemmon, E. M. & Lemmon, AR (2013) High-throughput genomic data in systematics and phylogenetics. Annual Review of Ecology, Evolution, and Systematics, 44, 99-121. Sousa F, Bertrand YJK, Nylinder S, Oxelman B, Eriksson JS, and Pfeil BE (2014) Phylogenetic properties of 50 nuclear loci in Medicago (Leguminosae) generated using multiplexed sequence capture and Next-Generation Sequencing. PLoS ONE, in press.