Download Next-Generation Sequencing applied to aDNA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Comparative genomic hybridization wikipedia , lookup

NUMT wikipedia , lookup

Epigenomics wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Minimal genome wikipedia , lookup

DNA supercoil wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Whole genome sequencing wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

ENCODE wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Human genome wikipedia , lookup

Genome evolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome editing wikipedia , lookup

Genomic library wikipedia , lookup

Human Genome Project wikipedia , lookup

RNA-Seq wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
Next-Generation Sequencing applied to aDNA
Hands-on session
June 13, 2014
Ludovic Orlando - [email protected]
Mikkel Schubert - [email protected]
Aurélien Ginolhac - [email protected]
Hákon Jónsson - [email protected]
Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen
http://geogenetics.ku.dk/research/research_groups/palaeomix_group/
1
Introduction and outline
The exercise will consist of the following parts
1. Run example PALEOMIX projects
(a) Process and align synthetic data against human mitochondria
(b) Generate phylogeny based on protein coding genes
2. Setup and run BAM pipeline for modern and ancient horses
(a) Examine the results from the mapping
(b) Examine patterns of post-mortem damage in the ancient sample
(c) Optimize procedure for the ancient sample
3. Setup and run Phylogenetic pipeline for modern and ancient horses
(a) Visualize the resulting phylogeny
1
2
Pre-requisites
Firstly, before attempting complete any of the analyses in this exercise, it is necessary to
“source” the following script in order to setup the environmental variables needed for the
pipelines. :
source /home/local/27626/exercises/paleomix/setup_env
set MY_HOME="/home/local/ngs_course/${USER}"
If using BASH, instead use the following command (if in doubt, just do both):
source /home/local/27626/exercises/paleomix/setup_env.sh
MY_HOME="/home/local/ngs_course/${USER}"
3
Running the example projects
The example project consists of synthetic data generated using primate mitochondrial sequences, which is mapped onto the revised Cambridge Reference Sequence (rCRS). Copy the
example project to your home folder and run the read processing and mapping steps using
the following commands:
mkdir $MY_HOME/my_ancient_dna
cd $MY_HOME/my_ancient_dna
cp -a /home/local/27626/exercises/paleomix/examples/phylo_pipeline .
cd phylo_pipeline/alignment
nice bam_pipeline run 000_makefile.yaml
The resulting files are located in the same folder; consider the following files (for the
Orangutan mitochondrial genome):
sumatran_orangutan.rCRS.realigned.bam The Orangutan reads aligned against the human
mitochondrial genome; the “.realigned” postfix signifies that local realignment has
been carried out around indels.
sumatran_orangutan.rCRS.coverage Table of average coverages for each chromosome / contig.
2
sumatran_orangutan.rCRS.depths Table of depth of coverage histogram for each chromosome / contig
sumatran_orangutan.summary Summary of the alignment(s), including information about
read trimming, percentage of reads mapped, fraction of reads filtered as PCR
duplicates.
sumatran_orangutan/ This folder contains trimmed reads (organized as in the makefile),
and other intermediate files; these are often useful in other analyses, but may be
deleted to save space.
Next, genotype the coding genes on the mitochondrial genome and build a phylogeny
using the following commands. This will generate a maximum likelihood phylogeny with 10
bootstraps (reduced to save time for the exercise).
cd $MY_HOME/my_ancient_dna/phylo_pipeline/phylogeny
nice phylo_pipeline genotype+msa+phylogeny 000_makefile.yaml
The results are located in subfolders in the “results/ExampleProject” folder:
genotypes/ (Filtered) genotyping calls in VCF format, and FASTA sequences generated for
each region of interest (i.e.. gene), for each sample.
alignments/ Multiple sequence alignments for each region of interest (i.e.. gene); the folder
contains both the unaligned sequences (*.fasta) and the aligned sequences (*.afa).
phylogenies/ Super-matrices (combined multiple-sequence alignments) used in, and Newick
trees resulting from, the phylogenetic inference.
The resulting phylogeny can be visualized using the “nw_display” command:
cd results/ExampleProject/phylogenies/ProteinCodingGenes/
nw_display replicates.support.newick
4
Setup and run BAM pipeline
In the following, we will map the sequencing data derived from four horses; three modern
and one pre-historic. To keep things manageable, we are restricting ourselves to 1mb of horse
3
chromosome 1 (namely chr1:10,000,000-11,000,000), and have generated a set of FASTQ files
containing reads that map to this region, to avoid having to map hundreds of GB of reads
for this exercise. Create a project directory for the analyses; the following file-structure is
used as it makes subsequent steps simpler:
cd $MY_HOME/my_ancient_dna
mkdir -p horses/alignments
cd horses/alignments
Copy the data used for this exercise (sequencing reads and a FASTA sequence for part of
chromosome 1):
cp -a /home/local/27626/exercises/paleomix/data/alignments/* .
Create an empty makefile:
bam_pipeline mkfile > makefile.yaml
The makefile is specified using YAML, a human-readable markup language that is visually
similar to Python code. In other words, the structure is defined using indentation. Note
that tabs cannot be used when editing this file, always use spaces! A copy of the final
makefile (“final_makefile.yaml”) was included in the data you copied, which may be used for
comparison with your own.
Open the “makefile.yaml” file in your favorite editor and add the following lines to end of
the file:
Przewalski:
Przewalski:
Library1:
Lane1: reads/Przewalski_R{Pair}.fastq.gz
The first line specifies that the name of this project is “Przewalski”. This means that all
resulting files will start with “Przewalski”. The second line defines the sample name; this is
used to tag the resulting alignments data, and is typically the same as the project name.
The third line names a single library, which we have chosen to call “Library1”, while the
fourth line names a single lane (i.e.. run on a NGS machine) and specifies the location of
the de-multiplexed files. The “{Pair}” part of the path signifies that this is paired-ended
4
reads, and is replaced with “1” and “2” by the pipeline, representing mate 1 and mate 2 reads
respectively.
A library can contain any number of lanes, and a sample can contain any number of
libraries, and so forth, but in this example we have chosen to limit ourselves to a single lane
of a single library.
Add each of the following samples below the entry for the Przewalski’s horse, using the
same structure, and the given path for the lane:
Standardbred:
Standardbred:
Library1:
Lane1: reads/Standardbred_R{Pair}.fastq.gz
Donkey:
Donkey:
Library1:
Lane1: reads/Donkey_R{Pair}.fastq.gz
Finally, add the following two lanes for the ancient horse; one paired-ended and one
single-ended lane:
ThistleCreek:
ThistleCreek:
Library1:
Lane1: reads/ThistleCreek_R{Pair}.fastq.gz
Lane2: reads/ThistleCreek.fastq.gz
Finally, update the “Prefixes” section earlier in the file, in order to specify which FASTA
files we will be mapping against. Here we map against a fragment of chromosome 1, which we
choose to call EquCab20Chr1frag; this name will be used in the resulting files. The pipeline
will take care of indexing the reference using the chosen aligner (BWA by default). The file
was copied along with the data above. Note that we intentionally use the name of the FASTA
file (with the .fasta extension) for the prefix; this is expected by the second pipeline:
Prefixes:
EquCab20Chr1frag:
Path: prefixes/EquCab20Chr1frag.fasta
5
Run the alignment:
nice bam_pipeline run makefile.yaml
A mapDamage profile is generated for each library in each sample by default, including
a plot of the post-mortem damage patterns (Fragmisincorporation_plot.pdf); compare the
ancient sample with any of modern samples, for example:
• ThistleCreek.EquCab20Chr1frag.mapDamage/Library1/Fragmisincorporation_plot.pdf
• Standardbred.EquCab20Chr1frag.mapDamage/Library1/Fragmisincorporation_plot.pdf
Also compare the read length distribution:
• ThistleCreek.EquCab20Chr1frag.mapDamage/Library1/Length_plot.pdf
• Standardbred.EquCab20Chr1frag.mapDamage/Library1/Length_plot.pdf
Finally, while working on the ancient ThistleCreek sample we noticed a high proportion of
DNA originating from a Pseudomonas bacteria; therefore, add the P. flourescens genome to
the list of prefixes, to also map against that genome:
Prefixes:
EquCab20Chr1frag:
Path: prefixes/EquCab20Chr1frag.fasta
Pseudomonas:
Path: prefixes/pfluorescens.fasta
Run the alignment again, in order to also map against the new Pseudomonas genome:
nice bam_pipeline run makefile.yaml
Compare the coverage for this mapping (using either the *.summary or *.Pseudomonas.coverage)
tables between the modern and ancient samples, and compare the mapDamage plots between
the horse genome and the Pseudomonas genome for the ancient horse (ThistleCreek):
• Standardbred.Pseudomonas.coverage
6
• ThistleCreek.Pseudomonas.coverage
The DNA for the Thistle Creek sample is highly fragmented, so we expect that any fragment
which cannot be collapsed is modern in origin; lets therefore exclude paired ended reads where
both mate ends passed the quality filters, but were not collapsed (i.e.. did not overlap). This
is done simply by repeating the “Options” section from the top of the file, but writing only
the part that needs to be overwritten (namely the “ExcludeReads” section):
ThistleCreek:
Options:
ExcludeReads:
- Paired
ThistleCreek:
Library1:
Lane1: reads/ThistleCreek_R{Pair}.fastq.gz
Lane2: reads/ThistleCreek.fastq.gz
Remove the old ThistleCreek results and run the alignment again:
rm -rv ThistleCreek*
nice bam_pipeline run makefile.yaml
If you look at the *.summary or *.Pseudomonas.coverage, you’ll find that this has excluded
~90% of the modern DNA for this particular bacteria.
Finally, a problem with using BWA is that assumes that the 5’ region of reads (the first
32bp by default) contain fewer mismatches that the rest of the read. This speeds up the
alignment, but at the cost of some loss of genuine alignments for ancient DNA, as damage
tends to be localized to the 5’ region. To disable the use of the seed region, override the
“UseSeed” option for that project.
7
ThistleCreek:
Options:
Aligners:
BWA:
UseSeed: no
ExcludeReads:
- Paired
ThistleCreek:
Library1:
Lane1: reads/ThistleCreek_R{Pair}.fastq.gz
Lane2: reads/ThistleCreek.fastq.gz
First, make a note of the current number of hits in ThistleCreek.EquCab20Chr1frag.coverage,
and then remove the old results and run the alignment again:
rm -rv ThistleCreek*
nice bam_pipeline run makefile.yaml
Inspect the newly generated ThistleCreek.EquCab20Chr1frag.coverage file; you should
see a gain of about 2.5%; not a lot, but every bit matters when only a couple of percent of
the DNA sequenced belongs to the sample itself.
5
Setup and run the Phylogenetic pipeline
Create a project directory for the analyses:
cd $MY_HOME/my_ancient_dna
mkdir -p horses/phylogeny
cd horses/phylogeny
Copy data files (a BED file containing coordinates of genes) and setup symbolic links to
the previous analyses; we create a link to the folder containing the BAM files, and a link
to the folder containing the FASTA files; this will allow the pipeline to (semi-)automatically
locate the files we used:
8
mkdir data
cd data
cp -a /home/local/27626/exercises/paleomix/data/phylogeny/* .
ln -s ../../alignments samples
ln -s ../../alignments/prefixes prefixes
A copy of the final makefile (“final_makefile.yaml”) was included in the data you copied
above, which may be used for comparison with your own. Now create an empty makefile:
cd $MY_HOME/my_ancient_dna/horses/phylogeny
phylo_pipeline mkfile > makefile.yaml
Open this makefile (“makefile.yaml”), update the project title (which determines the folder
in which results are placed), and list the four samples we used before:
Project:
Title: my_horses
Samples:
Donkey:
Gender: Male
Standardbred:
Gender: Male
Przewalski:
Gender: Male
ThistleCreek:
Gender: Male
GenotypingMethod: Random Sampling
Due to the low coverage (1x), it is not possible to genotype the ThistleCreek sample in
the normal way (here, using SAMTools), so instead we will simply random sample bases at
each site.
Next, update the “RegionsOfInterest” section just below the “Samples” section, to specify
which parts of the genome that we want to genotype:
9
RegionsOfInterest:
ProteinCodingGenes:
Prefix: EquCab20Chr1frag
Realigned: yes
ProteinCoding: yes
IncludeIndels: yes
HomozygousContigs:
Female:
- chrM
Male:
- chrX
- chrY
- chrM
The name “ProteinCodingGenes” together with the Prefix (“EquCab20Chr1frag”) tells the
pipeline to look for a BED file containing the coordinates of regions we are interested in at
./data/regions/EquCab20Chr1frag.ProteinCodingGenes.bed (by default).
Finally, we need to specify that we wish to build a phylogeny using this set of genes; find
the “PhylogeneticInference” section, and replace “PHYLOGENY_NAME” with “my_phylogeny”,
replace “REGIONS_NAME” with “ProteinCodingGenes” (which we specified above), and
add “Donkey” to “RootTreeOn” (see comments in makefile or below):
10
PhylogeneticInference:
my_phylogeny:
RootTreesOn:
- Donkey
PerGeneTrees: no
RegionsOfInterest:
ProteinCodingGenes:
Partitions: "111"
ExaML:
Replicates: 1
Bootstraps: 100
Model: GAMMA
Run the genotyping, multiple sequence alignment, and phylogeny:
nice phylo_pipeline genotype+msa+phylogeny makefile.yaml
Visualize the resulting phylogeny Newick utils:
cd results/my_horses/phylogenies/my_phylogeny/
nw_display replicates.support.newick
Notice the extreme branch-length of the branch leaning to the ThistleCreek sample; this
is a result of the higher error rate resulting not only from the post-mortem damage, but also
from the fact that we random sampled sites.
References
[1] Aurelien Ginolhac, Morten Rasmussen, M Thomas P. Gilbert, Eske Willerslev, and Ludovic Orlando. mapdamage: testing for damage patterns in ancient dna sequences. Bioinformatics, 27(15):2153–2155, Aug 2011.
11
[2] Hákon Jónsson, Aurélien Ginolhac, Mikkel Schubert, Philip L F. Johnson, and Ludovic
Orlando. mapdamage2.0: fast approximate bayesian estimates of ancient dna damage
parameters. Bioinformatics, 29(13):1682–1684, Jul 2013.
[3] Stinus Lindgreen. Adapterremoval: easy cleaning of next-generation sequencing reads.
BMC Res Notes, 5:337, 2012.
[4] Mikkel Schubert, Luca Ermini, Clio Der Sarkissian, Hákon Jónsson, Aurélien Ginolhac,
Robert Schaefer, Michael D. Martin, Ruth Fernández, Martin Kircher, Molly McCue,
Eske Willerslev, and Ludovic Orlando. Characterization of ancient and modern genomes
by snp detection and phylogenomic and metagenomic analysis using paleomix. Nat Protoc,
9(5):1056–1082, May 2014.
12