Download Bioinformatics Variant Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Whole genome sequencing wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Human genetic variation wikipedia , lookup

Pathogenomics wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

SNP genotyping wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
Next Generation Sequencing
Bioinformatics small variants Data Analysis
Guidelines
genomescan.nl
GenomeScan’s Guidelines for Small Variant Analysis on NGS Data
Using our own proprietary data analysis pipelines
Dear customer,
As of the beginning of 2015 ServiceXS became a trademark of GenomeScan B.V. GenomeScan
focuses exclusively on Molecular Diagnostics whereas our ServiceXS trademark is intended for your
R&D projects.
GenomeScan is dedicated to help you design and perform Next Generation Sequencing (NGS)
experiments that generate high quality results. This guide provides information for our data analysis
services and resources and tools for further analysis of your sequencing data. NGS experiments
result in vast amounts of data and therefore data analysis can be challenging. Our ability to assist in
the analysis of your results can be the key factor leading to a successful project.
Our experience in the past years is that even state-of-the-art NGS software is not always able to
fulfill the data analysis needs of our customers. To alleviate this problem our experienced team of
bioinformaticians and molecular biologists can provide standard or custom bioinformatics solutions
to get the most out of your project.
GenomeScan provides a comprehensive package of bioinformatics services for our NGS customers,
which enable them to utilise all the applications that are possible with billions of bases of sequence
data per run. GenomeScan can advise and assist you in every step of the data analysis. Do not
hesitate to contact us if you have any questions after reading this guideline!
On behalf of the Bioinformatics team,
Thomas Chin-A-Woeng
Project Manager
GenomeScan Guidelines- Page 2 of 14
Small Variant Analysis v3.0
Document Outline
Page
1
Introduction
3
2
Application Description
2.1
Quality Filtering and Trimming
2.2
Alignments
2.3
SNP Detection
2.4
SNP Filtering
2.5
Indel detection
2.6
Export Files
2.7
Consensus Sequence (optional)
2.8
SNP Effect Analysis
4
Analysis Results
3.1
Raw Sequencing Files
3.2
Alignment Files
3.3
Main SNP File
3.4
Human Readable SNP File
3.5
Genotype Summary
3.6
Assign Design File
3.7
Combined.tab
3.8
IUPAC and variant references
3.9
Visualisation
8
File Formats
4.1
Variant Analysis
4.2
Structural Variation
4.3
Reference Genomes
4.4
Assay Design
11
3
4
Changes to Previous Version (2.0)
-Lay-out changes
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 3 of 14
Small Variant Analysis v3.0
Chapter 1 Introduction
Most organisms within a particular species differ very little in their genomic structure. These
variations are referred to as allele changes. A single nucleotide polymorphism or SNP is a DNA
sequence variation occurring when a single nucleotide - A, T, C, or G - in the genome differs
between members of a species (or between paired chromosomes in an individual). Each individual
has many single nucleotide polymorphisms that together create a unique DNA pattern for that
individual. Typically, SNPs commonly observed in a population exhibit two alleles, a major allele,
which is more prevalent, and a relatively rarely occurring minor allele. The study of single
nucleotide polymorphisms is also important in genotyping in crop and livestock breeding.
Single nucleotide polymorphisms may fall within coding sequences of genes, non- coding regions, or
in the intergenic regions between genes. SNPs sometimes have very deleterious effects, such as a
change in only one nucleotide can cause codon(s) to be misread and accordingly a wrong protein
will form. SNPs within a coding sequence will not necessarily change the amino acid sequence of the
protein that is produced, due to degeneracy of the genetic code. A SNP in which both forms lead to
the same polypeptide sequence is termed synonymous, if a different polypeptide sequence is
produced they are non-synonymous. SNPs that are not in protein coding regions may still have
consequences for gene splicing, transcription factor binding, or the sequence of non-coding RNA.
SNPs located in regulatory regions (promoters, UTRs) may have a significant influence on the
expression level of a gene.
Next- generation sequencing (NGS) allows SNP identification without prior target information. The
high coverage possible in NGS also facilitates discovery of rare alleles within population studies. SNP
detection algorithms compare the
nucleotides present on aligned reads
against the reference at each
position (Fig. 1). Based on the
distribution of As, Ts, Gs, and Cs at
that position and the likelihood of a
sequencing error, a judgement is
made as to the existence of a SNP.
Further downstream the SNP analysis
the potential effects of SNPs
associated with the DNA sequence
can be evaluated.
Fig.1. Alignment against a reference sequence
This guideline describes the workflow for detection of small variants in a sample genome in
comparison to a reference genome. The main steps are (1) quality filtering and adapter trimming,
(2) alignment, (3) SNP detection (4) filtering of significant SNPs, and (5) optionally SNP effect
analysis and clustering.
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 4 of 14
Small Variant Analysis v3.0
Chapter 2 Sequencing Applications
The following section describes the main steps for SNP/Indel analysis (Fig. 2). The most common
workflow step in preparation for SNP analysis is to filter the reads and retain only those with high
mapping and base qualities. After calling SNPs and choosing the appropriate thresholds for filtering,
a VCF file is generated. From this VCF file various export formats that can be interpreted by the
customer are derived.
Fig. 2. SNP detection workflow.
2.1 Quality Filtering and Trimming
The SNP/indel pipeline starts with quality filtering and trimming of the sequence reads. For filtering
a set of standard thresholds is used which are optimised for the SNP/indel analysis pipeline. The
main parameter defaults are:
Table 1. Read filtering
Filter
Default
Description
Adapters
On
Minimal Q-score
22
Illimina sequencing adapters are removed
All bases in the read should have at least a Q-score of 22 (corresponding to
a chance of one error in 160 bases), bases with lower qualities are trimmed
off
After trimming bases reads should be at least 36bp to be kept in the data
set
For paired-end reads both reads should be kept or removed altogether
5' and 3' end of reads can be optionally trimmed for adapter sequences or
other unwanted bases indicated by the customer
Minimal
read
length
Treat paired-end
On
5' or 3' trim
Off
36
Presumed adapter sequences are removed from the read when the bases match a sequence in the
adapter sequence set (Illumina TruSeq adapters) with two or less mismatches and an alignment
quality of at least 12. To remove noise introduced by sequencing errors, reads are filtered and
clipped by quality. By default, the reads are filtered using a phred score of Q22 as a minimum
threshold. Bases with phred scores below this level are removed and as a consequence reads are
split. If the resulting reads are shorter than the minimal read length (36 bp by default), the reads
are removed altogether (both pairs in paired-end reads) when paired-end mode is enforced.
The filtered reads are written to FASTQ format and filtering statistics are calculated and reported.
The filtered reads are used for the next stage of the pipeline.
2.2 Quality Filtering and Trimming
The next step of the pipeline consists of aligning the filtered reads to the genome reference
provided by the customer or generated using de novo assembly. The filtered reads are aligned to
the reference sequence with a short read aligner based on Burrows–Wheeler Transform. A mismatch
rate of 4% (4 mismatches in a read of 100 bases) is used by default. This step lays the foundation for
finding the SNPs and variations. The alignment files (BAM files sorted and indexed .bam files by the
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 5 of 14
Small Variant Analysis v3.0
samtools v0.1.18 package) containing the mapped read information are provided on the harddisk in
the Alignments folder.
2.3
Whole genome (re-)sequencing of strains or related organisms
The pipeline performs SNP/Indel identification using Bayesian statistics similar to other commonly
used software tools for SNP detection. It uses the nucleotide values taken by each read covering the
location, as well as its associated base quality, and calculates a consensus genotype. Issues that a
SNP caller has to be able to consider are quality of reads, mapping quality, coverage,
homopolymeric tracts, and ploidy. The caller takes the following factors into consideration:





A sequencer outputs a sequence of nucleotides corresponding to each read and assigns a
quality value based on the confidence with which a particular base is called. The base
quality values add weight to the called nucleotides.
Misaligned reads create false positive SNPs or incorrect frequencies. Most alignment
algorithms assign quality scores to a mapping based on the read alignment with the
reference. These mapping scores indicate the likelihood of a read originating from the
suggested position on the reference. The mapping quality score takes into account the
inserts, deletions, and substitutions necessary for alignment at a particular position.
The number of reads at a genomic position also determines the confidence of a found SNP.
Greater sequencing depth leads to higher SNP calling accuracy.
The ploidy of the sample determines the number of nucleotide inferences necessary to
conclude the underlying genotype. When haploid, the algorithm does not assume the
probability of seeing a heterozygote.
Some sequencers exhibit inaccurate representations of homopolymers (e.g. AAAAAA) and
their immediate neighbors due to limitations in the technology. Such regions are also
handled by the SNP detection algorithm.
The SNP/Indel pipeline is capable of detecting three types of variants: substitutions or mutations,
deletions, and insertions. Substitutions consist of one or more nucleotide substitutions occurring at
certain genomic positions. Deletions are one or more nucleotide deletions occurring at a given
location. A deletion event is represented as a change from one or more consecutive nucleotides to a
gap (no bases). Insertions are one or multiple consecutive nucleotide insertions occurring at a given
location.
The pipeline can process data in single- or multi-sample mode. In the default multi-sample mode,
low-confidence calls occurring in multiple samples increase the confidence of the SNP call.
An associated Phred quality is output along with the consensus genotype; this score represents the
confidence in the variant call. High scores correspond to less possibility of error in the call.
2.4 SNP Filtering
Genomic positions are reported to be potential SNP sites if they satisfy a set of predefined criteria
that may be set by the customer or bioinformatician. They may be dependent upon the
experimental setup of the experiment. These include the minimal read depth, minimal quality
score, and minimal variant frequency. For all these criteria the number must exceed the thresholds
defined.
A VCF file is generated from the positions passing the filter. The results are reported in
filtered.snps.vcf in VCF file v4.1 format in the Variants directory. From this VCF file various export
formats that can be interpreted by the customer are derived. The following filters can be applied to
the SNP list:
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 6 of 14
Small Variant Analysis v3.0
 Read depth: The deeper the sequencing the more reliable the SNP detection can determine
whether it is a true SNP. A minimal threshold can be set to ascertain a minimal coverage
before a SNP is reported.
 Quality score: SNPs are filtered based on their quality score. All SNPs with quality scores
less than the defined threshold are filtered out. This ensures that SNPs with low quality are
discarded, but when these should also be included the threshold can be lowered.
 The variant frequency is set according prior expectations about the data set amongst which
are the ploidy and whether a pool of samples was analysed.
2.5 Indel Detection
Small insertions-deletions (indels), up to 30 bases, are detected by the indel caller using in-read
information (in contrast to mate or pair information). Aligners typically introduce gaps into reads
for better mapping that may represent deletions. Similar to a base, a gap (deletion) is significant
when the missing base(s) meet the filter criteria. Since deletion do not have an associated quality
score the surrounding base qualities are used for computation of a confidence score.
The indels are provided in VCF format and tab-delimited format.
2.6 Human-readable files
The SNP list is stored in snps.tab in tab-delimited format (in the Export folder). These can be
directly opened using a spreadsheet application such as MS Excel and LibreOffice if the number of
rows does not exceed the limitations of the application.
From this file the genotype columns are extracted into the summary.tab file.
A SNP assay design file is generated for the SNPs reported in snps.tab and reported in design.tab.
This file contains the contig information 75 bases upstream and downstream of the identified SNP
position.
Optionally, a file with the combined information of snps.tab, summary.tab, and design.tab is
provided in the file combined.tab. This file also includes additional columns with the distance to
the closest previous and next SNP and and the average sequence depth for all samples.
Small indels are output in the indels.tab file in the Export folder.
2.7
Consensus Sequencing (optional)
Based on the consensus call and the reference sequence a new reference sequence may be derived
which includes the found SNPs and genotypes. The resulting file is in FastA format and may be
coded in different ways.
2.8
SNP Effect Analysis (optional)
SNP Effect Analysis processes the list of SNPs and reports the effect that these SNPs have on the
genes in a given context. Using the genome feature information the SNPs are classified.
The following classifications are detected and reported.
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 7 of 14
Small Variant Analysis v3.0
Table 2. Read filtering
Classification
Description
Intergenic
A variant that does not fall within the neighborhood of any gene in the annotation
Variant in an exon. Synonymous: mutation has no effect on the final amino acid
sequence
Variant in an exon. Synonymous: mutation has effect on the final amino acid
sequence
Result in a STOP codon
Synonymous
Non-synonymous
Stop gain
Stop loss
STOP codon lost
Intronic
A mutation occurring in intronic regions
Upstream
A variant occuring upstream of the transcript
Downstream
A variant occuring downstream of the transcript
Essential splice site
5' UTR
Mutations to the donor and acceptor sites of the intron
Mutations to locations of splicing signals (i.e. 3-8 bases into the intron from either
side, 1-3 bases into neighboring exon)
A variant in the 5' UTR region
3' UTR
A variant in the 3' UTR region
Splice site
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 8 of 14
Small Variant Analysis v3.0
Chapter 3 Analysis Results
3.1
Raw Sequence Files
The raw sequence files output by the Illumina pipeline are being used as input for the SNP
detection. These sequence files are provided to the customer in FASTQ format in the 'Raw data'
directory. The quality-filtered output performed in the first step in the pipeline is provided
optionally to the customer.
3.2
Alignment files
The alignment files are provided in sorted BAM format with an accompanying index file. See our
Next-generation data analysis guideline for a full description of BAM files.
3.3
Main SNP file (snps.vcf)
The main output of the SNP/indel pipeline is a text file in VCF format formatted according to the
VCF 4.1 specification. VCF stands for Variant Call Format, and was originally used by the 1000
Genomes project to encode structural genetic variants. A short overview is given in Section 4.1.1.
3.4
Human readable SNP file (snps.tab)
This text file contains information in tab delimited format. It is both human- and machine readable.
Fig. 3. Layout of snps.tab and summary.tab files.
The format specification of this file is defined in Section 4.1.3. Columns 1 to 4 are general columns
applicable to all samples. Columns 5 to 7 contain SNP information for individual samples. Columns 7
to 10 contain genotype information. Columns 11 to 15 provide raw statistics on coverage and base
composition. The layout of the columns is described in section 4.1.1 (Table 8, Fig. 3).
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 9 of 14
Small Variant Analysis v3.0
3.5
Genotype summary (summary.tab)
This tab-delimited file contains the consensus columns in the snps.tab. It is both human- and
machine readable (Fig. 3 inset).
3.6
Assay design file (design.tab)
This tab-delimited file shows the flanking sequences of each position in the SNP file. It is both
human- and machine readable. Indicated in the flanking regions are neighbouring SNPs which may
be of importance for the design of follow up assay.
3.7 Combined.tab (optional)
This tab-delimited files combine the info from snps.tab, summary.tab, and desig.tab and included
additional information about the distances to neighbouring SNPs and total coverages over all
samples.
3.8
IUPAC and variant references
The construction of the IUPAC reference is depicted in Fig. 4.
Fig. 4. Generation of a IUPAC or variant reference
A new IUPAC or reference with variant alleles is generated using the original reference and, read
information, and variant tables. After alignment of the reads onto the reference sequence, each
base position is evaluated for its variants, coverage, and quality. Regions or bases with no coverage
are flagged in the new references with 'n'. Regions with coverage below a preset read depth
(default <=2) or doubtful alignment quality are flagged with lowercase bases to indicate low quality.
Variant alleles are depicted their IUPAC codes in the IUPAC reference or with the variant allele in
the variant reference.
The IUPAC reference in FASTQ format has an additional advantage that the genotype call score is
encoded as quality score similar to the Sanger phred score encodings. An offset of 33 is used when
translating ASCII encoding to the numerial score. The genotype call score is calculated as
Q = 10log 10logP  where P represents the probability that a polymorphism exists at the given
location.
Whether or not a variant allele is reported in the derived reference is dependent upon a set of key
threshold values inclusing variant freqency (default 30% for heterozygous diploid organisms or 80%
for haploid genomes), coverage or read depth (default 20), and mapping quality.
3.9 Visualisation
Aligned reads, pileups, and SNPs can be viewed in numerous software packages for NGS. Using the
reference file and alignment files this can be easily done in the IGV browser. See our Nextgeneration data analysis guideline how this can be performed.
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 10 of 14
Small Variant Analysis v3.0
Chapter 4 File Formats
This chapter describes the file formats specifically used for SNP and indel analysis. For other
common formats such as sequence and alignment files, please refer to our NGS data analysis
guideline.
4.1 Variant Analysis
The FASTQ sequence files output by the Illumina sequencers are saved compressed in the commonly
used GNU zip format. This is indicated by the .gz file extension. Most downstream data analysis
tools automatically decompress the files when used as input as well a most decompression software
packages can inflate this format.
VCF files
The Variant Call Format (VCF) is flexible format used to store any type of DNA polymorphism data
such as SNPs, insertions, deletions and structural variants, together with rich annotations by listing
both the reference haplotype (the REF column) and the alternate haplotypes (the ALT column). The
format was developed for the 1000 Genomes Project, and has been generally adapted by many
scientists and software tools.
The VCF format is a text file format which contains meta-information lines, a header line, and then
data lines each containing information about a position in the genome. The specification for the
format can be found at http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcfvariant-call-format-version-41 and published (Danecek et al. 2011. The variant call format and
VCFtools. Bioinformatics 27:2156–2158. The full VCF specification also includes a set of
recommended practices for describing complex variants.
The header contains an arbitrary number of meta-information lines, each starting with characters
‘##’, and a tab-delimited field definition line, starting with a single ‘#’ character. The metainformation header lines provide a standardised description of tags and annotations used in the data
section. The use of meta-information allows the information stored within a VCF file to be tailored
to the dataset in question. It can be also used to provide information about the means of file
creation, date of creation, version of the reference sequence, software used and any other
information relevant to the history of the file.
The field definition line names eight mandatory columns, corresponding to data columns
representing the chromosome (CHROM), a 1-based position of the start of the variant (POS), unique
identifiers of the variant (ID), the reference allele (REF), a comma separated list of alternate nonreference alleles (ALT), a phred-scaled quality score (QUAL), site filtering information (FILTER), and
a semicolon separated list of additional, user extensible annotation (INFO).
In addition, if samples are present in the file, the mandatory header columns are followed by a
FORMAT column and an arbitrary number of sample IDs that define the samples included in the VCF
file. The FORMAT column is used to define the information contained within each subsequent
genotype column, which consists of a colon separated list of fields. E.g., the FORMAT field
GT:GQ:DP in the fourth data entry of Fig. 5 indicates that the subsequent entries contain
information regarding the genotype, genotype quality, and read depth for each sample. All data
lines are tab- delimited and the number of fields in each data line must match the number of fields
in the header line.
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 11 of 14
Small Variant Analysis v3.0
Fig. 5. VCF file.
The VCF specification includes several common keywords with standardised meaning. The following
table gives some examples of the reserved tags.
Table 3. SNP/genotype file
Abbreviation
Description
Genotype columns
Genotype, encodes alleles as numbers: 0 for the reference allele, 1 for the first allele
listed in ALT column, 2 for the second allele listed in ALT and so on. The number of
alleles suggests ploidy of the sample and the separator indicates whether the alleles
are phased (‘|’) or unphased (‘/’) with respect to other data lines.
Phase set, indicates that the alleles of genotypes with the same PS value are listed in
the same order.
Read depth at this position.
Genotype likelihoods for all possible genotypes given the set of alleles defined in the
REF and ALT fields.
Genotype quality, probability that the genotype call is wrong under the condition that
the site is being variant. Note that the QUAL column gives an overall quality score for
the assertion made in ALT that the site is variant or no variant.
GT
PS
DP
GL
GQ
INFO column
DB
dbSNP membership.
H3
Membership in HapMap3.
VALIDATED
Validated by follow-up experiment.
AN
Total number of alleles in called genotypes.
AC
Allele count in genotypes, for each ALT allele, in the same order as listed.
Type of structural variant (DEL for deletion, DUP for duplication, INV for inversion,
etc. as described in the specification.
End position of the variant.
SVTYPE
END
IMPRECISE
Indicates that the position of the variant is not known accurately.
CIPOS/CIEND
Confidence interval around POS and END positions for imprecise variants.
Missing values are represented with a dot. For practical reasons, the VCF specification requires that
the data lines appear in their chromosomal order.
VCF files can be stored in a compressed manner, compressed by bgzip, a program which utilizes the
zlib-compatible BGZF library (Li et al., 2009). Files compressed by bgzip can be decompressed by
the standard gunzip and zcat utilities. Fast random access retrieval of variants from a range of
positions on the reference genome can be achieved by indexing genomic position using tabix, a
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 12 of 14
Small Variant Analysis v3.0
generic indexer for tab-delimited files. Both programs, bgzip and tabix, are part of the samtools
software
package
and
can
be
downloaded
from
the
SAMtools
web
site
(http://samtools.sourceforge.net).
BCF (Binary Call Format)
Binary format used by samtools/bcftools for efficient storing and parsing of genotype likelihoods. A
description can be found at http://vcftools.sourceforge.net/bcf.pdf
SNP/Genotype
The snps.tab file is a proprietary human-readable file with all information regarding SNPs and
genotypes. The file is also machine-readable. Columns 1 to 4 are general columns applicable to all
samples. Column 5 to 7 contain SNP information for individual samples. Columns 7 to 10 contain
genotype information. Columns 11 to 15 provide raw statistics on coverage and base composition.
The layout of the columns is as follows (Table 4.6, Fig. 3):
Table 4. SNP/genotype file
Column
Format
Description
1
2
Text
Numerical
Chromosome or contig
1-based genomic position within chromosome or contig
3
Nucleotide base
Reference allele
4
Nucleotide base
Detected alleles
5
Nucleotide base
Detected SNP, empty is no SNP or below significance
6
Numerical
Variant frequency (%)
7
Numerical
Quality score for SNP
8
IUPAC base
Genotype
9
Numerical
Genotype quality score
10
Numerical
Depth to calculate genotype or SNP
11
Numerical
Fraction of A
12
Numerical
Fraction of C
13
Numerical
Fraction of T
14
Numerical
Fraction of G
15
Numerical
Total depth (alignment)
16-26
Same as 5-15 for sample 2, etc
Summary file
The summary.tab file contains only the consensus genotypes of the samples.
Table 5. Summary file
Column
Format
Description
1
Text
Chromosome or contig
2
Numerical
1-based genomic position within chromosome or contig
3
Nucleotide base
Reference allele
4 .. n
IUPAC call
Consensus genotype for sample
Insertions/deletions file
The indels.tab file contains short indel information in the following format:
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 13 of 14
Small Variant Analysis v3.0
Table 6 Indel file format
Column
Name
Format
Description
1
2
chr1
Text
Chromosome or contig
pos1
Numerical
1-based genomic position within chromosome or contig chr1
3
reference
Nucleotide base
Reference allele
4
sequence
Nucleotide base(s)
Detected variation
5
chr2
Text
Not used
6
pos2
Numerical
Not used
7
type
Text
Variation class: INS (insertion) or DEL (deletion)
8
size
Numerical
Size of the variation
9
varfreq
Numerical
Frequency at which the variation is observed (%)
10
score
Numerical
Quality score
11
depth
Numerical
12..n
4.2
Coverage at the indicated position
Same as 8-11 for sample 2, etc
Structural Variation
SV file
The filtered.sv.tab file contains structural variation data (large insertions, deletions, duplication,
interchromosomal and intrachromosomal translocations).
Table 7. Structural variation file format
Column
Name
Format
Description
1
chr1
Text
2
pos1
Numerical
3
reference
Nucleotide base
Chromosome or contig
1-based genomic position within chromosome or contig
chr1
Reference allele
4
sequence
Nucleotide base(s)
Detected variation
5
chr2
Text
6
pos2
Numerical
7
type
Text
8
size
Numerical
Chromosome or contig
Second position in genome. 1-based genomic position
within chromosome or contig chr2
Variation class: INS (insertion), DEL (deletion), CTX
(interchromosomal translocation), ITX
(intrachromosomal translation)
Size of the variation
9
varfreq
Numerical
Frequency at which the variation is observed (%)
10
score
Numerical
Quality score
11
depth
Numerical
Coverage at the indicated position
12
sample
Text
Optional sample id
4.3 Reference genomes
IUPAC references
A IUPAC reference describes a heterozygous genome in which the alleleles are indicated using the
standard IUPAC codes for DNA.
The file may be in sequence file format such as FastA format or FASTQ format.
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com GenomeScan Guidelines- Page 14 of 14
Small Variant Analysis v3.0
4.4 Assay Design
Assay Design File
The design.csv file contains all information required to design follow up assay such as qPCR and
genotyping assays.
Table 8. Assay design file
Column
Name
Format
Description
1
contig
Text
Chromosome or contig
2
position
Numerical
1-based genomic position within chromosome or contig
3
reference
Nucleotide base
Reference allele
4
sequence
Sequence
DNA sequences left and right flanking the variant position.
Any neigbouring SNVs are encoded in IUPAC. The actual SNV
position is indicated using bracket notation ([A/T[).
Notes
© 2015 GenomeScan B.V. All rights reserved ServiceXS™ | GenomeScan B.V. | Plesmanlaan 1D | 2333 BZ Leiden | The Netherlands Telephone: +31 (0)71 568 1050 | Fax: +31 (0)71 568 1055 [email protected] | [email protected] | www.genomescan.nl | www.servicexs.com