Download Computational prediction of disease- causing CNVs - UiO

Document related concepts
no text concepts found
Computational prediction of diseasecausing CNVs from exome sequence data
Pubudu Saneth Samarakoon
Department of Medical Genetics, Faculty of Medicine
June 2016
© Pubudu Saneth Samarakoon, 2017
Series of dissertations submitted to the
Faculty of Medicine, University of Oslo
ISBN 978-82-8333-382-4
All rights reserved. No part of this publication may be
reproduced or transmitted, in any form or by any means, without permission.
Cover: Hanne Baadsgaard Utigard.
Print production: Reprosentralen, University of Oslo.
First and foremost I want to thank my supervisors; Robert Lyle and Torbjørn Rognes, you
have been tremendous mentors for me. Robert, thank you very much for listening to all my
questions and ideas, and guiding me during this period. Your input on academic writing and
our discussions on various scientific topics truly helped me to grow as a research scientist. I
am grateful to Torbjørn for introducing me to advance computational methods and for
sharing your programming knowledge. I am thankful to Norwegian quota scholarship fund
for providing financial support for the PhD.
I would like to express my gratitude to our research group, Hanne Sorte, Asbjørg StrayPedersen and Olaug Rødningen. Our fruitful discussions were invaluable and clearly helped
me to expand my skills and expertise to new horizons.
A warm thank to all my colleagues and friends at the department of Medical Genetics,
University of Oslo. I owe a special thanks to my dear colleagues Arvind Sundaram and
Marius Bjørnstad for sharing the office, serious discussions and for always lightening up my
I am truly thankful to my family and friends in Norway and Sri Lanka, especially my wife
Nisanka for sharing all my ups and downs. Also baby Sayon, who joined my life in the latter
part of my PhD.
List of publications
Paper 1
Samarakoon PS, Sorte HS, Kristiansen BE, Skodje T, Sheng Y, Tjønnfjord GE, Stadheim B,
Stray-Pedersen A, Rødningen OK and Lyle R. Identification of copy number variants from
exome sequence data. BMC Genomics. 2014;15:661. doi: 10.1186/1471-2164-15-661.
Paper 2
Samarakoon PS, Sorte HS, Stray-Pedersen A, Rødningen OK, Rognes T and Lyle R.
cnvScan: a CNV screening and annotation tool to improve the clinical utility of
computational CNV prediction from exome sequencing data. BMC Genomics 2016;
17(1):51. doi: 10.1186/s12864-016-2374-2.
Paper 3
Stray-Pedersen A, Sorte HS, Samarakoon PS, Gambin T, Chinn IK, Akdemir ZHC and et al.
Primary immunodeficiency diseases - genomic approaches delineate heterogeneous
Mendelian disorders. Journal of Allergy and Clinical Immunology 2016;
Table of content
Introduction ......................................................................................................................... 1
1.1 The human genome................................................................................................................... 1
1.2 Variation in the human genome ................................................................................................ 2
1.3 Genetic diseases ........................................................................................................................ 3
1.3.1 Disease-causing genetic variants ....................................................................................... 4
1.4 Diagnosis of genetic disease ..................................................................................................... 7
1.4.1 Genetic tests to detect SNVs and indels ............................................................................ 7
1.4.2 Genetic tests to detect chromosomal abnormalities and CNVs ......................................... 8
1.5 Exome sequencing .................................................................................................................... 9
1.5.1 Exome-based disease diagnosis ......................................................................................... 9
Disease-causing CNV detection from exome data ............................................................ 12
Aims of the study............................................................................................................... 20
Methodological considerations .......................................................................................... 21
CNV prediction ....................................................................................................................... 12
Coverage-based CNV prediction ............................................................................................ 14
CNV annotation ...................................................................................................................... 17
Causal CNV detection ............................................................................................................ 18
Validation of disease-causing CNVs ...................................................................................... 18
Exome sequence analysis pipeline.......................................................................................... 21
CNV prediction from exome sequence data ........................................................................... 23
Resolution of microarray platforms ........................................................................................ 23
Considerations when identifying true positive, false positive and false negative CNVs ....... 24
Summary of papers ............................................................................................................ 28
5.1 Paper 1 Identification of copy number variants from exome sequence data .......................... 28
5.2 Paper 2 cnvScan: a CNV screening and annotation tool to improve the clinical utility of
computational CNV prediction from exome sequencing data .......................................................... 30
5.3 Paper 3 Primary immunodeficiency diseases - genomic approaches delineate heterogeneous
Mendelian disorders.......................................................................................................................... 31
Discussion.......................................................................................................................... 33
6.1 Performance comparison of CNV prediction programs ......................................................... 33
6.2 Effect of cnvScan in-house database in CNV analysis ........................................................... 35
6.3 Characterization of true positives excluded in cnvScan implementation ............................... 37
6.4 Comparison with other disease-causing CNV detection methods .......................................... 40
6.4.1 Advantages of cnvScan annotation compared to other annotation programs.................. 40
6.4.2 Rate of disease-causing CNV detection .......................................................................... 41
6.4.3 Short CNV prediction ...................................................................................................... 43
6.4.4 Utility of exaCGH in detecting short CNVs .................................................................... 44
Future directions ................................................................................................................ 45
Optimizations to reduce the batch effect in cnvScan implementation ................................... 45
Whole-genome CNV analysis ................................................................................................ 45
Abbreviations ........................................................................................................................... 47
Commonly used terms ............................................................................................................. 48
References ................................................................................................................................ 50
1 Introduction
1.1 The human genome
The human genome contains ~3.2 billion bases packaged into 23 pairs of chromosomes.
Since the completion of the human genome project [1, 2], scientists have been motivated to
develop high-resolution maps and annotation datasets representing a broad spectrum of
genomic elements in the human genome. These maps and genome annotation datasets include
genes and other functional elements, repetitive elements, mobile elements and genomic
In 2004, with the availability of the complete euchromatic sequence of the human genome,
scientists estimated that the human genome contains 20000-25000 protein-coding genes [3],
which was recently revised downwards to 19,815 (GENCODE version 24; August 2015
freeze, GRCh38) [4]. These genes contain exons (protein-coding regions) that account for
~1.8% of the genome, introns (non-coding regions) and regulatory regions.
RefSeq [5, 6], ENCODE [7], HAVANA [8] and CCDS [9] were the early initiatives, which
are still involved in providing comprehensive genomic annotations. For example, the
ENCODE project was originally launched to identify and characterize functional elements in
1% of the human genome [7, 10], and later it was scaled up to provide integrated high-quality
annotations for the entire genome (GENCODE) [11, 12]. Similarly, HAVANA, initiated by
Ensembl, is dedicated to providing high-quality manually curated genes.
In order to effectively utilize this vast amount of information, various visualization solutions
were developed for scientists to browse the entre genome with relevant annotations. NCBI
[13], Ensembl [14], UCSC genome browser [15] were a few of the key efforts initiated to
provide broader access to this information.
Today, various methods (eg. genetic, evolutionary and biochemical approaches) are used to
improve our knowledge on structure and content of the human genome, propelling a broad
range of current research and applications [16]. In particular, this has fuelled the development
of systematic approaches that identify genes and genetic variants underlying human diseases.
1.2 Variation in the human genome
Variants in the human genome are the genetic differences between individuals, within and
among populations. Genetic variants are changes in DNA sequence or structural alterations of
the genome and are classified into three groups: sequence variants (eg. single nucleotide
variants, insertions and deletions), structural variants (eg. copy number variants,
translocations and inversions) and changes in chromosome number (eg. aneuploidy).
Single nucleotide variants (SNVs) are single base pair changes in the genome of an organism.
Insertions and deletions (indels) are changes in DNA sequence where short DNA fragments
are inserted or deleted [17]. Chemical and physical mutagens, and errors occurring during
DNA replication can induce SNV and indel formation.
Copy number variants are a class of structural variants containing deletions and duplications
[18]. Errors in DNA repair and DNA replication pathways contribute to CNV formation via
several mechanisms (eg. non-allelic homologous recombination - NAHR [19], nonhomologous end joining - NHEJ [20], microhomology mediated end joining - MMEJ [20],
fork stalling and template switching - FoSTeS [21] and microhomology mediated break
induced replication - MMBIR [22]). The majority of these mechanisms are mediated by
homologous or microhomologous regions [23-26]. Thus, most of the CNVs in human
genome are enriched within and near these regions (eg. 70% of the deletion breakpoints are
in microhomologous regions) [27, 28].
Our current understanding of genetic variation comes from analysing the genomes of a large
number of individuals. The human genome project [3], the SNP consortium (dbSNP) [29]
and the international HapMap project [30, 31] were the early efforts that identified ~10
million variants, primarily SNVs. Early large-scale efforts focused on detecting CNVs
include the HapMap project phase III [32] and Wellcome Trust Sanger CNV project [33, 34].
Collectively these projects identified a vast amount of common variants (frequency >5%)
found in a limited number of populations.
While these initial efforts were limited by the available technology, recent studies, fuelled by
more advanced variant detection methods (section 1.5) characterized a large amount of
common and rare variants in a broad range of populations worldwide. For example, a total of
88 million variants, including 84.7 million SNVs, 3.6 million indels and 60000 structural
variants, were characterized with the completion of the 1000 genomes project [35-37]. In
addition to this, the NHLBI GO Exome Sequencing Project (ESP) [38, 39] and UK10K
project [40] were other efforts initiated to identify common and rare genetic variants.
Considering all the variants identified from these projects, scientists have estimated the
average genome-wide variant counts (SNVs: ~3.7 million and indels: ~35000 [36]) and the
average genome-wide coverage of CNVs (contributing to 217.1Mbp, 7.01% of the genome)
[41]. Based on these figures, SNVs are the most frequent type of variant and structural
variants affect more bases than other types of genetic variants in the genome of an individual
With the goal of improving personal health, these vast amounts of population-specific
genomic variants were made available in public data repositories [42]. Current diagnostic
approaches use these datasets to help distinguish disease and non-disease causing variants in
patients with genetic diseases.
1.3 Genetic diseases
Today, the vast majority of human diseases are believed to have some genetic contribution.
Based on the nature of this genetic contribution, these diseases are grouped into Mendelian
diseases, complex diseases and chromosomal diseases.
Mendelian diseases [43] are caused by mutations at a single genetic locus that segregates in
families according to Mendel’s theory of inheritance [44]. Therefore Mendelian diseases are
also known as single-gene diseases. In contrast, complex diseases are associated with
multiple factors (gene and environmental factors) and do not follow Mendel’s laws [45].
While the majority of diseases are caused by alterations in genes, chromosomal aberrations
can change the structure of the genome and cause chromosomal diseases including
aneuploidy and a range of cancer types [46].
1.3.1 Disease-causing genetic variants
About 4000 disease-related genes have been identified to date [47]. These genes carry
variants (SNVs, indels or CNVs) that are either associated with or cause human diseases.
SNVs in exonic regions may cause genetic diseases by affecting the structure of proteins
(primary and secondary) [48], mRNA stability [49] or gene expression (eg. missense and
nonsense mutations). Indels in exonic regions also cause diseases when a variant alters the
reading frame of a gene (eg. frame shift mutation) [17]. SNVs and indels in gene regulatory
regions may affect gene expression and cause diseases. Today, a broad spectrum of diseases
ranging from infectious diseases to cancer have been associated with SNVs and indels [5052].
While SNVs and indels account for the majority of disease-causing variants identified to date
[53], CNVs represent an important subset of variants that cause genetic diseases. CNVs can
cause a broad range of common and complex diseases [54, 55] via three mechanisms (gene
interruption, dosage effect and position effect) that affect either the structure of a gene or
gene expression (Fig. 1, Table 1).
Partial heterozygous deletion /
Complete heterozygous
deletion / duplication
Partial deletion that unmasks
a recessive mutation
Fusion gene formation
Regulatory element
Recessive mutation
in an exon
Fig. 1 Gene interruption
a. CNVs affecting single genes (partial or complete, deletions or duplications); b. Unmasking
a recessive mutation by a heterozygous deletion; c. Gene interruption and fusion gene
Disease causing mechanism
Gene interruption
Change the gene structures and cause diseases.
Partial gene deletion or duplication
Cause diseases by interrupting a single gene. Partial deletion
(Fig. 1A)
of PDXDC1 is associated with hearing loss [56]. Muscular
dystrophy and lipoprotein lipase deficiency are associated
with partial duplications [57].
Heterozygous deletion can create
Cause diseases via loss of function mechanism. Opsoclonus-
hemizygous locus with a recessive
myoclonus ataxia-like syndrome [58], Charcot–Marie–Tooth
mutation (Fig. 1B)
disease type 1 [59] and Walker–Warburg syndrome [60] are
associated with compound heterozygous changes.
CNV with multiple genes can
Cause diseases via gain of function mechanism. In recent
create a fusion gene that encodes a
years, CNV breakpoints were examined to identify fusion
mRNA with an open reading frame
genes that affect intellectual disability, developmental delay,
(Fig. 1C)
autism spectrum disorders, congenital anomalies and
dysmorphic features [61, 62].
Combination of multiple CNVs
Cause complex diseases. Autism, mental retardation [63] and
that do not individually produce
developmental disorders like Smith-Magenis Syndrome [64]
are associated with these complex events.
Gene dosage
CNVs that change the number of copies of a gene present in
the genome can alter the gene expression and cause diseases.
Deletion of dosage-sensitive gene
Gene deficiency in haploinsufficient genes can cause diseases
(gene deficiency)
[65, 66].
Duplication of a dosage-sensitive
Increase in gene dosage can create imbalances due to excess
gene product and cause diseases [67].
Position effect
CNVs can alter the gene environment (enhancers, silencers
and local chromatin organization), which may change the
expression of neighbouring copy number neutral genes and
cause diseases.
Table 1 Disease causing mechanisms of CNVs
In addition to the direct involvement of genomic variants in genetic diseases, variants in
modifier genes can modulate the severity of disease-related phenotypes [68-70]. For example,
increased copy number of SMN2 is correlated with a milder phenotype of spinal muscular
atrophy caused by a mutation in SMN1 [71].
Today, linkage studies, genome-wide association studies (GWAS) [72], microarray-based
studies and high-throughput sequencing-based approaches have discovered a large number of
clinically-relevant variants that are catalogued in a number of different databases. For
example, the human genome mutation database (HGMD) [73] lists ~175,000 mutations that
are associated with or cause genetic diseases [53]. Similarly, ClinVar [74] archives ~126,000
genomic variants with clinically relevant phenotypes and DECIPHER database archives a
large number of clinically relevant CNVs [75]. Additionally, OMIM provides a detailed
gene-centric and disease-centric view of clinically relevant variants [76, 77].
While HGMD, ClinVar, DECIPHER and OMIM are the most commonly used databases with
variants contributing to a broad range of human diseases, a number of disease-centered
databases are available with a wealth of information on certain phenotypes [78]. All these
datasets improve the understanding of the mutational spectrum of a particular gene or disease
and they are widely used in diagnosing genetic diseases.
1.4 Diagnosis of genetic disease
Genetic diagnosis is the identification of genomic variants, genes or gene products that are
associated with human diseases [79]. The classical genetic diagnostic decision-making
process is guided by a detailed family history study that reveals the mode of inheritance of
the disease. Then, genetic tests are performed to confirm the diagnosis of a symptomatic
individual or individual with a family history of a specific disease. These genetic tests are
designed to detect a broad range of genomic variants [80].
1.4.1 Genetic tests to detect SNVs and indels
Since the advent of genetic testing in clinical laboratories, various techniques have been
introduced to detect disease-causing SNVs and indels [79, 81]. For example, restriction
fragment analysis [82], DNA sequencing (Sanger sequencing) [83] and allele-specific
oligonucleotide ligation assays [84, 85] are some of the initial tests developed. While these
techniques are not widely used in current protocols, a few remain useful in causal variant
identification. Particularly, Sanger sequencing is still used in diagnosing well-characterized
Mendelian disease genes (eg. cystic fibrosis: CFTR, Rett syndrome: MECP2) and to confirm
disease-causing SNVs identified by other methods [86].
The genetic diagnostic tests discussed above are designed to target specific loci, generally
identified by accurate phenotyping of patients and prior knowledge of the genetics of the
disease. Clinical and genetic heterogeneity can complicate the diagnostic process by making
it difficult to identify a likely causal gene from many candidates. Hence, targeted genetic
testing for these genes are time consuming, cost ineffective and have low success rates.
Furthermore, these tests have limited applicability in detecting rare variants and novel
variants. However, as discussed in section 1.5, current diagnostic protocols use highthroughput sequencing based methods to detect disease-causing variants.
1.4.2 Genetic tests to detect chromosomal abnormalities and CNVs
Karyotyping was introduced as the first diagnostic test to detect chromosomal abnormalities
[87]. Then, fluorescence in situ hybridisation (FISH) was introduced as a targeted method,
which detects submicroscopic anomalies in chromosomes [88, 89]. Next, comparative
genomic hybridisation (CGH) techniques [90] marked a turning point in genetic diagnosis by
enabling genome-wide CNV detection. CGH was further improved using microarrays to
perform high-resolution genome-wide CNV detection [91]. These microarrays contained 1-4
million 45-60bp long probes that target specific sequences spread over the entire human
genome sequence [92]. Then, SNV genotypes were used as genetic markers [93] in
developing genotype-arrays as an alternative platform to perform genome-wide CNV
detection [94, 95].
When microarray platforms were introduced, array results were validated using alternative
targeted methods. Initially, quantitative fluorescence PCR (QF-PCR) was used to validate
aCGH results and later multiplex ligation-dependent probe amplification (MLPA) was also
used to validate CNVs [96]. However, with recent developments in aCGH probe design and
advance bioinformatics algorithms, aCGH has become the principal CNV detection platform
in current genetic diagnostic protocols [92, 97].
Microarrays offer reliable and efficient methods for genome-wide population-scale CNV
detection [98]. However, the sensitivity and breakpoint precision of these platforms vary
greatly depending on their probe density distributions. Therefore they have variable
performance in detecting CNVs [99, 100]. For example, a microarray benchmark study has
shown that large CNVs (>1Mb) were detected with high accuracy even in arrays that have
different probe distributions, but these arrays have shown variable efficiencies in detecting
short CNVs [101]. With the availability of high-throughput sequencing data, alternative
sequencing-based approaches have been developed which can provide higher resolution CNV
1.5 Exome sequencing
With the introduction of next generation high-throughput sequencing technologies [102],
whole genomes could rapidly be characterized at single base pair resolution. The continuous
development of these technologies lead to the rapid decrease in sequencing cost and therefore
massive amounts of sequence data were generated with ever-decreasing effort [103-105].
About six years ago, when whole-genome sequencing was not available at an affordable
price, exome sequencing was introduced as a cost-effective approach to sequencing coding
regions of the human genome [106, 107].
When exome sequencing was first introduced, it was estimated that ~85% of clinically
relevant variants were in exonic regions [108]. While this figure is likely an overestimate due
to the scarcity of data on disease causality, scientists still expect the majority of causal
variants to be in the exome. Today, exome sequencing is used as an effective genetic
diagnostic test that has revolutionized the detection of novel, de novo and rare variants across
a wide range of diseases [108-112].
1.5.1 Exome-based disease diagnosis
The genome of a healthy individual differs from the human reference genome at 4.1-5.0
million sites [37]. These differences are due to genetic variants and exome sequencing has the
potential to reveal these variants that are in coding regions. For example, variant calling
programs identify 30000-50000 SNVs and 2000-6000 indels per exome (passing the quality
thresholds used in bioinformatics pipelines). Therefore identifying disease-causing mutations
from thousands of normal variants is the most challenging task in current genetic diagnostic
approaches. In order to address this issue, several different diagnostic strategies and
guidelines were introduced in recent years, laying the foundation to identify disease-causing
variants [113-115].
As detailed in these diagnostic guidelines, current pipelines use a number of different
prioritization approaches to select potentially disease-causing variant from patient samples.
For example, disease-causing variants can be found in candidate genes common to a group of
affected unrelated individuals or the same causal variant could be found in affected family
members. Similarly, de novo causal variants can be found by using a family trio (motherfather-child). Therefore exome-based diagnostic methods can be designed to increase the
probability of identifying the causal variant. When studying diseases that are known for
genetic heterogeneity, candidate genes shared among subsets of individuals can be selected
during variant prioritization.
Variant prioritization can be further improved by using population-specific datasets and
clinically relevant datasets. Here, population-specific datasets are used to filter-out common
variants and clinically relevant datasets are used to identify already known disease-relevant
variants (Table 2). Following prioritization, the next step is to identify a pathogenic variant
from the prioritized list. Here, a broad range of datasets including scoring systems that
improve variant interpretation [115], published literature, functional genomic annotations and
model organism phenotypes can be used to assess the pathogenicity of a genetic variant.
Current variant calling programs [116] detect SNVs [117] and indels [118, 119] with high
accuracy, and annotation programs provide a wealth of annotations to assist the diseasecausing variant detection [120, 121]. Today, exome-based diagnostic approaches have shown
a success rate of 25%-40% in detecting disease-causing SNVs and indels [122, 123].
However, only a 2%-3% CNV detection rate has been published for heterogeneous
Mendelian cohorts [124, 125].
Population-specific datasets
Used to identify known:
1000 genomes data
SNVs, indels and CNVs
[35-37, 42]
NHLBI GO Exome Sequencing
SNVs and indels
Database of Genomic Variants
Sanger CNV project
Project (ESP)
Clinically-relevant datasets
Used to identify disease-relevant:
Genes containing causal
SNVs, indels and CNVs
SNVs, indels and CNVs
DECIPHER database (Database
SNVs, indels and CNVs
of Chromosomal Imbalance and
Phenotype in Humans Using
Ensembl Resources)
Deciphering Developmental
Disorders (DDD)
Table 2 Public datasets used in exome-based disease diagnosis
2 Disease-causing CNV detection from
exome data
Since the introduction of exome sequencing as an effective diagnostic tool, various CNV
prediction and annotation programs have been developed to detect disease-causing CNVs
from patient exomes. These disease-causing CNV detection methods can be discussed in four
main steps: CNV prediction, annotation, causal CNV detection and validation.
2.1 CNV prediction
Computational programs use paired-end or coverage data to predict CNVs from sequence
data. Based on the most commonly used CNV prediction strategies, these programs are
grouped into three categories: paired-end-, split-read- and coverage-based CNV prediction.
In short-read sequencing (eg. Illumina sequencing [129]), paired-end reads are generated with
approximately known distances. Read alignment and mapping programs are expected to
maintain these distances [130]. If CNV breakpoints are internal to paired-end intervals, these
reads are mapped in wrong orientation or mapped at distances that are substantially different
from the expected distances [131, 132]. Therefore paired-end data is used in computational
programs to predict CNVs from sequence data (Fig. 2) [133].
Split-read- based CNV prediction programs detect CNV breakpoints using unmapped or
partially mapped (soft-clipped) reads (Fig. 3). If unmapped or partially mapped reads contain
CNV breakpoints, these programs use split-mapping algorithm to identify CNV breakpoints
at single base-pair resolution [131, 132].
Paired-end- and split-read-based methods are widely implemented to call CNVs in wholegenome sequence (WGS) data. For example, BreakDancer [134], PEMer [135],
VariationHunter [136], commonLAW [137], GASV [138], AGE [139], Pindel [118], SLOPE
[140], SRiC [141] and STRiP [142] predict CNVs in whole-genome sequences. However,
when predicting CNVs in exome sequences, paired-end information is only used in rare
instances (eg. SPLITREAD [143]).
Read 1 and read 2
Distance between
read 1 and read 2
Fig. 2 Paired-end-based CNV identification
a. Deletion (reads are mapped too far apart); b. Insertions (reads are mapped too close
together); c. Inversions (the orientation of one read in the pair is changed); d. Duplications
(reads change their relative order while retaining their original orientation)
Split read
Distance between
2 parts of split read
Fig. 3 Split-read-based CNV identification
a. Deletion (continuous stretch of gaps in the split-read); b. Insertion (continuous stretch of
gaps in the reference); c. Inversion (section of the split-read is mapped in wrong orientation);
d. Duplication (both sections of split-read alter their relative order during the alignment)
Both paired-end- and split-read-based CNV prediction methods depend on the availability of
paired-end reads encompassing or containing CNV breakpoints (Fig. 2, 3). Therefore, when
CNV breakpoints are in regions that are not captured in the exome (eg. non-coding regions),
these two methods are not capable of detecting CNVs. Therefore coverage-based methods
have been implemented to overcome this challenge, when predicting CNVs in exomes.
2.2 Coverage-based CNV prediction
In short-read sequencing, copy number variable regions generate higher or lower read counts
compared to diploid regions in the genome [144]. Therefore when calling CNVs, prediction
programs analyse the coverage distribution of exonic regions, expecting that the coverage of
an exon is proportional to the copy number of the exon (Fig. 4).
Exon 1
Exon 2
Exon 3
Control dataset
Mapped reads in
target exome1
Mapped reads in
target exome2
Increase or decrease in mapped read count
Fig. 4 Coverage-based CNV prediction
a. Copy number neutral control dataset; b. Increase in coverage of exon 2 (duplication)
compared to control dataset; c. Decrease in coverage in exon 3 compared to control dataset.
Coverage-based programs follow three main stages when predicting CNVs from exome
sequence data: coverage extraction, normalization and CNV calling. In the first stage,
alignment files (BAM file) are used to extract coverage data for target regions defined in an
exome definition file. The majority of programs extract coverage data from multiple exomes,
which are used in the second stage, coverage normalization. Here, extracted coverage data is
normalized using different statistical approaches to create a reference (background) dataset.
Finally, this normalized reference (background) dataset is used in variant calling algorithms
to predict CNVs from target exomes.
The efficiency of coverage-based programs depends on various factors that are associated
with the exome sequencing process. In exome sequencing, short exonic regions (about 200bp
[145]) and off-target regions (eg. segmental duplicates) are captured with variable
efficiencies [146]. This can affect CNV prediction by introducing bias into the statistical
models used to call CNVs. In addition to these sequencing artifacts, factors including GC
percentage, mappability, duplicated reads and ambiguously mapped reads [147] influence the
coverage distribution of target exomes and thereby affect the efficiency of CNV prediction.
As discussed in Table 3, a number of different techniques have been implemented in
coverage-based programs to minimize the effects from these issues.
Issues affecting CNV prediction
Technique used to minimize the effect
Duplicated reads
SAMtools rmdup [148] and Picard
Duplicated reads inflate the coverage of genomic
MarkDuplicates [149] can identify duplicated
regions and affect the CNV prediction quality.
reads in alignment files. Once identified,
duplicated reads can be excluded during
coverage data extraction.
Ambiguously mapped reads
Ambiguously or inaccurately mapped reads
Inaccurate read mapping can place reads in genomic
have low mapping quality scores [150].
regions regardless of the copy number state. This can
Therefore, quality score threshold can be used
introduce bias to the CNV prediction.
to exclude these reads.
GC content
Local coverage correction methods can reduce
Short-read sequencing platforms have shown a
the effect of GC content [153].
positive correlation between coverage and GC
percentage [151, 152]. Thus, GC rich regions can
have higher coverage than the AT rich regions.
Local mappability score correction methods
Reads mapped to genomic regions with low
can introduce to reduce these effects [147,
mappability scores (eg. repeat rich regions [154])
may not represent their original locations. Thus
coverage data from regions with low mappability
scores can introduce bias to the CNV prediction.
Technical artefacts in sequencing
Advance statistical methods can be
Technical artefacts could occur during the
implemented to normalize coverage data
sequencing process and these effects may vary
across the sample collection used for CNV
depending on capture technology and sequencing
prediction. For example, principal component
analysis based methods can be used to reduce
these biases [156, 157].
Table 3 Coverage-based CNV prediction: issues and strategies
Coverage-based methods were initially introduced to predict CNVs in paired case-control
samples. For example, to predict CNVs in a tumour (case) sample compared to a healthy
control [158, 159]. Soon, it was implemented in population scale CNV prediction [153].
Table 4 provides a summary of coverage-based programs that can predict CNVs in exome
sequence data.
(Google citations)
Use paired case-control exome samples to call CNVs.
Create a reference dataset from multiple exomes and call
CNVs using a method similar to ExomeCNV.
Use a negative binomial model to normalize coverage
data according to tile length and GC content. CNV
calling is performed based on a hidden markov model
Uses a robust strategy to model coverage data and to
build an optimized reference set that maximize the CNV
detection power.
Performs a singular value decomposition (SVD) based
normalization of reads per thousand bases per million
reads sequenced (RPKM).
Uses principal-component analysis (PCA) to normalize
coverage data and CNV calling is performed by a HMM.
Identify disease-associated CNVs and generate absolute
copy number genotypes at putatively associated loci.
Graphical software package to call CNVs using principal
component analysis based method.
Models coverage data using a negative binomial
distribution and creates a reference dataset when calling
Estimate CNVs based on coverage variability patterns
summarized from a set of reference samples.
Estimate CNVs using a singular value decomposition
based normalization framework for off-target coverage
Implements a three-step normalization procedure to
mitigate the effects of sequencing biases and use HMMbased segmentation algorithm to call CNVs.
Use HMM based approach to call CNVs.
Normalize coverage data to reduce the effect of
mappability and GC content.
Highly scalable copy number estimation tool based on
lattice-aligned mixture models.
Includes a stringent quality control (QC) metric that flags
low-quality exons when calling CNVs.
Table 4 Exome sequence-based CNV prediction programs
* Exome-based programs that are specially optimized to detect CNVs in cancer patients
include, VarScan 2 [173], exome2cnv [174], CoNVEX [175], CEQer [176] and ControlFREEC [177].
While a number of exome-based CNV prediction programs are currently available, only a
few are commonly used and cited in the published literature. For example, a Google citation
count of each CNV prediction program (Table 4) indicates that only a few programs are
commonly cited by the genomics community. Two benchmark studies have tested these
commonly used programs and demonstrated the potential utility in detecting disease-causing
CNVs [178, 179].
Despite the potential clinical utility of exonic CNV prediction programs, there are several
pitfalls that affect widespread implementation. For example, these programs predict
drastically different CNV counts when tested with the same sample collection. Similarly,
CNV length distributions of these programs also vary greatly from one program to another.
Moreover, some of these programs are not optimized to detect short CNVs (CNVs containing
1-4 exons) and show high false positive CNV counts [178-181]. Additionally, high stringent
CNV calling programs may have high false negative CNV counts. While these issues are not
yet resolved, the clinical importance of exome-based CNV prediction was repeatedly
highlighted in recent years [124, 178, 179, 182].
2.3 CNV annotation
Annotation programs provide information needed to understand the functional effects and
clinical relevance of predicted variants. ANNOVAR [120] and VEP [121] are the most
commonly used programs that can annotate CNVs and other genomic variants. Additionally,
CNVAnnotator [183], SG-ADVISER CNV [184] and DeAnnCNV [185] provide web-portals
for CNV annotation. These programs use different source datasets and provide a broad range
of information including gene and functional effect information, information on already
known and common CNVs, and clinically relevant information. These annotations are
effectively used to identify disease-causing CNVs in the next stage. However, since the
choice of source dataset has a major effect on the variant interpretation [186], efficiency in
disease-causing CNV identification may depend on the selected annotation program.
2.4 Causal CNV detection
Following prediction and annotation, CNVs are filtered and prioritized (using approaches
discussed in section 1.5.1) to detect potentially disease-causing variants. Then, the
pathogenicity of each of the prioritized variant is assessed to identify disease-causing CNVs
that could explain patient phenotypes.
As discussed in section 2.2, there are a number of limitations to the use of exome-based CNV
prediction programs. These limitations have a direct effect on disease-causing CNV
identification. For example, a high false positive count increases the risk of selecting a false
positive CNV as the disease-causing variant. Similarly, high false negative counts (predicting
few CNVs) increase the risk of not detecting the causal variant. Additionally, if thousands of
CNVs (eg. CONTRA and ExomeCopy) are predicted, this greatly increases the time and
effort involved in disease-causing CNV identification.
Following the identification of disease-causing CNVs, current diagnostic approaches
implement a secondary experimental step to validate computationally predicted causal
2.5 Validation of disease-causing CNVs
As discussed in section 2.2, CNV prediction programs that use exome sequence data have
shown high false positive CNV counts. Thus, in order to confirm the molecular diagnosis,
computationally detected disease-causing variants can be validated using various
experimental techniques. For example, aCGH can be used to confirm CNV predictions and
when these platforms do not contain probes covering the causal variant (eg. CNVs in
pseudogenes), targeted techniques (MLPA, QF-PCR) can be used.
The majority of commercially available aCGH platforms have low probe counts over exonic
regions. Therefore these aCGH platforms have limited applicability in detecting short exonic
CNV (with at least 3 consecutive probes). For example, the median probe spacing of Agilent
high-resolution 1x1M array is 2.1kb and only 5.3% of the array is composed of probes
covering exonic regions. Hence, scientists were motivated in developing custom solutions to
detect short exonic CNV. Recent studies have demonstrated the utility of these platforms in
confirming computationally predicted disease-causing CNVs [178, 179].
In summary, disease-causing CNVs can be detected from exome sequence data in four main
steps: CNV prediction, annotation, causal variant detection and validation. While the clinical
utility of these approaches has been demonstrated in recent years, a number of challenges that
are associated with these steps limits their widespread applicability. Therefore the main
objective of this thesis is to overcome these challenges and develop an improved method to
detect disease-causing CNVs from exome sequence data.
3 Aims of the study
The main objective of the work presented in this thesis was to develop a complete pipeline to
improve the detection of disease-causing CNVs in exome sequence data. Specific aims were
1. Evaluate the performance of commonly-used CNV prediction programs for exome
sequence data.
2. Study the ability of CNV prediction programs to predict short CNVs (containing 1-4
exons) and to develop new methods that improve short CNV detection.
3. Develop a custom aCGH to experimentally validate computational predictions and to
improve short CNV detection compared to existing arrays.
4. Study CNV prediction quality and to develop methods that can assess CNV prediction
quality scores.
5. Improve CNV annotation using multiple source databases.
4 Methodological considerations
4.1 Exome sequence analysis pipeline
Computational programs tested in our study use exome sequence data as the source for CNV
prediction. DNA samples for exome sequencing were obtained from primary
immunodeficiency diseases (PIDD) patients referred to Oslo University Hospital, Oslo
Norway. Exome capture was performed using the Agilent SureSelect Human All Exome
capture kit v.5 and Illumina rapidcapture exome. Then captured libraries were sequenced on
an Illumina HiSeq 2500 at the Norwegian Sequencing Centre (
Sequence data (fastq files) were analysed using an in-house developed exome pipeline (Fig.
5) combining NovoAlign (v2.07.17) [187], Picard (v1.74) [149], GATK (v2.4 and v2.8) [117,
188, 189] and ANNOVAR (2013Aug23)[120]. Alignment (BAM) files generated from the
pipeline showed an average coverage ranging from 124x-210x (Illumina rapidcapture exome:
124x-146x and Agilent SureSelect v5: 160x-210x) and they were used as the input for CNV
prediction programs tested in our study (section 4.2).
As shown in Fig. 5, our pipeline uses an in-house developed database and scripts to improve
variant interpretation. For example, ANNOVAR annotated SNV and indel calls were further
annotated to provide population-specific information (using an in-house variant database) and
disease-relevant information (using in-house software). During disease-relevant annotation,
SNVs and indels affecting primary immunodeficiency diseases (PIDD) candidate genes were
identified and then each variant was annotated with the phenotype and inheritance pattern.
Finally, these variants were filtered and prioritized using Filtus [190] when identifying PIDDcausing SNVs and indels.
Fig. 5 Exome sequence analysis pipeline
4.2 CNV prediction from exome sequence data
Six CNV prediction programs (ExomeCNV [160], CONTRA [161], ExomeCopy [162],
ExomeDepth [163], CoNIFER [157] and XHMM [156]) were tested on alignment files,
which were generated from the pipeline discussed in section 4.1. Additionally, a custom
algorithm - ExCopyDepth was developed using modules from ExomeCopy (coverage data
extraction module) and ExomeDepth (CNV calling module). Then, in order to validate CNVs
predicted from these programs, a custom aCGH (exaCGH) focusing exonic regions was
developed using Agilent eArray system [191]. Paper 1 and various sections of this thesis
provide detailed descriptions of steps involved in CNV prediction, exaCGH development and
CNV validation.
4.3 Resolution of microarray platforms
Since their introduction, microarray platforms have been continuously improved to increase
the coverage and resolution of genome-wide CNV detection [192]. Today, high-resolution
microarray platforms offer genome-wide CNV profiling with high accuracy [98, 193, 194].
However, due to the low probe coverage over exonic regions, these platforms show
restrictions in short exonic CNV detection. Thus, we developed a custom CGH array
(exaCGH) to perform high-resolution CNV detection in exonic regions [179]. exaCGH
contain 1 million probes and cover ~80% of exons (Havana or Ensemble exons in Gencode
v.15) with at least 4 probes. Exonic coverage of exaCGH increases up to ~90%, when
considering Gencode v.15 exons that are longer than 100bp. The proportion of Gencode v.15
exons covered by 4 or more probes is highest in exaCGH compared to Agilent 1x1M and
CytoScan HD array platforms [179].
The resolution of a microarray platform is mainly defined by the mean spacing of array
elements [100]. However, the mean probe spacing of an array does not represent variation in
local resolution, especially when array elements are not uniformly distributed throughout the
genome [100]. The characterization of local variation in probe spacing can provide more
insight into the resolution of an array and subsequently better understanding of its utility.
Hence, we compared the local probe spacing in exaCGH, Agilent 1x1M and CytoScan HD to
study the resolution of these three array platforms.
During this comparison, we first calculated the probe count of each Gencode exon and CDS
region (including 800bp flanking regions in 5’ and 3’ ends [179]). In order to compare the
variation of probe spacing in regions (Gencode exonic and CDS) that differ in length, these
regions were grouped into three categories: 1-2kb, 2-3kb and over 3kb. Then, we calculated
the mean probe spacing in each group (Fig. 6).
As expected, exaCGH had the lowest probe spacing in each group ranging ~300bp to
~350bp. Probe spacing in CytoScan HD was ranging from ~400bp to 1.2kb and Agilent
1x1M was ranging from ~1kb to ~2.4kb. This analysis confirmed that exaCGH has the
highest resolution compared to two of the tested commercially available high-resolution
arrays and optimized to detect short exonic CNVs. As a secondary verification of our array
results, several exaCGH CNV calls were randomly selected and manually visualized using
Agilent genome workbench.
Fig. 6 Resolution of microarray platforms
4.4 Considerations when identifying true positive, false
positive and false negative CNVs
Following the evaluation of exaCGH design, array experiments were performed and exaCGH
results were used to validate computationally predicted CNVs. Here, computational
predictions were compared to exaCGH results and CNVs detected by both methods were
identified as true positives. False positives CNVs were the computational predictions, which
were not detected in exaCGH. False negatives were the CNVs identified by the array but not
by the prediction program.
Prediction programs and exaCGH use different CNV breakpoint detection strategies. For
example, CNV breakpoint estimation in computational programs depends on the
segmentation algorithm used [147], while CNV breakpoints in microarrays were mainly
determined the probe density distribution [195]. Therefore computationally predicted CNV
breakpoints may differ from breakpoints identified from exaCGH results.
Due to the CNV breakpoint inconsistencies in computational predictions and exaCGH
results, a certain degree of overlap between these two datasets needs to be considered when
identifying true positives. Additionally, CNVs predicted from different computational
programs, have shown very different length distributions [179]. Therefore, the degree of
overlap between CNV predictions and exaCGH results may differ from program to program
(Fig. 7).
As shown in Fig. 7, computational predictions and exaCGH results overlap to a variable
degree depending on the length of CNVs. Therefore when identifying true positive and false
positive CNVs, it is challenging to determine the required threshold overlap. Hence, we used
1 base-pair overlap as the threshold to be less stringent in true positive identification.
Using 1 base-pair overlap between computational predictions and exaCGH results can
improve the true positive yield while minimizing the false positive CNV count. However,
selecting CNVs with very low overlap can potentially classify false positives as exaCGH
confirmed variants and thereby introduce bias into our true positive set. Therefore, true
positive CNVs were further analysed to study how true positive count (in each program)
change with the degree of overlap (between exaCGH results and computational predictions).
As shown in the cumulative frequency distributions (Fig. 8), only a small fraction of CNVs
(~0.01%-0.02%) had less than 100bp overlap between the two datasets used. Further decrease
in overlap (<100bp) clearly reduces the cumulative frequency, highlighting the presence of
few true positive CNVs. This analysis confirmed that the threshold overlap used in CNV
validation stage (1bp overlap), did not inflate the true positive count and thus, did not
introduce a significant bias into our study.
Fig. 7 Manual examination of CNVs
a. Computationally predicted CNVs and the exaCGH detection overlap to a variable degree;
b. Computational CNV predictions are internal to the exaCGH detection; c. The exaCGH
detection is internal to computationally predicted CNVs. * exaCGH parameters: minimum
number of probes – 3, minimum average absolute log ratio for deletion and duplication 0.25,
Fig. 8 Cumulative frequency distribution for true positive CNVs
5 Summary of papers
5.1 Paper 1 Identification of copy number variants from
exome sequence data
In this study, we evaluated six commonly used CNV prediction programs (ExomeCNV,
CONTRA, ExomeCopy, ExomeDepth, CoNIFER, XHMM) and also developed an in-house
modification to ExomeCopy and ExomeDepth (ExCopyDepth) to optimize short CNV
(containing 1-4 exons) prediction. Additionally, we studied how multiple prediction program
combinations can be used effectively to detect CNVs. Crucially, we developed a custom
aCGH (exaCGH) focussing on exonic regions to validate computational CNV predictions. In
this analysis, CNV prediction was performed on 30 exomes from the 1000 genomes project
and 9 exomes from patients with primary immunodeficiency diseases (PIDD). Nine samples
from each group were run in exaCGH and both computational predictions and array results
were used to identify true positive (TP) and false positive (FP) CNVs.
During the analysis, we compared CNV programs in terms of CNV prediction length, count
and the number of exons internal to predicted variants. Then we calculated the sensitivity and
false positive rate of these programs.
The comparison of computational programs showed a striking variation in the length and
number of predicted CNVs (Paper 1: Fig. 1a, b). These programs also showed a clear
variation in the number of exons internal to predicted CNVs (Paper 1: Fig. 2a, b). Compared
to individual programs, a narrow range was shown for the number exons internal to CNVs
that were predicted from multiple programs (Paper 1: Fig. 2c, d).
The number of TP CNVs identified from different programs also showed clear variation
(Paper 1: Table 2). All the programs except CoNIFER and ExomeDepth had low TP/FP
count ratios indicating high FP CNV counts (Paper 1: Table 2). Comparison of short CNVs
(CNVs containing 1-4 exons) indicated improved performance for ExomeCopy,
ExCopyDepth and intersection of ExomeCopy and ExCopyDepth compared to CoNIFER and
XHMM (Paper 1: Fig. 3).
1000 genomes exomes used in the study had higher coverage variance compared to the PIDD
patient exomes. Additionally 1000 genomes exomes were analysed using a relatively small
reference sample collection compared to PIDD patient exomes [179]. When analysing our
results, it was possible to observe how these factors (differences in two exome collections)
affect the CNV prediction. For example, CONIFER, XHMM, ExomeDepth were not able to
predict short TP CNVs from 1000 genomes exomes. However these programs predicted 4-50
short TP CNVs with relatively high TP/FP ratios (0.17-0.67) in PIDD patient exomes. With
1000 genomes exomes, intersection of ExomeCopy and ExCopyDepth showed an increased
performance (highest TP/FP ratio) compared to other programs (Paper 1: Table 3).
We then studied the consistency in assigning copy number state (CNS) when a CNV is
predicted from different programs. Here we showed that the same CNSs were assigned to
CNVs predicted from all the programs or programs that use similar variant calling methods
(Paper 1: Table 4).
In order to benchmark the accuracy of CNV prediction, we compared the sensitivity and false
positive rate (FPR) of programs tested in our study. ExomeCopy had the highest FPR (38.9%
of the samples showed FPR > 0.95) and CoNIFER had the lowest FPR. ExomeDepth,
CoNIFER and XHMM were optimized to detect rare CNVs. Therefore these programs had
high false negative counts affecting the sensitivity. For example, sensitivity of ExomeDepth,
CoNIFER and XHMM was lower than the other tested programs.
Next, the clinical utility of these programs was tested by searching for PIDD-causing
variants. All the programs except XHMM identified a PIDD-causing deletion in FANCA
(exon 26–37), demonstrating the potential clinical importance of CNV prediction programs
that use exome sequence data.
Finally, we proposed a protocol with two stages, especially applicable in candidate gene
studies. Stage one: selecting the program or program combination to predict CNVs depending
on the target exome collection and user requirements (Paper 1 provides a detailed discussion
on conditions need to be considered at this stage). Stage two: validation of computational
predictions using exaCGH experiments
In conclusion, we introduced a customized algorithm to predict short CNVs (ExCopyDepth)
and a custom aCGH (exaCGH) focussing exonic regions. Using computational programs and
exaCGH, we highlighted that programs are capable of predicting disease-causing CNVs
despite of the relatively high false positive counts. Finally, we presented a protocol that
guides the CNV prediction and validation especially in candidate gene studies.
5.2 Paper 2 cnvScan: a CNV screening and annotation tool
to improve the clinical utility of computational CNV
prediction from exome sequencing data
This paper describes a new software tool (cnvScan) that considerably improves the clinical
utility of CNV prediction from exome sequencing data. cnvScan can read input from any
prediction program, which may be affected by issues discussed in section 2.2. Amongst these
issues, high false positive CNV count is one of the main challenges in clinically relevant
CNV identification [178-180]. Therefore in Paper II, cnvScan was introduced to address this
Firstly, it was demonstrated that the default CNV prediction quality (assigned by prediction
programs) was useful as a stable measure to detect likely false positive predictions (Paper 2:
Fig. 1, Additional file 1: Fig. S2-S5). Then, a variant screening method was introduced to
calculate an additional CNV quality score. The new quality score (CNVQ) was designed to
help assess the effects of sequencing artefacts that influence FP CNV prediction.
Sequencing artefacts (eg. high coverage variance across exomes) can affect the coverage
normalization step in prediction programs and contribute to false positive CNVs. Although it
is not possible to detect and measure sequencing artefacts in normalized data, effects could be
observed using an in-house CNV database. For example, a considerable proportion of
overlapping false positives, which were predicted from multiple exomes were observed when
considering CNVs in an in-house database (Paper 2: Fig. 1, Additional file 1: Fig. S6). As
reasoned in Paper 2, these overlapping false positive predictions are likely the effects of
sequencing artefacts. Further, we showed that the common TP CNVs also have low quality
scores when prediction programs are optimized to detect rare variants. (Paper 2: Fig. 2).
Following the development of a variant screening method, cnvScan was extended with two
additional modules: CNV annotation and filtration. CNV annotation module provides
different types of functionally and clinically relevant information, and annotation on known
CNVs using a broad range of source datasets (Paper 2: Table 1). The filtration module is
capable of excluding low quality known non-disease causing CNVs using cnvScan
annotations. Clinical utility of these methods were demonstrated in Paper 2 by implementing
cnvScan in a PIDD patient cohort (n=64) and detecting three patients with PIDD-causing
CNVs (Paper 2: Fig. 4).
Overall, cnvScan can read input from any prediction program and perform a quality
assessment on predicted CNVs. Then cnvScan provides a broad range of annotations using
multiple source datasets. Finally, cnvScan filtration prioritizes potentially clinically relevant
CNVs by excluding false positive, known and non-disease causing variants. Our method
considerably improves the efficiency of disease-causing CNV detection by reducing the time
and effort in variant analysis.
5.3 Paper 3 Primary immunodeficiency diseases - genomic
approaches delineate heterogeneous Mendelian disorders
In Paper 3, the utility of exome sequencing as a diagnostic test was evaluated using a group
of patients with primary immunodeficiency diseases (PIDD). In this study, exome sequencing
was performed on a large cohort (n=278) comprising samples collected from two centres:
Oslo university hospital, Oslo, Norway (78 probands) and Baylor College of Medicine,
Texas, US (200 probands). All the patients in this cohort had been evaluated through
conventional methods (section 1.4.1, 1.4.2) and lacked an established molecular diagnosis
when included in the study. Therefore, Paper 3 describes the implementation of a broad range
of exome-sequence- based diagnostic methods to assess all known PIDD relevant genes in
these patients.
All the Oslo samples were run on the exome sequence analysis pipeline discussed in this
thesis (section 4.1). The pipeline run on Baylor samples [196] used Atlas2 suite for SNV and
indel identification [197, 198]. Variants identified in both centres were filtered using same
criterion during the prioritization, and potentially PIDD-causing variants were further studied
to detect causal variants. Identified PIDD-causing SNVs and indels were confirmed by
Sanger sequencing.
The pipeline discussed in section 6.4.2 was used to analyse CNVs from all the Oslo samples
and 80 samples from Baylor. While in Baylor, ExCopyDepth and an in-house developed
program (HMZDelFinder) were used to predict and annotate CNVs. When disease-causing
CNVs were identified, they were confirmed by a custom array (exaCGH or Baylor custom
array) or MLPA.
Seven PIDD-causing CNVs were identified with the computational methods used in both
centres. The pipeline discussed in section 6.4.2 was able to detect five PIDD-causing CNVs
(FANCA, NCF1, TERC, MAGT1, IKZF1) and three were detected in the Baylor pipeline
(PGM3, SMARCAL1, MAGT1). Exome samples containing PGM3 and SMARCAL1 CNVs
were not tested in Oslo. Thus, the ability to detect these two CNVs was not tested in the
pipeline discussed in section 6.4.2.
Our cohort study, revealed the potential molecular basis for the disease in 40% (n=110) of the
probands. Clinical diagnosis was revised in about half (60/110) and management was directly
altered in nearly a quarter (26/110) of probands based on the molecular findings. Further, we
found evidence for reduced penetrance (in autosomal dominant traits), founder mutations in
outbred populations (in autosomal recessive traits), mutation burden effect and, somatic and
revertant mosaicism contributing to disease variability.
This study resulted in the expansion of the phenotypic spectrum associated with known
disease genes, and the identification of several new PIDD-causing genes. The majority of
computationally predicted disease-causing CNVs identified in this study, were found by the
CNV pipeline discussed in this thesis.
6 Discussion
This thesis presents a complete pipeline providing a systematic approach to identify clinically
relevant CNVs. Paper 1 provides a detailed study on commonly used exome-based CNV
prediction programs while introducing a customized algorithm (ExCopyDepth) and a custom
aCGH (exaCGH). Both ExCopyDepth and exaCGH were developed to predict and
experimentally validate short exonic CNVs. Paper 2 describes a new software package
(cnvScan) that improves the clinical utility of CNV prediction programs. Finally, in Paper 3,
the clinical importance of ExCopyDepth, exaCGH and cnvScan were confirmed by detecting
disease-causing CNVs in a large patient cohort.
6.1 Performance comparison of CNV prediction
In Paper 1, we compared the performance of six commonly used CNV prediction programs.
We showed that these programs are capable of predicting CNVs with high sensitivity, which
was further highlighted by predicting disease-causing CNVs in PIDD patients. Despite these
advantages, we demonstrated that exome-based CNV prediction programs have high false
positive counts. These findings were confirmed by increasing number of recent benchmark
studies [178, 180, 181].
Although the conclusions of our performance comparison were consistent with recent
publications, it is possible to identify several key features that are not addressed in Paper 1.
For example, 1000 genomes exomes used in our study showed a high coverage variance
(mean coverage: 49.3x-199.7x), which could affect the CNV prediction efficiency. Moreover,
during the benchmark study, we had not considered the effect of CNV prediction quality
scores. Additionally, we had not provided precision and recall of CNV prediction programs,
although they are considered as the most relevant descriptors for performance, especially
when true negative counts are not available [199].
In order to address these issues, an additional analysis was designed (ExCopyDepth
prediction and exaCGH experiments) using 15 PIDD patient exomes with mean coverage
ranging from 160x-210x. Then a set of true positive and false positive CNVs was identified
to calculate precision and recall of CNV prediction programs. Since CNV counts vary
depending on the thresholds used in exaCGH experiments, array results were analysed in two
stringency levels when calculating precision and recall scores (Fig. 9).
Fig. 9 Performance comparison of CNV prediction programs
a, b, c. Precision and recall were calculated with low stringency exaCGH calls; d, e, f.
Precision and recall were calculated with high stringency exaCGH calls; a, d. Precision vs
default CNV quality score; b, e. Recall vs default CNV quality score; c, f Precision vs recall
of prediction programs; *CoNIFER does not provide CNV prediction quality by default, thus
it was not considered in this analysis; * Low stringency parameters: AMD-2 score threshold 5, minimum number of probes - 3, minimum average absolute log ratio - 0.20; * High
stringency parameters: AMD-2 score threshold - 6, minimum number of probes - 4, minimum
average absolute log ratio - 0.25
Due to the high false positive counts in CNV prediction programs [179] low precision (<0.5)
was shown for low CNV scores (<20; Fig. 9a, d). However, precision increased with
increasing CNV quality scores for all 4 CNV prediction programs. The majority of CNV
prediction programs are optimized to detect rare variants. Thus, these programs have high
false negative counts, as they are not capable of detecting all the exome-wide CNVs
identified in exaCGH experiments. Consequently, CNV programs had low recall scores (Fig.
9b, e). However, with increasing exaCGH stringency (Fig. 9e), recall of CNV prediction
programs increased drastically (0-0.06 to 0-0.13). This indicated that CNV prediction
programs have high recall (sensitivity) in detecting high-stringency CNV calls.
When considering both precision and recall (Fig. 9c, f), XHMM showed a lower performance
compared to ExomeCopy, ExCopyDepth and ExomeDepth. With high-stringency CNV calls,
ExomeCopy, ExCopyDepth and ExomeDepth showed a similar performance (Fig. 9f).
CoNIFER dose not provide quality scores in its default setting, therefore it was not included
in the comparison presented in Fig. 9. However, based on recently published literature,
CoNIFER can be considered as a program with very low false positive counts [178-180].
Precision and recall plots presented in this thesis and other similar benchmark studies suggest
that CNV prediction programs are subject to high false positive counts even though they are
capable of predicting disease-causing CNVs [178-181]. This highlights the importance of
careful evaluation and additional validation of computational CNV predictions.
6.2 Effect of cnvScan in-house database in CNV analysis
Current exome sequence analysis pipelines use in-house SNV databases to detect sequencing
artifacts in SNV calls [200, 201]. Similarly, cnvScan uses an in-house database and provides
a CNV quality assessment strategy to improve the detection of sequencing and coverage
artefacts. Here, we recommend using the same exome collection for CNV prediction and
cnvScan in-house database creation [202]. Alternatively, when prediction programs are run in
several batches of exome samples, CNVs from all these exomes can be pooled together to
create a relatively large in-house database. Changes in the content and size of the in-house
database affect quality assessment, filtration and thereby CNV interpretation. However in
Paper 2, we have not discussed how the content and size of cnvScan in-house database affect
the CNV analysis.
In order to study this effect, two in-house databases were created from two different datasets:
CNV predictions from 18 PIDD patient exomes (db-18) and CNV predictions from two
batches containing 18 and 47 PIDD exomes (db-65). Then, cnvScan was implemented using
these two in-house databases to detect PIDD-causing CNVs in exomes represented in db-18.
Two PIDD-causing CNVs (two patients from the same family with a deletion in MAGT1 and
one patient with a deletion in NCF1) were identified when db-18 was used for the cnvScan
implementation. However, only MAGT1 CNV was detected with the db-65 (NCF1 variant
was not detected). This analysis showed that the in-house database creation approach affects
the disease-causing CNV identification.
IGV visualization of NCF1 variant showed the presence of low quality overlapping database
CNVs in db-65 (Fig. 10). Therefore in the db-65 implementation, a low cnvScan quality
score (CNVQ) was calculated and NCF1 variant was filtered out. However, low quality
overlapping database CNVs were not observed in db-18 (Fig. 10). Thus, the prediction
quality of the NCF1 CNV was not affected and it was not filtered out during cnvScan db-18
Fig. 10 Effect of in-house database creation approach
a. db-18 implementation (no low quality overlapping database CNVs in db-18); b. db-65
implementation (low quality overlapping database CNVs were observed in db-65) * Bands
with higher and lower colour intensities in a and b represent CNV predictions with higher
and lower quality scores.
As mentioned earlier, CNVs in db-65 were predicted in two different batches with 18 and 47
exomes in each collection. CNV prediction quality of these two batches could differ from
each other due to various technical artifacts. Pooling CNVs in these two batches will
introduce batch effect to the cnvScan db-65 database and affect the CNV quality assessment
and filtration. For example, Table 5 shows that the numbers of CNVs identified in cnvScan
db-65 (filtration thresholds:15-65) are slightly lower than that of db-18. This highlights the
need for further optimization techniques that can minimize the batch effect. For example,
SNV prediction approaches implement quality score recalibration strategies to minimize
various types of biases [117, 189]. The current version of cnvScan avoids the batch effect by
using the same exome collection for CNV prediction and cnvScan implementation.
cnvScan filtration
Number of CNVs identified in cnvScan
db-18 in-house database
db-65 in-house database
Table 5 CNVs identified in cnvScan implementation
6.3 Characterization of true positives excluded in
cnvScan implementation
Current SNV analysis pipelines implement different filtration criteria to exclude technical
artifacts and population specific variants when searching for clinically relevant variant in
exome sequence data. For instance, variant calling programs (samtools [148] and GATK
[117, 189]), variant analysis packages (Filtus [190]) and annotation programs (VEP [121])
provide a number of options to apply various filters on SNVs and indels to find important or
interesting results.
Similarly, cnvScan and the recently published CoNVaDING [172] provide filtration
strategies to exclude low quality CNV predictions. cnvScan provides a filtration strategy for
CNVs predicted from any coverage-based prediction program. CoNVaDING deviates from
cnvScan by combining both CNV prediction and filtration in a single program.
CoNVaDING performs a coverage analysis and selects a control dataset (exomes with most
similar coverage patterns) for each target exome when calling CNVs. During this coverage
analysis, a set of quality matrices is created to describe how coverage of exons varies within
the target exome and control dataset. Then, these matrices are used in the filtration step to
improve the sensitivity and specificity. However, the clinical utility of CoNVaDING has not
yet been studied in a large patient cohort. The CNV filtration efficiency of cnvScan is
discussed in Paper 2 and the applicability of cnvScan in a large patient cohort is discussed in
Paper 3.
While cnvScan was effectively used to identify disease-causing CNVs, our tool can be biasprone due to the implemented hard-filtering step. Particularly, rare disease-causing CNVs can
be excluded when high cnvScan filtration thresholds are applied. A detailed characterization
of excluded CNVs would assist cnvScan users to select the optimum threshold, which could
be used to minimize the hard-filtration bias. Therefore an additional analysis was designed to
study the quality scores of filtered-out or excluded CNVs, especially as this was not
discussed in Paper 2.
Here, CNV prediction (ExCopyDepth) and cnvScan implementation was performed using 17
PIDD exomes. During cnvScan implementation, relatively high quality score threshold
(default CNV quality and CNVQ: 40) was used to increase the number of filter-out CNVs
(n=912). Following the filtration, CNVs excluded in cnvScan implementation were selected
for further evaluation. In parallel, exaCGH experiments were performed (on the 17 PIDD
exomes) to identify true positive and false positive CNVs within excluded variants.
The quality score analysis of excluded variants (Fig. 11) showed that true positives have
higher quality scores compared to false positive CNVs (Fig. 11a). These excluded true
positives may contain disease-causing rare variants as well as known variants that are
reported in public databases. Therefore these CNVs were further examined to identify known
and novel variant within excluded true positives. Here, CNVs reported in public databases
(manually curated DGV, 1000 genomes CNVs and sanger CNVs) were identified as known
variants, while CNVs not yet reported in public datasets were identified as novel variants.
Then we compared the default quality score and CNVQ distributions of these identified
CNVs. Both known and novel true positive CNVs were spread over lower and higher ends of
default quality score spectrum (Fig. 11b). However, when considering CNVQ distributions,
known variants were spread over a wider range compared to novel CNVs (Fig. 11c).
Fig. 11 Analysis of CNVs excluded in cnvScan
a. Default quality scores of false positive (FP) and true positive (TP) CNVs filtered out
(excluded) in cnvScan implementation; b. Default quality score distribution of known and
novel TP CNVs excluded in cnvScan; c. CNVQ distribution of known and novel TP CNVs
excluded in cnvScan.
The number of novel true positive CNVs excluded in cnvScan was 32 (~3.5% of the total
CNVs excluded) and they were found in at least 4 patient samples. These CNVs could either
be population-specific common variants that are not yet reported in public databases or
disease-causing CNVs that are present in a subset of patients (n≥4).
Previous studies have demonstrated that CNVs present in <10% of the population can cause
or be associated with a broad range of diseases [203-205]. Thus, it is important not to exclude
these variants during filtration. Based on Fig. 11, low cnvScan thresholds can be used to
retain the majority of these novel CNVs. However, there is a risk of excluding a small
fraction of novel CNVs. Therefore, we updated the cnvScan filtration script to produce two
results files: a primary result file containing high quality CNVs identified in cnvScan and a
secondary file containing low quality excluded novel variants. Hence, cnvScan users can
examine the primary result file when searching for disease-causing CNVs. If a causal variant
was not found, users can search the secondary results file.
6.4 Comparison with other disease-causing CNV
detection methods
6.4.1 Advantages of cnvScan annotation compared to other annotation
Today, a number of programs are available for scientists to annotate CNV predictions; for
example, VEP [121], ANNOVAR [120], CNVannotator [183], DeAnnCNV [185]. Despite
the availability of these tools, cnvScan was extended to provide additional information on
CNVs, especially using sources that are not available in other methods. For example,
clinically relevant information from DECIPHER DDD [128] and, known and common CNVs
from high quality manually curated database of genomic variants (DGV) [206] were not
provided in the majority of other CNV annotation programs. Similarly, these programs
provide a limited set of information on novel CNVs when pre-computed scores such as
haploinsufficiency scores [66] and gene intolerance scores [207] are useful in functional
interpretation of novel CNVs.
A recent study has shown that variant interpretation can be affected by the source transcript
dataset used in annotation programs [186]. For example, CNVAnnotator uses a combination
of Refseq, UCSC and ENCODE data to annotate genes and other genomic features [183],
which could increase the risk of misinterpreting annotated CNVs. cnvScan avoids this issue
by using Gencode dataset and provides high quality, consistent and accurate CNV
6.4.2 Rate of disease-causing CNV detection
Paper 3 discussed in the thesis uses a PIDD patient cohort and describes the utility of our
method in identifying disease-causing CNVs. Patient exomes tested in our method were
collected from the Department of Medical Genetics, Oslo University Hospital (n=78) and
Baylor College of Medicine, Texas (n=80).
ExCopyDepth and cnvScan were able to identify five disease-causing CNVs from these
patient populations: four PIDD-causing CNVs (FANCA, MAGT1, NCF1, TREC) in five
patients from the Oslo cohort (positive detection rate: 6.4%) and two PIDD-causing CNVs
(MAGT1, IKZF1) in the Baylor patient cohort (positive detection rate: 3.75%). All these
CNVs were validated by either exaCGH or MLPA experiments.
Our cohort study reported higher positive detection rates (Oslo: 6.4%, Baylor: 3.75%)
compared to other heterogeneous Mendelian cohorts and recently published large clinical
studies [123-125]. This was mainly due to the carefully designed steps of our complete
pipeline: CNV prediction, cnvScan implementation, pathogenicity assessment and exaCGH
validation (Fig. 12).
Fig. 12 Disease-causing CNV detection pipeline
a. Computational steps: CNV prediction (using alignment files and exome definition files as
main inputs) and cnvScan implementation (using cnvScan quality thresholds, in-house
database and various annotation datasets); b. Manual step: pathogenicity assessment of each
prioritized variant using cnvScan annotations and other relevant resources; c. Experimental
step: if disease-causing CNVs are identified in the previous step, they are confirmed using
exaCGH experiments.
During the implementation of the pipeline (Fig. 12), a low stringency prediction program
(ExCopyDepth) was used to minimize the false negative count when calling CNVs. Then
during cnvScan implementation, predicted variants were filtered and prioritized to reduce
false positive CNV count. Following the prioritization, cnvScan annotations and other
relevant information were used for effective CNV interpretation and pathogenicity
assessment. If disease-causing CNVs were identified, exaCGH experiments were performed
to confirm computational predictions.
When comparing positive detection rates of Oslo and Baylor samples, the Oslo group had
higher detection rates than the Baylor group. This could be due to the batch-wise
implementation of the Oslo samples. For example, ExCopyDepth prediction and cnvScan
implementation was performed on 20-30 exomes sequenced in 1-3 consecutive batches.
However, all the Baylor exomes (n=80) were pooled together, potentially increasing the
coverage variance across the collection and thereby affecting the CNV prediction efficiency.
Based on these observations, we suggest that ExCopyDepth and ExomeDepth CNV
prediction can be further optimized by pooling exomes depending on their sequencing
batches. However, when using XHMM and CoNIFER, CNV prediction efficiency can
increase with large exome collections [156, 157, 208]
6.4.3 Short CNV prediction
As shown in Table 6, our CNV pipeline detected PIDD-causing variants with a broad length
spectrum. For example, the FANCA CNV contained 11 exons and only one exon was internal
to the TERC variant. Moreover, 40% of the PIDD-causing CNVs detected in our study
contained 1-2 exons. Therefore, we have demonstrated the ability to detect disease-causing
short CNVs (containing < 4 exons), when the majority of CNV prediction programs are not
optimized to predict short CNVs [156, 157, 166, 208].
Short CNV detection efficiency in prediction programs depends on various algorithmic
features. In ExomeCopy and ExCopyDepth, short CNV prediction was improved due to the
implemented tile length normalization step [162, 179].
When considering CoNIFER and XHMM, both programs use principal component analysis
based normalization methods to remove experimental and systematic artefacts. CoNIFER
CNV calling algorithm search for consecutive runs of three or more targets with normalized
scores that are higher than a specified hard threshold [157, 179]. Therefore, CoNIFER
restricts its ability to predict CNVs with 1-3 exons. XHMM deviates from CoNIFER by
implementing a hidden markov model (HMM) to assess the quality of each CNV signal [156,
179, 208], as opposed to hard thresholds. Thus XHMM shows improved performance in
detecting short CNVs compared to CoNIFER.
Gene (region)
Number of
exons (Length)
(22273 bp)
(16253 bp)
(15219 bp)
(3600 bp)
(15350 bp)
Oslo and
Table 6 PIDD-causing variants identified by using the pipeline discussed in section 6.4.2
6.4.4 Utility of exaCGH in detecting short CNVs
As discussed in methodological considerations, exaCGH was optimized to detect short
exonic CNVs compared to commercial arrays. This directly contributed to the diseasecausing short CNV yield of our study. For example, a PIDD-causing single exon deletion in
TERC was detected in exaCGH with 7 probes, whereas CytoScan HD and Agilent 1x1M
arrays did not contain probes covering the variant. Identification of TERC deletion highlights
the advantage of using exaCGH over other commercial arrays. In recent years, custom arrays
were used in number of studies to confirm computationally predicted short exonic CNVs with
variable degree of success [124, 178, 209].
In conclusion, this thesis discusses a complete pipeline optimized to identify disease-causing
CNVs in exome sequence data while solving problems faced by other commonly used
methods. Our pipeline is adaptable to user requirements and greatly improves the time and
effort in clinically relevant CNV identification [179, 202, 210]. The pipeline was proven to
be effective in identifying CNVs with a broad length spectrum and have a positive detection
rate of 3.7%-6.4%. Therefore this initiative will provide necessary resources for exome-based
CNV detection in current genetic diagnostic protocols and improve the diagnosis of patients
with genetic diseases.
7 Future directions
7.1 Optimizations to reduce the batch effect in cnvScan
As discussed in section 6.2, when CNVs were predicted from several batches of exomes,
CNV prediction quality could differ from one batch to another (batch effect). Therefore when
implementing the current version of cnvScan on multiple batches of exomes, separate inhouse databases need to be created and a filtration threshold needs to be selected per batch.
In order to minimize the bias introduced by the batch effect, default CNV quality scores
assigned by prediction programs need to be recalibrated using an advance statistical method.
Here, special measures needed to be introduced, so that the recalibrated scores do not vary
depending on the exome collection used in the prediction program.
With these recalibrated CNV scores, all the in-house CNVs can be pooled to create a
relatively large cnvScan database without introducing the batch effect. A larger in-house
database would lead to the development of an advance CNVQ (cnvScan quality score)
calculation method that could improve the confidence in CNV quality assessment. Both
recalibrated quality scores and an advanced cnvScan quality assessment strategy, would help
select an optimum cnvScan filtration threshold during variant prioritization.
7.2 Whole-genome CNV analysis
Whole-genome sequencing (WGS) has the potential to capture genome-wide SNVs, indels
and CNVs in one experiment. Therefore, WGS will be an attractive alternative to exome
sequencing when available at an affordable price [211-214].
Prediction programs tested in our study use exomes as the source to detect CNVs. Therefore
these programs will be replaced when WGS become a widespread genetic diagnostic tool.
However, cnvScan accept CNV calls from any coverage-based prediction program as the
input [202]. Thus, cnvScan can be extended to analyse coverage-based CNV calls from WGS
data. Moreover, paired-end and split-read approaches [215] can be introduced in cnvScan as a
secondary CNV evaluation step. For example, when evaluating a CNV, cnvScan can search
for reads with insert sizes that significantly deviate from the expected length [216] or
partially mapped reads to identify CNV breakpoints [141]. Thus, cnvScan can be extended to
assess, further evaluate, annotate and filter CNVs predicted from WGS data.
Comparative genomic hybridization
Copy number state
Copy number variant
cnvScan quality score
Database of chromosomal imbalance and phenotype in Humans using
ensembl resources
Exon focussed CGH array
Fluorescence in situ hybridization
False positives
False positive rate
Genome-wide association studies
Human gene mutation database
Hidden markov model
Insertion or deletion
Multiplex ligation-dependent probe amplification
Online Mendelian Inheritance in Man
Principal component analysis
Primary immunodeficiency diseases
Quantitative fluorescence PCR (polymerase chain reaction)
Reads per thousand bases per million reads sequenced
Single nucleotide variant
Singular value decomposition
True positive
Commonly used terms
Allele: Variant forms of the same gene, in the same locus on homologous chromosomes.
Copy number variant (CNV): A DNA sequence that is 1kb or larger and is present at a
variable copy number in comparison with a reference genome.
Gene: A DNA sequence or locus that encode a functional RNA or protein product and is the
molecular unit of heredity.
Genetic loci: Specific regions that are mapped within a genome.
Genome: Complete set of genetic material present in a cell.
Genotype: The genetic constitution of the individual.
High-throughput sequencing (next-generation sequencing): Coverall term used to
describe a number of different modern sequencing technologies (eg. Illumina sequencing
/Short-read sequencing). These technologies enable DNA and RNA sequencing much more
quickly and cheaply than the previously used Sanger sequencing technology.
Inversion: A segment of DNA that is reversed in orientation with respect to the rest of the
Karyotype: An ordered array of metaphase chromosomes from a photomicrograph of a
single cell nucleus arranged in pairs in descending order of size and according to the position
of the centromere.
Paired-end reads: paired reads originating from the same template molecule and separated
by a known distance.
Precision (positive detection rate): true positive count / (true positive count + false positive
Recall (sensitivity): true positive count / (true positive count + false negative count)
Segmental duplication or low-copy repeat: A segment of DNA >1kb in size that occurs in
two or more copies per haploid genome, with the different copies sharing >90% sequence
identity. They are often variable in copy number and can therefore also be CNVs.
Soft-clipped reads: Partially mapped reads with bases that are not part of the alignment.
Structural variant (SV): Term that encompasses a group of microscopic or submicroscopic
genomic alterations involving segments of DNA. Structural variants include CNVs
comprising deletions and duplications, translocations and inversions.
Translocation: A change in position of a chromosomal segment within a genome.
Translocations can be intra- or inter-chromosomal.
International Human Genome Sequencing Consortium: Initial sequencing and
analysis of the human genome. Nature 2001, 409(6822):860-921.
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell
M, Evans CA, Holt RA et al: The sequence of the human genome. Science 2001,
International Human Genome Sequencing Consortium: Finishing the euchromatic
sequence of the human genome. Nature 2004, 431(7011):931-945.
Ezkurdia I, Juan D, Rodriguez JM, Frankish A, Diekhans M, Harrow J, Vazquez J,
Valencia A, Tress ML: Multiple evidence strands suggest that there may be as few
as 19,000 human protein-coding genes. Hum Mol Genet 2014, 23(22):5866-5878.
Pruitt KD, Katz KS, Sicotte H, Maglott DR: Introducing RefSeq and LocusLink:
curated human genome resources at the NCBI. Trends Genet 2000, 16(1):44-47.
Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources.
Nucleic Acids Res 2001, 29(1):137-140.
ENCODE Project Consortium: The ENCODE (ENCyclopedia Of DNA Elements)
Project. Science 2004, 306(5696):636-640.
Wilming LG, Gilbert JG, Howe K, Trevanion S, Hubbard T, Harrow JL: The
vertebrate genome annotation (Vega) database. Nucleic Acids Res 2008,
36(Database issue):D753-760.
Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell
CM, Loveland JE, Ruef BJ et al: The consensus coding sequence (CCDS) project:
Identifying a common protein-coding gene set for the human and mouse
genomes. Genome research 2009, 19(7):1316-1323.
ENCODE Project Consortium: Identification and analysis of functional elements
in 1% of the human genome by the ENCODE pilot project. Nature 2007,
Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J,
Gilbert JG, Storey R, Swarbreck D et al: GENCODE: producing a reference
annotation for ENCODE. Genome Biol 2006, 7(Suppl 1):S4 1-9.
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken
BL, Barrell D, Zadissa A, Searle S et al: GENCODE: the reference human genome
annotation for The ENCODE Project. Genome research 2012, 22(9):1760-1774.
Maglott DR, Katz KS, Sicotte H, Pruitt KD: NCBI's LocusLink and RefSeq.
Nucleic Acids Res 2000, 28(1):126-128.
Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J,
Curwen V, Down T et al: The Ensembl genome database project. Nucleic Acids
Res 2002, 30(1):38-41.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D:
The human genome browser at UCSC. Genome research 2002, 12(6):996-1006.
Kellis M, Wold B, Snyder MP, Bernstein BE, Kundaje A, Marinov GK, Ward LD,
Birney E, Crawford GE, Dekker J et al: Defining functional DNA elements in the
human genome. Proc Natl Acad Sci U S A 2014, 111(17):6131-6138.
Mullaney JM, Mills RE, Pittard WS, Devine SE: Small insertions and deletions
(INDELs) in human genomes. Hum Mol Genet 2010, 19(R2):R131-136.
Feuk L, Carson AR, Scherer SW: Structural variation in the human genome.
Nature reviews Genetics 2006, 7(2):85-97.
Inoue K, Lupski JR: Molecular mechanisms for genomic disorders. Annu Rev
Genomics Hum Genet 2002, 3:199-242.
Hastings PJ, Lupski JR, Rosenberg SM, Ira G: Mechanisms of change in gene copy
number. Nature reviews Genetics 2009, 10(8):551-564.
Lee JA, Carvalho CM, Lupski JR: A DNA replication mechanism for generating
nonrecurrent rearrangements associated with genomic disorders. Cell 2007,
Hastings PJ, Ira G, Lupski JR: A microhomology-mediated break-induced
replication model for the origin of human copy number variation. PLoS genetics
2009, 5(1):e1000327.
Hurles M, Lupski J: Recombination Hotspots in Nonallelic Homologous
Recombination. Genomic Disorders, 2006: 341-355.
Perry GH, Ben-Dor A, Tsalenko A, Sampas N, Rodriguez-Revenga L, Tran CW,
Scheffer A, Steinfeld I, Tsang P, Yamada NA et al: The fine-scale and complex
architecture of human copy-number variation. American journal of human
genetics 2008, 82(3):685-695.
Kidd JM, Graves T, Newman TL, Fulton R, Hayden HS, Malig M, Kallicki J, Kaul R,
Wilson RK, Eichler EE: A human genome structural variation sequencing
resource reveals insights into mutational mechanisms. Cell 2010, 143(5):837-847.
Arlt MF, Wilson TE, Glover TW: Replication stress and mechanisms of CNV
formation. Curr Opin Genet Dev 2012, 22(3):204-210.
Conrad DF, Bird C, Blackburne B, Lindsay S, Mamanova L, Lee C, Turner DJ,
Hurles ME: Mutation spectrum revealed by breakpoint sequencing of human
germline CNVs. Nat Genet 2010, 42(5):385-391.
Ottaviani D, LeCain M, Sheer D: The role of microhomology in genomic structural
variation. Trends Genet 2014, 30(3):85-94.
Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry
S, Mullikin JC, Mortimore BJ, Willey DL et al: A map of human genome sequence
variation containing 1.42 million single nucleotide polymorphisms. Nature 2001,
International HapMap Consortium: The International HapMap Project. Nature
2003, 426(6968):789-796.
International HapMap Consortium: A haplotype map of the human genome. Nature
2005, 437(7063):1299-1320.
International HapMap Consortium: Integrating common and rare genetic variation
in diverse human populations. Nature 2010, 467(7311):52-58.
Wellcome Trust Case Control Consortium: Genome-wide association study of
CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature
2010, 464(7289):713-720.
Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD,
Barnes C, Campbell P et al: Origins and functional impact of copy number
variation in the human genome. Nature 2010, 464(7289):704-712.
The 1000 Genomes Project Consortium: A map of human genome variation from
population-scale sequencing. Nature 2010, 467(7319):1061-1073.
The 1000 Genomes Project Consortium: An integrated map of genetic variation
from 1,092 human genomes. Nature 2012, 491(7422):56-65.
The 1000 Genomes Project Consortium: A global reference for human genetic
variation. Nature 2015, 526(7571):68-74.
Exome Sequencing Project (ESP) 6500 exome data
Fu W, O'Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, Gabriel S, Rieder MJ,
Altshuler D, Shendure J et al: Analysis of 6,515 exomes reveals the recent origin of
most human protein-coding variants. Nature 2013, 493(7431):216-220.
The UK10K Consortium: The UK10K project identifies rare variants in health
and disease. Nature 2015, 526(7571):82-90.
Sudmant PH, Mallick S, Nelson BJ, Hormozdiari F, Krumm N, Huddleston J, Coe
BP, Baker C, Nordenfelt S, Bamshad M et al: Global diversity, population
stratification, and selection of human copy-number variation. Science 2015,
Clarke L, Zheng-Bradley X, Smith R, Kulesha E, Xiao C, Toneva I, Vaughan B,
Preuss D, Leinonen R, Shumway M et al: The 1000 Genomes Project: data
management and community access. Nat Methods 2012, 9(5):459-462.
Griffiths AJF: Modern Genetic Analysis: W.H. Freeman; 1999.
Mendel G, Bateson W: Experiments in plant-hybridisation / By Gregor Mendel.
Harvard University Press; 1925.
Jin W, Qin P, Lou H, Jin L, Xu S: A systematic characterization of genes
underlying both complex and Mendelian diseases. Hum Mol Genet 2012,
Youssoufian H, Pyeritz RE: Mechanisms and consequences of somatic mosaicism
in humans. Nature reviews Genetics 2002, 3(10):748-758.
OMIM Entry Statistics []
Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and
survey. Nucleic Acids Res 2002, 30(17):3894-3900.
Capon F, Allen MH, Ameen M, Burden AD, Tillman D, Barker JN, Trembath RC: A
synonymous SNP of the corneodesmosin gene leads to increased mRNA stability
and demonstrates association with psoriasis across diverse ethnic groups. Hum
Mol Genet 2004, 13(20):2361-2368.
Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K: A comprehensive review of
genetic association studies. Genet Med 2002, 4(2):45-61.
Thomas PD, Kejariwal A: Coding single-nucleotide polymorphisms associated
with complex vs. Mendelian disease: evolutionary evidence for differences in
molecular effects. Proc Natl Acad Sci U S A 2004, 101(43):15398-15403.
Farh KK, Marson A, Zhu J, Kleinewietfeld M, Housley WJ, Beik S, Shoresh N,
Whitton H, Ryan RJ, Shishkin AA et al: Genetic and epigenetic fine mapping of
causal autoimmune disease variants. Nature 2015, 518(7539):337-343.
Stenson PD, Mort M, Ball EV, Shaw K, Phillips A, Cooper DN: The Human Gene
Mutation Database: building a comprehensive mutation repository for clinical
and molecular genetics, diagnostic testing and personalized genomic medicine.
Hum Genet 2014, 133(1):1-9.
Beckmann JS, Estivill X, Antonarakis SE: Copy number variants and genetic
traits: closer to the resolution of phenotypic to genotypic variability. Nature
reviews Genetics 2007, 8(8):639-646.
Girirajan S, Campbell CD, Eichler EE: Human copy number variation and
complex genetic disease. Annu Rev Genet 2011, 45:203-226.
Haraksingh RR, Jahanbani F, Rodriguez-Paris J, Gelernter J, Nadeau KC, Oghalai JS,
Schrijver I, Snyder MP: Exome sequencing and genome-wide copy number
variant mapping reveal novel associations with sensorineural hereditary hearing
loss. BMC genomics 2014, 15:1155.
Hu X, Worton RG: Partial gene duplication as a cause of human disease. Hum
Mutat 1992, 1(1):3-12.
Blumkin L, Kivity S, Lev D, Cohen S, Shomrat R, Lerman-Sagie T, Leshinsky-Silver
E: A compound heterozygous missense mutation and a large deletion in the
KCTD7 gene presenting as an opsoclonus-myoclonus ataxia-like syndrome. J
Neurol 2012, 259(12):2590-2598.
Abe A, Nakamura K, Kato M, Numakura C, Honma T, Seiwa C, Shirahata E, Itoh A,
Kishikawa Y, Hayasaka K: Compound heterozygous PMP22 deletion mutations
causing severe Charcot-Marie-Tooth disease type 1. J Hum Genet 2010,
Czeschik JC, Hehr U, Hartmann B, Ludecke HJ, Rosenbaum T, Schweiger B,
Wieczorek D: 160 kb deletion in ISPD unmasking a recessive mutation in a
patient with Walker-Warburg syndrome. Eur J Med Genet 2013, 56(12):689-694.
Holt R, Sykes NH, Conceicao IC, Cazier JB, Anney RJ, Oliveira G, Gallagher L,
Vicente A, Monaco AP, Pagnamenta AT: CNVs leading to fusion transcripts in
individuals with autism spectrum disorder. Eur J Hum Genet 2012, 20(11):11411147.
Newman S, Hermetz KE, Weckselblatt B, Rudd MK: Next-generation sequencing of
duplication CNVs reveals that most are tandem and some create fusion genes at
breakpoints. American journal of human genetics 2015, 96(2):208-220.
Ullmann R, Turner G, Kirchhoff M, Chen W, Tonge B, Rosenberg C, Field M,
Vianna-Morgante AM, Christie L, Krepischi-Santos AC et al: Array CGH identifies
reciprocal 16p13.1 duplications and deletions that predispose to autism and/or
mental retardation. Hum Mutat 2007, 28(7):674-682.
Stankiewicz P, Bi W, Lupski J: Smith-Magenis Syndrome Deletion, Reciprocal
Duplication dup(17)(p11.2p11.2), and Other Proximal 17p Rearrangements.
Genomic Disorders 2006, 179-191.
Dang VT, Kassahn KS, Marcos AE, Ragan MA: Identification of human
haploinsufficient genes and their genomic proximity to segmental duplications.
Eur J Hum Genet 2008, 16(11):1350-1357.
Huang N, Lee I, Marcotte EM, Hurles ME: Characterising and predicting
haploinsufficiency in the human genome. PLoS genetics 2010, 6(10):e1001154.
Lee C, Scherer SW: The clinical context of copy number variation in the human
genome. Expert Rev Mol Med 2010, 12:e8.
Cutting GR: Modifier genes in Mendelian disorders: the example of cystic
fibrosis. Ann N Y Acad Sci 2010, 1214:57-69.
Artuso R, Papa FT, Grillo E, Mucciolo M, Yasui DH, Dunaway KW, Disciglio V,
Mencarelli MA, Pollazzon M, Zappella M et al: Investigation of modifier genes
within copy number variations in Rett syndrome. J Hum Genet 2011, 56(7):508515.
Jiang Q, Ho YY, Hao L, Nichols Berrios C, Chakravarti A: Copy number variants
in candidate genes are genetic modifiers of Hirschsprung disease. PLoS One 2011,
Mailman MD, Heinz JW, Papp AC, Snyder PJ, Sedra MS, Wirth B, Burghes AH,
Prior TW: Molecular analysis of spinal muscular atrophy and modification of the
phenotype by SMN2. Genet Med 2002, 4(1):20-26.
Orr N, Chanock S: Common genetic variation and human disease. Adv Genet
2008, 62:1-32.
Krawczak M, Ball EV, Fenton I, Stenson PD, Abeysinghe S, Thomas N, Cooper DN:
Human gene mutation database-a biomedical information and research
resource. Hum Mutat 2000, 15(1):45-51.
Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR:
ClinVar: public archive of relationships among sequence variation and human
phenotype. Nucleic Acids Res 2014, 42(Database issue):D980-985.
Bragin E, Chatzimichali EA, Wright CF, Hurles ME, Firth HV, Bevan AP,
Swaminathan GJ: DECIPHER: database for the interpretation of phenotypelinked plausibly pathogenic sequence and copy-number variation. Nucleic Acids
Res 2014, 42(Database issue):D993-D1000.
Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian
Inheritance in Man (OMIM), a knowledgebase of human genes and genetic
disorders. Nucleic Acids Res 2005, 33(Database issue):D514-517.
McKusick VA: Mendelian Inheritance in Man and its online version, OMIM.
American journal of human genetics 2007, 80(4):588-604.
Cotton RG, Auerbach AD, Axton M, Barash CI, Berkovic SF, Brookes AJ, Burn J,
Cutting G, den Dunnen JT, Flicek P et al: GENETICS. The Human Variome
Project. Science 2008, 322(5903):861-862.
Landegren U, Kaiser R, Caskey CT, Hood L: DNA diagnostics--molecular
techniques and automation. Science 1988, 242(4876):229-237.
McPherson E: Genetic diagnosis and testing in clinical practice. Clin Med Res
2006, 4(2):123-129.
Glickman RM, Phillips MA, Glickman BW: Introduction to DNA-Based Genetic
Diagnostics. Can Fam Physician 1988, 34:883-889.
Southern EM: Detection of specific sequences among DNA fragments separated
by gel electrophoresis. J Mol Biol 1975, 98(3):503-517.
Sanger F, Nicklen S, Coulson AR: DNA sequencing with chain-terminating
inhibitors. Proc Natl Acad Sci U S A 1977, 74(12):5463-5467.
Conner BJ, Reyes AA, Morin C, Itakura K, Teplitz RL, Wallace RB: Detection of
sickle cell beta S-globin allele by hybridization with synthetic oligonucleotides.
Proc Natl Acad Sci U S A 1983, 80(1):278-282.
Landegren U, Kaiser R, Sanders J, Hood L: A ligase-mediated gene detection
technique. Science 1988, 241(4869):1077-1080.
Sanders SJ, Murtha MT, Gupta AR, Murdoch JD, Raubeson MJ, Willsey AJ, ErcanSencicek AG, DiLullo NM, Parikshak NN, Stein JL et al: De novo mutations
revealed by whole-exome sequencing are strongly associated with autism. Nature
2012, 485(7397):237-241.
Drets ME, Shaw MW: Specific banding patterns of human chromosomes. Proc
Natl Acad Sci U S A 1971, 68(9):2073-2077.
Bauman JG, Wiegant J, Borst P, van Duijn P: A new method for fluorescence
microscopical localization of specific DNA sequences by in situ hybridization of
fluorochromelabelled RNA. Exp Cell Res 1980, 128(2):485-490.
Van Prooijen-Knegt AC, Van Hoek JF, Bauman JG, Van Duijn P, Wool IG, Van der
Ploeg M: In situ hybridization of DNA sequences in human metaphase
chromosomes visualized by an indirect fluorescent immunocytochemical
procedure. Exp Cell Res 1982, 141(2):397-407.
Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel
D: Comparative genomic hybridization for molecular cytogenetic analysis of
solid tumors. Science 1992, 258(5083):818-821.
Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL,
Chen C, Zhai Y et al: High resolution analysis of DNA copy number variation
using comparative genomic hybridization to microarrays. Nature genetics 1998,
Haraksingh RR, Abyzov A, Gerstein M, Urban AE, Snyder M: Genome-wide
mapping of copy number variation in humans: comparative analysis of high
resolution array platforms. PLoS One 2011, 6(11):e27859.
Schlotterer C: The evolution of molecular markers--just a matter of fashion?
Nature reviews Genetics 2004, 5(1):63-69.
Chen X, Sullivan PF: Single nucleotide polymorphism genotyping: biochemistry,
protocol, cost and throughput. Pharmacogenomics J 2003, 3(2):77-96.
Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK: A high-resolution
survey of deletion polymorphism in the human genome. Nat Genet 2006, 38(1):7581.
Lee C, Iafrate AJ, Brothman AR: Copy number variations and clinical cytogenetic
diagnosis of constitutional disorders. Nature genetics 2007, 39(7 Suppl):S48-54.
Hayes JL, Tzika A, Thygesen H, Berri S, Wood HM, Hewitt S, Pendlebury M, Coates
A, Willoughby L, Watson CM et al: Diagnosis of copy number variation by
Illumina next generation sequencing is comparable in performance to
oligonucleotide array comparative genomic hybridisation. Genomics 2013,
Miller DT, Adam MP, Aradhya S, Biesecker LG, Brothman AR, Carter NP, Church
DM, Crolla JA, Eichler EE, Epstein CJ et al: Consensus statement: chromosomal
microarray is a first-tier clinical diagnostic test for individuals with
developmental disabilities or congenital anomalies. Am J Hum Genet 2010,
Baumbusch LO, Aaroe J, Johansen FE, Hicks J, Sun H, Bruhn L, Gunderson K,
Naume B, Kristensen VN, Liestol K et al: Comparison of the Agilent,
ROMA/NimbleGen and Illumina platforms for classification of copy number
alterations in human breast tumors. BMC genomics 2008, 9:379.
Coe BP, Ylstra B, Carvalho B, Meijer GA, Macaulay C, Lam WL: Resolving the
resolution of array CGH. Genomics 2007, 89(5):647-653.
Curtis C, Lynch AG, Dunning MJ, Spiteri I, Marioni JC, Hadfield J, Chin SF,
Brenton JD, Tavare S, Caldas C: The pitfalls of platform comparison: DNA copy
number array technologies assessed. BMC genomics 2009, 10:588.
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J,
Braverman MS, Chen YJ, Chen Z et al: Genome sequencing in microfabricated
high-density picolitre reactors. Nature 2005, 437(7057):376-380.
Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol 2008,
Hayden EC: Technology: The $1,000 genome. Nature 2014, 507(7492):294-295.
Metzker ML: Sequencing technologies - the next generation. Nature reviews
Genetics 2010, 11(1):31-46.
Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong
M, Bhattacharjee A, Eichler EE et al: Targeted capture and massively parallel
sequencing of 12 human exomes. Nature 2009, 461(7261):272-276.
Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure
J: Exome sequencing as a tool for Mendelian disease gene discovery. Nature
reviews Genetics 2011, 12(11):745-755.
Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloglu A,
Ozen S, Sanjad S et al: Genetic diagnosis by whole exome capture and massively
parallel DNA sequencing. Proc Natl Acad Sci U S A 2009, 106(45):19096-19101.
Biesecker LG: Exome sequencing makes medical genomics a reality. Nat Genet
2010, 42(1):13-14.
Mardis ER: A decade's perspective on DNA sequencing technology. Nature 2011,
Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, Braxton A,
Beuten J, Xia F, Niu Z et al: Clinical whole-exome sequencing for the diagnosis of
mendelian disorders. N Engl J Med 2013, 369(16):1502-1511.
Rabbani B, Tekin M, Mahdieh N: The promise of whole-exome sequencing in
medical genetics. J Hum Genet 2014, 59(1):5-15.
Ball MP, Thakuria JV, Zaranek AW, Clegg T, Rosenbaum AM, Wu X, Angrist M,
Bhak J, Bobe J, Callow MJ et al: A public resource facilitating clinical use of
genomes. Proc Natl Acad Sci U S A 2012, 109(30):11920-11927.
MacArthur DG, Manolio TA, Dimmock DP, Rehm HL, Shendure J, Abecasis GR,
Adams DR, Altman RB, Antonarakis SE, Ashley EA et al: Guidelines for
investigating causality of sequence variants in human disease. Nature 2014,
Koboldt DC, Larson DE, Sullivan LS, Bowne SJ, Steinberg KM, Churchill JD, Buhr
AC, Nutter N, Pierce EA, Blanton SH et al: Exome-based mapping and variant
prioritization for inherited Mendelian disorders. American journal of human
genetics 2014, 94(3):373-384.
Hwang S, Kim E, Lee I, Marcotte EM: Systematic comparison of variant calling
pipelines using gold standard personal exome variants. Sci Rep 2015, 5:17875.
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis
AA, del Angel G, Rivas MA, Hanna M et al: A framework for variation discovery
and genotyping using next-generation DNA sequencing data. Nature genetics
2011, 43(5):491-498.
Ye K, Schulz MH, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth
approach to detect break points of large deletions and medium sized insertions
from paired-end short reads. Bioinformatics 2009, 25(21):2865-2871.
Ye K, Wang J, Jayasinghe R, Lameijer EW, McMichael JF, Ning J, McLellan MD,
Xie M, Cao S, Yellapantula V et al: Systematic discovery of complex insertions
and deletions in human cancers. Nat Med 2016, 22(1):97-104.
Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic
variants from high-throughput sequencing data. Nucleic Acids Res 2010,
McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F: Deriving the
consequences of genomic variants with the Ensembl API and SNP Effect
Predictor. Bioinformatics 2010, 26(16):2069-2070.
Sawyer SL, Schwartzentruber J, Beaulieu CL, Dyment D, Smith A, Warman Chardon
J, Yoon G, Rouleau GA, Suchowersky O, Siu V et al: Exome sequencing as a
diagnostic tool for pediatric-onset ataxia. Hum Mutat 2014, 35(1):45-49.
Yang Y, Muzny DM, Xia F, Niu Z, Person R, Ding Y, Ward P, Braxton A, Wang M,
Buhay C et al: Molecular findings among patients referred for clinical wholeexome sequencing. JAMA 2014, 312(18):1870-1879.
Retterer K, Scuffins J, Schmidt D, Lewis R, Pineda-Alvarez D, Stafford A, Schmidt
L, Warren S, Gibellini F, Kondakova A et al: Assessing copy number from exome
sequencing and exome array CGH based on CNV spectrum in a large clinical
cohort. Genet Med 2015, 17(8):623-629.
Epilepsy Phenome/Genome Project & Epi4K Consortium: Copy number variant
analysis from exome data in 349 patients with epileptic encephalopathy. Ann
Neurol 2015, 78(2):323-328.
MacDonald JR, Ziman R, Yuen RK, Feuk L, Scherer SW: The Database of Genomic
Variants: a curated collection of structural variation in the human genome.
Nucleic Acids Res 2014, 42(Database issue):D986-992.
Firth HV, Richards SM, Bevan AP, Clayton S, Corpas M, Rajan D, Van Vooren S,
Moreau Y, Pettett RM, Carter NP: DECIPHER: Database of Chromosomal
Imbalance and Phenotype in Humans Using Ensembl Resources. Am J Hum
Genet 2009, 84(4):524-533.
Firth HV, Wright CF, Study DDD: The Deciphering Developmental Disorders
(DDD) study. Dev Med Child Neurol 2011, 53(8):702-703.
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall
KP, Evers DJ, Barnes CL, Bignell HR et al: Accurate whole human genome
sequencing using reversible terminator chemistry. Nature 2008, 456(7218):53-59.
Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling
variants using mapping quality scores. Genome Res 2008, 18(11):1851-1858.
Medvedev P, Stanciu M, Brudno M: Computational methods for discovering
structural variation with next-generation sequencing. Nat Methods 2009, 6(11
Alkan C, Coe BP, Eichler EE: Genome structural variation discovery and
genotyping. Nat Rev Genet 2011, 12(5):363-376.
Rausch T, Zichner T, Schlattl A, Stutz AM, Benes V, Korbel JO: DELLY:
structural variant discovery by integrated paired-end and split-read analysis.
Bioinformatics 2012, 28(18):i333-i339.
Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD,
Wendl MC, Zhang Q, Locke DP et al: BreakDancer: an algorithm for highresolution mapping of genomic structural variation. Nat Methods 2009, 6(9):677681.
Korbel JO, Abyzov A, Mu XJ, Carriero N, Cayting P, Zhang Z, Snyder M, Gerstein
MB: PEMer: a computational framework with simulation-based error models
for inferring genomic structural variants from massive paired-end sequencing
data. Genome Biol 2009, 10(2):R23.
Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D, Alkan C, Eichler EE,
Sahinalp SC: Next-generation VariationHunter: combinatorial algorithms for
transposon insertion discovery. Bioinformatics 2010, 26(12):i350-357.
Hormozdiari F, Hajirasouliha I, McPherson A, Eichler EE, Sahinalp SC:
Simultaneous structural variation discovery among multiple paired-end
sequenced genomes. Genome research 2011, 21(12):2203-2212.
Sindi S, Helman E, Bashir A, Raphael BJ: A geometric approach for classification
and comparison of structural variants. Bioinformatics 2009, 25(12):i222-230.
Abyzov A, Gerstein M: AGE: defining breakpoints of genomic structural variants
at single-nucleotide resolution, through optimal alignments with gap excision.
Bioinformatics 2011, 27(5):595-603.
Abel HJ, Duncavage EJ, Becker N, Armstrong JR, Magrini VJ, Pfeifer JD: SLOPE: a
quick and accurate method for locating non-SNP structural variation from
targeted next-generation sequence data. Bioinformatics 2010, 26(21):2684-2688.
Zhang ZD, Du J, Lam H, Abyzov A, Urban AE, Snyder M, Gerstein M:
Identification of genomic indels and structural variations using split reads. BMC
genomics 2011, 12:375.
Handsaker RE, Korn JM, Nemesh J, McCarroll SA: Discovery and genotyping of
genome structural polymorphism by sequencing on a population scale. Nat Genet
2011, 43(3):269-276.
Karakoc E, Alkan C, O'Roak BJ, Dennis MY, Vives L, Mark K, Rieder MJ,
Nickerson DA, Eichler EE: Detection of structural variants and indels within
exome data. Nat Methods 2012, 9(2):176-178.
Dalca AV, Brudno M: Genome variation discovery with high-throughput
sequencing data. Brief Bioinform 2010, 11(1):3-14.
Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch
MJ, Albert TJ, Hannon GJ et al: Genome-wide in situ exon capture for selective
resequencing. Nat Genet 2007, 39(12):1522-1527.
Shigemizu D, Momozawa Y, Abe T, Morizono T, Boroevich KA, Takata S,
Ashikawa K, Kubo M, Tsunoda T: Performance comparison of four commercial
human whole-exome capture platforms. Sci Rep 2015, 5:12742.
Magi A, Tattini L, Pippucci T, Torricelli F, Benelli M: Read count approach for
DNA copy number variants detection. Bioinformatics 2012, 28(4):470-478.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G,
Durbin R, Genome Project Data Processing S: The Sequence Alignment/Map
format and SAMtools. Bioinformatics 2009, 25(16):2078-2079.
Picard []
Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 2009, 25(14):1754-1760.
Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short
read data sets from high-throughput DNA sequencing. Nucleic Acids Res 2008,
Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork
NJ, Murray SS, Topol EJ, Levy S et al: Evaluation of next generation sequencing
platforms for population targeted sequencing studies. Genome Biol 2009,
Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of
copy number variants using read depth of coverage. Genome research 2009,
Derrien T, Estelle J, Marco Sola S, Knowles DG, Raineri E, Guigo R, Ribeca P: Fast
computation and applications of genome mappability. PLoS One 2012,
Magi A, Tattini L, Cifola I, D'Aurizio R, Benelli M, Mangano E, Battaglia C, Bonora
E, Kurg A, Seri M et al: EXCAVATOR: detecting copy number variants from
whole-exome sequencing data. Genome Biol 2013, 14(10):R120.
Fromer M, Moran JL, Chambert K, Banks E, Bergen SE, Ruderfer DM, Handsaker
RE, McCarroll SA, O'Donovan MC, Owen MJ et al: Discovery and statistical
genotyping of copy-number variation from whole-exome sequencing depth. Am J
Hum Genet 2012, 91(4):597-607.
Krumm N, Sudmant PH, Ko A, O'Roak BJ, Malig M, Coe BP, Project NES, Quinlan
AR, Nickerson DA, Eichler EE: Copy number variation detection and genotyping
from exome sequence data. Genome Res 2012, 22(8):1525-1532.
Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings
LA, Leroy C, Edkins S, Hardy C et al: Identification of somatically acquired
rearrangements in cancer using genome-wide massively parallel paired-end
sequencing. Nat Genet 2008, 40(6):722-729.
Chiang DY, Getz G, Jaffe DB, O'Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C,
Meyerson M, Lander ES: High-resolution mapping of copy-number alterations
with massively parallel sequencing. Nat Methods 2009, 6(1):99-103.
Sathirapongsasuti JF, Lee H, Horst BA, Brunner G, Cochran AJ, Binder S,
Quackenbush J, Nelson SF: Exome sequencing-based copy-number variation and
loss of heterozygosity detection: ExomeCNV. Bioinformatics 2011, 27(19):26482654.
Li J, Lupat R, Amarasinghe KC, Thompson ER, Doyle MA, Ryland GL, Tothill RW,
Halgamuge SK, Campbell IG, Gorringe KL: CONTRA: copy number analysis for
targeted resequencing. Bioinformatics 2012, 28(10):1307-1313.
Love MI, Mysickova A, Sun R, Kalscheuer V, Vingron M, Haas SA: Modeling read
counts for CNV detection in exome sequencing data. Stat Appl Genet Mol Biol
2011, 10(1).
Plagnol V, Curtis J, Epstein M, Mok KY, Stebbings E, Grigoriadou S, Wood NW,
Hambleton S, Burns SO, Thrasher AJ et al: A robust model for read count data in
exome sequencing experiments and implications for copy number variant
calling. Bioinformatics 2012, 28(21):2747-2754.
Coin LJ, Cao D, Ren J, Zuo X, Sun L, Yang S, Zhang X, Cui Y, Li Y, Jin X et al: An
exome sequencing pipeline for identifying and genotyping common CNVs
associated with disease with application to psoriasis. Bioinformatics 2012,
Shi Y, Majewski J: FishingCNV: a graphical software package for detecting rare
copy number variations in exome-sequencing data. Bioinformatics 2013,
Backenroth D, Homsy J, Murillo LR, Glessner J, Lin E, Brueckner M, Lifton R,
Goldmuntz E, Chung WK, Shen Y: CANOES: detecting rare copy number
variants from whole exome sequencing data. Nucleic Acids Res 2014, 42(12):e97.
Wang C, Evans JM, Bhagwate AV, Prodduturi N, Sarangi V, Middha M, Sicotte H,
Vedell PT, Hart SN, Oliver GR et al: PatternCNV: a versatile tool for detecting
copy number changes from exome sequencing data. Bioinformatics 2014,
Bellos E, Coin LJ: cnvOffSeq: detecting intergenic copy number variation using
off-target exome sequencing data. Bioinformatics 2014, 30(17):i639-645.
Miyatake S, Koshimizu E, Fujita A, Fukai R, Imagawa E, Ohba C, Kuki I, Nukui M,
Araki A, Makita Y et al: Detecting copy-number variations in whole-exome
sequencing data using the eXome Hidden Markov Model: an 'exome-first'
approach. J Hum Genet 2015, 60(4):175-182.
Jiang Y, Oldridge DA, Diskin SJ, Zhang NR: CODEX: a normalization and copy
number variation detection method for whole exome sequencing. Nucleic Acids
Res 2015, 43(6):e39.
Packer JS, Maxwell EK, O'Dushlaine C, Lopez AE, Dewey FE, Chernomorsky R,
Baras A, Overton JD, Habegger L, Reid JG: CLAMMS: a scalable algorithm for
calling common and rare copy number variants from exome sequencing data.
Bioinformatics 2016, 32(1):133-135.
Johansson LF, van Dijk F, de Boer EN, van Dijk-Bos KK, Jongbloed JD, van der
Hout AH, Westers H, Sinke RJ, Swertz MA, Sijmons RH et al: CoNVaDING: Single
Exon Variation Detection in Targeted NGS Data. Hum Mutat 2016, 37(5):457-64.
Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis
ER, Ding L, Wilson RK: VarScan 2: somatic mutation and copy number
alteration discovery in cancer by exome sequencing. Genome research 2012,
Valdes-Mas R, Bea S, Puente DA, Lopez-Otin C, Puente XS: Estimation of copy
number alterations from exome sequencing data. PLoS One 2012, 7(12):e51422.
Amarasinghe KC, Li J, Halgamuge SK: CoNVEX: copy number variation
estimation in exome sequencing data using HMM. BMC Bioinformatics 2013, 14
(Suppl 2):S26.
Piazza R, Magistroni V, Pirola A, Redaelli S, Spinelli R, Redaelli S, Galbiati M,
Valletta S, Giudici G, Cazzaniga G et al: CEQer: a graphical tool for copy number
and allelic imbalance detection from whole-exome sequencing data. PLoS One
2013, 8(10):e74825.
Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, JanoueixLerosey I, Delattre O, Barillot E: Control-FREEC: a tool for assessing copy
number and allelic content using next-generation sequencing data. Bioinformatics
2012, 28(3):423-425.
de Ligt J, Boone PM, Pfundt R, Vissers LE, Richmond T, Geoghegan J, O'Moore K,
de Leeuw N, Shaw C, Brunner HG et al: Detection of clinically relevant copy
number variants with whole-exome sequencing. Hum Mutat 2013, 34(10):14391448.
Samarakoon PS, Sorte HS, Kristiansen BE, Skodje T, Sheng Y, Tjonnfjord GE,
Stadheim B, Stray-Pedersen A, Rodningen OK, Lyle R: Identification of copy
number variants from exome sequence data. BMC genomics 2014, 15:661.
Guo Y, Sheng Q, Samuels DC, Lehmann B, Bauer JA, Pietenpol J, Shyr Y:
Comparative study of exome copy number variation estimation tools using array
comparative genomic hybridization as control. Biomed Res Int 2013, 2013:915636.
Tan R, Wang Y, Kleinstein SE, Liu Y, Zhu X, Guo H, Jiang Q, Allen AS, Zhu M: An
evaluation of copy number variation detection tools from whole-exome
sequencing data. Hum Mutat 2014, 35(7):899-907.
Jo HY, Park MH, Woo HM, Han M, Kim BY, Choi BO, Chung KW, Koo SK:
Application of whole-exome sequencing for detecting copy number variants in
CMT1A/HNPP. Clin Genet 2015, doi: 10.1111/cge.12714.
Zhao M, Zhao Z: CNVannotator: a comprehensive annotation server for copy
number variation in the human genome. PLoS One 2013, 8(11):e80170.
Erikson GA, Deshpande N, Kesavan BG, Torkamani A: SG-ADVISER CNV: copynumber variant annotation and interpretation. Genet Med 2015, 17(9):714-718.
Zhang Y, Yu Z, Ban R, Zhang H, Iqbal F, Zhao A, Li A, Shi Q: DeAnnCNV: a tool
for online detection and annotation of copy number variations from wholeexome sequencing data. Nucleic Acids Res 2015, 43(W1):W289-94.
McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier JB, Donnelly P:
Choice of transcripts and software has a large effect on variant annotation.
Genome Med 2014, 6(3):26.
Novocraft Novoalign []
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella
K, Altshuler D, Gabriel S, Daly M et al: The Genome Analysis Toolkit: a
MapReduce framework for analyzing next-generation DNA sequencing data.
Genome Res 2010, 20(9):1297-1303.
Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, LevyMoonshine A, Jordan T, Shakir K, Roazen D, Thibault J et al: From FastQ data to
high confidence variant calls: the Genome Analysis Toolkit best practices
pipeline. Curr Protoc Bioinformatics 2013, 11(1110):11.10.11-11.10.33.
Vigeland MD, Gjotterud KS, Selmer KK: FILTUS: a desktop GUI for fast and
efficient detection of disease-causing variants, including a novel autozygosity
detector. Bioinformatics 2016, doi:10.1093/bioinformatics/btw046.
eArray []
Gentien D, Reyes C: Microarrays-Based Molecular Profiling to Identify Genomic
Alterations. Pan-cancer Integrative Molecular Portrait Towards a New Paradigm in
Precision Medicine 2015, Chapter 4: 31-45.
Banerjee D: Array comparative genomic hybridization: an overview of protocols,
applications, and technology trends. Methods Mol Biol 2013, 973:1-13.
Uddin M, Thiruvahindrapuram B, Walker S, Wang Z, Hu P, Lamoureux S, Wei J,
MacDonald JR, Pellecchia G, Lu C et al: A high-resolution copy-number variation
resource for clinical and population genetics. Genet Med 2015, 17(9):747-752.
Carter NP: Methods and strategies for analyzing copy number variation using
DNA microarrays. Nat Genet 2007, 39(7 Suppl):S16-21.
Reid JG, Carroll A, Veeraraghavan N, Dahdouli M, Sundquist A, English A,
Bainbridge M, White S, Salerno W, Buhay C et al: Launching genomics into the
cloud: deployment of Mercury, a next generation sequence analysis pipeline.
BMC Bioinformatics 2014, 15:30.
Shen Y, Wan Z, Coarfa C, Drabek R, Chen L, Ostrowski EA, Liu Y, Weinstock GM,
Wheeler DA, Gibbs RA et al: A SNP discovery method to assess variant allele
probability from next-generation resequencing data. Genome Res 2010,
Challis D, Yu J, Evani US, Jackson AR, Paithankar S, Coarfa C, Milosavljevic A,
Gibbs RA, Yu F: An integrative variant analysis suite for whole exome nextgeneration sequencing data. BMC Bioinformatics 2012, 13:8.
Saito T, Rehmsmeier M: The precision-recall plot is more informative than the
ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One
2015, 10(3):e0118432.
Lemke JR, Riesch E, Scheurenbrand T, Schubach M, Wilhelm C, Steiner I, Hansen J,
Courage C, Gallati S, Burki S et al: Targeted next generation sequencing as a
diagnostic tool in epileptic disorders. Epilepsia 2012, 53(8):1387-1398.
Vissers LE, Fano V, Martinelli D, Campos-Xavier B, Barbuti D, Cho TJ, Dursun A,
Kim OH, Lee SH, Timpani G et al: Whole-exome sequencing detects somatic
mutations of IDH1 in metaphyseal chondromatosis with D-2-hydroxyglutaric
aciduria (MC-HGA). Am J Med Genet A 2011, 155A(11):2609-2616.
Samarakoon PS, Sorte HS, Stray-Pedersen A, Rodningen OK, Rognes T, Lyle R:
cnvScan: a CNV screening and annotation tool to improve the clinical utility of
computational CNV prediction from exome sequencing data. BMC genomics
2016, 17(1):51.
Beaudet AL: The utility of chromosomal microarray analysis in developmental
and behavioral pediatrics. Child Dev 2013, 84(1):121-132.
Lopes LR, Murphy C, Syrris P, Dalageorgou C, McKenna WJ, Elliott PM, Plagnol V:
Use of high-throughput targeted exome-sequencing to screen for copy number
variation in hypertrophic cardiomyopathy. Eur J Med Genet 2015, 58(11):611616.
Hoyer H, Braathen GJ, Eek AK, Nordang GB, Skjelbred CF, Russell MB: Copy
number variations in a population-based study of Charcot-Marie-Tooth disease.
Biomed Res Int 2015, 2015:960404.
Zarrei M, MacDonald JR, Merico D, Scherer SW: A copy number variation map of
the human genome. Nature reviews Genetics 2015, 16(3):172-183.
Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB: Genic intolerance to
functional variation and the interpretation of personal genomes. PLoS genetics
2013, 9(8):e1003709.
Fromer M, Purcell SM: Using XHMM Software to Detect Copy Number Variation
in Whole-Exome Sequencing Data. Curr Protoc Hum Genet 2014, 81:7.23.2127.23.21.
de Ligt J, Boone PM, Pfundt R, Vissers LE, de Leeuw N, Shaw C, Brunner HG,
Lupski JR, Veltman JA, Hehir-Kwa JY: Platform comparison of detecting copy
number variants with microarrays and whole-exome sequencing. Genom Data
2014, 2:144-146.
Stray-Pedersen A, Sorte HS, Samarakoon PS, Gambin T, Chinn IK, et al: Primary
immunodeficiency diseases - genomic approaches delineate heterogeneous
Mendelian disorders. . J. Allergy Clin. Immunol 2016.
Ali-Khan SE, Daar AS, Shuman C, Ray PN, Scherer SW: Whole genome scanning:
resolving clinical diagnosis and management amidst complex data. Pediatr Res
2009, 66(4):357-363.
Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DC, Nazareth L,
Bainbridge M, Dinh H, Jing C, Wheeler DA et al: Whole-genome sequencing in a
patient with Charcot-Marie-Tooth neuropathy. N Engl J Med 2010, 362(13):11811191.
Herdewyn S, Zhao H, Moisse M, Race V, Matthijs G, Reumers J, Kusters B,
Schelhaas HJ, van den Berg LH, Goris A et al: Whole-genome sequencing reveals a
coding non-pathogenic variant tagging a non-coding pathogenic hexanucleotide
repeat expansion in C9orf72 as cause of amyotrophic lateral sclerosis. Hum Mol
Genet 2012, 21(11):2412-2419.
Belkadi A, Bolze A, Itan Y, Cobat A, Vincent QB, Antipenko A, Shang L, Boisson B,
Casanova JL, Abel L: Whole-genome sequencing is more powerful than wholeexome sequencing for detecting exome variants. Proc Natl Acad Sci U S A 2015,
Pirooznia M, Goes FS, Zandi PP: Whole-genome CNV analysis: advances in
computational approaches. Front Genet 2015, 6:138.
Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM,
Palejev D, Carriero NJ, Du L et al: Paired-end mapping reveals extensive
structural variation in the human genome. Science 2007, 318(5849):420-426.