Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computational prediction of diseasecausing CNVs from exome sequence data Pubudu Saneth Samarakoon Department of Medical Genetics, Faculty of Medicine UNIVERSITY OF OSLO June 2016 © Pubudu Saneth Samarakoon, 2017 Series of dissertations submitted to the Faculty of Medicine, University of Oslo ISBN 978-82-8333-382-4 All rights reserved. No part of this publication may be reproduced or transmitted, in any form or by any means, without permission. Cover: Hanne Baadsgaard Utigard. Print production: Reprosentralen, University of Oslo. Acknowledgements First and foremost I want to thank my supervisors; Robert Lyle and Torbjørn Rognes, you have been tremendous mentors for me. Robert, thank you very much for listening to all my questions and ideas, and guiding me during this period. Your input on academic writing and our discussions on various scientific topics truly helped me to grow as a research scientist. I am grateful to Torbjørn for introducing me to advance computational methods and for sharing your programming knowledge. I am thankful to Norwegian quota scholarship fund for providing financial support for the PhD. I would like to express my gratitude to our research group, Hanne Sorte, Asbjørg StrayPedersen and Olaug Rødningen. Our fruitful discussions were invaluable and clearly helped me to expand my skills and expertise to new horizons. A warm thank to all my colleagues and friends at the department of Medical Genetics, University of Oslo. I owe a special thanks to my dear colleagues Arvind Sundaram and Marius Bjørnstad for sharing the office, serious discussions and for always lightening up my day. I am truly thankful to my family and friends in Norway and Sri Lanka, especially my wife Nisanka for sharing all my ups and downs. Also baby Sayon, who joined my life in the latter part of my PhD. List of publications Paper 1 Samarakoon PS, Sorte HS, Kristiansen BE, Skodje T, Sheng Y, Tjønnfjord GE, Stadheim B, Stray-Pedersen A, Rødningen OK and Lyle R. Identification of copy number variants from exome sequence data. BMC Genomics. 2014;15:661. doi: 10.1186/1471-2164-15-661. Paper 2 Samarakoon PS, Sorte HS, Stray-Pedersen A, Rødningen OK, Rognes T and Lyle R. cnvScan: a CNV screening and annotation tool to improve the clinical utility of computational CNV prediction from exome sequencing data. BMC Genomics 2016; 17(1):51. doi: 10.1186/s12864-016-2374-2. Paper 3 Stray-Pedersen A, Sorte HS, Samarakoon PS, Gambin T, Chinn IK, Akdemir ZHC and et al. Primary immunodeficiency diseases - genomic approaches delineate heterogeneous Mendelian disorders. Journal of Allergy and Clinical Immunology 2016; Table of content 1 Introduction ......................................................................................................................... 1 1.1 The human genome................................................................................................................... 1 1.2 Variation in the human genome ................................................................................................ 2 1.3 Genetic diseases ........................................................................................................................ 3 1.3.1 Disease-causing genetic variants ....................................................................................... 4 1.4 Diagnosis of genetic disease ..................................................................................................... 7 1.4.1 Genetic tests to detect SNVs and indels ............................................................................ 7 1.4.2 Genetic tests to detect chromosomal abnormalities and CNVs ......................................... 8 1.5 Exome sequencing .................................................................................................................... 9 1.5.1 Exome-based disease diagnosis ......................................................................................... 9 2 Disease-causing CNV detection from exome data ............................................................ 12 2.1 2.2 2.3 2.4 2.5 3 4 Aims of the study............................................................................................................... 20 Methodological considerations .......................................................................................... 21 4.1 4.2 4.3 4.4 5 CNV prediction ....................................................................................................................... 12 Coverage-based CNV prediction ............................................................................................ 14 CNV annotation ...................................................................................................................... 17 Causal CNV detection ............................................................................................................ 18 Validation of disease-causing CNVs ...................................................................................... 18 Exome sequence analysis pipeline.......................................................................................... 21 CNV prediction from exome sequence data ........................................................................... 23 Resolution of microarray platforms ........................................................................................ 23 Considerations when identifying true positive, false positive and false negative CNVs ....... 24 Summary of papers ............................................................................................................ 28 5.1 Paper 1 Identification of copy number variants from exome sequence data .......................... 28 5.2 Paper 2 cnvScan: a CNV screening and annotation tool to improve the clinical utility of computational CNV prediction from exome sequencing data .......................................................... 30 5.3 Paper 3 Primary immunodeficiency diseases - genomic approaches delineate heterogeneous Mendelian disorders.......................................................................................................................... 31 6 Discussion.......................................................................................................................... 33 6.1 Performance comparison of CNV prediction programs ......................................................... 33 6.2 Effect of cnvScan in-house database in CNV analysis ........................................................... 35 6.3 Characterization of true positives excluded in cnvScan implementation ............................... 37 6.4 Comparison with other disease-causing CNV detection methods .......................................... 40 6.4.1 Advantages of cnvScan annotation compared to other annotation programs.................. 40 6.4.2 Rate of disease-causing CNV detection .......................................................................... 41 6.4.3 Short CNV prediction ...................................................................................................... 43 6.4.4 Utility of exaCGH in detecting short CNVs .................................................................... 44 7 Future directions ................................................................................................................ 45 7.1 7.2 Optimizations to reduce the batch effect in cnvScan implementation ................................... 45 Whole-genome CNV analysis ................................................................................................ 45 Abbreviations ........................................................................................................................... 47 Commonly used terms ............................................................................................................. 48 References ................................................................................................................................ 50 1 Introduction 1.1 The human genome The human genome contains ~3.2 billion bases packaged into 23 pairs of chromosomes. Since the completion of the human genome project [1, 2], scientists have been motivated to develop high-resolution maps and annotation datasets representing a broad spectrum of genomic elements in the human genome. These maps and genome annotation datasets include genes and other functional elements, repetitive elements, mobile elements and genomic variants. In 2004, with the availability of the complete euchromatic sequence of the human genome, scientists estimated that the human genome contains 20000-25000 protein-coding genes [3], which was recently revised downwards to 19,815 (GENCODE version 24; August 2015 freeze, GRCh38) [4]. These genes contain exons (protein-coding regions) that account for ~1.8% of the genome, introns (non-coding regions) and regulatory regions. RefSeq [5, 6], ENCODE [7], HAVANA [8] and CCDS [9] were the early initiatives, which are still involved in providing comprehensive genomic annotations. For example, the ENCODE project was originally launched to identify and characterize functional elements in 1% of the human genome [7, 10], and later it was scaled up to provide integrated high-quality annotations for the entire genome (GENCODE) [11, 12]. Similarly, HAVANA, initiated by Ensembl, is dedicated to providing high-quality manually curated genes. In order to effectively utilize this vast amount of information, various visualization solutions were developed for scientists to browse the entre genome with relevant annotations. NCBI [13], Ensembl [14], UCSC genome browser [15] were a few of the key efforts initiated to provide broader access to this information. Today, various methods (eg. genetic, evolutionary and biochemical approaches) are used to improve our knowledge on structure and content of the human genome, propelling a broad 1 1 range of current research and applications [16]. In particular, this has fuelled the development of systematic approaches that identify genes and genetic variants underlying human diseases. 1.2 Variation in the human genome Variants in the human genome are the genetic differences between individuals, within and among populations. Genetic variants are changes in DNA sequence or structural alterations of the genome and are classified into three groups: sequence variants (eg. single nucleotide variants, insertions and deletions), structural variants (eg. copy number variants, translocations and inversions) and changes in chromosome number (eg. aneuploidy). Single nucleotide variants (SNVs) are single base pair changes in the genome of an organism. Insertions and deletions (indels) are changes in DNA sequence where short DNA fragments are inserted or deleted [17]. Chemical and physical mutagens, and errors occurring during DNA replication can induce SNV and indel formation. Copy number variants are a class of structural variants containing deletions and duplications [18]. Errors in DNA repair and DNA replication pathways contribute to CNV formation via several mechanisms (eg. non-allelic homologous recombination - NAHR [19], nonhomologous end joining - NHEJ [20], microhomology mediated end joining - MMEJ [20], fork stalling and template switching - FoSTeS [21] and microhomology mediated break induced replication - MMBIR [22]). The majority of these mechanisms are mediated by homologous or microhomologous regions [23-26]. Thus, most of the CNVs in human genome are enriched within and near these regions (eg. 70% of the deletion breakpoints are in microhomologous regions) [27, 28]. Our current understanding of genetic variation comes from analysing the genomes of a large number of individuals. The human genome project [3], the SNP consortium (dbSNP) [29] and the international HapMap project [30, 31] were the early efforts that identified ~10 million variants, primarily SNVs. Early large-scale efforts focused on detecting CNVs include the HapMap project phase III [32] and Wellcome Trust Sanger CNV project [33, 34]. Collectively these projects identified a vast amount of common variants (frequency >5%) found in a limited number of populations. 2 While these initial efforts were limited by the available technology, recent studies, fuelled by more advanced variant detection methods (section 1.5) characterized a large amount of common and rare variants in a broad range of populations worldwide. For example, a total of 88 million variants, including 84.7 million SNVs, 3.6 million indels and 60000 structural variants, were characterized with the completion of the 1000 genomes project [35-37]. In addition to this, the NHLBI GO Exome Sequencing Project (ESP) [38, 39] and UK10K project [40] were other efforts initiated to identify common and rare genetic variants. Considering all the variants identified from these projects, scientists have estimated the average genome-wide variant counts (SNVs: ~3.7 million and indels: ~35000 [36]) and the average genome-wide coverage of CNVs (contributing to 217.1Mbp, 7.01% of the genome) [41]. Based on these figures, SNVs are the most frequent type of variant and structural variants affect more bases than other types of genetic variants in the genome of an individual [37]. With the goal of improving personal health, these vast amounts of population-specific genomic variants were made available in public data repositories [42]. Current diagnostic approaches use these datasets to help distinguish disease and non-disease causing variants in patients with genetic diseases. 1.3 Genetic diseases Today, the vast majority of human diseases are believed to have some genetic contribution. Based on the nature of this genetic contribution, these diseases are grouped into Mendelian diseases, complex diseases and chromosomal diseases. Mendelian diseases [43] are caused by mutations at a single genetic locus that segregates in families according to Mendel’s theory of inheritance [44]. Therefore Mendelian diseases are also known as single-gene diseases. In contrast, complex diseases are associated with multiple factors (gene and environmental factors) and do not follow Mendel’s laws [45]. While the majority of diseases are caused by alterations in genes, chromosomal aberrations 3 3 can change the structure of the genome and cause chromosomal diseases including aneuploidy and a range of cancer types [46]. 1.3.1 Disease-causing genetic variants About 4000 disease-related genes have been identified to date [47]. These genes carry variants (SNVs, indels or CNVs) that are either associated with or cause human diseases. SNVs in exonic regions may cause genetic diseases by affecting the structure of proteins (primary and secondary) [48], mRNA stability [49] or gene expression (eg. missense and nonsense mutations). Indels in exonic regions also cause diseases when a variant alters the reading frame of a gene (eg. frame shift mutation) [17]. SNVs and indels in gene regulatory regions may affect gene expression and cause diseases. Today, a broad spectrum of diseases ranging from infectious diseases to cancer have been associated with SNVs and indels [5052]. While SNVs and indels account for the majority of disease-causing variants identified to date [53], CNVs represent an important subset of variants that cause genetic diseases. CNVs can cause a broad range of common and complex diseases [54, 55] via three mechanisms (gene interruption, dosage effect and position effect) that affect either the structure of a gene or gene expression (Fig. 1, Table 1). 4 Partial heterozygous deletion / duplication a. Complete heterozygous deletion / duplication b. Partial deletion that unmasks a recessive mutation c. Deletion Duplication Fusion gene formation Exon Regulatory element CNV Deletion Recessive mutation in an exon Fig. 1 Gene interruption a. CNVs affecting single genes (partial or complete, deletions or duplications); b. Unmasking a recessive mutation by a heterozygous deletion; c. Gene interruption and fusion gene formation. 5 5 Disease causing mechanism Comments Gene interruption Change the gene structures and cause diseases. i. Partial gene deletion or duplication Cause diseases by interrupting a single gene. Partial deletion (Fig. 1A) of PDXDC1 is associated with hearing loss [56]. Muscular dystrophy and lipoprotein lipase deficiency are associated with partial duplications [57]. ii. Heterozygous deletion can create Cause diseases via loss of function mechanism. Opsoclonus- hemizygous locus with a recessive myoclonus ataxia-like syndrome [58], Charcot–Marie–Tooth mutation (Fig. 1B) disease type 1 [59] and Walker–Warburg syndrome [60] are associated with compound heterozygous changes. iii. CNV with multiple genes can Cause diseases via gain of function mechanism. In recent create a fusion gene that encodes a years, CNV breakpoints were examined to identify fusion mRNA with an open reading frame genes that affect intellectual disability, developmental delay, (Fig. 1C) autism spectrum disorders, congenital anomalies and dysmorphic features [61, 62]. iv. Combination of multiple CNVs Cause complex diseases. Autism, mental retardation [63] and that do not individually produce developmental disorders like Smith-Magenis Syndrome [64] phenotypes are associated with these complex events. Gene dosage CNVs that change the number of copies of a gene present in the genome can alter the gene expression and cause diseases. i. ii. Deletion of dosage-sensitive gene Gene deficiency in haploinsufficient genes can cause diseases (gene deficiency) [65, 66]. Duplication of a dosage-sensitive Increase in gene dosage can create imbalances due to excess gene gene product and cause diseases [67]. Position effect CNVs can alter the gene environment (enhancers, silencers and local chromatin organization), which may change the expression of neighbouring copy number neutral genes and cause diseases. Table 1 Disease causing mechanisms of CNVs In addition to the direct involvement of genomic variants in genetic diseases, variants in modifier genes can modulate the severity of disease-related phenotypes [68-70]. For example, increased copy number of SMN2 is correlated with a milder phenotype of spinal muscular atrophy caused by a mutation in SMN1 [71]. 6 Today, linkage studies, genome-wide association studies (GWAS) [72], microarray-based studies and high-throughput sequencing-based approaches have discovered a large number of clinically-relevant variants that are catalogued in a number of different databases. For example, the human genome mutation database (HGMD) [73] lists ~175,000 mutations that are associated with or cause genetic diseases [53]. Similarly, ClinVar [74] archives ~126,000 genomic variants with clinically relevant phenotypes and DECIPHER database archives a large number of clinically relevant CNVs [75]. Additionally, OMIM provides a detailed gene-centric and disease-centric view of clinically relevant variants [76, 77]. While HGMD, ClinVar, DECIPHER and OMIM are the most commonly used databases with variants contributing to a broad range of human diseases, a number of disease-centered databases are available with a wealth of information on certain phenotypes [78]. All these datasets improve the understanding of the mutational spectrum of a particular gene or disease and they are widely used in diagnosing genetic diseases. 1.4 Diagnosis of genetic disease Genetic diagnosis is the identification of genomic variants, genes or gene products that are associated with human diseases [79]. The classical genetic diagnostic decision-making process is guided by a detailed family history study that reveals the mode of inheritance of the disease. Then, genetic tests are performed to confirm the diagnosis of a symptomatic individual or individual with a family history of a specific disease. These genetic tests are designed to detect a broad range of genomic variants [80]. 1.4.1 Genetic tests to detect SNVs and indels Since the advent of genetic testing in clinical laboratories, various techniques have been introduced to detect disease-causing SNVs and indels [79, 81]. For example, restriction fragment analysis [82], DNA sequencing (Sanger sequencing) [83] and allele-specific oligonucleotide ligation assays [84, 85] are some of the initial tests developed. While these techniques are not widely used in current protocols, a few remain useful in causal variant identification. Particularly, Sanger sequencing is still used in diagnosing well-characterized 7 7 Mendelian disease genes (eg. cystic fibrosis: CFTR, Rett syndrome: MECP2) and to confirm disease-causing SNVs identified by other methods [86]. The genetic diagnostic tests discussed above are designed to target specific loci, generally identified by accurate phenotyping of patients and prior knowledge of the genetics of the disease. Clinical and genetic heterogeneity can complicate the diagnostic process by making it difficult to identify a likely causal gene from many candidates. Hence, targeted genetic testing for these genes are time consuming, cost ineffective and have low success rates. Furthermore, these tests have limited applicability in detecting rare variants and novel variants. However, as discussed in section 1.5, current diagnostic protocols use highthroughput sequencing based methods to detect disease-causing variants. 1.4.2 Genetic tests to detect chromosomal abnormalities and CNVs Karyotyping was introduced as the first diagnostic test to detect chromosomal abnormalities [87]. Then, fluorescence in situ hybridisation (FISH) was introduced as a targeted method, which detects submicroscopic anomalies in chromosomes [88, 89]. Next, comparative genomic hybridisation (CGH) techniques [90] marked a turning point in genetic diagnosis by enabling genome-wide CNV detection. CGH was further improved using microarrays to perform high-resolution genome-wide CNV detection [91]. These microarrays contained 1-4 million 45-60bp long probes that target specific sequences spread over the entire human genome sequence [92]. Then, SNV genotypes were used as genetic markers [93] in developing genotype-arrays as an alternative platform to perform genome-wide CNV detection [94, 95]. When microarray platforms were introduced, array results were validated using alternative targeted methods. Initially, quantitative fluorescence PCR (QF-PCR) was used to validate aCGH results and later multiplex ligation-dependent probe amplification (MLPA) was also used to validate CNVs [96]. However, with recent developments in aCGH probe design and advance bioinformatics algorithms, aCGH has become the principal CNV detection platform in current genetic diagnostic protocols [92, 97]. 8 Microarrays offer reliable and efficient methods for genome-wide population-scale CNV detection [98]. However, the sensitivity and breakpoint precision of these platforms vary greatly depending on their probe density distributions. Therefore they have variable performance in detecting CNVs [99, 100]. For example, a microarray benchmark study has shown that large CNVs (>1Mb) were detected with high accuracy even in arrays that have different probe distributions, but these arrays have shown variable efficiencies in detecting short CNVs [101]. With the availability of high-throughput sequencing data, alternative sequencing-based approaches have been developed which can provide higher resolution CNV detection. 1.5 Exome sequencing With the introduction of next generation high-throughput sequencing technologies [102], whole genomes could rapidly be characterized at single base pair resolution. The continuous development of these technologies lead to the rapid decrease in sequencing cost and therefore massive amounts of sequence data were generated with ever-decreasing effort [103-105]. About six years ago, when whole-genome sequencing was not available at an affordable price, exome sequencing was introduced as a cost-effective approach to sequencing coding regions of the human genome [106, 107]. When exome sequencing was first introduced, it was estimated that ~85% of clinically relevant variants were in exonic regions [108]. While this figure is likely an overestimate due to the scarcity of data on disease causality, scientists still expect the majority of causal variants to be in the exome. Today, exome sequencing is used as an effective genetic diagnostic test that has revolutionized the detection of novel, de novo and rare variants across a wide range of diseases [108-112]. 1.5.1 Exome-based disease diagnosis The genome of a healthy individual differs from the human reference genome at 4.1-5.0 million sites [37]. These differences are due to genetic variants and exome sequencing has the potential to reveal these variants that are in coding regions. For example, variant calling programs identify 30000-50000 SNVs and 2000-6000 indels per exome (passing the quality 9 9 thresholds used in bioinformatics pipelines). Therefore identifying disease-causing mutations from thousands of normal variants is the most challenging task in current genetic diagnostic approaches. In order to address this issue, several different diagnostic strategies and guidelines were introduced in recent years, laying the foundation to identify disease-causing variants [113-115]. As detailed in these diagnostic guidelines, current pipelines use a number of different prioritization approaches to select potentially disease-causing variant from patient samples. For example, disease-causing variants can be found in candidate genes common to a group of affected unrelated individuals or the same causal variant could be found in affected family members. Similarly, de novo causal variants can be found by using a family trio (motherfather-child). Therefore exome-based diagnostic methods can be designed to increase the probability of identifying the causal variant. When studying diseases that are known for genetic heterogeneity, candidate genes shared among subsets of individuals can be selected during variant prioritization. Variant prioritization can be further improved by using population-specific datasets and clinically relevant datasets. Here, population-specific datasets are used to filter-out common variants and clinically relevant datasets are used to identify already known disease-relevant variants (Table 2). Following prioritization, the next step is to identify a pathogenic variant from the prioritized list. Here, a broad range of datasets including scoring systems that improve variant interpretation [115], published literature, functional genomic annotations and model organism phenotypes can be used to assess the pathogenicity of a genetic variant. Current variant calling programs [116] detect SNVs [117] and indels [118, 119] with high accuracy, and annotation programs provide a wealth of annotations to assist the diseasecausing variant detection [120, 121]. Today, exome-based diagnostic approaches have shown a success rate of 25%-40% in detecting disease-causing SNVs and indels [122, 123]. However, only a 2%-3% CNV detection rate has been published for heterogeneous Mendelian cohorts [124, 125]. 10 Dataset Application Reference Population-specific datasets Used to identify known: 1000 genomes data SNVs, indels and CNVs [35-37, 42] NHLBI GO Exome Sequencing SNVs and indels [38] Database of Genomic Variants CNVs [126] Sanger CNV project CNVs [34] Project (ESP) Clinically-relevant datasets OMIM Used to identify disease-relevant: Genes containing causal [76,77] variants HGMD SNVs, indels and CNVs [73] ClinVar SNVs, indels and CNVs [74] DECIPHER database (Database CNVs [127] SNVs, indels and CNVs [128] of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources) Deciphering Developmental Disorders (DDD) Table 2 Public datasets used in exome-based disease diagnosis 11 11 2 Disease-causing CNV detection from exome data Since the introduction of exome sequencing as an effective diagnostic tool, various CNV prediction and annotation programs have been developed to detect disease-causing CNVs from patient exomes. These disease-causing CNV detection methods can be discussed in four main steps: CNV prediction, annotation, causal CNV detection and validation. 2.1 CNV prediction Computational programs use paired-end or coverage data to predict CNVs from sequence data. Based on the most commonly used CNV prediction strategies, these programs are grouped into three categories: paired-end-, split-read- and coverage-based CNV prediction. In short-read sequencing (eg. Illumina sequencing [129]), paired-end reads are generated with approximately known distances. Read alignment and mapping programs are expected to maintain these distances [130]. If CNV breakpoints are internal to paired-end intervals, these reads are mapped in wrong orientation or mapped at distances that are substantially different from the expected distances [131, 132]. Therefore paired-end data is used in computational programs to predict CNVs from sequence data (Fig. 2) [133]. Split-read- based CNV prediction programs detect CNV breakpoints using unmapped or partially mapped (soft-clipped) reads (Fig. 3). If unmapped or partially mapped reads contain CNV breakpoints, these programs use split-mapping algorithm to identify CNV breakpoints at single base-pair resolution [131, 132]. Paired-end- and split-read-based methods are widely implemented to call CNVs in wholegenome sequence (WGS) data. For example, BreakDancer [134], PEMer [135], VariationHunter [136], commonLAW [137], GASV [138], AGE [139], Pindel [118], SLOPE [140], SRiC [141] and STRiP [142] predict CNVs in whole-genome sequences. However, when predicting CNVs in exome sequences, paired-end information is only used in rare instances (eg. SPLITREAD [143]). 12 b. a. Reference c. d. Reference Read 1 and read 2 Distance between read 1 and read 2 CNV Fig. 2 Paired-end-based CNV identification a. Deletion (reads are mapped too far apart); b. Insertions (reads are mapped too close together); c. Inversions (the orientation of one read in the pair is changed); d. Duplications (reads change their relative order while retaining their original orientation) a. b. Reference c. d. Reference Split read Distance between 2 parts of split read CNV Fig. 3 Split-read-based CNV identification a. Deletion (continuous stretch of gaps in the split-read); b. Insertion (continuous stretch of gaps in the reference); c. Inversion (section of the split-read is mapped in wrong orientation); d. Duplication (both sections of split-read alter their relative order during the alignment) 13 13 Both paired-end- and split-read-based CNV prediction methods depend on the availability of paired-end reads encompassing or containing CNV breakpoints (Fig. 2, 3). Therefore, when CNV breakpoints are in regions that are not captured in the exome (eg. non-coding regions), these two methods are not capable of detecting CNVs. Therefore coverage-based methods have been implemented to overcome this challenge, when predicting CNVs in exomes. 2.2 Coverage-based CNV prediction In short-read sequencing, copy number variable regions generate higher or lower read counts compared to diploid regions in the genome [144]. Therefore when calling CNVs, prediction programs analyse the coverage distribution of exonic regions, expecting that the coverage of an exon is proportional to the copy number of the exon (Fig. 4). Exon 1 a. Exon 2 Exon 3 Reference Control dataset b. Reference Mapped reads in target exome1 c. Reference Mapped reads in target exome2 Increase or decrease in mapped read count Exon Fig. 4 Coverage-based CNV prediction a. Copy number neutral control dataset; b. Increase in coverage of exon 2 (duplication) compared to control dataset; c. Decrease in coverage in exon 3 compared to control dataset. Coverage-based programs follow three main stages when predicting CNVs from exome sequence data: coverage extraction, normalization and CNV calling. In the first stage, alignment files (BAM file) are used to extract coverage data for target regions defined in an exome definition file. The majority of programs extract coverage data from multiple exomes, 14 which are used in the second stage, coverage normalization. Here, extracted coverage data is normalized using different statistical approaches to create a reference (background) dataset. Finally, this normalized reference (background) dataset is used in variant calling algorithms to predict CNVs from target exomes. The efficiency of coverage-based programs depends on various factors that are associated with the exome sequencing process. In exome sequencing, short exonic regions (about 200bp [145]) and off-target regions (eg. segmental duplicates) are captured with variable efficiencies [146]. This can affect CNV prediction by introducing bias into the statistical models used to call CNVs. In addition to these sequencing artifacts, factors including GC percentage, mappability, duplicated reads and ambiguously mapped reads [147] influence the coverage distribution of target exomes and thereby affect the efficiency of CNV prediction. As discussed in Table 3, a number of different techniques have been implemented in coverage-based programs to minimize the effects from these issues. Issues affecting CNV prediction Technique used to minimize the effect Duplicated reads SAMtools rmdup [148] and Picard Duplicated reads inflate the coverage of genomic MarkDuplicates [149] can identify duplicated regions and affect the CNV prediction quality. reads in alignment files. Once identified, duplicated reads can be excluded during coverage data extraction. Ambiguously mapped reads Ambiguously or inaccurately mapped reads Inaccurate read mapping can place reads in genomic have low mapping quality scores [150]. regions regardless of the copy number state. This can Therefore, quality score threshold can be used introduce bias to the CNV prediction. to exclude these reads. GC content Local coverage correction methods can reduce Short-read sequencing platforms have shown a the effect of GC content [153]. positive correlation between coverage and GC percentage [151, 152]. Thus, GC rich regions can have higher coverage than the AT rich regions. Mappability Local mappability score correction methods Reads mapped to genomic regions with low can introduce to reduce these effects [147, mappability scores (eg. repeat rich regions [154]) 155]. may not represent their original locations. Thus 15 15 coverage data from regions with low mappability scores can introduce bias to the CNV prediction. Technical artefacts in sequencing Advance statistical methods can be Technical artefacts could occur during the implemented to normalize coverage data sequencing process and these effects may vary across the sample collection used for CNV depending on capture technology and sequencing prediction. For example, principal component batch. analysis based methods can be used to reduce these biases [156, 157]. Table 3 Coverage-based CNV prediction: issues and strategies Coverage-based methods were initially introduced to predict CNVs in paired case-control samples. For example, to predict CNVs in a tumour (case) sample compared to a healthy control [158, 159]. Soon, it was implemented in population scale CNV prediction [153]. Table 4 provides a summary of coverage-based programs that can predict CNVs in exome sequence data. Program (Google citations) ExomeCNV (190) CONTRA (94) ExomeCopy (32) ExomeDepth (99) CoNIFER (180) XHMM (138) ExoCNVTest (17) FishingCNV (26) CANOES (9) PatternCNV (7) cnvOffSeq (2) EXCAVATOR Features Reference Use paired case-control exome samples to call CNVs. [160] Create a reference dataset from multiple exomes and call CNVs using a method similar to ExomeCNV. Use a negative binomial model to normalize coverage data according to tile length and GC content. CNV calling is performed based on a hidden markov model (HMM). Uses a robust strategy to model coverage data and to build an optimized reference set that maximize the CNV detection power. Performs a singular value decomposition (SVD) based normalization of reads per thousand bases per million reads sequenced (RPKM). Uses principal-component analysis (PCA) to normalize coverage data and CNV calling is performed by a HMM. Identify disease-associated CNVs and generate absolute copy number genotypes at putatively associated loci. Graphical software package to call CNVs using principal component analysis based method. Models coverage data using a negative binomial distribution and creates a reference dataset when calling CNVs. Estimate CNVs based on coverage variability patterns summarized from a set of reference samples. Estimate CNVs using a singular value decomposition based normalization framework for off-target coverage data. Implements a three-step normalization procedure to [161] 16 [162] [163] [157] [156] [164] [165] [166] [167] [168] [155] (42) eXome mitigate the effects of sequencing biases and use HMMbased segmentation algorithm to call CNVs. Use HMM based approach to call CNVs. [169] (3) CODEX (7) CLAMMS CoNVaDING Normalize coverage data to reduce the effect of mappability and GC content. Highly scalable copy number estimation tool based on lattice-aligned mixture models. Includes a stringent quality control (QC) metric that flags low-quality exons when calling CNVs. [170] [171] [172] Table 4 Exome sequence-based CNV prediction programs * Exome-based programs that are specially optimized to detect CNVs in cancer patients include, VarScan 2 [173], exome2cnv [174], CoNVEX [175], CEQer [176] and ControlFREEC [177]. While a number of exome-based CNV prediction programs are currently available, only a few are commonly used and cited in the published literature. For example, a Google citation count of each CNV prediction program (Table 4) indicates that only a few programs are commonly cited by the genomics community. Two benchmark studies have tested these commonly used programs and demonstrated the potential utility in detecting disease-causing CNVs [178, 179]. Despite the potential clinical utility of exonic CNV prediction programs, there are several pitfalls that affect widespread implementation. For example, these programs predict drastically different CNV counts when tested with the same sample collection. Similarly, CNV length distributions of these programs also vary greatly from one program to another. Moreover, some of these programs are not optimized to detect short CNVs (CNVs containing 1-4 exons) and show high false positive CNV counts [178-181]. Additionally, high stringent CNV calling programs may have high false negative CNV counts. While these issues are not yet resolved, the clinical importance of exome-based CNV prediction was repeatedly highlighted in recent years [124, 178, 179, 182]. 2.3 CNV annotation Annotation programs provide information needed to understand the functional effects and clinical relevance of predicted variants. ANNOVAR [120] and VEP [121] are the most commonly used programs that can annotate CNVs and other genomic variants. Additionally, 17 17 CNVAnnotator [183], SG-ADVISER CNV [184] and DeAnnCNV [185] provide web-portals for CNV annotation. These programs use different source datasets and provide a broad range of information including gene and functional effect information, information on already known and common CNVs, and clinically relevant information. These annotations are effectively used to identify disease-causing CNVs in the next stage. However, since the choice of source dataset has a major effect on the variant interpretation [186], efficiency in disease-causing CNV identification may depend on the selected annotation program. 2.4 Causal CNV detection Following prediction and annotation, CNVs are filtered and prioritized (using approaches discussed in section 1.5.1) to detect potentially disease-causing variants. Then, the pathogenicity of each of the prioritized variant is assessed to identify disease-causing CNVs that could explain patient phenotypes. As discussed in section 2.2, there are a number of limitations to the use of exome-based CNV prediction programs. These limitations have a direct effect on disease-causing CNV identification. For example, a high false positive count increases the risk of selecting a false positive CNV as the disease-causing variant. Similarly, high false negative counts (predicting few CNVs) increase the risk of not detecting the causal variant. Additionally, if thousands of CNVs (eg. CONTRA and ExomeCopy) are predicted, this greatly increases the time and effort involved in disease-causing CNV identification. Following the identification of disease-causing CNVs, current diagnostic approaches implement a secondary experimental step to validate computationally predicted causal variants. 2.5 Validation of disease-causing CNVs As discussed in section 2.2, CNV prediction programs that use exome sequence data have shown high false positive CNV counts. Thus, in order to confirm the molecular diagnosis, computationally detected disease-causing variants can be validated using various experimental techniques. For example, aCGH can be used to confirm CNV predictions and 18 when these platforms do not contain probes covering the causal variant (eg. CNVs in pseudogenes), targeted techniques (MLPA, QF-PCR) can be used. The majority of commercially available aCGH platforms have low probe counts over exonic regions. Therefore these aCGH platforms have limited applicability in detecting short exonic CNV (with at least 3 consecutive probes). For example, the median probe spacing of Agilent high-resolution 1x1M array is 2.1kb and only 5.3% of the array is composed of probes covering exonic regions. Hence, scientists were motivated in developing custom solutions to detect short exonic CNV. Recent studies have demonstrated the utility of these platforms in confirming computationally predicted disease-causing CNVs [178, 179]. In summary, disease-causing CNVs can be detected from exome sequence data in four main steps: CNV prediction, annotation, causal variant detection and validation. While the clinical utility of these approaches has been demonstrated in recent years, a number of challenges that are associated with these steps limits their widespread applicability. Therefore the main objective of this thesis is to overcome these challenges and develop an improved method to detect disease-causing CNVs from exome sequence data. 19 19 3 Aims of the study The main objective of the work presented in this thesis was to develop a complete pipeline to improve the detection of disease-causing CNVs in exome sequence data. Specific aims were to: 1. Evaluate the performance of commonly-used CNV prediction programs for exome sequence data. 2. Study the ability of CNV prediction programs to predict short CNVs (containing 1-4 exons) and to develop new methods that improve short CNV detection. 3. Develop a custom aCGH to experimentally validate computational predictions and to improve short CNV detection compared to existing arrays. 4. Study CNV prediction quality and to develop methods that can assess CNV prediction quality scores. 5. Improve CNV annotation using multiple source databases. 20 4 Methodological considerations 4.1 Exome sequence analysis pipeline Computational programs tested in our study use exome sequence data as the source for CNV prediction. DNA samples for exome sequencing were obtained from primary immunodeficiency diseases (PIDD) patients referred to Oslo University Hospital, Oslo Norway. Exome capture was performed using the Agilent SureSelect Human All Exome capture kit v.5 and Illumina rapidcapture exome. Then captured libraries were sequenced on an Illumina HiSeq 2500 at the Norwegian Sequencing Centre (www.sequencing.uio.no). Sequence data (fastq files) were analysed using an in-house developed exome pipeline (Fig. 5) combining NovoAlign (v2.07.17) [187], Picard (v1.74) [149], GATK (v2.4 and v2.8) [117, 188, 189] and ANNOVAR (2013Aug23)[120]. Alignment (BAM) files generated from the pipeline showed an average coverage ranging from 124x-210x (Illumina rapidcapture exome: 124x-146x and Agilent SureSelect v5: 160x-210x) and they were used as the input for CNV prediction programs tested in our study (section 4.2). As shown in Fig. 5, our pipeline uses an in-house developed database and scripts to improve variant interpretation. For example, ANNOVAR annotated SNV and indel calls were further annotated to provide population-specific information (using an in-house variant database) and disease-relevant information (using in-house software). During disease-relevant annotation, SNVs and indels affecting primary immunodeficiency diseases (PIDD) candidate genes were identified and then each variant was annotated with the phenotype and inheritance pattern. Finally, these variants were filtered and prioritized using Filtus [190] when identifying PIDDcausing SNVs and indels. 21 21 Fig. 5 Exome sequence analysis pipeline 22 4.2 CNV prediction from exome sequence data Six CNV prediction programs (ExomeCNV [160], CONTRA [161], ExomeCopy [162], ExomeDepth [163], CoNIFER [157] and XHMM [156]) were tested on alignment files, which were generated from the pipeline discussed in section 4.1. Additionally, a custom algorithm - ExCopyDepth was developed using modules from ExomeCopy (coverage data extraction module) and ExomeDepth (CNV calling module). Then, in order to validate CNVs predicted from these programs, a custom aCGH (exaCGH) focusing exonic regions was developed using Agilent eArray system [191]. Paper 1 and various sections of this thesis provide detailed descriptions of steps involved in CNV prediction, exaCGH development and CNV validation. 4.3 Resolution of microarray platforms Since their introduction, microarray platforms have been continuously improved to increase the coverage and resolution of genome-wide CNV detection [192]. Today, high-resolution microarray platforms offer genome-wide CNV profiling with high accuracy [98, 193, 194]. However, due to the low probe coverage over exonic regions, these platforms show restrictions in short exonic CNV detection. Thus, we developed a custom CGH array (exaCGH) to perform high-resolution CNV detection in exonic regions [179]. exaCGH contain 1 million probes and cover ~80% of exons (Havana or Ensemble exons in Gencode v.15) with at least 4 probes. Exonic coverage of exaCGH increases up to ~90%, when considering Gencode v.15 exons that are longer than 100bp. The proportion of Gencode v.15 exons covered by 4 or more probes is highest in exaCGH compared to Agilent 1x1M and CytoScan HD array platforms [179]. The resolution of a microarray platform is mainly defined by the mean spacing of array elements [100]. However, the mean probe spacing of an array does not represent variation in local resolution, especially when array elements are not uniformly distributed throughout the genome [100]. The characterization of local variation in probe spacing can provide more insight into the resolution of an array and subsequently better understanding of its utility. Hence, we compared the local probe spacing in exaCGH, Agilent 1x1M and CytoScan HD to study the resolution of these three array platforms. 23 23 During this comparison, we first calculated the probe count of each Gencode exon and CDS region (including 800bp flanking regions in 5’ and 3’ ends [179]). In order to compare the variation of probe spacing in regions (Gencode exonic and CDS) that differ in length, these regions were grouped into three categories: 1-2kb, 2-3kb and over 3kb. Then, we calculated the mean probe spacing in each group (Fig. 6). As expected, exaCGH had the lowest probe spacing in each group ranging ~300bp to ~350bp. Probe spacing in CytoScan HD was ranging from ~400bp to 1.2kb and Agilent 1x1M was ranging from ~1kb to ~2.4kb. This analysis confirmed that exaCGH has the highest resolution compared to two of the tested commercially available high-resolution arrays and optimized to detect short exonic CNVs. As a secondary verification of our array results, several exaCGH CNV calls were randomly selected and manually visualized using Agilent genome workbench. Fig. 6 Resolution of microarray platforms 4.4 Considerations when identifying true positive, false positive and false negative CNVs Following the evaluation of exaCGH design, array experiments were performed and exaCGH results were used to validate computationally predicted CNVs. Here, computational 24 predictions were compared to exaCGH results and CNVs detected by both methods were identified as true positives. False positives CNVs were the computational predictions, which were not detected in exaCGH. False negatives were the CNVs identified by the array but not by the prediction program. Prediction programs and exaCGH use different CNV breakpoint detection strategies. For example, CNV breakpoint estimation in computational programs depends on the segmentation algorithm used [147], while CNV breakpoints in microarrays were mainly determined the probe density distribution [195]. Therefore computationally predicted CNV breakpoints may differ from breakpoints identified from exaCGH results. Due to the CNV breakpoint inconsistencies in computational predictions and exaCGH results, a certain degree of overlap between these two datasets needs to be considered when identifying true positives. Additionally, CNVs predicted from different computational programs, have shown very different length distributions [179]. Therefore, the degree of overlap between CNV predictions and exaCGH results may differ from program to program (Fig. 7). As shown in Fig. 7, computational predictions and exaCGH results overlap to a variable degree depending on the length of CNVs. Therefore when identifying true positive and false positive CNVs, it is challenging to determine the required threshold overlap. Hence, we used 1 base-pair overlap as the threshold to be less stringent in true positive identification. Using 1 base-pair overlap between computational predictions and exaCGH results can improve the true positive yield while minimizing the false positive CNV count. However, selecting CNVs with very low overlap can potentially classify false positives as exaCGH confirmed variants and thereby introduce bias into our true positive set. Therefore, true positive CNVs were further analysed to study how true positive count (in each program) change with the degree of overlap (between exaCGH results and computational predictions). As shown in the cumulative frequency distributions (Fig. 8), only a small fraction of CNVs (~0.01%-0.02%) had less than 100bp overlap between the two datasets used. Further decrease in overlap (<100bp) clearly reduces the cumulative frequency, highlighting the presence of few true positive CNVs. This analysis confirmed that the threshold overlap used in CNV 25 25 validation stage (1bp overlap), did not inflate the true positive count and thus, did not introduce a significant bias into our study. Fig. 7 Manual examination of CNVs a. Computationally predicted CNVs and the exaCGH detection overlap to a variable degree; b. Computational CNV predictions are internal to the exaCGH detection; c. The exaCGH detection is internal to computationally predicted CNVs. * exaCGH parameters: minimum number of probes – 3, minimum average absolute log ratio for deletion and duplication 0.25, 26 Fig. 8 Cumulative frequency distribution for true positive CNVs 27 27 5 Summary of papers 5.1 Paper 1 Identification of copy number variants from exome sequence data In this study, we evaluated six commonly used CNV prediction programs (ExomeCNV, CONTRA, ExomeCopy, ExomeDepth, CoNIFER, XHMM) and also developed an in-house modification to ExomeCopy and ExomeDepth (ExCopyDepth) to optimize short CNV (containing 1-4 exons) prediction. Additionally, we studied how multiple prediction program combinations can be used effectively to detect CNVs. Crucially, we developed a custom aCGH (exaCGH) focussing on exonic regions to validate computational CNV predictions. In this analysis, CNV prediction was performed on 30 exomes from the 1000 genomes project and 9 exomes from patients with primary immunodeficiency diseases (PIDD). Nine samples from each group were run in exaCGH and both computational predictions and array results were used to identify true positive (TP) and false positive (FP) CNVs. During the analysis, we compared CNV programs in terms of CNV prediction length, count and the number of exons internal to predicted variants. Then we calculated the sensitivity and false positive rate of these programs. The comparison of computational programs showed a striking variation in the length and number of predicted CNVs (Paper 1: Fig. 1a, b). These programs also showed a clear variation in the number of exons internal to predicted CNVs (Paper 1: Fig. 2a, b). Compared to individual programs, a narrow range was shown for the number exons internal to CNVs that were predicted from multiple programs (Paper 1: Fig. 2c, d). The number of TP CNVs identified from different programs also showed clear variation (Paper 1: Table 2). All the programs except CoNIFER and ExomeDepth had low TP/FP count ratios indicating high FP CNV counts (Paper 1: Table 2). Comparison of short CNVs (CNVs containing 1-4 exons) indicated improved performance for ExomeCopy, ExCopyDepth and intersection of ExomeCopy and ExCopyDepth compared to CoNIFER and XHMM (Paper 1: Fig. 3). 28 1000 genomes exomes used in the study had higher coverage variance compared to the PIDD patient exomes. Additionally 1000 genomes exomes were analysed using a relatively small reference sample collection compared to PIDD patient exomes [179]. When analysing our results, it was possible to observe how these factors (differences in two exome collections) affect the CNV prediction. For example, CONIFER, XHMM, ExomeDepth were not able to predict short TP CNVs from 1000 genomes exomes. However these programs predicted 4-50 short TP CNVs with relatively high TP/FP ratios (0.17-0.67) in PIDD patient exomes. With 1000 genomes exomes, intersection of ExomeCopy and ExCopyDepth showed an increased performance (highest TP/FP ratio) compared to other programs (Paper 1: Table 3). We then studied the consistency in assigning copy number state (CNS) when a CNV is predicted from different programs. Here we showed that the same CNSs were assigned to CNVs predicted from all the programs or programs that use similar variant calling methods (Paper 1: Table 4). In order to benchmark the accuracy of CNV prediction, we compared the sensitivity and false positive rate (FPR) of programs tested in our study. ExomeCopy had the highest FPR (38.9% of the samples showed FPR > 0.95) and CoNIFER had the lowest FPR. ExomeDepth, CoNIFER and XHMM were optimized to detect rare CNVs. Therefore these programs had high false negative counts affecting the sensitivity. For example, sensitivity of ExomeDepth, CoNIFER and XHMM was lower than the other tested programs. Next, the clinical utility of these programs was tested by searching for PIDD-causing variants. All the programs except XHMM identified a PIDD-causing deletion in FANCA (exon 26–37), demonstrating the potential clinical importance of CNV prediction programs that use exome sequence data. Finally, we proposed a protocol with two stages, especially applicable in candidate gene studies. Stage one: selecting the program or program combination to predict CNVs depending on the target exome collection and user requirements (Paper 1 provides a detailed discussion on conditions need to be considered at this stage). Stage two: validation of computational predictions using exaCGH experiments 29 29 In conclusion, we introduced a customized algorithm to predict short CNVs (ExCopyDepth) and a custom aCGH (exaCGH) focussing exonic regions. Using computational programs and exaCGH, we highlighted that programs are capable of predicting disease-causing CNVs despite of the relatively high false positive counts. Finally, we presented a protocol that guides the CNV prediction and validation especially in candidate gene studies. 5.2 Paper 2 cnvScan: a CNV screening and annotation tool to improve the clinical utility of computational CNV prediction from exome sequencing data This paper describes a new software tool (cnvScan) that considerably improves the clinical utility of CNV prediction from exome sequencing data. cnvScan can read input from any prediction program, which may be affected by issues discussed in section 2.2. Amongst these issues, high false positive CNV count is one of the main challenges in clinically relevant CNV identification [178-180]. Therefore in Paper II, cnvScan was introduced to address this challenge. Firstly, it was demonstrated that the default CNV prediction quality (assigned by prediction programs) was useful as a stable measure to detect likely false positive predictions (Paper 2: Fig. 1, Additional file 1: Fig. S2-S5). Then, a variant screening method was introduced to calculate an additional CNV quality score. The new quality score (CNVQ) was designed to help assess the effects of sequencing artefacts that influence FP CNV prediction. Sequencing artefacts (eg. high coverage variance across exomes) can affect the coverage normalization step in prediction programs and contribute to false positive CNVs. Although it is not possible to detect and measure sequencing artefacts in normalized data, effects could be observed using an in-house CNV database. For example, a considerable proportion of overlapping false positives, which were predicted from multiple exomes were observed when considering CNVs in an in-house database (Paper 2: Fig. 1, Additional file 1: Fig. S6). As reasoned in Paper 2, these overlapping false positive predictions are likely the effects of sequencing artefacts. Further, we showed that the common TP CNVs also have low quality scores when prediction programs are optimized to detect rare variants. (Paper 2: Fig. 2). 30 Following the development of a variant screening method, cnvScan was extended with two additional modules: CNV annotation and filtration. CNV annotation module provides different types of functionally and clinically relevant information, and annotation on known CNVs using a broad range of source datasets (Paper 2: Table 1). The filtration module is capable of excluding low quality known non-disease causing CNVs using cnvScan annotations. Clinical utility of these methods were demonstrated in Paper 2 by implementing cnvScan in a PIDD patient cohort (n=64) and detecting three patients with PIDD-causing CNVs (Paper 2: Fig. 4). Overall, cnvScan can read input from any prediction program and perform a quality assessment on predicted CNVs. Then cnvScan provides a broad range of annotations using multiple source datasets. Finally, cnvScan filtration prioritizes potentially clinically relevant CNVs by excluding false positive, known and non-disease causing variants. Our method considerably improves the efficiency of disease-causing CNV detection by reducing the time and effort in variant analysis. 5.3 Paper 3 Primary immunodeficiency diseases - genomic approaches delineate heterogeneous Mendelian disorders In Paper 3, the utility of exome sequencing as a diagnostic test was evaluated using a group of patients with primary immunodeficiency diseases (PIDD). In this study, exome sequencing was performed on a large cohort (n=278) comprising samples collected from two centres: Oslo university hospital, Oslo, Norway (78 probands) and Baylor College of Medicine, Texas, US (200 probands). All the patients in this cohort had been evaluated through conventional methods (section 1.4.1, 1.4.2) and lacked an established molecular diagnosis when included in the study. Therefore, Paper 3 describes the implementation of a broad range of exome-sequence- based diagnostic methods to assess all known PIDD relevant genes in these patients. All the Oslo samples were run on the exome sequence analysis pipeline discussed in this thesis (section 4.1). The pipeline run on Baylor samples [196] used Atlas2 suite for SNV and indel identification [197, 198]. Variants identified in both centres were filtered using same criterion during the prioritization, and potentially PIDD-causing variants were further studied 31 31 to detect causal variants. Identified PIDD-causing SNVs and indels were confirmed by Sanger sequencing. The pipeline discussed in section 6.4.2 was used to analyse CNVs from all the Oslo samples and 80 samples from Baylor. While in Baylor, ExCopyDepth and an in-house developed program (HMZDelFinder) were used to predict and annotate CNVs. When disease-causing CNVs were identified, they were confirmed by a custom array (exaCGH or Baylor custom array) or MLPA. Seven PIDD-causing CNVs were identified with the computational methods used in both centres. The pipeline discussed in section 6.4.2 was able to detect five PIDD-causing CNVs (FANCA, NCF1, TERC, MAGT1, IKZF1) and three were detected in the Baylor pipeline (PGM3, SMARCAL1, MAGT1). Exome samples containing PGM3 and SMARCAL1 CNVs were not tested in Oslo. Thus, the ability to detect these two CNVs was not tested in the pipeline discussed in section 6.4.2. Our cohort study, revealed the potential molecular basis for the disease in 40% (n=110) of the probands. Clinical diagnosis was revised in about half (60/110) and management was directly altered in nearly a quarter (26/110) of probands based on the molecular findings. Further, we found evidence for reduced penetrance (in autosomal dominant traits), founder mutations in outbred populations (in autosomal recessive traits), mutation burden effect and, somatic and revertant mosaicism contributing to disease variability. This study resulted in the expansion of the phenotypic spectrum associated with known disease genes, and the identification of several new PIDD-causing genes. The majority of computationally predicted disease-causing CNVs identified in this study, were found by the CNV pipeline discussed in this thesis. 32 6 Discussion This thesis presents a complete pipeline providing a systematic approach to identify clinically relevant CNVs. Paper 1 provides a detailed study on commonly used exome-based CNV prediction programs while introducing a customized algorithm (ExCopyDepth) and a custom aCGH (exaCGH). Both ExCopyDepth and exaCGH were developed to predict and experimentally validate short exonic CNVs. Paper 2 describes a new software package (cnvScan) that improves the clinical utility of CNV prediction programs. Finally, in Paper 3, the clinical importance of ExCopyDepth, exaCGH and cnvScan were confirmed by detecting disease-causing CNVs in a large patient cohort. 6.1 Performance comparison of CNV prediction programs In Paper 1, we compared the performance of six commonly used CNV prediction programs. We showed that these programs are capable of predicting CNVs with high sensitivity, which was further highlighted by predicting disease-causing CNVs in PIDD patients. Despite these advantages, we demonstrated that exome-based CNV prediction programs have high false positive counts. These findings were confirmed by increasing number of recent benchmark studies [178, 180, 181]. Although the conclusions of our performance comparison were consistent with recent publications, it is possible to identify several key features that are not addressed in Paper 1. For example, 1000 genomes exomes used in our study showed a high coverage variance (mean coverage: 49.3x-199.7x), which could affect the CNV prediction efficiency. Moreover, during the benchmark study, we had not considered the effect of CNV prediction quality scores. Additionally, we had not provided precision and recall of CNV prediction programs, although they are considered as the most relevant descriptors for performance, especially when true negative counts are not available [199]. In order to address these issues, an additional analysis was designed (ExCopyDepth prediction and exaCGH experiments) using 15 PIDD patient exomes with mean coverage 33 33 ranging from 160x-210x. Then a set of true positive and false positive CNVs was identified to calculate precision and recall of CNV prediction programs. Since CNV counts vary depending on the thresholds used in exaCGH experiments, array results were analysed in two stringency levels when calculating precision and recall scores (Fig. 9). Fig. 9 Performance comparison of CNV prediction programs a, b, c. Precision and recall were calculated with low stringency exaCGH calls; d, e, f. Precision and recall were calculated with high stringency exaCGH calls; a, d. Precision vs default CNV quality score; b, e. Recall vs default CNV quality score; c, f Precision vs recall of prediction programs; *CoNIFER does not provide CNV prediction quality by default, thus it was not considered in this analysis; * Low stringency parameters: AMD-2 score threshold 5, minimum number of probes - 3, minimum average absolute log ratio - 0.20; * High stringency parameters: AMD-2 score threshold - 6, minimum number of probes - 4, minimum average absolute log ratio - 0.25 Due to the high false positive counts in CNV prediction programs [179] low precision (<0.5) was shown for low CNV scores (<20; Fig. 9a, d). However, precision increased with increasing CNV quality scores for all 4 CNV prediction programs. The majority of CNV prediction programs are optimized to detect rare variants. Thus, these programs have high false negative counts, as they are not capable of detecting all the exome-wide CNVs identified in exaCGH experiments. Consequently, CNV programs had low recall scores (Fig. 34 9b, e). However, with increasing exaCGH stringency (Fig. 9e), recall of CNV prediction programs increased drastically (0-0.06 to 0-0.13). This indicated that CNV prediction programs have high recall (sensitivity) in detecting high-stringency CNV calls. When considering both precision and recall (Fig. 9c, f), XHMM showed a lower performance compared to ExomeCopy, ExCopyDepth and ExomeDepth. With high-stringency CNV calls, ExomeCopy, ExCopyDepth and ExomeDepth showed a similar performance (Fig. 9f). CoNIFER dose not provide quality scores in its default setting, therefore it was not included in the comparison presented in Fig. 9. However, based on recently published literature, CoNIFER can be considered as a program with very low false positive counts [178-180]. Precision and recall plots presented in this thesis and other similar benchmark studies suggest that CNV prediction programs are subject to high false positive counts even though they are capable of predicting disease-causing CNVs [178-181]. This highlights the importance of careful evaluation and additional validation of computational CNV predictions. 6.2 Effect of cnvScan in-house database in CNV analysis Current exome sequence analysis pipelines use in-house SNV databases to detect sequencing artifacts in SNV calls [200, 201]. Similarly, cnvScan uses an in-house database and provides a CNV quality assessment strategy to improve the detection of sequencing and coverage artefacts. Here, we recommend using the same exome collection for CNV prediction and cnvScan in-house database creation [202]. Alternatively, when prediction programs are run in several batches of exome samples, CNVs from all these exomes can be pooled together to create a relatively large in-house database. Changes in the content and size of the in-house database affect quality assessment, filtration and thereby CNV interpretation. However in Paper 2, we have not discussed how the content and size of cnvScan in-house database affect the CNV analysis. In order to study this effect, two in-house databases were created from two different datasets: CNV predictions from 18 PIDD patient exomes (db-18) and CNV predictions from two batches containing 18 and 47 PIDD exomes (db-65). Then, cnvScan was implemented using these two in-house databases to detect PIDD-causing CNVs in exomes represented in db-18. 35 35 Two PIDD-causing CNVs (two patients from the same family with a deletion in MAGT1 and one patient with a deletion in NCF1) were identified when db-18 was used for the cnvScan implementation. However, only MAGT1 CNV was detected with the db-65 (NCF1 variant was not detected). This analysis showed that the in-house database creation approach affects the disease-causing CNV identification. IGV visualization of NCF1 variant showed the presence of low quality overlapping database CNVs in db-65 (Fig. 10). Therefore in the db-65 implementation, a low cnvScan quality score (CNVQ) was calculated and NCF1 variant was filtered out. However, low quality overlapping database CNVs were not observed in db-18 (Fig. 10). Thus, the prediction quality of the NCF1 CNV was not affected and it was not filtered out during cnvScan db-18 implementation. Fig. 10 Effect of in-house database creation approach a. db-18 implementation (no low quality overlapping database CNVs in db-18); b. db-65 implementation (low quality overlapping database CNVs were observed in db-65) * Bands with higher and lower colour intensities in a and b represent CNV predictions with higher and lower quality scores. As mentioned earlier, CNVs in db-65 were predicted in two different batches with 18 and 47 exomes in each collection. CNV prediction quality of these two batches could differ from 36 each other due to various technical artifacts. Pooling CNVs in these two batches will introduce batch effect to the cnvScan db-65 database and affect the CNV quality assessment and filtration. For example, Table 5 shows that the numbers of CNVs identified in cnvScan db-65 (filtration thresholds:15-65) are slightly lower than that of db-18. This highlights the need for further optimization techniques that can minimize the batch effect. For example, SNV prediction approaches implement quality score recalibration strategies to minimize various types of biases [117, 189]. The current version of cnvScan avoids the batch effect by using the same exome collection for CNV prediction and cnvScan implementation. cnvScan filtration Number of CNVs identified in cnvScan threshold db-18 in-house database db-65 in-house database 10 788 791 15 357 344 20 220 205 25 159 151 30 128 113 35 86 83 40 68 62 45 65 55 50 56 46 55 53 37 60 41 26 65 33 23 Table 5 CNVs identified in cnvScan implementation 6.3 Characterization of true positives excluded in cnvScan implementation Current SNV analysis pipelines implement different filtration criteria to exclude technical artifacts and population specific variants when searching for clinically relevant variant in exome sequence data. For instance, variant calling programs (samtools [148] and GATK [117, 189]), variant analysis packages (Filtus [190]) and annotation programs (VEP [121]) provide a number of options to apply various filters on SNVs and indels to find important or interesting results. 37 37 Similarly, cnvScan and the recently published CoNVaDING [172] provide filtration strategies to exclude low quality CNV predictions. cnvScan provides a filtration strategy for CNVs predicted from any coverage-based prediction program. CoNVaDING deviates from cnvScan by combining both CNV prediction and filtration in a single program. CoNVaDING performs a coverage analysis and selects a control dataset (exomes with most similar coverage patterns) for each target exome when calling CNVs. During this coverage analysis, a set of quality matrices is created to describe how coverage of exons varies within the target exome and control dataset. Then, these matrices are used in the filtration step to improve the sensitivity and specificity. However, the clinical utility of CoNVaDING has not yet been studied in a large patient cohort. The CNV filtration efficiency of cnvScan is discussed in Paper 2 and the applicability of cnvScan in a large patient cohort is discussed in Paper 3. While cnvScan was effectively used to identify disease-causing CNVs, our tool can be biasprone due to the implemented hard-filtering step. Particularly, rare disease-causing CNVs can be excluded when high cnvScan filtration thresholds are applied. A detailed characterization of excluded CNVs would assist cnvScan users to select the optimum threshold, which could be used to minimize the hard-filtration bias. Therefore an additional analysis was designed to study the quality scores of filtered-out or excluded CNVs, especially as this was not discussed in Paper 2. Here, CNV prediction (ExCopyDepth) and cnvScan implementation was performed using 17 PIDD exomes. During cnvScan implementation, relatively high quality score threshold (default CNV quality and CNVQ: 40) was used to increase the number of filter-out CNVs (n=912). Following the filtration, CNVs excluded in cnvScan implementation were selected for further evaluation. In parallel, exaCGH experiments were performed (on the 17 PIDD exomes) to identify true positive and false positive CNVs within excluded variants. The quality score analysis of excluded variants (Fig. 11) showed that true positives have higher quality scores compared to false positive CNVs (Fig. 11a). These excluded true positives may contain disease-causing rare variants as well as known variants that are reported in public databases. Therefore these CNVs were further examined to identify known 38 and novel variant within excluded true positives. Here, CNVs reported in public databases (manually curated DGV, 1000 genomes CNVs and sanger CNVs) were identified as known variants, while CNVs not yet reported in public datasets were identified as novel variants. Then we compared the default quality score and CNVQ distributions of these identified CNVs. Both known and novel true positive CNVs were spread over lower and higher ends of default quality score spectrum (Fig. 11b). However, when considering CNVQ distributions, known variants were spread over a wider range compared to novel CNVs (Fig. 11c). Fig. 11 Analysis of CNVs excluded in cnvScan a. Default quality scores of false positive (FP) and true positive (TP) CNVs filtered out (excluded) in cnvScan implementation; b. Default quality score distribution of known and novel TP CNVs excluded in cnvScan; c. CNVQ distribution of known and novel TP CNVs excluded in cnvScan. The number of novel true positive CNVs excluded in cnvScan was 32 (~3.5% of the total CNVs excluded) and they were found in at least 4 patient samples. These CNVs could either be population-specific common variants that are not yet reported in public databases or disease-causing CNVs that are present in a subset of patients (n≥4). Previous studies have demonstrated that CNVs present in <10% of the population can cause or be associated with a broad range of diseases [203-205]. Thus, it is important not to exclude these variants during filtration. Based on Fig. 11, low cnvScan thresholds can be used to retain the majority of these novel CNVs. However, there is a risk of excluding a small fraction of novel CNVs. Therefore, we updated the cnvScan filtration script to produce two 39 39 results files: a primary result file containing high quality CNVs identified in cnvScan and a secondary file containing low quality excluded novel variants. Hence, cnvScan users can examine the primary result file when searching for disease-causing CNVs. If a causal variant was not found, users can search the secondary results file. 6.4 Comparison with other disease-causing CNV detection methods 6.4.1 Advantages of cnvScan annotation compared to other annotation programs Today, a number of programs are available for scientists to annotate CNV predictions; for example, VEP [121], ANNOVAR [120], CNVannotator [183], DeAnnCNV [185]. Despite the availability of these tools, cnvScan was extended to provide additional information on CNVs, especially using sources that are not available in other methods. For example, clinically relevant information from DECIPHER DDD [128] and, known and common CNVs from high quality manually curated database of genomic variants (DGV) [206] were not provided in the majority of other CNV annotation programs. Similarly, these programs provide a limited set of information on novel CNVs when pre-computed scores such as haploinsufficiency scores [66] and gene intolerance scores [207] are useful in functional interpretation of novel CNVs. A recent study has shown that variant interpretation can be affected by the source transcript dataset used in annotation programs [186]. For example, CNVAnnotator uses a combination of Refseq, UCSC and ENCODE data to annotate genes and other genomic features [183], which could increase the risk of misinterpreting annotated CNVs. cnvScan avoids this issue by using Gencode dataset and provides high quality, consistent and accurate CNV annotations. 40 6.4.2 Rate of disease-causing CNV detection Paper 3 discussed in the thesis uses a PIDD patient cohort and describes the utility of our method in identifying disease-causing CNVs. Patient exomes tested in our method were collected from the Department of Medical Genetics, Oslo University Hospital (n=78) and Baylor College of Medicine, Texas (n=80). ExCopyDepth and cnvScan were able to identify five disease-causing CNVs from these patient populations: four PIDD-causing CNVs (FANCA, MAGT1, NCF1, TREC) in five patients from the Oslo cohort (positive detection rate: 6.4%) and two PIDD-causing CNVs (MAGT1, IKZF1) in the Baylor patient cohort (positive detection rate: 3.75%). All these CNVs were validated by either exaCGH or MLPA experiments. Our cohort study reported higher positive detection rates (Oslo: 6.4%, Baylor: 3.75%) compared to other heterogeneous Mendelian cohorts and recently published large clinical studies [123-125]. This was mainly due to the carefully designed steps of our complete pipeline: CNV prediction, cnvScan implementation, pathogenicity assessment and exaCGH validation (Fig. 12). 41 41 Fig. 12 Disease-causing CNV detection pipeline a. Computational steps: CNV prediction (using alignment files and exome definition files as main inputs) and cnvScan implementation (using cnvScan quality thresholds, in-house database and various annotation datasets); b. Manual step: pathogenicity assessment of each prioritized variant using cnvScan annotations and other relevant resources; c. Experimental step: if disease-causing CNVs are identified in the previous step, they are confirmed using exaCGH experiments. During the implementation of the pipeline (Fig. 12), a low stringency prediction program (ExCopyDepth) was used to minimize the false negative count when calling CNVs. Then during cnvScan implementation, predicted variants were filtered and prioritized to reduce false positive CNV count. Following the prioritization, cnvScan annotations and other relevant information were used for effective CNV interpretation and pathogenicity assessment. If disease-causing CNVs were identified, exaCGH experiments were performed to confirm computational predictions. When comparing positive detection rates of Oslo and Baylor samples, the Oslo group had higher detection rates than the Baylor group. This could be due to the batch-wise 42 implementation of the Oslo samples. For example, ExCopyDepth prediction and cnvScan implementation was performed on 20-30 exomes sequenced in 1-3 consecutive batches. However, all the Baylor exomes (n=80) were pooled together, potentially increasing the coverage variance across the collection and thereby affecting the CNV prediction efficiency. Based on these observations, we suggest that ExCopyDepth and ExomeDepth CNV prediction can be further optimized by pooling exomes depending on their sequencing batches. However, when using XHMM and CoNIFER, CNV prediction efficiency can increase with large exome collections [156, 157, 208] 6.4.3 Short CNV prediction As shown in Table 6, our CNV pipeline detected PIDD-causing variants with a broad length spectrum. For example, the FANCA CNV contained 11 exons and only one exon was internal to the TERC variant. Moreover, 40% of the PIDD-causing CNVs detected in our study contained 1-2 exons. Therefore, we have demonstrated the ability to detect disease-causing short CNVs (containing < 4 exons), when the majority of CNV prediction programs are not optimized to predict short CNVs [156, 157, 166, 208]. Short CNV detection efficiency in prediction programs depends on various algorithmic features. In ExomeCopy and ExCopyDepth, short CNV prediction was improved due to the implemented tile length normalization step [162, 179]. When considering CoNIFER and XHMM, both programs use principal component analysis based normalization methods to remove experimental and systematic artefacts. CoNIFER CNV calling algorithm search for consecutive runs of three or more targets with normalized scores that are higher than a specified hard threshold [157, 179]. Therefore, CoNIFER restricts its ability to predict CNVs with 1-3 exons. XHMM deviates from CoNIFER by implementing a hidden markov model (HMM) to assess the quality of each CNV signal [156, 179, 208], as opposed to hard thresholds. Thus XHMM shows improved performance in detecting short CNVs compared to CoNIFER. 43 43 Gene (region) Number of CNV exons (Length) Cohort study FANCA 10 Heterozygous (chr16:89811272-89833745) (22273 bp) deletion MAGT1 5 Deletion (chrX:77096742-77112995) (16253 bp) IKZF1 2 Heterozygous (chr7:50435853-50451072) (15219 bp) deletion TERC 1 Heterozygous (chr3:169482168-169485768) (3600 bp) deletion NCF1 11 Homozygous (chr7:74188309-74203659) (15350 bp) deletion Oslo Oslo and Baylor Baylor Oslo Oslo Table 6 PIDD-causing variants identified by using the pipeline discussed in section 6.4.2 6.4.4 Utility of exaCGH in detecting short CNVs As discussed in methodological considerations, exaCGH was optimized to detect short exonic CNVs compared to commercial arrays. This directly contributed to the diseasecausing short CNV yield of our study. For example, a PIDD-causing single exon deletion in TERC was detected in exaCGH with 7 probes, whereas CytoScan HD and Agilent 1x1M arrays did not contain probes covering the variant. Identification of TERC deletion highlights the advantage of using exaCGH over other commercial arrays. In recent years, custom arrays were used in number of studies to confirm computationally predicted short exonic CNVs with variable degree of success [124, 178, 209]. In conclusion, this thesis discusses a complete pipeline optimized to identify disease-causing CNVs in exome sequence data while solving problems faced by other commonly used methods. Our pipeline is adaptable to user requirements and greatly improves the time and effort in clinically relevant CNV identification [179, 202, 210]. The pipeline was proven to be effective in identifying CNVs with a broad length spectrum and have a positive detection rate of 3.7%-6.4%. Therefore this initiative will provide necessary resources for exome-based CNV detection in current genetic diagnostic protocols and improve the diagnosis of patients with genetic diseases. 44 7 Future directions 7.1 Optimizations to reduce the batch effect in cnvScan implementation As discussed in section 6.2, when CNVs were predicted from several batches of exomes, CNV prediction quality could differ from one batch to another (batch effect). Therefore when implementing the current version of cnvScan on multiple batches of exomes, separate inhouse databases need to be created and a filtration threshold needs to be selected per batch. In order to minimize the bias introduced by the batch effect, default CNV quality scores assigned by prediction programs need to be recalibrated using an advance statistical method. Here, special measures needed to be introduced, so that the recalibrated scores do not vary depending on the exome collection used in the prediction program. With these recalibrated CNV scores, all the in-house CNVs can be pooled to create a relatively large cnvScan database without introducing the batch effect. A larger in-house database would lead to the development of an advance CNVQ (cnvScan quality score) calculation method that could improve the confidence in CNV quality assessment. Both recalibrated quality scores and an advanced cnvScan quality assessment strategy, would help select an optimum cnvScan filtration threshold during variant prioritization. 7.2 Whole-genome CNV analysis Whole-genome sequencing (WGS) has the potential to capture genome-wide SNVs, indels and CNVs in one experiment. Therefore, WGS will be an attractive alternative to exome sequencing when available at an affordable price [211-214]. Prediction programs tested in our study use exomes as the source to detect CNVs. Therefore these programs will be replaced when WGS become a widespread genetic diagnostic tool. However, cnvScan accept CNV calls from any coverage-based prediction program as the 45 45 input [202]. Thus, cnvScan can be extended to analyse coverage-based CNV calls from WGS data. Moreover, paired-end and split-read approaches [215] can be introduced in cnvScan as a secondary CNV evaluation step. For example, when evaluating a CNV, cnvScan can search for reads with insert sizes that significantly deviate from the expected length [216] or partially mapped reads to identify CNV breakpoints [141]. Thus, cnvScan can be extended to assess, further evaluate, annotate and filter CNVs predicted from WGS data. 46 Abbreviations CGH Comparative genomic hybridization CNS Copy number state CNV Copy number variant CNVQ cnvScan quality score DECIPHER Database of chromosomal imbalance and phenotype in Humans using ensembl resources exaCGH Exon focussed CGH array FISH Fluorescence in situ hybridization FP False positives FPR False positive rate GWAS Genome-wide association studies HGMD Human gene mutation database HMM Hidden markov model indel Insertion or deletion MLPA Multiplex ligation-dependent probe amplification OMIM Online Mendelian Inheritance in Man PCA Principal component analysis PIDD Primary immunodeficiency diseases QF-PCR Quantitative fluorescence PCR (polymerase chain reaction) RPKM Reads per thousand bases per million reads sequenced SNV Single nucleotide variant SVD Singular value decomposition TP True positive 47 47 Commonly used terms Allele: Variant forms of the same gene, in the same locus on homologous chromosomes. Copy number variant (CNV): A DNA sequence that is 1kb or larger and is present at a variable copy number in comparison with a reference genome. Gene: A DNA sequence or locus that encode a functional RNA or protein product and is the molecular unit of heredity. Genetic loci: Specific regions that are mapped within a genome. Genome: Complete set of genetic material present in a cell. Genotype: The genetic constitution of the individual. High-throughput sequencing (next-generation sequencing): Coverall term used to describe a number of different modern sequencing technologies (eg. Illumina sequencing /Short-read sequencing). These technologies enable DNA and RNA sequencing much more quickly and cheaply than the previously used Sanger sequencing technology. Inversion: A segment of DNA that is reversed in orientation with respect to the rest of the chromosome. Karyotype: An ordered array of metaphase chromosomes from a photomicrograph of a single cell nucleus arranged in pairs in descending order of size and according to the position of the centromere. Paired-end reads: paired reads originating from the same template molecule and separated by a known distance. Precision (positive detection rate): true positive count / (true positive count + false positive count) 48 Recall (sensitivity): true positive count / (true positive count + false negative count) Segmental duplication or low-copy repeat: A segment of DNA >1kb in size that occurs in two or more copies per haploid genome, with the different copies sharing >90% sequence identity. They are often variable in copy number and can therefore also be CNVs. Soft-clipped reads: Partially mapped reads with bases that are not part of the alignment. Structural variant (SV): Term that encompasses a group of microscopic or submicroscopic genomic alterations involving segments of DNA. Structural variants include CNVs comprising deletions and duplications, translocations and inversions. Translocation: A change in position of a chromosomal segment within a genome. Translocations can be intra- or inter-chromosomal. 49 49 References 1. International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA et al: The sequence of the human genome. Science 2001, 291(5507):1304-1351. International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature 2004, 431(7011):931-945. Ezkurdia I, Juan D, Rodriguez JM, Frankish A, Diekhans M, Harrow J, Vazquez J, Valencia A, Tress ML: Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes. Hum Mol Genet 2014, 23(22):5866-5878. Pruitt KD, Katz KS, Sicotte H, Maglott DR: Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. Trends Genet 2000, 16(1):44-47. Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 2001, 29(1):137-140. ENCODE Project Consortium: The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 2004, 306(5696):636-640. Wilming LG, Gilbert JG, Howe K, Trevanion S, Hubbard T, Harrow JL: The vertebrate genome annotation (Vega) database. Nucleic Acids Res 2008, 36(Database issue):D753-760. Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ et al: The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome research 2009, 19(7):1316-1323. ENCODE Project Consortium: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799-816. Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al: GENCODE: producing a reference annotation for ENCODE. Genome Biol 2006, 7(Suppl 1):S4 1-9. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S et al: GENCODE: the reference human genome annotation for The ENCODE Project. Genome research 2012, 22(9):1760-1774. Maglott DR, Katz KS, Sicotte H, Pruitt KD: NCBI's LocusLink and RefSeq. Nucleic Acids Res 2000, 28(1):126-128. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T et al: The Ensembl genome database project. Nucleic Acids Res 2002, 30(1):38-41. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome research 2002, 12(6):996-1006. Kellis M, Wold B, Snyder MP, Bernstein BE, Kundaje A, Marinov GK, Ward LD, Birney E, Crawford GE, Dekker J et al: Defining functional DNA elements in the human genome. Proc Natl Acad Sci U S A 2014, 111(17):6131-6138. Mullaney JM, Mills RE, Pittard WS, Devine SE: Small insertions and deletions (INDELs) in human genomes. Hum Mol Genet 2010, 19(R2):R131-136. Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nature reviews Genetics 2006, 7(2):85-97. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 50 19. Inoue K, Lupski JR: Molecular mechanisms for genomic disorders. Annu Rev Genomics Hum Genet 2002, 3:199-242. Hastings PJ, Lupski JR, Rosenberg SM, Ira G: Mechanisms of change in gene copy number. Nature reviews Genetics 2009, 10(8):551-564. Lee JA, Carvalho CM, Lupski JR: A DNA replication mechanism for generating nonrecurrent rearrangements associated with genomic disorders. Cell 2007, 131(7):1235-1247. Hastings PJ, Ira G, Lupski JR: A microhomology-mediated break-induced replication model for the origin of human copy number variation. PLoS genetics 2009, 5(1):e1000327. Hurles M, Lupski J: Recombination Hotspots in Nonallelic Homologous Recombination. Genomic Disorders, 2006: 341-355. Perry GH, Ben-Dor A, Tsalenko A, Sampas N, Rodriguez-Revenga L, Tran CW, Scheffer A, Steinfeld I, Tsang P, Yamada NA et al: The fine-scale and complex architecture of human copy-number variation. American journal of human genetics 2008, 82(3):685-695. Kidd JM, Graves T, Newman TL, Fulton R, Hayden HS, Malig M, Kallicki J, Kaul R, Wilson RK, Eichler EE: A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 2010, 143(5):837-847. Arlt MF, Wilson TE, Glover TW: Replication stress and mechanisms of CNV formation. Curr Opin Genet Dev 2012, 22(3):204-210. Conrad DF, Bird C, Blackburne B, Lindsay S, Mamanova L, Lee C, Turner DJ, Hurles ME: Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat Genet 2010, 42(5):385-391. Ottaviani D, LeCain M, Sheer D: The role of microhomology in genomic structural variation. Trends Genet 2014, 30(3):85-94. Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL et al: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001, 409(6822):928-933. International HapMap Consortium: The International HapMap Project. Nature 2003, 426(6968):789-796. International HapMap Consortium: A haplotype map of the human genome. Nature 2005, 437(7063):1299-1320. International HapMap Consortium: Integrating common and rare genetic variation in diverse human populations. Nature 2010, 467(7311):52-58. Wellcome Trust Case Control Consortium: Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 2010, 464(7289):713-720. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P et al: Origins and functional impact of copy number variation in the human genome. Nature 2010, 464(7289):704-712. The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature 2010, 467(7319):1061-1073. The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, 491(7422):56-65. The 1000 Genomes Project Consortium: A global reference for human genetic variation. Nature 2015, 526(7571):68-74. Exome Sequencing Project (ESP) 6500 exome data [http://evs.gs.washington.edu/EVS/] 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 51 51 39. Fu W, O'Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, Gabriel S, Rieder MJ, Altshuler D, Shendure J et al: Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 2013, 493(7431):216-220. The UK10K Consortium: The UK10K project identifies rare variants in health and disease. Nature 2015, 526(7571):82-90. Sudmant PH, Mallick S, Nelson BJ, Hormozdiari F, Krumm N, Huddleston J, Coe BP, Baker C, Nordenfelt S, Bamshad M et al: Global diversity, population stratification, and selection of human copy-number variation. Science 2015, 349(6253):aab3761. Clarke L, Zheng-Bradley X, Smith R, Kulesha E, Xiao C, Toneva I, Vaughan B, Preuss D, Leinonen R, Shumway M et al: The 1000 Genomes Project: data management and community access. Nat Methods 2012, 9(5):459-462. Griffiths AJF: Modern Genetic Analysis: W.H. Freeman; 1999. Mendel G, Bateson W: Experiments in plant-hybridisation / By Gregor Mendel. Harvard University Press; 1925. Jin W, Qin P, Lou H, Jin L, Xu S: A systematic characterization of genes underlying both complex and Mendelian diseases. Hum Mol Genet 2012, 21(7):1611-1624. Youssoufian H, Pyeritz RE: Mechanisms and consequences of somatic mosaicism in humans. Nature reviews Genetics 2002, 3(10):748-758. OMIM Entry Statistics [http://www.omim.org/statistics/entry] Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey. Nucleic Acids Res 2002, 30(17):3894-3900. Capon F, Allen MH, Ameen M, Burden AD, Tillman D, Barker JN, Trembath RC: A synonymous SNP of the corneodesmosin gene leads to increased mRNA stability and demonstrates association with psoriasis across diverse ethnic groups. Hum Mol Genet 2004, 13(20):2361-2368. Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K: A comprehensive review of genetic association studies. Genet Med 2002, 4(2):45-61. Thomas PD, Kejariwal A: Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: evolutionary evidence for differences in molecular effects. Proc Natl Acad Sci U S A 2004, 101(43):15398-15403. Farh KK, Marson A, Zhu J, Kleinewietfeld M, Housley WJ, Beik S, Shoresh N, Whitton H, Ryan RJ, Shishkin AA et al: Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 2015, 518(7539):337-343. Stenson PD, Mort M, Ball EV, Shaw K, Phillips A, Cooper DN: The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet 2014, 133(1):1-9. Beckmann JS, Estivill X, Antonarakis SE: Copy number variants and genetic traits: closer to the resolution of phenotypic to genotypic variability. Nature reviews Genetics 2007, 8(8):639-646. Girirajan S, Campbell CD, Eichler EE: Human copy number variation and complex genetic disease. Annu Rev Genet 2011, 45:203-226. Haraksingh RR, Jahanbani F, Rodriguez-Paris J, Gelernter J, Nadeau KC, Oghalai JS, Schrijver I, Snyder MP: Exome sequencing and genome-wide copy number variant mapping reveal novel associations with sensorineural hereditary hearing loss. BMC genomics 2014, 15:1155. Hu X, Worton RG: Partial gene duplication as a cause of human disease. Hum Mutat 1992, 1(1):3-12. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 52 58. Blumkin L, Kivity S, Lev D, Cohen S, Shomrat R, Lerman-Sagie T, Leshinsky-Silver E: A compound heterozygous missense mutation and a large deletion in the KCTD7 gene presenting as an opsoclonus-myoclonus ataxia-like syndrome. J Neurol 2012, 259(12):2590-2598. Abe A, Nakamura K, Kato M, Numakura C, Honma T, Seiwa C, Shirahata E, Itoh A, Kishikawa Y, Hayasaka K: Compound heterozygous PMP22 deletion mutations causing severe Charcot-Marie-Tooth disease type 1. J Hum Genet 2010, 55(11):771-773. Czeschik JC, Hehr U, Hartmann B, Ludecke HJ, Rosenbaum T, Schweiger B, Wieczorek D: 160 kb deletion in ISPD unmasking a recessive mutation in a patient with Walker-Warburg syndrome. Eur J Med Genet 2013, 56(12):689-694. Holt R, Sykes NH, Conceicao IC, Cazier JB, Anney RJ, Oliveira G, Gallagher L, Vicente A, Monaco AP, Pagnamenta AT: CNVs leading to fusion transcripts in individuals with autism spectrum disorder. Eur J Hum Genet 2012, 20(11):11411147. Newman S, Hermetz KE, Weckselblatt B, Rudd MK: Next-generation sequencing of duplication CNVs reveals that most are tandem and some create fusion genes at breakpoints. American journal of human genetics 2015, 96(2):208-220. Ullmann R, Turner G, Kirchhoff M, Chen W, Tonge B, Rosenberg C, Field M, Vianna-Morgante AM, Christie L, Krepischi-Santos AC et al: Array CGH identifies reciprocal 16p13.1 duplications and deletions that predispose to autism and/or mental retardation. Hum Mutat 2007, 28(7):674-682. Stankiewicz P, Bi W, Lupski J: Smith-Magenis Syndrome Deletion, Reciprocal Duplication dup(17)(p11.2p11.2), and Other Proximal 17p Rearrangements. Genomic Disorders 2006, 179-191. Dang VT, Kassahn KS, Marcos AE, Ragan MA: Identification of human haploinsufficient genes and their genomic proximity to segmental duplications. Eur J Hum Genet 2008, 16(11):1350-1357. Huang N, Lee I, Marcotte EM, Hurles ME: Characterising and predicting haploinsufficiency in the human genome. PLoS genetics 2010, 6(10):e1001154. Lee C, Scherer SW: The clinical context of copy number variation in the human genome. Expert Rev Mol Med 2010, 12:e8. Cutting GR: Modifier genes in Mendelian disorders: the example of cystic fibrosis. Ann N Y Acad Sci 2010, 1214:57-69. Artuso R, Papa FT, Grillo E, Mucciolo M, Yasui DH, Dunaway KW, Disciglio V, Mencarelli MA, Pollazzon M, Zappella M et al: Investigation of modifier genes within copy number variations in Rett syndrome. J Hum Genet 2011, 56(7):508515. Jiang Q, Ho YY, Hao L, Nichols Berrios C, Chakravarti A: Copy number variants in candidate genes are genetic modifiers of Hirschsprung disease. PLoS One 2011, 6(6):e21219. Mailman MD, Heinz JW, Papp AC, Snyder PJ, Sedra MS, Wirth B, Burghes AH, Prior TW: Molecular analysis of spinal muscular atrophy and modification of the phenotype by SMN2. Genet Med 2002, 4(1):20-26. Orr N, Chanock S: Common genetic variation and human disease. Adv Genet 2008, 62:1-32. Krawczak M, Ball EV, Fenton I, Stenson PD, Abeysinghe S, Thomas N, Cooper DN: Human gene mutation database-a biomedical information and research resource. Hum Mutat 2000, 15(1):45-51. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 53 53 74. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR: ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014, 42(Database issue):D980-985. Bragin E, Chatzimichali EA, Wright CF, Hurles ME, Firth HV, Bevan AP, Swaminathan GJ: DECIPHER: database for the interpretation of phenotypelinked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res 2014, 42(Database issue):D993-D1000. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, 33(Database issue):D514-517. McKusick VA: Mendelian Inheritance in Man and its online version, OMIM. American journal of human genetics 2007, 80(4):588-604. Cotton RG, Auerbach AD, Axton M, Barash CI, Berkovic SF, Brookes AJ, Burn J, Cutting G, den Dunnen JT, Flicek P et al: GENETICS. The Human Variome Project. Science 2008, 322(5903):861-862. Landegren U, Kaiser R, Caskey CT, Hood L: DNA diagnostics--molecular techniques and automation. Science 1988, 242(4876):229-237. McPherson E: Genetic diagnosis and testing in clinical practice. Clin Med Res 2006, 4(2):123-129. Glickman RM, Phillips MA, Glickman BW: Introduction to DNA-Based Genetic Diagnostics. Can Fam Physician 1988, 34:883-889. Southern EM: Detection of specific sequences among DNA fragments separated by gel electrophoresis. J Mol Biol 1975, 98(3):503-517. Sanger F, Nicklen S, Coulson AR: DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 1977, 74(12):5463-5467. Conner BJ, Reyes AA, Morin C, Itakura K, Teplitz RL, Wallace RB: Detection of sickle cell beta S-globin allele by hybridization with synthetic oligonucleotides. Proc Natl Acad Sci U S A 1983, 80(1):278-282. Landegren U, Kaiser R, Sanders J, Hood L: A ligase-mediated gene detection technique. Science 1988, 241(4869):1077-1080. Sanders SJ, Murtha MT, Gupta AR, Murdoch JD, Raubeson MJ, Willsey AJ, ErcanSencicek AG, DiLullo NM, Parikshak NN, Stein JL et al: De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 2012, 485(7397):237-241. Drets ME, Shaw MW: Specific banding patterns of human chromosomes. Proc Natl Acad Sci U S A 1971, 68(9):2073-2077. Bauman JG, Wiegant J, Borst P, van Duijn P: A new method for fluorescence microscopical localization of specific DNA sequences by in situ hybridization of fluorochromelabelled RNA. Exp Cell Res 1980, 128(2):485-490. Van Prooijen-Knegt AC, Van Hoek JF, Bauman JG, Van Duijn P, Wool IG, Van der Ploeg M: In situ hybridization of DNA sequences in human metaphase chromosomes visualized by an indirect fluorescent immunocytochemical procedure. Exp Cell Res 1982, 141(2):397-407. Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D: Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 1992, 258(5083):818-821. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y et al: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature genetics 1998, 20(2):207-211. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 54 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. Haraksingh RR, Abyzov A, Gerstein M, Urban AE, Snyder M: Genome-wide mapping of copy number variation in humans: comparative analysis of high resolution array platforms. PLoS One 2011, 6(11):e27859. Schlotterer C: The evolution of molecular markers--just a matter of fashion? Nature reviews Genetics 2004, 5(1):63-69. Chen X, Sullivan PF: Single nucleotide polymorphism genotyping: biochemistry, protocol, cost and throughput. Pharmacogenomics J 2003, 3(2):77-96. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK: A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 2006, 38(1):7581. Lee C, Iafrate AJ, Brothman AR: Copy number variations and clinical cytogenetic diagnosis of constitutional disorders. Nature genetics 2007, 39(7 Suppl):S48-54. Hayes JL, Tzika A, Thygesen H, Berri S, Wood HM, Hewitt S, Pendlebury M, Coates A, Willoughby L, Watson CM et al: Diagnosis of copy number variation by Illumina next generation sequencing is comparable in performance to oligonucleotide array comparative genomic hybridisation. Genomics 2013, 102(3):174-181. Miller DT, Adam MP, Aradhya S, Biesecker LG, Brothman AR, Carter NP, Church DM, Crolla JA, Eichler EE, Epstein CJ et al: Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am J Hum Genet 2010, 86(5):749-764. Baumbusch LO, Aaroe J, Johansen FE, Hicks J, Sun H, Bruhn L, Gunderson K, Naume B, Kristensen VN, Liestol K et al: Comparison of the Agilent, ROMA/NimbleGen and Illumina platforms for classification of copy number alterations in human breast tumors. BMC genomics 2008, 9:379. Coe BP, Ylstra B, Carvalho B, Meijer GA, Macaulay C, Lam WL: Resolving the resolution of array CGH. Genomics 2007, 89(5):647-653. Curtis C, Lynch AG, Dunning MJ, Spiteri I, Marioni JC, Hadfield J, Chin SF, Brenton JD, Tavare S, Caldas C: The pitfalls of platform comparison: DNA copy number array technologies assessed. BMC genomics 2009, 10:588. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, 437(7057):376-380. Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol 2008, 26(10):1135-1145. Hayden EC: Technology: The $1,000 genome. Nature 2014, 507(7492):294-295. Metzker ML: Sequencing technologies - the next generation. Nature reviews Genetics 2010, 11(1):31-46. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE et al: Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009, 461(7261):272-276. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J: Exome sequencing as a tool for Mendelian disease gene discovery. Nature reviews Genetics 2011, 12(11):745-755. Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloglu A, Ozen S, Sanjad S et al: Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A 2009, 106(45):19096-19101. Biesecker LG: Exome sequencing makes medical genomics a reality. Nat Genet 2010, 42(1):13-14. 55 55 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. Mardis ER: A decade's perspective on DNA sequencing technology. Nature 2011, 470(7333):198-203. Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, Braxton A, Beuten J, Xia F, Niu Z et al: Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N Engl J Med 2013, 369(16):1502-1511. Rabbani B, Tekin M, Mahdieh N: The promise of whole-exome sequencing in medical genetics. J Hum Genet 2014, 59(1):5-15. Ball MP, Thakuria JV, Zaranek AW, Clegg T, Rosenbaum AM, Wu X, Angrist M, Bhak J, Bobe J, Callow MJ et al: A public resource facilitating clinical use of genomes. Proc Natl Acad Sci U S A 2012, 109(30):11920-11927. MacArthur DG, Manolio TA, Dimmock DP, Rehm HL, Shendure J, Abecasis GR, Adams DR, Altman RB, Antonarakis SE, Ashley EA et al: Guidelines for investigating causality of sequence variants in human disease. Nature 2014, 508(7497):469-476. Koboldt DC, Larson DE, Sullivan LS, Bowne SJ, Steinberg KM, Churchill JD, Buhr AC, Nutter N, Pierce EA, Blanton SH et al: Exome-based mapping and variant prioritization for inherited Mendelian disorders. American journal of human genetics 2014, 94(3):373-384. Hwang S, Kim E, Lee I, Marcotte EM: Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep 2015, 5:17875. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M et al: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 2011, 43(5):491-498. Ye K, Schulz MH, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009, 25(21):2865-2871. Ye K, Wang J, Jayasinghe R, Lameijer EW, McMichael JF, Ning J, McLellan MD, Xie M, Cao S, Yellapantula V et al: Systematic discovery of complex insertions and deletions in human cancers. Nat Med 2016, 22(1):97-104. Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010, 38(16):e164. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F: Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 2010, 26(16):2069-2070. Sawyer SL, Schwartzentruber J, Beaulieu CL, Dyment D, Smith A, Warman Chardon J, Yoon G, Rouleau GA, Suchowersky O, Siu V et al: Exome sequencing as a diagnostic tool for pediatric-onset ataxia. Hum Mutat 2014, 35(1):45-49. Yang Y, Muzny DM, Xia F, Niu Z, Person R, Ding Y, Ward P, Braxton A, Wang M, Buhay C et al: Molecular findings among patients referred for clinical wholeexome sequencing. JAMA 2014, 312(18):1870-1879. Retterer K, Scuffins J, Schmidt D, Lewis R, Pineda-Alvarez D, Stafford A, Schmidt L, Warren S, Gibellini F, Kondakova A et al: Assessing copy number from exome sequencing and exome array CGH based on CNV spectrum in a large clinical cohort. Genet Med 2015, 17(8):623-629. Epilepsy Phenome/Genome Project & Epi4K Consortium: Copy number variant analysis from exome data in 349 patients with epileptic encephalopathy. Ann Neurol 2015, 78(2):323-328. 56 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. MacDonald JR, Ziman R, Yuen RK, Feuk L, Scherer SW: The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res 2014, 42(Database issue):D986-992. Firth HV, Richards SM, Bevan AP, Clayton S, Corpas M, Rajan D, Van Vooren S, Moreau Y, Pettett RM, Carter NP: DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am J Hum Genet 2009, 84(4):524-533. Firth HV, Wright CF, Study DDD: The Deciphering Developmental Disorders (DDD) study. Dev Med Child Neurol 2011, 53(8):702-703. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456(7218):53-59. Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18(11):1851-1858. Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 2009, 6(11 Suppl):S13-20. Alkan C, Coe BP, Eichler EE: Genome structural variation discovery and genotyping. Nat Rev Genet 2011, 12(5):363-376. Rausch T, Zichner T, Schlattl A, Stutz AM, Benes V, Korbel JO: DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 2012, 28(18):i333-i339. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP et al: BreakDancer: an algorithm for highresolution mapping of genomic structural variation. Nat Methods 2009, 6(9):677681. Korbel JO, Abyzov A, Mu XJ, Carriero N, Cayting P, Zhang Z, Snyder M, Gerstein MB: PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol 2009, 10(2):R23. Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D, Alkan C, Eichler EE, Sahinalp SC: Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics 2010, 26(12):i350-357. Hormozdiari F, Hajirasouliha I, McPherson A, Eichler EE, Sahinalp SC: Simultaneous structural variation discovery among multiple paired-end sequenced genomes. Genome research 2011, 21(12):2203-2212. Sindi S, Helman E, Bashir A, Raphael BJ: A geometric approach for classification and comparison of structural variants. Bioinformatics 2009, 25(12):i222-230. Abyzov A, Gerstein M: AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision. Bioinformatics 2011, 27(5):595-603. Abel HJ, Duncavage EJ, Becker N, Armstrong JR, Magrini VJ, Pfeifer JD: SLOPE: a quick and accurate method for locating non-SNP structural variation from targeted next-generation sequence data. Bioinformatics 2010, 26(21):2684-2688. Zhang ZD, Du J, Lam H, Abyzov A, Urban AE, Snyder M, Gerstein M: Identification of genomic indels and structural variations using split reads. BMC genomics 2011, 12:375. Handsaker RE, Korn JM, Nemesh J, McCarroll SA: Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet 2011, 43(3):269-276. 57 57 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. Karakoc E, Alkan C, O'Roak BJ, Dennis MY, Vives L, Mark K, Rieder MJ, Nickerson DA, Eichler EE: Detection of structural variants and indels within exome data. Nat Methods 2012, 9(2):176-178. Dalca AV, Brudno M: Genome variation discovery with high-throughput sequencing data. Brief Bioinform 2010, 11(1):3-14. Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ et al: Genome-wide in situ exon capture for selective resequencing. Nat Genet 2007, 39(12):1522-1527. Shigemizu D, Momozawa Y, Abe T, Morizono T, Boroevich KA, Takata S, Ashikawa K, Kubo M, Tsunoda T: Performance comparison of four commercial human whole-exome capture platforms. Sci Rep 2015, 5:12742. Magi A, Tattini L, Pippucci T, Torricelli F, Benelli M: Read count approach for DNA copy number variants detection. Bioinformatics 2012, 28(4):470-478. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25(16):2078-2079. Picard [http://broadinstitute.github.io/picard/] Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760. Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 2008, 36(16):e105. Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S et al: Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 2009, 10(3):R32. Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome research 2009, 19(9):1586-1592. Derrien T, Estelle J, Marco Sola S, Knowles DG, Raineri E, Guigo R, Ribeca P: Fast computation and applications of genome mappability. PLoS One 2012, 7(1):e30377. Magi A, Tattini L, Cifola I, D'Aurizio R, Benelli M, Mangano E, Battaglia C, Bonora E, Kurg A, Seri M et al: EXCAVATOR: detecting copy number variants from whole-exome sequencing data. Genome Biol 2013, 14(10):R120. Fromer M, Moran JL, Chambert K, Banks E, Bergen SE, Ruderfer DM, Handsaker RE, McCarroll SA, O'Donovan MC, Owen MJ et al: Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am J Hum Genet 2012, 91(4):597-607. Krumm N, Sudmant PH, Ko A, O'Roak BJ, Malig M, Coe BP, Project NES, Quinlan AR, Nickerson DA, Eichler EE: Copy number variation detection and genotyping from exome sequence data. Genome Res 2012, 22(8):1525-1532. Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C et al: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet 2008, 40(6):722-729. Chiang DY, Getz G, Jaffe DB, O'Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods 2009, 6(1):99-103. 58 160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. Sathirapongsasuti JF, Lee H, Horst BA, Brunner G, Cochran AJ, Binder S, Quackenbush J, Nelson SF: Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics 2011, 27(19):26482654. Li J, Lupat R, Amarasinghe KC, Thompson ER, Doyle MA, Ryland GL, Tothill RW, Halgamuge SK, Campbell IG, Gorringe KL: CONTRA: copy number analysis for targeted resequencing. Bioinformatics 2012, 28(10):1307-1313. Love MI, Mysickova A, Sun R, Kalscheuer V, Vingron M, Haas SA: Modeling read counts for CNV detection in exome sequencing data. Stat Appl Genet Mol Biol 2011, 10(1). Plagnol V, Curtis J, Epstein M, Mok KY, Stebbings E, Grigoriadou S, Wood NW, Hambleton S, Burns SO, Thrasher AJ et al: A robust model for read count data in exome sequencing experiments and implications for copy number variant calling. Bioinformatics 2012, 28(21):2747-2754. Coin LJ, Cao D, Ren J, Zuo X, Sun L, Yang S, Zhang X, Cui Y, Li Y, Jin X et al: An exome sequencing pipeline for identifying and genotyping common CNVs associated with disease with application to psoriasis. Bioinformatics 2012, 28(18):i370-i374. Shi Y, Majewski J: FishingCNV: a graphical software package for detecting rare copy number variations in exome-sequencing data. Bioinformatics 2013, 29(11):1461-1462. Backenroth D, Homsy J, Murillo LR, Glessner J, Lin E, Brueckner M, Lifton R, Goldmuntz E, Chung WK, Shen Y: CANOES: detecting rare copy number variants from whole exome sequencing data. Nucleic Acids Res 2014, 42(12):e97. Wang C, Evans JM, Bhagwate AV, Prodduturi N, Sarangi V, Middha M, Sicotte H, Vedell PT, Hart SN, Oliver GR et al: PatternCNV: a versatile tool for detecting copy number changes from exome sequencing data. Bioinformatics 2014, 30(18):2678-2680. Bellos E, Coin LJ: cnvOffSeq: detecting intergenic copy number variation using off-target exome sequencing data. Bioinformatics 2014, 30(17):i639-645. Miyatake S, Koshimizu E, Fujita A, Fukai R, Imagawa E, Ohba C, Kuki I, Nukui M, Araki A, Makita Y et al: Detecting copy-number variations in whole-exome sequencing data using the eXome Hidden Markov Model: an 'exome-first' approach. J Hum Genet 2015, 60(4):175-182. Jiang Y, Oldridge DA, Diskin SJ, Zhang NR: CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res 2015, 43(6):e39. Packer JS, Maxwell EK, O'Dushlaine C, Lopez AE, Dewey FE, Chernomorsky R, Baras A, Overton JD, Habegger L, Reid JG: CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data. Bioinformatics 2016, 32(1):133-135. Johansson LF, van Dijk F, de Boer EN, van Dijk-Bos KK, Jongbloed JD, van der Hout AH, Westers H, Sinke RJ, Swertz MA, Sijmons RH et al: CoNVaDING: Single Exon Variation Detection in Targeted NGS Data. Hum Mutat 2016, 37(5):457-64. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK: VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research 2012, 22(3):568-576. Valdes-Mas R, Bea S, Puente DA, Lopez-Otin C, Puente XS: Estimation of copy number alterations from exome sequencing data. PLoS One 2012, 7(12):e51422. 59 59 175. 176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189. 190. 191. Amarasinghe KC, Li J, Halgamuge SK: CoNVEX: copy number variation estimation in exome sequencing data using HMM. BMC Bioinformatics 2013, 14 (Suppl 2):S26. Piazza R, Magistroni V, Pirola A, Redaelli S, Spinelli R, Redaelli S, Galbiati M, Valletta S, Giudici G, Cazzaniga G et al: CEQer: a graphical tool for copy number and allelic imbalance detection from whole-exome sequencing data. PLoS One 2013, 8(10):e74825. Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, JanoueixLerosey I, Delattre O, Barillot E: Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics 2012, 28(3):423-425. de Ligt J, Boone PM, Pfundt R, Vissers LE, Richmond T, Geoghegan J, O'Moore K, de Leeuw N, Shaw C, Brunner HG et al: Detection of clinically relevant copy number variants with whole-exome sequencing. Hum Mutat 2013, 34(10):14391448. Samarakoon PS, Sorte HS, Kristiansen BE, Skodje T, Sheng Y, Tjonnfjord GE, Stadheim B, Stray-Pedersen A, Rodningen OK, Lyle R: Identification of copy number variants from exome sequence data. BMC genomics 2014, 15:661. Guo Y, Sheng Q, Samuels DC, Lehmann B, Bauer JA, Pietenpol J, Shyr Y: Comparative study of exome copy number variation estimation tools using array comparative genomic hybridization as control. Biomed Res Int 2013, 2013:915636. Tan R, Wang Y, Kleinstein SE, Liu Y, Zhu X, Guo H, Jiang Q, Allen AS, Zhu M: An evaluation of copy number variation detection tools from whole-exome sequencing data. Hum Mutat 2014, 35(7):899-907. Jo HY, Park MH, Woo HM, Han M, Kim BY, Choi BO, Chung KW, Koo SK: Application of whole-exome sequencing for detecting copy number variants in CMT1A/HNPP. Clin Genet 2015, doi: 10.1111/cge.12714. Zhao M, Zhao Z: CNVannotator: a comprehensive annotation server for copy number variation in the human genome. PLoS One 2013, 8(11):e80170. Erikson GA, Deshpande N, Kesavan BG, Torkamani A: SG-ADVISER CNV: copynumber variant annotation and interpretation. Genet Med 2015, 17(9):714-718. Zhang Y, Yu Z, Ban R, Zhang H, Iqbal F, Zhao A, Li A, Shi Q: DeAnnCNV: a tool for online detection and annotation of copy number variations from wholeexome sequencing data. Nucleic Acids Res 2015, 43(W1):W289-94. McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier JB, Donnelly P: Choice of transcripts and software has a large effect on variant annotation. Genome Med 2014, 6(3):26. Novocraft Novoalign [http://www.novocraft.com/] McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M et al: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20(9):1297-1303. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, LevyMoonshine A, Jordan T, Shakir K, Roazen D, Thibault J et al: From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 2013, 11(1110):11.10.11-11.10.33. Vigeland MD, Gjotterud KS, Selmer KK: FILTUS: a desktop GUI for fast and efficient detection of disease-causing variants, including a novel autozygosity detector. Bioinformatics 2016, doi:10.1093/bioinformatics/btw046. eArray [https://earray.chem.agilent.com/earray/] 60 192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. Gentien D, Reyes C: Microarrays-Based Molecular Profiling to Identify Genomic Alterations. Pan-cancer Integrative Molecular Portrait Towards a New Paradigm in Precision Medicine 2015, Chapter 4: 31-45. Banerjee D: Array comparative genomic hybridization: an overview of protocols, applications, and technology trends. Methods Mol Biol 2013, 973:1-13. Uddin M, Thiruvahindrapuram B, Walker S, Wang Z, Hu P, Lamoureux S, Wei J, MacDonald JR, Pellecchia G, Lu C et al: A high-resolution copy-number variation resource for clinical and population genetics. Genet Med 2015, 17(9):747-752. Carter NP: Methods and strategies for analyzing copy number variation using DNA microarrays. Nat Genet 2007, 39(7 Suppl):S16-21. Reid JG, Carroll A, Veeraraghavan N, Dahdouli M, Sundquist A, English A, Bainbridge M, White S, Salerno W, Buhay C et al: Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline. BMC Bioinformatics 2014, 15:30. Shen Y, Wan Z, Coarfa C, Drabek R, Chen L, Ostrowski EA, Liu Y, Weinstock GM, Wheeler DA, Gibbs RA et al: A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res 2010, 20(2):273-280. Challis D, Yu J, Evani US, Jackson AR, Paithankar S, Coarfa C, Milosavljevic A, Gibbs RA, Yu F: An integrative variant analysis suite for whole exome nextgeneration sequencing data. BMC Bioinformatics 2012, 13:8. Saito T, Rehmsmeier M: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 2015, 10(3):e0118432. Lemke JR, Riesch E, Scheurenbrand T, Schubach M, Wilhelm C, Steiner I, Hansen J, Courage C, Gallati S, Burki S et al: Targeted next generation sequencing as a diagnostic tool in epileptic disorders. Epilepsia 2012, 53(8):1387-1398. Vissers LE, Fano V, Martinelli D, Campos-Xavier B, Barbuti D, Cho TJ, Dursun A, Kim OH, Lee SH, Timpani G et al: Whole-exome sequencing detects somatic mutations of IDH1 in metaphyseal chondromatosis with D-2-hydroxyglutaric aciduria (MC-HGA). Am J Med Genet A 2011, 155A(11):2609-2616. Samarakoon PS, Sorte HS, Stray-Pedersen A, Rodningen OK, Rognes T, Lyle R: cnvScan: a CNV screening and annotation tool to improve the clinical utility of computational CNV prediction from exome sequencing data. BMC genomics 2016, 17(1):51. Beaudet AL: The utility of chromosomal microarray analysis in developmental and behavioral pediatrics. Child Dev 2013, 84(1):121-132. Lopes LR, Murphy C, Syrris P, Dalageorgou C, McKenna WJ, Elliott PM, Plagnol V: Use of high-throughput targeted exome-sequencing to screen for copy number variation in hypertrophic cardiomyopathy. Eur J Med Genet 2015, 58(11):611616. Hoyer H, Braathen GJ, Eek AK, Nordang GB, Skjelbred CF, Russell MB: Copy number variations in a population-based study of Charcot-Marie-Tooth disease. Biomed Res Int 2015, 2015:960404. Zarrei M, MacDonald JR, Merico D, Scherer SW: A copy number variation map of the human genome. Nature reviews Genetics 2015, 16(3):172-183. Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB: Genic intolerance to functional variation and the interpretation of personal genomes. PLoS genetics 2013, 9(8):e1003709. 61 61 208. 209. 210. 211. 212. 213. 214. 215. 216. Fromer M, Purcell SM: Using XHMM Software to Detect Copy Number Variation in Whole-Exome Sequencing Data. Curr Protoc Hum Genet 2014, 81:7.23.2127.23.21. de Ligt J, Boone PM, Pfundt R, Vissers LE, de Leeuw N, Shaw C, Brunner HG, Lupski JR, Veltman JA, Hehir-Kwa JY: Platform comparison of detecting copy number variants with microarrays and whole-exome sequencing. Genom Data 2014, 2:144-146. Stray-Pedersen A, Sorte HS, Samarakoon PS, Gambin T, Chinn IK, et al: Primary immunodeficiency diseases - genomic approaches delineate heterogeneous Mendelian disorders. . J. Allergy Clin. Immunol 2016. Ali-Khan SE, Daar AS, Shuman C, Ray PN, Scherer SW: Whole genome scanning: resolving clinical diagnosis and management amidst complex data. Pediatr Res 2009, 66(4):357-363. Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DC, Nazareth L, Bainbridge M, Dinh H, Jing C, Wheeler DA et al: Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med 2010, 362(13):11811191. Herdewyn S, Zhao H, Moisse M, Race V, Matthijs G, Reumers J, Kusters B, Schelhaas HJ, van den Berg LH, Goris A et al: Whole-genome sequencing reveals a coding non-pathogenic variant tagging a non-coding pathogenic hexanucleotide repeat expansion in C9orf72 as cause of amyotrophic lateral sclerosis. Hum Mol Genet 2012, 21(11):2412-2419. Belkadi A, Bolze A, Itan Y, Cobat A, Vincent QB, Antipenko A, Shang L, Boisson B, Casanova JL, Abel L: Whole-genome sequencing is more powerful than wholeexome sequencing for detecting exome variants. Proc Natl Acad Sci U S A 2015, 112(17):5473-5478. Pirooznia M, Goes FS, Zandi PP: Whole-genome CNV analysis: advances in computational approaches. Front Genet 2015, 6:138. Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L et al: Paired-end mapping reveals extensive structural variation in the human genome. Science 2007, 318(5849):420-426. 62