* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Fusion gene detection
Human genome wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Transposable element wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Genetic engineering wikipedia , lookup
Oncogenomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Ridge (biology) wikipedia , lookup
History of genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Copy-number variation wikipedia , lookup
Gene therapy wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic imprinting wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Metagenomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Pathogenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
The Selfish Gene wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome editing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Designer baby wikipedia , lookup
F us i on gene detecti on Henrik Edgren Introduction For the purpose of this white paper, it is assumed that fusion genes will be identified from paired-end rna sequencing (rna-seq) data. This has the advantage that, by definition, only expressed fusion genes will be found. Whole genome sequencing data can also be used to search for gene fusions but, as it cannot tell which fusions are expressed, it is less efficient at finding potentially oncogenic gene fusions. Here, fusion gene detection is divided into two different use cases: 1) Full fusion gene search; identification of all gene fusions in a sample. 2) Targeted fusion gene search: specific identification of only those gene fusions, in which one of the partner genes belongs to a short list of predefined genes. If the gene of interest is e.g. ABL1, this use case would identify all fusions XYZ-ABL1 and ABL1-XYZ, in which XYZ can be any gene other than ABL1 in the human genome. The reason for distinguishing the two different use cases is the change in demands placed on computational resources, both in terms of running time and disk usage, of the next generation sequencing data analysis. If analyzing only a small number of samples, say in the low hundreds, the sensible thing is to run a full fusion gene search, as the computational resources and time requires are manageable. However, if the aim is to identify specific fusion genes from thousands of samples, e.g. all fusions including ABL1, it is significantly faster to run a specific fusion gene search, as the thousands of samples worth of full fusion gene searches demand considerable computational resources, and more importantly, a lot of time. Methods The full fusion gene detection method is based on Edgren et al. (2011), with added improvements to increase the sensitivity of fusion gene detection (Kangaspeska et al). In brief, the full search relies on first using the paired-end data to identify candidate fusion gene-gene pairs. This is achieved by aligning all sequence reads against the human genome and transcriptome (Ensembl), and identifying those paired-end read pairs, in which the reads align to different genes. These read pairs, after a number of filtering steps, define the candidate gene-gene pairs for all following steps. Next, these gene-gene pairs are used to construct all possible exon-exon combinations between the two genes, and alignment of sequence reads against these synthetic fusion junctions identifies the final fusion gene candidate list. Our method achieves its high specificity and sensitivity by analyzing the pattern of sequence read alignments across the fusion junction. For details, see Edgren et al. Helsinki 24.5.2013 1/3 The targeted fusion gene search starts of with alignment of sequence reads to a reference created from all transcripts of the genes on the predefined list (genes of interest). All read pairs in which only one of the reads aligns to a gene of interest are retained. The result of this is that most sequence reads are discarded after the first alignment round, significantly reducing the time required in subsequent steps. Next, candidate gene-gene pairs are created by aligning all retained sequence reads to the full human transcriptome, in order to create a list of candidate gene-gene pairs. Subsequent steps are the same as in the full fusion gene search. The main points at which this targeted approach saves time and disk space are: - The reference used in the first alignment step is small, if the predefined gene list is kept reasonably short. This speeds up the step in which all sequence reads are aligned, which is one of the most time consuming steps in the analysis pipeline. - After the first alignment step, only those sequence read pairs in which only one end aligns to a gene of interest are retained. This excludes well over 99% of all sequence reads from subsequent alignment steps. This speeds up subsequent alignment steps very significantly, as well as reduces the disk space usage. Discussion The data and fusion genes identified from it by the described approach (Edgren et al., Kangaspeska et al.) have been used as a "gold-standard" test set by more than 15 publications describing new fusion gene detection algorithms. These publications provide an extensive validation of the very high sensitivity and specificity of our methodology. One limitation of our method is that it does not find fusions that involve a gene that is not annotated by Ensembl. It also does not find fusion genes that involve an unannotated exon. In practice, however, this is not a significant limitation, since most fusion genes involve known exons of two known genes. In addition to rna-seq data, dna copy number data can be used to support fusion gene identification. In practice, it has been observed (see e.g. Edgren et al.) that the genomic breakpoints at which fusion genes occur frequently contain copy number alterations (CNAs). Therefore, if copy number data is available for a sample, the list of candidate fusion genes from that sample can be compared against a list of identified CNAs, to obtain further support for fusion gene predictions. However, CNAs are not always observed at known fusion gene breakpoints. Lack of CNAs at a predicted breakpoint should therefore not be seen as disproving a fusion gene candidate. The occurrence of CNAs at fusion gene associated genomic breakpoints also seems to be more rare in e.g. leukemias and other cancer types that, overall, contain fewer CNAs. Helsinki 24.5.2013 2/3 References Edgren H, Murumagi A, Kangaspeska S, Nicorici D, Hongisto V, Kleivi K, Rye IH, Nyberg S, Wolf M, Borresen-Dale AL, Kallioniemi O. Identification of fusion genes in breast cancer by paired-end RNAsequencing. Genome Biol. 2011;12(1):R6. Kangaspeska S, Hultsch S, Edgren H, Nicorici D, Murumägi A, Kallioniemi O. Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms. PLoS One. 2012;7(10):e48745. Helsinki 24.5.2013 3/3