Download Fusion gene detection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genome wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Transposable element wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

NEDD9 wikipedia , lookup

Genetic engineering wikipedia , lookup

Oncogenomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Ridge (biology) wikipedia , lookup

History of genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene therapy wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Metagenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Pathogenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene wikipedia , lookup

Genome editing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Genome (book) wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
F us i on gene detecti on
Henrik Edgren
Introduction
For the purpose of this white paper, it is assumed that fusion genes will be
identified from paired-end rna sequencing (rna-seq) data. This has the
advantage that, by definition, only expressed fusion genes will be found.
Whole genome sequencing data can also be used to search for gene fusions
but, as it cannot tell which fusions are expressed, it is less efficient at finding
potentially oncogenic gene fusions.
Here, fusion gene detection is divided into two different use cases:
1) Full fusion gene search; identification of all gene fusions in a sample.
2) Targeted fusion gene search: specific identification of only those gene
fusions, in which one of the partner genes belongs to a short list of
predefined genes. If the gene of interest is e.g. ABL1, this use case would
identify all fusions XYZ-ABL1 and ABL1-XYZ, in which XYZ can be any gene
other than ABL1 in the human genome.
The reason for distinguishing the two different use cases is the change in
demands placed on computational resources, both in terms of running time
and disk usage, of the next generation sequencing data analysis. If analyzing
only a small number of samples, say in the low hundreds, the sensible thing
is to run a full fusion gene search, as the computational resources and time
requires are manageable. However, if the aim is to identify specific fusion
genes from thousands of samples, e.g. all fusions including ABL1, it is
significantly faster to run a specific fusion gene search, as the thousands of
samples worth of full fusion gene searches demand considerable
computational resources, and more importantly, a lot of time.
Methods
The full fusion gene detection method is based on Edgren et al. (2011), with
added improvements to increase the sensitivity of fusion gene detection
(Kangaspeska et al). In brief, the full search relies on first using the paired-end
data to identify candidate fusion gene-gene pairs. This is achieved by
aligning all sequence reads against the human genome and transcriptome
(Ensembl), and identifying those paired-end read pairs, in which the reads
align to different genes. These read pairs, after a number of filtering steps,
define the candidate gene-gene pairs for all following steps.
Next, these gene-gene pairs are used to construct all possible exon-exon
combinations between the two genes, and alignment of sequence reads
against these synthetic fusion junctions identifies the final fusion gene
candidate list. Our method achieves its high specificity and sensitivity by
analyzing the pattern of sequence read alignments across the fusion
junction. For details, see Edgren et al.
Helsinki 24.5.2013
1/3
The targeted fusion gene search starts of with alignment of sequence reads
to a reference created from all transcripts of the genes on the predefined list
(genes of interest). All read pairs in which only one of the reads aligns to a
gene of interest are retained. The result of this is that most sequence reads
are discarded after the first alignment round, significantly reducing the time
required in subsequent steps. Next, candidate gene-gene pairs are created
by aligning all retained sequence reads to the full human transcriptome, in
order to create a list of candidate gene-gene pairs. Subsequent steps are the
same as in the full fusion gene search.
The main points at which this targeted approach saves time and disk space
are:
- The reference used in the first alignment step is small, if the predefined
gene list is kept reasonably short. This speeds up the step in which all
sequence reads are aligned, which is one of the most time consuming steps
in the analysis pipeline.
- After the first alignment step, only those sequence read pairs in which only
one end aligns to a gene of interest are retained. This excludes well over
99% of all sequence reads from subsequent alignment steps. This speeds up
subsequent alignment steps very significantly, as well as reduces the disk
space usage.
Discussion
The data and fusion genes identified from it by the described approach
(Edgren et al., Kangaspeska et al.) have been used as a "gold-standard" test set
by more than 15 publications describing new fusion gene detection
algorithms. These publications provide an extensive validation of the very
high sensitivity and specificity of our methodology.
One limitation of our method is that it does not find fusions that involve a
gene that is not annotated by Ensembl. It also does not find fusion genes that
involve an unannotated exon. In practice, however, this is not a significant
limitation, since most fusion genes involve known exons of two known
genes.
In addition to rna-seq data, dna copy number data can be used to support
fusion gene identification. In practice, it has been observed (see e.g. Edgren
et al.) that the genomic breakpoints at which fusion genes occur frequently
contain copy number alterations (CNAs). Therefore, if copy number data is
available for a sample, the list of candidate fusion genes from that sample
can be compared against a list of identified CNAs, to obtain further support
for fusion gene predictions. However, CNAs are not always observed at
known fusion gene breakpoints. Lack of CNAs at a predicted breakpoint
should therefore not be seen as disproving a fusion gene candidate. The
occurrence of CNAs at fusion gene associated genomic breakpoints also
seems to be more rare in e.g. leukemias and other cancer types that, overall,
contain fewer CNAs.
Helsinki 24.5.2013
2/3
References
Edgren H, Murumagi A, Kangaspeska S, Nicorici D, Hongisto V, Kleivi K, Rye IH, Nyberg S, Wolf M,
Borresen-Dale AL, Kallioniemi O. Identification of fusion genes in breast cancer by paired-end RNAsequencing. Genome Biol. 2011;12(1):R6.
Kangaspeska S, Hultsch S, Edgren H, Nicorici D, Murumägi A, Kallioniemi O.
Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms. PLoS
One. 2012;7(10):e48745.
Helsinki 24.5.2013
3/3