Download Considerations for Analyzing Targeted NGS Data – HLA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Quantitative trait locus wikipedia, lookup

SNP genotyping wikipedia, lookup

X-inactivation wikipedia, lookup

Human genome wikipedia, lookup

Human genetic variation wikipedia, lookup

Designer baby wikipedia, lookup

Polycomb Group Proteins and Cancer wikipedia, lookup

Skewed X-inactivation wikipedia, lookup

Whole genome sequencing wikipedia, lookup

Pharmacogenomics wikipedia, lookup

Gene expression profiling wikipedia, lookup

Minimal genome wikipedia, lookup

Genome (book) wikipedia, lookup

Genomic library wikipedia, lookup

Genomic imprinting wikipedia, lookup

Segmental Duplication on the Human Y Chromosome wikipedia, lookup

Population genetics wikipedia, lookup

Pathogenomics wikipedia, lookup

Therapeutic gene modulation wikipedia, lookup

Inbreeding wikipedia, lookup

Polymorphism (biology) wikipedia, lookup

Site-specific recombinase technology wikipedia, lookup

Epigenetics of human development wikipedia, lookup

Artificial gene synthesis wikipedia, lookup

Gene wikipedia, lookup

Genome evolution wikipedia, lookup

Major histocompatibility complex wikipedia, lookup

Genetic drift wikipedia, lookup

Multiple sequence alignment wikipedia, lookup

Metagenomics wikipedia, lookup

Hardy–Weinberg principle wikipedia, lookup

Microevolution wikipedia, lookup

Genomics wikipedia, lookup

Dominance (genetics) wikipedia, lookup

RNA-Seq wikipedia, lookup

A30-Cw5-B18-DR3-DQ2 (HLA Haplotype) wikipedia, lookup

Human leukocyte antigen wikipedia, lookup

Transcript
Considerations for Analyzing
Targeted NGS Data
HLA
Tim Hague, CTO
Introduction
 Human leukocyte antigen (HLA) is the
major histocompatibility complex (MHC) in
humans.
 Group of genes ('superregion') on
chromosome 6
 Essentially encodes cell-surface antigenpresenting proteins.
Functions
HLA genes have functions in:
combating infectious diseases
graft/transplant rejection
autoimmunity
cancer
Alleles
 Large number of alleles (and proteins).
 Many alleles are already known.
The number of
known alleles is
increasing
HLA Class I
Gene
A
B
C
Alleles 2013 2605 1551
Proteins 1448
1988
1119
HLA Class II
Gene
DRA DRB* DQA1 DQB1 DPA1 DPB1
Alleles
7
1260 47
176 34
155
Proteins 2
901 29
126 17
134
HLA Class II - DRB Alleles
Gene
DRB1
DRB3
DRB4
Alleles 1159
58
15
Proteins 860
46
8
DRB5
20
17
Analysis Challenges
HLA genes
have
specific
analysis
challenges regardless of the sequencing
technology.
High Polymorphism
High rate of polymorphism – up to 100 times
the average human mutation rate.
The HLA-DRB1 and HLA-B loci have the highest
sequence variation rate within the human genome.
High degree of heterozygosity – homozygotes are
the exception in this region.
Duplications
 High level of segmental duplications
 Lots of similar genes and lots of very similar
pseudegenes.
 Duplicated segments can be more similar to each other
within an individual than they are similar to the
corresponding segments of the reference genome.
Complex Genetics
 Particularly HLA-DRB*
 The DR β-chain is encoded by 4 loci, however
only no more than 3 functional loci are present
in a single individual, and only a maximum of 2
per chromosome.
Mitigating Factors
It's not all bad news:
Many HLA alleles are already well known – both in
terms of sequence and frequencies within the
population.
The HLA region is fairly small so there a high degree
of linkage disequilibrium, and therefore lots of known
haplotypes.
Traditional Typing
 SSO – low resolution, high throughput,
cheap
 SSP – very fast results, low resolution
 SBT – sequence-based typing, high
resolution, usually done by Sanger
sequencing.
NGS Typing
High resolution, an alternative to Sangerbased SBT
Why is it needed?
Sanger and HLA
 Sanger data is still the gold standard in
the genomic sequencing industry, even
though it is very expensive compared to
NGS.
 1 in 1'000 base error rate, if forward and
reverse typing are done, error rate drops
to 1 in 1'000'000.
So why is it bad for HLA?
Phase Resolution
 2x chromosome 6
 Many loci, many alleles
 Lots of heterozygosity
Allele Phasing problem
reference sequence
G
/
T
T
/
A
consensus sequence
OR???
Allele 1
Allele 2
T
A
Allele 1
Allele 2
A
T
The Problem with Sanger
 There is only one signal
 High degree of heterozygosity = high degree of
ambiguity
 Requires statistical techniques based on known
allele frequencies, plus manual intervention by
trained operators
 Ambiguity can only be resolved statistically, which
can lead to wrong assignment for rare types
HLA typing by Sanger method
GGACSGGRASACACGGAAWGTGAAGGCCCACTCACAGACTSACCGAGYGRACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGMCGGT
550
500
450
400
350
300
250
200
150
100
50
0
Number of potential alleles
NGS Advantages
 Can reduce ambiguity
 Phase resolution - two signals, but lots of
short reads
 Cheaper and faster than Sanger
 Less manual intervention required
NGS Data - Unphased
NGS Data - Phased
NGS Approaches
 HLA*IMP – chip based imputation engine
 Reference-based alignment, followed by a
HLA call based on the variants detected during
alignment
 Search against database of known alleles
NGS Reference-based
 Fraught with difficulties
 Very hard to align reads to this region
 The variant/HLA call is only as good as the
alignment
 No coverage = no call
Has been attempted by Broad Institute (HLA Caller)
and Roche
Alignment Efforts
RainDance provide a targeted HLA amplification kit call
HLAseq.
Target: the whole MHC superregion (except for some
tandem repeat regions)
Goal: align this data, before doing
variant/HLA call.
Diverse variant “density” in the MHC superregion
Based on a single
sample
Default BWA alignment – No coverage at an exon of
HLA-DMB
Low coverage and orphaned reads at a HLA-DRB1 exon
BWA vs more permissive alignment:
higher coverage = higher noise
Large targeted region without usable coverage
NGS Reference-based
Not providing enough coverage everywhere
What about de novo?
De novo assembly (MIRA)
287 contigs (longest contig: 2199 bp)
Mean contig size: 268 bp
Median contig size: 209 bp
Total consensus: 77084 bp
RainDance target: ~ 3800000 bp
De novo assembly (MIRA)
NGS De Novo Alignment
Not enough contigs produced, not enough coverage of
the target region.
What about a hybrid approach?
De novo assembly with “backbone”
First, alignment to backbone, then de novo
assembly
Backbone: 2220 contigs from HG19 chr 6 (sum:
3554852 bps) → almost whole RainDance
target
Results:
Max reads / backbone contig: 197
Max coverage: 71
De novo assembly with “backbone”
NGS Typing - Alignment Based
We tried:
Burrows Wheeler aligner
More sensitive, seed and extend aligner
De novo aligner
'Hybrid' de novo aligner
The variant/HLA call is only as good as the
alignment
The alignments were not good enough
NGS Database Based
 Search against 'database' of known alleles
 Such as IMGT/HLA database, available from EBI
web site
Stanford, Connexio, JSI Medical, BC Cancer Agency
and Omixon have all tried this approach.
DB Based Approach
Advantages
Less mapping headaches
Unambiguous results
Potential to be fast
Difficulties
Novel allele detection
Homozygous alleles
Results with Exome data
Exon level detail
Detailed results - short read pileup
Conclusions
 DB based approach to HLA typing is new but very
promising
 NGS approaches can resolve much of the
ambiguity of Sanger SBT
 DB based approach can also overcome the
limitations of NGS reference-based alignment
Conclusions
Available DB based HLA typing tools differ in:
Speed
Sequencers supported
Types of sequencing data supported (targeted,
exome, whole genome)
Ease of use
Ambiguity of results
Degree of manual intervention required
Novel allele detection capabilities