Download genome - cydney nielsen

Document related concepts

The Cancer Genome Atlas wikipedia , lookup

Transcript
Visual Analytics for
Genomics
Cydney Nielsen!
BC Cancer Agency!
Vancouver, BC, Canada!
Outline
Part 1
Introduction to Genomics
Part 2
Visual Design for Genomics
Part 3
Hands-On Design Exercise
Part 1
Introduction to Genomics
Genomics Workflow
genome:
the complete genetic material of a cell
Part 1. Intro to Genomics
Sequencing Experiment
Part 1. Intro to Genomics
Sequencing Experiment
Part 1. Intro to Genomics
Sequencing Experiment
G - C!
T - A!
Part 1. Intro to Genomics
Genomics Workflow
sample
data
insight
Part 1. Intro to Genomics
Genomics Workflow
sample
experiment
sequencing technology!
data
insight
Part 1. Intro to Genomics
Genomics Workflow
sample
experiment
sequencing technology!
data
+
analysis
visualization!
computation!
insight
Part 1. Intro to Genomics
Genomics Workflow
sample
experiment
sequencing technology!
data
+
analysis
visualization!
computation!
insight
Part 1. Intro to Genomics
Genomics Workflow
sample
experiment
sequencing technology!
data
molecular biology
Part 1. Intro to Genomics
Genomics Workflow
computational biology / bioinformatics
visual analytics
data
+
analysis
visualization!
computation!
insight
Part 1. Intro to Genomics
Genomics Workflow
sample
experiment
sequencing technology!
data
molecular biology
Part 1. Intro to Genomics
Sequencing Experiment
TACACCGATACACCAGA$
ACCAGATGGATTAGATGTA$
AAAAAAAAAAAAAAGATGT$
AAAGATGTATACCACCAG$
CACCAGTACACCGATA$
Sequencing machine!
Millions of short sequences (“reads”)!
e.g. 75 nt each compared to >3 billion nt in
human genome!
Part 1. Intro to Genomics
Sequencing Experiment
~$5,000$
in$2001$
~10¢$
in$2011$
Part 1. Intro to Genomics
Genomics Workflow
computational biology / bioinformatics
visual analytics
data
+
analysis
visualization!
computation!
insight
Part 1. Intro to Genomics
Sequencing Experiments
De novo assembly!
AGCTTCAGATGGACAGATAA$
GGCATACAGACTTAGACATA$
CCAGACAAGACAGACACAGTA$
TACAAGACATAAGCAATACAGA$
CCAGACAAGACAGACACAGTA$
Genome$Assembly$
Part 1. Intro to Genomics
Sequencing Experiments
De novo assembly!
AGCTTCAGATGGACAGATAA$
GGCATACAGACTTAGACATA$
CCAGACAAGACAGACACAGTA$
TACAAGACATAAGCAATACAGA$
CCAGACAAGACAGACACAGTA$
Re-sequencing!
GGCATACAGACTTAGACATA$
AGCTTCAGATGGACAGATAA$
CCAGACAAGACAGACACAGTA$
CCAGACAAGACAGACACAGTA$
TACAAGACATAAGCAATACAGA$
Reference$Genome$
Genome$Assembly$
Part 1. Intro to Genomics
Sequencing Experiments
De novo assembly!
AGCTTCAGATGGACAGATAA$
GGCATACAGACTTAGACATA$
CCAGACAAGACAGACACAGTA$
TACAAGACATAAGCAATACAGA$
CCAGACAAGACAGACACAGTA$
Re-sequencing!
GGCATACAGACTTAGACATA$
AGCTTCAGATGGACAGATAA$
CCAGACAAGACAGACACAGTA$
CCAGACAAGACAGACACAGTA$
TACAAGACATAAGCAATACAGA$
Reference$Genome$
Enrichment!
CCAGACAAGACAGACACAGTA$
AGCTTCAGATGGACAGATAA$
GGCATACAGACTTAGACATA$
CCAGACAAGACAGACACAGTA$
TACAAGACATAAGCAATACAGA$
Reference$Genome$
Genome$Assembly$
Part 1. Intro to Genomics
Sequencing Experiments
•  What sequence variations appear in cancer patients,
but not in unaffected individuals?!
•  Are these variations predictive of survival outcome?!
•  Are these variations causal for the disease (driver
mutations) or not?!
!
Part 1. Intro to Genomics
Part 1 - Summary
1.  Large and ever increasing volume of sequencing
data!
2.  Improved analysis techniques are essential for
biologists and clinicians to make the most of
these data!
3.  Great potential for visual analytics to facilitate
insight and understanding!
!
Part 1. Intro to Genomics
Part 2
Visual Design for Genomics
Challenge 1
Large number of samples for comparison!
Part 2. Visual Design for Genomics
Challenge 1
Large number of samples for comparison!
“To systematically characterize the genomic changes in hundreds of tumors…
and thousands of samples over the next five years”!
!
The Cancer Genome Atlas!
www.cancergenome.nih.gov!
Part 2. Visual Design for Genomics
Genome Browsers
Stacked data tracks along a common genome x-axis!
Data
samples!
Genome
coordinate!
Genome Browsers
Home
Genomes
Blat
Tables
Gene Sorter
PCR
PDF/PS
Session
FAQ
Help
UCSC Cancer Genomics Heatmaps
Glioblastoma Copy Number Abnormality, Agilent 244A array (n=200)
Zhu et al., Nature Methods, 2009!
er
nd
Ge
or
Tu
m
Genome coordinate!
vs
n
or
m
al
Data
samples!
Part 2. Visual Design for Genomics
Challenge 1
Large number of samples for comparison!
!
! Critically consider what you need to display!
!
! e.g. replace primary data with a biologically
meaningful summary, such as significant changes
between samples !
Part 2. Visual Design for Genomics
Challenge 2
Genomic features are small and sparse!
Part 2. Visual Design for Genomics
Genome Browsers
LOCAL VIEW!
Part 2. Visual Design for Genomics
Genome Browsers
LOCAL VIEW!
Human chr1, 1 pt corresponds to 480 kb, which is larger than 98% of all human genes! !
Part 2. Visual Design for Genomics
Hilbert Curve
RESEARCH ARTICLE
GLOBAL VIEW!
a
Chromosome 3L
Heterochromatinlike domain
PcG
domains
Open chromatin
domain
Cluster of small
expressed genes
context (for example, h
me3 with some active m
long expressed genes, m
enriched for H3K36me1
5′
3′
in S2, 202 in BG3; Supp
To examine further t
clustered expressed auto
ment for each chromatin
5′
3′
genes with large 59-end
show extensive H3K27ac
domains, and blocks of
Fig. 3b, last column). Th
regulatory functions (Su
5′
3′
within domains of Nipp
protein previously associ
In contrast, genes with
(red subtree, Fig. 3b) lack
5′
3′
is restricted to the 2 kb d
Chromatin states: 1 2 3 4 5 6 7 8 9
not explained by variatio
2 | Visualization
Overall, the presence or
Kharchenko etFigure
al., Nature,
2011! of spatial scales and organization using compact
folding. a, The chromosome is folded using a geometric pattern (Hilbert space- ference in the chromatin
Anders, Bioinformatics,
2009!
filling curve) that maintains spatial proximity of
nearby
An Design
illustration forand
Part
2.regions.
Visual
Genomics
longer (Supplement
of the first four folding steps is shown. Note that although this compact curve is
sistently correlates with
optimal for preserving proximity relationships, some distal sites appear adjacent
gene body, mainly assoc
along the fold axis (green dots). b, Chromosome 3L in S2 cells. A domain of a
b
Pericentromeric
heterochromatin
Challenge 2
Genomic features are small and sparse!
Connect overview and detail!
Part 2. Visual Design for Genomics
Challenge 3
Genomic features involve non-adjacent positions!
Part 2. Visual Design for Genomics
points’, span a wide range of distances and affect sequence segments of varying size. For example, tandem duplications may
involve a localized repetition of only a few kilobases, whereas
the breakpoints of translocations are located on nonhomologous
chromosome arms and may result in the rearrangement of large
genomic chunks. Finding a representation that enables one to
track breakpoints across this scale can be challenging. This is
exacerbated by the fact that variant genomic fragments can be
Challenge 3
Structural rearrangements!
a
b
J
Jʹ
K
Kʹ
J
Kʹ
K
c
Kʹ
K
e
J
Jʹ
Jʹ
d
Jʹ
Variant
J
Jʹ
K
Kʹ
J
Kʹ
K
K’
Reference
Figure 1 | Representations of a translocation. (a,b)
Linear
(a) and circular
(b)
Part
2. Visual
Design
reference genome layouts with an arc to depict a translocation between two
chromosomes (pink and blue). (c) Translocation illustrated as referencesequence segments with chromosome colors corresponding to those in a.
segments, and the h
trade-off for directly
coding or dot plot is
pair can be expressed
All of the images
coordinate system,
between breakpoints
to focus on the conse
genomic arrangemen
gene fusions, partic
frame. One way to ad
away from the geno
representation, suc
uninterrupted seque
order (Fig. 1e). The
readability of the co
linear order of the ge
as the presence of an
with edge attributes
As we look for alt
diversity of genomic
biologically relevant
COMPETING FINANCIAL IN
The authors declare no com
Cydney Nielsen & Ban
Krzywinski, M. et al. Ge
for 1.Genomics
Cydney Nielsen is a Canadi
Foundation for Health Rese
points’, span a wide range of distances and affect sequence segments of varying size. For example, tandem duplications may
involve a localized repetition of only a few kilobases, whereas
the breakpoints of translocations are located on nonhomologous
chromosome arms and may result in the rearrangement of large
genomic chunks. Finding a representation that enables one to
track breakpoints across this scale can be challenging. This is
exacerbated by the fact that variant genomic fragments can be
Challenge 3
Structural rearrangements!
a
b
J
Jʹ
K
Kʹ
J
Kʹ
K
c
Kʹ
K
e
J
Jʹ
Jʹ
d
Jʹ
Variant
J
Jʹ
K
Kʹ
J
Kʹ
K
K’
Reference
Figure 1 | Representations of a translocation. (a,b)
Linear
(a) and circular
(b)
Part
2. Visual
Design
reference genome layouts with an arc to depict a translocation between two
chromosomes (pink and blue). (c) Translocation illustrated as referencesequence segments with chromosome colors corresponding to those in a.
segments, and the h
trade-off for directly
coding or dot plot is
pair can be expressed
All of the images
coordinate system,
between breakpoints
to focus on the conse
genomic arrangemen
gene fusions, partic
frame. One way to ad
away from the geno
representation, suc
uninterrupted seque
order (Fig. 1e). The
readability of the co
linear order of the ge
as the presence of an
with edge attributes
As we look for alt
diversity of genomic
biologically relevant
COMPETING FINANCIAL IN
The authors declare no com
Cydney Nielsen & Ban
Krzywinski, M. et al. Ge
for 1.Genomics
Cydney Nielsen is a Canadi
Foundation for Health Rese
Challenge 3
Structural rearrangements!
Circos, Martin Krzywinski!
Part 2. Visual Design for Genomics
points’, span a wide range of distances and affect sequence segments of varying size. For example, tandem duplications may
involve a localized repetition of only a few kilobases, whereas
the breakpoints of translocations are located on nonhomologous
chromosome arms and may result in the rearrangement of large
genomic chunks. Finding a representation that enables one to
track breakpoints across this scale can be challenging. This is
exacerbated by the fact that variant genomic fragments can be
Challenge 3
Structural rearrangements!
a
b
J
Jʹ
K
Kʹ
J
Kʹ
K
c
Kʹ
K
e
J
Jʹ
Jʹ
d
Jʹ
Variant
J
Jʹ
K
Kʹ
J
Kʹ
K
K’
Reference
Figure 1 | Representations of a translocation. (a,b)
Linear
(a) and circular
(b)
Part
2. Visual
Design
reference genome layouts with an arc to depict a translocation between two
chromosomes (pink and blue). (c) Translocation illustrated as referencesequence segments with chromosome colors corresponding to those in a.
segments, and the h
trade-off for directly
coding or dot plot is
pair can be expressed
All of the images
coordinate system,
between breakpoints
to focus on the conse
genomic arrangemen
gene fusions, partic
frame. One way to ad
away from the geno
representation, suc
uninterrupted seque
order (Fig. 1e). The
readability of the co
linear order of the ge
as the presence of an
with edge attributes
As we look for alt
diversity of genomic
biologically relevant
COMPETING FINANCIAL IN
The authors declare no com
Cydney Nielsen & Ban
Krzywinski, M. et al. Ge
for 1.Genomics
Cydney Nielsen is a Canadi
Foundation for Health Rese
Supplementary Figure 1 Global dot-plot of Sorghum bicolor and Or
Challenge
3 displayed using VISTA-Dot.
assemblies
Structural rearrangements!
VISTA-Dot!
Part 2. Visual Design for Genomics
Challenge 3
All these representations use a genomic
coordinate system, which emphasizes base-pair
distance between points. !
!
Is this the best use of positional information?!
Part 2. Visual Design for Genomics
Match the encoding method to the data
Data can be encoded using visual properties such as position (scatter plot), length (bar plot), angle
(heat map). Figure 35 (adapted from Figure 15 in [7]), ranks different encodings according to their
representing quantitative (numbers), ordinal (categories with implied order, such as “best”, “better
(categories without an order, such as brands of cars) variables.
Challenge 3
[figure encoding-schemes.eps]
M. Krzywinski adapted from Mackinlay J (1986) ACM Trans Graph 5: 110-141.!
Figure 35. Many encoding schemes exist and should be selected based on the type of encoded variable.
2. Visual
Design
for diagram
Genomics
In any encoding, simpler visual formsPart
are preferable.
Consider
the Venn
in Figure 36A (ad
[59]). The Venn diagram demonstrates a nested data set – all values in Z are in Y and all values in Y
need to show 4 of the 7 intersections. These data are shown better as a set of concentric circles (Fig
points’, span a wide range of distances and affect sequence segments of varying size. For example, tandem duplications may
involve a localized repetition of only a few kilobases, whereas
the breakpoints of translocations are located on nonhomologous
chromosome arms and may result in the rearrangement of large
genomic chunks. Finding a representation that enables one to
track breakpoints across this scale can be challenging. This is
exacerbated by the fact that variant genomic fragments can be
Challenge 3
Structural rearrangements!
a
b
J
Jʹ
K
Kʹ
J
Kʹ
K
c
Kʹ
K
e
J
Jʹ
Jʹ
d
Jʹ
Variant
J
Jʹ
K
Kʹ
J
Kʹ
K
K’
Reference
Figure 1 | Representations of a translocation. (a,b)
Linear
(a) and circular
(b)
Part
2. Visual
Design
reference genome layouts with an arc to depict a translocation between two
chromosomes (pink and blue). (c) Translocation illustrated as referencesequence segments with chromosome colors corresponding to those in a.
segments, and the h
trade-off for directly
coding or dot plot is
pair can be expressed
All of the images
coordinate system,
between breakpoints
to focus on the conse
genomic arrangemen
gene fusions, partic
frame. One way to ad
away from the geno
representation, suc
uninterrupted seque
order (Fig. 1e). The
readability of the co
linear order of the ge
as the presence of an
with edge attributes
As we look for alt
diversity of genomic
biologically relevant
COMPETING FINANCIAL IN
The authors declare no com
Cydney Nielsen & Ban
Krzywinski, M. et al. Ge
for 1.Genomics
Cydney Nielsen is a Canadi
Foundation for Health Rese
Challenge 3
Genomic features involve non-adjacent positions!
Encode important information in position!
Part 2. Visual Design for Genomics
Challenge 4
Large number of data types!
Part 2. Visual Design for Genomics
Genomic rearrangement in cancer
A
Deletion-type
Tandem dup-type
SNU-C1 (colorectal): Chr 15
Tail-to-tail inverted
Head-to-head inverted
Non-inverted
orientation
4
Copy 2
number
0
Allelic 1
ratio 0
Inverted
orientation
15
B
20
25
30
35
40
45
50 55 60 65 70
Genomic location (Mb)
75
80
85
90
95 100
Stephens
et al., Cell,
8505C (thyroid):
Chr2011!
9
Part 2. Visual Design for Genomics
4
RESEARCH ARTICLE
17 mouse genomes
0
742
0
179
0
836
SNPs
SVs
TEs
Uncallable
CAST/EiJ
14
13
12
11
10
9
8
>100,000
15
16
17
18
19
X
0
5
2
6
7
1
3
4
4
3
5
2
7
6
WSB/EiJ
LtJ
HI tJ
O/ hiL J
NZ D/S LP/ J
/2
NO
A
J
/
DB CBA J
N
L/6 eJ
7B /H
C5 C3H B/cJ J
L
/
BA A /J
R
AK vBrd
J
vE
5/S /SvIm sd
9S
12 9S1 laH
12 P2/O
9
12
a
1
8
10
11
12
13
14
15
16
17
18
19
X
1
9
2
3
4
5
6
7
8
9
10
1
11
2
3
4
5
6
7
8
1
18 9
12
11
X
3
1
2
6
4
5
7
10
9
8
17
16
15
14
13
12
11
PWK/PhJ
X
10
13
14
15
16
17
18
19
X
9
12
13
14
15
16
17
18
19
SPRET/EiJ
Keane et al.,
b Nature, 2011!
SNPs
44,688,817
9,042,516
SV deletion
TE insertions
Part
2. Visual Design
for Genomics
Indels
3,012,100
6,111,656
1,694,226
101,925
1,006,034
33,488
42,514
15,484
15,002
9,116
Challenge 4
Large number of data types!
Exploit domain-specific details in your design!
Part 2. Visual Design for Genomics
Challenge 5
No longer one genome but many!
Part 2. Visual Design for Genomics
Challenge 5
No longer one genome but many!
Part 2. Visual Design for Genomics
Single nucleotide variation
Ossowski et al.
Genome
Research, 2008!
Part 2. Visual Design for Genomics
Single nucleotide variation
Integrative Genomics Viewer (IGV)!
Robinson et al. Nature Biotechnology, 2011!
Part 2. Visual Design for Genomics
Challenge 5
No longer one genome but many!
Be open to change (genomics is evolving quickly)!
Part 2. Visual Design for Genomics
Part 2 - Summary
1. 
2. 
3. 
4. 
5. 
Cri<cally$consider$what$you$need$to$display$
Connect$overview$and$detail$
Encode$important$informa<on$in$posi<on$
Exploit$domainIspecific$details$in$your$design$
Be$open$to$change$(genomics$is$evolving$quickly)$
Part 2. Visual Design for Genomics
Part 3
Hands-On Design Exercise
Genome Assembly
Input!
TACACCGATACACCAGA$
ACCAGATGGATTAGATGTA$
AAAAAAAAAAAAAAGATGT$
AAAGATGTATACCACCAG$
CACCAGTACACCGATA$
Part 3. Hands-On Design Exercise
Genome Assembly
Input!
TACACCGATACACCAGA$
ACCAGATGGATTAGATGTA$
AAAAAAAAAAAAAAGATGT$
AAAGATGTATACCACCAG$
CACCAGTACACCGATA$
Aligned!
AAAAAAAAAAAAAAGATGT$
AAAGATGTATACCACCAG$
CACCAGTACACCGATA$
TACACCGATACACCAGA$
ACCAGATGGATTAGATGTA$
Part 3. Hands-On Design Exercise
Genome Assembly
Input!
TACACCGATACACCAGA$
ACCAGATGGATTAGATGTA$
AAAAAAAAAAAAAAGATGT$
AAAGATGTATACCACCAG$
CACCAGTACACCGATA$
Aligned!
AAAAAAAAAAAAAAGATGT$
AAAGATGTATACCACCAG$
CACCAGTACACCGATA$
TACACCGATACACCAGA$
ACCAGATGGATTAGATGTA$
Consensus!
AAAAAAAAAAAAAAGATGTATACCACCAGTACACCGATACACCAGATGGATTAGATGTA$
Part 3. Hands-On Design Exercise
Sequence Alignment Rules
Part 3. Hands-On Design Exercise
Sequence Alignment Rules
1.$Maximize$sequence$overlap:$
$
This$overlap$is$BETTER…&
$
AAAAAAAAAAAAAAGATGTATACCACCAGTACACCGATACACCAGATG
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$GATGTATACCACCAGTACACCGATACACCAGATGGATTAGATGTAGGGG
$
…than$this$overlap:$
$
AAAAAAAAAAAAAAGTATGTATACCACCAGTACACCGATACACCAGATG
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$GATGTATACCACCAGTACAC
$
Part 3. Hands-On Design Exercise
Sequence Alignment Rules
1.$Maximize$sequence$overlap:$
$
This$overlap$is$BETTER…&
$
AAAAAAAAAAAAAAGATGTATACCACCAGTACACCGATACACCAGATG
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$GATGTATACCACCAGTACACCGATACACCAGATGGATTAGATGTAGGGG
$
…than$this$overlap:$
$
AAAAAAAAAAAAAAGTATGTATACCACCAGTACACCGATACACCAGATG
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$GATGTATACCACCAGTACAC
$
2.$Align$leUers$rightIsideIup,$reading$leV$to$right$(just$like$wriUen$English):$
$
NOT$a$valid&overlap$:
$
$
$
$
$
$Valid$overlap:$
$
CACCAGTACATTTTTAAAGGG
CACCAGTACATTTTTAAAGGG
ATTTTTAAAGGGCCACATG
Part 3. Hands-On Design Exercise
GTACACCGGGAAATTTTTA
Sequence Alignments
Yellow set!
AGCAGATC…AAAAAAAA
AAAAAAAA…AAAAAAAA
AAAAAAAA…TACTTACA
…TACTTACA…GGGGGGGG
GGGGGGGG…GGGGGGGG
GGGGGGGG…GACAGATA
Part 3. Hands-On Design Exercise
Sequence Alignments
Blue set!
GATAGA…AAAAAA
AAAAAA…CAGATG
…CAGATG…GGGGGG
GGGGGG…GGGGGG
GGGGGG…ATAGAC
…ATAGAC…AAAAAA
AAAAAA…GGACAT
AAAAAA…AAAAAA
Part 3. Hands-On Design Exercise
Sequence Alignments
Both sets together (pretend you don’t know colour)!
Ambiguous –!
could belong to
multiple sequences:!
AGCAGA…AAAAAA
AAAAAA…CTTACA
…CTTACA…GGGGGG
AAAAAA…AAAAAA
GGGGGG…CAGATA
GATAGA…AAAAAA
GGGGGG…GGGGGG
AAAAAA…CAGATG
…CAGATG…GGGGGG
GGGGGG…ATAGAC
…ATAGAC…AAAAAA
AAAAAA…GGACAT
Part 3. Hands-On Design Exercise
Sequence Alignments
AGCAGA…AAAAAA
AAAAAA…CTTACA
…CTTACA…GGGGGG
GGGGGG…CAGATA
GATAGA…AAAAAA
AAAAAA…CAGATG
…CAGATG…GGGGGG
GGGGGG…ATAGAC
…ATAGAC…AAAAAA
AAAAAA…GGACAT
Part 3. Hands-On Design Exercise
Choosing a representation
Part 3. Hands-On Design Exercise
Choosing a representation
Part 3. Hands-On Design Exercise
Choosing a representation
Part 3. Hands-On Design Exercise
ABySS-Explorer
Part 3. Hands-On Design Exercise
(b)
inversion event in a human lymphoma genome
(c)
ABySS-Explorer
Nielsen et al. 2009!
!
ABySS-Explorer: visualizing
genome sequence assemblies.!
!
IEEE Trans Vis Comput Graph!
VisWeek Proceedings!
(Best paper award)!
!
(a)
reference human genome
(b)
inversion event in a human lymphoma genome
(c)
Part 3. Hands-On Design Exercise
Resources
The&Cartoon&Guide&to&Gene5cs&
Larry$Gonick$and$Mark$Wheelis$(1991)$$
The&Processes&of&Life:&An&Introduc5on&to&Molecular&Biology&
Lawrence$E.$Hunter$(2009)$
Nature&Methods&special&issue&on&Visualizing&Biological&Data&(2010)&
hUp://www.nature.com/nmeth/journal/v7/n3s$
$
Bang&Wong’s&monthly&Points&of&View&column&
hUp://bang.clearscience.info$