Download 20 IMPERATIVES OF INFORMATION DESIGN Martin Krzywinski

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Public health genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression programming wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Getting Into Visualization of Large Biological Data Sets
BC Cancer Agency
2 0 I M P E R AT I V E S O F I N F O R M AT I O N D E S I G N
Martin Krzywinski, Inanc Birol, Steven Jones, Marco Marra
CARE
&
RESEARCH
An agency of the Provincial Health Services Authority
ENSURE LEGIBILITY AND FOCUS ON THE MESSAGE
USE EFFECTIVE VISUAL ENCODINGS TO ORGANIZE INFORMATION.
USE EFFECTIVE DESIGN PRINCIPLES TO EMPHASIZE AND COMMUNICATE PATTERNS.
Create legible visualizations with a strong message. Make elements
large enough to be resolved comfortably. Bin dense data to avoid
sacrificing clarity.
Match the visual encoding to the hypothesis. Use encodings specific
and sensitive to important patterns. Dense annotations should be
independent of the core data in distinct visual layers.
Well-designed figures illustrate complex concepts and patterns that
may be difficult to express concisely in words. Figures that are clear,
concise and attractive are effective – they form a strong connection
with the reader and communicate with immediacy. These qualities
can be achieved with methods of graphic design, which are based on
theories of how we perceive, interpret and organize visual information.
Distinguish between exploration and communication.
0.50
width
samples pt kb
2 6.8
225
RESOLVING COLOR DIFFERENCES
900
0.5 1.7
1,800 0.25 0.85
S. cerevisiae chrIV 1,531 kb · scale length 6.22 in, 158 mm, 448 pt · scale resolution 246 kb/in, 9.6 kb/mm, 3.4 kb/pt
Rendering human chromosome 1 (249 Mb) with 450 divisions provides a resolution of 550 kb, which is
larger than 98% of all human genes. To depict 50% of genes without magnification we would need to use
15 kb bins, which would have to be 1/16,600 of the figure (0.0004 in, 10 microns)!
50
RAD54L
G>A
rs121908690
CANCER
CENSUS
100
BCL9
ARNT
TPM3
RIT1
PRCC
NTRK1
PBX1
PRRX1
ABL2
LHX4
TPR
PRG4
CDC73
CSF1
RBM15
CSDE1
BCL10
LCK
SFPQ
MYCL1
MUTYH
TAL1
PAX7
EPHB2
Approaches to encoding min/avg/max values of downsampled data. In the top hi-low trace, the vertical
bars are perceived as a separate layer and effectively show variance without obscuring trends in the
average.
chr 1
barely
comfortable
generous
Visual acuity limits the perception of color and spatial differences. For
standard reading distance, we can comfortably resolve objects separated by 1
pt. For larger viewing distances (e.g. this poster), this limit is proportionately
higher. The standard unit of distance in print is 1 pt = 1/72 inch.
150
Mb
200
RNASEL
C>T
rs74315365
SNP
OMIM
size (kb) <10
10-30 30-50 50-100 100-200 >200
When drawing the position and size of densely packed genes, encode the gene’s size using a non-linear
mapping. When the number of data values is large, such as in the OMIM gene track, hollow glyphs are
effective. For even greater number of points, a density map is preferred.
Aggregate data for focused theme.
A strong visual message has no
uncertainty in its interpretation.
Focus on a single theme by
aggregating unnecessary detail.
Show density maps and outliers.
Establishing context is helpful
when emergent patterns in the data
provide a useful perspective on the
message. When data sets are large,
it is difficult to maintain detail in
the context layer because the density
of points can visually overwhelm
the area of interest. In this case,
consider showing only the outliers in
the data set.
A
Data on large genomes must be
downsampled. Depict variation with
min/max plots and consider hiding it when
it is within noise levels. Help the reader
notice significant outliers.
Do not draw small elements to scale.
Map size of elements onto clearly legible
symbols. Legibility and clarity are more
important than precise positioning and
sizing. Discretize sizes and positions to
facilitate making meaningful comparisons.
B
0-30
31-60
61-100
D
54
30
30
60
What is communicated? (A) The raw data imparts no clear message.(B). Binning indicates ranges, not individual values,
are important. (C). Frequency distribution suggests that there is a shortage of medium-sized values. (D) Individual data
points can be removed to emphasize trend and significance.
Consider whether showing the full data set is useful.
The reader’s attention can be focused by increasing the salience
of interesting patterns. Other complex data sets, such as
networks, are shown more effectively when context is carefully
edited or even removed.
NO MESSAGE
MESSAGE IN CONTEXT
G2
G3
G1
S1
S1
ONLY OUTLIERS SHOWN
B
S4
S2
F
G2 G3
A
C
B
C
D
G4
MESSAGE IN ISOLATION
G1
A
ALL DATA SHOWN
*
60
S2
G8
G4
E
S3
E
Y
11
69
Y
29
Z
11
29
29
X 69 11
Y 40
Z 11
29
29
X 69
Y
40
11
Z
X
29
40
11
Reduce unnecessary variation.
D
X
40
Z
11
69
X
Y
Z
Use perceptual palettes.
Selecting perceptually favorable colors is difficult because most software does not support the
required color spaces. Brewer palettes [BRE11]
exist for the full range of colors to help us make
useful choices. Qualitative palettes have no perceived order of importance. Sequential palettes
are suitable for heat maps because they have a
natural order and the perceived difference between adjacent colors is constant. Twin hue diverging palettes, are useful for two-sided quantitative encodings, such as immunofluorescence
and copy number.
SEQUENTIAL
DIVERGING
set1
blues
spectral
set2
greens
rdylbu
pastel2
reds
rdylgn
dark2
ylorbr
piyg
Crop scale to reveal fine structure in data.
Biological data sets are typically high-resolution
(changes at base pair level can meaningful),
sparse (distances between changes are orders of
magnitude greater than the affected areas) and
connect distant regions by adjacency relationships (gene fusions and other rearrangements).
It is difficult to take these properties into account on a fixed linear scale, the kind used by
traditional genome browsers. To mitigate this,
crop and order axis segments arbitrarily and
apply a scale adjustment to a segment or portion thereof.
INS 1
DISEASE
Hue does not communicate relative change in values
because we perceive hue categorically (blue, green,
yellow, etc). Changes within one category have less perceptual impact than transitions between categories. For
example, variations across the green/yellow boundary
are perceived to be larger than variations across the
same sized hue interval in other parts of the spectrum.
uniform HSV hue
G3
GENES IN
COMPLEX
B
CONNECTING
GENE
DAF 2
200% 150%
AAP 1
AAP 1
DAF 8
DAF 14
DAF 9
Use encodings that are robust and comparable.
In addition to being transparent and predictable, visualizations must be robust with respect to the data.
Changes in the data set should be reflected by proportionate changes in the visualization. Be wary of forcedirected network layouts, which have low spatial autocorrelation. In general, these are neither sensitive nor
specific to patterns of interest.
ENTIRE NETWORK
manual
layout
NODE N1 REMOVED
In an audience of 8 men and 8 women,
chances are 50% that at least one has some
degree of color blindness. Use a palette that
is color-blind safe. In the palette below the
15 colors appear as 5-color tone progressions
to those with color blindness. Additional
encodings can be achieved with symbols or
line thickness.
8
14
12
3
5
[STE07]
DAF 16
Respect natural hierarchies.
When the data set embodies a natural hierarchy, use
an encoding that emphasizes it clearly and memorably.
The use hierarchy in layout (e.g. tabular form) and
encoding can significantly improve a muddled figure.
NO HIERARCHY
HIERARCHICAL
LAYOUT AND ENCODING
gene
gene
in situ
non-specific in literature
Be aware of color blindness.
4
DAF 3 DAF 5
DAF 12
specific in literature
[GRE10] [WON10]
1
9
16
non-selective in situ result
1
luminance
AKT 1
AKT 2
16
AKT 1
AKT 2
selective in situ result
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
2
AGE 1
specific in situ result
0
7
AGE 1
all genes
in situ
in lit
specific
specific
selective
selective
non-selective
non-selective
non-specific
non-specific
in lit
Adapted from [NEL03].
Use consistent alignment. Center on theme.
Establish equivalence using consistent alignment.
Awkward callouts can be avoided if elements are
logically placed.
15-COLOR PALETTE FOR COLOR BLINDNESS
NORMAL VISION
DEUTERANOPIA
common (6%)
TRITANOPIA
very rare (<1%)
PROTANOPIA
rare (2%)
N1
spring
embedded
1
6
11
2
7
12
3
8
13
4
9
14
5
10
15
RGB
N1
RSPO3
REFERENCE
6q22.3
1
1 0 0
2 0 73
3 0 146
4 255 109
5 255 182
0
73
146
182
119
6
7
8
9
10
73
0
182
109
182
0
109
109
182
219
146
219
255
255
255
11
12
13
14
15
146
146
219
36
255
0
73
209
255
255
0
0
0
36
109
2
PTPRK
802kb
3
3
1 2
2
1
3
TCTCCTGGGATCGGCCCAAGGCCAGTTCTCCGCAG TGCATCCTAACGTTAGTCAAGGCTGCCAAGGAGGCTGTGCAACATGCTCA
SAMPLE
hive plot
NORMAL VISION
red-green palette
F
N1
magenta-green palette
GENE COMPLEX
CONNECTING GENE
Circos image of patterns in densely distributed breakpoint positions between
chromosomes 1, 2, 9 and 17 are emphasized when scale is zoomed. (A) full genome (B)
chrs 1, 2, 9, 17 (C) 31 regions on chrs 1, 2, 9, 17 totaling 223 kb [KRZ09].
DAF 4
28
LAYOUT
G7
SYSTEM GENE COMPLEX
S3
11
DAF 16
Color is a useful encoding – the eye can
distinguish about 450 levels of gray, 150 hues,
and 10-60 levels of saturation, depending
on the color – but our ability to perceive
differences varies with context.
Adjacent tones with different luminance
values can interfere with discrimination, in
interaction known as the luminance effect.
G8
GENE
N
DAF 11
Be aware of the luminance effect.
S4
G6
DAF 7
DAF N
You can count the objects on the left almost instantly, without conscious effort.
This is called pre-attentive cognition. This example uses the concept of proximity
to partition the dots into pre-attentive groupings. Elements can be effectively
organized by reducing spatial variation, here achieved by symmetrical layout and
two levels of spacing,
D
G5
DAF 28
DAF 1
Never use hue to encode magnitude.
[HEE10]
FULL ENCAPSULATION
INCONSISTENT AND INCOMPLETE ENCAPSULATION
11
29
29
Including details not relevant to the core message of
the figure can create confusion. Encapsulation should
be done to the same level of detail and to the simplest
visual form. Duplication in labels should be avoided
unless required to preserve semantic forms.
INS 1
11
69
40
11
The reader does not know what is important
in a figure and will assume that any spatial
or color variation is meaningful. The figure’s
variation should come solely from data or act
to organize information.
Encapsulate details.
HOW MANY DOTS?
F
2
G5 G6 G7
S3
Z
Y
69
QUALITATIVE
82
79
67
61
C
40
Y
0
0
1.5
Show variation with statistics.
29
25
23
22
12
0
X
X
Y
Z
2.5
C
11
E
Ensure data elements are at least 1 pt on a
two-column Nature figure (6.22 in), 4 pixels
on a 1920 horizontal resolution display, or
2 pixels on a typical LCD projector. These
restrictions become challenges for large
genomes.
12 54 82 29 25 22 67 61 23 79
Z
29
69
40
0.75 2.6
no
0
Accuracy and speed in detecting differences
in visual forms depends on how information
is presented. We judge relative lengths more
accurately than areas, particularly when
elements are aligned and adjacent. Our
judgment of area is poor because we use length
as a proxy, which causes us to systematically
underestimate.
log2(error)
3.4
600
resolvable?
SYSTEM
1
B
X
Help the reader judge accurately.
Use no more than ~500 scale intervals.
1.5 5.1
A
Choose concise encodings over elaborate ones.
RESOLVING DATA TRACKS
Our acuity is ~50 cycles/degree or about 1/200 (0.3 pt) at
10 inches. Ensure the reader can comfortably see detail by
limiting resolution to no more than 50% of acuity. Where
possible, elements that require visual separation should be
at least 1 pt part.
450
Use the simplest encoding.
2
29
Do not exceed resolution of visual acuity.
300
1.5
0.25
0.50
0.75
1
1.5
2
line separation (pt)
Use exploratory tools (e.g. genome browsers) to discover
patterns and validate hypotheses. Avoid using screenshots
from these applications for communication – they are
typically too complex and cluttered with navigational
elements to be an effective static figure.
0.25
RESOLVING DETAIL
line thickness (pt)
0.75
1
Hive Plots are a robust and quantitative visual form of networks based on meaningful properties
chosen by the user. Unlike in a conventional layout, the removal of a node from a network can be
easily spotted in a hive plot [KRZ11].
DEUTERANOPIA
PROTANOPIA
Some Brewer
palettes,
such as the
pink-yellow-green
diverging palette,
are color-blind safe.
Use them for images
where red/green
discrimination
is critical for
comprehension.
Complex information can be organized by consistent alignment. Notice how placing the gene fusion
product in the center emphasizes the subject of the figure. Explaining the concept of gene fusion
does not require the internal structure of the two genes, whose size and exon count/distribution
should be removed to improve clarity. Original figure from [SES12].
[BRE11] C Brewer (2011) Color Brewer. http://www.colorbrewer.org
[GRE10] HE Grecco et al. (2010) In situ analysis of tyrosine phosphorylation networks by FLIM on cell arrays. Nat Methods 7:
467-472.
[HEE10] J Heer et al. (2010) Crowdsourcing graphical perception: using mechanical turk to assess visualization design. Proceedings
of the 28th international conference on Human factors in computing systems. Atlanta, Georgia, USA: ACM. pp. 203-212.
[KRZ09] M Krzywinski et al. (2009) Circos: an information aesthetic for comparative genomics. Genome Res, vol. 19, pp. 1639-45
[KRZ11] M Krzywinski et al. (2011) Hive plots—rational approach to visualizing networks, Brief Bioinform, Dec 9 2011.
[NEL03] S Nelander et al. (2003) Prediction of cell type-specific gene modules: identification and initial characterization of a core
set of smooth muscle-specific genes. Genome Res 13: 1838-1854.
[SES12] S Seshagiri et al. (2012) Recurrent R-spondin fusions in colon cancer. Nature 488: 660-664.
[STE07] SE Von Stetina et al. (2007) Cell-specific microarray profiling experiments reveal a comprehensive picture of gene expression
in the C. elegans nervous system. Genome Biol 8: R135.
[WON10] B Wong (2010) Points of view: Color coding. Nat Methods 7: 573.
Content of this poster is
drawn from my book chapter
Visualization Principles for
Scientific Communication
(Martin Krzywinski & Jonathan
Corum) in the upcoming open
access Cambridge Press book
Visualizing biological data
- a practical guide (Seán I.
O'Donoghue, James B. Procter,
Kate Patterson, eds.), a
survey of best practices and
unsolved problems in biological
visualization. This book project
was conceptualized and
initiated at the Vizbi 2011
conference.
If you are interested in
guidelines for data encoding
and visualization in biology, see
the Points of View column by
Bang Wong in Nature Methods.
v2 16 oct 2012
download poster