Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Visual Analytics for Genomics Cydney Nielsen! BC Cancer Agency! Vancouver, BC, Canada! Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On Design Exercise Part 1 Introduction to Genomics Genomics Workflow genome: the complete genetic material of a cell Part 1. Intro to Genomics Sequencing Experiment Part 1. Intro to Genomics Sequencing Experiment Part 1. Intro to Genomics Sequencing Experiment G - C! T - A! Part 1. Intro to Genomics Genomics Workflow sample data insight Part 1. Intro to Genomics Genomics Workflow sample experiment sequencing technology! data insight Part 1. Intro to Genomics Genomics Workflow sample experiment sequencing technology! data + analysis visualization! computation! insight Part 1. Intro to Genomics Genomics Workflow sample experiment sequencing technology! data + analysis visualization! computation! insight Part 1. Intro to Genomics Genomics Workflow sample experiment sequencing technology! data molecular biology Part 1. Intro to Genomics Genomics Workflow computational biology / bioinformatics visual analytics data + analysis visualization! computation! insight Part 1. Intro to Genomics Genomics Workflow sample experiment sequencing technology! data molecular biology Part 1. Intro to Genomics Sequencing Experiment TACACCGATACACCAGA$ ACCAGATGGATTAGATGTA$ AAAAAAAAAAAAAAGATGT$ AAAGATGTATACCACCAG$ CACCAGTACACCGATA$ Sequencing machine! Millions of short sequences (“reads”)! e.g. 75 nt each compared to >3 billion nt in human genome! Part 1. Intro to Genomics Sequencing Experiment ~$5,000$ in$2001$ ~10¢$ in$2011$ Part 1. Intro to Genomics Genomics Workflow computational biology / bioinformatics visual analytics data + analysis visualization! computation! insight Part 1. Intro to Genomics Sequencing Experiments De novo assembly! AGCTTCAGATGGACAGATAA$ GGCATACAGACTTAGACATA$ CCAGACAAGACAGACACAGTA$ TACAAGACATAAGCAATACAGA$ CCAGACAAGACAGACACAGTA$ Genome$Assembly$ Part 1. Intro to Genomics Sequencing Experiments De novo assembly! AGCTTCAGATGGACAGATAA$ GGCATACAGACTTAGACATA$ CCAGACAAGACAGACACAGTA$ TACAAGACATAAGCAATACAGA$ CCAGACAAGACAGACACAGTA$ Re-sequencing! GGCATACAGACTTAGACATA$ AGCTTCAGATGGACAGATAA$ CCAGACAAGACAGACACAGTA$ CCAGACAAGACAGACACAGTA$ TACAAGACATAAGCAATACAGA$ Reference$Genome$ Genome$Assembly$ Part 1. Intro to Genomics Sequencing Experiments De novo assembly! AGCTTCAGATGGACAGATAA$ GGCATACAGACTTAGACATA$ CCAGACAAGACAGACACAGTA$ TACAAGACATAAGCAATACAGA$ CCAGACAAGACAGACACAGTA$ Re-sequencing! GGCATACAGACTTAGACATA$ AGCTTCAGATGGACAGATAA$ CCAGACAAGACAGACACAGTA$ CCAGACAAGACAGACACAGTA$ TACAAGACATAAGCAATACAGA$ Reference$Genome$ Enrichment! CCAGACAAGACAGACACAGTA$ AGCTTCAGATGGACAGATAA$ GGCATACAGACTTAGACATA$ CCAGACAAGACAGACACAGTA$ TACAAGACATAAGCAATACAGA$ Reference$Genome$ Genome$Assembly$ Part 1. Intro to Genomics Sequencing Experiments • What sequence variations appear in cancer patients, but not in unaffected individuals?! • Are these variations predictive of survival outcome?! • Are these variations causal for the disease (driver mutations) or not?! ! Part 1. Intro to Genomics Part 1 - Summary 1. Large and ever increasing volume of sequencing data! 2. Improved analysis techniques are essential for biologists and clinicians to make the most of these data! 3. Great potential for visual analytics to facilitate insight and understanding! ! Part 1. Intro to Genomics Part 2 Visual Design for Genomics Challenge 1 Large number of samples for comparison! Part 2. Visual Design for Genomics Challenge 1 Large number of samples for comparison! “To systematically characterize the genomic changes in hundreds of tumors… and thousands of samples over the next five years”! ! The Cancer Genome Atlas! www.cancergenome.nih.gov! Part 2. Visual Design for Genomics Genome Browsers Stacked data tracks along a common genome x-axis! Data samples! Genome coordinate! Genome Browsers Home Genomes Blat Tables Gene Sorter PCR PDF/PS Session FAQ Help UCSC Cancer Genomics Heatmaps Glioblastoma Copy Number Abnormality, Agilent 244A array (n=200) Zhu et al., Nature Methods, 2009! er nd Ge or Tu m Genome coordinate! vs n or m al Data samples! Part 2. Visual Design for Genomics Challenge 1 Large number of samples for comparison! ! ! Critically consider what you need to display! ! ! e.g. replace primary data with a biologically meaningful summary, such as significant changes between samples ! Part 2. Visual Design for Genomics Challenge 2 Genomic features are small and sparse! Part 2. Visual Design for Genomics Genome Browsers LOCAL VIEW! Part 2. Visual Design for Genomics Genome Browsers LOCAL VIEW! Human chr1, 1 pt corresponds to 480 kb, which is larger than 98% of all human genes! ! Part 2. Visual Design for Genomics Hilbert Curve RESEARCH ARTICLE GLOBAL VIEW! a Chromosome 3L Heterochromatinlike domain PcG domains Open chromatin domain Cluster of small expressed genes context (for example, h me3 with some active m long expressed genes, m enriched for H3K36me1 5′ 3′ in S2, 202 in BG3; Supp To examine further t clustered expressed auto ment for each chromatin 5′ 3′ genes with large 59-end show extensive H3K27ac domains, and blocks of Fig. 3b, last column). Th regulatory functions (Su 5′ 3′ within domains of Nipp protein previously associ In contrast, genes with (red subtree, Fig. 3b) lack 5′ 3′ is restricted to the 2 kb d Chromatin states: 1 2 3 4 5 6 7 8 9 not explained by variatio 2 | Visualization Overall, the presence or Kharchenko etFigure al., Nature, 2011! of spatial scales and organization using compact folding. a, The chromosome is folded using a geometric pattern (Hilbert space- ference in the chromatin Anders, Bioinformatics, 2009! filling curve) that maintains spatial proximity of nearby An Design illustration forand Part 2.regions. Visual Genomics longer (Supplement of the first four folding steps is shown. Note that although this compact curve is sistently correlates with optimal for preserving proximity relationships, some distal sites appear adjacent gene body, mainly assoc along the fold axis (green dots). b, Chromosome 3L in S2 cells. A domain of a b Pericentromeric heterochromatin Challenge 2 Genomic features are small and sparse! Connect overview and detail! Part 2. Visual Design for Genomics Challenge 3 Genomic features involve non-adjacent positions! Part 2. Visual Design for Genomics points’, span a wide range of distances and affect sequence segments of varying size. For example, tandem duplications may involve a localized repetition of only a few kilobases, whereas the breakpoints of translocations are located on nonhomologous chromosome arms and may result in the rearrangement of large genomic chunks. Finding a representation that enables one to track breakpoints across this scale can be challenging. This is exacerbated by the fact that variant genomic fragments can be Challenge 3 Structural rearrangements! a b J Jʹ K Kʹ J Kʹ K c Kʹ K e J Jʹ Jʹ d Jʹ Variant J Jʹ K Kʹ J Kʹ K K’ Reference Figure 1 | Representations of a translocation. (a,b) Linear (a) and circular (b) Part 2. Visual Design reference genome layouts with an arc to depict a translocation between two chromosomes (pink and blue). (c) Translocation illustrated as referencesequence segments with chromosome colors corresponding to those in a. segments, and the h trade-off for directly coding or dot plot is pair can be expressed All of the images coordinate system, between breakpoints to focus on the conse genomic arrangemen gene fusions, partic frame. One way to ad away from the geno representation, suc uninterrupted seque order (Fig. 1e). The readability of the co linear order of the ge as the presence of an with edge attributes As we look for alt diversity of genomic biologically relevant COMPETING FINANCIAL IN The authors declare no com Cydney Nielsen & Ban Krzywinski, M. et al. Ge for 1.Genomics Cydney Nielsen is a Canadi Foundation for Health Rese points’, span a wide range of distances and affect sequence segments of varying size. For example, tandem duplications may involve a localized repetition of only a few kilobases, whereas the breakpoints of translocations are located on nonhomologous chromosome arms and may result in the rearrangement of large genomic chunks. Finding a representation that enables one to track breakpoints across this scale can be challenging. This is exacerbated by the fact that variant genomic fragments can be Challenge 3 Structural rearrangements! a b J Jʹ K Kʹ J Kʹ K c Kʹ K e J Jʹ Jʹ d Jʹ Variant J Jʹ K Kʹ J Kʹ K K’ Reference Figure 1 | Representations of a translocation. (a,b) Linear (a) and circular (b) Part 2. Visual Design reference genome layouts with an arc to depict a translocation between two chromosomes (pink and blue). (c) Translocation illustrated as referencesequence segments with chromosome colors corresponding to those in a. segments, and the h trade-off for directly coding or dot plot is pair can be expressed All of the images coordinate system, between breakpoints to focus on the conse genomic arrangemen gene fusions, partic frame. One way to ad away from the geno representation, suc uninterrupted seque order (Fig. 1e). The readability of the co linear order of the ge as the presence of an with edge attributes As we look for alt diversity of genomic biologically relevant COMPETING FINANCIAL IN The authors declare no com Cydney Nielsen & Ban Krzywinski, M. et al. Ge for 1.Genomics Cydney Nielsen is a Canadi Foundation for Health Rese Challenge 3 Structural rearrangements! Circos, Martin Krzywinski! Part 2. Visual Design for Genomics points’, span a wide range of distances and affect sequence segments of varying size. For example, tandem duplications may involve a localized repetition of only a few kilobases, whereas the breakpoints of translocations are located on nonhomologous chromosome arms and may result in the rearrangement of large genomic chunks. Finding a representation that enables one to track breakpoints across this scale can be challenging. This is exacerbated by the fact that variant genomic fragments can be Challenge 3 Structural rearrangements! a b J Jʹ K Kʹ J Kʹ K c Kʹ K e J Jʹ Jʹ d Jʹ Variant J Jʹ K Kʹ J Kʹ K K’ Reference Figure 1 | Representations of a translocation. (a,b) Linear (a) and circular (b) Part 2. Visual Design reference genome layouts with an arc to depict a translocation between two chromosomes (pink and blue). (c) Translocation illustrated as referencesequence segments with chromosome colors corresponding to those in a. segments, and the h trade-off for directly coding or dot plot is pair can be expressed All of the images coordinate system, between breakpoints to focus on the conse genomic arrangemen gene fusions, partic frame. One way to ad away from the geno representation, suc uninterrupted seque order (Fig. 1e). The readability of the co linear order of the ge as the presence of an with edge attributes As we look for alt diversity of genomic biologically relevant COMPETING FINANCIAL IN The authors declare no com Cydney Nielsen & Ban Krzywinski, M. et al. Ge for 1.Genomics Cydney Nielsen is a Canadi Foundation for Health Rese Supplementary Figure 1 Global dot-plot of Sorghum bicolor and Or Challenge 3 displayed using VISTA-Dot. assemblies Structural rearrangements! VISTA-Dot! Part 2. Visual Design for Genomics Challenge 3 All these representations use a genomic coordinate system, which emphasizes base-pair distance between points. ! ! Is this the best use of positional information?! Part 2. Visual Design for Genomics Match the encoding method to the data Data can be encoded using visual properties such as position (scatter plot), length (bar plot), angle (heat map). Figure 35 (adapted from Figure 15 in [7]), ranks different encodings according to their representing quantitative (numbers), ordinal (categories with implied order, such as “best”, “better (categories without an order, such as brands of cars) variables. Challenge 3 [figure encoding-schemes.eps] M. Krzywinski adapted from Mackinlay J (1986) ACM Trans Graph 5: 110-141.! Figure 35. Many encoding schemes exist and should be selected based on the type of encoded variable. 2. Visual Design for diagram Genomics In any encoding, simpler visual formsPart are preferable. Consider the Venn in Figure 36A (ad [59]). The Venn diagram demonstrates a nested data set – all values in Z are in Y and all values in Y need to show 4 of the 7 intersections. These data are shown better as a set of concentric circles (Fig points’, span a wide range of distances and affect sequence segments of varying size. For example, tandem duplications may involve a localized repetition of only a few kilobases, whereas the breakpoints of translocations are located on nonhomologous chromosome arms and may result in the rearrangement of large genomic chunks. Finding a representation that enables one to track breakpoints across this scale can be challenging. This is exacerbated by the fact that variant genomic fragments can be Challenge 3 Structural rearrangements! a b J Jʹ K Kʹ J Kʹ K c Kʹ K e J Jʹ Jʹ d Jʹ Variant J Jʹ K Kʹ J Kʹ K K’ Reference Figure 1 | Representations of a translocation. (a,b) Linear (a) and circular (b) Part 2. Visual Design reference genome layouts with an arc to depict a translocation between two chromosomes (pink and blue). (c) Translocation illustrated as referencesequence segments with chromosome colors corresponding to those in a. segments, and the h trade-off for directly coding or dot plot is pair can be expressed All of the images coordinate system, between breakpoints to focus on the conse genomic arrangemen gene fusions, partic frame. One way to ad away from the geno representation, suc uninterrupted seque order (Fig. 1e). The readability of the co linear order of the ge as the presence of an with edge attributes As we look for alt diversity of genomic biologically relevant COMPETING FINANCIAL IN The authors declare no com Cydney Nielsen & Ban Krzywinski, M. et al. Ge for 1.Genomics Cydney Nielsen is a Canadi Foundation for Health Rese Challenge 3 Genomic features involve non-adjacent positions! Encode important information in position! Part 2. Visual Design for Genomics Challenge 4 Large number of data types! Part 2. Visual Design for Genomics Genomic rearrangement in cancer A Deletion-type Tandem dup-type SNU-C1 (colorectal): Chr 15 Tail-to-tail inverted Head-to-head inverted Non-inverted orientation 4 Copy 2 number 0 Allelic 1 ratio 0 Inverted orientation 15 B 20 25 30 35 40 45 50 55 60 65 70 Genomic location (Mb) 75 80 85 90 95 100 Stephens et al., Cell, 8505C (thyroid): Chr2011! 9 Part 2. Visual Design for Genomics 4 RESEARCH ARTICLE 17 mouse genomes 0 742 0 179 0 836 SNPs SVs TEs Uncallable CAST/EiJ 14 13 12 11 10 9 8 >100,000 15 16 17 18 19 X 0 5 2 6 7 1 3 4 4 3 5 2 7 6 WSB/EiJ LtJ HI tJ O/ hiL J NZ D/S LP/ J /2 NO A J / DB CBA J N L/6 eJ 7B /H C5 C3H B/cJ J L / BA A /J R AK vBrd J vE 5/S /SvIm sd 9S 12 9S1 laH 12 P2/O 9 12 a 1 8 10 11 12 13 14 15 16 17 18 19 X 1 9 2 3 4 5 6 7 8 9 10 1 11 2 3 4 5 6 7 8 1 18 9 12 11 X 3 1 2 6 4 5 7 10 9 8 17 16 15 14 13 12 11 PWK/PhJ X 10 13 14 15 16 17 18 19 X 9 12 13 14 15 16 17 18 19 SPRET/EiJ Keane et al., b Nature, 2011! SNPs 44,688,817 9,042,516 SV deletion TE insertions Part 2. Visual Design for Genomics Indels 3,012,100 6,111,656 1,694,226 101,925 1,006,034 33,488 42,514 15,484 15,002 9,116 Challenge 4 Large number of data types! Exploit domain-specific details in your design! Part 2. Visual Design for Genomics Challenge 5 No longer one genome but many! Part 2. Visual Design for Genomics Challenge 5 No longer one genome but many! Part 2. Visual Design for Genomics Single nucleotide variation Ossowski et al. Genome Research, 2008! Part 2. Visual Design for Genomics Single nucleotide variation Integrative Genomics Viewer (IGV)! Robinson et al. Nature Biotechnology, 2011! Part 2. Visual Design for Genomics Challenge 5 No longer one genome but many! Be open to change (genomics is evolving quickly)! Part 2. Visual Design for Genomics Part 2 - Summary 1. 2. 3. 4. 5. Cri<cally$consider$what$you$need$to$display$ Connect$overview$and$detail$ Encode$important$informa<on$in$posi<on$ Exploit$domainIspecific$details$in$your$design$ Be$open$to$change$(genomics$is$evolving$quickly)$ Part 2. Visual Design for Genomics Part 3 Hands-On Design Exercise Genome Assembly Input! TACACCGATACACCAGA$ ACCAGATGGATTAGATGTA$ AAAAAAAAAAAAAAGATGT$ AAAGATGTATACCACCAG$ CACCAGTACACCGATA$ Part 3. Hands-On Design Exercise Genome Assembly Input! TACACCGATACACCAGA$ ACCAGATGGATTAGATGTA$ AAAAAAAAAAAAAAGATGT$ AAAGATGTATACCACCAG$ CACCAGTACACCGATA$ Aligned! AAAAAAAAAAAAAAGATGT$ AAAGATGTATACCACCAG$ CACCAGTACACCGATA$ TACACCGATACACCAGA$ ACCAGATGGATTAGATGTA$ Part 3. Hands-On Design Exercise Genome Assembly Input! TACACCGATACACCAGA$ ACCAGATGGATTAGATGTA$ AAAAAAAAAAAAAAGATGT$ AAAGATGTATACCACCAG$ CACCAGTACACCGATA$ Aligned! AAAAAAAAAAAAAAGATGT$ AAAGATGTATACCACCAG$ CACCAGTACACCGATA$ TACACCGATACACCAGA$ ACCAGATGGATTAGATGTA$ Consensus! AAAAAAAAAAAAAAGATGTATACCACCAGTACACCGATACACCAGATGGATTAGATGTA$ Part 3. Hands-On Design Exercise Sequence Alignment Rules Part 3. Hands-On Design Exercise Sequence Alignment Rules 1.$Maximize$sequence$overlap:$ $ This$overlap$is$BETTER…& $ AAAAAAAAAAAAAAGATGTATACCACCAGTACACCGATACACCAGATG $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$GATGTATACCACCAGTACACCGATACACCAGATGGATTAGATGTAGGGG $ …than$this$overlap:$ $ AAAAAAAAAAAAAAGTATGTATACCACCAGTACACCGATACACCAGATG $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$GATGTATACCACCAGTACAC $ Part 3. Hands-On Design Exercise Sequence Alignment Rules 1.$Maximize$sequence$overlap:$ $ This$overlap$is$BETTER…& $ AAAAAAAAAAAAAAGATGTATACCACCAGTACACCGATACACCAGATG $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$GATGTATACCACCAGTACACCGATACACCAGATGGATTAGATGTAGGGG $ …than$this$overlap:$ $ AAAAAAAAAAAAAAGTATGTATACCACCAGTACACCGATACACCAGATG $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$GATGTATACCACCAGTACAC $ 2.$Align$leUers$rightIsideIup,$reading$leV$to$right$(just$like$wriUen$English):$ $ NOT$a$valid&overlap$: $ $ $ $ $ $Valid$overlap:$ $ CACCAGTACATTTTTAAAGGG CACCAGTACATTTTTAAAGGG ATTTTTAAAGGGCCACATG Part 3. Hands-On Design Exercise GTACACCGGGAAATTTTTA Sequence Alignments Yellow set! AGCAGATC…AAAAAAAA AAAAAAAA…AAAAAAAA AAAAAAAA…TACTTACA …TACTTACA…GGGGGGGG GGGGGGGG…GGGGGGGG GGGGGGGG…GACAGATA Part 3. Hands-On Design Exercise Sequence Alignments Blue set! GATAGA…AAAAAA AAAAAA…CAGATG …CAGATG…GGGGGG GGGGGG…GGGGGG GGGGGG…ATAGAC …ATAGAC…AAAAAA AAAAAA…GGACAT AAAAAA…AAAAAA Part 3. Hands-On Design Exercise Sequence Alignments Both sets together (pretend you don’t know colour)! Ambiguous –! could belong to multiple sequences:! AGCAGA…AAAAAA AAAAAA…CTTACA …CTTACA…GGGGGG AAAAAA…AAAAAA GGGGGG…CAGATA GATAGA…AAAAAA GGGGGG…GGGGGG AAAAAA…CAGATG …CAGATG…GGGGGG GGGGGG…ATAGAC …ATAGAC…AAAAAA AAAAAA…GGACAT Part 3. Hands-On Design Exercise Sequence Alignments AGCAGA…AAAAAA AAAAAA…CTTACA …CTTACA…GGGGGG GGGGGG…CAGATA GATAGA…AAAAAA AAAAAA…CAGATG …CAGATG…GGGGGG GGGGGG…ATAGAC …ATAGAC…AAAAAA AAAAAA…GGACAT Part 3. Hands-On Design Exercise Choosing a representation Part 3. Hands-On Design Exercise Choosing a representation Part 3. Hands-On Design Exercise Choosing a representation Part 3. Hands-On Design Exercise ABySS-Explorer Part 3. Hands-On Design Exercise (b) inversion event in a human lymphoma genome (c) ABySS-Explorer Nielsen et al. 2009! ! ABySS-Explorer: visualizing genome sequence assemblies.! ! IEEE Trans Vis Comput Graph! VisWeek Proceedings! (Best paper award)! ! (a) reference human genome (b) inversion event in a human lymphoma genome (c) Part 3. Hands-On Design Exercise Resources The&Cartoon&Guide&to&Gene5cs& Larry$Gonick$and$Mark$Wheelis$(1991)$$ The&Processes&of&Life:&An&Introduc5on&to&Molecular&Biology& Lawrence$E.$Hunter$(2009)$ Nature&Methods&special&issue&on&Visualizing&Biological&Data&(2010)& hUp://www.nature.com/nmeth/journal/v7/n3s$ $ Bang&Wong’s&monthly&Points&of&View&column& hUp://bang.clearscience.info$