Download Presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Molecular graphics wikipedia , lookup

Tektronix 4010 wikipedia , lookup

Transcript
2-D and 3-D Coordinates For
M-Mers And
Dynamic Graphics For
Representing Associated Statistics
By
Daniel B. Carr
[email protected]
George Mason University
Overview
•
•
•
•
•
Background
Encoding and self-similar coordinates
Examples
Rendering software – GLISTEN
Closing remarks
Background
• Task
– Visualize statistics indexed by a sequence of letters
• Letter-Indexing
– Nucleotides: AAGTAC
– Amino Acids: KTLPLCVTL
– Terminology: blocks of m letters called m-mers
• Statistics: counts or likelihoods for
– Short DNA sequence motifs for transcription factor
binding: gene regulation
– Peptide docking on immune system molecules
Graphical Design Goals
• Provide an overview and selective focus
• Use geometric structures to
– Organize statistics
– Reveal patterns
– Provide cognitive accessibility
• Incorporate scientific knowledge in layout
choices
– Enhance patterns and simplify comparisons
Common Practice - Tables
• Published tables – a linear list
– Sorted by values of a statistic
– Indexing letter sequences shown as row labels
– Only few items shown of thousands to millions
Common Practice - Graphics
• 1-D histograms – some examples
– Nucleotides: Distribution of promoters by
distance upstream from the start codon
– Amino acids:
• Sequence alignment logo plots are one variant
• Docking counts by position
• Cell-colored matrices?
– More commonly used for microarray data and
correlation matrices
H LA-A2 Molecule
Peptide D ocking Counts By Am ino Acid Giv en Position
Pos 1
Pos 2
Pos 3
Pos 4
50
50
Pos 5
Pos 6
Pos 7
Pos 8
50
50
Pos 9
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
50
50
150
250
50
50
50
150
Graphical Encoding Ideas:
Use Points For M-Mers
• Represent m-mers using coordinates
– A point stands for an m-mer
– A glyph at the point represents statistics for that
m-mer. For example point color, size, shape
• Challenge
– The domain of all letter sequences is
exponential in sequence length
– Display space is limited
Self-Similar Coordinates
• Self-similarity helps us keep oriented
– Parallel coordinate plots are increasingly familiar
• Coordinates from 3-D geometry
– 4 Nucleotides => tetrahedron
– 20 Amino acids
• Icosahedron face centers
• Familiar coordinates => hemisphere
• Two kinds of self-similarity
– At different scales => fractals
– At the same scale => shells, surfaces
Self-Similarity At Different Scales:
Nucleotide Example
• Represent each 6-mer as a 3-D point
– (4 nucleotides)6 = 4096 points
• Attractor: tetrahedron vertices
– A=(1,1,1), C=(1,-1,-1), G=(-1,1,-1), T=(-1,-1,1)
• Computation:
– Hexamer position weights: 2^(5,4,3,2,1,0)/63
– ACGTTC -> (.555, .270, .206)
Application:
Gene Regulation Studies
• Cluster genes based on
– Gene expression levels in different situations
– Other criteria such as gene family
• For each cluster look in gene regulation regions
for recurrent nucleotide patterns
– Over expressed m-mers: potential transcription factor
docking sites
• Show frequencies (or multinomial likelihoods)
Nucleotides Example
Yeast Gene Regulation
29 Genes in a cluster
–
–
–
–
–
YBL072c
YDL130w
YDR025w
…
YCL054w
Sliding hexamer window
300 letters upstream from
open reading frames
–
–
–
–
300
299
298
297
ATATGA
TATGAG
ATGAGT
TGAGTA
Statistics
• Number of genes with hexamer
–
–
–
–
–
–
TTTTTC
GAAAAA
TTTTTT
AAAAAT
TTTTCA
ATTTTT
22
21
19
19
18
17
• Total number of appearances, etc.
Extensions
• 2-D version (projected gasket)
– 10mers => 1024 x 1024 pixel display
• Wild card and dimer counts
– TACC……GGAA
• Include more scientific knowledge
– Special representations for known transcription factors
• More interactivity
– Filtering for regions upstream
– Mouseovers, etc.
Self-Similarity At Different Scales:
Amino Acids Sequence Coordinates
• Represent each 3-mer as a 3-D point
– (20 amino acids)3 = 8000 points
• Attractor: icosahedron face centers
– Let x1= .539, x2=.873, x3=1.412
– A=(x1,x3,0), C=(0,x1,x3), … Y=(-x3,0,-x1)
• Computation
Position weights: 3.8(2,1,0) scaled to sum to 1.
Letters HIT => (-1.26, -1.08, .180)
Graphical Encoding Ideas:
Paths
• Use paths connecting m-mer points to represent
longer sequences
– Path features, thickness and color can encode statistics
indexed by the concatenated m-mers
– Can reuse the m-mers keeping a common framework
– 3 3-mers -> two segment path -> 9 mer
• Challenges
– Overplotting, path ambiguity, prime sequence lengths
– Using translucent triangles for triples is poor, etc.
Letter x Position Coordinates
And Paths
• Merits
– Few points and simple structure
• 20 amino acids by 9 positions = 180 points
• Challenges
– Path overplotting =>filtering
– Avoiding path interpretation ambiguity in
higher dimensional tables => 3-D layouts
Self-Similarity At The Same Scale:
Amino Acids Coordinates
• Each point represents a letter and position pair
– 9-mers: 20 letter x 9 positions = 180 points
• Geometry: icosahedron face centers
– Let x1= .539, x2=.873, x3=1.412
– A=(x1,x3,0), C=(0,x1,x3), … Y=(-x3,0,-x1)
• Use scale factor for a given position
– Scale factors for 9-mers: 2.2, 2.4, 2.6, …, 3.6
– A1 => 2.2*(x1,x3,0) C2=>2.4*(0,x1,x3)
• Problem: overplotting of paths
Self-Similarity At The Same Scale:
Amino Acids Example
• Each point represents a letter and position pair
– 9-mers: 20 letter x 9 positions = 180 points
• Geometry: hemisphere
– Amino acid: longitude, Position: latitude
– Amino acid ordering
• Group by chemical properties: hydrophobic, etc.
• Order to minimize path length in given application
– Include gaps for perceptual grouping
• Path overplotting still a problem, need filtering
Peptide Docking Example
• Immune system molecules combine with peptides
to form a complex recognized by T-cell receptors
– Problems:
• Failure to dock foreign peptides
• Docking with “self” peptides
• Molecule specific databases of docking peptides
– MHCPEP 1997, Brusic, Rudy, and Harrison
– Human leukocyte antigen (HLA) A2, class 1 molecule
• Small: about 500 peptides of 209 = ½ trillion possibilities
• Mostly 9-mers (483)
• Positions related to asymmetric docking groove
Peptide Docking Interests
• Which amino acids appear in which
position?
• Characterize the space of
• docking, not-docking, unknown
• Prediction of unknowns
• Focused questions
• Is there a docking peptide in a key protein common
to all 23 HIV strains?
Docking Statistics
Number of the 483 peptides with the amino acid in position 2
M Q P S T F V A L G I K R HEDCWNY
45 4 1 1 23 2 16 14 294 1 71 5 2 0 2 1 1 0 0 1
Cells from the collection of all 4-position tables:
126 tables of potentially 204 = 160000 cells each
G4 F5 V6 F7: 35
L2 A7 A8 V9: 29
…
Graphics Software
• GLISTEN
– Geometric Letter-Indexed Statistical Table Encoding
– Swap out coordinates at will with tables unchanged
– NSF research: second generation version in progress
• Available partial alternatives
– CrystalVision ftp://www.galaxy.gmu.edu/pub/software/
– Ggobi www.ggobi.org/download.html
Hemisphere Plot Versus
Parallel Coordinate Plots
• PC plots are
– Better for the many scientists preferring flatland
– Straight forward to publish
– Ambiguous when connecting non-adjacent axes
• Hemisphere plots
– 3-D curvature reduces line ambiguity and provides a
general framework for tables involving non-adjacent
positions
– 3-D provides more neighbor options to group amino
acids based on chemical properties: non-polar, etc.
Closing Remarks
• Docking applications are still evolving
– New procedures for inference and better
databases
• Graphics still need work
– More scientific structure
– Work on cognitive optimization
• GLISTEN can address many other
applications
Graphics Reference
• Lee, et al. 2002, “The Next Frontier for
Bio- an Cheminformatics Visualization,”
IEEE Computer Graphics and Applications,
Sept/Oct pp,. 6-11.
Relate Scientific References (1)
Spellmen, et al. 1998. “Comprehensive Identification of
Cell Cycle-regulated Gened of the Yeast
Saccharomyces cervisiae by Microarray
Hybridization,” Molecular Biology of the Cell. Vol 9,
pp. 3273-3297.
Keles, van der Laan, and Eisen. 2002. “Identification of
regulatory elements using a feature selection method.”
Bioinformatics, Vol. 18. No 9. pp1167-1175.
Related Scientific References (2)
• Segal Cummings and Hubbard. 2001.
“Relating Amino Acid Sequences to
Phenotypes: Analysis of Peptide-Binding
Data,” Biometrics 57, pp. 632-643.