Download Visualizing multivariate data with clustering and heatmaps

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Waveform graphics wikipedia , lookup

Transcript
Visualizing multivariate
data with clustering and
heatmaps
Reija Autio
School of Health Sciences
University of Tampere
Visualization in R
R is a powerful software for visualization
Limitation often between the computer and chair ;)
More advanced illustration options require R-packages
Sometimes requires installation of assisting software
not the case with clustering and heatmaps
Visualization in R
Publication quality graphics
Completely programmable and reproducible
Several packages available
Resulting figures can be viewed and saved in
On-screen graphics
Postscript, pdf, svg, jpeg, png, tiff,…
Graphic environments
Low-level
R base graphics
Bar plots, scatter plots, line plots, pie charts, boxplots, etc..
Grid
High-level
Lattice
ggplot2
In this presentation the focus is on ggplot2
Clustering
Clustering is the classification/partitioning of data objects based on their
similarity into groups (clusters).
This similarity is computed according to a distance between variables
It is used in many fields, such as data mining, machine learning, pattern
recognition, image analysis, genomics, systems biology, etc.
In machine learning clustering is defined as a form of unsupervised learning.
Why to cluster?
Pre-Clustering
Clustering data reveals
efficiently trends and
similarities between the
variables.
There are several
clustering various
methods to be used.
Clustering is standard
data analysis methods in
many fields.
Post-Clustering
Types of clustering
Today we focus on hierarchical clustering
Distance methods
There are several distance methods to be used in clustering
Euclidean distance
Binary
Cityblock (Manhattan),…
Correlation based distances: 1-R
Pearsson, Spearman,…
Cluster linkage
Single linkage
Complete linkage
Average linkage
Hierachical clustering
Hierarchical clustering (HC) is a straightforward method to illustrate the
groupings within the data
HC can be used for different types of data:
Examples:
Car data
Gene data
Car data
This the example data mtcars
Here Euclidean distance and complete linkage
used in clustering (defaults in R)
Gene data
Data from article: Tuomela, et al. (2013) Gene Expression Profiling of Immune-Competent Human
Cells Exposed to Engineered Zinc Oxide or Titanium Dioxide Nanoparticles, PloS ONE 2013 Jul
22;8(7):e68415
Human Jurkat samples exposed to nanoparticles
With pearson correlation distance
Euclidean distance
Hierarchical clustering
HC works iteratively
Identify clusters or variables with shortest distances
Group them to new cluster
Compute the distance between the clusters/variable (now the new cluster is a
variable)
Continue on step 1
Iterate until all the clusters are joined into one big cluster
Clustering agglomerative (step by step)
0.2
Join two closest to a cluster
0.3
Join two closest to a cluster
Join two closest to a cluster
0.4
0.6
Join two closest to a cluster
All the nodes now in one cluster
Clustering is ready
STOP
Heatmaps
Lots of various colormaps
You can also create an own colormap
# creates a own color palette from red to green
my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
Heatmaps
mtcars
Clustering
Parameters change the resulting
heatmap a lot
Scaling rows vs scaling columns
heatmap.2
heatmap.2 function includes more than 40 visible
arguments that can be used to tune the resulting
figure. These arguments include scaling, selecting
clustering method, labeling, showing density info,
handling missing values etc.
Left correlation, right euclidean, colormap swaped, both
standardized based on rows, complete distance
Thank you for your attention!
Further reading:
Kaufman, L., & Rousseeuw, P. J. (2005). Finding Groups in Data. An
Introduction to Cluster Analysis (p. 342). John Wiley & Sons Inc.
Maechler, M. (2013). Cluster Analysis Extended Rousseeuw et al. CRAN.