Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Visualizing multivariate data with clustering and heatmaps Reija Autio School of Health Sciences University of Tampere Visualization in R R is a powerful software for visualization Limitation often between the computer and chair ;) More advanced illustration options require R-packages Sometimes requires installation of assisting software not the case with clustering and heatmaps Visualization in R Publication quality graphics Completely programmable and reproducible Several packages available Resulting figures can be viewed and saved in On-screen graphics Postscript, pdf, svg, jpeg, png, tiff,… Graphic environments Low-level R base graphics Bar plots, scatter plots, line plots, pie charts, boxplots, etc.. Grid High-level Lattice ggplot2 In this presentation the focus is on ggplot2 Clustering Clustering is the classification/partitioning of data objects based on their similarity into groups (clusters). This similarity is computed according to a distance between variables It is used in many fields, such as data mining, machine learning, pattern recognition, image analysis, genomics, systems biology, etc. In machine learning clustering is defined as a form of unsupervised learning. Why to cluster? Pre-Clustering Clustering data reveals efficiently trends and similarities between the variables. There are several clustering various methods to be used. Clustering is standard data analysis methods in many fields. Post-Clustering Types of clustering Today we focus on hierarchical clustering Distance methods There are several distance methods to be used in clustering Euclidean distance Binary Cityblock (Manhattan),… Correlation based distances: 1-R Pearsson, Spearman,… Cluster linkage Single linkage Complete linkage Average linkage Hierachical clustering Hierarchical clustering (HC) is a straightforward method to illustrate the groupings within the data HC can be used for different types of data: Examples: Car data Gene data Car data This the example data mtcars Here Euclidean distance and complete linkage used in clustering (defaults in R) Gene data Data from article: Tuomela, et al. (2013) Gene Expression Profiling of Immune-Competent Human Cells Exposed to Engineered Zinc Oxide or Titanium Dioxide Nanoparticles, PloS ONE 2013 Jul 22;8(7):e68415 Human Jurkat samples exposed to nanoparticles With pearson correlation distance Euclidean distance Hierarchical clustering HC works iteratively Identify clusters or variables with shortest distances Group them to new cluster Compute the distance between the clusters/variable (now the new cluster is a variable) Continue on step 1 Iterate until all the clusters are joined into one big cluster Clustering agglomerative (step by step) 0.2 Join two closest to a cluster 0.3 Join two closest to a cluster Join two closest to a cluster 0.4 0.6 Join two closest to a cluster All the nodes now in one cluster Clustering is ready STOP Heatmaps Lots of various colormaps You can also create an own colormap # creates a own color palette from red to green my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299) Heatmaps mtcars Clustering Parameters change the resulting heatmap a lot Scaling rows vs scaling columns heatmap.2 heatmap.2 function includes more than 40 visible arguments that can be used to tune the resulting figure. These arguments include scaling, selecting clustering method, labeling, showing density info, handling missing values etc. Left correlation, right euclidean, colormap swaped, both standardized based on rows, complete distance Thank you for your attention! Further reading: Kaufman, L., & Rousseeuw, P. J. (2005). Finding Groups in Data. An Introduction to Cluster Analysis (p. 342). John Wiley & Sons Inc. Maechler, M. (2013). Cluster Analysis Extended Rousseeuw et al. CRAN.