Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Integrating Statistical Analysis with Visualization CS 4390/5390 Data Visualization Shirley Moore, Instructor October 8, 2014 1 Descriptive Statistics • Basic statistics such as mean, mode, median, standard deviation • Correlation coefficient expresses strength of an assumed linear correlation of two random variables on a scale between -1 and 1. • Probability density describes the likelihood for a random variable to take on a given value. 2 Pearson’s Correlation Coefficient 3 Scatterplots with Correlation Coefficients 4 Normal Distribution 5 Box Plots • Graphically displays data according to their quartiles • Indicate dispersion and skewness • Show outliers • Examples – Box plots in D3 http://bl.ocks.org/mbostock/4061502 http://bl.ocks.org/jensgrubert/7789216 6 Inferential Statistics • Draw conclusions that reach beyond the immediate data • Regression modeling – linear – nonlinear – multiple • Clustering – centroid-based – hierarchical 7 Linear Regression • Least squares regression calculates the bestfitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line. • Coefficient of determination (R2 value) indicates how much of the total variation in y can be explained by the relationship between x and y. • Linear regression in D3: http://bl.ocks.org/benvandyke/8459843 8 Anscombe’s Quartet 9 Cluster Analysis • Task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar in some sense to each other than to those in other groups (clusters) • Many different algorithms 10 Centroid-based Clustering • Clusters are represented by a central vector, which may not necessarily be a member of the data set. • When the number of clusters is fixed to k, kmeans clustering finds k cluster centers and assigns the objects to the nearest cluster center, such that the sum of the squared distances from the centers is minimized. • Since the problem is NP-hard, approximation algorithms are often used. 11 k-means Clustering Formal Definition 12 Lloyd’s Algorithm See science.js for an implementation by Jason Davies. See also http://www.naftaliharris.com/blog/visualizing-k-means-clustering/ 13 Hierarchical Clustering • Agglomerative and divisive methods • Distance-based agglomerative method repeatedly merges “closest” clusters • Result usually represented as dendogram 14 Hierarchical Clustering Example See science.js for D3 hcluster code 15 Preparation for Next Class • Finish Lab 3 – upload files prior to class • Review for the Quest (quiz/test) that will be Wed., Oct 15 – Bring questions to class 16