Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Integrating Statistical Analysis with Visualization CS 4390/5390 Data Visualization Shirley Moore, Instructor October 8, 2014 1 Descriptive Statistics • Basic statistics such as mean, mode, median, standard deviation • Correlation coefficient expresses strength of an assumed linear correlation of two random variables on a scale between -1 and 1. • Probability density describes the likelihood for a random variable to take on a given value. 2 Pearson’s Correlation Coefficient 3 Scatterplots with Correlation Coefficients 4 Normal Distribution 5 Box Plots • Graphically displays data according to their quartiles • Indicate dispersion and skewness • Show outliers • Examples – Box plots in D3 http://bl.ocks.org/mbostock/4061502 http://bl.ocks.org/jensgrubert/7789216 6 Inferential Statistics • Draw conclusions that reach beyond the immediate data • Regression modeling – linear – nonlinear – multiple • Clustering – centroid-based – hierarchical 7 Linear Regression • Least squares regression calculates the bestfitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line. • Coefficient of determination (R2 value) indicates how much of the total variation in y can be explained by the relationship between x and y. • Linear regression in D3: http://bl.ocks.org/benvandyke/8459843 8 Anscombe’s Quartet 9 Cluster Analysis • Task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar in some sense to each other than to those in other groups (clusters) • Many different algorithms 10 Centroid-based Clustering • Clusters are represented by a central vector, which may not necessarily be a member of the data set. • When the number of clusters is fixed to k, kmeans clustering finds k cluster centers and assigns the objects to the nearest cluster center, such that the sum of the squared distances from the centers is minimized. • Since the problem is NP-hard, approximation algorithms are often used. 11 k-means Clustering Formal Definition 12 Lloyd’s Algorithm See science.js for an implementation by Jason Davies. See also http://www.naftaliharris.com/blog/visualizing-k-means-clustering/ 13 Hierarchical Clustering • Agglomerative and divisive methods • Distance-based agglomerative method repeatedly merges “closest” clusters • Result usually represented as dendogram 14 Hierarchical Clustering Example See science.js for D3 hcluster code 15 Preparation for Next Class • Finish Lab 3 – upload files prior to class • Review for the Quest (quiz/test) that will be Wed., Oct 15 – Bring questions to class 16