Download Integrating Statistical Analysis with Visualization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia, lookup

History of statistics wikipedia, lookup

Misuse of statistics wikipedia, lookup

Time series wikipedia, lookup

Transcript
Integrating Statistical Analysis
with Visualization
CS 4390/5390 Data Visualization
Shirley Moore, Instructor
October 8, 2014
1
Descriptive Statistics
• Basic statistics such as mean, mode, median,
standard deviation
• Correlation coefficient expresses strength of
an assumed linear correlation of two random
variables on a scale between -1 and 1.
• Probability density describes the likelihood for
a random variable to take on a given value.
2
Pearson’s Correlation Coefficient
3
Scatterplots with Correlation
Coefficients
4
Normal Distribution
5
Box Plots
• Graphically displays data according to their
quartiles
• Indicate dispersion and skewness
• Show outliers
• Examples
– Box plots in D3
http://bl.ocks.org/mbostock/4061502
http://bl.ocks.org/jensgrubert/7789216
6
Inferential Statistics
• Draw conclusions that reach beyond the
immediate data
• Regression modeling
– linear
– nonlinear
– multiple
• Clustering
– centroid-based
– hierarchical
7
Linear Regression
• Least squares regression calculates the bestfitting line for the observed data by minimizing
the sum of the squares of the vertical deviations
from each data point to the line.
• Coefficient of determination (R2 value) indicates
how much of the total variation in y can be
explained by the relationship between x and y.
• Linear regression in D3:
http://bl.ocks.org/benvandyke/8459843
8
Anscombe’s Quartet
9
Cluster Analysis
• Task of grouping a set of objects in such a way
that objects in the same group (called a
cluster) are more similar in some sense to
each other than to those in other groups
(clusters)
• Many different algorithms
10
Centroid-based Clustering
• Clusters are represented by a central vector,
which may not necessarily be a member of the
data set.
• When the number of clusters is fixed to k, kmeans clustering finds k cluster centers and
assigns the objects to the nearest cluster center,
such that the sum of the squared distances from
the centers is minimized.
• Since the problem is NP-hard, approximation
algorithms are often used.
11
k-means Clustering Formal Definition
12
Lloyd’s Algorithm
See science.js for an implementation by Jason Davies.
See also
http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
13
Hierarchical Clustering
• Agglomerative and divisive methods
• Distance-based agglomerative method
repeatedly merges “closest” clusters
• Result usually represented as dendogram
14
Hierarchical Clustering Example
See science.js for D3 hcluster code
15
Preparation for Next Class
• Finish Lab 3
– upload files prior to class
• Review for the Quest (quiz/test) that will be
Wed., Oct 15
– Bring questions to class
16