Download Integrating Statistical Analysis with Visualization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Integrating Statistical Analysis
with Visualization
CS 4390/5390 Data Visualization
Shirley Moore, Instructor
October 8, 2014
1
Descriptive Statistics
• Basic statistics such as mean, mode, median,
standard deviation
• Correlation coefficient expresses strength of
an assumed linear correlation of two random
variables on a scale between -1 and 1.
• Probability density describes the likelihood for
a random variable to take on a given value.
2
Pearson’s Correlation Coefficient
3
Scatterplots with Correlation
Coefficients
4
Normal Distribution
5
Box Plots
• Graphically displays data according to their
quartiles
• Indicate dispersion and skewness
• Show outliers
• Examples
– Box plots in D3
http://bl.ocks.org/mbostock/4061502
http://bl.ocks.org/jensgrubert/7789216
6
Inferential Statistics
• Draw conclusions that reach beyond the
immediate data
• Regression modeling
– linear
– nonlinear
– multiple
• Clustering
– centroid-based
– hierarchical
7
Linear Regression
• Least squares regression calculates the bestfitting line for the observed data by minimizing
the sum of the squares of the vertical deviations
from each data point to the line.
• Coefficient of determination (R2 value) indicates
how much of the total variation in y can be
explained by the relationship between x and y.
• Linear regression in D3:
http://bl.ocks.org/benvandyke/8459843
8
Anscombe’s Quartet
9
Cluster Analysis
• Task of grouping a set of objects in such a way
that objects in the same group (called a
cluster) are more similar in some sense to
each other than to those in other groups
(clusters)
• Many different algorithms
10
Centroid-based Clustering
• Clusters are represented by a central vector,
which may not necessarily be a member of the
data set.
• When the number of clusters is fixed to k, kmeans clustering finds k cluster centers and
assigns the objects to the nearest cluster center,
such that the sum of the squared distances from
the centers is minimized.
• Since the problem is NP-hard, approximation
algorithms are often used.
11
k-means Clustering Formal Definition
12
Lloyd’s Algorithm
See science.js for an implementation by Jason Davies.
See also
http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
13
Hierarchical Clustering
• Agglomerative and divisive methods
• Distance-based agglomerative method
repeatedly merges “closest” clusters
• Result usually represented as dendogram
14
Hierarchical Clustering Example
See science.js for D3 hcluster code
15
Preparation for Next Class
• Finish Lab 3
– upload files prior to class
• Review for the Quest (quiz/test) that will be
Wed., Oct 15
– Bring questions to class
16