Download Exploratory data analysis: numerical summaries

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
CIS 2033
Based on Textbook: A Modern Introduction to
Probability and Statistics. 2007
Slides: QUINCY R WALKER
Modified by the instructor: Dr. Longin Jan Latecki
Chapter 16
Exploratory data analysis: numerical
summaries
16.1 The Center of the Data Set
Center of the Data= sample mean:
n = the sample size
Example:
Sample mean of the following data is 44.7
43, 43, 41, 41, 41, 42, 43, 58, 58, 41, 41
Outliers
an outlier is an observation that is
numerically distant from the rest of the data
Sample median is more robust in the presence of outliers.
Variability in A Data Set
Variance:
Standard Deviation:
where n is the number samples
Why we choose the factor 1/(n−1) instead of 1/n will be
explained later (in Chapter 19).
Variability cont.
Median of Absolute Deviation (MAD):
The Median of the Absolute Deviations of a Sample.
Medn= median of sample
Absolute Deviation:
Absolute Deviation:
The absolute value of the distance
Of a point xi in a data set from
the median
Empirical quantiles
The order statistics consist of the same elements as the original
dataset x1, x2 x3,…, xk , but in ascending order. Denote by the kth
element in the ordered list. Then:
The pth quartile corresponds to
pth quartile of a cdf:
Finv(p) where F(p) is the
cumulative distribution
function of the data
Quartiles
• Lower quartile: qn(.25)
• Upper quartile: qn(.75)
• Interquartile Range (IQR)
• IQR = qn(0.75) − qn(0.25)
• Median(Middle Quartile): qn(.50)
The box-and-whisker plot
• Advantages:
• Good representation of statistical data
• Shows quartiles, median and outliers
• Disadvantages
• poor graphical display of the dataset
• histogram and kernel density estimate are more informative displays of a single
dataset
Using boxplots to compare
several datasets
Boxplots become useful if we want to compare several sets of
data in a simple graphical display: