Download Exploratory data analysis: numerical summaries

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
CIS 2033
Based on Textbook: A Modern Introduction to
Probability and Statistics. 2007
Instructor: Dr. Longin Jan Latecki
Chapter 16
Exploratory data analysis: numerical
summaries
16.1 The Center of the Data Set
Center of the Data= sample mean:
n = the sample size
Example:
Sample mean of the following data is 44.7:
43, 43, 41, 41, 41, 42, 43, 58, 58, 41, 41
Outliers
an outlier is an observation that is
numerically distant from the rest of the data
Sample median is more robust in the presence of outliers.
Variability in A Data Set
Variance:
Standard Deviation:
where n is the number samples
Why we choose the factor 1/(n−1) instead of 1/n will be
explained later (in Chapter 19).
Variability cont.
Median of Absolute Deviation (MAD):
The Median of the Absolute Deviations of a Sample.
Medn= median of sample
Absolute Deviation:
Absolute Deviation:
The absolute value of the distance
Of a point xi in a data set from
the median
Empirical quantiles
The order statistics consist of the same elements as the original
dataset x1, x2 x3,…, xk , but in ascending order. Denote by the kth
element in the ordered list. Then:
The pth quartile corresponds to
pth quartile of a cdf:
Finv(p) where F(p) is the
cumulative distribution
function of the data
Quick exercise 16.5. Compute the 55th empirical percentile for the Wick
temperature data.
The order statistics of the Wick temperature data is
Quartiles
• Lower quartile: qn(.25)
• Upper quartile: qn(.75)
• Interquartile Range (IQR)
• IQR = qn(0.75) − qn(0.25)
• Median (Middle Quartile): qn(.50)
Quick exercise 16.6. Compute the five-number summary for the
(uncorrected) Wick temperature data.
The box-and-whisker plot
• Advantages:
• Good representation of statistical data
• Shows quartiles, median and outliers
• Disadvantages
• poor graphical display of the dataset
• histogram and kernel density estimate
are more informative displays
of a single dataset
Up from the upper quartile we measure out
a distance of 1.5 times the IQR and draw
a so-called whisker up to the largest observation
that lies within this distance, where we put a horizontal line.
Similarly, down from the lower quartile we measure out a distance
of 1.5 times the IQR and draw a whisker to the smallest observation
that lies within this distance, where we also put a horizontal line.
All other observations beyond the whiskers are marked by ◦.
Such an observation is called an outlier.
Using boxplots to compare
several datasets
Boxplots become useful if we want to compare several sets of
data in a simple graphical display: