Download Lecture 2 - notes - for Dr. Jason P. Turner

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Descriptive Measures
MARE 250
Dr. Jason Turner
Descriptive Measures
Descriptive Measures – numbers that are
used to describe datasets
Parts of Descriptive Statistics
Used to summarize raw data
Descriptive Measures
Measures of Center
Measures of Variation – how data are
distributed around center
5-number summary – used to construct
visual representation - Boxplot
Measures of Center
Measure of Central Tendency – indicate
where center or most typical value of data set
lie
Mean, Median, Mode
Measures of Center
Mean – of a dataset is the sum of the
observations divided by the number of
observations; Arithmetic Average
10,20,30,40,50,60,70,80,90,100 = 550
550 / 10 = 55
Measures of Center
Median – the number that divides the bottom
50% of the data from the top 50%
1) Arrange data in increasing order
2) If number of observations is ODD, the
median is the observation exactly in the
middle
3) If the number of observations is EVEN,
median is the mean of the middle two
observations
Measures of Center
Median = (n+1)/2
10,20,30,40,50,60,70,80,90,100, 110
(ODD); Median = 60
10,20,30,40,50,60,70,80,90,100
(EVEN); Median = 50+60/2 = 55
Measures of Center
Mode – frequency of each value in
the data set
If no value occurs more than once – No
Mode; 10,20,30,40,50,60,70,80,90,100
Otherwise – any value with greatest
frequency is Mode; 10,20,30,40,50,50,
60,70,80,90,100…Mode is 50
Measures of Center
Number of Individuals
The mode is useful if the distribution is
skewed or bimodal (having two very
pronounced values around which data are
concentrated)
30
20
10
0
You are so totally skewed!
The mean is sensitive to extreme (very large or
small) observations and the median is not
Therefore – you can determine how skewed your
data is by looking at the relationship between
median and mean
Mean is Greater
than the Median
Mean and
Median are Equal
Mean is Less
Than the Median
Resistance Measures
A resistance measure is not sensitive to the
influences of a few extreme observations
Median – resistant measure of center
Mean – not resistant
Outliers DO NOT affect Median
Outliers DO affect Mean
Resistance Measures
Resistance of Mean can be improved by
using –
Trimmed Means – a specified percentage
of the smallest and largest observations are
removed before computing the mean
Will do something like this later when
exploring the data and evaluating
outliers…(their effects upon the mean)
Measures of Variation
Measures of Variation (Spread) – amount of
variability in the data set
Range, Standard Deviation, Variance
Range = Maximum Observation – Minimum
Observation
10,20,30,40,50,60,70,80,90,100;
Range = 100-10 = 90
Measures of Variation
Standard Deviation - (±SD) measures the
variation by indicating how far (on average)
the observations are from the mean
Large Dev. – far
From mean
Small Dev. –
Close to mean
Measures of Variation
Variance - (measure used by statistical
formulas) square of the standard deviation
“Equal Variance” is one of the assumptions of
parametric means testing…(we will learn this later)
Measures of Variation
Three Standard Deviations Rule – almost
all observations in any data set lie within
three standard deviations to either side of the
mean; “almost all” defined in 2-ways by stats
nerds…
Measures of Variation
Three Standard Deviations Rule –
Chebychev’s Rule – 89% of data within 3
Standard Deviations
Empirical Rule – 99.7% of observations are
within 3 Standard deviations; if data are
approximately bell-shaped
5 Number Summary
Percentiles – data set is divided into
hundredths (100 equal parts)
Why?..Percentiles are not sensitive to the
influence of a few extreme observations
(outliers)
5 Number Summary
Quartiles – data set is divided into quarters
(4 equal parts); most typically used
Data set has 3 Quartiles: Q1, Q2, Q3
Q1 – is the number that divides the bottom 25%
from top 75%
Q2 – is the median; bottom 50% from top 50%
Q3 – is the number that divides the bottom 75%
from top 25%
5 Number Summary
Quartiles – data set is divided into quarters
(4 equal parts); most typically used
5 Number Summary
Interquartile Range (IQR) – the difference
between the first and third quartiles
IQR = Q3 – Q1
The IQR gives you the range of the middle
50% of the data
Outlier, Outlier
Outliers – observations that fall well outside
the overall pattern of the data
Requires special attention
May be the result of:
Measurement or Recording Error
Observation from a different population
Unusual Extreme observation
Pants on Fire!
Must deal with outliers: (Yes, really!)
If error – can delete; otherwise judgment call
Can use quartiles and IQR to identify
potential outliers
The Outer Limits
Lower and Upper Limits:
Lower limit – is the number that lies 1.5
IQR’s below the first quartile
Lower Limit = Q1 - 1.5 * IQR
Upper limit – is the number that lies 1.5
IQR’s above the first quartile
Upper Limit = Q3 + 1.5 * IQR
The Outer Limits
If a value is outside the “Outer Limits” of
a dataset it is an…
OUTLIER!
5 Number Summary
5-Number Summary:
Min, Q1, Q2, Q3, Max
Written in increasing order
Provides information on Center and Variation
Are used to construct Box-Plots
Boxplot
Boxplot (Box-and-Whisker-Design):
based on the 5-number summary
provide graphic display of the center and
variation
Q1 Q2 Q3
Min
Max
0
70
Boxplot
Modified Boxplot – includes outliers
Potential Outlier
*
0
70
Note that Min & Max are determine after
outliers are removed!
Boxplot
Boxplot
Boxplots summarize information about the
shape, dispersion, and center of your data
They can also help you spot outliers
Boxplot
Left edge of the box represents the first
quartile (Q1), while the right edge represents
the third quartile (Q3)
Box portion of the plot represents the
interquartile range (IQR) - middle 50% of data
Q1 Q2 Q3
Lower
Limit
0
Upper
Limit
70
Boxplot
The line drawn through the box represents
the median of the data
The lines extending from the box are called
whiskers
The whiskers extend outward to indicate the
Upper and Lower limits in the data set
(excluding outliers)
Boxplot
Extreme values, or outliers, are represented
by dots
A value is considered an outlier if it is outside
of the box (greater than Q3 or less than Q1)
by more than 1.5 times the IQR
Potential Outlier
*
0
70
Boxplot
Use the boxplot to assess the symmetry of
the data:
If the data are fairly symmetric, the median
line will be roughly in the middle of the IQR
box and the whiskers will be similar in length
0
70
Boxplot
Use the boxplot to assess the symmetry of
the data:
If the data are skewed, the median may not
fall in the middle of the IQR box, and one
whisker will likely be noticeably longer than
the other
0
70