Download Descriptive Statistics

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Descriptive Statistics
August 27, 2012
Overview of Descriptive Statistics
I
Descriptive Statistics are used to describe the basic features of
the data gathered from an experimental study.
Overview of Descriptive Statistics
I
Descriptive Statistics are used to describe the basic features of
the data gathered from an experimental study.
I
They provide simple summaries about the sample and the
measures.
Overview of Descriptive Statistics
I
Descriptive Statistics are used to describe the basic features of
the data gathered from an experimental study.
I
They provide simple summaries about the sample and the
measures.
I
Together with simple graphics analysis, they form the basis of
virtually every quantitative analysis of data.
Basic Features of Data
I
The size of the sample is usually denoted n.
Basic Features of Data
I
The size of the sample is usually denoted n.
I
The measures of the central tendency are mean, median, and
mode.
Basic Features of Data
I
The size of the sample is usually denoted n.
I
The measures of the central tendency are mean, median, and
mode.
I
The measures of spread or variation are min, max, variance
and standard deviation, fs (IQR), outliers.
Basic Features of Data
I
The size of the sample is usually denoted n.
I
The measures of the central tendency are mean, median, and
mode.
I
The measures of spread or variation are min, max, variance
and standard deviation, fs (IQR), outliers.
I
The measures of position are percentiles, deciles, fourths
(quartiles).
Measures of the Center of the Data
Overview
I
What is normal for this population?
Overview
I
What is normal for this population?
I
We try to understand this question by asking another
question.
Overview
I
What is normal for this population?
I
We try to understand this question by asking another
question.
I
What is the central value of the data?
µ vs x
I
The mean µ is the average value of a population.
µ vs x
I
The mean µ is the average value of a population.
I
The sample mean x is the average value of the sample.
x=
x1 + x2 + . . . + xn
n
where n is the size of the sample x1 , x2 , . . . , xn .
The Median x̃ (or simply M)
I
x̃ denotes the median, which is a number that splits the data
into two parts of equal size.
The Median x̃ (or simply M)
I
x̃ denotes the median, which is a number that splits the data
into two parts of equal size.
I
If the number of data is odd, then x̃ is the central number.
The Median x̃ (or simply M)
I
x̃ denotes the median, which is a number that splits the data
into two parts of equal size.
I
If the number of data is odd, then x̃ is the central number.
I
If the number of data is even, then x̃ is the midpoint of the
two central numbers.
The Median x̃
e.g. For 3; 3; 4; 5; 7; 8; 10 the median is x̃ = 5.
e.g. For 2.1; 4.2; 4.3; 4.9; 5.0; 5.2 the median is x̃ =
4.3+4.9
2
= 4.6
Mode
I
The mode is the most common value in a sample.
Mode
I
The mode is the most common value in a sample.
I
e.g. x̃ = 7 for S = {3, 5, 7, 7, 7, 9}
Compare the various measures of the “center”
Consider the sample 100; 100; 100; 100; 500; 600; 600; 700; 800
I
The mode is 100
Compare the various measures of the “center”
Consider the sample 100; 100; 100; 100; 500; 600; 600; 700; 800
I
The mode is 100
I
The median is x̃ = 500
Compare the various measures of the “center”
Consider the sample 100; 100; 100; 100; 500; 600; 600; 700; 800
I
The mode is 100
I
The median is x̃ = 500
I
The sample mean is
100 + 100 + 100 + 100 + 500 + 600 + 600 + 700 + 800
9
= 400
x=
Measuring the Spread of Data
Limitations of max and min
Consider the following data sets.
I
1; 5; 5; 5; 5; 5; 5; 5; 9;
I
1; 2; 3; 4; 5; 6; 7; 8; 9;
Deviation
How far away from the average is a given piece of data?
I
We could just take the absolute value...
Deviation
How far away from the average is a given piece of data?
I
We could just take the absolute value...
I
1; 2; 3; 4; 5; 6; 7; 8; 9;
Deviation
How far away from the average is a given piece of data?
I
We could just take the absolute value...
I
1; 2; 3; 4; 5; 6; 7; 8; 9;
I
µ = 5, x3 = 3, |µ − x3 | = 2...
Deviation
How far away from the average is a given piece of data?
I
We could just take the absolute value...
I
1; 2; 3; 4; 5; 6; 7; 8; 9;
I
µ = 5, x3 = 3, |µ − x3 | = 2...
I
so x3 is 2 away from the average.
Deviation
How far away from the average is a given piece of data?
I
We could just take the absolute value...
I
1; 2; 3; 4; 5; 6; 7; 8; 9;
I
µ = 5, x3 = 3, |µ − x3 | = 2...
I
so x3 is 2 away from the average.
I
Similarly x8 = 8 is 3 from the average.
Deviation
How far away from the average is a given piece of data?
I
We could just take the absolute value...
I
1; 2; 3; 4; 5; 6; 7; 8; 9;
I
µ = 5, x3 = 3, |µ − x3 | = 2...
I
so x3 is 2 away from the average.
I
Similarly x8 = 8 is 3 from the average.
I
These examples illustrate the absolute deviations of x3 and
x8 from the mean.
Standard Deviation
I
The standard deviation is another way to measure how far
away a data point is relative to the other data.
Calculating the Standard Deviation of a sample
I
We first calculate the sample variance
s2 =
(x1 − x)2 + (x2 − x)2 + . . . + (xn − x)2
n−1
Calculating the Standard Deviation of a sample
I
We first calculate the sample variance
s2 =
I
(x1 − x)2 + (x2 − x)2 + . . . + (xn − x)2
n−1
The sample standard deviation is the square root of the
sample variance
s
√
(x1 − x)2 + (x2 − x)2 + . . . + (xn − x)2
s = s2 =
n−1
Using the Standard Deviation
1; 2; 3; 4; 5; 6; 7; 8; 9;
I
Which data are at least 1 standard deviation from the mean?
Using the Standard Deviation
1; 2; 3; 4; 5; 6; 7; 8; 9;
I
Which data are at least 1 standard deviation from the mean?
I
We want those data xi such that |xi − x| ≥ s
Using the Standard Deviation
1; 2; 3; 4; 5; 6; 7; 8; 9;
I
Which data are at least 1 standard deviation from the mean?
I
We want those data xi such that |xi − x| ≥ s
I
|xi − 5| ≥ 2.74 for x1 = 1, x2 = 2, x8 = 8 and x9 = 9
Calculating the Standard Deviation of a sample
I
We first calculate the sample variance
2
s =
=
(x1 − x)2 + (x2 − x)2 + . . . + (xn − x)2
n−1
(1 − 5)2 + (2 − 5)2 + (3 − 5)2 + (4 − 5)2 + (6 − 5)2 + (7 − 5)2 + (8 − 5)2 + (9 − 5)2
= 7.50
9−1
Calculating the Standard Deviation of a sample
I
We first calculate the sample variance
2
s =
=
(x1 − x)2 + (x2 − x)2 + . . . + (xn − x)2
n−1
(1 − 5)2 + (2 − 5)2 + (3 − 5)2 + (4 − 5)2 + (6 − 5)2 + (7 − 5)2 + (8 − 5)2 + (9 − 5)2
9−1
= 7.50
I
The sample standard deviation is the square root of the
sample variance
√
√
s = s 2 = 7.50 ≈ 2.74
Measures of Position
Percentiles
I
The percentile of an observation is the percent of the data
less than (or equal to) that observation.
I
The median is the 50th percentile.
I
The lower fourth (first quartile) Q1 is the 25th percentile.
I
The upper fourth (third quartile) Q3 is the 75th percentile.
Calculating Percentiles
I
The pth percentile for a ranked data set consisting of n
observations is found by a two step procedure
Calculating Percentiles
I
The pth percentile for a ranked data set consisting of n
observations is found by a two step procedure
I
Compute the index
i=
p
n
100
Calculating Percentiles
I
The pth percentile for a ranked data set consisting of n
observations is found by a two step procedure
I
Compute the index
p
n
100
If i is not an integer, the next integer greater than i locates
the position of the pth percentile in the ranked data set. If i is
an integer, the p th percentile is the average of the
observations in positions i and i + 1 in the ranked data set.
i=
I
Percentiles: an example
I
312; 318; 320; 331; 342; 344; 345; 349; 350; 390
I
Lower fourth (First quartile):
Q1 = median of bottom half = 320
I
Median
342 + 344
= 343
2
Upper fourth (Third quartile)
x̃ =
I
Q3 = median of top half = 349
Percentiles: an example
I
312; 318; 320; 331; 342; 344; 345; 349; 350; 390
p25 = 320
NOTE: p25 = Q1
(the data in the third position)
Deciles: an example
I
312; 318; 320; 331; 342; 344; 345; 349; 350; 390
Deciles: an example
I
312; 318; 320; 331; 342; 344; 345; 349; 350; 390
I
The 1st decile is the 10th percentile
315 =
312 + 318
2
Deciles: an example
I
312; 318; 320; 331; 342; 344; 345; 349; 350; 390
I
The 1st decile is the 10th percentile
315 =
I
312 + 318
2
The 8th decile is the 80th percentile
349.5 =
349 + 350
2
An Exercise
I
34; 35; 36; 38; 40; 45; 51; 63
I
Find the upper and lower fourths (first and third quartiles)
and the median.
I
Find the 90th percentile.
fs : Fourths Spread (or IQR: interquartile range)
fs = Q3 − Q1
I
The fourths spread (interquartile range) is the length of the
interval containing the middle 50% of the data.
fs : Fourths Spread (or IQR: interquartile range)
fs = Q3 − Q1
I
The fourths spread (interquartile range) is the length of the
interval containing the middle 50% of the data.
I
We will also see that this is the width of the box in a box and
whisker plot.
fs : Fourths Spread (or IQR: interquartile range)
fs = Q3 − Q1
I
The fourths spread (interquartile range) is the length of the
interval containing the middle 50% of the data.
I
We will also see that this is the width of the box in a box and
whisker plot.
I
This is another way to measure the spread of the data and it
is also (almost) a way to measure the middle.
IQR: an example
I
1; 2; 3; 4; 5; 6; 7; 8; 9;
IQR = Q3 − Q1
IQR = Q3 − Q1
IQR: an example
I
1; 2; 3; 4; 5; 6; 7; 8; 9;
I
Q1 :
i=
so Q1 = 3
25
9 = 2.25
100
IQR = Q3 − Q1
IQR: an example
I
1; 2; 3; 4; 5; 6; 7; 8; 9;
I
Q1 :
i=
so Q1 = 3
I
Similarly Q3 = 7
25
9 = 2.25
100
IQR = Q3 − Q1
IQR: an example
I
1; 2; 3; 4; 5; 6; 7; 8; 9;
I
Q1 :
i=
so Q1 = 3
I
Similarly Q3 = 7
I
IQR = 7 − 3 = 4
25
9 = 2.25
100
Outliers
I
An outlier is a datum that does not fit the rest of the data.
Outliers
I
An outlier is a datum that does not fit the rest of the data.
I
A common way to determine which data are outliers is to say
that an outlier is a sample that is more than 1.5(IQR) outside
the middle 50%.
Outliers
I
An outlier is a datum that does not fit the rest of the data.
I
A common way to determine which data are outliers is to say
that an outlier is a sample that is more than 1.5(IQR) outside
the middle 50%.
I
e.g.
1.5(IQR) = 1.5(4) = 6
So, an outlier would have to be below
Q1 − 6 = 3 − 6 = −3
or above
Q3 + 6 = 7 + 6 = 13
We see that there are no outliers in this sample.
Box and Whisker Plots
I
A Box and Whisker Plot is a visual representation of data
that focuses on the fourths (quartiles).
I
To construct a box plot, use a horizontal number line and a
rectangular box. The smallest and largest data values label
the endpoints of the axis.
I
The lower fourth (first quartile) marks one end of the box and
the upper fourth (third quartile) marks the other end of the
box.
I
The middle fifty percent of the data fall inside the box.
I
The ”whiskers” extend from the ends of the box to the
smallest and largest data values.
I
The box plot gives a good quick picture of the data.
Box and Whisker Plots: An Example
Create a box plot of the following data
312; 318; 320; 331; 342; 344; 345; 349; 350; 390
Box and Whisker Plots: An Example
Create a box plot of the following data
312; 318; 320; 331; 342; 344; 345; 349; 350; 390
Histograms
A Histogram is made by grouping data into bins and plotting the
frequency or relative frequency of members in each bin versus the
bin values.
Histograms: An Example
Create a histogram of the following data
312; 318; 320; 331; 342; 344; 345; 349; 350; 390
Histograms: An Example
Create a histogram of the following data
312; 318; 320; 331; 342; 344; 345; 349; 350; 390
Box and Whisker Plots: An Example
-1.96; -.814; 1.86; 1.96; 0.519; 0.739; -0.540; 0.702; 0.663; 0.591;
0.580; 0.475; 0.589; -1.33; 0.420; -0.460; -0.482; 1.58; 0.778;
0.530; -0.507; -0.233; -0.195; 0.193; -0.136
Histogram: An Example
e.g. -1.96; -.814; 1.86; 1.96; 0.519; 0.739; -0.540; 0.702; 0.663;
0.591; 0.580; 0.475; 0.589; -1.33; 0.420; -0.460; -0.482; 1.58;
0.778; 0.530; -0.507; -0.233; -0.195; 0.193; -0.136
Comparing the Visuals
Practice
Draw the box plot and histogram for the sample
1; 1.1; 1.2; 2; 2.2; 3; 4.1; 5; 7; 7.1; 10;
Skewness
I
In our last two samples, the mean and median were the same.
Skewness
I
In our last two samples, the mean and median were the same.
I
This is not always the case.
Skewness
I
In our last two samples, the mean and median were the same.
I
This is not always the case.
I
1; 1; 2; 3; 5; 7; 10;
Skewness
I
In our last two samples, the mean and median were the same.
I
This is not always the case.
I
1; 1; 2; 3; 5; 7; 10;
I
The median is x̃ = 3.
Skewness
I
In our last two samples, the mean and median were the same.
I
This is not always the case.
I
1; 1; 2; 3; 5; 7; 10;
I
The median is x̃ = 3.
I
The sample mean is
x=
1 + 1 + 2 + 3 + 5 + 7 + 10
≈ 4.14
7
Skewness
I
If the mean exceeds the median then the sample is skewed to
the right.
Skewness
I
If the mean exceeds the median then the sample is skewed to
the right.
I
If the median exceeds the mean then the sample is skewed to
the left.
Skewness
I
If the mean exceeds the median then the sample is skewed to
the right.
I
If the median exceeds the mean then the sample is skewed to
the left.
I
x > M ⇒ skewed to the right
x < M ⇒ skewed to the left