Download Data Summarization - Colorado State University`s Department of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Lecture Notes 3:
Data summarization
Highlights:
•
•
•
•
•
•
•
•
Average
Median
Quartiles
5-number summary (and relation to boxplots)
Outliers
Range & IQR
Variance and standard deviation
Determining shape using mean & median
1
Some important characteristics of a data set
Location: Where is the data set “located” along a
number line? Where is its center?
Spread:
How dispersed (i.e. spread out) is the
data? Outliers:
set?
Are there any unusual values in the data
Shape:
What is the shape of the distribution of
values in the data set?
2
Location Statistics
Mean, Median & Quartiles
• In these notes, we will look at some common
descriptive statistics that are useful for summarizing
a data set.
• Recall that a statistic is any number calculated from
a set of data.
• The most succinct way to describe the location of a
data set is to identify its center.
• There are two statistics used to describe center: with
the mean and with the median.
3
Sample average
• The sample average (a.k.a. mean) is the sum of the data divided by
the sample size.
• We denote the mean using , or “x bar” x
• The sample size is the number of observations in the sample, and
is denoted “n”.
• The sum of all the observations in a sample is denoted by
.
• So, our formula for the sample mean is
∑x
i
x
∑
x =
i
n
4
Sample Average Example
• Suppose we are interested in the average undulation
rate (in Hz) of a paradise tree snake, which undulates
after jumping from a tree in order to glide away.
• We take a sample of n = 8 snakes and somehow
measure the rates at which they undulate as they
propel themselves from a source.
• The eight observed rates are 0.9, 1.4, 1.2, 1.2, 1.3,
2.0, 1.4, 1.6
5
Sample Average Example
So, for this sample, we can compute:
x
∑
x =
n
i
=
=
6
Median
• If you put data in order from the smallest to the largest values, the
number in the middle is called the median.
• The median separates the bottom 50% of the data from the top 50% of
the data.
• If the sample size is odd, the median will be a value in your sample. If
the sample size is even, the median will be “between” the middle two
numbers in your sample.
7
Computing the median
1)
Order the data set, smallest to largest.
2)
Compute the rank of the median using
Rank = (n + 1)/2. The rank tells you which
observation will be the median.
3)
If “Rank” is an integer value go right to it in the
sorted data set. Otherwise compute the average of
the two surrounding observations.
ordered
For instance, if rank = 5, then the median is the 5th
ordered observation. If rank = 5.5, then the median is
the average of the 5th and 6th ordered observations.
8
Computing the Median
• The data set to the right is
already ordered. There are 19
observations.
49
73
96
116
137
69
78
96
116
142
70
81
105
117
151
70
81
110
121
• Find the rank of the median
using (n+1)/2:
• Now go to this observation by counting from the start
of the data set to the rank of the median.
•
You can verify that this is the median by making sure
that there are the same number of observations
above it as there are below it.
9
Computing the Median
• The data set to the right is
already ranked. There are 20
observations.
• Find the rank of the median
using (n+1)/2:
49
73
96
116
137
69
78
96
116
142
70
81
105
117
151
70
81
110
121
175
• In this case, the rank is
between two integers, so the
median will be the average of
these two ordered observations.
10
Location Statistics:
Quartiles
•
The median breaks the data set into two halves
•
Quartiles break the data set into 4 quarters
•
The lower quartile, Q1, is the “median” of all the
data below the overall median.
•
The upper quartile, Q3, is the “median” of all the
data above the overall median.
11
Computing Quartiles
Here, there are 10 observations
below the median. We can find
their “median”, Q1, in the usual
manner:
49
73
96
116
137
69
78
96
116
142
70
81
105
117
151
70
81
110
121
175
Q1 separates the lower 25%
from the upper 75% of the data.
12
Computing Quartiles
Likewise, there are 10
observations above the median.
We can use the same rank we
used to find Q1, but start
counting from the first
observation above the overall
median:
49
73
96
116
137
69
78
96
116
142
70
81
105
117
151
70
81
110
121
175
Q3 separates the lower 75%
from the top 25% of the data.
13
Computing Quartiles
•
A brief aside: when sample size is odd, it will not be
the case that *exactly* 50% of the data is below the
median or that *exactly* 50% is above it
•
This is because the median itself is not counted as
being in either the upper or lower half of the data
set.
•
For reasonably large data sets, we may say things
like “50% of the data is above the median” and
“25% of the data is below Q1”, even though in some
cases these are approximations.
14
Computing Quartiles
•
Note that for relatively small datasets, you may be
able to “eyeball” the data to find the median, Q1,
and Q3, rather than using rank.
•
For instance, it is not challenging to find the median
and quartiles for the snake undulation rate data set
of size n=8 from before.
•
Simply order the numbers 0.9, 1.4, 1.2, 1.2, 1.3,
2.0, 1.4, 1.6 from smallest to largest, and you can
quickly see where the median and quartiles lie:
15
Location Statistics:
Extremes
• We are also often interested in the extremes of a data set.
• These extreme values are referred to as the minimum and the
maximum. “Extreme” in this context doesn’t necessarily mean
“really big” or “really small”. It just means “the biggest” or “the
smallest”.
16
The 5-number summary
• The 5-number summary can be used to
summarize a data set.
• This group consists of the: minimum, maximum,
Q1, median, and Q3
• These are all measures of location
17
Boxplots and the 5-number
summary
75
60
• Sometimes boxplots
are called “box and
whisker plots.”
65
70
• Boxplots graphically
illustrate the 5 values
in a 5-number summary
boxplot of height (female)
18
Boxplots and the 5-number
summary
•
•
•
•
Boxplots can be displayed horizontally or vertically.
The dark line inside the box is the median
The edges of the box are Q1 and Q3
The whiskers extend to either the min and max, or to
the furthest non-outliers.
19
Boxplots and the 5-number
summary
• Outliers are represented as dots on a boxplot.
• Note: 50% of the data is inside the box, 25% is
below the box, and 25% is above the box.
20
Outliers
• Outliers are data points that are located far away from where the
majority of the data lie.
• There is not universal agreement on what the standard should be for
classifying an observation as an outlier. It is to some extent subjective.
• Data analysis software packages will have internal standards by which
they decide which values should be considered outlying.
21
Outliers
• It’s usually a good idea to look more closely at an outlier to see if it is real or if it is a mistake.
• The outlier might be an improperly entered data value. Data entry is a tedious process and
sometimes people make mistakes.
• The outlier might be in different units than the rest of the data. For instance, in the
questionnaires from the first day of class, a few students gave their heights in centimeters rather
than inches. If these heights had not been converted, then our class dataset would have shown
students over 12 feet tall.
22
Outliers
• Outliers are often real, accurate pieces of data
that are simply unusual.
•
For instance, most people work 35-40 hours
per week. However a very small number work
70-80 hours a week.
• It is sometimes tempting to remove outliers
from a data set, but we must find out first
whether or not the outlier is a legitimate
observation or a mistake.
23
Dispersion (Spread)
Here is a good piece of advice:
“Do not cross a river if it is, on
average, 4 feet deep”
-Nassim Taleb, The Black Swan
Why is this good advice? What additional
information would we need before we decide if
crossing the river is a good idea?
24
Dispersion (Spread)
• Information about location (average or median) is not
enough to adequately summarize a data set.
• Sometimes the average doesn’t exist. For example, the
average human being has one ovary and one testicle.
• Information about how your data is dispersed is also
useful, and is essential in inferential statistics.
•
We don’t just want to know where the center of our data
lies; we also want to know how spread out the data is!
25
The Range
• The range is the easiest measure of dispersion to
compute.
• It is the difference between the maximum value
and the minimum value. • One problem with using the range is that it
doesn’t tell you whether most of the data is spread
out through the whole range, or if the maximum
and minimum values are outliers.
26
The IQR
• The inter-quartile range (Q3 – Q1) is not affected by extreme
values since it is calculated using values that lie close to the
center of the data set
• We will not use either the range or the IQR when we move on to
inferential statistics. But they are still useful as descriptive
statistics.
27
Variance
• The variance is another measure of dispersion. It is closely related to the standard deviation, which we will
consider shortly.
• Unlike the range or IQR, the variance statistic is computed using
all of the data values in a data set.
• It is sensitive to outliers, but the effects of extreme values are
“diluted” if there are a large number of observations.
28
Sum of Squared Deviations
• To compute the variance of a data set we first need a
statistic called the sum of squared deviations
• This is often abbreviated as SS, for “sum of squares”
• To get the squared deviation for a single
observation, subtract the mean from this
observation, and then square the result.
• Do this for all observations and sum the results.
This gives us the sum of squared deviations.
• Mathematically,
S =∑
S(
i
x−
2
x)
29
Sum of Squared Deviations
• Example: find the sum of squared deviations
(SS) for our TV watching dataset:
0.9
S =∑
S(
1.4
i
x−
1.2
2
1.2
1.3
2.0
1.4
1.6
x= )
30
Sample Variance
• The sample variance is denoted by the symbol s2
2
(
x
−
x)
S∑ i
• Mathematically, s 2 = S
=
n −1
−n 1
• The English interpretation of a variance is:
“The average squared distance that a group of ‘n’
points lies from the mean of the group.”
• This is not a very intuitive concept, though it is very
often used in mathematical computations.
31
Sample Standard Deviation
• The sample standard deviation is simply the
square root of the sample variance.
• It is denoted by the letter s
• Continuing with our example, we have:
s=
2
S S
=s
=
n −1
32
Interpret the Standard Deviation
• The standard deviation can be thought of roughly as an average distance
that a group of points lies from the group mean. • A large standard deviation tells you that your data is highly dispersed, or
spread out.
• In inferential statistics, a large standard deviation signifies high levels of
uncertainty regarding statistical inferences.
• Note that what counts as “large” or “small” depends on the magnitude of the
data itself.
33
Shapes of Distributions
• You don’t need a histogram to determine the
shape of a distribution. In fact, all you need
are the values for the mean and the median
of your data set.
Frequency
9
8
Median= 92
7
6
5
4
3
Mean= 86
2
1
0
30
40
50
60
70
80
90
100 110
Grades
34
Shapes of Distributions
• What is the shape of this
distribution to the right?
• Note that the mean is 86,
and the median is 92
9
8
Median= 92
7
6
5
Mean= 86
4
3
2
1
0
30
40
0
50
60
70
80
90
100 110
0
35
Shapes of Distributions
Median = .6
• What is the shape of this
distribution to the right?
10
5
mean = 2.6
• Note that the mean is 2.6,
and the median is 0.6
0
0
2
4
6
8
10
12
14
36
Shapes of Distributions
• What is the shape of
this distribution to
the right?
• Note that the mean is
102, and the median
is 102
30
Mean=102
Median= 102
20
10
0
0
20
40
60
80
100 120 140 160 180
0
37
Mean, Median, & Shape
• If the mean is greater than the median then the
distribution is skewed to the right
• If the mean is less than the median then the
distribution is skewed to the left
• If the mean and median are (approximately) equal
then the distribution is (approximately)
symmetric
38
Conclusion
• A statistic is any number calculated from a set of data. Descriptive statistics
are numbers that are used to describe important features of a data set.
• The mean and median are very commonly used statistics which refer to
location
• The standard deviation is a very commonly used statistic which refers to
dispersion.
• In the next set of notes, we will look at probability and the normal distribution,
which will lay the groundwork for understanding inferential statistics.
39