Download eda

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Exploratory Data Analysis
PS372
Spring 2010
The four features of distributions




Central Location – where are most of the
observations?
Spread – how far apart are the observations?
Shape – Symmetric or skewed?
Outliers – are any observations very far from
the rest?
What type of data do you have?


Nominal – observations are in categories.
Examples are gender (male, female) or eye
color (blue, green, brown, other)
Ordinal – observations can be ranked (i.e.
greater than or less than makes sense).
Examples are education level (less than high
school, high school, bachelor’s, graduate
degree) or agreement with a survey question
(strongly disagree, disagree, ambivalent,
agree, strongly agree)
Scales, continued


Interval – an observation is on an interval
scale if the difference between two numbers
has meaning. In measuring temperature, 95
is 5 degrees higher than 90 degrees, 30 is 5
degree higher than 25 degrees, etc.
This is not true of all data. For example, is
the difference between “agree” and “strongly
agree” the same as the difference between
“ambivalent” and “agree”?
Scales, continued



Ratio – The strongest form of scale. It
indicates that ratios (division) of numbers has
meaning. If your income is 20,000 then you
have twice as much income as someone
who makes 10,000.
Temperature in Fahrenheit or Celsius is NOT
ratio scaled, since 10 degrees is not “twice
as hot” as 5 degrees.
In a ratio scale, 0 has meaning as “nothing”.
In those temperature scales, 0 is arbitrary.
Back to the four features



Central location – where are most of the
observations?
What the observations are in categories, the
most relevant “statistics” are either the
number and/or frequencies in each category.
For example, “50.5% of live births are male,
while 49.5% are female”. Alternatively, “in
our town, we have 934 men and 982 women”
It doesn’t make sense to talk about “the
average person’s gender” since you really
can’t be somewhere in the middle.
Mode




One statistic mentioned often for categorical
data (ordinal or nominal) is the mode, which is
the category with the most observations.
The mode is most meaningful when one of the
categories has most of the observations, as in
“most faculty at UK have doctoral degrees”
If the data is spread among many categories,
knowing the mode doesn’t provide a full picture.
For example, “the largest department in Arts and
Sciences at UK is Psychology” does not say
anything about the majority of faculty.
Summary – the mode often isn’t that useful.
Central Location for Interval/Ratio


For interval/ratio data, the most common
measures of central location are the mean
and median.
The mean is defined as the arithmetic
average of the observations. You find this by
adding them up and dividing by the total
number. If your observations are (1,5,12), the
mean is (1+5+12)/3 = 6.
Mean/Median continued


The median is the “middle” observation of the
SORTED data. If your observations are
(1,5,12), the median is 5. If your observations
are (4,10,2,8,9), the median is 8.
If there is an even amount of data, average
the two middle values. So if the data are
(6,10,4,3), the middle values are 4 and 6,
and (4+6)/2 = 5. The median is 5.
Differences between the mean and
median



The median is robust, which means that
outliers do not affect it. The mean is not.
Suppose we have data (1,4,6,10,12). The
mean is 33/5 = 6.6 while the median is 6.
Suppose we change the 12 to 14000. The
median is still 6, but the mean changes to
14021/5 = 2804.2. Note also that the median
is still close to most of the data, but the mean
is nowhere close to any data point.
Spread



For ordinal/nominal data, we do NOT have a
measure of spread in this class.
There are measures of spread, not
discussed in this class, for ordinal/nominal
data. Essentially, this measures indicate
whether the data is spread evenly into all the
categories or whether one or a few
categories contain almost all the data.
The notion is called entropy. Not required in
our class, but look it up if you need it.
Spread for interval/ratio


Some common measures of spread for
interval/ratio data are the range, the
interquartile range, and the standard
deviation.
The range is simply the distance between the
smallest and largest observations. It is
obviously not robust to outliers, and seldom
used except when the spread is very small.
(i.e. if all the scores on an exam happened to
be between 76 and 78, which doesn’t
happen very often)
Interquartile range


First, we have to define the quartiles. Recall
when we compute the median, we are
dividing the data in half. The quartiles divide
each of the halves in half again (this divides
the data into four parts, hence the term
quartile)
To find the quartiles, first sort the data as if
you were finding the median.
Quartiles continued



If n is even, divide the data in half, thus
creating a first half and a second half
If n is odd, remove the median, and then
divide the data in half to produce a first half
and a second half.
The first quartile, Q1, is the median of the first
half. The third quartile, Q3, is the median of
the second half. (Q2 is the median).
Example of computing quartiles



Suppose our sorted data was 12, 14, 23, 36,
40, 42, 44, 61, and 78.
There are n=9 numbers, so find the median
M=40 and remove it. The first half is
(12,14,23,36) and the second half is (42, 44,
61, and 78).
The median of the first half is Q1=(14+23)/2 =
18.5 while the median of the second half is
Q3 = (44+61)/2 = 52.5
Interquartile range


The interquartile range is Q3 – Q1. It is not
sensitive to outliers.
We used the data 12, 14, 23, 36, 40, 42, 44,
61, and 78. If we changed the 78 to 100,000
then the interquartile range (IQR) does not
change.
Standard deviation

The standard deviation is based on measuring the
average squared distance from the mean. It is defined as
 X
n
i 1

X
i
n 1

2
Standard deviation continued



The standard deviation is sensitive to
outliers. If one of the observations is very
large, then the standard deviation will be
large as well.
Unless there are strong outliers, the standard
deviation is the most commonly used
measure of spread.
This is because the standard deviation is
directly related to normal distributions (bell
curves), which we will study later.
Interlude – review of central
location and spread




For nominal/ordinal data, we simply report
the percentages in each category.
For interval/ratio data, central location is
usually measured by the mean (not robust)
or the median (robust).
For interval/ratio data, spread is usually
measured by the standard deviation (not
robust) or the Interquartile Range (robust)
The mode (central location) and the range
(spread) are rarely used for inference.
Shape




Look at the “tails”. If the tails are equal
length, then the distribution is symmetric
If the tail for lower values is longer, the
distribution is left skewed
If the tail for higher values is longer, the
distribution is right skewed.
“Symmetric” gets the benefit of the doubt in
describing a distribution. “Roughly
symmetric” is fine. I will not put judgment
calls on homework or exams.
Symmetric Data – Ideally and
Practically
Right skewed data – ideally and
practically
Left skewed data – ideally and
practically
Outliers



Recall outliers are any points that appears
separate from the rest.
Often this is a judgment call. Saying “mild
outlier” is fine, I don’t intend on policing
judgment calls.
Outliers often occur with skewed data in the
direction of the long tail.
Boxplots


A boxplot is intended to be a SIMPLE plot
which allows you to quickly see all the
features of the distribution.
In PS372 you will NOT be expected to draw
a boxplot from scratch, but you will be
expected to interpret a boxplot drawn on a
computer.
Step 1 for boxplot – The Box



Box extends from
Q1 to Q3, with a line
for the median.
Thus, you can
immediately see the
median (central
location) and the
IQR (spread).
Note the box
contains 50% of the
data
Q3
Median
Q1
Step 2 for boxplot – The fences


Construct the
“fences”. These are
NOT in the final
product. They are
just used to make
decisions on
outliers.
Inner fences are 1.5
IQR from the box,
outer fences are 3.0
IQR from the box.
1.5 IQR
1.5 IQR
IQR
1.5 IQR
1.5 IQR
Q3
Median
Q1
Step 2 for boxplot – Inner Fences


Construct the
“fences”. These are
NOT in the final
product. They are
just used to make
decisions on
outliers.
Inner fences are 1.5
IQR from the box,
outer fences are 3.0
IQR from the box.
1.5 IQR
1.5 IQR
IQR
1.5 IQR
1.5 IQR
Inner fences
Step 2 for boxplot – Outer fences


Construct the
“fences”. These are
NOT in the final
product. They are
just used to make
decisions on
outliers.
Inner fences are 1.5
IQR from the box,
outer fences are 3.0
IQR from the box.
1.5 IQR
1.5 IQR
IQR
1.5 IQR
1.5 IQR
Outer Fences
Step 3 for boxplot – Whiskers


The whiskers
extend from the box
to the point closest,
but still inside, the
inner fence.
Remember, the
whiskers end at a
data point, not the
inner fences.
1.5 IQR
1.5 IQR
IQR
1.5 IQR
1.5 IQR
Whiskers
Step 4 for boxplot – Mild outliers


Mild outliers for a
boxplot are defined
to be points located
between the inner
and outer fences.
They are denoted
by open circles.
1.5 IQR
1.5 IQR
IQR
1.5 IQR
1.5 IQR
Mild outliers
Step 5 for boxplot – Extreme
outliers


Extreme outliers for
a boxplot are
defined to be points
located beyond the
outer fences
They are denoted
by filled circles.
1.5 IQR
1.5 IQR
IQR
1.5 IQR
1.5 IQR
Extreme outliers
Final boxplot


Remember, the
fences are not
actually drawn.
You can see the
four features of
distributions easily
with a boxplot.
Outliers, for
example, are
explicitly drawn.
Using Boxplots

Central location is
shown through the
median (some
boxplots will show
the mean as a
separate line).
Using Boxplots


Spread is shown
through the IQR
(you cannot get the
standard deviation
from a boxplot).
You can also see
the range of the
data, but remember
the range is often
not that useful.
Using Boxplots

Shape can be seen
through the box and
the whiskers. If one
side of the box and
the corresponding
whisker are longer,
then the data is
skewed that
direction (here left
skewed)
Using boxplots



Sometime the box “leans” one way and the whiskers
the other. Then you can’t tell that much about shape
from the boxplot. This happens most often in small
datasets, where there isn’t much information about
shape in the entire dataset anyway.
Remember that symmetric always gets the benefit of
the doubt, so a slight “lean” isn’t enough to conclude
skewness.
Outliers are of course drawn explicitly on the plot,
and while you don’t have to take their definitions of
“mild” and “extreme” as absolute truth, it can be
handy.
Some variants


Some people and/or computer programs add
some “bells and whistles” to this basic
boxplot.
For example, Stata will often put a “+” in the
boxplot showing the location of the mean.
Side by side boxplots


When comparing multiple groups of people
(or anything else), boxplots provide a handy
method for comparison.
My placing the boxplots side by side, you can
immediately see similarities and differences
in central location, spread, and shape.
1970 Draft Lottery – months on x axis,
draft number on y axis.
Conclusions


There is clear evidence the later months,
especially December, fared far worse in the
draft lottery than other months.
This draft was redone later after the
unfairness was noted by many sources.
Review



There are four features of distributions –
central location, spread, shape, and outliers
Central location can be measured by the
mode (nominal or ordinal data) or the median
or mean (interval/ratio data)
In interval/ratio data, spread can be
measured by the range (rarely useful), the
IQR, or the standard deviation.
More review


Outliers are any points far from the other
points. This definition is deliberately vague.
Two people may disagree over whether a
point is an outlier.
There is an explicit definition of outlier for a
boxplot (any point more extreme than Q1 –
1.5 IQR or Q3 + 1.5 IQR), but that is NOT
etched in stone
More review



Shape is in “the tails”. If the tails are equal
length, then the distribution is symmetric
If the tail for lower values is longer, the
distribution is left skewed
If the tail for higher values is longer, the
distribution is right skewed.
Describing a single distribution


When describing a distribution, or comparing two
distribution, you need to mention all four features of
the distributions, noting where they are similar and
where they are different.
For example, “all the distributions have the same
spread (IQR is around 5, standard deviation is
around 7), but distribution A is, on average, much
higher than distribution B (mean for A is 78 while the
mean for B is 70). Both distributions are symmetric
and have no outliers”.
Example


Two classrooms were observed, with one
classroom (n=21) using “new directed
reading activities” and another classroom
(n=23) not using the activities.
This might be useful for an exploratory study,
but cannot provide conclusion evidence of
anything, as the classrooms differ on far
more than just “activities” or “no activities”
(for example, the teachers differ)
Example continued
Descriptive statistics


For the controls, n=23, mean=41.52, M=53,
std.dev = 17.15, IQR=26
For the treatment group n=21, mean=51.47,
M=42, std.dev = 11.00, IQR=14
An example paragraph summary

The two groups vary most on spread, both in
terms of standard deviation (17.15 for the
controls and 11.00 for the treatment group)
and IQR (26 for the control group and 14 for
the treatment group). The difference in
spread is sufficient that the control group
extends beyond the treated group both for
high and low scoring students.
Paragraph summary continued

On average, scores are higher in the
treatment group. The mean of the treatment
group is 51.48 compared to a mean of 41.52
for the controls (the respective medians are
53 and 42). Both groups appears
approximately symmetric (perhaps a slight
right skew for the control group) and have no
outliers.
Related documents