Download Statistics: In a Nutshell

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Statistical inference wikipedia , lookup

Resampling (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Statistics: In a Nutshell
A statistic is a number that is calculated to represent some characteristic of a group of numbers.
The goal of Statistics is to use these summary numbers to allow us to get a sense of a group of
numbers B to build a >model= of the group of numbers..
There are two kinds of statistics. One kind C called descriptive statistics C involves calculating
statistics that are designed to tell us something about the specific group of numbers that we
actually have collected.
A second kind C called inferential statistics C are statistics that we use to try and get some idea
of what the parameters of a population are like. We are trying to find out about a large group,
even though we don=t actually have direct information about the large group.
A parameter is a characteristic of a population that gives us a sense of some quality of the
population. Parameters are analogous to statistics, except parameters relate to populations and
statistics relate to samples. (If you don=t know what the word Aanalogous@ means C or if you
come across any word you don=t understand C you should ask someone or look it up in the
dictionary!).
In this class, we will begin with descriptive statistics, developing tools to get a simpler and
simpler sense of a large group of numbers. At each stage, we will simplify the information that is
used to describe a set of numbers. When we do this, we make it easier for our minds to
understand what the numbers are telling us, but we also always lose some information, too.
Here is an outline of all the major material we will cover in this class:
Descriptive Statistics
Imagine a bunch of numbers
If we just have a whole bunch of numbers, we don=t get much information. It is hard to tell what
is really going on. Suppose we ask 30 students their score on the last exam they took, and these
are the responses we get:
75
90
85
90
60
70
85
80
75
100
95
95
85
75
75
80
90
90
90
75
85
100
90
85
75
85
100
100
100
90
What is the general performance of these students? What sense of their academic
abilities can we get from just looking at the data in this way?
1/09
We can get a better sense if we can organize the data into a frequency distribution.
A frequency distribution groups the numbers that are the same together, and lists them together
with the number of times they occur in the group of numbers. For the exam score data, a
frequency distribution looks like this:
X
Frequency of
X
60
1
65
0
70
1
75
6
80
2
85
6
90
7
95
2
100
5
From this table, we can see that people generally did fairly well on their last exam
C not many people scored below 75. More people seemed to score in the 80s or
90s, and 5 people got a perfect score C 100%.
1/09
Sometimes, it also helps if we graph the frequency distribution in a bar graph or histogram, like
this:
Frequency
6
4
2
0
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00 100.00 110.00
Exam Score
Or, we can use a different kind of graph called a frequency polygram, that looks something like
this:
Frequency Polygon
7
6
5
4
3
2
1
0
60
1/09
65
70
75
80
85
Exam Scores
90
95
100
However, we might still want to summarize these data in an even more succinct way. We can get
this by using what are sometimes called the moments of a distribution. The most common
summary statistics are measures of central tendency (or, location) and measures of spread (or,
variability).
1.
2.
1/09
Central Tendency refers to a measure of how high or low the scores are,
generally. This is an average.
a)
The most common measure of central tendency is the mean (this is
probably what you think of when you think of an average). The mean for
the data above is 85.67.
b)
The median is another measure of central tendency — the score that
divides the numbers in half, so that 50% of the scores are above the
median and 50% are below the median. The median of the data above is
85.
c)
The mode is another measure that is sometimes used to indicate central
tendency. The mode is the most common or frequent value. What is the
mode of the data above?
Knowing the location or central tendency of a set of numbers really doesn’t tell us
that much. It tells us how high or how low the numbers are over all, but it doesn’t
tell us anything more. One thing that is very useful to know is how spread out the
numbers are. We call this the spread or the variability.
a)
There are three common measures of spread or variability. The first of
these is called the range. The range is simply the distance between the
highest and lowest score in the distribution. That is, we take the lowest
score, and subtract it from the highest score to get the range. What is the
range of the distribution given above?
b)
A second measure of variability is called the variance. The variance is a
mean — in this case the mean distance of each score from the overall
mean of the distribution. We’ll talk much more about this later. The
variance of the distribution above is 99.56.
c)
An even more useful measure of variability (for reasons we will see later)
is the standard deviation. This is the square root of the variance. For the
data above, the standard deviation is 9.98.
3.
Measures of variability and measures of central tendency are used when we are
thinking about only one variable. However, often we want to get a picture of two
variables, and how they relate to each other.
a)
For example, in addition to asking about exam score, suppose I also asked
people their overall grade point average. We would expect there to be a
relationship such that people with higher GPAs probably got a higher
grade on their last exam. To look at this we would calculate a statistic
called a correlation.
b)
Another thing we could do is try to come up with an equation where we
could predict someone’s exam score based on their GPA. We could use
our data to create a regression equation. A regression equation would give
us a way to make an educated guess about people’s exam scores.
(1) Of course, we are not going to be right all the time, but this
would be our best guess, based on the data that we have
collected.
B.
Understanding these ways to represent or describe data gives us an ability to get a
‘picture’ of what some specific set of numbers are like. We can get a sense of the overall
data with a frequency distribution — either in a table, a histogram, or a frequency
polygram. We can understand some characteristics of those data more succinctly with
summary statistics such as measures of location, or variability, and of shape.
C.
If we have two variables, we get even more information. We can understand how the two
variables relate to each other with a correlation coefficient of a regression equation.
D.
However, these summary statistics only apply to the specific numbers that we have in
front of us. If we want to get a sense of the parameters of a population — for example,
the mean or standard deviation of the whole population, or the difference between the
means of two different populations, etc, then we have to move into inferential statistics.
1/09
Inferential Statistics
II.
As mentioned, inferential statistics refer to statistics used to get a sense of what is going
on with the population. Because we only have data from a sample, we have to make an
inference about what the value of the population parameters are. Because by definition
we do not know things about the population, the inferences we make about the population
parameters are always accompanied by uncertainty. Fortunately, we can often make
certain assumptions that allow us to tell exactly how uncertain we are!
III. A very important concept for inferential statistics — perhaps the most important
concept in this whole course — is the concept of the sampling distribution.
A.
The sampling distribution of a statistic is a frequency distribution of that statistic.
That is, if we imagine that we could take every possible sample from some
population, and calculate some statistic on each sample, we would have a bunch
of numbers — the value of the statistic for each of the different samples.
B.
We can group these into a frequency distribution, and the shape of this
distribution, as well as the moments of the distribution (i.e., the mean, standard
deviation, regression, correlation), can help us a great deal.
1.
Of course, we can’t really collect every possible sample (if we could, we
would also be able to get data on the whole population, and then we
wouldn’t need inferential statistics at all). But, we can often assume that
the sampling distribution of a particular statistic is of a certain form — a
form that corresponds to some theoretical distribution that we already
know about (such as the normal distribution).
IV. There are basically two approaches to inferential statistics that are commonly used, and
we will cover both of them. Of course, they are closely related to each other, but they
take somewhat different approaches.
A.
The first approach can be called the estimation approach. With this approach, we
try and come up with some sample statistic that will be a good estimate of a
population parameter that we are interested in.
1.
Often, the statistic that is the best estimate is closely related to the
parameter. For example, the best estimate of the mean of a population will
be the mean of the sample.
a)
1/09
However, it is important to understand that for other parameters —
most of which we will not cover in this class — this may not be the
case.
2.
When we have an estimate of a parameter, there will be some uncertainty.
That is, we will be pretty sure that the exact population mean is not the
same, exactly, as our sample mean.
a)
However, we can come up with a confidence interval which gives
us a range and we can know that the population is most likely in
that range.
(1) For example, we might say that the mean of the population
of some value is 12.45, plus or minus 4.2. This means that
we are pretty certain (although never 100% certain) that the
mean of the population is between 8.25 (=12.45 - 4.2) and
16.65 (=12.45 + 4.2).
(2) The narrower the confidence interval, the more exact we
can be about the value of the parameter.
B.
The second approach to inferential statistics is called the null hypothesis testing
approach. In this approach, we suppose that a population parameter has a certain
value. For example, we might adopt the null hypothesis that the population mean
is zero. Then, we test whether, if the population parameter were zero, we would
be likely to get the data that we actually got in our sample.
1.
C.
For example, we might have a sample with a mean of 12.45 and a standard
deviation of 2. And, the question might be: Is the population mean equal
to zero?
a)
Appropriate statistical tests can be done and we could say that there
is less than a 5% chance that data with a mean of 12.45 and a
standard deviation of 2 would come from a population with an
actual mean of zero.
b)
Since it is therefore very unlikely that the sample we have came
from a population with a mean of zero, we can reject the null
hypothesis and say that the mean of the population is probably not
zero.
We will consider estimation/hypothesis testing about a variety of different
parameters that are useful in understanding real data in psychology. We will also
learn how to use SPSS to get both the descriptive and inferential statistics that we
need.
This is basically everything we will be doing in this class. We will cover a few other things
along the way that will help us understand some of the issues that are covered in this outline, but
just about everything can fit somewhere in this scheme of things. Simple, no?
1/09
1/09