Download Chapter 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 1
Statistics Review
(Reminding you of what you, hopefully, already
know)
1
What are we doing here?
Statistics!
A system of managing, understanding and analyzing data. To
understand the theories we develop, we need a way to
objectively categorize and summarize information (data). Then
we need methods for judging and interpreting data. Statistics
give us all of that (kinda handy, huh?)
2
Some terms to start with…
Observations – the “individual” we are studying. In
psychology, this is usually people or animals, but not always.
Some synonyms – “Experimental Units” (that’s from the
book), subjects, participants, individuals, people, rats,
guinea pigs…
Variable – the characteristics that we are measuring.
3
Types of data
Quantitative data – information about observations that are
numerical, both in value and in nature.
Qualitative data – information that can is not numerical in
nature (although it can be in name). There is a field of analysis
called Qualitative Data Analysis that goes far beyond simple
categories.
4
Classification of Data
We can make finer distinctions than the above
1. Nominal – qualitative: simply names the individual or a
quality.
2. Ordinal – qualitative: values of the variable are ordered,
but there are not necessarily equal spaces between the
values.
3. Interval – quantitative. Scores are ordered and equally
spaced, but there is no meaningful zero.
4. Ratio – quantitative. Interval, but with a meaningful zero.
5
Two More Classifications…
Discreet – There is a set number of possible scores and there
can be nothing between them. Nominal and ordinal are both
discreet.
Continuous – There is a theoretically infinite number of
possible scores, even if there is a restricted range. Ratio data is
always continuous, interval data may or may not be, but
typically is.
Assignment 3 – p4, items 1.1, 1.3
6
Populations and Samples
Population – the set of all possible observations, existing and/or
conceptual. Described by parameters.
1. People (everybody)
2. NCCU Students
3. All college students
4. People with depression
5. Countries
6. Businesses
Sample – a subset of data from a population. Described by statistics.
1. People in a telephone survey, dorm residents at NCCU
2. Students in this class, Psi Chi members, football players
3. NCCU Students, Seniors at all NC universities
4. Students getting treated at student health, patients of Dr. Jones
5. North, South, Central American countries
6. Fortune 500 companies
7
Random sampling – observations are selected in such a way
that each member of the population has an equal chance of
selection.
1. Student ID numbers are written on Ping-Pong balls and
selected Lotto-style
2. Computer draws of listing of all college students
3. Random draws from a hat containing the names of
everybody diagnosed with depression.
4. Cut out the names of all the countries listed in the NY Times
Almanac, random draws.
5. File a request under the Freedom of Information Act to get
the names of all corporations from the IRS, random draw.
Assignment 4 – p8, items 1.9, 1.13, 1.15
8
Intro to Graphical Stats
Drawing graphs of our data enhances our understanding
immeasurably.
Relative Frequency Distribution
Frequency
250
200
150
100
50
or
e
M
0
15
0
14
0
13
0
12
0
11
0
10
90
80
70
60
0
9
The y-axis shows the range of scores, the x-axis
shows the relative frequency of the occurrence of
those scores. This type of graph describes the
frequencies of random variables. That is, some
variable in a population that can take on the
values described in the graph.
10
Area under the curve:
The space enclosed by the graph represents all
the possible scores and how often they occur. If
we say that the total area is equal to 1. The area
associated with a specific value gives the
percentage (or proportion) of the population with
that score.
11
Consider this distribution:
Frequency
120
100
80
60
40
20
0
70
80
90
100
110
120
130
140
150
12
Two New Terms:
Random variable: Whatever variable we are
measuring. “Random” because it can take on
some range of values. The likelihood that it takes
on a specific value is determined by the
probability distribution. The probability
distribution is the same thing as it’s relative
frequency. See all that stuff about the area under
the curve above.
13
The problem with populations…
We can’t measure most populations so their
probability distributions are unknown.
Taking a sample allows us to infer information
about the population. We assume that population
parameters are about equal to sample statistics.
14
First steps with samples
Before we start going crazy with data analysis,
we need to just have a look at our data.
76, 80, 84, 85, 86, 87, 87, 88, 89, 93, 94, 95, 95,
96, 97, 98, 98, 99, 100, 101, 104, 106, 106, 106,
109, 112, 115, 126, 133, 139
15
Stem and Leaf Plot
7
8
9
10
11
12
13
6
04567789
345567889
0146669
23
6
39
16
Histogram
Frequency
12
10
8
6
4
2
0
80
90
100
110
120
130
140
More
Assignment 5 – p14, items 1.17, 1.19, 1.21
17
Numerical Descriptions of Data
(Note: “descriptions” is not an accident, this branch of stats is
called descriptive statistics)
Finding the center of a data set:
The mean – the arithmetic average, the sum of all of the scores,
divided by the total number of scores.
Writing this with statistical notation…
x
n
18
Understanding the breadth or spread of a dataset.
Range – the distance between the lowest and highest scores.
Simply subtract the low score from the high score.
Variance – the average of the squares of the deviations.
SS
 
N
where
2
( xi )
SS  ( xi   )   x 
N
2
2
2
i
19
The standard deviation –the square root of the variance. One
of the very nice things is that standard deviations can be
negative. Thus, it not only tells how far a score is from the
mean, but whether it’s above or below the mean.
Remember that the above formula is for populations. In the real
world we have to estimate the parameters. The sample statistic
is calculated the same way except we use n-1 instead of N.
SS
s 
n 1
where
2
( xi )
SS  ( xi   )   x 
N
2
2
2
i
20
Interpreting the Standard Deviation
1. For any dataset, regardless of it’s shape, 75% of the
observations will fall within 2 standard deviations of the
mean.
2. For most datasets of moderate size (about 25 or more
observations) with a mound shaped distribution, 95% of
the observations will fall within 2 standard deviations of
the mean.
So, when you hear that someone’s score was –2.5 standard
deviations from the mean, you know a lot.
21
More on parameters and statistics
Although we perform our analyses on samples, we are
interested in understanding populations. In other words, we
are more interested in parameters than statistics.
However, we can’t measure parameters so we estimate
them with statistics. The sample mean is an estimate of the
population mean. The sample variance is an estimate of the
population variance.
Assignment 6 p24, items 1.23, 1.25, 1.29
22
The Normal Probability Distribution
The normal distribution is symmetrical around the mean
with its spread determined by its standard deviation.
The standard normal distribution is a normal distribution
with a mean of 0 and a standard deviation of 1. This is also
known as a z distribution.
You can use the z table to find out the proportion of data
between two scores. Look at the portion of the chart on
page 30.
Now, why is this useful?
Assignment 7 – p32, items 1.35, 1.37
23
Sampling Distributions (not sample distributions)
Imagine this: You gather a sample of students and give
them a statistics test. Now you can calculate a mean and
standard deviation. We use this to estimate the population
mean and standard deviation.
Question: If you do this just once, how likely are you to hit
the actual mean and standard deviation of the population?
So, what is the solution? Take a bunch of samples.
But, now we have a bunch of means and standard deviations.
So what do we do with them?
They are scores, just like anything else. In fact, they now make
up a new kind of distribution. Sampling distributions are
probability distributions of sample statistics.
24
More detail
Take a sample of a specific n, say 20 observations, on a
specific variable, say income. Now you can calculate the mean,
say it’s $37, 455.08.
OK, great. Put the sample back in the population (i.e.,
sample with replacement). Do the same thing again, also with
n=20, again on income. Find the mean again.
Do this an infinite number of times. This is the sampling
distribution.
Now we can’t actually take an infinite number of samples.
We can’t because, well, infinite is kind of a lot. It’s even more
than a lot.
So, we mathematically estimate the parameters of the
distribution.
25
Standard Error
The standard deviation of the sampling distribution is
called the standard error of the estimate, usually referred
to as just the standard error. This is something we are going
to use a lot.
Why is it called the standard error?
26
The Central Limit Theorem
Given sufficiently large sample sizes the shape of
the sampling distribution is normal, regardless of
the shape of the population distribution.
27
Making & assessing parameter estimates
If the mean of sampling distribution is equal to the
parameter, the statistic is considered an unbiased estimate.
If not, it is considered biased.
The sample mean and standard deviation are considered
unbiased.
However, this does not mean that they are accurate
estimates. We need a way of assessing accuracy.
28
Confidence Intervals
Recall two things: Sampling distributions are normal.
Given a normal distribution, 95% of the observations
fall within about 2 standard deviations of the mean.
It’s actually 1.96 standard deviations rather than 2. So,
we add and subtract 1.96 standard deviations from the
sample mean.
This range of numbers is called the “95% Confidence
Interval” or 95% CI.
29
What Does the CI Mean?
Assuming that we are using .95 as the confidence
coefficient, we are 95% sure that we have captured the true
mean inside that range.
That is, 95% of the time this procedure will give us a
range of data that includes the true mean.
Expanding the width of the interval (e.g., to 99% or
100%) will increase our confidence, but decrease our
precision.
Shrinking the confidence interval (e.g., to 75% or 50%)
will increase our precision but decrease our confidence.
Assignment 8 – p44, 1.45, 1.47, 1.51
30
Review of Hypothesis Testing
First, more definitions:
1. Theory – theories are big, very big. They are broad,
powerful statements that explain a set of
observations. They are based on evidence and are
discarded when they no longer work.
2. Hypothesis – hypotheses are smaller. They are
derived from theories and are very specific, narrow
statements about what we are testing with a given
experiment or set of analyses.
31
The test of a hypothesis makes use of the
following concepts:
1. The null hypothesis, H0, is a statement, usually
indicating no effect, which we assume is true.
2. An alternative or research hypothesis, Ha or H1, makes a
statement counter to the null. Usually, this is what we
“want” to find evidence for.
3. A test statistic, a number calculated from the data that
we use to make a decision about H0 and H1.
4. The rejection region is a set of values. If the test
statistic falls inside this region, then we reject H0.
32
Testing a hypothesis about a population mean
The z-test – compare the sample data to a hypothesized
population mean.
z
y  0
y
y 
where:

n
33
The problem with the z-test – we rarely know the
population standard deviation.
Enter the one-sample t-test:
y  0
t
s
n
Assignment 9 – p57 1.56, 1.57, 1.61, 1.77
34
Testing the difference between two means.
Say we are comparing the effects of talk therapy on
panic attacks. One group of subjects gets talk therapy
while the other gets no therapy.
We set =0.05. What does this mean?
Next, we get our means, standard deviations and
estimate the standard error. We use this to calculate t.
We look this value up in the t table. If our value is
bigger than the critical value, then we reject the null.
35
Calculation
t
y1  y 2
1
1
s (  )
n1 n2
2
p
36
There are two minor differences when doing this
with a computer package. First, the package does
all the calculating for you. Second, it reports the
exact p, or the observed significance level.
So, if we set  to 0.05, then we are actually
looking for a p less than 0.05, usually written as
p<0.05.
Thus, when we get the exact p, we are looking for
it to be less than 0.05.
37
Assumptions
When doing a statistical test, there are certain
assumptions about the data that must be met first.
1. The sampled population (or populations) is normal.
2. The sample (or samples) is randomly and
independently sampled
3. The variances of however many groups are being
tested are equal. This is called homogeneity of
variance.
Assignment 10 – p63 items 1.71, 1.73, 1.77
38
Checking for heteroscedasticity (unequal
variances)
If the first two assumptions are true, then we can test the
third with an F-test. The F distribution is a non-normal
distribution whose shape is dependent on it’s degrees of
freedom. Unlike the t-distribution, it has two degrees of
freedom. It is used when we are sampling from two
populations (or two parameters from the same population).
In this case we are sampling variances from our two
populations. The degrees of freedom are n1-1 and n2-1.
Turn to page 73.
39
Calculation
2
1
2
2
where the numerator is the larger sample variance and the
denominator is the smaller.
s
F
s
Assignment p79, 1.83
40
Related documents