Download 1 Random Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Categorical variable wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Math 143
Fall 2010
1
Random Variables
A random variable is a variable whose value is
1. numerical
and
2. determined by the outcome of some random process
The distribution of a random variable describes the possible values of a random variable, and the probability of having
those values. That is, it answers the two questions:
• What values?
• With what frequency?
We use random variables to help us quantify the results of experiments for the purpose of analysis. Random variables
are generally denoted using capital letters, like X, Y, Z.
1.1
Some handy “rules’
In the rules below, a and b are numbers and X and Y are random variables.
1. Linear Transformations
(a) E(aX) = a E(X)
(b) E(X + b) = E(X + b)
(c) Var(aX) = a2 Var(X)
(d) Var(X + b) = Var(X + b)
(e) If X is normal, then aX and X + b are also normal. (This is a special property of normal distributions
and few others.)
2. Sums
(a) E(X + Y ) = E(X) + E(Y ) for any random variables X and Y
(b) Var(X + Y ) = Var(X) + Var(Y ) only in the special case that X and Y are independent.
(c) If X and Y are normal and independent, then X + Y is normal too.
3. Differences (these rules follow from the ones above)
(a) E(X − Y ) = E(X) − E(Y )
(b) Var(X − Y ) = Var(X) + Var(Y ) only in the special case that X and Y are independent.
• this is the one that gets people; the variance of a difference is the sum of the variances.
(c) If X and Y are normal and independent, then X − Y is normal too.
Examples. Suppose we have random variables with means, standard deviations, and variances given in the following
table.
variable
X
Y
mean
45
variance
9
st dev
3
60
16
4
Last updated: November 3, 2010
Math 143
Fall 2010
We can use this information to compute the means, variances, and standard deviations of other variables, including
the ones in the following table. (The results for variance and standard deviation are only correct if the variables are
independent.)
1.2
variable
2X + 5
mean
2 · 45 + 5 = 95
variance
4 · 9 = 36
X +Y
45 + 60 = 105
9 + 16 = 25
X −Y
45 − 60 = −15
9 + 16 = 25
st dev
36 = 2 · 3 = 6
√
√
32 + 42 = 25 = 5
√
√
32 + 42 = 25 = 5
√
Binomial Random Variables
1. The Binom(n, p) Situation
(a) n trials. A random process is repeated n times (n known in advance). Each repetition is called a trial.
(b) Two outcomes. Each trial has two outcomes (often called success and failure)
(c) Constant probability of success. The probability of success is the same for each trial and denoted p.
(d) Independent Trials. Each trial is independent of the others.
(e) Count successes. The Binom(n, p) random variable counts the number of successes.
2. When n = 1, we call the binomial random variable a Bernoulli random variable.
3. Formulas for expected value and variance of a Binom(n, p) random variable
• Expected value: np
• Variance: np(1 − p)
• Standard deviation:
p
np(1 − p)
4. There is a computational formula for determining binomial probabilities, but we will usually use a computer for
this. (In StatCrunch: Stat > Calculators > Binomial)
p
5. Binom(n, p) ≈ Norm np, np(1 − p)
(a) The approximation is better when n is larger and when p is closer to 1/2.
(b) Rule of Thumb: The approximation is good enough for us if np ≥ 10 and n(1 − p) ≥ 10. That is, if we
would expect at least 10 successes and at least 10 failures.
(c) This result follows from a mathematical theorem called the Central Limit Theorem.
6. The binomial distribution is often a good model for sampling for a proportion:
(a) Each person in the sample is a trial
(b) They give one of two responses (yes/no, male/female, smoker/non-smoker, etc.)
(c) p is the true (unknown) probability in the population
(d) Need to be sure that the people in our sample are independent, representative sample
Last updated: November 3, 2010
Math 143
Fall 2010
2
Sampling Distributions
1. A parameter is a number that describes a population or a process.
• Example: the percentage of Americans wearing black socks today.
• Usually the value of a parameter is unknown. (Do you know what percentage of Americans are wearing
black socks today?)
• Favorite parameters: population proportion (p), population mean (µ), population standard deviation (σ)
2. A statistic is a number that describes a sample.
• Example: the percentage of people in my sample who are wearing black socks today.
• If you know the data, you know the statistic – it can be calculated from the data.
• Favorite statistics: sample proportion (p̂), sample mean (x), sample standard deviation (s).
3. A sampling distribution is the distribution of a statistic.
• If we use randomization to collect our data (random sampling, random assignment to treatment groups,
etc.), then the statistic is a random variable (why?)
• The study of sampling distributions allows us to learn what our sample statistic tells us about a population
parameter.
4. The sampling distribution for a sample count (number of observations in a data set with a certain property) is
p
≈ Binom(n, p) ≈ Norm np, np(1 − p)
• Can only use this if we have a random sampling method.
• Binomial approximation good enough if population is much larger than sample.
• Normal approximation good enough if in addition np ≥ 10, and n(1 − p) ≥ 10.
5. The sampling distribution for a sample proportion (proportion of observations in a data set with a certain
property)
“Binom(n, p)”
≈
≈ Norm p,
n
r
p(1 − p)
n
!
• Approximation good enough in same conditions as above.
6. The sampling distribution for the sample mean (x)
• expected value = µ (unbiased)
• standard deviation = √σn
σ
• distribution ≈ Norm µ, √
provided the sample size is large enough (Central Limit Theorem).
n
◦
◦
◦
◦
◦
Assumes random sampling method and large population (at least 10 times larger than the sample).
Exact when population is normal.
Very good approximation even for small samples if population is unimodal and roughly symmetric.
Approximation requires larger samples if population is skewed.
In most cases, n = 30 is plenty big.
Last updated: November 3, 2010
Fall 2010
3
Math 143
Inference for Proportions and Means
3.1
The Scenarios
The inference procedures in this section deal with four scenarios. (We’ll learn more scenarios later.)
• 1-proportion
In this scenario we use a random sample of a categorical variable to learn about the proportion in the population
that have a specific value (generically called success) for this categorical variable.
Data: A single categorical variable.
Example: Estimate what proportion of Grand Rapids residents will vote in the upcoming election.
• 2-proportion
In this scenario, we use samples from two populations to investigate how different the proportions are in the two
populations.
Data: A categorical response variable plus a categorical variable indicating which population each observation
is from.
Example: Will the voting percentage be higher among women than among men?
• 1-sample t
In this scenario we use a random sample of a quantitative variable to learn about the mean value of that
measurement in the population.
Data: A single quantitative variable.
Example: Estimate the mean length of lemur tails.
• 2-sample t
In this scenario we use samples from two populations to compare the mean of some quantitative measurement
between the two populations.
Data: A quantitative response variable plus a categorical variable indicating which population each observation
is from.
Example: Are male lemur tails longer than female lemur tails (on average).
Observational and Experimental designs
We can use the 2-proportion and 2-sample t procedures to analyze the results of either observational or experimental
designs. In an experiment with two treatments (e.g., drug vs. placebo), the two populations correspond to the two
treatment options. In an observational study, we may classify subjects based on two variables without either of them
being a treatment.
Recall that in an experiment, the values of at least one explanatory variable are determined by the researchers (usually
using randomness).
3.2
The Big Picture
These four scenarios lead to very similar procedures. Seeing the similarities can help you remember how things go.
Last updated: November 3, 2010
Math 143
Fall 2010
3.2.1
Confidence Intervals
All four confidence intervals have the general form
data value ± (critical value) · SE
The table below shows how to obtain these values for each scenario.
scenarario
3.2.2
data value ± critical value
1-proportion
p̂
±
z∗
2-proportion
p̂1 − p̂2
±
z∗
1-sample t
x
±
t∗
2-sample t
x1 − x2
±
t∗
SE (standard error)
q
q
p̂(1−p̂)
n
p̂1 (1−p̂1 )
n1
√s
n
q
=
s21
n1
+
p̂2 (1−p̂2 )
n2
q
+
s2
n
s22
n2
Hypothesis Testing
All hypothesis tests follow our 4-step outline:
1. State the Null and Alternative Hypotheses.
2. Calculate a test statistic.
All the test statistics for these four scenarios have a similar form:
t or z =
data value - hypothesis value
SD or SE
3. Calculate a p-value.
The p-value is the probability of obtaining a test statistic at least as extreme as the one calculated from the
data, assuming that the null hypothesis is true.
4. Draw a conclusion.
This step is exactly the same for all tests. You just need to understand what a p-value means in the context of
your particular test.
scenarario
H0
test statistic
data value
1-proportion
H0 : p̂ = p0
z
p̂
SD =
2-proportion
H0 : p1 − p2 = 0
z
p̂1 − p̂2
SE =
1-sample t
H0 : µ = µ0
t
x − µ0
SE =
2-sample t
H0 : µ1 − µ2 = µdiff
t
(x1 − x2 ) − µdiff
SE =
Note: For the 2-proportion test, we use a pooled estimate for p (p̂ =
because the null hypothesis tells us the two proportions are the same.
x1 +x2
n1 +n2 )
Last updated: November 3, 2010
SD or SE
q
p0 (1−p̂0 )
n
q
p̂(1−p̂)
n1
+
q
s2
n
√s
n
q
=
s21
n1
+
p̂(1−p̂)
n2
s22
n2
in our formula for the standard error