Download Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Summary of Formulas and Concepts
Descriptive Statistics (Ch. 1-4)
Definitions
Population: The complete set of numerical information on a particular quantity
in which an investigator is interested. We assume a population consists of
N values.
Sample: An observed subset of population values. We assume a sample consists
of n values.
Graphic Summaries
Dotplot - A graphic to display the original data. Draw a number line, and put
a dot on it for each observation. For identical or really close observations,
stack the dots.
.
.
: .: .: :
.
. : .:..:: :: :.
.
..
. : . ::.: :.:.:::::::::::..::::::. ... : .
+---------+---------+---------+---------+---------+------40
50
60
70
80
90
Histogram - A graphic that displays the shape of numeric data by grouping it
in intervals.
1. Choose evenly spaced categories
2. Count the number of observations in each group
3. Draw a bar graph with the height of each bar equal to the number of
observations in the corresponding interval.
Stem and leaf plot Similar to a dotplot. Data are grouped according to their
leading digits, and the last digit is used as a plotting symbol (like a dot in
the dotplot).
The left digits are a cumulative count on each side of the middle. The
bracketed number is how many observations are in the middle. The middle
column of digits are the first digits of the number, and the “bars” are the
last digit.
2
4 57
5
5 133
13
5 56777899
23
6 0222334444
39
6 5667777788888899
(18)
7 000000111222222334
38
7 5555555666668899999999
16
8 0022333444
2
Numeric Summaries of Data
Suppose that we have n observations, labeled x1 , x2 , . . . , xn .
P
Then ni=1 xi means
n
X
xi = x 1 + x 2 + x 3 + . . . + x n
i=1
Some other relations are:
n
X
f (xi ) = f (x1 ) + f (x2 ) + f (x3 ) + . . . + f (xn ), for any function f ,
i=1
n
X
i=1
n
X
cxi = c
n
X
xi for any constant c,
i=1
(axi + byi ) = a
i=1
n
X
xi + b
i=1
n
X
yi , for any constants a and b.
i=1
Measures of Location - numeric summaries of the center of a distribution.
Mean (or average)
N
1 X
µ=
xi
N i=1
(population)
n
1X
x=
xi
n i=1
(sample)
Median - the middle observation
The middle observation of the sorted data if n is odd, otherwise the
average of the two middle values.
Measures of Dispersion - numeric summaries of the spread or variation of a
distribution.
Variance and Standard Deviation
σ2 =
N
1 X
(xi − µ)2
N i=1
(population)
s2 =
n
1 X
(xi − x)2
n − 1 i=1
Population and sample standard deviations (σ and s) are just the square
roots of these quantities.
Interpreting σ
Chebyshev’s rule: for any population
at least 75% of the observations lie within 2σ of µ,
at least 89% of the observations lie within 3σ of µ,
at least 100(1 − 1/m2 )% of the observations lie within m × σ of the
mean µ.
3
IQR (Interquartile Range): The distance between the (n + 1)/4th and 3 ×
(n + 1)/4th observations in an ordered dataset. These two values are
called the first and third quartiles.
Measure of symmetry : Skewness
skewness =
n
X
i=1
(xi − x)3 /n
s3
Negative means left skewed, 0 means symmetric, positive means right
skewed.
Measure of heavy tails : Kurtosis
kurtosis =
n
X
i=1
(xi − x)4 /n
−3
s4
A normal distribution has a kurtosis of 3, so if we subtract 3, interpretations are relative to that. Positive values mean a sharper peak, and
negative values mean a flatter top than a normal distribution.
Box-and-whisker plot: A graphic that summarizes the data using the median
and quartiles, and displays outliers. Good for comparing several groups of
data
o
first
quartile
median
third
quartile
4
largest value below
Q3 + 1.5 IQR
outlier
Probability (Ch. 5)
Definitions and Set Theory:
Random experiment: A process leading to at least two possible outcomes with uncertainty as to which will occur.
Basic outcome: A possible outcome of the random experiment.
Sample space: The set of all basic outcomes.
Event: A set of basic outcomes from the sample space. An event is said to occur if
any one of its constituent basic outcomes occurs.
Combining events: let A and B be two events.
Technical
Symbol Pronounced Meaning
Union
A∪B
A or B
A occurs or B occurs or both occur
Intersection
A∩B
A and B
A occurs and B occurs
Complement
A
not A
A does not occur
Venn Diagrams:
A
A
B
A
B
A
A
B
B
C
shaded
=A∩B
shaded
=A∪B
A shaded
A, B mutually
exclusive
A,B,C collectively
exhaustive
Probability Postulates:
1. If A is any event in the sample space S, 0 ≤ P(A) ≤ 1
2. Let A be an event in S and let Oi denote the basic outcomes. Then P(A) =
P
A P(Oi ), where the notation implies that the summation extends over all the
basic outcomes in A.
3. P(S) = 1
Probability rules for combining events:
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
P(A ∪ B) = P(A) + P(B) if A and B mutually exclusive.
P(A) = 1 − P(A)
Conditional Probability:
P(A ∩ B)
P(A|B) =
provided P(B) > 0.
P(B)
P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)
5
Independence:
Two events are Statistically Independent if and only if
P(A ∩ B) = P(A)P(B)
or equivalently P(A|B) = P(A) and P(B|A) = P(B).
General case: events E1 , E2 , . . . , Ek are independent if and only if
P(E1 ∩ E2 ∩ . . . ∩ Ek ) = P(E1 )P(E2 ) . . . P(Ek )
Bivariate Probability
Probabilities of outcomes for bivariate events:
B1
B2
B3
A1 P(A1 ∩ B1 ) P(A1 ∩ B2 ) P(A1 ∩ B3 ) P(A1 )
A2 P(A2 ∩ B1 ) P(A2 ∩ B2 ) P(A2 ∩ B3 ) P(A2 )
P(B1 )
P(B2 )
P(B3 )
P(A1 ∩ B2 ) is a probability of A1 and B2 occurring.
P(A1 ) = P(A1 ∩ B1 ) + P(A1 ∩ B2 ) + P(A1 ∩ B3 ) is the marginal probability that A1
occurs.
If we think of A1 , A2 as a group of attributes A, and B1 , B2 , B3 as a group of attributes
B, then A and B are independent only if every one of {A1 , A2 } are independent of
every one of {B1 , B2 , B3 }.
Discrete Random Variables (Ch. 6-7)
Definitions
Random Variable: (r.v.) A variable that takes on numerical values determined by
the outcome of a random experiment.
Discrete Random Variable: A r.v. that can take on no more than a countable
number of values.
Continuous Random Variable: A r.v. that can take any value in an interval.
Notation: An upper case letter (e.g. X) will represent a r.v.; a lower case letter (e.g.
x) will represent one of its possible values.
6
Discrete Probability Distributions
The probability function, PX (x), of a discrete r.v. X gives the probability that X
takes the value x:
PX (x) = P(X = x)
where the function is evaluated at all possible values of x.
Properties:
1. PX (x) ≥ 0 for any value x
2.
P
x
PX (x) = 1
Cumulative probability function, FX (x0 ) of a r.v. X:
FX (x0 ) = P (X ≤ x0 ) =
X
PX (x).
x≤x0
Properties:
1. 0 ≤ FX (x) ≤ 1 for any x
2. If a < b, then FX (a) ≤ FX (b).
Expectation of Discrete Random Variables
Expected value of a discrete r.v.:
E(X) = µX =
X
xPX (x).
x
For any function g(X),
E(g(X)) =
X
g(x)PX (x).
x
Variance of a discrete r.v.:
2
Var(X) = σX
= E((X − µX )2 ) =
X
x
(x − µX )2 PX (x) = E(X 2 ) − µ2X .
The standard deviation of X is σX .
Plug-In Rules: let X be a r.v., and a and b constants. Then
E(a + bX) = a + bE(X)
Var(a + bX) = b2 Var(X).
This only works for linear functions.
7
Jointly Distributed Discrete Random Variables
Joint Probability Function: Suppose X and Y are r.v.’s. Their joint probability
function gives the probability that simultaneously X = x and Y = y:
PX,Y (x, y) = P({X = x} ∩ {Y = y})
Properties:
1. PX,Y (x, y) ≥ 0 for any pair (x, y)
2.
P P
x
y
PX,Y (x, y) = 1.
Marginal probability function:
PX (x) =
X
PX,Y (x, y).
y
Conditional probability function:
PX,Y (x, y)
PX (x)
PY |X (y|x) =
Independence: X and Y are independent if and only if
PX,Y (x, y) = PX (x)PY (y)
for all possible (x, y) pairs
Expectation: Let X and Y be r.v.’s, and g(X, Y ) any function. Then
E(g(X, Y )) =
XX
x
g(x, y)PX,Y (x, y).
y
Conditional Expectation: Let X and Y be r.v.’s, and suppose we know the conditional distribution of X for Y = y, labeled PX|Y (x|y). Then
E(X|Y = y) =
X
xPX|Y (x|y).
x
Covariance: If E(X) = µX and E(Y ) = µY ,
Cov(X, Y ) = E[(X − µX )(Y − µY )] =
= E(XY ) − µX µY =
"
XX
XX
x
y
x
y
(x − µX )(y − µY )PX,Y (x, y)
#
xyPX,Y (x, y) − µX µY
If two r.v.’s are independent, their covariance is zero. The converse is not necessarily
true.
8
Correlation:
ρXY = Cor(X, Y ) = q
−1 ≤ ρXY ≤ 1
always.
ρXY = ±1 if and only if Y = a + bX
Cov(X, Y )
Var(X)Var(Y )
(a,b constants).
Plug-in rules: Let X and Y be r.v.’s, and a, b constants. Then
E(aX + bY ) = aE(X) + bE(Y )
Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) + 2abCov(X, Y )
Binomial distribution
Bernoulli Trials:
A sequence of repeated experiments are Bernoulli trials if:
1. The result of each trial is either a success or failure.
2. The probability p of a success is the same for all trials.
3. The trials are independent.
If X is the number of successes in n Bernoulli trials, X is a Binomial Random
Variable. It has probability function:
!
n x
p (1 − p)n−x
PX (x) =
x
Where
for
n
x
n
x
counts the number of ways of getting x successes in n trials. The formula
is
!
n!
n
=
x
x!(n − x)!
where n! = n × (n − 1) × (n − 2) × ... × 2 × 1.
Mean and Variance: E(X) = np, Var(X) = np(1 − p).
Continuous Random Variables (Ch. 7)
Probability Distributions
Probability density function: A function fX (x) of the continuous r.v. X with the
following properties:
9
1. fX (x) ≥ 0 for all values of x.
2. P(a ≤ X ≤ b) = the area under fX (x) between a and b, if a < b.
3. The total area under the curve is 1
4. The area under the curve to the left of any value x is FX (x), the probability that
X does not exceed x.
Cumulative distribution function: Same as before.
P(a ≤ X ≤ b) = FX (b) − FX (a)
(provided a < b).
Expectations, Variances, Covariances, etc.
R
P
Same rules as for discrete r.v.’s. The summation ( ) is replaced by the integral ( ),
which is not necessary for this course.
Normal Distribution
Probability Density function:
fX (x) = √
2
2
1
e−(x − µ) /2σ
2πσ 2
for constants µ and σ such that −∞ < µ < ∞ and 0 < σ < ∞.
Mean and Variance: E(X) = µ
Var(X) = σ 2
Notation: X ∼ N (µ, σ 2 ) means X is normal with mean µ and variance σ 2 .
If Z ∼ N (0, 1) we say it has a standard normal distribution.
If X ∼ N (µ, σ 2 ) then Z = (X − µ)/σ ∼ N (0, 1). Thus
a−µ
b−µ
P(a < X < b) = P
<Z<
σ
σ
!
= FZ
!
b−µ
a−µ
− FZ
σ
σ
Central Limit Theorem
Let X1 , X2 , . . . , Xn be n independent r.v.’s, each with identical distributions, mean µ
and variance σ 2 . As n becomes large,
X ∼ N (µ, σ 2 /n)
n
X
i=1
Xi = nX ∼ N (nµ, nσ 2 )
10
Sampling & Sampling distributions
Simple random sample: (or random sample) A method of randomly drawing n
objects wihich are Independent and Identically Distributed (I.I.D.).
Statistic: A function of the sample information.
Sampling distribution of a statistic: The probability distribution of the values a
statistic can take, over all possible samples of a fixed size n.
Sampling distribution of the mean : Suppose X1 , . . . Xn are a random sample
2
from some population with mean µX and variance σX
. The sample mean is
X=
n
X
Xi .
i=1
It has the following properties:
1. E(X) = µX
√
2. It has standard deviation σX = σX / n.
3. If the population distribution is normal,
2
2
) = N (µX , σX
/n).
X ∼ N (µX , σX
4. If the population distribution is not normal, but n is large, then (3) is roughly
true.
Sampling distribution of a proportion : Suppose the r.v. X is the number of
successes in a binomial sample of n trials, whose probability of success is p. The
sample proportion is
p̂ = X/n
It has the following properties:
1. E(p̂) = p
q
2. It has standard deviation σp̂ = p(1−p)
.
n
3. If n is large (np(1 − p) > 9 or roughly n ≥ 40),
p̂ ∼ N (p, σp̂2 ) = N (p, p(1 − p)/n).
11
Point Estimation
Estimator: A random variable that depends on the sample information and whose
realizations provide approximations to an unknown population parameter.
Estimate: A specific realization of an estimator.
Point estimator: An estimator that is a single number.
Point estimate: A specific realization of a point estimator.
Bias: Let θ̂ be an estimate of the parameter θ. The bias in θ̂ is
Bias(θ̂) = E(θ̂) − θ.
If the bias is 0, θ̂ is an unbiased estimator.
Efficiency: Let θ̂1 and θ̂2 be two estimators of θ, based on the same sample. Then
θ1 is more efficient than θ̂2 if
Var(θ̂1 ) < Var(θ̂2 ).
Interval Estimation
Confidence Interval: Let θ be an unknown parameter. Suppose that from sample
information, we can find random variables A and B such that
P(A < θ < B) = 1 − α.
If the observed values are a and b, then (a, b) is a 100(1 − α)% confidence interval
for θ. The quantity (1 − α) is called the probability content of the interval.
Student’s t distribution: Given a random sample of n observations with mean X
and standard deviation s, from a normal population with mean µ, the random
variable
X −µ
T = √
s/ n
follows the Student’s t distribution with (n − 1) degrees of freedom. For n > 30,
the t distribution is quite close to a N (0, 1) distribution.
12
Data
Parameter
N (µ, σ 2 ), σ 2 known
µ
mean µ, σ 2 unknown, n > 30
µ
N (µ, σ 2 ), σ 2 unknown
µ
100(1 − α)% C.I.
σ
x ± zα/2 √
n
s
x ± zα/2 √
n
s
x ± tn−1,α/2 √
n
s
p̂(1 − p̂)
n
Binomial(n, p),
np(1 − p) > 9, or roughly n ≥ 40
p
p̂ ± zα/2
n matched pairs,
difference ∼ N (µX − µY , σ 2 )
µX − µ Y
sd
d ± tn−1,α/2 √
n
2 independent samples,
means µX , µY
variances unknown, n > 30
µX − µ Y
2 independent samples,
means µX , µY
variances unknown
2 independent samples,
Binomial(nX , pX ),Binomial(nY , pY )
µX − µ Y
pX − p Y
x − y ± zα/2
v
u 2
u sx
t
nx
x − y ± tn∗ ,α/2
p̂x − p̂y ± zα/2
+
v
u 2
u sx
t
nx
s2y
ny
+
s2y
ny
v
u
u p̂x (1 − p̂x )
t
nx
+
p̂y (1 − p̂y )
ny
Notes for the table:
1. All quantities in the C.I. column are either known constants or observed sample quantities.
2. P(Z > zα/2 ) = α/2 for Z ∼ N (0, 1).
3. P(T > tn−1,α/2 ) = α/2 for T ∼ Student’s t with (n − 1) d.f.
4. s, sx , sy , sd are observed sample standard deviations corresponding to xi , xi , yi , di =
xi − yi respectively.
5. p̂, p̂x , p̂y are the observed sample proportions corresponding to xi , xi , yi respectively.
6. d is the sample mean corresponding to di = xi − yi .
7. n, nx , ny are the total, x and y sample sizes.
13
∗
8. n =
s2y
s2x
+
nx ny
!2 ,"
(s2x /nx )2 (s2y /ny )2
+
nx − 1
ny − 1
#
Also note: The text gives a different formula for comparing two means with small sample
sizes. It requires that the two variances be the same, which may not be the case. Unless
you’re sure the variances are equal, it’s safer to use the approximation given here (the formula
with a n∗ ). If you are sure that the variances are equal, using the book’s formula is ok.
Estimating the sample size: If you want an 100(1−α)% interval of ±L (i.e. length
2L), choose n so
situation
n
normal, σ known
Bernoulli, worst case
2
zα/2
σ2
n=
L2
2
0.25zα/2
n=
L2
Hypothesis Testing
Null Hypothesis (H0 ): The hypothesis we assume to be true unless there is sufficient evidence to the contrary.
Alternative Hypothesis (H1 ): The hypothesis we test the null against. If there
is evidence that H0 is false, we accept H1 .
Type I Error: Rejecting a true H0 .
Type II Error: Not rejecting a false H0 .
Significance Level: P(rejectH0 |H0 true) = P(type I error).
Power: The probability of rejecting a null hypothesis that is false. Note that this
depends on the true value of the parameter.
P-value: The smallest significance level at which a null hypothesis can be rejected.
This is a measure of how likely the data is, if H0 is true.
Notes for the following table (In addition to the comments for CI’s):
1. The first three tests are examples of one (> and < alternatives) and two sided (6=
alternative) tests. The remaining tests all have a > alternative, but are easily adaptable
to either of the other two alternatives.
14
2. In the last test,
p̂0 =
nx p̂x + ny p̂y
nx + n y
3. In all the tests, we are comparing the unknown parameter (such as µ, p or µX − µY )
to constants (µ0 , p0 , and D0 ).
Hypothesis tests with significance level α
Data
H0
H1
reject H0 if
N (µ, σ 2 ), σ 2 known
µ = µ0 or
µ ≤ µ0
µ > µ0
x − µ0
√ > zα
σ/ n
same
µ = µ0 or
µ ≥ µ0
µ < µ0
x − µ0
√ < −zα
σ/ n
same
µ = µ0
µ 6= µ0
mean µ, σ 2 unknown
n > 30
µ = µ0 or
µ ≤ µ0
µ > µ0
N (µ, σ 2 ), σ 2 unknown
n < 30
µ = µ0 or
µ ≤ µ0
µ > µ0
Binomial(n, p)
n(1 − p) > 9 or roughly n >
40
p = p0 or
p ≤ p0
p > p0
n matched pairs,
µX − µ Y = D 0 µX − µ Y > D 0
difference ∼ N (µX −µY , σ 2 ) or
µX − µ Y ≤ D 0
(Continued on next page)
15
x − µ0 not in
√
σ/ n (−zα/2 , zα/2 )
x − µ0
√ > zα
s/ n
x − µ0
√ > tn−1,α
s/ n
q
p̂ − p0
p0 (1 − p0 )/n
> zα
d − D0
√ > tn−1,α
sd / n
Hypothesis tests with significance level α
H0
H1
Data
2 independent samples,
means µX , µY
variances unknown,
nx > 30, ny > 30
2 independent
normal samples,
means µX , µY
variances unknown
2 independent samples,
Binomial(nx , px ) and
Binomial(ny , py )
µX − µ Y = D 0
or
µX − µ Y ≤ D 0
µX − µ Y > D 0
µX − µ Y = D 0
or
µX − µ Y ≤ D 0
µX − µ Y > D 0
px − p y = 0
or
px − p y ≤ 0
px − p y > 0
16
reject H0 if
x − y − D0
> zα
x−y
> tn∗ ,α
v
u 2
u sx
t
s2y
+
nx ny
v
u 2
u sx
t
s2y
+
nx ny
v
u
u
tp̂
p̂x − p̂y
nx + n y
0 (1 − p̂0 )
nx ny
!
> zα