Download Stat 139 Math Review Sheet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

System of linear equations wikipedia , lookup

Matrix calculus wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Transcript
Statistical concepts and related topics
required for STAT 139, Fall 2016
September 3, 2016
Suggested review sources:
• Ch. 1-5, Introduction to the Practice of Statistics, Moore, McCabe, [and Craig if 5th
ed.].
• Section on Matrix Algebra (Appendix A) in A. Sen and M. Srivastava, ”Regression
Analysis: Theory, Methods, and Applications” (or any other textbook used in math
courses that you took).
1
Randomness and Probability
An outcome is the result of an experiment or other situation involving uncertainty. The
set of all possible outcomes of a probability experiment is called a sample space.
An event is any collection of outcomes of an experiment.
A probability provides a quantatative description of the likely occurrence of a particular
event. Probability is conventionally expressed on a scale from 0 to 1.
Conditional Probability. The usual notation for ”event A occurs given that event B has
occurred” is “A|B” (A given B). The symbol | is a vertical line and does not imply division.
1
P (A|B) denotes the probability that event A will occur given that event B has occurred
already. A rule that can be used to determine a conditional probability from unconditional
T
B)
.
probabilities is: P (A|B) = P (A
P (B)
Probability rules and laws for any events A and B:
• Addition Rule: P (A
S
B) = P (A) + P (B) − P (A
• Multiplication Rule: P (A
T
T
B)
B) = P (A)P (B|A)
Two events A and B are independent if the occurrence of one of the events gives us no
information about whether or not the other event will occur; that is, the events have no
T
influence on each other. Formally, P (A B) = P (A)P (B) or P (A|B) = P (A).
Two events A and B are disjoint (or mutually exclusive) if P (A
T
B) = 0.
Bayes’ Theorem is a result that allows new information to be used to update the conditional
probability of an event. In its simplest form, for two events A and B it is
P (A|B) =
P (B|A)P (A)
P (B)
Related Topics: Law of Total Probability, Venn diagrams.
2
Random Variables and their Probability Distributions
A random variable is a function that associates a unique numerical value with every
outcome of an experiment.
There are two types of random variable - discrete and continuous.
A random variable has either an associated probability distribution (discrete random
variable) or probability density function, (PDF), (continuous random variable).
2
More formally, the probability distribution of a discrete random variable X is a function
which gives the probability that the random variable takes value xi , P (X = xi ), for each
i = 1, 2, . . . , n.
All random variables (discrete and continuous) have a cumulative distribution function
(CDF). It is a function giving the probability that the random variable X is less than or
equal to x, for every value x: F (x) = P (X ≤ x), for −∞ < x < ∞.
The probability density function of a continuous random variable X is a function which can
be integrated to obtain the probability that the random variable takes a value in a given
interval. More formally, the probability density function, f (x), of a continuous random
d
F (x). Then,
variable X is the derivative of the cumulative distribution function f (x) = dx
b
Z
P (a < X ≤ b) = F (b) − F (a) =
f (x)dx.
a
The expected value of a random variable X indicates its average or central value. It may
be referred to as population mean if X describes some population. The usual notation is
E(X) = µ.
Rules for Expected Values:
• If X is a random variable and a and b are fixed numbers and Z = a + bX, then
µZ = a + bµX
• If X and Y are random variables and Z = X + Y , then
µZ = µX + µY
Expected value of a discrete random variable that takes values X1 , X2 , . . . , Xn with probabilities p1 , p2 , . . . , pn is
µX = X 1 p 1 + X 2 p 2 + . . . Xn p n =
n
X
Xi pi .
i=1
The (population) variance of a random variable is a non-negative number which gives an
3
idea of how widely spread the values of the random variable are likely to be; the larger
the variance, the more scattered the observations on average. The variance of the random
2
variable X is defined to be: V ar(X) = E(X − E(X))2 = σX
. Taking the square root of the
p
variance gives the standard deviation, i.e.: V ar(X) = σX .
Rules for Variances:
• If X is a random variable and a and b are fixed numbers and Z = a + bX, then
2
σZ2 = b2 σX
• If X and Y are independent random variables, then
2
2
σX+Y
= σX
+ σY2
2
2
σX−Y
= σX
+ σY2
• If X and Y have correlation ρXY , then
2
2
σX+Y
= σX
+ σY2 + 2ρXY σX σY
2
2
σX−Y
= σX
+ σY2 − 2ρXY σX σY
Variance of a discrete random variable that takes values X1 , X2 , . . . , Xn with probabilities
p1 , p2 , . . . , pn is
2
σX
= (X1 − µX )2 p1 + (X2 − µX )2 p2 + . . . (Xn − µX )2 pn =
n
X
(Xi − µX )2 pi .
i=1
Types of discrete probability distributions: Bernoulli (Bern(p)), Binomial (Binom(n, p)).
Types of continuous probability distributions: Normal or Gaussian (N (µ, σ 2 )), Uniform
(U nif (a, b)).
Normal distribution models (some) continuous random variables. A continuous random
variable X, taking all real values in the range −∞ < x < ∞ is said to follow a Normal
4
distribution with parameters and σ 2 , X ∼ N (µ, σ 2 ), if it has PDF
1
1
f (x) = √ exp{−
2
σ 2π
Here, E(X) = µ, V ar(X) = σ 2 . Also,
X−µ
σ
x−µ
σ
2
}
∼ N (0, 1), a standard Normal distribution.
The 68-95-99.7 rule. About 68% of the values from a normal distribution lie within 1
standard deviation of the mean. Similarly, about 95% of the values lie within 2 standard
deviations of the mean. Nearly all (99.7%) of the values lie within 3 standard deviations of
the mean.
The median is the value halfway through the ordered data set, below and above which
there lies an equal number of data values. It is generally a good descriptive measure of the
location which works well for skewed data, or data with outliers.
The mode is the most frequently occurring value in a set of discrete data. There can be
more than one mode if two or more values are equally common.
A correlation coefficient between X and Y , usually denoted by ρXY , is a number between
−1 and 1 which measures the degree to which the two random variables are linearly related.
If there is perfect linear relationship with positive slope between the two variables, we have
a correlation coefficient of 1; if there is positive correlation, whenever one variable has a
high (low) value, so does the other. If there is a perfect linear relationship with negative
slope between the two variables, we have a correlation coefficient of -1; if there is negative
correlation, whenever one variable has a high (low) value, the other has a low (high) value.
A correlation coefficient of 0 means that there is no linear relationship between the variables.
3
Describing data
The sample mean is an estimator available for estimating the population mean. It is a
measure of location, commonly called the average, often symbolized X̄. In a sample of size
n with observed values X1 , X2 , . . . , Xn ,
n
1X
X1 + X2 + · · · + Xn
=
Xi
X̄ =
n
n i=1
5
Sample variance (S 2 ) is a measure of the spread of or dispersion within a set of sample
data. In a sample of size n with observed values X1 , X2 , . . . , Xn ,
2
Pn
− X̄)2
.
n−1
i=1 (Xi
S =
Sample standard deviation, S, is a square root of the sample variance.
Percentiles are values that divide a sample of data into one hundred groups containing
(as far as possible) equal numbers of observations. For example, 30% of the data values lie
below the 30th percentile.
Quantiles are a set of ’cut points’ that divide a sample of data into groups containing (as
far as possible) equal numbers of observations.
Shapes of the distribution:
• Skewness is defined as asymmetry in the distribution of the sample data values. Values
on one side of the distribution tend to be further from the ’middle’ than values on the
other side. For skewed data, the usual measures of location will give different values,
for example, mode < median < mean would indicate positive (or right) skewness.
• Symmetry is implied when data values are distributed in the same way above and
below the middle of the sample.
Sample correlation coefficient for two observed samples, X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn ,
is calculated in the following way:
Pn
− X̄)(Yi − Ȳ )
pPn
2
2
i=1 (Xi − X̄)
i=1 (Yi − Ȳ )
rXY = pPn
i=1 (Xi
The Law of Large Numbers. As the number of independent observations in the sample
increases, the sample mean X̄ of the observed values eventually approaches the mean of the
population µ.
The Central Limit Theorem states that whenever a random sample of size n is taken from
any distribution with mean and variance σ 2 , then the sample mean X̄ will be approximately
6
normally distributed with mean µ and variance σ 2 /n. The larger the value of the sample
size n, the better the approximation to the normal.
This is very useful when it comes to inference. For example, it allows us (if the sample
size is fairly large) to use hypothesis tests which assume normality even if our data appear
non-normal. This is because the tests use the sample mean X̄, which the Central Limit
Theorem tells us will be approximately normally distributed.
Related topics: Histogram, Boxplot, quartiles, dispersion, range, outlier.
4
Matrix Algebra
Matrices. A matrix is an array of numbers. For example,
"
A=
a11 a12 a13
a21 a22 a23
#
is a matrix
of# size
"
" 2 x #3. The"rows of
# A are [a11 a12 a13 ] and [a21 a22 a23 ] and the columns of
a11
a12
a13
A are
,
, and
.
a21
a22
a23
We call aij the (i, j) entry of A.
A (column) vector is an n x 1 matrix, or a matrix with n rows and one column. For example,


a1


X =  a2 
a3
is a vector of length 3.
The elements along the north-west to south-east diagonal of a square matrix A comprise the
diagonal of A. When the only non-zero elements of a matrix are in its diagonal, it is called
a diagonal matrix; e.g.,
7


1 0 0


A= 0 2 0 
0 0 3
The matrix is called an identity matrix if it is diagonal and all elements are equal to 1. It is
denoted by I (or lr ).
Addition and Multiplication. If A and B are matrices of the same size then the (i, j)
entry of A + B, called matrix addition, is aij + bij , or, [aij ] + [bij ] = [aij + bij ] .
We define a scalar multiplication for a matrix componentwise: c [aij ] = [caij ].
If A = [aij ] is p x m and B = [bij ] is m x n then we define a matrix multiplication AB =
C = [cij ] as the p x n matrix with entries
cij =
n
X
aik bkj .
k=1
We can verify for all matrices A, B, C, and D that
1. If the product A(BC) = (AB)C, (associativity), and if A and B are m x n and C and
D are n x q then
2. A(C + D) = AC + AD
3. (A + B)C = AC + BC
4. A(aC) = aAC for all scalars a; and
5. IB = BI = B. It is generally NOT the case that AB = BA
Transpose and Inverse. Let A = [aij ]. The transpose of A, denoted AT , is the matrix [bij ]
where bij = aji .
If, for a square matrix A, there exists a matrix B such that AB = BA = I, then B is called
the inverse of A and is denoted as A−1 .
We can find A−1 for a n x n matrix A as follows:
8
(a) Create the “double matrix” [A|I].
(b) Row reduce A down to I (if possible) while simultaneously performing the same operations on the right side of the“double matrix”. The I on the RHS will be transformed
into A−1 : [I|A−1 ].
Systems of Linear Equations. A linear equation in variables x1 , x2 , . . . , xn is an equation
of the form a1 x1 + a2 x2 + . . . + an xn = b. A collection of such equations is called a system:
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
...................................
am1 x1 + am2 x2 + · · · + amn xn = bm .
A solution to this system is a set of values for x1 , x2 , . . . xn such that each equation is
satisfied.
A system of linear equations is dependent if one equation in the system is a linear combination
of the equations.
The system of linear equations shown above can be represented in a matrix notation as
AX = B, where



A=


a11 a12
a21 a22
...
...
am1 am2
... a1n
... a2n
... ...
... amn






,X = 




x1
x2
...
xn






,B = 




b1
b2
...
bm






Solutions to systems can be given parametrically as X = A−1 B, as long as the system is not
dependent (i.e., if A−1 exists).
Material partially borrowed from
http://www.stats.gla.ac.uk/steps/glossary and
http://faculty.etsu.edu/gardnerr/2250/notes.htm.
9