Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical concepts and related topics required for STAT 139, Fall 2016 September 3, 2016 Suggested review sources: • Ch. 1-5, Introduction to the Practice of Statistics, Moore, McCabe, [and Craig if 5th ed.]. • Section on Matrix Algebra (Appendix A) in A. Sen and M. Srivastava, ”Regression Analysis: Theory, Methods, and Applications” (or any other textbook used in math courses that you took). 1 Randomness and Probability An outcome is the result of an experiment or other situation involving uncertainty. The set of all possible outcomes of a probability experiment is called a sample space. An event is any collection of outcomes of an experiment. A probability provides a quantatative description of the likely occurrence of a particular event. Probability is conventionally expressed on a scale from 0 to 1. Conditional Probability. The usual notation for ”event A occurs given that event B has occurred” is “A|B” (A given B). The symbol | is a vertical line and does not imply division. 1 P (A|B) denotes the probability that event A will occur given that event B has occurred already. A rule that can be used to determine a conditional probability from unconditional T B) . probabilities is: P (A|B) = P (A P (B) Probability rules and laws for any events A and B: • Addition Rule: P (A S B) = P (A) + P (B) − P (A • Multiplication Rule: P (A T T B) B) = P (A)P (B|A) Two events A and B are independent if the occurrence of one of the events gives us no information about whether or not the other event will occur; that is, the events have no T influence on each other. Formally, P (A B) = P (A)P (B) or P (A|B) = P (A). Two events A and B are disjoint (or mutually exclusive) if P (A T B) = 0. Bayes’ Theorem is a result that allows new information to be used to update the conditional probability of an event. In its simplest form, for two events A and B it is P (A|B) = P (B|A)P (A) P (B) Related Topics: Law of Total Probability, Venn diagrams. 2 Random Variables and their Probability Distributions A random variable is a function that associates a unique numerical value with every outcome of an experiment. There are two types of random variable - discrete and continuous. A random variable has either an associated probability distribution (discrete random variable) or probability density function, (PDF), (continuous random variable). 2 More formally, the probability distribution of a discrete random variable X is a function which gives the probability that the random variable takes value xi , P (X = xi ), for each i = 1, 2, . . . , n. All random variables (discrete and continuous) have a cumulative distribution function (CDF). It is a function giving the probability that the random variable X is less than or equal to x, for every value x: F (x) = P (X ≤ x), for −∞ < x < ∞. The probability density function of a continuous random variable X is a function which can be integrated to obtain the probability that the random variable takes a value in a given interval. More formally, the probability density function, f (x), of a continuous random d F (x). Then, variable X is the derivative of the cumulative distribution function f (x) = dx b Z P (a < X ≤ b) = F (b) − F (a) = f (x)dx. a The expected value of a random variable X indicates its average or central value. It may be referred to as population mean if X describes some population. The usual notation is E(X) = µ. Rules for Expected Values: • If X is a random variable and a and b are fixed numbers and Z = a + bX, then µZ = a + bµX • If X and Y are random variables and Z = X + Y , then µZ = µX + µY Expected value of a discrete random variable that takes values X1 , X2 , . . . , Xn with probabilities p1 , p2 , . . . , pn is µX = X 1 p 1 + X 2 p 2 + . . . Xn p n = n X Xi pi . i=1 The (population) variance of a random variable is a non-negative number which gives an 3 idea of how widely spread the values of the random variable are likely to be; the larger the variance, the more scattered the observations on average. The variance of the random 2 variable X is defined to be: V ar(X) = E(X − E(X))2 = σX . Taking the square root of the p variance gives the standard deviation, i.e.: V ar(X) = σX . Rules for Variances: • If X is a random variable and a and b are fixed numbers and Z = a + bX, then 2 σZ2 = b2 σX • If X and Y are independent random variables, then 2 2 σX+Y = σX + σY2 2 2 σX−Y = σX + σY2 • If X and Y have correlation ρXY , then 2 2 σX+Y = σX + σY2 + 2ρXY σX σY 2 2 σX−Y = σX + σY2 − 2ρXY σX σY Variance of a discrete random variable that takes values X1 , X2 , . . . , Xn with probabilities p1 , p2 , . . . , pn is 2 σX = (X1 − µX )2 p1 + (X2 − µX )2 p2 + . . . (Xn − µX )2 pn = n X (Xi − µX )2 pi . i=1 Types of discrete probability distributions: Bernoulli (Bern(p)), Binomial (Binom(n, p)). Types of continuous probability distributions: Normal or Gaussian (N (µ, σ 2 )), Uniform (U nif (a, b)). Normal distribution models (some) continuous random variables. A continuous random variable X, taking all real values in the range −∞ < x < ∞ is said to follow a Normal 4 distribution with parameters and σ 2 , X ∼ N (µ, σ 2 ), if it has PDF 1 1 f (x) = √ exp{− 2 σ 2π Here, E(X) = µ, V ar(X) = σ 2 . Also, X−µ σ x−µ σ 2 } ∼ N (0, 1), a standard Normal distribution. The 68-95-99.7 rule. About 68% of the values from a normal distribution lie within 1 standard deviation of the mean. Similarly, about 95% of the values lie within 2 standard deviations of the mean. Nearly all (99.7%) of the values lie within 3 standard deviations of the mean. The median is the value halfway through the ordered data set, below and above which there lies an equal number of data values. It is generally a good descriptive measure of the location which works well for skewed data, or data with outliers. The mode is the most frequently occurring value in a set of discrete data. There can be more than one mode if two or more values are equally common. A correlation coefficient between X and Y , usually denoted by ρXY , is a number between −1 and 1 which measures the degree to which the two random variables are linearly related. If there is perfect linear relationship with positive slope between the two variables, we have a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, we have a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation coefficient of 0 means that there is no linear relationship between the variables. 3 Describing data The sample mean is an estimator available for estimating the population mean. It is a measure of location, commonly called the average, often symbolized X̄. In a sample of size n with observed values X1 , X2 , . . . , Xn , n 1X X1 + X2 + · · · + Xn = Xi X̄ = n n i=1 5 Sample variance (S 2 ) is a measure of the spread of or dispersion within a set of sample data. In a sample of size n with observed values X1 , X2 , . . . , Xn , 2 Pn − X̄)2 . n−1 i=1 (Xi S = Sample standard deviation, S, is a square root of the sample variance. Percentiles are values that divide a sample of data into one hundred groups containing (as far as possible) equal numbers of observations. For example, 30% of the data values lie below the 30th percentile. Quantiles are a set of ’cut points’ that divide a sample of data into groups containing (as far as possible) equal numbers of observations. Shapes of the distribution: • Skewness is defined as asymmetry in the distribution of the sample data values. Values on one side of the distribution tend to be further from the ’middle’ than values on the other side. For skewed data, the usual measures of location will give different values, for example, mode < median < mean would indicate positive (or right) skewness. • Symmetry is implied when data values are distributed in the same way above and below the middle of the sample. Sample correlation coefficient for two observed samples, X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn , is calculated in the following way: Pn − X̄)(Yi − Ȳ ) pPn 2 2 i=1 (Xi − X̄) i=1 (Yi − Ȳ ) rXY = pPn i=1 (Xi The Law of Large Numbers. As the number of independent observations in the sample increases, the sample mean X̄ of the observed values eventually approaches the mean of the population µ. The Central Limit Theorem states that whenever a random sample of size n is taken from any distribution with mean and variance σ 2 , then the sample mean X̄ will be approximately 6 normally distributed with mean µ and variance σ 2 /n. The larger the value of the sample size n, the better the approximation to the normal. This is very useful when it comes to inference. For example, it allows us (if the sample size is fairly large) to use hypothesis tests which assume normality even if our data appear non-normal. This is because the tests use the sample mean X̄, which the Central Limit Theorem tells us will be approximately normally distributed. Related topics: Histogram, Boxplot, quartiles, dispersion, range, outlier. 4 Matrix Algebra Matrices. A matrix is an array of numbers. For example, " A= a11 a12 a13 a21 a22 a23 # is a matrix of# size " " 2 x #3. The"rows of # A are [a11 a12 a13 ] and [a21 a22 a23 ] and the columns of a11 a12 a13 A are , , and . a21 a22 a23 We call aij the (i, j) entry of A. A (column) vector is an n x 1 matrix, or a matrix with n rows and one column. For example, a1 X = a2 a3 is a vector of length 3. The elements along the north-west to south-east diagonal of a square matrix A comprise the diagonal of A. When the only non-zero elements of a matrix are in its diagonal, it is called a diagonal matrix; e.g., 7 1 0 0 A= 0 2 0 0 0 3 The matrix is called an identity matrix if it is diagonal and all elements are equal to 1. It is denoted by I (or lr ). Addition and Multiplication. If A and B are matrices of the same size then the (i, j) entry of A + B, called matrix addition, is aij + bij , or, [aij ] + [bij ] = [aij + bij ] . We define a scalar multiplication for a matrix componentwise: c [aij ] = [caij ]. If A = [aij ] is p x m and B = [bij ] is m x n then we define a matrix multiplication AB = C = [cij ] as the p x n matrix with entries cij = n X aik bkj . k=1 We can verify for all matrices A, B, C, and D that 1. If the product A(BC) = (AB)C, (associativity), and if A and B are m x n and C and D are n x q then 2. A(C + D) = AC + AD 3. (A + B)C = AC + BC 4. A(aC) = aAC for all scalars a; and 5. IB = BI = B. It is generally NOT the case that AB = BA Transpose and Inverse. Let A = [aij ]. The transpose of A, denoted AT , is the matrix [bij ] where bij = aji . If, for a square matrix A, there exists a matrix B such that AB = BA = I, then B is called the inverse of A and is denoted as A−1 . We can find A−1 for a n x n matrix A as follows: 8 (a) Create the “double matrix” [A|I]. (b) Row reduce A down to I (if possible) while simultaneously performing the same operations on the right side of the“double matrix”. The I on the RHS will be transformed into A−1 : [I|A−1 ]. Systems of Linear Equations. A linear equation in variables x1 , x2 , . . . , xn is an equation of the form a1 x1 + a2 x2 + . . . + an xn = b. A collection of such equations is called a system: a11 x1 + a12 x2 + · · · + a1n xn = b1 a21 x1 + a22 x2 + · · · + a2n xn = b2 ................................... am1 x1 + am2 x2 + · · · + amn xn = bm . A solution to this system is a set of values for x1 , x2 , . . . xn such that each equation is satisfied. A system of linear equations is dependent if one equation in the system is a linear combination of the equations. The system of linear equations shown above can be represented in a matrix notation as AX = B, where A= a11 a12 a21 a22 ... ... am1 am2 ... a1n ... a2n ... ... ... amn ,X = x1 x2 ... xn ,B = b1 b2 ... bm Solutions to systems can be given parametrically as X = A−1 B, as long as the system is not dependent (i.e., if A−1 exists). Material partially borrowed from http://www.stats.gla.ac.uk/steps/glossary and http://faculty.etsu.edu/gardnerr/2250/notes.htm. 9