Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Simple random sampling Peter McCullagh ∗ July 2007 1 Simple random sampling 1.1 Background and terminology Samples and sample values Let N be a positive integer, let [N ] = {1, . . . , N } be a set containing N elements, and let Y : [N ] → R be a given function. In the present context, [N ] is called the population, the elements of [N ] are called statistical units or sampling units, and Y (1), . . . , Y (n) is the list of population values. The value Y (i) ≡ Yi on unit i is assumed at present to be a real number; in practice it may be vector-valued. Let 0 ≤ n ≤ N be a non-negative integer. A one-to-one function ϕ: [n] → [N ] is called a sample. One-to-one means that ϕ(u) = ϕ(u0 ) implies u = u0 , so the sample contains no duplicate units. The sample is an ordered list ϕ(1), . . . , ϕ(n) consisting of n distinct units in the population. The composition Y ϕ: [n] → R is the list of sample values Y (ϕ(1)), . . . , Y (ϕ(n)). The sampled units are distinct, but the sample values may not be distinct. It is important to distinguish between the population elements, each of which is an identifying label such as u = Mr G. Bush, and the value of Y associated with that unit. For example, if Y = weight in kg, the value Y (u) is a real number such as 85.1. Note that the units of measurement (kg.) are part of the definition of the variable Y , which is contrary to normal practice in physics and engineering. In everyday speech, we often say that Y = weight in which case Y (u) = 85.1kg is not a real number. If σ: [n] → [n] is a permutation of the sample labels, the composition ϕσ: [n] → [N ] is a sample of size n whose elements ϕ(σ(1)), . . . , ϕ(σ(n)) coincide with the sample ϕ, but in a different order. Two samples ϕ, ϕ0 ∗ Support for this research was provided by NSF Grant DMS-0305009 1 consisting of the same units in different orders are distinct and are not regarded as equivalent. Accordingly, the number of distinct samples [n] → [N ] is given by the descending factorial function à ! N ↓n = N (N − 1) · · · (N − n + 1) = N n! n A simple random sample is a random element ϕ having the uniform distribution on the set of injective maps [n] → [N ]. As always, the sample values Y ϕ: [n] → R are obtained by composition with Y . Symmetric functions This section is concerned with properties of a function T (x1 , . . . , xn ) under permutation of arguments. We say that T is a symmetric function if T (xσ(1) , . . . , xσ(n) ) = T (x1 , . . . , xn ) for each permutation σ: [n] → [n]. It is immediately evident that if T, T 0 are two real-valued symmetric functions, each linear combination is also symmetric, as is the product and the ratio T /T 0 (provided that T 0 6= 0). A symmetric function need not be real-valued, but we focus initially on real-valued functions on account of their simplicity. Examples of symmetric functions: X xj , X min(x1 , . . . , xn ), x2j , x̄n , s2n , X max(x1 , . . . , xn ), (xj − x̄n )2 , med(x1 , . . . , xn ). Examples of non-symmetric functions x1 , xn , x1 + 2x2 + · · · + nxn Note that the sample variance s2n is defined only for n ≥ 2. The emphasis in this section is on homogeneous polynomial symmetric functions. Statistic For present purposes, a statistic is a symmetric function of the sample values defined for each sample ϕ: [n] → [N ] with n ≥ d sufficiently large. That is to say, a statistic is a sequence T = (Tn )n≥d in which each component Tn is a symmetric function Rn → R. For example, the sample mean and the total sum of squares x̄n = (x1 + · · · + xn )/n Sn2 = X 2 (xi − x̄n )2 are both defined for all n ≥ 1. However, the sample variance s2n = Sn2 /(n−1) is defined only for n ≥ 2, while the sample skewness k3,n (x) = n X (xi − x̄n )3 /((n − 1)(n − 2)) is defined for n ≥ 3. Ordinarily, a statistic is defined as an ordinary function Rn → R for some fixed n, but here we insist on a sequence of functions. The rationale is that the most important properties of a statistic such as the sample variance are not properties of the individual functions, but properties of the sequence s22 , s23 , . . .. By defining a statistic as a sequence of functions, we are forced to declare, at least implicitly, a relation between the functions Tn and Tn+1 . Implicitly, we are saying that if we had one additional data value we would compute Tn+1 (x1 , . . . , xn+1 ) rather than Tn (x1 , . . . , xn ), and if the entire population were available we would compute TN (x1 , . . . , xN ). It is therefore natural to require that the sequence of functions be related to one another. In 1950, Tukey proposed a mathematical definition of the concept of a natural statistic, related to simple random sampling, as follows. Let T be a statistic with components Tn : Rn → R defined for each n ≥ d. Consider a population of size N with values x1 , . . . , xN , and a sample ϕ of size n with values xϕ(1) , . . . , xϕ(n) . The statistic T associates with each sample ϕ a number Tn (xϕ), and with the population another number TN (x). The sequence T is said to be inherited on the average if for each n ≤ N the average sample value Tn (xϕ) is equal to the population value TN (x). This average is taken over simple random samples ϕ: [n] → [N ] for fixed population values x: [n] → R. In other words, for each x: [N ] → R, E(Tn (xϕ) | x) = TN (x), where the expectation is taken with respect to ϕ: [n] → [N ], uniformly distributed on the set of N ↓n samples. It is worth remarking at this point that Tukey’s definition, which he termed ‘inheritance on the average’, looks very much like the definition of unbiasedness in parametric statistical models. However, unbiasedness in parametric models is a property of individual functions Tn (x), whereas inheritance is a property of the sequence. This is a subtle but fundamental distinction. By contrast, consistency and rates of convergence are both properties of a sequence. However, they depend only on the tail sequence so there are no direct logical implications for finite samples. Since each element in [N ] occurs in the image of ϕ with probability n/N , 3 the sample total has the property that aveϕ (xϕ(1) + · · · + xϕ(n) ) = n (x1 + · · · + xN ). N Thus, while the totals are symmetric, they do not have the inheritance property. However, the sample averge is inherited. The sample variance can be written in the form 2s2n = 1 X] (xi − xj )2 n↓2 ij with summation over distinct units, which makes it clear that this statistic is also inherited under simple random sampling. Likewise, the sample fraction of identical pairs 1 X] I(xi = xj ) n↓2 ij and the mean absolute deviation defined as 1 X] |xi − xj | n↓2 are both inherited for n ≥ 2. Other examples include the sample skewness k3,n (x) and k11,n = x̄2n − s2n /n. In addition, if T, T 0 are inherited statistics, each linear combination αT +α0 T 0 is also inherited. In this context, T, T 0 are both statistics, i.e. T is a sequence of symmetric functions, so the coefficients α, α0 are scalars independent of n. Inheritance on the average is a very demanding property, satisfied by only by very special sequences such as U -statistics. In practice, we often work with statistics that are only approximately inherited. For example, neither the sample median nor the inter-quartile range is inherited. 1.2 k-statistics Sample means For any x ∈ Rn define the following power averages mr,n (x) = 1X r x , n i i mrs,n (x) = 1 X] r s x x , n↓2 ij i j mrst,n (x) = 1 X] r s t x x x ,... n↓3 ijk i j k The sequence of functions mr = (mr,n )n≥1 is a statistic defined for n ≥ 1. Likewise, mrs = (mrs,n )n≥2 is a sequence of symmetric functions and thus 4 a statistic in the same sense. Ordinarily, we suppress the index n and write mrs (x) instead of mrs,n (x), the value of n being inferred from the argument x ∈ Rn . It is evident from their construction as U -statistics that each of these statistics has the inheritance property. Consequently each linear combination also has the inheritance property. The combinations that have proved to be most useful and natural for statistical purposes are certain homogeneous polynomials called k-statistics and polykays defined as follows: k1 (x) = m1 (x), k11 (x) = m11 (x), k111 = m111 , k1111 = m1111 , k2 (x) = m2 (x) − m11 (x) k21 = m21 − m111 , k211 = m211 − m1111 , k22 = m22 − 2m211 + m1111 , k3 = m3 − 3m21 + 2m111 k31 = m31 − 3m211 + 2m1111 , k4 = m4 − 4m31 − 3m22 + 12m211 − 6m1111 In general, kr,n (x) is a polynomial of degree r in x kr,n (x) = X φi1 ,...,ir xi1 xi2 · · · xir with coefficients φi1 ,...,ir = (−1)ν−1 /n↓ν where ν is the number of distinct values in i1 , . . . , ir . The single-index ks, called k-statistics, are due to Fisher (1929); the multi-index ks called polykays are due to Tukey (1950, 1956). The following rationale helps to explain the coefficients that occur in the definition of the k-statistics. Suppose that the components of x are independent and identically distributed random variables with distribution F . Then the expected value of the power average is E(mr (x)) = µr , the rth moment of F . Likewise, E(mrs (x)) = µr µs is the product of two moments, E(mrst (x)) = µr µs µt is the product of three moments, and so on. For the k-statistics of degree two, E(k11 (x)) = κ21 and E(k2 (x)) = µ2 − µ21 = κ2 , a cumulant or product of cumulants of F . For the k-statistics of degree three, E(k111 (x)) = κ31 , E(k21 (x)) = µ2 µ1 − µ31 = κ1 κ2 , and E(k3 (x)) = µ3 − 3µ2 µ1 + 2µ31 = κ3 . For the k-statistics of degree four, E(k22 (x)) = µ22 − 2µ2 µ21 + µ41 = κ22 , and E(k4 (x)) = κ4 . These expectations are for a fixed sample with values that are independent and identically distributed. This is not to be confused with the expectation for a simple random sample ϕ: [n] → [N ] with fixed x: [N ] → R. Multiplication tables The polynomial k1,n (x) is homogeneous symmetric of degree one in x, while 2 (x) k11,n (x) and k2,n (x) are homogeneous of degree two. The square k1,n is homogeneous of degree two, while the products k1,n k11,n and k1,n k2,n are 5 homogeneous of degree three. In order to compute variances and covariances, it is helpful to have a multiplication table that expresses each product as a linear combination of k-statistics and polykays. This is a multiplication table for functions, so the coefficients in the linear combination depend on n. Two explicit calculations may help to illustrate what is involved in the construction of such a multiplication table. 2 k1,n (x) = (x1 + · · · + xn )2 /n2 = X] xi xj /n2 + X ij x2i /n2 i ↓2 2 = n m11,n (x)/n + m2,n (x)/n = k11,n (x)(n − 1)/n + k2,n (x)/n + k11,n (x)/n = k11,n (x) + k2,n (x)/n 2 k2,n (x) = = ³X X] φij xi xj ´2 φij φkl xi xj xk xl + X] (4φij φik + 2φii φjk )x2i xj xk + X] (2φij φij + φii φjj )x2i x2j + X] 4φii φij = n↓4 m1111 /(n↓2 )2 + n↓3 m211 (4/(n↓2 )2 − 2/(n2 (n − 1))) + n↓2 m22 (2/(n↓2 )2 − 1/n2 ) − 4m31 /n + = k22,n (x)(n + 1)/(n − 1) + k4,n (x)/n where φii = 1/n and φij = −1/n↓2 for i 6= j are the coefficients in k2,n (x). Some additional multiplication formulae are as follows: k12 = k11 + k2 /n k13 = k111 + 3k21 /n + k3 /n2 k14 = k1111 + 6k211 /n + 3k22 /n2 + 4k31 /n2 + k4 /n3 k1 k2 = k21 + k3 /n k12 k2 = k211 + k22 /n + 2k31 /n + k4 /n2 k1 k3 = k31 + k4 /n k22 = k22 (n + 1)/(n − 1) + k4 /n Variances Let x: [N ] → R be given and let ϕ: [n] → [N ] be a simple random sample. The sample average k1,n (xϕ) is a random variable whose mean is E(k1,n (xϕ) | x) = k1,N (x), the population mean. The variance is obtained by using the multiplication table twice as follows: 2 2 var(k1,n (xϕ)) = E(k1,n (xϕ)) − k1,N (x) = E(k11,n (xϕ) + k2,n (xϕ)/n) − k11,N (x) − k2,N (x)/N 6 = k11,N (x) + k2,N (x)/n − k11,N (x) − k2,N (x)/N = k2,N (x)(1/n − 1/N ) = n−1 k2,N (x)(1 − n/N ). In the limit as N → ∞, the variance is κ2 /n, where κ2 is the assumed limit of k2,N (x). The factor 1 − n/N is called the finite-population correction. The sample variance k2,n (xϕ) is a random variable whose mean under simple random sampling E(k2,n (xϕ) | x) = k2,N (x) coincides with the population variance, a consequence of inheritance. The variance is obtained by using the multiplication table twice and using the inheritance property as follows: 2 2 var(k2,n (xϕ)) = E(k2,n (xϕ)) − k2,N (x) = E(k22,n (xϕ)/(n + 1)(n − 1) + k4,n (xϕ)/n) − k22,N (x)(N + 1)/(N − 1) − k4,N (x)/N = k22,N (x)(2/(n − 1) − 2/(N − 1)) + k4,N (x)(1/n − 1/N ). In the limit as N → ∞, the variance is 2κ22 /(n − 1) + κ4 /n. In a similar manner we find that cov(k1,n (xϕ), k2,n (xϕ) | x) = k3,N (x)(1/n − 1/N ), while the third-order cumulant of k1,n (xϕ) is cum3 (k1,n (xϕ | x)) = k3,N (x)(1/n − 1/N )(1/n − 2/N ) References: Dressel, P.L. (1940) Statistical seminvariants and their setimates with particular emphasis on their relation to algebraic seminvariants. Ann. Math. Statist. 11, 33–57. Dwyer, P.S. and Tracy. D.S. (1964) A combinatorial method for the product of two polykays with some general formulae. Ann. Math. Statist. 35, 1174–1185. Fisher, R.A. (1929) Moments and product moments of sampling distributions. Proc. Lond. Math. Soc. Series 2 30, 199–238. Thiele, T.N. (1897) Elementaer Iagttageleseslaere. Reprinted in English as The theory of observations. Ann. Math. Statist. (1931) 2, 165–308. Tracy, D.S. (1968) Some rules for a combinatorial method for multiple products of generalized k-statistics. Ann. Math. Statist. 39, 983–998. Tukey, J.W. (1950) Some sampling simplified. J. Amer. Statist. Assoc. 45 501– 519. Tukey, J.W. (1956) Variances of variance components: I. Balanced designs. Ann. Math. Statist. 27, 362–377. 7