Survey

Survey

Document related concepts

Transcript

Note that these are textbook chapters, although Lecture Notes may be referenced. Chapter 1 Overview and Descriptive Statistics 1.1 - Populations, Samples and Processes 1.2 - Pictorial and Tabular Methods in Descriptive Statistics 1.3 - Measures of Location 1.4 - Measures of Variability 1 What is “random variation” in the distribution of a population? Examples: Toasting time, Temperature settings, etc. of a population of toasters… POPULATION 1: Little to no variation (e.g., product manufacturing) In engineering situations such as this, we try to maintain “quality control”… i.e., “tight tolerance levels,” high precision, low variability. But what about a population of, say, people? 4 What is “random variation” in the distribution of a population? Example: Body Temperature (F) POPULATION 1: Little to no variation (e.g., clones) Most individual values ≈ population mean value Density Very little variation about the mean! 98.6 F 5 What is “random variation” in the distribution of a population? Example: Examples:Body Gender, Temperature Race, Age, (F) Height, Annual Income,… POPULATION 2: Much variation (more common) Density Much more variation about the mean! 6 Example • Click on image for full .pdf article • Links in article to access datasets Study Question: How can we estimate “mean age at first birth” of women in the U.S.? POPULATION Women in U.S. who have given birth “Random Variable” X = Age at first birth Suppose we know that X follows a “normal distribution” (a.k.a. “bell curve”) in the population. That is, the Population Distribution of X ~ N(, ). and are “population characteristics” i.e., “parameters” (fixed, unknown) How is this accomplished? Hospital records, etc. standard deviation σ “sampling frame” mean μ = ??? {x1, x2, x3, x4, … , x400} Study Question:Other possible parameters: How can we estimate • standard POPULATION deviation “mean age at first birth”• median Women in U.S. who of women in the U.S.? • minimum have given birth • maximum “Random Variable” x = 25.6 is an example of a “sample characteristic” = “statistic.” (numerical info culled from a sample) Suppose we know that X follows a “normal This is called a “point estimate“ of distribution” (a.k.a. “bell curve”) in the population. from the one sample. That is, the Population Distribution of X ~ N(, ). Can it be improved, and if so, how? • Choose a bigger sample, which standard should reduce “variability.” and are ??? deviation • Average the sample means of “population σ many samples, not just one. characteristics” (introduces “sampling variability”) i.e., “parameters” “Sampling Distribution” ~ ??? (fixed, unknown) X = Age at first birth ????????? How big??? mean μ = ??? {x1, x2, x3, x4, … , x400} FORMUL A mean x = 25.6 mean x = 25.6 Without knowing every value in the population, it is not possible to determine the exact value of with 100% “certainty.” mean x = 25.6 Study Question: How can we estimate “mean age at first birth” of women in the U.S.? POPULATION Women in U.S. who have given birth “Random Variable” X = Age at first birth Suppose we know that X follows a “normal distribution” (a.k.a. “bell curve”) in the population. That is, the Population Distribution of X ~ N(, ). and are “population characteristics” i.e., “parameters” (fixed, unknown) standard deviation σ For concreteness, suppose = 1.5 mean μ = ??? {x1, x2, x3, x4, … , x400} FORMUL A mean x = 25.6 95% CONFIDENCE INTERVAL FOR µ 25.453 mean x = 25.6 25.747 μ Without knowing every value in the population, it is not possible to determine the exact value of with 100% “certainty.” BASED ON OUR SAMPLE DATA, the true value of μ is between 25.453 and 25.747, with 95% “confidence” (…akin to “probability”). This is called an “interval estimate“ of from the sample. Used in “Statistical Inference” via “Hypothesis Testing”… Study Question: How can we estimate “mean age at first birth” of women in the U.S.? • Arithmetic Mean POPULATION Women in U.S. who have given birth “Random Variable” X = Age at first birth “population characteristics” i.e., “parameters” (fixed, unknown) x1 x2 n xn • Geometric Mean xG n x1 x2 xn • Harmonic Mean Suppose we know that X follows a “normal distribution” (a.k.a. “bell curve”) in the population. That is, the Population Distribution of X ~ N(, ). and are xA standard deviation σ xH 1 x1 x12 n Each of these gives an estimate of for a particular sample. Any general sample estimator for is denoted by the symbol ˆ . Likewise for and mean μ = ??? {x1, x2, x3, x4, … , xn} x1n FORMUL A mean x ˆ . Study Question: How can we estimate “mean age at first birth” of women in the U.S.? POPULATION Women in U.S. who have given birth “Random Variable” X = Age at first birth Suppose we know that X follows a “normal distribution” (a.k.a. “bell curve”) in the population. That is, the Population Distribution of X ~ N(, ). and are “population characteristics” i.e., “parameters” (fixed, unknown) Extending these ideas to other parameters of a population gives rise to the general theory of… “PARAMETER ESTIMATION” standard deviation σ mean μ = ??? {x1, x2, x3, x4, … , xn} FORMUL A mean x POPULATION composed of “units” (people, rocks, toasters,...) To make certain calculations simpler, we assume that populations are “arbitrarily large” (or indeed, infinite). What do we want to know about this population? How is… “Random Variable” X (age, income level, …) … distributed? Suppose we know know that that XX follows follows aa “normal known but with parameters distribution” distribution” “probability (a.k.a. “bell curve”) in the population… in the population. That is, the Population Distribution of X ~ Dist( N(, 1)., 2,…). and are “population characteristics” i.e., “parameters” (fixed, unknown) standard deviation heavily skewed tail σ mean μ = ??? 1 , 2 , unknown vals. SAMPLE For a particular , want to define a corresponding “parameter estimator” ˆ Ideal properties… • Unbiased estimator of • Minimum Variance among all such unbiased estimators i.e., “MVUE” POPULATION composed of “units” (people, rocks, toasters,...) To make certain calculations simpler, we assume that populations are “arbitrarily large” (or indeed, infinite). “Random Variable” X = any numerical value that can be assigned to each unit of a population “Random” refers to the notion that this value is unknown until actually observed (usually as part of an outcome of an experiment to test a specific hypothesis). Contrast this with the idea of a “nonrandom” variable with no empirical error, e.g., X = # cards in a deck = 52. What do we want to know about this population? How is… “Random Variable” X (age, income level, …) … distributed? Quantitative [measurement] length mass temperature pulse rate # puppies shoe size There are two general types......... Quantitative and Qualitative 10 10½ 11 16 POPULATION composed of “units” (people, rocks, toasters,...) To make certain calculations simpler, we assume that populations are “arbitrarily large” (or indeed, infinite). “Random Variable” X = any numerical value that can be assigned to each unit of a population “Random” refers to the notion that this value is unknown until actually observed (usually as part of an outcome of an experiment to test a specific hypothesis). Contrast this with the idea of a “nonrandom” variable with no empirical error, e.g., X = # cards in a deck = 52. What do we want to know about this population? How is… “Random Variable” X (age, income level, …) … distributed? Quantitative [measurement] length mass temperature pulse rate # puppies shoe size CONTINUOUS (can take their values at any point in a continuous interval) DISCRETE (only take their values in disconnected jumps) There are two general types......... Quantitative and Qualitative 17 POPULATION composed of “units” (people, rocks, toasters,...) To make certain calculations simpler, we assume that populations are “arbitrarily large” (or indeed, infinite). “Random Variable” X = any numerical value that can be assigned to each unit of a population “Random” refers to the notion that this value is unknown until actually observed (usually as part of an outcome of an experiment to test a specific hypothesis). Contrast this with the idea of a “nonrandom” variable with no empirical error, e.g., X = # cards in a deck = 52. There are two general types......... Quantitative and Qualitative What do we want to know about this population? How is… “Random Variable” X (age, income level, …) … distributed? Qualitative [categorical] video game levels (1, 2, 3,...) 1 2 3 income level (low, mid, high) zip code PIN # 1 2 3 color (Red, Green, Blue) ORDINAL, RANKED (ordered labels) NOMINAL (unordered labels) IMPORTANT SPECIAL CASE: Binary (or Dichotomous) 1, "Success" X • “Pregnant?” (Yes / No) 0, "Failure" • Coin toss (Heads / Tails) • Treatment (Drug / Placebo) 18 1, "Success" Y 0, "Failure" POPULATION Define a new parameter = P(Success) Point estimator Suppose we intend to select a random sample of size n from this population of Success and Failures… … in such a way that the “Success or Failure” outcome of any selected individual conveys no information about the “Success or Failure” outcome of any other selected individual. That is, the “Success or Failure” outcomes between any two individuals are independent. (Think of tossing a coin n times.) ˆ ? Random Variable Let X = “Number of Successes in the sample.” (0, 1, 2, …, n) Then a natural estimator for could be the sample proportion of Success X ˆ n Ex: n = 500 tosses, X= 285 Heads ˆ 285 0.57 500