Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
8/30/16 Unit 1 Outline • • • • Unit 1: Data Collection Section 1.1 & 1.2 in the Text Variables & Measurement Collecting Data Sampling Random Assignment (for causal inference) In God we trust. All others must bring data. – W. Edmunds Deming 2 Variables and their Measurements Categorical Variables • Variable: Any characteristic that takes different values for • Two types: Nominal and Ordinal • Two major Types of variables: categorical and quantitative • Nominal variable: a categorical variable in which the • A categorical variable is a variable that can take on a few • Ordinal variable: a categorical variable in which the different individuals in a sample or population different values (categories) when measured. Sometimes called qualitative variables. categories are unordered. categories have an order or hierarchy (and can possibly be numeric), but there is “no defined distance between levels on the measurement scale” • A quantitative variable is a variable that is measured on a numerical scale covering a large range of values. • Examples? • Examples? Categorical: Nominal: Ordinal: Quantitative: 3 4 1 8/30/16 1+1=3 Dummy Variables X What a Dummy! Y Quantitative Variables In English please? • Two types: Discrete and Continuous. • Both are measured on an interval scale. That is • There is a special type of “nominal” categorical variable called a dummy variable or indicator variable. there is a specific numerical distance between any two measurements. • These variables take on only 2 possible values: 0 or 1. The one usually stands for success or yes, while the zero usually stand for failure or no. • By convention, they are usually named after the category • Discrete variable: a quantitative variable that can only take on specific numbers, like 0, 1, 2, … • Continuous variable: a quantitative variable that can take an that is a success. infinite number of possibilities within a range of numbers • Example: to represent sex/gender, we could define a dummy variable named female, which would be 1 for all women, and 0 for all men: ⎧1 if female female = ⎨ ⎩0 if male • Examples: Discrete: Continuous: 5 6 Summary of types of Variables: Variables Categorical Nominal Ordinal (more common) (less common) Quantitative Discrete Continuous Dummy Unit 1 Outline • • • • Variables & Measurement Collecting Data Sampling Random Assignment (for causal inference) (special case) **Note: in this class (and most of statistics), the most important difference is that between categorical and quantitative variables. That differentiation will typically determine the type of statistics and analysis used. Nominal and ordinal variables are often treated the same. Same for discrete and continuous variables. 7 8 2 8/30/16 Anecdotal evidence Collecting Data • Data can be collected in many ways: 1. Anecdotal information 2. Available data 3. Observational studies 4. Randomized experiments The further down the list you go, the more reliable the information is. And the conclusions you can draw will typically then be stronger. 9 • Anecdotal evidence is based on haphazardly selected individual cases, that often come to our attention because they are striking (probably not representative) • Example: Politicians often cite the case of a single individual to invoke a public response consistent with the politicians’ desire (a sample of size n = 1) • “Ask for averages, not testimonials” 10 Available data Observational Studies • Available data are data that were produced in the past for some other purpose but may help answer a present question • Many use available data because producing new data is expensive (nearly always most costly part of research). • There are lots of reliable available datasets on the web rich with information. Some examples: : http://www.census.gov/# • An observational study is one in which data is collected by merely observing the measurements on the individuals in the sample. No attempt to influence or intervene with the subject is taken. • May be difficult to reach causal conclusions (that changing one variable causes another variable to change) since other variables may be muddling up (called confounding) this relationship. • Example: Does smoking cigarette increase your risk of heart disease? : http://www3.norc.org/gss+website/ • Example: Let’s come up with a survey of Harvard: : http://www.hcup-us.ahrq.gov/nisoverview.jsp 11 12 3 8/30/16 Observational Studies Pros: • Usually cheap • The only option when randomized experiment is not feasible or unethical • Showing causation is not always necessary • Risk factors for medical decisions, population statistics. • Risk factor (common in medicine and epidemiology) a variable associated with an increased risk of disease or infection. • Examples? Observational Studies Cons: • Establishing causation may be impossible due to the presence of confounding variables. • Requires advanced statistical methods and unverifiable assumptions. • Confounding variable (or factor), sometimes referred to as a confounder or a lurking variable, affects both the group membership and the outcome (or dependent) variable. • This third variable causes the two variables to falsely appear to be related. 13 Confounding Variables • Name confounding variables that may induce the following associations: • The association between the amount of serious crime committed and the amount of ice cream sold by street vendors. • Drink More Diet Soda, Gain More Weight?: Overweight Risk Soars 41% With Each Daily Can of Diet Soft Drink. • Negative correlation between a size of one’s palm and their life expectancy. 15 14 Experiments • An experiment is a study in which an investigator imposes an intervention (e.g. treatment) on individuals in order to observe their response. • Clinical trials are a type of experiment • An Example: A comparison of different drugs for women with breast cancer, often with as few as 100 people. • The experimenter chooses women in the study receive the different levels of the drug (new therapy vs. old therapy). The levels of the drug are called the treatment. • The outcome of the study may be the measured amount of disease-free survival for each woman 16 4 8/30/16 Experiments: a few details Experiments • There has to always be at least two groups of the treatment to compare. The ‘default’ condition is often called the control group (standard-of-care in clinical trials). • The control group may receive a placebo treatment. This is a treatment that looks like the active treatment (classic ex: a ‘sugar pill’) • The subjects should be randomized to the treatment groups. That is, chance should decide which patients receive the treatments • This guarantees that all other variables are balanced across the treatment groups • To ensure this balance, the study needs to be replicated enough times. • An experiment is the best (only?) way to determine if one variable (the treatment) causes another variable (the outcome) to vary. • However, they are not always ethical or plausible. You cannot knowingly do harm to human subjects by forcing them to take a dangerous treatment (ex: force to smoke) • Experiments may not mimic real life (the conditions in which an experiment is run are often too ‘perfect’ or unrealistic). So there is often some loss of generalization of them to the real world. • They are also the most expensive way to collect data 17 18 Confounding Variables and Randomization • Suppose we would like to compare two methods of teaching introductory statistics. • At Harvard, one professor uses standard lecturing set-up in his class and another professor uses an interactive clicker approach in her class. • Students in the two classes are given achievement tests to see how well they learned the tests. Unit 1 Outline • • • • Variables & Measurement Collecting Data Sampling Random Assignment (for causal inference) • Confounding variables? • Better experiment? 19 20 5 8/30/16 Population vs. Sample Parameters and Statistics • Population: entire group of individuals on which we desire information. • Technicality: actual vs. conceptual populations • For our Harvard study: • Sample: a part of the population on which we actually collect data. • For our Harvard study: • Parameter (often called an estimand): a numerical summary of the population (like µ or p). • For our Harvard study: • Statistic: a numerical summary of the sample data. • For our Harvard study: • Estimator: a statistic used as a guess for the value of the estimand ( or p̂). • Estimate: a particular realization of the estimator (4/12 = 0.33). 21 Selection of Study Units Study/experimental unit/subject - one member of a set of entities being studied. Two extremes of a selection mechanism: • Self-selection (volunteers, haphazard) • Random sampling Analysis, Estimates, & Inference Purpose: describe population characteristics. 23 22 Parameter vs. Estimate Parameter (also, estimand) proportion of childless households in the population. Estimate - proportion of childless households in the sample. - childless household - household with children under 18 24 6 8/30/16 Entire population = all possible units Target Population: a collection of units a researcher is interested in; a group about which the researcher wishes to draw conclusions. 25 Census: sample everybody in target population 26 Census Pros: • In principle, no need to use statistical inference Cons: • Expensive • Long and difficult • In practice, never perfect: • Respondents are often not representative of target population! 27 28 7 8/30/16 Sampling units whose data Collection of units that are Respondents: were actually obtained. Sampling Frame: potential members of the sample. Overcoverage Undercoverage 29 30 Sample:a [randomly selected] subset of a sampling frame Target population Sampling Steps Population Target population Sampling frame Sample Respondents 31 32 8 8/30/16 Random Sampling Selecting a Sample from a Sampling Frame • Ensures that all subpopulations in the overall population are roughly represented in the sample. • Simple Random Sampling (SRS) – every subset of n units has equal chance to be selected • Pick size n (may use power analysis) • Enumerate all units • Pick n numbers randomly • What is the simplest way of collecting a random sample? • Small example: selection of a 3-member advisory committeeat random from the 11 faculty members of the Stat Dept. • What is the population? What is the sample? • What’s the chance that any one specific member is selected for the committee? • (Stat 110 question): How many different 3-person committees can be formed? 34 33 Simple random sample example Random Sampling • Systematic Random Sampling - select every kth unit from the ordered sampling frame, starting randomly from of the first k positions. • Easier to administer • Requires well-mixed population • Variable probability sampling – allow units to have unequal probabilities of being sampled. • Requires more careful analysis that involves weighting. • Example: Stratified Sampling – split the population into homogeneous subpopulations and use SRS (or another method) within a sampling frame of each subpopulation. 35 • If we were to draw a simple random sample n = 60 students from all Harvard undergrads, we could : 1) Write out the sampling frame: the list of all individuals in the population. 2) Assign each of the N members to a number from 1 to N. 3) Use a random numbers table or software to generate random numbers: So if N is a 4-digit number, then we could just generate random sets of 4 digits numbers, and choose the individuals based on those numbers 36 9 8/30/16 Stratified Sample Example Stratified random samples • We could perform a stratified sample at Harvard by randomly selecting 5 individuals from each house, and then combining them into one sample of n = 60. Basic idea: sample important groups separately, then combine these samples 1) Divide population into groups of similar individuals, called strata 2) Choose a separate simple random sample within each strata 3) Combine the results of the simple random samples together to form the overall statistic, weighting each separate stratum correctly to mimic the population n = 60 What’s an advantage to this stratified sample compared to the SRS? 37 Assignment to Groups Unit 1 Outline • • • • 38 Variables & Measurement Random sampling Collecting Data Sampling Assignment to groups/treatments Random Assignment (for causal inference) If the assignment mechanism is random, the expected proportion of childless couples in each (treatment) group is the same. 39 Two extremes of an assignment mechanism: •Haphazard (or unknown) •Random 40 10 8/30/16 Assignment to Groups Inferences Permitted by Study Design • Complete Randomization (parallel to SRS). If sampled randomly • Stratified or clustered Randomization (parallel to stratified sampling) • Ensures that representatives from all strata are present in each treatment group. If groups are randomly assigned • How not to randomize: If sampled randomly AND groups are randomly assigned 41 42 Study's generalizability Observational studies Difficult to draw inferences about population • Internal validity is the validity of (causal) inferences in a scientific study. • Should be established first. • It is low when there are • unaccounted confounding factors; • ignored missing data; • Noncompliance; • unverified assumptions; • suboptimal method of analysis. • A study that readily allows its findings to generalize to the population at large has high external validity. Difficult to draw causal inferences 43 44 11 8/30/16 Concepts to review: (some will be covered in Units 2 & 3) • • • • • • • • • • • 45 Outcomes, events, probability Random variables (r.v.) Probability Distribution of r.v. Indicator variables Bernoulli and Binomial Distribution Normal (Gaussian) distribution Mean, variance, standard deviation (SD) Histogram Sample mean, sample variance, sample SD Law of Large Numbers Central Limit Theorem 46 The Last Word 47 12