Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Misuse of statistics wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Student's t-test wikipedia , lookup
• Distribution: Describes what values a variable takes and how frequently these values occur. distribution of a variable can bedescribed graphically and numerically in terms of “shape”, “center” and “spread”. • Mean is the “average value”. • If there are n observations x1, x2,…, xn, then the mean is… For example if the data are: 3, 2, 3, 6, 1 then their mean (or average) is (3+2+3+6+1)/5 = 3.0 • Median is the “midpoint” • 50% of observations are smaller than the median and 50% are larger than the median • If n is odd then the median is the center observation in the ordered list • If n is even then the median is the mean of the two center observations in the ordered list • For example if the data are: 3, 2, 3, 6, 1, we can order them 1, 2, 3, 3, 6 and see that the median is 3 • Mode is the observation that occurs most frequently may not be unique; there may be more than one mode • For example if the data are: 3, 2, 3, 6, 1, the mode is 3 because it occurs most frequently • Outliers usually demand investigation • Often they are errors in the data (e.g. due to instrument failure or errors in recording) but they also may be very important (e.g. a new scientific observation) • If there is no reason to suspect they have been wrongly recorded, may want to use summaries that are resistant to their influence (e.g., medians rather than means) • Outliers should not be discarded without good reason • A measure of spread conveys information regarding variability – how dispersed the distribution is • Common numerical summaries of spread • Variance (s^2) • Standard Deviation (SD & s) • Range (largest minus smallest observation) and IQR The concept of variance: • The “center” of a group of observations can be measured by the mean • variability of a single observation ( ) can be measured by its distance from the center (e.g. mean) Since we want this to always be a positive number we consider the square of the above • If we consider the sum of such “squared deviations from the mean” as a measure of variability - we realize that we need to take its average • variance is the average (almost) of squared deviations from the mean the units of variance are squared units • If there are n observations x1, x2,…, xn, then the variance is Standard Deviation • standard deviation (SD) is the square root of the variance Quartiles and the Interquartile Range • The first quartile Q1 is the median of the observations in the ordered list to the left of the overall median (25% are smaller than Q1 and 75% are larger) • The third quartile Q3 is the median of the observations in the ordered list to the right of the overall median • Interquartile Range, IQR = Q3 - Q1, is a measure of variability of the distribution (IQR contains middle 50% of the observations) • Example: For the observations 1, 2, 3, 4, 5 Q1 = 1.5, Median = 3, Q3 = 4.5, and IQR = 3 • Boxplot graphically displays several important features of a distribution, including the median, quartiles and outliers: tool for visualizing the location (center) and variation of quantitative data, illustrating differences between 2 or more groups of data Constructing a boxplot • Draw a box whose ends are the lower and upper quartiles Q1 and Q3 (length of box is equal to the IQR) • Mark the median by a line within the box • Observations greater than Q3 + (1.5 x IQR) or less than (Q1 – 1.5 x IQR) are considered to be outliers and highlighted • Draw lines from the quartiles to the most extreme values that are not marked as outliers (called whiskers) • Bar graphs used for categorical data Display count/percentage of individuals in each category of the categorical variable emphasize center and spread of a distribution • Histograms: quantitative data Display count/percentage of Individuals within intervals of equal width number of intervals and choice of interval width is important • emphasize the distribution of values • Graphically summarize the distribution of one variable 1. Center (i.e., the location) of the data 2. Spread (i.e., the variation) 3.Skewness (departure from left-right symmetry) 4.Presence of outliers 5.Presence of multiple modes (high frequency values) in the data • Red text notes the strength of a histogram compared to a boxplot Density Curves • It is often easier to conceptualize a population of values with smooth curves rather than histograms • The curve serves as a mathematical model for the distribution • Graph on next slide comes from data in IPS, p 66 • Histogram of Iowa test vocabulary scores, Gary, Indiana 7th graders (n = 947) • Vertical axis is relative frequency • Plus approximation of the distribution with the normal density curve Properties of a density curve • All values are positive (curve sits above the horizontal axis) • Total area under the curve is 1 • Areas under the curve and between two x values give (an approximation to) the relative frequency of values in the population between those x values • Shapes follow those of histograms The “normal distributions” • The normal distributions are a family of density curves indexed by their mean and standard deviations • The curves are symmetric, unimodal, bell-shaped Normal distributions – N(µ,σ) • The family of “normal distributions” are symmetric, bell-shaped density curves • All normal distributions have the same shape, but with possibly different means (µ) and standard deviations (σ) • Common notation for a normal distribution - N(µ,σ) The Standard Normal Distribution Z ~ N(0,1) • The standard normal distribution, called Z distribution has µ = 0 and σ = 1, so we write Z ~ N(0,1) • All tables of the normal distribution are for Z ~ N(0,1) • If Y ~ N(µ,σ) then we can standardize it by: The 68-95-99.7 rule • All normal distributions follow the 68-95-99.7 rule � 68% of observations fall within σ of µ � 95% fall within 2σ of µ � 99.7% fall within 3σ of µ • Conversely, if a distribution has this property then it is normal or nearly normal Z = (Y- µ)/σ so Z ~ N(0,1) Why is the “normal distribution” so common? • Result became known as “Central Limit Theorem”: Under general conditions, the distribution of a sum(or average) of many random quantities is close to a normal distribution when repeated How to “standardize” an observation • Subtract the mean from the observation • Divide by the standard deviation • if Y is N(µ,σ) distributed, then Z = (Y- µ)/σ has a N(0,1) distribution • The standardized value is often called a “z-score” Standardizing • Example: Grades of a previous STAT 104 final exam - the mean was 66 and the standard deviation was 12 • Student A scored 78 and student B scored 48 • Since 1 SD = 12 points, student A scored 1 SD above the mean (Z = 78 - 66 / 12 = 1) • Student B scored 1.5 SDs below the mean (Z = 48 - 66 / 12 = -1.5) • A standardized score takes into account the spread of the data Normal distribution – finding the probability to the right of a Z-score • Since the normal tables give the probability to the left of a Z-score, we use subtraction and the fact that the total probability is 1 Example - SAT Verbal Scores (IPS p 79) • SAT verbal test scores have an approximately normal distribution with µ = 505 and σ = 110 [X ~ N(505,110)] What test score will place a student in the top 10%? • So we want to find x0 such that Prob (X > x0) = 0.1 • This is the same as finding z0 such that Prob ((X – µ / σ) = Z > z0) = 0.1 (standardize) where Z ~ N(0,1) • Because Table A only gives area to the left, need to state this problem as: what z0 has area 0.9 to the left? Example - SAT Verbal Scores • Prob (Z < 1.28) = 0.9 from Table A, so Prob (Z > 1.28) = 0.1 To determine the SAT score, set z0 = 1.28 = (x0 – 505)/110 and solve for x0, so x0 = 505 + (1.28)(110) = 645.8 Some properties of normal distributions • Think of Prob (Y < w) as the probability of an eventwhere “Y < w” is the event [shorten to P(Y < w)] • When dealing with distributions P(Y < w) can also be interpreted as a proportion or relative frequency • Table A gives us P(Z < z0) when Z ~ N(0,1) • A plot that can be used to assess normality is called a normal quantile plot (or normal probability plot) • A tool that will become useful later • P(Z < -z0) = P(Z > z0), i.e. they are symmetric • P(z1< Z < z2) = P( Z < z2) - P(Z < z1) • P(Z < z0) = P(Z < z0), since P( Z = z0) = 0 Normal quantile plots 4. Plot each data point y (vertical axis) against the corresponding z (horizontal axis) 5. If the data distribution is close to the normal distribution then the plotted points will lie close to a straight line Quantile plot vs histogram vs boxplot • Unlike a histogram, quantile plot does not require an arbitrary definition of bins (width of the bars) for the histogram • Boxplot will show symmetry, but not good at indicating when tails have “too many outliers” for normality Transforming data to “normality” • Consider the elimination of outliers – with caution • If the data are positive and skewed, then consider transforming the data using the natural logarithm • Other possible transformations include the class ofpower transformations Xk where k ≠ 0 (e.g. k = ½)… Many methods described later in the course are more reliable when the data are normally distributed, or nearly so… such transformations do not always work Relationships between variables .. Categorical variables – a limited set of outcomes .. Quantitative variables – take on numerical values (arithmetic operations are meaningful) • Use boxplots to examine the relationship between a categorical variable and a quantitative variable • Use scatterplots to look at the relationship between two quantitative variables (measured on the same individuals) (First step when studying the relationship) Positive and negative associations • Two variables measured on the same individuals are called positively associated if increasing values of one variable tend to occur with increasing values of the other • They are negatively associated if increasing values of one variable occur with decreasing values of the other Response and explanatory variables • Response variable, denoted as Y, measures the outcome of a study. Y is the variable we want to predict/explain (often called the dependent variable) • Explanatory variable, denoted as X, is a variable that may predict/explain (but not necessarily cause) the response variable (often called the predictor variable) (frequently - many possible explanatory variables) Linear relationships • The relationship between two variables is said to be linear if the points on the scatterplot lie (approx.) on a straight line. • A perfect linear relationship between a response variable (Y) and an explanatory variable (X) is Y = a + bX • A positive linear relationship means b > 0 • A negative linear relationship means b < 0 • What if b = 0? Flat. Correlation • Correlation is a measure of the strength of the linear relationship between two variables • It is usually denoted by r with a range of -1 to 1 .. r = 1 means the relationship between two variables X and Y is exactly positive linear .. r = -1 indicates the relationship is exactly negative linear… r = 0 indicates a very weak (or no) linear relationship Correlation • Definition: Suppose we have n pairs of observations (x1,y1),…,(xn,yn) on two variables X and Y. The correlation between X and Y is given by the formula where sx and sy are the SDs of X and Y Properties of r, the correlation coefficient • r always between –1 and +1 • r is 1 or –1 only if points lie exactly on a straight line • sign of r indicates a positive or negative association • r is unaltered by changes in units of X or Y • absolute value of r measures the strength of the linear relationship • r has no direct interpretation as a percent or proportion (e.g., r = 0.8 is not twice as strong as r = 0.4) *Association does not imply causation* Least-squares regression • Situation: 2 quantitative variables • A regression line is a straight line that describes how a response variable (Y) changes as an explanatory variable (X) changes • Unlike correlation, regression requires that we have a response variable (Y) and an explanatory variable (also called predictor variable) (X) Least-squares regression – the formulas Suppose we have n pairs of observations on X and Y: (x1,y1), (x2,y2), (x3,y3), ... , (xn,yn) We want to find the straight line that best “fits” the data… This line has an equation of the form, where (“y hat”) is the predicted value of Y, a is the yintercept (value of Y when X = 0), b is the slope of the line Least-squares criterion The “best-fitting” line is the line that makes the sum of the squares of the vertical deviations from the data points to the line as small as possible Minimizes the quantity: We want to solve for a and b to make the above quantityas small as possible Least-squares intercept (a) and slope (b) The values of a and b that minimize this quantity are: where sx and sy are the standard deviations for X and Y and r is the correlation coefficient between X and Y Interpreting the regression line • The least-squares regression line for these data is: y = 64.93 + 0.635x • For these data, b = 0.635 (the slope), so that height increases on the average by 0.635 centimeters for each month increase in age • For these data, a = 64.93 (the y-intercept), which is the point on the y axis where the line (if extended) would touch when x = 0 Correlation between X-Y and Y-X is the same But least-squares regression of Y on X is different from regression of X on Y Interpreting r2 in regression • r2 is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x. Thus, r2 = variance of y / variance of y, where y are the predicted values (y = a + bx) and y are the observed values Analysis of residuals • Definition: residual = observed y – predicted y = y – y. So the data (y) are: the predicted values (y) [the pattern] plus the residuals [deviations from the pattern] DATA = FIT + RESIDUALS • Residual plot: a scatterplot of residuals versus x values • Ideally…Residuals are close to zero for all values of x ... Have no pattern when plotted against the explanatory variable x (random scatter) ... Residuals have a normal distribution with mean 0 Outliers and Influential Points • Outlier: in regression, a point that lies far from the fitted line, often producing a large residual • Influential Point: a point whose removal would markedly change the position of the regression line Extrapolation is the use of the regression line for prediction outside the range of the explanatory variable. This can produce nonsensical results • Aggregation: Associations based on averaged data • Problem: A scatterplot of just the averages hides much of the variability in the data • In general, regression with aggregate data overstates the strength of the association (larger r2) Lurking variables • A variable not among the explanatory or response variables that influences the interpretation • Solution: plot the residuals against time and other variables that may influence the results Common relationships between X and Y (a) Association between X and Y (partially) due to “X causes Y” (b) Association between X and Y (partially) explained by a “lurking variable” (Z) (c) Association between X and Y is mixed up with and cannot be distinguished from the effect of an additional variable (Z) Establishing causation The best (and only?) method of clearly establishing causation is to conduct a carefully-designed randomized experiment that changes X, the explanatory variable, and controls for the effects of possible lurking variables Establishing causation – the backup plan • The association is strong • The association is consistent across many studies • Higher doses are associated with stronger responses • The alleged cause precedes the effect in time • There is a plausible causal relationship The ecological fallacy • Sociologists: “ecology” = study of groups • Data on group behavior is called ecological data • Ecological fallacy: concluding (perhaps incorrectly) that relationships holding for groups necessarily hold for individuals in those groups • Aggregate data + lurking variables = ecological fallacy • Aggregate data may be easier to obtain than data on individuals, but such inferences are only weakly supported, at best 2 types of logarithms • Logarithms were invented to reduce multiplication and division calculations to addition and subtraction calculations before the time of electronic calculators • Basic principle of logarithm use: Log (A x B) = Log A + Log B Log (Ab) = b x Log (A) • Two major types of logarithms • Common (base 10) (usually called “log”) • Natural (base e = 2.718...) (usually called “ln”) • We will now consider log (base 10) transformations • Later we will start using ln (base e) in formulas • If log (A) = c, then A = 10c • If ln (A) = c, then A = ec Checking regression assumptions We usually check: • Relation between X and Y is linear • Residuals have constant SD • Residuals have a normal distribution (e.g. examine plots of the residuals) If assumptions are not met: • Pretend they are (“ostrich”) • Consider more complex models • Transform data to conform to assumptions Nonlinear transformations • Earlier in the course we discussed linear transformations (y = a + bx) • Here we consider nonlinear transformations Nonlinear transformations can 1) alter the shape of distributions (to make skewed distributions more symmetric) 2) change the form of the relationship between two variables (to make it linear) 3) alter the residuals (to make them normal with consistent SD) Transformations • Logarithmic transformation works very well for some financial data and some biological data (makes “exponential growth” data linear) • When relationship between Y and X is not linear, consider transformations of the form Yk and Xk where: k = ... -3, -2, -1, -½, log, ½, 1, 2, 3 ... “Ladder of power transformations” • A specific experimental condition (intervention) is called a treatment • An experiment imposes some “treatment” on individuals in order to observe their responses • An experiment allows us to control lurking variables • In principle, randomized controlled experiments are the “gold-standard” of evidence to support “causation” • Experiments may not always be ethical or practical Principles of experimental design • 1st Control – directly compare two or more treatments – helps control effects of lurking variables • 2nd Randomization – use randomization to assign individuals (experimental units) to treatments • 3rd Replication – replicate each treatment on many individuals to reduce effect of chance variation (Also called repetition) Control group • Control - 1st principle of experimental design • In a “controlled experiment”, two or more groups of individuals (subjects, experimental units) are compared .. Treatment group: subjects receive a specific intervention .. Control group: subjects do not receive the specific intervention and are compared to the treatment group • Controlled comparisons allow us to eliminate (or reduce) effects of specific treatment assignments, selection of subjects, placebo effects and potential biases (systematic favoring of a certain outcome) • If studies are uncontrolled, results may be meaningless Assignment of treatments • The 2nd principle of experimental design concerns assignment of subjects to treatments • We want the treatment groups to be alike as much as possible in every way (except for the treatment) for a fair comparison • We could do it by matching (e.g. by subject’s age, sex, smoking), but matching is not enough (unknown lurking variables cannot be matched) • Instead, use chance to decide - randomization • Assignment of treatments using randomization helps ensure balance of known and unknown factors in the treatment groups Replication • Randomization produces treatment groups that are similar in all respects except treatment received • Therefore differences in the response must be due to either the treatments or the play of chance • Replication of the treatments on many subjects (large sample size) reduces the role of chance variation • Replication gives the experiment the power to detect differences between the treatments • A treatment effect so large that it would rarely occur by chance is said to be “statistically significant” research trial • The placebo effect is a measurable, observable, or felt improvement not attributable to a treatment Blinding • Blinding: comparison of treatments can be distorted if subjects or persons administering or evaluating treatment know which treatment is being allocated – especially for subjective endpoints • Blinding avoids many sources of unconscious biases • Single-blind: subjects do not know which treatment they have received • Double-blind: neither subjects nor experimenters know which treatments have been received Population and sample • Population: entire group of individuals on which we desire information • Sample: part of population on which we actually collect data • Sampling design: method used to choose sample from population • Census: survey of an entire population • Why sample, instead of taking a census? Time, expense, and sometimes sampling units are changed by their measurement Simple random sample (SRS) • In a SRS of size n: 1) each individual in the population has an equal chance of being chosen 2) every set of n individuals has an equal chance of being the sample chosen Hite Report and non-response bias • Sampling frame: The “list” of individuals from whom the sample is selected Drawback of simple random sampling • Weakness of SRS: it does not use relevant information about the population - such as small group of people who are poorer than the others - to ensure proper balance that pure random sampling may miss • A sampling method that uses this type of information is called stratified random sampling .. Individuals are divided into groups called strata .. Often (but not always), a SRS is taken within each stratum • National surveys can be even more complicated, using multistage sampling Stratified random samples Basic idea: sample important groups separately, then combine these samples 1) Divide population into groups of similar individuals, called strata 2) Choose a separate simple random sample within each strata 3) Combine these simple random samples to form the full sample (in the correct proportions) Multistage samples One way to take a nationwide multistage sample: Stage 1: Take a sample from the 3000 counties in the US Stage 2: Take a sample of townships within each county chosen Stage 3: Take a sample of city blocks (or census blocks) within each township chosen Stage 4: Take a sample of households within each block At each stage, take a simple random sample Data • Data can be produced in many ways: 1. Anecdotal information 2. Available data 3. Observational studies 4. Controlled experiments 5. Randomized controlled experiments • Major differences in quality of information produced and ultimately the reliability of conclusions that can be drawn (lower on list is better) • Randomized controlled experiments provide by far the most reliable information Blocking in experimental designs • Blocking: a block is a group of individuals known to be similar in some way that is thought likely to influence the response variable • In a “randomized block design”, randomization is carried out separately within each block • Example: matched-pairs design .. Blocks consisting of two units matched as closely as possible, e.g., using identical twins Observational studies (e.g. sample surveys) versus experiments • An observational study collects information from individuals making no attempt to influence the responses • An experiment imposes an intervention (e.g. treatment) on individuals in order to observe their responses • Sample surveys are a type of observational study Block designs • The device of pairing observations is a special case of blocking • A block is a portion of the experimental material (e.g.,the 2 shoes of one boy) that is expected to be morehomogeneous than the aggregate (all shoes of all boys) • By confining treatment comparisons within such blocks, greater precision can be obtained • In the paired design the block size is two, and we compare two treatments A and B Design of studies Methods for producing data are called designs Major elements of study design 1) Who or what is the object of study (individuals)? 2) Will study be observational or experimental? (if experimental - how will treatments be assigned?) 3) How will the individuals be selected? 4) How many individuals will be studied? 5) What variables will be measured? Choice of blocks • Blocks should be chosen on the basis of the most important (known) unavoidable source of variation among the individuals (experimental units) • Randomization then averages out the remaining sources of variability to allow “unbiased” (i.e., un-confounded) estimation of treatment effects • Blocks allow greater precision, because a source of systematic variation is removed (reduced variability) from the experimental comparison Sampling distribution • What would happen if an experiment (or a sample) were repeated many times? (a “thought experiment”) • Take repeated samples of the same size from the same population: – 1st sample, calculate the statistic of interest – 2nd sample, calculate the statistic of interest, and so on ... • The statistic will vary from sample to sample • The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population • The sampling distribution often has a predictable pattern Some terminology and concepts • Experimental units (e.g., individual subjects) are the objects of the study Placebo effects • A placebo is a medically inert substance, such as a sugar pill, used to replace medication in a clinical The major concept of statistical inference • A sampling distribution characterizes the behavior of a statistic Cautions for sample surveys 1) Selection bias: some groups in population are over or under-represented in sample 2) Non-response bias: non-respondents may differ in important ways from respondents 3) Response bias: e.g., wording of questions, telescoping in the recall of events Parameters and statistics Parameter: number that describes the population Statistic: number that describes a sample Statistical inference: use information from a sample (a statistic) to make an inference about a population (a population parameter) Sample .. Population • A sampling distribution is inherently unobservable, because there will (in almost all cases) be only one survey, one experiment, one observational study ... • Probability theory provides tools for calculating the theoretical form of a sampling distribution • Understanding the behavior of a statistic under (hypothetical) repeated samplings (the sampling distribution) helps understand the precision and reliability of the statistic Bias and variability • Two measures of the reliability of a statistic .. Bias – the distance of the center of the sampling distribution from the true parameter .. Variability – the variance of the sampling distribution • Bias is often thought of as a measure of validity of a study (e.g. reduced by using random sampling) • Variability captures the spread in the sampling distribution (e.g. reduced by increasing sample size) • Survey results come with a “margin of error” (+ 3%) • If bias = 0 and variability is small, the values of a statistic will be tightly clustered around the “truth” Size doesn’t matter • Population size doesn’t matter • The variability of a statistic from a random sample doesn’t depend on the size of the population (provided the population is substantially larger than sample) • Important consequences for surveys: A SRS of 2500 from the more than 210 million adults in US gives results as precise as a SRS of 2500 from the 665,000 inhabitants of San Francisco