Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Various topics Petter Mostad 2005.11.14 Overview • • • • • Epidemiology Study types / data types Econometrics Time series data More about sampling – Estimation of required sample size Epidemiology • Epidemiology is the study of diseases in a population – prevalence – incidence, mortality – survival • Goals – describe occurrence and distribution – search for causes – determine effects in experiments Some study types • Observational studies – – – – Cross-sectional studies Cohort studies Longitudinal studies Case / control studies • Experimental studies – Randomized, controlled experiments – Interventions Cross-sectional studies • Examines a sample of persons, at a single timepoint • Time effects rely on memory of respondents • Good for estimating prevalence • Difficult for rare diseases • Response rate bias Cohort studies and longitudinal studies • A sample (cohort) is followed over some time period. • If queried at specific timepoints: Longitudinal study • Gives better information about causal effects, as report of events is not based on memory • Requires that a substantial group developes disease, and that substantial groups differ with respect to risk factors • Problem: Long time perspective Case – control studies • Starts with a set of sick individuals (cases), and adds a set of controls, for comparison. • Cases and controls should be from same populations • Matching controls • Good method for rare diseases • Problem: Bias from selection Measures of risk • Relative risk • Odds ratio • Incidence rate ratio • Attributable risk Econometrics • ”Econometrics is the field of economics that concerns itself with the application of mathematical statistics and the tools of statistical inference to the empirical measurement of relationships postulated by economic theory” • Is the unification of – economic statistics – quantitative economic theory – mathematical economics About econometrics • Variations and extensions of the regression model – – – – – – heteroscedasticity autocorrelation models panel data logistic regression non-linear regression models multivariate regression • Matrix computations (linear algebra) is almost indispensable tool • Time series data • Simultaneous equations models Heteroscedasticity • Recall: When the variances of independent errors in the model vary, the model is heteroscedastic. • Example: In a regression model of house size against income, the variance of house sizes might increase with income • In case of heteroscedasticity, ordinary regression models are not optimal. • Previously, we mentioned variable transformation as a possible solution • Much more advanced solutions exist, when the heteroscedasticity is known or can be estimated: Generalized least squares,… Autocorrelations • Recall: When for example the data is from a time series, the random errors for adjacent time steps might be correlated! • Improvements in model might reduce problem • Standard regression methods are not optimal • Modelling and estimating the autoregression gives improved results Panel data • Data collected for the same sample, at repeated time points • Corresponds to longitudinal epidemiological studies • A combination of cross-sectional data and time series data • Increasingly popular study type Analyzing panel data • Fixed effects: Standard regression, but using a constant term differing for each individual – We get a parameter for each person! • Random effects: A stochastic variable models variation connected to individual – The individual variation is assumed drawn from a distribution with fixed variance – A generalization of least squares is needed for computations Analyzing panel data • Heteroscedasticity might also here be a problem • Autocorrelations • Dynamic models: Lagged variables Logistic regression • What if the dependent variable is an indicator variable? • The model then has two stages: First, we predict a value zi from predictors as before, then the probability of indicator value 1 is given by e z /(1 e z ) • Given data, we can estimate coefficients in a similar way as before Non-linear regression models • Ordinary regression is very useful, but it is limited by the linear form of the equations • Sometimes, variable transformations can bring the connection between variables to a linear form • Other times, this is not possible: The relationship describes the dependent variable as some function of independent variables and some random error. • The model may still be estimated by minimizing the errors. This is non-linear regression. Multivariate regression • Instead of one dependent variable, one can have a vector of dependent variables • A theory of multivariate multiple regression can be developed (with the help of matrix algebra): Many similar results to ordinary multiple regressions • Captures the dependencies between dependent variables Simultaneous equations models • Often, you want to describe interdependencies between variables, rather than explaining one variable in terms of others • Example: – Demand is a function of various variables, including price – The same is the case with supply – Setting demand = supply creates simultaneous equations • Identifiability? • Estimation: Least squares is not optimal; other methods exist Time series models • Time series issues: – Identifying trends, cycles, etc. – Predicting future values • Autoregressive models: – Explicit models for time dependencies: AR(1) X t 1 X t 1 t (Corr ( X t , X t j ) 1j ) AR(2) X t 1 X t 1 2 X t 2 t • (Box-Jenkins, ARIMA models) The runs test (for random samples) • In a random sample, the probability that an observation is above or below the median is independent of whether the previous observation is. • A run is a (maximal) sequence of observations such that all are above the median, or all are below. • For n observations, the number of runs has a null distribution under the assumption of no autocorrelation. With too few runs, the null hypothesis of no autocorrelation can be rejected. (Table in Newbold). • For large samples, a formula based on a normal approximation can be used. Sampling in practice • Newbold mentions: 1. 2. 3. 4. 5. 6. • Information required? Relevant population? Sample selection? Obtaining information? Inferences from sample? Conclusions? Sampling / nonsampling errors Types of sampling • • • • Simple random sampling Stratified sampling Cluster sampling Two-phase sampling (using pilot studies) • Each requires somewhat adjusted formulas for estimation Correcting for finite population in estimations • Our estimates of for example population variances, population proportions, etc. assumed an ”infinite” population • When the population size N is comparable to the sample size n, a correction factor is necessary. (Why?) 2 • Examples: s ( N n) 2 ˆ – Variance of population mean estimate: X n – Variance of population proportion estimate: ˆ p2 N p(1 p) ( N n) n 1 N Estimation of required sample size • An important part of experimental planning • The answer will generally depend on the parameters you want to estimate in the first place, so only a rough estimate is possible • However, a rough estimate may sometimes be very important to do • A pilot study may be very helpful Example: Estimating the mean of a normally distributed population • We want to estimate mean • We want a confidence interval to extend a distance a from the estimate • We guess at the population variance 2 • A sample size estimate: Z2 / 2 2 4 2 n 2 at 95% confidence 2 a a • If we have a population of size N, and want a specified X2 , we get N 2 n ( N 1) X2 2