Download Introduction to Biostatitics Summer 2005

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Corecursion wikipedia, lookup

Pattern recognition wikipedia, lookup

Inverse problem wikipedia, lookup

Data analysis wikipedia, lookup

Taylor's law wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

Simplex algorithm wikipedia, lookup

Expectation–maximization algorithm wikipedia, lookup

Vector generalized linear model wikipedia, lookup

Data assimilation wikipedia, lookup

Predictive analytics wikipedia, lookup

Generalized linear model wikipedia, lookup

Regression analysis wikipedia, lookup

The Mean Regression
or Regression to the
Rama C. Nair
Epidemiology and Community Medicine
• Why Data Analysis?
– Qualitative and quantitative data
• Role of Statistics
– Quantitative information
– Variability in data
– Bias and Random error
– Population vs sample
• Regression Analysis
– Definition
– Types of regressions
– Benefits and pitfalls of regression analyses
Why data analysis
• Research Question
• Collection of information
– Qualitative and quantitative
• Only quantitative information (or information
that can be quantified) considered in this
• Analyzing the information collected to arrive
at a conclusion (decision) about the research
• Is childhood obesity an increasing problem in
the community?
• What might be the major causes for this
increasing trend?
• Would introduction of supervised physical
activities in the school address the issue?
• What are the important considerations for a
good school program?
Role of Statistics in Data Analysis
• Variability in data
– Measured by Probability Distributions
• Understanding the variability
– Reasons for variation
• Effects of variability on inference
– Reliable and valid inference
– Random error and Bias
Role of statistics in data analysis
• Analyzing variability in observed data and arriving at
inferences (answers to research questions) that are
reliable and valid, in the presence of the bias and
‘random’ errors that are inherent in all data
– A tall order
• Statisticians start with
– Let X1, X2,…, Xn be i.i.d. (independently and
identically distributed) with N (µ,σ) to describe the
n observations
• Epidemiologists put a context and see if the statistical
model fits
Population vs sample
• Truly, statistics only come into play when we are
observing only a sample of the whole population
• Inference from sample to population is based on the
statistical properties of sample statistics based on the
probability distribution of the variables
– Descriptive analysis involving estimation of
characteristics (mean, relative risk, odds ratio, or
simply probabilities of events)
– Statistical tests of hypotheses about the
characteristics (alone or in combination with some
Bias and Random Error
• How does one deal with bias and random
error in statistical analyses?
– Try to minimize bias by choosing
appropriate design
– Alternately, using mathematical modeling,
one may ‘eliminate’ bias in analysis
– Random error cannot be avoided, but
effect on inference can be minimized by
increasing sample size
Regression analysis
• What is regression
Regression - Wikipedia
In statistics, regression analysis examines the relation of a dependent
variable (response variable) to specified independent variables
(explanatory variables). The mathematical model of their relationship is
the regression equation. The dependent variable is modeled as a
random variable because of uncertainty as to its value, given only the
value of each independent variable. A regression equation contains
estimates of one or more hypothesized regression parameters
("constants"). These estimates are constructed using data for the
variables, such as from a sample. The estimates measure the
relationship between the dependent variable and each of the
independent variables. They also allow estimating the value of the
dependent variable for a given value of each respective independent
Uses of regression include curve fitting, prediction (including
forecasting of time-series data), modeling of causal relationships, and
testing scientific hypotheses about relationships between variables.
The MEAN Regression
• Relating (regressing) the ‘dependent’ variable to the
‘independent’ variable(s)
• Simply a way of characterizing a relationship through
a mathematical (statistical) model
– Simple linear regression
– Logistic regression
– Cox regression (proportional hazards model for
survival analysis)
– Time series analysis of recurrent data
The mathematical model
• All starts with a simple observation
– If two variables are related to each other,
can one predict the value of one of the
variables if the value of the other variable
is knows?
– Y=f(X), where f is a known mathematical
– The quest is to find the correct form of f
The mathematical model
– Where does f come from?
• Observation
–Plotting values of X and Y in a
bivariate plot to see if there is any
obvious pattern
• Theoretical considerations
–Area (rectangle) = length x width
• A combination of the two
The simple linear regression
• The simplest form of regression
– One dependent variable, Y is related to one independent
variable, X
– Plot X and Y (scatterplot)
– Is there a straight line relationship (Is Y changing
proportionally to X)?
– Y=α+βX
• Two ‘parameters’ determine the equation
• Slope of the line and the intercept of the line on the X
(independent variable) axis
Example of a simple linear regression
Simple Linear Regression
• Notice that not all data points are on the line, so
obviously the equation does not fit all the
– Not a perfect relationship
– The actual relationship is something more
– Can we use this relationship as approximation
– What are the risks in using this equation to
‘estimate’ the relationship?
Simple Linear Regression
What is the purpose of identifying this
• Predict values of Y for any given X?
• Predict trends in Y based on trends in
• Predict gain/loss if we introduced a
program to change values of Y in the
population, by changing values of X?
Simple Linear Regression
• If Yi is an actual observation in the previous picture, and the
equation to the blue line is Y=α+βX, then
• Yi=α+βXi+εi
The εi would be the deviation (error) of the observed value from
the ‘fitted’ value – a measure of uncertainty about the model
being a good fit to the data
• Clearly many possible lines (other than the blue line) can be
drawn and each of them will have different distribution of εi
• Which line do we choose as the best fit (one with the least
• Since many data points, we want a cumulative error
• Does Mean squared error seem reasonable?
Least squares regression
• Using minimum mean squared error as the criterion
– What is the straight line that best fits the data?
– Estimates of α (a) and β (b)
– Sample vs Population
• a and b are the best estimates of α and β based on the
observations and these values are going to be different in
different sample, even if the straight line relationship is
fixed for the population
• Sampling variation of a and b
• Measured by standard error of these estimates
Inference on regression
• The regression coefficient
– Slope of the regression line signifies the
magnitude of change in Y expected with
changes in X
– For prediction, one needs to know the
value of β
– Estimated by b
– Standard error of b allows one to draw
conclusions as to possible true values of β
Assumptions for the linear regression
• As with many statistical procedures, the first assumption is that
the observations are statistically independent of each other
– This is essential in constructing the probability distribution of
the sample values
• It is also assumed that the random errors ε are Normally
– This is essential in calculating the actual probability
distribution; as long as the distributional form is known, one
can do this even if the distribution is not Normal (though
– However, the least squares method of estimating the
parameters that we used is optimal when the distribution is
• A third assumption for the estimated regression equation to be
reliable and valid is that the deviation of the observed values
from the fitted values remains similar for all values of X
– This is essential for the estimates to be unbiased (reliable)
Multiple Linear Regression
• That was simple.
• Now what happens if there are more than one independent
variable that might have something to do with the dependent
– Can fit slr for each variable – but that is wasteful, and can
create confusion, specially if many of the Xs themselves are
related to each other
• A comprehensive equation, relating all of them in one equation
to the dependent variable
• Y = Xβ + ε
– (matrix notation)
• Yi=β1X1i+β2X2i+…+βkXki+εi, for the ith observation
Multiple Linear Regression
• Essentially same as linear regression
• The regression coefficients are now ‘partial’,
in that it signifies the amount of linear
relationship of one independent variable to
the dependent variable, with all the others in
the equation
• The method of estimation and testing
hypotheses are essentially same as simple
linear regression
• Assumptions are also similar
Linear Regression
Goodness of Fit
How does one assess how good the relationship is?
Are the βs significantly different from 0?
Back to the purpose of the regression
– Explain the variability in Y as results of variability in X (in
other words, Y and X are related)
– Amount of variability in Y (variance of Y, function of Mean
squared deviation from the mean)
– Amount of variability still ‘unexplained’ after the regression
(mean squared deviation of the residuals from the fitted line)
Linear Regression
• Unexplained variation
– If perfect fit, the sum of squares of deviation of the residuals
is zero
– If completely random, (β=0) then this sum of squares is the
same as the sum of squares of Y
– The difference between the two is a masure of variability
‘explained by’ the relationship, called regression SS
– Therefore the ratio of regression SS to the Total SS serves
as a criterion for how good the fit is
– 0<R2<1, known as the ‘coefficient of determination’
• In the linear regression, notice that we assumed Y has a Normal
distribution (by virtue of the linear regression equation and the
distribution of random errors)
• So the dependent variable has to be a continuous variable
• What if it is dichotomous, as with most epidemiologic studies
where we are looking at illness or similar entities measured on a
dichotomous scale?
Logistic Regression
• Y is now a dichotomous variable
• Clearly we can only talk about proportions (probabilities) of Y
being 1 or 0 as something we can predict
• Transforming Y to the logistic function, would help this
(mathematical derivation of why this is feasible or desirable is
available in many texts: e.g. Hosmer and Lemeshow – Applied
Logistic Regression)
Derivation of the logistic model
• Let (x)=(e0+  1X)/(1+ e  0+  1X)
• The logit transformation
– g(x)=ln [(x)/(1- (x))]
– g(x) =  0+  1X
– linear regression for g(x)
• Original outcome y
• Distribution of y not Normal
• y= (x)+ε
• ε=1- (x) with prob. (x) when y=1
• ε=-(x) with prob 1- (x) when y=0
Logistic regression
• Analysis steps
• n independent pairs (xi,yi)
• estimate regression coefficients and
goodness of fit of the model
• linear regression - least squares
• maximum likelihood if y normally distributed
• logistic regression -maximum likelihood
Logistic regression
• Maximum likelihood method
– Given a parametric model, the maximum
likelihood estimates for a set of parameters
maximizes the probability of obtaining the
observed data
• The likelihood function = joint probability of
observations under the given probability
Logistic Regression
• P(Y=1|x) = (x)
• P(Y=0|x) = 1-(x)
• Prob. For observation (xi,yi)
– (xi) if yi=1
– 1-(xi) if yi=0
– (xi)yi(1-(xi))1-yi in general
• For n observations, the joint probability (because independent)
– Prod [(xi)yi(1-(xi))1-yi ]
– This is the likelihood function l
Logistic regression
• Maximizing the likelihood function is achieved
by maximizing its log
• Unlike linear regression, one cannot get a
linear equation to calculate the regression
• Need to obtain estimates by iteration because
the equation is nonlinear
Logistic regression
• Inferences on the regression coefficient follows the same rules
as linear regression
– Estimates and standard errors of β are calculated and
approximate Normal distributions are used (Wald test)
• Interpretation of β
– Related to odds ratio as e -β
– Calculate odds ratio and its standard error
• Goodness of fit
– Again not as simple as simple linear regression
– Many methods available
• In summary
– Regression is a simple way of relating variables by the use
of mathematical functions, allowing one to examine the
variability in one variable as a function of variability in the
– Relationship could be one-one or one-many
– Allows for adjustment of confounding, (assuming general
linear model)
– Some allowance for testing effect modification
– Need to be careful of the assumptions regarding data
collection, data format, patterns of variability, study design
• In summary
– Any model can be used to fit the data
– Interpretation depends primarily on the theoretical foundation
for the model
– Parameters of the model may have identifiable
characteristics (for example the odds ratio in logistic
regression) and meaningful definitions when the theoretical
foundation is solid
– While confounding can bd adjusted and effect modification
detected, this is very much model dependent