Download here

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lund University
Mathematical Sciences
Fall 2016
Inference Theory
Computer Lab 2
1
Introduction
This computer lab consists of the main parts:
• Parametric and nonparametric tests
• Regression analysis.
Try to answer all questions. Ask when you do not understand. We will analyse only real life data,
received from a research department at a university hospital
The data that you are supposed to analyse in this lab come from the Clinical Research Center
(CRC) at the University Hospital in Malmö (UMAS). The data are real data and are data that
researchers at CRC have done (part of) their research on. Thus a main goal of this lab is to
illustrate the types of analyses that one can do as a research scientist at a high quality research
institute.
Note that there are no "right answers" to the questions below. You are supposed to analyse the
material. You can of course do this in a more or less clever way. The lab instructions tell you what
you should think of when you perform the analyses, but again there are no "right answers" to the
research questions. And you can possibly find a new connection or result that previously was not
known!
2
2.1
The data
The data material
Data consists of n = 4547 individuals that have been followed until either (i) they die (in a heart disease) or until (ii) they leave the study, for some reason. For each individual there are measurements
on a number of phenotyopes
T2D
FASTINGGLUCOSE
CHOL
TRIGL
HDLCHOL
LDL
FASTINGINSULIN
BMI
WH
SEX
SMOKING
PHYSACTIVITY
1
There are also, for each individual, measurements on 7 genotypes. These genes have actual names,
known to the researchers at CRC, but for ethical purposes they are for us coded as gene 1, gene 2,
..., gene 7. In the data material we have they are labeled as
g1, g2, g3, g4, g5, g6, g7
The individuals are identified by two identifiers:
PATIENT
FAMILY
The identifier PATIENT is a unique identifier for each individual, while the identifier FAMILY can
be shared by several individuals (that belong to the family identified by the variable FAMILY).
2.2
Load data into R
To load the data into R’s data working memory: Start R. Load the data into R’s data working
memory by doing
dat<-read.table("datafil.txt")
You can list all variables for the data by doing
names(dat)
3
Nonparametric tests and estimators
In this first part you are supposed to get a picture of the distributions for the different phenotypes
and do tests for if there is a difference between those distributions.
3.1
Nonparametric estimates
The empirical distribution function (ecdf) is an estimator of the true and unknown distribution
function. To calculate the empirical distribution function, for for instance BMI, do
plot(ecdf(dat$BMI))
If you want to look at the ecdf for BMI for the group that has the riskgenotype g1 (corresponding
to the value g1 = 1), do
plot(ecdf(dat$BMI[dat$g1==1]))
If you want to see the ecdf for BMI for the two groups that have and that have not the risk genotype
(corresponding to g1 = 1 and g1 = 0) in the same figure, do
plot(ecdf(dat$BMI[dat$g1==0]),xlim=c(15,60))
par(new=TRUE,col="red")
plot(ecdf(dat$BMI[dat$g1==1]),xlim=c(15,60))
Do this for some interesting groups. Some suggestions for possible group division are and how
you divide are: You can for instance divide into gender (variable SEX, 1 means man, 2 means
WOMAN), whether the individual is a smoker or not (variable SMOKING), those that have type
II diabetes versus those that have not (T2D). Other interesting phenotypes apart from BMI that
are continuous and that you can study the distribution for and distributional differences between
groups for are CHOL, TRIGL, HDLCHOL, LDL, FASTINGINSULIN.
2
3.2
Nonparametric tests
The above gives you estimates of the distribution functions and the plotted figures can give you an
indication of whether there are differences of the distributions between groups.
To make a formal test of differences between groups you can use a two-sample KolmogorovSmirnov test (R code function is ks.test). Do
plot(ecdf(dat$HDL[dat$SMOKING==0]),xlim=c(0,4))
par(new=TRUE,col=’’red")
plot(ecdf(dat$HDL[dat$SMOKING==1]),xlim=c(0,4))
ks.test(dat$HDL[dat$SMOKING==1],dat$HDL[dat$SMOKING==0])
Do the figures and the formal test agree?
You can use the ecdf to estimate quantiles, by taking the inverse of the ecdf at a fixed point. More
convenient and what amounts to the same thing is to calculate the empirical quantiles. Convince
yourself that you understand the equivalence of this and ask the teacher if it is not clear to you. As
an example, if you want to estimate the lower 10% quantile för HDL in the group of smokers do
quantile(dat$HDL[dat$SMOKING==1], 0.1, na.rm=TRUE)
You can use a one-sample Kolmogorov-Smirnov test to check if data are Normal distributed.
ks.test(dat$HDL[dat$SMOKING==1],"pnorm")
plot(ecdf(dat$HDL[dat$SMOKING==1]),xlim=c(0,4))
t<-seq(0,4,by=0.01)
m<-mean(dat$HDL[dat$SMOKING==1],na.rm=TRUE)
v<-var(dat$HDL[dat$SMOKING==1],na.rm=TRUE)
par(new=TRUE,col="blue")
plot(t,pnorm(t,m,sqrt(v)),xlim=c(0,4))
A better way to graphically see if there is a difference to the Normal distribution is to use a qq-plot.
qqnorm(dat$HDL[dat$SMOKING==1])
Use the help function on "qqnorm", and convince yourself that you understand what a qq-plot does.
Ask if you don’t understand.
4
Parametric tests
One can use for instance a t-test to test for whether there is a difference between the expectations
for a covariate in different groups. Note that even if we see, as above, that the distributions are not
Gaussian, since the estimates of the expectations are averages, and since an average of i.i.d. data is
approximately Gaussian by the CLT, the t-test will still give (approximately) correct p-values.
Note however that the parametric estimates are not as informative for the (whole) distribution,
since we only estimate the expectation in the distribution, and note also that parametric tests are
not as sensitive for differences, since we only test for differences between the expectations in the
distributions.
To do a t-test, do
t.test(dat$HDL[dat$SMOKING==1],dat$HDL[dat$SMOKING==0])
Try to do some more analyses. Do you get the same results as in the previous section, using nonparametric tests?
3
5
Linear regerssion
5.1
Univariate linear regression
We will use two variables as response variables, BMI and WH, and try to find what other variables
might influence them, and try to estimate or describe that influence.
A univariate linear regression model can be fitted to the data as
fit<-lm(BMI~FASTINGGLUCOSE,data=dat)
summary(fit)
Give an interpretation of the results! Do help on lm. Try using other explanatory variables. To
graphically see if the model fits data one can plot the residuals
plot(fit$fit,fit$res)
What should the residuals look like? Ask the teacher what you can do if they do not look look liike
they should. To see if the residuals are Normal distributed, you can do a qq-plot
qqnorm(fit$res)
Formal test of if the model fits the data is given by the F-statistic in
summary(fit)
Ask if you do not understand.
5.2
Multivariate linear regression
A multivariate regression model can be fitted to the data (using the least squares method), for instance using the covariates FASTINGGLUCOSE, CHOL, TRIGL, HDLCHOL, LDL and FASTINGINSULIN as explanatory variables, by
fit<-lm(BMI~FASTINGGLUCOSE+CHOL+TRIGL+HDLCHOL+LDL+FASTINGINSULIN,data=dat)
summary(fit)
Interpret the results! If you compare this model with a model with only FASTINGGLUCOSE as
explanatory variable, you will find that FASTINGGLUCOSE is not any more significant in the
multivariate regression model. Why?
The question of which covariates you should end up using in a multivariate model, is known
as a model selection problem. To do model selection in a regression model it is convenient to use
the commands add1 and drop1. One of the standard methods for that is to do so called stepwise
backwards elimination of a "large" model, as long as you can. In the example above one would do
fit<-lm(BMI~FASTINGGLUCOSE+CHOL+TRIGL+HDLCHOL+LDL+FASTINGINSULIN,data=dat)
drop1(fit,test="Chisq")
fit<-lm(BMI~CHOL+TRIGL+HDLCHOL+LDL+FASTINGINSULIN,data=dat)
drop1(fit,test="Chisq")
etc
1. Use the above method to find a model for BMI. Start with a multivariate model that uses all
variables as explanatory and eliminate them one at a time, until it is no longer possible.
2. Do the same thing (i.e. find a good model) for WH.
3. Find separate models for BMI (and for WH), for men and for women. What are your conclusions? To do an analysis on a subset of all data, for instance for only males, one can use the
code
4
fit<-lm(BMI~FASTINGGLUCOSE+CHOL+TRIGL+HDLCHOL
+LDL+FASTINGINSULIN,data=dat,subset=SEX==1)
A way to incorporate a varible that you believe "should be included" but that was not significant,
is to make a (non-linear) transformation of the variable. For instance one could dichotomise a
continuous variable. Try to do this for FASTINGGLUCOSE, by dichotomising for instance at it’s
median
fast.m<-quantile(dat$FASTINGGLUCOSE,0.5,na.rm=TRUE)
dat$ny.fast<-as.real(dat$FASTINGGLUCOSE>fast.m)
fit<-lm(BMI~ny.fast+CHOL+TRIGL+HDLCHOL+LDL+FASTINGINSULIN,data=dat)
summary(fit)
What is your conclusion?
6
Nonparametric estimation of the probability density function
Load the package KernSmooth into R’s program memory
library(KernSmooth)
Do help on the function bkde. Choose one of the variables that you could be interested in estimating
the density for, and do this, for instance for BMI the code is
fit<-bkde(dat$BMI[!is.na(dat$BMI)])
plot(fit,type="l")
Try to change the bandwidth to some different values.
fit<-bkde(dat$BMI[!is.na(dat$BMI)],band=1)
plot(fit,type="l")
fit<-bkde(dat$BMI[!is.na(dat$BMI)],band=0.5)
plot(fit,type="l")
fit<-bkde(dat$BMI[!is.na(dat$BMI)],band=0.3)
plot(fit,type="l")
fit<-bkde(dat$BMI[!is.na(dat$BMI)],band=0.1)
plot(fit,type="l")
fit<-bkde(dat$BMI[!is.na(dat$BMI)],band=0.05)
plot(fit,type="l")
What is your explanation to what you see?
Try to plot the kernel estimator of the density (for some choice of bandwidth) against a Gaussian
density (for instance with values for the expectation and the variance estimated from the data), for
a comparison. Explain what you see. Is this a formal test, and why not? Explain where in the lab
you did the formal test.
The End
5