Download Fitting data to distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
 Fitting data to distributions When you take your field measures (e.g., body mass, feather’s length or individual traveled distance), these data points can be described by a probability distribution. A probability distribution describes the probability of each measure. Statisticians have described several theoretical probability distributions that can be mathematically described by parameters such as mean and variance, which many times have important biological meaning. Oftentimes, the distribution of the data you collect comes from an unknown distribution. Distribution fitting is the process by which we quantitatively calculate parameters such as mean and variance given that the data follows a theoretical distribution. In this exercise we are going to study three commonly used theoretical distributions, the normal, Poisson and negative binomial. Normal Distribution Probably the most commonly used statistical distribution, the normal distribution is used to describe continuous data and is described by two parameters the mean and the standard deviation. Let’s study the shape of the distribution by simulating some data. We are going to use the function rnorm() to randomly simulate 100 data points from a normal distribution with a mean of 20 and standard deviation of 5. > normal.sim=rnorm(1000,mean=20,sd=1)
Look at the data. Note that each data point has decimal points because it is a continuous distribution. Can you tell the shape of the distribution just by looking at it? > normal.sim
A histogram will help us better assess the shape of the distribution. > hist(normal.sim)
Note the shape of the distribution. Try guessing the mean of the distribution by looking at the histogram. It seems that it is between 19 and 21. Let’s calculate it. > mean(normal.sim)
A normal distribution is assumed by many standard statistical tests such as t-­‐test, linear regressions, and principal component analysis. How do you know if your data is normally distributed? We are going to study two ways: a visual and a statistical test (Shapiro-­‐Wilk test). Let’s start with a visual representation. There are two ways to assess visually if your data is normally distributed (remember that the normal distribution applies to continuous data). You can make a histogram the way we just did; however, this is a very subjective test. An alternative way is to make a Quantile-­‐Quantile plot. This plot compares the quantiles of two distributions. In the x-­‐axis are the quantiles of the theoretical distribution (normal distribution in this case) and in the y-­‐axis the quantiles of the sample distribution (your field data). If you data follows a normal distribution you would expect the points in the qqplot to align almost exactly in the diagonal of the figure. Given that the dataset norm.sim is a simulated dataset from a normal distribution we would expect the points to follow almost a straight line through the diagonal. To make the plot we are going to use the function qqnorm() which makes a qqplot with the quantiles in the x-­‐axis that follow a normal distribution. > qqnorm(normal.sim)
Now let’s compare the qqplot we just made with a qqplot for a dataset that is not normal. To do this we are going to simulate data using the rgamma() function which simulates data following a gamma distribution instead of a normal. The gamma distribution also works for continuous data and is described by two parameters: shape and rate (instead of mean and standard deviation in the normal). Let’s simulate 1000 data points with a shape of 2 and a rate of 2. Note that we are using the gamma distribution just to simulate a dataset that we are sure does not follow a normal distribution. > nonormal.sim=rgamma(1000,shape=2,rate=2)
Now let’s make a histogram of this new no-­‐normal dataset. > hist(nonormal.sim)
How does the shape of the distribution compares with that of normal.sim? Now let’s compare qqplots for normal.sim and nonormal.sim. > par(mfrow=c(1,2))
> qqnorm(normal.sim,main="Normal")
> qqnorm(nonormal.sim,main="No Normal (gamma)")
Note how the qqplot for the normal data follows the diagonal line while the qqplot for the no normal data does not. How much deviation from the diagonal line can we have and still say it is normally distributed? This is a hard question to answer because this “graphical” method is very subjective. The Shapiro-­‐Wilk normality test is often used in combination with a qqplot to assess the normality assumption. The Shapiro-­‐Wilk test assesses the null hypothesis that the data is normally distributed. Hence, if the test is not significant, then we can say that the data is normally distributed (with an associated α level). If the test is significant then we conclude that the data is not normally distributed. Let’s run a Shapiro-­‐Wilk test for the previous data. > shapiro.test(normal.sim)
> shapiro.test(nonormal.sim)
Note that normal.sim (normally distributed data) had a p-­‐value >0.05 which means that we cannot say that the distribution is different from a normal distribution, while nonormal.sim (gamma distributed data) had a p-­‐value < 0.05 which means that the distribution of the data was significantly different from a normal distribution. Poisson Distribution The Poisson distribution is used to describe discrete data (i.e., data with no decimal points). It assumes that the data points are independent of each other. This distribution has only one parameter (λ) that describes both the mean and the variance. Let’s simulate and plot some Poisson distributed data so we can study the shapes of the distribution. We are going to use the function rpois() (similar to rnorm that we used with the normal distribution) to simulate 1000 data points with varying λ of 1, 3 and 10. > pos1=rpois(1000,1)
> pos3=rpois(1000,3)
> pos10=rpois(1000,10)
Now let’s plot them to see how the shape of the distribution varies depending on the value of λ. > par(mfrow=c(1,3))
> hist(pos1,main="lambda=1")
> hist(pos3,main="lambda=3")
> hist(pos10,main="lambda=10")
Note how the hump of the distribution shifts to the right depending on the value of λ. This is different from the normal distribution where the hump of the distribution is consistently located in the middle of the distribution. If you have some count data, and you suspect is Poisson distributed (we will see shortly how can we test how good our “suspicion” is), how can you estimate the λ parameter? We use distribution fitting. In this process of distribution fitting, we use optimization algorithms to fit a theoretical distribution (in this case the Poisson) to the empirical data. Let’s start by importing the bird count data stored in the file lab4_data.txt the way we usually do by using the read.table() function. The data includes 100 point counts in wetlands for three species: American Redstart (AMRE), House Sparrow (HOSP) and Cattle Egrets (CAEG). > birds=read.table(“lab4_data.txt",header=TRUE)
Since it is count data we are going to try to fit a Poisson distribution to the data. Let’s first see a histogram of the data. Once you do the histogram do not close the figure window. > hist(birds$AMRE)
Now we are going to use the function fitdistr() in the library MASS to fit a Poisson distribution to the data. First you need to input the library. > library(MASS)
If the library is not installed in your computer you can always install it by using > install.packages(“MASS”,dependencies=TRUE)
The function fitdstr() is very easy to use. You just need to specify the data vector that you want to fit; in this case we will start with the point count data for AMRE, and the distribution that you wan to fit. The function uses maximum likelihood to estimate the parameters of the distribution. In this case, since the Poisson distribution has only one parameter, we are going to estimate only one parameter λ. > AMRE.lambda=fitdistr(birds$AMRE,"poisson")$estimate
> AMRE.lambda
Now let’s plot the fitted distribution line on top of the histogram. >
lines(seq(0,max(birds$AMRE)),100*dpois(seq(0,max(birds$AMRE
)),AMRE.lambda),col="red")
How does the fit (model in the red line) compares to the real data (histogram)? Exercise Try fitting a Poisson distribution to the House Sparrow point-­‐count data (HOSP). Show the histogram with the fit line. How good is the fit? > hist(birds$HOSP)
> HOSP.lambda=fitdistr(birds$HOSP,"poisson")$estimate
>lines(seq(0,max(birds$HOSP)),100*dpois(seq(0,max(birds$HOS
P)),HOSP.lambda),col="red")
The fit is not really good. There is a way to empirically assess how good is the fit to the data. These are called goodness of fit test. Goodness of fit tests help us to assess if is reasonable to assume that the empirical data collected comes from a specific known theoretical distribution. The null and alternate hypotheses formulations are similar to the Shapiro-­‐Wilk test in which the null hypotheses states that the empirical sample comes from the known distribution and the alternate hypothesis states that the empirical sample does not comes from the known distribution. A common test used for goodness of fit for discrete data is the X2 test. To calculate this test we are going to use the vcd package in R. You can install it by writing, > install.packages("vcd",dependencies=TRUE)
> library("vcd")
Now let’s do a goodness of fit test for a Poisson distribution on the AMRE and the HOSP data. We are going to use the function goodfit() which uses as arguments the data, (in this case the AMRE or the HOSP data), the type of fit that we are going to test (Poisson in this case but it will also work with binomial or negative binomial as we are going to do in the next section). > gf.AMRE<-goodfit(birds$AMRE,type= "poisson",method=
"MinChisq")
> summary(gf.AMRE)
> gf.HOSP<-goodfit(birds$HOSP,type= "poisson",method=
"MinChisq")
> summary(gf.HOSP)
Note that for AMRE p> 0.05 which states that we cannot reject the null hypothesis that states that the data comes from a Poisson distribution (a Poisson distribution can be used to describe this dataset). On the other hand for HOSP the X2 estimate is so high that R cannot compute a proper p-­‐value, which means that a Poisson fit is not adequate for the HOSP data because the expected and observed values are very different. Also note that there was a warning message in after you run the goodfit() function on the HOSP data that says that NA/Inf values were used. This means that there was a zero in the denominator of the X2 equation. In general warning messages in R can be ignored if you know what the message means and you know how it affects your result. Negative Binomial Distribution The Poisson distribution assumes that the mean and the variance are the same. There is just a handful of datasets that can meet this assumption. The negative binomial allows the variance to be greater than the mean. Recall that λ is the mean and variance in the Poisson distribution. In the negative binomial the variance is defined as: !!
!"# ! = ! + ! where r is called a dispersion parameter. What happens if r is a very very large number? Think about it for a second. When r is really large basically we are adding a very very small number to λ (zero in practical terms) and hence the variance and the mean are essentially the same (like in the Poisson). This means that the negative binomial is actually a general case of the Poisson. Let’s explore this in R. We are going to use the rpois() function to simulate 1000 data points from a Poisson distribution with a λ=1. Then we are going to use the rnbinom() function to simulate also 1000 data points from a negative binomial distribution with a mean of 1 (R calls the mean of the negative binomial mu, hence is mu=1), and we are going to vary the r parameter (called size in R) from 1 to 1x1016 which is a really big number. Note that we are going to simulate the data and plot it in the same line by using the function rpois or rnbinom inside the function hist. > par(mfrow=c(2,2))
> hist(rpois(1000,lambda=1),main="Poisson")
> hist(rnbinom(1000,mu=1,size=1),main="Negative Binomial
r=1")
> hist(rnbinom(1000,mu=1,size=10),main="Negative Binomial
r=10")
> hist(rnbinom(1000,mu=1,size=1e16),main="Negative Binomial
r=1e16")
Study the figure. Look at your Poisson distribution and compare the shape of the distribution with the other three figures. What happens as r increases? Why? Now let’s fit a negative binomial distribution to point-­‐count data on Little-­‐blue Herons (LBHE). First draw a histogram to see how the distribution is shaped. > hist(birds$LBHE)
It looks a bit skewed to the right, which might indicate over dispersion (variance greater than the mean). Let’s try to fit a negative binomial distribution and assess model fit. > (LBHE.fit=fitdistr(birds$LBHE,"negative binomial"))
>LBHE.gf=goodfit(birds$LBHE,type="nbinomial",method="MinChi
sq")
> summary(LBHE.gf)
You might get a warning message saying that the estimate is incorrect, but you know why this is (zeros in the denominator of the chi-­‐square). Exercise What do you think? Is the negative binomial a good fit for the LBHE data? What is the null hypothesis? Can you reject this null hypothesis?