Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 The R Data Analysis System - Statistics Functions Basic Distributions Random Samples from Distributions Evaluating Density or Probability Mass Functions Evaluating Cumulative Distribution Functions Quantiles of Distributions Numerical Summaries of Data Student-t Tests and Confidence Intervals for a Single Sample Two-sample Student-t Tests and Confidence Intervals Confidence Intervals and Tests for Proportions Tests for Equality of Proportions Tabulating and Crosstabulating Factor Data Chi-square Tests for Goodness of Fit Chi-square Tests for Contingency Tables and Homogeneity This document describes the basic R functions for generating samples from common distributions, working with density, distribution, and quantile functions, and performing standard statistical inference procedures such as estimation and hypothesis testing. Many of these tasks can be done with point and click operations in R Commander, but doing them from the command line is actually easier once you learn how. Basic Distributions The common names and the R names of the most important distributions are given in the following table. Each distribution has two to four R functions associated with it. These are prefixed "r" for random number generation, "d" for density function (or pmf) evaluation, "p" for probability (cdf) evaluation, and "q" for quantile function evaluation. 1 2 Common Name Uniform Names of R Functions runif, dunif, punif, qunif Normal rnorm, dnorm, pnorm, qnorm Binomial rbinom, dbinom, pbinom, qbinom Poisson rpois, dpois, ppois, qpois Exponential rexp, dexp, pexp, qexp Gamma rgamma, dgamma, pgamma, qgamma Chi-square rchisq, dchisq, pchisq, qchisq Student-t rt, dt, pt, qt F rf, df, pf, qf Multinomial rmultinom, dmultinom Hypergeometric rhyper, dhyper, phyper, qhyper Geometric rgeom, dgeom, pgeom, qgeom Negative Binomial rnbinom, dnbinom, pnbinom, qnbinom Weibull rweib, dweib, pweib, qweib There are more, but these are the most important. Each function has several arguments, most of which have default values in case you do not give them. Random Samples from Distributions Generation of random samples from a distribution is accomplished by the function with the prefix "r" associated with the distribution. To generate a random sample of size 100 from the uniform distribution on the unit interval and save it in a vector called unifsample, type > unifsample = runif(100) > unifsample For most of the "r" functions, the only required argument is the sample size. Parameter values are set with optional arguments. The help files explain them fully. For example, to generate 100 samples from the uniform distribution on the interval (-1,4), the command is 2 3 > runif(n=100, min=-1, max=4) To generate 500 samples from the normal distribution with mean 4 and standard deviation 3, type > rnorm(n=500, mean=4, sd=3) Arguments to functions in R do not have to be named if they are given in the expected order. For example, the normal sample above could have been obtained by > rnorm(500, 4, 3) Evaluating Density or Probability Mass Functions The value of the density function or probability mass function is accomplished by the function prefixed with "d". For example, the probability that exactly 3 heads are observed in 10 tosses of a fair coin is > dbinom(x=3, size=10, prob=0.5) or just > dbinom(3,10,0.5) You can evaluate at more than one point with just one call. For example, > dbinom(0:10, 10, 0.5) 3 4 will give you all the values of the binomial probability mass function for 10 trials and success probability 0.5. You can obtain a plot of the normal density function with mean 4 and standard deviation 3 by > curve(dnorm(x,4,3), from=-5, to=13) Evaluating Cumulative Distribution Functions Functions with the prefix "p" evaluate the cumulative distribution function of a random variable. Examples are > pnorm(q=2.28, mean = 4, sd=3) > pnorm(2.28, 4, 3) > pexp(q=3, rate = .5) > pgamma(5, shape=3, lower.tail=F) The last command illustrates the "lower.tail" argument, which is a logical argument with the default value of T. If it is specified as F, the number returned is the probability that the random variable is greater than the given value. Quantiles of Distributions Quantiles of a distribution are equivalent to its percentiles. If p is a number strictly between 0 and 1, the value Q(p) of the quantile function Q associated with a distribution is the 100pth percentile of the distribution. In R, the prefix "q" before the root name of a distribution designates its quantile function. Thus, to find the 30th percentile of the standard normal distribution, type > qnorm(0.30) 4 5 All of the quantile functions in R accept optional arguments for adjusting parameter values. For instance, > qexp(0.75, rate=2) > qnorm(seq(.1,.9,.1), -1, 2) > qchisq(0.95, df=9) > qchisq(0.05, df=9, lower.tail=F) Numerical Summaries of Data Suppose that vect is the name of a numeric vector or matrix in R. The various numerical summary statistics (mean, standard deviation, median, etc.) of the data in vect are given in the following table. Summary Statistic R Function Mean mean(vect) Median median(vect) 5% trimmed mean mean(vect,trim=0.05) Standard deviation sd(vect) Variance var(vect) th th p quantile (100p percentile) quantile(p,vect) Range ((smallest value, largest value)) range(vect) Interquartile range (3rd quartile - 1st quartile) IQR(vect) Minimum value min(vect) Maximum value max(vect) MAD (median absolute deviation from the mad(vect) median) 5 6 A six-number summary consisting of the minimum, the quartiles, the median, the mean, and the maximum can be obtained all at once by > summary(vect) These numerical summaries seldom make sense for a factor or categorical variable. It may be useful instead to tabulate the frequency of occurrence of its different levels. This is done by > table(vect) Typically, in a data frame some of the variables (columns) will be numeric and some will be factors. If we call the summary function for a data frame, we get the six-number numerical summary for the numeric variables in the data frame and the tabulated frequencies for the factors in the data frame. A good example of this, using one of R's built-in data frames, is given by > summary(ChickWeight) Student-t Tests and Confidence Intervals for a Single Sample If vect is a numerical vector containing the values of a sample from a normal distribution with mean µ, and nullmean is a given numeric value, the two-sided student-t test of the null hypothesis H0: µ = nullmean versus H1: µ ≠ nullmean is accomplished by > t.test(vect, mu = nullmean) 6 7 This function returns the p-value of the test statistic and a 95% confidence interval for the population mean. The default value of the argument "mu" is 0 and it does not have to be specified if you are testing the hypothesis that the population mean is 0. The optional argument "alternative" allows you to specify a one-sided alternative in either direction. The optional argument "conf.level" allows you to specify a confidence level other than 95%. Two-sample Student-t Tests and Confidence Intervals If vect1 and vect2 are two independent samples from normal distributions with means µ1 and µ2, the student-t test of the null hypothesis H0: µ1 = µ2 against the alternative that the means are not equal is accomplished by > t.test(vect1, vect2) The p-value of the test statistic and a 95% confidence interval for difference of the means is returned. For a one-sided alternative hypothesis, use the "alternative" argument. The test performed is Welch's approximate procedure which does not assume the two populations have equal variances. If you want an exact t-test assuming equal population variances, use > t.test(vect1, vect2, var.equal=T) The level of the confidence interval can be adjusted with the "conf.level" argument. Confidence Intervals and Tests for Proportions 7 8 Suppose that nosuccesses successes are observed in notrials trials and let p0 be a hypothesized value for the success probability or population proportion. A test of the null hypothesis based on the exact binomial distribution of the number of successes is given by > binom.test(x=nosuccesses, n=notrials, p=p0) This returns the p-value against the two-sided alternative and a 95% confidence interval. The alternative hypothesis and the confidence level can be adjusted with the arguments "alternative" and "conf.level". As an example, > binom.test(x=12, n=20, p=.4) returns a p-value of 0.1075 and the confidence interval 0.3605 to 0.8808. Tests and confidence intervals using the normal approximation, appropriate when the number of trials is large, are given by the function "prop.test", as in > prop.test(x=nosuccesses, n=notrials, p=p0) For example, > prop.test(12,20,0.4) The function "prop.test" actually returns the result of a chi-square test which is equivalent to the normal test. It also has optional arguments "alternative" and "conf.level". Tests for Equality of Proportions 8 9 A test of the hypothesis that two binomial success probabilities (population proportions) are equal can be accomplished with "prop.test" as well. If nosuccesses1 and nosuccesses2 are the observed numbers of successes from two independent binomial experiments and notrials1 and notrials2 are the numbers of trials in the two experiments, the following will produce the p-value of a chi-square test of the hypothesis that the two experiments have the same success probability. > prop.test(x=c(nosuccesses1, nosuccesses2), n=c(notrials1, notrials2)) This also gives a 95% confidence interval for the difference of the two population proportions. Use the "conf.level" argument to adjust the confidence level as desired. For example, > prop.test(x=c(12, 20), n=c(20, 40), conf.level=0.99) Tabulating and Crosstabulating Factor Data The counts with which levels of a factor variable occur can be obtained with the "table" function, as in > table(factor1) Here, factor1 is the name of the factor variable. If factor1 and factor2 are two factors in a data frame called dataframe, you can get counts of all the combinations of factor levels by > table(dataframe[,"factor1"], dataframe[,"factor2"]) If you have attached the data frame to the search list, like this: > attach(dataframe) the process is easier. 9 10 > table(factor1, factor2) This produces a 2-way table of counts of all the combinations of factor levels. It can be extended to any number of factors for higher-dimensional tables. If all the variables in a data frame are factors you can get a multi-way table by typing > table(dataframe) If you want proportions instead of counts, use the function "prop.table". > thingy=prop.table(table(factor1, factor2)) You can add margins containing the row sums and column sums to a table by > addmargins(thingy) Chi-square Tests for Goodness of Fit Suppose that a numeric vector vect contains sample values and we wish to use the chi-square test goodness of fit test for a hypothesized distribution. The first task is to subdivide the range of the data into nonoverlapping intervals and determine the probability of a sample value falling in each interval. Suppose that has been done and the endpoints of the intervals are in a vector endpoints whose entries increase from the left end of the leftmost interval to the right end of the rightmost interval. One possible way of creating the vector endpoints is > endpoints = seq(a, b, h) This divides the interval (a, b) into subintervals of length h. Suppose the probabilities of these subintervals have been calculated and are in a vector probs. 10 11 The next task is to count the sample values that fall into each of the subintervals. This can be done with the "cut" function, which converts a numeric vector to a factor whose levels are the subintervals just defined, followed by the "table" function to tabulate the frequencies of occurrence of the subintervals. > frequencies = table(cut(vect, breaks = endpoints)) Finally, the chi-square test is accomplished by > chisq.test(frequencies, p = probs) The default for the argument "p" is equal probabilities for the subintervals. "p" doesn't have to be specified if that is what you want to test. The test returns the value of the chi-square statistic and its p-value. If the probabilities of some of the subintervals are too small for the given sample size, you will get a warning that the test may not be reliable. Chi-square Tests for Contingency Tables and Homogeneity The chi-square test for independence of two random factors giving rise to a contingency table is the same as the test of homogeneity, or equality of several multinomial distributions. In both cases the count data is arranged in a 2-way table with r rows and c columns and the null distribution of the test statistic is approximately chisquare with (r-1)(c-1) degrees of freedom. Such a table is created in R with the "table" function. Suppose the table is named table1. The command > chisq.test(table1) returns the test statistic, its p-value, and a warning if any of the cell frequencies are too low for reliable inference. An example, using one of R's built-in data sets is > table1=HairEyeColor[,"Blue",] 11 12 > chisq.test(table1) This shows that among blue-eyed individuals, hair color and sex are not independent attributes. 12