Download Evaluating Density or Probability Mass Functions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
1
The R Data Analysis System - Statistics Functions
Basic Distributions
Random Samples from Distributions
Evaluating Density or Probability Mass Functions
Evaluating Cumulative Distribution Functions
Quantiles of Distributions
Numerical Summaries of Data
Student-t Tests and Confidence Intervals for a Single Sample
Two-sample Student-t Tests and Confidence Intervals
Confidence Intervals and Tests for Proportions
Tests for Equality of Proportions
Tabulating and Crosstabulating Factor Data
Chi-square Tests for Goodness of Fit
Chi-square Tests for Contingency Tables and Homogeneity
This document describes the basic R functions for generating samples from common distributions, working
with density, distribution, and quantile functions, and performing standard statistical inference procedures such
as estimation and hypothesis testing. Many of these tasks can be done with point and click operations in R
Commander, but doing them from the command line is actually easier once you learn how.
Basic Distributions
The common names and the R names of the most important distributions are given in the following table. Each
distribution has two to four R functions associated with it. These are prefixed "r" for random number
generation, "d" for density function (or pmf) evaluation, "p" for probability (cdf) evaluation, and "q" for
quantile function evaluation.
1
2
Common Name
Uniform
Names of R Functions
runif, dunif, punif, qunif
Normal
rnorm, dnorm, pnorm, qnorm
Binomial
rbinom, dbinom, pbinom, qbinom
Poisson
rpois, dpois, ppois, qpois
Exponential
rexp, dexp, pexp, qexp
Gamma
rgamma, dgamma, pgamma,
qgamma
Chi-square
rchisq, dchisq, pchisq, qchisq
Student-t
rt, dt, pt, qt
F
rf, df, pf, qf
Multinomial
rmultinom, dmultinom
Hypergeometric
rhyper, dhyper, phyper, qhyper
Geometric
rgeom, dgeom, pgeom, qgeom
Negative Binomial
rnbinom, dnbinom, pnbinom,
qnbinom
Weibull
rweib, dweib, pweib, qweib
There are more, but these are the most important. Each function has several arguments, most of which have
default values in case you do not give them.
Random Samples from Distributions
Generation of random samples from a distribution is accomplished by the function with the prefix "r"
associated with the distribution. To generate a random sample of size 100 from the uniform distribution on the
unit interval and save it in a vector called unifsample, type
> unifsample = runif(100)
> unifsample
For most of the "r" functions, the only required argument is the sample size. Parameter values are set with
optional arguments. The help files explain them fully. For example, to generate 100 samples from the uniform
distribution on the interval (-1,4), the command is
2
3
> runif(n=100, min=-1, max=4)
To generate 500 samples from the normal distribution with mean 4 and standard deviation 3, type
> rnorm(n=500, mean=4, sd=3)
Arguments to functions in R do not have to be named if they are given in the expected order. For example, the
normal sample above could have been obtained by
> rnorm(500, 4, 3)
Evaluating Density or Probability Mass Functions
The value of the density function or probability mass function is accomplished by the function prefixed with
"d". For example, the probability that exactly 3 heads are observed in 10 tosses of a fair coin is
> dbinom(x=3, size=10, prob=0.5)
or just
> dbinom(3,10,0.5)
You can evaluate at more than one point with just one call. For example,
> dbinom(0:10, 10, 0.5)
3
4
will give you all the values of the binomial probability mass function for 10 trials and success probability
0.5. You can obtain a plot of the normal density function with mean 4 and standard deviation 3 by
> curve(dnorm(x,4,3), from=-5, to=13)
Evaluating Cumulative Distribution Functions
Functions with the prefix "p" evaluate the cumulative distribution function of a random variable. Examples are
> pnorm(q=2.28, mean = 4, sd=3)
> pnorm(2.28, 4, 3)
> pexp(q=3, rate = .5)
> pgamma(5, shape=3, lower.tail=F)
The last command illustrates the "lower.tail" argument, which is a logical argument with the default value of
T. If it is specified as F, the number returned is the probability that the random variable is greater than the
given value.
Quantiles of Distributions
Quantiles of a distribution are equivalent to its percentiles. If p is a number strictly between 0 and 1, the value
Q(p) of the quantile function Q associated with a distribution is the 100pth percentile of the distribution. In R,
the prefix "q" before the root name of a distribution designates its quantile function. Thus, to find the 30th
percentile of the standard normal distribution, type
> qnorm(0.30)
4
5
All of the quantile functions in R accept optional arguments for adjusting parameter values. For instance,
> qexp(0.75, rate=2)
> qnorm(seq(.1,.9,.1), -1, 2)
> qchisq(0.95, df=9)
> qchisq(0.05, df=9, lower.tail=F)
Numerical Summaries of Data
Suppose that vect is the name of a numeric vector or matrix in R. The various numerical summary statistics
(mean, standard deviation, median, etc.) of the data in vect are given in the following table.
Summary Statistic
R Function
Mean
mean(vect)
Median
median(vect)
5% trimmed mean
mean(vect,trim=0.05)
Standard deviation
sd(vect)
Variance
var(vect)
th
th
p quantile (100p percentile)
quantile(p,vect)
Range ((smallest value, largest value))
range(vect)
Interquartile range (3rd quartile - 1st
quartile)
IQR(vect)
Minimum value
min(vect)
Maximum value
max(vect)
MAD (median absolute deviation from the
mad(vect)
median)
5
6
A six-number summary consisting of the minimum, the quartiles, the median, the mean, and the maximum can
be obtained all at once by
> summary(vect)
These numerical summaries seldom make sense for a factor or categorical variable. It may be useful instead to
tabulate the frequency of occurrence of its different levels. This is done by
> table(vect)
Typically, in a data frame some of the variables (columns) will be numeric and some will be factors. If we call
the summary function for a data frame, we get the six-number numerical summary for the numeric variables in
the data frame and the tabulated frequencies for the factors in the data frame. A good example of this, using
one of R's built-in data frames, is given by
> summary(ChickWeight)
Student-t Tests and Confidence Intervals for a Single Sample
If vect is a numerical vector containing the values of a sample from a normal distribution with mean µ, and
nullmean is a given numeric value, the two-sided student-t test of the null hypothesis
H0: µ = nullmean versus
H1: µ ≠ nullmean
is accomplished by
> t.test(vect, mu = nullmean)
6
7
This function returns the p-value of the test statistic and a 95% confidence interval for the population
mean. The default value of the argument "mu" is 0 and it does not have to be specified if you are testing the
hypothesis that the population mean is 0. The optional argument "alternative" allows you to specify a one-sided
alternative in either direction. The optional argument "conf.level" allows you to specify a confidence level
other than 95%.
Two-sample Student-t Tests and Confidence Intervals
If vect1 and vect2 are two independent samples from normal distributions with means µ1 and µ2, the student-t
test of the null hypothesis
H0: µ1 = µ2
against the alternative that the means are not equal is accomplished by
> t.test(vect1, vect2)
The p-value of the test statistic and a 95% confidence interval for difference of the means is returned. For a
one-sided alternative hypothesis, use the "alternative" argument. The test performed is Welch's approximate
procedure which does not assume the two populations have equal variances. If you want an exact t-test
assuming equal population variances, use
> t.test(vect1, vect2, var.equal=T)
The level of the confidence interval can be adjusted with the "conf.level" argument.
Confidence Intervals and Tests for Proportions
7
8
Suppose that nosuccesses successes are observed in notrials trials and let p0 be a hypothesized value for the
success probability or population proportion. A test of the null hypothesis based on the exact binomial
distribution of the number of successes is given by
> binom.test(x=nosuccesses, n=notrials, p=p0)
This returns the p-value against the two-sided alternative and a 95% confidence interval. The alternative
hypothesis and the confidence level can be adjusted with the arguments "alternative" and "conf.level".
As an example,
> binom.test(x=12, n=20, p=.4)
returns a p-value of 0.1075 and the confidence interval 0.3605 to 0.8808.
Tests and confidence intervals using the normal approximation, appropriate when the number of trials is large,
are given by the function "prop.test", as in
> prop.test(x=nosuccesses, n=notrials, p=p0)
For example,
> prop.test(12,20,0.4)
The function "prop.test" actually returns the result of a chi-square test which is equivalent to the normal test. It
also has optional arguments "alternative" and "conf.level".
Tests for Equality of Proportions
8
9
A test of the hypothesis that two binomial success probabilities (population proportions) are equal can be
accomplished with "prop.test" as well. If nosuccesses1 and nosuccesses2 are the observed numbers of
successes from two independent binomial experiments and notrials1 and notrials2 are the numbers of trials in
the two experiments, the following will produce the p-value of a chi-square test of the hypothesis that the two
experiments have the same success probability.
> prop.test(x=c(nosuccesses1, nosuccesses2), n=c(notrials1, notrials2))
This also gives a 95% confidence interval for the difference of the two population proportions. Use the
"conf.level" argument to adjust the confidence level as desired. For example,
> prop.test(x=c(12, 20), n=c(20, 40), conf.level=0.99)
Tabulating and Crosstabulating Factor Data
The counts with which levels of a factor variable occur can be obtained with the "table" function, as in
> table(factor1)
Here, factor1 is the name of the factor variable. If factor1 and factor2 are two factors in a data frame called
dataframe, you can get counts of all the combinations of factor levels by
> table(dataframe[,"factor1"], dataframe[,"factor2"])
If you have attached the data frame to the search list, like this:
> attach(dataframe)
the process is easier.
9
10
> table(factor1, factor2)
This produces a 2-way table of counts of all the combinations of factor levels. It can be extended to any
number of factors for higher-dimensional tables. If all the variables in a data frame are factors you can get a
multi-way table by typing
> table(dataframe)
If you want proportions instead of counts, use the function "prop.table".
> thingy=prop.table(table(factor1, factor2))
You can add margins containing the row sums and column sums to a table by
> addmargins(thingy)
Chi-square Tests for Goodness of Fit
Suppose that a numeric vector vect contains sample values and we wish to use the chi-square test goodness of
fit test for a hypothesized distribution. The first task is to subdivide the range of the data into nonoverlapping
intervals and determine the probability of a sample value falling in each interval. Suppose that has been done
and the endpoints of the intervals are in a vector endpoints whose entries increase from the left end of the
leftmost interval to the right end of the rightmost interval. One possible way of creating the vector endpoints is
> endpoints = seq(a, b, h)
This divides the interval (a, b) into subintervals of length h. Suppose the probabilities of these subintervals
have been calculated and are in a vector probs.
10
11
The next task is to count the sample values that fall into each of the subintervals. This can be done with the
"cut" function, which converts a numeric vector to a factor whose levels are the subintervals just defined,
followed by the "table" function to tabulate the frequencies of occurrence of the subintervals.
> frequencies = table(cut(vect, breaks = endpoints))
Finally, the chi-square test is accomplished by
> chisq.test(frequencies, p = probs)
The default for the argument "p" is equal probabilities for the subintervals. "p" doesn't have to be specified if
that is what you want to test. The test returns the value of the chi-square statistic and its p-value. If the
probabilities of some of the subintervals are too small for the given sample size, you will get a warning that the
test may not be reliable.
Chi-square Tests for Contingency Tables and Homogeneity
The chi-square test for independence of two random factors giving rise to a contingency table is the same as the
test of homogeneity, or equality of several multinomial distributions. In both cases the count data is arranged in
a 2-way table with r rows and c columns and the null distribution of the test statistic is approximately chisquare with (r-1)(c-1) degrees of freedom. Such a table is created in R with the "table" function. Suppose the
table is named table1.
The command
> chisq.test(table1)
returns the test statistic, its p-value, and a warning if any of the cell frequencies are too low for reliable
inference. An example, using one of R's built-in data sets is
> table1=HairEyeColor[,"Blue",]
11
12
> chisq.test(table1)
This shows that among blue-eyed individuals, hair color and sex are not independent attributes.
12