Download Page | 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Unit II: Summarizing Data & Revisiting Probability (NOS 2101)
2.7.Probability Distribution Function ( PDF):
It defines probability of outcomes based on certain conditions. Based on Conditions, there are majorly 5
types PDFs.
Types of Probability Distribution:

Binomial Distribution

Poisson Distribution

Continuous Uniform Distribution

Exponential Distribution

Normal Distribution

Chi-squared Distribution

Student t Distribution

F Distribution
2.7.1.Normal Distribution:
Distribution is “normal.”
If X has a normal probability distribution. A graph of a normal distribution, where we have chosen a = 0 and
b = 1, appears in figure below:
Figure 1: Normal Distribution
Figure 1 shows the normal distribution of sample data. The shape of a normal curve is highly dependent on the
standard deviation.
Importance of Normal Distribution:
•
Normal distribution is a continuous distribution that is “bell-shaped”.
Page | 1
•
Data are often assumed to be normal.
•
Normal distributions can estimate probabilities over a continuous interval of data values.
Figure 2: A
normal distribution with Mean=0 and Standard deviation = 1
Properties:
The normal distribution f(x), with any mean μ and any positive deviation σ, has the following properties:

It is symmetric around the point x = μ, which is at the same time the mode, the median and the mean
of the distribution.


It is unimodal: its first derivative is positive for x < μ, negative for x > μ, and zero only at x = μ.
Its density has two inflection points (where the second derivative of is zero and changes sign), located
one standard deviation away from the mean as x = μ − σ and x = μ + σ.

Its density is log-concave.

Its density is infinitely differentiable, indeed super smooth of order 2.

Its second derivative f′′(x) is equal to its derivative with respect to its variance σ2.
Normal Distribution in R:
Description:
Density, distribution function, quantile function and random generation for the normal distribution with mean
equal to mean and standard deviation equal to sd.
Usage
•
dnorm(x, mean = 0, sd = 1, log = FALSE)
•
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
•
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
•
rnorm(n, mean = 0, sd = 1)
Page | 2
•
Arguments
–
x, q vector of quantiles.
–
P vector of probabilities.
–
N number of observations. If length(n) > 1, the length is taken to be the number
required.
–
Mean vector of means.
–
Sd vector of standard deviations.
–
log, log. P logical; if TRUE, probabilities p are given as log(p).
–
lower.tail logical; if TRUE (default), probabilities are P[X ≤ x] otherwise, P[X > x].
–
rnorm(n, mean = 0, sd = 1) as default
Lab Activity1:
> data2<- c( 3, 5, 7, 5, 3, 2, 6, 8, 5, 6, 9, 4, 5, 7, 3, 4)
> dens = density(data2)
> dens
Call:
density. default(x = data2)
Data: data2 (16 obs.); Bandwidth 'bw' = 0.9644 x y
Min. :-0.8932 Min. :0.0002982
1st Qu.: 2.3034 1st Qu.:0.0134042
Median : 5.5000 Median :0.0694574
Mean : 5.5000 Mean :0.0781187
3rd Qu.: 8.6966 3rd Qu.:0.1396352
Max. :11.8932 Max. :0.1798531
> str(dens)
List of 7
Page | 3
$ x : num [1:512] -0.893 -0.868 -0.843 -0.818 -0.793 ...
$ y : num [1:512] 0.000313 0.000339 0.000367 0.000397 0.000429 ...
$ bw : num 0.964
$ n : int 16
$ call : language density.default(x = data2)
$ data.name: chr "data2"
$ has.na : logi FALSE
- attr(*, "class")= chr "density"
Lab Activity 2:
To generates 20 numbers with a mean of 5 and a standard deviation of 1:
> rnorm(20, mean = 5, sd = 1)
[1] 5.610090 5.042731 5.120978 4.582450 5.015839 3.577376 5.159308 6.496983
[9] 3.071729 6.187525 5.027074 3.517274 4.393562 3.866088 4.533490 6.021554
[17] 5.359491 5.265780 3.817124 5.855315
> pnorm(5, mean = 5, sd = 1)
[1] 0.5
> qnorm(0.5, 5, 1)
[1] 5
> dnorm(c(4,5,6), mean = 5, sd = 1)
[1] 0.2419707 0.3989423 0.2419707
Lab Activity 3: Probability Theories:
1. If you throw a dice 20 times then what is the probability that you get following results:
a. 3 sixes
Solution:
b. 6 sixes
Page | 4
c. 1,2 and 3 sixes
2. In Iris data set check whether Sepal Length is normally distributed or not.
Use : To find if the Sepal Length is normally distributed or not we use 2 commands- qqnorm() &qqline().
The qqnorm() shows the actual distribution of data while qqline() shows the line on which data would lie if the
data is normally distributed. The deviation of plot from line shows that data is not normally distributed.
Figure3: Normal distribution of iris$ Sepal Length
3. Prove that population mean of Sepal length is different from mean of 1st 10 data significantly.
T-Test of sample subset of Iris data set.
Page | 5
Here p-value is much less than 0.05. So we reject the null hypothesis and we accept the alternate hypothesis
which says that mean of sample is less than the population mean.
µs <µp
Also sample mean is 4.86 and degree of freedom if 9 which is sample size -1.
Similarly we can do two sided test by writing alternative= “two sided”. And also paired sample t-test by using
paired=TRUE as the part of argument.
4. Do ANOVA test of 3 different data sets which are subset of Sepal Length.
ANOVA of Iris data with its own subset.
From ANOVA of the above data we see that degree of freedom of independent variable is 2 and F value is 2.447.
5. Create a random walk plot of 100 inputs starting at t=10secs.
To create a random walk for 100 trials starting at t=10.
Page | 6
> plot(y, type='l')
***
2.7.2.Test of Normal Distribution:
Normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how
likely it is for a random variable underlying the data set to be normally distributed. The tests are a form of model
selection, and can be interpreted several ways, depending on one's interpretations of probability:
In descriptive statistics terms, one measures a goodness of fit of a normal model to the data – if the fit is
poor then the data are not well modeled in that respect by a normal distribution, without making a judgment on
any underlying variable.
In frequentist statistics statistical hypothesis testing, data are tested against the null hypothesis that it
Normal Distribution curves with different conditions Standard Normal Distribution is normally distributed.
In Bayesian statistics, one does not "test normality" per se, but rather computes the likelihood that the data
come from a normal distribution with given parameters μ,σ (for all μ,σ), and compares that with the likelihood
that the data come from other distributions under consideration, most simply using a Bayes factor (giving the
relative likelihood of seeing the data given different models), or more finely taking a prior distribution on possible
models and parameters and computing a posterior distribution given the computed likelihoods.
Page | 7
2.7.3. Graphical methods:
Histogram method: An informal approach to testing normality is to compare a histogram of the sample data to
a normal probability curve. The empirical distribution of the data (the histogram) should be bell-shaped and
resemble the normal distribution. This might be difficult to see if the sample is small. In this case one might
proceed by regressing the data against the quartiles of a normal distribution with the same mean and variance
as the sample. Lack of fit to the regression line suggests a departure from normality.
Histogram: hist()
List of Attribute with hist():
>hist(data2, breaks = c(2,4,5,6,9))
Example:
> hist(data2, freq = F, col = 'gray85')
> lines(density(data2), lty = 2)
> lines(density(data2, k = 'rectangular'))
Page | 8
Quantile - quantile (QQ plot): A graphical tool for assessing normality is the normal probability plot, a
quantile-quantile plot (QQ plot) of the standardized data against the standard normal distribution. Here the
correlation between the sample data and normal quartiles (a measure of the goodness of fit) measures how well
the data are modeled by a normal distribution. For normal data the points plotted in the QQ plot should fall
approximately on a straight line, indicating high positive correlation. These plots are easy to interpret and also
have the benefit that outliers are easily identified.
Frequentist tests:
Frequentist Tests are used to test univariate normality. It includes:
•
D'Agostino's K-squared test
•
Jarque–Bera test: Derived from skewness and kurtosis estimates
•
Anderson–Darling test
•
Cramér–von Mises criterion
•
Lilliefors test for normality (itself an adaptation of the Kolmogorov–Smirnov test)
•
Shapiro–Wilk test
•
Pearson's chi-squared test
•
Shapiro–Francia test
The normal distribution has the highest entropy of any distribution for a given standard deviation. There are a
number of normality tests based on this property, the first attributable to Vasicek.
Bayesian tests:
Kullback–Leibler divergences between the whole posterior distributions of the slope and variance do not indicate
non-normality. However, the ratio of expectations of these posteriors and the expectation of the ratios give
similar results to the Shapiro–Wilk statistic except for very small samples, when non informative priors are used.
Spiegel halter suggests using a Bayes factor to compare normality with a different class of distributional
alternatives. This approach has been extended by Farrell and Rogers-Stewart.
2.8.Central Limit Theorem:
Page | 9
The central limit theorem states that under certain (fairly common) conditions, the sum of many random
variables will have an approximately normal distribution.
More specifically, where X1, …, Xn are independent and identically distributed random variables with the same
arbitrary distribution, zero mean, and variance σ2; and Z is their mean scaled by
Then, as n increases, the probability distribution of Z will tend to the normal distribution with zero mean and
variance (σ2).
The central limit theorem also implies that certain distributions can be approximated by the normal distribution,
for example:
• The binomial distribution B(n, p) is approximately normal with mean np and variance np(1−p) for large n and
for p not too close to zero or one.
• The Poisson distribution with parameter λ is approximately normal with mean λ and variance λ, for large values
of λ.
• The chi-squared distribution χ2(k) is approximately normal with mean k and variance 2k, for large k.
• The Student's t-distribution t(ν) is approximately normal with mean 0 and variance 1 when ν is large.
Random walk:
A random walk is a mathematical formalization of a path that consists of a succession of random steps.
Example: The path traced by a molecule as it travels in a liquid or a gas, the search path of a foraging animal,
the price of a fluctuating stock and the financial status of a gambler can all be modeled as random walks,
although they may not be truly random in reality. The term random walk was first introduced by Karl Pearson in
1905.
Random walks have been used in many fields: ecology, economics, psychology, computer science, physics,
chemistry, and biology.
Application of Random Walk:
Applying the random walk theory to finance and stocks suggests that stock prices change randomly, making it
impossible to predict stock prices. The random walk theory corresponds to the belief that markets are efficient,
and that it is not possible to beat or predict the market because stock prices reflect all available information and
the occurrence of new information is seemingly random as well.
Case study: Binomial Distribution:
Sachin buys a chocolate bar every day during a promotion that says one out of six chocolate bars has a gift
coupon within. Answer the following questions:
•What is the distribution of the number of chocolates with gift coupons in seven days?
•What is the probability that Amir gets no chocolates with gift coupons in seven days?
Page | 10
•Amir gets no gift coupons for the first six days of the week. What is the chance that he will get a one on the
seventh day?
•Amir buys a bar every day for six weeks. What is the probability that he gets at least three gift coupons?
•How many days of purchase are required so that Amir’s chance of getting at least one gift coupon is 0.95 or
greater?
Solution:
Hints:
Formula = nCrp r q n-r
Where n is the no. of trials
r is the number of successful outcomes
p is the probability of success,
and q is the probability of failure.
Other important formulae include p + q = 1 Hence, q = 1 – p Thus, p = 1/6 q = 5/6
Page | 11