Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Univariate MATLAB plots
boxplot Box plot
cdfplot Empirical cumulative distribution function plot
dfittool Interactive distribution fitting
disttool Interactive density and distribution plots
ecdfhist Empirical cumulative distribution function histogram
histfit
Histogram with a distribution fit
normplotNormal probability plot
normspecNormal density plot between specifications
probplot Probability plots
qqplot Quantile-quantile plot
randtool Interactive random number generation
wblplot Weibull probability plot
Some MATLAB bivarite/ multivariate plots
boxplot
gline
gname
gplotmatrix
gscatter
hist3
lsline
refcurve
refline
scatterhist
Box plot
Interactively add line to plot
Add case names to plot
Matrix of scatter plots by group
Scatter plot by group
Bivariate histogram
Add least-squares line to scatter plot
Add reference curve to plot
Add reference line to plot
Scatter plot with marginal histograms
Statistics Toolbox in MATLAB
Some examples
help rng
rng('default')
x = [normrnd(4,1,1,100)
normrnd(6,0.5,1,200)];
histfit(x)
help histfit
70
60
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
histfit Histogram with superimposed fitted normal density.
histfit(DATA,NBINS) plots a histogram of the values in the vector DATA,
along with a normal density function with parameters estimated from the
data. NBINS is the number of bars in the histogram. With one input
argument, NBINS is set to the square root of the number of elements in
DATA.
histfit(DATA,NBINS,DIST) plots a histogram with a density from the DIST
distribution. DIST can take the following values:
'beta'
Beta
'birnbaumsaunders'
Birnbaum-Saunders
'exponential'
Exponential
'extreme value' or 'ev'
Extreme value
'gamma'
Gamma
'generalized extreme value' 'gev' Generalized extreme value
'generalized pareto' or 'gp'
Generalized Pareto (threshold 0)
'inverse gaussian'
Inverse Gaussian
'logistic'
Logistic
'loglogistic'
Log logistic
'lognormal'
Lognormal
'negative binomial' or 'nbin'
Negative binomial
'nakagami'
N
akagami
'normal'
Normal
'poisson'
Poisson
'rayleigh'
Rayleigh
'rician'
Rician
'tlocationscale'
t location-scale
'weibull' or 'wbl'
Weibull
H = histfit(...) returns a vector of handles to the plotted lines.
H(1) is a handle to the histogram, H(2) is a handle to the density curve.
Reference page in Help browser
doc histfit
probplot('normal',x)
Probability plot for Normal distribution
0.995
0.99
0.95
0.9
Probability
0.75
0.5
0.25
0.1
0.05
0.01
0.005
1
2
3
5
4
6
7
8
Data
>> p = 0:0.25:1;
y = quantile(x,p);
z = [p;y]
z=
0 0.2500 0.5000 0.7500 1.0000
1.0557 4.7375 5.6872 6.1526 7.5784
>> y = [mean(x) median(x)]
y = 5.3438 5.6872
>> y = [skewness(x) kurtosis(x)]
y = -1.0417 3.5895
>> Z = zscore(x);
>> find(abs(Z)>3)
3 35
quantile initially assigns the sorted values in X to the (0.5/n), (1.5/n), ..., ([n – 0.5]/n)
quantiles. For example:
For a data vector of six elements such as {6, 3, 2, 10, 8, 1}, the sorted elements {1, 2,
3, 6, 8, 10} respectively correspond to the (0.5/6), (1.5/6), (2.5/6), (3.5/6), (4.5/6),
and (5.5/6) quantiles.
For a data vector of five elements such as {2, 10, 5, 9, 13}, the sorted elements {2, 5,
9, 10, 13} respectively correspond to the 0.1, 0.3, 0.5, 0.7, and 0.9 quantiles.
The following figure illustrates this approach for data vector X = {2, 10, 5, 9, 13}. The first
observation corresponds to the cumulative probability 1/5 = 0.2, the second observation
corresponds to the cumulative probability 2/5 = 0.4, and so on. The step function in this
figure shows these cumulative probabilities. quantile instead places the observations in
midpoints, such that the first corresponds to 0.5/5 = 0.1, the second corresponds to 1.5/5
= 0.3, and so on, and then connects these midpoints. The red lines in the following figure
connect the midpoints.
By switching the axes, as the next
figure, you can see the values of the
variable X that correspond to the p
quantiles.
quantile finds any quantiles between the data values using linear interpolation.
Linear interpolation uses linear polynomials to approximate a function f(x) and
construct new data points within the range of a known set of data points.
Algebraically, given the data points (x1, y1) and (x2, y2), where y1 = f(x1) and y2 =
f(x2), linear interpolation finds y = f(x) for a given x between x1 and x2 as follows:
Similarly, if the 1.5/n quantile is y1.5/n and the 2.5/n quantile is y2.5/n, then linear
interpolation finds the 2.3/n quantile y2.3/n as
load gas
prices = [price1 price2];
normplot(prices)
Both scatters approximately
follow straight lines through
thefirst and third quartiles of
the samples, indicating
approximate normal
distributions
A hypothesis test is used to quantify the test of normality.
Since each sample is relatively small, a Lilliefors test is recommended.
lillietest(price1) ans = 0
lillietest(price2) ans = 0
Null hypothesis is accepted
sample_means = mean(prices)
sample_means = 115.1500 118.5000
h,pvalue,ci] = ztest(price1/100,1.15,0.04)
h=0
pvalue = 0.8668
ci =
1.1340
1.1690
The logical output h = 0 indicates a failure to reject the null hypothesis atthe
default significance level of 5%. This is a consequence of thehigh probability
under the null hypothesis, indicated by the p value,of observing a value as
extreme or more extreme of the z-statisticcomputed from the sample. The 95%
confidence interval on the mean[1.1340 1.1690] includes the hypothesized
population mean of $1.15.
Does the later sample offer stronger evidence for rejecting a null hypothesis of a statewide average price of $1.15 in February? The shift shown in the probability plot and the
difference in the computed sample means suggest this. The shift might indicate a
significant fluctuation in the market, raising questions about the validity of using the
historical standard deviation. If a known standard deviation cannot be assumed, a t-test
is more appropriate.
[h,pvalue,ci] = ttest(price2/100,1.15)
h = 1; pvalue = 4.9517e-004; ci = 1.1675 1.2025
The logical output h = 1 indicates a rejection of the null hypothesis at the default
significance level of 5%. In this case, the 95% confidence interval on the mean does not
include the hypothesized population mean of $1.15.
You might want to investigate the shift in prices a little more closely. The function ttest2
tests if two independent samples come from normal distributions with equal but
unknown standard deviations and the same mean, against the alternative that the
means are unequal.
[h,sig,ci] = ttest2(price1,price2)
h = 1; sig = 0.0083; ci = -5.7845 -0.9155
The null hypothesis is rejected at the default 5% significance level, and the confidence
interval on the difference of means does not include the hypothesized value of 0.
A notched box plot is another way to visualize the shift.
boxplot(prices,1)
set(gca,'XTick',[1 2])
set(gca,'XtickLabel',{'January','February'}) xlabel('Month') ylabel('Prices ($0.01)')
The plot displays the distribution of the samples around their medians. The heights of the notches
in each box are computed so that the side-by-side boxes have nonoverlapping notches when their
medians are different at a default 5% significance level. The computation is based on an
assumption of normality in the data, but the comparison is reasonably robust for other
distributions. The side-by-side plots provide a kind of visual hypothesis test, comparing medians
rather than means. The plot above appears to barely reject the null hypothesis of equal medians.
The nonparametric Wilcoxon rank sum test, implemented by thefunction
ranksum, can be usedto quantify the test of equal medians. It tests if two
independentsamples come from identical continuous (not necessarily normal)
distributionswith equal medians, against the alternative that they do not
haveequal medians.
[p,h] = ranksum(price1,price2)
p=
0.0095
h=
1
The test rejects the null hypothesis of equal medians at thedefault 5% significance
level.