Download Introduction to Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia, lookup

Transcript
Single Samples
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Resources
• Crawley, MJ (2005) Statistics: An Introduction
Using R. Wiley.
• Gentle, JE (2002) Elements of Computational
Statistics. Springer.
• Gonick, L., and Woollcott Smith (1993) A Cartoon
Guide to Statistics. HarperResource (for fun).
Questions of Interest About Single
Samples
• What is the mean value?
• Is the mean value significantly different from
expectation or theory?
• What is the level of uncertainty associated with
the estimate of the mean value?
Facts Needed for Answers
• Are the values normally distributed (bellshaped) or not?
• Are there outliers in the data?
• If the data were collected over a period of time,
is there evidence for serial correlation?
To use standard parametric tests, you need
normal data, without outliers, and without
serial correlation.
Data Summary
data<-read.table("das.txt",header=T)
> names(data)
[1] "y"
> attach(data)
> summary(y)
Min. 1st Qu. Median
Mean 3rd Qu.
1.904
2.241
2.414
2.419
2.568
> plot(y)
Max.
2.984
plot(y)
Querying your data
y[50]<- 21.79386
plot(y)
which(y>10)
50
y[50]<-2.179386
boxplot(y,ylab="data values”)
Results
Normal Distribution
• The Central Limit Theorem implies anything
produced by adding a large number of random samples
(such as the mean) is normally distributed.
– dnorm(z) is the normal distribution, with mean 0.0 and
standard deviation (i.e., √variance) of 1.0. (z here is the
standard unit for the normal distribution)
– pnorm(x) is the probability of a z value of x or less.
– qnorm(c(p1,p2)) gives the corresponding values of z that
produce the probabilities of p1and p2
Plots for Testing Normality
• The simplest and often the best test of normality is the
quantile-quantile plot
– qqnorm(y)
– qqline(y, lty=2)
• If the resulting plot shows a marked S-shape, it
indicates non-normality. You’ve already seen this
demonstrated.
• If the data are non-normal, use Wilcoxon's signed rank
test (wilcox.test) rather than Student's t-test (t.test)
Inference
• Demonstration with speed of light data
• Another way to test this is bootstrapping
– Demonstration
• Demonstration of Student's t
– dt(z,df)
– pt(z,df)
– qt(c(p,q),df)
• Comparison between Student's t and normal
distributions.
Skew
• Dimensionless version of the third moment
about the mean.
m3 = Sum(y-ymean)3/n
s3 = (√s2)3
skew = 1 = m3/s3
• Measures the extent to which the distribution
has a tail on one or the other side.
• Demo of skew test.
Kurtosis
• Dimensionless version of the fourth moment
about the mean.
m4 = Sum(y-ymean)4/n
s4 = (s2)2
kurtosis = 2 = m4/s4 -3
• Measures the extent to which the distribution is
peaky or flat-topped.
• Demo of kurtosis test.
Conclusions
• A generalisation of these individual tests is the
Kolmogorov-Smirnov test (ks.test), which is
usually used to compare two distributions.
• If variance was ill-behaved, skew and kurtosis
are worse.
• We've seen ways of testing for normality and
outliers. Serial correlation will be discussed
when we learn about analysis of variance.