Download Variety1 Variety2 Variety3 Variety4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Diploma in Statistics: Laboratory 1
In this laboratory I want you to learn how to use Minitab for simple statistical
analyses. The interface is (in my view!) very simple, especially for anyone familiar
with Microsoft products, for example Excel. We will begin with the design and
analysis of simple comparative studies, using the data from the notes to explore
the Minitab facilities. Then we will analyse the melon varieties data discussed
when we studied ANOVA and the Ski-trails dataset which was examined when we
studied regression. Note that screenshots of most of the required operations are
attached to this handout.
Minitab works on data stored in a spreadsheet (worksheet). This sheet is static
(unlike Excel), it stores numbers or text, not formulae, so you cannot dynamically
change analyses. Most operations refer to data stored in columns (labelled c1,
c2…). The introductory example of a simple comparative study from Chapter 2 is
reproduced below.
1.
A study involved charging a spheroniser (essentially a centrifuge) with a dough-like
material and running it at two speeds (labelled A and B) to determine which gives
the higher yield. The yield is the percentage of the material that ends up as usable
pellets (the pellets are sieved to separate out and discard pellets that are either too
large or too small).
Speed-B
Speed-A
76.2
81.3
77.0
79.9
76.4
76.2
77.6
80.5
81.5
77.3
78.2
73.5
77.0
74.8
72.7
75.4
77.1
76.1
74.4
78.1
76.5
75.0
Table 1: Percentage yields for the two speeds
1
Type the data for Speed A and Speed B into two columns, say Speed A into c1
and Speed B into c2. In the header over the worksheet cells type Speed A into
Column 1 and Speed B into column 2.
Obtain descriptive statistics, Stat/ Basic Statistics/ Display descriptive
statistics, and dotplots, Graph/ Dotplot, of the data. (See screenshots, pages 5,
6)
Use the Stat/ Basic Statistics/ Two-sample t menu to obtain a t-test and
confidence interval for the long-run yield difference. (See screenshots, page 7)
The purpose of this session is learn how to use a statistics package to analyse
data. The objective is not just to push the right buttons! When you get output,
ask yourself if you know how the calculations were done. How do you interpret
each of the elements in the output?
2.
To check the model assumptions we usually obtain ‘residuals’ (what is ‘left-over’
when we subtract the group mean from each of the two sets of results) and
combine them into one set of values for analysis.
Use Calc/ calculator to calculate residuals (see page 8): c1-mean(c1) gives the
residuals for Method A. Store these in columns c3 (Method A) and c4 (Method B).
Stack the two columns in c5 either by copy and paste operations (either using the
Edit menu, or control-C copies selected cells and control-V pastes the values into
new cells) or by using Data/ Stack/ Columns. Obtain a Normal probability plot of
the residuals using Stat/ Basic Statistics / Normality Test (see page 9). Is the
Normality assumption reasonable?
3.
Use Stat/ Basic Statistics/ 2 Variances to carry out an F-test of the hypothesis
that the two long-run standard deviations are equal (see pages 10, 11).
4.
Now carry out the same analyses (apart from the residual analysis) for the
Epitaxial Layer study (attached, page 18); carry out the other analyses required in
the examination question. Note that the t-tests on summarised data are carried
out by selecting the “summarised data” part of the menu and filling in the data
summaries given on page 18. The one-sample analyses are done using menus
that are very similar to the two-sample analyses (see Stat/ Basic Statistics/ 1sample t). Note that when comparing two variances (SD squared) the menu
requires you to input the variances – not the SDs, as given in the exam question.
Analyse the Drivers dataset (attached, pages 19-20) using the Stat/ Basics
Statistics/ paired t- menu. Obtain the means and differences for each driver on
the two car designs using the Calc/ calculator menu. Plot differences against
2
means for each driver (use the Graph/ Scatterplot menu) and get a Normal plot of
the differences – do the model assumptions appear valid?
5.
Use the sample size facilities within Minitab (Stat/ Power and Sample Size) to
determine an appropriate sample size to detect a difference of (a) 0.5 (b) 1.0 (c)
1.5 units between two long-run means, when carrying out a two-tailed test using a
significance level of 0.05, while requiring a power of 0.95 and assuming the
standard deviation is either 1, 1.25, or 1.5 (see pages 12, 13).
6.
Mead and Curnow (1983) report an experiment to compare the yields of four
varieties of melon. Six plots of each variety were grown; the plots were allocated
to varieties at random. The yields (in kg) are shown below. Enter these into four
columns of the worksheet.
Variety1
25.1
17.3
26.4
16.1
22.2
15.9
Variety2
40.3
35.3
32.0
36.5
43.3
37.1
Variety3
Variety4
18.3
22.6
25.9
15.1
11.4
23.7
28.1
28.6
33.2
31.7
30.3
27.6
Page 14 gives screenshots based on the STAT/ ANOVA/ One-way (unstacked)
menu. Obtain the ANOVA table, carry out multiple comparisons (note: Tukey’s =
HSD, Fisher’s = LSD; Dunnett’s is another multiple comparison technique), get
residuals and fitted values and use them to check the model assumptions.
7.
Neter and Wasserman (1974) report the data below on a study of the relationship
between numbers of visitor days, Y, and numbers of miles, X, of intermediate level
ski trails in a sample of New England ski resorts during normal snow conditions.
The data are shown overleaf.
3
Miles-of-trails
10.5
2.5
13.1
4.0
14.7
3.6
7.1
22.5
17.0
6.4
Visitor Days
19929
5839
23696
9881
30011
7241
11634
45684
36476
12068
First, obtain a scatterplot of Visitor Days versus Miles-of-trails with a fitted line,
using Stat/ Regression/ Fitted Line Plot (see screenshots page 15). Next fit a
regression model getting full output, including a prediction and a confidence
interval when X = 10 miles of trails (type 10 into the “prediction intervals for new
observations” box of the Options sub-menu). Review what is available under the
various sub-menus (Graphs, Options, Storage, Results – see page 15). Store
residuals and fitted values and carry out the usual residual analyses. Interpret the
different elements of the output.
8.
To explore some of the graphical functions within Minitab we will first generate
some random data. Use the Calc/ Random Data/ Normal menu to generate 100
data values into a column – see screenshots page 16. Name the column N-data.
Just type it on the worksheet in the header row.
Use the Calc/ Random Data/ Chi-square menu to generate 100 data values from
a chi-square distribution with degrees of freedom equal to 7 into a column – see
screenshots page 17. Name the column Chi-data.
Now explore the Graph menu a little – for example you could get Histograms,
Dotplots and Boxplots of the two columns of data. In each case get both datasets
on the same graph – this makes it easier to make comparisons. I have not given
you screenshots – work it out for yourself! There is a help button on each submenu – see what you can learn from these (in particular look at the Boxplot Help
menu).
NOTE: THE COMMANDS AND SCREENSHOTS IN THIS HANDOUT ARE BASED ON MINITAB 15 –
MINITAB 16 IS NOW INSTALLED ON THE COLLEGE SYSTEM; THERE MAY BE MINOR DIFFERENCES
IN THE LATER VERSION.
4
Descriptive Statistics
Select the variables you
want either by
Highlighting and then
clicking on SELECT
or
Just Double-clicking on
the variable e.g.
C1 A
You can
change these
default
options by
selection/deselection
5
Dotplots for Data in Two Columns
SELECT
Select the variables you want
either by Highlighting and then
clicking on SELECT
or
Just Double-clicking on the
variable e.g.
C1 A
6
Two- Sample t-test for two groups in different columns
Click here first
and then double
click on
C1
A
Do the same for B
Select here for
standard T test
7
Calculating Residuals
This subtracts the mean
of column 1 from each
value in C1 and puts the
resulting numbers
(residuals) in C3
Do the same for C2 and put the resulting Residuals in C4. You can then cut and paste
the numbers in C3 and C4 and put them into C5 – C5 will them contain 22 values.
Name it Residuals by simply typing the name in the Minitab header.
An alternative way of stacking the residuals is as follows:
C1 A
C2 B
C3
C4
Select
Select C3 C4
or just Type
them
Type
Deselect
8
Normal Plot
These are alternative
tests for Normality
9
Comparing Standard Deviations (Variances)
Click here first
and then select
C1 and C2
10
Test for Equal Variances for Speed-B, Speed-A
F-Test
Test Statistic
P-Value
Speed-B
1.58
0.483
Levene's Test
Test Statistic
P-Value
Speed-A
1.0
1.5
2.0
2.5
3.0
3.5
95% Bonferroni Confidence Intervals for StDevs
0.53
0.477
4.0
The standard F-test
assumes data
Normality. Levene’s
is an alternative test
that does not make
this assumption.
Speed-B
Speed-A
Boxplots of the
two data columns
72
74
76
78
80
82
Data
Extract from Minitab HELP Menu:
Boxplots summarize information about the shape, dispersion, and center of your data.
They can also help you spot outliers.
·
The left edge of the box represents the first quartile (Q1), while the right edge
represents the third quartile (Q3). Thus the box portion of the plot represents the
interquartile range (IQR), or the middle 50% of the observations.
·
The line drawn through the box represents the median of the data.
·
The lines extending from the box are called whiskers. The whiskers extend
outward to indicate the lowest and highest values in the data set (excluding outliers).
·
Extreme values, or outliers, are represented by dots. A value is considered an
outlier if it is outside of the box (greater than Q3 or less than Q1) by more than 1.5 times
the IQR.
Note: Bonferroni confidence intervals mean that we are 95% confident that both
intervals simultaneously cover the two population parameters, here 1 and 2.
11
Sample Size for a Comparative Study
See overleaf
12
Power and Sample Size
2-Sample t Test
Testing mean 1 = mean 2 (versus not =)
Calculating power for mean 1 = mean 2 + difference
Alpha = 0.05 Assumed standard deviation = 1
Sample Target
Difference Size Power Actual Power
0.5 105 0.95
0.950129
Note that the actual power can be a bit different from that specified as your target
value since the sample size is discrete, i.e., we can choose sample sizes of say, 104,
105 or 106 but not intermediate non-integer values.
13
Analysis of Variance
Put the Melons data
into four columns, e.g.
C5-C8 and then select
them so that they
appear here
Select
14
Regression
Put data into two
columns beforehand
Put visitor days
here
Put Miles of trail
here
Put visitor days here
Put Miles of trail here
15
Generating Random Data
16
17
100 rows into
C2 (say)
7
Examination-style Question
1.
A first step in processing silicon wafers for integrated circuit devices is to grow an
epitaxial layer on the polished silicon wafers. The manufacturing specifications
for a particular product are that the epitaxial layer should be between 10 µm and
18 µm thick. Two production lines produce these wafers. Samples of 25 wafers
from the two lines gave the results tabulated below, when their epitaxial layers
were measured.
Mean
Standard Deviation
Line 1
Line 2
14.3
16.4
0.80
0.75
18
Assume layer thickness is Normally distributed.
(a)
Carry out a statistical test of the hypothesis that the long-run average
thickness of layers produced on line 1 equals the mid-specification value
of 14 µm. Interpret the result of your test.
(5 marks)
(b)
Calculate and interpret a 95% confidence interval for the long-run average
thickness of layers produced on line 2.
(4 marks)
(c)
Carry out a statistical test of the hypothesis that the long-run average
layer thicknesses are the same for lines 1 and 2 and interpret the result of
the test.
(d)
(6 marks)
Calculate and interpret a 95% confidence interval for the difference
between the long-run average layer thicknesses from lines 1 and 2.
(6 marks)
Examination Question: Q1 Hilary, 2008
1.
A study of engineering design factors that affect the handling ability of differently
designed cars involved twelve drivers parallel parking two cars of different
design. The data are shown below; the responses are the times (in seconds) to
complete the parking operation under standardized test conditions. The order in
which the cars were driven was randomized separately for each driver.
Driver
Design 1
Design 2
1
42.0
17.8
2
30.7
20.2
19
(a)
3
21.2
16.8
4
29.2
41.4
5
27.1
21.3
6
38.4
38.5
7
28.8
16.9
8
63.2
32.2
9
38.6
27.7
10
29.4
23.1
11
28.3
29.6
12
26.2
20.5
A paired t-test gave a t-value of t=2.48. How was this value calculated? What
are the number of degrees of freedom and the critical values for a t-test using a
significance level of 0.05?
Interpret the result of the test.
Calculate the
corresponding 95% confidence interval and explain its interpretation in this
context. What model underlies the analysis?
[10 marks]
(b)
A two-sample t-test gave a t-value of t=2.02. How was this value calculated?
What are the number of degrees of freedom and the critical values for a t-test
using a significance level of 0.05? Interpret the result of the test. Calculate the
corresponding 95% confidence interval and explain its interpretation in this
context. What model underlies the analysis?
[10 marks]
(c)
Why is the result of the test carried for part (b) different from the result of that for
part (a). Which analysis do you consider more appropriate and why?
[5 marks]
20
21