Download Lab 2. Normal probability plots and scatterplots 1 Normal probability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Least squares wikipedia , lookup

Transcript
Lab 2. Normal probability plots and scatterplots
www.nmt.edu/~olegm/283labs/Lab2stat.pdf
Note: the menus and other things you will read or type on the computer are in italics. Attach the
printouts whenever needed.
In this Lab, we will discuss the statistical methods based on scatterplots: normal
probability plot and the scatterplot for exploring the relationship between two variables.
1
Normal probability plot
Normal probability plot (n.p.p.) helps check if a distribution is close to normal, i.e.
has a particular bell shape. (This is similar to the Normal quantile plot discussed in
the book.)
Any deviation from this shape will reflect on n.p.p. as a departure from the straight
line. We can also check the shape informally, i.e. looking at the histogram. However,
some shapes (like heavy-tailed distributions below) are hard to spot this way.
To make a normal probability plot, use Graph → Probability plot → Simple; in the
Variables window, select the variable you need. In the problems below, we’ll examine
some data and see how particular features of a distribution reflect in the n.p.p.
Problem 1
Make a normal probability plot for mounting holes problem (corresponds to 1.145 in
the textbook), see holes.txt.
The variable given is the distance between the holes in thousandths of an inch. The
normality here would imply, for example, that it’s equally likely for the distance to
be too short or too long, and very large errors are rare.
(a) What is the most notable deviation from normality in these data?
(b) How would you propose to fix it?
Problem 2
In cases (a)-(c) describe the departures from normality and how they reflect in the
n.p.p.
(a) heavy-tailed distribution/ outliers on both ends: Open the data set Internet.txt
(monthly fees for Internet access in 2000).
Make a histogram and a normal probability plot. Describe the behavior that
you see.
1
(b) Uniform distribution. Generate 100 “random numbers” from uniform distribution on the interval [0,1]: Calc → Random Data → Uniform; Generate 100 rows
of data; store in columns C2.
Make a histogram and a normal probability plot. Describe.
(c) A skewed distribution. Open the data set Guinea.txt (survival times for
Guinea pigs).
Make a histogram and a normal probability plot. Describe.
2
Scatterplots and correlation
Scatterplots describe relationships
between pairs of numerical variables.
Data in the file wine.txt describe the
relationship between wine consumption
and heart disease death rates (deaths per
100,000 people) for 19 developed nations.
To make a scatterplot, use Graph→
Scatterplot→ Simple; select wine
consumption into X and heart disease
into Y cells.
Problem 3
Answer the questions below
(a) Are there any outliers?
(b) Clusters of countries?
(c) Is there a linear pattern?
(d) How strong is the relationship?
(e) Italy’s wine consumption is 7.9 (liters of alcohol from wine, per person per year).
What is its heart disease rate?
(f) Compute the correlation coefficient using Stat → Basic Statistics→ Correlation
and bringing both variables into Variables box.
2
(g) Does it appear that drinking more wine would reduce a person’s risk of heart
disease? 1
2.1
Effect of Linear transformations
We will investigate the effect that linear transformations have on the correlation.
Open the data set Sevilleta.txt. It contains average daily temperatures (in Celsius)
at Sevilleta National Wildlife Refuge for the months of September and October, 2002.
Changing the unit of measurement
For example, to change temperatures from Celsius to Fahrenheit, we need to use the
formula
◦
F = ◦ C ∗ 1.8 + 32
We will create a new variable called temp F. Go to Calc→ Calculator; and type the
arithmetic expression into the Expression window.
Problem 4
(a) make a scatterplot of temp_C versus temp_F. (Do not print.) What is the value
of correlation coefficient? Explain why it is the way it is.
(b) Make a scatterplot of Y= temp_F versus X = Day. What kind of association do
you observe? Describe in words what happens.
(c) Compute correlations of both temp_C and temp_F with Day. What did you
observe?
1
About cause and effect, read the discussion at http://www.nmt.edu/~olegm/283labs/SciAmWine.pdf
3
Problem 5: Exploration
The file DJIret.txt contains the values of returns on Dow Jones Industrials index
(DJI), where
Return = 100% ×
N ew value − Old value
Old value
(a) Make a histogram and a Normal Probability plot of the data. Would you
describe this distribution as heavy-tailed? symmetric? skewed?2 Is the nonnormality easy to spot using the histogram alone?
(b) Does the today’s return affect the tomorrow’s return? Make a scatterplot of the
return series with itself, only shifted by one day. [To obtain the shifted series,
you can simply copy and paste the numbers into a new column.] Is it possible
to predict tomorrow’s return based on today’s return?
2
Heavy-tailed distribution is generally cited as an extra source of risk when buying stocks. This
means that large gains or large losses are more frequent when dealing with the individual stocks, or,
in this case, with a stock index. For example, returns of ±2σ occur more frequently than 5% of the
time promised by the 95% rule for Normal distribution.
4