Download Math 140 Notes and Activity Packet (Word)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Math 140 Notes and Activity Packet (Word)
Exploratory Data Analysis (EDA), Correlation & Regression
Go over EDA Notes 1 (PDF online only) before doing EDA Act 1
Math 140 – EDA Activity 1
Using Graphs to Explore Shape, Center, Averages and Outliers
1. Open the men and women’s health data on the website www.teachoutcoc.org . Look under the
“Data Sets” tab.
2. Look at the following columns of data for both men and women: age, height, weight, pulse and
body mass index. For each column, use Statcrunch to create a histogram, dot plot and box plot and
find the mean, median, mode, minimum and maximum values. Then answer the following questions.
Save your graphs, sample statistics and answers on a word document in your flash drive.
a) What are the values trying to measure? What are the units?
b) Look at the graphs and determine the shape of the data set?
c) Look at the graphs and estimate where you think the center or average should be for the
data set. Which is the most accurate average, the mean, median or mode? What is the
average?
d) Use the box plot to determine if there are any outliers in the data set? Remember, these
are values that look far away from most of the data. Do you think these outliers were
mistakes in collecting the data or just an “unusual” individual value.
NOTE: To copy your graphs in StatCrunch and save them in Word, you will need to follow the
following instructions.






Click on options
Click on copy
Right click and copy the new window
Open a word document
Control/Alt V and click on “Device independent Bitmap”
Save word document on your flash drive
Go over EDA Notes 2 (PDF online only) before doing EDA Act 2
Math 140 – EDA Activity 2
Shape, Center, Spread and Typical Values
1. Open the men and women’s health data on the website www.teachoutcoc.org . Look under the
“Data Sets” tab.
2. Look at the following columns of data for both men and women: age, height, weight, pulse and
body mass index. For each column, use Statcrunch to create a histogram and find the mean, median, ,
IQR, standard deviation, Q1 and Q3. Then answer the following questions. Save your graphs, sample
statistics and answers on a word document in your flash drive.
a) Look at the histogram and determine the shape of the data set?
(You do not need to save the graph.)
b) Based on the shape, which is the most accurate measure of center, the mean or median?
What is the average?
c) Based on the shape, which is the most accurate measure of typical spread, the
standard deviation or IQR?
d) Use the most accurate center and spread to find two numbers that typical values in the
data set will fall in between. (Use the mean + or – standard deviation if bell shaped. Use Q1
and Q3 if skewed.)
Go over EDA Notes 3 (PDF online only) before doing EDA Act 3&4
Math 140 EDA – Activity#3
Exploratory Data Analysis Paragraph
Directions: Using the health data on my website to write a data analysis paragraph for the
following data sets for both men and women: age, height, weight, pulse and body mass index.
Use Statcrunch to make a histogram and boxplot and the mean, median, min, max, IQR, Q1, Q3,
and standard deviation in order to analyze the shape and outliers. You do not need to save
your histogram and boxplot. Write a data analysis paragraph describing the shape of the data
set, as well as outliers, measures of center and measures of spread. Also include the size of the
data set and the smallest and largest values in the data set. Be sure to include which is the best
measure of center and spread and write a sentence interpreting the meaning of the center and
spread in the context of the data set. Give the most accurate average for the data set and also
give two numbers that typical values fall in between. Also analyze the outliers. Should the
outliers be included in the data set or not and what was their affect on the shape, center and
spread?
1. Write a data analysis paragraph for women’s ages.
2. Write a data analysis paragraph for men’s ages.
3. Write a data analysis paragraph for women’s height.
4. Write a data analysis paragraph for men’s height.
5. Write a data analysis paragraph for women’s weight.
6. Write a data analysis paragraph for men’s weight.
7. Write a data analysis paragraph for women’s pulse rates.
8. Write a data analysis paragraph for men’s pulse rates.
9. Write a data analysis paragraph for women’s body mass index.
10. Write a data analysis paragraph for men’s body mass index.
Math 140 EDA – Activity#4
Classifying Summary Statistics
1. For each of the following sample statistics, classify it as a measure of spread (variability), a measure
of center (average), or a measure of position. Then write a sentence describing what the statistic tells
us.
a) Mean
b) Standard Deviation
c) Minimum
d) Range
e) Median
f) Quartile 3 (Q3)
g) Interquartile Range (IQR)
h) Maximum
i) Quartile 1 (Q1)
j) Mode
k) Variance
2. List all the measures of center. Which is the most accurate for bell shaped (normal) data sets?
Which is the most accurate for skewed data sets?
3. List all the measures of spread. Which is the most accurate for bell shaped (normal) data sets?
Which is the most accurate for skewed data sets?
4. List all the measures of position.
5. A very important statistic that is not a center, spread or position, is the frequency or sample size.
Write a sentence describing the meaning of the sample size.
6. Use Statcrunch and the Bear data to find all of the summary statistics we discussed for the bears
weight. You need to give the name of the statistic, the number and the units.
EDA review problems on the Sampling/Experiment/EDA review sheet
located on the Sampling and Experiments page.
Go over Correlation Notes (pdf online only), Regression Notes (below in word)
and residual notes (pdf online only) before doing Regression Act 1 & 2
Correlation and Regression Notes
Relationship Hypothesis Tests

Categorical / Categorical Relationship (Chi-Squared Independence Test)
Ho: Categorical Variables are independent
(show distribution of conditional probabilities are the same)
Ha: Categorical Variables are dependent
(show distribution of conditional probabilities are different)

Categorical / Quantitative Relationship (ANOVA)
H 0 : 1  2  3  4  5  6
(categorical variable and quantitative variable are independent (not related)
H A : at least one is 
(categorical variable and quantitative variable are dependent (related)

Quantitative / Quantitative Relationship (Correlation Hypothesis Test) “Regression”
Correlation : See if there is a linear relationship between two quantitative variables.
The study of that relationship is often called “Correlation and Regression”.
Scatterplot : graph for visually seeing correlation or not
I. Choosing your variables:
Chose which variable will be x (explanatory variable or independent variable) and
which variable will be y (response variable or dependent variable)

Is one of the variables a natural response variable?
Ex) Year (time) and unemployment rates in U.S. Let explanatory variable x be time
(years) and let the response variable y be unemployment rate. Unemployment
responds to time, but not the other way around.

If the variables respond to each other, pick the response variable to be the one you
are most interested in or may want to make predictions about.
Ex) The unemployment rate in U.S. and the national debt in the U.S.
If you are studying national debt and factors that may be related to the national debt,
then you should make the national debt be your response variable y (and that means
that unemployment rate would be explanatory x).
II. Graphing your data (Scatterplot)

Make ordered pairs from your x and y data (x , y) and create a scatterplot.

StatCrunch: Graph  scatterplot  pick columns for x and y  compute

Correlation Study: see how well ordered pair quantitative data fit a line.
(regression line)

Correlation Coefficient (r) : number between -1 and +1 that measures the strength
and direction of correlation. (Always look at the scatterplot with the r value, Do not
just look at r value)
r close to +1 (r = +0.893)  Strong, Positive Correlation (line going up from left to
right (positive slope) and the points in scatterplot are close to line)
r close to -1 (r = -0.916)  Strong Negative Correlation (line going down from left to
right (negatve slope) and the points in the scatterplot are close to the line)
r close to 0 (+0.037 or -0.009)  No linear correlation (but still could be nonlinear)
III. R-Squared
(Squaring the correlation coefficient r) R-squared : Percentage of variability in y (response)
that can explained by the linear relationship with x (explanatory).
Example 1: Rainfall (inches) verses number of car accidents.
Explanatory (x) : Rain
Response(y) : car accidents
R = +0.51 (There is a moderate positive correlation between rain and car accidents)
Note!!! Correlation is not Causation!! Can NOT say that rain CAUSES car accidents.
R-squared = 0.51 ^2 = 0.2601 or 26.01%
26% of the variability in car accidents can be explained by the relationship with rain.
Confounding variables? Age, condition of car , road conditions, experience , texting ,
drinking, drugs,
IV. Standard Deviation of the residual errors se
(two meanings : Average distance from line & prediction error)
1. The average distance that points are from the regression line.
2. If we use the regression line to make a prediction, the standard deviation of the
residuals gives us how much average error we can expect in that prediction.
Residual : How far a point is above or below the regression line.
Example 2: x: nicotine y: tar
Simple linear regression results:
Dependent Variable: tar (mg)
Independent Variable: nicotine (mg)
tar (mg) = -1.2713139 + 14.207623 nicotine (mg)
Sample size: 29
R (correlation coefficient) = 0.96136857
R-sq = 0.92422952
Estimate of error standard deviation: 1.2983543
r = 0.961 There is a strong positive correlation between nicotine and tar.
r-squared = 0.924 (92.4%) 92.4% of the variability in y (tar) can be explained by the
relationship with x (nicotine).
Are there any confounding variables that might influence the amount of tar? Carbon
Monoxide, company, cost of cigarette,
** Important Note: “Correlation  Causation”
Just because there is a relationship (correlation), that does not give you the right to say that
the x “causes” y to happen.
Stand Deviation of residual errors = 1.298  1.3 (same units as y) (mg tar)
The points in the scatterplot are about 1.3 mg from the regression line on average.
If we try to predict tar from the nicotine, we could have an average error of 1.3 mg.
Regression line: y = A + Bx (OLI book)
A is y intercept (where line crosses y axis) starting value
B is slope (average rate of change)
Regression line: x: nicotine, y: tar
Y = -1.2713139 + 14.207623 X
Slope = 14.2
What does that mean?
Slope =
Change in Y
14.2 mg of tar

Change in X 1 mg of nicotine
The amount of tar is increasing by 14.2 mg for every mg of nicotine.
Regression Act 1 – abcd only
Example 3: Health Data
StatCrunch Directions:
Stat  Regression  Simple Linear  pick x and y columns  compute
Is a woman’s age related to her diastolic blood pressure?
Pick x and y (blood pressure responds to age, but age does not respond to bloodpressure)
X: (explanatory or independent variable) Woman’s Age
Y: (response or dependent variable) Diastolic Blood Pressure
StatCrunch Printout
Women Diast BP = 47.699875 + 0.59368323 Women Age (years)
Sample size: 40
R (correlation coefficient) = 0.63594161
R-sq = 0.40442173
Estimate of error standard deviation: 9.0898362
a. The scatterplot and r-value show a moderate positive correlation. (r = 0.636)
b. r-squared = 0.404 = 40.4%
So 40.4% of the variability in a woman’s diastolic blood pressure can be explained by the
relationship with woman’s age.
Confounding Variables (influence BP)? Race, Ethnicity, stress, genetics, diet,
c. standard deviation of residual errors = 9.1 Blood pressure units
Two meanings:
Points in scatterplot are 9.1 away from the regression line on average.
If we use the regression line to make a prediction (guess a woman’s diastolic blood pressure)
we could have an average error of 9.1 blood pressure units.
Meaning of Slope of regression line? 0.59 (rate of change between x and y)
Slope =
Change in Y 0.59 blood pressure units

Change in X
1 year
Women’s diastolic blood pressure increases 0.59 per year.
(Example 2 continued)
Use the regression line to predict how much tar a cigarette will have if the cigarette has 1.5
mg of nicotine?
Y = -1.2713139 + 14.207623 (1.5) = 20.04 mg of tar!!
How much error in this prediction? Stand Dev of residual errors is the prediction error. (off
by 1.3 mg on average)
Histogram of the residuals looks nearly normal (only slight skew)
The residual plot is showing a fan (V) shape. So is not evenly spread out. (Fails the
homoscedasticity requirement.)
Math 140 Regression Activity#1
Correlation and Regression with Statcrunch
Directions: Your goal today is to explore relationships between quantitative variables using Statcrunch.
For each of the following data sets, decide which data set should be the explanatory variable and which
should be the response variable. Go to the “Stat” menu, and click on “Regression”, then “Simple
Linear”. Put in the columns for the explanatory (x) and the response (y). Click on Fitted line plot,
Residuals verses x variable, and a Histogram of the residuals. Save the scatterplot, Residuals verses x
variable, and a Histogram of the residuals on a word document with the r value, r-squared, standard
deviation of the residuals, and the equation of the regression line on a word document.
a) Use the r value and scatterplot to interpret the strength and direction of the linear
relationship. Do the variables have a weak, moderate or strong linear relationship (correlation), a nonlinear (curved) relationship, or no relationship at all?
b) Write a sentence to explain the r-squared value. Were there any confounding variables that
might influence the response variable other than the explanatory variable?
c) Write two sentences to explain the standard deviation of the residuals.
d) Give the regression line formula and write a sentence to explain the slope.
e) A residual is the vertical distance each point in the scatterplot is from the regression line.
Look at the histogram of the residuals. We like the histogram to be nearly normal (close to bell shaped).
What is the shape of the histogram of the residuals?
f) The residuals verses the x variable is often called a “residual plot”. It is often difficult to see
how far points are from the regression line. Think of the residual plot graph as putting the points in the
scatterplot under a magnifying glass. You can see the distances better. The residual plot should show
dots that are evenly spaced from the line. This is called “Homoscedasticity”. It should not be “fan”
shaped. If it is fan shaped it fails the homoscedasticity requirement. Is the residual plot fan shaped or
evenly spaced?
1. Open the health data. Explore the relationship between a man’s age and cholesterol?
2. Open the health data. Explore the relationship between a man’s height and weight?
3. Open the health data. Explore the relationship between a man’s weight and BMI?
4. Open the health data. Explore the relationship between a man’s systolic blood pressure and diastolic
blood pressure?
5. Open the Bear’s data. Explore the relationship between age and weight.
6. Open the Bear’s data. Explore the relationship between chest size and neck size?
7. Open the Bear’s data. Explore the relationship between skull length and skull width?
Math 140 Regression Activity#2
Using Regression lines to make Predictions
1. Open the cigarette data. Let the explanatory variable represent the amount of nicotine and
the response variable represent the amount of tar. Find the equation of the regression line and
the standard error. Use the equation to predict the amount of tar if a cigarette contains 1.2 mg
of nicotine. How much error might there be in that prediction?
2. Open the cigarette data. Let the explanatory variable represent the amount of nicotine and
the response variable represent the amount of carbon monoxide (CO). Find the equation of the
regression line and the standard error. Use the equation to predict the amount of CO if a
cigarette contains 1.2 mg of nicotine. How much error might there be in that prediction?
3. Open the women’s health data. Let the explanatory variable represent the systolic blood
pressure and the response variable represent the diastolic blood pressure. Find the equation of
the regression line and the standard error. Use the equation to predict the diastolic blood
pressure of a person who has a systolic blood pressure of 130. How much error might there be
in that prediction?
4. Open the bear data. Let the explanatory variable represent the age of the bear in months
and the response variable represent the length of the bear in inches. Find the equation of the
regression line and the standard error. Use the equation to predict the length of a bear that is
24 months old. How much error might there be in that prediction?
5. Open the bear data. Let the explanatory variable represent the neck circumference of the
bear and the response variable represent the weight of the bear in pounds. Find the equation
of the regression line and the standard error. Use the equation to predict the weight of a bear
that has a neck circumference of 24 inches. How much error might there be in that prediction?