Download Lecture 2 - West Virginia University

Lecture 2 STAT 211 – 019 Dan Piett West Virginia University Last Lecture  Population/Sample  Variable Types  Discrete/Continuous Numeric & Ranked/Unranked Categorical  Displaying Small Sets of Numbers  Dot Plots, Stem and Leaf, Pie Charts  Histograms  Frequency/Density and Symmetric vs Right/Left Skewed  Measures of Center  Mean/Median Overview  2.3 Measures of Dispersion  2.5 Boxplots  3.1 Scatterplots  3.2 Correlation  3.3 Regression Section 2.3 Measures of Dispersion Descriptive Statistics  Describing the Data  How do we describe data?  Graphs (Last Class)  Measures  Center (Last Class)  Mean/Median  Dispersion/Spread (This Class)  Variance, Standard Deviation, IQR Spread of Data  Example: Spread  Data 1: 8, 8, 9, 9, 10, 11, 11, 12, 12  Data 2: -30, -20, -10, 0, 10, 20, 30, 40 ,50  Data 1 – Mean = Median = 10  Data 2 – Mean = Median = 10  Both have the same measure of center but how do they differ?  Data 2 is much more spread out. Sample Standard Deviation  Sample Standard Deviation (S) is a measure of how spread out the data is  S can be any number >= 0  Larger S indicates a larger spread  Unit Associated with S is the same unit as the variable  Example: Mean of 110 lb, Standard Deviation 10 lb  The square of the sample standard deviation is called the sample variance Standard Deviation Example  Data 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)  S = 1.58  Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)  S = 27.39  As you can see, the standard deviation of Data 2 is much larger than Data 1. Population Variance/Standard Deviation  Much like the sample mean (xbar) estimates the population mean (mu), the sample variance/standard deviation (s) can be used to estimate the true population standard deviation (sigma) Linear Transformations and Changes of Scale  By adding or subtracting a constant to every value in a data set  The mean is increased/decreased by the same amount  The median is increased/decreased by the same amount  The standard deviation is unchanged  By multiplying each value by a constant  The mean is multiplied by the same amount  The median is multiplied by the same amount  The standard deviation is multiplied by the same amount Section 2.5 Boxplots Quartiles  Quartiles are numbers which partition the data into 4 subgroups (ie 4 quarters in a dollar)  Q1  The data separating lowest 25% of the data values  Q2 aka. Median  The data separating the lowest 50% of the data values  Q3  The data separating the lowest 75% of the data values  Q4 aka. Maximum  The largest data value Quartiles Example  You can think of Q1 as the median of the bottom half of the data and Q3 as the median of the top half of the data Interquartile Range (IQR)  The IQR is another measure of spread, much like S.  Larger IQR results in more spread data  IQR is calculated as Q3 - Q1  Example  Data 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)  IQR = 11.5-8.5=3  Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50)  IQR = 35-(-15) = 50 Boxplots  Boxplots are a graphical representation of the quartiles. Using IQR to Find Potential Outliers  One method to find potential outliers is as follows: Find the IQR 2. Add 1.5*IQR to Q3 1.  Anything larger than this value can be flagged as a potential outlier 3. Likewise, subtract 1.5*IQR from Q1  Anything smaller than this value can be flagged as a potential outlier  Example  Data 1 (8, 8, 9, 9, 10, 11, 11, 12, 12)  Data 2 (-30, -20, -10, 0, 10, 20, 30, 40 ,50) Section 3.1 Scatterplots Bivariate Data  Bivariate data is data consisting of two variables from the same individual  Examples  Height and Weight  Classes skipped and GPA  Graphed using a scatterplot Scatterplot Example Section 3.2 Correlation Pearson Correlation Coefficient  We have discussed ways to describe data of one variable. This section will discuss how to describe two variables on the same individual together.  The correlation coefficient, r, is a measure of the strength of a linear (straight line) relationship between bivariate data. (You will not need to know the formula for r)  To say two variables are correlated is two say that an increase/decrease in one corresponds to an increase/decrease in the other. More on r  r can take on values between -1 and 1  The strength of the correlation depends on how close you are to the extreme values of -1 or 1  r = -.78 is a stronger correlation than r = .50  There are three types of correlation  Positive  Negative  No Correlation Positive Correlation  Positive Correlation exists when r is between 0 and 1.  The closer r is to 1, the stronger the relationship  This implies that if you increase one of the variables, the other one will also increase.  Examples:  Height and Weight, Temperature and Ice Cream Sales Negative Correlation  Positive Correlation exists when r is between -1 and 0.  The closer r is to -1, the stronger the relationship  This implies that if you increase one of the variables, the other one will decrease.  Example:  Temperature and Hot Chocolate Sales No Correlation  No Correlation exists when r is approximately 0  This implies that if you increase one of the variables the other one does not change  Example:  Temperature and Cookie Sales Interpretation of r  Although we may find that two variables are correlated, this does not mean that there is necessarily a causal relationship.  Example:  High School Teachers who are paid less tend to have students who do better on the SATs than Teachers who are paid more. It has been found that there is a negative correlation between teacher salary and students SAT scores. Therefore we should pay our teachers less so students score higher.  Clearly this is not a causal relationship. There is likely a third variable, that is explaining this. One possibility may be the age of the teacher. Section 3.3 Regression Regression Intro  So we have decided that two variables are correlated, we are now going to use the value of one of the variables, “x”, to predict the value of the other variable, “y”.  Example:  Use height (x) to predict weight (y)  Use temperature (x) to predict ice cream sales (y) Regression Equation  Calculating a Regression Equation Given the slope and intercept  Plotting a Regression Line Notes on Regression Lines  Residuals  A residual is the distance between a point (observed y-value) and the regression line (predicted y-value)  Formula: Observed Value – Predicted Value  Using the Cholesterol Example:  For TV Hours = 3, our predicted value was 212.2  The actual value on the graph is 220.  The residual for this particular point is = 220-212.2=7.8  A residual may be positive or negative  The interpretation is that the observed y-value is 7.8 units larger than the predicted y value for TV Hours = 3

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 2 - West Virginia University