Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Seminar 15 | Tuesday, October 18, 2007 | Aliaksei Smalianchuk Means and Variances What happens to means and variances when data is manipulated? Let’s check by manipulating data from the survey. Data Height in inches (HT) Shoe size (Shoe) Age (Age) Additional Columns: Height with a 1 inch heel (HeightPlus1) Height in centimeters (2.5TimesHeight) Sum of height and shoe size (HeightPlusShoe) Sum of height and age (HeightPlusAge) Statistics Variable HT Shoe Age HeightPlus1 2.5TimesHeight HeightPlusShoe HeightPlusAge N 444 445 444 444 444 444 444 Mean 66.928 9.1056 20.371 67.928 167.32 76.035 87.299 StDev 3.938 1.9484 2.912 3.938 9.84 5.693 4.913 Observation 1 Variable HT Shoe Age HeightPlus1 2.5TimesHeight HeightPlusShoe HeightPlusAge N 444 445 444 444 444 444 444 Mean 66.928 9.1056 20.371 67.928 167.32 76.035 87.299 StDev 3.938 1.9484 2.912 3.938 9.84 5.693 4.913 The mean of heel heights is one inch larger than then mean of heights Why? If every element is modified by a constant number the mean follows the same pattern. Observation 2 Variable HT Shoe Age HeightPlus1 2.5TimesHeight HeightPlusShoe HeightPlusAge N 444 445 444 444 444 444 444 Mean 66.928 9.1056 20.371 67.928 167.32 76.035 87.299 StDev 3.938 1.9484 2.912 3.938 9.84 5.693 4.913 The standard deviation of heel heights equals the standard deviation of heights Why? Standard deviation is relative to the mean, and the shape of the distribution didn’t change Observation 3 Variable HT Shoe Age HeightPlus1 2.5TimesHeight HeightPlusShoe HeightPlusAge N 444 445 444 444 444 444 444 Mean 66.928 9.1056 20.371 67.928 167.32 76.035 87.299 The standard deviation of heights is 2.5 times the standard deviation of heights in centimeters StDev 3.938 1.9484 2.912 3.938 9.84 5.693 4.913 Why? By multiplying all data values by a constant value we are increasing the spread of the histogram by the same value, therefore modifying the properties that depend on the spread (like standard deviation.) Observation 4 Variable HT Shoe Age HeightPlus1 2.5TimesHeight HeightPlusShoe HeightPlusAge N 444 445 444 444 444 444 444 Mean 66.928 9.1056 20.371 67.928 167.32 76.035 87.299 StDev 3.938 1.9484 2.912 3.938 9.84 5.693 4.913 Mean of HeightPlusShoe = Mean of Height + Mean of Shoe Observation 5 Variable HT Shoe Age HeightPlus1 2.5TimesHeight HeightPlusShoe HeightPlusAge N 444 445 444 444 444 444 444 Mean 66.928 9.1056 20.371 67.928 167.32 76.035 87.299 Mean of HeightPlusAge = Mean of Height + Mean of Age StDev 3.938 1.9484 2.912 3.938 9.84 5.693 4.913 Why? Since Variances Variance = σ2 Variances apply to a probability distribution Variance is a way to capture the degree of spread of a distribution Variances Variable HT Shoe Age HeightPlusShoe HeightPlusAge Variance 15.50784 3.796263 8.479744 32.41025 24.13757 Dependence Are shoe sizes and heights dependent? Are age and height dependent? Let’s check using scatter plots Height vs. Shoe Size Height vs. Age Back to variances Variable HT Shoe Age HeightPlusShoe HeightPlusAge Variance 15.50784 3.796263 8.479744 32.41025 24.13757 Variance of HeightPlusShoe is much greater than Var(Height) + Var(Shoe) Variance of HeightPlusAge is very close to Var(Height) + Var(Age) Why? Can you see a difference in relationships (Height vs. Shoe Size) and (Height vs. Age?) Dependence Adding two dependent data distributions produces extremes (adding small values with corresponding small values and adding large values to correspondent large values) This makes the variance much larger. Dependence In case of independent sets, values do not necessarily correspond by relative value (large values can be added to small values) This does not alter the spread of the distribution much Variance of sample mean Mean = (X1 + X2 + … + Xn)/n Variance [(X1 + X2+ … +Xn)/n] = (Variance[X1] + Variance[X2]+ … + Variance[Xn])/n Dependence? Would this work for dependent values of X 1, X 2 … X n ? Would the variance produced by this formula be larger or smaller than actual? Sampling without replacement Would the variance formula hold true? Why? Dependence Adding variances of dependent values will produce a smaller result than expected because adding dependent data sets will produce extremes, altering the spread Sampling without replacement on smaller populations (n < 10) will produce dependence The End Extra Credit (Dr. Pfenning) Use Minitab Calculator to create column “Birthyear” Plot Earned vs. Birthyear, note relationship Create column “EarnedPlusBirthyear” Find sds of Earned, Birthyear, EarnedPlusBirthyear, square to variances Compare variances Explain results