Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Math Part of the Course… Measures of Central Tendency Mode: The number with the highest frequency in a dataset Median: The middle number in a dataset Mean: The average of the dataset When to use each: Mode: Good for non-numerical data and for frequent occurrences Median: When an outlier may significantly influence the mean, use median Mean: When data have no likely outlier, use mean Measures of Dispersion Range: Range of values in a dataset (describes the extremes around the typical case) Standard deviation: Shows how much variation there is from the mean. Low standard deviation indicates that the data points tend to be very close to the mean, whereas a high standard deviation indicates that the data is spread out over a large range of values. Population Standard Deviation Formula Sample Standard Deviation Formula Solving for population standard deviation: Assume the dataset: 1, 8, 14, 29, 46 Step one: Solve for : Step two: Solve for 1 8 14 29 46 19.6 19.6 19.6 19.6 19.6 -18.6 -11.6 -5.6 9.4 26.4 345.96 134.56 31.36 88.36 696.96 1297.20 Step three: Solve final equation The Normal Distribution Say μ = 2 and σ = 1/3 in a normal distribution. The graph of the normal distribution is as follows: μ = 2, σ = 1/3 The following graph represents the same information, but it has been standardized so that μ = 0 and σ = 1: μ = 0, σ = 1 The two graphs have different μ and σ, but have the same shape (if we tweak the axes). The new distribution of the normal random variable Z with mean 0 and variance 1 (or standard deviation 1) is called a standard normal distribution. Standardizing the distribution like this makes it much easier to calculate probabilities. Considering our example above where μ = 2, σ = 1/3, then One-half standard deviation = σ/2 = 1/6, and Two standard deviations = 2σ = 2/3 If we have mean μ and standard deviation σ, then Since all the values of X falling between x1 and x2 have corresponding Z values between z1 and z2, it means: The area under the X curve between X = x1 and X = x2 equals: The area under the Z curve between Z = z1 and Z = z2. Hence, we have the following equivalent probabilities: P(x1 < X < x2) = P(z1 < Z < z2) So ½ s.d. to 2 s.d. to the right of μ = 2 will be represented by the area from to This area is graphed as follows: μ = 2, σ = 1/3 The area above is exactly the same as the area z1 = 0.5 to z2 = 2 in the standard normal curve: μ = 0, σ = 1 . Finding the Area Under the Normal Curve In the standard normal curve, the mean is 0 and the standard deviation is 1. The green shaded area in the diagram represents the area that is within 1.45 standard deviations from the mean. The area of this shaded portion is 0.4265 (or 42.65% of the total area under the curve). To get this area of 0.4265, we read down the left side of the table for the standard deviation's first 2 digits (the whole number and the first number after the decimal point, in this case 1.4), then we read across the table for the "0.05" part (the top row represents the 2nd decimal place of the standard deviation that we are interested in.) z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 We have: (left column) 1.4 + (top row) 0.05 = 1.45 standard deviations The area represented by 1.45 standard deviations to the right of the mean is shaded in green in the standard normal curve above. You can see how to find the value of 0.4265 in the full z-table below. Follow the "1.4" row across and the "0.05" column down until they meet at 0.4265. z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517 0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879 0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224 0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549 0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852 0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133 0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3304 0.3365 0.3389 1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830 1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015 1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319 1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633 1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767 2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817 2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857 2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890 2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916 2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936 2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952 2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964 2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974 2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981 2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986 3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990 3.1 0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.4993 3.2 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.4995 3.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.4997 3.4 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998 3.5 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 3.6 0.4998 0.4998 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 3.7 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 Find the area under the standard normal curve for the following, using the z-table. Sketch each one. (a) between z = 0 and z = 0.78 (b) between z = -0.56 and z = 0 (c) between z = -0.43 and z = 0.78 (d) between z = 0.44 and z = 1.50 (e) to the right of z = -1.33. (a) 0.2823 (b) 0.2123 (c) 0.1664 + 0.2823 = 0.4487 (d) 0.4332 - 0.1700 = 0.2632 (e) 0.4082 + 0.5 = 0.9082 It was found that the mean length of 100 parts produced by a lathe was 20.05 mm with a standard deviation of 0.02 mm. Find the probability that a part selected at random would have a length (a) between 20.03 mm and 20.08 mm (b) between 20.06 mm and 20.07 mm (c) less than 20.01 mm X = length of part (a) 20.03 is 1 standard deviation below the mean; 20.08 is standard deviations above the mean P(20.03<X<20.08) =P(-1<Z<1.5) =.3413+.4332 =.7745 So the probability is 0.7745. (b) 20.06 is 0.5 standard deviations above the mean; 20.07 is 1 standard deviation above the mean P(20.06<X<20.07) =P(.5<Z<1) =.3413-.1915 =.1498 So the probability is 0.1498. (c) 20.01 is 2 s.d. below the mean. P(X<20.07) =P(Z<-2) =.5-.4792 =.0228 So the probability is 0.0228. A company pays its employees an average wage of $3.25 an hour with a standard deviation of 60 cents. If the wages are approximately normally distributed, determine a. the proportion of the workers getting wages between $2.75 and $3.69 an hour; b. the minimum wage of the highest 5%. X = wage (a) P(2.75<X<3.69) = P(-.833<Z<.7333) =.298 + .268 =.566 So about 56.6% of the workers have wages between $2.75 and $3.69 an hour. (b) W = minimum wage of highest 5% x = 1.645 (from table) X-3.25=.987 X=4.237 So the minimum wage of the top 5% of salaries is $4.24. The average life of a certain type of motor is 10 years, with a standard deviation of 2 years. If the manufacturer is willing to replace only 3% of the motors that fail, how long a guarantee should he offer? Assume that the lives of the motors follow a normal distribution. X = life of motor x = guarantee period Normal Curve: μ = 10, σ = 2 We need to find the value (in years) that will give us the bottom 3% of the distribution. These are the motors that we are willing to replace under the guarantee. P(X < x) = 0.03 The area that we can find from the z-table is 0.5 - 0.03 = 0.47 The corresponding z-score is z = -1.88. Since , we can write: Solving this gives x = 6.24. So the guarantee period should be 6.24 years. Measures of Association Monkey Favorability Rating Age Group 12-24 6 9 8 <12 4 8 20 Low Medium High >24 18 9 3 Lambda: An asymmetrical measure of association: the value varies depending on which variable is independent. Ranges from 0 to 1 Formula: 1. Calculate Row and Column Totals Monkey Favorability Rating Low Medium High <12 4 8 20 32 Age Group 12-24 6 9 8 23 >24 18 9 3 30 28 26 31 85 2. Calculate E1: Find the mode of the dependent variable (the attribute that occurs the most often) and subtract it from N (sample size). E1=N-ƒ of the mode E1=85-31=54 3. Calculate E2: Find the mode in each column (i.e., category of the independent variable). Subtract each value from the column (category) total and add them together. E2=(Column total – Column mode) + (Column total – Column mode) for all attributes of the independent variable. E2=(32-20)+(23-9)+(30-18)=12+14+12=38 4. Find lambda. We know that thirty percent of the errors in predicting the relationship between age and monkey favorability can be reduced by taking into account the voter’s age. Gamma: • • • • • A measure of association using ordinal variables It is a symmetrical measure, therefore you don’t need to specify the IV and DV. Compares pairs of observations that are positive (going in the same direction) and negative (going in the opposite direction). Ranges from 0 to 1 Formula: • Ns=Count of Same order pairs (positive); Nd= Count of inverse order pairs (negative) Monkey Favorability Rating Low Medium High <12 4 8 20 Age Group 12-24 6 9 8 >24 18 9 3 To find Ns: Multiply top left cell frequency by the sum of all cells that are lower and to the right of that cell. Ns= 4(9+8+9+3) + 8(8+3) + 6(9+3) + 9(3) Ns= 116 + 88 + 72 + 27 = 313 To find Nd: Multiply top right cell frequency by the sum of all cells that are lower and to the left of that cell. Nd= 18(9+8+8+20) + 9(8+20) + 6(8+20) + 9(20) Nd= 810 + 252 + 168 + 180 = 1410 Interpret: Using age to predict monkey favorability results in a proportional reduction of error of 65%. There is an inverse or negative relationship: as age increases, favorability of monkeys decreases. Chi-Square: Chi-square is a statistical test commonly used to compare observed data with data we would expect to obtain according to a specific hypothesis. For example, if, according to Mendel's laws, you expected 10 of 20 offspring from a cross to be male and the actual observed number was 8 males, then you might want to know about the "goodness to fit" between the observed and expected. Were the deviations (differences between observed and expected) the result of chance, or were they due to other factors. How much deviation can occur before you, the investigator, must conclude that something other than chance is at work, causing the observed to differ from the expected. The chisquare test is always testing what scientists call the null hypothesis, which states that there is no significant difference between the expected and observed result. Monkey Favorability Rating Age Group 12-24 6 9 8 23 <12 4 8 20 32 Low Medium High >24 18 9 3 30 28 26 31 85 Hypotheses: H0: Age and favorability are independent; H1: Age and favorability are related First step: Calculate the expected values of each cell. Our null hypothesis would be that age has no bearing on favorability of monkeys. As a result, the null hypothesis would expect that favorability within each age group would be equal. To calculate the expected value of a cell: Monkey Low Favorability Medium Rating High <12 4 (10.54) 8 (9.79) 20 (11.67) 32 Age Group 12-24 6 (7.58) 9 (7.04) 8 (8.39) 23 >24 18 (9.88) 9 (9.18) 3 (10.94) 30 28 26 31 85 Second step: Calculate the chi-square calculated value. Formula: = + + + + + + + + Third step: Determine the critical value df 1 2 3 4 5 6 Significance Level .025 .10 .05 .01 .005 2.7055 3.8415 5.0239 6.6349 7.8794 4.6052 6.2514 7.7794 9.2363 10.6446 5.9915 7.8147 9.4877 11.0705 12.5916 7.3778 9.3484 11.1433 12.8325 14.4494 9.2104 11.3449 13.2767 15.0863 16.8119 10.5965 12.8381 14.8602 16.7496 18.5475 To use this table, we need to first determine our level of significance. For the purposes of this class, let’s always work on the assumption that we want 95% confidence ( ). Next, we need to figure out our degrees of freedom (df). As a result, our critical value for .05 at df = 4 is 9.4877. Fourth step: Compare the calculated chi-square value with the critical value. Chi-square calculated: 23.66; chi-square critical: 9.49 As a result, we REJECT the null. We can conclude that monkey favorability and age are related in some way. Two Sample T-Test Purpose: To compare responses from two groups. These two groups can come from different experimental treatments, or different natural "populations". Assumptions: each group is considered to be a sample from a distinct population the responses in each group are independent of those in the other group the distributions of the variable of interest are normal In a test of the hypothesis that females smile at others more than males, females and males were videotaped while interacting and the number of smiles emitted was recorded. Using the following number of smiles in the 5-minute interaction, test the null hypothesis that there are no gender differences between the number of smiles. Males 8 11 13 4 2 Females 15 19 13 11 18 Step One: Calculate the Means of Each Group Step Two: Solve for the Variances of the Two Samples 8 11 13 4 2 7.6 7.6 7.6 7.6 7.6 Step Three: Solve for t .4 3.4 5.4 -3.6 -5.6 .16 11.56 29.16 12.96 31.36 85.2 21.3 15 19 13 11 18 15.2 15.2 15.2 15.2 15.2 -.2 3.8 -2.2 -4.2 2.8 .04 14.44 4.84 17.64 7.84 44.8 11.2 Step Four: Compare Calculated t-value with Critical t-value To determine the critical t-value, we first need to determine the degrees of freedom (df). With ttests, df = n1+n2+-2. df = 5+5-2 = 8 At 95% confidence ( df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 50% 1.000 0.816 0.765 0.741 0.727 0.718 0.711 0.706 0.703 0.700 0.697 0.695 0.694 0.692 0.691 0.690 60% 1.376 1.061 0.978 0.941 0.920 0.906 0.896 0.889 0.883 0.879 0.876 0.873 0.870 0.868 0.866 0.865 ), the critical t-value is consequently 2.306. 70% 1.963 1.386 1.250 1.190 1.156 1.134 1.119 1.108 1.100 1.093 1.088 1.083 1.079 1.076 1.074 1.071 80% 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 90% 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 95% 12.71 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 98% 31.82 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 99% 63.66 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 99.5% 127.3 14.09 7.453 5.598 4.773 4.317 4.029 3.833 3.690 3.581 3.497 3.428 3.372 3.326 3.286 3.252 99.8% 318.3 22.33 10.21 7.173 5.893 5.208 4.785 4.501 4.297 4.144 4.025 3.930 3.852 3.787 3.733 3.686 99.9% 636.6 31.60 12.92 8.610 6.869 5.959 5.408 5.041 4.781 4.587 4.437 4.318 4.221 4.140 4.073 4.015 t-score calculated: 2.98; t-score critical: 2.306 As a result, we REJECT the null. We can conclude that gender and smiling are related in some way. Regression Regression is a tool for describing how, how strongly, and under what conditions an independent and dependent variable are associated. It can be used to make causal inferences. The ordinary least squares regression formula is Y = a + bX and describes the slope of a line: – Y = dependent variable – a = y-intercept (or constant) – b = slope or coefficient – X = independent variable If b is positive, the relationship is positive; if b is negative, the relationship is negative. Interpreting Regression Data are gathered on 40 countries to study variations in birth rate. Consider this equation: Y = 32-.0018X r = - .78 Seb = .00024 Where: Y = birth rate per 1000 population and X = per capita income Identify the following: independent and dependent variables; regression coefficient; the constant; the correlation coefficient; the coefficient of determination; the standard error of the slope. IV: Per capita income DV: Birth rate per 1000 population Regression coefficient: -.0018 (for every drop of 1 in per capita income, we see an increase of .0018 in birth rate per 1000 population) Constant: 32 (the predicted value of Y would be 32 if X=0) Correlation coefficient: -.78 (there is a strong, negative relationship) Coefficient of determination: .6084 (-.78*-.78) Standard error of the slope: .00024 What percent variation in birth rate is associated with per capita income? 6.084 (r2=-.78*-.78) What is the direction of the relationship? Negative Calculate the t-ratio. What does this tell you? It allows us to test the hypothesis that b=0. df = 38 (n-2). The critical t-value at 95% confidence and df = 38 is 2.024. As a result, we REJECT the null. We can conclude that gender and smiling are related in some way. A country has a per capita income of $2000. Estimate its birth rate. Y = 32-.0018X Y= 32-.0018(2000) Y= 32-3.6 Y= 28.4 28.4 births per 1000 population Interpreting Multiple Regression Regression Model Summary Std. Error of the Model 1 R R Square .638a .407 Adjusted R Square .403 Estimate 19.469 a. Predictors: (Constant), ZZ11. PRE IWR OBS: R gender, Y6. Employment status, J1. Party ID: Does R think of self as Dem, Rep, Ind or what, Y1x. Age of Respondent, Y3. Highest grade of school or year of college R completed, C5ax. SUMMARY: R better/worse off than 1 year ago, F1ax. SUMMARY: economy better worse in last year, Y21a. Household income R-Square is the proportion of variance in the dependent variable which can be predicted from the independent variables. This value indicates that 41% of the variance in the dependent variable can be predicted from the independent variables. Note that this is an overall measure of the strength of association, and does not reflect the extent to which any particular independent variable is associated with the dependent variable. b ANOVA Model 1 Sum of Squares Df Mean Square Regression 352041.587 8 44005.198 Residual 513212.737 1354 379.035 Total 865254.324 1362 F 116.098 Sig. .000a a. Predictors: (Constant), ZZ11. PRE IWR OBS: R gender, Y6. Employment status, J1. Party ID: Does R think of self as Dem, Rep, Ind or what, Y1x. Age of Respondent, Y3. Highest grade of school or year of college R completed, C5ax. SUMMARY: R better/worse off than 1 year ago, F1ax. SUMMARY: economy better worse in last year, Y21a. Household income b. Dependent Variable: B1j. Feeling Thermometer: Republican Party The F Value is the Mean Square Regression divided by the Mean Square Residual, yielding F. The p value associated with this F value is very small (0.0000). These values are used to answer the question "Do the independent variables reliably predict the dependent variable?". The p value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable". You could say that the group of independent variables can be used to reliably predict the dependent variable. If the p value were greater than 0.05, you would say that the group of independent variables do not show a significant relationship with the dependent variable, or that the group of independent variables do not reliably predict the dependent variable. Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variables. The ability of each individual independent variable to predict the dependent variable is addressed in the table below where each of the individual variables are listed. a Coefficients Standardized Unstandardized Coefficients Model 1 B (Constant) C5ax. SUMMARY: R better/worse off Coefficients Std. Error Beta 62.215 3.569 .418 .432 3.763 t Sig. 17.430 .000 .021 .966 .334 .743 .113 5.062 .000 7.393 .271 .601 27.269 .000 .087 .034 .054 2.546 .011 -.632 .243 -.062 -2.601 .009 -1.772 2.398 -.016 -.739 .460 .018 .106 .004 .169 .865 -2.877 1.072 -.057 -2.684 .007 than 1 year ago F1ax. SUMMARY: economy better worse in last year J1. Party ID: Does R think of self as Dem, Rep, Ind or what Y1x. Age of Respondent Y3. Highest grade of school or year of college R completed Y6. Employment status Y21a. Household income ZZ11. PRE IWR OBS: R gender a. Dependent Variable: B1j. Feeling Thermometer: Republican Party Feeling thermometer Republican Party = 62.215 + .418Better/Worse Off + 3.763 Economy + 7.393 PartyID + .087 Age - .632 Education – 1.772 Unemployed + .018 Income – 2.877 Gender (B) These estimates tell you about the relationship between the independent variables and the dependent variable. These estimates tell the amount of increase in Feeling Thermometer Republican that would be predicted by a 1 unit increase in the predictor. (b) These are the values for a regression equation if all of the variables are standardized to have a mean of zero and a standard deviation of one. Because the standardized variables are all expressed in the same units, the magnitudes of the standardized coefficients indicate which variables have the greatest effects on the predicted value. This is not necessarily true of the unstandardized coefficients. Because the magnitudes of the unstandardized coefficients can largely depend on the units of the variables, the effects of the variable on the prediction can be difficult to gauge. While the standardized coefficients may vary significantly from the unstandardized coefficients in magnitude, the sign (positive or negative) of the coefficients is unchanged. These columns provide the t value and 2 tailed p value used in testing the null hypothesis that the coefficient is 0. Coefficients having p values less than alpha are significant. For example, if you chose alpha to be 0.05, coefficients having a p value of 0.05 or less would be statistically significant (i.e., you can reject the null hypothesis and say that the coefficient is significantly different from 0).