Download Philip Robbins 10 Apr 2011 IS6010, Case Study #1

Philip Robbins IS6010, Case Study #1 1. 10 Apr 2011 GENDER variable: What type of data does GENDER represent? Nominal Data. 2. The GENDER variable describes data based on a label: male or female. GENDER variable: What does the mean gender of 1.40 tell us? If Male is coded as 1, and Female is coded as 2, a mean of 1.40 tells us that there are more males compared to females in our GENDER sample. 3. GENDER variable: What would be the appropriate measure of central tendency for gender? For a nominal two element sample you could either use Mean or Mode averages as a conclusive measurement of central tendency. A mean measurement for nominal samples above two elements does not contain meaning. A median average is non-conclusive for nominal data. Using a mode measurement is more appropriate. 4. GENDER variable: What is the value for central tendency? Mode = 1 5. RANKING variable: What type of data does RANKING represent? Ordinal Data. 6. The RANKING variable describes order based on the SCORE variable. RANKING variable: What is the appropriate measure of central tendency? When using an ordinal scale, the central tendency of a group of items can be described by using the group’s mode or its median, but the mean cannot be defined. In this case using a median measurement is more appropriate. 7. RANKING variable: What is the value for central tendency? Median = 8. 8. RANKING variable: Would it be appropriate to describe the average ranking? why not? Why or Ordinal data describes order. It does describe relative size or degree of difference between these data items, thus, a mean of ordinal data such as RANKING has no definition. 9. SCORE variable: What type of data does SCORE represent? Interval Data. 10. A SCORE variable does not have an absolute zero point. SCORE variable: What is the mean, median and mode of this data set? Mean = 73.13, Median = 75, Mode = 55. 1 | P a g e 11. you? SCORE variable: What does the difference between the mean, median and mode tell The Mean represents the arithmetic average or balance point in a distribution and is the sum of all the elements divided by the number of elements, which is 73.13. The Median, 75, represents the middle element/value when all the SCORE values are ordered and sequenced from the smallest to the largest value. The Mode represents the data element/value that occurs most frequently. In this case the Mode is 55, which appears 3 times in the case example. 12. SCORE variable: Is this data set skewed? If so, in which direction? Skewness characterizes the degree of asymmetry of a distribution around its mean. The Skewness value for SCORE is -0.065, a negative value indicating a very slight skewed distribution with an asymmetric tail extending towards more negative values. Normal distributions produce a Skewness static of about zero. 13. SCORE variable: What is the range of the data set? How is this determined? Range = 45. Range is determined from the difference of the range bounds: by subtracting the lowest SCORE value, 50 from the highest SCORE value, 95. 14. SCORE variable: What does the kurtosis figure tell you? Kurtosis is a measure used to describe the distribution of observed data around the mean. A high kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations and is portrayed by a curve with a “peakedness”, heavy tails and a low, even distribution, whereas a low kurtosis portrays a chart with skinny tails and a distribution concentrated toward the mean. It is sometimes referred to as the “volatility of volatility”. SCORE has a Kurtosis of -1.753 15. SCORE variable: Do you think this data is normally distributed? Why? No. In this case the Skewness value of -0.065 and a kurtosis of -1.753 indicates a nonnormal distribution. Histogram plotted shows the effect of negative skewness and negative kurtosis on SCORE distribution. MATLAB code: >> score = [95 92 91 90 88 82 80 75 70 60 59 55 55 55 50]; >> x=1:1:100; >> y=score; >> hist(y,x);shg 3 2.5 2 1.5 1 0.5 0 30 40 50 60 70 80 90 100 110 2 | P a g e 16. SCORE variable: What does the standard error tell you? The standard error of a method of measurement or estimation is the standard deviation of the sampling distribution associated with the estimation method. In the case of the SCORE dataset the standard error or standard deviation is 16.24 from a mean of 75. 17. SCORE variable: What is the relationship between the variance and the standard deviation? What do these numbers tell you? The Standard Deviation (SD) is the square of the Variance. SD has an advantage it is in the same units as the mean, which makes interpretation easy. Variance average of the squared differences from the Mean. Variance is used as a measure far a set of numbers are spread out from each other, in this case SCORES have a of 263.70 Assuming a normal distribution, the standard deviation tells us that the participants within this case study scored within 58.76 and 91.24. 18. WEIGHT variable: What type of data does WEIGHT represent? Ratio Data. 19. in that is the of how Variance 68% of Ratio is like Interval data but with a unique property line of zero. WEIGHT variable: What is the mean, median, and mode of this data set? Mean = 144.73, Median = 130, Mode = 108. 20. you? WEIGHT variable: What does the difference between the mean, median and mode tell The Mean represents the arithmetic average or balance point in the WEIGHT distribution and is the sum of all the elements divided by the number of elements, which is 144.73. The Median, 130, represents the middle element/value when all the WEIGHT values are ordered and sequenced from the smallest to the largest value. The Mode represents the data element/value that occurs most frequently. In this case the Mode is 108, which appears 3 times in the case example. 21. WEIGHT variable: Is the data set skewed? If so, in which direction? Skewness characterizes the degree of asymmetry of a distribution around its mean. The Skewness value for WEIGHT is 0.625, a positive value indicating a skew distribution with an asymmetric tail extending towards more positive values. 22. WEIGHT variable: What is the range of the data set? How is this determined? Range = 135. Range is determined from the difference of the range bounds: by subtracting the lowest WEIGHT value, 90 from the highest WEIGHT value, 225. 23. WEIGHT variable: What does the kurtosis figure tell you? Kurtosis is a measure used to describe the distribution of observed data around the mean. A high kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations and is portrayed by a curve with a “peakedness”, heavy tails and a low, even distribution, whereas a low kurtosis portrays a chart with skinny tails and a distribution concentrated toward the mean. It is sometimes referred to as the “volatility of volatility”. WEIGHT has a Kurtosis of -1.037 3 | P a g e 24. WEIGHT variable: Do you think this data is normally distributed? Why? NO. Skewness and Kurtosis indicates distribution is not normal. WEIGHT WEIGHT WEIGHT WEIGHT Mean = 144.73 Variance = 2168.07 SD = 46.56 Skewness = 0.625 Histogram plotted shows the effect of positive skewness and negative kurtosis on WEIGHT distribution. MATLAB code: >> weight = [200 110 103 145 130 180 170 90 102 225 225 108 108 108 167]; >> x=60:1:260; >> y=weight; 3 >> hist(y,x);shg 2.5 2 1.5 1 0.5 0 60 25. 80 100 120 140 160 180 200 220 240 260 WEIGHT variable: What does the standard error tell you? The standard error of a method of measurement or estimation is the standard deviation of the sampling distribution associated with the estimation method. In the case of the WEIGHT dataset the standard error or standard deviation is 46.56 from a mean of 144.73. 26. WEIGHT variable: What is the relationship between the variance and the standard deviation? What do these numbers tell you? The Standard Deviation (SD) is the square of the Variance. SD has an advantage in that it is in the same units as the Mean, which makes interpretation easy. Variance is the average of the squared differences from the Mean. Variance is used as a measure of how far a set of numbers are spread out from each other, in this case, WEIGHT has a Variance of 2168.07 Assuming a normal distribution, the standard deviation tells us that 68% of the participants within this case study weighs within 98.17 and 191.29. 4 | P a g e 6010 WEEK 1 NOTES ROBBINS ============================================================== SAMPLE SIZE ============================================================== Watch out for Sampling Bias: small, unbiased samples tend to yield more accurate results than biased samples, even if the sizes of the biased samples are larage and the sizes of the unbiased samples are small. Increasing sample size incrases precision: When you say you have precise results (or a reasonalbe degree of precision), you are saying the results vary by only a small amount from sample to sample, which will happen if each sample is large. Watch out for Diminishing returns: At some point the returns (in terms of an increase in precision) deminish to the point that further increases in sample size are of very little benefit. ============================================================== DESCRIPTIVE STATISTICS ============================================================== Sampling Methods: Random Sampling: each person has equal chance of being selected Stratified Sampling: a method of sampling from a population (strata). the strata should be mutually exclusive and collectively exhaustive. this type of sampling reduces sampling error. produces a weighted mean that has less variability than the arithmetic mean of a simple random sample. Systematic Sampling: selects every kth element. where k = N/n, where N = population size, n = sample size Cluster Sampling: natural groupings used in statistical population. used in marketing research. Convenience Sampling: population readily available and convenient. not a representative method; used only for pilot testing. Level of Measurements: Nominal: data that consists of names, labels, or categories only. Ordinal: describe order, but not relative size or degree of difference between the iterms measured. scale type or rank order. Interval: like ordinal level with the additional property that we can determine meaningful amounts of differences between data. Ratio: like interval data but with a unique line of zero. Measure of Central Tendency: Average: or measurement of central tendency can represent a mean, median, or mode. Be specific when talking about an average, esp in scientific research to identify if the underlying distribution is skewed. Mean: the arithmetic average (balance point in a distribution), computed by adding up a collection of numbers and dividing by their count. It is the value areound which the deviations sum to zero. drawback is that means are drawn in the direction of the skew / extreme scores (outliners). * Mean is used with Interval & Ratio data. Median: the middle element / value of a set. in situations where outliners dramtically impact the mean the median can be much more representative of the central tendency of the sample set. for odd # = order smallest to largest and middl value is median for even # = order smallest to largest and sum the two data elements in the middle and divide by 2 * Median is used with Ordinal, Interval, Ratio data, and also used when a distribution is highly skewed. Mode: the data element / value that occurs most frequently. you can have more than one mode called bimodal. having more than two modes is called multimodal. * Mode can be used with all data types. Range: referes to the exterme unit values in a dispersion set Standard Deviation: (S, SD for a population / s, sd for a sample) is the measure of variability or dispersion there is from the average (mean, or expected value). The smaller the varaiblity is, the smaller the standard deviation is. normality has been oberved with great frequency in nature. standard deviation is derived and describes the variability of normal distributions. Relationships { + if a distribution is normal, 68% of the participants in the distribution lie within one standard-deviation unit of the mean. + a "narrower curve" is attributed to a lower standard deviation + more than half of the observations are within 1 standard deviation of the mean + more than 90% of the observations are within 2 standard deviations of the mean + most observations fall within 3 standard deviations of the mean } Shapes of Distributions: Normal: When very large samples are used the curve on a smooth bell shaped (normal) curve. (i.e. weights of grains of sand on a beach) Positive Skew: distribution that is skewed to the right. (trailing tail is on the right, i.e. income curve) Negative Skew: distribution that is skewed to the left. (trailing tail is on the left,i.e. math test results from PhDs) "skewed to the left" to indicate a "nagative skew" "skewed to the right" to indicate a "positive skew" The Median and Interquartile Range: Where as the standard deviation measures variablity from a mean average, the Rangle or the Interquartile range is used to measure variability from the median average. Range: the highest value minus the lowest value. the more extreme the value is the more unreliable it is. range is based on two extreme values, thus is considered an unreliable statistic. Interquartile Range (IQR): divides a distribution into quarters and the range of the middle 50% is considered the IQR. When the median is reported as the measure of central tendency, it is customary to report the IQR as the measure of variability.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Philip Robbins 10 Apr 2011 IS6010, Case Study #1