Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2: Describing & Exploring Data As mentioned in the introduction, the purpose of using statistics in research is to aid in the summary of data. Two primary methods are used to accomplish such a summary: graphical and numerical. Graphical methods summarize data visually via graphs, charts, and tables. • frequency distributions • histograms • stem-and-leaf displays • boxplots • (dotplots) • scatterplots Numerical methods summarize data using numbers to describe various characteristics or trends within the distribution. • measures of central tendency • measures of variability • measures of association 1 Chapter 2: Example Data Set Consider the following data (Chap2Ex.sav). If I were to tell you that these are test scores, you would have a difficult time summarizing them by simply eyeballing the data. GRE SCORES Obs person score 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 680 492 540 392 724 438 551 491 441 503 426 475 569 420 426 420 534 470 365 543 631 643 458 661 394 Obs 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 person score Obs person score Obs person score 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 405 595 539 492 622 437 436 466 492 597 378 618 466 609 486 364 267 459 565 540 453 587 408 628 355 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 535 494 377 515 416 490 414 576 505 693 465 661 398 473 518 418 547 346 578 586 535 534 394 446 557 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 572 638 401 416 580 638 401 437 488 574 590 459 489 604 400 495 650 438 369 455 405 252 417 482 630 2 Chapter 2: Frequency Distributions One first step in summarizing data is to create a frequency distribution showing the number of observations for each observed data value. (In my experience, frequency distributions group the data, and counts are provided to show numbers or percentages of observations within various ranges). Sometimes frequency distributions include cumulative frequencies so that you can determine the counts and percentages of observations at or below any given data point or range. Percentile ranks are determined this way for educational tests. Alternatively, you might be interested in depicting fewer score points than percentiles. For example, a common quantile (AKA fractile) is the quartile. • Quartiles: The data values at or below which fall 25% (Q1), 50%(Q2 = median), and 75% (Q3) of the remaining values, corresponding to the first, second, and third quartiles. The fourth quartile is the maximum value. 3 Chapter 2: Frequency Procedure Frequency procedure: In SPSS, frequency distributions for specified variables can be obtained by choosing the followings from the Menu bar. Analyze Descriptive Statistics Frequencies. At the pop-up dialog box Window, choose GRE in the left box, and then move it to the right box by clicking on the arrow ( ►) button. This procedure generates a table of frequency distribution for a variable called GRE. 4 Chapter 2: Frequency Table GRE 252 1 1.0 1.0 Cumulative Percent 1.0 267 1 1.0 1.0 2.0 346 1 1.0 1.0 3.0 355 1 1.0 1.0 4.0 364 1 1.0 1.0 5.0 365 1 1.0 1.0 6.0 369 1 1.0 1.0 7.0 377 1 1.0 1.0 8.0 378 1 1.0 1.0 9.0 392 1 1.0 1.0 10.0 394 2 2.0 2.0 12.0 398 1 1.0 1.0 13.0 400 1 1.0 1.0 14.0 401 2 2.0 2.0 16.0 405 2 2.0 2.0 18.0 408 1 1.0 1.0 19.0 414 1 1.0 1.0 20.0 416 2 2.0 2.0 22.0 417 1 1.0 1.0 23.0 418 1 1.0 1.0 24.0 420 2 2.0 2.0 26.0 426 2 2.0 2.0 28.0 436 1 1.0 1.0 29.0 437 2 2.0 2.0 31.0 438 2 2.0 2.0 33.0 Frequency Valid Percent Valid Percent Q1 5 Chapter 2: Frequency Table (Cont’d) Output Continued 441 1 1.0 1.0 Cumulative Percent 34.0 446 1 1.0 1.0 35.0 453 1 1.0 1.0 36.0 455 1 1.0 1.0 37.0 458 1 1.0 1.0 38.0 459 2 2.0 2.0 40.0 465 1 1.0 1.0 41.0 466 2 2.0 2.0 43.0 470 1 1.0 1.0 44.0 473 1 1.0 1.0 45.0 475 1 1.0 1.0 46.0 482 1 1.0 1.0 47.0 486 1 1.0 1.0 48.0 488 1 1.0 1.0 49.0 489 1 1.0 1.0 50.0 490 1 1.0 1.0 51.0 491 1 1.0 1.0 52.0 492 3 3.0 3.0 55.0 494 1 1.0 1.0 56.0 495 1 1.0 1.0 57.0 503 1 1.0 1.0 58.0 505 1 1.0 1.0 59.0 515 1 1.0 1.0 60.0 518 1 1.0 1.0 61.0 534 2 2.0 2.0 63.0 535 2 2.0 2.0 65.0 539 1 1.0 1.0 66.0 Frequency Percent Valid Percent Q2 (median) 6 Chapter 2: Frequency Table (Cont’d) Output Continued 540 2 2.0 2.0 Cumulative Percent 68.0 543 1 1.0 1.0 69.0 547 1 1.0 1.0 70.0 551 1 1.0 1.0 71.0 557 1 1.0 1.0 72.0 565 1 1.0 1.0 73.0 569 1 1.0 1.0 74.0 572 1 1.0 1.0 75.0 574 1 1.0 1.0 76.0 576 1 1.0 1.0 77.0 578 1 1.0 1.0 78.0 580 1 1.0 1.0 79.0 586 1 1.0 1.0 80.0 587 1 1.0 1.0 81.0 590 1 1.0 1.0 82.0 595 1 1.0 1.0 83.0 597 1 1.0 1.0 84.0 604 1 1.0 1.0 85.0 609 1 1.0 1.0 86.0 618 1 1.0 1.0 87.0 622 1 1.0 1.0 88.0 628 1 1.0 1.0 89.0 630 1 1.0 1.0 90.0 631 1 1.0 1.0 91.0 638 2 2.0 2.0 93.0 643 1 1.0 1.0 94.0 650 1 1.0 1.0 95.0 661 2 2.0 2.0 97.0 680 1 1.0 1.0 98.0 693 1 1.0 1.0 99.0 724 1 1.0 1.0 100.0 100 100.0 100.0 Frequency Total Percent Valid Percent Q3 7 Chapter 2: Histograms The graphic version of a frequency distribution is a histogram (also known as a bar chart*). 25 20 Frequency We examine frequency distributions and histograms for each variable of interest to get an impression of the overall shape of the data and to see whether there are outliers in the data. Both of these features may influence your choice of a statistic for summarizing your data. Histogram of Scores 15 10 5 0 20 27 34 41 48 55 62 69 76 83 Score *The difference between histogram and bar chart is that the histogram is for continuous variables and the bar chart is for discrete (categorical) variable. Thus for creating histogram, we need to decide the number of class intervals. SPSS does it for you, but it is generally recommended using from 5 to 20 class intervals. 8 Chapter 2: Distribution Shapes As mentioned previously, graphic representations of data help us to to get an idea of the shape of the data and to identify outliers because these characteristics may influence the choice of summary statistic. There are a variety of terms used to describe the shape of data. • • Normal: Unimodal (one distinct peak), symmetrical (i.e., you can draw a line between which each side is nearly a mirror image of the other side) & bell-shaped (peaked in the middle and tapering in the tails). Actually there is a mathematical formula for a truly normal distribution. Bimodal: Having two distinct peaks (definitely NOT “normal”). 9 Chapter 2: Distribution Shapes • Skewed: Having most data points cluster tightly with a few data points being extreme. Distributions may be negatively skewed (having a few data points in the extreme low direction) or positively skewed (having a few data points in the extreme high direction). The mnemonic is that you can draw an arrow on the tail of the distribution, and the skew is in the direction of the arrow. Negatively (or left) skewed distribution Positively (or right) skewed distribution 10 Chapter 2: Distribution Shapes • Kurtosis: The relative concentration of data points in the center, shoulders, and tails of the distribution. • Mesokurtic: A description of the relative concentration of data points in a normal distribution. • Platykurtic: A distribution that tends to have fewer data points in the center and tails (and more in the shoulders) than a normal distribution. Remember that plateaus are flat. • Leptokurtic: A distribution that tends to have more data points in the center and tails (and less in the shoulders) than a normal distribution. Remember to “leap” is to jump up -- the distribution jumps up in the middle. Platykurtic (negative kurtosis) Normal distr, Leptokurtic (positive kurtosis) Normal distr, 11 Chapter 2: Chart Procedure The chart procedure: Chart Procedure is a SPSS procedure that request bar charts or histograms for specific variables. In the data screen of SPSS, we choose the following option from the menu bar. Graphs -> Histogram, and move a variable called GRE on the left box to the variable box on the upper right box by cricking the right arrow in the middle. (Another way to create Histogram is to once we moved GRE variable in the right box after the sequence of Analyze Descriptive Statistics Descriptives, we choose Charts button in the bottom. Then choose Histogram and push continue.) This action tells SPSS that we want to generate plot of the data. In this example, we request charts for a variable called GRE. 12 Chapter 2: Chart Procedure Histogram for GRE scores 14 12 Frequency 10 8 6 4 2 Mean = 497.02 Std. Dev. = 95.325 N = 100 0 400 600 GRE Q. How many class intervals does this histogram have and what is the width for each class interval (AKA bin size)? A. 19, 200/8 = 25 13 Chapter 2: Chart Procedure Sometimes it is a good idea to overlay the normal curve because the normal distribution can serve as the reference distribution with unimodal, symmetry (i.e., skewness = 0) and mesokurtic (kurtosis = 0). We can do it by selecting the Display normal curve (or With normal curve in Charts option in Descriptives) option. Histogram for GRE scores 14 Modality? 12 Skewness? Frequency 10 Kurtosis? 8 6 4 2 Mean = 497.02 Std. Dev. = 95.325 N = 100 0 400 600 GRE 14 Chapter 2: Chart Procedure Sometimes it is hard to figure out these characteristics of the shape of the distribution by eyeballing the chart. In this case, we get the values by choosing Statistics option in the pop-up dialog box window of the Frequency procedure. Choose Mean, Median, Mode from Central tendency section, and Skewness and kurtosis from Distribution section. Statistics GRE N Mean Median Mode Skewness Std. Error of Skewnes s Kurtos is Std. Error of Kurtos is Valid Mis sing 100 0 497.02 489.50 492 .113 .241 -.387 .478 Note: A rule of thumb for skewness and kurtosis would be: Between 1 (i.e., absolute value less than 1) --- Slightly Between 2 (i.e., -2 to -1, 1 to 2) --- moderately Outside of 2 (i.e., less than-2 or larger than 2) --- heavily 15 Chapter 2: Chart Procedure Also there is a relationship among the mean , the median Md, and the model Mo for a given distribution if the distribution is unimodal. = Md = Mo (a) When the distribution is symmetric, = Md = Mo. Md Mo (b) When the distribution is negatively skewed, < Md < Mo. Md Mo (v) When the distribution is positively skewed, Mo < Md < . 16 Chapter 2: Summation Notation For simplicity of expression, we use symbols to represent various concepts in statistics. Variables—The codes (often numerical codes) we use to describe the constructs we’re interested in. Variables are indicated by upper-case letters (X, Y). Individual values are represented using subscripts (Xi, Yj). Summation—We frequently need to add a series of observations for a variable. The Greek upper-case sigma (S) is used to symbolize this. For example, N X i is read as “the sum of the values of X ranging from 1 to N.” i 1 S X i N stands for “the sum of” stands for the variable we sum referred to as a subscripting index, stands for the individual values of X stands for the highest value we sum across (usually the number of cases). N could be replaced by a number, but we usually use a letter like N to indicate that we’re summing across all values of X (i.e., there are N values of the X variable). 17 Chapter 2: Summation Notation Some examples Say the data are these pretest scores: 8, 7, 5, 10 X1 would be the score of the first person in the data set. Here X1 = 8. The first score is not necessarily the largest (or smallest) score, because we don’t assume the scores are ordered. Xi is the “ith” score -- here you select what value of i you are interested in: If i = 3, X3 = 5. If i = N, then here N = 4, so X4 = 10. Saying “Xi, for i = 1 to N” means “the set of all N scores”. 18 Chapter 2: Summation Notation Frequently, it is clear that we want to sum all values of X, so we can simply write N X ( = X i ) which equals (X1 + X2 + X3 + … + XN). i 1 That is, omit all the subscripts which Howell does. But in my opinion, it is always a good idea to keep them, because they will help you. Other common summations are • The sum of the squared values of X: N X i 1 2 i X 12 X 22 X 32 ... X N2 • The square of the sum of X: N ( X i ) 2 ( X 1 X 2 X 3 ... X N ) 2 i 1 Note that N X i 1 N 2 i ( X i ) i 1 2 N 2 (i.e., X is not equal to ( X i ) ). (Check this by the i 1 2 i N i 1 example above.) In general, you perform the functions within parentheses prior to performing the19 functions outside of the parentheses. Chapter 2: Summation Notation • The sum of X added to a constant C: X C X1 C X 2 C X 3 C ... X N C X NC • The sum of X multiplied by a constant C: XC X1C X 2C X 3C ... X N C C X • The product of matched pairs of X and Y: XY X1Y1 X 2Y2 X 3Y3 ... X NYN • The sum of a difference between X and Y: X Y X1 Y1 X 2 Y2 X 3 Y3 ... X N YN X Y • Note that X Y X i Yi X i Yi It is easier to tell if a variable is being summed when it has a subscript, but sometimes, as above, the subscript is dropped. 20 Chapter 2: Summation Notation In crosstabulated or two-way tables, it is common to use two subscripts (one for the rows and one for the columns). Hence, to sum across both rows and columns, we would write I J X i 1 j 1 ij which is read, “the sum of X over the subscripts i and j, where i ranges from 1 to I and j ranges from 1 to J.” Ex. We have a cross-tabulated table on political party preference crossed by gender. Each cell represents the number of people. Political preference Gender 1=D 2=R 3=I 1=Male 3 (= X11) 5 (= X12) 2 (= X13) 2=Female 5 (= X21) 3 (= X22) 2 (= X23) 21 Chapter 2: Summation Notation I In this example, I = 2, J = 3. Thus we have J X i 1 j 1 2 ij = 3 X i 1 j 1 2 ij . 3 X i 1 j 1 ij simply says add up all the number in the table. We can go with either the column first or the row first. If we fix the row, add up the number in the same row, and then move to the second row, then we do: 2 3 X i 1 j 1 ij 2 3 ( X i 1 j 1 2 ij ) ( X i1 X i 2 X i 3 ) ( X 11 X 12 X 13 ) ( X 21 X 22 X 23 ) i 1 10 10 X 1 X 2 20 Political preference Gender 1=D 2=R 3=I Row Total 1=Male 3 (= X11) 5 (= X12) 2 (= X13) 10( = X1·) 2=Female 5 (= X21) 3 (= X22) 2 (= X23) 10 ( = X2·) Column Total 8 ( = X·1) 8 ( = X·2) 4 ( = X·1) 20 ( = X··) 22 Chapter 2: Summation Notation Or we can go the other way around. That is, fixing the column, add up the number in the same column, and then do the same thing for the next column. 2 3 X i 1 j 1 ij 3 2 ( X j 1 i 1 3 ij ) ( X 1 j X 2 j ) ( X 11 X 21 ) ( X 12 X 22 ) ( X 13 X 23 ) j 1 8 8 4 X 1 X 2 X 3 20 Political preference Gender 1=D 2=R 3=I Row Total 1=Male 3 (= X11) 5 (= X12) 2 (= X13) 10( = X1·) 2=Female 5 (= X21) 3 (= X22) 2 (= X23) 10 ( = X2·) Column Total 8 ( = X·1) 8 ( = X·2) 4 ( = X·1) 20 ( = X··) We now realize that the summations are exchangeable. That is, I J X i 1 j 1 ij I J J I i 1 j 1 j 1 i 1 J I ( X ij ) ( X ij ) X ij . j 1 i 1 23 Chapter 2: Measures of Central Tendency One characteristic of a distribution that we may wish to summarize is its location on the underlying continuum. For example, in the following plot, the blue and red distributions are identical, but the red one is shifted to the right on the horizontal axis (AKA X-axis). We refer to such a difference in the positions of distributions as a location shift. We depict such differences in location using statistics that are called measures of central tendency (AKA measures of location). 24 Chapter 2: Measures of Central Tendency There are three primary measures of central tendency: 1. Mode (Mo): The most frequently occurring data value. 2. Median (Med): When the data are rank ordered, the middle value (or average of middle values when there is an even number of observations). The median, therefore, represents the 50th percentile of the data values. 3. Mean (also X or ”X-bar”): The arithmetic average. Obtained by adding all data values and dividing by the number of observations: N X Xi i 1 N Q. There are 7 observations and they are [3 5 7 5 6 8 9]. What are the mean, median, and the mode of this distribution? A. Mean = 6.14, Median = 6, Mode = 5. 25 Chapter 2: Measures of Central Tendency Each of the three measures of central tendency is more appropriate for some types of data than for others. 1. Mode: Nominal, ordinal, interval, ratio -- i.e., can be used for all types of variables. 2. Median: Ordinal, interval, ratio -- also, frequently used when there are outliers in the data. No good for nominal variables, which are not ordered at all. 3. Mean: Interval, ratio -- this is used when the numbers themselves have meaning beyond just ordering the data. 26 Chapter 2: Questions on Measures of Central Tendency Why SPSS says: The mean of 100 GRE score is 497.02. The median of 100 GRE score is 489.50. The mode of 100 GRE score is 492. • Can you explain how the mean, 497.02, was obtained? • Can you explain how the median, 489.50, was obtained using the frequency table shown before? Since N = 100 (even number), choose (100+1)/2 = 50.5 th obs. (489+490)/2 = 489.50 • Can you explain how the mode, 492, was obtained using the frequency table shown before? 27 Chapter 2: Measures of Central Tendency Mean, Med, & Mo The mean, median, and mode are equal ONLY when the distribution is symmetrical and unimodal. When the distribution is skewed and unimodal, the mode will be the hump in the distribution. The mean will be pulled out toward the tail of the skew. The median will be between the other two values. Mo Med Mean 28 Chapter 2: Measures of Variability Another characteristic of a distribution that we may wish to summarize is its dispersion or spread on the underlying continuum. For example, in this plot, the blue and red distributions have the same measure of central tendency, but the red one is more widely dispersed (wider and flatter) along the X-axis. We refer to such a difference in the spread of distributions as a difference in dispersion or variability, and we depict such differences in spread using statistics that are called measures of variability (AKA measures of dispersion). A distribution with a small measure of variability has more homogeneous members than one with greater variability (which has more heterogeneous members). 29 Chapter 2: Measures of Variability There are five primary measures of variability: 1. Range: The difference between the two most extreme data points (maximum – minimum). 2. Interquartile Range (IQR): The difference between the 25th (Q1) and 75th (Q3) percentiles. 3. Variance ( s 2X or “s-squared sub X”): The average squared deviation of scores from the mean: N s 2X X i X i 1 2 N 1 30 Chapter 2: Measures of Variability 4. Standard Deviation (sx—“s-sub X”): The average absolute deviation of scores from the mean—also the square root of the variance: N sX 5. Xi X i 1 2 N 1 Coefficient of Variation (CV): An index that rescales the standard deviations from two groups that are measured on the same scale but have very different means (useful for comparing group variability). Thus, the CV measures the variability relative to the magnitude of the mean. 1 CV sX 2 s1 X s2 In the figure, CV1>CV2 Large CV indicates a potential “floor effect” 31 X1 X2 Chapter 2: Measures of Variability Like the measures of central tendency, measures of variability are influenced by certain characteristics of the data: • Range: sensitive to outliers • IQR: insensitive to outer 50% of the data • s 2X & sx: very sensitive to outliers Also the measures of variability are more appropriate for some types of data than others (none are suitable for nominal data). • IQR: Ordinal, interval, ratio • s 2X , sx, CV, & Range: Interval, ratio 32 Chapter 2: SPSS descriptive statistics for continuous variables The descriptives procedure: Descriptives procedure in SPSS produces request simple descriptive statistics for specific variables. In SPSS, Analyze Descriptive Statistics Descriptives. At the pop-up dialog box Window, choose GRE in the left box, and then move it to the right box by clicking on the arrow ( ►) button. This procedure generates a table of descriptive statistics for a variable called GRE. 33 Chapter 2: Descriptive Statistics Procedure Descriptive Statistics N GRE Valid N (lis twis e) 100 100 Minimum 252 Maximum 724 Mean 497.02 Std. Deviation 95.325 34 Chapter 2: Transformations and Statistics Frequently, we wish to transform data from their original scale to one that has more meaning to us. For example, we might want to transform test scores to an IQ scale (with a mean of 100 and standard deviation of 15) or an SAT/GRE scale (with a mean of 500 and standard deviation of 100). Similarly, we might wish to transform temperature from Celsius to Fahrenheit (with a freezing point of 32 rather than 0). These are all examples of linear transformations in which a new mean and standard deviation are applied to a scale. We can use linear transformations to transform a variable to have any desired mean and standard deviation. Scaling factor Linear transformation: X′= bX+a Location factor There are a few general rules that allow us to make such transformations without losing the meaning of the variable in question. 35 Chapter 2: Transformations and Statistics For example: • Adding a constant to all values in a dataset (Xi’ = Xi + a for all i) increases the mean of the distribution by the value of the constant and leaves the variance and standard deviation unchanged. Xi’= Xi + a for all i • SX SX S X S X2 2 Multiplying all values in a dataset (Xi’ = b Xi for all i) multiplies the mean and the standard deviation of the distribution by the value of the constant—the variance is increased by the square of the constant. Xi’ = b Xi for all i • X X a X bX S X bS X S X b 2 S X2 2 Linear transformation is a combination of both addition and multiplication. That is, Xi’ = bXi + a for all i. Xi’ = bXi + a for all i X bX a S X bS X S X b 2 S X2 2 36 Chapter 2: Transformations and Statistics A common linear transformation is standardization in which scores are scaled to have a mean of 0 and standard deviation of 1. The variable (or score) that has a mean of 0 and standard deviation of 1 is frequently referred to as a standardized variable (or score), and the symbol z is designated. What would be the values of a (intercept) and b (slope) in the general formula for the linear transformation, i.e., X bX a ? If we want to center X’ on 0, then we can subtract the mean of X from all of the observed values of X (why would this work?). Hence, a would equal X . Similarly, if we want X’ to have a standard deviation of 1, we can divide all values of X by their standard deviation (why would this work?). Hence, b would equal 1 SD X . 37 Chapter 2: Transformations and Statistics Hence, to get X′ scores with a new mean equal to 0 and standard deviation of 1 we use a the following version of the linear equation: X X X SX Thus, the standardized score Z for variable X (ZX) can be obtained by the formula: X X or for ith observation , Z X i X Xi ZX SX SX for i = 1,…,N. More generally, to get X’ scores with a new mean X and standard deviation sX , we use a the following transformation formula: X X X S X SX X or X S X Z X X where ZX X X . SX Or, for each observation i, X X X i S X i SX X or X i S X Z X i X where ZX i Xi X for i 1,..., N . SX 38 Chapter 2: Transformations and Statistics An example: Our GRE variable has a mean = 497.02 and sX = 95.32 Suppose we want a new mean = 100 and new SD = 15. We compute the following transformed score for the original score of 500: X X X sX sX 500 497.02 X 15 100 100.47 95.32 So an original score X = 500 would be X′ = 100.47. 39 Chapter 2: Transformation in SPSS You can transform variables in SPSS by using compute command, which actually creates a new variable and computes the value for each case by following the formula you provide. There are a variety of SPSS functions that may be useful when doing transformations. Below is an example showing how to create three transformations of the GRE variable (Chap2Ex.sav): adding a constant, multiplying by a constant, and transforming to a scale with a mean of 100 and standard deviation of 15. Transform Compute. At the pop-up window, you write SCORE_PLUS in the target variable box, and GRE + 500 in numeric expression box, then click OK. For other variables we do the same thing (Here I used the syntax window by clicking Paste. COMPUTE SCORE_PLUS = GRE + 500 . COMPUTE SCORE_TIMES = GRE * 100 . COMPUTE SCORE_SCALED =(15* (( GRE - 497.02) / 95.32)) + 100 . EXECUTE . Now we can compute the descriptive statistics. DESCRIPTIVES VARIABLES=GRE SCORE_PLUS SCORE_TIMES SCORE_SCALED /STATISTICS=MEAN STDDEV MIN MAX . 40 Chapter 2: Transformation in SPSS The output looks like this. Descriptive Statistics N GRE SCORE_PLUS SCORE_TIMES SCORE_SCALED Valid N (lis twis e) 100 100 100 100 100 Minimum 252 752.00 25200.00 61.44 Maximum 724 1224.00 72400.00 135.72 Mean 497.02 997.0200 49702.00 100.0000 Std. Deviation 95.325 95.32464 9532.46425 15.00073 Q. Can you tell the general rule of linear transformation? If we performed the linear transformation on a variable X, i.e., X′= b X + a, then the mean, standard deviation, and the variance of the new variable X′ are: X bX a S X bS X S X2 b 2 S X2 41 Chapter 2: Transformation in SPSS Ex. X has a mean of 50 and a S.D. of 10. Now we created the new variable Y by multiplying the scaling factor of 4 and adding the location factor of 20 (i.e., b = 4, a = 20 in Y = b X + a), what would be the mean and the standard deviation of the new variable Y? We can check this empirically by generating 1000 X’s from Normal distribution with a mean of 50 and a SD of 10. The output looks like this. Descriptive Statistics X Y Valid N (lis twis e) N 1000 1000 1000 Minimum 23.10 112.41 Maximum 81.08 344.31 Mean 50.0951 220.3805 Std. Deviation 10.01557 40.06227 42 Chapter 2: Stem-and-Leaf Displays Another graphical method for summarizing data is the stem-and-leaf display, which gives you a visual display of the shape of the distribution while preserving the actual values for every data point in the data set. The stem of a stem-and-leaf display contains the leading digits (or most significant digits) of the data points (e.g., the 3s in 31, 34, 37, and 39). The leaves (or trailing digits or less significant digits) of the display contain the remaining portions of the data points, allowing you to identify individual data points (e.g., the 1, 4, 7, and 9 of 31, 34, 37, and 39). The display below summarizes these data points: 11, 15, 18, 21, 22, 22, 23, 25, 28, 30, 31, 32, 33, 33, 33, 34, 45, 51. 1|158 2|122358 3|0123334 4|5 5|1 43 Chapter 2: Boxplots Another graphical method for summarizing data is the boxplot (AKA box-andwhisker plot), which gives you a summary of the data. The following quantities are contained in a boxplot: • Median: The middlemost data value (50th percentile, i.e., Q2) when the data are ordered. • Median Location: ML = (N + 1) / 2, where N is the number of scores. ML tells us where, in the rank ordered data, the median lies. • Hinge: The median values of the upper and lower halves of the data when the data are rank ordered. The Upper Hinge (UH) represents the data value of 75th percentile, and the Lower Hinge (LH) represents the data value of 25th percentile. Thus, UH = Q3 and LH = Q1. • Hinge Location: HL = (ML+1)/2. HL tells us where, in the rank ordered data, the hinges lie. • H-Spread: HS = UH – LH, a value comparable to the IQR. • Inner Fence: UIF = Upper Hinge + (1.5 x HS) and LIF = Lower Hinge – (1.5 x HS). • Adjacent Values: The data values that are no more extreme than Inner Fence. LAV = max (smallest data value, LIF) and UAV = min (largest data value, UIF). 44 Chapter 2: Boxplots More simply, a boxplot represents the median and IQR of a data set. The median is represented by the line in the middle of the box. The upper and lower quartiles are represented by the outer edges of the box. The maximum and minimum reasonable values (approximately the lower and upper 2.5% of the data) are represented by the ends of the lines on each side of the box. Asterisks are used to represent data points that lie outside of these “reasonable” limits. * Outliers * * * LAV LH ML H-S UH UAV 45 Chapter 2: Explore procedure The explore procedure: Explore is a SPSS procedure that requests comprehensive descriptive statistics for a particular variable (including stem-and-leaf and box plots). The following strokes take you to study the variables in detail (use Chap2Ex.sav). Analyze Descriptive statistics Explore, then at the pop-up window, bring GRE variable in the left box to the Dependent list box. Then click OK. What the above key strokes do in Syntax is as follows. EXAMINE VARIABLES=GRE /PLOT BOXPLOT STEMLEAF /COMPARE GROUP /STATISTICS DESCRIPTIVES /CINTERVAL 95 /MISSING LISTWISE /NOTOTAL. 46 Chapter 2: Explore procedure Descriptives GRE Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is Lower Bound Upper Bound Statis tic 497.02 478.11 Std. Error 9.532 515.93 496.66 489.50 9086.787 95.325 252 724 472 154 .113 -.387 .241 .478 47 Chapter 2: Explore procedure GRE Stem-and-Leaf Plot Frequency 2.00 1.00 10.00 22.00 22.00 13.00 14.00 10.00 5.00 1.00 Stem width: Each leaf: Stem & Leaf 2. 3. 3. 4. 4. 5. 5. 6. 6. 7. 56 4 5666779999 0000001111122223333344 5555566677788889999999 0011333334444 55667777888999 0012233334 56689 2 100 1 case(s) 48 Chapter 2: Explore procedure Box and Whisker Plot 800 700 600 500 400 300 200 GRE 49 Chapter 2: Explore procedure Before, we had the histogram for GRE variable. 14 12 Frequency 10 8 6 4 2 Mean = 497.02 Std. Dev. = 95.325 N = 100 0 400 600 GRE 50 Chapter 2: Parameters and Statistics Recall that we use descriptive statistics as estimates of population parameters. The table below shows the correspondence between several of the statistics and parameters we will discuss this semester. Note. Parameter --- fixed number that represents a certain characteristic of the population in which we are interested. This is usually unknown; statistic --- a value that we can compute from our data (i.e., sample) at hand. We use the sample statistic to estimate the corresponding population parameter.. Statistic Parameter X x s 2X sx2 sX sx rxy rxy bx|y bx|y estimate Statistic ----- - -> parameter (Roman letters) (Greek letters) 51 Chapter 2: Parameters and Statistics There are four properties that are useful when we use statistics, particularly the mean and variance, as estimators of parameters: 1. Sufficiency: Statistic uses all of the information in the sample. 2. Unbiasedness: Expected value (i.e., the average of) over a large number of samples equals the parameter. Note that N-1 is used in the denominator of the sample variance to make it an unbiased estimate of the population variance. For example, N p o p. N S X2 ( X i X )2 i 1 N 1 2 s X to unbiasedly estimate (X i 1 i X )2 N pop. where Npop. is the population size (i.e., number of cases in the entire population). 3. Efficiency: The variability of a large number of samples is smaller than the variability of other, similar, descriptive statistics. 4. Resistant: Not heavily influenced by outliers. 52 Chapter 2: Discussion Questions I have the IQ scores of 1000 students and I ran SPSS (Frequencies & Explore) and obtained descriptive statistics, histogram, stem-and-leaf, and box plot. The output appears on the following pages. Based on the output, comment on the following characteristics of the IQ scores. Be sure to cite the indices you considered concerning each characteristic. * Shape * Location * Dispersion * Skewness * Kurtosis * Outliers * Percentiles (Especially, Quartiles) What would happen if you assumed that these IQ scores were normally distributed? In other words, you only know the mean and the standard deviation of IQ scores. And if you assume that IQ scores are normally distributed with the given mean and SD, what kind of errors you might make? 53 Chapter 2: Discussion Questions Output from Frequencies Statistics IQ N Mean Median Mode Std. Deviation Skewness Std. Error of Skewnes s Kurtos is Std. Error of Kurtos is Percentiles Valid Mis sing 25 50 75 1000 0 119.17 117.00 99 22.460 .060 .077 -1.287 .155 99.00 117.00 139.00 54 Chapter 2: Discussion Questions Output from Frequencies Histogram 100 Frequency 80 60 40 20 Mean = 119.17 Std. Dev. = 22.46 N = 1,000 0 60 80 100 120 140 160 180 IQ 55 Chapter 2: Discussion Questions Output from Explore Descriptives IQ Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is Lower Bound Upper Bound Statis tic 119.17 117.77 Std. Error .710 120.56 119.09 117.00 504.452 22.460 68 171 103 40 .060 -1.287 .077 .155 56 Chapter 2: Discussion Questions IQ Stem-and-Leaf Plot Frequency 1.00 .00 8.00 22.00 50.00 76.00 95.00 99.00 77.00 54.00 33.00 30.00 51.00 63.00 93.00 86.00 75.00 50.00 25.00 10.00 1.00 1.00 Stem width: Each leaf: Stem & Leaf 6. & 7. 7 . 689& 8 . 0123344 8 . 5556667778888999 9 . 000000111111222222333344444 9 . 5555566666777778888889999999999 10 . 000000011111112222222223333344444 10 . 5555556666666777777888899 11 . 000001112222333444 11 . 556666778899 12 . 001223344 12 . 556677788888899999 13 . 0000111112222333334444 13 . 555555566666777778888888999999 14 . 0000011111112222223333344444 14 . 55556666677777778888889999 15 . 0000111112222334 15 . 55566799& 16 . 0124& 16 . & 17 . & 10 3 case(s) 57 & denotes fractional leaves. Chapter 2: Discussion Questions 180 160 140 120 100 80 60 IQ 58 Chapter 2: Discussion Questions Knowing that the mean = 119.17, SD = 22.5, you assumed that IQ is distributed as normal. Then using the well-known fact that in normal distribution 68 % of the scores lie within the range of mean 1 SD, you predicted that 68 % of the scores lie within the range of (96.67, 141.67) computed from 119.17 22.5. But actually only 58.5 % of the students are in the range. So by 9.5% we overestimated the people in the middle range. Another way of saying this is that we expect that 16% of the scores are below 97 and another 16% of the scores are above 141. The actual observation we have is: Below 97 --- 20.4% Above 141 --- 21.1% Thus if you blindly assume that IQ scores are normally distributed, you underestimate the percentages of the students in high and low ranges about 4 ~ 5 % and over-estimate the middle range about 10 %. Whenever we make some statements based on distribution, we can’t assume normality all the time. We need to base our inferences on the actual distribution. 59 Chapter 2: Discussion Questions I calculated the above percentages from the frequency table below. IQ 68 1 .1 .1 Cumulative Percent .1 76 3 .3 .3 .4 77 1 .1 .1 .5 78 2 .2 .2 .7 79 2 .2 .2 .9 80 4 .4 .4 1.3 81 2 .2 .2 1.5 82 4 .4 .4 1.9 83 6 .6 .6 2.5 84 6 .6 .6 3.1 85 9 .9 .9 4.0 Frequency Valid Percent Valid Percent 86 9 .9 .9 4.9 87 10 1.0 1.0 5.9 88 12 1.2 1.2 7.1 89 10 1.0 1.0 8.1 90 17 1.7 1.7 9.8 91 17 1.7 1.7 11.5 92 17 1.7 1.7 13.2 93 11 1.1 1.1 14.3 94 14 1.4 1.4 15.7 95 16 1.6 1.6 17.3 96 15 1.5 1.5 18.8 97 16 1.6 1.6 20.4 60 Chapter 2: Frequency Table (continued) IQ 98 18 1.8 1.8 Cumulative Percent 22.2 99 30 3.0 3.0 25.2 100 21 2.1 2.1 27.3 101 21 2.1 2.1 29.4 102 26 2.6 2.6 32.0 103 16 1.6 1.6 33.6 104 15 1.5 1.5 35.1 105 18 1.8 1.8 36.9 106 21 2.1 2.1 39.0 107 19 1.9 1.9 40.9 108 12 1.2 1.2 42.1 109 7 .7 .7 42.8 110 14 1.4 1.4 44.2 111 8 .8 .8 45.0 112 13 1.3 1.3 46.3 113 10 1.0 1.0 47.3 114 9 .9 .9 48.2 Frequency Percent Valid Percent 115 5 .5 .5 48.7 116 12 1.2 1.2 49.9 117 5 .5 .5 50.4 61 Chapter 2: Frequency Table (continued) IQ 118 6 .6 .6 Cumulative Percent 51.0 119 5 .5 .5 51.5 120 7 .7 .7 52.2 121 3 .3 .3 52.5 122 7 .7 .7 53.2 123 6 .6 .6 53.8 124 7 .7 .7 54.5 125 5 .5 .5 55.0 126 7 .7 .7 55.7 127 8 .8 .8 56.5 128 17 1.7 1.7 58.2 129 14 1.4 1.4 59.6 130 12 1.2 1.2 60.8 131 14 1.4 1.4 62.2 132 12 1.2 1.2 63.4 133 14 1.4 1.4 64.8 134 11 1.1 1.1 65.9 135 22 2.2 2.2 68.1 136 16 1.6 1.6 69.7 137 16 1.6 1.6 71.3 138 22 2.2 2.2 73.5 139 17 1.7 1.7 75.2 140 16 1.6 1.6 76.8 Frequency Percent Valid Percent 62 Chapter 2: Frequency Table (continued) IQ 141 Frequency 21 Percent 2.1 Valid Percent 2.1 Cumulative Percent 78.9 142 18 1.8 1.8 80.7 143 15 1.5 1.5 82.2 144 16 1.6 1.6 83.8 145 11 1.1 1.1 84.9 146 15 1.5 1.5 86.4 147 20 2.0 2.0 88.4 148 18 1.8 1.8 90.2 149 11 1.1 1.1 91.3 150 11 1.1 1.1 92.4 151 16 1.6 1.6 94.0 152 12 1.2 1.2 95.2 153 7 .7 .7 95.9 154 4 .4 .4 96.3 155 9 .9 .9 97.2 156 5 .5 .5 97.7 157 4 .4 .4 98.1 158 1 .1 .1 98.2 159 6 .6 .6 98.8 160 2 .2 .2 99.0 161 3 .3 .3 99.3 162 2 .2 .2 99.5 163 1 .1 .1 99.6 164 2 .2 .2 99.8 169 1 .1 .1 99.9 100.0 171 Total 1 .1 .1 1000 100.0 100.0 63