Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Action Research Measurement Scales and Descriptive Statistics INFO 515 Glenn Booker INFO 515 Lecture #2 1 Measurement Needs Need a long set of measurements for one project, and/or many projects to examine statistical trends Could use measurements to test specific hypotheses Other realistic uses of measurement are to help make decisions and track progress Need scales to make measurements! INFO 515 Lecture #2 2 Measurement Scales There are four types of measurement scales Nominal Ordinal Interval Ratio Completely optional mnemonic: to remember the sequence, I think of ‘NOIR’ like in the expression ‘film noir’ (‘noir’ is French for ‘black’) INFO 515 Lecture #2 3 Nominal Scale A nominal (“name”) scale groups or classifies things into categories, which: Must be jointly exhaustive (cover everything) Must be mutually exclusive (one thing can’t be in two categories at once) Are in any sequence (none better or worse) So a nominal variable is putting things into buckets which have no inherant order to them INFO 515 Lecture #2 4 Nominal Scale Examples include INFO 515 Gender (though some would dispute limitations of only male/female categories) Dewey decimal system The Library of Congress system Academic majors Makes of stuff (cars, computers, etc.) Parts of a system Lecture #2 5 Ordinal Scale This measurement ranks things in order Sequence is important, but the intervals between ranks is not defined numerically Rank is relative, such as “greater than” or “less than” E.g. letter grades, urgency of problems, class rank, inspection ratings So now the buckets we’re using have some sense or order or direction INFO 515 Lecture #2 6 Interval Scale An interval scale measures quantitative differences, not just relative Addition and subtraction are allowed E.g. common temperature scales (°F or C), a single date (Feb 15, 1999), maybe IQ scores Let me know if you find any more examples A zero point, if any, is arbitrary (90 °F is *not* six times hotter than 15 °F!) INFO 515 Lecture #2 7 Ratio Scale A ratio scale is an interval scale with a non-arbitrary zero point Allows division and multiplication The “best” type of scale to use, if possible E.g. defect rates for software, test scores, absolute temperature (Kelvin or Rankine), the number or count of almost anything, size, speed, length, … INFO 515 Lecture #2 8 Summary of Scales Nominal names different categories, not ordered, not ranked: Male, Female, Republican, Catholic.. Ordinal Categories are ordered: Low, High, Sometimes, Never, Interval Fixed intervals, no absolute zero: IQ, Temperature Ratio Fixed intervals with an absolute zero point: Age, Income, Years of Schooling, Hours/Week, Weight Age could be measured as ratio (years), ordinal (young, middle, old), or nominal (baby boomer, gen X) Scale of measurement affects (may determine) type of statistics that you can use to analyze the data INFO 515 Lecture #2 9 Scale Hierarchy Measurement scales are hierarchical: ratio (best) / interval / ordinal / nominal Lower level scales can always be derived from data which uses a higher scale E.g. defect rates (a ratio scale) could be converted to {High, Medium, Low} or {Acceptable, Not Acceptable} (ordinal scales) INFO 515 Lecture #2 10 Reexamine Central Tendencies If data are nominal, only the mode is meaningful If data are ordinal, both median and mode may be used If data are ratio or interval (called “scale” in SPSS), you may use mean, median, and mode INFO 515 Lecture #2 11 Reexamine Variables Discrete variables use counting units or specific categories Example: makes of cars, grades, … Use Nominal or Ordinal scales Continuous = Integer or Real Measurements INFO 515 Example: IQ Test scores, length of a table, your weight, etc. Use Ratio or Interval scales Lecture #2 12 Refine Research Types Qualitative Research tends to use Nominal and/or Ordinal scale variables Quantitative Research tends to use Interval and/or Ratio scale variables INFO 515 Lecture #2 13 Frequency Distributions Frequency distributions describe how many times each value occurs in a data set They are useful for understanding the characteristics of a data set Frequencies are the count of how many times each possible value appears for a variable (gender = male, or operating system = Windows 2000) INFO 515 Lecture #2 14 Frequency Distributions They are most useful when there is a fixed and relatively small number of options for that variable They’re harder to use for variables which are numbers (either real or integer) unless there are only a few specific options allowed (e.g. test responses 1 to 5 for a multiple choice question) INFO 515 Lecture #2 15 Generating Frequency Distributions Select the command Analyze / Descriptive Statistics / Frequencies… Select one or more “Variable(s):” Note that the Frequency (count) and percent are included by default; other outputs may be selected under the “Statistics...” button INFO 515 A bar chart can be generated as well using the “Charts…” button; see another way later Lecture #2 16 Sample Frequency Output O N u r P r u c c e e V 8 3 2 2 2 1 0 1 1 3 1 6 3 3 5 1 6 5 5 0 1 9 4 4 5 1 1 3 3 8 1 9 9 9 7 1 7 7 7 4 2 2 4 4 8 2 1 2 2 0 T 4 0 0 INFO 515 Lecture #2 17 Analysis of Frequency Output The first, unlabeled column has the values of data – here, it first lists all Valid values (there are no Invalid ones, or it would show those too) The Frequency column is how many times that value appears in the data set The Percent column is the percent of cases with that value; in the fourth row, the value 15 appears 116 times, which is 24.5% of the 474 total cases (116/474*100 = 24.5%) INFO 515 Lecture #2 18 Analysis of Frequency Output The Valid Percent column divides each Frequency by the total number of Valid cases (= Percent column if all cases valid) The Cumulative Percent adds up the Valid Percent values going down the rows; so the first entry is the Valid Percent for first row, the second entry is from 11.2 + 40.1 = 51.3%, next is 51.3 + 1.3 = 52.5% and so on Round-off error INFO 515 Lecture #2 19 Generating Frequency Graphs Frequency is often shown using a bar graph Bar graphs help make small amounts of data more visible To generate a frequency graph alone INFO 515 Click on the Charts menu and select “Bar…” Leave the “Simple” graph selected, and leave “Summaries are for groups of cases” selected; click the “Define” button Lecture #2 20 Generating Frequency Graphs INFO 515 Let the Bars Represent remain “N of cases” Click on variable “Educational Level (years)” and move it into the Category Axis field Click “OK” You should get the graph on the next slide. Notice that the text below the X axis is the Label for the Category Axis. Lecture #2 21 Sample Frequency Output Notice that the exact same graph can be generated from Frequencies, or just as a bar graph INFO 515 Lecture #2 22 Frequency Distributions A frequency distribution is a tabulation that indicates the number of times a score or group of scores occurs Bar charts best used to graph frequency of nominal & ordinal data Histograms best used to display shape of interval & ratio data INFO 515 Lecture #2 23 Frequency Distribution Example Employment Category 400 300 Frequency 200 100 0 Clerical Custodial Manager Employment Category Employment Category Valid SPSS for Windows, Student Version INFO 515 Lecture #2 Clerical Cus todial Manager Total Frequency 363 27 84 474 Percent 76.6 5.7 17.7 100.0 Valid Percent 76.6 5.7 17.7 100.0 Cumulative Percent 76.6 82.3 100.0 24 Basic Measures - Ratio Used for two exclusive populations (every case fits into one OR the other) Ratio = (# of testers) / (# of developers) E.g. tester to developer ratio is 1:4 INFO 515 Lecture #2 25 Proportions and Fractions Used for multiple (> 2) populations Proportion = (Number of this population) / (Total number of all populations) Sum of all proportions equals unity (one) E.g. survey results Proportions are based on integer units Fractions are based on real numbered units INFO 515 Lecture #2 26 Percentage A proportion or fraction multiplied by 100 becomes a percentage Only report percentages when N (total population measured) is above ~30 to 50; and always provide N for completeness Why? Otherwise a percentage will imply more accuracy than the data supports INFO 515 If 2 out of 3 people like something, it’s misleading to report that 66.667% favor it Lecture #2 27 Percents Percent = the percentage of cases having a particular value. Raw percent = divide the frequency of the value by the total number of cases (including missing values) Valid percent = calculated as above but excluding missing values INFO 515 Lecture #2 28 Percent Change The percent increase in a measurement is the new value, minus the old one, divided by the old value; negative means decrease: % increase = (new - old) / old The percent change is the absolute value of the percent increase or decrease: % change = | % increase | INFO 515 Lecture #2 29 Percent Increase Later Value – Earlier Value Earlier Value So if a collection goes from 50,000 volumes in 1965 to 150,000 in 1975, the percent increase is: 150,000-50,000 = 2 = 200% 50,000 Always divide by where you started Carpenter and Vasu, (1978) INFO 515 Lecture #2 30 Percentiles A percentile is the point in a distribution at or below a given percentage of scores. The median is the 50% percentile Think of the SAT scores - what percentile were you for verbal, math, etc. - means what percent of people did worse than you INFO 515 Lecture #2 31 Rate Rate conveys the change in a measurement, such as over time, dx/dt. Rate = (# observed events) / (# of opportunities)*constant Rate requires exposure to the risk being measured E.g. defects per KSLOC (1000 lines of code) = (# defects)/(# of KSLOC)*1000 INFO 515 Lecture #2 32 Exponential Notation You might see output of the form +2.78E-12 The ‘E’ means ‘times ten to the power of’ This is +2.78 * 10-12 (+2.78*10**-12) A negative exponent, e.g. –12, makes it a very small number INFO 515 10-12 = 0.000000000001 10+12 = 1,000,000,000,000 The leading number, here +2.78, controls whether it is a positive or negative number Lecture #2 33 Exponential Notation +5*10**+12 (a positive number >>1) Pos. 0 +5*10**-12 (a positive number <<1) -5*10**-12 (a negative number <<1) Neg. -5*10**+12 (a negative number >>1) INFO 515 Lecture #2 34 Precision Keep your final output to a consistent level of precision (significant digits) Don’t report one value as “12” and another as “11.86257523454574123” Pick a level of precision to match the accuracy of your inputs (or one digit more), and make sure everything is reported that way consistently (e.g. 12.0 and 11.9) INFO 515 Lecture #2 35 Data Analysis Raw data is collected, such as the dates a particular problem was reported and closed Refined data is extracted from raw data, e.g. the time it took a problem to be resolved Derived data is produced by analyzing refined data, such as the average time to resolve problems INFO 515 Lecture #2 36 Descriptive Statistics Descriptive statistics describes the key characteristics of one set of data (univariate) INFO 515 Mean, median, mode, range (see also last week) Standard deviation, variance Skewness Kurtosis Coefficient of variation Lecture #2 37 Mean A.k.a.: Average Score The mean is the arithmetic average of the scores in a distribution Add all of the scores Divide by the total number of scores The mean is greatly influenced by extreme scores; they pull it off center INFO 515 Lecture #2 38 Mean Calculation HOLDINGS IN 7 DIFFERENT LIBRARIES X 7400 6500 6200 5900 5100 4300 3800 X= 39200 INFO 515 Mean = X N 39200 = 5600 7 Here, sum every data value Lecture #2 39 Mean with a Frequency Distribution X (IQ) 140 135 132 130 128 126 125 123 120 110 101 F=Freq 2 1 2 1 1 1 4 1 4 3 1 21 Mean = ∑FX = N N = SF INFO 515 FX = F*X 280 135 264 130 128 126 500 123 480 330 101 2597 2597 = 123.67 = 124 (round off) 21 Lecture #2 40 Central Tendency Example Staff Salaries $4100 6000 6000 6000 8000 9000 10000 11000 20000 Mode = $6000 Median = 9 + 1 = 5th value = $8000 2 Mean = ∑X N = 80100 = $8900 9 Carpenter and Vasu, (1978) INFO 515 Lecture #2 41 Handling Extreme Values In cases where you have an extreme value (high or low) in a distribution, it is helpful to report both the median and the mean Reporting both values gives some indication (through comparison) of a skewed distribution INFO 515 Lecture #2 42 Measures of Variation Measures which indicate the variation, or spread of scores in a distribution INFO 515 Range (see last week) Variance Standard Deviation Lecture #2 43 Standard Deviation, Variance Standard deviation is the average amount the data differs from the mean (average) SD = ( S (Xi-X)**2 / (N-1) ) SD = ( Variance ) Variance is the standard deviation squared Variance = S (Xi-X)**2 / (N-1) [per ISO 3534-1, para 2.33 and 2.34] INFO 515 Lecture #2 44 Standard Deviation The standard deviation is the square root of the variance. It is expressed in the same units as the original data. Since the variance was expressed “squared units” it doesn’t make much practical sense. For example, what are “squared books” or “squared man-hours?” INFO 515 Lecture #2 45 Computing the Variance S2 = ∑(X – Mean)2 N 1. Subtract the mean from each score 2. Square the result 3. Sum the squares for all data points 4. Divide by the N of cases INFO 515 Lecture #2 46 Divide by N or N-1??? You’ll see different formulas for variance and standard deviation – some divide by N, some by N-1 (e.g. slides 43 and 45); why? INFO 515 If your data covers the entire population (you have all of the possible data to analyze), then divide by N If your data covers a sample from the population, divide by N-1 Lecture #2 47 Standard Deviation for Freq Dist. X 17 16 14 10 9 6 F 2 4 5 2 3 1 FX 34 64 70 20 27 6 221 X2 289 256 196 100 81 36 σ = √ (∑FX2 – (∑FX)2/N) N = √ ((3061- 2873)/17) FX2 578 1024 980 200 243 36 3061 Standard Deviation of Bookmobile Distribution = √ (3061- (221)2/17) 17 = 3.3 Notice that FX2 is F*(X2), not (F*X)2 INFO 515 Lecture #2 48 Std Dev Reflects Consistency Distance from Target In Meters 200 150 100 50 0 -50 -100 -150 -200 Battery A 2 4 5 7 9 7 5 4 2 Frequency Battery B 0 1 5 10 13 10 5 1 0 Mean =0 Standard D. = 102.74 Mean =0 Standard D. = 65.83 Runyon and Haber (1984) INFO 515 Lecture #2 49 Standard Deviation vs. Std. Error To be precise, the standard error is the standard deviation of a statistic used to estimate a population parameter [per ISO 3534-1, para 2.56 and 2.50] So standard error pertains to sample data, while standard deviation should describe the entire population We often use them interchangeably INFO 515 Lecture #2 50 Skewness Skewness is a measure of the asymmetry of a distribution. A distribution with a significant positive skewness has a long right tail The normal distribution is symmetric, and has a skewness value of zero. Positive skewness means the mean and median are more positive than the mode (the peak of the distribution) Negative skewness has a long left tail. INFO 515 Lecture #2 51 Skewness As a rough guide, a skewness magnitude more than two (>2 or <-2) is taken to indicate a significant departure from symmetry Positive skewness Negative skewness Both curves have same mean and standard deviation. INFO 515 Lecture #2 From www.riskglossary.com 52 Kurtosis Kurtosis is a measure of the extent to which data clusters around a central point For a normal distribution, the value of the kurtosis is 3 The kurtosis excess (= kurtosis-3) is zero for a normal distribution INFO 515 Positive kurtosis excess indicates that the data have longer tails than “normal” Negative kurtosis excess indicates the data have shorter tails Lecture #2 53 Kurtosis Platykurtic Leptokurtic tail The curve on the right has higher kurtosis than the curve on the left. It is more peaked at the center, and it has fatter tails. If a distribution’s kurtosis is greater than 3, it is said to be leptokurtic (sharp peak). If its kurtosis is less than 3, it is said to be platykurtic (flat peak). They might have equal standard deviation. Mesokurtic is the “normal” curve, which has kurtosis = 3. INFO 515 Lecture #2 From www.riskglossary.com 54 Skewness & Kurtosis Example From the Employee data set, use Analyze / Descriptive Statistics / Descriptives, select the ‘salary’ variable; Under Options…, select Skewness and Kurtosis Skewness is 2.125, so there is significant positive skewness to the data Kurtosis is 5.378, so the data is leptokurtic INFO 515 Lecture #2 55 Coefficient of Variation The coefficient of variation (CV) is the ratio of the standard deviation to the mean: CV = s/m [per ISO 3534-1, para 2.35] Smaller CV means the more representative the mean is for the total distribution Can compare means and standard deviations of two different populations INFO 515 Higher CV means more variability Lecture #2 56 Coefficient of Variation Divide the standard deviation by the mean to get CV. CV = s/m The smaller the decimal fraction this produces, the more representative is the mean for the total distribution The larger the decimal fraction, the worse job the mean does of giving us a true picture of the distribution INFO 515 Lecture #2 57 Generating a Histogram Frequency graphs can be generated for variables which have many integer or real values (e.g. salary), by using a histogram A histogram shows how many data points fall into various ranges of values The closest “normal” curve can be shown for comparison INFO 515 Lecture #2 58 Generating a Histogram The “¾ rule” is helpful for histograms The tallest bar should be ¾ of the height of the Y axis Be sure to label X and Y axes appropriately The each bar shows how many data points fall within a range of X axis values INFO 515 See How to Lie with Statistics, by Darrell Huff Lecture #2 59 Histogram of Salary CURRENT SALARY 140 120 100 80 40 Std. Dev = 6830.26 20 0 Mean = 13767.8 N = 474.00 0 0. 00 54 0.0 00 50 0.0 00 46 0.0 00 42 0.0 00 38 0.0 00 34 0.0 00 30 0.0 00 26 0.0 00 22 0.0 00 18 0.0 00 14 0.0 00 10 0 . 00 60 Frequency 60 CURRENT SALARY INFO 515 Lecture #2 60 Another Note on Histograms SPSS will define its own bar widths for a histogram, e.g. how wide the range of salary values is for each bar Later in the course, we’ll look at how you can define your own variables to make predefined histograms bars INFO 515 Lecture #2 61 Pie Chart Histogram A histogram can also be made in the shape of a pie This should be limited to variables with a small number of possible values INFO 515 Lecture #2 62 A *bad* pie chart histogram 15660 CURRENT SALARY 9180 15540 9240 15480 9300 15420 9360 15360 9420 15120 9480 15060 9540 15000 9600 14820 9660 14640 9720 14460 9780 14400 9840 14280 9900 14220 9960 14100 10020 14040 10080 INFO 515 Lecture #2 10140 (I had to include this one just because it’s colorful) 63 This is a better example: EDUCATIONAL LEVEL 21 20 19 18 8 17 16 This visually implies the percentages of data in each value. 12 15 14 INFO 515 Lecture #2 64 Bookmobile Data Case/ Bookmobile Value of Var. No. of Stops X No. of Stops F No. of Bookmobiles A B C D E F G H I J K L M N O P Q 6 9 10 14 16 17 14 16 14 10 9 14 14 16 9 17 16 17 16 14 10 9 6 2 4 5 2 3 1 N = 17 Bookmobile examples taken from Carpenter and Vasu, (1978) Same data as used on slides 48 & 66. INFO 515 Lecture #2 65 Bookmobile Distributions Percent cumulative freq counting down Stops 17 16 14 10 9 6 f 2 4 5 2 3 1 % 11.8 23.5 29.4 11.8 17.6 5.8 CF 17 15 11 6 4 1 Cumulative freq adding down INFO 515 Lecture #2 CF 2 6 11 13 16 17 C% 100 88 64 35 23 6 Cumulative freq adding up 66 HISTOGRAM OF BOOKMOBILE STOPS 10 8 F6 4 2 Std. Dev = 3.43 Mean = 13.0 N = 17.00 0 5.0 7.5 10.0 12.5 15.0 17.5 Number of Bookmobile Stops INFO 515 Lecture #2 67 Normalizing Data Some data sets are not very close to a normal distribution Sometimes it helps to transform the independent variable by applying a math function to it, such as looking at log(x) (the logarithm of each x value) instead of just x INFO 515 Lecture #2 68 Normalizing Data In SPSS this can be done by defining a new variable, such as “log_x” Then use Transform / Compute to calculate log_x = LG10(x) assuming that ‘x’ is the original variable Then generate a histogram showing the normal curve, to see if log_x is closer to a normal distribution INFO 515 Lecture #2 69 Normalizing Data Who cares if we have a normal distribution? Many tests in statistics can only be applied to a variable which has a normal distribution – so it’s worth our while to transform the variable INFO 515 Lecture #2 70