* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download primer - University of St. Thomas
Survey
Document related concepts
Transcript
Statistics Primer Thomas P. Sturm, Ph.D. University of St. Thomas St. Paul, Minnesota The Nature of Statistics Presentation Descriptive Statistics Data Collection Contents: Part I - The Nature of Statistics Part II - Presentation Part III - Descriptive Statistics Part IV - Data Collection 3 11 27 37 Copyright © 1971-1997 Thomas P. Sturm All rights reserved MINITAB and Minitab for Windows are registered trademarks of Minitab, Inc. Statistics Primer 2 Table of Contents The Nature of Statistics What is statistics Types of data Scales of measurement Copyright 1993-97 Thomas P. Sturm What is Statistics Statistics is: The SCIENCE of a. COLLECTING, b. Classifying / Presenting / Tabulating / Describing, and c. INTERPRETING, NUMERICAL Data All three areas will be covered: Collecting - Chapter 3 Describing - Chapters 1 and 2 Interpreting - bulk of the course Course Goal: To produce good "statistical consumers" Statistics Primer 4 The Nature of Statistics Collecting Data Data must be collected with a purpose - to find information about a designated group of people/places/things/events POPULATION - the collection of ALL objects that are of interest - must be carefully defined - must be able to determine under all circumstances whether something is in the population or not e.g. employees - current? fired? retired? part-time? Problem: It's usually just too expensive (or impossible) to get the information for all objects in a population (a CENSUS) SAMPLE - a subset of the population used to find information about the entire population - more economical - with care, can obtain an accurate picture of the population So, to get information about the population, we take a sample and find information about the things in the sample Statistics Primer 5 The Nature of Statistics Variables in Statistics PROPERTY An attribute that is relevant for all things in the population (and therefore the sample) e.g. height, weight, color, result of casting a die, beauty VARIABLE Any characteristic than can be measured for all things in the population e.g. height (in inches), weight (in pounds), color (a word), # of spots on a die OBSERVATION A VALUE for a variable is assigned through a process of MEASUREMENT e.g. use a ruler to MEASURE a VALUE of 6'4" as the OBSERVED height of a basketball player POSSIBLE VALUES values that COULD be obtained e.g. 0 to 100% on an exam OBSERVED VALUES values that are actually obtained in the current instance e.g. 97%, 92%, 84%, 63% in a class of 4 students Statistics Primer 6 The Nature of Statistics Types of Data QUALITATIVE ATTRIBUTE or CATEGORICAL data useful only to place individuals into categories (e.g. Earthlings, Martians) QUANTITATIVE DISCRETE a finite set of values e.g. number of students CONTINUOUS an infinite set of values in a bounded range e.g. height of students But statistics only deals with NUMERICAL data, (and MEASUREMENT assigns a numerical value to a VARIABLE,) so, for QUALITATIVE data, part of the measurement process is to assign a number to each attribute value e.g. SEX - 1=male, 2=female, etc. Thus, as part of the measurement process, everything gets a number. But what can you DO with those numbers ??? Statistics Primer 7 The Nature of Statistics Scales of Measurement Nominal Scale (Qualitative data) e.g. 1=male, 2=female come from qualitative (attribute) data can only count how many of each value you have to obtain FREQUENCY data cannot sort, add, subtract, multiply, or divide the numbers Ordinal Scale (Ordinal data) e.g. 1=never, 2=occasionally, 3=frequently, 4=always come from a condensation of quantitative data where asking for specific numbers would not be accurate can sort in addition to count, 1 < 2 cannot add, subtract, multiply, or divide the numbers Interval Scale (Metric data) e.g. temperature in Fahrenheit come from quantitative values that are measured against arbitrary starting points can subtract in addition to sorting and counting, 24 outside, 72 inside, 48 degrees warmer inside cannot add, multiply, or divide the numbers Ratio Scale (Metric data) e.g. number of courses taken, any FREQUENCY data, rates come from quantitative values that have "natural" zeroes 0 is meaningful, Pat took 6 courses, Chris took 2 courses, Pat took 3 times as many courses as Chris can perform all operations Statistics Primer 8 The Nature of Statistics Distribution/Variation In general, not all of the measurements yield the same value. This could be because of different measurements of the same thing or measurement of different members of a sample. This is called VARIATION. The values of the data have some sort of a DISTRIBUTION which characterizes where in the range of POSSIBLE values the OBSERVED values most frequently fall. Much of descriptive statistics deals with finding simple ways (perhaps as simple as a single number) of describing the distribution. Nominal and ordinal data allow the least amount of mathematical manipulation, so the description of nominal and ordinal data is limited to counting the frequencies of the observations (and sorting the observations if on an ordinal scale) and then presenting the counts. Statistics Primer 9 The Nature of Statistics Statistics Primer 10 The Nature of Statistics Presentation Properties of a Table Bar charts Pie charts Graphs Statistical lying Tally sheets Frequency histograms and polygons Stem and Leaf diagrams Copyright 1993-97 Thomas P. Sturm Properties of a Table Every statistical table worthy of our consumption should have, at a minimum, the following: TITLE - states exactly and as succinctly as possible what the entries are CAPTION - distinguishes one body of information from another STUB - distinguishes each member of a series from another BODY - contains the actual figures UNITS of MEASUREMENT - somewhere in the table (caption, stub, or title) the units must be given (number, dollars, millions, etc.) SOURCE - describes where the information was obtained Without ALL of these, a table loses its credibility - because it is the kind statistical liars use Statistics Primer 12 Presentation Example Statistical Table Governmental Per Capita Tax Revenue (Dollars) Year 1960 1962 1964 1965 State and Local 200.66 223.62 249.75 266.11 Federal 427.81 442.69 473.03 483.49 Total 628.47 666.32 722.78 749.61 Source: Statistical Abstract of the U.S., 1966, 87th ed. (Washington, D.C., 1966), p. 417. Statistics Primer 13 Presentation Presenting Categorical Data Categorical (nominal) data can only be counted. You cannot "average" it, subtract it, or divide it. However, you can present it in a wide variety of ways. A survey of 120 University of St. Thomas students on the question: "Given the choice between the two, do you prefer to eat at Scooters on in the Grill?" Response Grill Scooters No opinion Total Frequency 74 37 9 120 Source: Postoffice box mail survey taken by QMCS 220 students during spring semester, 1993. The stub/caption/body of the table could have been presented in a variety of other ways, e.g.: - It could have included relative frequencies as well Response Yes No No opinion Total Statistics Primer Frequency 74 37 9 120 14 Relative Frequency .62 .31 .07 1.00 Presentation Histograms and Bar Charts - It could have been done as a (vertical) histogram 80 74 70 60 50 37 40 30 20 9 10 0 Grill Scooters No Opinion - It could have been done with a horizontal bar chart No Opinion 9 Scooters 37 Grill 74 0 Statistics Primer 10 20 30 40 50 15 60 70 80 Presentation Pie Charts - It could have been done with a pie chart Scooters 31% No Opinion 8% Grill 61% - It could have been done with a picture graph Grill Scooters NoOpinion Statistics Primer 16 Presentation Graphing a Statistical Lie - The following is NOT a valid way of presenting the same information: Grill Scooters No Opinion Statistics Primer 17 Presentation Becoming a Better Statistical Consumer 1. Consider the source How many details of the study are given? - study with full details included - study with details available - informed opinion - opinion poll Who did the study? - name identification, independent organization - audited and published by respected publisher - special interest organizations - self-interest groups How stable is the data? historical current forecast 2. Does it make sense? Is it an "offhand" percentage? Is it a "less than" comparison? Less than what? Does it lack internal consistency e.g. percents of 20 items should be multiples of 5 Does it have too much precision, regularity, or even #s? Is it plausible? Is the arithmetic correct? 3. Are the conclusions correct? Are the survey results consistent with the conclusions? Have the definitions remained consistent over time? Are there wrong interpretations of right results? Are there confusing counts and percentages? Statistics Primer 18 Presentation Presenting Ratio Data Consider the following data: Survey of number of hours taken to learn the skill of sending and receiving international Morse code at a rate of 13 words per minute: 80 90 88 73 98 89 52 69 63 97 78 88 92 83 78 109 98 64 75 94 83 100 76 81 82 67 85 85 100 70 96 61 75 95 58 105 70 96 81 88 108 64 Generally, we want to see how the OBSERVATIONS are spread out across the range of POSSIBLE VALUES The data above in random order is of little use Statistics Primer 19 Presentation Tally Sheet Put the data into groups of adjacent numbers, describing the situation by enumerating the number of occurrences in each range of numbers or CLASS The number of occurrences is called the FREQUENCY or class frequency We would like to use "convenient" numbers to divide the data into groups, and we would like to end up with 6 to 14 intervals for under 100 observations, 7 to 15 intervals for over 100 observations. Lower class limit 50 60 70 80 90 100 Upper class limit 59 69 79 89 99 109 Class (Class Interval) 50-59 60-69 70-79 80-89 90-99 100-109 Class mark 54.5 64.5 74.5 84.5 94.5 104.5 Tally // ///// / ///// /// ///// ///// // ///// //// ///// Frequency 2 6 8 12 9 5 Note: No overlap between the groups, groups of equal size, difference between upper class limit of one group and lower class limit of the next is the "smallest unit" of measure. Statistics Primer 20 Presentation Frequency Histogram 12 10 8 6 4 2 0 4.5 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5 94.5 105 115 Note: Adjacent vertical bars are connected (compare with disconnected bars for categorical data). Values on horizontal axis refer to midpoint of vertical bar. Statistics Primer 21 Presentation Frequency Polygon 12 10 8 6 4 2 0 4.5 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5 94.5 104.5 114.5 Note: Must plot data for two classes beyond the range of the data (in this case 40-49 and 110-119) that have a frequency of 0. Must plot the frequency at the class mark. Statistics Primer 22 Presentation Stem and Leaf Diagram 5 6 7 8 9 10 28 134479 00355688 011233558889 024566788 00589 or 5* 56* 67* 78* 89* 910* 10- Statistics Primer 2 8 1344 79 003 55688 011233 558889 024 566788 00 589 23 Presentation Another Stem and Leaf Diagram Consider the following data: 10 15 12 11 10 15 19 14 13 13 13 25 22 17 18 14 20 24 20 22 16 18 19 21 23 18 17 19 19 15 20 11 16 14 13 22 10 18 16 12 Stem and Leaf Diagram to obtain 6 to 14 groups: 1* 1T 1F 1S 12* 2T 2F Statistics Primer 00011 223333 444555 66677 88889999 0001 2223 45 24 Presentation Minitab Commands to Learn (Command Driven Version) HELP EXIT PAPER NOPAPER READ PRINT NAME SET LET HISTOGRAM DOTPLOT STEM-AND-LEAF DESCRIBE MEAN ... Statistics Primer 25 Presentation Minitab Commands to Learn (Menu Driven Version) File Open Save Save as Worksheet description Print window Manipulate Delete rows Erase variables Calculate Column statistics Statistics Basic statistics Descriptive statistics Graphs Correlation Covariance Normality Text EDA (Exploratory Data Analysis) Stem-and-leaf Boxplot Graph Plot (1 variable vs. another) Chart (sum of 1 variable vs. another) Histogram (1 variable with values) Boxplot (category vs. values) Pie Chart (2 varieties) Help Statistics Primer 26 Presentation Descriptive Statistics Measures of Central Tendency Measures of Dispersion Box Plots Copyright © 1995 - 1997 Thomas P. Sturm Home Run Data Babe Ruth (sorted) 22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60 Roger Maris (sorted) 8, 13, 14, 16, 23, 26, 28, 33, 39, 61 Histograms: 4 Frequency 3 2 1 0 20 25 30 35 40 45 50 55 60 Ruth Frequency 3 2 1 0 10 20 30 40 50 60 Maris Statistics Primer 28 Descriptive Statistics Stem-and-Leaf Plots Stem-and-leaf of Ruth Leaf Unit = 1.0 1 2 3 4 6 (5) 4 2 1 2 2 3 3 4 4 5 5 6 0 1 2 3 4 5 6 = 15 N = 10 2 5 4 5 11 66679 44 9 0 Stem-and-leaf of Maris Leaf Unit = 1.0 1 4 (3) 3 1 1 1 N 8 346 368 39 1 Side-by-side Stem and leaf plots: Ruth 2 3 4 5 6 25 45 1166679 449 0 Statistics Primer Maris 8 346 368 39 0 1 2 3 4 5 6 1 29 Descriptive Statistics Measures of Central Tendency Mode - most frequently occurring observation from a set of grouped data. Ruth: about 45; Maris: about 20 Mean - (arithmetic mean) - the sum of the observations divided by the number of observations. Ruth hit 659 home runs in 15 years or 659/15 = 43.93 Maris hit 261 home runs in 10 years or 261/10 = 26.1 Median - the number in the center from a set of sorted data Ruth: 22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60 Maris: 8, 13, 14, 16, 23, 26, 28, 33, 39, 61 Thus for Ruth the median is 46.00; for Maris the median is (23 + 26)/2 = 24.5 Midrange - the average of the lowest plus the highest observation Ruth: (22 + 60)/2 = 41; Maris: (8 + 61)/2 = 34.5 5% Trimmed mean - the mean after trimming off the highest 5% and lowest 5% of the values Ruth: 577/13 = 44.38; Maris: 192/8 = 24 Resistance to outliers: The midrange is the least resistant. The mean is also not resistant. The trimmed mean is resistant to up to 5% outliers at each end. The mode is generally resistant unless there are a cluster of them. The median is totally resistant. Statistics Primer 30 Descriptive Statistics Measures of Dispersion Range - the difference between the largest and smallest observation Ruth: 60 - 22 = 38; Maris: 61 - 8 = 53 No resistance to outliers Interquartile range (IQR) - the difference between the third quartile and the first quartile Ruth: Q1 = 35, Q3 = 54, Q3 - Q1 = 19 Maris: Q1 = 14, Q3 = 33, Q3 - Q1 = 19 Very high resistance to outliers Five-Number summary Minimum, Q1, Median, Q3, Maximum Ruth: 22, 35, 46, 54, 60; Maris: 8, 14, 24.5, 33, 61 Outliers - Observations more than 1.5 x IQR below Q1 or more than 1.5 x IQR above Q3 Boxplot Ends of box are at the quartiles, line within the box marks the median, two lines (whiskers) extend to smallest and largest observations. Modified Boxplot Ends of box are at the quartiles, line within the box marks the median, two lines (whiskers) extend to smallest and largest observations that are not outliers. Outliers are plotted individually and separately. Statistics Primer 31 Descriptive Statistics Boxplots Boxplot of Ruth 20 30 40 50 60 Ruth Boxplot of Maris 10 20 30 40 50 60 Maris Statistics Primer 32 Descriptive Statistics Variance and Standard Deviation These measures of dispersion are only valid when the mean is used as the measure of central tendency. They are not as resistant to outliers as the median, but have the best theoretical properties when the distribution is “well behaved” Variance: Intuitively, the adjusted average square of the differences from the mean. This is a theoretical measure. Ruth: s² = (1/14)[(22-43.93)² + (25-43.93)² + (34-43.93)² + (35-43.93)² + (41-43.93)² + (41-43.93)² + (46-43.93)² + (46-43.93)² + (46-43.93)² + (47-43.93)² + (49-43.93)² + (54-43.93)² + (54-43.93)² + (59-43.93)² + (60-43.93)²] = (480.9 + 358.3 + 98.60 + 79.74 + 8.585 + 8.585 + 4.285 + 4.285 + 4.285 + 9.425 + 25.70 + 101.4 + 101.4 + 227.1 + 258.2)/14 = 1771/14 = 126.5 Maris: s² = (1/9)[(8-26.1)² + (13-26.1)² + (14-26.1)² + (16-26.1)² + (23-26.1)² + (26-26.1)² + (28-26.1)² + (33-26.1)² + (39-26.1)² + (61-26.1)²] = (327.6 + 171.6 + 146.4 + 102.0 + 9.610 + .01000 + 3.610 + 47.61 + 166.4 + 1218)/9 = 2193/9 = 243.7 Standard deviation: The square root of the variance. This is an interpretive measure. Ruth: 126.5 = 11.25; Maris: 243.7 = 15.61 Statistics Primer 33 Descriptive Statistics Computation of the Variance as a Spreadsheet Consider the data for Maris. This can be placed in tabular form for ease of computation (by hand, by spreadsheet, or by Minitab). The work is done a column at a time, from left to right. The results from each column, in general, are used in the computations in the next column. x Statistics Primer x xx 8 13 14 16 23 26 28 33 39 61 261 -18.1 -13.1 -12.1 -10.1 -3.1 -.1 1.9 6.9 12.9 34.9 0.0 x 261 261. n x x 2 327 .6 171 .6 146 .4 102 .0 9 .610 .01000 3 .610 47 .61 166 .4 1218 . 2193 . 2 x x 2193 s2 10 34 n 1 9 243.7 Descriptive Statistics Shortcut formula for variance Use the shortcut formula for variance if you are going to program the calculation or do it by hand 1. Calculate sum, square the sum. 2. Calculate square of each observation, sum the squares. 3. Divide the result in part 1 by the number of observations. 4. Subtract the result in part 3 from the result in part 2 5. Divide the result in part 4 by one less than the number of observations. Ruth: 1. Sum is 659. Square of sum is 434281. 2. 22² + 25² + 34² + 35² + 41² + 41² + 46² + 46² + 46² + 47² + 49² + 54² + 54² + 59² + 60² = 484 + 625 + 1156 + 1225 + 1681 + 1681 + 2116 + 2116 + 2116 + 2209 + 2401 + 2916 + 2916 + 3481 + 3600 = 30723 3. 434281/15 = 28952 4. 30723 - 28952 = 1771 5. 1771/14 = 126.5 Note: Extra accuracy is needed in steps 1 to 3 because it is anticipated that the two numbers subtracted in part 4 will be “nearly” equal. Statistics Primer 35 Descriptive Statistics Output from the Minitab DESCRIBE Command Variable Mean Ruth 2.90 Maris 4.94 Variable Ruth Maris Variable N Mean Median Tr Mean StDev SE Mean Min Max Q1 Q3 Statistics Primer N Mean Median Tr Mean StDev 15 43.93 46.00 44.38 11.25 10 26.10 24.50 24.00 15.61 Min 22.00 8.00 Max 60.00 61.00 Q1 35.00 13.75 Q3 54.00 34.50 SE Name of the variable (or C# if unnamed) Number of observations Arithmetic mean Median 5% Trimmed mean Standard deviation Standard error of the mean (not covered until Chapter 4) Minimum value Maximum value First quartile Third quartile 36 Descriptive Statistics Data Collection Measurement Sampling Methods Survey Design Designing Experiments Copyright 1993-97 Thomas P. Sturm Experimentation / Measurement - A method of determining a specific value for a variable - To have VALIDITY, must insure that the variable used to represent the property is relevant or appropriate. e.g. asking for height in gallons is not appropriate - To accurately portray the characteristics of the population, use an INSTRUMENT that possesses the following characteristics: UNBIASED Bias is a systematic tendency to misrepresent (overstate or understate the true value) the data in some way e.g. AGE on the survey - biased 1/2 year low PRECISE Lack of precision causes observed values obtained through the measurement process to be somewhat distant or scatter from their "true" value e.g. EARNINGS LAST SUMMER - how many responses would stand up to an IRS audit for accuracy RELIABLE Unreliable results are those which would be quite different if the experiment/observation were made again under "identical" circumstances e.g. PICK A NUMBER FROM 1 to 10 - how many would pick the same number again and again Statistics Primer 38 Data Collection Sampling Sampling is the process by which we select the sample that we are going to measure. The sampling itself must be done to provide an unbiased, reliable, and precise estimate of the values of the population it is intended to represent. A CENSUS is actually an attempt to "sample" the entire population, and is generally expensive - could be impossible (flash bulb testing??) - could be inaccessible (homeless??) IF you expend enough effort to get everything, is the most accurate, reliable, and unbiased A CONVENIENCE SAMPLE is a sample of whatever is the easiest to measure - students in a class - what you happen to have on hand generally the most prone to inaccuracy and unreliability, and very likely to be biased A SELF-SELECTED SAMPLE is a sample of people who "choose themselves" to be in the survey - phone-ins to 900 numbers - mail-back surveys without follow-up generally the most prone to bias, and very likely to be inaccurate and unreliable Statistics Primer 39 Data Collection Sampling Methods The following sampling methods, when used with care within their limits of applicability, can produce unbiased, reliable, and precise results SIMPLE RANDOM SAMPLE (SRS) - every member of the population has exactly the same probability of being selected - can be hard to make the probabilities exactly equal - can miss an accurate description of "rare" subsets - could still be expensive STRATIFIED SAMPLE - divide the population into "strata" and then perform an SRS on each strata - e.g. healthy adults vs. those with a rare disease - need to know relative sizes of the strata SYSTEMATIC SAMPLE - start at a random point, and then select every kth item e.g. for a sample of 1/10th of the population at an event that issued numbered tickets, pick at random a digit from 0 to 9, and then include everyone whose ticket number ended with that digit - could be just a expensive as SRS CLUSTER SAMPLE - pick, at random, areas or regions or groups of the population, then perform a census within each group - least expensive of the above methods - must have enough areas to avoid unreliability - must carefully check results between groups for bias Statistics Primer 40 Data Collection Measurement Errors - Reporting errors (in 1950 survey, average age of women over 40 was under 40 years old) - Recording errors (random transcription errors) - Unit of measurement errors (some in dollars, some in cents; some per unit, some per six-pack, some per case of 24) Suggestion: pick a convenient unit of measurement, perhaps through the use of a consistently applied coding technique - Processing errors (performing mathematical operations inappropriate for the scale of measurement of the data) - Non-response errors (no response from selected groups) - Errors in doing the sampling - Errors in adjusting data from stratified samples - Must accurately classify each response into appropriate strata - Must know the proportion of people actually in each strata - Must properly "scale back" the responses from the "overrepresented" strata to derive population statistics Statistics Primer 41 Data Collection Survey Design When designing a survey, you must look ahead to the administration of the survey, the collection of results, the tabulation of results, the analysis of results, and the interpretation of results. To do this successfully, you must ask some basic questions: - Why am I doing the survey? What specific facts do I hope to learn more about? What variables might be use to measure those facts and what variables might influence those facts? - Who am I going to survey? What is my population? Am I surveying people, or doing experiments with physical objects? Can I realistically obtain the kind of sample I want from that population at reasonable cost? - What questions will I ask? Some questions need to address the specific facts I hope to learn more about, while other questions need to "consider the source." These latter questions are called "demographic" questions. For example, if I want to learn more about pop consumption on campus, in addition to asking questions about how much pop is consumed, when it is consumed, where it is purchased, where it is consumed, diet or regular, etc., I might also want to ask demographic questions such as age, class year, sex, weight, day student or boarder, etc. - The remaining questions are form of the survey, how many surveys to administer, and how to ask the questions. Statistics Primer 42 Data Collection Form of the Survey Direct measurement in the laboratory: + Most accurate and reliable + Least subject to unknown influences - Can be expensive, many times impractical Direct personal interviews: + Consistent measurement if interviewers are well trained - Time consuming - Hard to get a random sample Telephone interviews: + Somewhat consistent measurement with skilled interviewers + Less expensive than lab setting or face-to-face interviews - Lower response rate due to hang-ups and no answers even after many callbacks - Incomplete surveys due to hang-ups Mail surveys: + Quick + Least expensive - Lowest response rate (10% to 20%) - Great deal of opportunity to misinterpret questions - No idea if person is informed about the subject you are studying Statistics Primer 43 Data Collection How Many Surveys to Administer Determining the number of surveys to administer depends upon the following factors: What form of survey is being done? - you need to send out about 10 times as many mail surveys as you would need lab participants For nominal demographic data, how many categories does the data divide into (maximum over all questions)? - you need 5 times as many completed surveys if you have a demographic question that has 10 possible nominal responses than if you have a survey all of whose demographic questions on a nominal scale are logically divided into two categories (e.g. male/female, boarder/day student, graduate/undergraduate, yes/no, etc.) How accurate do the results need to be? - the more data, the more accurately the measurement can be done How much statistics do you know and how much professional statistical assistance can you afford? - the SMALLER the sample, the MORE work it is to accurately draw conclusions from the data - ideally, you want 30 completed surveys PER DEMOGRAPHIC GROUP for the demographic question measured on the nominal scale with the most possible response values. Statistics Primer 44 Data Collection Example Calculation of Number of Surveys to Administer Example 1: We are doing a laboratory experiment in which all of the demographic questions on a nominal scale are yes/no. We will need to insure 30 yes responses and 30 no responses to each demographic question. However, since we are controlling the demographics, we could get by with as few are 60 well-crafted experiments. Example 2: We are doing a long mail survey in which one of the demographic questions is on a nominal scale and has 10 possible responses. We need 300 completed surveys at a minimum to get 30 for each value on the nominal scale. However, we have no control over the response, so we could need up to twice that many (600) completed surveys. Mail survey response is in the 10% to 20% range, but our survey is long, so we should expect on the low end of that range. Since we could also use the additional responses if they come in, we assume a 10% response rate. This means we should send out about 6000 surveys. In both of these examples, our ultimate results are likely to have comparable (low) accuracy, but the statistical analysis required (because of 30 in each group) will be manageable without advanced statistical methods. Statistics Primer 45 Data Collection How to Ask (Phrase) Survey Questions Many times the measurement process does more to determine the scale of measurement that the nature of the property being measured. Always want to strive for ratio data or as close to it as possible. The method of asking a question, establishing a "ruler," or selecting units of measurement can have a dramatic effect on the scale of measurement (and the measurement errors) of the resulting values. Examples: Temperature: Fahrenheit - interval Kelvin - ratio Age: Young/Old check boxes - nominal (unreliable??) Traditional College Age/Older check boxes - ordinal Birthdate - can be converted to ratio without bias In general, try to control the responses to things that can easily be converted to numerical values on as high a scale of measurement as possible, and try to provide as much "calibration" as possible so that the values are being measured as consistently as possible between subjects. Statistics Primer 46 Data Collection Miscellaneous Survey Mechanics Different respondents may interpret the question differently. This variation needs to be eliminated with totally unambiguous language. This is best tested by piloting a survey and then asking respondents how they interpreted the question. All respondents must understand the question. The variation in reading ability needs to be factored out by targeting the reading level to below grade 8 level difficulty. This can be measured by computer software. Respondents may “give up” in the middle of a survey. Make sure the important questions are asked first, and the less important questions (less important demographics, for example) are at the end. Statistics Primer 47 Data Collection Designing Experiments We may be able to measure a phenomenon by subjecting different experimental units (usually called subjects in this context, especially when they are human) to different stimuli (usually called treatments in this context). This frequently takes the form similar to finding relationships in categorical data, namely, we attempt to explain the value of a response variable by noting differences in an explanatory variable. The difference in designing experiments is that as designers we have control over the values of the explanatory variables, which are called factors in this context. The experiment is performed by combining specific values (usually called levels in this context) for each of the explanatory variables, and measuring the resulting response. We need for each factor a “control group” that measures the “normal” outcome when “no treatment” is given. We need to eliminate confounding variables due to chance. This is generally done by placing subjects into experimental groups at random. Where humans are involved (in any way) we need to eliminate subjective bias by not allowing subject or researcher to know what treatment is being received. This is known as a double-blind experiment. Statistics Primer 48 Data Collection Calculating the Number of Participants The crucial factor in experimental design is to avoid combinatorial explosion. The most common method of designing an experiment is called block design. In a block design, we place a fixed number of people in every category combination. Ideally, we would like about 30 people in each category combination. - If we have 1 factor with 2 levels, we need 2 x 30 = 60 participants. - If we have 2 factors with 2 levels, we need 2 x 2 x 30 = 120 participants. - If we have 2 factors, the first with 2 levels and the second with 5 levels, we need 2 x 5 x 30 = 300 participants. - If we have 3 factors, each with 4 levels, we need 4 x 4 x 4 x 30 = 1920 participants. - If we have 3 factors, the first with 5 levels, the second with 7 levels, and the third with 11 levels, we need 5 x 7 x 11 x 30 = 11550 participants. - If we have 4 factors, each with 3 levels, we would need 3 x 3 x 3 x 3 x 30 = 2430 participants. - If we have 7 factors, the first of which has 3 levels, and each of the other 6 factors has 20 levels, we would need 3 x 20 x 20 x 20 x 20 x 20 x 20 x 30 = 5,760,000,000 participants, or more people than there are on earth! It is therefore not difficult to understand why most experimental designs involve only 1, 2, or at most 3 factors. Statistics Primer 49 Data Collection Statistics Primer 50 Data Collection Statistics Primer 51 Data Collection Statistics Primer 52 Data Collection