Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2022 1. THE NATURE OF PROBABILITY AND STATISTICS Objectives: At the end of this chapter, the students are expected to: 1. Define statistics; 2. Differentiate descriptive and inferential statistics; 3. Distinguish primary and secondary data; 4. Make a distinction between qualitative and quantitative data; 5. Identify discrete and continuous data; and 6. Classify data according to variable type and appropriate level of measurement; and 7. Discuss some applications of statistics. Introduction Decision makers make better decisions when they use all available information in an effective and meaningful way. The primary role of statistics is to provide decision makers with methods for obtaining and analyzing information to help make these decisions. Statistics is used to answer longrange planning questions, such as when and where to locate facilities to handle future sales. The word statistics is derived from the Latin word status meaning “state”. In the beginning, statistics involved compilation of data and graphs describing various aspects of state or country. The word statistics means different to different people. To some, statistics means actual numbers derived from data and others refer to statistics as a method of analysis. Thus, specifically, statistics is defined as the science of collecting, organizing, presenting, analyzing and interpreting numerical data for the purpose of assisting in making a more effective decision. Statistical methods are vital tools in many researches in education, psychology, medicine, business, agriculture, and other disciplines. Types of Statistics Statistics is a tool which helps us develop general and meaningful conclusions that go beyond the original data. There are two types of statistical analyses: Descriptive and Inferential or Inductive Statistics. 1. Descriptive Statistics are all the methods used to collect, organize, summarize or present data, usually to make the data easier to understand. It is concerned with summary calculations such as averages, and percentages and construction of graphs, charts and tables. 2. Inferential Statistics is concerned with the formulation of conclusions or generalizations about a population based on an observation or a series of observations of a sample drawn from a population. It consists of performing hypothesis testing, determining relationships among variables, and making predictions. For example, the average family income of the residents in Region 2 can be estimated from figures obtained from a few hundred (the sample) of families. Quantitative and Qualitative Variables or Data In doing a research, initially, we have to define the variables relevant to the data. The term variable means an item of interest that can take on many different numerical values while a collection of this is called data. The variable may take on different value. If a given value does not vary or fixed, it is called constant. There are two major qualifications of variables: qualitative and quantitative. 1. Qualitative Variables are nonnumeric variables and can't be measured. Examples include gender (male, female), religious affiliation (Roman Catholic, Iglesia ni Cristo, Methodist, etc), ethnicity (Ilocano, Tagalog, Ibanag, etc.) 2. Quantitative Variables are numerical variables and can be measured. Examples include balance in your checking account, number of children in your family. Some quantitative variables can take on only specific or isolated values along a scale, for example, the number of children in the family may be 1, 2, 3, or any other whole number but it can Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page - 1 - 2022 never be 1.25 or 0.5. Thus, this variable has values which can only be obtained through the process of counting and is referred to as discrete or discontinuous variables. Specifically, quantitative variables can be ordered and ranked. It can be classified in to two groups: Discrete and Continuous. Discrete variables are values that are obtained by counting. The results are whole numbers. For example, the number of students in the room. Continuous variables are values that are obtained by measuring. The results can be any value between two specific values. For example, if you take everyone’s height of students in the room, you could get any number between two reasonable amounts. So height is a continuous variable. Levels of Measurement: Variables can also be classified according to the level of measurement. There are four levels of measurement: Nominal, Ordinal, Interval and Ratio. 1. Nominal Data: The weakest data measurement. Numbers are used to represent an item or characteristic. Examples include: names, gender, religious affiliation, civil status, college majors. Note that such data should not be treated as numerical, since relative size has no meaning. 2. Ordinal or Rank Data: This can be ordered or ranked, but a specific difference in the levels can not be determined. For example, the performance rating (Outstanding, Very Satisfactory, Satisfactory, Poor). This can be ordered. You know that Outstanding is higher than Very Satisfactory or Very Satisfactory is higher than Satisfactory, etc. , but there is no exact difference between any two of them. For example, the grade of Outstanding and Very Satisfactory may be close (4.65 and 4.45) or may be far apart (5.00 and 4.25), so the exact difference cannot be determined. 3. Interval Data: This can be ordered and has exact difference between any two units but has no meaningful zero or starting point. For example, Temperature is an interval data since they can be ordered, there is an exact difference between two degrees, but the zero does not mean the starting point since there can be temperatures below zero. 4. Ratio Data: Is the highest level of measurement and allows for all basic arithmetic operations, including division and multiplication. Data at this level can be ordered, has exact difference between units, and has a meaningful zero. Things that are counted are usually ratio level, for example, business data, such as cost, revenue and profit. Data Collection: Data can be collected in various ways: 1. Focus Group 2. Telephone Interview 3. Mail Questionnaires 4. Door-to-Door Survey 5. Mall Intercept 6. New Product Registration 7. Personal Interview 8. Experiments Sources of Data: 1. Secondary Data: Data which are already available. For example, ISU enrollment data. Secondary data is less expensive; however, it may not satisfy the researcher’s need. 2. Primary Data: Data which must be collected. Sampling Techniques: Sampling Techniques are used when a part of the population is to be surveyed. If it takes too long or very expensive to interview the whole population, a sample is used. If a sample is chosen correctly to represent the population, it is called unbiased while if it does not represent the whole population, it is called biased. There are many ways to collect a sample, statistical or non-statistical. The most commonly used methods are: Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page - 2 - 2022 A. Statistical Sampling: 1. Simple Random Sampling: This is used to see that all possible elements of the population have an equal opportunity of being selected for the sample. 2. Stratified Random Sampling: This is obtained by selecting simple random samples from strata (or mutually exclusive sets). Some of the criteria for dividing a population into strata are: Gender (male, female); Age (under 18, 18 to 28, 29 to 39); Occupation (blue-collar, professional, other). 3. Cluster Sampling: This is a simple random sample of groups or cluster of elements. Cluster sampling is useful when it is difficult or costly to generate a simple random sample. For example, to estimate the average annual household income in a large city we use cluster sampling, because to use simple random sampling we need a complete list of households in the city from which to sample. To use stratified random sampling, we would again need the list of households. A less expensive way is to let each block within the city represent a cluster. A sample of clusters could then be randomly selected, and every household within these clusters could be interviewed to find the average annual household income. B. Nonstatistical Sampling: 1. Judgement Sampling: In this case, the person taking the sample has direct or indirect control over which items are selected for the sample. 2. Convenience Sampling: In this method, the decision maker selects a sample from the population in a manner that is relatively easy and convenient. 3. Quota Sampling: In this method, the decision maker requires the sample to contain a certain number of items with a given characteristic. Many political polls are, in part, quota sampling. Note: The random number table provides lists of numbers that are randomly generated and can be used to select random samples. Computer packages are used to generate lists of random numbers. For the table, refer to any texts in Statistics. Parameter and Statistic A specific, well-defined characteristic of a population is known as a parameter of that population while a specific characteristic of a sample is called a statistic of that sample. For instance, for a given sample of temperature readings at 1:00 P.M. local time on December 12, 2019 at various locations around Santiago City, then the parameter is the highest temperature reading in Santiago City as determined at hourly intervals on December 12, 2019 while the statistic is highest temperature reading at 1:00 P.M. local time on December 12, 2019 in Santiago City. Population and Sample In statistics, the term population refers to a particular set of items, objects, phenomena, or people being analyzed. These items, also called elements, can be actual subjects such as people or animals, but they can also be numbers or definable quantities expressed in physical units. A sample of a population is a subset of that population. It can be a set consisting of only one value, reading, or measurement singled out from a population, or it can be a subset that is identified according to certain characteristics. The physical unit (if Sample Population any) that defines a sample is always the same as the physical unit Infer that defines the main, or parent, population. A single element of a sample is called an event. When a sample consists of the whole population, it is called a census. When a sample consists of a subset of a population whose elements are chosen at random, it is called a random sample. Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page - 3 - 2022 Generating Random Variables using MS Excel A random variable is a discrete or continuous variable whose value cannot be predicted in any given instance. Such a variable is usually defined within a certain range of values, such as 1 through 6 in the case of a thrown die. In order for a variable to be random, the only requirement is that, it is must be impossible to predict its value in any single instance. For instance, we can’t predict what number will turn up if we throw a die one time. A random sample is also called a probability sample, or scientific sample. Random sampling is a type of sampling in which every item in a population of interest, or target population, has a known, and usually equal, chance of being chosen for inclusion in the sample. Having such a sample ensures that the sample items are chosen without bias and provides the statistical basis for determining the confidence that can be associated with the inferences. The four principal methods of random sampling are the simple, systematic, stratified, and cluster sampling methods. A simple random sample is one in which individual items are chosen from the target population on the basis of chance. Such chance selection is similar to the random drawing of numbers in a lottery. However, in statistical sampling a table of random numbers or a random-number generator computer program generally is used to identify the numbered items in the population that are to be selected for the sample. A systematic sample is a random sample in which the items are selected from the population at a uniform interval of a listed order, such as choosing every tenth account receivable for the sample. The first account of the 10 accounts to be included in the sample would be chosen randomly (perhaps by reference to a table of random numbers). A particular concern with systematic sampling is the existence of any periodic, or cyclical, factor in the population listing that could lead to a systematic error in the sample results. In stratified sampling the items in the population are first classified into separate subgroups, or strata, by the researcher on the basis of one or more important characteristics. Then a simple random or systematic sample is taken separately from each stratum. Such a sampling plan can be used to ensure proportionate representation of various population subgroups in the sample. Further, the required sample size to achieve a given level of precision typically is smaller than it is with simple random sampling, thereby reducing sampling cost. Cluster sampling is a type of random sampling in which the population items occur naturally in subgroups. Entire subgroups, or clusters, are then randomly sampled. Although a nonrandom sample can turn out to be representative of the population, there is difficulty in assuming beforehand that it will be unbiased, or in expressing statistically the confidence that can be associated with inferences from such a sample. A judgment sample is one in which an individual selects the I’m in! Me too! items to be included in the sample. The extent to which such a sample is representative of the population then depends on the judgment of that individual and cannot be And me! statistically assessed. A convenience sample includes the most easily accessible measurements, or observations, as is Population implied by the word convenience. Voluntary Response Sample A strict random sample is not usually feasible since only readily available items or transactions can easily be inspected. In order to capture changes that are taking place in the quality of process output, small samples are taken at regular intervals of time. Such a sampling scheme is called the method of rational subgroups. Such sample data are treated as if random samples were taken at each point in time, with the understanding that one should be alert to any known reasons why such a sampling scheme could lead This one’s Population far to biased results. too small NOTE: For the purpose of statistical inference a representative sample is desired. Yet, the methods of statistical inference require only that a random sample be obtained. There is no sampling method that can guarantee a representative sample. The best we can do is to avoid any Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. Investigator intervention -Page - 4 - 2022 consistent or systematic bias by the use of random (probability) sampling. Some causes of bias in sampling are voluntary response, investigator intervention, or the effects of periodic, seasonal and/or systematic gathering of data. While a random sample rarely will be exactly representative of the target population from which it was obtained, use of this procedure does guarantee that only chance factors underlie the amount of difference between the sample and the population. In statistical sampling, a table of random numbers or a random-number generator computer program generally is used to identify the numbered items in the population that are to be selected for the sample. Excel is also a powerful tool in generating a sample from a given population. Problem: A researcher wishes to obtain a simple random sample of 100 households from 876 households in San Fabian, Echague, Isabela. (For convenience, the households are identified by the ID numbers 1 through 876. Use Excel to obtain the 100 ID numbers of the sampled households to be included in the study. Steps: (1) Open Excel. Place the integers from 1 to 876 in column A of the worksheet by first entering the number 1 A1. With cell A1 active (by clicking away from and back to A1, for instance), CLICK EDIT, FILL, SERIES and open the Series dialog box. (2) Select the Series in Columns button with Step value of 1 and Stop value of 876. CLICK OK, and the integers 1 to 876 will appear in column A. Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page - 5 - 2022 (2) To identify the 100 households to be sampled, CLICK TOOLS, DATA ANALYSIS, SAMPLING. Designate the Input Range as $A$1:$A$876, the Sampling Method as Random, the number of samples as 100, and the Output Range as $B$1. CLICK OK, and the IDs of the randomly selected households will appear in rows 1 through 100 of column B. Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page - 6 - 2022 An the result is: STATISTICS OF SAMPLING In the preceding lecture, terms such as population parameter, sample statistic, and sampling bias were introduced. Now, we will try to understand what these terms mean and how they are related to each other. When you measure a certain observation from a given unit, such as a person’s response to a Likert-scaled item (as shown in the figure in the succeeding page), that observation is called a response. In other words, a response is a measurement value provided by a sampled unit. Each respondent will give you different responses to different items in an instrument. Responses from different respondents to the same item or observation can be graphed into a frequency distribution based on their frequency of occurrences. For a large number of responses in a sample, this frequency distribution tends to resemble a bell-shaped curve called a normal distribution, which can be used to estimate overall characteristics of the entire sample, such as sample mean (average of all observations in a sample) or standard deviation (variability or spread of observations in a sample). These sample estimates are called sample statistics (a “statistic” is a value that is estimated from observed data). Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page - 7 - 2022 Item Names All responses from one respondent All responses from all respondents in one item. Note: the mean or SD of this set is a SAMPLE STATISTIC No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 attitude 1 attitude 2 attitude 3 attitude 4 attitude 5 3 4 3 2 4 3 3 2 2 2 3 1 1 1 1 3 3 1 4 3 3 4 2 2 2 5 3 2 1 4 2 3 4 4 4 3 4 2 2 1 3 2 3 3 1 3 2 1 3 3 1 3 3 3 3 3 4 2 2 0 3 3 3 3 2 3 2 3 2 1 3 3 3 3 3 4 4 1 3 4 4 3 3 3 2 3 3 1 3 1 3 3 1 4 1 3 3 3 2 1 Individual responses Missing value Populations also have means and standard deviations that could be obtained if we could sample the entire population. However, since the entire population can never be sampled, population characteristics are always unknown, and are called population parameters (and not “statistic” because they are not statistically estimated from data). Sample statistics may differ from population parameters if the sample is not perfectly representative of the population; the difference between the two is called sampling error. Theoretically, if we could gradually increase the sample size so that the sample approaches closer and closer to the population, then sampling error will decrease and a sample statistic will increasingly approximate the corresponding population parameter. If a sample is truly representative of the population, then the estimated sample statistics should be identical to corresponding theoretical population parameters. There is a need for you to understand the concept of a sampling distribution to be able to know when your samples are at least reasonably close to the population parameters. A sampling distribution is a frequency distribution of a sample statistic (like sample mean) from a set of samples, while the commonly referenced frequency distribution is the distribution of a response (observation) from a single sample. Just like a frequency distribution, the sampling distribution will also tend to have more sample statistics clustered around the mean (which presumably is an estimate of a population parameter), with fewer values scattered around the mean. With an infinitely large number of samples, this distribution will approach a normal distribution. The variability or spread of a sample statistic in a sampling distribution (i.e., the standard deviation of a sampling statistic) is called its standard error. In contrast, the term standard deviation is reserved for variability of an observed response from a single sample. The mean value of a sample statistic in a sampling distribution is presumed to be an estimate of the unknown population parameter. Based on the spread of this sampling distribution (i.e., based on standard error), it is also possible to estimate confidence intervals for that prediction population parameter. Confidence interval is the estimated probability that a population parameter lies within a specific interval of sample statistic values. All normal distributions tend to follow a 68-95-99 percent rule (see Figure below), which says that over 68% of the cases in the distribution lie within one standard deviation of the mean value (μ + 1σ), over 95% of the cases in the distribution lie within two standard deviations of the mean (μ +2σ), and over 99% of the cases in the distribution lie within three standard deviations of the mean value (μ + 3σ). Since a sampling distribution with an infinite number of samples will approach a normal distribution, the same 68-95-99 rule applies, and it can be said that: (Sample statistic + one standard error) represents a 68% confidence interval for the population parameter. Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page - 8 - 2022 (Sample statistic + two standard errors) represents a 95% confidence interval for the population parameter. (Sample statistic + three standard errors) represents a 99% confidence interval for the population parameter. 99.7% of data are within 3 standard deviations of the mean 95% within 2 standard deviations 68% within 1 standard deviation 34% 34% 2.4% 2.4% 0.1% 0.1% 13.5% -3 -2 13.5% -1 + +2 +3 A sample is “biased” (i.e., not representative of the population) if its sampling distribution cannot be estimated or if the sampling distribution violates the 68-95-99 percent rule. As an aside, note that in most regression analysis where we examine the significance of regression coefficients with p<0.05, we are attempting to see if the sampling statistic (regression coefficient) predicts the corresponding population parameter (true effect size) with a 95% confidence interval. Interestingly, the “six sigma” standard attempts to identify manufacturing defects outside the 99% confidence interval or six standard deviations (standard deviation is represented using the Greek letter sigma), representing significance testing at p<0.01. DETERMINING THE SAMPLE SIZE The sample size depends of three factors: (1) the degree of accuracy required; (2) amount of variability inherent in the population from which the sample was taken; and (3) the mature and complexity of the characteristics of the population under consideration. There are various formulas for calculating the required sample size based upon whether the data collected is to be of a categorical or quantitative nature (e.g. is to estimate a proportion or a mean). These formulas require knowledge of the variance or proportion in the population and a determination as to the maximum desirable error, as well as the acceptable Type I error risk (e.g., confidence level). Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. -Page - 9 - 2022 The formula used for these calculations was: This formula is the one used by Krejcie & Morgan in their 1970 article “Determining Sample Size for Research Activities” (Educational and Psychological Measurement, #30, pp. 607-610). Proportional Allocation of Samples: Where = number of group allocation; ; and = desired/estimated sample size; = Total population. Guidelines with regards to the minimum number of items needed for a representative sample: Descriptive studies – a minimum number of 100 Co-relational studies – a sample of at least 30 is deemed necessary to establish the existence of a relationship. Experimental and causal comparative studies – minimum of 30 per group. Sometimes experimental studies with only 15 items in each group can be defended if they are very tightly controlled. If the sample is randomly selected and is sufficiently large, an accurate view of the population can be used, provided that no bias enters the selection process Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. - -Page - 10 2022 References: Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and Statistics. 10th ed. New York: Duxbury Press. Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed. McGraw-Hill Book Co. Deuna, Melecio C. (1996), Elementary Statistics for Basic Education. Quezon City: Phoenix Publishing House, Inc. Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to Statistics. Metro Manila, Pheonix Publishing House, Inc. Ferguson G. (1981) Statistical Analysis in Psychology and Education. 5th ed. New York: McGraw-Hill Book Company. Padua, R. N., E.G. Adanza and R.T. Guinto (1986) Statistics: Theory and Applications. Metro Manila: Hermil Printing Services. Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics. Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems in Probability and Statistics. 3rd. Edition. Singapore: McGraw-Hill Book Company. Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley Publishing Company. Walpole, R.E (1982) Introduction to Statistics. 3rd ed. New York: Macmillan Publishing Co. Inc. Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S. - -Page - 11