Download STAT BIO MODULE 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
2022
1. THE NATURE OF PROBABILITY AND STATISTICS
Objectives:
At the end of this chapter, the students are expected to:
1. Define statistics;
2. Differentiate descriptive and inferential statistics;
3. Distinguish primary and secondary data;
4. Make a distinction between qualitative and quantitative data;
5. Identify discrete and continuous data; and
6. Classify data according to variable type and appropriate level of measurement; and
7. Discuss some applications of statistics.
Introduction
Decision makers make better decisions when they use all available information in an effective
and meaningful way. The primary role of statistics is to provide decision makers with methods for
obtaining and analyzing information to help make these decisions. Statistics is used to answer longrange planning questions, such as when and where to locate facilities to handle future sales.
The word statistics is derived from the Latin word status meaning “state”. In the beginning,
statistics involved compilation of data and graphs describing various aspects of state or country. The
word statistics means different to different people. To some, statistics means actual numbers derived
from data and others refer to statistics as a method of analysis. Thus, specifically, statistics is defined
as the science of collecting, organizing, presenting, analyzing and interpreting numerical data for the
purpose of assisting in making a more effective decision.
Statistical methods are vital tools in many researches in education, psychology, medicine,
business, agriculture, and other disciplines.
Types of Statistics
Statistics is a tool which helps us develop general and meaningful conclusions that go beyond
the original data. There are two types of statistical analyses: Descriptive and Inferential or Inductive
Statistics.
1. Descriptive Statistics are all the methods used to collect, organize, summarize or present
data, usually to make the data easier to understand. It is concerned with summary calculations such as
averages, and percentages and construction of graphs, charts and tables.
2. Inferential Statistics is concerned with the formulation of conclusions or generalizations
about a population based on an observation or a series of observations of a sample drawn from a
population. It consists of performing hypothesis testing, determining relationships among variables,
and making predictions. For example, the average family income of the residents in Region 2 can be
estimated from figures obtained from a few hundred (the sample) of families.
Quantitative and Qualitative Variables or Data
In doing a research, initially, we have to define the variables relevant to the data. The term
variable means an item of interest that can take on many different numerical values while a collection
of this is called data. The variable may take on different value. If a given value does not vary or
fixed, it is called constant. There are two major qualifications of variables: qualitative and
quantitative.
1. Qualitative Variables are nonnumeric variables and can't be measured. Examples include
gender (male, female), religious affiliation (Roman Catholic, Iglesia ni Cristo, Methodist, etc),
ethnicity (Ilocano, Tagalog, Ibanag, etc.)
2. Quantitative Variables are numerical variables and can be measured. Examples include
balance in your checking account, number of children in your family.
Some quantitative variables can take on only specific or isolated values along a scale, for
example, the number of children in the family may be 1, 2, 3, or any other whole number but it can
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
-Page - 1 -
2022
never be 1.25 or 0.5. Thus, this variable has values which can only be obtained through the process of
counting and is referred to as discrete or discontinuous variables.
Specifically, quantitative variables can be ordered and ranked. It can be classified in to two
groups: Discrete and Continuous.
Discrete variables are values that are obtained by counting. The results are whole numbers.
For example, the number of students in the room.
Continuous variables are values that are obtained by measuring. The results can be any value
between two specific values. For example, if you take everyone’s height of students in the room, you
could get any number between two reasonable amounts. So height is a continuous variable.
Levels of Measurement:
Variables can also be classified according to the level of measurement. There are four levels
of measurement: Nominal, Ordinal, Interval and Ratio.
1. Nominal Data: The weakest data measurement. Numbers are used to represent an item or
characteristic. Examples include: names, gender, religious affiliation, civil status, college majors. Note
that such data should not be treated as numerical, since relative size has no meaning.
2. Ordinal or Rank Data: This can be ordered or ranked, but a specific difference in the levels
can not be determined. For example, the performance rating (Outstanding, Very Satisfactory,
Satisfactory, Poor). This can be ordered. You know that Outstanding is higher than Very Satisfactory
or Very Satisfactory is higher than Satisfactory, etc. , but there is no exact difference between any two
of them. For example, the grade of Outstanding and Very Satisfactory may be close (4.65 and 4.45)
or may be far apart (5.00 and 4.25), so the exact difference cannot be determined.
3. Interval Data: This can be ordered and has exact difference between any two units but has
no meaningful zero or starting point. For example, Temperature is an interval data since they can be
ordered, there is an exact difference between two degrees, but the zero does not mean the starting point
since there can be temperatures below zero.
4. Ratio Data: Is the highest level of measurement and allows for all basic arithmetic
operations, including division and multiplication. Data at this level can be ordered, has exact
difference between units, and has a meaningful zero. Things that are counted are usually ratio level,
for example, business data, such as cost, revenue and profit.
Data Collection: Data can be collected in various ways:
1. Focus Group
2. Telephone Interview
3. Mail Questionnaires
4. Door-to-Door Survey
5. Mall Intercept
6. New Product Registration
7. Personal Interview
8. Experiments
Sources of Data:
1. Secondary Data: Data which are already available. For example, ISU enrollment data.
Secondary data is less expensive; however, it may not satisfy the researcher’s need.
2. Primary Data: Data which must be collected.
Sampling Techniques:
Sampling Techniques are used when a part of the population is to be surveyed. If it takes too
long or very expensive to interview the whole population, a sample is used. If a sample is chosen
correctly to represent the population, it is called unbiased while if it does not represent the whole
population, it is called biased.
There are many ways to collect a sample, statistical or non-statistical. The most commonly
used methods are:
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
-Page - 2 -
2022
A. Statistical Sampling:
1. Simple Random Sampling: This is used to see that all possible elements of the population
have an equal opportunity of being selected for the sample.
2. Stratified Random Sampling: This is obtained by selecting simple random samples from
strata (or mutually exclusive sets). Some of the criteria for dividing a population into strata are:
Gender (male, female); Age (under 18, 18 to 28, 29 to 39); Occupation (blue-collar, professional,
other).
3. Cluster Sampling: This is a simple random sample of groups or cluster of elements. Cluster
sampling is useful when it is difficult or costly to generate a simple random sample. For example, to
estimate the average annual household income in a large city we use cluster sampling, because to use
simple random sampling we need a complete list of households in the city from which to sample. To
use stratified random sampling, we would again need the list of households. A less expensive way is to
let each block within the city represent a cluster. A sample of clusters could then be randomly selected,
and every household within these clusters could be interviewed to find the average annual household
income.
B. Nonstatistical Sampling:
1. Judgement Sampling: In this case, the person taking the sample has direct or indirect
control over which items are selected for the sample.
2. Convenience Sampling: In this method, the decision maker selects a sample from the
population in a manner that is relatively easy and convenient.
3. Quota Sampling: In this method, the decision maker requires the sample to contain a certain
number of items with a given characteristic. Many political polls are, in part, quota sampling.
Note: The random number table provides lists of numbers that are randomly generated and can be
used to select random samples. Computer packages are used to generate lists of random numbers. For
the table, refer to any texts in Statistics.
Parameter and Statistic
A specific, well-defined characteristic of a population is known as a parameter of that
population while a specific characteristic of a sample is called a statistic of that sample. For instance,
for a given sample of temperature readings at 1:00 P.M. local time on December 12, 2019 at various
locations around Santiago City, then the parameter is the highest temperature reading in Santiago City
as determined at hourly intervals on December 12, 2019 while the statistic is highest temperature
reading at 1:00 P.M. local time on December 12, 2019 in Santiago City.
Population and Sample
In statistics, the term population refers to a particular set of items, objects, phenomena, or
people being analyzed. These items, also called elements, can be actual subjects such as people or
animals, but they can also be numbers or definable quantities
expressed in physical units.
A sample of a population is a subset of that population. It
can be a set consisting of only one value, reading, or measurement
singled out from a population, or it can be a subset that is
identified according to certain characteristics. The physical unit (if
Sample
Population
any) that defines a sample is always the same as the physical unit
Infer
that defines the main, or parent, population. A single element of a
sample is called an event.
When a sample consists of the whole population, it is called a census. When a sample consists
of a subset of a population whose elements are chosen at random, it is called a random sample.
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
-Page - 3 -
2022
Generating Random Variables using MS Excel
A random variable is a discrete or continuous variable whose value cannot be predicted in any
given instance. Such a variable is usually defined within a certain range of values, such as 1 through 6
in the case of a thrown die. In order for a variable to be random, the only requirement is that, it is
must be impossible to predict its value in any single instance. For instance, we can’t predict what
number will turn up if we throw a die one time.
A random sample is also called a probability sample, or scientific sample. Random sampling is
a type of sampling in which every item in a population of interest, or target population, has a known,
and usually equal, chance of being chosen for inclusion in the sample. Having such a sample ensures
that the sample items are chosen without bias and provides the statistical basis for determining the
confidence that can be associated with the inferences. The four principal methods of random sampling
are the simple, systematic, stratified, and cluster sampling methods.
A simple random sample is one in which individual items are chosen from the target
population on the basis of chance. Such chance selection is similar to the random drawing of numbers
in a lottery. However, in statistical sampling a table of random numbers or a random-number generator
computer program generally is used to identify the numbered items in the population that are to be
selected for the sample.
A systematic sample is a random sample in which the items are selected from the population at
a uniform interval of a listed order, such as choosing every tenth account receivable for the sample.
The first account of the 10 accounts to be included in the sample would be chosen randomly (perhaps
by reference to a table of random numbers). A particular concern with systematic sampling is the
existence of any periodic, or cyclical, factor in the population listing that could lead to a systematic
error in the sample results.
In stratified sampling the items in the population are first classified into separate subgroups, or
strata, by the researcher on the basis of one or more important characteristics. Then a simple random
or systematic sample is taken separately from each stratum. Such a sampling plan can be used to
ensure proportionate representation of various population subgroups in the sample. Further, the
required sample size to achieve a given level of precision typically is smaller than it is with simple
random sampling, thereby reducing sampling cost.
Cluster sampling is a type of random sampling in which the population items occur naturally in
subgroups. Entire subgroups, or clusters, are then randomly sampled.
Although a nonrandom sample can turn out to be representative of the population, there is
difficulty in assuming beforehand that it will be unbiased, or in expressing statistically the confidence
that can be associated with inferences from such a sample.
A judgment sample is one in which an individual selects the
I’m in!
Me too!
items to be included in the sample. The extent to which
such a sample is representative of the population then
depends on the judgment of that individual and cannot be
And me!
statistically assessed. A convenience sample includes the
most easily accessible measurements, or observations, as is
Population
implied by the word convenience.
Voluntary Response Sample
A strict random sample is not usually feasible since
only readily available items or transactions can easily be inspected. In order to capture changes that are
taking place in the quality of process output, small samples are taken at regular intervals of time. Such
a sampling scheme is called the method of rational subgroups. Such sample data are treated as if
random samples were taken at each point in time, with the understanding that one should be alert to
any known reasons why such a sampling scheme could lead
This one’s
Population
far
to biased results.
too small
NOTE:
For the purpose of statistical inference a
representative sample is desired. Yet, the methods of
statistical inference require only that a random sample be
obtained. There is no sampling method that can guarantee a
representative sample. The best we can do is to avoid any
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
Investigator intervention
-Page - 4 -
2022
consistent or systematic bias by the use of random (probability) sampling. Some causes of bias in
sampling are voluntary response, investigator intervention, or the effects of periodic, seasonal and/or
systematic gathering of data.
While a random sample rarely will be exactly representative of the target population from
which it was obtained, use of this procedure does guarantee that only chance factors underlie the
amount of difference between the sample and the population.
In statistical sampling, a table of random numbers or a random-number generator computer
program generally is used to identify the numbered items in the population that are to be selected for
the sample. Excel is also a powerful tool in generating a sample from a given population.
Problem: A researcher wishes to obtain a simple random sample of 100 households from 876
households in San Fabian, Echague, Isabela. (For convenience, the households are identified
by the ID numbers 1 through 876. Use Excel to obtain the 100 ID numbers of the sampled
households to be included in the study.
Steps:
(1)
Open Excel. Place the integers from 1 to 876 in column A of the worksheet by first
entering the number 1 A1. With cell A1 active (by clicking away from and back to A1, for instance),
CLICK EDIT, FILL, SERIES and open the Series dialog box.
(2) Select the Series in Columns button with Step value of 1 and Stop value of 876. CLICK
OK, and the integers 1 to 876 will appear in column A.
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
-Page - 5 -
2022
(2) To identify the 100 households to be sampled, CLICK TOOLS, DATA ANALYSIS,
SAMPLING. Designate the Input Range as $A$1:$A$876, the Sampling Method as Random, the
number of samples as 100, and the Output Range as $B$1. CLICK OK, and the IDs of the randomly
selected households will appear in rows 1 through 100 of column B.
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
-Page - 6 -
2022
An the result is:
STATISTICS OF SAMPLING
In the preceding lecture, terms such as population parameter, sample statistic, and sampling
bias were introduced. Now, we will try to understand what these terms mean and how they are related
to each other.
When you measure a certain observation from a given unit, such as a person’s response to a
Likert-scaled item (as shown in the figure in the succeeding page), that observation is called a
response. In other words, a response is a measurement value provided by a sampled unit. Each
respondent will give you different responses to different items in an instrument. Responses from
different respondents to the same item or observation can be graphed into a frequency distribution
based on their frequency of occurrences.
For a large number of responses in a sample, this frequency distribution tends to resemble a
bell-shaped curve called a normal distribution, which can be used to estimate overall characteristics of
the entire sample, such as sample mean (average of all observations in a sample) or standard deviation
(variability or spread of observations in a sample). These sample estimates are called sample statistics
(a “statistic” is a value that is estimated from observed data).
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
-Page - 7 -
2022
Item Names
All responses from
one respondent
All responses from
all respondents in
one item. Note:
the mean or SD of
this set is a
SAMPLE STATISTIC
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
attitude 1 attitude 2 attitude 3 attitude 4 attitude 5
3
4
3
2
4
3
3
2
2
2
3
1
1
1
1
3
3
1
4
3
3
4
2
2
2
5
3
2
1
4
2
3
4
4
4
3
4
2
2
1
3
2
3
3
1
3
2
1
3
3
1
3
3
3
3
3
4
2
2
0
3
3
3
3
2
3
2
3
2
1
3
3
3
3
3
4
4
1
3
4
4
3
3
3
2
3
3
1
3
1
3
3
1
4
1
3
3
3
2
1
Individual
responses
Missing value
Populations also have means and standard deviations that could be obtained if we could sample
the entire population. However, since the entire population can never be sampled, population
characteristics are always unknown, and are called population parameters (and not “statistic” because
they are not statistically estimated from data).
Sample statistics may differ from population parameters if the sample is not perfectly
representative of the population; the difference between the two is called sampling error.
Theoretically, if we could gradually increase the sample size so that the sample approaches closer and
closer to the population, then sampling error will decrease and a sample statistic will increasingly
approximate the corresponding population parameter.
If a sample is truly representative of the population, then the estimated sample statistics should
be identical to corresponding theoretical population parameters. There is a need for you to understand
the concept of a sampling distribution to be able to know when your samples are at least reasonably
close to the population parameters.
A sampling distribution is a frequency distribution of a sample statistic (like sample mean)
from a set of samples, while the commonly referenced frequency distribution is the distribution of a
response (observation) from a single sample. Just like a frequency distribution, the sampling
distribution will also tend to have more sample statistics clustered around the mean (which presumably
is an estimate of a population parameter), with fewer values scattered around the mean. With an
infinitely large number of samples, this distribution will approach a normal distribution. The
variability or spread of a sample statistic in a sampling distribution (i.e., the standard deviation of a
sampling statistic) is called its standard error. In contrast, the term standard deviation is reserved for
variability of an observed response from a single sample.
The mean value of a sample statistic in a sampling distribution is presumed to be an estimate of
the unknown population parameter. Based on the spread of this sampling distribution (i.e., based on
standard error), it is also possible to estimate confidence intervals for that prediction population
parameter. Confidence interval is the estimated probability that a population parameter lies within a
specific interval of sample statistic values. All normal distributions tend to follow a 68-95-99 percent
rule (see Figure below), which says that over 68% of the cases in the distribution lie within one
standard deviation of the mean value (μ + 1σ), over 95% of the cases in the distribution lie within two
standard deviations of the mean (μ +2σ), and over 99% of the cases in the distribution lie within three
standard deviations of the mean value (μ + 3σ). Since a sampling distribution with an infinite number
of samples will approach a normal distribution, the same 68-95-99 rule applies, and it can be said that:
 (Sample statistic + one standard error) represents a 68% confidence interval for the population
parameter.
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
-Page - 8 -
2022


(Sample statistic + two standard errors) represents a 95% confidence interval for the population
parameter.
(Sample statistic + three standard errors) represents a 99% confidence interval for the
population parameter.
99.7% of data are within 3 standard deviations of the mean
95% within
2 standard deviations
68% within
1 standard deviation
34%
34%
2.4%
2.4%
0.1%
0.1%
13.5%
-3
-2
13.5%
-1

+
+2
+3
A sample is “biased” (i.e., not representative of the population) if its sampling distribution
cannot be estimated or if the sampling distribution violates the 68-95-99 percent rule. As an aside, note
that in most regression analysis where we examine the significance of regression coefficients with
p<0.05, we are attempting to see if the sampling statistic (regression coefficient) predicts the
corresponding population parameter (true effect size) with a 95% confidence interval. Interestingly,
the “six sigma” standard attempts to identify manufacturing defects outside the 99% confidence
interval or six standard deviations (standard deviation is represented using the Greek letter sigma),
representing significance testing at p<0.01.
DETERMINING THE SAMPLE SIZE
The sample size depends of three
factors: (1) the degree of accuracy required; (2)
amount of variability inherent in the
population from which the sample was taken;
and (3) the mature and complexity of the
characteristics of the population under
consideration.
There are various formulas for
calculating the required sample size based
upon whether the data collected is to be of a
categorical or quantitative nature (e.g. is to
estimate a proportion or a mean). These
formulas require knowledge of the variance or
proportion in the population and a
determination as to the maximum desirable
error, as well as the acceptable Type I error
risk (e.g., confidence level).
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
-Page - 9 -
2022
The formula used for these calculations was:
This formula is the one used by Krejcie & Morgan in their 1970 article “Determining Sample Size for
Research Activities” (Educational and Psychological Measurement, #30, pp. 607-610).
Proportional Allocation of Samples:
Where
= number of group
allocation;
; and
= desired/estimated sample size;
= Total population.
Guidelines with regards to the minimum number of items needed for
a representative sample:



Descriptive studies – a minimum number of 100
Co-relational studies – a sample of at least 30 is deemed necessary to establish the existence of
a relationship.
Experimental and causal comparative studies – minimum of 30 per group. Sometimes
experimental studies with only 15 items in each group can be defended if they are very tightly
controlled. If the sample is randomly selected and is sufficiently large, an accurate view of the
population can be used, provided that no bias enters the selection process
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
-
-Page - 10
2022
References:
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed. McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education. Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to Statistics. Metro Manila, Pheonix Publishing
House, Inc.
Ferguson G. (1981) Statistical Analysis in Psychology and Education. 5th ed. New York: McGraw-Hill Book Company.
Padua, R. N., E.G. Adanza and R.T. Guinto (1986) Statistics: Theory and Applications. Metro Manila: Hermil Printing
Services.
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems in Probability and Statistics. 3rd. Edition.
Singapore: McGraw-Hill Book Company.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley Publishing Company.
Walpole, R.E (1982) Introduction to Statistics. 3rd ed. New York: Macmillan Publishing Co. Inc.
Prepared by MARIANNE JANE ANTOINETTE D. PUA, M.S.
-
-Page - 11