Download MATH 2441 - BCIT Commons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

German tank problem wikipedia , lookup

Coefficient of determination wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
MATH 2441
Probability and Statistics for Biological Sciences
Looking Ahead … Terminology and Directions
There is certain basic terminology, and certain concepts which are used throughout the explanation and use
of the methods of probability and statistics. You will become familiar with these ideas as they arise again
and again in the course. However, to alert you to their importance, we mention a few of them here.
I. Variables
As in many branches of mathematics, the notion of a variable arises immediately. You can find some very
formal definitions of a variable in reference books, but for our purposes, a variable is simply a specific
property of each of a population of things. For example




in reference to a population of human beings, a person's height, weight, age, annual income,
years of school completed, hair color, favorite flavor of ice cream, favorite political candidate,
etc. are all variables.
in reference to the population of all cans of chicken soup produced by Acme Soup Inc., such
properties as the percent salt, the number of millilitres of soup in the can, the date on which the
can of soup was produced, etc. are all variables
in reference to a population of wheat plants of a certain variety, the height of the plant at a
certain time after planting, the number of days required for the seed to germinate, the percent
protein in the harvested grains, etc. are all variables.
in reference to flipping a coin, the identity of the face showing (heads or tails) is a variable. If
five coins are tossed simultaneously, the number of coins falling heads up is a variable.
In statistics, we often wish to characterize relationships between variables (for example, we might wish to
determine whether the value of a person's annual income is related to the value of their hair color, or
perhaps, whether the percent protein in wheat kernels is related to how much fertilizer was applied to the
field during growth, etc.). In cases such as these, we still retain the distinction between independent
variables and dependent variables. The value of the dependent variable is thought to be determined, or at
least influenced, by the values assigned or observed for the independent variables.
When the value a variable is given comes as the result of some random process (a process in which the
specific result is not predictable with certainty in advance), we refer to it as a random variable. Random
variables play a large role in statistical work, since we are mostly concerned with properties of members of
random samples.
II. Samples and Populations
Always remember that the basic goal of statistics is to be able to estimate values of properties, draw
conclusions about, or make predictions about populations using data obtained from a random sample of that
population. Very early in the course, we shall have to be careful to distinguish when we are dealing with a
population property (called a population parameter) or a property of a random sample (called a sample
statistic).
Sample statistics will always be random variables, because their values will always depend on which
elements of the population get randomly selected to be part of the sample. Once the sample has been
selected, the value of the statistic is known.
Population parameters are always fixed quantities, never random variables (simply because once a
population is described, it is a fixed collection of things). However, rarely are these values of population
parameters known -- hence the need for the methods of statistical analysis.
In statistics, the term experiment refers generally to the procedure required to measure the value of a
sample statistic.
III. Sampling Error and Statistical Significance/Confidence Levels
Since the estimates we will make of population properties, or the conclusions we draw or the predictions we
make about a population will always be based on direct observation of elements of a smaller random sample
selected from that population, we need to keep in mind that we can obtain a quite mistaken and erroneous
David W. Sabo (1999)
Looking Ahead … Directions and Terminology
Page 1 of 2
result about the population just because of the coincidence of which elements of the population actually turn
up in the sample.
For example, suppose we are studying a population of , say, all graduates of BCIT's
Biotechnology Diploma Program. For sake of argument, let's say that there are now
exactly 400 such individual's, of which 10 have become millionaires since graduation
(though we wouldn't know this unless we determined the current wealth of each one of the
400 -- that is, unless we studied the entire population). Instead of going to the time and
expense of locating each of the 400 individuals in this population and forcing each of them
to tell us how much money they have (a somewhat questionable experimental approach
for several reasons), we decide to put the names of all 400 on small slips of paper in a
box, and then while blindfolded, we draw just two names as our random sample of
individuals for more detailed study. We then contact those two people, who, as it turns out
are quite happy to tell us how much money they have.
Now, it could be that both of these individuals in our sample are from the group of 390
grads who are not millionaires. We would then be led by our sample to conclude that no
biotechnology grads are millionaires -- clearly an incorrect conclusion. On the other hand,
it might also be that by wild coincidence, the two names we draw for our sample just
happen to be in the small group of ten grads who are now millionaires. In this case we will
come to the incorrect conclusion that all biotechnology grads are millionaires.
It is true that basing conclusions about a population on random samples of size two is too
risky to consider an appropriate approach. The point here though is that varying degrees
of such risk are present no matter how large the sample is, except when the sample
includes every element of the population.
The difference between the value of a property of the population and the value of the corresponding property
of a sample is called is referred to as sampling error. It is a feature of the random sampling process itself,
and some degree of sampling error cannot be avoided in statistical experiments involving real populations.
Whenever we estimate some property of a population based on observations for a random sample of that
population (in the example above, we might be estimating the percentage of biotechnology grads who are
millionaires), we will attach a level of confidence to our estimate. The level of confidence is a number on a
scale of 0 to 1, written as a percentage, where numbers near 100% indicate a high likelihood that our
estimate is correct as stated. (A level of confidence of 100% would mean we are certain our estimate is
correct, whereas a level of confidence of 0% means we are certain our estimate is incorrect. For most work,
a level of confidence of 95% is considered acceptable.)
On the other hand, when we are using the data in a sample to draw conclusions about a population, the
measure of reliability of that conclusion is called its level of significance. Again, the level of significance is
a number between 0 and 1, written as a percent, and representing the likelihood that our conclusion is
wrong as a result of sampling error. A level of significance of 1 or 100% would mean we are certain our
conclusion is incorrect, whereas a level of significance of 0 or 0% would mean we are certain our conclusion
is correct. Generally, conclusions based on levels of significance which are 0.05 or 5% are considered
acceptable. When detailed calculations of the likelihood of a particular conclusion being incorrect are done,
that likelihood is called the p-value of the conclusion.
We will spend quite a lot of time fleshing out these notions -- they are central to the whole discipline of
statistics. From this point on, however, you need to be aware of the sort of results we will be able to get
about populations from information obtained from a random sample of that population, and that we can
never lose track of the presence of sampling error.
Page 2 of 2
Looking Ahead … Directions and Terminology
David W. Sabo (1999)