Download statistics - Bio-Link

Document related concepts

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
[email protected]
Biotechnology Laboratory
Technician Program
Course:
Basic Biotechnology Laboratory Skills
for a Regulated Workplace
Lisa Seidman, Ph.D. Ph.D.
STATISTICS
A BRIEF INTRODUCTION
WHY LEARN ABOUT
STATISTICS?

Statistics provides tools that are used in



Quality control
Research
Measurements

Sports
[email protected]
IN THIS COURSE

We will use some of these tools



Ideas
Vocabulary
A few calculations
[email protected]
VARIATION

There is variation in the natural world




People vary
Measurements vary
Plants vary
Weather varies
[email protected]

Variation among organisms is the basis of
natural selection and evolution
[email protected]
EXAMPLE



100 people take a drug and 75 of them get
better
100 people don’t take the drug but 68 get
better without it
Did the drug help?
[email protected]
VARIABILITY IS A PROBLEM



There is variation in response to the illness
There is variation in response to the drug
So it’s difficult to figure out if the drug helped
[email protected]
STATISTICS

Provides mathematical tools to help arrive at
meaningful conclusions in the presence of
variability
[email protected]


Might help researchers decide if a drug is
helpful or not
This is a more advanced application of
statistics than we will get into
[email protected]
DESCRIPTIVE STATISTICS


Chapter 16 in your textbook
Descriptive statistics is one area within
statistics
[email protected]
DESCRIPTIVE STATISTICS

Provides tools to DESCRIBE, organize and
interpret variability in our observations of the
natural world
[email protected]
DEFINITIONS

Population:

Entire group of events, objects, results, or
individuals, all of whom share some unifying
characteristic
[email protected]
POPULATIONS

Examples:

All of a person’s red blood cells
All the enzyme molecules in a test tube

All the college students in the U.S.

[email protected]
SAMPLE

Sample: Portion of the whole population that
represents the whole population
[email protected]

Example: It is virtually impossible to measure
the level of hemoglobin in every cell of a
patient

Rather, take a sample of the patient’s blood
and measure the hemoglobin level
[email protected]
MORE ABOUT SAMPLES

Representative sample: sample that truly
represents the variability in the population -good sample
[email protected]
TWO VOCABULARY WORDS



A sample is random if all members of the population
have an equal chance of being drawn
A sample is independent if the choice of one
member does not influence the choice of another
Samples need to be taken randomly and
independently in order to be representative
[email protected]
SAMPLING


How we take a sample is critical and often
complex
If sample is not taken correctly, it will not be
representative
[email protected]
EXAMPLE

How would you sample a field of corn?
[email protected]
VARIABLES

Variables:


Characteristics of a population (or a sample) that
can be observed or measured
Called variables because they can vary among
individuals
[email protected]
VARIABLES

Examples:



Blood hemoglobin levels
Activity of enzymes
Test scores of students
[email protected]


A population or sample can have many
variables that can be studied
Example

Same population of six year old children can be
studied for




Height
Shoe size
Reading level
Etc.
[email protected]
DATA

Data: Observations of a variable (singular is datum)


May or may not be numerical
Examples:



Heights of all the children in a sample (numerical)
Lengths of insects (numerical)
Pictures of mouse kidney cells (not numerical)
[email protected]
ALWAYS UNCERTAINTY

Even if you take a sample correctly, there is
uncertainty when you use a sample to represent the
whole population


Various samples from the same population are unlikely to
be identical
So, need to be careful about drawing conclusions
about a population, based on a sample – there is
always some uncertainty
[email protected]
SAMPLE SIZE



If a sample is drawn correctly, then, the larger
the sample, the more likely it is to accurately
reflect the entire population
If it is not done correctly, then a bigger
sample may not be any better
How does this apply to the corn field?
[email protected]
INFERENTIAL STATISTICS



Another branch of statistics
Won’t talk about it much
Deals with tools to handle the uncertainty of
using a sample to represent a population
[email protected]
EXAMPLE PROBLEM

In a quality control setting, 15 vials of product
from a batch are tested. What is the sample?
What is the population?

In an experiment, the effect of a carcinogenic
compound was tested on 2000 lab rats.
What is the sample? What is the population?
[email protected]

A clinical study of a new drug was tested on
fifty patients. What is the sample? What is
the population?
[email protected]
ANSWERS

15 vials, the sample, were tested for QC. The
population is all the vials in the batch.

The sample is the rats that were tested. The
population is probably all lab rats.

The sample is the 50 patients tested in the trial. The
population is all patients with the same condition.
[email protected]
EXAMPLE PROBLEM

An advertisement says that 2 out of 3 doctors
recommend Brand X.



What is the sample? What is the population?
Is the sample representative?
Does this statement ensure that Brand X is better
than competitors?
[email protected]
ANSWER

Many abuses of statistics relate to poor sampling.
The population of interest is all doctors. No way to
know what the sample is. The sample could have
included only relatives of employees at Brand X
headquarters, or only doctors in a certain area.
Therefore the statement does not ensure that the
majority of doctors recommend Brand X. It certainly
does not ensure that Brand X is best.
[email protected]
DESCRIBING DATA SETS



Draw a sample from a population
Measure values for a particular variable
Result is a data set
[email protected]
DATA SETS


Individuals vary, therefore the data set has
variation
Data without organization is like letters that
aren’t arranged into words
[email protected]

Numerical data can be arranged in ways that
are meaningful – or that are confusing or
deceptive
[email protected]
DESCRIPTIVE STATISTICS


Provides tools to organize, summarize, and
describe data in meaningful ways
Example:



Exam scores for a class is the data set
What is the variable of interest?
Can summarize with the class “average”, what
does this tell you?
[email protected]


A measure that describes a data set, such as
the average, is sometimes called a “statistic”
Average gives information about the center of
the data
[email protected]
MEDIAN AND MODE



Two other statistics that give information
about the center of a set of data
Median is the middle value
Mode is most frequent value
[email protected]
MEASURES OF CENTRAL
TENDENCY


Measures that describe the center of a data
set are called: Measures of Central Tendency
Mean, median, and the mode
[email protected]
HYPOTHETICAL DATA SET
2 5 6 7 8 3 9 3 10 4 7 4 6 11 9
Simplest way to organize them is to put in
order:
2 3 3 4 4 5 6 6 7 7 8 9 9 10 11
By inspection they center around 6 or 7
[email protected]
MEAN
Mean is basically the same as the average
 Add all the numbers together and divide by
number of values
2 3 3 4 4 5 6 6 7 7 8 9 9 10 11
What is the mean for this data set?

[email protected]
NOMENCLATURE



Mean = 6.3 =  read “X bar”
The observations are called X1, X2, etc.
There are 15 observations in this example, so the
last one is X15
Mean = Xi
n
Where n = number of values
[email protected]
EXAMPLE

Data set
2 3 3 4 5 6 7 8 9
What is the mode?
What is the median?
[email protected]
MEAN OF A POPULATION
VERSUS THE MEAN OF A
SAMPLE




Statisticians distinguish between the mean of
a sample and the mean of a population
The sample mean is 
The population mean is μ
It is rare to know the population mean, so the
sample mean is used to represent it
[email protected]
DISPERSION

Data sets A and B both have the same
average:
A 4 5 5 5 6 6
B 1 2 4 7 8 9

But are not the same:


A is more clumped around the center of the
central value
B is more dispersed, or spread out
[email protected]
MEASURES OF DISPERSION


Measures of central tendency do not describe
how dispersed a data set is
Measures of dispersion do; they describe
how much the values in a data set vary from
one another
[email protected]
MEASURES OF DISPERSION

Common measures of dispersion are:




Range
Variance
Standard deviation
Coefficient of variation
[email protected]
CALCULATIONS OF
DISPERSION


Measures of dispersion, like measures of
central tendency, are calculated
Range is the difference between the lowest
and highest values in a data set
[email protected]
Example:
2 3 3 4 4 5 6 6 7 7 8 9 9 10 11
 Range: 11-2 = 9 or, 2 to 11
 Range is not particularly informative because
it is based only on two values from the data
set

[email protected]
CALCULATING VARIANCE AND
STANDARD DEVIATION


Variance and standard deviation measure of
the average amount by which each
observation varies from the mean
Example:
4cm 5cm 6cm 7cm 7cm 7cm 9cm 11cm
Data set, lengths of 8 insects
[email protected]
4cm 5cm 6cm 7cm 7cm 7cm 9cm 11cm



The mean is 7 cm
How much do they vary from one another?
Intuitively might see how much each point
varies from the mean

This is called the deviation
[email protected]
CALCULATION OF
DEVIATIONS FROM MEAN
4cm 5cm 6cm 7cm 7cm 7cm 9cm 11cm
Value-Mean
in cm
Deviation
(4-7)
(5-7)
(6-7)
(7-7)
(7-7)
(7-7)
(9-7)
(11-7)
-3
-2
-1
0
0
0
+2
+4
[email protected]
Value-Mean
Deviation
(in cm)
(4-7)
(5-7)
(6-7)
(7-7)
(7-7)
(7-7)
(9-7)
(11-7)
-3
-2
-1
0
0
0
+2
+4
Sum of deviations =
[email protected]
0



Sum of the deviations from the mean is
always zero
Therefore, cannot use the average deviation
Therefore, mathematicians decided to square
each deviation so they will get positive
numbers
[email protected]
Value-Mean Deviation SquaredDeviation
(in cm)
(4-7)
(5-7)
(6-7)
(7-7)
(7-7)
(7-7)
(9-7)
(11-7)
-3
-2
-1
0
0
0
+2
+4
9 cm2
4 cm2
1 cm2
0
0
0
4 cm2
16 cm2
total squared deviation = sum of squares =
[email protected]
34 cm2
VARIANCE

Total squared deviation (sum of squares)
divided by the number of measurements:
34 cm2 = 4.25 cm2
8
[email protected]
STANDARD DEVIATION

Square root of the variance:
4.25 cm2 = 2.06 cm


Note that the SD has the same units as the data
Note also that the larger the variance and SD, the
more dispersed are the data
[email protected]
VARIANCE AND SD OF
POPULATION VS SAMPLE



Statisticians distinguish between the mean
and SD of a population and a sample
The variance of a population is called sigma
squared, σ2
Variance of a sample is S2
[email protected]


The standard deviation of a population is
called sigma, σ
Standard deviation of a sample is S or SD
[email protected]
STANDARD DEVIATION OF A
SAMPLE
(Xi - )2
n -1
[email protected]
EXAMPLE PROBLEM
A biotechnology company sells cultures of E.
coli. The bacteria are grown in batches that
are freeze dried and packaged into vials.
Each vial is expected to have 200 mg of
bacteria. A QC technician tests a sample of
vials from each batch and reports the mean
weight and SD.
[email protected]

Batch Q-21 has a mean weight of 200 mg
and a SD of 12 mg. Batch P-34 has a mean
weight of 200 mg and as SD of 4 mg. Which
lot appears to have been packaged in a more
controlled fashion?
[email protected]
ANSWER

The SD can be interpreted as an indication of
consistency. The SD of the weights of Batch
P-34 is lower than of Batch Q-21. Therefore,
the weights for vials for Batch P-34 are less
dispersed than those for Batch Q-21 and
Batch P-34 appears to have been better
controlled.
[email protected]
FREQUENCY DISTRIBUTIONS


So far, talked about calculations to describe
data sets
Now talk about graphical methods
[email protected]
TABLE 5
THE WEIGHTS OF 175 FIELD MICE
(in grams)
19
21
19
20
19
20
22
22
23
21
20
22
25
25
24
26
22
21
24
20
24
20
22
22
21
20
22
21
22
26
20
22
21
23
21
21
21
21
23
22
21
22
21
22
20
20
20
21
23
22
25
21
21
22
23
20
22
19
23
22
21
23
23
21
23
21
24
22
23
25
22
23
22
24
24
25
21
22
22
19
22
24
19
24
22
23
20
21
22
24
25
21
25
21
23
23
23
21
19
19
24
21
23
20
20
20
24
26
20
23
[email protected]
19
24
22
22
22
24
20
21
18
23
21
22
21
23
28
21
26
21
21
21
21
22
27
21
19
27
24
19
23
25
20
22
24
24
22
22
20
23
22
23
22
22
25
20
25
17
22
23
21
22
20
23
24
20
20
23
22
23
20
20
22
24
23
22
FREQUENCY DISTRIBUTION TABLE OF THE
WEIGHTS OF FIELD MICE
Weight
(g)
Frequency
17
18
19
20
21
22
23
24
25
26
27
1
1
11
25
34
40
27
19
10
4
2
28
1
[email protected]
FREQUENCY TABLE


Tells us that most mice have weights in the
middle of the range, a few are lighter or
heavier
The word distribution refers to a pattern of
variation for a given variable
[email protected]


It is important to be aware of patterns, or
distributions, that emerge when data are
organized by frequency
The frequency distribution can be illustrated
as a frequency histogram
[email protected]
FREQUENCY HISTOGRAM




X axis is units of measurement, in this
example, weight in grams
Y axis is the frequency of a particular value
For example, 11 mice weighed 19 g
The values for these 11 mice are illustrated
as a bar
[email protected]


Note that when the mouse data were
collected, a mouse recorded as 19 grams
actually weighed between 18.5 g and 19.4 g.
Therefore the bar spans an interval of 1 gram
[email protected]
FIRST FOUR BARS
F
R
E
Q
U
E
N
C
Y
17 18 19 20
WEIGHTS
IN GRAMS
[email protected]
CONSTRUCTING A
FREQUENCY HISTOGRAM




Divide the range of the data into intervals
It is simplest to make each interval (class) the
same width
No set rule as to how many intervals to have
For example, length data might be 1-9 cm,
10-19 cm, 20-29 cm and so on
[email protected]



Count the number of observations that are in
each interval
Make a frequency table with each interval
and the frequency of values in that interval
Label the axes of a graph with the intervals
on the X axis and the frequency on the Y axis
[email protected]



Draw in bars where the height of a bar
corresponds to the frequency of the value
Center the bars above the midpoint of the
class interval
For example, if the interval is 0-9 cm, then
the bar should be centered at 4.5 cm
[email protected]
NORMAL FREQUENCY
DISTRIBUTION

If weights of very many lab mice were
measured, would likely have a frequency
distribution that looks like a bell shape, also
called the “normal distribution”
[email protected]
NORMAL DISTRIBUTION
F
R
E
Q
U
E
N
C
Y
WEIGHT
[email protected]
NORMAL DISTRIBTION


Very important
Examples:


Heights of humans
Measure same thing over and over,
measurements will have this distribution
[email protected]
CALCULATIONS AND
GRAPHICAL METHODS



Related
The center of the peak of a normal curve is
the mean, the median and the mode
Values are evenly spread out on either side
of that high point
[email protected]



The width of the normal curve is related to
the SD
The more dispersed the data, the higher the
SD and the wider the normal curve
Exact relationship is in text, not go into it this
semester
[email protected]
EXAMPLE PROBLEM
A technician customarily performs a certain assay.
The results of 8 typical assays are:
32.0 mg 28.9 mg 23.4 mg 30.7 mg
23.6 mg 21.5 mg 29.8 mg 27.4 mg
a.
If the technician obtains a value of 18.1 mg,
should he be concerned? Base your answer on
estimation.
b.
Perform statistical calculations to see if the answer
if out of the range of two SDs.
[email protected]
ANSWER


The average appears to be in the
midtwenties and hovers around + 5.
Therefore, 18.1 mg appears a bit low.
Mean = 27.16 mg, SD = 3.87 mg. The mean
– 2SD is 19.4 mg, so 18.1 mg appears to be
outside the range and should be investigated
[email protected]
[email protected]