Download SciMethod _ Stat - Napa Valley College

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Foundations of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Analysis of variance wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
The Scientific Method and
Basic Statistics
Objectives:
Understand the steps in the Scientific Method
Be able to describe basic statistical parameters and how they relate
to the Normal (Gaussian) Distribution Model
Be able to explain how hypotheses are tested; supported or
rejected.
What Do Scientists Do?
•Scientists collect data and develop
theories, models, and laws about how
nature works.
Science searches for natural causes
to explain natural phenomenon
1.
Purpose of science
a. to determine cause and effect
b. to gain insight into natural events
2. Science does not include “absolutes”
3. Science provides tentative explanations to explain
natural phenomenon
4. Fundamental basis of science:
The Principal of Uncertainty
“Science cannot prove anything, nor is it a search for the
‘truth’.”
1.
2.
Science develops tentative answers for guesses (hypotheses)
based on evidence
Theory - when supporting evidence is very strong!
Science Is a Search for Order
in Nature
Identify a problem
Find out what is known about the problem
Ask a question to be investigated
Gather data through experiments
Propose a scientific hypothesis
Science Is a Search for Order
in Nature
Make testable predictions
Keep testing and making observations
Accept or reject the hypothesis
Scientific theory: well-tested and widely
accepted hypothesis
Characteristics of Science…and Scientists
Curiosity
Skepticism
Reproducibility
Peer review
Openness to new ideas
Critical thinking
Creativity
Observation: Nothing happens when I try
to turn on my flashlight.
Question: Why didn’t the light come on? Are
the batteries dead?
Hypothesis: Maybe the batteries are dead.
Test hypothesis with an experiment: Put in
new batteries and try to turn on the flashlight.
Result: Flashlight still does not work.
New hypothesis: Maybe the bulb is burned out.
Experiment: Put in a new bulb.
Result: Flashlight works.
Conclusion: New hypothesis is verified.
Fig. 2-3, p. 33
Concept 1.1
Connections in Nature
Observation of Pacific tree frogs
suggested that a parasite can cause
deformities.
Small glass beads implanted in
tadpoles to mimic the effect of cysts
of Ribeiroia ondatrae, a trematode
flatworm, also produced deformities.
Concept 1.1
Connections in Nature
Further studies:
• Deformities of Pacific tree frogs
occurred only in ponds that also
had an aquatic snail, Helisoma
tenuis, an intermediate host of the
parasite.
• All frogs with deformed limbs had
Ribeiroia cysts.
Figure 1.3 The Life Cycle of Ribeiroia
1. Observation
• The awareness of a natural event or natural
phenomenon directly or indirectly by means of
our senses.
Observation:
North facing slopes have heavier tree growth
than south facing slopes
N
S
Observation: North facing slopes have heavier tree
growth than south facing slopes
Possible Questions:
 What causes trees to grow more abundantly on north facing
slopes?
Question both relevant and testable, but very general.
 What causes the slope to be north facing?
Probably not relevant.
 Did Martians plant these trees 10,000 years ago?
Probably not testable.
 Is evaporation of water less on north facing slopes than south
facing slopes?
More relevant and to the point.
Observation: North facing slopes have heavier tree growth
facing slopes
than south
Question: Is evaporation of water less on north facing slopes
than south facing slopes?
N
S
3. Hypothesis:
A guess postulating an answer to the question
Must be relevant and testable
Bias
My idea is so logical, so reasonable, and it sounds
so right, it must be correct
Where is the supporting evidence?
Observation: North facing slopes have heavier tree growth
than south facing slopes
Question: Is evaporation of water less on north facing slopes than south
facing slopes?
Hypothesis:
Evaporation is greater on south facing slopes
than north facing slopes.
4. Experiment
•Additional observations gathered to test the
hypothesis.
Observation: North facing slopes have heavier tree growth
than south facing slopes
Question: Is evaporation of water less on north facing slopes than
south facing slopes?
Hypothesis: Evaporation is greater on south facing slopes than
north facing slopes.
Experiment:
Test evaporation using a sling psychrometer.
Experimental Difficulties
• Bias
• Experimental Errors
• Sample Size
What are the odds of flipping:
• 5 heads in a row?
2-5 = 1/32
•10 heads in a row?
2-10 = 1/1024
•100 heads in a row?
2-100 = 1.27x1030 or
1 in
1,270,000,000,000,000,000,000,000,000,000
Charlie
Charlie’s Sick
Diagnosis – Fish Ick
Fish Ick Medicine
Controlled Experiment
•Run two side-by-side experiments
1. No change
2. Change one experimental variable only
Controlled Study
Experimental Group
Conditions Identical Except
Fish ick medicine
How many of each?
~50 experimental fish
Control Group
no medicine
~50 control fish
5. Evaluation – Conclusions
• Analyze the results of the experiment
50 Experimental Fish
How many of each lived?
Live 40 / 50
Conclusion – Medication helps
50 Control Fish
10 / 50
Live 40 / 50
32 / 50
Conclusion – Not clear if medication helps
5. Evaluation
• When results are close the sample size is critical.
Experimental Fish
Control Fish
How many fish should be used?
Inconclusive result if 100 fish are used (difference = 1/256 chance)
Live 40 / 50
32 / 50
More conclusive result if 1000 fish are used
Live 400 / 500
320 / 500
(difference = 1/1.21x1030 chance)
Statistical Approach to Science
 How does science develop theories?
 A theory is an hypothesis which is solidly supported by
evidence. Support for hypotheses comes from statistics
 Using a sample, the mean of an experimental
population can be determined along with other
statistical parameters
 The absolute “true mean” (denoted as m) cannot be
determined. instead a we estimate a mean (x) for our
sample population.
 We can estimate a confidence interval in which the true
mean of the population lies at a given level of
probability
 This honors the Uncertainty Principal in Science
Statistical Method
• There is a high degree of variability in living things:
cells, organisms, populations
• Sample – a portion of a population must be sufficiently
large, but obtained randomly
• Random selection reduces bias
Number of individuals with
some value of the trait
“Normal” Distribution
The line of a
bell-shaped
curve reveals
continuous
variation in the
population
Range of values for the trait
Fig. 8-14a, p.120
Range of values
for the trait
Fig. 8-14b, p.120
Number of individuals with
some value of the trait
Statistics
 Summation Notation and
Symbols
 i is the index variable, or
counter. The index variable is
used to identify each observed
value.
 n is the number of observations
 Xi is the variable of interest for
observation number i.
 ∑ is sigma (Greek capital S)
 This means to add, or sum, all
observations of variable X
• Mean
x
1
N
 xi
• Variance
2
sx
xi  x 


2
N 1
• Standard deviation
xi  x 2
xi2  Nx 2
sx 

N 1
N 1


Arithmetic Mean
 Mean is the average value of observations;
 Determined by adding up all values then dividing them by the
number of observations
 The mean represents an estimate of the absolute “true mean”
denoted with a Greek lower case m (m)
1
x
N
 xi
Variance
Variance is an estimate of the range of values from our
observations
 Obtained by summing the square of the differences
between individual values and the mean then dividing by
the number of observations minus one.
 Again, this is an estimate of the “true variance” (s2)


x

x
i

2
s 
x
N 1
2
Standard deviation
Standard deviation is another estimate of the range of values in
relation to the mean. Again, this is an estimate of the “true
deviation” (s) represented by a lower case Greek s
Simply calculated as the square root of the variance
sx 
2


x

x
 i
N 1


xi2  Nx 2
N 1
Confidence Interval
CI gives the probability that the spread of values will lie
within a distribution; with our sample mean and the
true population in the center of the range
It also provides our level of confidence for rejecting or
failing to reject a null hypothesis


2
1
2
2
s
s
CI  X 1  X 2  t

n1 n2
Confidence Level
• In biology the level of confidence used is usually 95%.
• This means there is a 5% chance that our conclusion is in error!
Confidence Level
95% Confidence interval:
95% of data will be contained within non-shaded area of curve
In biology the level of confidence used is usually 95%.
This means there is a 5% chance that our conclusion is in error!
Fig. 8-15, p.121
T-test determines probability that two data sets are from a
single population
Hypotheses
Ho: µ1 = µ2
6
H1: µ1  µ2
After conducting a t-test,
we would reject the null
hypothesis; the two
means are not equal
4
N
In this example we can
visually see a significant
difference among two
means.
5
3
2
1
250200150100 50 0 50 100150200250
Count
Count
TAXON
Pelv
Porph
Null vs Alternate Hypotheses
• Null Hypothesis
Ho: µ1 = µ2
• By default, the null hypothesis is that there is no significant difference among our
two sample means.
• Alternate Hypothesis
H1: µ1  µ2
Decision Rule If the p-value is less than alpha Reject the
Hypothesis
If the p-value is greater than or equal to alpha
Fail to Reject the Hypothesis
•
t Test


















If the p-value is less than alpha, reject the null
Hypothesis (two means are not equal)
If the p-value is greater than or equal to alpha
Fail to Reject the Hypothesis
Two-sample t-test on TEMP grouped by TREATMENT$ against Alternative = 'not equal'
Group
None
Shade
N
116
287
Mean SD
16.55697
14.57568
2.60453
2.03032
25
Separate variance:
Difference in means
= 1.98130
95.00% CI
= 1.44862 to 2.51398
t
= 7.34105
df
=
174.2
p-value
= 0.00000
Pooled variance:
20
TEMP


Decision Rule
15
Difference in means
= 1.98130
95.00% CI
= 1.50322 to 2.45937
t
= 8.14733
df
=
401
p-value
= 0.00000
TREATMENT
10
60 50 40 30 20 10 0 10 20 30 40 50 60
Count
Count
None
Shade
Comparing more than two means
•T-tests work when we want to determine the
equality of two means.
•What if we have 3 or more sample
populations to compare?
•There are additional statistical analyses
performed on more than two populations,
but they depend on the type of data and on
the question we’re asking
•Typically results in models
Types of Data
 Categorical- qualitative data that fall into distinct
categories. Further divided into two types:
 Nominal- descriptive ( color, gender)
 Ordinal- where order is important ( mature, immature)
 Numerical- quantitative, measured numerical
observations, also subdivided into two types
 Discrete- only certain values are possible (number of
seeds, offspring etc)
 Continuous- any value within an interval is possible and
limited only by the resolution of the measuring device
(height, weight, concentration, temperature)
The General Linear Model
• Used for comparing multiple populations or data sets
• Analysis of variance- like a t-test on 3 or more groups
• Correlation- tests whether two variables are correlated
(display a linear relationship)
• Regression analysis- once correlation is established,
determines how well an independent variable (x-axis)
predicts the value of a dependent variable (y- axis)
Analysis of Variance (ANOVA)
Least Squares Means
19
19
TEMP
16
TEMP
16
13
13
10
HCN
HCS
LP
SITE
MP
10
HCN
HCS
LP
SITE
MP
General Linear Model
Regression on continuous variables
NDVI vs Leaf Chloropyll
0.6
R2 = 0.8114
0.5
NDVI
0.4
0.3
0.2
0.1
0
0
20
40
60
80
Chlorophyll mg/cm 2
100
120
ANOVA
 Sometimes data must be reclassified
 Here, we measured actual
concentrations of pesticide
(continuous data), but had
to run an ANOVA as if the
data were categorical
 This was decided by peers
reviewing our manuscript
for publication
General Linear Model: Linear Regression
 A data set has values yi each of which has an associated modeled value fi (also
sometimes referred to as ). Here, the values yi are called the observed values and
the modeled values fi are sometimes called the predicted values.
 The "variability" of the data set is measured through different sum of squares
 the total sum of squares (proportional to the sample variance);
 the regression sum of squares, also called the explained sum of squares,
 the sum of squares of residuals, also called the residual sum of squares. In the above, is the mean of
the observed data:
 The most general definition of the coefficient of determination is