* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download SciMethod _ Stat - Napa Valley College
Degrees of freedom (statistics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Foundations of statistics wikipedia , lookup
Categorical variable wikipedia , lookup
Analysis of variance wikipedia , lookup
Student's t-test wikipedia , lookup
The Scientific Method and Basic Statistics Objectives: Understand the steps in the Scientific Method Be able to describe basic statistical parameters and how they relate to the Normal (Gaussian) Distribution Model Be able to explain how hypotheses are tested; supported or rejected. What Do Scientists Do? •Scientists collect data and develop theories, models, and laws about how nature works. Science searches for natural causes to explain natural phenomenon 1. Purpose of science a. to determine cause and effect b. to gain insight into natural events 2. Science does not include “absolutes” 3. Science provides tentative explanations to explain natural phenomenon 4. Fundamental basis of science: The Principal of Uncertainty “Science cannot prove anything, nor is it a search for the ‘truth’.” 1. 2. Science develops tentative answers for guesses (hypotheses) based on evidence Theory - when supporting evidence is very strong! Science Is a Search for Order in Nature Identify a problem Find out what is known about the problem Ask a question to be investigated Gather data through experiments Propose a scientific hypothesis Science Is a Search for Order in Nature Make testable predictions Keep testing and making observations Accept or reject the hypothesis Scientific theory: well-tested and widely accepted hypothesis Characteristics of Science…and Scientists Curiosity Skepticism Reproducibility Peer review Openness to new ideas Critical thinking Creativity Observation: Nothing happens when I try to turn on my flashlight. Question: Why didn’t the light come on? Are the batteries dead? Hypothesis: Maybe the batteries are dead. Test hypothesis with an experiment: Put in new batteries and try to turn on the flashlight. Result: Flashlight still does not work. New hypothesis: Maybe the bulb is burned out. Experiment: Put in a new bulb. Result: Flashlight works. Conclusion: New hypothesis is verified. Fig. 2-3, p. 33 Concept 1.1 Connections in Nature Observation of Pacific tree frogs suggested that a parasite can cause deformities. Small glass beads implanted in tadpoles to mimic the effect of cysts of Ribeiroia ondatrae, a trematode flatworm, also produced deformities. Concept 1.1 Connections in Nature Further studies: • Deformities of Pacific tree frogs occurred only in ponds that also had an aquatic snail, Helisoma tenuis, an intermediate host of the parasite. • All frogs with deformed limbs had Ribeiroia cysts. Figure 1.3 The Life Cycle of Ribeiroia 1. Observation • The awareness of a natural event or natural phenomenon directly or indirectly by means of our senses. Observation: North facing slopes have heavier tree growth than south facing slopes N S Observation: North facing slopes have heavier tree growth than south facing slopes Possible Questions: What causes trees to grow more abundantly on north facing slopes? Question both relevant and testable, but very general. What causes the slope to be north facing? Probably not relevant. Did Martians plant these trees 10,000 years ago? Probably not testable. Is evaporation of water less on north facing slopes than south facing slopes? More relevant and to the point. Observation: North facing slopes have heavier tree growth facing slopes than south Question: Is evaporation of water less on north facing slopes than south facing slopes? N S 3. Hypothesis: A guess postulating an answer to the question Must be relevant and testable Bias My idea is so logical, so reasonable, and it sounds so right, it must be correct Where is the supporting evidence? Observation: North facing slopes have heavier tree growth than south facing slopes Question: Is evaporation of water less on north facing slopes than south facing slopes? Hypothesis: Evaporation is greater on south facing slopes than north facing slopes. 4. Experiment •Additional observations gathered to test the hypothesis. Observation: North facing slopes have heavier tree growth than south facing slopes Question: Is evaporation of water less on north facing slopes than south facing slopes? Hypothesis: Evaporation is greater on south facing slopes than north facing slopes. Experiment: Test evaporation using a sling psychrometer. Experimental Difficulties • Bias • Experimental Errors • Sample Size What are the odds of flipping: • 5 heads in a row? 2-5 = 1/32 •10 heads in a row? 2-10 = 1/1024 •100 heads in a row? 2-100 = 1.27x1030 or 1 in 1,270,000,000,000,000,000,000,000,000,000 Charlie Charlie’s Sick Diagnosis – Fish Ick Fish Ick Medicine Controlled Experiment •Run two side-by-side experiments 1. No change 2. Change one experimental variable only Controlled Study Experimental Group Conditions Identical Except Fish ick medicine How many of each? ~50 experimental fish Control Group no medicine ~50 control fish 5. Evaluation – Conclusions • Analyze the results of the experiment 50 Experimental Fish How many of each lived? Live 40 / 50 Conclusion – Medication helps 50 Control Fish 10 / 50 Live 40 / 50 32 / 50 Conclusion – Not clear if medication helps 5. Evaluation • When results are close the sample size is critical. Experimental Fish Control Fish How many fish should be used? Inconclusive result if 100 fish are used (difference = 1/256 chance) Live 40 / 50 32 / 50 More conclusive result if 1000 fish are used Live 400 / 500 320 / 500 (difference = 1/1.21x1030 chance) Statistical Approach to Science How does science develop theories? A theory is an hypothesis which is solidly supported by evidence. Support for hypotheses comes from statistics Using a sample, the mean of an experimental population can be determined along with other statistical parameters The absolute “true mean” (denoted as m) cannot be determined. instead a we estimate a mean (x) for our sample population. We can estimate a confidence interval in which the true mean of the population lies at a given level of probability This honors the Uncertainty Principal in Science Statistical Method • There is a high degree of variability in living things: cells, organisms, populations • Sample – a portion of a population must be sufficiently large, but obtained randomly • Random selection reduces bias Number of individuals with some value of the trait “Normal” Distribution The line of a bell-shaped curve reveals continuous variation in the population Range of values for the trait Fig. 8-14a, p.120 Range of values for the trait Fig. 8-14b, p.120 Number of individuals with some value of the trait Statistics Summation Notation and Symbols i is the index variable, or counter. The index variable is used to identify each observed value. n is the number of observations Xi is the variable of interest for observation number i. ∑ is sigma (Greek capital S) This means to add, or sum, all observations of variable X • Mean x 1 N xi • Variance 2 sx xi x 2 N 1 • Standard deviation xi x 2 xi2 Nx 2 sx N 1 N 1 Arithmetic Mean Mean is the average value of observations; Determined by adding up all values then dividing them by the number of observations The mean represents an estimate of the absolute “true mean” denoted with a Greek lower case m (m) 1 x N xi Variance Variance is an estimate of the range of values from our observations Obtained by summing the square of the differences between individual values and the mean then dividing by the number of observations minus one. Again, this is an estimate of the “true variance” (s2) x x i 2 s x N 1 2 Standard deviation Standard deviation is another estimate of the range of values in relation to the mean. Again, this is an estimate of the “true deviation” (s) represented by a lower case Greek s Simply calculated as the square root of the variance sx 2 x x i N 1 xi2 Nx 2 N 1 Confidence Interval CI gives the probability that the spread of values will lie within a distribution; with our sample mean and the true population in the center of the range It also provides our level of confidence for rejecting or failing to reject a null hypothesis 2 1 2 2 s s CI X 1 X 2 t n1 n2 Confidence Level • In biology the level of confidence used is usually 95%. • This means there is a 5% chance that our conclusion is in error! Confidence Level 95% Confidence interval: 95% of data will be contained within non-shaded area of curve In biology the level of confidence used is usually 95%. This means there is a 5% chance that our conclusion is in error! Fig. 8-15, p.121 T-test determines probability that two data sets are from a single population Hypotheses Ho: µ1 = µ2 6 H1: µ1 µ2 After conducting a t-test, we would reject the null hypothesis; the two means are not equal 4 N In this example we can visually see a significant difference among two means. 5 3 2 1 250200150100 50 0 50 100150200250 Count Count TAXON Pelv Porph Null vs Alternate Hypotheses • Null Hypothesis Ho: µ1 = µ2 • By default, the null hypothesis is that there is no significant difference among our two sample means. • Alternate Hypothesis H1: µ1 µ2 Decision Rule If the p-value is less than alpha Reject the Hypothesis If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis • t Test If the p-value is less than alpha, reject the null Hypothesis (two means are not equal) If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis Two-sample t-test on TEMP grouped by TREATMENT$ against Alternative = 'not equal' Group None Shade N 116 287 Mean SD 16.55697 14.57568 2.60453 2.03032 25 Separate variance: Difference in means = 1.98130 95.00% CI = 1.44862 to 2.51398 t = 7.34105 df = 174.2 p-value = 0.00000 Pooled variance: 20 TEMP Decision Rule 15 Difference in means = 1.98130 95.00% CI = 1.50322 to 2.45937 t = 8.14733 df = 401 p-value = 0.00000 TREATMENT 10 60 50 40 30 20 10 0 10 20 30 40 50 60 Count Count None Shade Comparing more than two means •T-tests work when we want to determine the equality of two means. •What if we have 3 or more sample populations to compare? •There are additional statistical analyses performed on more than two populations, but they depend on the type of data and on the question we’re asking •Typically results in models Types of Data Categorical- qualitative data that fall into distinct categories. Further divided into two types: Nominal- descriptive ( color, gender) Ordinal- where order is important ( mature, immature) Numerical- quantitative, measured numerical observations, also subdivided into two types Discrete- only certain values are possible (number of seeds, offspring etc) Continuous- any value within an interval is possible and limited only by the resolution of the measuring device (height, weight, concentration, temperature) The General Linear Model • Used for comparing multiple populations or data sets • Analysis of variance- like a t-test on 3 or more groups • Correlation- tests whether two variables are correlated (display a linear relationship) • Regression analysis- once correlation is established, determines how well an independent variable (x-axis) predicts the value of a dependent variable (y- axis) Analysis of Variance (ANOVA) Least Squares Means 19 19 TEMP 16 TEMP 16 13 13 10 HCN HCS LP SITE MP 10 HCN HCS LP SITE MP General Linear Model Regression on continuous variables NDVI vs Leaf Chloropyll 0.6 R2 = 0.8114 0.5 NDVI 0.4 0.3 0.2 0.1 0 0 20 40 60 80 Chlorophyll mg/cm 2 100 120 ANOVA Sometimes data must be reclassified Here, we measured actual concentrations of pesticide (continuous data), but had to run an ANOVA as if the data were categorical This was decided by peers reviewing our manuscript for publication General Linear Model: Linear Regression A data set has values yi each of which has an associated modeled value fi (also sometimes referred to as ). Here, the values yi are called the observed values and the modeled values fi are sometimes called the predicted values. The "variability" of the data set is measured through different sum of squares the total sum of squares (proportional to the sample variance); the regression sum of squares, also called the explained sum of squares, the sum of squares of residuals, also called the residual sum of squares. In the above, is the mean of the observed data: The most general definition of the coefficient of determination is