Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
Taylor's law wikipedia , lookup
History of statistics wikipedia , lookup
Foundations of statistics wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Lecture-2 Some Basic Statistical Methods Engr. Dr. Attaullah Shah Distributions for Sample Data Data distributions come in two basic varieties. When the values that can be observed are anything within some range, then the distribution is said to be continuous. If only certain particular values can be observed, then the distribution is said to be discrete. Example of continuous data Different levels of measurement: (1) nominal, (2) ordinal, (3) interval or ratio scale 1 2 3 0 0 1 0.5 1.0 10 1.5 2.0 100 2.5 3.0 3.5 1000 4.0 Measurements of Location mean 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 mm Mean = Sum of values/n = Xi/n e.g. length of 8 fish larvae at day 3 after hatching: 0.6, 0.7, 1.2, 1.5, 1.7, 2.0, 2.2, 2.5 mm mean length = (0.6+0.7+1.2+1.5+1.7+2.0+2.2+2.5)/8 = 1.55 mm mean median 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Median, Order = n/2 for n is an odd number Order = (n+1)/2 for n is an even number 4.0 mm mean median 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 mm 2.5 3.0 4.0 mm mean median 0 0.5 1.0 1.5 2.0 3.5 Median is often used with mean Mean is used much more frequent, however, Median is a better measure of central tendency for data with skewed distribution or outliers Measurements of dispersion Range e.g. length of 8 fish larvae at day 3 after hatching: 0.6, 0.7, 1.2, 1.5, 1.7, 2.0, 2.2, 2.5 mm Range = 2.5 - 0.6 = 1.9 mm (or say from 0.6 to 2.5mm) Population Standard Deviation () Averaged measurement of deviation from mean xi - x e.g. five rainfall measurements, whose mean is 7 Rainfall (mm) xi - x (xi - x)2 12 12 - 7 = 5 0 2 5 16 0 - 7 = -7 2 - 7 = -5 5 - 7 = -2 16 - 7 = 9 Sum = 184 25 49 25 4 81 Sum = 184 Population variance: 2 = (xi - x)2/n = 184/5 = 36.8 Population SD: = (xi - x)2/n = 6.1 Sample SD (s) s = [(xi - x)2]/ (n - 1) s = [xi2 - (xi)2 /n]/ (n - 1) Two modifications: by dividing [(xi - x)2] by (n -1) rather than n, gives a better unbiased estimate of (however, when n increases, difference between s and declines rapidly) the sum of squared deviations can be calculated as (xi2)- ( xi)2/ n Sample SD (s) e.g. five rainfall measurements, whose mean is 7 Rainfall (mm) 12 0 2 5 16 xi2 144 0 4 25 256 (xi2) = 429 xi 12 0 2 5 16 xi = 35 (xi)2 = 1225 s2 = [xi2 - (xi)2 /n]/ (n - 1) = [429 - (1225/5)]/ (5 - 1) = 46.0 s = (46.0) = 6.782 Frequency Distribution e.g. The particle sizes (m) of 37 grains from a sample of sediment from an estuary Define convenient classes (equal 8.2 6.3 6.8 6.4 8.1 6.3 width) and class 5.3 7.0 6.8 7.2 7.2 7.1 intervals e.g. 1 5.2 5.3 5.4 6.3 5.5 6.0 m 5.5 5.1 4.5 4.2 4.3 5.1 4.3 5.8 4.3 5.7 4.4 4.1 4.2 4.8 3.8 3.8 4.1 4.0 e.g. Frequency distribution for the size of particles collected from the estuary Particle size (m) 3.0 to under 4.0 4.0 to under 5.0 5.0 to under 6.0 6.0 to under 7.0 7.0 to under 8.0 8.0 to under 9.0 Frequency 2 12 10 7 4 2 Frequency Histogram Frequency 15 10 5 0 3 to <4 4 to <5 5 to <6 6 to <7 7 to <8 8 to <9 Particle size (m) e.g. Frequency distribution for the size of particles collected from the estuary e.g. Frequency distribution of height of the students in a class (n = 52: 30 females & 22 males) 14 12 Frequency 10 8 6 4 2 0 >149-153 >153-157 >157-161 >161-165 >165-169 >169-173 >173-177 >177-181 >181-185 Height (cm) Normal Distribution There are many standard distributions for continuous data. Here, only the normal distribution (also sometimes called the Gaussian distribution) is considered. This distribution is characterized as being bell-shaped, with most values being near the center of the distribution. There are two parameters to describe the distribution: the mean and the standard deviation, which are often denoted by μ and σ, respectively. There is also a function to describe the distribution in terms of these two parameters, which is referred to as a probability density function (pdf). Normal curve f(x) = [1/(2)]exp[(x )2/(22)] 10 9 8 f(x) Frequency 7 6 5 4 3 2 1 145 150 155 160 165 Height (cm) 170 175 180 0 151 155 159 163 Height (cm) 167 171 Normal curve f(x) = [1/(2)]exp[(x )2/(22)] In general, it turns out that, for all normal distributions, about 67% of values will be in the range μ ± σ, about 95% will be in the range μ ± 2σ, and about 99.7% will be in the range μ ± 3σ Normal curve was first expressed on paper (for astronomy) by A. de Moivre in 1733. 0.10 0.09 0.08 Probability density Parameters and determine the position of the curve on the xaxis and its shape. 0.07 male 0.06 0.05 female 0.04 0.03 0.02 0.01 0.00 140 150 160 170 Height (cm) 180 190 f(x) = [1/(2)]exp[(x )2/(22)] 0.50 Probability density 0.40 N(10,1) N(20,1) 0.30 0.20 N(20,2) N(10,3) 0.10 0.00 0 10 20 X • Normal distribution N(,) • Probability density function: the area under the curve is equal to 1. 30 The standard normal curve = 0, = 1 and with the total area under the curve = 1 units along x-axis are measured in units Figures: (a) for 1 , area = 0.6826 (68.26%); (b) for 2 95.44%; (c) the shaded area = 100% - 95.44% Application of the Standard Normal Distribution For example: We have a large data set (e.g. n = 200) of normally distributed suspended solids determinations for a particular site on a river: x = 18.3 ppm and s = 8.2 ppm. We are asked to find the probability of a random sample containing 30 ppm suspended solids or more. Application of the standard normal distribution The standard deviation (or Z value): Z = (Xi - )/ Z = (30 - 18.3)/8.2 = 1.43 Check the Z Table you will obtain the probability for the samples having 30 ppm = 0.0764 or 7.64% i.e. for n = 200, more on about 15 occasions for having 30 ppm Central Limit Theorem As sample size (n) increases, the means of samples (i.e. subsets or replicate groups) drawn from a population of any distribution will approach the normal distribution. By taking mean of the means, we smooth out the extreme values within the sets while keeping x x. As the number of subsets increases, the standard deviation of the mean of the means will be reduced and the frequency distribution is very close to the normal distribution Inferential statistics - testing the null hypothesis Inferential = “that may be inferred.” Infer = conclude or reach an opinion The hypothesis under test, the null hypothesis, will be that Z has been chosen at random from the population represented by the curve. Z values close to the mean ( = 0) are high, while frequencies away from the mean decline e.g. two values of Z are shown: Z = 1.96 and Z = 2.58 From the Table, we have the corresponding probability: 0.025 (2.5%) and 0.0049 (0.5%) Testing of Hypothesis Samples and Populations How do we select? Population Sample of population Inference Parameters Statistics X of sample Null and Research Hypotheses Hypothesis Educated guess Reflects the research problem being investigated Determines the techniques for testing the research questions Should be grounded in theory Hypotheses Contd. Research Question Research Hypothesis Test In research we NEVER prove a hypothesis! Purposes of the Null Hypothesis Acts as a starting point State of affairs accepted as true in the absence of any other information Until a systematic difference is shown, assume that any difference observed is due to chance Research job is to eliminate chance factors and evaluate other factors that may contribute to group differences Null Hypothesis Purpose # 2 Provides a benchmark to measure actual outcomes How likely is it that outcomes are due to some other factor? Helps define range within which observed differences can be reasonably attributed to chance or something other than chance Null Hypotheses Usually a statement of no differences or no associations – an equality Sentence There will be no difference in the pollution level before and after the construction activity. Symbols Ho: before const = after-const. Ho: before const – after-const. = 0 Research/Alternative Hypotheses A statement of a relationship between the variables – an inequality. May be non-directional (two-tailed) May be directional (one-tailed) which is more powerful in research results as it splits the p – value in half Non-directional Alternative Hyp. Reflects a difference between groups but the direction of the difference is not specified Non-directional Sentence There is a difference in pollution level before and after construction activity Non-directional Symbols Ha: before const. After-Const Directional Alternative Hyp. Reflects a difference between groups, and the direction of the difference is specified Directional Sentence Pollution level after construction will be higher than the pollution level before construction Directional Symbols Ha: before Const < After const What Makes a Good Hypothesis? A good hypothesis: is stated in declarative form and not as a question. posits an expected relationship between variables. What Makes a Good Hypothesis? A good hypothesis: reflects the theory or literature on which it is based. should be brief and to the point. is testable, which means that it can carry out the intent of the question reflected by the hypothesis. Six Steps of Hypothesis Testing 1. 2. 3. 4. 5. 6. State the null hypothesis. State the alternative hypothesis Select a level of significance Collect and summarize the sample data. Refer to a criterion for evaluating the sample evidence. Make a decision to keep/reject the null. 1. State the Null Hypothesis States that there is no relationship between the variables. Refers to the population. Examples of the Null Hypothesis Written: There are no differences in the pre, mid, and post construction pollution levels due to new project Symbols: µpre = µmid = µpost Step 2: State the Alternative Hypothesis Symbolically referred to as Ha States the opposite of the Ho Examples of Alternative Hypothesis Written: There are differences within the pre, mid, and post pollution levels. Symbols: pre mid post for at least one pair. Step 3: Select a Level of Significance Most researchers select a small number such as 0.001, 0.01, or 0.05. The most common choice is 0.05 Otherwise known as “alpha level”, p=0.05, =0.05 The significance level serves as a scientific cutoff point that determines what decision will me made concerning the null hypothesis. Type I and Type II Errors 1. Mistakes can occur: Type I Error – designates the mistake of rejecting the Ho when the null is actually false. When the level of significance is set at 0.05, this means the chance of a Type I error becomes equal to 1 out of 20. Type II Errors Designates a mistake made if Ho is not rejected when the null is actually false. Statistical Errors in Hypothesis Testing Consider court judgments where the accused is presumed innocent until proved guilty beyond reasonable doubt (I.e. Ho = innocent) If the accused is If the accused is innocent guilty (Ho is true) (Ho is false) Court’s decision: Guilty Wrong judgement OK Court’s decision: Innocent OK Wrong judgement Statistical Errors in Hypothesis Testing Similar to court judgments, in testing a null hypothesis in statistics, we also suffer from the similar kind of errors: If Ho is true If Ho is false If Ho is rejected Type I error No error If Ho is accepted No error Type II error Statistical Errors in Hypothesis Testing e.g. Ho = responses of cancer patients to a new drug and placebo are similar • If Ho is indeed a true statement about a statistical population, it will be concluded (erroneously) to be false 5% of time (in case = 0.05). • Rejection of Ho when it is in fact true is a Type I error (also called an error). • If Ho is indeed false, our test may occasionally not detect this fact, and we accept the Ho. • Acceptance of Ho when it is in fact false is a Type II error (also called a error). Step 4: Collection and Analysis of Sample Data The summary of the sample data will always lead to a single numerical value which is referred to as the calculated value. ( r, t, or f). The computer calculates the probability of the above value in the form of p = ____. Step 5: The Criterion for Evaluating the Sample Evidence Two Methods: Compare the calculated and critical values. Compare the data-based p-value against a preset point on the 0-1 scale on which the p must fall. (Level of Significance) Step 6: Make a Decision! Reject the Null if the p-value is less than the established level of significance. • a statistically significant difference was obtained p< 0.05 • Fail to Reject the Null • • • • Retain the Null if the p-value is greater than the established level of significance. H0 was tenable The null was retained. No significant difference was found. The result was not statistically significant. Inferential statistics - testing the null hypothesis As the curve is symmetrical about the mean, p to obtain a value of Z < -1.96 is also 2.5%; so the total p of obtaining a value of Z between -1.96 and +1.96 is 95% Likewise, between Z = 2.58, the total p = 99% Then we can state a null hypothesis that a random observation of the population will have a value between -1.96 and + 1.96. Inferential statistics - testing the null hypothesis Alternatively, we can state the null hypothesis as that a random observation of Z will lie outside the limit -1.96 or +1.96. There are 2 possibilities: Either we have chosen an ‘unlikely’ value of Z, or our hypothesis is incorrect. Conventionally, when performing a significant test, we make the rule that if Z values lies outside the range 1.96, then the null hypothesis is rejected and the Z value is termed significant at the 5% level or = 0.05 (or p < 0.05) — critical value of the statistics. For Z = 2.58, the value is termed significant at the 1% level. Chi-square statistics Widely used for the analysis of nominal scale data Introduced by Karl Pearson during 1900 Its theory and application expanded by him and R. A. Fisher The 2 test: 2 = (observed freq. - expected freq.)2/ expected freq. Obtain a sample of nominal scale data and to infer if the population from which it came conforms to a certain theoretical distribution. Used to test Ho that the observations (not the variables) are independent of each other for the population. Based on the difference between the actual observed frequencies (not %) and the expected frequencies that would be obtained if the variables were truly independent. The 2 test: 2 = (observed freq. - expected freq.)2/ expected freq. Used as a measure of how far a sample distribution deviates from a theoretical distribution Ho: no difference between the observed and expected frequency (HA: they are different) If Ho is true then both the difference and chi-square value will be SMALL If Ho is false then both measurements will be Large, HA will be accepted Example In a questionnaire, 259 adults were asked what they thought about cutting air pollution by increasing tax on vehicle fuel. 113 people agreed with this idea but the rest disagreed. Perform a Chi-square text to determine the probability of the results being obtained by chance. Agree Observed 113 Expected 259/2 = 129.5 Disagree 259 -113 = 146 259/2 = 129.5 Ho: Observed = Expected 2 = (113 - 129.5)2/129.5 + (146 - 129.5)2 /129.5 = 2.102 + 2.102 = 4.204 df = k - 1 = 2 - 1 = 1 From the Chi-square Table Critical 2 ( = 0.05, df = 1)= 3.841 << calculated 2 = 4.202, 0.025<p<0.05 Therefore, rejected Ho. The probability of the results being obtained by chance is between 0.025 and 0.05. Confidence Interval: Confidence limits for a parameter of a distribution give a range within which the parameter is expected to lie. For example, a 90% confidence limit for a distribution mean defines a range, which is called a confidence interval, within which the mean is expected to lie 90% of the time, in the sense that if many such intervals are calculated, then about 90% of them will contain the true value of the parameter. When to use z and When to use t z and t distributions are used in confidence intervals. _ These are determined by the distribution of X. USE z (with σ X σ/ n ) when : • Large n or sampling from a normal distribution • σ is known t USE (with s X s/ n ) when : • Large n or sampling from a normal distribution • σ is unknown General Form of confidence Intervals The general form of a confidence interval is: (Point Estimate) ± (Margin of Error) or (Point Estimate) ± (zα/2 or tα/2) (Appropriate Standard Error) Example The average cost of all required test for water analysis has gone up due to several tests required. A sample of 41 sources was taken The average cost of tests for these 41tezts is $86.15 Construct a 95% confidence interval for the average costs of test for these tests assuming: 1. The standard deviation is $22. 2. The standard deviation is unknown, but the sample standard deviation of the sample is $24.77. Case 1 Because the sample size > 30, it is not necessary to assume that the costs follow a normal distribution to construct a confidence interval. And because it is assumed that σ is known (to be $22), this will be a z-interval. x z α/2 σ n 86.15 1.96 22 41 $86.15 ± $6.73 ($79.42$92.92) Case 2 Because the sample size > 30, it is not necessary to assume that the costs follow a normal distribution to construct a confidence interval. Because it is assumed that σ is unknown, this will be a t-interval with 40 degrees of freedom and s = 24.77. x t α/2,40 86.15 2.021 s n 24.77 41 $86.15 ± $7.82 ($78.33$93.97) How does one variable respond to changes in another variable? Lichen is sensitive to SO2 e.g. Growth of lichen vs. Air pollution Growth determined by max length Pollution indicated by the distance from a town center (0-10 km) 30.00 Max length (mm) 25.00 20.00 15.00 10.00 5.00 0.00 0.00 2.00 4.00 6.00 8.00 10.00 Distance from the town center (km) Evernia prunastri • Decreasing thallus size as the town center is approached • A gap in the data between 4 and 6 km • Any outliner(s)? • A statistical technique termed CORRELATION enables us to quantify the relationship between two variables 30.00 Max length (mm) 25.00 20.00 15.00 • Calculation of a correlation coefficient r (range from –1 to +1) 10.00 5.00 0.00 0.00 2.00 4.00 6.00 8.00 Distance from the town center (km) 10.00 • r +1 : +ve correlation • r 0 : no correlation • r –1 : -ve correlation A X2 B X2 X1 X1 C X2 D X2 X1 X1 Significant negative correlation between number of trees and number of sick people within individual regions (r = -0.981, p < 0.001). Is this conclusion right??? Regression Analysis A simple mathematical expression to provide an estimate of one variable from another It is possible to predict the likely outcome of events given sufficient quantitative knowledge of the processes involved Regression Model 1 y x “Controlled” parameter (Independent variable) vs. Measured parameter (dependent variable) Independent variable (on x-axis) must be measured with a high degree of accuracy & is not subjected to random variation Other inferential factors must be kept constant Dependent variable (on y-axis) may vary randomly and its ‘error’ should follow a normal distribution Normally distributed populations of y values Y X • The population of y values is normally distributed & • The variances of different population of y values corresponding to different individual x values are similar Regression x2 Model 2 x1 Both measured parameters (x1 & x2 not x & y) which cannot be controlled Both are subject to random variation & called randomeffects factors Common in field studies where conditions are difficult to control Correlation rather than regression, required for bivariately normal distributions e.g. measurements of human arm and leg lengths Example for model 1 Study the rate of disappearance of a pesticide in a seawater sample Time (independent) vs. Concentration (dependent) Other factors such as pH, salinity must be kept constant Study the growth rate of fish at different fixed water temperature Temp (independent) vs. Growth rate (dependent) Other factors such as diet, feeding frequency must be kept constant Y x, y c d a X Model: y = a + bx Slope: coefficient b = c/d Intercept: coefficient a b= -ve b= +ve Y b= 0 X 3 4 5 6 8 9 10 11 12 14 15 16 17 1.4 1.5 2.2 2.4 3.1 3.2 3.2 3.9 4.1 4.7 4.5 5.2 5.0 Wing lengths of 13 sparrows of various age 6.0 y = 0.2702x + 0.7131 R2 = 0.9733 5.0 Wing length (cm ) Age (days) Wing length (cm) X Y 4.0 3.0 2.0 1.0 0.0 0 5 10 Age (days) 15 20 6.0 y = 0.2695x + 0.7284 R2 = 0.9705 Wing length (cm ) 5.0 The concept of least squares Sum of di2 indicates the deviations of the points from the regression line 4.0 d 3.0 2.0 1.0 0.0 0 5 10 Age (days) 15 20 Best fit line is achieved with minimum sum of square deviations (di2) Age (days) Wing length (cm) X Y 3 4 5 6 8 9 10 11 12 14 15 16 17 n mean sum sum X^2 13 10.0 130.0 1562.0 1.4 1.5 2.2 2.4 3.1 3.2 3.2 3.9 4.1 4.7 4.5 5.2 5.0 XY 4.2 6.0 11.0 14.4 24.8 28.8 32.0 42.9 49.2 65.8 67.5 83.2 85.0 13.0 3.4 44.4 514.8 171.3 Calculation for a regression y = a + bx b = [xy – (xy/n)]/ [x2 – (x)2/n] a = y – bx b = [514.8-(130)(44.4)/13]/[1562 – (130)2/13] b = 0.720 cm/day a = 3.4 – (0.720)(10) = 0.715 cm The simple linear regression equation is Y = 0.715 + 0.270X Normally distributed populations of y values Y X • The population of y values is normally distributed & • The variances of different population of y values corresponding to different individual x values are similar Residual = y – y^ = y – (a + bx) + + 0 0 - - + + 0 0 - -