Download Alejandra LÓPEZ - UEF-Wiki

Alejandra López Ramírez - 285144 Learning Diary for Research Methodology CBU – Forestry. 1. Research process The investigative process is supported by the scientific method. The scientific method is a set of systematic procedures that allows giving answer to a research problem or generating new knowledge. In simple words, it is the way things are done in science; however its methodic nature does not mean that this is an inflexible process, on the contrary, scientific method allows freedom of thought, criticism, rigorous analysis and discussion. Its steps are guides to develop research process in an orderly and rational manner. The following steps are part of scientific method: 1. Observation of natural phenomena or a research problem 2. Elaborate a hypothesis based on observation, previous knowledge or experiments that may explain the natural phenomena or give answer to the problem. 3. Use the hypothesis to make predictions (logical consequences of the hypothesis). 4. Carry out experimentation to test predictions and modify hypothesis based on obtained results. 5. Repeat steps 3 and 4 until there are no discrepancies between theory and experiment and/or observation. With no discrepancies, a hypothesis becomes a theory. There are two facts about scientific method: A scientific hypothesis must be falsifiable, this means that it is has to be possible to prove the theory as false by identifying possible results that show discrepancies with the predictions deduced from the hypothesis. “what is unfalsifiable is classified as unscientific”. For example, the existence of the afterlife. Results are repeatable. An essential tool for the investigative process is statistics, where data is organized, analyzed and interpreted. The process of collecting/selecting data from a statistical population is called sampling. Statistics allows us to design our experiments in order to get a representative sample that really describes a whole population; and test and interpreter our results to make inferences. Figure 1. The process of sampling and statistical analysis and interpretation of data. 2. Basic concepts in statistics The following concepts are essential for describing the statistic behavior of data sets; they are used to describe and summarize a data set and give an idea of how data is organized: Mean: The sample mean of a variable is the sum of observed values in a data divided by the number of observations (or the length of the sample), in simple terms, is the average of the data set. This parameter is the reference point for the evaluation of the dispersion within the data. Variance: the variance gives an idea of the amount of dispersion of variability of data within a data set. It is measure to quantify how much the points in a data set spread out from their mean. Therefore, high variance means high dispersion of the data with respect of the mean of average. Std Deviation: although variance gives a measure for dispersion, standard deviation is the parameter usually given for the evaluation of variability. This value is the square root of the variance. Normal Distribution: It is defined by mean and Std. deviation. Basically, when we have a data set, we assume that the frequencies of the values are symmetrically distributed in both sides of the mean with the highest frequencies around the average. It is also known as Gauss distribution. Parametric analysis: the statistic evaluation of data with normal distribution. Many statistical tests are based on the assumption of normality. Standard Error: this parameter indicates how much the mean of the sample approaches the real average of the population. So, standard deviation gives an idea of the variability within the data set, but standard error gives an idea of the variability with respect of the whole population. Standard error is calculated by dividing the standard deviation by the square root of the number of observations, meaning that, the longer the sample is, the smaller the standard error will be. Example: (extracted from examples on wiki). We are trying to explain the volume of a species of pine using variables easier to measure. We have tried with the height of the trees as the only predictor. As we can see in the table, descriptive parameters are given to have an idea of how data is distributed and organized, but we cannot infer anything yet about the data, we need to perform ttest, or do a regression analysis in order to use the data or conclude something from results. In the example we can see that a sample of 50 trees have been tested in terms of their height and volumes. If we want to infer something about the variability of data, we can calculate the coefficient of variation with formula: . If we apply this estimation to Height and Volume using the values on the table, we can have an idea of which variable has more variability. 3. t-test and ANOVA A t-test is a tool in inferential Statistics to evaluate if the means and variances of two groups are statistically different from each other and thus, being able to conclude whether there is a significant difference between two groups. For instance, t-test allows us to evaluate if the application of a treatment to a sample has been effective when we compare it with a sample without treatment (control sample). The assumption of two variances and means form two samples being equal, constitutes the null hypothesis (Ho) and the opposite to this, both being different, is called the alternative hypothesis (H1). Thus, the t-test is used to test hypothesis (accept or reject Ho). Uses of t-test:     Comparing two sample means Comparing the means of paired observations Determining the significance of a regression coefficient Comparing two regression coefficients When we do a t- test, the parameter in the outputs that we have to concentrate on is the p-value and evaluate whether they are statistically significant or not. In the majority of analyses, an alpha of 0.05 is used as the cutoff for significance: If p-value < 0.05 , In this case, p-value is statistically significant and we reject the null hypothesis that the variances and means of two sample are equal, since there is a significant difference between the sample. If p-value > 0.05, In this case, p-value is not statistically significant and we support the null hypothesis, meaning that both samples are not statiscally differente from each other. Example of a t-test: (extracted from examples on wiki, this was an exam question). We are testing a new fertilizer in some fields in order to increase the yield (kg/ha). To analyze if there are differences, we have compared the results of the application of the fertilizer to some fields that we have left as “control”. We have used a t-test to contrast the differences: In the table above, the results for a p-test applied to the example previously described are shown. The two hypotheses for the test are equal variances assumed (the null hypothesis) and equal variances not assumed (Alternative hypothesis). On the t-distribution with (n-1 = 49) degrees of freedom, this t-value corresponds to a p-value of 0.615. This result is not statistically significant. Therefore, we cannot reject the null hypothesis that the variances and means of both groups are equal, so we conclude that the treatment with fertilization was not effective because there was not significant change between the two samples. If we desire to compare the means and variances of more than two groups, we use ANOVA as an alternative, in which the null hypothesis is that all the means are equal, and the alternative hypothesis is that at least one of the means is different. The ANOVA checks if the variation between of more than two samples is due to the treatment we have applied or is random. For this, it compares the sample means and show us if these are equal or different. 4. Basics of modeling: simple regression In the investigative process, statistical modeling is a powerful tool in which, regression analysis is widely used to estimate the relationship among variables. Linear regression calculates an equation that approximates the dependence relation between a dependent variable Y, the independent variables Xi and a random term ε. This equation minimizes the distance between the fitted line (or regression line) and all of the data points (figure 1). The variable y in this equation is the dependent or response variable which is in function of the independent or predictors variables X. This is the linear function that describes the relation between a dependent variable Y, the independent variables Xi and a random term ε The red line is the regression line that best fits the data (blue points). Figure 2. Linear Regression Model The intercept is the value on the y-axis where the line crosses this axis. Multiple linear regression is the terminology used when the linear model has more than one independent variable, in the form of “ ”. There are several assumptions when it comes to explain the relationship between variables of a population through a regression analysis: Linearity, Normality, Homoscedasticity and we have to check these assumptions after our model is fitted! How to build and analyze a linear regression model: Using ratio scale variables as predictors that can probably explain the dependent one, we run a regression model (assuming that we are using statistics software) and obtain the coefficients for the linear function. For example: We are trying to explain the volume of a species of pine using variables easier to measure. We have tried with the height of the trees as the only predictor. The analysis has been run and these are the results: The results show unstandardized coefficients for the predictors (Constant and the height) which are the parameters for our fitted equation which is as follows: ( ) The constant (-0.818) is the y-intercept of the line of regression. We also have that R2 = 0.387. R2 is the coefficient of determination. It indicates the proportion of the variance in the dependent variable that is predictable from the independent variable. The result means that variable Height explains with a percentage of 38.7% the regression model. What it is usually to check: R2: usually the regression models whose correlation coefficients were higher than 0.90 are desirable. This means that the independent variables explain in more than 90% the model. Our example shows a really low R2. The usually way to pursue, is find a better model with higher R2 p-values: for a good fit, it is preferable those results whose predictors had a p-value lower than 0.05, which means that every parameter was significant, in other words, every parameter must had influence in the response variable. In the example, we can see that both variables are significant. Graphic of residuals versus predicted values graphic: The residuals are the difference between the observed values and the modeled values (in this case, the difference between the volume in data set, o measured for each tree, and the volume calculated if we apply the equation ) for the regression model, ( ). The homoscedasticity and normality of the residuals has to be evaluated:  Linearity: It was checked whether the residuals were showing any kind of pattern or randomness instead. Randomness in residuals is what is desirable since this would mean that the regression model satisfies the linearity principle and the model fits the data.  Homoscedasticity: if the residuals become wider within de range, then the principle of homoscedasticity is not satisfied. Example of non-linerarity: Figure 3 Residuals Versus Predicted values Figure 3 shows structured tendency in the residual versus fitted values plots, rather than the randomness, which is the desirable case. This indicates that the assumption of linearity is not satisfied and we have a bad fit, with systematic deviations. In this case, the solution for a better model is to use nonlinear regression because linear models are unable to fit the specific curve that these data follow. In other words, we need to do transformation for variables. However, since the objective of the study is to build a linear regression model Example of heteroscedasticity of residuals: Figure 4 Residuals Versus Predicted values Figure 5 Residuals Versus Predicted values There is not equality of variances in those models whose residuals are being shown in figure 4 and 5, we can see that both become wider with higher values. The first one can belong to linear model because of the randomness, but if we focus on both figures, we can see how the variability of the residuals increases with higher values of volume. Normality is also violated. 5. Advanced models: alternatives to simple regression As I explained in the previous chapter, multiple linear regressions can be an alternative to simple regression if we know that two or more independent variables can explain better the response variable. R2 can increase or stay the same as more predictors are added to a multiple linear regression model, especially if the independent variables added are unconnected to the response variable. But, ¿what about if the assumption of linearity cannot be accepted after the evaluation of the residuals graphic versus the predicted values? “The problem of heteroscedasticity (different variances) can be corrected by transforming the data using a logarithm or a power”. The proper transformation will depend on each particular data set”. Thus, we can obtain a model that fits better the data. Sometimes, with a visual evaluation of the scatted plots of the independent variables versus dependent variables at the very beginning, can gives an idea of the proper model to explain the relation between our variables. Example: (using the same data from my previous example) After the previous analysis, we have fitted a new model based on the square of the diameter (DIAM2) of the trees. The results are as follows: We can see improvement on R2, now. We also have significant predictors. We will have to check the residual to see if there is Homoscedasticity or not (it is pretty obvious that we will not get linearity). 6. Validating of models Model validation provides us parameters useful to evaluate the applicability of our model, if we can use it with different data sets than the tested. For instances, once we have created and tested our model we have to run it with different data and compare the new results with measured data. For example, for R assignment, we used a validation set to validate the linear regression model obtained. We performed a t-test to validate the model. The results for both data sets (the modeling and validation sets) show p-values > 0.05 (p-value = 0.9998 for modeling data and pvalue = 0.1382 for validation data). This means that we cannot reject the null hypothesis that the variances and means of both groups (observed and modeled volumes) are equal. We can also use the following parameters: - RMSSE (Root mean squared error). States us the dispersion or difference between the measured values and the values from our model run. - Bias: Describes the differences between the averages levels (mean) of the model and the measured data. These values can be relative (%) or absolute, and depending on which one of them we need the calculations are slightly different. Following, there are the mathematical formulas to calculate them: Where is the measured value and i is the modelled value 7. GIS tools and remote sensing applications GIS (Geographical information systems) are systems designed to capture, store, manipulate, analyze and present spatial data. They can be very useful for forest sciences. For instance, to calculate volume for timber production of potential biomass at different levels. Geographic data can be stored in a vector or a raster format. Using a vector, two-dimensional data is stored in terms of x and y coordinates. A raster data format expresses data as a continuously changing set of grid cells.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Alejandra LÓPEZ - UEF-Wiki