Download Alejandra LÓPEZ - UEF-Wiki

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vector generalized linear model wikipedia , lookup

Pattern recognition wikipedia , lookup

Inverse problem wikipedia , lookup

Computer simulation wikipedia , lookup

Data analysis wikipedia , lookup

Corecursion wikipedia , lookup

Predictive analytics wikipedia , lookup

Generalized linear model wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
Alejandra López Ramírez - 285144
Learning Diary for Research Methodology
CBU – Forestry.
1. Research process
The investigative process is supported by the scientific method. The scientific method is a set of
systematic procedures that allows giving answer to a research problem or generating new
knowledge. In simple words, it is the way things are done in science; however its methodic
nature does not mean that this is an inflexible process, on the contrary, scientific method allows
freedom of thought, criticism, rigorous analysis and discussion. Its steps are guides to develop
research process in an orderly and rational manner.
The following steps are part of scientific method:
1. Observation of natural phenomena or a research problem
2. Elaborate a hypothesis based on observation, previous knowledge or experiments that may
explain the natural phenomena or give answer to the problem.
3. Use the hypothesis to make predictions (logical consequences of the hypothesis).
4. Carry out experimentation to test predictions and modify hypothesis based on obtained
results.
5. Repeat steps 3 and 4 until there are no discrepancies between theory and experiment and/or
observation. With no discrepancies, a hypothesis becomes a theory.
There are two facts about scientific method:
A scientific hypothesis must be falsifiable, this means that it is has to be possible to prove the
theory as false by identifying possible results that show discrepancies with the predictions
deduced from the hypothesis. “what is unfalsifiable is classified as unscientific”. For example,
the existence of the afterlife.
Results are repeatable.
An essential tool for the investigative process is statistics, where data is organized, analyzed and
interpreted. The process of collecting/selecting data from a statistical population is called
sampling. Statistics allows us to design our experiments in order to get a representative sample
that really describes a whole population; and test and interpreter our results to make inferences.
Figure 1. The process of sampling and statistical analysis and interpretation of data.
2. Basic concepts in statistics
The following concepts are essential for describing the statistic behavior of data sets; they are
used to describe and summarize a data set and give an idea of how data is organized:
Mean: The sample mean of a variable is the sum of observed values in a data divided by the
number of observations (or the length of the sample), in simple terms, is the average of the data
set. This parameter is the reference point for the evaluation of the dispersion within the data.
Variance: the variance gives an idea of the amount of dispersion of variability of data within a
data set. It is measure to quantify how much the points in a data set spread out from their mean.
Therefore, high variance means high dispersion of the data with respect of the mean of average.
Std Deviation: although variance gives a measure for dispersion, standard deviation is the
parameter usually given for the evaluation of variability. This value is the square root of the
variance.
Normal Distribution: It is defined by mean and Std. deviation. Basically, when we have a data
set, we assume that the frequencies of the values are symmetrically distributed in both sides of
the mean with the highest frequencies around the average. It is also known as Gauss
distribution.
Parametric analysis: the statistic evaluation of data with normal distribution. Many statistical
tests are based on the assumption of normality.
Standard Error: this parameter indicates how much the mean of the sample approaches the
real average of the population. So, standard deviation gives an idea of the variability within the
data set, but standard error gives an idea of the variability with respect of the whole population.
Standard error is calculated by dividing the standard deviation by the square root of the number
of observations, meaning that, the longer the sample is, the smaller the standard error will be.
Example: (extracted from examples on wiki). We are trying to explain the volume of a species
of pine using variables easier to measure. We have tried with the height of the trees as the only
predictor.
As we can see in the table, descriptive parameters are given to have an idea of how data is
distributed and organized, but we cannot infer anything yet about the data, we need to perform ttest, or do a regression analysis in order to use the data or conclude something from results. In
the example we can see that a sample of 50 trees have been tested in terms of their height and
volumes. If we want to infer something about the variability of data, we can calculate the
coefficient of variation with formula:
. If we apply this estimation to
Height and Volume using the values on the table, we can have an idea of which variable has
more variability.
3. t-test and ANOVA
A t-test is a tool in inferential Statistics to evaluate if the means and variances of two groups are
statistically different from each other and thus, being able to conclude whether there is a
significant difference between two groups. For instance, t-test allows us to evaluate if the
application of a treatment to a sample has been effective when we compare it with a sample
without treatment (control sample). The assumption of two variances and means form two
samples being equal, constitutes the null hypothesis (Ho) and the opposite to this, both being
different, is called the alternative hypothesis (H1). Thus, the t-test is used to test hypothesis
(accept or reject Ho). Uses of t-test:




Comparing two sample means
Comparing the means of paired observations
Determining the significance of a regression coefficient
Comparing two regression coefficients
When we do a t- test, the parameter in the outputs that we have to concentrate on is the p-value
and evaluate whether they are statistically significant or not. In the majority of analyses, an
alpha of 0.05 is used as the cutoff for significance:
If p-value < 0.05 , In this case, p-value is statistically significant and we reject the null
hypothesis that the variances and means of two sample are equal, since there is a significant
difference between the sample.
If p-value > 0.05, In this case, p-value is not statistically significant and we support the null
hypothesis, meaning that both samples are not statiscally differente from each other.
Example of a t-test: (extracted from examples on wiki, this was an exam question). We are
testing a new fertilizer in some fields in order to increase the yield (kg/ha). To analyze if there
are differences, we have compared the results of the application of the fertilizer to some fields
that we have left as “control”. We have used a t-test to contrast the differences:
In the table above, the results for a p-test applied to the example previously described are
shown. The two hypotheses for the test are equal variances assumed (the null hypothesis) and
equal variances not assumed (Alternative hypothesis). On the t-distribution with (n-1 = 49)
degrees of freedom, this t-value corresponds to a p-value of 0.615. This result is not statistically
significant. Therefore, we cannot reject the null hypothesis that the variances and means of both
groups are equal, so we conclude that the treatment with fertilization was not effective because
there was not significant change between the two samples.
If we desire to compare the means and variances of more than two groups, we use ANOVA as
an alternative, in which the null hypothesis is that all the means are equal, and the alternative
hypothesis is that at least one of the means is different.
The ANOVA checks if the variation between of more than two samples is due to the treatment
we have applied or is random. For this, it compares the sample means and show us if these are
equal or different.
4. Basics of modeling: simple regression
In the investigative process, statistical modeling is a powerful tool in which, regression analysis
is widely used to estimate the relationship among variables. Linear regression calculates an
equation that approximates the dependence relation between a dependent variable Y, the
independent variables Xi and a random term ε. This equation minimizes the distance between
the fitted line (or regression line) and all of the data points (figure 1). The variable y in this
equation is the dependent or response variable which is in function of the independent or
predictors variables X.
This is the linear
function that describes the relation between a
dependent variable Y, the independent
variables Xi and a random term ε
The red line is the regression line that best fits
the data (blue points).
Figure 2. Linear Regression Model
The intercept is the value on the y-axis where
the line crosses this axis.
Multiple linear regression is the terminology used when the linear model has more than one
independent variable, in the form of “
”. There are several
assumptions when it comes to explain the relationship between variables of a population
through a regression analysis: Linearity, Normality, Homoscedasticity and we have to check
these assumptions after our model is fitted!
How to build and analyze a linear regression model:
Using ratio scale variables as predictors that can probably explain the dependent one, we run a
regression model (assuming that we are using statistics software) and obtain the coefficients for
the linear function. For example: We are trying to explain the volume of a species of pine using
variables easier to measure. We have tried with the height of the trees as the only predictor. The
analysis has been run and these are the results:
The results show unstandardized coefficients for the predictors (Constant and the height) which
are the parameters for our fitted equation which is as follows:
(
)
The constant (-0.818) is the y-intercept of the line of regression. We also have that R2 = 0.387.
R2 is the coefficient of determination. It indicates the proportion of the variance in the
dependent variable that is predictable from the independent variable. The result means that
variable Height explains with a percentage of 38.7% the regression model.
What it is usually to check:
R2: usually the regression models whose correlation coefficients were higher than 0.90 are
desirable. This means that the independent variables explain in more than 90% the model. Our
example shows a really low R2. The usually way to pursue, is find a better model with higher R2
p-values: for a good fit, it is preferable those results whose predictors had a p-value lower than
0.05, which means that every parameter was significant, in other words, every parameter must
had influence in the response variable. In the example, we can see that both variables are
significant.
Graphic of residuals versus predicted values graphic: The residuals are the difference
between the observed values and the modeled values (in this case, the difference between the
volume in data set, o measured for each tree, and the volume calculated if we apply the equation
)
for the regression model, (
). The homoscedasticity and normality
of the residuals has to be evaluated:
 Linearity: It was checked whether the residuals were showing any kind of pattern or
randomness instead. Randomness in residuals is what is desirable since this would mean
that the regression model satisfies the linearity principle and the model fits the data.
 Homoscedasticity: if the residuals become wider within de range, then the principle of
homoscedasticity is not satisfied.
Example of non-linerarity:
Figure 3 Residuals Versus Predicted values
Figure 3 shows structured tendency in the residual versus fitted values plots, rather than the
randomness, which is the desirable case. This indicates that the assumption of linearity is not
satisfied and we have a bad fit, with systematic deviations. In this case, the solution for a
better model is to use nonlinear regression because linear models are unable to fit the specific
curve that these data follow. In other words, we need to do transformation for variables.
However, since the objective of the study is to build a linear regression model
Example of heteroscedasticity of residuals:
Figure 4 Residuals Versus Predicted values
Figure 5 Residuals Versus Predicted values
There is not equality of variances in those models whose residuals are being shown in figure 4
and 5, we can see that both become wider with higher values. The first one can belong to linear
model because of the randomness, but if we focus on both figures, we can see how the
variability of the residuals increases with higher values of volume. Normality is also violated.
5. Advanced models: alternatives to simple regression
As I explained in the previous chapter, multiple linear regressions can be an alternative to
simple regression if we know that two or more independent variables can explain better the
response variable. R2 can increase or stay the same as more predictors are added to a multiple
linear regression model, especially if the independent variables added are unconnected to the
response variable. But, ¿what about if the assumption of linearity cannot be accepted after the
evaluation of the residuals graphic versus the predicted values? “The problem of
heteroscedasticity (different variances) can be corrected by transforming the data using a
logarithm or a power”. The proper transformation will depend on each particular data set”.
Thus, we can obtain a model that fits better the data. Sometimes, with a visual evaluation of the
scatted plots of the independent variables versus dependent variables at the very beginning, can
gives an idea of the proper model to explain the relation between our variables.
Example: (using the same data from my previous example) After the previous analysis, we have
fitted a new model based on the square of the diameter (DIAM2) of the trees.
The results are as follows:
We can see improvement on R2, now. We also have significant predictors. We will have to
check the residual to see if there is Homoscedasticity or not (it is pretty obvious that we will not
get linearity).
6. Validating of models
Model validation provides us parameters useful to evaluate the applicability of our model, if we
can use it with different data sets than the tested. For instances, once we have created and tested
our model we have to run it with different data and compare the new results with measured data.
For example, for R assignment, we used a validation set to validate the linear regression model
obtained. We performed a t-test to validate the model. The results for both data sets (the
modeling and validation sets) show p-values > 0.05 (p-value = 0.9998 for modeling data and pvalue = 0.1382 for validation data). This means that we cannot reject the null hypothesis that the
variances and means of both groups (observed and modeled volumes) are equal.
We can also use the following parameters:
- RMSSE (Root mean squared error). States us the dispersion or difference between the
measured values and the values from our model run.
- Bias: Describes the differences between the averages levels (mean) of the model and the
measured data.
These values can be relative (%) or absolute, and depending on which one of them we need the
calculations are slightly different. Following, there are the mathematical formulas to calculate
them:
Where
is the measured value and
i is the modelled value
7. GIS tools and remote sensing applications
GIS (Geographical information systems) are systems designed to capture, store, manipulate,
analyze and present spatial data. They can be very useful for forest sciences. For instance, to
calculate volume for timber production of potential biomass at different levels.
Geographic data can be stored in a vector or a raster format. Using a vector, two-dimensional
data is stored in terms of x and y coordinates. A raster data format expresses data as a
continuously changing set of grid cells.