Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CORRELATION COEFFICIENT E.P. Yankovich To model geoecological objects and processes as complex natural systems it is necessary to consider some of their properties because the aim of this is to clarify the generic structure of a studied object. In one cases the studied properties are presented independently of one another, and in other cases more or less clear interrelations can be presented between them. The linear correlation coefficient (Pearson) intending normal law of distribution of observations is widespread to estimate the degree of interrelation. Correlation coefficient is a parameter characterizing the degree of linear interrelation between two samples. Correlation coefficient is changed from –1 (strict inverse linear relationship) to 1 (strict direct proportion). There is no linear relationship between two samples if the value is equal 0. Here, direct dependence is understood as dependence when an increase or decrease in value of one property leads to an increase or decrease of the second property, relatively. Sample estimation of correlation coefficient can be calculated according to the formula n r ( xi x )( yi y ) nSx S y i 1 Where x and y – sample estimations of average values of random variables X and Y; Sx and Sy – sample estimations of their standards; n – number of comparable paired values. When we carry out hand calculations this formula is used: n 2 1 n 2 n 2 1 n 2 n 1 n n r xi yi xi yi xi xi yi yi n n n i 1 i 1 i 1 i 1 i 1 i 1 i 1 If because of small data you can’t test a hypothesis whether the empirical distribution is in accord with the law, to test the hypothesis you can use Spearman’s rank correlation coefficient. Its calculation is based on change of the investigated random variable sample values; they are changed by their ranks in the order of increasing. However, it is supposed that if there is no correlation dependence between values of random variables, ranks of these variables will be independent. The expression for calculation of rank correlation coefficient is: n r 1 6 d i2 i 1 2 n(n 1) where di – rank difference of conjugate values of studied variables xi and yi, n – number of pairs in sample. LAWS OF RANDOM VARIABLE DISTRIBUTION Law of random variable distribution is the relationship between all possible values of random variable and their correspondent probabilities. Law of random variable distribution can be presented in a tabulated form, graphically or in distribution functional form. , Distribution series is possible values хi and probabilities рi= Р ( Х presented in a tabulated Here, probabilities pi the population of their correspondent = хi), it can be form. satisfy where the number of possible values k can be finite or infinite. Graphic presentation of distribution series is called a distribution polygon. To draw the distribution polygon it is necessary to plot the possible values of random variable (хi) on the abscissa, and probabilities рi should be plotted on the ordinate; points Аi and coordinates (хi , рi ) are connected by broken lines. If the true probability is not known, the relative frequency of each of values occurrence is plotted on the ordinate. The distribution function is the most common form of the distribution law description. It defines probability that random variable will take the value which will be lesser than any specified value X. This probability depends on Х and, therefore, it is the function of X, i.e. F(x)= Р (<x) discrete random variable continuous random variable Graph of integral function of distribution The function F(х) for discrete random variable is calculated by the formula: F ( x) pi , xi x where the summation over all i is carried out for which хi х. Continuous random variable is characterized by the nonnegative function f(х), to be carried out, and this function is called probability density and it is defined by: P( x X x x) f ( x) lim x x At any х probability density f(х) satisfies equality: x F(x) = f( ~ x)d~ x - linking it with distribution function F(х). Geometrical probability of hit X on site territory (а,b) is equal to area of curvilinear trapezoid corresponding to definite integral Graphic presentation of probability density function (differential function of distribution) Normal Distribution (firstly this term was used by Galton in1889, also it is called Gaussian). The normal distribution (the "bell-shaped curve" which is symmetrical about the mean) is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions. In general, the normal distribution provides a good model for a random variable, when: 1. There is a strong tendency for the variable to take a central value; 2. Positive and negative deviations from this central value are equally likely; 3. The frequency of deviations falls off rapidly as the deviations become larger. The normal distribution function is determined by the following formula: f(x) = 1/[(2*π)1/2*σ] * e**{-1/2*[(x-μ)2/σ]2}, for -∞ < x < ∞, where μ σ e π is the mean is the standard deviation is the base of the natural logarithm, sometimes called Euler's e (2.71...) is the constant Pi (3.14...) The exact form of normal distribution (specific “bell curve”, see Fig.) is defined by only two parameters: average deviation and standard one. The specific property of normal distribution lies in the fact that 68% of all observations fall in the range ±1 standard deviation from mean, and range ±2 of standard deviations include 95% values. In other words, under normal distribution the less -2 or more +2 standard observations possess relative frequency less 5% (Standard observation means that average value is taken from base value and the result is divided by standard deviation). Log-normal Distribution The log-normal distribution is often used in simulations of variables such as personal incomes, age at first marriage, or tolerance to poison in animals. In general, if x is a sample from a normal distribution, then y = ex is a sample from a log-normal distribution. Thus, the log-normal distribution is defined as: f ( x) 1 x 2 e (ln(x ) ) 2 / 2 2 where, x>0; -∞<μ<+∞; σ>0 e is the scale parameter is the shape parameter is the base of the natural logarithm, sometimes called Euler's e (2.71...) Graphs f(x) and F(x) of log-normal distribution Probability Density Function Probability Distribution Function y = lognorm(x; 0; 0,5) p = ilognorm(x; 0; 0,5) 1,0 0,8 0,8 0,6 0,6 0,4 0,4 0,2 0,2 0,0 0,0 0,4 0,8 1,2 1,6 2,0 2,4 2,8 3,2 0,4 0,8 1,2 1,6 2,0 2,4 2,8 3,2 Student's t Distribution The student's t distribution is symmetric about zero, and its general shape is similar to that of the standard normal distribution. It is most commonly used in testing hypothesis about the mean of a particular population. The student's t distribution is defined as (for = 1, 2, . . .): m 1 m 1 à 2 2 1 2 1 x ft ( x; m) , m m m à 2 x . Probability Density Function Probability Distribution Function y = student(x; 5) p = istudent(x; 5) 1,0 0,4 0,8 0,3 0,6 0,2 0,4 0,1 0,0 0,2 -3 -2 -1 0 1 2 3 0,0 -3 -2 -1 0 1 2 3 Characters of t-distribution: M [ x] xmed xmod 0 m D[ x] m2 A0 6 E m4 If the degrees of freedom are great (m> 30), t-distribution is equal to normal distribution N(x;0;1) ONE-DIMENSIONAL STATISTICAL MODELS. STATISTICAL CHARACTERISTICS OF SAMPLE RANDOM VARIABLE One-dimensional statistical models are used to solve two types of problems: to estimate average parameters of geoecological objects and to verify hypotheses statistically. The most abundant statistical characteristics of one-dimensional random variable: • range • median • mode • average value • dispersion • root-mean-square deviation • coefficient of variation • skewness • excess Range is the difference between maximum xmax and minimum xmin values of property p= xmax - xmin. Median is a mean of ordered series of values. To find median it is necessary to arrange all values in the order of increasing or in the order of decreasing and to find in order the mean term of series. If in case of n – even integer there will be two values in the middle of series, the median is equal to their half-sum. Mode is the most abundant value of random variable. Average value is arithmetical mean value of all measured values: 1 k x = xi n i 1 Median, mode and mean value are characteristics of position. Measured values of random variable are grouped near them. Dispersion is a number which is equal to average square deviations of values of random variable from its average value (Dispersion of random variable is a measure of this random variable spread, i.e. its deviation from mathematical expectation): 1 n = (x i - x) 2 n i 1 2 Average square deviation is a number which is equal to square root of dispersion: 1 n = (x n i 1 i - x) 2 Coefficient of variation is the ratio of average square deviation to average value: V= x Coefficient of variation is expressed in unit fractions or (after the product by 100) in percentages. It is not unreasonable to calculate the coefficient of variation for positive random variables. Dispersion, average square deviation, coefficient of variation and also range are measures of scatter of values of random variable in the neighborhood of average value. The more measures are the more scattering is. Skewness – noncentrality degree of values distribution of random variable relative to average value: A= n 1 n 3 (x x) 3 i i 1 Excess – degree of peakedness or flat-toppedness of values of random variable relative to normal distribution law: E= n 1 n 4 (x x) i 1 i 4 3 Skewness and excess are nondimensional values. They show singularities of values grouping of random variable in the neighborhood of average value. • Thus: Median, mode and average value are characteristics of position; Dispersion, average square deviation, coefficient of variation and also range are measures of scatter; Skewness and excess show singularities of values grouping of values. Statistical estimations can be point and interval. In point estimating the unknown characteristic of random variable is estimated by a number, in interval estimating the unknown characteristic of random variable is estimated by an interval. With specified possibility the true value of estimated variable must be in range of the latter. STATISTICAL MODELING Mathematical expressions including at least one random component (i.e. such variable, the value of which cannot be exactly predicted for single observation) are called statistical models. They are extensively used for mathematical modeling aims so long as they account well random fluctuations of experimental data. Statistical models are usually used for: • obtaining trusted assessments of geological objects properties according to sampling data; • testing of hypothesis; • identifying and describing of dependences between properties of geological objects; • classifying of geological objects; • determining of sampling data amount needed to estimate geological objects properties to specified accuracy. Two concepts – general population and sampling – are the basis for statistical modeling. General population – a lot of possible values of examined object or phenomenon specified characteristics. Sampling – the sum total of observed values of this characteristic. Statistical modeling is assumed that sampling population satisfies the requirements of mass, homogeneity, randomness and independence. Mass condition is due to the fact that statistical regularities are manifested in mass phenomena and so amount of sampling population is to be sufficiently great. It is established by empiricism that reliability of statistical estimates goes down in reducing sample in the range from 60 to 30-20 values and there is no need for applying the statistical methods if there are less observations. Homogeneity condition is due to the fact that sampling population must consist of observations which belong to one object and they must be carried out by the same method, i.e. the sample size and analysis method must be constant. Randomness condition provides unpredictability of the single sample observation result. Independence condition is due to the fact that the results of each investigation do not depend on results of previous and follow-up observations and in the process of carrying out observations dealing with area and volume the results do not depend on space coordinates. The concept of random event probability is one of the main concepts in statistical modeling. The event is any fact which can be realized in the result of the experiment or test. In turn the experiment or test is realization of certain complex of conditions though a man does not always take part in. All events are subdivided into persistent, impossible and random. • The event which is certain to happen in the process of this kind of test is called persistent. • Impossible event is never realized in the process of this kind of test. • Random events are characterized by that they can happen in the process of this kind of test or they can’t happen. The variable taking one or another unknown in advance value in the result of test is called random variable. Random variables are discrete and continuous. Meanwhile values which they possess they can be limited or not. Discrete variable can take fixed value and if the interval is specified the number of these values is finite. Continuous random variable can take infinitely many values in any specified interval. The value called probability is used as a measure of possibility of random events. Probability of event A is a number which characterizes objective possibility of occurrence of this event. It is designated as either Р(А) or р, i.e. р=Р(А). Classical interpretation: Probability of event A is equal to ratio of number of events, favourable to event A, to general number of events. P(A)=m/n, where n – general number of events, m – number of events, favourable to event A. Р(А) is variable from 0 to 1. Probability of persistent event is equal 1, probability of impossible event is equal 0. Ratio of m/n, number of m in which the event A occurred, to the total number of tests n is called the relative frequency of any event in this series from n tests. Almost in every sufficiently long series of tests the relative frequency of event A is established at defined value m/n taken as probability of event A. The relative frequency of event A is called statistical probability, which is symbolized m P* ( A) A n where mA – number of experiments where the event A occurred; n – total number of experiments. The basic characteristics of random variable The most important of them are mathematical expectation of random variable which is denoted by М(Х), and dispersion D(Х) = 2(Х), the square root of which (Х) is called standard deviation or standard. In the discrete type (discontinuous) of random variable, the definition of mathematical expectation М(Х) is given as the sum of the product of the random variables and the probability mass function of those random variables. k Ì(Õ) = õ1ð1 + õ 2 ð 2 + . . . + õ k ð k = x i p i i 1 Or k Ì(Õ) = k x p / p i 1 i i i 1 i Mechanical interpretation of mathematical expectation: М(Х) – abscissa of centroid of mass points, abscissas of which are equal to possible values of random variable, and masses are placed in these points are equal to adequate probabilities. Mathematical expectation of continuous type of random variable is called the integral, and the integral is supposed to converge absolutely; Ì(Õ) = xf(x)dx - here f(х) – probability density of distribution of random variable Х. Mathematical expectation М(Х) can be understood as “theoretical mean value of random variable”. Along with mathematical expectation another characters are used: median xmed divides the distribution Х into two equal parts and it is defined by condition F(xmed) = 0,5; mode xmоd – maximum commonly occurring value Х and it is abscissa of the maximum point f(x) for continuously distributed random variable. All three characters (mathematical expectation, median and mode) are the same in symmetrical distributions. If there are several modes the distribution is called multimodal distribution. Dispersion of random variable X is called the mathematical expectation of deviation of random variable square from its mathematical expectation, i.e. D(Х) = М(Х – М(Х) 2) Dispersion is calculated by the formula: D(Х) = М(Х2) – [М(Х)] 2 For discrete random variable X the formula gives k Ì(Õ) = (x ) i 1 i 2 p i [ M ( X )]2 For continuous random variable X D(Õ) = 2 (x M(x)) f(x)dx - Dimension of dispersion is equal to dimension of random variable square. If mathematical expectation of random variable gives us its “average” or point on the coordinate line where the values of considered random variable “are spread” around it, dispersion classifies “the spread degree” of values of random variable about its average value. The positive root of dispersion is called the root-meansquare (standard) deviation and it is denoted by σ D(X ) The root-mean-square deviation possesses the same dimension that the random variable possesses. 100 % V Coefficient of variation is called the value 1 Coefficient of variation – dimensionless value applied for comparison of degrees of variation of random variables with different units of measurement. Skewness ratio (or coefficient of skewness) of distribution is called the value 3 A 3 Coefficient of skewness classifies the degree of random variable distribution skewness relative to its mathematical expectation. For skewness distributions А = 0. If the peak of function graph f(x) is shifted in small values (“tail” on the function graph f(x) to the right), А> 0. In the contrary case А< 0. 1,0 A>0 A=0 0,8 A<0 f(x) 0,6 0,4 0,2 0,0 -0,5 0,0 0,5 1,0 1,5 2,0 x 2,5 3,0 3,5 4,0 4,5 Coefficient of excess (or peakedness) is called the value 4 E 4 3. Coefficient of excess is the measure of sharpness of probability density graphs f(x) 1,2 E>0 1,0 f(x) 0,8 0,6 E=0 0,4 0,2 E<0 0,0 -0,5 0,0 0,5 1,0 1,5 2,0 x 2,5 3,0 3,5 4,0 4,5