Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2 УДК 911.5.1./.9 (075.8) ББК 26.8:32.8.я73 М77 Монголина Т.А., Янкович Е.П., Надеина Л.В. М77 Geographic information systems and mathematical modeling: лекции по курсу «Geographic information systems and mathematical modeling» для студентов, обучающихся по направлению 022000 «Экология и природопользование», профилю подготовки «Геоэкология» / Т.А. Монголина, Е.П. Янкович, Л.В. Надеина – Томск: Изд-во Томского политехнического университета, 2012. – 58 с. УДК 911.5.1./.9 (075.8) ББК 26.8:32.8.я73 Рецензент Профессор, доктор геолого-минералогических наук С.И. Арбузов © Монголина Т.А., Янкович Е.П., Надеина Л.В., 2012 © Томский политехнический университет, 2012 © Оформление. Издательство Томского политехнического университета, 2012 3 UDK 911.5.1./.9 (075.8) BBК 26.8:32.8.я73 М77 T.A. Mongolina, E.P. Yankovich, L.V. Nadeina М77 Geographic information systems and mathematical modeling: Lectures in course “Geographic information systems and mathematical modeling” for students of 022000 “Ecology and Environmental Management” Course, professional profile “Geoecology” \ T.A. Mongolina, E.P. Ynkovich, L.V. Nadeina – Tomsk: Publishing House Tomsk Polytechnic University, 2012. – 58 p. UDK 911.5.1./.9 (075.8) BBК 26.8:32.8.я73 Authorization granted by Editorial Advisory Board Tomsk Polytechnic University Reviewer Doctor in Geological-Mineralogy Science Prof. S. I. Arbuzov © STE HPT TPU, 2012 © T.A. Mongolina, E.P. Yankovich, L.V. Nadeina, 2012 © Design. Tomsk Polytechnic University Publishing House, 2012 4 CONTENTS Lecture 1. Modeling in science. Types, principles and methods of mathematical 6p Modeling. Statistical Modeling. One-dimensional Statistical Models Lecture 2. Two-dimensional and multi-dimensional statistical models. Spatial modeling 28 p Lecture 3. Introduction in Geographic information systems 44 p Lecture 4. Study of geographical data 49 p Lecture 5. Coordinate systems and map projection 54 p 5 LECTURE 1. MODELING IN SCIENCE. TYPES, PRINCIPLES AND METHODS OF MATHEMATICAL MODELING. STATISTICAL MODELING. ONE-DIMENSIONAL STATISTICAL MODELS. Modeling in science Modeling is one of the methods of the surroundings. The process of model development and use is called modeling. Model (measure, sample, standard) is the material or imaginary object which displaces the original in the process of study and it keeps its several typical characteristics which are important for this investigation. Modeling is a method of the surroundings which can refer to general scientific methods applied both at empirical and theoretic levels The term “model” is often used to represent: 1) device reproducing construction or the function of the device (reduced, magnified or full-sized); 2) analog (drawing, graph, plan, scheme, description and etc.) of phenomenon, process or object. To develop a model an investigator always proceeds from the purpose in hand and takes into account only important factors. Therefore any model is not identical to object-original and which means that it is not a complete model so long as to develop it the investigator took into account only the most important (from his point of view) factors Material systems as the objects for study are divided into well-organized and badorganized ones. Well-organized systems consist of a limited number of elements and there are strongly defined and unique dependences between them. We can refer the simplest chemical and physical processes, mechanisms, devices and etc. Their properties and states can be described with a help of physical and chemical laws. We can refer complicated nature objects and phenomena to bad-organized systems. Living organisms and their community, and also a lot of objects studied by Earth sciences we can refer to typical bad-organized systems. During the study these systems we can find only specific regularities in their structure i.e. tendencies which are not lent themselves to strict quantification. The basic method of bad-organized system study is modeling when a direct object of study is replaced by its simplified analog – model. According to character of model there are object modeling and sign (information) modeling. Object modeling is the modeling when investigation is carried out with a help of model reproducing defined geometrical, physical, dynamical or functional characteristics of object. Sign (information) modeling uses sign representations (diagrams, schemes, graphs, hieroglyphs, character sets) as models. Mathematical modeling (modeling with a help of mathematical relations) is an example of sign (information) modeling. Geological concepts formalization must often be controlled in the process of mathematical treatment of geological information. Different methods of sign (information) modeling play a key role in Earth sciences. According to character of information they can be divided into verbal, graphical and mathematical). Numerous classifications, concepts and definitions can refer to verbal models. Various drawing geoecological documents – maps, plans, sheme, sections, projections and etc so long as they approximately depict the properties of real objects – should refer to graphical models. Numbers and formulae describing relations and regularities of change of geological formation properties or geological process parameters are used as mathematical models. Over the last years a borderline between these models become conditional in connection with wide use of geoecological investigations of computer modeling with a help of various geoecological information. 6 Cartographic information is digitalized with a help of nominal scale, and results of measurements during the geochemical and geophysical surveys are depicted as maps with a help of plotters or graphical displays. Types of mathematical models There are following types of mathematical models: according to construction principle, according to bond character, according to types of solved problems. The static and dynamic modeling is separated out according to construction principle of mathematical model. The static modeling consists in mathematical formulation of the investigated object properties according to results of their study by the inductive generalization of empirical observation sampling. The techniques of deductive methods when properties of specified objects are taken out from general ideas about its structure and laws defining its properties are used in dynamic modeling. Static modeling appears as: • transformation of geoecological information into the well-behaved form; • determination of regularities in mass and random measurements of studied object properties; • mathematical description of revealed regularities (construction of mathematical model); • use of obtained quantitative characteristics to solve specific geoecological problems – test of geoecological hypothesis, selection of method of further study object etc.; • estimated probability of possible errors in solving of formulated problem by means of sampling method of the object study. Mathematical models are divided into deterministic and statistical models according to bond character between parameters and properties of studied objects. Deterministic models show the functional connections between arguments and dependent variables. They are equated and in these equations for defined value of the argument there is only one value of variable. Deterministic models are used seldom for modeling of geoecological objects. This may be due to they have little relation with real phenomena where the functional connections are preserver in the range. Mathematical expressions including at least one random component (i.e. such variable, the value of which cannot be exactly predicted for single observation) are called statistical models. They are extensively used for mathematical modeling aims so long as they account well random fluctuations of experimental data. Equationally definable class of geoecological problems and study objects lead up to necessity for using methods of different branches of mathematics (such as theory of chances and mathematical statistics, theory of sets, theory of groups, information theory, theory of graphs, games theory, vector-matrix algebra, differential geometry) in the process of modeling. Meanwhile the same problem can be solved by different methods, but in certain cases to solve one problem it is necessary to use complex of methods from different branches of mathematics. In that case it is rather difficult to classify mathematical methods used in geoecology. At the same time, according to types of solved problems and set of used mathematical methods all mathematical models are distinctly divided into two groups. The first group consists of models using mainly mathematical apparatus of theory of chances and mathematical statistics. Geoecological objects are considered to be internally homogeneous, and their properties changes in space are considered to be random not depending on the measurement site. Such kind of models can be conditionally called statistical. Subject to amounts of simultaneously examined properties they are divided into one-dimensional, twodimensional and multidimensional models. Statistical models are usually used for: •obtaining trusted assessments of geological objects properties according to sampling data; • testing of hypothesis; 7 • identifying and describing of dependences between properties of geological objects; • classifying of geological objects; • determining of sampling data amount needed to estimate geological objects properties to specified accuracy. The second group consists of models where properties of geoecological objects are considered to be spatial variables. In these models it is supposed that geoecological objects properties depend on measuring point coordinates, and there are defined regularities in these properties change in space. Meanwhile the techniques of combinatorics (polynomials), harmonic analysis, vector algebra, differential geometry and other branches of mathematics are also used along with certain probabilistic methods (random functions, time series, variance analysis). The techniques of both static modeling and dynamic modeling are used to study spatial geoecological variables. Models of spatial geological variables are used to solve problems dealing with: • test of hypothesis about regularities of the geoecological objects location relative to each other; • test of hypotheses about nature of the geoecological formation development processes; • isolation of anomalies in the fields; • classification of geoecological objects according to features of their internal structure; • development of interpolation and extrapolation techniques in the process of the geoecological objects delineation; • selection of the optimal observation network density and form in the process of the geoecological objects study. Principles and methods of mathematical modeling in geoecology Use of mathematical modeling in geoecology is connected with a number of complexities. Mathematical model as any other one is a simplified analog of the investigated object. Not any mathematical model can reproduce all their properties because of the geoecological objects and processes complexities. Therefore it is necessary to often use different mathematical models to describe various properties of one and the same object. Meanwhile it is necessary to make sure that the selected model adequately depicts just the properties of object which affect on the solution of problem. Mathematical model cannot characterize the examined properties completely. They are based on certain assumptions connected with nature of modeling object properties. Thus, solution of geoecological problems on the base of mathematical modeling represents rather difficult process which can be divided into the following steps: 1) problem setting; 2) the determination of geoecological population, i.e. ranging of geoecological object or time interval of the geoecological process; 3) the determination of basic properties of object or parameters of process in the context of posed problem; 4) transition from geoecological population to tested and sampling one subject to characteristics of investigation methods; 5) selection of mathematical modeling type; 6) formulation of mathematical problem in the context of selected mathematical model; 7) selection of the mathematical problem solution method; 8) the mathematical problem solution in terms of parameters calculation of mathematical model of object; 9) interpretation of obtained results as applied to geoecological; 10) estimate of probability and possible error value because of the model and object inadequacy. Thus, the steps of geoecological model development (tested and sampling geoecological population) are preceded by the step of mathematical modeling proper. 8 Sampling methods of study are widespread in geoecological investigations. Local areas of observations and samples are very small as compared with areas and Earth interior where carrying out investigations. In this connection there are problems dealing with stationing of local observation stations and systematization of sampling data. An investigator judges by properties of totality researching its part which is accessible to observation and sampling and it is called sampled population. Quality of conformance of sampled population properties and studied population depends on location, density and total amount of observation points and it also depends on sizes, orientation, form, sampling volume or this property measurement method. There are three main observation point location systems: uniform sampling, chance sampling and multiple stage sampling. Set of elementary characters obtained in the result of measurement or analysis of any geological object properties can be put into correspondence with each geological population. Such kind of sets of elementary characters is called sampling (statistical) populations. Statistical modeling Two concepts – general population and sampling – are the basis for statistical modeling. General population – a lot of possible values of examined object or phenomenon specified characteristics. Sampling – the sum total of observed values of this characteristic. Statistical modeling is assumed that sampling population satisfies the requirements of mass, homogeneity, randomness and independence. Mass condition is due to the fact that statistical regularities are manifested in mass phenomena and so amount of sampling population is to be sufficiently great. It is established by empiricism that reliability of statistical estimates goes down in reducing sample in the range from 60 to 30-20 values and there is no need for applying the statistical methods if there are less observations. Homogeneity condition is due to the fact that sampling population must consist of observations which belong to one object and they must be carried out by the same method, i.e. the sample size and analysis method must be constant. To summarize the results of geoecological investigations it is necessary to deal with data obtained with a help of various techniques in different years. So long as in practice of geoecological investigations homogeneity condition is not always observed, using of statistical methods has to be followed by analysis of possible consequences owing to this condition breakdown. It is necessary to take into account the nature of solved geoecological problem and in some cases it is necessary to use special methods to test a sample homogeneity hypothesis. Randomness condition provides unpredictability of the single sample observation result. As a rule, complexity and changeability of geoecological objects eliminate a possibility of their properties accurate estimate before observation. Therefore the randomness condition is strictly performed only when sampling location or measurement of studied property are not connected with value characterizing this property. Independence condition is due to the fact that the results of each investigation do not depend on results of previous and follow-up observations and in the process of carrying out observations dealing with area and volume the results do not depend on space coordinates. This condition isn’t observed for most geoecological processes and formations. There are certain regularities of changeability of geoecological formation properties in space and geoecological process parameters in time. For this reason the field of statistical models uses is limited by objects with absence of any change regularities in space or in time, and also it is limited by problems when solving them these regularities can’t be taken into account. The concept of random event probability is one of the main concepts in statistical modeling. The event is any fact which can be realized in the result of the experiment or test. 9 In turn the experiment or test is realization of certain complex of conditions though a man does not always take part in. All events are subdivided into persistent, impossible and random. The event which is certain to happen in the process of this kind of test is called persistent. Impossible event is never realized in the process of this kind of test. Random events are characterized by that they can happen in the process of this kind of test or they can’t happen. The variable taking one or another unknown in advance value in the result of test is called random variable. Random variables are discrete (or discontinuous) and continuous. Meanwhile values which they possess they can be limited or not. Discrete variable can take fixed value and if the interval is specified the number of these values is finite. Continuous random variable can take infinitely many values in any specified interval. The variable called probability is used as a measure of possibility of random events. Probability of event A is a number which characterizes objective possibility of occurrence of this event. It is designated as either Р(А) or р, i.e. р=Р(А). Classical interpretation: Probability of event A is equal to ratio of number of events, favourable to event A, to general number of events. P(A)=m/n, 1.1 where n – general number of events, m – number of events, favourable to event A. Р(А) is variable from 0 to 1. Probability of persistent event is equal 1, probability of impossible event is equal 0. Classical interpretation works when there is capability of probability prediction in terms of symmetry conditions under experiment is carried out and hereupon in terms of symmetry of test outcomes and that leads to the concept “equal possibility” of outcomes. Therefore, classical interpretation is connected with the concept equal possibility and it is used for experiments reducing to the scheme of events. Do this requires that the events e1, e2, en were incompatible, i.e. no two of them can occur together to form a complete group, i.e. they exhaust all possible outcomes; they are equal possible under the stipulation that the experiment provides the equal possibility of occurrence of each of them. It is rather difficult to find some kinds of regularities upon analysis the certain test results. But the stability of mean characteristics can be discovered in sequence of identical trials. Ratio of m/n, number of m in which the event A occurred, to the total number of tests n is called the relative frequency of any event in this series from n tests. Almost in every sufficiently long series of tests the relative frequency of event A is established at defined value m/n taken as probability of event A. Value stability of the relative frequency is verified by special experiments. For the first time such kind of statistical regularities were discovered by way of example gambling games, i.e. by way of example those tests which are characterized by equal possibility outcomes. It opened the door to the statistical approach of numerical determination of probability when symmetry conditions of the experiment are violated. The relative frequency of event A is called statistical probability, which is symbolized 1.2 , where mA – number of experiments where the event A occurred; n – total number of experiments. For determination of probability formulae (1.1) and (1.2) have got similarity of appearance but they are essentially different. Formula (1.1) is necessary for theoretical calculation of probability of event according to desired conditions of the experiment. Formula (1.2) is 10 necessary for experimental determination of the relative frequency of event. The experimental statistical material is necessary to use formula (1.2). The basic properties of probability 1. For every stochastic event A its probability is determined, while . 2. For persistent event U with equality when P(U)=1. Properties 1 and 2 result from the determination of probability. 3. If А and В events are mutually exclusive, sum of events probability is equal to sum of their probabilities. This property is called the law of addition of probability in special case (for mutually exclusive events). 4. For arbitrary events А and В . This property is called the law of addition of probability in the general case. For opposite events А and with equality when . Besides, the impossible event denoted by , not any outcome from space of elementary events is not of aid in it, is introduced. The probability of the impossible event is equal to 0, P( )=0 . The basic characteristics of random variable The properties of random variable can be characterized by different parameters. The most important of them are mathematical expectation of random variable which is denoted by М(Х), and dispersion D(Х) = 2(Х), the square root of which (Х) is called standard deviation or standard. In the discrete type (discontinuous) of random variable, the definition of mathematical expectation М(Х) is given as the sum of the product of the random variables and the probability mass function of those random variables. k М(Х) = х1р1 + х2р2 + . . . + хk рk = xi pi или М(Х) = i 1 k xi pi / i 1 k p i 1 i Mechanical interpretation of mathematical expectation: М(Х) – abscissa of centroid of mass points, abscissas of which are equal to possible values of random variable, and masses are placed in these points are equal to adequate probabilities. Mathematical expectation of continuous type of random variable is called the integral М(Х) = xf ( x )dx and the integral is supposed to converge absolutely; here f(х) – probability density of distribution of random variable Х. Mathematical expectation М(Х) can be understood as “theoretical mean value of random variable”. Consider the properties of mathematical expectation: 1. Mathematical expectation possesses the same dimension that the random variable possesses. 2. Mathematical expectation can be both positive integer and negative one. 3. Mathematical expectation of invariable С is equal to this invariable, i.e. М(С) = С. 4. Mathematical expectation of some random variables sum is equal to the sum of mathematical expectations of these variables, i.e. М(X + Y + . . . + W) = М(X) + М(Y) + . . . + М( W). 11 5. Mathematical expectation of product of two or several mutually independent random variables is equal to the product of mathematical expectations of these variables, i.e. М(XY) = M(X) M(Y). 6. Mathematical expectation of product of random variable by invariable С is equal to product of mathematical expectation of random variable by invariable С М(СХ) = СМ(Х). Along with mathematical expectation another characters are used: median xmed divides the distribution Х into two equal parts and it is defined by condition F(xmed) = 0,5; mode x mоd – maximum commonly occurring value Х and it is abscissa of the maximum point f(x) for continuously distributed random variable. All three characters (mathematical expectation, median and mode) are the same in symmetrical distributions. If there are several modes the distribution is called multimodal distribution. If mathematical expectation of random variable gives us its “average” or point on the coordinate line where the values of considered random variable “are spread” around it, dispersion classifies “the spread degree” of values of random variable about its average value. Dispersion of random variable X is called the mathematical expectation of deviation of random variable square from its mathematical expectation, i.e. D(Х) = М(Х – М(Х)2) Dispersion is calculated by the formula: D(Х) = М(Х2) – [М(Х)] 2 For discrete random variable X the formula gives k D(Х) = (хi)2 рi – [М(Х)] 2. i 1 For continuous random variable X D(Х) = x M ( x ) 2 f ( x )dx Dimension of dispersion is equal to dimension of random variable square. Properties of dispersion: 1. Dispersion of constant value is always equal to 0: D(С) = 0. 2. Fixed factor can be taken outside dispersion preliminarily squared: D(СX) = С2D(X). 3. Dispersion of two independent random variables algebraic sum is equal to sum of their dispersions: D(X Y) = D(X) + D(Y). The positive root of dispersion is called the root-mean-square (standard) deviation and it is denoted by σ D ( X ) . The root-mean-square deviation possesses the same dimension that the random variable possesses. Среднее квадратичное отклонение имеет ту же размерность, что и случайная величина. Random variable is called centered if M(X) = 0, and it is called standardized if M(X) = 0 and 1. In the general case properties of random variable can be classified by different ordinary moment and moment about mean. Ordinary moment of K (order) is called K determined by formula: 12 xiK pi K M( X K ) K x f ( x) d xi X дискр., для X непрер., для where M( X K ) – mathematical expectation of K–й degree of random variable random variables of discrete and continuous types appropriately). K Moment about mean K–го order is called the number X K (for determined by formula ( xi 1 ) K pi , i K M[( X )] K ( x 1 ) f ( x) d xi . From the definitions of moments, in particular, follow: 0 0 1, 1 M ( X ), D( X ) 2 2 12 Derivative characteristics from ordinary moment and moments about mean are often used. Coefficient of variation is called the value V 100 % . 1 Coefficient of variation – dimensionless value applied for comparison of degrees of variation of random variables with different units of measurement. Skewness ratio (or coefficient of skewness) of distribution is called the value A 3 3 Coefficient of skewness classifies the degree of random variable distribution skewness relative to its mathematical expectation. For skewness distributions А = 0. If the peak of function graph f(x) is shifted in small values (“tail” on the function graph f(x) to the right), А> 0. In the contrary case А< 0 (see Fig. 1). 1,0 A>0 A=0 0,8 A<0 f(x) 0,6 0,4 0,2 0,0 -0,5 0,0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 x Fig. 1 Dependence of probability density graphs f(x) on coefficient of skewness A Coefficient of excess (or peakedness) is called the value 13 E 4 3. 4 Coefficient of excess is the measure of sharpness of probability density graphs f(x) (Fig.2). 1,2 E>0 1,0 f(x) 0,8 0,6 E=0 0,4 0,2 E<0 0,0 -0,5 0,0 0,5 1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5 x Fig. 2 Dependence of probability density graphs of symmetric f(x) on coefficient of excess E Laws of random variable distribution Law of random variable distribution is the relationship between all possible values of random variable and their correspondent probabilities. Law of random variable distribution can be presented in a tabulated form, graphically or in distribution functional form. Distribution series is the population of possible values хi and their correspondent probabilities рi= Р ( Х = хi), it can be presented in a tabulated form. Table 1 Distribution series of discrete random variable Х хi рi х1 р1 х2 р2 ... ... хk рk k Here, probabilities рi satisfy pi 1 , i 1 where the number of possible values k can be finite or infinite. Graphic presentation of distribution series is called a distribution polygon. To draw the distribution polygon it is necessary to plot the possible values of random variable (хi) on the abscissa, and probabilities рi should be plotted on the ordinate; points Аi and coordinates (хi , рi ) are connected by broken lines. 14 If the true probability is not known, the relative frequency of each of values occurrence is plotted on the ordinate. The distribution function is the most common form of the distribution law description. It defines probability that random variable will take the value which will be lesser than any specified value X. This probability depends on Х and, therefore, it is the function of X, i.e. F(x)= Р (<x) The function F(х) for discrete random variable is calculated by the formula: F(х)= pi , where the summation over all i is carried out for which хi х. xi x Continuous random variable is characterized by the nonnegative function f(х, to be carried out, and this function is called probability density and it is defined by: P( x X x x) f ( x) lim x x x At any х probability density f(х) satisfies equality F(х)= f ( x ) dx linking it with distribution function F(х). Thus, continuous random variable is given by either distribution function F(х) (integral law) or probability density f(х) (differential law). discrete random variable continuous random variable Graph of integral function of distribution The distribution function F(х)(integral law of distribution) possesses the following properties: 1) Р(а Х в) = F(в) – F(а); 2)F( х1 ) F( х2 ), если х1 х2 ; 15 3) lim F ( x) = 1; x 4) lim F ( x) = 0 x Probability density f(х) (differential law of distribution) possesses the following basic properties: 1)f(х) 0; 2)f(х) = dF ( x ) = F(х); dx x 3) f ( t )dt = F(х); 4) f ( x )dx = 1; 5) Р(а Х в) = b f ( x )dx . a Geometrical probability of hit X on site territory (а,b) is equal to area of curvilinear b trapezoid corresponding to definite integral f ( x )dx (Fig.3) a Fig. 3 Graphic presentation of probability density function (differential function of distribution) Consider laws of distribution that are most often used. Normal Distribution (firstly this term was used by Galton in1889, also it is called Gaussian). The normal distribution (the "bell-shaped curve" which is symmetrical about the mean) is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions. In general, the normal distribution provides a good model for a random variable, when: 1. There is a strong tendency for the variable to take a central value; 2. Positive and negative deviations from this central value are equally likely; 3. The frequency of deviations falls off rapidly as the deviations become larger. As an underlying mechanism that produces the normal distribution, one may think of an infinite number of independent random (binomial) events that bring about the values of a particular variable. For example, there are probably a nearly infinite number of factors that determine a person's height (thousands of genes, nutrition, diseases, etc.). Thus, height can be expected to be normally distributed in the population. The normal distribution function is determined by the following formula: 16 f(x) = 1/[(2*π)1/2*σ] * e**{-1/2*[(x-μ)2/σ]2}, for -∞ < x < ∞ where μ σ is the mean is the standard deviation is the base of the natural logarithm, sometimes called Euler's e e (2.71...) π is the constant Pi (3.14...) The exact form of normal distribution (specific “bell curve”, see Fig.) is defined by only two parameters: average deviation and standard one. The specific property of normal distribution lies in the fact that 68% of all observations fall in the range ±1 standard deviation from mean, and range ±2 of standard deviations include 95% values. In other words, under normal distribution the less -2 or more +2 standard observations possess relative frequency less 5% (Standard observation means that average value is taken from base value and the result is divided by standard deviation). Graphical method, sample parameters of distribution form and goodness measures are usually used to estimate the accordance of available experimental data with normal distribution law. Log-normal Distribution.The log-normal distribution is often used in simulations of variables such as personal incomes, age at first marriage, or tolerance to poison in animals. In general, if x is a sample from a normal distribution, then y = ex is a sample from a log-normal distribution. Thus, the log-normal distribution is defined as: where, x>0; -∞<μ<+∞; σ>0 is the scale parameter is the shape parameter is the base of the natural logarithm, sometimes called Euler's e e (2.71...) 17 Probability Density Function Probability Distribution Function y = lognorm(x; 0; 0,5) p = ilognorm(x; 0; 0,5) 1,0 0,8 0,8 0,6 0,6 0,4 0,4 0,2 0,2 0,0 0,0 0,4 0,8 1,2 1,6 2,0 2,4 2,8 3,2 0,4 0,8 1,2 1,6 2,0 2,4 2,8 3,2 Fig.4 Graphs f(x) and F(x) of log-normal distribution Continuous random variable Х possesses chi-squared distribution with m-degrees of freedom if it is represented as the sum of squares of m values distributed according to normal law N (1;0); i.e. if probability density distribution is of the form (see Fig.5): Probability Distribution Function p = ichi2(x; 3) Probability Density Function y = chi2(x; 3) 1,0 0,24 0,8 0,16 0,6 0,4 0,08 0,2 0,00 2 4 6 8 0,0 10 2 4 6 8 10 Fig. 5 Graphs f(x) and F(x) chi-squared distribution x m 1 1 fCh ( x; m) m / 2 e 2x2 , 2 (m / 2) 0 x, where ( z ) e t t z 1dt – gamma-function: 0 2n 1 1 n 2n 1!! 2 2 и n 1 n! для n 0, . Characters of chi-squared distribution: M [ x ] m , xmod m 2 , D[ x] 2m , A 23/ 2 / m , E 12/ m . 18 Density graph of chi-squared distribution is asymmetric (left-skewed, so long as A> 0), peaked (E> 0) and xmоd<M[x], Dependence of density graphs of chi-squared distribution on m is represented in Fig. 6 below. у = ch2(x; m) 0,18 m= 5 0,14 y 0,10 m = 10 0,06 0,02 -0,02 -2 2 6 10 14 18 22 x Fig. 6 Dependence of f(x) graphs of chi-squared distribution on m Student's t Distribution.The student's t distribution is symmetric about zero, and its general shape is similar to that of the standard normal distribution. It is most commonly used in testing hypothesis about the mean of a particular population. The student's t distribution is defined as (for = 1, 2, . . .): m 1 m 1 à 2 2 1 2 1 x ft ( x; m) , m à m m 2 x . Probability Density Function Probability Distribution Function y = student(x; 5) p = istudent(x; 5) 1,0 0,4 0,8 0,3 0,6 0,2 0,4 0,1 0,0 0,2 -3 -2 -1 0 1 2 3 0,0 -3 -2 -1 0 1 2 3 Fig. 7 Graphs f(x) and F(x) of t–distribution law Characters of t-distribution: 19 M [ x] xmed xmod 0 , D[ x] A0, E m m2 , 6 m4. If the degrees of freedom are great (m> 30), t-distribution is equal to normal distribution N ( x;0;1) . One-dimensional statistical models One-dimensional statistical models are used to solve two types of problems: to estimate average parameters of geoecological objects and to verify hypotheses statistically. Owing to possible deviations of geoecological object study conditions from strong requirements produced to the statistical experiment, statistical analysis of geoecological data should be practically divided into two stages – exploring and supporting. The aim of the first stage is to translate observational data into more compact and visual form which allows to identify regularities in these data. During the second stage it makes possible to approach the traditional statistical methods of solving geoecological problems in a more substantiated way. During the first stage it is reasonable to apply a priori assumption-free methods relative to sample population properties and these methods do not need labour-intensive calculations. Preferences should be given to such kind of methods where numerical information is translated to graphic data. Statistical characteristics of sample random variable The calculation of statistical characteristics of sample random variable is the basis for most computations. The most abundant statistical characteristics of one-dimensional random variable: range median mode average value dispersion root-mean-square deviation coefficient of variation skewness excess Suppose n of x property measurement. It is necessary to find statistical characteristics of this measurement set. Range is the difference between maximum xmax and minimum xmax Range is the difference between maximum xmax and minimum xmin values of property p= xmax xmin. Median is a mean of ordered series of values. To find median it is necessary to arrange all values in the order of increasing or in the order of decreasing and to find in order the mean term of series. If in case of n – even integer there will be two values in the middle of series, the median is equal to their half-sum. Mode is the most abundant value of random variable. Average value is arithmetical mean value of all measured values: 20 Median, mode and average value are characteristics of position. Measured values of random variable are grouped near them. Dispersion is a number which is equal to average square deviations of values of random variable from its average value (Dispersion of random variable is a measure of this random variable spread, i.e. its deviation from mathematical expectation): Average square deviation is a number which is equal to square root of dispersion: Average square deviation possesses dimension coincident with dimension of random variable and average value. For example, if values of random variable are measured in meters, average square deviation will be expressed in meters too. Coefficient of variation is the ratio of average square deviation to average value: Coefficient of variation is expressed in unit fractions or (after the product by 100) in percentages. It is not unreasonable to calculate the coefficient of variation for positive random variables. Dispersion, average square deviation, coefficient of variation and also range are measures of scatter of values of random variable in the neighborhood of average value. The more measures are the more scattering is. Skewness – noncentrality degree of values distribution of random variable relative to average value: Excess – degree of peakedness or flat-toppedness of values of random variable relative to normal distribution law: Skewness and excess are nondimensional values. They show singularities of values grouping of random variable in the neighborhood of average value. Thus: Median, mode and average value are characteristics of position; Dispersion, average square deviation, coefficient of variation and also range are measures of scatter; Skewness and excess show singularities of values grouping of values. Statistical estimations can be point and interval. In point estimating the unknown characteristic of random variable is estimated by a number, in interval estimating the unknown characteristic of random variable is estimated by an interval. With specified possibility the true value of estimated variable must be in range of the latter. Point estimation does not constitute information about precision of the obtained result. The fewer sampling is and mutability of property is far stronger, the error can be larger. That’s why in circumstances where sample is very small it is desirable to know the property values interval in which its unknown true average value falls with specified possibility. Suppose the statistical characteristic Θ* found by sampling data serves as an estimation of unknown parameter Θ. Θ will be considered to be constant number (Θ can be random). It is clear that the Θ* determines the parameter more adequately the absolute difference |Θ – Θ*| is less. In 21 other words, if δ>0 and |Θ – Θ*|<δ, δ is less, the estimation is more exactly. Therefore, positive number δ characterizes closeness in estimation. However, statistical methods don’t allow confirm exactly that estimation Θ* satisfies inequality |Θ – Θ*|<δ; we can speak only about probability γ whereby this inequality is. Reliability (probability belief) of estimation Θ from Θ* is called probability γ whereby the inequality |Θ – Θ*|<δ is. Usually the reliability of estimation is given in advance, moreover, the number close to 1 is taken as γ. Reliability which is equal to 0,95; 0,99 and 0,999 is most often given. Suppose probability of that |Θ – Θ*| <δ is equal to γ: P[ |Θ – Θ*| <δ ] = γ. Have been substituted inequality Θ – Θ*|<δ by equally matched two-side inequality – δ<Θ – Θ* <δ, or Θ* – δ<Θ<Θ* + δ, we have P [ Θ* – δ<Θ<Θ* + δ ] = γ. This ratio should be understood in this way: probability of that the interval (Θ* – δ, Θ* + δ) hold unknown parameter Θ is equal to γ. The interval (Θ* – δ, Θ* + δ) which covers unknown parameter with desired reliability γ is called confidence. The way of confidence interval development for mathematical expectation depends on dispersion σ2 is known. If it is known, the confidence interval corresponding to desired reliability (probability belief) p is given by , x t ; x t n n Low probability whereby the event can be considered to be impossible is called confidence level. Usually the confidence level is lettered α. There is following ratio γ = 1 – α between the probability belief and the confidence level. Statistical verification of hypotheses Many geoecological problem solutions are based on analogy when regularities established in the process of analogous object study are used to explain structure features of underexplored objects. To choose the right object-analog it is necessary to estimate the similarity measure to prototype system. In other cases it is necessary to estimate the measure of discrepancy of geoecological objects according to one or other physical properties. Statistical methods of property character hypotheses verification are used to solve the problem of geoecological objects similarity or difference. In geoecological practice these methods are used for estimation: about equality of the studied property average values obtained by different methods for the same object or by one method for various objects; about equality of dispersions of two random variables from sample data; about homogeneity of studied object. Statistical verification of hypotheses is carried out with a help of goodness measures. Goodness measure is called the value of certain function K=f(X1, X2, ..., Xn), where X1, X2, ..., Xn – random variables characterizing verified hypothesis. The function is taken in such a way that in case of rightness of verified hypothesis its values are represented by random variable with distribution known in advance. Verified hypothesis is accepted if the value K calculated by sample values X1, X2, ..., Xn is less or more (it depends on the statement of hypothesis) than theoretical value K for similar conditions and specified probability α which is taken in accordance with the certain distribution. Probability α here corresponds to the probability level of practically impossible event and it is called a significance level. Probability (1 – α) where validity of decision will be the practically persistent event is respectively called probability belief. 22 The error enclosed in rejection region, though it is true, is called type 1 error, and acceptance of a false hypothesis is called a type 2 error. If write probability of type 2 error for β, (1 – β), i.e. absence of the error probability, will be a value called a strength of the criterion relative to competing hypothesis. An increase of probability belief decreases the probability of type 1 error but it increases the probability of the type 2 error. An application field of certain goodness measures is usually limited by some conditions, and their strength depends on character of competing (alternative) hypothesis and sample size. To solve problems in terms of statistical verification of hypotheses it is necessary to perform the following operations: • to formulate clear testable hypothesis (Н0) and alternative hypothesis(Н1) on account of the point of the problem; • to choose the most powerful tests which are not contrary to properties of studied random variables; • to estimate consequence of type 1 error and type 2 error according to the solved problem situations, and to choose significance level on account of minimizing loss requirements in the result of the incorrect decision; • to calculate the empirical value of goodness measure K from sample data, to compare it with theoretical value K for stated significance level and to make a decision relative to hypothesis Н0; • to interpret the obtained result in respect to the posed problem. Statistical goodness measures are divided into parametric and nonparametric. Parametric goodness measures are taken from various statistical laws of distribution and they can be used only in such kind of case if sample data distribution is in agreement with this law. Nonparametric goodness measures can be used even though if the distribution law of investigated values is unknown or their distribution corresponds to none of known laws. Nonparametric goodness measures usually possess power lesser than parametric goodness measures possess, but their application field is essentially wider. Verification of hypotheses on parameters distribution law The most statistical methods of solutions of problems are based on using of various distribution law properties. However, usually the investigator can’t know beforehand what properties of the selections obtained in the process of the investigation will be. That’s why the stage of comparison of the empirical distributions with known theoretical ones is a preliminary to specific problem solution. Theoretical distribution conformance testing. In most cases the law of distribution and its parameters are unknown in the process of real problem solving. At the same time the statistical methods applied as imputations require the certain law of distribution. Hence, the important problem occurring in the process of one sample analysis is the estimation of measure of concordance of obtained empirical data and any theoretical distributions. Assumption concerning normal distribution of population is verified more often because the majority of statistical procedures are concentrated on samples obtained from normally distributive population. Graphical method, sampling parameters of the distribution form and goodness measures are used to estimate correspondence of experimental data to normal distribution law. Graphical method allows estimate provisionally the dissimilarity and coincidences of distributions. If the number of observations is large (n> 100), the calculation of sampling parameters of the distribution form (excess and skewness) produces quite good results. They say, that the normalcy of distribution assumption is not contrary to available data if skewness approaches zero, i.e. it lies in the range from -0,2 to 0,2, and excess lies in the range from -1 to 1. The use of goodness measures produces the most satisfactory results. Goodness measures are called statistical measures destined for verification of goodness of experimental results and 23 theoretical model checking. Here, zero hypothesis (Н0) represents the statement that the population distribution of which sample was obtained is no different from normal. Nonparametric measure χ2 (chi-square) is the most abundant among goodness measures. It is based on comparison of empirical frequencies of intervals of grouping with theoretical frequencies calculated according to formulae of normal distribution. Verification of hypotheses on location test The most important question arisen in the process of two sample analysis is the question about differences between these samples. For this purpose the verification of statistical hypotheses, that both samples belong to the population or universe means are equal, is usually carried out. So called tests of differences are used to solve such kind of problems. Different statistical tests can be used to verify the same hypothesis. The correct selection of test is determined by both characteristics of data and verified hypotheses, and also the level of investigator’s experience. Parametric tests. Parametric tests are necessary to verify hypotheses of location and distribution. Student’s t-test (test of differences) is the most popular with parametric tests to verify hypotheses of universe means (mathematical expectations). The test allows identify the probability of that both means are related to two different populations. If this probability p is lower the significance level (р < 0,05), samples are considered to be related to two different populations. Two cases can be selected with the use of t-test. In first case it is applied to test hypothesis of the universe means equality of two-variable, unrelated samples (so called two-sample test). In this case there is a test group and an empiric group. In the second case when the same group of objects produce numerical material to test hypotheses for means it is used so called two-sample test. Meanwhile samples are called dependent, connected. In both cases in every compared groups and equality of dispersion in compared populations it should be carried out the requirement of dispersion normality of the investigated characteristic. Though, the correct use of Student’s t-test for two groups is often difficult whereas these conditions can’t be definitely checked by no means always. . The use of parametric Student’s t-test is based on that if samples X1, X2, ..., Xk amount of n1 values and sample Y1, Y2, ..., Yk amount of n2 values are selected from normally distributed population, the variable t xy S12 n1 S22 n2 , S2 S2 where x , y – sampling estimations of mean, and 1 , 2 – sampling estimations of dispersion, follows Student distribution law with (n1+n2–2) degrees of freedom. Verification of hypothesis on the equality of two sampling means consist in substitution in estimation formula 2 2 x and S1 according to the first sample and y and S 2 according to the second sample and the comparison of obtained value of t-test with tabulated value for this number of degrees of freedom and specified probability belief. If calculated value of test is more than tabulated one, the hypothesis on the equality of sampling means is denied. To verify the hypothesis on the equality of sampling means it is recommended to use Rodionov test in case of correspondence of log-normal model sampling data. D.A. Rodionov established that the variable Z lg x lg y 1,153 S 2 lg x Slg2 y Slg2 x n1 Slg2 y n2 2,65( Slg4 x (n1 1) Slg4 y (n2 1)) 24 is distributed by asymptotically normal with mathematical expectation 0 and dispersion 1. Therefore, theoretical value of variable Z is founded according to the table of values of Laplace’s integral function. Nonparametric tests (Van-der-Waerden test, Wilcoxon test, goodness-of-fit testχ2) are usually used at small samples or in the cases when average values are calculated by semiquantitative data – for example, by the results of semiquantitative spectral analysis. Nonparametric tests are used in the cases when the data distribution law differs from normal or it is unknown. Verification of hypothesis on the equality of means determined by two samples (A and B) with a help of Van-der-Waerden X-test begins with all values in both samples are ranked, i.e. they are written as series in the order of increasing. X-test presents variable i X , n 1 1 h where n – the total number of values in two samples; h – the number of observations in sample; i – the sequence number of each value of sample B in general series; ψ(...) – the function which is inverse one of the normal distribution function. Nonparametric Wilcoxon test (W) is also based on the ranking procedure and it presents the sum of ranks Ri of smaller sample in the total ranked series from both samples: n1 W Ri , n1 n2 . i 1 H :x x 2, If hypothesis on the equality of means of populations A and B is true, i.e. 0 1 mathematical expectation of Wilcoxon statistic (MW) and the possible deviation variable of sample estimation from it depend on only samples n1 and n2. Verification of hypotheses on equality of dispersions Degree of variation of various objects is estimated in magnitude of dispersion or any properties coefficient of variation and it is necessary for proved using of analog method in the process of study. Fisher's test. Fisher's test is used to verify hypothesis on two dispersions belonging to one population and, therefore, they are equal. Nevertheless, data are supposed to be independent and to be distributed according to the normal law. The hypothesis on equality of dispersions is taken if ratio of larger dispersion to smaller one is less than the critical value of Fisher's distribution S12 F 2, S2 F Fкрит , where Fкрит depends on the significance level and number of degrees of freedom for dispersions in numerator and denominator. Sidzhel-Tukey test is the nonparametric analog of Fisher's test. It is used for any kind of distributions and it is sensitive to bad values and so it is convenient for problem solution, especially in connection with small samples. Analysis of selection homogeneity When using one-dimensional statistical models to describe the geological object properties it is supposed that this object is homogeneous in relation to investigated property. The homogeneity problem is usually solved on the basis of assumed geoecological model. The statement of object homogeneity is obtained by the verification of hypothesis on its static homogeneity, in this case the quantitative data on variability of its properties are used. 25 The problems based on verification of hypothesis on static homogeneity of geoecological objects can be divided into three types: • selection of bad values; • separation of non-homogeneous selections; • estimation of degree of different factors effect on variability of the geological object properties In case of normal distribution of the background population this problem is solved with a help of Smirnov and Fergusson’s parametric tests. N.V. Smirnov determined that if the crest value of population is not bad value, the variable 2 t ( x мах x ) S см has got the distribution named for him. In this formula x мах – the crest 2 – shifted estimate of dispersion which is x – arithmetic average; S см 2 2 n 1 calculated by unbiased estimate of dispersion S2 according to the formula S см S . n value of population; If calculated value of coefficient of skewness exceeds the table value for probability belief and n degrees of freedom, the crest value of population should be admitted as the bad value. If the distribution of background population is different from normal, all frequent occurrence large values belonging to the investigated population will be admitted “anormal’. This limits the field of both tests uses. They can be used only in case if it is known beforehand that the distribution of background population is normal. In practice it is very often the highly improbable values are admitted as bad values in absolute magnitude exceeding 3 x or x 2 . Though this way can’t be regarded correct, so long as it does not guarantee from errors of either first kind or second one and also probability of these errors can’t be estimated. Analysis of variance The properties of any complex natural system usually depend on number of factors which are responsible for their variability. Identification of these factors and the estimation of degree of their effect on the investigated object properties variability (nonhomogeneity) are carried out with a help of analysis of variance. Analysis of variance is destined for investigation the problem when one or some independent factors which possess some gradations act on the measured random variable. Though in single-factor, two-factor analysis and etc. the factors affecting on the result are considered to be known and the question is determination of importance or estimation of this effect. The analysis of variance use is possible if we can suppose that selected groups are in accordance with normal populations and independence of distribution of observations in groups. In the process of uniform single-factor analysis of variance of random variable х relative to factor A, having p levels when amounts of measurements on each levels are equal to q, the results of observations are designated as xij, where i – the number of observation (i= 1, 2, ..., q), and j – the number of the factor level (j = 1, 2, ..., p), they are written as a table: Number of Level of factor variability A1 A2 … Ap 1 x11 x12 … x1p 2 x21 x22 … x2p … … … … … q xq1 xq2 … xqp Group means … xг р xг р xг р 1 2 The following statistics are calculated by these data: 1) total sum of squared deviations of the observed values characteristic from grand mean p p x: q Cобщ ( xij x ) 2 ; j 1 i 1 26 2) factor sum of squared deviations of group means from grand mean with a help of which the dispersion between the groups are characterized: p Cфакт q ( x г р j x ) 2 ; j 1 3) residual sum of squared deviations of the observed values from their group mean with a help of which the dispersion within groups are characterized: q q q i 1 i 1 i 1 Cост ( xi1 x г р1 ) 2 ( xi 2 x г р2 ) 2 ... ( xip x г р p ) 2 . In the process of single-factor analysis of variance the calculations can be simplified using the equality Сост = Собщ – Сфакт; 4) total, factor and residual variance: 2 2 2 Cост p( q 1) ; Sобщ Cобщ ( pq 1) ; Sфакт Cфакт ( p 1) ; Sост 5) value of Fisher's test: 2 2 . F Sфакт Sост The value of Fisher's test is compared with critical one for the set significance level α and number of degrees of freedom k1 = p –1 and k1 = p(q– 1). In the process of the nonuniform single-factor analysis of variance when the number of observations on the level А1 is equal q1 on the level А2– q2, on the level Аk– qp. In this case the total sum of squared deviations is calculated using the formula Cобщ P1 P2 ... Pp ( R1 R2 ... R p ) 2 n where P1 q1 xi21 – sum of squares of observed characteristic values on the level A1; i 1 q2 P2 xi22 – sum of squares of observed characteristic values on the level A2; i 1 qp Pp xip2 – sum of squares of observed characteristic values on the level Ap; i 1 q1 q2 qp i 1 i 1 i 1 R1 xi1 , R2 xi 2 , …, R p xip – the sums of observed characteristic values on the levels A1, A2, …, Ap appropriately; n q1 q2 ... q p – total number of tests (sample number). The factor sum of squared deviations is calculated using the formula Cфакт ( R12 q1 ) ( R22 q2 ) ... ( R p2 q p ) ( R1 R2 ... R p )2 n The residual sum of squared deviations is calculated using the formula Сост = Собщ – Сфакт The rest of operations are carried out just like in case of the equal number of tests: 2 2 2 Sобщ Cобщ (n 1) S факт Cфакт ( p 1) Sост Cост ( n p ) ; ; The value of Fisher's test is compared with critical one for the set significance level α and number of degrees of freedom k1 = p –1 and k2 = n–p. Questions 1. Enumerate types of geologo-mathematical models. 2. Point out the steps of mathematical model-based ecological task solution. 27 Give the definition of the terms “statistical model” and “deterministic model”. Point out differences between statistical models and spatial models. The basic principles and methods of mathematical modeling. Population and selection. Assessment of statistical parameters for selection. What requirements are imposed upon sampling data? What is the probability of accidental event? What is the random variable law? What distribution laws are usually used for modeling of ecological objects and processes? Properties of normal distribution law. One-dimensional statistical models. Entity and application conditions. Basic sources: 1. Aivazyan S.A., Enyukov I.S., Meshalkin L.D. Practical statistics. Basis of modeling and primary data processing. Reference book. – M.: Finance and statistics, 1983. - 472 p. 2. Aivazyan S.A., Enyukov I.S., Meshalkin L.D. Practical statistics: Dependences study: Reference book. – M.: Finance and statistics, 1985. - 182 p. 3. Borovikov V.P. Statistic for students and engineers. – M.: ComputerPress, 200. -301 p. 4. Van der Varden B.L. Mathematical statistics. – M.: Foreign Literature Publishing House, 1960. - 302 p. 5. Kazhdan A.B., Gus’kov O.I. Mathematical methods in geology. – M.: Nedra, 1990. - 251 p. 6. Kendell M., Stewart A. Distribution theory. – M.: Nauka, 1966. - 566 p. 7. Kendell M., Stewart A. Statistical conclusions and connections. – M.: Nauka, 1973. - 899 p. 8. Kramer G. “Mathematical methods of statistics”. M., 1948, -631 p. 9. Muller P., Noiman P., Shtorm R. «Mathematical statistics tables». – M., 1982, - 270 p. 10. Fisher R.A., Yates F. Statistical Tables for Biological, Agricultural, and Medical Research. – Edinburgh: Oliver and Boyd, 1953. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. LECTURE 2. TWO-DIMENSIONAL AND MULTI- DIMENSIONAL STATISTICAL MODELS. SPATIAL MODELLING. To model geoecological objects and processes as complex natural systems it is necessary to consider some of their properties because the aim of this is to clarify the generic structure of a studied object. In one cases the studied properties are presented independently of one another, and in other cases more or less clear interrelations can be presented between them. In other cases to explain the nature of observed dependences it is necessary to look over the long chain of interrelated processes and phenomena. So, in the result of persistent statistical data processing dealt with the traumatic mining disaster it was established that their periodicity was connected with the phases of the Moon. At the first glance, this extremely strange connection is explained by influence of the Moon on tidal forces which begin to appear in neither hydrosphere nor lithosphere, and they often play the role of “trigger” for such kind of phenomena as rock bump, gas blow-out and etc. The connection between different properties of objects defies often explanation according to genetic and cause-and-effect positions so long as observed dependences can’t be connected with geoecological processes. The study of interdependences between values of geoecological properties formation contributes sophisticated understanding of characteristics of geological processes and determination of factors influencing on efficiency of geoecological object investigation method. In some cases it allows obtain quantitative estimations of some properties by values of other easy determined values. Since studied interdependences have statistical character and they practically differ from functional ones, two-dimensional and multi-dimensional statistical models are used to study and describe them. 28 The determination of correlation relationships between different properties of geoecological objects helps to solve a wide range of problems. The correlation analysis is often used to study geoecological processes, to choose rational complex of research methods. In other cases the statistical analysis of observation results is preceded by theoretical inclusions, and determined correlation relationships are considered in development of deterministic models, describing dependences between geoecological phenomena and studied physical, chemical, biological and other factors. The linear correlation coefficient (Pearson) intending normal law of distribution of observations is widespread to estimate the degree of interrelation. Correlation coefficient is a parameter characterizing the degree of linear interrelation between two samples. Correlation coefficient is changed from –1 (strict inverse linear relationship) to 1 (strict direct proportion). There is no linear relationship between two samples if the value is equal 0. Here, direct dependence is understood as dependence when an increase or decrease in value of one property leads to an increase or decrease of the second property, relatively. For example, gas pressure increases when the temperature increases, and gas pressure decreases when the temperature decreases. Sample estimation of correlation coefficient can be calculated according to the formula n r ( xi x )( yi y ) nSx S y , i 1 where x and y – sample estimations of average values of random variables X and Y; Sx and Sy – sample estimations of their standards; п – number of comparable paired values. When we carry out hand calculations this formula is used n 1 n n r xi yi xi yi n i 1 i 1 i 1 n 2 1 n 2 n 2 1 n 2 xi xi yi yi n i 1 i 1 n i 1 i 1 If because of small data you can’t test a hypothesis whether the empirical distribution is in accord with the law, to test the hypothesis you can use Spearman’s rank correlation coefficient. Its calculation is based on change of the investigated random variable sample values; they are changed by their ranks in the order of increasing. However, it is supposed that if there is no correlation dependence between values of random variables, ranks of these variables will be independent. The expression for calculation of rank correlation coefficient is: n r 1 6 d i2 i 1 2 n(n 1) where di – rank difference of conjugate values of studied variables xi and yi, п – number of pairs in sample. If the occurrence of correlation relationship for two variables has been proved from sample, if its form is determined and if there is an equation to describe it, there is possibility of forecast of one of random variables by values of other random variable. Solution of such kind of problems is based on construction of empirical lines of regression or calculation of their analytical expressions – equations of regression. To solve the problems exactly it is necessary neither to estimate the force of correlation relationship nor to identify its character. Therefore the approximate way of checking hypothesis on linearity of relationship according to the kind of empiric line of regression, in this case, is usually added by analytical calculations. Analytical way of checking hypothesis on linearity of relationship is based on that 29 if there is linearity of relationship the correlation coefficient and the correlation relation are in absolute value. Fisher test is the relevant criterion to check this hypothesis F ( y2 x r 2 )( N m) ((1 y2 x )( m 2)) where y2 x – correlation relation of characteristic Y by classes of grouping X; т – number of classes of grouping; N –number of value pairs XY. Obtained values F are compared with tabulation Fкр for significance level α at f1 = (m–2) and f2 = (N–m) degrees of freedom. Correlation is consider to be nonlinear if F > Fкр. Regression analysis. In addition to correlation, the regression is distinguished when investigating interrelations between samples. Regression is used to analyze the action on separate dependent variable of one or more independent variables. Thus, the regression analysis is one more tool to study stochastic dependences. The regression analysis establishes the forms of dependence between random variable Y (dependent) and values of one or some variables (independent), it being known that the values of variables (independent) are considered to be prescribed. Such kind of dependence is usually determined by certain mathematical model (by equation of regression) including some unknown parameters. During the regression analysis on the base of sample data the estimations of these parameters are determined, the statistical errors of estimations or confidence limits are determined and also this mathematical model adequacy to experimental data is checked. The link between random variables is supposed to be linear in the linear regression analysis. In simple case there are two variables Х and Y in the linear regression model. According to п pairs of observations (X1, Y1), (X2, Y2), ..., (Xn, Yn) it is required to construct a straight line called by line of regression which brings into proximity with observational values. Equation of this line y= аx + b is regression equation. The desired value of dependent variable y corresponding to value of independent variable x can be predicted with a help of regression equation. Thus, it is safe to say that regression analysis consists in fit of a graph and its equation for set of observations. In regression analysis all variables which enter into equation must be continuous and not be discrete. In case when dependence between one dependent variable Y and some independent ones X1,X2, ...,Xm is considered, we may speak of multiple linear regression. In this case the regression equation is y = a0 + a1x1 + a2x2 + … + amxm, where a0, a1, a2, …, am – regression coefficients requiring determination. Coefficient of determination R2 (R-square) is an effectiveness criterion of regression model. Coefficient of determination R2 (R-square) determines what degree of accuracy the obtained regression equation approximates given data. Significance of regression model is investigated with a help of F-test (Fisher test). If variable F-test is significant (р < 0,05), the regression model is significant. Certainty of coefficients difference a0, a1, a2, …, am from 0 is checked with a help of Student’s test. In cases when р > 0,05 the coefficient can be considered 0, it means that the influence of this independent variable on the dependent variable is unreliable, and this independent variable can be eliminated from the equation. MULTI- DIMENSIONAL STATISTICAL MODELS Every phenomenon can be characterized by set of characteristics which are determinable and observable. Geoecological objects should be considered as systems depending on great number of factors and requiring multidimensional attribute space for their characterization. To solve such kind of problems it is necessary to consider complex of studied characteristics together, i.e. to form a multi-dimensional statistical model. 30 Solving most multidimensional geoecological problems we have to deal with complex combinations of factors which can’t be selected pure and studied independently. Combined study of complex interrelated characteristics contributes the detection of additional information about variability of investigated objects and makes it possible to forecast their unknown properties. Multidimensional statistical analysis relies on wide range of methods. Discriminant and cluster analyses refer to methods of multidimensional classification which are intended to separate collection of objects, subjects and phenomena into uniform groups. It is also necessary to take into account that each of objects is characterized with a great number of different and stochastically connected characteristics. The occurrence of a great number of original characteristics characterizing the process of object functioning make select the most important from them and study smaller set of characteristics. The original characteristics are often subjected to changing and it provides minimal loss of information. This decision can be supplied by loss of dimension methods and factor analysis belongs to them. This method allows take into account the effect of essential multidimensionality of data and gives the opportunity to explain multivariate structures laconically and simply. With a help of obtained factors and principal components we can disclose existing nonobservable regularities. It gives the opportunity to simply describe the observable original data, structure and the character of correlations between them. Data compression is obtained with respect to the number of factors or principal components – new units of measurement – is used far less than original characteristics. CLUSTER ANALYSIS The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover structures in data without providing an explanation or interpretation. In other words, cluster analysis simply discovers structures in data without explaining why they exist. We deal with clustering in almost every aspect of daily life. For example, a group of diners sharing the same table in a restaurant may be regarded as a cluster of people. In food stores items of similar nature, such as different types of meat or vegetables are displayed in the same or nearby locations. There are a countless number of examples in which clustering plays an important role. For instance, biologists have to organize the different species of animals before a meaningful description of the differences between animals is possible. According to the modern system employed in biology, man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals. Note how in this classification, the higher the level of aggregation the less similar are the members in the respective class. Man has more in common with all other primates (e.g., apes) than it does with the more "distant" members of the mammals (e.g., dogs), etc. Methods of cluster analysis are used to solve the problems as the separation of set in such a way that all objects belonging to one cluster (class, group) were more similar each other than the objects of other clusters. As a matter of fact, cluster analysis is not so much ordinary statistical method as “set” of various algorithms of distribution of objects over clusters. Methods of cluster analysis are used in most cases when there are no any a priori hypotheses relative to data structure. Cluster analysis is applied for description of investigation. The check of statistical significance in cluster analysis is not applied so long as this analysis determines “the most important major decision”. The techniques of clusterization are used in various fields (Hartigan, 1975). Cluster analysis is very useful whenever it is necessary to classify a lot of information which is suitable for groups on further processing. 31 Cluster analysis relies on two assumptions. The first assumption – discussed characteristics of object admit the desirable population partition of clusters. The second assumption – validity of scaling or measuring unit of characteristics, i.e. this means that scales are comparable. In cluster analysis scaling is supposed to be of great importance. Consider an example. Imagine that characteristics data x in set of data A is by two orders of magnitude more than characteristics data y: values of a variable x are in the range from 100 to 700, and values of a variable y are in the range from 0 to 1. Then in the process of calculation of value of a distance between two points showing the object location in space and their properties the variable possessing great values, i.e. variable x, will dominate absolutely over the variable possessing small values, i.e. variable y. Therefore, because of non-uniformity of measuring units of characteristics it is impossible to calculate reasonably a distance between two points. This problem is solved with a help of preliminary standardization of variables. Standardization or normalization brings into values of all modified variables to unified value range by the ratio of values to a variable showing different properties of actual characteristic. There are different ways of given data normalization: z=(xi-x)/σ, z==хi/х, z = xi/xmax, z = (xi-x)/(xmax-xmin), where х, σ – average deviation and root-mean-square deviation appropriately, xmax,xmin– largest and least value, хi–i-ое value of characteristic. Along with standardization of variables there is a coefficient of importance of every variable (weight) which would represent the importance of the corresponding variable. Expert analyses obtained in the process of expert survey can serve as weight. Received products of standard variables by corresponding weights let us to obtain a distance between points in multidimensional space, if we take into account that the weights are equal. The aim of cluster analysis is to divide set of elements into groups in such way that objects with highest values of similarity characteristics would be combined in these groups and disjoint groups would be isolated by given characteristic. Measures of similarity for cluster analysis can be divided into: Similarity measure of the distance type (distance function), it is also called measure of inequality. In this case the objects are considered to be the more similar the less distance between them. That’s why some authors call similarity measure of the distance type as measures of inequality. Similarity measure of the correlation type called linkage is a measure defining similarity of objects. In this case the objects are considered to be the more similar the more linkage between them. Information statistic. Numbers obtained in the result of clusters calculation do not have a conceptual value. To differ one cluster from other one it is necessary to have these numbers. So, sequence order of clusters can be convenient for investigator with the use of cluster analysis results in other methods. The measure of similarity between elements of a set is called metric if it satisfies the conditions: symmetry, inequality of triangle, discernibility of non-identical objects and indiscernibility of identical objects. Minkowski metric Minkowski metric is the most common metric. The degree of difference can be chosen in the range from 1 to 4. If this degree is equal 2, we’ll obtain Euclidean distance. Minkowski distance is equal r- root of the sum of absolute differences of paired values taken to r power: distance(x,y) = {∑i (xi - yi)r }1/r Euclidean metric This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. Euclidean distance between two points x and y is a least 32 distance between them. If we had a two- or three-dimensional space this measure is the straight line linking given points. If r=2 in Minkowski metric, we’ll obtain standard Euclidean distance (Euclidean metric ) distance(x,y) = {∑i (xi - yi)2 }½ Squared Euclidean metric (Squared Euclidean distance) You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. Due to squaring the large differences are accounted better in the process of calculation distance(x,y) = ∑i (xi - yi)2 Manhattan distance There are also alternative distance measures: The city-block distance uses the sum of the variables’ absolute differences. This is often called the Manhattan metric as it is akin to the walking distance between two points in a city like New York’s Manhattan district, where the distance equals the number of blocks in the directions North-South and East-West. If r=1 Minkowski metric gives Manhattan distance. distance(x,y) = ∑i |xi - yi| Chebychev distance This distance measure may be appropriate in cases when we want to define two objects as "different" if they are different on any one of the dimensions. Researchers frequently use the Chebychev distance, which is the maximum of the absolute difference in the clustering variables’ values. The Chebychev distance is computed as: distance(x,y) = Maximum|xi - yi| Power distance Sometimes we may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as: distance(x,y) = (∑i |xi - yi|p)1/r, where r and p are user-defined parameters. A few example calculations may demonstrate how this measure "behaves." Parameter p controls the progressive weight that is placed on differences on individual dimensions, parameter r controls the progressive weight that is placed on larger differences between objects. If r and p are equal to 2, then this distance is equal to the Euclidean distance. Percent disagreement This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as: distance(x,y) = (Number of xiyi)/i AMALGATION OR LINKAGE RULES At the first step, when each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been 33 linked together, how do we determine the distances between those new clusters? In other words, we need a linkage or amalgamation rule to determine when two clusters are sufficiently similar to be linked together. There are various possibilities: for example, we could link two clusters together when any two objects in the two clusters are closer together than the respective linkage distance. Put another way, we use the "nearest neighbors" across clusters to determine the distances between clusters; this method is called single linkage. This rule produces "stringy" types of clusters, that is, clusters "chained together" by only single objects that happen to be close together. Alternatively, we may use the neighbors across clusters that are furthest away from each other. This method is called complete linkage. There are numerous other linkage rules such as these that have been proposed. The following methods of cluster analysis are considered: Hierarchical methods: o single linkage (nearest neighbour), o average linkage method (King), o Ward's method Iterative methods of grouping: o K-means method (MacQueen). Algorithms: o method of correlation Pleiades (Terentjev), o Wroclaw taxonomy. 1) SINGLE LINKAGE (NEAREST NEIGHBOUR) This method is the simplest one of hierarchical agglomerative methods of cluster analysis. In this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbours) in the different clusters. Each object is placed in a separate cluster, and at each step we merge the closest pair of clusters, until certain termination conditions are satisfied. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains." 2) AVERAGE LINKAGE METHOD (KING) This method is similar to single linkage (nearest neighbour). In this method the distance between the two clusters is defined as the average of the distances between all pairs of objects, where one member of the pair is from each of the cluster. This method also uses information on all pairs of distance, not merely the minimum or maximum distances. For this reason, it is usually preferred to the single and complete linkage methods 3) WARD’S METHOD Another commonly used approach in hierarchical clustering is Ward’s method. This method is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the sum of squares of any two clusters that can be formed at each step. In general, this method is regarded as very efficient, however, it tends to create clusters of small size. 4) K-MEANS METHOD (MACQUEEN) K-means (MacQueen) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new 34 centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. 5) METHOD OF CORRELATION PLEIADES The concept of correlation Pleiades was originally advanced by Terentjev. The results of classification can be visually presented in the shape of a cylinder cut by planes which are perpendicular to its axis. The planes correspond to its levels (from 0 to 1 with step 0,1). The parameters or objects being subject to classification are combined on these levels. Therefore, this method resembles single linkage (nearest neighbour), but with fixed-level union. The results of classification are displayed graphically in the shape of circles – sections (Pleiades) of the correlation cylinder mentioned above. The classified objects are marked on the circles. Linkages between classified objects are shown by means of chord connection of circle points related to objects. VISUALIZING RESULTS OF CLUSTER ANALYSIS When carrying out a hierarchical cluster analysis, the process can be represented on a diagram known as a dendrogram. This diagram illustrates which clusters have been joined at each stage of the analysis and the distance between clusters at the time of joining. If there is a large jump in the distance between clusters from one stage to another then this suggests that at one stage clusters that are relatively close together were joined whereas, at the following stage, the clusters that were joined were relatively far apart. This implies that the optimum number of clusters may be the number present just before that large jump in distance. This is easier to understand by actually looking at a dendrogram. You see, after carrying out classification it is recommended to visualize results of clustering by means of construction of a dendrogram. Suppose that the results of classification by way of variables for pairs of objects are obtained after applying one of hierarchical methods. The idea of the dendrogram construction is evident – pairs of objects are linked according to the level of linkage plotted on the ordinate (Fig.2.1) 30 Linkage Distance 25 20 15 10 5 1 2 3 4 5 6 Fig. 2.1 Dendrogram of hierarchical method Consider a Horizontal Hierarchical Tree Plot we begin with each object in a class by itself. Now imagine that, in very small steps, we "relax" our criterion as to what is and is not 35 unique. Put another way, we lower our threshold regarding the decision when to declare two or more objects to be members of the same cluster. As a result we link more and more objects together and aggregate (amalgamate) larger and larger clusters of increasingly dissimilar elements. Finally, in the last step, all objects are joined together. In these plots, the horizontal axis denotes the linkage distance (in Vertical Icicle Plots, the vertical axis denotes the linkage distance). Thus, for each node in the graph (where a new cluster is formed) we can read off the criterion distance at which the respective elements were linked together into a new single cluster. When the data contain a clear "structure" in terms of clusters of objects that are similar to each other, then this structure will often be reflected in the hierarchical tree as distinct branches. As the result of a successful analysis with the joining method, we are able to detect clusters (branches) and interpret those branches. Symbolic notations of investigation objects (vectors of matrix) are positioned on abscissa axis; and the lowest values of distance coefficient corresponding to each step of the classifying procedure are positioned on ordinate axis. Therefore, ordinate axis is used for scaled presentation of hierarchical grouping levels. Visualization and conceptual importance of tree graphs increase if not only the information about closeness of intragroup linkages but also between group distances h is represented in them. Such dendritic graph taking into account not only intragroup distances but also mean distances is called dendrograph. K-MEANS METHOD This method is connected with objects but it is not connected with matrix of similarity. In k-means method the object belongs to such class the distance to which is minimal. The distance is understood as Euclidean distance, i.e. the objects are considered to be points of Euclidean space. The distance between the object and class is the distance between the object and the center of class. Each class of objects possesses the centroid. The mean parameter values are considered to be the centre of class. Then the distance between the object and group of objects are determined and algorithm can work. Imagine, that number of objects is equal 2. Join these points with segment of line and find its midpoint. It will be a centroid of group consisting of two points. From this centroid to given point distance will be the desired one. K-means method “works” as follows: 1) First, cluster partition of data is given (number of clusters is determined by user); centroids of clusters are computed; 2) displacement of points occurs: each point is located in the nearest cluster; 3) centroids of new clusters are computed; 4) steps 2, 3 are repeated till the stable configuration will not be found (i.e. clusters stop changing) or number of iterations will be no more than given by user. Resulting configuration is desired one. Usually, as the result of a K-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are. Ideally, we would obtain very different means for most, if not all dimensions, used in the analysis. The magnitude of the F values from the analysis of variance performed on each dimension is another indication of how well the respective dimension discriminates between clusters. APPLICATION OF CLUSTER ANALYSIS: 1) development of typology or classification; 2) investigation of useful conceptual schemes of object grouping; 3) generate of hypotheses on the base of data investigation; 4) test of hypotheses or investigations to determine if types (groups) isolated by some means or other are in data. FACTOR ANALYSIS 36 Factor analysis is a statistical method used to study the dimensionality of a set of variables. In factor analysis, latent variables represent unobserved constructs and are referred to as factors or dimensions. Factor analysis is designed for interval data, although it can also be used for ordinal data. The variables used in factor analysis should be linearly related to each other. This can be checked by looking at scatterplots of pairs of variables. Obviously the variables must also be at least moderately correlated to each other, otherwise the number of factors will be almost the same as the number of original variables, which means that carrying out a factor analysis would be pointless. The factor analysis model can be written algebraically as follows. If you have p variables X1,X2, . . . ,Xp measured on a sample of n subjects, then variable i can be written as a linear combination of m factors F1, F2, . . . , Fm where, as explained above m < p. The main goals of factor analysis are: 1. number of variables reduction (data reduction) and 2. classification of variables. So factor analysis is used either method of data reduction or method of classification. Thus, Factor analysis allows identify dependence between phenomena, discover latent base of some phenomena, answer the question why these phenomena are related. As the method of statistic investigation, factor analysis includes following stages: 1) Goal formation Goals can be: a) Research (identifying factors and analysis of them) b) Applied (construction of aggregate characteristics for forecasting and management) 2) Choice of characteristics and objects population 3) Generation of given factor structure 4) Correction of factor structure on the basis of investigation goals. 5) Identifying of second-order factors We obtain second-order factors – more general categories of investigated phenomenon. 6) Interpretation and use of results. The main ideas of factor analysis are: Correlation matrix constructed with use of Pearson’s correlation ratio coefficient is the main object of investigation. Positive semidefiniteness is the basic requirement to constructed matrix. Hermitian matrix is called positive semidefinite if all its minors are non-negative. Nonnegativeness of all eigenvalues follows from this property. Correlation coefficients constituting correlation matrix are automatically calculated between parameters (characteristics, tests), but not between objects (individuums, persons). So, dimension of correlation matrix is equal to number of parameters. It is so called techniques R. But, for example, the correlation between objects can be studied. This methodology is called techniques Q. There is also techniques Р, analysis of investigations is supposed to be carried out, and carried out on the same individuum in different periods of time, and correlation between individuum conditions is studied. In all methods of factor analysis there is a hypothesis that the studied dependence has the linear character. The main requirement to given data is that they should be subordinate to multidimensional normal distribution. The correlation matrix reduction is called a process of unit replacement by some values named generalities on the main diagonal of correlation matrix. Generality is a sum of squares of factor loadings. Generality of given variable is that part of its dispersion which is stipulated by general factors. This follows from that the total dispersion is composed of general dispersion stipulated by factors which are common to all variables, and also by specific dispersion stipulated by factors which are specific for only given variable and for dispersion stipulated by the error. The aim of factor analysis is to obtain a matrix of factorial mapping. Its rows are the endpoints of the vector coordinates related to т variables in r-dimensional factor space. 37 Proximity of the endpoints of these vectors provides a rough idea of mutual dependence of variables. Each vector in brief contains the information concerning with process. In addition, if a number of selected factors are greater than unity, the rotation of matrix of factorial mapping is carried out; the aim of this is to obtain so called simple structure. For illustrative purposes the results can be represented graphically, but it is difficult to do if we have three or more selected factors. So, the mapping of r-dimensional factor space in two sections is usually formed. In the process of the factor analysis problem solution you should be ready for that you fail to solve the problem. It is induced by solved problem complexity of correlation matrix eigenvalues. For example, correlation matrix can appear to be degenerate because of coincidence or complete linear correlation of parameters. In the process of calculations the underflow can occur for high-order matrixes. That’s why we theoretically can’t eliminate the situation when factor analysis methods are not applicable at least till given data are not successful “to correct”. Corrected data can be obtained as follows. Identify linearly dependent parameters with a help of, for example, method and correlation Pleiades (it is possible to apply other methods) and leave only one of linearly dependent parameters group in given data. METHOD OF PRINCIPAL COMPONENTS It is rather difficult to study objects if dimensionality of attribute space increases. There is a problem to substitute of numerous observable characteristics by smaller of their numbers without lost of useful information. Method of principal components is one of the most abundant methods to solve this problem. The linear transformation т of given variables (characteristics) in т of new variables where each variable represents linear combination of given ones is a basis of method of principal components In the process of transformation the vectors of observable variables are changed by new vectors (main components) which make different contributions to the total dispersion of multidimensional characteristics. Principal components are eigenvectors of covariance matrix of given characteristics. Number of eigenvectors of covariance matrix is determined by the number of studied characteristics, i.e. it is equal to number of its columns (or rows). Each eigenvector (principal component) is characterized by eigenvector and coordinates. If the covariance matrix is used, the variables will remain in their original metric. Eigenvalues of covariance matrix (λj) are the lengths of its eigenvectors, i.e. their dispersion. The sums of eigenvalues of covariance matrix are equal to its trace, i.e. diagonal sum. Coordinates of eigenvector of covariance matrix (ωij) are numerical coefficients, characterizing its position in т dimensional attribute space. Number of point coordinates of each eigenvector (ωij) – ω1, ω2, ..., ωm is determined by space dimension, and their numerical values are linear equation coefficients of eigenvector. Eigenvalues of covariance matrix are found as characteristic roots of polynominal equations by solution of them. But it is rather difficult to realize for large values m. So, in computational practice they are determined by matrix-updating methods, which can be realized only with a help of computer. Coordinates-seeking methods of the symmetric matrix eigenvectors are very difficult and they require the use of the computer. So long as the covariance matrixes of given characteristics are symmetric, their eigenvectors are always orthogonal, and their variables are interchangeable, i.e. nonintercorrelated. In method of principal components the coordinates of eigenvectors are considered as loads of variables on one or other factor. They are used to calculate matrixes of new population set by design of given data vectors (characteristics х1,х2, …, хm ) on the eigenvector axis (γ1, γ2, …, γm): m j ji xi , (1) j 1 38 where ji – loads j- component in i- variable of characteristic. With a help of formula (1) parent matrix of observed dimension characteristics пxт is recalculated in matrix of new variables (the same dimension), taking into account each of components eigenvalues. If statistic (correlation) linkages between observed characteristics of multidimensional space display clearly, decomposition of parent matrix of observations into т new components leads to an increase of the dispersion distribution visibility for new components compared with eigenvectors. As a rule, dispersion of one of principal components reaches half and more of the total dispersion characteristics, and combined with dispersions of one-two next components their universal contribution in total dispersion exceeds 90%. Therefore, space dimension of observed characteristics (to p≤m) can be reduced noticeably without loss of the observed characteristics variability information. In this case we can limit ourselves data for two-three most informational principal components. It lets us consider that for the purposes of geoecological analysis the matrix of principal components of dimension п xp (where p, as a rule, does not exceed 2 – 3) can be used instead of parent matrix of dimension пxm. So long as new variables in this matrix are represented by uncorrelated variables, the method of principal components can be considered as power tool for determination of number of linearly independent vectors contained in parent matrix. Let’s consider the method of principal components more detailed – variant of main factors method. The base model of principal components is written by matrix as follows: Z= AP, where Z – matrix of standardized given data, A – factor mapping, P – matrix of factors values. Order of matrix Z is т х п, order of matrix A is т х r, , order of matrix P is rх п, where т – number of variables (vectors of data), n – number of individuals (elements of one vector), r – number of separated factors. As we can see from expression mentioned above the model of component analysis includes only factors common to these vectors. Matrix of standardized given data is defined by matrix of given data Y (order of matrix Y т х п ) according to formula zij yij yij si , i = 1, 2, …, m, j = 1, 2, …, n, where y ij – element of matrix of given data, y ij – mean value, si – standard deviation. To calculate correlation matrix – the base element of factor analysis – we should use the simple relation 1 ZZ ' R , n 1 where R – correlation matrix ; order of this matrix т х т, ' – symbol of transposition. There are values, equal 1, on the main diagonal. These values are called generalities and 2 they are designated as hi , they are measure of full dispersion of variable. Matrixes A and P are unknown. Matrix A can be found from fundamental theorem of factor analysis R=A*C*A' where C – correlation matrix, showing linkage between factors. If C = I, orthogonal factors are spoken about, if С ≠ I, skew-angle factors are spoken about. Here I – unity matrix. For matrix C it is true relation C 39 1 PP' C . n 1 We consider only case of orthogonal factors for which R = A*A' The model of classical factor analysis includes a number of common factors and one specific factor to each variable. The first formula from mentioned in this unit is the main model of factor analysis for method of principal components. Number of principal components is always less or equal to number of variables. PROBLEM OF ROTATION Axes of coordinates corresponding to separated factors are orthogonal, and their directions are established sequentially according to remainder of dispersion maximum. But axes of coordinates obtained by this means are not interpreted meaningfully. So, the position of system of coordinates by rotation of this system about origin of it is more important. Owing to this procedure the arrangement of vectors is unchanged. The aim of rotation is to find one of possible coordinate systems for obtaining so called simple factor structure. The popular method of rotation VARIMAX is applied. CRITERIA OF FACTOR MAXIMUM NUMBER There are some assessment criteria of maximum number of held factors. Criteria based on analysis of the parent and reproduced correlation matrix determinants don’t show stability. Criteria based on variable of the correlation matrix eigenvalues, as a final result, lead to analysis of separated by factors dispersion percent. All general factors, the number of which is equal to the number of parameters, isolate 100% of dispersion. If sum of percents exceeds 100%, it means: negative eigenvalues were obtained in the process of calculation of the correlation matrix eigenvalues and, as a result, complex eigenvectors and it can mean incorrect reduction of parent correlation matrix. Kaiser criterion, only factors of which are in accord with eigenvalues of covariance matrix more than 1 are considered. Scree test, we should cast out all factors the eigenvalues of them are little different. Scree test is the graphical method firstly offered by Cattell (Cattell, 1966). Cattell offered to find such place on the graph where decrease of eigenvalues from left to right is retarded maximally. Only “factorial scree” is supposed to be to the right of this point. "Scree” is a geological term. Scree, or talus, is accumulation of broken rock fragments at the base of mountain cliffs. According to this creterion 2 or 3 factors can be left in this example. Application area of multidimensional statistical models in geoecology Applications of multidimensional statistical models for study dependencies of some different geoecological characteristics complexes are practically unbounded for any branch of geoecology. Correlation methods of paragenetic analysis of chemical elements are used widely in ecological geochemistry. 40 Multidimensional statistical descriptions of geoecological variables linkage with next assessments of their interdependences are used in geoecological practice, the aim of which is to identify, to discriminate, to classify studied objects or to search more informative combinations of characteristics to solve predictive problems. Classification of geoecological objects, for example, hierarchic grouping of chemical elements association according with complete chemical analyses data are carried out with a help of cluster-analysis, other methods of multidimensional correlation analysis or method of factor analysis. Prediction of various properties of studied geoecological objects is the ultimate objective of most multidimensional statistical methods. Subject to type of given data and objectives of geoecological investigations the different multidimensional models are used to form these algorithms. Meanwhile, as a rule, there is a problem of search of more informative characteristic combinations and their space dimension of reduction. It is carried out with a help of principle component method, R-method of factor analysis or other logic and heuristic methods. MODELING OF SPATIAL VARIABLES In the process of study objects and geoecological processes the investigator is interested in not only average characteristics of changeability and interrelationships of observable values of phenomena properties but in regularities of spatial changes. Statistical models are not suitable for these objectives, so long as any statistical characteristic shows only the average level of studied property changeability aside of spatial location of observation points, while regularities of their spatial location appear can be different. Therewith, statistical characteristics give objective estimations of the observable changeability characteristic level in such case when sample data represent population of independent random variables. For aims of mathematical modeling of spatial location regularities of studied geoecological formation properties their characteristics are considered as not random variables, but as spatial variables which possess a number of specific characteristics such as regularity, domain of existence and definitional domain. Their populations form the fields of spatial variables, in the range of them the location of each variable is determined by space coordinates. Geometrical and analytical modeling methods of geological, geochemical, geophysical and other spatial variables fields favour the separation and quantitative description of tendencies observed in change of investigated object properties. And in some cases they allow identify new, earlier unknown regularities. The results of geoecological mapping, geochemical and schlich surveys, geophysical observations are used for aims of modeling. Spatial regularities of geophysical field changes are used widely in the process of geological mapping. Mathematical modeling of geochemical and geophysical fields allows us to identify anomalies more reliably. Geoecological objects as fields of spatial variables Field of spatial variable is called a spatial domain, where each point can be paired with some value of studied variable. The spatial domain can be considered as a geoecological field. In addition, certain value of studied geoecological characteristic is in accord with each element of spatial domain. Subject to nature of modeling characteristics there are geophysical, geochemical, morphometric and other geological fields, according to dimension of studied space they are divided into one-dimensional, two-dimensional, three-dimensional and multi-dimensional. Continuous and discrete geological spatial variables. According to domain of existence geological spatial variables are divided into continuous and discrete. Continuous spatial variables express properties of object visualized in any point of field, i.e. in the entire area of investigated territory. Concentration of chemical elements in rock formation, their physical properties, thickness of studied geological bodies and a lot of other properties of rocks and ores belong to 41 number of these variables. Spatially confined geological formation, the domains of existence of which are small to negligible in comparison with investigated areas, belong to number of discrete spatial variables. They are represented by specific geological bodies (for example, certain facies), mineral deposits, phenocrysts of some minerals or mineral aggregate in rocks and etc. Scalar and vector fields. According to characteristics of dimensionality there are scalar and vector geological fields. The majority of studied geological variables belong to scalar values. To give scalar values it is necessary to know their module and sign. Populations of these variables form scalar geological fields. Vector spatial variables are used more rarely in geological practice. To give vector spatial variables it is necessary to know neither module nor variable direction. Vector random field can be simulated as vectors oriented in real two- or three- dimensional space (for example, magnetic fields) or as complexes of different scalar variables (for example, according to the content of some chemical elements in each point). If not initial values are studied but their derivatives, i.e. gradients of geological fields, many scalar fields can be transformed in vector fields. Background, anomalies and trend surface The model of additive random field is the most abundant model of continuous scalar geological field. In this case the values of continuous scalar variable uˆ f ( x, y ) are given on x and y plane, the values of which are used for description of additive scalar field uˆ f ( x, y ) , where f ( x, y ) uˆ – function of coordinates; ε – random variable. The task of modeling is to estimate of function f(x,y) in known assumption relative to ε and to describe random part of ε in some assumptions relative to f(x,y). The main problem of the spatial regularities study is to describe a non-random component of field, showing the level of its values. It is typical of particular parts of examined territory. The non-random component characterizing the main part of the simulated geological field is called its background. Background part of the field identifies the area of relatively increased and decreased values of studied characteristic and it contains useful geological information about nature of the studied geological object. The generalization of base properties of field with reduction of more or less essential fraction deviations is necessary to distinguish the background. In each specific case the deviations from background are considered as anormalous ones. Trend surface analysis is the most widely used global surface-fitting procedure. The mapped data are approximated by a polynomial expansion of the geographic coordinates of the control points, and the coefficients of the polynomial function are found by the method of least squares, insuring that the sum of the squared deviations from the trend surface is a minimum. Each original observation is considered to be the sum of a deterministic polynomial function of the geographic coordinates plus a random error. The polynomial can be expanded to any desired degree, although there are computational limits because of rounding error. The unknown coefficients are found by solving a set of simultaneous linear equations which include the sums of powers and cross products of the X, Y, and Z values. Once the coefficients have been estimated, the polynomial function can be evaluated at any point within the map area. It is a simple matter to create a grid matrix of values by substituting the coordinates of the grid nodes into the polynomial and calculating an estimate of the surface for each node. Because of the least-squares fitting procedure, no other polynomial equation of the same degree can provide a better approximation of the data. Two different methodological approaches are used for trend-analysis goal in geological practice: 1) smoothing of given data by moving statistical windows 2) approximation of fields by unified function of spatial coordinates (orthogonal polynomials and etc.). The methods of moving averages are universal and they offer better assessments of average parameters of spatially confined parts of geological fields in comparison with the given data of polynomial trend-analysis method. The data are used to identify regional geological regularities. 42 Relational nature of regularities and random components of observed characteristics variability have effect on trend-analysis results of geological fields. In this connection depending on scopes, goals and objectives of investigations the backgrounds are taken to mean the trend surface of different degree of smoothness, and anomalies are taken to mean any deviations from the background exceeding reference surface. Separation of regional regularities with a help of empirical evidence approximation of space coordinate function is connected with rather difficult calculations. It requires the use of computers. Orthogonal polynomials of different degrees, Laplace equation, trigonometrical polynomials and etc. are used as approximating functions. Orthogonal polynomials are usually used in case of the uniform rectangular network. Meanwhile the trend is defined as linear function of geographical coordinates, according to observation population constructed in such a way that sum of squared deviation of characteristic value from the plane is minimal. Such kind of model represents a variant of statistical method of multiregression, where the function ( x, y ) uˆ describing the surface of trend is considered as uˆ 0 1 x 2 y (where x and y are coordinates of space, β0, β1 and β2 are polynomial coefficients). The equations are used to estimate three indicated coefficients u 0n 1 x 2 y ; xu 0 x 1 x 2 2 xy ; yu 0 y 1 xy 2 y 2 ; where п – number of observation points; u – value of characteristic in observation points; x and y – coordinates of observation points. To solve equations they have to be written in matrix form: n x y x y 0 u 2 x x y 1 xu x y y 2 2 yu and to be solved relative to β0, β1 and β2. The estimation method of binomodel coefficient is called least square method. Questions 1. Multi-dimentional statistical models. Entity and application conditions. 2. What statistic method is used to solve the problem about similarity and difference of sampling means? 3. What statistic method is used to divide samplings into maximum different groups? 4. What statistic method is used to identify the correlation between variables? 5. What is the linear regression? And what is the nonlinear regression? 6. Characterize spatial models. 7. What is the spatial variable field? Basic sources: 1. Aivazyan S.A., Enyukov I.S., Meshalkin L.D. Practical statistics. Basis of modeling and primary data processing. Reference book. – M.: Finance and statistics, 1983. - 472 p. 2. Aivazyan S.A., Enyukov I.S., Meshalkin L.D. Practical statistics: Dependences study: Reference book. – M.: Finance and statistics, 1985. - 182 p. 3. Borovikov V.P. Statistic for students and engineers. – M.: ComputerPress, 200. -301 p. 4. Dreiper N., Smith G. “Application regression analysis”. Book 1. – M., 1986. -365 p. 43 5. Dreiper N., Smith G. “Application regression analysis”. Book 2. – M., 1987. -349 p. 6. Kazhdan A.B., Gus’kov O.I. Mathematical methods in geology. – M.: Nedra, 1990. - 251 p. 7. Kendell M., Stewart A. Statistical conclusions and connections. – M.: Nauka, 1973. - 899 p. 8. Kim Dg.O. Muller Ch.Y., Klekka Y.P. et al. Factor, discriminant and cluster analysis. – M.: Finance and statistics, 1989. - 215 p. 2 9. Kramer G. “Mathematical methods of statistics”. M., 1948, -631 p. 10. Breiman L., Friedman J.H., Olshen R.A., Stone C.J. Classification and regression trees. – Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software, 1984. - 358 p. LECTURE 3. INTRODUCTION Definition and content of concept GIS. GIS history. Correlation between GIS and basic courses. Relevance of GIS use in ecological information processing and representation. Characteristic of main functions of GIS (information collection and information processing, modeling and analysis, data use in the process of decision-making). The main classifications of GIS (Brаcken, Webster, 1990; Koshkarev, Karyakin, 1987) and their characteristic. Academic literature and educational material, periodical literature and reference materials. Geographic information systems (GIS) is a new and fast-growing research area at the interface between computer technologies and Earth sciences. They use in many fields of human activity – from interactive maps in the Internet and devices of satellite communication to mining field development programs. Geographic information systems (All-Union State Standard – GOST Р 52438-2005) information system using spatial data. Geographic information systems – computing system of collection, checking, integration and analysis of information dealing with earth surface. Simply said, GIS is an electronic map; the attributes of each object are written in the table connected with this map. Information system is a system used for storage, processing, search, distribution, communication and representation of information (GOST 7.0-99, article 3.1.30). Data – information presented as information which is suitable for processing by automatic tools (GOST 15971-90, article 1). There are four periods in history of GIS development: Pioneer period (the late 1950s – the early 1970s) Its sources are in team works; the teams formulated the first objectives and approaches to information systems building oriented on spatial data processing. “First generation” GIS was much different from modern GIS; data output of the fir GIS were not graphic maps but generalized results of investigations represented in tabulated form. They were teams of researchers and developers from Canada and Sweden. The first GIS appeared in Canada and used for accounting and analysis of natural resources (De Mers, 2000). The main function of GIS was to input source accounting documents for storage and regular updating, including data aggregation and closing of accounts of statistical tabulated documents. Period of the early 1970s – the early 1980s. Period of state initiatives. The development of interaction of geoinformatics methods and tools with digital techniques of mapping and automated cartography was characteristic of 1970s. GIS was developed on the base of information retrieval systems and later it gained the functions of carthographic data banks with data analysis and modeling capability. The most of GIS consist of a number of tasks of maps construction and they use map document as a source of data. In late 1960s in the USA the opinion was formed that it was necessary to use GIS-technologies for census of population data processing and presentation. The methodology providing correct gridding of census of population data was required. Necessity of population address converting to geographic coordinates was the main problem. 44 The special format of map data representation was developed. Rectangular coordinates of street intersections segmented streets of all settlements into separated segments in the USA. The map data processing and representation algorithms were taken from GIS developers in Canada and Harvard laboratory and they were designed in form of program POLYVRT realizing the population address converting to geographic coordinates and describing the graphic segments of streets. Therefore, the topological approach to geographic information management was widely used for the first time. It contained mathematical description method of spatial relationships between objects. Period of 1980s. Period of business development. Dynamism of GIS development was characteristic of 1980s. A great number of software, desktop GIS, appearance of network applications and a lot of eventual users, systems, supporting individual data set on separate computers open the door for distributed geodatabase systems. The well-known software product ARC/INFO was designed by Environmental System Research Institute, ESRI Inc in 1981. It was the most successful realization of ideas about separate internal representation of geometric and attribute information. At present, it is one of the most popular packets in the world. Period of 1990s – present. User period. This period is characterized by high competitiveness among business producers of geographic information technologies and services, increased demand of data and formation of global geographic information infrastructure. GIS software market saturation, especially for personal computers, expanded GIS-technologies application field. It required essential set of digital geodata and also GIS engineers. In Russia the development of geoinformatic and GIS began from the late 1980s – the early 1990s. In Russia the geographic information boom began from the mid-1990s. Different GIS for many areas of knowledge are produced. Now they are widely distributed in all fields of human activity connected with the earth surface. They are used in cartography, geology, meteorology, land management, ecology, municipal administration, defense and others. Widely application of GIS is connected with high efficiency of these systems and with complex analysis results. It is impossible to obtain these results by neither simple map analysis nor data table analysis. With a help of ArcGIS you can solve GIS problem of any degree of complexity: from separate analytical project to implementation of multiuser GIS for your organization. Today thousand of organizations and hundred thousand users use GIS technologies to study and process various sets of geographically connected information. E-maps of towns are the simplest example of GIS. We see the map of the town on display the screen, we select a street and obtain its name, or vice versa, we type the street’s name in sear5ch line and see its indication on the map. For illustrative purposes we consider some fields where the use of GIS is traditional. Management and areawide planning. This field is based on behaviour of different social groups showing social needs and opportunities. They have given placement and dynamicity on the territory. Urban development and architecture. Planning, engineering survey, town development is the typical work of urban services supporting the development of the territory. Civil engineering infrastructure. Inventory, accounting, planning of distributed production infrastructure object location (water supply, drainage system, heat supply, gas supply, electricity supply) and their management, state estimation and making decisions dealing with repair and emergency situations. Land resources management, land cadastre. Making out of cadastre, classified maps, delimitation of territories and areas are the typical problems of this field. Management of natural resources and environmental activity. Estimation of natural resource stocks, processes modeling in natural environment and decision-making are the typical problems of this field. 45 Geology, mineral resources and mineral resource industry. Specific character of these problems is that it is necessary to calculate mineral reserves on the certain area according to the significant points results (probe boring, test holes, etc) under the well-known model of the deposit formation process. Planning and transport agency (logistics). Characteristics of locations where the goods are stored and characteristics of locations where the goods are waited for; position, state and characteristics of hauling equipment, characteristics of road network (mean speed, repairs, bypass roads, blocks, boundaries, custom stations, etc.) are set up on the map. It is required to draw a schedule of movement and correct it from time to time if unforeseen situations arise. Surface, aero- and hydronavigation mapping and surface, air and water transport management. Well-established fields with understandable problems. Problems of moving object control, if the system of relations between them and nonmoving objects is performed, have a special place here. Marketing and market analysis. Developments tendencies, assessment of various topological properties effect – closeness, crossing and combination of areas – are identified. Agriculture. Reserves estimate according to a number of point-by-point measurements, transport planning, interaction of dynamically changing areas, classification and “similarity” identification of spatial objects, precision agriculture. Emergency situations. Assessment of potentially hazardous facilities, modeling of consequences in emergency situations. Fast response service. Public safety, fire fighting activity, emergency medical service. GIS classification A lot of problems recurring in life led to the formation of different GIS, which can be classified by following characteristics: 1) According to type of architecture they can be divided into two classes: open and closed. Closed systems are characterized by low price and the class of solved problems is represented beforehand there. They are characterized by interface simplicity and these systems are quickly learnt by users. The function set can’t be changed. Hence, it should be noted that lifecycle of these systems is very short. Open systems have certain set of functions and they have special instruments for generation and building of special applications by users thereby expanding the functionality of base GIS. Open systems are more expensive, but they have a longer lifecycle and they can be adapted to a very large class of problems. 2) According to hardware platform they can be divided into: GIS professional GIS tabletop Well-known systems ESRI, Intergraph refer to classical GIS professionals. They are powerful systems designed firstly for workstations and network, they include blocks of map document vectoring and they support the work with a lot of external devices. These systems are constructed by modularity and they can be delivered with flexible packaging. GIS tabletop are PC-oriented and they are used by a lot of users. Such kind of GIS has less function set. They cost low price, they are used much; in large GIS-projects workplaces are organized powered by them, where GIS is built as a multilevel system. 3) According to spatial coverage GIS are divided into: - Global (planetary); - Continental; - National (state); - Regional; - Local. There is its own gradation in a state. For example, in Russia there are: - federal GIS (FGIS); 46 - regional GIS (RGIS); - municipal GIS (MGIS) and local GIS (LGIS); 4) According to domain modeling there are GIS: по предметной области информационного моделирования различают ГИС: - urban, agricultural, geological, environmental, recreational, water source monitoring, etc. 5) According to functionality there are GIS: Multipurpose (tool, full-function) Special-purpose GIS-viewers. Multipurpose GIS are characterized by openness, they deal with different data formats, they possess powerful graphics editor, they have tools of development and different applications implementation (increase of function set). This class of GIS is used much, so long as it allows adapt if it is necessary and it allows solve different problems dealing with many fields. As a rule, these systems possess their own embedded language working both attribute information and graphic one, and they have tools to embed program modules written in high level language. Special-purpose GIS solve few problems using given parameter set. Their main problem is to control processes and prevent uncontrolled situations. GIS-viewers are necessary to visualize spatial information and to print out it. These systems do not have the tool for spatial analysis and modeling. 6) According to used data model: • Vector GIS are based on vector graphics and work with topological and non-topological vector data models. • Raster GIS are based on raster graphics and work with raster data models. • Hybrid GIS combine vector and raster GIS. The most modern GIS are not always strictly vector or strictly raster. There are usually some tools to work with raster data in vector GIS, and vice versa, there are usually some tools to work with vector data models in raster GIS. Functions of GIS. Function set implemented in GIS depends on, firstly, system purpose in whole. The main functions of GIS are: - input and update; - data storage and data manipulation; - data analysis; - data output and data and results presentation . One of the specific features is to understand any types of computer information, so to import all data it is necessary to convert it to computer-generated one. Point descriptions and reports are converted to text files, photos and sketches are scanned and image files are renamed according to numbers of observation points. All table information is resulted in the same form. Data scanning. Data visualization. GIS enables to visualize data like maps. Any geoinformation system provides instruments for data scanning. On the screen the map has got sandwich-type organization – every vector map, locked raster or observation point image are represented by a separate layer which can be switched on and switched off. Layer-by-layer organization of data have the following advantages: - possibility to change visibility of layers in the process of data visualization; - possibility to change layer order in the process of visualization; - possibility to settle independently parameters of every layer visualization; - possibility to carry out spatial analysis independently of each layer; - possibility to form the map from layers of different level details and origin. Vector data form vector layers, raster data form raster layers. Raster layer corresponds to one raster image. If vector layers have similar attributes they can be combined in group layers, it is subjected to raster data. 47 Symbols. Map is a model of real world where there are elements showing real objects or events. Symbol is a basic element of all cartographic representations. With a help of symbols cartographic representations appear on the map. There are three basic types of symbols in cartography: point symbols, line symbols and area symbols. If objects or events are very small true to map scale they are represented by point symbols. If objects or events are long distance true to map scale but they have negligible width they are represented by line symbols. If objects or events are highly long distance true to map scale and they occupy closed region they are represented by area symbols. In addition, text symbols used for representation texts on the map are applied. Analysis. Geographic information systems differ from other information systems in the fact that they possess effective opportunities of spatial data analysis. With a help of this analysis it is possible to carry out spatial modeling of objects and events. GIS is the tool of spatial analysis. Spatial analysis is called “heart” of GIS. The maps can be compared by switching on and switching off the map layers. Outputting different element chemical analysis data we can come to conclusion about distribution regularities. Task set of spatial analysis can be divided into 5 generalized categories. 1) Location analysis. Spatial request corresponds to this category. The pattern of spatial distribution of objects is visually represented on the map and it allows show relationships between them and understand investigated field better. Only have seen the object locations it is possible to understand some reasons of spatial interrelationships. In order to investigate regularities in data distribution it is necessary in a certain manner to show objects which are researched basing on values of their characteristics. 2) Satisfaction of spatial conditions. Spatial request corresponding to this category: where are the spatial conditions satisfied? Simple request about object location consists of one condition. To answer this question it is necessary to carry out one normal operation. Complex request about object location can include condition set. To answer this question it is necessary to carry out some operations of spatial analysis. For example, a) where is the construction site 2 hectares in area in the limit of 200 m from the district road with bearing power of soil up to 1 kg per 1 square centimeter? б) validate the shopping center, educational organization or business-center location taking into account spatial factors; в) find optimal pipe line route which is designed. 3) Time analysis. Request corresponds to this category: what is changed spatially over the specified period? The answer this question is the attempt to identify changes occurred in space and time, tendencies of these changes on the certain territory. For example, what tendency of spread of fly is in the town, what new objects have been built recently, what is urban sprawl? Saving and comparing maps of different dates GIS can carry out time analysis. 4) Identifying of structure. Spatial request corresponding to this category: what spatial structures and distributions are there? For example, how many anomalies, which are not identified as normal distribution, are there and where are they? What population distribution is in the town? What roads are the most dangerous? To select spatial structures is a very difficult problem requiring use of power tools of spatial analysis. 5) Different script assessment. The script of potential is a result of such kind of questions “What will happen if…”. For example, what will happen if rainfall intensity is critical? What costs will be if a street is widened by 14 m? How will transport communication change if a tram is taken away in Pushkin street? In these cases the user uses the prediction model and the potential effect maps. The use of such kind of model allows construct hypothetical situation and forecast development and consequences of sociological and economic situations, disasters and technogenic accidents in space and time. In recent time the evident growth of analytical and modeling functions of GIS is observed. For example, system ArcGIS 9.3 (ЕSRI) includes available modules SpatialAnalyst, 3DAnalyst, NetworkAnalyst, GeostatisticalAnalyst. 48 Data print. Besides data presentation on display GIS includes elements of publishing cartographical system which can link maps of necessary scales, it can generate the symbol list, scale rules, north arrows, etc. 1. 2. 3. 4. 5. 1. 2. 3. 4. Questions Enumerate the main kinds of GIS classification. Characterize the structure of GIS. The basic functional capabilities of GIS. What is the approach of fibered data management. Characterize the history of GIS development. Basic sources: Ananiev Yu.S. Geographic information systems: Manual. – Tomsk: Tomsk Polytechnic University Publishing House, 2003. -69 p. GOST Р 52155-2003 Geographic information systems. Federal, regional, municipal. Koshkarev A.V., Tikunov V.S. Geoinformatics. – M.: Kartgeocenter-Geoizdat, 1993. -213 p. Kuznetsov O.L., Nikitin A.A. Geoinformatics. – M.: Nedra, 1992. -301 p. LECTURE 4. STUDY OF GEOGRAPHICAL DATA Concept of data. Geographic data. Discrete and continuous data. Three principal data components: attribute, geographic and temporary data. Formats of GIS data. Data vector. Raster data. Representation of spatial data. Concept and advantages of geodatebase. ArcCatalog: data management application. ArcMap: data display application. Data and layers. Types of maps and their characteristic. Map montage. The term "Data" derives from the Latin “Datum” – fact (English: Data – данные). Data – collection of facts represented in a formalized manner (in quantitative and qualitative terms). Data correspond to discrete records concerning phenomena. Data correspond to information and facts usually collected as a result of experience, observations and experiment, processes and assumptions in computer systems. Data can consist of numbers, words or images, especially as results of measurements or observations of variable set. Data is often represented as low-level abstraction. This level is necessary to obtain information and knowledge. We get real-world information. Geographic Data present the unity of geospatial, semantic and temporary data of geographic locations. Geospatial data – data about local spatial properties: locality, form, sizes, spatial relationships of geographic objects, phenomena, processes in real Earth’s surface. Semantic data – data with help of which is described content and semantic information about geographic objects, properties of geographic objects. Temporary data hold fix time of the object investigation and show the properties of object change during the period of time. The main requirement to temporary data is relevant. It means that data of today can be used for processing. Irrelevant data – stale data which can’t be used in new changed conditions. To present parameters of time and thematic scope the one class of data – attributes – is used in most of geographical information technologies. Geographical information system must be able to keep jointly under control all parts of geographic data. Discrete geographical objects – these are separate bounded macrobodies of real Earth’s surface. Discrete geographical objects can be present or absent in any place of Earth’s surface. For example, manholes, road traffic accident (RTA), roads, pipelines, buildings, blocks, zones … 49 Continuous phenomena (fields) characterize territory in whole but they do not characterize isolated objects. For example, surfaces, precipitation, temperature can be measured in any place of territory and they can characterize it in whole. Two main methods used in GIS are referred to as vector mode and raster mode. VECTOR MODELS Vector model of data – essentially entails recording the grid co-ordinates of the points, lines or polygons used to depict the feature. Each point may be recorded by a single x,y coordinate pair. Lines and polygons are recorded by a series of x,y coordinate pairs. The computer reconstructs the shape of the lines and polygons by joining each successive co-ordinate pair by a straight line. Vector models are convenient for to present and keep discrete objects such as buildings, pipelines or boundaries of areas. For example, the point location (point object) of bore-holes is described by x,y coordinate pairs. Linear objects such as roads, rivers or pipelines are kept as series of x,y co-ordinate pairs. Polygonal objects such as river-drainage system, sites of land or service sectors are kept in the form of closed axis set. Coordinates are often pairs (x,y) or triples (x,y,z, where z – for example, altitude). Coordinate values depend on geographical reference system where data are kept. ArcGIS keeps vector data in classes of spatial objects and in sets of topologically related classes of objects. Attributes connected with objects are kept in data tables. To present spatial data ArcGIS uses three different realizations of vector model: covering, shape-files and bases of geodata. RASTER MODEL OG DATA Raster model of data – digital spatial objects representation in the form of the raster cells population (pixels) with designated values of the objects class. Cell is a bit of raster model. Raster model is effective for operation within continuous properties. Raster image presents set of values for cells, it is like a scanned map or a picture. In raster model the real world is represented as the surface divided uniformly into cells. X,Y coordinates of at least one screen angle are known, therefore, its location in geographical space is determined. Raster models are convenient for to keep and to analyze data distributed continuously on the certain area. Each cell contains value, determining the class or category; it may be the measurement or the result of interpretation. Raster data include images, for example, airborne imagery, satellite data or scanned maps; they are often used to create data of GIS. The sources of raster data are: - images: erophotographs of territory; satellite images of territory; photography of objects; - drawings: topographical maps; plans; technical drawings; schemes; - figures; - texts: documents; tables. Both models have advantages and disadvantages. Modern GIS can work both vector models and raster ones. ATTRIBUTE DATA. The properties of geographical objects are represented in database by attributes set. Attribute is a synonym of requisite, property, qualitative and quantitative characteristics, characterizing spatial object and associating with its unique number or identifier. Sets of attribute values are usually represented in the form of tables of relational databases. In 50 this case the row (record) represents attributes of one object, and the column (field) represents attributes of one type. The tools of database management system (DBMS) are used for ordering, storage and manipulation of attribute data. METADATA. Information that describes the content, quality, condition, origin, and other characteristics of data or other pieces of information. Metadata for spatial data may describe and document its subject matter; how, when, where, and by whom the data was collected; availability and distribution information; its projection, scale, resolution, and accuracy; and its reliability with regard to some standard. Metadata consists of properties and documentation. Properties are derived from the data source (for example, the coordinate system and projection of the data), while documentation is entered by a person (for example, keywords used to describe the data). Geodatabases. Geodatabases implement an object-based GIS data model – the geodatabase data model. A geodatabase stores each feature as a row in a table. The vector shape of the feature is stored in the table’s shape field, with the feature attributes in other fields. Each table stores a feature class. In addition to features, geodatabases can also store rasters, data tables, and references to other tables. Geodatabases are repositories that can hold all of your spatial data in one location. They are like adding coverages, shapefiles, and rasters into a DBMS. However, they also add important new capabilities over file-based data models. Some advantages of a geodatabase are that features in geodatabases can have built-in behavior; geodatabase features are completely stored in a single database; and large geodatabase feature classes can be stored seamlessly, not tiled. In addition to generic features, such as points, lines, and areas, you can create custom features such as transformers, pipes, and parcels. Custom features can have special behavior to better represent real-world objects. You can use this behavior to support sophisticated modeling of networks, data entry error prevention, custom rendering of features, and custom forms for inspecting or entering attributes of features. Features in geodatabases. Because you can create your own custom objects, the number of potential feature classes is unlimited. The basic geometries (shapes) for geodatabase feature classes are points, multipoints, network junctions, lines, network edges, and polygons. You can also create features with new geometries. All point, line, and polygon feature classes can - Be multipart (for example, like multipoint shapes or regions in a coverage). - Have x,y; x,y,z; or x,y,z,m coordinates (m-coordinates store distance measurement values such as the distance to each milepost along a highway). - Be stored as continuous layers instead of tiled. Whether you use GIS in a project or multiuser environment, you can use the three ArcGIS desktop applications – ArcCatalog, ArcMap, and ArcToolbox – to do your work. ArcCatalog is the application for managing your spatial data holdings, for managing your database designs, and for recording and viewing metadata. ArcMap is used for all mapping and editing tasks, as well as for map-based analysis. ArcToolbox is used for data conversion and geoprocessing. Using these three applications together, you can perform any GIS task, simple to advanced, including mapping, data management, geographic analysis, data editing, and geoprocessing. ArcCatalog lets you find, preview, document, and organize geographic data and create sophisticated geodatabases to store that data. ArcCatalog provides a framework for organizing large and diverse stores of GIS data. You can use ArcCatalog to organize folders and file-based data when you build project databases on your computer. You can create personal geodatabases on your computer and use tools in ArcCatalog to create or import feature classes and tables. You can also view and update metadata, allowing you to document your datasets and projects. ArcMap lets you create and interact with maps. In ArcMap, you can view, edit, and analyze your geographic data. You can query your spatial data to find and understand relationships among geographic features. You can symbolize your data in a wide variety of ways. You can create charts and reports to communicate your understanding with others. You can lay out your maps in a what-you-see-is-what-you-get layout view. With ArcMap, you can 51 create maps that integrate data in a wide variety of formats including shapefiles, coverages, tables, computer-aided drafting (CAD) drawings, images, grids, and triangulated irregular networks (TINs). ArcToolbox is a simple application containing many GIS tools used for geoprocessing. Simple geoprocessing tasks are accomplished through form-based tools. More complex operations can be done with the aid of wizards. Geographical map – a diminished image of Earth’s surface on the plane in line with certain projection, subject to the surface curvature of relevancy, which illustrating the placement, combination and connection of natural and social phenomena selected and characterized according to the function of this map. Classification Geographical maps are subdivided into the following categories: According to the spatial coverage - maps of the world; - maps of continents; - maps of countries and regions According to the scale - large-scale (begins from 1:200000 and major); - medium-scale (from 1:200000 and to 1:1000000 inclusive); - small-scale (smaller 1:1000000). If maps have different scales they have different accuracy and degree of image detail, level of generalization and different function. For the purpose intended scientific-reference maps are intended for carrying out of research study and receiving of full maximum information; - cultural-educational maps are intended for popularization of knowledge and ideas; training maps are used as visual aids to study Geography, History, Geology and other disciplines; engineering maps represent objects and conditions which are necessary to solve technical tasks; - tourist maps include such common geographic features as road networks, population centers, rivers, lakes, forests, and land relief, as well as items of special tourist interest, including architectural and historical landmarks, preserves, national parks, museums, hotels, tourist centers, and camping sites. Such maps serve to acquaint tourists with a given district and provide information on possible travel routes, on the location of specific landmarks, and on the availability of tourist services. navigation (road) maps and etc. According to content Geographial (physical) maps are majorly utilized to depict the physical features like various landforms and water bodies, deserts and plains, climate, vegetation, and erosion present on the earth's surface. Large scale geographial maps where all landmarks represent are called topographic maps, medium-scale geographial maps – topographic survey maps, and small-scale geographial maps – survey maps. Thematic maps A thematic map is a type of map especially designed to show a particular theme connected with a specific geographic area. These maps "can portray physical, social, political, cultural, economic, sociological, agricultural, or any other aspects of a city, state, region, nation, or continent". They can be divided into two groups: maps of natural phenomena and maps of social phenomena. maps of natural phenomena include all components of natural environment and their combinations. This group consists of geological maps, geophysical maps, maps of surface relief and bottom of World ocean, meteorological and climatic maps, oceanographic, botanical, 52 hydrological, soil maps, maps of mineral resources, maps of physical-geographical landscapes and physical-geographical regionalization and etc. Social-political maps include maps of populaton, economic, political, historical, social-geographical maps, and each of subcategories, in turn, can contain its own structure of division. So, economic maps also include maps of industry (both general and branch), agriculture, fisheries, transport and communications and etc. Layout is an element arrangement of digital map image or printed map including titles, legend, pointers of North, graduated scale and geographical data. Layout represents a set of map’s elements and our geographical data (i.e. data frame). In digital cartography, a distinctly identifiable graphic or object in the map or page layout. For example, a map element can be a title, scale bar, legend, or other map-surround element. The map area itself can be considered a map element; or an object within the map can be referred to as a map element, such as a roads layer or a school symbol. MAP ELEMENTS: Pointers of North show what direction of map is. Map scale can help to present visually the real object sizes and the distance between them on the map. Scale rule is a line or a panel divided into parts and signed according to real distances on the earth location. It is usually made in multiple map units such as tens of kilometers or hundreds of miles. If a map is magnified or degraded, the scale rule is changed too. Scale type matter. The scale of the map can be shown with a help of type matter. Текст масштаба отображает масштаб карты и её пространственных объектов. Scale type matter shows the user what real distance is presented by the concrete unit on the map, for example, “One centimeter is equal to 100000 meters”. Legend shows what symbols are used to map what objects. Legends consist of examples of map’s symbols with descriptive texts. When one symbol is used for objects of a layer, the layer’s name is pointer out in legend. If some symbols are used to present objects of one layer, the map body applying to classify objects becomes the head of legend, and each category is signed by proper value. There are small fragments – symbol standards on the map– in legend. In the process of layout it is necessary to take into account goals, use conditions and the audience who will use this map. Questions 1. The basic sources of data in GIS. 2. The approaches of position measurement. 3. The basic methods of data input in GIS. 4. Data structure in GIS. 5. What variants of spatial and attribute data connections are there? 6. Name the basic characteristics of raster models of spatial data. 7. Surface analysis in GIS. Basic sources: 8. Ananiev Yu.S. Geographic information systems: Manual. – Tomsk: Tomsk Polytechnic University Publishing House, 2003. -69 p. 9. Chandra A.M., Gosh S.K. Remote sensing and geographic information systems. – M.: Technosfera, 2008. -312 p. 10. Tsvetkov V.Ya. Geographic information systems and technologies. – M.: Finance and statistics, 1998. - 288 p. LECTURE 5. COORDINATE SYSTEMS AND MAP PROJECTION 53 Geographical coordinate system. Datum. Projection coordinate systems. Map projection. Map projection classification. Projection interpretation and conversion. Computer model consideration allows us to select the principle of real world in GIS: geographical object models and their spatial properties – location, shape, sizes, spatial relations – are represented by coordinates. These coordinates are connected with real world object location by means of coordinate system. What is a coordinate system? Coordinate systems are arbitrary designations for spatial data. Their purpose is to provide a common basis for communication about a particular place or area on the earth's surface. Within ArcGIS it is a system which localizes the position in space and defines relationships between positions. A geographic (or deodetic) coordinate system (GCS) uses a three-dimensional spherical surface to define locations on the earth. A GCS is often incorrectly called a datum, but a datum is only one part of a GCS. A GCS includes an angular unit of measure, a prime meridian, and a datum (based on a spheroid). The spheroid defines the size and shape of the earth model, while the datum connects the spheroid to the earth's surface. A point is referenced by its longitude and latitude values. Longitude and latitude are angles measured from the earth's center to a point on the earth's surface. The angles often are measured in degrees (or in grads). In the spherical system, horizontal lines, or east-west lines, are lines of equal latitude, or parallels. Vertical lines, or north-south lines, are lines of equal longitude, or meridians. These lines encompass the globe and form a gridded network called a graticule. The line of latitude midway between the poles is called the equator. It defines the line of zero latitude. The line of zero longitude is called the prime meridian. For most GCSs, the prime meridian is the longitude that passes through Greenwich, England. The origin of the graticule (0,0) is defined by where the equator and prime meridian intersect. Latitude and longitude values are traditionally measured either in decimal degrees or in degrees, minutes, and seconds. Latitude values are measured relative to the equator and range from –90° at the South pole to +90° at the North pole. Longitude values are measured relative to the prime meridian. They range from –180° when traveling west to 180° when traveling east. SPHEROIDS AND SPHERES. The shape and size of a geographic coordinate system's surface is defined by a sphere or spheroid. Although the earth is best represented by a spheroid, it is sometimes treated as a sphere to make mathematical calculations easier. The assumption that the earth is a sphere is possible for small-scale maps (smaller than 1:5,000,000). At this scale, the difference between a sphere and a spheroid is not detectable on a map. However, to maintain accuracy for larger-scale maps (scales of 1:1,000,000 or larger), a spheroid is necessary to represent the shape of the earth. A sphere is based on a circle, while a spheroid (or ellipsoid) is based on an ellipse. The shape of an ellipse is defined by two radii. The longer radius is called the semimajor axis, and the shorter radius is called the semiminor axis. Rotating the ellipse around the semiminor axis creates a spheroid. A spheroid is also known as an oblate ellipsoid of rotation. As a rule, a spheroid is chosen for one country or certain territory. If the spheroid is ideally suited for one geographic region, it does not mean that it is suited for another region. Datums. The coordinate system defines datum and map projection. While a spheroid approximates the shape of the earth, a datum defines the position of the spheroid relative to the center of the earth. A datum provides a frame of reference for measuring locations on the surface of the earth. It defines the origin and orientation of latitude and longitude lines. Datum is a frame of reference which describes shape and size of the Earth, origin, orientation and scale of the coordinate systems used to determine location relative to the Earth by coordinates. Datum is a mathematical representation of the Earth’s surface shape. This is a welldefined mathematical method to convert coordinates between two geographic coordinate systems. As with the coordinate systems, there are several hundred predefined geographic transformations that you can access. It is very important to correctly use a geographic 54 transformation if it is required. When neglected, coordinates can be in the wrong location by up to a few hundred meters. Sometimes no transformation exists, or you have to use a third GCS like the World Geodetic System 1984 (WGS84) and combine two transformations. The World Geodetic System is a base of the location measurement all over the world. MAP PROJECTION Whether you treat the earth as a sphere or a spheroid, you must transform its threedimensional surface to create a flat map sheet. This mathematical transformation is commonly referred to as a map projection. One easy way to understand how map projections alter spatial properties is to visualize shining a light through the earth onto a surface, called the projection surface. Imagine the earth's surface is clear with the graticule drawn on it. Wrap a piece of paper around the earth. A light at the center of the earth will cast the shadows of the graticule onto the piece of paper. You can now unwrap the paper and lay it flat. The shape of the graticule on the flat paper is different from that on the earth. The map projection has distorted the graticule. A spheroid can't be flattened to a plane any more easily than a piece of orange peel can be flattened—it will rip. Representing the earth's surface in two dimensions causes distortion in the shape, area, distance, or direction of the data. A map projection uses mathematical formulas to relate spherical coordinates on the globe to flat, planar coordinates. Different projections cause different types of distortions. Some projections are designed to minimize the distortion of one or two of the data's characteristics. A projection could maintain the area of a feature but alter its shape. The process of transferring information from the Earth to a map causes every projection to distort at least one aspect of the real world – shape, area, distance, or direction. If you deal with small areas such as a town or a district, the distortion cannot be very large and it cannot be represent on your map or measurements. If you deal with national, continental or global level, it is necessary for you to choose the map projection disrupting minimally those properties which are the most important in your project. Different projections cause different types of distortions. Some projections are designed to minimize the distortion of one or two of the data's characteristics. Classification based on distortion characteristics. Conformal projections. A projection that maintains angular relationships and accurate shapes over small areas is called a conformal projection. They save small local shapes without distortions. Equal area projections. A projection that maintains accurate relative sizes is called an equal area, or equivalent projection. These projections are used for maps that show distributions or other phenomena where showing area accurately is important. Equidistant projections. A projection that maintains accurate distances from the center of the projection or along given lines is called an equidistant projection. Equidistant projection maps keep the distance between certain points. Classification based on developable surface. Map projections can also be classified based on the shape of the developable surface to which the Earth's surface is projected. A developable surface is a simple geometric form capable of being flattened without stretching, such as a cylinder, cone, or plane. Surfaces of projections can take normal, transverse or oblique position relate to the axis of rotation of sphere or spheroid. Conic projection A conic (or conical) projection is a type of map in which a cone is wrapped around a sphere (the globe), and the details of the globe are projected onto the cylindrical surface. This projection is based on the concept of the ‘piece of paper’ being rolled into a cone shape and touching the Earth on a circular line. Most commonly, the tip of the cone is positioned over a Pole and the line where the cone touches the Earth is a line of latitude; but this is not essential. The line of latitude where the cone touches the Earth is called standard parallel. Because of the 55 distortions away from the standard parallel, conic projections are usually used to map regions of the Earth – particularly in mid-latitude areas. This map uses the same settings as the previous World Map, but it is more typical of a conic projection map. Distortions are greatest to the north and south – away from the standard parallel. But, because the standard parallel runs east-west, distortions are minimal through the middle of the map. In conical or conic projections, the reference spherical surface is projected onto a cone placed over the globe. The cone is cut lengthwise and unwrapped to form a flat map. The cone may be either tangent to the reference surface along a small circle or it may cut through the globe and be secant at two small circles. Examples of conic projections include Lambert conformal conic, Albers equal area conic, and equidistant conic projections. On Albers equal area conic projection near North and South poles parallels are situated closer then central parallels, and the projection maps equivalent areas. Cylindrical projections Like conic projections, cylindrical projections can also have tangent or secant cases. The Mercator projection is one of the most common cylindrical projections, and the equator is usually its line of tangency. Meridians are geometrically projected onto the cylindrical surface, and parallels are mathematically projected. This produces graticular angles of 90 degrees. The cylinder is "cut" along any meridian to produce the final cylindrical projection. The meridians are equally spaced, while the spacing between parallel lines of latitude increases toward the poles. This projection is conformal and displays true direction along straight lines. On a Mercator projection, rhumb lines, lines of constant bearing, are straight lines, but most great circles are not. For more complex cylindrical projections, the cylinder is rotated, thus changing the tangent or secant lines. Transverse cylindrical projections, such as the Transverse Mercator, use a meridian as the tangential contact or lines parallel to meridians as lines of secancy. The standard lines then run north-south, along which the scale is true. Oblique cylinders are rotated around a great circle line located anywhere between the equator and the meridians. In these more complex projections, most meridians and lines of latitude are no longer straight. In all cylindrical projections, the line of tangency or lines of secancy have no distortion and thus are equidistant lines. Other geographic properties vary according to the specific projection. Planar projections (azimuthal projections) Planar projections project map data onto a flat surface touching the globe. A planar projection is also known as an azimuthal projection or a zenithal projection. This type of projection is usually tangent to the globe at one point but may be secant also. The point of contact may be the North Pole, the South Pole, a point on the equator, or any point in between. This point specifies the aspect and is the focus of the projection. The focus is identified by a central longitude and a central latitude. Possible aspects are polar, equatorial, and oblique. Polar aspects are the simplest form. Parallels of latitude are concentric circles centered on the pole, and meridians are straight lines that intersect with their true angles of orientation at the pole. Planar projections are used most often to map polar regions. Gauss-Kruger projection DESCRIPTION This projection is similar to the Mercator except that the cylinder is longitudinal along a meridian instead of the equator. The result is a conformal projection that does not maintain true directions. The central meridian is placed on the region to be highlighted. This centering minimizes distortion of all properties in that region. This projection is best suited for land masses that stretch north-south. The Gauss-Kruger coordinate system is based on the Gauss-Kruger projection. PROJECTION METHOD Cylindrical projection with central meridian placed in a particular region. LINES OF CONTACT 56 Any single meridian for the tangent projection. For the secant projection, two parallel lines equidistant from the central meridian. LINEAR GRATICULES The equator and the central meridian. PROPERTIES Shape Conformal. Small shapes are maintained. Shapes of larger regions are increasingly distorted away from the central meridian. Area Distortion increases with distance from the central meridian. Direction Local angles are accurate everywhere. Distance Accurate scale along the central meridian if the scale factor is 1.0. If it is less than 1.0, there are two straight lines with accurate scale equidistant from and on each side of the central meridian. APPLICATIONS Gauss-Kruger coordinate system. Gauss-Kruger divides the world into zones six degrees wide. Each zone has a scale factor of 1.0 and a false easting of 500,000 meters. The central meridian of zone 1 is at 3° E. Some places also add the zone number times one million to the 500,000 false easting value. Gauss-Kruger zone 5 could have a false easting value of 500,000 or 5,500,000 meters. Three degree Gauss-Kruger zones exist also. The UTM system is similar. The scale factor is 0.9996, and the central meridian of UTM zone 1 is at 177° W. The false easting value is 500,000 meters, and southern hemisphere zones also have a false northing of 10,000,000. 1. 2. 3. Questions: The basic principles and methods of mathematical modeling. The approaches of position measurement. Methods of data output and data visualization in GIS. Basic sources: 1. Ananiev Yu.S. Geographic information systems: Manual. – Tomsk: Tomsk Polytechnic University Publishing House, 2003. -69 p. 2. Geographic information systems: Annals. – M.: KIBERSO. -112 p. 3. Chandra A.M., Gosh S.K. Remote sensing and geographic information systems. – M.: Technosfera, 2008. -312 p. 4. Tsvetkov V.Ya. Geographic information systems and technologies. – M.: Finance and statistics, 1998. - 288 p. 57