Download Монголина Т.А., Янкович Е.П., Надеина Л.В.

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
2
УДК 911.5.1./.9 (075.8)
ББК 26.8:32.8.я73
М77
Монголина Т.А., Янкович Е.П., Надеина Л.В.
М77 Geographic information systems and mathematical modeling: лекции по
курсу «Geographic information systems and mathematical modeling» для
студентов, обучающихся по направлению 022000 «Экология и
природопользование», профилю подготовки «Геоэкология» / Т.А.
Монголина, Е.П. Янкович, Л.В. Надеина – Томск: Изд-во Томского
политехнического университета, 2012. – 58 с.
УДК 911.5.1./.9 (075.8)
ББК 26.8:32.8.я73
Рецензент
Профессор, доктор геолого-минералогических наук
С.И. Арбузов
© Монголина Т.А., Янкович Е.П., Надеина Л.В., 2012
© Томский политехнический университет, 2012
© Оформление. Издательство Томского
политехнического университета, 2012
3
UDK 911.5.1./.9 (075.8)
BBК 26.8:32.8.я73
М77
T.A. Mongolina, E.P. Yankovich, L.V. Nadeina
М77 Geographic information systems and mathematical modeling: Lectures in
course “Geographic information systems and mathematical modeling” for students
of 022000 “Ecology and Environmental Management” Course, professional profile
“Geoecology” \ T.A. Mongolina, E.P. Ynkovich, L.V. Nadeina – Tomsk:
Publishing House Tomsk Polytechnic University, 2012. – 58 p.
UDK 911.5.1./.9 (075.8)
BBК 26.8:32.8.я73
Authorization granted by Editorial Advisory Board
Tomsk Polytechnic University
Reviewer
Doctor in Geological-Mineralogy Science
Prof. S. I. Arbuzov
© STE HPT TPU, 2012
© T.A. Mongolina, E.P. Yankovich, L.V. Nadeina, 2012
© Design. Tomsk Polytechnic University
Publishing House, 2012
4
CONTENTS
Lecture 1. Modeling in science. Types, principles and methods of mathematical
6p
Modeling. Statistical Modeling. One-dimensional Statistical Models
Lecture 2. Two-dimensional and multi-dimensional statistical models. Spatial modeling
28 p
Lecture 3. Introduction in Geographic information systems
44 p
Lecture 4. Study of geographical data
49 p
Lecture 5. Coordinate systems and map projection
54 p
5
LECTURE 1. MODELING IN SCIENCE. TYPES, PRINCIPLES AND METHODS OF
MATHEMATICAL MODELING. STATISTICAL MODELING. ONE-DIMENSIONAL
STATISTICAL MODELS.
Modeling in science
Modeling is one of the methods of the surroundings. The process of model development
and use is called modeling.
Model (measure, sample, standard) is the material or imaginary object which displaces the
original in the process of study and it keeps its several typical characteristics which are important
for this investigation.
Modeling is a method of the surroundings which can refer to general scientific methods
applied both at empirical and theoretic levels
The term “model” is often used to represent:
1) device reproducing construction or the function of the device (reduced, magnified or
full-sized);
2) analog (drawing, graph, plan, scheme, description and etc.) of phenomenon, process or
object. To develop a model an investigator always proceeds from the purpose in hand and takes
into account only important factors. Therefore any model is not identical to object-original and
which means that it is not a complete model so long as to develop it the investigator took into
account only the most important (from his point of view) factors
Material systems as the objects for study are divided into well-organized and badorganized ones. Well-organized systems consist of a limited number of elements and there are
strongly defined and unique dependences between them. We can refer the simplest chemical and
physical processes, mechanisms, devices and etc. Their properties and states can be described
with a help of physical and chemical laws.
We can refer complicated nature objects and phenomena to bad-organized systems. Living
organisms and their community, and also a lot of objects studied by Earth sciences we can refer
to typical bad-organized systems. During the study these systems we can find only specific
regularities in their structure i.e. tendencies which are not lent themselves to strict quantification.
The basic method of bad-organized system study is modeling when a direct object of study
is replaced by its simplified analog – model.
According to character of model there are object modeling and sign (information)
modeling. Object modeling is the modeling when investigation is carried out with a help of
model reproducing defined geometrical, physical, dynamical or functional characteristics of
object.
Sign (information) modeling uses sign representations (diagrams, schemes, graphs,
hieroglyphs, character sets) as models.
Mathematical modeling (modeling with a help of mathematical relations) is an example of
sign (information) modeling. Geological concepts formalization must often be controlled in the
process of mathematical treatment of geological information.
Different methods of sign (information) modeling play a key role in Earth sciences.
According to character of information they can be divided into verbal, graphical and
mathematical).
Numerous classifications, concepts and definitions can refer to verbal models.
Various drawing geoecological documents – maps, plans, sheme, sections, projections and
etc so long as they approximately depict the properties of real objects – should refer to graphical
models.
Numbers and formulae describing relations and regularities of change of geological
formation properties or geological process parameters are used as mathematical models.
Over the last years a borderline between these models become conditional in connection
with wide use of geoecological investigations of computer modeling with a help of various
geoecological information.
6
Cartographic information is digitalized with a help of nominal scale, and results of
measurements during the geochemical and geophysical surveys are depicted as maps with a help
of plotters or graphical displays.
Types of mathematical models
There are following types of mathematical models: according to construction principle,
according to bond character, according to types of solved problems.
The static and dynamic modeling is separated out according to construction principle of
mathematical model.
The static modeling consists in mathematical formulation of the investigated object
properties according to results of their study by the inductive generalization of empirical
observation sampling.
The techniques of deductive methods when properties of specified objects are taken out
from general ideas about its structure and laws defining its properties are used in dynamic
modeling.
Static modeling appears as:
• transformation of geoecological information into the well-behaved form;
• determination of regularities in mass and random measurements of studied object
properties;
• mathematical description of revealed regularities (construction of mathematical model);
• use of obtained quantitative characteristics to solve specific geoecological problems – test
of geoecological hypothesis, selection of method of further study object etc.;
• estimated probability of possible errors in solving of formulated problem by means of
sampling method of the object study.
Mathematical models are divided into deterministic and statistical models according to
bond character between parameters and properties of studied objects.
Deterministic models show the functional connections between arguments and dependent
variables. They are equated and in these equations for defined value of the argument there is only
one value of variable. Deterministic models are used seldom for modeling of geoecological
objects. This may be due to they have little relation with real phenomena where the functional
connections are preserver in the range.
Mathematical expressions including at least one random component (i.e. such variable, the
value of which cannot be exactly predicted for single observation) are called statistical models.
They are extensively used for mathematical modeling aims so long as they account well random
fluctuations of experimental data.
Equationally definable class of geoecological problems and study objects lead up to
necessity for using methods of different branches of mathematics (such as theory of chances and
mathematical statistics, theory of sets, theory of groups, information theory, theory of graphs,
games theory, vector-matrix algebra, differential geometry) in the process of modeling.
Meanwhile the same problem can be solved by different methods, but in certain cases to solve
one problem it is necessary to use complex of methods from different branches of mathematics.
In that case it is rather difficult to classify mathematical methods used in geoecology.
At the same time, according to types of solved problems and set of used mathematical
methods all mathematical models are distinctly divided into two groups.
The first group consists of models using mainly mathematical apparatus of theory of
chances and mathematical statistics. Geoecological objects are considered to be internally
homogeneous, and their properties changes in space are considered to be random not depending
on the measurement site. Such kind of models can be conditionally called statistical. Subject to
amounts of simultaneously examined properties they are divided into one-dimensional, twodimensional and multidimensional models.
Statistical models are usually used for:
•obtaining trusted assessments of geological objects properties according to sampling data;
• testing of hypothesis;
7
• identifying and describing of dependences between properties of geological objects;
• classifying of geological objects;
• determining of sampling data amount needed to estimate geological objects properties to
specified accuracy.
The second group consists of models where properties of geoecological objects are
considered to be spatial variables. In these models it is supposed that geoecological objects
properties depend on measuring point coordinates, and there are defined regularities in these
properties change in space. Meanwhile the techniques of combinatorics (polynomials), harmonic
analysis, vector algebra, differential geometry and other branches of mathematics are also used
along with certain probabilistic methods (random functions, time series, variance analysis).
The techniques of both static modeling and dynamic modeling are used to study spatial
geoecological variables.
Models of spatial geological variables are used to solve problems dealing with:
• test of hypothesis about regularities of the geoecological objects location relative to each
other;
• test of hypotheses about nature of the geoecological formation development processes;
• isolation of anomalies in the fields;
• classification of geoecological objects according to features of their internal structure;
• development of interpolation and extrapolation techniques in the process of the
geoecological objects delineation;
• selection of the optimal observation network density and form in the process of the
geoecological objects study.
Principles and methods of mathematical modeling in geoecology
Use of mathematical modeling in geoecology is connected with a number of complexities.
Mathematical model as any other one is a simplified analog of the investigated object. Not any
mathematical model can reproduce all their properties because of the geoecological objects and
processes complexities. Therefore it is necessary to often use different mathematical models to
describe various properties of one and the same object. Meanwhile it is necessary to make sure
that the selected model adequately depicts just the properties of object which affect on the
solution of problem.
Mathematical model cannot characterize the examined properties
completely. They are based on certain assumptions connected with nature of modeling object
properties.
Thus, solution of geoecological problems on the base of mathematical modeling represents
rather difficult process which can be divided into the following steps:
1) problem setting;
2) the determination of geoecological population, i.e. ranging of geoecological object or
time interval of the geoecological process;
3) the determination of basic properties of object or parameters of process in the context of
posed problem;
4) transition from geoecological population to tested and sampling one subject to
characteristics of investigation methods;
5) selection of mathematical modeling type;
6) formulation of mathematical problem in the context of selected mathematical model;
7) selection of the mathematical problem solution method;
8) the mathematical problem solution in terms of parameters calculation of mathematical
model of object;
9) interpretation of obtained results as applied to geoecological;
10) estimate of probability and possible error value because of the model and object
inadequacy.
Thus, the steps of geoecological model development (tested and sampling geoecological
population) are preceded by the step of mathematical modeling proper.
8
Sampling methods of study are widespread in geoecological investigations. Local areas
of observations and samples are very small as compared with areas and Earth interior where
carrying out investigations. In this connection there are problems dealing with stationing of local
observation stations and systematization of sampling data.
An investigator judges by properties of totality researching its part which is accessible to
observation and sampling and it is called sampled population. Quality of conformance of
sampled population properties and studied population depends on location, density and total
amount of observation points and it also depends on sizes, orientation, form, sampling volume or
this property measurement method.
There are three main observation point location systems: uniform sampling, chance
sampling and multiple stage sampling.
Set of elementary characters obtained in the result of measurement or analysis of any
geological object properties can be put into correspondence with each geological population.
Such kind of sets of elementary characters is called sampling (statistical) populations.
Statistical modeling
Two concepts – general population and sampling – are the basis for statistical modeling.
General population – a lot of possible values of examined object or phenomenon specified
characteristics.
Sampling – the sum total of observed values of this characteristic.
Statistical modeling is assumed that sampling population satisfies the requirements of
mass, homogeneity, randomness and independence.
Mass condition is due to the fact that statistical regularities are manifested in mass
phenomena and so amount of sampling population is to be sufficiently great. It is established by
empiricism that reliability of statistical estimates goes down in reducing sample in the range
from 60 to 30-20 values and there is no need for applying the statistical methods if there are less
observations.
Homogeneity condition is due to the fact that sampling population must consist of
observations which belong to one object and they must be carried out by the same method, i.e.
the sample size and analysis method must be constant.
To summarize the results of geoecological investigations it is necessary to deal with data
obtained with a help of various techniques in different years. So long as in practice of
geoecological investigations homogeneity condition is not always observed, using of statistical
methods has to be followed by analysis of possible consequences owing to this condition
breakdown. It is necessary to take into account the nature of solved geoecological problem and
in some cases it is necessary to use special methods to test a sample homogeneity hypothesis.
Randomness condition provides unpredictability of the single sample observation result.
As a rule, complexity and changeability of geoecological objects eliminate a possibility of their
properties accurate estimate before observation. Therefore the randomness condition is strictly
performed only when sampling location or measurement of studied property are not connected
with value characterizing this property.
Independence condition is due to the fact that the results of each investigation do not
depend on results of previous and follow-up observations and in the process of carrying out
observations dealing with area and volume the results do not depend on space coordinates. This
condition isn’t observed for most geoecological processes and formations. There are certain
regularities of changeability of geoecological formation properties in space and geoecological
process parameters in time. For this reason the field of statistical models uses is limited by
objects with absence of any change regularities in space or in time, and also it is limited by
problems when solving them these regularities can’t be taken into account.
The concept of random event probability is one of the main concepts in statistical
modeling.
The event is any fact which can be realized in the result of the experiment or test.
9
In turn the experiment or test is realization of certain complex of conditions though a man
does not always take part in.
All events are subdivided into persistent, impossible and random.
 The event which is certain to happen in the process of this kind of test is called
persistent.
 Impossible event is never realized in the process of this kind of test.
 Random events are characterized by that they can happen in the process of this kind of
test or they can’t happen.
The variable taking one or another unknown in advance value in the result of test is called
random variable.
Random variables are discrete (or discontinuous) and continuous. Meanwhile values
which they possess they can be limited or not.
Discrete variable can take fixed value and if the interval is specified the number of these
values is finite.
Continuous random variable can take infinitely many values in any specified interval.
The variable called probability is used as a measure of possibility of random events.
Probability of event A is a number which characterizes objective possibility of occurrence
of this event. It is designated as either Р(А) or р, i.e. р=Р(А).
Classical interpretation:
Probability of event A is equal to ratio of number of events, favourable to event A, to
general number of events.
P(A)=m/n,
1.1
where n – general number of events, m – number of events, favourable to event A.
Р(А) is variable from 0 to 1.
Probability of persistent event is equal 1, probability of impossible event is equal 0.
Classical interpretation works when there is capability of probability prediction in terms of
symmetry conditions under experiment is carried out and hereupon in terms of symmetry of test
outcomes and that leads to the concept “equal possibility” of outcomes.
Therefore, classical interpretation is connected with the concept equal possibility and it is
used for experiments reducing to the scheme of events. Do this requires that the events e1, e2, en
were incompatible, i.e. no two of them can occur together to form a complete group, i.e. they
exhaust all possible outcomes; they are equal possible under the stipulation that the experiment
provides the equal possibility of occurrence of each of them.
It is rather difficult to find some kinds of regularities upon analysis the certain test results.
But the stability of mean characteristics can be discovered in sequence of identical trials. Ratio
of m/n, number of m in which the event A occurred, to the total number of tests n is called the
relative frequency of any event in this series from n tests. Almost in every sufficiently long series
of tests the relative frequency of event A is established at defined value m/n taken as probability
of event A. Value stability of the relative frequency is verified by special experiments. For the
first time such kind of statistical regularities were discovered by way of example gambling
games, i.e. by way of example those tests which are characterized by equal possibility outcomes.
It opened the door to the statistical approach of numerical determination of probability when
symmetry conditions of the experiment are violated. The relative frequency of event A is called
statistical probability, which is symbolized
1.2
,
where mA – number of experiments where the event A occurred; n – total number of
experiments.
For determination of probability formulae (1.1) and (1.2) have got similarity of appearance
but they are essentially different. Formula (1.1) is necessary for theoretical calculation of
probability of event according to desired conditions of the experiment. Formula (1.2) is
10
necessary for experimental determination of the relative frequency of event. The experimental
statistical material is necessary to use formula (1.2).
The basic properties of probability
1.
For every stochastic event A its probability is determined, while
.
2.
For persistent event U with equality when P(U)=1.
Properties 1 and 2 result from the determination of probability.
3.
If А and В events are mutually exclusive, sum of events probability is equal to sum
of their probabilities. This property is called the law of addition of probability in special case
(for mutually exclusive events).
4.
For arbitrary events А and В
.
This property is called the law of addition of probability in the general case.
For opposite events А and
with equality when
.
Besides, the impossible event denoted by , not any outcome from space of elementary events
is not of aid in it, is introduced. The probability of the impossible event is equal to 0,
P( )=0 .
The basic characteristics of random variable
The properties of random variable can be characterized by different parameters. The most
important of them are mathematical expectation of random variable which is denoted by М(Х),
and dispersion D(Х) = 2(Х), the square root of which  (Х) is called standard deviation or
standard.
In the discrete type (discontinuous) of random variable, the definition of mathematical
expectation М(Х) is given as the sum of the product of the random variables and the probability
mass function of those random variables.
k
М(Х) = х1р1 + х2р2 + . . . + хk рk =  xi pi или М(Х) =
i 1
k
 xi pi /
i 1
k
p
i 1
i
Mechanical interpretation of mathematical expectation: М(Х) – abscissa of centroid of
mass points, abscissas of which are equal to possible values of random variable, and masses are
placed in these points are equal to adequate probabilities.
Mathematical expectation of continuous type of random variable is called the integral

М(Х) =
 xf ( x )dx

and the integral is supposed to converge absolutely; here f(х) – probability density of
distribution of random variable Х.
Mathematical expectation М(Х) can be understood as “theoretical mean value of random
variable”.
Consider the properties of mathematical expectation:
1. Mathematical expectation possesses the same dimension that the random variable
possesses.
2. Mathematical expectation can be both positive integer and negative one.
3. Mathematical expectation of invariable С is equal to this invariable, i.e.
М(С) = С.
4. Mathematical expectation of some random variables sum is equal to the sum of
mathematical expectations of these variables, i.e.
М(X + Y + . . . + W) = М(X) + М(Y) + . . . + М( W).
11
5. Mathematical expectation of product of two or several mutually independent random
variables is equal to the product of mathematical expectations of these variables, i.e.
М(XY) = M(X) M(Y).
6. Mathematical expectation of product of random variable by invariable С is equal to
product of mathematical expectation of random variable by invariable С
М(СХ) = СМ(Х).
Along with mathematical expectation another characters are used: median xmed divides the
distribution Х into two equal parts and it is defined by condition F(xmed) = 0,5; mode x mоd –
maximum commonly occurring value Х and it is abscissa of the maximum point f(x) for
continuously distributed random variable.
All three characters (mathematical expectation, median and mode) are the same in
symmetrical distributions.
If there are several modes the distribution is called multimodal distribution.
If mathematical expectation of random variable gives us its “average” or point on the
coordinate line where the values of considered random variable “are spread” around it,
dispersion classifies “the spread degree” of values of random variable about its average value.
Dispersion of random variable X is called the mathematical expectation of deviation of
random variable square from its mathematical expectation, i.e.
D(Х) = М(Х – М(Х)2)
Dispersion is calculated by the formula:
D(Х) = М(Х2) – [М(Х)] 2
For discrete random variable X the formula gives
k
D(Х) =

(хi)2 рi – [М(Х)] 2.
i 1
For continuous random variable X

D(Х) =
  x  M ( x )
2
f ( x )dx

Dimension of dispersion is equal to dimension of random variable square.
Properties of dispersion:
1. Dispersion of constant value is always equal to 0: D(С) = 0.
2. Fixed factor can be taken outside dispersion preliminarily squared: D(СX) = С2D(X).
3. Dispersion of two independent random variables algebraic sum is equal to sum of their
dispersions: D(X  Y) = D(X) + D(Y).
The positive root of dispersion is called the root-mean-square (standard) deviation and it
is denoted by σ  D ( X ) . The root-mean-square deviation possesses the same dimension that
the random variable possesses. Среднее квадратичное отклонение имеет ту же размерность,
что и случайная величина.
Random variable is called centered if M(X) = 0, and it is called standardized if M(X) = 0
and   1.
In the general case properties of random variable can be classified by different ordinary
moment and moment about mean.
Ordinary moment of K (order) is called
K
determined by formula:
12

 xiK pi
 K  M( X K )   K
x f ( x) d xi
 
X  дискр.,
для
X  непрер.,
для
where M( X K ) – mathematical expectation of K–й degree of random variable
random variables of discrete and continuous types appropriately).
K
Moment about mean K–го order is called the number
X K (for
determined by formula
  ( xi  1 ) K pi
,
 i
 K  M[( X   )]   
K
  ( x  1 ) f ( x) d xi .
 
From the definitions of moments, in particular, follow:
 0  0  1,
1  M ( X ), D( X )  2   2  12
Derivative characteristics from ordinary moment and moments about mean are often
used. Coefficient of variation is called the value
V

100 % .
1
Coefficient of variation – dimensionless value applied for comparison of degrees of
variation of random variables with different units of measurement.
Skewness ratio (or coefficient of skewness) of distribution is called the value
A
3
3
Coefficient of skewness classifies the degree of random variable distribution skewness
relative to its mathematical expectation. For skewness distributions А = 0. If the peak of
function graph f(x) is shifted in small values (“tail” on the function graph f(x) to the right), А> 0.
In the contrary case А< 0 (see Fig. 1).
1,0
A>0
A=0
0,8
A<0
f(x)
0,6
0,4
0,2
0,0
-0,5
0,0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
x
Fig. 1 Dependence of probability density graphs f(x) on coefficient of skewness A
Coefficient of excess (or peakedness) is called the value
13
E
4
 3.
4
Coefficient of excess is the measure of sharpness of probability density graphs f(x)
(Fig.2).
1,2
E>0
1,0
f(x)
0,8
0,6
E=0
0,4
0,2
E<0
0,0
-0,5
0,0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
x
Fig. 2 Dependence of probability density graphs of symmetric f(x) on coefficient of excess E
Laws of random variable distribution
Law of random variable distribution is the relationship between all possible values of
random variable and their correspondent probabilities. Law of random variable distribution can
be presented in a tabulated form, graphically or in distribution functional form.
Distribution series is the population of possible values хi and their correspondent
probabilities рi= Р ( Х = хi), it can be presented in a tabulated form.
Table 1
Distribution series of discrete random variable Х
хi
рi
х1
р1
х2
р2
...
...
хk
рk
k
Here, probabilities рi satisfy
 pi  1 ,
i 1
where the number of possible values k can be finite or infinite.
Graphic presentation of distribution series is called a distribution polygon. To draw the
distribution polygon it is necessary to plot the possible values of random variable (хi) on the
abscissa, and probabilities рi should be plotted on the ordinate; points Аi and coordinates (хi , рi )
are connected by broken lines.
14
If the true probability is not known, the relative frequency of each of values occurrence is
plotted on the ordinate.
The distribution function is the most common form of the distribution law description. It
defines probability that random variable  will take the value which will be lesser than any
specified value X. This probability depends on Х and, therefore, it is the function of X, i.e.
F(x)= Р (<x)
The function F(х) for discrete random variable is calculated by the formula:
F(х)=
 pi
,
where the summation over all i is carried out for which хi х.
xi  x
Continuous random variable is characterized by the nonnegative function f(х, to be
carried out, and this function is called probability density and it is defined by:
P( x  X  x  x)
f ( x)  lim
x
x 
x
At any х probability density f(х) satisfies equality F(х)=

f ( x ) dx

linking it with distribution function F(х).
Thus, continuous random variable is given by either distribution function F(х) (integral law) or
probability density f(х) (differential law).
discrete random variable
continuous random variable
Graph of integral function of distribution
The distribution function F(х)(integral law of distribution) possesses the following
properties:
1) Р(а  Х  в) = F(в) – F(а);
2)F( х1 ) F( х2 ), если х1  х2 ;
15
3) lim F ( x) = 1;
x 
4) lim F ( x) = 0
x 
Probability density f(х) (differential law of distribution) possesses the following basic
properties:
1)f(х) 0;
2)f(х) =
dF ( x )
= F(х);
dx
x
3)

f ( t )dt = F(х);


4)

f ( x )dx = 1;

5) Р(а  Х  в) =
b
 f ( x )dx .
a
Geometrical probability of hit X on site territory (а,b) is equal to area of curvilinear
b
trapezoid corresponding to definite integral
 f ( x )dx (Fig.3)
a
Fig. 3 Graphic presentation of probability density function (differential function of distribution)
Consider laws of distribution that are most often used.
Normal Distribution (firstly this term was used by Galton in1889, also it is called Gaussian).
The normal distribution (the "bell-shaped curve" which is symmetrical about the mean) is a
theoretical function commonly used in inferential statistics as an approximation to sampling
distributions. In general, the normal distribution provides a good model for a random variable,
when:
1. There is a strong tendency for the variable to take a central value;
2. Positive and negative deviations from this central value are equally likely;
3. The frequency of deviations falls off rapidly as the deviations become larger.
As an underlying mechanism that produces the normal distribution, one may think of an infinite
number of independent random (binomial) events that bring about the values of a particular
variable. For example, there are probably a nearly infinite number of factors that determine a
person's height (thousands of genes, nutrition, diseases, etc.). Thus, height can be expected to be
normally distributed in the population. The normal distribution function is determined by the
following formula:
16
f(x) = 1/[(2*π)1/2*σ] * e**{-1/2*[(x-μ)2/σ]2},
for -∞ < x < ∞
where
μ
σ
is the mean
is the standard deviation
is the base of the natural logarithm, sometimes called Euler's e
e
(2.71...)
π
is the constant Pi (3.14...)
The exact form of normal distribution (specific “bell curve”, see Fig.) is defined by only
two parameters: average deviation and standard one.
The specific property of normal distribution lies in the fact that 68% of all observations fall
in the range ±1 standard deviation from mean, and range ±2 of standard deviations include 95%
values.
In other words, under normal distribution the less -2 or more +2 standard
observations possess relative frequency less 5% (Standard observation means that average value
is taken from base value and the result is divided by standard deviation).
Graphical method, sample parameters of distribution form and goodness measures are
usually used to estimate the accordance of available experimental data with normal distribution
law.
Log-normal Distribution.The log-normal distribution is often used in simulations of variables
such as personal incomes, age at first marriage, or tolerance to poison in animals. In general, if x
is a sample from a normal distribution, then y = ex is a sample from a log-normal distribution.
Thus, the log-normal distribution is defined as:
where, x>0; -∞<μ<+∞; σ>0
is the scale parameter

is the shape parameter

is the base of the natural logarithm, sometimes called Euler's e
e
(2.71...)
17
Probability Density Function
Probability Distribution Function
y = lognorm(x; 0; 0,5)
p = ilognorm(x; 0; 0,5)
1,0
0,8
0,8
0,6
0,6
0,4
0,4
0,2
0,2
0,0
0,0
0,4
0,8
1,2
1,6
2,0
2,4
2,8
3,2
0,4
0,8
1,2
1,6
2,0
2,4
2,8
3,2
Fig.4 Graphs f(x) and F(x) of log-normal distribution
Continuous random variable Х possesses chi-squared distribution with m-degrees of freedom
if it is represented as the sum of squares of m values distributed according to normal law N (1;0);
i.e. if probability density distribution is of the form (see Fig.5):
Probability Distribution Function
p = ichi2(x; 3)
Probability Density Function
y = chi2(x; 3)
1,0
0,24
0,8
0,16
0,6
0,4
0,08
0,2
0,00
2
4
6
8
0,0
10
2
4
6
8
10
Fig. 5 Graphs f(x) and F(x) chi-squared distribution
x m

1
1
fCh ( x; m)  m / 2
e 2x2 ,
2 (m / 2)
0 x,

where
( z )   e  t t z 1dt
– gamma-function:
0
 2n  1  1

  n  2n  1!! 
 2  2
и
  n  1  n!
для n
 0,  .
Characters of chi-squared distribution:
M [ x ]  m , xmod  m  2 , D[ x]  2m ,
A  23/ 2 / m , E  12/ m .
18
Density graph of chi-squared distribution is asymmetric (left-skewed, so long as A> 0),
peaked (E> 0) and xmоd<M[x],
Dependence of density graphs of chi-squared distribution on m is represented in Fig. 6
below.
у = ch2(x; m)
0,18
m= 5
0,14
y
0,10
m = 10
0,06
0,02
-0,02
-2
2
6
10
14
18
22
x
Fig. 6 Dependence of f(x) graphs of chi-squared distribution on m
Student's t Distribution.The student's t distribution is symmetric about zero, and its general shape
is similar to that of the standard normal distribution. It is most commonly used in testing
hypothesis about the mean of a particular population. The student's t distribution is defined as
(for = 1, 2, . . .):
 m 1 
m 1
Ã

2  2
1
 2  1  x 
ft ( x; m) 
,


m Ã m   m 
 
2
   x  .
Probability Density Function
Probability Distribution Function
y = student(x; 5)
p = istudent(x; 5)
1,0
0,4
0,8
0,3
0,6
0,2
0,4
0,1
0,0
0,2
-3
-2
-1
0
1
2
3
0,0
-3
-2
-1
0
1
2
3
Fig. 7 Graphs f(x) and F(x) of t–distribution law
Characters of t-distribution:
19
M [ x]  xmed  xmod  0 , D[ x] 
A0, E 
m
m2 ,
6
m4.
If the degrees of freedom are great (m> 30), t-distribution is equal to normal distribution
N ( x;0;1) .
One-dimensional statistical models
One-dimensional statistical models are used to solve two types of problems: to estimate
average parameters of geoecological objects and to verify hypotheses statistically.
Owing to possible deviations of geoecological object study conditions from strong
requirements produced to the statistical experiment, statistical analysis of geoecological data
should be practically divided into two stages – exploring and supporting.
The aim of the first stage is to translate observational data into more compact and visual
form which allows to identify regularities in these data. During the second stage it makes
possible to approach the traditional statistical methods of solving geoecological problems in a
more substantiated way.
During the first stage it is reasonable to apply a priori assumption-free methods relative to
sample population properties and these methods do not need labour-intensive calculations.
Preferences should be given to such kind of methods where numerical information is translated
to graphic data.
Statistical characteristics of sample random variable
The calculation of statistical characteristics of sample random variable is the basis for most
computations. The most abundant statistical characteristics of one-dimensional random variable:
 range
 median
 mode
 average value
 dispersion
 root-mean-square deviation
 coefficient of variation
 skewness
 excess
Suppose n of x property measurement. It is necessary to find statistical characteristics of this
measurement set.
Range is the difference between maximum xmax and minimum xmax
Range is the difference between maximum xmax and minimum xmin values of property p= xmax xmin.
Median is a mean of ordered series of values. To find median it is necessary to arrange all values
in the order of increasing or in the order of decreasing and to find in order the mean term of
series. If in case of n – even integer there will be two values in the middle of series, the median is
equal to their half-sum. Mode is the most abundant value of random variable.
Average value is arithmetical mean value of all measured values:
20
Median, mode and average value are characteristics of position. Measured values of random
variable are grouped near them.
Dispersion is a number which is equal to average square deviations of values of random variable
from its average value (Dispersion of random variable is a measure of this random variable
spread, i.e. its deviation from mathematical expectation):
Average square deviation is a number which is equal to square root of dispersion:
Average square deviation possesses dimension coincident with dimension of random
variable and average value. For example, if values of random variable are measured in meters,
average square deviation will be expressed in meters too.
Coefficient of variation is the ratio of average square deviation to average value:
Coefficient of variation is expressed in unit fractions or (after the product by 100) in
percentages. It is not unreasonable to calculate the coefficient of variation for positive random
variables.
Dispersion, average square deviation, coefficient of variation and also range are measures
of scatter of values of random variable in the neighborhood of average value. The more measures
are the more scattering is.
Skewness – noncentrality degree of values distribution of random variable relative to average
value:
Excess – degree of peakedness or flat-toppedness of values of random variable relative to
normal distribution law:
Skewness and excess are nondimensional values. They show singularities of values
grouping of random variable in the neighborhood of average value.
Thus:
Median, mode and average value are characteristics of position;
Dispersion, average square deviation, coefficient of variation and also range are measures of
scatter;
Skewness and excess show singularities of values grouping of values.
Statistical estimations can be point and interval. In point estimating the unknown
characteristic of random variable is estimated by a number, in interval estimating the unknown
characteristic of random variable is estimated by an interval. With specified possibility the true
value of estimated variable must be in range of the latter.
Point estimation does not constitute information about precision of the obtained result.
The fewer sampling is and mutability of property is far stronger, the error can be larger. That’s
why in circumstances where sample is very small it is desirable to know the property values
interval in which its unknown true average value falls with specified possibility.
Suppose the statistical characteristic Θ* found by sampling data serves as an estimation of
unknown parameter Θ. Θ will be considered to be constant number (Θ can be random). It is clear
that the Θ* determines the parameter more adequately the absolute difference |Θ – Θ*| is less. In
21
other words, if δ>0 and |Θ – Θ*|<δ, δ is less, the estimation is more exactly. Therefore, positive
number δ characterizes closeness in estimation.
However, statistical methods don’t allow confirm exactly that estimation Θ* satisfies
inequality |Θ – Θ*|<δ; we can speak only about probability γ whereby this inequality is.
Reliability (probability belief) of estimation Θ from Θ* is called probability γ whereby the
inequality |Θ – Θ*|<δ is. Usually the reliability of estimation is given in advance, moreover, the
number close to 1 is taken as γ. Reliability which is equal to 0,95; 0,99 and 0,999 is most often
given.
Suppose probability of that |Θ – Θ*| <δ is equal to γ:
P[ |Θ – Θ*| <δ ] = γ.
Have been substituted inequality Θ – Θ*|<δ by equally matched two-side inequality – δ<Θ
– Θ* <δ, or Θ* – δ<Θ<Θ* + δ, we have P [ Θ* – δ<Θ<Θ* + δ ] = γ.
This ratio should be understood in this way: probability of that the interval (Θ* – δ, Θ* +
δ) hold unknown parameter Θ is equal to γ.
The interval (Θ* – δ, Θ* + δ) which covers unknown parameter with desired reliability γ
is called confidence.
The way of confidence interval development for mathematical expectation depends on
dispersion σ2 is known. If it is known, the confidence interval corresponding to desired
reliability (probability belief) p is given by

 

,
x

t
;
x

t

n
n 
Low probability whereby the event can be considered to be impossible is called confidence
level. Usually the confidence level is lettered α. There is following ratio γ = 1 – α between the
probability belief and the confidence level.
Statistical verification of hypotheses
Many geoecological problem solutions are based on analogy when regularities established
in the process of analogous object study are used to explain structure features of underexplored
objects. To choose the right object-analog it is necessary to estimate the similarity measure to
prototype system.
In other cases it is necessary to estimate the measure of discrepancy of geoecological
objects according to one or other physical properties.
Statistical methods of property character hypotheses verification are used to solve the
problem of geoecological objects similarity or difference. In geoecological practice these
methods are used for estimation:
 about equality of the studied property average values obtained by different methods for
the same object or by one method for various objects;
 about equality of dispersions of two random variables from sample data;
 about homogeneity of studied object.
Statistical verification of hypotheses is carried out with a help of goodness measures.
Goodness measure is called the value of certain function K=f(X1, X2, ..., Xn), where X1,
X2, ..., Xn – random variables characterizing verified hypothesis. The function is taken in such a
way that in case of rightness of verified hypothesis its values are represented by random variable
with distribution known in advance.
Verified hypothesis is accepted if the value K calculated by sample values X1, X2, ..., Xn is
less or more (it depends on the statement of hypothesis) than theoretical value K for similar
conditions and specified probability α which is taken in accordance with the certain distribution.
Probability α here corresponds to the probability level of practically impossible event and it is
called a significance level.
Probability (1 – α) where validity of decision will be the practically persistent event is
respectively called probability belief.
22
The error enclosed in rejection region, though it is true, is called type 1 error, and
acceptance of a false hypothesis is called a type 2 error.
If write probability of type 2 error for β, (1 – β), i.e. absence of the error probability, will
be a value called a strength of the criterion relative to competing hypothesis.
An increase of probability belief decreases the probability of type 1 error but it increases
the probability of the type 2 error. An application field of certain goodness measures is usually
limited by some conditions, and their strength depends on character of competing (alternative)
hypothesis and sample size.
To solve problems in terms of statistical verification of hypotheses it is necessary to perform the
following operations:
• to formulate clear testable hypothesis (Н0) and alternative hypothesis(Н1) on account of the
point of the problem;
• to choose the most powerful tests which are not contrary to properties of studied random
variables;
• to estimate consequence of type 1 error and type 2 error according to the solved problem
situations, and to choose significance level on account of minimizing loss requirements in the
result of the incorrect decision;
• to calculate the empirical value of goodness measure K from sample data, to compare it with
theoretical value K for stated significance level and to make a decision relative to hypothesis Н0;
• to interpret the obtained result in respect to the posed problem.
Statistical goodness measures are divided into parametric and nonparametric. Parametric
goodness measures are taken from various statistical laws of distribution and they can be used
only in such kind of case if sample data distribution is in agreement with this law. Nonparametric
goodness measures can be used even though if the distribution law of investigated values is
unknown or their distribution corresponds to none of known laws. Nonparametric goodness
measures usually possess power lesser than parametric goodness measures possess, but their
application field is essentially wider.
Verification of hypotheses on parameters distribution law
The most statistical methods of solutions of problems are based on using of various
distribution law properties. However, usually the investigator can’t know beforehand what
properties of the selections obtained in the process of the investigation will be. That’s why the
stage of comparison of the empirical distributions with known theoretical ones is a preliminary
to specific problem solution.
Theoretical distribution conformance testing. In most cases the law of distribution and its
parameters are unknown in the process of real problem solving. At the same time the statistical
methods applied as imputations require the certain law of distribution. Hence, the important
problem occurring in the process of one sample analysis is the estimation of measure of
concordance of obtained empirical data and any theoretical distributions. Assumption concerning
normal distribution of population is verified more often because the majority of statistical
procedures are concentrated on samples obtained from normally distributive population.
Graphical method, sampling parameters of the distribution form and goodness measures
are used to estimate correspondence of experimental data to normal distribution law.
Graphical method allows estimate provisionally the dissimilarity and coincidences of
distributions.
If the number of observations is large (n> 100), the calculation of sampling parameters of
the distribution form (excess and skewness) produces quite good results. They say, that the
normalcy of distribution assumption is not contrary to available data if skewness approaches
zero, i.e. it lies in the range from -0,2 to 0,2, and excess lies in the range from -1 to 1.
The use of goodness measures produces the most satisfactory results. Goodness measures
are called statistical measures destined for verification of goodness of experimental results and
23
theoretical model checking. Here, zero hypothesis (Н0) represents the statement that the
population distribution of which sample was obtained is no different from normal.
Nonparametric measure χ2 (chi-square) is the most abundant among goodness measures. It is
based on comparison of empirical frequencies of intervals of grouping with theoretical
frequencies calculated according to formulae of normal distribution.
Verification of hypotheses on location test
The most important question arisen in the process of two sample analysis is the question
about differences between these samples. For this purpose the verification of statistical
hypotheses, that both samples belong to the population or universe means are equal, is usually
carried out. So called tests of differences are used to solve such kind of problems. Different
statistical tests can be used to verify the same hypothesis. The correct selection of test is
determined by both characteristics of data and verified hypotheses, and also the level of
investigator’s experience.
Parametric tests. Parametric tests are necessary to verify hypotheses of location and
distribution. Student’s t-test (test of differences) is the most popular with parametric tests to
verify hypotheses of universe means (mathematical expectations).
The test allows identify the probability of that both means are related to two different
populations. If this probability p is lower the significance level (р < 0,05), samples are
considered to be related to two different populations.
Two cases can be selected with the use of t-test. In first case it is applied to test hypothesis
of the universe means equality of two-variable, unrelated samples (so called two-sample test). In
this case there is a test group and an empiric group. In the second case when the same group of
objects produce numerical material to test hypotheses for means it is used so called two-sample
test. Meanwhile samples are called dependent, connected.
In both cases in every compared groups and equality of dispersion in compared
populations it should be carried out the requirement of dispersion normality of the investigated
characteristic. Though, the correct use of Student’s t-test for two groups is often difficult
whereas these conditions can’t be definitely checked by no means always. .
The use of parametric Student’s t-test is based on that if samples X1, X2, ..., Xk amount of
n1 values and sample Y1, Y2, ..., Yk amount of n2 values are selected from normally distributed
population, the variable
t xy
S12 n1  S22 n2 ,
S2 S2
where x , y – sampling estimations of mean, and 1 , 2 – sampling estimations of
dispersion, follows Student distribution law with (n1+n2–2) degrees of freedom. Verification of
hypothesis on the equality of two sampling means consist in substitution in estimation formula
2
2
x and S1 according to the first sample and y and S 2 according to the second sample and the
comparison of obtained value of t-test with tabulated value for this number of degrees of
freedom and specified probability belief. If calculated value of test is more than tabulated one,
the hypothesis on the equality of sampling means is denied.
To verify the hypothesis on the equality of sampling means it is recommended to use
Rodionov test in case of correspondence of log-normal model sampling data. D.A. Rodionov
established that the variable
Z
 lg x  lg y  1,153 S
2
lg x
 Slg2 y

Slg2 x n1  Slg2 y n2  2,65( Slg4 x (n1  1)  Slg4 y (n2  1))
24
is distributed by asymptotically normal with mathematical expectation 0 and dispersion 1.
Therefore, theoretical value of variable Z is founded according to the table of values of Laplace’s
integral function.
Nonparametric tests (Van-der-Waerden test, Wilcoxon test, goodness-of-fit testχ2) are
usually used at small samples or in the cases when average values are calculated by
semiquantitative data – for example, by the results of semiquantitative spectral analysis.
Nonparametric tests are used in the cases when the data distribution law differs from normal or it
is unknown.
Verification of hypothesis on the equality of means determined by two samples (A and B)
with a help of Van-der-Waerden X-test begins with all values in both samples are ranked, i.e.
they are written as series in the order of increasing. X-test presents variable
 i 
X   
,
 n 1
1
h
where n – the total number of values in two samples; h – the number of observations in sample; i
– the sequence number of each value of sample B in general series; ψ(...) – the function which is
inverse one of the normal distribution function.
Nonparametric Wilcoxon test (W) is also based on the ranking procedure and it presents
the sum of ranks Ri of smaller sample in the total ranked series from both samples:
n1
W   Ri , n1  n2 .
i 1
H :x x
2,
If hypothesis on the equality of means of populations A and B is true, i.e. 0 1
mathematical expectation of Wilcoxon statistic (MW) and the possible deviation variable of
sample estimation from it depend on only samples n1 and n2.
Verification of hypotheses on equality of dispersions
Degree of variation of various objects is estimated in magnitude of dispersion or any
properties coefficient of variation and it is necessary for proved using of analog method in the
process of study.
Fisher's test. Fisher's test is used to verify hypothesis on two dispersions belonging to one
population and, therefore, they are equal. Nevertheless, data are supposed to be independent and
to be distributed according to the normal law. The hypothesis on equality of dispersions is taken
if ratio of larger dispersion to smaller one is less than the critical value of Fisher's distribution
S12
F 2,
S2
F  Fкрит ,
where Fкрит depends on the significance level and number of degrees of freedom for
dispersions in numerator and denominator.
Sidzhel-Tukey test is the nonparametric analog of Fisher's test. It is used for any kind of
distributions and it is sensitive to bad values and so it is convenient for problem solution,
especially in connection with small samples.
Analysis of selection homogeneity
When using one-dimensional statistical models to describe the geological object properties
it is supposed that this object is homogeneous in relation to investigated property. The
homogeneity problem is usually solved on the basis of assumed geoecological model. The
statement of object homogeneity is obtained by the verification of hypothesis on its static
homogeneity, in this case the quantitative data on variability of its properties are used.
25
The problems based on verification of hypothesis on static homogeneity of geoecological
objects can be divided into three types:
• selection of bad values;
• separation of non-homogeneous selections;
• estimation of degree of different factors effect on variability of the geological object
properties
In case of normal distribution of the background population this problem is solved with a
help of Smirnov and Fergusson’s parametric tests.
N.V. Smirnov determined that if the crest value of population is not bad value, the variable
2
t  ( x мах  x ) S см
has got the distribution named for him. In this formula x мах – the crest
2
– shifted estimate of dispersion which is
x – arithmetic average; S см
2
2  n  1
calculated by unbiased estimate of dispersion S2 according to the formula S см  S 
.
 n 
value of population;
If calculated value of coefficient of skewness exceeds the table value for probability belief
and n degrees of freedom, the crest value of population should be admitted as the bad value. If
the distribution of background population is different from normal, all frequent occurrence large
values belonging to the investigated population will be admitted “anormal’. This limits the field
of both tests uses. They can be used only in case if it is known beforehand that the distribution
of background population is normal. In practice it is very often the highly improbable values are
admitted as bad values in absolute magnitude exceeding 3  x or x  2 . Though this way
can’t be regarded correct, so long as it does not guarantee from errors of either first kind or
second one and also probability of these errors can’t be estimated.
Analysis of variance
The properties of any complex natural system usually depend on number of factors which
are responsible for their variability. Identification of these factors and the estimation of degree of
their effect on the investigated object properties variability (nonhomogeneity) are carried out
with a help of analysis of variance.
Analysis of variance is destined for investigation the problem when one or some
independent factors which possess some gradations act on the measured random variable.
Though in single-factor, two-factor analysis and etc. the factors affecting on the result are
considered to be known and the question is determination of importance or estimation of this
effect. The analysis of variance use is possible if we can suppose that selected groups are in
accordance with normal populations and independence of distribution of observations in groups.
In the process of uniform single-factor analysis of variance of random variable х relative to
factor A, having p levels when amounts of measurements on each levels are equal to q, the
results of observations are designated as xij, where i – the number of observation (i= 1, 2, ..., q),
and j – the number of the factor level (j = 1, 2, ..., p), they are written as a table:
Number of
Level of factor
variability
A1
A2
…
Ap
1
x11
x12
…
x1p
2
x21
x22
…
x2p
…
…
…
…
…
q
xq1
xq2
…
xqp
Group means
…
xг р
xг р
xг р
1
2
The following statistics are calculated by these data:
1) total sum of squared deviations of the observed values characteristic from grand mean
p
p
x:
q
Cобщ   ( xij  x ) 2 ;
j 1 i 1
26
2) factor sum of squared deviations of group means from grand mean with a help of which the
dispersion between the groups are characterized:
p
Cфакт  q  ( x г р j  x ) 2 ;
j 1
3) residual sum of squared deviations of the observed values from their group mean with a help
of which the dispersion within groups are characterized:
q
q
q
i 1
i 1
i 1
Cост   ( xi1  x г р1 ) 2   ( xi 2  x г р2 ) 2  ...   ( xip  x г р p ) 2 .
In the process of single-factor analysis of variance the calculations can be simplified using the
equality Сост = Собщ – Сфакт;
4) total, factor and residual variance:
2
2
2
 Cост p( q  1) ;
Sобщ
 Cобщ ( pq  1) ; Sфакт
 Cфакт ( p  1) ; Sост
5) value of Fisher's test:
2
2
.
F  Sфакт
Sост
The value of Fisher's test is compared with critical one for the set significance level α and
number of degrees of freedom k1 = p –1 and k1 = p(q– 1).
In the process of the nonuniform single-factor analysis of variance when the number of
observations on the level А1 is equal q1 on the level А2– q2, on the level Аk– qp. In this case the
total sum of squared deviations is calculated using the formula

Cобщ  P1  P2  ...  Pp   ( R1  R2  ...  R p ) 2 n
where P1 

q1
 xi21 – sum of squares of observed characteristic values on the level A1;
i 1
q2
P2   xi22 – sum of squares of observed characteristic values on the level A2;
i 1
qp
Pp   xip2 – sum of squares of observed characteristic values on the level Ap;
i 1
q1
q2
qp
i 1
i 1
i 1
R1   xi1 , R2   xi 2 , …, R p   xip – the sums of observed characteristic
values on the levels A1, A2, …, Ap appropriately;
n  q1  q2  ...  q p – total number of tests (sample number).
The factor sum of squared deviations is calculated using the formula

 
Cфакт  ( R12 q1 )  ( R22 q2 )  ...  ( R p2 q p )  ( R1  R2  ...  R p )2 n

The residual sum of squared deviations is calculated using the formula
Сост = Собщ – Сфакт
The rest of operations are carried out just like in case of the equal number of tests:
2
2
2
Sобщ
 Cобщ (n  1) S факт
 Cфакт ( p  1) Sост
 Cост ( n  p )
;
;
The value of Fisher's test is compared with critical one for the set significance level α and
number of degrees of freedom k1 = p –1 and k2 = n–p.
Questions
1.
Enumerate types of geologo-mathematical models.
2.
Point out the steps of mathematical model-based ecological task solution.
27
Give the definition of the terms “statistical model” and “deterministic model”.
Point out differences between statistical models and spatial models.
The basic principles and methods of mathematical modeling.
Population and selection. Assessment of statistical parameters for selection.
What requirements are imposed upon sampling data?
What is the probability of accidental event?
What is the random variable law?
What distribution laws are usually used for modeling of ecological objects and processes?
Properties of normal distribution law.
One-dimensional statistical models. Entity and application conditions.
Basic sources:
1.
Aivazyan S.A., Enyukov I.S., Meshalkin L.D. Practical statistics. Basis of modeling and
primary data processing. Reference book. – M.: Finance and statistics, 1983. - 472 p.
2.
Aivazyan S.A., Enyukov I.S., Meshalkin L.D. Practical statistics: Dependences study:
Reference book. – M.: Finance and statistics, 1985. - 182 p.
3.
Borovikov V.P. Statistic for students and engineers. – M.: ComputerPress, 200. -301 p.
4.
Van der Varden B.L. Mathematical statistics. – M.: Foreign Literature Publishing House,
1960. - 302 p.
5.
Kazhdan A.B., Gus’kov O.I. Mathematical methods in geology. – M.: Nedra, 1990. - 251
p.
6.
Kendell M., Stewart A. Distribution theory. – M.: Nauka, 1966. - 566 p.
7.
Kendell M., Stewart A. Statistical conclusions and connections. – M.: Nauka, 1973. - 899
p.
8.
Kramer G. “Mathematical methods of statistics”. M., 1948, -631 p.
9.
Muller P., Noiman P., Shtorm R. «Mathematical statistics tables». – M., 1982, - 270 p.
10. Fisher R.A., Yates F. Statistical Tables for Biological, Agricultural, and Medical Research.
– Edinburgh: Oliver and Boyd, 1953.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
LECTURE 2. TWO-DIMENSIONAL AND MULTI- DIMENSIONAL STATISTICAL
MODELS. SPATIAL MODELLING.
To model geoecological objects and processes as complex natural systems it is necessary
to consider some of their properties because the aim of this is to clarify the generic structure of a
studied object. In one cases the studied properties are presented independently of one another,
and in other cases more or less clear interrelations can be presented between them.
In other cases to explain the nature of observed dependences it is necessary to look over
the long chain of interrelated processes and phenomena. So, in the result of persistent statistical
data processing dealt with the traumatic mining disaster it was established that their periodicity
was connected with the phases of the Moon. At the first glance, this extremely strange
connection is explained by influence of the Moon on tidal forces which begin to appear in
neither hydrosphere nor lithosphere, and they often play the role of “trigger” for such kind of
phenomena as rock bump, gas blow-out and etc.
The connection between different properties of objects defies often explanation according
to genetic and cause-and-effect positions so long as observed dependences can’t be connected
with geoecological processes.
The study of interdependences between values of geoecological properties formation
contributes sophisticated understanding of characteristics of geological processes and
determination of factors influencing on efficiency of geoecological object investigation method.
In some cases it allows obtain quantitative estimations of some properties by values of other easy
determined values. Since studied interdependences have statistical character and they practically
differ from functional ones, two-dimensional and multi-dimensional statistical models are used
to study and describe them.
28
The determination of correlation relationships between different properties of
geoecological objects helps to solve a wide range of problems. The correlation analysis is often
used to study geoecological processes, to choose rational complex of research methods.
In other cases the statistical analysis of observation results is preceded by theoretical
inclusions, and determined correlation relationships are considered in development of
deterministic models, describing dependences between geoecological phenomena and studied
physical, chemical, biological and other factors.
The linear correlation coefficient (Pearson) intending normal law of distribution of
observations is widespread to estimate the degree of interrelation.
Correlation coefficient is a parameter characterizing the degree of linear interrelation
between two samples. Correlation coefficient is changed from –1 (strict inverse linear
relationship) to 1 (strict direct proportion). There is no linear relationship between two samples if
the value is equal 0. Here, direct dependence is understood as dependence when an increase or
decrease in value of one property leads to an increase or decrease of the second property,
relatively.
For example, gas pressure increases when the temperature increases, and gas
pressure decreases when the temperature decreases.
Sample estimation of correlation coefficient can be calculated according to the formula
n
r   ( xi  x )( yi  y ) nSx S y ,
i 1
where x and y – sample estimations of average values of random variables X and Y; Sx and Sy
– sample estimations of their standards; п – number of comparable paired values.
When we carry out hand calculations this formula is used
n
1  n  n  
r   xi yi    xi   yi 
n  i 1  i 1 
 i 1
 n 2 1  n 2   n 2 1  n 2 
 xi    xi    yi    yi  
n  i 1    i 1
n  i 1  
 i 1


If because of small data you can’t test a hypothesis whether the empirical distribution is in
accord with the law, to test the hypothesis you can use Spearman’s rank correlation
coefficient. Its calculation is based on change of the investigated random variable sample values;
they are changed by their ranks in the order of increasing. However, it is supposed that if there is
no correlation dependence between values of random variables, ranks of these variables will be
independent. The expression for calculation of rank correlation coefficient is:
n
r  1
6 d i2
i 1
2
n(n  1)
where di – rank difference of conjugate values of studied variables xi and yi, п – number of pairs
in sample.
If the occurrence of correlation relationship for two variables has been proved from sample,
if its form is determined and if there is an equation to describe it, there is possibility of forecast
of one of random variables by values of other random variable.
Solution of such kind of problems is based on construction of empirical lines of regression
or calculation of their analytical expressions – equations of regression. To solve the problems
exactly it is necessary neither to estimate the force of correlation relationship nor to identify its
character. Therefore the approximate way of checking hypothesis on linearity of relationship
according to the kind of empiric line of regression, in this case, is usually added by analytical
calculations. Analytical way of checking hypothesis on linearity of relationship is based on that
29
if there is linearity of relationship the correlation coefficient and the correlation relation are in
absolute value. Fisher test is the relevant criterion to check this hypothesis
F  ( y2 x  r 2 )( N  m) ((1   y2 x )( m  2))
where
 y2 x – correlation relation of characteristic Y by classes of grouping X; т – number of
classes of grouping; N –number of value pairs XY.
Obtained values F are compared with tabulation Fкр for significance level α at f1 = (m–2) and f2
= (N–m) degrees of freedom. Correlation is consider to be nonlinear if F > Fкр.
Regression analysis. In addition to correlation, the regression is distinguished when
investigating interrelations between samples. Regression is used to analyze the action on separate
dependent variable of one or more independent variables. Thus, the regression analysis is one
more tool to study stochastic dependences.
The regression analysis establishes the forms of dependence between random variable Y
(dependent) and values of one or some variables (independent), it being known that the values of
variables (independent) are considered to be prescribed. Such kind of dependence is usually
determined by certain mathematical model (by equation of regression) including some unknown
parameters. During the regression analysis on the base of sample data the estimations of these
parameters are determined, the statistical errors of estimations or confidence limits are
determined and also this mathematical model adequacy to experimental data is checked.
The link between random variables is supposed to be linear in the linear regression
analysis. In simple case there are two variables Х and Y in the linear regression model.
According to п pairs of observations (X1, Y1), (X2, Y2), ..., (Xn, Yn) it is required to construct a
straight line called by line of regression which brings into proximity with observational values.
Equation of this line y= аx + b is regression equation. The desired value of dependent variable y
corresponding to value of independent variable x can be predicted with a help of regression
equation.
Thus, it is safe to say that regression analysis consists in fit of a graph and its equation for
set of observations. In regression analysis all variables which enter into equation must be
continuous and not be discrete. In case when dependence between one dependent variable Y and
some independent ones X1,X2, ...,Xm is considered, we may speak of multiple linear regression.
In this case the regression equation is
y = a0 + a1x1 + a2x2 + … + amxm,
where a0, a1, a2, …, am – regression coefficients requiring determination.
Coefficient of determination R2 (R-square) is an effectiveness criterion of regression
model. Coefficient of determination R2 (R-square) determines what degree of accuracy the
obtained regression equation approximates given data.
Significance of regression model is investigated with a help of F-test (Fisher test). If
variable F-test is significant (р < 0,05), the regression model is significant.
Certainty of coefficients difference a0, a1, a2, …, am from 0 is checked with a help of
Student’s test. In cases when р > 0,05 the coefficient can be considered 0, it means that the
influence of this independent variable on the dependent variable is unreliable, and this
independent variable can be eliminated from the equation.
MULTI- DIMENSIONAL STATISTICAL MODELS
Every phenomenon can be characterized by set of characteristics which are determinable
and observable. Geoecological objects should be considered as systems depending on great
number of factors and requiring multidimensional attribute space for their characterization. To
solve such kind of problems it is necessary to consider complex of studied characteristics
together, i.e. to form a multi-dimensional statistical model.
30
Solving most multidimensional geoecological problems we have to deal with complex
combinations of factors which can’t be selected pure and studied independently. Combined study
of complex interrelated characteristics contributes the detection of additional information about
variability of investigated objects and makes it possible to forecast their unknown properties.
Multidimensional statistical analysis relies on wide range of methods. Discriminant and
cluster analyses refer to methods of multidimensional classification which are intended to
separate collection of objects, subjects and phenomena into uniform groups. It is also necessary
to take into account that each of objects is characterized with a great number of different and
stochastically connected characteristics. The occurrence of a great number of original
characteristics characterizing the process of object functioning make select the most important
from them and study smaller set of characteristics. The original characteristics are often
subjected to changing and it provides minimal loss of information.
This decision can be
supplied by loss of dimension methods and factor analysis belongs to them. This method allows
take into account the effect of essential multidimensionality of data and gives the opportunity to
explain multivariate structures laconically and simply. With a help of obtained factors and
principal components we can disclose existing nonobservable regularities. It gives the
opportunity to simply describe the observable original data, structure and the character of
correlations between them. Data compression is obtained with respect to the number of factors
or principal components – new units of measurement – is used far less than original
characteristics.
CLUSTER ANALYSIS
The term cluster analysis (first used by Tryon, 1939) encompasses a number of different
algorithms and methods for grouping objects of similar kind into respective categories. A general
question facing researchers in many areas of inquiry is how to organize observed data into
meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an
exploratory data analysis tool which aims at sorting different objects into groups in a way that
the degree of association between two objects is maximal if they belong to the same group and
minimal otherwise. Given the above, cluster analysis can be used to discover structures in data
without providing an explanation or interpretation. In other words, cluster analysis simply
discovers structures in data without explaining why they exist. We deal with clustering in almost
every aspect of daily life. For example, a group of diners sharing the same table in a restaurant
may be regarded as a cluster of people. In food stores items of similar nature, such as different
types of meat or vegetables are displayed in the same or nearby locations. There are a countless
number of examples in which clustering plays an important role. For instance, biologists have to
organize the different species of animals before a meaningful description of the differences
between animals is possible. According to the modern system employed in biology, man belongs
to the primates, the mammals, the amniotes, the vertebrates, and the animals. Note how in this
classification, the higher the level of aggregation the less similar are the members in the
respective class. Man has more in common with all other primates (e.g., apes) than it does with
the more "distant" members of the mammals (e.g., dogs), etc.
Methods of cluster analysis are used to solve the problems as the separation of set in such
a way that all objects belonging to one cluster (class, group) were more similar each other than
the objects of other clusters. As a matter of fact, cluster analysis is not so much ordinary
statistical method as “set” of various algorithms of distribution of objects over clusters. Methods
of cluster analysis are used in most cases when there are no any a priori hypotheses relative to
data structure. Cluster analysis is applied for description of investigation. The check of statistical
significance in cluster analysis is not applied so long as this analysis determines “the most
important major decision”.
The techniques of clusterization are used in various fields (Hartigan, 1975). Cluster
analysis is very useful whenever it is necessary to classify a lot of information which is suitable
for groups on further processing.
31
Cluster analysis relies on two assumptions. The first assumption – discussed characteristics
of object admit the desirable population partition of clusters. The second assumption – validity of
scaling or measuring unit of characteristics, i.e. this means that scales are comparable.
In cluster analysis scaling is supposed to be of great importance. Consider an example.
Imagine that characteristics data x in set of data A is by two orders of magnitude more than
characteristics data y: values of a variable x are in the range from 100 to 700, and values of a
variable y are in the range from 0 to 1.
Then in the process of calculation of value of a distance between two points showing the
object location in space and their properties the variable possessing great values, i.e. variable x,
will dominate absolutely over the variable possessing small values, i.e. variable y. Therefore,
because of non-uniformity of measuring units of characteristics it is impossible to calculate
reasonably a distance between two points.
This problem is solved with a help of preliminary standardization of variables.
Standardization or normalization brings into values of all modified variables to unified value
range by the ratio of values to a variable showing different properties of actual characteristic.
There are different ways of given data normalization:
z=(xi-x)/σ, z==хi/х, z = xi/xmax, z = (xi-x)/(xmax-xmin),
where х, σ – average deviation and root-mean-square deviation appropriately, xmax,xmin–
largest and least value, хi–i-ое value of characteristic.
Along with standardization of variables there is a coefficient of importance of every
variable (weight) which would represent the importance of the corresponding variable. Expert
analyses obtained in the process of expert survey can serve as weight. Received products of
standard variables by corresponding weights let us to obtain a distance between points in
multidimensional space, if we take into account that the weights are equal.
The aim of cluster analysis is to divide set of elements into groups in such way that objects
with highest values of similarity characteristics would be combined in these groups and disjoint
groups would be isolated by given characteristic. Measures of similarity for cluster analysis can
be divided into:
 Similarity measure of the distance type (distance function), it is also called measure of
inequality. In this case the objects are considered to be the more similar the less distance
between them. That’s why some authors call similarity measure of the distance type as
measures of inequality.
 Similarity measure of the correlation type called linkage is a measure defining similarity
of objects. In this case the objects are considered to be the more similar the more linkage
between them.
 Information statistic.
Numbers obtained in the result of clusters calculation do not have a conceptual value. To
differ one cluster from other one it is necessary to have these numbers. So, sequence order of
clusters can be convenient for investigator with the use of cluster analysis results in other
methods. The measure of similarity between elements of a set is called metric if it satisfies the
conditions: symmetry, inequality of triangle, discernibility of non-identical objects and
indiscernibility of identical objects.
Minkowski metric
Minkowski metric is the most common metric. The degree of difference can be chosen in
the range from 1 to 4. If this degree is equal 2, we’ll obtain Euclidean distance. Minkowski
distance is equal r- root of the sum of absolute differences of paired values taken to r power:
distance(x,y) = {∑i (xi - yi)r }1/r
Euclidean metric
This is probably the most commonly chosen type of distance. It simply is the geometric
distance in the multidimensional space. Euclidean distance between two points x and y is a least
32
distance between them. If we had a two- or three-dimensional space this measure is the straight
line linking given points. If r=2 in Minkowski metric, we’ll obtain standard Euclidean distance
(Euclidean metric )
distance(x,y) = {∑i (xi - yi)2 }½
Squared Euclidean metric (Squared Euclidean distance)
You may want to square the standard Euclidean distance in order to place progressively
greater weight on objects that are further apart. Due to squaring the large differences are
accounted better in the process of calculation
distance(x,y) = ∑i (xi - yi)2
Manhattan distance
There are also alternative distance measures: The city-block distance uses the sum of the
variables’ absolute differences. This is often called the Manhattan metric as it is akin to the
walking distance between two points in a city like New York’s Manhattan district, where the
distance equals the number of blocks in the directions North-South and East-West. If r=1
Minkowski metric gives Manhattan distance.
distance(x,y) = ∑i |xi - yi|
Chebychev distance
This distance measure may be appropriate in cases when we want to define two objects as
"different" if they are different on any one of the dimensions. Researchers frequently use the
Chebychev distance, which is the maximum of the absolute difference in the clustering
variables’ values. The Chebychev distance is computed as:
distance(x,y) = Maximum|xi - yi|
Power distance
Sometimes we may want to increase or decrease the progressive weight that is placed on
dimensions on which the respective objects are very different. This can be accomplished via the
power distance. The power distance is computed as:
distance(x,y) = (∑i |xi - yi|p)1/r,
where r and p are user-defined parameters. A few example calculations may demonstrate how
this measure "behaves." Parameter p controls the progressive weight that is placed on differences
on individual dimensions, parameter r controls the progressive weight that is placed on larger
differences between objects. If r and p are equal to 2, then this distance is equal to the Euclidean
distance.
Percent disagreement
This measure is particularly useful if the data for the dimensions included in the analysis
are categorical in nature. This distance is computed as:
distance(x,y) = (Number of xiyi)/i
AMALGATION OR LINKAGE RULES
At the first step, when each object represents its own cluster, the distances between those
objects are defined by the chosen distance measure. However, once several objects have been
33
linked together, how do we determine the distances between those new clusters? In other words,
we need a linkage or amalgamation rule to determine when two clusters are sufficiently similar
to be linked together. There are various possibilities: for example, we could link two clusters
together when any two objects in the two clusters are closer together than the respective linkage
distance. Put another way, we use the "nearest neighbors" across clusters to determine the
distances between clusters; this method is called single linkage. This rule produces "stringy"
types of clusters, that is, clusters "chained together" by only single objects that happen to be
close together. Alternatively, we may use the neighbors across clusters that are furthest away
from each other. This method is called complete linkage. There are numerous other linkage rules
such as these that have been proposed.
The following methods of cluster analysis are considered:
 Hierarchical methods:
o single linkage (nearest neighbour),
o average linkage method (King),
o Ward's method
 Iterative methods of grouping:
o K-means method (MacQueen).
 Algorithms:
o method of correlation Pleiades (Terentjev),
o Wroclaw taxonomy.
1) SINGLE LINKAGE (NEAREST NEIGHBOUR)
This method is the simplest one of hierarchical agglomerative methods of cluster
analysis. In this method the distance between two clusters is determined by the distance of the
two closest objects (nearest neighbours) in the different clusters. Each object is placed in a
separate cluster, and at each step we merge the closest pair of clusters, until certain termination
conditions are satisfied. This rule will, in a sense, string objects together to form clusters, and the
resulting clusters tend to represent long "chains."
2) AVERAGE LINKAGE METHOD (KING)
This method is similar to single linkage (nearest neighbour). In this method the distance
between the two clusters is defined as the average of the distances between all pairs of objects,
where one member of the pair is from each of the cluster. This method also uses information on
all pairs of distance, not merely the minimum or maximum distances. For this reason, it is
usually preferred to the single and complete linkage methods
3) WARD’S METHOD
Another commonly used approach in hierarchical clustering is Ward’s method. This method is
distinct from all other methods because it uses an analysis of variance approach to evaluate the
distances between clusters. In short, this method attempts to minimize the sum of squares of any
two clusters that can be formed at each step. In general, this method is regarded as very efficient,
however, it tends to create clusters of small size.
4) K-MEANS METHOD (MACQUEEN)
K-means (MacQueen) is one of the simplest unsupervised learning algorithms that solve
the well known clustering problem. The procedure follows a simple and easy way to classify a
given data set through a certain number of clusters (assume k clusters) fixed a priori. The main
idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning
way because of different location causes different result. So, the better choice is to place them as
much as possible far away from each other. The next step is to take each point belonging to a
given data set and associate it to the nearest centroid. When no point is pending, the first step is
completed and an early groupage is done. At this point we need to re-calculate k new centroids
as barycenters of the clusters resulting from the previous step. After we have these k new
centroids, a new binding has to be done between the same data set points and the nearest new
34
centroid. A loop has been generated. As a result of this loop we may notice that the k centroids
change their location step by step until no more changes are done. In other words centroids do
not move any more.
5) METHOD OF CORRELATION PLEIADES
The concept of correlation Pleiades was originally advanced by Terentjev. The results of
classification can be visually presented in the shape of a cylinder cut by planes which are
perpendicular to its axis. The planes correspond to its levels (from 0 to 1 with step 0,1). The
parameters or objects being subject to classification are combined on these levels. Therefore, this
method resembles single linkage (nearest neighbour), but with fixed-level union. The results of
classification are displayed graphically in the shape of circles – sections (Pleiades) of the
correlation cylinder mentioned above.
The classified objects are marked on the circles.
Linkages between classified objects are shown by means of chord connection of circle points
related to objects.
VISUALIZING RESULTS OF CLUSTER ANALYSIS
When carrying out a hierarchical cluster analysis, the process can be represented on a diagram
known as a dendrogram. This diagram illustrates which clusters have been joined at each stage
of the analysis and the distance between clusters at the time of joining. If there is a large jump in
the distance between clusters from one stage to another then this suggests that at one stage
clusters that are relatively close together were joined whereas, at the following stage, the clusters
that were joined were relatively far apart. This implies that the optimum number of clusters may
be the number present just before that large jump in distance. This is easier to understand by
actually looking at a dendrogram.
You see, after carrying out classification it is recommended to visualize results of clustering
by means of construction of a dendrogram. Suppose that the results of classification by way of
variables for pairs of objects are obtained after applying one of hierarchical methods. The idea of
the dendrogram construction is evident – pairs of objects are linked according to the level of
linkage plotted on the ordinate (Fig.2.1)
30
Linkage Distance
25
20
15
10
5
1
2
3
4
5
6
Fig. 2.1 Dendrogram of hierarchical method
Consider a Horizontal Hierarchical Tree Plot we begin with each object in a class by
itself. Now imagine that, in very small steps, we "relax" our criterion as to what is and is not
35
unique. Put another way, we lower our threshold regarding the decision when to declare two or
more objects to be members of the same cluster. As a result we link more and more objects
together and aggregate (amalgamate) larger and larger clusters of increasingly dissimilar
elements. Finally, in the last step, all objects are joined together. In these plots, the horizontal
axis denotes the linkage distance (in Vertical Icicle Plots, the vertical axis denotes the linkage
distance). Thus, for each node in the graph (where a new cluster is formed) we can read off the
criterion distance at which the respective elements were linked together into a new single cluster.
When the data contain a clear "structure" in terms of clusters of objects that are similar to each
other, then this structure will often be reflected in the hierarchical tree as distinct branches. As
the result of a successful analysis with the joining method, we are able to detect clusters
(branches) and interpret those branches.
Symbolic notations of investigation objects (vectors of matrix) are positioned on abscissa
axis; and the lowest values of distance coefficient corresponding to each step of the classifying
procedure are positioned on ordinate axis. Therefore, ordinate axis is used for scaled presentation
of hierarchical grouping levels.
Visualization and conceptual importance of tree graphs increase if not only the information
about closeness of intragroup linkages but also between group distances h is represented in them.
Such dendritic graph taking into account not only intragroup distances but also mean distances is
called dendrograph.
K-MEANS METHOD
This method is connected with objects but it is not connected with matrix of similarity. In
k-means method the object belongs to such class the distance to which is minimal. The distance
is understood as Euclidean distance, i.e. the objects are considered to be points of Euclidean
space. The distance between the object and class is the distance between the object and the center
of class. Each class of objects possesses the centroid. The mean parameter values are considered
to be the centre of class. Then the distance between the object and group of objects are
determined and algorithm can work.
Imagine, that number of objects is equal 2. Join these points with segment of line and find
its midpoint. It will be a centroid of group consisting of two points. From this centroid to given
point distance will be the desired one.
K-means method “works” as follows:
1) First, cluster partition of data is given (number of clusters is determined by user);
centroids of clusters are computed;
2) displacement of points occurs: each point is located in the nearest cluster;
3) centroids of new clusters are computed;
4) steps 2, 3 are repeated till the stable configuration will not be found (i.e. clusters stop
changing) or number of iterations will be no more than given by user.
Resulting configuration is desired one.
Usually, as the result of a K-means clustering analysis, we would examine the means for
each cluster on each dimension to assess how distinct our k clusters are. Ideally, we would obtain
very different means for most, if not all dimensions, used in the analysis. The magnitude of the F
values from the analysis of variance performed on each dimension is another indication of how
well the respective dimension discriminates between clusters.
APPLICATION OF CLUSTER ANALYSIS:
1) development of typology or classification;
2) investigation of useful conceptual schemes of object grouping;
3) generate of hypotheses on the base of data investigation;
4) test of hypotheses or investigations to determine if types (groups) isolated by some means or
other are in data.
FACTOR ANALYSIS
36
Factor analysis is a statistical method used to study the dimensionality of a set of variables.
In factor analysis, latent variables represent unobserved constructs and are referred to as factors
or dimensions.
Factor analysis is designed for interval data, although it can also be used for ordinal data.
The variables used in factor analysis should be linearly related to each other. This can be
checked by looking at scatterplots of pairs of variables. Obviously the variables must also be at
least moderately correlated to each other, otherwise the number of factors will be almost the
same as the number of original variables, which means that carrying out a factor analysis would
be pointless. The factor analysis model can be written algebraically as follows. If you have p
variables X1,X2, . . . ,Xp measured on a sample of n subjects, then variable i can be written as a
linear combination of m factors F1, F2, . . . , Fm where, as explained above m < p.
The main goals of factor analysis are:
1. number of variables reduction (data reduction) and
2. classification of variables.
So factor analysis is used either method of data reduction or method of classification.
Thus, Factor analysis allows identify dependence between phenomena, discover latent base
of some phenomena, answer the question why these phenomena are related.
As the method of statistic investigation, factor analysis includes following stages:
1) Goal formation
Goals can be:
a) Research (identifying factors and analysis of them)
b) Applied (construction of aggregate characteristics for forecasting and management)
2) Choice of characteristics and objects population
3) Generation of given factor structure
4) Correction of factor structure on the basis of investigation goals.
5) Identifying of second-order factors
We obtain second-order factors – more general categories of investigated phenomenon.
6) Interpretation and use of results.
The main ideas of factor analysis are:
Correlation matrix constructed with use of Pearson’s correlation ratio coefficient is the
main object of investigation. Positive semidefiniteness is the basic requirement to constructed
matrix. Hermitian matrix is called positive semidefinite if all its minors are non-negative.
Nonnegativeness of all eigenvalues follows from this property.
Correlation coefficients constituting correlation matrix are automatically calculated
between parameters (characteristics, tests), but not between objects (individuums, persons). So,
dimension of correlation matrix is equal to number of parameters. It is so called techniques R.
But, for example, the correlation between objects can be studied. This methodology is called
techniques Q. There is also techniques Р, analysis of investigations is supposed to be carried out,
and carried out on the same individuum in different periods of time, and correlation between
individuum conditions is studied.
In all methods of factor analysis there is a hypothesis that the studied dependence has the
linear character. The main requirement to given data is that they should be subordinate to
multidimensional normal distribution.
The correlation matrix reduction is called a process of unit replacement by some values
named generalities on the main diagonal of correlation matrix. Generality is a sum of squares of
factor loadings. Generality of given variable is that part of its dispersion which is stipulated by
general factors. This follows from that the total dispersion is composed of general dispersion
stipulated by factors which are common to all variables, and also by specific dispersion
stipulated by factors which are specific for only given variable and for dispersion stipulated by
the error.
The aim of factor analysis is to obtain a matrix of factorial mapping. Its rows are the
endpoints of the vector coordinates related to т variables in r-dimensional factor space.
37
Proximity of the endpoints of these vectors provides a rough idea of mutual dependence of
variables. Each vector in brief contains the information concerning with process. In addition, if a
number of selected factors are greater than unity, the rotation of matrix of factorial mapping is
carried out; the aim of this is to obtain so called simple structure.
For illustrative purposes the results can be represented graphically, but it is difficult to do if
we have three or more selected factors. So, the mapping of r-dimensional factor space in two
sections is usually formed.
In the process of the factor analysis problem solution you should be ready for that you fail
to solve the problem. It is induced by solved problem complexity of correlation matrix
eigenvalues. For example, correlation matrix can appear to be degenerate because of
coincidence or complete linear correlation of parameters. In the process of calculations the
underflow can occur for high-order matrixes. That’s why we theoretically can’t eliminate the
situation when factor analysis methods are not applicable at least till given data are not
successful “to correct”. Corrected data can be obtained as follows. Identify linearly dependent
parameters with a help of, for example, method and correlation Pleiades (it is possible to apply
other methods) and leave only one of linearly dependent parameters group in given data.
METHOD OF PRINCIPAL COMPONENTS
It is rather difficult to study objects if dimensionality of attribute space increases. There is
a problem to substitute of numerous observable characteristics by smaller of their numbers
without lost of useful information. Method of principal components is one of the most abundant
methods to solve this problem.
The linear transformation т of given variables (characteristics) in т of new variables
where each variable represents linear combination of given ones is a basis of method of principal
components In the process of transformation the vectors of observable variables are changed by
new vectors (main components) which make different contributions to the total dispersion of
multidimensional characteristics. Principal components are eigenvectors of covariance matrix of
given characteristics. Number of eigenvectors of covariance matrix is determined by the number
of studied characteristics, i.e. it is equal to number of its columns (or rows). Each eigenvector
(principal component) is characterized by eigenvector and coordinates. If the covariance matrix
is used, the variables will remain in their original metric.
Eigenvalues of covariance matrix (λj) are the lengths of its eigenvectors, i.e. their
dispersion. The sums of eigenvalues of covariance matrix are equal to its trace, i.e. diagonal sum.
Coordinates of eigenvector of covariance matrix (ωij) are numerical coefficients,
characterizing its position in т dimensional attribute space. Number of point coordinates of each
eigenvector (ωij) – ω1, ω2, ..., ωm is determined by space dimension, and their numerical values
are linear equation coefficients of eigenvector.
Eigenvalues of covariance matrix are found as characteristic roots of polynominal
equations by solution of them. But it is rather difficult to realize for large values m. So, in
computational practice they are determined by matrix-updating methods, which can be realized
only with a help of computer. Coordinates-seeking methods of the symmetric matrix
eigenvectors are very difficult and they require the use of the computer.
So long as the covariance matrixes of given characteristics are symmetric, their
eigenvectors are always orthogonal, and their variables are interchangeable, i.e. nonintercorrelated.
In method of principal components the coordinates of eigenvectors are considered as loads
of variables on one or other factor. They are used to calculate matrixes of new population set by
design of given data vectors (characteristics х1,х2, …, хm ) on the eigenvector axis (γ1, γ2, …, γm):
m
 j    ji xi ,
(1)
j 1
38
where
 ji – loads j- component in i- variable of characteristic. With a help of formula (1)
parent matrix of observed dimension characteristics пxт is recalculated in matrix of new
variables (the same dimension), taking into account each of components eigenvalues. If statistic
(correlation) linkages between observed characteristics of multidimensional space display
clearly, decomposition of parent matrix of observations into т new components leads to an
increase of the dispersion distribution visibility for new components compared with
eigenvectors. As a rule, dispersion of one of principal components reaches half and more of the
total dispersion characteristics, and combined with dispersions of one-two next components their
universal contribution in total dispersion exceeds 90%.
Therefore, space dimension of observed characteristics (to p≤m) can be reduced noticeably
without loss of the observed characteristics variability information. In this case we can limit
ourselves data for two-three most informational principal components. It lets us consider that for
the purposes of geoecological analysis the matrix of principal components of dimension п xp
(where p, as a rule, does not exceed 2 – 3) can be used instead of parent matrix of dimension
пxm. So long as new variables in this matrix are represented by uncorrelated variables, the
method of principal components can be considered as power tool for determination of number of
linearly independent vectors contained in parent matrix.
Let’s consider the method of principal components more detailed – variant of main factors
method. The base model of principal components is written by matrix as follows:
Z= AP,
where Z – matrix of standardized given data,
A – factor mapping, P – matrix of factors values.
Order of matrix Z is т х п, order of matrix A is т х r, , order of matrix P is rх п,
where т – number of variables (vectors of data), n – number of individuals (elements of one
vector), r – number of separated factors.
As we can see from expression mentioned above the model of component analysis includes
only factors common to these vectors.
Matrix of standardized given data is defined by matrix of given data Y (order of matrix Y
т х п ) according to formula
zij 
yij  yij
si
, i = 1, 2, …, m, j = 1, 2, …, n,
where y ij – element of matrix of given data, y ij – mean value, si – standard deviation.
To calculate correlation matrix – the base element of factor analysis – we should use the
simple relation
1
ZZ '  R ,
n 1
where R – correlation matrix ; order of this matrix т х т, ' – symbol of transposition.
There are values, equal 1, on the main diagonal. These values are called generalities and
2
they are designated as hi , they are measure of full dispersion of variable.
Matrixes A and P are unknown. Matrix A can be found from fundamental theorem of factor
analysis
R=A*C*A'
where C – correlation matrix, showing linkage between factors.
If C = I, orthogonal factors are spoken about, if С ≠ I, skew-angle factors are spoken
about. Here I – unity matrix. For matrix C it is true relation C
39
1
PP'  C .
n 1
We consider only case of orthogonal factors for which
R = A*A'
The model of classical factor analysis includes a number of common factors and one
specific factor to each variable.
The first formula from mentioned in this unit is the main model of factor analysis for
method of principal components. Number of principal components is always less or equal to
number of variables.
PROBLEM OF ROTATION
Axes of coordinates corresponding to separated factors are orthogonal, and their directions
are established sequentially according to remainder of dispersion maximum. But axes of
coordinates obtained by this means are not interpreted meaningfully. So, the position of system
of coordinates by rotation of this system about origin of it is more important. Owing to this
procedure the arrangement of vectors is unchanged. The aim of rotation is to find one of possible
coordinate systems for obtaining so called simple factor structure. The popular method of
rotation VARIMAX is applied.
CRITERIA OF FACTOR MAXIMUM NUMBER
There are some assessment criteria of maximum number of held factors. Criteria based on
analysis of the parent and reproduced correlation matrix determinants don’t show stability.
Criteria based on variable of the correlation matrix eigenvalues, as a final result, lead to analysis
of separated by factors dispersion percent. All general factors, the number of which is equal to
the number of parameters, isolate 100% of dispersion. If sum of percents exceeds 100%, it
means: negative eigenvalues were obtained in the process of calculation of the correlation matrix
eigenvalues and, as a result, complex eigenvectors and it can mean incorrect reduction of parent
correlation matrix.
Kaiser criterion, only factors of which are in accord with eigenvalues of covariance matrix more
than 1 are considered.
Scree test, we should cast out all factors the eigenvalues of them are little different. Scree test is
the graphical method firstly offered by
Cattell (Cattell, 1966).
Cattell offered to find such place on the
graph where decrease of eigenvalues from
left to right is retarded maximally. Only
“factorial scree” is supposed to be to the
right of this point. "Scree” is a geological
term. Scree, or talus, is accumulation of
broken rock fragments at the base of
mountain cliffs. According to this
creterion 2 or 3 factors can be left in this
example.
Application area of multidimensional statistical models in geoecology
Applications of multidimensional statistical models for study dependencies of some
different geoecological characteristics complexes are practically unbounded for any branch of
geoecology. Correlation methods of paragenetic analysis of chemical elements are used widely
in ecological geochemistry.
40
Multidimensional statistical descriptions of geoecological variables linkage with next
assessments of their interdependences are used in geoecological practice, the aim of which is to
identify, to discriminate, to classify studied objects or to search more informative combinations
of characteristics to solve predictive problems.
Classification of geoecological objects, for example, hierarchic grouping of chemical
elements association according with complete chemical analyses data are carried out with a help
of cluster-analysis, other methods of multidimensional correlation analysis or method of factor
analysis.
Prediction of various properties of studied geoecological objects is the ultimate objective
of most multidimensional statistical methods.
Subject to type of given data and objectives of geoecological investigations the different
multidimensional models are used to form these algorithms. Meanwhile, as a rule, there is a
problem of search of more informative characteristic combinations and their space dimension of
reduction. It is carried out with a help of principle component method, R-method of factor
analysis or other logic and heuristic methods.
MODELING OF SPATIAL VARIABLES
In the process of study objects and geoecological processes the investigator is interested in
not only average characteristics of changeability and interrelationships of observable values of
phenomena properties but in regularities of spatial changes. Statistical models are not suitable for
these objectives, so long as any statistical characteristic shows only the average level of studied
property changeability aside of spatial location of observation points, while regularities of their
spatial location appear can be different. Therewith, statistical characteristics give objective
estimations of the observable changeability characteristic level in such case when sample data
represent population of independent random variables.
For aims of mathematical modeling of spatial location regularities of studied geoecological
formation properties their characteristics are considered as not random variables, but as spatial
variables which possess a number of specific characteristics such as regularity, domain of
existence and definitional domain.
Their populations form the fields of spatial variables, in the range of them the location of
each variable is determined by space coordinates.
Geometrical and analytical modeling methods of geological, geochemical, geophysical and
other spatial variables fields favour the separation and quantitative description of tendencies
observed in change of investigated object properties. And in some cases they allow identify new,
earlier unknown regularities. The results of geoecological mapping, geochemical and schlich
surveys, geophysical observations are used for aims of modeling. Spatial regularities of
geophysical field changes are used widely in the process of geological mapping. Mathematical
modeling of geochemical and geophysical fields allows us to identify anomalies more reliably.
Geoecological objects as fields of spatial variables
Field of spatial variable is called a spatial domain, where each point can be paired with
some value of studied variable. The spatial domain can be considered as a geoecological field. In
addition, certain value of studied geoecological characteristic is in accord with each element of
spatial domain.
Subject to nature of modeling characteristics there are geophysical, geochemical,
morphometric and other geological fields, according to dimension of studied space they are
divided into one-dimensional, two-dimensional, three-dimensional and multi-dimensional.
Continuous and discrete geological spatial variables. According to domain of existence
geological spatial variables are divided into continuous and discrete. Continuous spatial variables
express properties of object visualized in any point of field, i.e. in the entire area of investigated
territory. Concentration of chemical elements in rock formation, their physical properties,
thickness of studied geological bodies and a lot of other properties of rocks and ores belong to
41
number of these variables. Spatially confined geological formation, the domains of existence of
which are small to negligible in comparison with investigated areas, belong to number of discrete
spatial variables. They are represented by specific geological bodies (for example, certain
facies), mineral deposits, phenocrysts of some minerals or mineral aggregate in rocks and etc.
Scalar and vector fields. According to characteristics of dimensionality there are scalar and
vector geological fields. The majority of studied geological variables belong to scalar values. To
give scalar values it is necessary to know their module and sign. Populations of these variables
form scalar geological fields.
Vector spatial variables are used more rarely in geological practice. To give vector spatial
variables it is necessary to know neither module nor variable direction. Vector random field can
be simulated as vectors oriented in real two- or three- dimensional space (for example, magnetic
fields) or as complexes of different scalar variables (for example, according to the content of
some chemical elements in each point). If not initial values are studied but their derivatives, i.e.
gradients of geological fields, many scalar fields can be transformed in vector fields.
Background, anomalies and trend surface
The model of additive random field is the most abundant model of continuous scalar
geological field. In this case the values of continuous scalar variable uˆ  f ( x, y ) are given on x
and y plane, the values of which are used for description of additive scalar field
uˆ  f ( x, y )   , where f ( x, y )  uˆ – function of coordinates; ε – random variable.
The task of modeling is to estimate of function f(x,y) in known assumption relative to ε and
to describe random part of ε in some assumptions relative to f(x,y). The main problem of the
spatial regularities study is to describe a non-random component of field, showing the level of its
values. It is typical of particular parts of examined territory.
The non-random component characterizing the main part of the simulated geological field
is called its background. Background part of the field identifies the area of relatively increased
and decreased values of studied characteristic and it contains useful geological information about
nature of the studied geological object. The generalization of base properties of field with
reduction of more or less essential fraction deviations is necessary to distinguish the background.
In each specific case the deviations from background are considered as anormalous ones.
Trend surface analysis is the most widely used global surface-fitting procedure. The
mapped data are approximated by a polynomial expansion of the geographic coordinates of the
control points, and the coefficients of the polynomial function are found by the method of least
squares, insuring that the sum of the squared deviations from the trend surface is a minimum.
Each original observation is considered to be the sum of a deterministic polynomial function of
the geographic coordinates plus a random error.
The polynomial can be expanded to any desired degree, although there are computational
limits because of rounding error. The unknown coefficients are found by solving a set of
simultaneous linear equations which include the sums of powers and cross products of the X, Y,
and Z values. Once the coefficients have been estimated, the polynomial function can be
evaluated at any point within the map area. It is a simple matter to create a grid matrix of values
by substituting the coordinates of the grid nodes into the polynomial and calculating an estimate
of the surface for each node. Because of the least-squares fitting procedure, no other polynomial
equation of the same degree can provide a better approximation of the data.
Two different methodological approaches are used for trend-analysis goal in geological
practice:
1) smoothing of given data by moving statistical windows 2) approximation of fields by
unified function of spatial coordinates (orthogonal polynomials and etc.). The methods of
moving averages are universal and they offer better assessments of average parameters of
spatially confined parts of geological fields in comparison with the given data of polynomial
trend-analysis method. The data are used to identify regional geological regularities.
42
Relational nature of regularities and random components of observed characteristics
variability have effect on trend-analysis results of geological fields. In this connection depending
on scopes, goals and objectives of investigations the backgrounds are taken to mean the trend
surface of different degree of smoothness, and anomalies are taken to mean any deviations from
the background exceeding reference surface.
Separation of regional regularities with a help of empirical evidence approximation of
space coordinate function is connected with rather difficult calculations. It requires the use of
computers. Orthogonal polynomials of different degrees, Laplace equation, trigonometrical
polynomials and etc. are used as approximating functions.
Orthogonal polynomials are usually used in case of the uniform rectangular network.
Meanwhile the trend is defined as linear function of geographical coordinates, according to
observation population constructed in such a way that sum of squared deviation of characteristic
value from the plane is minimal. Such kind of model represents a variant of statistical method of
multiregression, where the function ( x, y )  uˆ describing the surface of trend is considered
as uˆ   0  1 x   2 y (where x and y are coordinates of space, β0, β1 and β2 are polynomial
coefficients). The equations are used to estimate three indicated coefficients
 u   0n  1  x   2  y ;
 xu  0  x  1  x 2   2  xy ;
 yu  0  y  1  xy  2 y 2 ;
where п – number of observation points; u – value of characteristic in observation points; x and y
– coordinates of observation points.
To solve equations they have to be written in matrix form:
 n

 x
 y

 x  y  0    u 
2

 x  x y    1     xu 
 x y  y 2    2   yu 
and to be solved relative to β0, β1 and β2. The estimation method of binomodel coefficient is
called least square method.
Questions
1.
Multi-dimentional statistical models. Entity and application conditions.
2.
What statistic method is used to solve the problem about similarity and difference of
sampling means?
3.
What statistic method is used to divide samplings into maximum different groups?
4.
What statistic method is used to identify the correlation between variables?
5.
What is the linear regression? And what is the nonlinear regression?
6.
Characterize spatial models.
7.
What is the spatial variable field?
Basic sources:
1.
Aivazyan S.A., Enyukov I.S., Meshalkin L.D. Practical statistics. Basis of modeling and
primary data processing. Reference book. – M.: Finance and statistics, 1983. - 472 p.
2.
Aivazyan S.A., Enyukov I.S., Meshalkin L.D. Practical statistics: Dependences study:
Reference book. – M.: Finance and statistics, 1985. - 182 p.
3.
Borovikov V.P. Statistic for students and engineers. – M.: ComputerPress, 200. -301 p.
4.
Dreiper N., Smith G. “Application regression analysis”. Book 1. – M., 1986. -365 p.
43
5.
Dreiper N., Smith G. “Application regression analysis”. Book 2. – M., 1987. -349 p.
6.
Kazhdan A.B., Gus’kov O.I. Mathematical methods in geology. – M.: Nedra, 1990. - 251
p.
7.
Kendell M., Stewart A. Statistical conclusions and connections. – M.: Nauka, 1973. - 899
p.
8.
Kim Dg.O. Muller Ch.Y., Klekka Y.P. et al. Factor, discriminant and cluster analysis. –
M.: Finance and statistics, 1989. - 215 p. 2
9.
Kramer G. “Mathematical methods of statistics”. M., 1948, -631 p.
10. Breiman L., Friedman J.H., Olshen R.A., Stone C.J. Classification and regression trees. –
Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software, 1984. - 358 p.
LECTURE 3. INTRODUCTION
Definition and content of concept GIS. GIS history. Correlation between GIS and basic
courses. Relevance of GIS use in ecological information processing and representation.
Characteristic of main functions of GIS (information collection and information processing,
modeling and analysis, data use in the process of decision-making). The main classifications of
GIS (Brаcken, Webster, 1990; Koshkarev, Karyakin, 1987) and their characteristic. Academic
literature and educational material, periodical literature and reference materials.
Geographic information systems (GIS) is a new and fast-growing research area at the
interface between computer technologies and Earth sciences. They use in many fields of human
activity – from interactive maps in the Internet and devices of satellite communication to mining
field development programs.
Geographic information systems (All-Union State Standard – GOST Р 52438-2005) information system using spatial data. Geographic information systems – computing system of
collection, checking, integration and analysis of information dealing with earth surface. Simply
said, GIS is an electronic map; the attributes of each object are written in the table connected
with this map. Information system is a system used for storage, processing, search, distribution,
communication and representation of information (GOST 7.0-99, article 3.1.30).
Data – information presented as information which is suitable for processing by automatic
tools (GOST 15971-90, article 1).
There are four periods in history of GIS development: Pioneer period (the late 1950s –
the early 1970s)
Its sources are in team works; the teams formulated the first objectives and approaches to
information systems building oriented on spatial data processing. “First generation” GIS was
much different from modern GIS; data output of the fir GIS were not graphic maps but
generalized results of investigations represented in tabulated form. They were teams of
researchers and developers from Canada and Sweden. The first GIS appeared in Canada and
used for accounting and analysis of natural resources (De Mers, 2000). The main function of GIS
was to input source accounting documents for storage and regular updating, including data
aggregation and closing of accounts of statistical tabulated documents.
Period of the early 1970s – the early 1980s. Period of state initiatives.
The development of interaction of geoinformatics methods and tools with digital
techniques of mapping and automated cartography was characteristic of 1970s. GIS was
developed on the base of information retrieval systems and later it gained the functions of
carthographic data banks with data analysis and modeling capability. The most of GIS consist of
a number of tasks of maps construction and they use map document as a source of data.
In late 1960s in the USA the opinion was formed that it was necessary to use GIS-technologies
for census of population data processing and presentation. The methodology providing correct
gridding of census of population data was required. Necessity of population address converting
to geographic coordinates was the main problem.
44
The special format of map data representation was developed. Rectangular coordinates of
street intersections segmented streets of all settlements into separated segments in the USA. The
map data processing and representation algorithms were taken from GIS developers in Canada
and Harvard laboratory and they were designed in form of program POLYVRT realizing the
population address converting to geographic coordinates and describing the graphic segments of
streets. Therefore, the topological approach to geographic information management was widely
used for the first time. It contained mathematical description method of spatial relationships
between objects.
Period of 1980s. Period of business development.
Dynamism of GIS development was characteristic of 1980s. A great number of software,
desktop GIS, appearance of network applications and a lot of eventual users, systems, supporting
individual data set on separate computers open the door for distributed geodatabase systems.
The well-known software product ARC/INFO was designed by Environmental System Research
Institute, ESRI Inc in 1981. It was the most successful realization of ideas about separate internal
representation of geometric and attribute information. At present, it is one of the most popular
packets in the world.
Period of 1990s – present. User period.
This period is characterized by high competitiveness among business producers of
geographic information technologies and services, increased demand of data and formation of
global geographic information infrastructure. GIS software market saturation, especially for
personal computers, expanded GIS-technologies application field. It required essential set of
digital geodata and also GIS engineers.
In Russia the development of geoinformatic and GIS began from the late 1980s – the early
1990s. In Russia the geographic information boom began from the mid-1990s. Different GIS for
many areas of knowledge are produced.
Now they are widely distributed in all fields of human activity connected with the earth
surface. They are used in cartography, geology, meteorology, land management, ecology,
municipal administration, defense and others. Widely application of GIS is connected with high
efficiency of these systems and with complex analysis results. It is impossible to obtain these
results by neither simple map analysis nor data table analysis.
With a help of ArcGIS you can solve GIS problem of any degree of complexity: from
separate analytical project to implementation of multiuser GIS for your organization. Today
thousand of organizations and hundred thousand users use GIS technologies to study and process
various sets of geographically connected information. E-maps of towns are the simplest example
of GIS. We see the map of the town on display the screen, we select a street and obtain its name,
or vice versa, we type the street’s name in sear5ch line and see its indication on the map.
For illustrative purposes we consider some fields where the use of GIS is traditional.
Management and areawide planning. This field is based on behaviour of different social
groups showing social needs and opportunities. They have given placement and dynamicity on
the territory.
Urban development and architecture. Planning, engineering survey, town development
is the typical work of urban services supporting the development of the territory.
Civil engineering infrastructure. Inventory, accounting, planning of distributed
production infrastructure object location (water supply, drainage system, heat supply, gas supply,
electricity supply) and their management, state estimation and making decisions dealing with
repair and emergency situations.
Land resources management, land cadastre. Making out of cadastre, classified maps,
delimitation of territories and areas are the typical problems of this field.
Management of natural resources and environmental activity. Estimation of natural
resource stocks, processes modeling in natural environment and decision-making are the typical
problems of this field.
45
Geology, mineral resources and mineral resource industry. Specific character of these
problems is that it is necessary to calculate mineral reserves on the certain area according to the
significant points results (probe boring, test holes, etc) under the well-known model of the
deposit formation process.
Planning and transport agency (logistics). Characteristics of locations where the goods are
stored and characteristics of locations where the goods are waited for; position, state and
characteristics of hauling equipment, characteristics of road network (mean speed, repairs, bypass roads, blocks, boundaries, custom stations, etc.) are set up on the map. It is required to draw
a schedule of movement and correct it from time to time if unforeseen situations arise.
Surface, aero- and hydronavigation mapping and surface, air and water transport
management. Well-established fields with understandable problems. Problems of moving object
control, if the system of relations between them and nonmoving objects is performed, have a
special place here.
Marketing and market analysis. Developments tendencies, assessment of various
topological properties effect – closeness, crossing and combination of areas – are identified.
Agriculture. Reserves estimate according to a number of point-by-point measurements,
transport planning, interaction of dynamically changing areas, classification and “similarity”
identification of spatial objects, precision agriculture.
Emergency situations. Assessment of potentially hazardous facilities, modeling of
consequences in emergency situations.
Fast response service. Public safety, fire fighting activity, emergency medical service.
GIS classification
A lot of problems recurring in life led to the formation of different GIS, which can be
classified by following characteristics:
1) According to type of architecture they can be divided into two classes: open and closed.
Closed systems are characterized by low price and the class of solved problems is represented
beforehand there. They are characterized by interface simplicity and these systems are quickly
learnt by users. The function set can’t be changed. Hence, it should be noted that lifecycle of
these systems is very short.
Open systems have certain set of functions and they have special instruments for generation and
building of special applications by users thereby expanding the functionality of base GIS. Open
systems are more expensive, but they have a longer lifecycle and they can be adapted to a very
large class of problems.
2) According to hardware platform they can be divided into:
GIS professional
GIS tabletop
Well-known systems ESRI, Intergraph refer to classical GIS professionals. They are powerful
systems designed firstly for workstations and network, they include blocks of map document
vectoring and they support the work with a lot of external devices. These systems are constructed
by modularity and they can be delivered with flexible packaging.
GIS tabletop are PC-oriented and they are used by a lot of users. Such kind of GIS has less
function set. They cost low price, they are used much; in large GIS-projects workplaces are
organized powered by them, where GIS is built as a multilevel system.
3) According to spatial coverage GIS are divided into:
- Global (planetary);
- Continental;
- National (state);
- Regional;
- Local.
There is its own gradation in a state. For example, in Russia there are:
- federal GIS (FGIS);
46
- regional GIS (RGIS);
- municipal GIS (MGIS) and local GIS (LGIS);
4) According to domain modeling there are GIS: по предметной области
информационного моделирования различают ГИС:
- urban, agricultural, geological, environmental, recreational, water source monitoring, etc.
5) According to functionality there are GIS:
Multipurpose (tool, full-function)
Special-purpose
GIS-viewers.
Multipurpose GIS are characterized by openness, they deal with different data formats, they
possess powerful graphics editor, they have tools of development and different applications
implementation (increase of function set). This class of GIS is used much, so long as it allows
adapt if it is necessary and it allows solve different problems dealing with many fields. As a rule,
these systems possess their own embedded language working both attribute information and
graphic one, and they have tools to embed program modules written in high level language.
Special-purpose GIS solve few problems using given parameter set. Their main problem is to
control processes and prevent uncontrolled situations.
GIS-viewers are necessary to visualize spatial information and to print out it. These systems do
not have the tool for spatial analysis and modeling.
6) According to used data model:
• Vector GIS are based on vector graphics and work with topological and non-topological
vector data models.
• Raster GIS are based on raster graphics and work with raster data models.
• Hybrid GIS combine vector and raster GIS.
The most modern GIS are not always strictly vector or strictly raster. There are usually some
tools to work with raster data in vector GIS, and vice versa, there are usually some tools to
work with vector data models in raster GIS.
Functions of GIS.
Function set implemented in GIS depends on, firstly, system purpose in whole. The main
functions of GIS are:
- input and update;
- data storage and data manipulation;
- data analysis;
- data output and data and results presentation .
One of the specific features is to understand any types of computer information, so to import all
data it is necessary to convert it to computer-generated one. Point descriptions and reports are
converted to text files, photos and sketches are scanned and image files are renamed according to
numbers of observation points. All table information is resulted in the same form.
Data scanning. Data visualization. GIS enables to visualize data like maps. Any
geoinformation system provides instruments for data scanning. On the screen the map has got
sandwich-type organization – every vector map, locked raster or observation point image are
represented by a separate layer which can be switched on and switched off. Layer-by-layer
organization of data have the following advantages:
- possibility to change visibility of layers in the process of data visualization;
- possibility to change layer order in the process of visualization;
- possibility to settle independently parameters of every layer visualization;
- possibility to carry out spatial analysis independently of each layer;
- possibility to form the map from layers of different level details and origin.
Vector data form vector layers, raster data form raster layers. Raster layer corresponds to one
raster image. If vector layers have similar attributes they can be combined in group layers, it is
subjected to raster data.
47
Symbols.
Map is a model of real world where there are elements showing real objects or events.
Symbol is a basic element of all cartographic representations. With a help of symbols
cartographic representations appear on the map. There are three basic types of symbols in
cartography: point symbols, line symbols and area symbols.
If objects or events are very small true to map scale they are represented by point symbols. If
objects or events are long distance true to map scale but they have negligible width they are
represented by line symbols. If objects or events are highly long distance true to map scale and
they occupy closed region they are represented by area symbols. In addition, text symbols used
for representation texts on the map are applied.
Analysis. Geographic information systems differ from other information systems in the fact that
they possess effective opportunities of spatial data analysis. With a help of this analysis it is
possible to carry out spatial modeling of objects and events.
GIS is the tool of spatial analysis. Spatial analysis is called “heart” of GIS. The maps can be
compared by switching on and switching off the map layers. Outputting different element
chemical analysis data we can come to conclusion about distribution regularities.
Task set of spatial analysis can be divided into 5 generalized categories.
1) Location analysis. Spatial request corresponds to this category. The pattern of spatial
distribution of objects is visually represented on the map and it allows show relationships
between them and understand investigated field better. Only have seen the object locations it is
possible to understand some reasons of spatial interrelationships. In order to investigate
regularities in data distribution it is necessary in a certain manner to show objects which are
researched basing on values of their characteristics.
2) Satisfaction of spatial conditions. Spatial request corresponding to this category: where are
the spatial conditions satisfied? Simple request about object location consists of one condition.
To answer this question it is necessary to carry out one normal operation. Complex request about
object location can include condition set. To answer this question it is necessary to carry out
some operations of spatial analysis. For example, a) where is the construction site 2 hectares in
area in the limit of 200 m from the district road with bearing power of soil up to 1 kg per 1
square centimeter? б) validate the shopping center, educational organization or business-center
location taking into account spatial factors; в) find optimal pipe line route which is designed.
3) Time analysis. Request corresponds to this category: what is changed spatially over the
specified period? The answer this question is the attempt to identify changes occurred in space
and time, tendencies of these changes on the certain territory. For example, what tendency of
spread of fly is in the town, what new objects have been built recently, what is urban sprawl?
Saving and comparing maps of different dates GIS can carry out time analysis.
4) Identifying of structure. Spatial request corresponding to this category: what spatial
structures and distributions are there? For example, how many anomalies, which are not
identified as normal distribution, are there and where are they? What population distribution is in
the town? What roads are the most dangerous? To select spatial structures is a very difficult
problem requiring use of power tools of spatial analysis.
5) Different script assessment. The script of potential is a result of such kind of questions
“What will happen if…”. For example, what will happen if rainfall intensity is critical? What
costs will be if a street is widened by 14 m? How will transport communication change if a tram
is taken away in Pushkin street? In these cases the user uses the prediction model and the
potential effect maps. The use of such kind of model allows construct hypothetical situation and
forecast development and consequences of sociological and economic situations, disasters and
technogenic accidents in space and time.
In recent time the evident growth of analytical and modeling functions of GIS is observed. For
example, system ArcGIS 9.3 (ЕSRI) includes available modules SpatialAnalyst, 3DAnalyst,
NetworkAnalyst, GeostatisticalAnalyst.
48
Data print. Besides data presentation on display GIS includes elements of publishing
cartographical system which can link maps of necessary scales, it can generate the symbol list,
scale rules, north arrows, etc.
1.
2.
3.
4.
5.
1.
2.
3.
4.
Questions
Enumerate the main kinds of GIS classification.
Characterize the structure of GIS.
The basic functional capabilities of GIS.
What is the approach of fibered data management.
Characterize the history of GIS development.
Basic sources:
Ananiev Yu.S. Geographic information systems: Manual. – Tomsk: Tomsk Polytechnic
University Publishing House, 2003. -69 p.
GOST Р 52155-2003 Geographic information systems. Federal, regional, municipal.
Koshkarev A.V., Tikunov V.S. Geoinformatics. – M.: Kartgeocenter-Geoizdat, 1993. -213
p.
Kuznetsov O.L., Nikitin A.A. Geoinformatics. – M.: Nedra, 1992. -301 p.
LECTURE 4. STUDY OF GEOGRAPHICAL DATA
Concept of data. Geographic data. Discrete and continuous data. Three principal data
components: attribute, geographic and temporary data. Formats of GIS data. Data vector. Raster
data. Representation of spatial data. Concept and advantages of geodatebase.
ArcCatalog: data management application. ArcMap: data display application. Data and
layers. Types of maps and their characteristic. Map montage.
The term "Data" derives from the Latin “Datum” – fact (English: Data – данные). Data –
collection of facts represented in a formalized manner (in quantitative and qualitative terms).
Data correspond to discrete records concerning phenomena. Data correspond to information and
facts usually collected as a result of experience, observations and experiment, processes and
assumptions in computer systems. Data can consist of numbers, words or images, especially as
results of measurements or observations of variable set. Data is often represented as low-level
abstraction. This level is necessary to obtain information and knowledge. We get real-world
information.
Geographic Data present the unity of geospatial, semantic and temporary data of
geographic locations.
Geospatial data – data about local spatial properties: locality, form, sizes, spatial relationships of
geographic objects, phenomena, processes in real Earth’s surface.
Semantic data – data with help of which is described content and semantic information about
geographic objects, properties of geographic objects.
Temporary data hold fix time of the object investigation and show the properties of object
change during the period of time. The main requirement to temporary data is relevant. It means
that data of today can be used for processing. Irrelevant data – stale data which can’t be used in
new changed conditions.
To present parameters of time and thematic scope the one class of data – attributes – is
used in most of geographical information technologies.
Geographical information system must be able to keep jointly under control all parts of
geographic data.
Discrete geographical objects – these are separate bounded macrobodies of real Earth’s
surface. Discrete geographical objects can be present or absent in any place of Earth’s surface.
For example, manholes, road traffic accident (RTA), roads, pipelines, buildings, blocks, zones …
49
Continuous phenomena (fields) characterize territory in whole but they do not
characterize isolated objects. For example, surfaces, precipitation, temperature can be measured
in any place of territory and they can characterize it in whole.
Two main methods used in GIS are referred to as vector mode and raster mode.
VECTOR MODELS
Vector model of data – essentially entails recording the grid co-ordinates of the points,
lines or polygons used to depict the feature. Each point may be recorded by a single x,y coordinate pair. Lines and polygons are recorded by a series of x,y coordinate pairs. The computer
reconstructs the shape of the lines and polygons by joining each successive co-ordinate pair by a
straight line. Vector models are convenient for to present and keep discrete objects such as
buildings, pipelines or boundaries of areas. For example, the point location (point object) of
bore-holes is described by x,y coordinate pairs. Linear objects such as roads, rivers or pipelines
are kept as series of x,y co-ordinate pairs.
Polygonal objects such as river-drainage system, sites of land or service sectors are kept in the
form of closed axis set.
Coordinates are often pairs (x,y) or triples (x,y,z, where z – for example, altitude).
Coordinate values depend on geographical reference system where data are kept.
ArcGIS keeps vector data in classes of spatial objects and in sets of topologically related classes
of objects. Attributes connected with objects are kept in data tables. To present spatial data
ArcGIS uses three different realizations of vector model: covering, shape-files and bases of
geodata.
RASTER MODEL OG DATA
Raster model of data – digital spatial objects representation in the form of the raster cells
population (pixels) with designated values of the objects class. Cell is a bit of raster model.
Raster model is effective for operation within continuous properties. Raster image presents set
of values for cells, it is like a scanned map or a picture. In raster model the real world is
represented as the surface divided uniformly into cells. X,Y coordinates of at least one screen
angle are known, therefore, its location in geographical space is determined. Raster models are
convenient for to keep and to analyze data distributed continuously on the certain area. Each cell
contains value, determining the class or category; it may be the measurement or the result of
interpretation. Raster data include images, for example, airborne imagery, satellite data or
scanned maps; they are often used to create data of GIS.
The sources of raster data are:
- images:
erophotographs of territory;
satellite images of territory;
photography of objects;
- drawings:
topographical maps;
plans;
technical drawings;
schemes;
- figures;
- texts:
documents;
tables.
Both models have advantages and disadvantages. Modern GIS can work both vector
models and raster ones.
ATTRIBUTE DATA. The properties of geographical objects are represented in database
by attributes set. Attribute is a synonym of requisite, property, qualitative and quantitative
characteristics, characterizing spatial object and associating with its unique number or identifier.
Sets of attribute values are usually represented in the form of tables of relational databases. In
50
this case the row (record) represents attributes of one object, and the column (field) represents
attributes of one type.
The tools of database management system (DBMS) are used for
ordering, storage and manipulation of attribute data.
METADATA. Information that describes the content, quality, condition, origin, and other
characteristics of data or other pieces of information. Metadata for spatial data may describe and
document its subject matter; how, when, where, and by whom the data was collected; availability
and distribution information; its projection, scale, resolution, and accuracy; and its reliability
with regard to some standard. Metadata consists of properties and documentation. Properties are
derived from the data source (for example, the coordinate system and projection of the data),
while documentation is entered by a person (for example, keywords used to describe the data).
Geodatabases. Geodatabases implement an object-based GIS data model – the geodatabase
data model. A geodatabase stores each feature as a row in a table. The vector shape of the feature
is stored in the table’s shape field, with the feature attributes in other fields. Each table stores a
feature class. In addition to features, geodatabases can also store rasters, data tables, and
references to other tables. Geodatabases are repositories that can hold all of your spatial data in
one location. They are like adding coverages, shapefiles, and rasters into a DBMS. However,
they also add important new capabilities over file-based data models.
Some advantages of a geodatabase are that features in geodatabases can have built-in behavior;
geodatabase features are completely stored in a single database; and large geodatabase feature
classes can be stored seamlessly, not tiled. In addition to generic features, such as points, lines,
and areas, you can create custom features such as transformers, pipes, and parcels. Custom
features can have special behavior to better represent real-world objects. You can use this
behavior to support sophisticated modeling of networks, data entry error prevention, custom
rendering of features, and custom forms for inspecting or entering attributes of features.
Features in geodatabases. Because you can create your own custom objects, the number of
potential feature classes is unlimited. The basic geometries (shapes) for geodatabase feature
classes are points, multipoints, network junctions, lines, network edges, and polygons. You can
also create features with new geometries.
All point, line, and polygon feature classes can
- Be multipart (for example, like multipoint shapes or regions in a coverage).
- Have x,y; x,y,z; or x,y,z,m coordinates (m-coordinates store distance measurement values such
as the distance to each milepost along a highway).
- Be stored as continuous layers instead of tiled. Whether you use GIS in a project or
multiuser environment, you can use the three ArcGIS desktop applications – ArcCatalog,
ArcMap, and ArcToolbox – to do your work.
ArcCatalog is the application for managing your spatial data holdings, for managing your
database designs, and for recording and viewing metadata. ArcMap is used for all mapping and
editing tasks, as well as for map-based analysis. ArcToolbox is used for data conversion and
geoprocessing. Using these three applications together, you can perform any GIS task, simple to
advanced, including mapping, data management, geographic analysis, data editing, and
geoprocessing.
ArcCatalog lets you find, preview, document, and organize geographic data and create
sophisticated geodatabases to store that data. ArcCatalog provides a framework for organizing
large and diverse stores of GIS data. You can use ArcCatalog to organize folders and file-based
data when you build project databases on your computer. You can create personal geodatabases
on your computer and use tools in ArcCatalog to create or import feature classes and tables. You
can also view and update metadata, allowing you to document your datasets and projects.
ArcMap lets you create and interact with maps. In ArcMap, you can view, edit, and
analyze your geographic data. You can query your spatial data to find and understand
relationships among geographic features. You can symbolize your data in a wide variety of
ways. You can create charts and reports to communicate your understanding with others. You
can lay out your maps in a what-you-see-is-what-you-get layout view. With ArcMap, you can
51
create maps that integrate data in a wide variety of formats including shapefiles, coverages,
tables, computer-aided drafting (CAD) drawings, images, grids, and triangulated irregular
networks (TINs).
ArcToolbox is a simple application containing many GIS tools used for geoprocessing.
Simple geoprocessing tasks are accomplished through form-based tools. More complex
operations can be done with the aid of wizards. Geographical map – a diminished image of
Earth’s surface on the plane in line with certain projection, subject to the surface curvature of
relevancy, which illustrating the placement, combination and connection of natural and social
phenomena selected and characterized according to the function of this map.
Classification
Geographical maps are subdivided into the following categories:
According to the spatial coverage
- maps of the world;
- maps of continents;
- maps of countries and regions
According to the scale
- large-scale (begins from 1:200000 and major);
- medium-scale (from 1:200000 and to 1:1000000 inclusive);
- small-scale (smaller 1:1000000).
If maps have different scales they have different accuracy and degree of image detail, level of
generalization and different function.
For the purpose intended
scientific-reference maps are intended for carrying out of research study and receiving of
full maximum information;
- cultural-educational maps are intended for popularization of knowledge and ideas;
training maps are used as visual aids to study Geography, History, Geology and other
disciplines;
engineering maps represent objects and conditions which are necessary to solve technical
tasks;
- tourist maps include such common geographic features as road networks, population centers,
rivers, lakes, forests, and land relief, as well as items of special tourist interest, including
architectural and historical landmarks, preserves, national parks, museums, hotels, tourist
centers, and camping sites. Such maps serve to acquaint tourists with a given district and
provide information on possible travel routes, on the location of specific landmarks, and on
the availability of tourist services.
navigation (road) maps and etc.
According to content
Geographial (physical) maps are majorly utilized to depict the physical features like various
landforms and water bodies, deserts and plains, climate, vegetation, and erosion present on the
earth's surface.
Large scale geographial maps where all landmarks represent are called topographic maps,
medium-scale geographial maps – topographic survey maps, and small-scale geographial maps –
survey maps.
Thematic maps
A thematic map is a type of map especially designed to show a particular theme connected with a
specific geographic area. These maps "can portray physical, social, political, cultural, economic,
sociological, agricultural, or any other aspects of a city, state, region, nation, or continent".
They can be divided into two groups: maps of natural phenomena and maps of social
phenomena.
maps of natural phenomena include all components of natural environment and their
combinations. This group consists of geological maps, geophysical maps, maps of surface relief
and bottom of World ocean, meteorological and climatic maps, oceanographic, botanical,
52
hydrological, soil maps, maps of mineral resources, maps of physical-geographical landscapes
and physical-geographical regionalization and etc. Social-political maps include maps of
populaton, economic, political, historical, social-geographical maps, and each of subcategories,
in turn, can contain its own structure of division. So, economic maps also include maps of
industry (both general and branch), agriculture, fisheries, transport and communications and etc.
Layout is an element arrangement of digital map image or printed map including titles,
legend, pointers of North, graduated scale and geographical data. Layout represents a set of
map’s elements and our geographical data (i.e. data frame).
In digital cartography, a distinctly identifiable graphic or object in the map or page layout. For
example, a map element can be a title, scale bar, legend, or other map-surround element. The
map area itself can be considered a map element; or an object within the map can be referred to
as a map element, such as a roads layer or a school symbol.
MAP ELEMENTS:
Pointers of North show what direction of map is.
Map scale can help to present visually the real object sizes and the distance between them
on the map. Scale rule is a line or a panel divided into parts and signed according to real
distances on the earth location. It is usually made in multiple map units such as tens of
kilometers or hundreds of miles. If a map is magnified or degraded, the scale rule is changed
too.
Scale type matter. The scale of the map can be shown with a help of type matter. Текст
масштаба отображает масштаб карты и её пространственных объектов. Scale type matter
shows the user what real distance is presented by the concrete unit on the map, for example,
“One centimeter is equal to 100000 meters”.
Legend shows what symbols are used to map what objects. Legends consist of examples of
map’s symbols with descriptive texts. When one symbol is used for objects of a layer, the
layer’s name is pointer out in legend. If some symbols are used to present objects of one layer,
the map body applying to classify objects becomes the head of legend, and each category is
signed by proper value. There are small fragments – symbol standards on the map– in legend.
In the process of layout it is necessary to take into account goals, use conditions and the audience
who will use this map.
Questions
1.
The basic sources of data in GIS.
2.
The approaches of position measurement.
3.
The basic methods of data input in GIS.
4.
Data structure in GIS.
5.
What variants of spatial and attribute data connections are there?
6.
Name the basic characteristics of raster models of spatial data.
7.
Surface analysis in GIS.
Basic sources:
8.
Ananiev Yu.S. Geographic information systems: Manual. – Tomsk: Tomsk Polytechnic
University Publishing House, 2003. -69 p.
9.
Chandra A.M., Gosh S.K. Remote sensing and geographic information systems. – M.:
Technosfera, 2008. -312 p.
10.
Tsvetkov V.Ya. Geographic information systems and technologies. – M.: Finance and
statistics, 1998. - 288 p.
LECTURE 5. COORDINATE SYSTEMS AND MAP PROJECTION
53
Geographical coordinate system. Datum. Projection coordinate systems. Map projection.
Map projection classification. Projection interpretation and conversion.
Computer model consideration allows us to select the principle of real world in GIS:
geographical object models and their spatial properties – location, shape, sizes, spatial relations –
are represented by coordinates. These coordinates are connected with real world object location
by means of coordinate system.
What is a coordinate system? Coordinate systems are arbitrary designations for spatial
data. Their purpose is to provide a common basis for communication about a particular place or
area on the earth's surface. Within ArcGIS it is a system which localizes the position in space
and defines relationships between positions.
A geographic (or deodetic) coordinate system (GCS) uses a three-dimensional spherical
surface to define locations on the earth. A GCS is often incorrectly called a datum, but a datum is
only one part of a GCS. A GCS includes an angular unit of measure, a prime meridian, and a
datum (based on a spheroid). The spheroid defines the size and shape of the earth model, while
the datum connects the spheroid to the earth's surface.
A point is referenced by its longitude and latitude values. Longitude and latitude are angles
measured from the earth's center to a point on the earth's surface. The angles often are measured
in degrees (or in grads). In the spherical system, horizontal lines, or east-west lines, are lines of
equal latitude, or parallels. Vertical lines, or north-south lines, are lines of equal longitude, or
meridians. These lines encompass the globe and form a gridded network called a graticule.
The line of latitude midway between the poles is called the equator. It defines the line of zero
latitude. The line of zero longitude is called the prime meridian. For most GCSs, the prime
meridian is the longitude that passes through Greenwich, England. The origin of the graticule
(0,0) is defined by where the equator and prime meridian intersect.
Latitude and longitude values are traditionally measured either in decimal degrees or in
degrees, minutes, and seconds. Latitude values are measured relative to the equator and range
from –90° at the South pole to +90° at the North pole. Longitude values are measured relative to
the prime meridian. They range from –180° when traveling west to 180° when traveling east.
SPHEROIDS AND SPHERES.
The shape and size of a geographic coordinate system's surface is defined by a sphere or
spheroid. Although the earth is best represented by a spheroid, it is sometimes treated as a sphere
to make mathematical calculations easier. The assumption that the earth is a sphere is possible
for small-scale maps (smaller than 1:5,000,000). At this scale, the difference between a sphere
and a spheroid is not detectable on a map. However, to maintain accuracy for larger-scale maps
(scales of 1:1,000,000 or larger), a spheroid is necessary to represent the shape of the earth. A
sphere is based on a circle, while a spheroid (or ellipsoid) is based on an ellipse.
The shape of an ellipse is defined by two radii. The longer radius is called the semimajor axis,
and the shorter radius is called the semiminor axis. Rotating the ellipse around the semiminor
axis creates a spheroid. A spheroid is also known as an oblate ellipsoid of rotation.
As a rule, a spheroid is chosen for one country or certain territory. If the spheroid is ideally
suited for one geographic region, it does not mean that it is suited for another region.
Datums. The coordinate system defines datum and map projection. While a spheroid
approximates the shape of the earth, a datum defines the position of the spheroid relative to the
center of the earth. A datum provides a frame of reference for measuring locations on the surface
of the earth. It defines the origin and orientation of latitude and longitude lines.
Datum is a frame of reference which describes shape and size of the Earth, origin,
orientation and scale of the coordinate systems used to determine location relative to the Earth by
coordinates. Datum is a mathematical representation of the Earth’s surface shape. This is a welldefined mathematical method to convert coordinates between two geographic coordinate
systems. As with the coordinate systems, there are several hundred predefined geographic
transformations that you can access. It is very important to correctly use a geographic
54
transformation if it is required. When neglected, coordinates can be in the wrong location by up
to a few hundred meters. Sometimes no transformation exists, or you have to use a third GCS
like the World Geodetic System 1984 (WGS84) and combine two transformations. The World
Geodetic System is a base of the location measurement all over the world.
MAP PROJECTION
Whether you treat the earth as a sphere or a spheroid, you must transform its threedimensional surface to create a flat map sheet. This mathematical transformation is commonly
referred to as a map projection. One easy way to understand how map projections alter spatial
properties is to visualize shining a light through the earth onto a surface, called the projection
surface. Imagine the earth's surface is clear with the graticule drawn on it. Wrap a piece of paper
around the earth. A light at the center of the earth will cast the shadows of the graticule onto the
piece of paper. You can now unwrap the paper and lay it flat. The shape of the graticule on the
flat paper is different from that on the earth. The map projection has distorted the graticule. A
spheroid can't be flattened to a plane any more easily than a piece of orange peel can be
flattened—it will rip. Representing the earth's surface in two dimensions causes distortion in the
shape, area, distance, or direction of the data. A map projection uses mathematical formulas to
relate spherical coordinates on the globe to flat, planar coordinates. Different projections cause
different types of distortions. Some projections are designed to minimize the distortion of one or
two of the data's characteristics. A projection could maintain the area of a feature but alter its
shape.
The process of transferring information from the Earth to a map causes every projection to
distort at least one aspect of the real world – shape, area, distance, or direction. If you deal with
small areas such as a town or a district, the distortion cannot be very large and it cannot be
represent on your map or measurements. If you deal with national, continental or global level, it
is necessary for you to choose the map projection disrupting minimally those properties which
are the most important in your project.
Different projections cause different types of distortions. Some projections are designed to
minimize the distortion of one or two of the data's characteristics.
Classification based on distortion characteristics.
Conformal projections. A projection that maintains angular relationships and accurate
shapes over small areas is called a conformal projection. They save small local shapes without
distortions.
Equal area projections. A projection that maintains accurate relative sizes is called an equal
area, or equivalent projection. These projections are used for maps that show distributions or
other phenomena where showing area accurately is important.
Equidistant projections. A projection that maintains accurate distances from the center of the
projection or along given lines is called an equidistant projection. Equidistant projection maps
keep the distance between certain points.
Classification based on developable surface.
Map projections can also be classified based on the shape of the developable surface to
which the Earth's surface is projected. A developable surface is a simple geometric form capable
of being flattened without stretching, such as a cylinder, cone, or plane. Surfaces of projections
can take normal, transverse or oblique position relate to the axis of rotation of sphere or
spheroid.
Conic projection
A conic (or conical) projection is a type of map in which a cone is wrapped around a
sphere (the globe), and the details of the globe are projected onto the cylindrical surface. This
projection is based on the concept of the ‘piece of paper’ being rolled into a cone shape and
touching the Earth on a circular line. Most commonly, the tip of the cone is positioned over a
Pole and the line where the cone touches the Earth is a line of latitude; but this is not essential.
The line of latitude where the cone touches the Earth is called standard parallel. Because of the
55
distortions away from the standard parallel, conic projections are usually used to map regions of
the Earth – particularly in mid-latitude areas. This map uses the same settings as the previous
World Map, but it is more typical of a conic projection map. Distortions are greatest to the north
and south – away from the standard parallel. But, because the standard parallel runs east-west,
distortions are minimal through the middle of the map. In conical or conic projections, the
reference spherical surface is projected onto a cone placed over the globe. The cone is cut
lengthwise and unwrapped to form a flat map. The cone may be either tangent to the reference
surface along a small circle or it may cut through the globe and be secant at two small circles.
Examples of conic projections include Lambert conformal conic, Albers equal area conic, and
equidistant conic projections. On Albers equal area conic projection near North and South poles
parallels are situated closer then central parallels, and the projection maps equivalent areas.
Cylindrical projections
Like conic projections, cylindrical projections can also have tangent or secant cases. The
Mercator projection is one of the most common cylindrical projections, and the equator is
usually its line of tangency. Meridians are geometrically projected onto the cylindrical surface,
and parallels are mathematically projected. This produces graticular angles of 90 degrees. The
cylinder is "cut" along any meridian to produce the final cylindrical projection. The meridians
are equally spaced, while the spacing between parallel lines of latitude increases toward the
poles. This projection is conformal and displays true direction along straight lines. On a
Mercator projection, rhumb lines, lines of constant bearing, are straight lines, but most great
circles are not. For more complex cylindrical projections, the cylinder is rotated, thus changing
the tangent or secant lines. Transverse cylindrical projections, such as the Transverse Mercator,
use a meridian as the tangential contact or lines parallel to meridians as lines of secancy. The
standard lines then run north-south, along which the scale is true. Oblique cylinders are rotated
around a great circle line located anywhere between the equator and the meridians. In these more
complex projections, most meridians and lines of latitude are no longer straight. In all cylindrical
projections, the line of tangency or lines of secancy have no distortion and thus are equidistant
lines. Other geographic properties vary according to the specific projection.
Planar projections (azimuthal projections)
Planar projections project map data onto a flat surface touching the globe. A planar projection is
also
known as an azimuthal projection or a zenithal projection. This type of projection is usually
tangent to the globe at one point but may be secant also. The point of contact may be the North
Pole, the South Pole, a point on the equator, or any point in between. This point specifies the
aspect and is the focus of the projection. The focus is identified by a central longitude and a
central latitude. Possible aspects are polar, equatorial, and oblique. Polar aspects are the simplest
form. Parallels of latitude are concentric circles centered on the pole, and meridians are straight
lines that intersect with their true angles of orientation at the pole. Planar projections are used
most often to map polar regions.
Gauss-Kruger projection
DESCRIPTION
This projection is similar to the Mercator except that the cylinder is
longitudinal along a meridian instead of the equator. The result is a conformal projection that
does not maintain true directions. The central meridian is placed on the region to be highlighted.
This centering minimizes distortion of all properties in that region. This projection is best suited
for land masses that stretch north-south. The Gauss-Kruger coordinate system is based on the
Gauss-Kruger projection.
PROJECTION METHOD
Cylindrical projection with central meridian placed in a particular region.
LINES OF CONTACT
56
Any single meridian for the tangent projection. For the secant projection, two parallel lines
equidistant from the central meridian.
LINEAR GRATICULES
The equator and the central meridian.
PROPERTIES
Shape
Conformal. Small shapes are maintained. Shapes of larger regions are increasingly distorted
away from the central meridian.
Area
Distortion increases with distance from the central meridian.
Direction
Local angles are accurate everywhere.
Distance
Accurate scale along the central meridian if the scale factor is 1.0. If it is less than 1.0, there are
two straight lines with accurate scale equidistant from and on each side of the central meridian.
APPLICATIONS
Gauss-Kruger coordinate system. Gauss-Kruger divides the world into zones six degrees
wide. Each zone has a scale factor of 1.0 and a false easting of 500,000 meters. The central
meridian of zone 1 is at 3° E. Some places also add the zone number times one million to the
500,000 false easting value. Gauss-Kruger zone 5 could have a false easting value of 500,000 or
5,500,000 meters. Three degree Gauss-Kruger zones exist also. The UTM system is similar. The
scale factor is 0.9996, and the central meridian of UTM zone 1 is at 177° W. The false easting
value is 500,000 meters, and southern hemisphere zones also have a false northing of
10,000,000.
1.
2.
3.
Questions:
The basic principles and methods of mathematical modeling.
The approaches of position measurement.
Methods of data output and data visualization in GIS.
Basic sources:
1.
Ananiev Yu.S. Geographic information systems: Manual. – Tomsk: Tomsk Polytechnic
University Publishing House, 2003. -69 p.
2.
Geographic information systems: Annals. – M.: KIBERSO. -112 p.
3.
Chandra A.M., Gosh S.K. Remote sensing and geographic information systems. – M.:
Technosfera, 2008. -312 p.
4.
Tsvetkov V.Ya. Geographic information systems and technologies. – M.: Finance and
statistics, 1998. - 288 p.
57