Download statistical models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Central limit theorem wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
CORRELATION COEFFICIENT
E.P. Yankovich
To model geoecological objects and
processes as complex natural systems it is
necessary to consider some of their
properties because the aim of this is to clarify
the generic structure of a studied object. In
one cases the studied properties are
presented independently of one another, and
in other cases more or less clear
interrelations can be presented between
them.
The
linear
correlation
coefficient
(Pearson)
intending
normal
law
of
distribution
of
observations is widespread to estimate the
degree of interrelation.
Correlation coefficient is a parameter characterizing
the degree of linear interrelation between two
samples.
Correlation coefficient is changed from –1 (strict
inverse linear relationship) to 1 (strict direct
proportion). There is no linear relationship
between two samples if the value is equal 0.
Here, direct dependence is understood as
dependence when an increase or decrease in
value of one property leads to an increase or
decrease of the second property, relatively.
Sample estimation of correlation coefficient can
be calculated according to the formula
n
r   ( xi  x )( yi  y ) nSx S y
i 1
Where x and y – sample estimations of average values
of random variables X and Y; Sx and Sy – sample
estimations of their standards; n – number of
comparable paired values.
When we carry out hand calculations this formula is
used:
 n 2 1  n 2   n 2 1  n 2 
n
1  n  n  
r   xi yi    xi   yi 
 xi    xi    yi    yi  
n
n
n
 i 1
 i 1
 i 1

 i 1
 i 1
   i 1
 i 1
 
If because of small data you can’t test a hypothesis
whether the empirical distribution is in accord with the
law, to test the hypothesis you can use Spearman’s
rank correlation coefficient.
Its calculation is based on change of the investigated
random variable sample values;
they are changed by their ranks in the order of
increasing.
However, it is supposed that if there is no correlation
dependence between values of random variables,
ranks of these variables will be independent.
The expression for calculation of rank correlation
coefficient is:
n
r  1
6 d i2
i 1
2
n(n  1)
where di – rank difference of conjugate values of
studied variables xi and yi, n – number of pairs in
sample.
LAWS OF RANDOM VARIABLE
DISTRIBUTION
Law of random variable distribution is the
relationship between all possible values of
random variable and their correspondent
probabilities.
Law of random variable distribution can be
presented in a tabulated form, graphically or
in distribution functional form.
,
Distribution series is
possible values хi and
probabilities рi= Р ( Х
presented in a tabulated
Here, probabilities pi
the population of
their correspondent
= хi), it can be
form.
satisfy
where the number of possible values k can be finite or infinite.
Graphic presentation of distribution series is called
a distribution polygon. To draw the distribution
polygon it is necessary to plot the possible values of
random variable (хi) on the abscissa, and
probabilities рi should be plotted on the ordinate;
points Аi and coordinates (хi , рi ) are connected
by broken lines.
If the true probability is not known, the relative
frequency of each of values occurrence is plotted on the
ordinate.
The distribution function is the most common form
of the distribution law description.
It defines
probability that random variable  will take the
value which will be lesser than any specified value
X. This probability depends on Х and, therefore, it
is the function of X, i.e. F(x)= Р (<x)
discrete random variable
continuous random variable
Graph of integral function of distribution
The function F(х) for discrete random variable is
calculated by the formula:
F ( x)   pi
,
xi  x
where the summation over all i is carried out for which
хi х.
Continuous random variable is characterized by the
nonnegative function f(х), to be carried out, and this
function is called probability density and it is defined
by:
P( x  X  x  x)
f ( x)  lim
x 
x
At any х probability density f(х) satisfies equality:
x
F(x) =  f( ~
x)d~
x
-
linking it with distribution function F(х).
Geometrical probability of hit X on site
territory (а,b) is equal to area of curvilinear
trapezoid corresponding to definite integral
Graphic presentation of probability density function
(differential function of distribution)
Normal Distribution
(firstly this term was used by Galton in1889, also it
is called Gaussian).
The normal distribution (the "bell-shaped curve"
which is symmetrical about the mean) is a
theoretical function commonly used in inferential
statistics as an approximation to sampling
distributions.
In general, the normal distribution provides a good
model for a random variable, when:
1. There is a strong tendency for the variable to
take a central value;
2. Positive and negative deviations from this central
value are equally likely;
3. The frequency of deviations falls off rapidly as
the deviations become larger.
The normal distribution function is determined by the
following formula:
f(x) = 1/[(2*π)1/2*σ] * e**{-1/2*[(x-μ)2/σ]2}, for -∞ < x < ∞,
where
μ
σ
e
π
is the mean
is the standard deviation
is the base of the natural logarithm,
sometimes called Euler's e (2.71...)
is the constant Pi (3.14...)
The
exact
form
of
normal
distribution
(specific “bell curve”, see Fig.) is defined by only
two parameters: average deviation and standard
one.
The specific property of normal distribution lies in the fact
that 68% of all observations fall in the range ±1 standard
deviation from mean, and range ±2 of standard deviations
include 95% values.
In other words, under normal
distribution the less -2 or more +2 standard observations
possess relative frequency less 5% (Standard observation
means that average value is taken from base value and the
result is divided by standard deviation).
Log-normal Distribution
The log-normal distribution is often used in simulations
of variables such as personal incomes, age at first
marriage, or tolerance to poison in animals. In
general, if x is a sample from a normal distribution,
then y = ex is a sample from a log-normal
distribution. Thus, the log-normal distribution is
defined as:
f ( x) 
1
x 2
e
(ln(x )   ) 2 / 2 2
where, x>0; -∞<μ<+∞; σ>0


e
is the scale parameter
is the shape parameter
is the base of the natural logarithm,
sometimes called Euler's e (2.71...)
Graphs f(x) and F(x) of log-normal distribution
Probability Density Function
Probability Distribution Function
y = lognorm(x; 0; 0,5)
p = ilognorm(x; 0; 0,5)
1,0
0,8
0,8
0,6
0,6
0,4
0,4
0,2
0,2
0,0
0,0
0,4
0,8
1,2
1,6
2,0
2,4
2,8
3,2
0,4
0,8
1,2
1,6
2,0
2,4
2,8
3,2
Student's t Distribution
The student's t distribution is symmetric about zero, and its
general shape is similar to that of the standard normal
distribution. It is most commonly used in testing hypothesis
about the mean of a particular population. The student's t
distribution is defined as (for = 1, 2, . . .):
 m 1 
m 1
Ã

2  2
1
 2  1  x 
ft ( x; m) 
,


m
m


m Ã


 
2
   x  .
Probability Density Function
Probability Distribution Function
y = student(x; 5)
p = istudent(x; 5)
1,0
0,4
0,8
0,3
0,6
0,2
0,4
0,1
0,0
0,2
-3
-2
-1
0
1
2
3
0,0
-3
-2
-1
0
1
2
3
Characters of t-distribution:
M [ x]  xmed  xmod  0
m
D[ x] 
m2
A0
6
E
m4
If the degrees of freedom are great (m> 30),
t-distribution is equal to normal distribution N(x;0;1)
ONE-DIMENSIONAL STATISTICAL MODELS.
STATISTICAL CHARACTERISTICS OF SAMPLE
RANDOM VARIABLE
One-dimensional statistical models are used
to solve two types of problems: to estimate
average parameters of geoecological objects
and to verify hypotheses statistically.
The most abundant statistical characteristics
of one-dimensional random variable:
• range
• median
• mode
• average value
• dispersion
• root-mean-square deviation
• coefficient of variation
• skewness
• excess
Range is the difference between maximum xmax
and
minimum
xmin
values
of
property
p= xmax - xmin.
Median is a mean of ordered series of values. To
find median it is necessary to arrange all values
in the order of increasing or in the order of
decreasing and to find in order the mean term of
series. If in case of n – even integer there will be
two values in the middle of series, the median is
equal to their half-sum.
Mode is the most abundant value of random
variable.
Average value is arithmetical mean value of all
measured values:
1 k
x =  xi
n i 1
Median, mode and mean value are characteristics
of position. Measured values of random variable
are grouped near them.
Dispersion is a number which is equal to average
square deviations of values of random variable from its
average value (Dispersion of random variable is a
measure of this random variable spread, i.e. its
deviation from mathematical expectation):
1 n
 =  (x i - x) 2
n i 1
2
Average square deviation is a number which is equal
to square root of dispersion:
1 n
=
(x

n
i 1
i
- x) 2
Coefficient of variation is the ratio of average
square deviation to average value:
V=

x
Coefficient of variation is expressed in unit fractions or
(after the product by 100) in percentages. It is not
unreasonable to calculate the coefficient of variation for
positive random variables.
Dispersion, average square deviation, coefficient of
variation and also range are measures of scatter of
values of random variable in the neighborhood of
average value. The more measures are the more
scattering is.
Skewness – noncentrality degree of values
distribution of random variable relative to average
value:
A=
n
1
n 3
 (x  x)
3
i
i 1
Excess – degree of peakedness or flat-toppedness of
values of random variable relative to normal
distribution law:
E=
n
1
n
4
 (x  x)
i 1
i
4
3
Skewness and excess are nondimensional values.
They show singularities of values grouping of
random variable in the neighborhood of average
value.
• Thus:
Median, mode and average value are characteristics of
position;
Dispersion, average square deviation, coefficient of
variation and also range are measures of scatter;
Skewness and excess show singularities of values
grouping of values.
Statistical estimations can be point and interval. In
point estimating the unknown characteristic of
random variable is estimated by a number, in
interval estimating the unknown characteristic of
random variable is estimated by an interval. With
specified possibility the true value of estimated
variable must be in range of the latter.
STATISTICAL MODELING
Mathematical expressions including at
least one random component (i.e. such
variable, the value of which cannot be exactly
predicted for single observation) are called
statistical models. They are extensively used
for mathematical modeling aims so long as
they account well random fluctuations of
experimental data.
Statistical models are usually used for:
• obtaining
trusted
assessments
of
geological objects properties according to
sampling data;
• testing of hypothesis;
• identifying and describing of dependences
between properties of geological objects;
• classifying of geological objects;
• determining of sampling data amount
needed to estimate geological objects
properties to specified accuracy.
Two concepts – general population and sampling
– are the basis for statistical modeling.
General population – a lot of possible values of
examined object or phenomenon specified
characteristics.
Sampling – the sum total of observed values of
this characteristic.
Statistical modeling is assumed that sampling
population satisfies the requirements of mass,
homogeneity, randomness and independence.
Mass condition is due to the fact that statistical
regularities are manifested in mass phenomena and so
amount of sampling population is to be sufficiently
great. It is established by empiricism that reliability of
statistical estimates goes down in reducing sample in
the range from 60 to 30-20 values and there is no
need for applying the statistical methods if there are
less observations.
Homogeneity condition is due to the fact
that sampling population must consist of
observations which belong to one object and
they must be carried out by the same
method, i.e. the sample size and analysis
method must be constant.
Randomness
condition
provides
unpredictability of the single sample
observation result.
Independence condition is due to the fact
that the results of each investigation do not
depend on results of previous and follow-up
observations and in the process of carrying
out observations dealing with area and
volume the results do not depend on space
coordinates.
The concept of random event probability is
one of the main concepts in statistical
modeling.
The event is any fact which can be realized in
the result of the experiment or test.
In turn the experiment or test is realization of
certain complex of conditions though a man
does not always take part in.
All events are subdivided into persistent,
impossible and random.
• The event which is certain to happen in
the process of this kind of test is called
persistent.
• Impossible event is never realized in the
process of this kind of test.
• Random events are characterized by that
they can happen in the process of this
kind of test or they can’t happen.
The variable taking one or another unknown in
advance value in the result of test is called
random variable.
Random variables are discrete and continuous.
Meanwhile values which they possess they
can be limited or not.
Discrete variable can take fixed value and if
the interval is specified the number of these
values is finite.
Continuous random variable can take infinitely
many values in any specified interval.
The value called probability is used as a measure of
possibility of random events.
Probability of event A is a number which characterizes
objective possibility of occurrence of this event.
It is designated as either Р(А) or р, i.e. р=Р(А).
Classical interpretation:
Probability of event A is equal to ratio of number of events,
favourable to event A, to general number of events.
P(A)=m/n, where n – general number of events, m – number of
events, favourable to event A.
Р(А) is variable from 0 to 1.
Probability of persistent event is equal 1, probability of
impossible event is equal 0.
Ratio of m/n, number of m in which the event A occurred, to the
total number of tests n is called the relative frequency of any
event in this series from n tests.
Almost in every sufficiently long series of tests the relative
frequency of event A is established at defined value m/n
taken as probability of event A.
The relative frequency of event A is called statistical probability,
which is symbolized
m
P* ( A)  A
n
where mA – number of experiments where the event A occurred;
n – total number of experiments.
The basic characteristics of random variable
The most important of them are mathematical expectation of
random variable which is denoted by М(Х), and dispersion D(Х)
= 2(Х), the square root of which  (Х) is called standard
deviation or standard.
In the discrete type (discontinuous) of random variable, the
definition of mathematical expectation М(Х) is given as the sum
of the product of the random variables and the probability mass
function of those random variables.
k
Ì(Õ) = õ1ð1 + õ 2 ð 2 + . . . + õ k ð k =  x i p i
i 1
Or
k
Ì(Õ) =
k
x p / p
i 1
i
i
i 1
i
Mechanical interpretation of mathematical expectation: М(Х) –
abscissa of centroid of mass points, abscissas of which are
equal to possible values of random variable, and masses are
placed in these points are equal to adequate probabilities.
Mathematical expectation of continuous type of random
variable is called the integral, and the integral is supposed
to converge absolutely;

Ì(Õ) =
 xf(x)dx
-
here f(х) – probability density of distribution of random variable Х.
Mathematical expectation М(Х) can be understood as
“theoretical mean value of random variable”.
Along with mathematical expectation another characters are
used:
median xmed divides the distribution Х into two equal parts
and it is defined by condition F(xmed) = 0,5;
mode xmоd – maximum commonly occurring value Х and it is
abscissa of the maximum point f(x) for continuously
distributed random variable.
All three characters (mathematical expectation, median and
mode) are the same in symmetrical distributions.
If there are several modes the distribution is called multimodal
distribution.
Dispersion of random variable X is called the mathematical
expectation of deviation of random variable square from
its mathematical expectation, i.e.
D(Х) = М(Х – М(Х) 2)
Dispersion is calculated by the formula:
D(Х) = М(Х2) – [М(Х)] 2
For discrete random variable X the formula gives
k
Ì(Õ) =
 (x )
i 1
i
2
p i  [ M ( X )]2
For continuous random variable X

D(Õ) =
2
(x
M(x))
f(x)dx

-
Dimension of dispersion is equal to dimension of random variable square.
If mathematical expectation of random variable gives us its “average” or
point on the coordinate line where the values of considered random
variable “are spread” around it, dispersion classifies “the spread degree”
of values of random variable about its average value.
The positive root of dispersion is called the root-meansquare (standard) deviation and it is denoted by
σ  D(X )
The root-mean-square deviation possesses the same
dimension that the random variable possesses.


100 %
V
Coefficient of variation is called the value
1
Coefficient of variation – dimensionless value applied
for comparison of degrees of variation of random
variables with different units of measurement.
Skewness ratio (or coefficient of skewness) of distribution
is called the value
3
A 3

Coefficient of skewness classifies the degree of random variable
distribution skewness relative to its mathematical expectation.
For skewness distributions А = 0. If the peak of function graph
f(x) is shifted in small values (“tail” on the function graph f(x) to
the right), А> 0. In the contrary case А< 0.
1,0
A>0
A=0
0,8
A<0
f(x)
0,6
0,4
0,2
0,0
-0,5
0,0
0,5
1,0
1,5
2,0
x
2,5
3,0
3,5
4,0
4,5
Coefficient of excess (or peakedness) is called the value
4
E  4  3.

Coefficient of excess is the measure of sharpness of
probability density graphs f(x)
1,2
E>0
1,0
f(x)
0,8
0,6
E=0
0,4
0,2
E<0
0,0
-0,5
0,0
0,5
1,0
1,5
2,0
x
2,5
3,0
3,5
4,0
4,5