Download Basic concepts and notions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Geophysical data analysis --- Chapter 1
Chapter 1
UNBC
Basic notion in classical data analysis
1.1 Data and Data presentation
The goal of data analysis: discover relations and features in a dataset.
Geophysical data are those that record a measurement and also the location
for the measurement. Geophysical data analysis can be understood as using
such data to detect, examine, understand or predict events and phenomena
that are of relevance to a geophysical study. However, the methods of
analysis may not themselves be geophysical. Many of the more common
descriptive, inferential and relational statistics make no specific allowance
for where the data were collected.
Types of statistic: There are three types of statistic: those that describe
and summarise a set of data in its own right – descriptive statistic; those that
'go beyond' the data to infer something more general about the population
from which the data were sampled – influential statistic; and those which
examine relationships – rational statistic.
Data are measurements of something of interest. They are also called
observations because the measurements help us to observe (and to quantify)
an attribute of whatever is being studied.
Data type -- Discrete data are those that take on one from a restricted set of
possible values. They are usually whole or integer data. Continuous data
could take on any value, or any value within a lower and upper limit and to a
certain level of precision. They are ‘real’ or ‘floating-point’ numbers.
Some ways to present data --- Table, Bar Chart, Histogram, Line etc..
Bar graph - A bar graph is a way of summarizing a set of categorical data.
It displays the data using a number of rectangles, of the same width, each of
which represents a particular category. Bar graphs can be displayed
horizontally or vertically and they are usually drawn with a gap between the
bars (rectangles).
Histogram - A histogram is a way of summarizing data that are measured
on an interval scale (either discrete or continuous). It is often used in
Geophysical data analysis --- Chapter 1
UNBC
exploratory data analysis to illustrate the features of the distribution of the
data in a convenient form.
Pie chart - A pie chart is used to display a set of categorical data. It is a
circle, which is divided into segments. Each segment represents a particular
category. The area of each segment is proportional to the number of cases in
that category.
Line graph - A line graph is particularly useful when we want to show the
trend of a variable over time. Time is displayed on the horizontal axis (xaxis) and the variable is displayed on the vertical axis (y- axis).
Scatter plots show how much one variable is affected by another. The
relationship between two variables is called their correlation .
Geophysical data analysis --- Chapter 1
UNBC
Geophysical data analysis --- Chapter 1
UNBC
2.2 Measures of central tendency
Measures of central tendency and dispersion are common descriptive
measures for summarising numerical data.
Simple definition:
Measures of central tendency are measures of the location of the middle or the
center of a distribution.
The most frequently used measures of central tendency are
the mean, median and mode.

The mean is obtained by summing the values of all the observations and
dividing by the number of observations.

The median (also referred to as the 50th percentile) is the middle value in
a sample of ordered values. Half the values are above the median and
half are below the median.
The mode is a value occurring most frequently. It is rarely of any practical
use for numerical data.

A comparison of the mean, median and mode can reveal information about
skewness, as illustrated in figure below. The mean, median and mode are similar
when the distribution is symmetrical. When the distribution is skewed the median
is more appropriate as a measure of central tendency.
Geophysical data analysis --- Chapter 1
UNBC
Satistical definition:
Continuous variables
For a random variables x which takes on continuous values over a domain
 , the expectation is given by an integral,
E ( x)   xp( x)dx
(1.1)

where p( x ) is the probability density function. For any function f ( x ), the
expectation is
E ( x)   f ( x) p( x)dx
(1.2)

Example: For a normal distribution: x ~ N(0,1), the expectation of x


x2

x2
x2
0
1 2
1 2
1 2
E ( x )   xp( x )dx   x
e dx   x
e dx   x
e dx  0
2
2
2


0

Discrete variables
Let x be random variable which takes on discrete. For example, x can be the
outcome of a die cast, where the possible values are x i  i , with i=1,2,…6.
The expectation or expected value of x i from a population is given by
E[ x ]   x i p i
(1.3)
i
Where p i is the probability of x i occurring. If the die is fair, p i is 1/6 for all i,
So E[ x ]   x i pi =(1+2+3+4+5+6)/6=3.5. We also write
i
E[ x ]   x
with  x denoting the mean of x for the population.
Similarly, for any discrete function f(x), its expectation is
E[ f ( x )]   f ( x i ) pi
i
(1.4)
Geophysical data analysis --- Chapter 1
UNBC
The expectation of a sum of random variables satisfies
E[ax  by  c ]  aE ( x )  bE ( y )  c,
(1.5)
where x and y are random variables, and a , b, c, are constants.
In practice, one can only sample N measurements of x ( x1 , …, x N ) from the
population. The sample mean x or < x > is calculated as
1 N
(1.6)
 xi
N i 1
which is in general different from the population mean  x . As the same size
increases, the same mean approaches the population mean.
x   x 
The function of “mean” in MATLAB is mean(x)
2.3 Variance and covariance
A measure of central tendency is useful but says nothing about how the data
are distributed around it. Knowing only an average is not, by itself,
especially helpful.
Example:Imagine measurements are taken of a pollutant in each of three
streams on 10 different days. The first stream is found to have a mean
pollution concentration of 6.1 units, the second stream has 9.1 units and the
third has 5.6 units. If the threshold beyond which a stream becomes toxic is
10 units, then which is the stream to worry about?
Intuitively the answer is the second stream because its mean (of 9.1) is
closest to 10. But that intuition could be misleading without knowledge of
how the individual measurements vary around it, as shown below.
Geophysical data analysis --- Chapter 1
UNBC
In general, a measure of central tendency should always be accompanied by
a measure of spread, which is a fluctuation about the mean value. It is
commonly characterized by the variance of the population,
Var( x)  E[( x   x ) 2 ]  E ( x 2  2 x x   x ]  E[ x 2 ]   x
2
2
(1.7)
where (1.5) has been invoked. The standard deviation s is the positive square
root of the population variance, i.e.,
s 2  Var ( x )
The sample standard deviation  2 
(1.8)
1 N
 ( x i  x) 2
N  1 i 1
(1.9)
As the sample size increases, the sample variance approaches the population
variance. For large N, distinction is often not made between having N-1 or N
in the denominator of (1.9).
Degrees of freedom
The degrees of freedom are the amount of flexibility you have to change the
values of some observations if certain properties of the data set are fixed.
Those properties are often referred to as parameters. Imagine somebody
takes some playing cards and, having removed the picture cards, looks at 10
Geophysical data analysis --- Chapter 1
UNBC
cards at the top of the pack. That person then tells you the sum of the 10
cards is sixty before dealing them face down. The question is: what is the
maximum number of cards you need to turn up before you know the face
value of them all? The answer is nine. You know there are 10 cards and you
know the sum of their values. By deducting the values of the first 9 cards
from sixty, you can determine the value of the final card.
Alternatively, consider a data set containing n = 10 observation for which
the mean is known and must stay constant. You can take any nine of the
observations and change their values to anything you like. But, having done
so, you have no choice about what the tenth element equals: its value is fixed
by the mean and by the other observations. Your ‘degrees of freedom’ are
limited to n - 1 of the data values. Be aware that the number of degrees of
freedom is not always n - 1. It depends on the parameters of the data set
required for a particular test or measure. A formal definition of degrees of
freedom is the sample size, n, minus the number of parameters, p,
estimated from the data.
Normalization: The goal is to make different variables to be measured and
compared in the same scale, i.e., an effective way to remove the influence of
units.
For example, we would like to compare two very different variables, e.g.,
sea surface temperature and fish population. Simply, one can’t even draw
their variations in a plot due to different units. So, one usually standardizes
the variables before making the comparison. The standardized variable
x s  ( x  x) / 
(1.10)
is obtained from the original variables by subtracting the sample mean and
dividing by the sample standard deviation. The standardized variable is also
called the normalized variable or the standardized anomaly (where anomaly
means the deviation from the mean value).
Covariance
Covariance measures how two variables move together. It measures whether
the two move in the same direction (a positive covariance) or in opposite
Geophysical data analysis --- Chapter 1
UNBC
directions (a negative covariance). In this article, the variables will usually
be stock prices, but they can be anything.
For two random variables x and y , with mean  x and  y respectively, their
covariance is given by
Cov ( x, y )  E[( x   x )( y   y )]
(1.11)
The variance is simply a special case of the covariance, with
Var ( x )  Cov ( x, x ).
(1.12)
The sample covariance is computed as
1 N
Cov ( x, y ) 
 ( x i  x)( y i  y).
N  1 i 1
(1.13)
The function of variance, standard deviation and covariance in MATLAB
is var(x) , std(x), cov(x,y).
2.4 Skewness:
Skewness is a measure of symmetry or more precisely, the lack of
symmetry.
skew ( x ) 
E[( x   x )3 ]
 x3
(1.14)
where  x and  x are the mean and standard deviation of x . As one might
expect, the formula takes on a positive value if x is positively skewed and a
negative value if x is negatively skewed
A distribution, or data set, is symmetric if it looks the same to the left and
right of the center point, i.e., skew = 0. If the left tail (tail at small end of the
the distribution) is more pronounced that the right tail (tail at the large end of
the distribution), the skew is negative . If the reverse is true, the skew is
positive. Distributions with positive skew are sometimes called "skewed to
the right" whereas distributions with negative skew are called "skewed to the
left."
Geophysical data analysis --- Chapter 1
UNBC
Positive vs. Negative Skewness
+ skew
- skew
These graphs illustrate the notion of skewness. Both PDFs have the same expectation
and variance. The one on the left is positively skewed. The one on the right is
negatively skewed.
2.5 Examples of using these concepts to analysis practical problems.
Standard Deviation
Geophysical data analysis --- Chapter 1
UNBC
Covariance
For the data sets X = 65.21, 64.75, 65.26, 65.76, 65.96 and Y = 67.25, 66.39, 66.12,
65.70, 66.64, find the covariance to estimate the linear relationship between the two data
sets X & Y.
Solution
Sum(X) = 65.21 + 64.75 + 65.26 + 65.76 + 65.96
= 326.93
μx = 326.93 / 5
= 65.38
Sum(Y) =67.25 + 66.39 + 66.12 + 65.70 + 66.64 = 332.09
μy = 332.09 / 5
= 66.42
Geophysical data analysis --- Chapter 1
UNBC
cov(X,Y) = (SUM(xi - μx) * SUM (yi - μy)) / (n - 1)
= (65.21 - 65.38) * (67.25 - 66.42) + (64.75 - 65.38) * (66.39 - 66.42) + (65.26 - 65.38) *
(66.12 - 66.42) + (65.76 - 65.38) * (65.7 - 66.42) + (65.96 - 65.38) * (66.64 - 66.42))/4
= -0.058
Geophysical data analysis --- Chapter 1
UNBC
Shown above are two cases: El Nino and La Nina. Comparing the two cases
reveals: (1) the largest variations occur at the eastern Pacific ocean; (2) El
Nino and La Nina are asymmetric.
These two features could be explained by the below figure. As can be seen,
the largest magnitudes of variances appear over the eastern Pacific,
coinciding with the above figures. Similarly, the large positive skewness
occupies the equatorial eastern boundary, and small negative skewness
appears in the west, indicating the asymmetry shown between El Nino and
La Nina.
Geophysical data analysis --- Chapter 1
UNBC
Geophysical data analysis --- Chapter 1
UNBC
Geophysical data analysis --- Chapter 1
UNBC
Reference:
Harris, Richard; Jarvis, Claire. Statistics for Geography and Environmental
Science (Kindle Location 1028). Taylor and Francis. Kindle Edition, chapter
2.