Download Sample space - Laboratory for Remote Sensing Hydrology and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
STATISTICS
Exploratory Data Analysis
and Probability
Professor Ke-Sheng Cheng
Department of Bioenvironmental Systems Engineering
National Taiwan University
What is “statistics”?
• Statistics is a science of “reasoning” from data.
• A body of principles and methods for
extracting useful information from data, for
assessing the reliability of that information, for
measuring and managing risk, and for making
decisions in the face of uncertainty.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
2
• The major difference between statistics and
mathematics is that statistics always needs
“observed” data, while mathematics does not.
• An important feature of statistical methods is
the “uncertainty” involved in analysis.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
3
• Statistics is the discipline concerned with the
study of variability, with the study of
uncertainty and with the study of decisionmaking in the face of uncertainty. As these
are issues that are crucial throughout the
sciences and engineering, statistics is an
inherently interdisciplinary science.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
4
Stochastic Modeling & Simulation
• Building probability models for real world
phenomena.
– No matter how sophisticated a model is, it only
represents our understanding of the complicated
natural systems.
• Generating a large number of possible
realizations.
• Making decisions or assessing risks based on
simulation results.
• Conducted by computers.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
5
Exploratory Data Analysis
• Features of data distributions
– Histograms
– Center: mean, median
– Spread: variance, standard deviation, range
– Shape: skewness, kurtosis
– Order statistics and sample quantiles
– Clusters
– Extreme observations: outliers
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
6
• Histogram: frequencies and relative
frequencies
104.838935 265.018615 205.279506 146.938446
–22.371870
A sample129.538575
data set X 37.587841 231.608794
24.762863
82.708815
82.535199
115.387515
64.158533
72.895810
85.553281
102.347372
5/5/2017
275.440477
149.905426
150.761192
102.460651
133.663194
107.569047
96.920012
19.277535
70.721022
113.442704
134.931864
16.480639
139.201204
81.266071
34.202372
134.484317
100.717110
131.144892
174.200632
9.961515
112.180103
101.351639
45.472935
121.101643
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
12.577133
60.397366
33.918756
9.539663
130.360126
53.449806
105.368124
16.652365
149.996985
10.382787
7
• Frequency histogram
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
8
• Relative histogram
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
9
• Measures of center
– Sample mean
– Sample median
5/5/2017
Sample mean = 98.26067
Sample median = 101.8495
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
10
– One desirable property of the sample median is
that it is resistant to extreme observations, in the
sense that its value depends only the values of the
middle observations, and is quite unaffected by the
actual values of the outer observations in the
ordered list. The same cannot be said for the
sample mean. Any significant changes in the
magnitude of an observation results in a
corresponding change in the value of the mean.
Hence, the sample mean is said to be sensitive to
extreme observations.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
11
• Measures of spread
– Sample variance and sample standard deviation
– Range
• the difference between the largest and smallest values
Sample variance = 4039.931
Sample standard deviation = 63.56045
Range = 265.9008 (275.440477 – 9.539663)
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
12
• Measures of shape
– Sample skewness
– Sample kurtosis
Sample skewness = 0.7110874
Sample kurtosis = 0.533141 (or 3.533141 in R)
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
13
• Order statistics
• Sample quantiles
Linear interpolation
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
14
• Box-and-whisker plot (or box plot)
– A box-and-whisker plot includes two major parts – the box
and the whiskers.
– A parameter range determines how far the plot whiskers
extend out from the box. If range is positive, the whiskers
extend to the most extreme data point which is no more
than range times the interquartile range (IQR) from the box.
A value of zero causes the whiskers to extend to the data
extremes.
– Outliers are marked by points which fall beyond the
whiskers.
– Hinges and the five-number summary
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
15
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
16
Not “linear interpolation”
– In R, a boxplot is essentially a graphical
representation determined by the 5NS.
The summary function in R yields a list of six numbers:
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
17
– Box-and-whisker plot of X
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
18
Seasonal variation of average
monthly rainfalls in CDZ, Myanmar
– Boxplots are based on average monthly rainfalls of
54 rainfall stations.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
19
Random Experiment and Sample Space
• An experiment that can be repeated under the
same (or uniform) conditions, but whose
outcome cannot be predicted in advance, even
when the same experiment has been
performed many times, is called a random
experiment.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
20
• Examples of random experiments
– Tossing a coin.
– Rolling a die.
– The selection of a numbered ball (1-50) in an urn.
(selection with replacement)
– Occurrences of earthquakes
• The time interval between the occurrences of two
consecutive higher-than-scale 6 earthquakes.
– Occurrences of typhoons
• The amount of rainfalls produced by typhoons in one
year (yearly typhoon rainfalls).
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
21
•
The following items are always associated
with a random experiment:
– Sample space. The set of all possible outcomes,
denoted by .
– Outcomes. Elements of the sample space,
denoted by . These are also referred to as
sample points or realizations.
– Events. An event is a subsets of  for which the
probability is defined. Events are denoted by
capital Latin letters (e.g., A,B,C).
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
22
Definition of Probability
• Classical probability
• Frequency probability
• Probability model
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
23
Classical (or a priori) probability
• If a random experiment can result in n
mutually exclusive and equally likely
outcomes and if nA of these outcomes have an
attribute A, then the probability of A is the
fraction nA/n .
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
24
• Example 1.
Compute the probability of getting two heads
if a fair coin is tossed twice. (1/4)
• Example 2.
The probability that a card drawn from an
ordinary well-shuffled deck will be an ace or
a spade. (16/52)
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
25
Remarks
• The probabilities determined by the classical
definition are called “a priori” probabilities
since they can be derived purely by deductive
reasoning.
• The “equally likely” assumption requires the
experiment to be carried out in such a way
that the assumption is realistic; such as, using
a balanced coin, using a die that is not loaded,
using a well-shuffled deck of cards, using
random sampling, and so forth. This
assumption also requires that the sample
space is appropriately defined.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
26
• Troublesome limitations in the classical
definition of probability:
– If the number of possible outcomes is infinite;
– If possible outcomes are not equally likely.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
27
Relative frequency
(or a posteriori) probability
• We observe outcomes of a random
experiment which is repeated many times.
We postulate a number p which is the
probability of an event, and approximate p by
the relative frequency f with which the
repeated observations satisfy the event.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
28
• Suppose a random experiment is repeated n
times under uniform conditions, and if event A
occurred nA times, then the relative frequency
for which A occurs is fn(A) = nA/n. If the limit
of fn(A) as n approaches infinity exists then
one can assign the probability of A by:
P(A)= lim f n ( A) .
n 
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
29
• This method requires the existence of the
limit of the relative frequencies. This
property is known as statistical regularity.
This property will be satisfied if the trials are
independent and are performed under
uniform conditions.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
30
• Example 3
A fair coin was tossed 100 times with 54
occurrences of head. The probability of head
occurrence for each toss is estimated to be
0.54.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
31
• The chain of probability definition
Random
experiment
5/5/2017
Sample
space
Event
space
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
Probability
space
32
Probability Model
Each outcome can be thought of as a sample
point, or an element, in the sample space.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
33
• Event and event space
– An event is a subset of the sample space. The class
of all events associated with a given random
experiment is defined to be the event space.
– An event will always be a subset of the sample
space, but for sufficiently large sample spaces not all
subsets will be events. Thus the class of all subsets
of the sample space will not necessarily correspond
to the event space.
– If the sample space consists of only a finite number
of points, then the corresponding event space will be
the class of all subsets of the sample space.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
34
1)  (the empty set) and  (the sure event) are
both subsets of .
2) An event A is said to occur if the experiment
at hand results in an outcome that belongs to
A.
3) An event space is usually denoted by a script
Latin letter such as A and B.
4) Two events A and B are said to be mutually
exclusive if and only if A  B   . Events
A1, A2 , A3 ... are mutually exclusive if and only
if Ai  Aj   for i  j.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
35
Event space and algebra of events
• Let A denote an event space, the following
properties are called the Boolean algebra, or
algebra of events:
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
36
Probability function
• Let  denote the sample space and A denote
an algebra of events for some random
experiment. Then, a probability function P is a
set function with domain A (an algebra of
events) and counter domain the interval [0, 1]
which satisfies the following axioms:
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
37
• Probability is a mapping (function) of sets to
numbers.
• Probability is not a mapping of the sample
space to numbers.
– The expression P( ) for    is not defined.
However, for a singleton event{} , P ({}) is
defined.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
38
Probability space
• A probability space is the triplet (, A, P[]), where
 is a sample space, A is an event space, and P[] is
a probability function with domain A.
• A probability space constitutes a complete
probabilistic description of a random
experiment.
– The sample space  defines all of the possible
outcomes, the event space A defines all possible
things that could be observed as a result of an
experiment, and the probability P defines the degree of
belief or evidential support associated with the
experiment.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
39
Finite Sample Space
• A random experiment can result in a finite
number of possible outcomes. A sample space
with only a finite number of elements (points)
is called a finite sample space.
• Finite sample space with equally likely points –
simple sample space
• Finite sample space without equally likely
points
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
40
Conditional probability
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
41
Bayes’ theorem
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
42
Multiplication rule
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
43
Independent events
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
44
• The property of independence of two events A
and B and the property that A and B are
mutually exclusive are distinct, though related,
properties.
• If A and B are mutually exclusive events then
AB=. Therefore, P(AB) = 0. Whereas, if A
and B are independent events then P(AB) =
P(A)P(B). Events A and B will be mutually
exclusive and independent events only if
P(AB)=P(A)P(B)=0, that is, at least one of A or
B has zero probability.
5/5/2017
45
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
• But if A and B are mutually exclusive events
and both have nonzero probabilities then it is
impossible for them to be independent events.
• Likewise, if A and B are independent events
and both have nonzero probabilities then it is
impossible for them to be mutually exclusive.
5/5/2017
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
46