Download Exercise 3: Probability distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prognostics wikipedia , lookup

Probability amplitude wikipedia , lookup

Intelligent maintenance system wikipedia , lookup

Transcript
Statistical and Data Analysis Methods, 1HY013, 5 ECTS
Exercise 3: Probability distributions
This exercise provides some applied examples of how models of the probability distributions
of datasets can be used to obtain specific, useful information. In part 1, you use a ~50-year
long record of river discharge to calculate probabilities of high-flow events. In part 2, you use
data from a survey of soil composition to help design a sampling strategy for potential soil
contamination by mining activity. You can start by Part 1 or Part 2.
Tasks
Part 1 - Analysis of river discharge data
Task 1.1
You are provided with a 51-year (1950-2000) record of monthly mean
discharge measurements (Q) from a river called Matawani, in Canada. The
data are in an MS Excel spreadsheet (Exercise3_FlowData.xls). Open it to see
how the data are organized. Then, load the data into Matlab.
Produce a plot of the cumulative probability density function (cdf) for the data.
Suppose that we define unusual flow situations in this river as follows:
''extreme low flow''
''low flow''
''high flow''
''flood''
F(x) < 0.15
0.15 ≤ F(x) ≤ 0.25
0.85 ≤ F(x) ≤ 0.95
F(x) > 0.95
Using the cdf plot, determine the values of monthly discharge rates that must
be exceeded (above or below) for these different events to occur. Round the
figures to nearest ± 10 m3/s. Hint: use the data cursor tool in the Matlab graphic
window.
Task 1.2
Based on the results from Task 1.1, determine, for the period of record, how
often the Matawani river experienced these different situations. To find out,
count the frequency of each situation by constructing a histogram with predefined bin sizes. What was the mean frequency of floods over the period
1950-2000 ? And the mean return time (years) of such events ? What is the
probability that two floods could occur in the same decade ?
Task 1.3
Finally, plot the discharge time series. Do you notice any special features ?
Can you suggest what could explain them ? Are there any implications for
making inferences, for e.g., flood return time predictions, from these data ?
Course responsible: Dr. Claudia Teutschbein, e-mail: [email protected]
Statistics and Data Analysis Methods
Exercise 3: Probability distributions
Part 2: Analysis of sediment geochemistry data
Task 2.1
SGU conducted a survey of the trace metal content of soils in a 15 km2 region
of Sweden where a new Pb-Zn mine will open in a few years. The objective of
the survey is to define the natural ''background'' concentrations of metals, so
that the future environmental impact of the mine can be measured against
these data. A total of 238 soil samples were analyzed. The data are in a MS
Excel workbook (Exercise3_SedData.xls). Open it to see how the data are
organized. Then, load the data into Matlab.
Create and compare box-plots of (a) the raw data; (b) the normalized (ztransformed) data; (c) the log-transformed data; and (d) the normalized, logtransformed data. Do the same thing with normal probability plots. Which
metals have a ''normal-like'' distribution ? Which ones are more log-normally
distributed ? Which metals have the largest number of naturally-occuring
positive outliers ?
Task 2.2
National guidelines have been established that give recommended ''safe limits''
for the concentration of some potentially toxic metals in soils:
Metal
Cadmium (Cd)
Chromium (Cr)
Copper (Cu)
Lead (Pb)
Nickel (Ni)
Zinc (Zn)
Safe limit (ppm)
1.2
81
34
37
21
150
In the SGU survey, samples were collected with an approximately
homogeneous spatial density: one soil sample per km2 (except where there
was a lake or river). Now, suppose that only 20 samples were taken randomly
over this area, what is the probability that this survey would find at least one
sample that has a Pb concentration naturally above the recommended limit ?
Task 2.3
Suppose that after 5 years of mining operations, 5 % of soils around the mine
become contaminated by Pb and Zn (levels higher than the recommended safe
limits). Suppose also that the spatial pattern of contamination is unpredictable.
If you wanted to have a 90 % probability of detecting the presence of possible
contamination in at least 10 samples, what is the minimum number of samples
you should take in the monitoring soil survey ?
Course responsible: Dr. Claudia Teutschbein, e-mail: [email protected]
Statistics and Data Analysis Methods
Exercise 3: Probability distributions
List of useful Matlab commands
Statistics
cdfplot
– plots the empirical cumulative density function of a data array.
ecdf
– estimates the value of the empirical cumulative probability distribution for a
specific value of a random variable.
histc
– returns the frequency (counts) of data in user-defined intervals.
binopdf
– returns the value of the binomial probability density function for specified
values of x, p.
binocdf
– returns the value of the cumulative binomial probability function for specified
values of x, p.
Related Matlab functions also exist for the binomial, hypergeometric, Poisson distributions,
and many others. Type ''discrete distributions'' or ''continuous distributions'' in the Matlab
help search window to see details of all the related functions available.
Figures:
bar
– produces a bar plot
hist
– produces a frequency histogram
boxplot
– produces a box plot
normplot – produces a normal probability plot
Data cursor tool description: http://www.mathworks.se/help/matlab/creating_plots/datacursor-displaying-data-values-interactively.html
Data import/export:
xlsread – reads data from a MS Excel workbook/sheet (filename.xls).
Course responsible: Dr. Claudia Teutschbein, e-mail: [email protected]