Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical and Data Analysis Methods, 1HY013, 5 ECTS Exercise 3: Probability distributions This exercise provides some applied examples of how models of the probability distributions of datasets can be used to obtain specific, useful information. In part 1, you use a ~50-year long record of river discharge to calculate probabilities of high-flow events. In part 2, you use data from a survey of soil composition to help design a sampling strategy for potential soil contamination by mining activity. You can start by Part 1 or Part 2. Tasks Part 1 - Analysis of river discharge data Task 1.1 You are provided with a 51-year (1950-2000) record of monthly mean discharge measurements (Q) from a river called Matawani, in Canada. The data are in an MS Excel spreadsheet (Exercise3_FlowData.xls). Open it to see how the data are organized. Then, load the data into Matlab. Produce a plot of the cumulative probability density function (cdf) for the data. Suppose that we define unusual flow situations in this river as follows: ''extreme low flow'' ''low flow'' ''high flow'' ''flood'' F(x) < 0.15 0.15 ≤ F(x) ≤ 0.25 0.85 ≤ F(x) ≤ 0.95 F(x) > 0.95 Using the cdf plot, determine the values of monthly discharge rates that must be exceeded (above or below) for these different events to occur. Round the figures to nearest ± 10 m3/s. Hint: use the data cursor tool in the Matlab graphic window. Task 1.2 Based on the results from Task 1.1, determine, for the period of record, how often the Matawani river experienced these different situations. To find out, count the frequency of each situation by constructing a histogram with predefined bin sizes. What was the mean frequency of floods over the period 1950-2000 ? And the mean return time (years) of such events ? What is the probability that two floods could occur in the same decade ? Task 1.3 Finally, plot the discharge time series. Do you notice any special features ? Can you suggest what could explain them ? Are there any implications for making inferences, for e.g., flood return time predictions, from these data ? Course responsible: Dr. Claudia Teutschbein, e-mail: [email protected] Statistics and Data Analysis Methods Exercise 3: Probability distributions Part 2: Analysis of sediment geochemistry data Task 2.1 SGU conducted a survey of the trace metal content of soils in a 15 km2 region of Sweden where a new Pb-Zn mine will open in a few years. The objective of the survey is to define the natural ''background'' concentrations of metals, so that the future environmental impact of the mine can be measured against these data. A total of 238 soil samples were analyzed. The data are in a MS Excel workbook (Exercise3_SedData.xls). Open it to see how the data are organized. Then, load the data into Matlab. Create and compare box-plots of (a) the raw data; (b) the normalized (ztransformed) data; (c) the log-transformed data; and (d) the normalized, logtransformed data. Do the same thing with normal probability plots. Which metals have a ''normal-like'' distribution ? Which ones are more log-normally distributed ? Which metals have the largest number of naturally-occuring positive outliers ? Task 2.2 National guidelines have been established that give recommended ''safe limits'' for the concentration of some potentially toxic metals in soils: Metal Cadmium (Cd) Chromium (Cr) Copper (Cu) Lead (Pb) Nickel (Ni) Zinc (Zn) Safe limit (ppm) 1.2 81 34 37 21 150 In the SGU survey, samples were collected with an approximately homogeneous spatial density: one soil sample per km2 (except where there was a lake or river). Now, suppose that only 20 samples were taken randomly over this area, what is the probability that this survey would find at least one sample that has a Pb concentration naturally above the recommended limit ? Task 2.3 Suppose that after 5 years of mining operations, 5 % of soils around the mine become contaminated by Pb and Zn (levels higher than the recommended safe limits). Suppose also that the spatial pattern of contamination is unpredictable. If you wanted to have a 90 % probability of detecting the presence of possible contamination in at least 10 samples, what is the minimum number of samples you should take in the monitoring soil survey ? Course responsible: Dr. Claudia Teutschbein, e-mail: [email protected] Statistics and Data Analysis Methods Exercise 3: Probability distributions List of useful Matlab commands Statistics cdfplot – plots the empirical cumulative density function of a data array. ecdf – estimates the value of the empirical cumulative probability distribution for a specific value of a random variable. histc – returns the frequency (counts) of data in user-defined intervals. binopdf – returns the value of the binomial probability density function for specified values of x, p. binocdf – returns the value of the cumulative binomial probability function for specified values of x, p. Related Matlab functions also exist for the binomial, hypergeometric, Poisson distributions, and many others. Type ''discrete distributions'' or ''continuous distributions'' in the Matlab help search window to see details of all the related functions available. Figures: bar – produces a bar plot hist – produces a frequency histogram boxplot – produces a box plot normplot – produces a normal probability plot Data cursor tool description: http://www.mathworks.se/help/matlab/creating_plots/datacursor-displaying-data-values-interactively.html Data import/export: xlsread – reads data from a MS Excel workbook/sheet (filename.xls). Course responsible: Dr. Claudia Teutschbein, e-mail: [email protected]