Download AnswersPSno2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Problem Set #2
Geog 2000: Introduction to Geographic Statistics
Instructor: Dr. Paul C. Sutton
Due Tuesday Week #5 in Lab
Instructions:
There are three parts to this problem set: 1) Do it by Hand Exercises (#’s 1 – 10). Do
these with a calculator or by hand. Draw most of your figures by hand. Write out your
answers either digitally or by hand. Start your answer to each numbered question on a
new page. Make it easy for me to grade. 2) Computer exercises (#’s 11-13). Use the
computer to generate the graphics and paste them into a digital version of this assignment
while leaving room for you to digitally or hand write your responses if you so wish. 3)
‘How to Lie with Statistics’ Essay(s) (#’s 14-18). For this section, prepare a typed
paragraph answering each question. Do not send me digital versions of your answers. I
want paper copies because that is the easiest way for me to spill coffee on your
assignments when I am grading them. All of the questions on these four problem sets will
resemble many of the questions on the three exams. Consequently, it behooves you to
truly understand – ON YOUR OWN – how to do these problems. I am fine with you
working with other students on these problem sets (to enhance your learning
opportunities); however, my exams will punish those who free ride the comprehension of
the material in these problem sets. (Don’t mess with me; don’t cheat yourself).
Pattern Recognition: Is this an old ‘cold war’ era satellite image or;
a, ‘down on the farm’ photograph of Farmer John’s Best friend?
Do it By Hand Exercises (i.e. Don’t Use A Computer)
#1) The Standard Normal Distribution in all its glory
Make a detailed drawing of a Standard Normal (e.g. N(0,1)) probability density function
(aka pdf), and answer the following questions:
A) What fraction of observations drawn from a N(0,1) distribution are larger than 1
standard deviation above the mean?
13.6 + 2.1 + 0.1 = 15.8%
B) What is the mathematical equation describing the N(0,1) pdf?
F(x) = (1 / (2*п)1/2 ) * (2.718281828)(-x2 / 2)
Note: It looks nicer in the figure above
C) What is the total area under the N(0, 1) curve? (e.g.
is the equation you identified in part ‘B’?
) where
P(x)
The total area under the standard normal curve from -∞ to +∞ is simply 1.
D) What fraction of observations drawn from a N(0,1) distribution would you expect
to land anywhere from -1 to +1? (e.g. -1 <= X <= +1)
34.1 + 34.1 = 68.2%
E) What fraction of observations drawn from a N(0,1) distribution would you expect
to land anywhere from -2 to +2? (e.g. -2 <= X <= +2); -3 to +3? (e.g. -3 <= X <= +3)
-2 to +2:
-3 to + 3:
34.1 + 34.1 + 13.6 + 13.6 = 95.4%
34.1 + 34.1 + 13.6 + 13.6 + 2.1 + 2.1 = 99.6%
F) Assume women’s heights are distributed N(5’ 7”, 3”). A particular woman has a
height of 5’ 10”. What percentage of women are taller than this woman?
Same as Answer to ‘A’ above:
13.6 + 2.1 + 0.1 = 15.8%
G) Suppose there are 1,000 trout in Lake Goober. Assume the lengths of these trout
are distributed N(10”, 2”). How many fish would you expect that lake to have that
are anywhere from 11” to 13” long?
The expected value for the number of fish
between 11” and 13” inches long is: 242
The total area under the standard normal curve from +0.5 to +1.5
is 0.3085 – 0.0668 = 0.2417 or 24.2% 24.2% of 1,000 is 242 fish
#2) Measures of variability and Z-scores oh my!
Given the following observations of a random experiment calculate the following:
3, 4, 4, 6, 7, 8, 8, 8, 9, 10, 10, 12, 21, 28, 35, 52
A) The Mean 14.0625
B) The Median 8.5
C) The Standard Deviation 13.5
D) The Variance 182.2
E) The inter-quartile range 12 or 12.5
F) The Mode 8
H) Draw a histogram of these data and comment on the pattern it reveals.
Distributions
Column 1
Quantiles
0
10
20
30
40
50
100.0% maximum
99.5%
97.5%
90.0%
75.0%
quartile
50.0%
median
25.0%
quartile
10.0%
2.5%
0.5%
0.0%
minimum
Moments
52.000
52.000
52.000
40.100
18.750
8.500
6.250
3.700
3.000
3.000
3.000
Mean
Std Dev
Std Err Mean
upper 95% Mean
low er 95% Mean
N
14.0625
13.497994
3.3744984
21.255073
6.8699269
16
This is clearly a skewed distribution with a longer tail toward high values
I) How is the standard deviation different than the Mean Absolute Deviation?
Define both the standard deviation and the mean absolute deviation and calculate
each of them for this set of numbers.
The Standard Deviation ‘S’ is given by:
In this dataset ‘s’ = 13.5
The Mean Absolute Deviation is given by
In this dataset MAD = 9.96875
J) Calculate a z-score (e.g. a normalized value for each of the 16 numbers). Does it
make sense to use z-scores for this particular distribution? Explain
X
Zsc0re
3
0.82
4
0.75
4
0.75
6
-0.6
7
0.52
8
0.45
8
0.45
8
0.45
9
0.38
10
10
-0.3
-0.3
12
0.15
21
28
35
52
0.514
1.033
1.551
2.811
If one is calculating z-scores under the assumption that this dataset is distributed
normally it is not a good idea. This dataset is way too skewed to regard as normal.
K) Suggest a random experimental observation that might produce a set of numbers
like this?
A lot of geographic data has skewed distributions like this. For example this could
be the windspeeds at a particular location on March 9th for 16 different years. It
could also be the number of dollars found in 16 randomly selected people’s pockets.
#3) Measurement Scales: “These Tennis Balls are MUCH Fuzzier”.
Define the four measurement scales most commonly encountered in geographic
measurement (listed below as ‘A’, ‘B’, ‘C’, and ‘D’). Provide an example of the
observations of a ‘random process’ for each of these:
A) Nominal
‘Apples and Oranges’. The values of this scale have no ‘numeric’ meaning in
the way that we usually think about number. Land cover classes, Religion,
Language, Sex, etc.
B) Ordinal
‘Ranking’ Sears ‘Good’ , ‘Better’, ‘Best’ characterization of their appliances.
Likert scale survey responses such as ‘Strongly Agree’, ‘Agree’, ‘Neutral’
etc. The measurments can be ranked in an order from highest to lowest but
the difference between them is not comparable (e.g. ‘Better’ is not necessarily
as far from ‘Good’ as ‘Best’ is from ‘Better’). In essence, the ‘distance’
between numbers is not meaningful.
C) Interval
With both interval and ratio data the distance between numbers is
meaningful. For example; 90˚F is 10 degrees hotter than 80˚ F and that 10˚
difference is the same as the 10˚ difference between 30˚ degrees and 20˚
degrees. However, for interval data the zero is not meaningful. Degrees
Fahrenheit being the most commonly used example of interval data. 80˚ F is
NOT twice as ‘hot’ as 40˚ F nor is 20˚ F twice as hot as 10˚F . If you measure
in degrees Kelvin – Temperature is an absolute measure of the energy in
matter and temperature ratios can be used reasonably.
D) Ratio
With ratio scale measurements the zero is meaningful and making ratios of
measurements is appropriate. Length, Weight, Counts (e.g. populations) are
all ratio measures.
E) Classify each of the following according to their measurement scale:
1) Student course Evaluation Scores here at DU.
(i.e. Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree)
Likert Scale responses are definitely ‘ORDINAL’; however, many
statisticians perform analyses on survey data using statistical methods that
assume the responses are interval. This is considered problematic by some
but is a common practice. For example, if survey responses ‘Strongly
Agree’….”Strongly Disagree’ are coded as 5 …. 1 (Which they almost
universally ARE comparing means (or averages) of these number is not truly
kosher statistically if they are ordinal data. Nonetheless – Most Universities
look at averages of these kinds of numbers when evaluating teaching
effectiveness.
2) Sex (e.g. Male or Female)
Sex is a pretty heavy duty topic but classifying people as male or female is
relatively simple (not always – but most of the time). This kind of
classification is definitely Nominal.
3) Number of text messages sent from randomly selected cell phones
This is clearly a count. 10 messages is twice as many as 5 and zero is zero.
This kind of measurement is clearly Ratio.
4) SAT scores of the graduating high school students of the year 2000
This is the toughest of these four to classify its measurement scale. It is
impossible to even get a zero on an SAT test. The lowest possible score is 200.
One question is: Is a 300 as much better than a 200 as a 700 is better than a
600. Another question is: Is an 800 twice as good as a 400? (answer: probably
not – in fact – an 800 is probably more than twice as good as a 400). SAT
scores are definitely ordinal – and they may be interval - but they are
certainly not ratio.
#4) Tumbling Dice
Use the example of the rolling of two six sided dice to define the ideas of:
A) A Random Experiment
From the book: A random experiment is the process of observing the
outcome of a chance event. Rolling two die and summing their face up values
is not only a random experiment but it produces a value or ‘score’ which is a
random variable which is the numerical outcome of a random experiment.
B) Elementary Outcomes
The elementary outcomes of this random experiment are all the possible
combinations of the two die:
1,1 1,2 1,3 1,4 1,5 1,6
4,1 4,2 4,3 4,4 4,5 4,6
2,1, 2,2 2,3 2,4 2,5 2,6
5,1 5,2 5,3 5,4 5,5 5,6
3,1 3,2 3,3, 3,4 3,5 3,6
6,1 6,2 6,3 6,4 6,5 6,6
Basically there are 36 ways for two dice to ‘result’ in a ‘score’.
Note However that if you define the ‘score’ as the ‘sum’ of the two die there
are only eleven possible outcomes (e.g. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) and they
do not all have the same probability.
C) Sample Space
From the book: The sample space is the set or collection of all the elementary
outcomes. In this case it has been spelled out clearly above.
D) Draw a probability mass function for the random experiment of rolling two
dice and summing their face up numbers
Score
2
3
4
5
6
7
8
9
10
11
12
Prob
1 in 36
2 in 36
3 in 36
4 in 36
5 in 36
6 in 36
5 in 36
4 in 36
3 in 36
2 in 36
1 in 36
Dec Eqv
0.02777
0.0555
0.08333
0.1111
0.13888
0.1666
0.13888
0.1111
0.08333
0.0555
0.02777
E) What is the the probability of at least one of the die showing a ‘6’ on its face
in a single roll of two dice?
The easy but tedious way to answer this question is to go to the sample space and
count the number of the 36 ‘events’ that have a ‘6’ in them. Based on that the odds
(aka probability) of having at least one ‘6’ in a roll of two die is 11/36 (0.30555).
From the book we might see it as one of the probability rules:
P(E or F) = P(E) + P(F) – P(E & F).
In this case P(‘6’ on die ‘a’ OR ‘6’ on die ‘b’) = 1/6 +1/6 – 1/36 = 11/36. Get it? 
#5) Sometimes extreme events are very important & the Normal Distribution is not.
Consider the following numbers observed by a hypothetical hydrologist for a stream he
has been observing for 3 years. The numbers represent the maximum flow of water (in
cubic meters per minute), at a given point at the lower end of the stream right after a
rainstorm.
1.0, 1.0, 1.5. 1.6, 1.8, 2.0, 2.1, 2.1, 2.1, 2.5, 2.5, 2.6, 2.7, 2.7, 2.7, 3.0, 3.3, 3.5,
3.5, 3.6, 3.6, 3.7, 3.7, 3.8, 3.9, 4.1, 4.3, 4.9, 5.5, 6.3, 7.1, 7.5, 8.8, 10.3, 16.5, 59.6
A) Draw a histogram of these numbers (use JMP if you wish). Do these observations look
normally distributed? How might the Normal distribution not be an appropriate
model for these observations?
Distributions
Column 1
Quantiles
0
10
20
30
40
50
60
100.0% maximum
99.5%
97.5%
90.0%
75.0%
quartile
50.0%
median
25.0%
quartile
10.0%
2.5%
0.5%
0.0%
minimum
Moments
59.600
59.600
59.600
9.250
4.750
3.500
2.200
1.570
1.000
1.000
1.000
Mean
Std Dev
Std Err Mean
upper 95% Mean
low er 95% Mean
N
5.5944444
9.7290124
1.6215021
8.8862687
2.3026202
36
Clearly, this distribution is very skewed. Does not look normal at all.
C)
Suppose that sediment transport (e.g. kilograms of soil, gravel etc.) that are
moved by these storms varies directly as a function of the flow of water raised to
the 3rd power (e.g. a cubic function (i.e. Flow = 1 , sediment transport = 1; flow =
2, sediment transport = 8; flow = 3, sediment transport = 27). Calculate the
predicted sediment transport for each of the storms and calculate what percentage
of the sediment transport that took place over these three years was accomplished
by the single largest storm. Paste a table in here (you’ll probably want to use excel or JMP)
Stream Flow
Sed Flow (x3)
1
1
1
1.5
1.6
1.8
1 3.375 4.096 5.832
2
2.1
2.1
2.1
2.5
2.5
8 9.261 9.261 9.261 15.625 15.625
2.6
17.576
3
3.3
3.5
3.5
3.6
3.6
3.7
3.7
27 35.94 42.875 42.88 46.66 46.66 50.653 50.653
3.8
54.872
Stream Flow
Sed Flow (x3)
2.7
2.7
2.7
19.68 19.68 19.68
Stream Flow
Sed Flow (x3)
3.9
4.1
4.3
4.9
5.5
6.3
7.1
7.5
8.8
10.3
16.5
59.6
59.32 68.92 79.51 117.6 166.4 250.05 357.9 421.9 681.5 1092.7 4492.1 211708.7
Total Sediment moved: 220,053.8
211,708/220,053 = .962
Biggest Stream Flow: 211,708
96.2% From One Storm
D) Assume these are reasonable numbers (as observations) and the stated relationship
between stream flow and sediment transport is also correct. How important are
extreme events? How well do you think the Normal distribution will serve as a
model for these kind of hydrologic phenomena?
In this example a single event accounted for 96.2% of the sediment flow. This event
also accounted for almost 30% of total water flow (29.59% actually). Clearly
modeling this with a normal distribution is not such a good idea. Beware of this
because many many many geographic phenomena are not well characterized by the
normal distribution.
#6) Shooting Craps - The odds are not that easy to calculate….
Google the rules for shooting craps. The gist is this (I think; – roll a ‘7’ or an ‘11’ on the
first roll – you win. If you roll ‘snake-eyes’ (a one and a one) you lose. I think you also
lose if you roll a 12). If you roll something other than a 2,7,11, or 12 than you keep
rolling until you match what you rolled or something else happens – I don’t really know the rules.
A) In any case – summarize the rules of ‘craps’ right here. (I hope to learn something
from this).
Thanks to Wikipedia – Again: Craps is a game played by 1 or more players. Players
take turns rolling two dice. The player rolling the dice is called the "shooter." The game
is played in rounds, with the first roll of a new round called the "come-out roll."On the
come-out roll, if the total of the two dice is 7, 11, 2, 3 or 12, the round ends immediately
and the shooter must roll another come-out roll. A result of 2, 3 or 12 is called 'craps'
while a result of 7 or 11 is called a 'win' or a 'natural.' When any other number (4, 5, 6,
8, 9, or 10) is rolled on the come-out roll, this number becomes what is called the point.
If a point is established, then the shooter will re-roll the dice continuously until either a
7 is rolled or the point is rolled again. If the shooter rolls the point again, the round
ends and the game starts over with the same shooter rolling another come-out roll. If
the shooter rolls a 7 instead of the point, this is called a 'seven-out,' the round ends, and
the dice pass to the next player to the left, who becomes the new shooter.
B) What is the probability of ‘winning’ on your first roll? (e.g. a ‘7’ or an ‘11’)
Rolling a ‘7’ is mutually exclusive from rolling an ‘11’. Consequently we can simply
add their respective probabilities. P(‘7’) = 1/6; P(‘11’) = 1/18. Thus the probability
of winning on the ‘come-out’ roll is simply: 1/6 + 1/18 = 2/9 (.2222..)
C) What is the probability of ‘losing’ on your first roll?
Again, the probability of rolling a 2,3, or 12 are all mutually exclusive so we can
simply add their respective probabilities. P(‘2’) = 1/36; P(‘3’) = 2/36; P(‘12’) = 1/36.
Thus the probability of losing on ‘come-out’ roll is: 1/36 + 2/36 + 1/36 = 1/9 (.1111..)
D) Why is it difficult to calculate the ‘overall’ odds of winning at craps?
While it is easy to calculate the odds (probability) of winning or losing on the ‘come-out’
roll (which is in fact 2/9 + 1/9 = 1/3 of the time the first roll is the last roll of a round); it is
not as easy to calculate the odds when a ‘point’ is rolled. Suppose your ‘come-out’ roll is a
‘5’. Theoretically you could indefinitely roll ‘not 5’s’ and ‘not 7’s’ forever. It’s kind of
like baseball that way . Solving this kind of problem is still classical probability and the
numbers are calculable (real mathematicians would regard these problems as trivial);
nonetheless they are beyond the scope of this instructor and this class .
#7) A Tale of two distributions
Use the normal and the binomial distribution to explain the difference between a discrete
random variable and a continuous random variable. What are the parameters of these two
distributions? Why does one have a probability distribution function whereas the other
has a probability mass function? Describe the relationship between these two
distributions? (Larry Gonick’s ‘Cartoon Guide to Statistics’ Should REALLY help ).
The Normal Distribution is continuous and has two parameters the mean (μ) and
variance (σ2 ). The mean describes the central tendency and the variance describes
the spread or variability about that central tendency. We often designate this as
N(μ, σ2 ). The pdf of a normal distribution is a symmetric bell shaped curve. It is a
pdf (probability density function) because the normal is continuous. Random
variables measured from something that is distributed normally can take on an
infinite number of possible values. The Binomial on the other hand is a discrete
distribution. Consequently it has a probability mass function (pmf) which essentially
describes the probability of each of the finite number of possible outcomes of a
Binomially distributed random variable. The number of possible outcomes (or
sample space ) is N plus 1; where ‘N’ is one of the parameters of the binomial
distribution. ‘N’ can be thought of as the ‘# of coin flips’ or ‘# of Bernoulli trials’.
The other parameter of the Binomial is ‘p’ or the ‘probability of success’ or, in the
case of a coin, the probability of ‘heads’. Thus a binomial random variable is
generally designated as Bin(N, p). As ‘N’ approaches very large numbers and ‘p’ is
0.5 the pdf of a binomial starts to look very similar to a normal distribution. The
Normal and Binomial are two very important and fundamental distributions that
are distinct in that they are continuous and discrete respectively; however, they are
fundamentally related to one another.
2,000 samples from Binomial (100, .5)
2,000 samples from N(50, 5)
The above two figures were derived from random generations from a Binomial and a
Normal. As you can see there is a striking similarity (and that’s with N=100).
#8) NASCAR, fishing derbies, and Random Variables
Suppose there is a big fishing derby for NASCAR fans in Kentucky. Lake Goober is
stocked with trout whose lengths are distributed N(10, 3). Assume the IQ of the fishing
derby participants is distributed N(80, 10). Let’s say we create a random variable that is
simply the sum of the IQ of the fisherman and the length of the first fish he or she
catches. How will that random variable be distributed?
First, let’s assume that the IQ of the fisherman and the length of the first fish he or
she catches are independent (i.e. smarter folks don’t tend to catch bigger fish). If
you have carefully read pages 68-72 of The Cartoon Guide you would know that
when we add two independent random variables we can simply add their means and
add their variances; so, the answer is simply:
N(80, 10) + N(10, 3) = N(90, 13)
#9) Testing for rare diseases and the False Positive Paradox
(not a true paradox by the way)
Suppose a rare disease infects one out of every 3,000 people in a population. And
suppose that there is a good, but not perfect, test for this disease. If a person has the
disease the test comes back positive 99 % of the time. On the other hand, the test also
produces some false positives. About 1 % of the uninfected patients also test positive.
Assume you just tested positive.
A) What are your chances of having this disease?
Let’s say one million people get tested. Presumably you can expect there to be
(1/3000)*1,000,000 = 333.33 people who actually have the disease. 99% of these
would test positive: 0.99 * 333.33 = 330. Thus out of a million 330 with the disease
would test positive. Now, how many without the disease would test positive? Well,
there would be 1,000,000 – 333.33 = 999,666.67 ‘people’ will not have the disease.
1% of these will test positive. This would be: 0.01 * 999,666.67 = 9,996.67. So the
number of people who tested positive is: 9,996.67 + 333.33 = 10,330. The percentage
of those who tested positive that actually have the disease is 100* (333.33 / 10,330) =
32.3%. So, even if you tested positive there is only a 32.3 % chance that you really
have the disease.
C) Also, assume all administrations of this test are independent of one another
regardless of who is tested. What should you do if you test positive?
Get tested again. If you test positive again start to worry but get tested again.
This kind of test with ‘false positive’ possibilities is problematic.
#10) Pascal’s Triangle, Permutations, Combinations, and the Binomial Distribution
Draw the first few lines of Pascal’s triangle and explain the rule for drawing additional
rows indefinitely. How is this triangle related to the Binomial distribution? What is the
binomial distribution again (include formulas)? Define, and explain applications, of the
formulas for both permutations P(n,r) and combinations C(n,r) shown below:
P(n,r) is the number of ‘permutations’ of
‘n’ things taken ‘r’ at a time. For
permutations order matters. C(n,r) is the
number of Combinations of ‘n’ things
taken ‘r’ at a time. Here order does not
matter. Think of the letters A, B, C, and D.
How many permutations of two letters
from these four are possible? That would
be P(4, 2). The formula says it is:
4*3*2*1 = 12 Permutations
(2*1)
(e.g. AB, BA, AC, CA, AD, DA, BC, CB, BD, DB, CD, DC)
How many combinations are possible?
The next line is 1, 6, 15, 20, 15, 6, 1 – The rule
Is That you add the two numbers above it (and
keep 1’s on the edges).
The formula says: 4*3*2*1 = 6 combinations
2*1*(2*1)
(e.g. AB, AC, AD, BC, BD, CD)
Wikipedia again –
In mathematics, the binomial theorem is an important formula giving the expansion of powers of sums.
Its simplest version says
whenever n is any non-negative integer, the number
is the binomial coefficient (using the choose function), and n! denotes the factorial of n.
Binomial Probability Formula:
Where p is the probability of a success, and q is the probability of a failure (which is
complimentary to p, that is q=1-p.) This is simply a rewording of the binomial theorem.
That is N is N factorial and K is K factorial*(N - K)
Computer Problems (use JMP and/or Excel for these exercises)
#11) Going from Univariate to Bivariate Characterization of Random
Manifestations of distributions: Point patterns in ‘Spaaaaace’.
YOW! This is fun.
As you did in problem set one, create the following ‘random’ generations of 2000 values
for each of the following distributions: N(100, 3), N(100, 30), N(100, 30) (again – two
different instantiations), Exponential (1), Uniform (100), Uniform (100) (again – two
different instantiations), and Gamma (7).
Now, you will use the ‘scatterplot’ functionality of JMP. Go to the ‘Analysis’ menu and
choose ‘Fit Y by X’. Here you can produce scatterplots of any two of your univariate
random columns. The dialog box looks like the figure below.
Once you have selected an ‘x’ and a ‘y’ you click ‘OK’ and get a scatterplot that in this
case looks like the figure on the right. In this
case a Uniform(100) distribution was plotted
against an Exponential (1) distribution.
Suppose this is a ‘map’ of a geographically
observed phenomena. What random
observations might show such a pattern?
Here is an example: Presume that the ‘X’
axis is the north side of an interstate
highway and the black dots represent
cigarette butts found along the side of the
highway. Most are close to the highway and
they appear less and less frequently as an
exponential function of distance from the
road. They are uniformly distributed along
the highway because (in theory) people will
toss butts out the window at equal probability anywhere along the road (of course we are
assuming we are a long way from exits, gas stations, scenic overlooks, etc. In fact, this is
a perfectly reasonable ‘model’ of cigarette butt distribution from a transportation
corridor. Hopefully this improves your idea of what many people are talking about when
they say something along the lines of: ‘We modeled this geographically manifested
phenomena according to these assumptions and distribution functions’. Not really as
scary as it sounds eh? Now your problems are to create a bunch of scatterplots based on
the distributions you have ‘randomly’ sampled from. For each scatterplot suggest a
phenomena (geographic or not) that might produce such a pattern. (Note: Paste your plot
into your submitted answer set along with your comments and interpretations).
A) Plot N(100, 30) vs. N(100,30). Describe the pattern you see and suggest a random
phenomena that might produce such a pattern. Explain.
This is a ‘shotgun’ pattern with
the densest concentrarion of
points at the center point (100,
100). Possible phenomena that
might produce such a pattern – 1)
Dead cows around a poisoned
water well? 2) A scatter plot of
IQ and height of Americans born
in 1980?
Bivariate Fit of N(100, 30) 'A' By N(100, 30) 'B'
N(100, 30) 'A'
200
100
0
0
100
200
N(100, 30) 'B'
B) Plot N(100, 30) vs. an Exponential (1). Describe the pattern you see and suggest
a random phenomena that might produce such a pattern. Explain.
Bivariate Fit of N(100, 30) 'B' By Exponential (1)
200
N(100, 30) 'B'
This one is kind of weird. The
density is bell-shaped as you go
up and down the y axis but
decays exponentially as you go
right along the x axis. Possible
pehneomena that might display
this pattern – People crowding
around a window in a wall
from which someone is
handing out $1,000 dollar
bills? The scattering of a
turned over truck full of
rubber duckies (Y axis is the
road)?
100
0
0
1
2
3
4
5
Exponential (1)
6
7
8
9
C)
Plot a Uniform (100) vs. a N(100, 30). Describe the pattern you see and suggest a
random phenomena that might produce such a pattern. Explain.
Bivariate Fit of Uniform(100) 'B' By N(100, 30) 'A'
The density of points is uniformly bell
shaped along the x-axis and constant along
the y-axis. Possible phenomena that might
display this pattern – Trash along a north –
south running highway? Riparian trees e.g.
cottonwoods) along a north –south running
river?
Uniform(100) 'B'
100
90
80
70
60
50
40
30
20
10
0
0
100
200
N(100, 30) 'A'
D) Plot a Gamma(7) vs. A Uniform (100). Describe the pattern you see and suggest a
random phenomena that might produce such a pattern. Explain.
Bivariate Fit of Gamma (7) By Uniform (100) 'A'
20
Gamma (7)
The point density is uniform as you
travel across the x-axis but it is low,
peaks, and dribbles out (skewed) as
you go up the y-axis. A plot of ‘inches
of rainfall per storm’ (y –axis) vs. time
(x-axis) might produce this pattern.
Or, this could be the pattern of where
pennies land when tossed from the top
of a tall building. Where the wall of
the building is the x axis.
10
0
10 20 30 40 50 60 70 80 90 100
Uniform (100) 'A'
E) Plot a Uniform (100) vs. a Uniform (100). Describe the pattern you see and suggest
a random phenomena that might produce such a pattern. Explain.
Bivariate Fit of Uniform (100) 'A' By Uniform(100) 'B'
100
Uniform (100) 'A'
Now this is complete spatial
randomness. Uniform distribution
across x-axis, uniform distrubition
across y-axis. Phenomena that might
look like this –Meteorite impacts on a
planimetric surface (almost Colorado –
latitude effects tweak a bit); Cactus
locations on a large flat desert with
uniform soil, rainfall etc.?
90
80
70
60
50
40
30
20
10
0
0
10 20 30 40 50 60 70 80 90 100
Uniform(100) 'B'
F) Plot a N(100,30) vs. N(100,3). Describe the pattern you see and suggest a random
phenomena that might produce such a pattern. Explain.
Bivariate Fit of N(100, 30) 'B' By N(100, 3)
200
N(100, 30) 'B'
Be careful here. This has a striking
resemblance to the ‘shotgun’ pattern you
saw in ‘A’; however, note the scales. The
y-axis ‘spread’ is much higher than the xaxis ‘spread’. If you ‘mapped’ this fairly
it would be a ‘cigar-shaped’ pattern.
Possible ways to manifest this pattern:
The pattern –on the ground- of sewage
dumped from a flying jet? A scatterplot
of the length of a coke cans held by 2000
people and their respective weights in
pounds?
100
0
90
100
110
N(100, 3)
#12) What is ‘Complete Spatial Randomness’ (aka CSR) for Point Patterns?
Complete spatial randomness is an important theoretical idea for geographers. Of
course, the nature of what you are measuring is important; and, confounding, for this
idea. What would completely spatially random temperature look like? Not a
reasonable thing to think of as a random phenomena - really. Temperature is spatially
auto-correlated – which we will talk about later. Temperature is also continuous –
best represented as a field. In fact, temperature, pressure, humidity, rainfall,
population density, and many other important geographic phenomena are often best
represented as a field and ARE NOT SPATIALLY RANDOM. The fact that so many
phenomena are not spatially random is a big reason geography is a discipline.
Nonetheless; complete spatial randomness is an important idea to start from. Often
we will test to see if phenomena manifest as non-random in space. So… Let’s
consider a phenomena that we might theoretically expect to be completely spatially
random (CSR): Meteorite impacts in the state of Colorado.
A) Which of the spatial distributions you produced in Question #1 of the Computer
problems do you think most closely represents ‘Complete Spatial Randomness’ as
we might theoretically expect for the ‘x’, ‘y’ locations of meteorite impacts in the
state of Colorado? Explain.
The Pattern in ‘E’ – Uniform vs. Uniform defines an example of Complete
Spatial Randomness. Any ‘X’ or ‘Y’ coordinate is equally likely. Interestingly
enough, a quadratting of such a CSR pattern produces ‘counts’ that have a
Poisson distribution.
#13) How do you randomly sample the earth’s surface?
A team of scientists were producing a global land cover dataset derived from the
interpretation of satellite imagery obtained from Landsat. They used some complex
statistical methods beyond the scope of this course to ‘classify’ all the ‘pixels’ in these
images as some sort of land cover type or ocean or lake or whatever. These complex
methods of ‘classification’ are not perfect. Efforts of this nature usually require a
statistical approach to ‘ground truthing’ of the ‘classification accuracy’. Usually this
involves the generation of ‘X’ number of points (RANDOMLY IN SPACE) in which
‘field scientists’ will got to those particular locations and determine the appropriate land
cover classification of that particular point in space. Once all these field scientists have
traveled to these locations and characterized them a ‘classification accuracy’ assessment
can be completed.
A) Suppose the team that is doing this work prepares the following “Random
Sample” of the earth’s surface according to the following rule: We generate a
Random Uniform (-90, +90) distribution of ‘Y’ or ‘Latitude’ Values and pair
them randomly with a Random Uniform (-180, +180) set of ‘Longitude’ Values.
This set of ‘Y’ (aka Latitude) and ‘X’ (aka Longitude) pairs will randomly sample
the surface of the earth (assuming it’s a sphere which we know it is not but who
cares?). Will this approach randomly sample the surface of the earth – assuming it
is a sphere? Explain.
NO. The earth is distorted when represented as a 360 by 180 rectangle. The poles
are ‘over-represented’. Remember a 1° by 1° ‘patch’ of earth is not really even a
square at the equator (although it is pretty close to one) but it gets less and less
square as you approach the poles.
B) Only if you think the approach presented above is NOT random – Propose a
better method of randomly sampling the surface of the earth using simple and
reasonable probability density functions. Explain.
The best way to do this is with a Uniform distribution for Longitude and a Cosine
distribution for Latitude. Cosine goes from 1 to 0 as ‘Y’ goes from 0 to 90°
How to Lie With Statistics Questions
#14) Write a 4 to 7 sentence summary of Chapter 4: Much ado about practically nothing
This chapter is about perhaps significant but trivial results. Is a child with a
measured IQ of 103 really smarter than one with a measured IQ of 99? Given
unceratainty and narrowness in the measuring process – probably not. Often
Medical studies say things like: Eating oatmeal every day increases your lifespan. A
study involving tens of thousands of patients might demonstrate that people who are
forced to each oatmeal every day do indeed live one week longer than those who
don’t. So What? I’ll take the non-oatmeal eating life – any day.
#15) Write a 4 to 7 sentence summary of Chapter 5: The Gee-Whiz Graph
The axes of a graph can be used to legally deceive. Suppose a DU supporter wants to
show how much more popular DU is based on undergraduate applications. They
could present a graph with the y-axis ranging from 3,000 to 3,100 and plot two
successive years number of applications of 3,001 and 3,099. The line would be going
up in a steep curve depending upon how far apart the two years are on the x-axis.
However, this really only represents a 3% increase from one-year to the next and
may in fact be random variation.
#17) Write a 4 to 7 sentence summary of Chapter 6: The one-dimensional picture
Edwin Tuffte gets his panties in a twist about this kind of distortion. The book
shows pictorial examples of deception (Cows, Bags of Money, etc). If you plotted
your income through time it is perfectly reasonable to use a ‘bar’ chart. Scaling
linear dimmensions on 2-D representations of 3-D objects is a cognitive no-no unless
you are intentionally being rhetorically sleazy.
# 18) Suppose that just the other day the Dow Jones Industrial Average dropped by 300
points. Find the last month’s worth of Dow Jones Industrial Averages and plot them.
Now, Prepare two graphs: 1) A graph where you are trying to scare people that you
are showing the graph to that the Dow Jones is Crashing (e.g. the sky is falling); and
2) Where you are trying to assuage the fears of your investors and ‘minimize’ this
300 point drop in the Dow Jones Industrial Average. Explain.
I went to this URL to get some DOW Jones data:
http://money.cnn.com/quote/chart/chart.html?symb=djia&sid=1643&time=1mo&S
ubmit1=Refresh
But I just ginned these two graphs up from the aether to make this point:
12,500
12,000
12,200
0
February 2008
February 2008
Both show the same data (not really but close enough) only y-axiz scaling is
different.