Download AnswersPSno1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Problem Set #1
Geog 2000: Introduction to Geographic Statistics
Instructor: Dr. Paul C. Sutton
Due Tuesday Week #3 in Lab
Instructions:
There are three parts to this problem set: 1) Do it by Hand Exercises (#’s 1 – 7). Do these
with a calculator or by hand. Draw your figures by hand. Write out your answers by
either digitally or by hand. Start your answer to each numbered question on a new page.
Make it easy for me to grade. 2) Computer exercises (#’s 8-13). Use the computer to
generate the graphics and paste them into a digital version of this assignment while
leaving room for you to digitally or hand write your responses if you so wish. 3) ‘How to
Lie with Statistics’ Essay(s) (#’s 14-17). For this section prepare a typed paragraph
answering each question. Do not send me digital versions of your answers. I want paper
copies because that is the easiest way for me to spill coffee on your assignments when I
am grading them. All of the questions on these four problem sets will resemble many of
the questions on the three exams. Consequently it behooves you to truly understand – ON
YOUR OWN – how to do these problems. I am fine with you working with other
students on these problem sets but my exams will punish those who free ride the
comprehension of the material in these problem sets.
Do it By Hand Exercises (i.e. Don’t Use A Computer)
#1) Strange Numbers from What or Why or Where or How?
Consider the following set of numbers that are observations of some random process or
phenomena: (BTW N = 16)
0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 4, 4, 5, 8, 14, 41
A) Calculate and/or identify the mean, median, mode, range, inter-quartile range,
maximum, Shortest half, outliers, and minimum of these numbers.
Mean: 4.75 or 4.0 *
Median: 2
Mode: 0
Range:
Inter-quartile Range: 4.5
Shortest Half: 0-2
Minimum: 0
Outliers: 14 and 41 *
41
Maximum:
41
* NOTE: Quantiles are not obvious to calculate for small samples (see Quantile Calculation paper on
course web site). Nonetheless, for our purposes I’m ok with the 25 th and 75th quantiles of this dataset
to be either .5 and 4.5 or .25 and 4.75. As far as outliers go JMP uses a rule that works for me: upper
quartile + 1.5*(interquartile range) ……lower quartile - 1.5*(interquartile range).
B) Draw a histogram, a box-plot, and a stem-and-leaf plot for these numbers. Explain
how the Stem and Leaf plot does not really improve on the histogram with respect to
characterizing these numbers in this particular example but might in another dataset.
Stem & Leaf on 5’s
0
10
20
30
40
0,0,0,0,1,1,1,2,2,3,4
5,8
14
...
41
C) Write a short 2-4 sentence explanation for each of the following random processes as
to why each of them do or do not represent a reasonable and probable manifestation of
the measurements represented by these numbers. ( I expect answers to vary )
1) The number of sexual partners of a 16 randomly selected 20 year old men in
New York City in 1995.
This strikes me as possible. It is a skewed distribution with many low values
and perhaps a few very high values. It could be argued that 41 sexual
partners at the age of 20 is too high an outlier but it is at least possible and
maybe probable.
2) The number of corporate logos (e.g. the McDonald’s ‘M’, The Nike
‘Swoosh’, British Petroleum’s ‘BP’) recognized from a presentation of 1000
logos to a randomly selected set of 16 sixth grade students from Boston MA.
FOUR kids in Boston that don’t regognize at least one corporate logo? Not a
chance. In fact, most kids know more corporate logos by the age of four than
the number of plants or animals they can identify in their own backyard as
high school students. Possible but not very likely in the least.
3) The length in inches of Trout caught in a Fishing Derby in Lake Tahoe CA.
Zero inch trout? No way. Who would even report catching a zero inch trout?
Impossible and improbable.
4) The actual Fertility rate (the number of live births) of 16 randomly selected
American Women of the age of 50 in 2007 (Actual TFR then was 2.05).
While the numbers may be possible the probability is almost zero that in a
sample of sixteen women you would find one that had given birth to 41
children. Also, the average is way above the real average for this phenomena.
Almost impossible and certainly improbable.
5) The net worth (in millions of dollars) of sixteen people who paid $50 to have
their photograph taken at the top of a ski lift in Aspen Colorado in 2001.
Aspen has a large population of very rich people floating through. And,
millionaires are the new middle class in America. This strikes me as both
possible and perhaps probable.
#2) Fuzzy Dice from Hell and their Probability Density Functions
Assume you have two tetrahedrally shaped dice (i.e. four sided) on which are printed a 1,
2, 3, and 4 on each side. Also assume that the dice are ‘fair’ (i.e. equal probability of 1-4
showing up [in this case face down] when they are rolled).
A)
1) What is the average result you would expect from the ‘sum’ value of the rolling
of these two dice?
5
2) What is the range of possible outcomes from the sum of the rolling of these two
dice?
2 to 8 (range = 6)
3) What is a more likely outcome of the sum of a single roll of these two dice:
Sum of Dice = 2 or Sum of Dice = 5? Explain.
Sum of Dice = 5 greater probability than sum of dice = 2
Probability of sum = 2 is ¼ * ¼ = 1/16
(of 16 possible outcomes only 1 chance for sum of dice to = 2)
Several ways for sum of dice to be 5
(a 3 and a 2, a 2 and a 3, a 4 and a 1 or a 1 and a 4 --- 4/16 probability)
B) Draw a probability distribution function (actually since this is a discrete distribution it
is more accurately referred to as a probability mass function – which should look a lot
like a histogram or bar chart), that characterizes the probability of each of the possible
sums of the values of the face down value of these two dice.
.25
.1875
1/8
.125 1/16
.0625
0.0
2
3
3/16
1/4
3/16
1/8
1/16
4
5
6
7
8
C) Suppose your roommate suggests to you that she will engage you in a game of chance
with these two die. If she rolls a score of three or less (where the ‘score’ is the sum of the
two dice) you pay her 10 dollars. If she rolls any score greater than three she pays you
three dollars. Your roommate will roll the die 100 times and payments will be made for
each roll of the dice. 1) Are you wise to play this game of chance? 2) If you did choose
to play this game of chance how much would you expect to win or lose assuming this
‘game of chance’ worked out perfectly randomly (which these things rarely do)?
Expected Loss on any roll: 3/16 * 10 - 13/16 * 3 = - .5625 dollars
On average, after 100 rolls you will have won $ 56.25
D) Suppose you try to cheat on your roommate by sneaking in a ‘bogus’ die with ‘twos’
on every side. Did you improve your odds of winning? Explain.
Not a smart move. You actually reduce your odds from winning to losing.
Expected win is now: ¾ * 3 – ¼ * 10 = 2.25 – 2.50 = -0.25
After 100 rolls, on average, you will be down $25.
E) What is the average value of the roll of one of these four sided die? How might you do
financially if you suggested to your roommate that she roll a SINGLE four sided die and
you pay her $100 dollars every time she rolled the average value and you collected a
mere twenty five cents every time she did not? Explain.
The average roll of a single four-sided die with 1 through 4 marked on each side is
2.5 In this case the average can NEVER be rolled. Good bet. You’ll win EVERY
time. But Your roommate just might figure this out after a while .
#3) The odds and oddity of heirs
Assume that the probability of having a baby boy or a baby girl are equal and 50%.
Assume a woman has two children. What are her chances of having at least one female
child? Explain.
The four possible outcomes are: Boy & Boy, Boy & Girl, Girl & Boy, Girl & Girl
In 3 of the 4 cases there is at least one girl. Thus, if a woman has two children, ~75%
of the time she will have at least one girl.
#4) Sometimes it is easier to think about what couldn’t happen.
Assume you have been randomly selected from thousands of college age students to
participate in an ‘American Idol’ competition. Also assume that there is an equal
probability of all of these college age students having a birthday on any day of the year
(No Leap Years – we’re not going there – e.g. 1 in 365 days of the year).
A) How many students would have to be selected for which the probability of you
having the same birthday as one of the other contestants would be a 50% chance?
(Note: You can use a calculator or spreadsheet for this question – it does involve
significant sequential arithmetic).
Wikipedia Provided a very clear answer to this question
To compute the approximate probability that in a room of n people, at least two have the
same birthday, we disregard variations in the distribution, such as leap years, twins,
seasonal or weekday variations, and assume that the 365 possible birthdays are equally
likely. Real-life birthday distributions are not uniform since not all dates are equally
likely.[3]
It is easier to first calculate the probability p(n) that all n birthdays are different. If n >
365, by the pigeonhole principle this probability is 0. On the other hand, if n ≤ 365, it is
given by
because the second person cannot have the same birthday as the first (364/365), the third
cannot have the same birthday as the first two (363/365), etc.
The event of at least two of the n persons having the same birthday is complementary to
all n birthdays being different. Therefore, its probability p(n) is
This probability surpasses 1/2 for n = 23 (with value about 50.7%).
B) Assume the song you have decided to perform for this ‘American Idol’
competition is “Blue Valentines” by Tom Waits – What are the odds that out of
1,000 other contestants one or more of them would choose the same song?
Answers will vary. I say the odds are zero because Tom Waits is a gravelly
voiced singer who sings depressing songs that would be an unlikely choice of
candidates trying to win an American Idol competition.
C) Characterize the probabilistic approach you took to parts ‘A’ and ‘B’ of this
Question as classical, relative frequency, monte carlo, or subjective. Explain.
The first calculation is a classical probability approach whereas the second is a
subjective probability approach.
#5) Conditional Probability can be REALLY tough to recognize: “Let’s Make a Deal”.
Monte Hall (and I think now Drew Carey) hosted (and Drew Carey now hosts) a famous
game show called “Let’s Make a Deal”. A classic statistical problem is associated with a
challenge in this show involving “The infamous Door #3” or whatever door comes into
play. Assume you are a contestant on this show. Assume the producers of the show etc.
are not messing around with you. Here is your situation: Monte Hall (now Drew Carey)
presents you with three doors that obscure potential prizes. The prizes are: $1,000,000; an
old Goat, and a young Sheep. Assume, (which I think is a safe bet), you want the million
bucks. Monte Hall tells you to pick a door: “ Door #1?, Door #2? Or,…Door #3”. You
pick ‘Door #2’. Now… (Monte Hall WILL do This no matter what door you initially
choose) Monte Hall says – “Are You Sure you want to Stay with Door #2?” – And then he
reveals to you the Goat or Sheep behind Door #1 or Door #3. And Monte Hall now gives
you this option: “Do you want to stay with Door #2 OR Do you want to switch to the
door I have not revealed to you” . It’s up to you. Should you switch or should you stick
with your first choice? Remember this: Monte Hall WILL challenge you with this option
of switching regardless of what your initial choice was. Is he doing you a favor by giving
you this choice from a statistical perspective or is this merely a 50 – 50 choice on your
part? What should you do: Switch or Stick or does it not matter? Explain using what
you have learned about probability.
You should defnintely switch given the opportunity. On your first selection you have
a 1/3 chance of choosing the money and a 2/3 chance of choosing a hooved animal. If
you chose the money you would HAVE to switch to a goat or sheep (this will happen
1/3 of the time). If you chose a hooved animal you will ALWAYS switch to the
money because the other hooved animal has been revealed to you (This will happen
2/3 of the time). Consequently it is NOT a 50/50 proposition. It is a 2/3 – 1/3
proposition in your favor.
#6) Probabilistic Approaches
Define and provide and example of each of the following approaches to Probability:
A) Classical
Population parameters of Random Variables are known and the approach is
derived from strict mathematical rules. Casino games such as roulette, craps,
poker, black jack, etc are all classical probability problems.
B) Relative Frequency
A large body of empirical evidence provides the groundwork for relative
frequency approaches. Insurance companies do this all the time. Male
drivers between the ages of 20 and 25 have such and such a probability of
getting in accidents that cost such and such amount. Insurance rates are
based on relative frequencies. These are often adjusted based on conditional
probabilities such as # of DUIs, # of Tickets, Levels of Education, Distance
traveled to work etc.
C) Subjective
Subjective probability is when someone makes decisions based on experience,
folklore, hunches, judgement, etc. A good example is listening to sports guys
on TV pick football games. They are not random. Las Vegas also puts ‘odds’
on sporting events. These ‘odds’ are basically the subjective assessment of
experts. They are often pretty damn good.
D) Monte Carlo
The Monte Carlo approach is similar to relative frequency approach but
often takes place when one does not have a large history of empirical results
to draw from. In this case, theoretical models are developed using stochastic
(random variables) and ‘empirically modeled’ relative frequencies are
generated. For example, the “Let’s Make A Deal” question (#5) can be
approached classically or you could simply play it 100 times (e.g. Monte
Carlo it) and get an empirical relative frequency distribution. An example
provided by Pete Rogerson (Geography Prof At SUNY Buffalo) is two
theoretical hockey teams. In this example lets use to stochastic parameters
per team: 1) # of shots on goal each team makes in a game; and 2) The % of
saves each goalie makes for each shot on goal. If you model these parameters
as random instantiations either team can win even if the team with the better
goalie also averages more shots on goal. After many ‘runs’ of the model
though – the ‘better team’ wins more games.
#7) Will the Real Independent Random Variable Please step forward?
What does it mean for observations of a random variable to be ‘Independent’?
Provide an example of independent observations and an example in which
Observations are not independent. Why is geography (and especially spatial
Analyses) often criticized for using regular statistics on non-independent
Measurements?
Independence of measurement means that making one measurement will tell you
nothing about any of the other measurements you make. This is essentially why we
want truly random sampling when we are doing statistics (if possible). When two
events are statistically independent, it means that knowing whether one of them
occurs makes it neither more probable nor less probable that the other occurs. In
other words, the occurrence of one event occurs does not affect the outcome of the
occurrence of the other event. Similarly, when we assert that two random variables
are independent, we intuitively mean that knowing something about the value of one
of them does not yield any information about the value of the other.
Example of Independent Observation: The roll of a single die should not effect the
roll of another single die.
Example of Non-Independent Observation: The Let’s Make a Deal question (#5)
demonstrates a kind of non-independence. Your odds of ‘winning’ change
depending upon whether your first choice was the money or a hooved animal.
Non-Independence in Space: Space makes almost everything dependent. The value
of a house is almost always related to the value of the houses near it. Even if you
randomly select within a neighborhood you have almost necessarily biased your
sample of housing values. The temperature at a given location is more like the
temperature around it than the temperature 1500 miles away in any direction.
This idea of spatial dependence is one of the foundational reasons geography is a
discipline. Nonetheless, it is also a reason that we are often ‘cheating’ (to a greater
or lesser extent) whenever we use traditional statistics.
Computer Problems (use JMP and/or Excel for these exercises)
The Bell Curve (aka The Normal Distribution) is but one of many, many different kinds
of random number distributions. The JMP software package allows you to generate sets
of ‘observations’ from a large number of different kinds of random number distributions.
Some distributions have ‘parameters’. The meaning and interpretation of these
parameters is not always the same. For example, the Normal distribution has two
parameters: The mean (usually symbolized as: µ), and the standard deviation (usually
symbolized as: σ). The mean is a measure of the central tendency of the normal
distribution and the standard deviation is a measure of the spread, or, variability of the
numbers around the mean. The Normal distribution is fairly good at characterizing many
random variables. For example, consider the distribution of the height of 25 year old men
in the state of Colorado. Let’s assume that is can be characterized by a Normal
distribution with a mean of 70 inches (5’ 10”) and a standard deviation of 3 inches. We
can ask JMP to artificially sample from a N(70,3) (read as Normal distribution with mean
of 70 and standard deviation 3) distribution. In JMP open a new table, add 2000 rows
(we’ll take a sample of 2000) and double click on the column and click on Edit Formula
(we’ll do this in lab too) and choose ‘Random’ then choose ‘Normal’. Add the 70 and the
3 to the box, check apply, and bam! – You have 2000 numbers in your table drawn from
a Random N(70, 3) distribution. Use the Analysis tool to make a histogram and get
descriptive statistics.
Figure 1: Histogram of 2000 randomly generated points from a N(70, 3) distribution
So here is a histogram of the randomly generated numbers from a N(70, 3) distribution.
Use the ‘?’ tool (which is the help system of JMP) to find out what all these diagrams,
and numbers are. There is a box and whiskers chart, a shortest half (the red figure), and a
whole bunch more. Make sure you know how to use this ‘?’ tool. It is very cool. In fact
make sure you can find the figure pasted onto the next page in the JMP help system. This
is a very good way to learn both how to use the software, and how to interpret the
statistical output of the software.
Figure 2: JMP Help system explanation of figures in the Histogram function
Find the above figure in the JMP help system. Now, we’ll generate another 2000
observations from a Normal distribution with different parameters. Assume a coke can is
7 inches tall. The variability of the size of a coke can will be much smaller (relatively and
actually) than the variability of the heights of 25 year old men living in Colorado. So do
the same thing again by creating a new column, editing its formula etc. but this time
generate your numbers from a N(7, 0.03) distribution (e.g. very little spread about the
mean). Plot the histogram. Compare the two datasets – Both of them are from a Normal
Distribution but have different ‘parameters’ for the Mean and Standard Deviation.
Figure 3: Histogram of 2000 randomly generated points from a N(7, 0.03) distribution
Note the numerical scale across the bottom. All the points fall within 0.06 inches around
7 whereas for the first generation from N(70, 3) all the points were within 20 inches
around 70. Both are Normal distributions but clearly the heights of coke cans are much
less variable than the heights of 25 year old men living in Colorado. Your computer
exercises for this problem set are to prepare a one page summary for each of the
following distributions: Uniform, Binomial, Exponential, Poisson, and Chi-Square. (Note
the Chi-Square is not in the ‘Random’ part of the formula editor but in the ‘Statistical’
part). Google each distribution to find out a little bit about it: How many parameters does
it have? How are the parameters interpreted? Is this a discrete or a continuous
distribution. What kinds of phenomena are best represented by this kind of distribution?
Generate two histograms for each distribution using different parameters. Tell me a one
page story about each distribution with these figures and your interpretation based on
your research. Definitely answer the italicized questions above in this story.
This is the essence of the assignment:
#8) Generate two Normal distributions identical to those described above and
Create a single column with all of the observations (e.g. 2000 from N(70,3) and
2000 from a N(7, 0.03). Plot them all on one histogram. What do you see?
All the variability of the N(7, .03) variable (~ from 6.9
to 7.1) fits into a single bin of the histogram; whereas,
the variability of the N(70, 3) random variable ranges
from roughly 60 to 80. Both are symmetric normal
distributions but they have different means and
standard deviations (aka central tendencies and
spreads).
10
20
30
40
50
60
70
80
#9) (Less than) One page Summary of the Uniform Distribution
Uniform (100)
The Uniform distribution is essentially equal
probability across the range for which it exists.
In fact, the only ‘parameter’ of the uniform
distribution is this ‘range’. The Uniform
distribution is essentially what is used for
random number generation applications such as
selecting lottery numbers. There are discrete and
continuous versions of the uniform.The pdf of
the continuous uniform distribution is:
0 10 20 30 40 50 60 70 80 90 100
#10) One Page Summary of the Binomial Distribution
Distributions
Bin(100, .5)
Quantiles
40
50
60
70
100.0% maximum
99.5%
97.5%
90.0%
75.0%
quartile
50.0%
median
25.0%
quartile
10.0%
2.5%
0.5%
0.0%
minimum
Moments
70.000
62.000
60.000
56.000
53.000
50.000
46.000
43.000
40.000
36.005
33.000
Mean
Std Dev
Std Err Mean
upper 95% Mean
low er 95% Mean
N
49.791
5.0389497
0.1126743
50.011971
49.570029
2000
If you play with the ‘hand tool’ in JMP you can make the histogram as fine ‘binned’
as you like. In the case above I made it so fine ‘binned’ that you can see the discrete
nature of the binomial distribution. ‘Scores’ of 50 and 49 and 19 can occur but no
non-integer values can. This Bin(100, .5) example would be analogous to flipping a
coin 100 times and counting the number of heads – 2000 times. Theoretically a 0 or
a 100 could occur; however, note that in 2000 trials of 100 coin flips they do not
occur. In fact, all the values fall between 30 and 70. The parameters of a Binomial
are ‘n’ (analogous to the # of coin flips – or Bernoulli trials, and ‘p’ the probability
of getting a ‘heads’ or ‘success’ in Bernoulli trial jargon. The formulas for the pmf
and some of the parameters for the Binomial are pasted in from Wikipedia below:
Parameters
number of trials (integer)
success probability
(real)
Support
Probability
mass function
(pmf)
Cumulative
distribution
function (cdf)
Mean
Median
Mode
Variance
one of
#11) One Page Summary of the Exponential Distribution
Distributions
Exp(10)
Quantiles
0
1
2
3
4
5
6
7
8
9 10
100.0% maximum 10.331
99.5%
5.063
97.5%
3.640
90.0%
2.256
75.0%
quartile
1.335
50.0%
median
0.690
25.0%
quartile
0.282
10.0%
0.100
2.5%
0.025
0.5%
0.00633
0.0%
minimum 0.00025
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
low er 95% Mean
N
0.9796749
1.0018597
0.0224023
1.0236092
0.9357407
2000
The exponential is a skewed continuous distribution that has many applications in
the natural and social sciences. It is often considered a continuous analog of the
negative binomial distribution. The exponential distribution has only one
parameter: λ (e.g. Exp(λ)). The mean of the exponential is 1/ λ; while the variance is
given by 1/ λ2 . Processes that are often modeled by the exponential include the
decay of radioactive material and the length of time one might expect to wait for a
bus at a bus stop. The Wikipedia characterization of pdf and cumulative
distribution function are pasted in below:
Probability density function
The probability density function (pdf) of an exponential distribution has the form
where λ > 0 is a parameter of the distribution, often called the rate parameter. The
distribution is supported on the interval [0,∞). If a random variable X has this
distribution, we write X ~ Exponential(λ).
[edit] Cumulative distribution function
The cumulative distribution function is given by
#12) One Page Summary of the Poisson Distribution
Distributions
Poisson(10)
Quantiles
0
10
20
100.0% maximum
99.5%
97.5%
90.0%
75.0%
quartile
50.0%
median
25.0%
quartile
10.0%
2.5%
0.5%
0.0%
minimum
Moments
20.000
19.000
17.000
14.000
12.000
10.000
8.000
6.000
5.000
3.000
1.000
Mean
Std Dev
Std Err Mean
upper 95% Mean
low er 95% Mean
N
10.112
3.1039126
0.0694056
10.248115
9.9758851
2000
The Poisson distribution is a discrete distribution often used in spatial applications.
In fact, is an important distribution with respect to deriving complete spatial
randomness (CSR). Imagine a field study of say.. meteorite impacts. If you divide
this area up into equal sized sub-areas (often quadrats) you would expect the ‘count’
of meteorite impacts in these quadrats to be distributed as a poisson variable with
λ = # of meteorite impacts / # of quadrats (if they are spatially random). In fact, for
the Poisson; it’s only parameter: λ, is also the expected value of both it’s mean and
variance (e.g. for X~P(λ); E[X] = λ and E[ σ2 x ] = λ. Again Wikipedia provides some
nice background info on the Poisson distribution pasted in below:
In probability theory and statistics, the Poisson distribution is a discrete probability distribution that
expresses the probability of a number of events occurring in a fixed period of time if these events occur
with a known average rate and independently of the time since the last event. The Poisson distribution can
also be used for the number of events in other specified intervals such as distance, area or volume.The
distribution was discovered by Siméon-Denis Poisson (1781–1840) and published, together with his
probability theory, in 1838 in his work Recherches sur la probabilité des jugements en matières criminelles
et matière civile ("Research on the Probability of Judgments in Criminal and Civil Matters"). The work
focused on certain random variables N that count, among other things, a number of discrete occurrences
(sometimes called "arrivals") that take place during a time-interval of given length. If the expected number
of occurrences in this interval is λ, then the probability that there are exactly k occurrences (k being a nonnegative integer, k = 0, 1, 2, ...) is equal to
where
e is the base of the natural logarithm (e = 2.71828...)
k is the number of occurrences of an event - the probability of which is given by the function
k! is the factorial of k
λ is a positive real number, equal to the expected number of occurrences that occur during the given
interval. For instance, if the events occur on average every 4 minutes, and you are interested in the number
of events occurring in a 10 minute interval, you would use as model a Poisson distribution with
λ = 10/4 = 2.5.
As a function of k, this is the probability mass function. The Poisson distribution can be derived as a
limiting case of the binomial distribution.
#13) One Page Summary of the Chi-Square Distribution
For clues on how to do this see: http://web.utk.edu/~leon/stat571/jmp/chi-square/default.html
Distributions
Chi-Square (5)
Quantiles
0
10
20
100.0% maximum
99.5%
97.5%
90.0%
75.0%
quartile
50.0%
median
25.0%
quartile
10.0%
2.5%
0.5%
0.0%
minimum
Moments
22.252
16.547
13.207
9.741
6.861
4.435
2.726
1.674
0.807
0.441
0.194
Mean
Std Dev
Std Err Mean
upper 95% Mean
low er 95% Mean
N
5.1645598
3.2840109
0.0734327
5.3085724
5.0205471
2000
The Chi-Square distribution (aka χ2 ) is a continuous distribution with only one
parameter often denoted ‘k’ and called ‘degrees of freedom’. The Chi-square
distribution is the distribution that results from the summing of ‘k’ standard
normal variables squared. Consequently it is always positive and it is skewed to the
right. It is used in the oft used Chi-Square ‘Goodness of Fit’ test which has many
applications including the testing of
the independence of nominal
variables with respect to frequency
distributions (for example are ‘eye
color’ and ‘handedness’ of human
beings correlated?), and several
geographic applications. The shape
of the Chi-Square changes
significantly as it’s degree of
freedom ‘k’ parameter increases
from 1-5. For ‘k’’s beyond 5 it
becomes an increasingly flat yet
slightly skewed curve. The most
common use of the chi-square is the
‘goodness of fit’ test in which the statistics is:
(Observed – Expected)2
χ
2
= -----------------------------(Expected)2
Where the ‘Observed’ and ‘Expected’
numbers are obtained from a‘Contingency
Table’. (more on this later in the course)
Again, hat’s off to Wikipedia for some additional info on this distribution (below):
In probability theory and statistics, the chi-square distribution (also chi-squared or χ2
distribution) is one of the most widely used theoretical probability distributions in inferential
statistics,
i.e. in statistical significance tests. It is useful because, under reasonable
assumptions, easily calculated quantities can be proven to have distributions that
approximate to the chi-square distribution if the null hypothesis is true.If Xi are k independent,
normally distributed random variables with mean 0 and variance 1, then the random variable
is distributed according to the chi-square distribution. This is usually written
The chi-square distribution has one parameter: k - a positive integer that specifies the number of degrees of
freedom (i.e. the number of Xi). The chi-square distribution is a special case of the gamma distribution.The
best-known situations in which the chi-square distribution are used are the common chi-square tests for
goodness of fit of an observed distribution to a theoretical one, and of the independence of two criteria of
classification of qualitative data. However, many other statistical tests lead to a use of this distribution. One
example is Friedman's analysis of variance by ranks.
How to Lie With Statistics Question
#14) Write a 4 to 7 sentence summary of Chapter 1: The Sample with built in Bias
ANSWERS WILL VARY: The average income of Yale graduates was used as an
example. They asked alumni who came to a re-union and believed what they said. DU
can ‘bias’ their ‘Average SAT’ score of applicant by not requiring applicants to submit
SAT scores. This biases the average up because applicants with good scores will want
to submit whereas applicants with bad scores are more likely to not sbumt their SAT
scores
#15) Write a 4 to 7 sentence summary of Chapter 2: The well chosen average
ANSWERS WILL VARY: The skewed distribution of housing prices is used here as an
example. There are many measures of ‘central tendency’ which can sometimes
scurrilously be replaced with the technical and precise term of ‘average’. The mean
(aka ‘average’, median (aka middle value), and mode (most common value) are all
measures of central tendency. All of them could be used to characterize a distribution.
The true ‘average’ price of a house in a given housing market is usually higher than
what most people pay for a house because housing prices tend to be skewed kind of like
a chi-squared distribution – i.e. a long tail to the right. So, if you are trying to impress
people about what a nice neighborhood you live in is you might give the average price
of a home in it; however, if you are trying to sell someone on the affordability of the
neighborhood you might use the median price of a home.
#16) Write a 4 to 7 sentence summary of Chapter 3: The little figures that are not there
ANSWERS WILL VARY: Some results are prominently presented even though they
may be based on small sample sizes or even biased samples. The old DENTYNE
commercial comes to mind: “Four out of Five Dentists, WHO CHEW GUM, choose
Dentyne. Come on, how many Dentists chew gum? (I don’t really know probably 99% )
#17) An apochryphal story about the department of Geography at the University of North
Carolina (UNC) is that their undergraduate geography majors earn more money than the
geography graduates of any other university in the country. It just so happens that
Michael Jordan (an extremely well paid basketball player) got his degree from UNC in
geography in 1986. Would Darrell Huff describe this as ‘Lying With Statistics’? If so,
what category of your above essays would this ‘lie’ fall under. Explain.
ANSWERS WILL VARY: Darrell Huff would describe this as lying with statistics and
he would probably categorize it as either a sample with built in bias or as a well-chosen
average. Hopefully you get the point in either case.