Download Probability distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Law of large numbers wikipedia , lookup

Probability amplitude wikipedia , lookup

Transcript
Probability Concepts 91585
A thorough understanding of the differences between true probability, theoretical probability and experimental
estimates.
•
True Probability
The true probability that an event will occur. Often impossible to calculate but can be estimated through
experimental probability and theoretical probability.
Eg we assume flipping a coin gives a 50:50 chance of landing on heads or tails, but as the coin isn’t perfectly
symmetrical, and there is a chance it can land on its edge, this is not the case. However it will be very close!
•
Theoretical probability
This is a true probability estimate that we can calculate from modelling the situation.
We know that an approximately symmetrical coin will give approximately 50:50 chance of getting a heads with
each toss. Although this isn’t exactly equal to the true probability, it is a very good model and we can assume it to
be true.
Theoretical probability is an “educated guess”
We can often use tree’s and other diagrams to calculate overall theoretical probability if we know the probability
of individual steps.
•
Experimental probability
Experimental probability is an estimate of the true probability through trial and simulation. If we flip a coin 100
times and get approximately 50 zero’s, we can estimate the true probability of getting a heads is 50:50.
This is often a crude way to calculate true probability as many trails re required to get an accurate estimate,
however if the model is extremely complicated it can be the only way.
Experimental probability calculation relies on independent events to give meaningful estimates.
There is no such thing as theoretical probability! It is just an experimental estimate.
Understanding true probability, model estimates and experimental estimates
True probability is the (almost always) unknown actual probability that an event will occur in a given situation. The
actual probability of a coin landing heads up is affected by the position from which it is tossed, the asymmetry of the
two faces of the coin etc, so is not exactly 0.5, though the probability of a fair coin landing heads will be very close to
0.5. We can find out about the unknown true probability by observation (experiment) or by trying to understand the
situation and modelling it.
In probability an experiment is one or more trials of a probability situation. An experimental estimate is
calculated from observation as the number of successful trials divided by the total number of trials. In the long run
(over many trials), the experimental estimate may approach the true probability.
An experimental estimate that a coin will land heads if it is tossed 20 times and lands heads up 14 times is 14/20 = 0.7.
A probability model is a representation of a situation involving probability. Probability models can incorporate
experimental estimates and assumptions about the situation (e.g., independence). These assumptions may be based
on an idealised view of the world and an understanding of the mathematics of probability.
A model estimate is an estimate of the probability that an event will occur, based on a probability model. The model
estimate of a fair coin landing heads is 0.5.
If a model is a good representation of the situation, the experimental estimate over many trials will be close to the
model estimate.
A model must always be considered in context. A good model is one which is fit for the purpose for which it is being
used. When tossing an approximately fair coin, the model estimate of P(heads) = 0.5 is a good model for most
purposes. A transport system modelling the timing of traffic lights to get a smooth flow of traffic will require a more
complex model, tested against experimental observations to ensure that it is fit for the purpose.
In some situations there is no obvious theoretical model, so we can only estimate the probabilities and probability
distributions via experiment. These estimates can be used as a basis for building a theoretical model. For instance, to
develop a model of the probability of getting a basketball through the hoop, an initial model might assume a constant
probability of 0.5. As data is gathered, there could be successive refinements of the model so that it becomes a better
estimate of the true probability. The data might indicate that the probability of getting the ball in the hoop is closer to
0.2 and that it changes over time.
Sometimes we might think that an obvious theoretical model applies, but experimental estimates demonstrate that our
model is a poor one. There is now a need to find a better model using the estimates from the experiments. We might
initially model the result of spinning a weighted coin as P(heads) = 0.5 but realise that that estimate is a poor one and
use data to improve it.
Deterministic and probabilistic models
A deterministic model does not include elements of randomness. Every time you run the model with the
same initial conditions you will get the same results.
A probabilistic model does include elements of randomness. Every time you run the model, you are likely to
get different results, even with the same initial conditions.
RANDOMNESS
What is true randomness? What makes something random? Often we misinterprate what proper randomness looks like.
Humans are keyed up to see pattern in everything. We will find patterns where there are none. Often to us a uniform spread
appears ‘random’ as it is difficult to identify a pattern. Whereas true randomness looks ‘lumpy’
Which of these images shows a random spread of dots?
Randomness is often not intuitive.
The easiest way to increase your understanding of randomness is through questions and practical examples.
Imagine a coin flipping 30 times. Try to make up a ‘random’ coin toss in your head and write down the results (write down
30 flips as if you had actually flipped the coins)
Then actually flip 30 coins recording your results and compare the 2 sets of data. What differences do you notice?
Independence
Two events, A and B, are independent if the fact that A occurs does not affect the probability of B occurring.
Think about rolling a dii. The probability of landing on each face is equal.
Therefore: P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6
P(1+2+3+4+5+6) = 1 ( when we roll the die, it will land on one of the faces)
Now if we roll two die. Does the roll of the first die effect the second?
If we have rolled a 6 on the first die, what is the probability of getting a 6 on the second die?
The same as always! P(6) = 1/6 no matter how many die have been rolled previously or whatever their rolls where!
Try to think of some examples of independent events. Eg the results of flipping the same coin twice.
Events are independent if: P(A and B) = P(A) x P(B)
Dependence
Events are dependent if the outcome of one has an effect on the outcome of the other.
Eg: think about the probability of being blond and the probability of having blue eyes.
Are blond people more likely to have blue eyes?
Most real life events are dependent on some level (even if the dependence is very small)
Dependant if : P(A and B) ≠ P(A) x P(B)
1)A dresser drawer contains one pair of socks with each of the following colors: blue, brown, red, white and
black. Each pair is folded together in a matching set. You reach into the sock drawer and choose a pair of socks
without looking. You replace this pair and then choose another pair of socks. What is the probability that you will
choose the red pair of socks both times?
Are both draws independent?
2) A survey found that 72% of people in school of 300 a like pizza. If 3 people are selected at random, what is
the probability that all three like pizza?
Is the probability of the second and third person liking pizza independent of the first person liking pizza?
Why? Why not?
Mutually exclusive events
Events are mutually exclusive if both cannot occur at the same time. The most obvious example of mutually
exclusive events are complimentary functions. Obviously A cannot occur at the same time as A’ (A not occurring).
(Think about tossing a coin. We cannot get a heads and a tails on the same toss)
However They do not have to be complimentary functions.
If events A & B are mutually exclusive: P(A&B) = 0
or P(AUB) = P(A) + P(B)
Conditional probability
Conditional probability is the probability of an event occurring given another event occurs.
It is written as P(A/B) -P(A given B)
In essence it means the “probability of event A occurring if event B occurs”. Think back to our mutually esclusive events.
Suppose A & B are mutually exclusive. P(A/B) = 0 because if B occurs, A cannot occur.
It is calculated through the formula:
P(A/B) =
𝑃(𝐴&𝐵)
𝑃(𝐵)
Probability distributions and graphs
In your exams you’ll often be asked to estimate expected values from looking at data distributions and graphs.
This will often be comparing eg: Which one has greater variance, and which one has the highest expected value.
These skills are most easily learnt through practice at looking at graphs.
Contingency tables
Contingency tables
•
Can be used to display probabilities or frequencies of events with 2 or more variables
•
Can help in conversion of frequencies to probabilities
•
Can help in determining independence
We use a table with event or condition A on one axis and event or condition B on the other axis. The table allows
easy comparison between probabilities of both events. It also allows us to easily calculate the probability of both
events or either of the events happening. Contingency tables are preferable to ven diagrams, but use whatever you
find most comfortable.
A occurs
B occurs
B & A occur
(BNA), (BUA)
B’ (not
occurring)
A occurs, B
doesn’t.
(BUA)
A’ (not
occurring)
B occurs, A
doesn’t.
(BUA)
Neither
occurs.
Here we can see how the table works. Events A and B can be
anything. Also we see it is easy to identify which box corresponds to
ANB, and AUB.
Contingency tables are preferred to ven diagrams due to the easy of
calculating using tables.
Try filling this table out.
Contengency tables can easily be converted from numbers to fractions by dividing through by the total value.
We can also calculate probabilities in other terms.
•
For example: What fraction of the drinkers drink coffee?
•
We take the number of coffee drinkers and divide this by the total number of drinkers
(200-people who drink nothing)
This is called conditional probability.
Could also be written as “what is the probability someone drinks coffee given they drink something?”
or P(C/T)? (C is coffee drinker, T is tea drinker)
How would we calculate the probability someone drinks coffee given they drink tea? This is similar to calculation of the
probability of one of the squares like in the previous example, but this time our ‘total’ is different. Why?
Think about the phrasing of the question: Out of all the tea drinkers, what is the probability one drinks coffee? So we would
divide square
Tea drinker
Tea drinker’
Total
So what is P(C/T)?
=(TUC)/(T)
=122/453
=0.27
Coffee drinker
A: 122
C: 132
E: 254
Coffee drinker’
B: 321
D: 98
F: 419
Total
H: 453
I: 230
J: 683
What is P(T’UC)?
What is P(T’UT)?
What is P(C’/T’)?
Probability (A given B) =
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 (𝐴&𝐵)
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝐵)
Probability trees
Probability trees are another useful skill in determining probabilities. They use a series of probability steps.
Probability trees can easily get very large and cumbersome compared to contingency tables, but are especially
useful in visualising conditional probability.
Each step represents a probability step or ‘event’. Each path is an event occurring and its complimentary
function. The next step is the net event occurring. Tree’s can have any number of events.
(and each step can have any number of branches.)
Event A
Event B
Probabilistic outcome
(A&B)
A
(A&B’)
(A’&B)
A’
(A’&B’)
It is easy to visualies probabilities by using tree’s.
Also we can easily calculate conditional probability.
P(A/B)= P(A&B)/P(B)
Ven diagrams
Are the most common way for multiple probable functions to be drawn.
They are however not as useful as tree diagrams or contingency tables to use. (if you can use contingency tables
or trees, it is preferable. However if you are more comfortable using ven diagrams, keep on using them!)
Total probability is 1.
Probability of different
events happening are
worked out as a fraction
or percentage.
So if there is 50% chance
of A occurring, P(A) = .5
(or a fraction of 1)
P(AUB) is probability
of either A, B or
both A and B
occurring.
Ven diagrams are invaluable when solving 3 way probabilistic models. Tree diagrams can also be used for
multiple step probability models but get very large very quickly! Contingency tables cannot be used as we
would require 3 dimensions!
3 way ven diagrams often have difficult calculations, but as long as your definitions are right, you won’t go
wrong.
Often you will be given just enough
information, and will have to use
calculations to find all unknowns in a ven
diagram.
The same probability calculation rules apply to these 3 way diagrams. P(A&B&C) = P(A)*P(B)*P(C) for independent
events
P((A&B)/C) = P(A&B&C)/P(C)
P(A/(B&C) = P(A&B&C)/(P(B&C)
-these rules can be used interchanging A, B & C
Variance
Variance is a measure of how spread out data is or probability values are.
The greater the variance, the more ‘varied’ our probability function is.
For example if we have two random number generators:
Generator 1 makes a number between -50 and 50
Generator 2 makes a number between -500 and 500.
Both have the same expected value or mean probability (0) but if we ran both generators, generator 2 would give a
much more spread out or ‘varied’ range of numbers.
Variance can be calculated by the following formula:
𝑉𝐴𝑅(𝑋) = 𝐸[𝑋 2 ] − µ2
The variance of a probability distribution is equal to the sum of all expected values squared, minus the mean
squared.
Calculations will likely not be required but it is important to understand what variance is a measure of: How varied
the data is
The greater the spread, the
greater the variance and
standard deviation! These
are similar characteristics
but have different values.
Standard deviation
Standard deviation is another measure of probability spread or data spread.
Remember we can look at results or probability functions to determine variance or standard deviation
We calculate standard deviation by first calculating variance.
Sd or 𝜎 = √𝑉𝐴𝑅
Probability distributions
Probability distributions paper takes a different turn from previous years.
It is no longer a precise paper with many calculator related questions such as ‘find the expected value’
This year you will be asked to estimate expected values and variance with real world (and non exact) data.
Distributions are largely up to interpretation. You will be asked to estimate which model would fit best to certain
data and why.
Paper involves calculating and interpreting expected values and standard deviations of discrete random variables.
You will also have to apply distributions to data: Binomial, Normal, Poisson etc
A good understanding of the underlying concepts of probability distributions will equip you to tackle any question
the exam poses.
Discrete and continuous data
Although all calculations you encounter will involve discrete data, you will need to understand the differences
between different types.
A discontinuous distribution is a series of data in which values can only take on certain set values. Like age in years
or goals scored.
A continuous distribution is a range of values which can take on any value. These may be forced to fall within
constraints, but there are an infinite amount of values which can be achieved.
Mean
The mean is the average. This means if we add all the discrete values together and divide the value by the number of
discrete values, we will have the mean.
The mean is also the ‘expected value’ if we were to take a random value of our data points. Mean values are
stretched by extreme values or outliers on our data plots.
Standard deviation
Is a measure of spread of data. Standard deviation is related to variance
Standard deviation = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
Standard deviation shows how much variation dispersion exists from the average (mean), or expected value. A
low standard deviation indicates that the data points tend to be very close to the mean; high standard deviation
indicates that the data points are spread out over a large range of values.
True, theoretical & experimental distributions.
Similarly to probability comparing the differences in true, theoretical and experimental probability, we will also
look at the various distributions.
Just like in probability paper: True probability is some unknown ‘real’ probability which we can never know or
properly calculate.
It can only be estimated through combination of theoretical and experimental probability.
So to find the ‘true’ probability distribution we would have to combine estimates from our theoretical
probability distribution with our experimental probability distribution.
Theoretical distribution is an estimate of what our data spread would look like based upon mathematical
modelling of the event
Experimental distribution is simply the spread of data we receive when running a simulation of the event.
Types of distributions…
You will need to know which distribution is best applied to different forms of data.
Binomial distribution
A binomial distribution is a probability distribution of a series of yes/no functions. Eg we flip 100 coins, what is
the probability distribution of total heads? Or what values of total heads can we expect?
A distribution characterised by:
•
Only 2 outcomes: Success or failure
Eg flipping a coin.
•
Fixed number of identical trials
The number of trails is set at the beginning
•
P(success) at each trial is constant
Each individual trail is consistent with the same P(Success)
•
Each trial is independent.
Each individual trail has no impact on any other trail.
We know that any binomial trail must be a discontinuous function. This because each individual trail is either a
success or a failure: we can’t have half a success!
Mean and variance of a binomial distribution.
If we let X be our random variable (the coin toss)
We can calculate the mean and variance using the following formulae
𝐸(𝑋) = 𝜇 = 𝑛𝜋
𝑉𝐴𝑅(𝑋) = 𝜎 2 = 𝑛𝜋(1 − 𝜋)
n = number of trials
𝜋 = probability of success at
each trial
E(X) = expected value
𝜎 = standard deviation
(Remember 𝜎 2 = VAR(X)
These are theoretical probability equations.
A typical exam question could either ask you to calculate expected means or standard deviation using
theoretical models or will supply you with a distribution of discrete data
Questions will ask you what the expected value is, what the standard deviation is and what distribution best fits
this data.
Using trees
Binomial distributions can be calculated using tree diagrams. These can provide a good frame to calculate
outcomes, but get very complex very quickly!
4
4
N
4
N
N
4
N
4
N
4
N
4
N
In this example: each event is a dice roll. We are looking at a positive event being the roll of a 4 on the dice. A
negative event is any other number.
For only 3 rolls the tree provides a frame which calculations can easily be conducted.
But as we can see, if we needed to calculate probabilities for 6 or more rolls, the tree would get too
complicated!
For this tree each positive event has P(4) = 1/6 and P(N) = 5/6. Try calculating the probabilities of each
combination of rolls.
Poisson distribution
Poisson distributions work when we have a set frame or time for events to occur and we are wanting to figure
out the probability of different numbers of events occurring in that time frame. EG: How many cars pass by a
road in an hour?
A special kind of distribution based on counting the number of times an event occurs in a time interval.
•
Poisson distribution is discontinuous. It can only take on discrete values.
•
Imagine a distribution of number of cars passing down a street in a given hour.
•
The Probability has an expected value, but as we are examining number of events for a given time
interval, Number of events cannot be below 0
•
This gives a lopsided distribution: The Poisson distribution.
•
Poisson distribution must follow certain conditions.
‐ Each occurrence is independent of other occurrences
‐ Events cannot occur simultaneously
‐ Events occur at random, and are unpredictable
- For a small interval the probability of the event occurring is proportional to the size of the interval
𝑒 −𝑥 𝑥𝜆𝑥
𝑃(𝑋 = 𝑥 ) =
𝑥!
It is a good idea to familiarise yourself with the shape of the poisson distribution as you may have to match it to
real world data.
•
Poisson distribution has only one variable:
𝜆
This makes calculations considerably easier once the concept is grasped.
•
There is only one variable because
𝜆 (Variance) = 𝜇 (mean)
•
So the mean value is equal to how much the data is spread out.
•
This rule is true only for the poisson distribution, and it arises due to all poisson distributions having
the same shape. Different distributions are just elongated.
As all distributions have the same shape, we can generate all other possible poisson distributions by
multiplying a ‘base’ distribution through by a value.
Calculations with poisson…
𝑃(𝑋 = 𝑥) =
𝑒 −𝜆 𝜆𝑥
𝑥!
As 𝜆 is our only variable (both the variance and the mean number of events at a given interval)
Remember for any distribution, total probability = 1
(This is equivalent to saying the probability that anything happens is 100%)
Have a go at calculating this question.
Mean number of eruptions a year is 4.7
What is the probability more than 2 will occur?
We use the face that Poisson only has discreet values.
𝜆 = 4.7
We want P(X>2)
You may need to apply reverse Poisson to find λ when given a certain probability. This will require basic
algebraic manipulation.
Given that we know a distribution follows Poisson, and P(X=0) = 0.001, Calculate λ
These problems will always be stated in the form P(0) = … or P(X≥1) = ….
Hence we only need to P(0), which will be our answer, or 1- P(0).
P(X=0) =
e−λ λx
x!
= 0.001 and we know x = 0
Have a go at this reverse Poisson problem.
𝑃(𝑋 = 0) =
𝑒 −𝜆 𝜆0
0!
=
= 0.001
Remember 𝜆 is also the
variance!
𝑒 −𝜆 ×1
1
0.001 = 𝑒 −𝜆
So
𝜆 = −ln(0.001)
= 6.9078
Harder Poisson problems may involve probability tree’s and conditional probability.
Eg mean number of accidents on SH1 on a weekday = 1, and on a weekend = 2
What is the probability that 3 accidents occur on a day chosen at random.
First draw a tree!
P(3 given weekend) =
P(weekend = 2/7,
𝑒 −2 23
3!
𝜆=2
P(X=3)
P(3 given weekday) =
P(weekday) = 5/7,
𝜆=1
You may also be asked conditional problems:
If no crashes occur on SH1, what is the probability it is a weekday?
First we need P(X=0) for weekdays and P(X=0) for weekends.
Weekends
𝑒 −220
P(X=0) =
0!
= 0.13534
Weekdays
𝑒 −110
P(X=0) =
0!
= 0.3679
In addition: There are 5 weekdays for every weekend. P(no crash weekday) + P(no crash weekend)
5
P(No crash weekday)
7
So P(weekday given no crash) = 5
2
P(no crash weekday) + P(no crash weekend)
7
7
𝑒 −1 13
3!
=
5
(0.3679)
7
2
5
(0.13534) + (0.3679)
7
7
= 0.871727 (if there was a day without crashes, it is far
more likely that it occurred on a weekday.)
Remember the most important skill is fitting distributions to data!
Here are some examples of real world poisson distributions which you would need to recognise in the exam.
Normal distribution:
Normal distribution follows a symmetrical bell curve. It is probably the most useful distribution curve and used in
many ways in the real world.
Mean or 𝜇 is roughly the middle of the distribution
𝜇 is the mean. It marks the centre of the normal curve.
The curve will be bell shaped, symmetric around the mean
Standard deviation or 𝜎 is roughly 1/6th of the range of normal
distribution.
Sd is the average distance from the mean.
Skills in estimating sd and mean will improve by practise.
𝝁 is shifted by extreme values (remember it is the average)
𝝈 (sd) is stretched by extreme values (measure of spread)
Try estimating the standard deviation and means of yr 12 and yr 9 students.
Mean = 13.1
SD = 2.4
Mean = 9.0
SD = 2.8
The standard deviation dictates how spread out the curve is. A curve with a high 𝜎 will be flatter; a low 𝜎 will give a
sharp spike. Remember the total area under a distribution is always 0
•
Because normal distributions do not take on discrete values, calculation is a bit more complicated. We can
no longer calculate P(X =1), because for P(X= 1.000000….) approaches 0.
•
Instead we use normal distributions to calculate P(X<1)
•
We use the function Z or standard score to calculate values. Where Z =
𝑋−𝜇
.
𝜎
Z is equal to the number of standard deviations Z is from the mean.
Using a calculator
The first thing we need to do before solving calculations using a graphics calculator is to convert to a ‘standard curve’
of 𝜇 = 0 & 𝜎 = 1
This is where Z comes into play. We use Z to standardize our curve, then our calculators can solve to find
probabilities using the reference curve.
EG What is the probability a carton of eggs is less than 200g if the mean is 205 grams and standard deviation is 3. We
first need to convert to a standard curve.
•
What is the probability a carton of eggs is less than 200g if the mean is 205 grams and standard deviation is
3.
•
First adjust our mean to 0g. So we want P(X<-5) when SD = 3.
•
Z=
𝑋−𝜇
𝜎
=
−5−0
3
= -5/3.
On your calculator…
>menu
>stat
>dist(tab)
>norm(tab)
>Npd
In the real world, distributions are never exactly normally distributed. The normal distribution is a theoretical
model which is useful because it enables us to calculate probabilities for a distribution based only on the mean
and standard deviation. If the population we are considering has a distribution which is approximately normal
and we have good estimates for µ and σ, then the probability estimates we make for the population using a
normal distribution model are going to be close to the actual population probabilities. We call such populations
normally distributed to indicate that they have the characteristics of a normal distribution and can usefully be
modelled by a normal distribution. Similar considerations apply to modelling using uniform, triangular,
binomial or Poisson distributions.
The normal distribution can give reasonably accurate estimates for probabilities of distributions of populations
if they have the following characteristics: unimodal; reasonably symmetric; frequency of the observations falls
off rapidly as measurements get further from the central value; few or no extreme values. A uniform
distribution is the best choice of model when there are lower and upper limits to possible values and little
information about the shape of the distribution, or when the context and shape of the sample distribution
suggest a uniform distribution. A triangular distribution is the best choice when there is information about
lower and upper limits and the mode, but little information about the shape apart from that, or when the
context and shape of the sample distribution suggest a triangular distribution.
Whether a sample distribution is consistent with being from a population which could be modelled by a given
distribution is a matter of judgement. Contextual knowledge should be used to decide whether a model is
useful, along with the characteristics of the distribution. For example, a small sample (eg n<30) may not look
normal but could be consistent with being from an underlying normal distribution, while a large sample (eg
n>200) from a normally distributed population would be expected to look approximately normal. Uniform and
triangular distributions can be discrete or continuous. Unlike real world distributions, the underlying true
distribution of a probability situation may in some cases be exactly modelled by a theoretical distribution.
Students should have the opportunity to see how changing the bin (class) width changes the appearance of a
histogram.
Limitations of real world data
It is easy to think of real world data as being perfect: Normal distributions are purely normal and will always
looks normal. However we know that distributions are very rarely normal, but only closely resemble a normal
distribution.
And often samples of normal distributions may not even appear normal!
We think of a sample size of 30+ as being large enough to employ central limit theorem, but this is often not
enough to give a good looking normal distribution. Realistically we need sample sizes of above 200 to give a
proper normal distribution.
.
Small sample size of 30
Heights of people will give a very good
approximation of normal distribution.
However this distribution looks anything
but normal!
We will need a sample size of much
larger than 30 to give a good looking
distribution.
You need to be aware that if a sample
size is small, it is unwise to comment on
its distribution, as it could turn out to be
anything.
This distribution looks a lot better.
Even though both are sampling from
the same population, the larger
sample size has a huge impact on the
appearance of the distribution
Errors in interpretation can also arise through poorly placed brackets for a histogram. If the bars are not in line
with the mean, the data can look skewed and not normal even if the individual points closely resemble a normal
distribution
height of 500 NZ men
This sample, although having a large sample size,
appears skewed to the side.
200
150
frequency
Why? Poorly placed margin lines of our
histogram. The data is normal, but due to the
mean being to the left of the histogram bar, the
distribution will always appear side skewed.
100
50
0
-165
-170
-175
-180
-185
height (cm)
-190
>190
Histograms can be dangerous as even normal
distributions can appear ‘not normal’
Continuous probability functions and probability density functions
All the calculations used in determining probability within certain constraints for normal and Poisson distributions
used continuous probability theory.
Continuous probability distributions or density functions use non discrete data.
They cannot be applied to real life data, but provide a theoretical model.
Triangular, uniform and other distributions.
Sometimes you may be presented with data that does not fit any of the above distributions and you may have to
improvise a distribution.
In certain situations we may only know a very small amount of information.
Eg: My own bus route (277) runs only every half hour, and isn’t as reliable as the inner link.
I know that the bus is most likely to appear on time, but could in fact turn up at any time between the time it is
due and half an hour later. It is never early.
The distribution has a max at t=0 and min at t=30
Other than that we know nothing about the probability of arrival times.
We could fit any of these models to the data.
How do we determine which is the best model to use?
The first one can be ruled out because we know we have a max at the beginning and min at the end. Aside from that
we apply a rule of ‘using the distribution of the least complexity’.
We will use the last distribution as it is the most simple.
We can use this distribution to calculate probabilities such as P(Bus arrives after 10 minutes)
This is obviously a rough estimate of the true probability, but is the best theoretical model we can make with the
given information.
Question structure.
Basic question structure for distributions will show a real world random data set.






You will be asked to provide an appropriate theoretical model for this data set
You may be asked to estimate the mean and standard deviation for the data set
You may also be asked to interpret the implications of your chosen model, or the mean or standard
deviation.
You may also be asked to discuss limitations of your model, or whether or not means or SD seem accurate
using the context of the data.
You will also need to brush up on your pure theoretical skills: Calculating SD and Mean of different
distributions.
Also you will need to have a firm grasp on confidence interval calculations
Question examples…
1.
Seeds are planted in rows of six. After 14 days the number of seeds which have germinated in each
of the 100 rows is noted. The results are shown in the table:
Number of seeds germinating
0
1
2
3
4
5
6
Number of rows
2
1
2
10 30 35 20
Find the theoretical frequencies of 0, 1, 2, …, 6 seeds germinating in a row, using an associated theoretical
distribution.
2.
In a large batch of items from a production line the probability that an item is faulty is p. 400
samples, each of size 5, are taken and the number of faulty items in each batch is noted. Estimate p from
the frequency distribution given in the table. Use the theoretical binomial distribution with the same mean
to estimate the number of samples which would be
Number of faulty
0
1
2
3 4 5
expected to have more than one faulty item, if 600
items
samples were taken from the production line.
Frequency
297
90
10
2
1
0
3.
On average 20% of the bolts produced by a machine in a factory are faulty. Samples of ten bolts are
to be selected at random from the bolts produced that day.
a) Calculate the probability that, in any one sample, two or fewer bolts will be faulty.
b) Find the expected value and standard deviation of the number of bolts in a sample which will not
be faulty.
4.
In a large batch of items from a production line the probability that an item is faulty is p. 400
samples, each of size 5, are taken and the number of faulty items in each batch is noted.
a. Estimate p from the frequency distribution given in the table.
b. Select a theoretical distribution to model this situation, and justify its use. Use it to estimate the number
of samples of size five which would be expected to have more than one faulty item, if 600 samples were
taken from the production line.
5.
The number of emergency admissions each day to a hospital varies. The mean number of
admissions is 2 with a standard deviation of 1.5. Select a suitable theoretical distribution to model this
situation, justify your choice, and use the distribution to answer the following:
a. Evaluate the probability that on a particular day, there will be no emergency admission.
b. At the beginning of one day the hospital has 5 beds for emergencies. Calculate the probability that this
will be an insufficient number for the day.
c. Calculate the probability that there will be exactly three admissions on two consecutive days.
6.
days
A firm selling electrical components records the number of new orders received over a period of 150
Number of new orders
0
1
2
3
4
Number of days
51
54
36
6
3
a. Find the average number of new orders per day
b. Use an appropriate theoretical distribution to calculate the probability that there will be 5 or more orders
in a day. Justify your choice of distribution.
c. The firm packs the electrical components in boxes of 60. On average 2% of the components are
faulty. What is the chance of getting more than two defective components in a box?
7.
On average 20% of the bolts produced by a machine in a factory are faulty. Samples of ten bolts are
to be selected at random from the bolts produced that day.
a. Calculate the probability that, in any one sample, two or fewer bolts will be faulty.
b. Find the expected value and standard deviation of the number of bolts in a sample which will not
be faulty.
c. State any assumptions you have made in answering this question, and comment on whether the
assumptions were valid.
8.
a. National records for the past 100 years were examined to find the number of deaths in each year
due to lightening. The most deaths were in any year were four which was recorded once. In 35 years no
death was observed and in 38 years only one death. The mean number of deaths per year was 1.00. Draw up
a frequency table of the number of deaths per year, and estimate the corresponding expected frequencies
for an associated theoretical distribution having the same mean.
b. Justify your choice of theoretical distribution to model the number of deaths per year.
9.
In one trial of an experiment a certain number of dice are thrown and the number of sixes rolled is
recorded. The dice are all biased the same way, and the probability of getting a six in one throw is p. The
results of sixty trials are shown in the table.
Number of sixes rolled
0
1
2
3
4
>4
frequency
19
26
12
2
1
0
Choose a theoretical distribution to model this situation. By comparing these results with those expected
for the theoretical distribution, estimate the number of dice thrown in each trial, and the value of p.
10.
The manager of a processing plant noticed during the course of a morning
that one of her employees was often idle. She decided to record when the employee
was active or idle over a 3 hour period. The results are given in the table.
Time (am)
status
7:00 – 7:32
idle
a) If the manager had walked through the processing plant at a random time
between 7:00 am and 10:00 am, determine the probability that she would have
found the employee idle.
7:32 – 8:20
active
8:20 – 8:30
idle
8:30 – 9:30
active
b) If the manager had randomly observed the employee for 6 one-minute time
periods between 7 and 10 am, justify the use of the binomial distribution to model
the situation.
9:30 – 10:00 idle
c) Find the probability that the employee would have been idle for all 6 random one-minute observations.
d) Find the probability that the employee would have been active less than half the time.
1. Binomial
p = 0.75,
Χ
Distributions practice - ANSWERS
= 4.5 = 6p
0, 0, 3, 13, 30, 36, 18,
2.
Χ
Justification using assumptions of binomial distribution supported by variance ≈ npq or similarity of
experimental observation to the binomial model.
a. Poisson
= 2.767
sd = 1.511 variance = 2.860
Justify using assumptions of Poisson supported by either mean ≈ variance or similarity of experimental
observation to the Poisson model.
b. 0.7632
3.
Χ
c. 0.06285
a. Poisson
= 0.5
b. Justify using assumptions of Poisson supported by either mean ≈ variance or similarity of
experimental observation to the Poisson model.
P(X > 5) = 0.000014 ≈ 0
4.
a. Binomial
Χ
c. Binomial n= 144, p = 0.03, P(X ≥ 2) = 0.932
= 0.3 = np ,
n = 5 p = 0.06
b. Justification using assumptions of binomial distribution supported by variance ≈ npq or similarity
of experimental observation to the binomial model.
In one sample of size 5, P(X > 1) = 0.0319
In 600 samples, expect 19 to have more than one faulty item.
5. Poisson, justify using assumptions of Poisson supported by mean ≈ variance.
a. P( X = 0) = 0.1353
b. P( X > 5) = 0.0165
Χ
c. On one day P( X = 3) = 0.180 , P(two days in a row ) = 0.1802 = 0.0324
6.
a.
= 1.04
b. Poisson, justify using assumptions of Poisson supported by either mean ≈ variance or similarity of
experimental observation to the Poisson model. P(X ≥ 5) = 0.0043
c. Binomial n= 60, p = 0.02, P(X ≥ 2) = 0.3381
7.
a. Binomial n= 10, p = 0.2 P(X ≤ 2) = 0.6778
b. 8, 1.3
deaths
8.
by
0
Observed 35
Expected
9.
=5
1
2
3
4
38
(20)
(6)
1
36.8 36.8 18.4 6.1 1.9
Poisson, justify using assumptions of Poisson supported
either mean ≈ variance or similarity
experimental
observation to the Poisson model.
Binomial np = 1, np(1-p) = 0.8000, 1-p = 0.8, p=0.2, n
10. a. idle = 32 + 10 +30 = 72 minutes, P(idle) = 72/180 = 0.4
b. fixed number of trials n = 6, each observation independent (assumed), constant probability p = 0.4,
two outcomes (active or idle)
c. P(X = 6) = 0.0041,
d. p(X ≥ 4) = 0.1792
11)Li Ching-Yuen was a Chinese herbalist and longevity expert who was known to have died in 1928. He
claimed to have been born in 1734, giving him a lifespan of 196 years. Investigations into birth records
indicated that he was actually born in 1678, giving an even longer lifespan of 250 years!
Whilst this may seem unbelievable, is it? In this question we use statistics to look into the lifespan of very old
people.
Whilst there is no conclusive historical evidence to support the birth date of Li Ching-Yuen, the following data
concerning lifespans are known [at the time of writing this question (October 2008); sources given below]






There were about 450000 people in the world aged over 100.
There were 82 living people who were known to be over the age of 110
There were 2 people known to be over the age of 115 (ages 115 and 116)
There are 31 unverified claims of people over the age of 110, two of whom claimed to be aged 115 and
116.
In the past 50 years, 25 people are known for certain to have lived beyond the age of 115.
In the past 50 years, 2 people are known for certain to have lived beyond the age of 120 (dying at ages
120 and 122).
A hypothesis H is made saying: Once you make it to your 100th birthday there is a fixed probability p of
surviving to your next birthday on any given subsequent birthday. For example, if p were 0.05 then the
hypothesis says that on my 100th birthday there is a 5% chance of surviving until I am 101; on my 101st
birthday there would be a 5% chance of surviving until I am 102 and so on.
Does the data approximately fit this hypothesis? What values of p would seem most appropriate?
Assume that the hypothesis is true with a generous value of p=0.5. With this hypothesis, how many 100 year
olds would need to be in a room before we might feel confident that one would live to the age of 196 suggested
by Li Ching-Yuen himself? How does this number compare with the number of people on earth today (6.7
billion)?
Extension: There are many statistical complications involved in predicting death rates. How many can you think
of? How might these effect these statistics in future?
Poisson practise:
1.
The number of telephone calls received per minute at the switchboard of a certain office was logged during the period 10
a.m. to noon on a working day. The results were as in the following table. f is the number of minutes with x calls per
minute.
By consideration of the mean and variance of this distribution show that a possible is a Poisson distribution.. Using the
calculated mean and on the assumption of a Poisson distribution calculate
a. The probability that two or more calls were received during any one minute.
b. The probability that no calls were received during any one minute.
2.
(a)The number of accidents notified in a factory per day over a period of 200 days gave rise to following table:
Number of accidents
0
1
2
3
4
5
Number of days
127
54
14
3
1
1
i. Calculate the mean number of accidents per day
ii. Assuming this situation can be represented by a suitable Poisson distribution, calculate the
corresponding frequencies.
(b) Of the items produced by a machine, approximately 3% are defective and those occur at random. What is the
probability that, in a sample of 144 items, there will be at least two which are defective?
3.
.The number of emergency admissions each day to a hospital is found to have a Poisson distribution with a mean 2.
a. Evaluate the probability that on a particular day, there will be no emergency admission.
b. At the beginning of one day the hospital has 5 beds for emergencies. Calculate the probability that this will be
an insufficient number for the day.
c. Calculate the probability that there will be exactly three admissions on two consecutive days.
4.
(a) Following table shows the number of phone calls I received over a period of 150 days
Number of calls 0
1
2
3
4
Number of days 51
54
36
6
3
(i)
Find the average number of calls per day
(ii)
Calculate the frequencies of a comparable Poisson distribution.
(b) A firm selling electrical components packs them in boxes of 60. On average 2% of the components are faulty. What
is the chance of getting more than two defective components in a box?
5.
(a) The number of organic particles in a volume V cm3 of a certain liquid follows a Poisson distribution with a mean of
01.V. Find the probabilities that a sample of 1 cm3 of the liquid will contain
(i)
At least one organic particle
(ii)
Exactly one organic particle
(b) The liquid is sold in vials, each vial containing 10cm3 of the liquid. The vials are dispatched for sale in boxes, each
box containing 100 vials. Find the probability that the vial will contain at least one organic particle. Hence find the mean
and the standard deviation of the number of vials per 100 vials that contain at least one organic particle.
6.
(a) National records for the past 100 years were examined to find the number of deaths in each year due to lightening.
The most deaths were in any year were four which was recorded once. In 35, no death was observed and in 38 years only
one death. The mean number of deaths per year was 1.00. Draw up a frequency table of the number of deaths per year,
and estimate the corresponding expected frequencies for a Poisson distribution having the same mean. Illustrate both
frequency distributions graphically.
(b) Justify the choice of a Poisson distribution to model the number of deaths per year.
1.
2.
3.
4.
5.
6
Answers:
Mean 2.917, variance 2.860 (a) 0.788 (b) 0.00293
(a) (i) 0.5 (ii) 121.3, 60.7, 15.2, 2.5, 0.32, 0.03 (b) 0.929
(a) 0.135 (b) 0.017 (c) 0.195
(a) (i) 1.04 (ii) 53.0, 55.1, 28.7, 9.9, 2.6 (b) 0.121
(a) 2e-4 (b) 0.1353, 0.2707, 0.3233, 0.1782,
6. deaths
Observed
Expected
0
35
36.8
1
38
36.8
2
(20)
18.4
3
(6)
6.1
4
1
1.9
91584
Statistical evaluation
practice reports
It’s a good idea to skim read the report before answering questions to get the
gist of what is being stated.
Read the report and fill in the framework for the analysis as you go. Questions
can then be answered on the reports.
If you run out of reports to analyse, try using any statistical report. Newspapers
are full of them!
Kiwis unlikely to queue up for asset shares – poll
Published: 6:55PM Friday August 10, 2012 Source: ONE News
There may not be a rush of Kiwis buying up shares in Mighty River Power or other state assets, the latest
ONE News Colmar Brunton poll suggests.
The partial float of Mighty River Power will take place in October or November if the Government has its
way.
The latest poll has support for the sales up two percentage points since the last survey in March but there is
still more opposition by nearly two to one.
The poll shows most Kiwis think they have the cash for a splash in the share market. Asked if they could afford the
$1000 needed for the minimum share purchase almost 50% say definitely, 11% say probably and the rest didn't know
or were unsure.
However the Prime Minister remains optimistic that Kiwis who can, will buy in.
"If you ask the question if the programme is definitely going to go ahead, will people support it, then it looks
like there's quite a high level of interest in terms of people buying shares," John Key said.
But when asked how likely they are to buy shares, just 13% of people said very likely with 21% saying quite
likely, meaning only a third of people appear keen to invest.
That leaves 65% who aren't likely to buy shares.
The Shareholders Association says due to a lack of financial literacy less than 10% of New Zealanders are
directly active in the share market and the poll numbers are the best the Government could hope for.
The association says people are still wary.
"There's still the older generation out there who were burnt off in the '87 sharemarket crash, and there's still
also their sons and daughters who have seen their mum and dad lose money in the '87 share market crash,"
says Shareholders' Association director, Grant Diggle.
The ONE News Colmar Brunton poll has a margin of error of plus or minus 3.1%.
Writing frame for critically evaluating a report
Pre-Reading: “Getting the gist”
Read the media report and summarise what it is about in 3 sentences or bullet points.
While reading: “Worry Questions”
Read the media report again, asking appropriate “worry questions” as you go.
Record your answers in the boxes below:
Source
Method
Target Group
Who sampled
How selected
Sample size
margin of error
Questions asked
Key Findings
Claims
What is missing?
Please Turn Over
Critical Evaluation:
Discuss 2 good aspects of this report
Discuss 2 concerns
Writing frame for critically evaluating a report
Pre-Reading: “Getting the gist”
Read the media report and summarise what it is about in 3 sentences or bullet points.
While reading: “Worry Questions”
Read the media report again, asking appropriate “worry questions” as you go.
Record your answers in the boxes below:
Source
Method
Target Group
Who sampled
How selected
Sample size
margin of error
Questions asked
Key Findings
Claims
What is missing?
Please Turn Over
Critical Evaluation:
Discuss 2 good aspects of this report
Discuss 2 concerns
Writing frame for critically evaluating a report
Pre-Reading: “Getting the gist”
Read the media report and summarise what it is about in 3 sentences or bullet points.
While reading: “Worry Questions”
Read the media report again, asking appropriate “worry questions” as you go.
Record your answers in the boxes below:
Source
Method
Target Group
Who sampled
How selected
Sample size
margin of error
Questions asked
Key Findings
Claims
What is missing?
Please Turn Over
Critical Evaluation:
Discuss 2 good aspects of this report
Discuss 2 concerns
Comparing polls
14 May 11
Credit: Electoral Commission
The poll puzzle: Horizon provides a complete electoral picture
Some bloggers supporting various political parties are asking why Horizon's party vote poll results differ
from other polls.
They point out that other polls, mainly conducted by phone, have National about 20% ahead of Labour.
Horizon's polls show a lesser margin, for example, 9.7% on May 14 and 13.8% in April 2011.
They therefore claim Horizon's methodology must be faulty, and the HorizonPoll national panel is "selfselected".
However, there is no apples-with-apples comparison.
And the HorizonPoll panel is not self-selected.
Most of the telephone pollsters are reaching 1000 respondents and expressing the about 69% who have a
party vote preference as a percentage of 100.
They exclude undecided and won't say respondents.
They therefore are not expressing a complete picture of the 18+ adult population.
People are invited to join the HorizonPoll national online research panel based on the profile of the
population at the 2006 census. The panel is, therefore, not self-selected.
Less than 5% of the panel is self-enrolled and an iterative rim weighting system, using up to six factors at
one time, including party vote 2008, ensures results are robust within the confidence levels stipulated.
Other pollsters, where they publish what factors they are weighting on in order to make their results
representative, appear not to be weighting on 2008 party vote. This opens up room for any larger sampling
of any particular parties' voters to possibly affect results.
Horizon also usually uses sample sizes of 1800 or higher, to provide greater reliability in assessing the vote
for minor parties. This is important in a MMP environment, in which minor parties have been determining
which main party can form a coalition government.
Horizon's party vote results are weighted and expressed as a percentage of the adult population aged 18+
(after filtering by registration and intention to vote detailed below).
National won 32.9% of adult population votes in 2008, Labour 25%, Act 1.7%, NZ First 3%, Green 4.9%,
other parties 3.1%. Some 26.7% did not vote.
Horizon can also analyse the intentions of this significant non-voter group.
At April 2011 it appeared about 60,000 of them were again expressing a party vote preference and were
intending to vote. This too could have a major bearing on the outcome of the November 26 general election.
Horizon's produces what we call a Net Potential Vote poll.
We take the responses of decided voters, as others do. We also ask the undecided group, which has been
varying in size between 12.8 and 23%, if they have a preference.
These preferences are added to the decided group - and then those who are not eligible to vote (so can't) are
excluded, along with those who definitely will not or may not vote.
The results we publish are therefore for decideds + undecideds with a preference - all of whom are on the
electoral rolls and say they are likely to or will definitely vote.
For further information please contact:
Grant McInman
Manager, Horizon Research
Telephone: +64 (21) 076 2040
Writing frame for critically evaluating a report
Pre-Reading: “Getting the gist”
Read the media report and summarise what it is about in 3 sentences or bullet points.
While reading: “Worry Questions”
Read the media report again, asking appropriate “worry questions” as you go.
Record your answers in the boxes below:
Source
Method
Target Group
Who sampled
How selected
Sample size
margin of error
Questions asked
Key Findings
Claims
What is missing?
Please Turn Over
Critical Evaluation:
Discuss 2 good aspects of this report
Discuss 2 concerns