Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Probability Concepts 91585 A thorough understanding of the differences between true probability, theoretical probability and experimental estimates. • True Probability The true probability that an event will occur. Often impossible to calculate but can be estimated through experimental probability and theoretical probability. Eg we assume flipping a coin gives a 50:50 chance of landing on heads or tails, but as the coin isn’t perfectly symmetrical, and there is a chance it can land on its edge, this is not the case. However it will be very close! • Theoretical probability This is a true probability estimate that we can calculate from modelling the situation. We know that an approximately symmetrical coin will give approximately 50:50 chance of getting a heads with each toss. Although this isn’t exactly equal to the true probability, it is a very good model and we can assume it to be true. Theoretical probability is an “educated guess” We can often use tree’s and other diagrams to calculate overall theoretical probability if we know the probability of individual steps. • Experimental probability Experimental probability is an estimate of the true probability through trial and simulation. If we flip a coin 100 times and get approximately 50 zero’s, we can estimate the true probability of getting a heads is 50:50. This is often a crude way to calculate true probability as many trails re required to get an accurate estimate, however if the model is extremely complicated it can be the only way. Experimental probability calculation relies on independent events to give meaningful estimates. There is no such thing as theoretical probability! It is just an experimental estimate. Understanding true probability, model estimates and experimental estimates True probability is the (almost always) unknown actual probability that an event will occur in a given situation. The actual probability of a coin landing heads up is affected by the position from which it is tossed, the asymmetry of the two faces of the coin etc, so is not exactly 0.5, though the probability of a fair coin landing heads will be very close to 0.5. We can find out about the unknown true probability by observation (experiment) or by trying to understand the situation and modelling it. In probability an experiment is one or more trials of a probability situation. An experimental estimate is calculated from observation as the number of successful trials divided by the total number of trials. In the long run (over many trials), the experimental estimate may approach the true probability. An experimental estimate that a coin will land heads if it is tossed 20 times and lands heads up 14 times is 14/20 = 0.7. A probability model is a representation of a situation involving probability. Probability models can incorporate experimental estimates and assumptions about the situation (e.g., independence). These assumptions may be based on an idealised view of the world and an understanding of the mathematics of probability. A model estimate is an estimate of the probability that an event will occur, based on a probability model. The model estimate of a fair coin landing heads is 0.5. If a model is a good representation of the situation, the experimental estimate over many trials will be close to the model estimate. A model must always be considered in context. A good model is one which is fit for the purpose for which it is being used. When tossing an approximately fair coin, the model estimate of P(heads) = 0.5 is a good model for most purposes. A transport system modelling the timing of traffic lights to get a smooth flow of traffic will require a more complex model, tested against experimental observations to ensure that it is fit for the purpose. In some situations there is no obvious theoretical model, so we can only estimate the probabilities and probability distributions via experiment. These estimates can be used as a basis for building a theoretical model. For instance, to develop a model of the probability of getting a basketball through the hoop, an initial model might assume a constant probability of 0.5. As data is gathered, there could be successive refinements of the model so that it becomes a better estimate of the true probability. The data might indicate that the probability of getting the ball in the hoop is closer to 0.2 and that it changes over time. Sometimes we might think that an obvious theoretical model applies, but experimental estimates demonstrate that our model is a poor one. There is now a need to find a better model using the estimates from the experiments. We might initially model the result of spinning a weighted coin as P(heads) = 0.5 but realise that that estimate is a poor one and use data to improve it. Deterministic and probabilistic models A deterministic model does not include elements of randomness. Every time you run the model with the same initial conditions you will get the same results. A probabilistic model does include elements of randomness. Every time you run the model, you are likely to get different results, even with the same initial conditions. RANDOMNESS What is true randomness? What makes something random? Often we misinterprate what proper randomness looks like. Humans are keyed up to see pattern in everything. We will find patterns where there are none. Often to us a uniform spread appears ‘random’ as it is difficult to identify a pattern. Whereas true randomness looks ‘lumpy’ Which of these images shows a random spread of dots? Randomness is often not intuitive. The easiest way to increase your understanding of randomness is through questions and practical examples. Imagine a coin flipping 30 times. Try to make up a ‘random’ coin toss in your head and write down the results (write down 30 flips as if you had actually flipped the coins) Then actually flip 30 coins recording your results and compare the 2 sets of data. What differences do you notice? Independence Two events, A and B, are independent if the fact that A occurs does not affect the probability of B occurring. Think about rolling a dii. The probability of landing on each face is equal. Therefore: P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 P(1+2+3+4+5+6) = 1 ( when we roll the die, it will land on one of the faces) Now if we roll two die. Does the roll of the first die effect the second? If we have rolled a 6 on the first die, what is the probability of getting a 6 on the second die? The same as always! P(6) = 1/6 no matter how many die have been rolled previously or whatever their rolls where! Try to think of some examples of independent events. Eg the results of flipping the same coin twice. Events are independent if: P(A and B) = P(A) x P(B) Dependence Events are dependent if the outcome of one has an effect on the outcome of the other. Eg: think about the probability of being blond and the probability of having blue eyes. Are blond people more likely to have blue eyes? Most real life events are dependent on some level (even if the dependence is very small) Dependant if : P(A and B) ≠ P(A) x P(B) 1)A dresser drawer contains one pair of socks with each of the following colors: blue, brown, red, white and black. Each pair is folded together in a matching set. You reach into the sock drawer and choose a pair of socks without looking. You replace this pair and then choose another pair of socks. What is the probability that you will choose the red pair of socks both times? Are both draws independent? 2) A survey found that 72% of people in school of 300 a like pizza. If 3 people are selected at random, what is the probability that all three like pizza? Is the probability of the second and third person liking pizza independent of the first person liking pizza? Why? Why not? Mutually exclusive events Events are mutually exclusive if both cannot occur at the same time. The most obvious example of mutually exclusive events are complimentary functions. Obviously A cannot occur at the same time as A’ (A not occurring). (Think about tossing a coin. We cannot get a heads and a tails on the same toss) However They do not have to be complimentary functions. If events A & B are mutually exclusive: P(A&B) = 0 or P(AUB) = P(A) + P(B) Conditional probability Conditional probability is the probability of an event occurring given another event occurs. It is written as P(A/B) -P(A given B) In essence it means the “probability of event A occurring if event B occurs”. Think back to our mutually esclusive events. Suppose A & B are mutually exclusive. P(A/B) = 0 because if B occurs, A cannot occur. It is calculated through the formula: P(A/B) = 𝑃(𝐴&𝐵) 𝑃(𝐵) Probability distributions and graphs In your exams you’ll often be asked to estimate expected values from looking at data distributions and graphs. This will often be comparing eg: Which one has greater variance, and which one has the highest expected value. These skills are most easily learnt through practice at looking at graphs. Contingency tables Contingency tables • Can be used to display probabilities or frequencies of events with 2 or more variables • Can help in conversion of frequencies to probabilities • Can help in determining independence We use a table with event or condition A on one axis and event or condition B on the other axis. The table allows easy comparison between probabilities of both events. It also allows us to easily calculate the probability of both events or either of the events happening. Contingency tables are preferable to ven diagrams, but use whatever you find most comfortable. A occurs B occurs B & A occur (BNA), (BUA) B’ (not occurring) A occurs, B doesn’t. (BUA) A’ (not occurring) B occurs, A doesn’t. (BUA) Neither occurs. Here we can see how the table works. Events A and B can be anything. Also we see it is easy to identify which box corresponds to ANB, and AUB. Contingency tables are preferred to ven diagrams due to the easy of calculating using tables. Try filling this table out. Contengency tables can easily be converted from numbers to fractions by dividing through by the total value. We can also calculate probabilities in other terms. • For example: What fraction of the drinkers drink coffee? • We take the number of coffee drinkers and divide this by the total number of drinkers (200-people who drink nothing) This is called conditional probability. Could also be written as “what is the probability someone drinks coffee given they drink something?” or P(C/T)? (C is coffee drinker, T is tea drinker) How would we calculate the probability someone drinks coffee given they drink tea? This is similar to calculation of the probability of one of the squares like in the previous example, but this time our ‘total’ is different. Why? Think about the phrasing of the question: Out of all the tea drinkers, what is the probability one drinks coffee? So we would divide square Tea drinker Tea drinker’ Total So what is P(C/T)? =(TUC)/(T) =122/453 =0.27 Coffee drinker A: 122 C: 132 E: 254 Coffee drinker’ B: 321 D: 98 F: 419 Total H: 453 I: 230 J: 683 What is P(T’UC)? What is P(T’UT)? What is P(C’/T’)? Probability (A given B) = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 (𝐴&𝐵) 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝐵) Probability trees Probability trees are another useful skill in determining probabilities. They use a series of probability steps. Probability trees can easily get very large and cumbersome compared to contingency tables, but are especially useful in visualising conditional probability. Each step represents a probability step or ‘event’. Each path is an event occurring and its complimentary function. The next step is the net event occurring. Tree’s can have any number of events. (and each step can have any number of branches.) Event A Event B Probabilistic outcome (A&B) A (A&B’) (A’&B) A’ (A’&B’) It is easy to visualies probabilities by using tree’s. Also we can easily calculate conditional probability. P(A/B)= P(A&B)/P(B) Ven diagrams Are the most common way for multiple probable functions to be drawn. They are however not as useful as tree diagrams or contingency tables to use. (if you can use contingency tables or trees, it is preferable. However if you are more comfortable using ven diagrams, keep on using them!) Total probability is 1. Probability of different events happening are worked out as a fraction or percentage. So if there is 50% chance of A occurring, P(A) = .5 (or a fraction of 1) P(AUB) is probability of either A, B or both A and B occurring. Ven diagrams are invaluable when solving 3 way probabilistic models. Tree diagrams can also be used for multiple step probability models but get very large very quickly! Contingency tables cannot be used as we would require 3 dimensions! 3 way ven diagrams often have difficult calculations, but as long as your definitions are right, you won’t go wrong. Often you will be given just enough information, and will have to use calculations to find all unknowns in a ven diagram. The same probability calculation rules apply to these 3 way diagrams. P(A&B&C) = P(A)*P(B)*P(C) for independent events P((A&B)/C) = P(A&B&C)/P(C) P(A/(B&C) = P(A&B&C)/(P(B&C) -these rules can be used interchanging A, B & C Variance Variance is a measure of how spread out data is or probability values are. The greater the variance, the more ‘varied’ our probability function is. For example if we have two random number generators: Generator 1 makes a number between -50 and 50 Generator 2 makes a number between -500 and 500. Both have the same expected value or mean probability (0) but if we ran both generators, generator 2 would give a much more spread out or ‘varied’ range of numbers. Variance can be calculated by the following formula: 𝑉𝐴𝑅(𝑋) = 𝐸[𝑋 2 ] − µ2 The variance of a probability distribution is equal to the sum of all expected values squared, minus the mean squared. Calculations will likely not be required but it is important to understand what variance is a measure of: How varied the data is The greater the spread, the greater the variance and standard deviation! These are similar characteristics but have different values. Standard deviation Standard deviation is another measure of probability spread or data spread. Remember we can look at results or probability functions to determine variance or standard deviation We calculate standard deviation by first calculating variance. Sd or 𝜎 = √𝑉𝐴𝑅 Probability distributions Probability distributions paper takes a different turn from previous years. It is no longer a precise paper with many calculator related questions such as ‘find the expected value’ This year you will be asked to estimate expected values and variance with real world (and non exact) data. Distributions are largely up to interpretation. You will be asked to estimate which model would fit best to certain data and why. Paper involves calculating and interpreting expected values and standard deviations of discrete random variables. You will also have to apply distributions to data: Binomial, Normal, Poisson etc A good understanding of the underlying concepts of probability distributions will equip you to tackle any question the exam poses. Discrete and continuous data Although all calculations you encounter will involve discrete data, you will need to understand the differences between different types. A discontinuous distribution is a series of data in which values can only take on certain set values. Like age in years or goals scored. A continuous distribution is a range of values which can take on any value. These may be forced to fall within constraints, but there are an infinite amount of values which can be achieved. Mean The mean is the average. This means if we add all the discrete values together and divide the value by the number of discrete values, we will have the mean. The mean is also the ‘expected value’ if we were to take a random value of our data points. Mean values are stretched by extreme values or outliers on our data plots. Standard deviation Is a measure of spread of data. Standard deviation is related to variance Standard deviation = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 Standard deviation shows how much variation dispersion exists from the average (mean), or expected value. A low standard deviation indicates that the data points tend to be very close to the mean; high standard deviation indicates that the data points are spread out over a large range of values. True, theoretical & experimental distributions. Similarly to probability comparing the differences in true, theoretical and experimental probability, we will also look at the various distributions. Just like in probability paper: True probability is some unknown ‘real’ probability which we can never know or properly calculate. It can only be estimated through combination of theoretical and experimental probability. So to find the ‘true’ probability distribution we would have to combine estimates from our theoretical probability distribution with our experimental probability distribution. Theoretical distribution is an estimate of what our data spread would look like based upon mathematical modelling of the event Experimental distribution is simply the spread of data we receive when running a simulation of the event. Types of distributions… You will need to know which distribution is best applied to different forms of data. Binomial distribution A binomial distribution is a probability distribution of a series of yes/no functions. Eg we flip 100 coins, what is the probability distribution of total heads? Or what values of total heads can we expect? A distribution characterised by: • Only 2 outcomes: Success or failure Eg flipping a coin. • Fixed number of identical trials The number of trails is set at the beginning • P(success) at each trial is constant Each individual trail is consistent with the same P(Success) • Each trial is independent. Each individual trail has no impact on any other trail. We know that any binomial trail must be a discontinuous function. This because each individual trail is either a success or a failure: we can’t have half a success! Mean and variance of a binomial distribution. If we let X be our random variable (the coin toss) We can calculate the mean and variance using the following formulae 𝐸(𝑋) = 𝜇 = 𝑛𝜋 𝑉𝐴𝑅(𝑋) = 𝜎 2 = 𝑛𝜋(1 − 𝜋) n = number of trials 𝜋 = probability of success at each trial E(X) = expected value 𝜎 = standard deviation (Remember 𝜎 2 = VAR(X) These are theoretical probability equations. A typical exam question could either ask you to calculate expected means or standard deviation using theoretical models or will supply you with a distribution of discrete data Questions will ask you what the expected value is, what the standard deviation is and what distribution best fits this data. Using trees Binomial distributions can be calculated using tree diagrams. These can provide a good frame to calculate outcomes, but get very complex very quickly! 4 4 N 4 N N 4 N 4 N 4 N 4 N In this example: each event is a dice roll. We are looking at a positive event being the roll of a 4 on the dice. A negative event is any other number. For only 3 rolls the tree provides a frame which calculations can easily be conducted. But as we can see, if we needed to calculate probabilities for 6 or more rolls, the tree would get too complicated! For this tree each positive event has P(4) = 1/6 and P(N) = 5/6. Try calculating the probabilities of each combination of rolls. Poisson distribution Poisson distributions work when we have a set frame or time for events to occur and we are wanting to figure out the probability of different numbers of events occurring in that time frame. EG: How many cars pass by a road in an hour? A special kind of distribution based on counting the number of times an event occurs in a time interval. • Poisson distribution is discontinuous. It can only take on discrete values. • Imagine a distribution of number of cars passing down a street in a given hour. • The Probability has an expected value, but as we are examining number of events for a given time interval, Number of events cannot be below 0 • This gives a lopsided distribution: The Poisson distribution. • Poisson distribution must follow certain conditions. ‐ Each occurrence is independent of other occurrences ‐ Events cannot occur simultaneously ‐ Events occur at random, and are unpredictable - For a small interval the probability of the event occurring is proportional to the size of the interval 𝑒 −𝑥 𝑥𝜆𝑥 𝑃(𝑋 = 𝑥 ) = 𝑥! It is a good idea to familiarise yourself with the shape of the poisson distribution as you may have to match it to real world data. • Poisson distribution has only one variable: 𝜆 This makes calculations considerably easier once the concept is grasped. • There is only one variable because 𝜆 (Variance) = 𝜇 (mean) • So the mean value is equal to how much the data is spread out. • This rule is true only for the poisson distribution, and it arises due to all poisson distributions having the same shape. Different distributions are just elongated. As all distributions have the same shape, we can generate all other possible poisson distributions by multiplying a ‘base’ distribution through by a value. Calculations with poisson… 𝑃(𝑋 = 𝑥) = 𝑒 −𝜆 𝜆𝑥 𝑥! As 𝜆 is our only variable (both the variance and the mean number of events at a given interval) Remember for any distribution, total probability = 1 (This is equivalent to saying the probability that anything happens is 100%) Have a go at calculating this question. Mean number of eruptions a year is 4.7 What is the probability more than 2 will occur? We use the face that Poisson only has discreet values. 𝜆 = 4.7 We want P(X>2) You may need to apply reverse Poisson to find λ when given a certain probability. This will require basic algebraic manipulation. Given that we know a distribution follows Poisson, and P(X=0) = 0.001, Calculate λ These problems will always be stated in the form P(0) = … or P(X≥1) = …. Hence we only need to P(0), which will be our answer, or 1- P(0). P(X=0) = e−λ λx x! = 0.001 and we know x = 0 Have a go at this reverse Poisson problem. 𝑃(𝑋 = 0) = 𝑒 −𝜆 𝜆0 0! = = 0.001 Remember 𝜆 is also the variance! 𝑒 −𝜆 ×1 1 0.001 = 𝑒 −𝜆 So 𝜆 = −ln(0.001) = 6.9078 Harder Poisson problems may involve probability tree’s and conditional probability. Eg mean number of accidents on SH1 on a weekday = 1, and on a weekend = 2 What is the probability that 3 accidents occur on a day chosen at random. First draw a tree! P(3 given weekend) = P(weekend = 2/7, 𝑒 −2 23 3! 𝜆=2 P(X=3) P(3 given weekday) = P(weekday) = 5/7, 𝜆=1 You may also be asked conditional problems: If no crashes occur on SH1, what is the probability it is a weekday? First we need P(X=0) for weekdays and P(X=0) for weekends. Weekends 𝑒 −220 P(X=0) = 0! = 0.13534 Weekdays 𝑒 −110 P(X=0) = 0! = 0.3679 In addition: There are 5 weekdays for every weekend. P(no crash weekday) + P(no crash weekend) 5 P(No crash weekday) 7 So P(weekday given no crash) = 5 2 P(no crash weekday) + P(no crash weekend) 7 7 𝑒 −1 13 3! = 5 (0.3679) 7 2 5 (0.13534) + (0.3679) 7 7 = 0.871727 (if there was a day without crashes, it is far more likely that it occurred on a weekday.) Remember the most important skill is fitting distributions to data! Here are some examples of real world poisson distributions which you would need to recognise in the exam. Normal distribution: Normal distribution follows a symmetrical bell curve. It is probably the most useful distribution curve and used in many ways in the real world. Mean or 𝜇 is roughly the middle of the distribution 𝜇 is the mean. It marks the centre of the normal curve. The curve will be bell shaped, symmetric around the mean Standard deviation or 𝜎 is roughly 1/6th of the range of normal distribution. Sd is the average distance from the mean. Skills in estimating sd and mean will improve by practise. 𝝁 is shifted by extreme values (remember it is the average) 𝝈 (sd) is stretched by extreme values (measure of spread) Try estimating the standard deviation and means of yr 12 and yr 9 students. Mean = 13.1 SD = 2.4 Mean = 9.0 SD = 2.8 The standard deviation dictates how spread out the curve is. A curve with a high 𝜎 will be flatter; a low 𝜎 will give a sharp spike. Remember the total area under a distribution is always 0 • Because normal distributions do not take on discrete values, calculation is a bit more complicated. We can no longer calculate P(X =1), because for P(X= 1.000000….) approaches 0. • Instead we use normal distributions to calculate P(X<1) • We use the function Z or standard score to calculate values. Where Z = 𝑋−𝜇 . 𝜎 Z is equal to the number of standard deviations Z is from the mean. Using a calculator The first thing we need to do before solving calculations using a graphics calculator is to convert to a ‘standard curve’ of 𝜇 = 0 & 𝜎 = 1 This is where Z comes into play. We use Z to standardize our curve, then our calculators can solve to find probabilities using the reference curve. EG What is the probability a carton of eggs is less than 200g if the mean is 205 grams and standard deviation is 3. We first need to convert to a standard curve. • What is the probability a carton of eggs is less than 200g if the mean is 205 grams and standard deviation is 3. • First adjust our mean to 0g. So we want P(X<-5) when SD = 3. • Z= 𝑋−𝜇 𝜎 = −5−0 3 = -5/3. On your calculator… >menu >stat >dist(tab) >norm(tab) >Npd In the real world, distributions are never exactly normally distributed. The normal distribution is a theoretical model which is useful because it enables us to calculate probabilities for a distribution based only on the mean and standard deviation. If the population we are considering has a distribution which is approximately normal and we have good estimates for µ and σ, then the probability estimates we make for the population using a normal distribution model are going to be close to the actual population probabilities. We call such populations normally distributed to indicate that they have the characteristics of a normal distribution and can usefully be modelled by a normal distribution. Similar considerations apply to modelling using uniform, triangular, binomial or Poisson distributions. The normal distribution can give reasonably accurate estimates for probabilities of distributions of populations if they have the following characteristics: unimodal; reasonably symmetric; frequency of the observations falls off rapidly as measurements get further from the central value; few or no extreme values. A uniform distribution is the best choice of model when there are lower and upper limits to possible values and little information about the shape of the distribution, or when the context and shape of the sample distribution suggest a uniform distribution. A triangular distribution is the best choice when there is information about lower and upper limits and the mode, but little information about the shape apart from that, or when the context and shape of the sample distribution suggest a triangular distribution. Whether a sample distribution is consistent with being from a population which could be modelled by a given distribution is a matter of judgement. Contextual knowledge should be used to decide whether a model is useful, along with the characteristics of the distribution. For example, a small sample (eg n<30) may not look normal but could be consistent with being from an underlying normal distribution, while a large sample (eg n>200) from a normally distributed population would be expected to look approximately normal. Uniform and triangular distributions can be discrete or continuous. Unlike real world distributions, the underlying true distribution of a probability situation may in some cases be exactly modelled by a theoretical distribution. Students should have the opportunity to see how changing the bin (class) width changes the appearance of a histogram. Limitations of real world data It is easy to think of real world data as being perfect: Normal distributions are purely normal and will always looks normal. However we know that distributions are very rarely normal, but only closely resemble a normal distribution. And often samples of normal distributions may not even appear normal! We think of a sample size of 30+ as being large enough to employ central limit theorem, but this is often not enough to give a good looking normal distribution. Realistically we need sample sizes of above 200 to give a proper normal distribution. . Small sample size of 30 Heights of people will give a very good approximation of normal distribution. However this distribution looks anything but normal! We will need a sample size of much larger than 30 to give a good looking distribution. You need to be aware that if a sample size is small, it is unwise to comment on its distribution, as it could turn out to be anything. This distribution looks a lot better. Even though both are sampling from the same population, the larger sample size has a huge impact on the appearance of the distribution Errors in interpretation can also arise through poorly placed brackets for a histogram. If the bars are not in line with the mean, the data can look skewed and not normal even if the individual points closely resemble a normal distribution height of 500 NZ men This sample, although having a large sample size, appears skewed to the side. 200 150 frequency Why? Poorly placed margin lines of our histogram. The data is normal, but due to the mean being to the left of the histogram bar, the distribution will always appear side skewed. 100 50 0 -165 -170 -175 -180 -185 height (cm) -190 >190 Histograms can be dangerous as even normal distributions can appear ‘not normal’ Continuous probability functions and probability density functions All the calculations used in determining probability within certain constraints for normal and Poisson distributions used continuous probability theory. Continuous probability distributions or density functions use non discrete data. They cannot be applied to real life data, but provide a theoretical model. Triangular, uniform and other distributions. Sometimes you may be presented with data that does not fit any of the above distributions and you may have to improvise a distribution. In certain situations we may only know a very small amount of information. Eg: My own bus route (277) runs only every half hour, and isn’t as reliable as the inner link. I know that the bus is most likely to appear on time, but could in fact turn up at any time between the time it is due and half an hour later. It is never early. The distribution has a max at t=0 and min at t=30 Other than that we know nothing about the probability of arrival times. We could fit any of these models to the data. How do we determine which is the best model to use? The first one can be ruled out because we know we have a max at the beginning and min at the end. Aside from that we apply a rule of ‘using the distribution of the least complexity’. We will use the last distribution as it is the most simple. We can use this distribution to calculate probabilities such as P(Bus arrives after 10 minutes) This is obviously a rough estimate of the true probability, but is the best theoretical model we can make with the given information. Question structure. Basic question structure for distributions will show a real world random data set. You will be asked to provide an appropriate theoretical model for this data set You may be asked to estimate the mean and standard deviation for the data set You may also be asked to interpret the implications of your chosen model, or the mean or standard deviation. You may also be asked to discuss limitations of your model, or whether or not means or SD seem accurate using the context of the data. You will also need to brush up on your pure theoretical skills: Calculating SD and Mean of different distributions. Also you will need to have a firm grasp on confidence interval calculations Question examples… 1. Seeds are planted in rows of six. After 14 days the number of seeds which have germinated in each of the 100 rows is noted. The results are shown in the table: Number of seeds germinating 0 1 2 3 4 5 6 Number of rows 2 1 2 10 30 35 20 Find the theoretical frequencies of 0, 1, 2, …, 6 seeds germinating in a row, using an associated theoretical distribution. 2. In a large batch of items from a production line the probability that an item is faulty is p. 400 samples, each of size 5, are taken and the number of faulty items in each batch is noted. Estimate p from the frequency distribution given in the table. Use the theoretical binomial distribution with the same mean to estimate the number of samples which would be Number of faulty 0 1 2 3 4 5 expected to have more than one faulty item, if 600 items samples were taken from the production line. Frequency 297 90 10 2 1 0 3. On average 20% of the bolts produced by a machine in a factory are faulty. Samples of ten bolts are to be selected at random from the bolts produced that day. a) Calculate the probability that, in any one sample, two or fewer bolts will be faulty. b) Find the expected value and standard deviation of the number of bolts in a sample which will not be faulty. 4. In a large batch of items from a production line the probability that an item is faulty is p. 400 samples, each of size 5, are taken and the number of faulty items in each batch is noted. a. Estimate p from the frequency distribution given in the table. b. Select a theoretical distribution to model this situation, and justify its use. Use it to estimate the number of samples of size five which would be expected to have more than one faulty item, if 600 samples were taken from the production line. 5. The number of emergency admissions each day to a hospital varies. The mean number of admissions is 2 with a standard deviation of 1.5. Select a suitable theoretical distribution to model this situation, justify your choice, and use the distribution to answer the following: a. Evaluate the probability that on a particular day, there will be no emergency admission. b. At the beginning of one day the hospital has 5 beds for emergencies. Calculate the probability that this will be an insufficient number for the day. c. Calculate the probability that there will be exactly three admissions on two consecutive days. 6. days A firm selling electrical components records the number of new orders received over a period of 150 Number of new orders 0 1 2 3 4 Number of days 51 54 36 6 3 a. Find the average number of new orders per day b. Use an appropriate theoretical distribution to calculate the probability that there will be 5 or more orders in a day. Justify your choice of distribution. c. The firm packs the electrical components in boxes of 60. On average 2% of the components are faulty. What is the chance of getting more than two defective components in a box? 7. On average 20% of the bolts produced by a machine in a factory are faulty. Samples of ten bolts are to be selected at random from the bolts produced that day. a. Calculate the probability that, in any one sample, two or fewer bolts will be faulty. b. Find the expected value and standard deviation of the number of bolts in a sample which will not be faulty. c. State any assumptions you have made in answering this question, and comment on whether the assumptions were valid. 8. a. National records for the past 100 years were examined to find the number of deaths in each year due to lightening. The most deaths were in any year were four which was recorded once. In 35 years no death was observed and in 38 years only one death. The mean number of deaths per year was 1.00. Draw up a frequency table of the number of deaths per year, and estimate the corresponding expected frequencies for an associated theoretical distribution having the same mean. b. Justify your choice of theoretical distribution to model the number of deaths per year. 9. In one trial of an experiment a certain number of dice are thrown and the number of sixes rolled is recorded. The dice are all biased the same way, and the probability of getting a six in one throw is p. The results of sixty trials are shown in the table. Number of sixes rolled 0 1 2 3 4 >4 frequency 19 26 12 2 1 0 Choose a theoretical distribution to model this situation. By comparing these results with those expected for the theoretical distribution, estimate the number of dice thrown in each trial, and the value of p. 10. The manager of a processing plant noticed during the course of a morning that one of her employees was often idle. She decided to record when the employee was active or idle over a 3 hour period. The results are given in the table. Time (am) status 7:00 – 7:32 idle a) If the manager had walked through the processing plant at a random time between 7:00 am and 10:00 am, determine the probability that she would have found the employee idle. 7:32 – 8:20 active 8:20 – 8:30 idle 8:30 – 9:30 active b) If the manager had randomly observed the employee for 6 one-minute time periods between 7 and 10 am, justify the use of the binomial distribution to model the situation. 9:30 – 10:00 idle c) Find the probability that the employee would have been idle for all 6 random one-minute observations. d) Find the probability that the employee would have been active less than half the time. 1. Binomial p = 0.75, Χ Distributions practice - ANSWERS = 4.5 = 6p 0, 0, 3, 13, 30, 36, 18, 2. Χ Justification using assumptions of binomial distribution supported by variance ≈ npq or similarity of experimental observation to the binomial model. a. Poisson = 2.767 sd = 1.511 variance = 2.860 Justify using assumptions of Poisson supported by either mean ≈ variance or similarity of experimental observation to the Poisson model. b. 0.7632 3. Χ c. 0.06285 a. Poisson = 0.5 b. Justify using assumptions of Poisson supported by either mean ≈ variance or similarity of experimental observation to the Poisson model. P(X > 5) = 0.000014 ≈ 0 4. a. Binomial Χ c. Binomial n= 144, p = 0.03, P(X ≥ 2) = 0.932 = 0.3 = np , n = 5 p = 0.06 b. Justification using assumptions of binomial distribution supported by variance ≈ npq or similarity of experimental observation to the binomial model. In one sample of size 5, P(X > 1) = 0.0319 In 600 samples, expect 19 to have more than one faulty item. 5. Poisson, justify using assumptions of Poisson supported by mean ≈ variance. a. P( X = 0) = 0.1353 b. P( X > 5) = 0.0165 Χ c. On one day P( X = 3) = 0.180 , P(two days in a row ) = 0.1802 = 0.0324 6. a. = 1.04 b. Poisson, justify using assumptions of Poisson supported by either mean ≈ variance or similarity of experimental observation to the Poisson model. P(X ≥ 5) = 0.0043 c. Binomial n= 60, p = 0.02, P(X ≥ 2) = 0.3381 7. a. Binomial n= 10, p = 0.2 P(X ≤ 2) = 0.6778 b. 8, 1.3 deaths 8. by 0 Observed 35 Expected 9. =5 1 2 3 4 38 (20) (6) 1 36.8 36.8 18.4 6.1 1.9 Poisson, justify using assumptions of Poisson supported either mean ≈ variance or similarity experimental observation to the Poisson model. Binomial np = 1, np(1-p) = 0.8000, 1-p = 0.8, p=0.2, n 10. a. idle = 32 + 10 +30 = 72 minutes, P(idle) = 72/180 = 0.4 b. fixed number of trials n = 6, each observation independent (assumed), constant probability p = 0.4, two outcomes (active or idle) c. P(X = 6) = 0.0041, d. p(X ≥ 4) = 0.1792 11)Li Ching-Yuen was a Chinese herbalist and longevity expert who was known to have died in 1928. He claimed to have been born in 1734, giving him a lifespan of 196 years. Investigations into birth records indicated that he was actually born in 1678, giving an even longer lifespan of 250 years! Whilst this may seem unbelievable, is it? In this question we use statistics to look into the lifespan of very old people. Whilst there is no conclusive historical evidence to support the birth date of Li Ching-Yuen, the following data concerning lifespans are known [at the time of writing this question (October 2008); sources given below] There were about 450000 people in the world aged over 100. There were 82 living people who were known to be over the age of 110 There were 2 people known to be over the age of 115 (ages 115 and 116) There are 31 unverified claims of people over the age of 110, two of whom claimed to be aged 115 and 116. In the past 50 years, 25 people are known for certain to have lived beyond the age of 115. In the past 50 years, 2 people are known for certain to have lived beyond the age of 120 (dying at ages 120 and 122). A hypothesis H is made saying: Once you make it to your 100th birthday there is a fixed probability p of surviving to your next birthday on any given subsequent birthday. For example, if p were 0.05 then the hypothesis says that on my 100th birthday there is a 5% chance of surviving until I am 101; on my 101st birthday there would be a 5% chance of surviving until I am 102 and so on. Does the data approximately fit this hypothesis? What values of p would seem most appropriate? Assume that the hypothesis is true with a generous value of p=0.5. With this hypothesis, how many 100 year olds would need to be in a room before we might feel confident that one would live to the age of 196 suggested by Li Ching-Yuen himself? How does this number compare with the number of people on earth today (6.7 billion)? Extension: There are many statistical complications involved in predicting death rates. How many can you think of? How might these effect these statistics in future? Poisson practise: 1. The number of telephone calls received per minute at the switchboard of a certain office was logged during the period 10 a.m. to noon on a working day. The results were as in the following table. f is the number of minutes with x calls per minute. By consideration of the mean and variance of this distribution show that a possible is a Poisson distribution.. Using the calculated mean and on the assumption of a Poisson distribution calculate a. The probability that two or more calls were received during any one minute. b. The probability that no calls were received during any one minute. 2. (a)The number of accidents notified in a factory per day over a period of 200 days gave rise to following table: Number of accidents 0 1 2 3 4 5 Number of days 127 54 14 3 1 1 i. Calculate the mean number of accidents per day ii. Assuming this situation can be represented by a suitable Poisson distribution, calculate the corresponding frequencies. (b) Of the items produced by a machine, approximately 3% are defective and those occur at random. What is the probability that, in a sample of 144 items, there will be at least two which are defective? 3. .The number of emergency admissions each day to a hospital is found to have a Poisson distribution with a mean 2. a. Evaluate the probability that on a particular day, there will be no emergency admission. b. At the beginning of one day the hospital has 5 beds for emergencies. Calculate the probability that this will be an insufficient number for the day. c. Calculate the probability that there will be exactly three admissions on two consecutive days. 4. (a) Following table shows the number of phone calls I received over a period of 150 days Number of calls 0 1 2 3 4 Number of days 51 54 36 6 3 (i) Find the average number of calls per day (ii) Calculate the frequencies of a comparable Poisson distribution. (b) A firm selling electrical components packs them in boxes of 60. On average 2% of the components are faulty. What is the chance of getting more than two defective components in a box? 5. (a) The number of organic particles in a volume V cm3 of a certain liquid follows a Poisson distribution with a mean of 01.V. Find the probabilities that a sample of 1 cm3 of the liquid will contain (i) At least one organic particle (ii) Exactly one organic particle (b) The liquid is sold in vials, each vial containing 10cm3 of the liquid. The vials are dispatched for sale in boxes, each box containing 100 vials. Find the probability that the vial will contain at least one organic particle. Hence find the mean and the standard deviation of the number of vials per 100 vials that contain at least one organic particle. 6. (a) National records for the past 100 years were examined to find the number of deaths in each year due to lightening. The most deaths were in any year were four which was recorded once. In 35, no death was observed and in 38 years only one death. The mean number of deaths per year was 1.00. Draw up a frequency table of the number of deaths per year, and estimate the corresponding expected frequencies for a Poisson distribution having the same mean. Illustrate both frequency distributions graphically. (b) Justify the choice of a Poisson distribution to model the number of deaths per year. 1. 2. 3. 4. 5. 6 Answers: Mean 2.917, variance 2.860 (a) 0.788 (b) 0.00293 (a) (i) 0.5 (ii) 121.3, 60.7, 15.2, 2.5, 0.32, 0.03 (b) 0.929 (a) 0.135 (b) 0.017 (c) 0.195 (a) (i) 1.04 (ii) 53.0, 55.1, 28.7, 9.9, 2.6 (b) 0.121 (a) 2e-4 (b) 0.1353, 0.2707, 0.3233, 0.1782, 6. deaths Observed Expected 0 35 36.8 1 38 36.8 2 (20) 18.4 3 (6) 6.1 4 1 1.9 91584 Statistical evaluation practice reports It’s a good idea to skim read the report before answering questions to get the gist of what is being stated. Read the report and fill in the framework for the analysis as you go. Questions can then be answered on the reports. If you run out of reports to analyse, try using any statistical report. Newspapers are full of them! Kiwis unlikely to queue up for asset shares – poll Published: 6:55PM Friday August 10, 2012 Source: ONE News There may not be a rush of Kiwis buying up shares in Mighty River Power or other state assets, the latest ONE News Colmar Brunton poll suggests. The partial float of Mighty River Power will take place in October or November if the Government has its way. The latest poll has support for the sales up two percentage points since the last survey in March but there is still more opposition by nearly two to one. The poll shows most Kiwis think they have the cash for a splash in the share market. Asked if they could afford the $1000 needed for the minimum share purchase almost 50% say definitely, 11% say probably and the rest didn't know or were unsure. However the Prime Minister remains optimistic that Kiwis who can, will buy in. "If you ask the question if the programme is definitely going to go ahead, will people support it, then it looks like there's quite a high level of interest in terms of people buying shares," John Key said. But when asked how likely they are to buy shares, just 13% of people said very likely with 21% saying quite likely, meaning only a third of people appear keen to invest. That leaves 65% who aren't likely to buy shares. The Shareholders Association says due to a lack of financial literacy less than 10% of New Zealanders are directly active in the share market and the poll numbers are the best the Government could hope for. The association says people are still wary. "There's still the older generation out there who were burnt off in the '87 sharemarket crash, and there's still also their sons and daughters who have seen their mum and dad lose money in the '87 share market crash," says Shareholders' Association director, Grant Diggle. The ONE News Colmar Brunton poll has a margin of error of plus or minus 3.1%. Writing frame for critically evaluating a report Pre-Reading: “Getting the gist” Read the media report and summarise what it is about in 3 sentences or bullet points. While reading: “Worry Questions” Read the media report again, asking appropriate “worry questions” as you go. Record your answers in the boxes below: Source Method Target Group Who sampled How selected Sample size margin of error Questions asked Key Findings Claims What is missing? Please Turn Over Critical Evaluation: Discuss 2 good aspects of this report Discuss 2 concerns Writing frame for critically evaluating a report Pre-Reading: “Getting the gist” Read the media report and summarise what it is about in 3 sentences or bullet points. While reading: “Worry Questions” Read the media report again, asking appropriate “worry questions” as you go. Record your answers in the boxes below: Source Method Target Group Who sampled How selected Sample size margin of error Questions asked Key Findings Claims What is missing? Please Turn Over Critical Evaluation: Discuss 2 good aspects of this report Discuss 2 concerns Writing frame for critically evaluating a report Pre-Reading: “Getting the gist” Read the media report and summarise what it is about in 3 sentences or bullet points. While reading: “Worry Questions” Read the media report again, asking appropriate “worry questions” as you go. Record your answers in the boxes below: Source Method Target Group Who sampled How selected Sample size margin of error Questions asked Key Findings Claims What is missing? Please Turn Over Critical Evaluation: Discuss 2 good aspects of this report Discuss 2 concerns Comparing polls 14 May 11 Credit: Electoral Commission The poll puzzle: Horizon provides a complete electoral picture Some bloggers supporting various political parties are asking why Horizon's party vote poll results differ from other polls. They point out that other polls, mainly conducted by phone, have National about 20% ahead of Labour. Horizon's polls show a lesser margin, for example, 9.7% on May 14 and 13.8% in April 2011. They therefore claim Horizon's methodology must be faulty, and the HorizonPoll national panel is "selfselected". However, there is no apples-with-apples comparison. And the HorizonPoll panel is not self-selected. Most of the telephone pollsters are reaching 1000 respondents and expressing the about 69% who have a party vote preference as a percentage of 100. They exclude undecided and won't say respondents. They therefore are not expressing a complete picture of the 18+ adult population. People are invited to join the HorizonPoll national online research panel based on the profile of the population at the 2006 census. The panel is, therefore, not self-selected. Less than 5% of the panel is self-enrolled and an iterative rim weighting system, using up to six factors at one time, including party vote 2008, ensures results are robust within the confidence levels stipulated. Other pollsters, where they publish what factors they are weighting on in order to make their results representative, appear not to be weighting on 2008 party vote. This opens up room for any larger sampling of any particular parties' voters to possibly affect results. Horizon also usually uses sample sizes of 1800 or higher, to provide greater reliability in assessing the vote for minor parties. This is important in a MMP environment, in which minor parties have been determining which main party can form a coalition government. Horizon's party vote results are weighted and expressed as a percentage of the adult population aged 18+ (after filtering by registration and intention to vote detailed below). National won 32.9% of adult population votes in 2008, Labour 25%, Act 1.7%, NZ First 3%, Green 4.9%, other parties 3.1%. Some 26.7% did not vote. Horizon can also analyse the intentions of this significant non-voter group. At April 2011 it appeared about 60,000 of them were again expressing a party vote preference and were intending to vote. This too could have a major bearing on the outcome of the November 26 general election. Horizon's produces what we call a Net Potential Vote poll. We take the responses of decided voters, as others do. We also ask the undecided group, which has been varying in size between 12.8 and 23%, if they have a preference. These preferences are added to the decided group - and then those who are not eligible to vote (so can't) are excluded, along with those who definitely will not or may not vote. The results we publish are therefore for decideds + undecideds with a preference - all of whom are on the electoral rolls and say they are likely to or will definitely vote. For further information please contact: Grant McInman Manager, Horizon Research Telephone: +64 (21) 076 2040 Writing frame for critically evaluating a report Pre-Reading: “Getting the gist” Read the media report and summarise what it is about in 3 sentences or bullet points. While reading: “Worry Questions” Read the media report again, asking appropriate “worry questions” as you go. Record your answers in the boxes below: Source Method Target Group Who sampled How selected Sample size margin of error Questions asked Key Findings Claims What is missing? Please Turn Over Critical Evaluation: Discuss 2 good aspects of this report Discuss 2 concerns