Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Probability and Statistics 1 Review Sheet (starred formulae need to be learned) Representation of Data Types of Data: Quantitative: Data that has numerical value – two types – discrete – Data that has steps between each value (shoe size, numbers of elephants, money) – continuous – Data that is measured rounded to the nearest…. (time, height, weight etc). Stem and Leaf Diagrams Used to quickly organise data and see the distribution – can use to find median and mode much more easily. Advantage: Shows all the Data values Disadvantage: Difficult to see the spread of the data accurately Histograms: The area of the bar is proportional to the frequency. Often use frequency density = frequency class width. To find the frequency of a particular group find out what one unit squared (or one square) in histogram represents and then look at the area you want. Advantage: Shows the distribution of the data, takes into account unequal classes Disadvantage: Difficult to compare areas rather than heights. Cumulative Frequency Diagrams: Shows the running total. Find cumulative frequency in table, and the plot top boundary of class against the cumulative frequency. Use to find the median, Upper Quartile and Lower Quartile for a box plot. Can find estimate of eg 60% of the data. Advantage: Can easily find quartiles and certain percentages of data Disadvantage: Cannot easily compare two different sets of data. Box and Whisker Plots Show quartiles – use to compare two different distributions, looking at the Inter quartile range, medians, highest and lowest values. If Q3 – Q2 = Q2 – Q1 Data is symmetrical. If Q3 - Q2 > Q2 – Q1 data has positive skew. If Q3 – Q2 < Q2 – Q1, data has negative skew. Advantage: Can easily compare two distributions Disadvantage: Don’t know specific data values. Measures of Location & Spread xi *, for a frequency table x x i fi *. Advantage: Uses all Mean: x n fi the information in the data set Disadvantage: Can be distorted easily by outliers. Median: Middle value – normally found from cumulative frequency or stem and leaf. If n is odd, the median is ½(n+1)th value. If n is even the median is halfway between the ½ nth value and the following value. Mode: Most popular value, not often used. Range: Largest – smallest. Disadvantage: Doesn’t tell you much about the pattern of the distribution. Interquartile Range Upper Quartile – Lower Quartile. Useful for seeing where the middle 50% of the data lies. Standard Deviation and Variance: Shows the spread of the data from the mean. Uses and looks at the spread of all the data values (unlike the interquartile range). Variance = n1 xi 2 x 2 *. The standard deviation is the square root of the variance. x f f 2 Variance from a frequency table = i i x2 * i Sometimes there may be a questions which calculates the mean and standard deviation for the data values subtract 100 for example. In these cases you will be given a question including ( x 100) or similar. This means 100 has been taken off each data value. If this happens you need to remember that the true mean is the one given plus 100, but the standard deviation stays the same. Probability Sample Space: A list or table of all possible outcomes. Remember that probabilities always add up to 1 and can never be greater than 1. P(A or B) = P(AB) = P(A) + P(B) – P(AB) the probability of A plus the probability of B – the probability that A and B have happened. P(A and B) = P(AB) = P(A) x P(BA) (the prob of A multiplied by the probability of B given A has happened). Use tree diagrams to help you calculate these. Permutations and Combinations Permutations The number of ways of arranging n objects is n! The number of different permutations (where order matters) of picking r n! objects from n distinct objects is n Pr . Eg might be the number of (n r )! different ways that the Gold, Silver and bronze medals can be won in a race of 8 people. If the objects are not distinct, you need to take this into account. So, the number of ways of arrange n objects, when p are the same, q are the same, r n! are the same etc is where p+q+r+… = n. For example, how many p ! q ! r !... different ways can you arrange the letters in Ecclesbourne? 12 ! , as there are 3 E’s and 2 C’s. 3! 2 ! Combinations Order doesn’t matter - like what are the number of ways that different people can finish in the top 3 of a race – it doesn’t matter who gets gold, silver n n! or bronze. Ignores repeated combinations. nCr . r r !(n r )! Discrete Probability Distributions Random Variable: A quantity whose value depends on chance. Probability Distributions: A listing of all the possible values of a random variable and the corresponding probabilities. If we are given experimental results we can approximate how many times a value will come up if we use frequency = total frequency x probability. Eg: If a dice is thrown 360 times and the prob of getting an even number is ½ then we would expect 360 x ½ = 180 even numbers. If a probability distribution is not binomial or geometric (see below) we can use E X xi pi and Var X xi 2 pi 2 to find the expected value and variance of X. We normally use these formulae when the data is given in a table. Binomial Distribution Assumptions: (1) A single trial has just two possible outcomes (success and failure). (2) There are a fixed number of trials, n. (3) The outcome of each trial is independent of the outcome of all the other trials. (4) The probability of success at each trial, p, is constant. The binomial distribution has two parameters, n (the number of trials) and p (the probability of success). If we wanted to say that X was binomially distributed with parameters n and p we would write X~B(n,p). We use the binomial distribution to find the probability of success in r trials out of the n trials using the formula P ( X x) nC r p r (1 p)nr . Where 1-p is the probability of failure. We can also use cumulative binomial tables from the formula book. Each value shows P X x . The tables can be manipulated to find greater than or equal to and numbers between. If you practice using them it will save lots of time in the exam. Expectation and Variance: E X np,Var X np(1 p) We use the binomial distribution when we have a fixed number of trials Geometric Distribution Assumptions: (1) A single trial has just two possible outcomes (success and failure) and these are mutually exclusive. (2) The outcome of each trial is independent of the outcome of all the other trials. (3) The probability of success at each trial is constant. (4) The trials are repeated until a success occurs. The Geometric Distribution has one parameter p, the probability of success. If we wanted to say that X was Geometrically distributed with parameter p, then we would write X~Geo(p). We use the Geometric Distribution to find the probability that we will get success at the xth trial using the formula P( X x) p(1 p) x 1 , where 1-p is the probability of failure. To find P ( X x ) use 1 P ( X x 1) . To find the probability that there will be at least a trials do (1-p)a-1. 1 Expectation E X . p We use the Geometric Distribution when we want to keep on going until we have a success. Correlation Product Moment correlation Coefficient (r) measures how close the points of a scattergraph are to a straight line. r lies between 1 and –1 where 1 is perfect positive correlation (line goes from bottom left to top right), -1 is perfect negative correlation (line goes from top left to bottom right) and 0 is no correlation (points randomly scattered). Calculate using the formula: 2 S xy 1 1 r S xx xi 2 xi , S xy xi yi xi yi , where n n S xx S yy 2 1 S yy yi yi . You will normally be given the information you need in n the question. Spearman’s rank correlation coefficient measures the correlation between the ranks of the two datasets. 1 indicates that the ranks are the same, -1 indicates the ranks are the complete opposite and 0 indicates little agreement between the two rankings. To calculate, rank the items in each data set from 1 to n and then find the difference (d) between the two rankings for each pair of data. Square the 6 d i 2 differences (d2) and then use the formula: rs 1 to calculate. n(n2 1) Spearman’s rank can show a coefficient of 1 if the points are in a curve, this is because the ranks are still the same – as x increases, so does y. 2 Regression We can find an accurate regression line to predict values. The least-squares S regression line of y on x is y=a + bx where b xy , and a y bx . Use this S xx line to predict y when we know the x value. The least-squares regression line S of x on y is x=a’ + b’y where b ' xy and a ' x b ' y . Use this line to predict S yy x when we know the y value.