Download Statistics 1 Revis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Probability amplitude wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Time series wikipedia , lookup

Transcript
Probability and Statistics 1 Review Sheet (starred formulae need to be
learned)
Representation of Data
Types of Data:
Quantitative: Data that has numerical value – two types
– discrete – Data that has steps between each value (shoe size, numbers of
elephants, money) – continuous – Data that is measured rounded to the
nearest…. (time, height, weight etc).
Stem and Leaf Diagrams Used to quickly organise data and see the
distribution – can use to find median and mode much more easily.
Advantage: Shows all the Data values Disadvantage: Difficult to see the
spread of the data accurately
Histograms: The area of the bar is proportional to the frequency. Often use
frequency density = frequency  class width.
To find the frequency of a particular group find out what one unit squared (or
one square) in histogram represents and then look at the area you want.
Advantage: Shows the distribution of the data, takes into account unequal
classes
Disadvantage: Difficult to compare areas rather than heights.
Cumulative Frequency Diagrams: Shows the running total. Find cumulative
frequency in table, and the plot top boundary of class against the cumulative
frequency. Use to find the median, Upper Quartile and Lower Quartile for a
box plot. Can find estimate of eg 60% of the data.
Advantage: Can easily find quartiles and certain percentages of data
Disadvantage: Cannot easily compare two different sets of data.
Box and Whisker Plots Show quartiles – use to compare two different
distributions, looking at the Inter quartile range, medians, highest and lowest
values. If Q3 – Q2 = Q2 – Q1 Data is symmetrical. If Q3 - Q2 > Q2 – Q1 data
has positive skew. If Q3 – Q2 < Q2 – Q1, data has negative skew.
Advantage: Can easily compare two distributions
Disadvantage: Don’t
know specific data values.
Measures of Location & Spread
 xi *, for a frequency table x   x i fi *. Advantage: Uses all
Mean: x 
n
 fi
the information in the data set
Disadvantage: Can be distorted easily by
outliers.
Median: Middle value – normally found from cumulative frequency or stem
and leaf. If n is odd, the median is ½(n+1)th value. If n is even the median is
halfway between the ½ nth value and the following value.
Mode: Most popular value, not often used.
Range: Largest – smallest. Disadvantage: Doesn’t tell you much about the
pattern of the distribution.
Interquartile Range Upper Quartile – Lower Quartile. Useful for seeing where
the middle 50% of the data lies.
Standard Deviation and Variance: Shows the spread of the data from the
mean. Uses and looks at the spread of all the data values (unlike the
interquartile range). Variance = n1  xi 2  x 2 *. The standard deviation is the
square root of the variance.
x f
f
2
Variance from a frequency table =
i
i
 x2 *
i
Sometimes there may be a questions which calculates the mean and
standard deviation for the data values subtract 100 for example. In these
cases you will be given a question including ( x 100) or similar. This
means 100 has been taken off each data value. If this happens you need to
remember that the true mean is the one given plus 100, but the standard
deviation stays the same.
Probability
Sample Space: A list or table of all possible outcomes. Remember that
probabilities always add up to 1 and can never be greater than 1.
P(A or B) = P(AB) = P(A) + P(B) – P(AB) the probability of A plus the
probability of B – the probability that A and B have happened.
P(A and B) = P(AB) = P(A) x P(BA) (the prob of A multiplied by the
probability of B given A has happened).
Use tree diagrams to help you calculate these.
Permutations and Combinations
Permutations
The number of ways of arranging n objects is n!
The number of different permutations (where order matters) of picking r
n!
objects from n distinct objects is n Pr 
. Eg might be the number of
(n  r )!
different ways that the Gold, Silver and bronze medals can be won in a race
of 8 people.
If the objects are not distinct, you need to take this into account. So, the
number of ways of arrange n objects, when p are the same, q are the same, r
n!
are the same etc is
where p+q+r+… = n. For example, how many
p ! q ! r !...
different ways can you arrange the letters in Ecclesbourne?
12 !
, as there are 3 E’s and 2 C’s.
3! 2 !
Combinations
Order doesn’t matter - like what are the number of ways that different
people can finish in the top 3 of a race – it doesn’t matter who gets gold, silver
 n
n!
or bronze. Ignores repeated combinations. nCr    
.
 r  r !(n  r )!
Discrete Probability Distributions
Random Variable: A quantity whose value depends on chance.
Probability Distributions: A listing of all the possible values of a random
variable and the corresponding probabilities.
If we are given experimental results we can approximate how many times a
value will come up if we use frequency = total frequency x probability. Eg:
If a dice is thrown 360 times and the prob of getting an even number is ½ then
we would expect 360 x ½ = 180 even numbers.
If a probability distribution is not binomial or geometric (see below) we can
use   E  X    xi pi and   Var  X    xi 2 pi   2 to find the expected
value and variance of X. We normally use these formulae when the data is
given in a table.
Binomial Distribution
Assumptions: (1) A single trial has just two possible outcomes (success and
failure). (2) There are a fixed number of trials, n. (3) The outcome of each trial
is independent of the outcome of all the other trials. (4) The probability of
success at each trial, p, is constant.
The binomial distribution has two parameters, n (the number of trials) and p
(the probability of success). If we wanted to say that X was binomially
distributed with parameters n and p we would write X~B(n,p).
We use the binomial distribution to find the probability of success in r trials out
of the n trials using the formula P ( X  x)  nC r  p r  (1  p)nr . Where 1-p is
the probability of failure.
We can also use cumulative binomial tables from the formula book. Each
value shows P  X  x  . The tables can be manipulated to find greater than or
equal to and numbers between. If you practice using them it will save lots of
time in the exam.
Expectation and Variance: E  X   np,Var  X   np(1  p)
We use the binomial distribution when we have a fixed number of trials
Geometric Distribution
Assumptions: (1) A single trial has just two possible outcomes (success and
failure) and these are mutually exclusive. (2) The outcome of each trial is
independent of the outcome of all the other trials. (3) The probability of
success at each trial is constant. (4) The trials are repeated until a success
occurs.
The Geometric Distribution has one parameter p, the probability of success. If
we wanted to say that X was Geometrically distributed with parameter p, then
we would write X~Geo(p).
We use the Geometric Distribution to find the probability that we will get
success at the xth trial using the formula P( X  x)  p(1  p) x 1 , where 1-p is
the probability of failure.
To find P ( X  x ) use 1  P ( X  x  1) . To find the probability that there will be
at least a trials do (1-p)a-1.
1
Expectation E  X   .
p
We use the Geometric Distribution when we want to keep on going until we
have a success.
Correlation
Product Moment correlation Coefficient (r) measures how close the points
of a scattergraph are to a straight line. r lies between 1 and –1 where 1 is
perfect positive correlation (line goes from bottom left to top right), -1 is
perfect negative correlation (line goes from top left to bottom right) and 0 is no
correlation (points randomly scattered). Calculate using the formula:
2
S xy
1
1
r
S xx   xi 2    xi  ,
S xy   xi yi   xi  yi ,
where
n
n
S xx S yy
2
1
S yy   yi    yi  . You will normally be given the information you need in
n
the question.
Spearman’s rank correlation coefficient measures the correlation between
the ranks of the two datasets. 1 indicates that the ranks are the same, -1
indicates the ranks are the complete opposite and 0 indicates little agreement
between the two rankings.
To calculate, rank the items in each data set from 1 to n and then find the
difference (d) between the two rankings for each pair of data. Square the
6 d i 2
differences (d2) and then use the formula: rs  1 
to calculate.
n(n2  1)
Spearman’s rank can show a coefficient of 1 if the points are in a curve, this is
because the ranks are still the same – as x increases, so does y.
2
Regression
We can find an accurate regression line to predict values. The least-squares
S
regression line of y on x is y=a + bx where b  xy , and a  y  bx . Use this
S xx
line to predict y when we know the x value. The least-squares regression line
S
of x on y is x=a’ + b’y where b '  xy and a '  x  b ' y . Use this line to predict
S yy
x when we know the y value.