Download Using your GDC to calculate the χ 2 statistic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Mathematical Studies Standard Level for the IB Diploma
Revision Topic 4: Statistical applications
Chapter 11: The normal distribution
The normal distribution curve
The shape of data plotted in a histogram can be compared to the normal distribution curve. This is a
standardised view of how data can be distributed, and it has the following properties:








bell-shaped
symmetrical about the mean value, μ
equal values for the mean, median and
mode
area under the curve equals 1
68% of the data lies within 1 standard
deviation, σ, of the mean
95% of the data lies within 2 standard
deviations of the mean
99% of the data lies within 3 standard
deviations of the mean
to find the standard deviation marks
along the horizontal axis for the
percentages 68%, 95% and 99%, start
at the middle (the mean) and add or
take away the correct number of
standard deviations.
The normal distribution is written using this notation:
X  N(μ, σ2)
In a question you may be told that some data, X, follows the normal distribution, X  N(12, 52), or you
may just be given the values of μ and σ.
Copyright Cambridge University Press 2014. All rights reserved.
Page 1 of 12
Mathematical Studies Standard Level for the IB Diploma
Probability calculations using the normal distribution
You may be asked for the probability that an event will happen or the percentage of time that an
event occurs. These mean the same thing, which is that you should work out the area under the curve
between two points.
You should use your GDC to do this, obtaining both a graph of the relevant area and the value that
you want.
Questions that ask you to do probability calculations with the normal distribution will give you the
following information:

that the data follows a normal distribution or is ‘normally distributed’

the value of the mean

the value of the standard deviation

one or two boundary values.
When given boundary values, the question will ask you to calculate a probability associated with one
of the following situations:
Situation given in the question
More than a value
Between two values
Below a value
Lower value to enter into GDC
the value you are given
the lower value you are given
−99999
Upper value to enter into GDC
99999
the higher value you are given
the value you are given
Using your GDC
It is easier to follow what to do on your GDC if we look at a particular example:
A large number of mobile phone calls were monitored, and their lengths were recorded to the nearest
minute. The call lengths were found to be normally distributed with a mean of 12 minutes and a
standard deviation of 5 minutes. What is the percentage of calls that lasted over 15 minutes?
This question gives you the following values:

lower boundary value = 15

upper boundary value = 99999

μ = 12

σ=5
Copyright Cambridge University Press 2014. All rights reserved.
Page 2 of 12
Mathematical Studies Standard Level for the IB Diploma
Texas TI-84
Before drawing the graph on the TI-84, you
need to set your window so you can actually
see the graph:
Set the window boundaries as follows:
Xmin = μ − 3σ
Xmax = μ + 3
Ymin = −0.25
Ymax = 0.25
This will give you shadenorm(…); then the
values should be entered in this order:
lower
upper
mean
standard deviation
Casio fx-9750GII
Get to the variable screen:
Input the values in this order:
lower
upper
standard deviation
mean
Then draw the graph and read off the
probability value:
Then draw the graph and read off the
probability value (Area):
The GDC gives P = 0.274, so the percentage of calls lasting more than 15 minutes is 27.4%.
Copyright Cambridge University Press 2014. All rights reserved.
Page 3 of 12
Mathematical Studies Standard Level for the IB Diploma
Inverse normal calculations
If you know the probability of an event happening, along with the mean and standard deviation of the
normal distribution, you can work out the boundary value(s) of the event.
For example, if 30% of a group of students scored below the pass mark on a test, and you know that
  45 and   7.2 for the test scores, then you can find the pass mark.
To do this, you need to use the ‘inverse normal’ function on your GDC.
You will need to input the following values to calculate the boundary value:
Area under the normal curve
σ
μ
This is the given probability or percentage written as a decimal.
The standard deviation
The mean
So, for the example above you would do the following:
Texas TI-84
Casio fx-9750GII
Navigate to the
inverse normal
function.
Enter the known values in this
order:
Area, μ, σ
Input the values
given in the
question.
Note: The TI-84 always gives the ≤
value or the left tail boundary.
If you want the ≥ value (right tail
boundary), you need to subtract the
GDC result from 1 to get the final
answer.
So the pass mark
was 41.2 (or 41
to the nearest
whole number).
Enter the values of Area, σ, μ.
‘Tail’ means the side of the graph that is
shaded and depends on whether the
probability given in the question
corresponds to ≤ or ≥ the boundary value:
≤ is left
≥ is right
Copyright Cambridge University Press 2014. All rights reserved.
Page 4 of 12
Mathematical Studies Standard Level for the IB Diploma
Chapter 12: Correlation
The concept of correlation
Bivariate data
Correlation
Data that consists of measurements of two variables collected from each
individual in a sample
The relationship between the two variables of bivariate data
The variables of bivariate data can be classified as follows:
Independent variable
Dependent variable
Variable that is controlled by the person conducting the study
Observed variable that should demonstrate the effect of the
hypothesised relationship
For example, in the hypothesis
‘A greater number of calories eaten per day will make a person heavier’
the independent variable is the number of calories consumed per day and the dependent variable is the
person’s weight.
Scatter diagrams
The easiest way to see if there is a pattern in bivariate data is to draw a scatter diagram by creating
coordinates from your data in this order:
(independent variable value, dependent variable value)
Then plot these coordinates on a grid and look at the grouping of the points to determine what type of
correlation there is.
Positive correlation
As one variable increases, so
does the other.
No correlation
No apparent relationship
Copyright Cambridge University Press 2014. All rights reserved.
Negative correlation
As one variable increases, the
other decreases.
Page 5 of 12
Mathematical Studies Standard Level for the IB Diploma
Correlation and causation
 Just because two variables have a correlation, it doesn’t mean that one causes the other.
Be cautious when making judgements based on data.

Don’t forget to consider all the variables that might affect the results.
Line of best fit
To highlight the relationship between the two variables of bivariate data, you should draw a line of
best fit on your scatter diagram.
To do this, follow these steps:



Find the mean of each variable (i.e. the data
plotted along the x-axis and the data plotted
along the y-axis), giving you the mean point
(x, y) .
Plot the mean point on the scatter diagram.
Draw a line through the mean point so that
the other points of the scatter diagram are
spread evenly above and below the line.
This line represents the relationship between the
two variables.
Drawing a scatter diagram and line of best fit on your GDC
Texas TI-84
Casio fx-9750GII
Put the
bivariate data
into your
GDC: enter it
in the data
table as two
lists, the first
for the
independent
variable and
the second
for the
dependent
variable.
Copyright Cambridge University Press 2014. All rights reserved.
Page 6 of 12
Mathematical Studies Standard Level for the IB Diploma
Set the graph
type to
‘scatter’.
Then draw
the graph.
Once you have created the scatter diagram, you can get the GDC to calculate the line of best fit along
with a measure of the strength of the correlation.
Texas TI-84
Casio fx-9750GII
In this case the equation of the line of best fit (regression line) is y = −0.944x + 10.3
To draw the regression line on the TI-84, you
To draw the regression line on the scatter
need to manually input the data into the [Y=]
diagram:
screen.
Copyright Cambridge University Press 2014. All rights reserved.
Page 7 of 12
Mathematical Studies Standard Level for the IB Diploma
Pearson’s product moment correlation coefficient
This is a measure, based on the data and the line of best fit, which tells you how strong the correlation
is. Remember the following points:

This coefficient is usually denoted by r.

−1 ≤ r ≤ 1

If r = +1, there is a perfect positive correlation.

If r = 0, there is no correlation.

If r = −1, there is a perfect negative correlation.

If the value of r is between −0.5 and 0.5, the correlation is too weak to draw any meaningful
conclusions from the regression line.

The closer to ±1 the value of r is, the stronger the correlation.
In the GDC example above, r = −0.913, which indicates a very strong negative correlation.
Regression line of y on x
A regression line is a line of best fit that minimises the overall distance between the data points and
the line of fit. Remember that:

The line has an equation of the form y  ax  b

You should use your GDC to find the values of a and b.

You should rearrange the equation so that it is written sensibly.

If the correlation is strong, you can use the regression line to predict values.

If the correlation is weak (i.e. −0.5 < r < 0.5), then you should not predict values using the
regression line.

You should not use a regression line to predict values outside the range of data given.
In the GDC example above, a = −0.944 and b = 10.3, so the equation of the regression line is
y  0.944 x  10.3 , which could also be written as y  0.944 x  10.3 .
Copyright Cambridge University Press 2014. All rights reserved.
Page 8 of 12
Mathematical Studies Standard Level for the IB Diploma
Chapter 13: Chi-squared hypothesis testing
The chi-squared test is used to see if two variables are independent. It can also be used to assess
whether data differs significantly from what is expected, called the ‘goodness of fit’.
Expected frequencies
First, you need to be able to work out the frequencies that you would ‘expect’ to see, based on some
hypothesis you assume for the data. How this is done depends on the type of problem you have.

For a goodness-of-fit test: Assuming a certain theoretical distribution for the data, the
expected frequency of each outcome would be
total frequency × probability of that outcome occurring

For a test of independence of two variables: Given a two-way table summarising the observed
row total  column total
frequencies of the data, the expected frequencies would be
total
(This is the probability of the row outcome multiplied by the column total, or vice versa, and
it gives you the correct share of the total you should expect.)
The χ2 statistic
The χ2 statistic is a measure of the discrepancy between the observed and expected frequencies. You
should use your GDC to calculate the χ2 statistic, and then interpret it in relation to the following:
Critical χ2 value
Significance level, 
The threshold value above which the discrepancy is considered significant.
In exam questions you will be given this value.
The maximum probability of making a mistake in your conclusion, deciding
that the result is significant when actually it isn’t.
In questions you will be given this value, and it is normally 1%, 5% or 10%.
Null hypothesis, H0
The hypothesis that the factors being tested are independent.
Alternative hypothesis, H1
The hypothesis that the factors being tested are dependent.
p-value
The probability of getting a discrepancy as large as the calculated χ2 statistic
if the theoretical distribution or null hypothesis were correct.
The number of outcomes that can be independent, given that the total
frequency is fixed.
In the goodness-of-fit test,
degrees of freedom  number of outcomes  1
In the independence test,
degrees of freedom  (number of rows  1)  (number of columns  1)
Degrees of freedom
Copyright Cambridge University Press 2014. All rights reserved.
Page 9 of 12
Mathematical Studies Standard Level for the IB Diploma
Using your GDC to calculate the χ2 statistic
Suppose you need to do a goodness-of-fit test on the following data, where the theoretical distribution
assumes equally likely outcomes:
A
11
B
6
C
9
D
10
Put the data into your GDC as a list.
Texas TI-84
Casio fx-9750GII
Go into list mode.
Enter your data
(observed
frequencies) in list 1
and the expected
frequencies in list 2.
Access the χ2 statistic
and p-value.
Copyright Cambridge University Press 2014. All rights reserved.
Page 10 of 12
Mathematical Studies Standard Level for the IB Diploma
Suppose you need to perform an independence test on the following bivariate data, given in a twoway frequency table:
A B C
a 15 12 8
b 6 11 7
c 9 6 14
D
6
7
20
In this case, put the data into a table or matrix.
Texas TI-84
Casio fx-9750GII
Go into matrix
edit mode.
Set the size of the
matrix: enter the
number of rows
first (3) followed
by the number of
columns (4).
Then input your
data.
Access the
χ2 statistic and
p-value.
If you need to see the expected frequencies (which are calculated automatically by the GDC), open
‘matrix B’.
Copyright Cambridge University Press 2014. All rights reserved.
Page 11 of 12
Mathematical Studies Standard Level for the IB Diploma
Understanding the χ2 statistic and p-value
The GDC gives you the χ2 statistic, the p-value and the degrees of freedom (df).
For each test, comparing the χ2 statistic with the critical value or the p-value with the significance
level will lead to the following conclusions:
χ2 statistic
χ2 < critical value
χ2 > critical value
p-value comparison
p-value > significance level
p-value < significance level
Goodness-of-fit test
Good fit
Not a good fit
Copyright Cambridge University Press 2014. All rights reserved.
Independence test
Accept the null hypothesis
Reject the null hypothesis
Page 12 of 12