Download Worksheet 7 - Contingency analysis (frequencies) 2017

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Data assimilation wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Worksheet 7: Contingency analysis (frequencies) - 2017
Example of Contingency Analysis
This example uses the Car Poll.CSV sample data table, which contains data collected from car polls. The
data includes aspects about the individual polled, such as their sex, marital status, and age. The data also
includes aspects about the car that they own, such as the country of origin, the size, and the type. Here we
want to examine the relationship between car sizes (small, medium, and large) and the cars’ country of
origin.
1) Graph the relationship between Size of car (size) and country of car’s origin (country). Note that
both are categorical variables and that you will be graphing the frequency of occurrence of the
combinations of size and country
a. Using Graph Builder put country in the x-axis and size as an overlay variable.
b. Make sure you put N as the summary statistic
c. What is the hypothesis that is being tested (the null hypothesis)
d. What are the qualitative results?
2) Test your hypothesis using the contingency platform (fit Y by X)
a.
To launch the Fit Y by X platform, select Analyze > Fit Y by X.
b.
Put size in the X category and country in the Y category. Note both are categorical.
c.
Look at the output
Page | 1
Worksheet 7: Contingency analysis (frequencies) - 2017
i. The graph is a mosaic plot. A mosaic plot is a graphical representation of the
two-way frequency table or Contingency Table. A mosaic plot is divided into
rectangles, so that the vertical length of each rectangle is proportional to the
proportions of the Y variable in each level of the X variable. The width of the x
dimension of each rectangle is proportional to the proportions of each level of the
x variable.
1) The proportions on the x-axis represent the number of observations for
each level of the X variable, which is country.
2) The proportions on the y-axis at right represent the overall proportions of
Small, Medium, and Large cars for the combined levels (American,
European, and Japanese).
3) The scale of the y-axis at left shows the response probability, with the
whole axis being a probability of one (representing the total sample).
ii. Now look at the Contingency table
Page | 2
Worksheet 7: Contingency analysis (frequencies) - 2017
Note the following about Contingency tables:
The Count, Total%, Col%, and Row% correspond to the data within each
cell that has row and column headings (such as the cell under American
and Large).
The last column contains the total counts for each row and percentages
for each row.
The bottom row contains total counts for each column and percentages
for each column.
iii. Now look at the Tests
1) What is you conclusion concerning your null hypothesis?
2) If the null hypothesis is rejected - What combination of country of origin and
size of car contributes most to that conclusion?
i. Click on the red triangle next to the contingency table, then on
Deviation and Cell Cho Square. The Deviation value is the deviation
between observed and expected assuming the null hypothesis is true.
The cell chi-square is the contribution of the cell to the overall chi
square value. The expected values for both are zero (assuming null
hypothesis is true). Deviation values have direction (observed –
expected), Chi square values do not.
Example of GENERALIZED (Poisson) Regression
This example uses the same data but in a different framework that allows you to look in more detail
at the distributional patterns. In particular you can assess the all three terms: Size, Country and the
interaction between Size and Country. You need to restructure the data set to do this
1) Use TABLES SUMMARY and put Country and Size in the Group Box and leave the
STATISTICS box empty. Run the model
2) Your new data table will have three columns – Size, Country and N Rows. N Rows is the
frequency of observations for combinations of Size and Country
3) Using the new table go to ANALYZE, FIT MODEL and put N Rows in the Y Box and Size,
County and Country*Size in the Model Effects Box
4) Use PERSONALITY = GENERALIZED and DISTRIBUTION = POISSON.
5) Run the Model and open EFFECTS TEST
Page | 3
Worksheet 7: Contingency analysis (frequencies) - 2017
a.
What terms are significant?
b.
Do the results make sense – are they more informative than the simple Contingency
Analysis?
6) Now Click the red arrow next to MAXIMUM LIKELIHOOD and click on Profilers- Profiler.
Play with Country and see if the interaction between Country and Size makes sense.
USE of KS test to compare distributions
Here we want to compare the size frequency distributions of abalone for two sample areas: one in a
Marine Protected Area (MPA) and another in an area of No Protection. We can do this using a KS
Test (Kolmogorov- Smirnov) .
1) Open “Abalone frequencies by MPA status. There are three columns. Status, Size (mm) and
Frequency, which is the number of observations of abalone of a given size in an area of given
status.
2) Now use FIT Y by X and put Size in the Y Box and Status in the X Box. Also put Frequency
in the FREQ Box. Click OK
3) From the results click on the red triangle, then on NON PARAMETRIC then on
KOLMOGOROV SMIRNOV TEST – What is the P – Value? Is it significant.
4) You may be wondering what the test actually is doing - this may help
a.
Click on the red arrow again and this time click on CDF PLOT, Now do the same for
DENISTIES, COMPARE DENSITIES and DENSITIES, PROPORTION OF
DENSITIES
b.
THE KS TEST is a test of cumulative distribution (CDF) and the distance between
them (D). If the maximum D is large enough then the two distributions differ. What
was your hypothesis? Probably that the CDF for the MPA was lower than that for
no protection because that would be an indication that there were more large
individuals in the MPA (ask if this does not make sense). Based on the CDF and D
and P-value, is this hypothesis supported?
c.
Now look and the COMPARE DENSTIES and PROPORTION OF DENSITIES
graphs. Do these also support your hypothesis?
Page | 4