Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) Practical Skills for AS/A2 Geography at BHS 3. Statistical Skills <insert cover image> [email protected] zigzageducation.co.uk GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) Contents Thank You for Choosing ZigZag Education ............................................................ Error! Bookmark not defined. Teacher Feedback Opportunity .................................................................................. Error! Bookmark not defined. Terms and Conditions of Use ...................................................................................... Error! Bookmark not defined. Teacher’s Introduction .................................................................................................. Error! Bookmark not defined. 2.1 Measures of Central Tendency: Mean, Median and Mode. ............................................................................. 2 Arithmetic Mean ( x ) ............................................................................................................................................................................... 2 Median ........................................................................................................................................................................................................ 3 Mode ............................................................................................................................................................................................................ 4 2.2 Measures of Dispersion. ......................................................................................................................................... 5 Range .......................................................................................................................................................................................................... 5 Inter-quartile Range .................................................................................................................................................................................. 5 Dispersion Graphs ..................................................................................................................................................................................... 6 Exercise 2.1 ................................................................................................................................................................................................ 7 Comparing Dispersion Graphs................................................................................................................................................................. 9 Box-and-Whisker Diagrams ................................................................................................................................................................ 11 Histograms .............................................................................................................................................................................................. 12 Exercise 2.2 ............................................................................................................................................................................................. 14 Standard Deviation () ......................................................................................................................................................................... 15 Exercise 2.3 ............................................................................................................................................................................................. 16 2.3 Correlation .............................................................................................................................................................. 17 Scattergraphs (Scatter Diagrams) ........................................................................................................................................................ 17 Spearman’s Rank Correlation Coefficient (rs) ..................................................................................................................................... 19 2.4 Comparative Tests................................................................................................................................................. 22 Chi-squared test (χ2) ............................................................................................................................................................................... 22 Mann-Whitney U test ............................................................................................................................................................................ 26 Exercise 2.4 ............................................................................................................................................................................................. 29 2.5 Examination Questions......................................................................................................................................... 30 Examination Assignment 2.1 ................................................................................................................................................................ 30 Examination Assignment 2.2 ................................................................................................................................................................ 32 Page 1 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) 2.1 Measures of Central Tendency: Mean, Median and Mode. Arithmetic Mean ( x ) The arithmetic mean, usually called ‘the mean,’ is the ‘average’. It is found by adding up the values in a data set and dividing by the number of values It is expressed as: x x n where x (bar x) = mean (sigma) = the sum of x = values of the variable n = number of items in the set Look at the population data in Figure 2.1 (column 1). To find the mean population the values of the individual populations are first added together to give a total population of 3,376 millions ( x). This is divided by the number of countries in the set (n =13) to give a mean value of 259.69 millions (x = 259.69). Exam hint Remember when making calculations the answer should be given to 2 decimal places Figure 2.1: Population data for selected countries in Africa/Asia/Latin America (2006) Country Population (millions) (1) Egypt Nigeria Ethiopia Uganda Mexico Bangladesh India Pakistan China Brazil Bolivia Chile Puerto Rico Afghanistan 75 134 74 27 108 146 1,121 165 1311 186 9 16 4 No data Life expectancy at birth (2) 70 44 49 47 75 61 63 62 72 72 64 78 77 42 % Urban (3) 43 44 15 12 75 23 29 34 37 81 63 87 94 22 Page 2 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) Median The median is the middle value or mid-point in a set of data. The simplest way to find the median is to arrange the values in sequence from highest to lowest. This can be done by either simply listing the numbers in a line (Figure 2.2) or by using a column dispersion diagram (see Figure 2.5). To find the middle value (median) count the number of values. There must be an equal number of points above and below the median. For example, with 25 values the median is the 13th value (you can count from either the bottom or the top), with 12 values above and 12 values below the median. Exam hint: You might be tempted to try and find the median by observation alone. Be warned that this frequently results in errors! Figure 2.2: Population sizes (millions) There are 13 values (an odd number) in the set, so the 7th is the median (with 6 values above and 6 values below the median). 1311 1121 186 165 146 134 108 75 74 27 16 9 4 median (7th value in sequence) With an odd number of values the median should be easy to find. Look again at the population data in Figure 2.1 (column 1). There are 13 countries in the data set and their values (in millions) have been arranged in rank order in Figure 2.2. As there are 13 values, the mid-point is the 7th value (there will be 6 points above and below the median), which is 108. So the median value of the 13 countries is 108 million. With an odd number of values the median point can be calculated. The median is the n 1 th value in the set 2 So the median population in Figure 2.2 13 1 th value (or 7th) in the set (=108) 2 However, with an even number of values there is no central value and no formula can be used to find the value. You will need to take the mean of the two middle values. So in a sequence of 14 values the two mid points are the 7th and 8th value. These two values are then divided by 2 (as there are two of them) to give you the median. To find the median value life expectancy for the countries listed in Figure 2.1 (column 2) the data has been again been arranged in rank order in Figure 2.3. There is an even number of values (14) and the two middle values are 63 and 64. So the median is the mean of 63 and 64: 63 64 63.5 . 2 Page 3 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) Figure 2.3: Life expectancy There are 14 values (an even number), so the median is the mean of the two middle values (with 6 values above and 6 values below the median). 78 77 75 72 72 70 64 63 62 61 49 47 44 42 63 64 2 Median = 63.5 Note: if values occur more than once (as with number 72 in Figure 2.3), list the values next to one another to make sure they are all counted! Mode The mode is the value that occurs most frequently in a data set. However, with continuous data a mode is a rare occurrence. For example, with data on population sizes there will not be a mode as no two countries will have the same populations. More important is the grouping of data into a number of classes, e.g. 0–4, 5–9, 10–14, 15–19. The group that occurs most frequently is called the modal class. This can be shown in graph form using a histogram (see section 2.2). When grouping into classes aim for 5 or 6 groups and ensure that the class interval is the same! Of the measures of central tendency, the mean is the one used most frequently. It takes all the values into consideration and is easy to calculate. However, it can be influenced by one or two extreme values and, therefore, on its own is of limited value. The median on its own also gives no indication of the spread of the data (as in Figure 2.2). Page 4 of 29 ● GEOGRAPHY AS/A2 LEVEL ● AQA (1031/2031) 2.2 Measures of Dispersion. To give a more accurate impression of a data set it is useful to look at the dispersion of the data. Range This is the difference between the highest and lowest value in a data set. It is of little significance apart from indicating the spread of the data. Inter-quartile Range To find the inter-quartile range, first rank the data and find the median as described above. The Upper Quartile is the mid-point of the values above the median, and the Lower Quartile is the midpoint of the values below the median. In each case there will be two middle values and you will need to take the mean of the values. In Figure 2.4 the upper quartile is the median of the values above 108 and the lower quartile is the median of the values below 108. For the upper quartile the two middle values are 186 and 165, so the upper quartile is the mean of the two values, which is 175.5. If you carry out the same procedure for the lower quartile the result is 21.5. The subtraction of lower quartile from the upper quartile gives the Inter-quartile Range, which is an index of dispersion. If we divide the inter-quartile range by two we obtain the ‘Quartile Deviation’. To help with the calculation of upper and lower quartile with an odd number of values the following formula can be used. Upper Quartile = n 1 th value (ranked from highest to lowest) 4 So in Figure 2.4 the Upper Quartile would be the Lower Quartile = 3 × 13 1 th value (= 3.5 i.e. the mean values of 3 + 4) 4 n 1 th value (ranked from highest to lowest) 4 In Figure 2.4 the Lower Quartile would be 3 × 13 1 th value (= 10.5 i.e. the mean of values 10 + 11) 4 Note: this formula cannot be used with even number of values The inter-quartile range is a more useful measure than the range as it tells us how the values are dispersed about the median (above and below it). It tells us the spread of the middle 50% of the data above and below the median. A small inter-quartile range means that there is a narrow range of values about the median. The large inter-quartile range in Figure 2.4 indicates a wide spread of data with regard to the population sizes of the 13 countries. Page 5 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) Figure 2.4: The inter-quartile range of population size for selected countries 1311 1121 186 165 146 134 186 165 2 Upper Quartile = 175.5 108 75 74 27 16 9 4 27 16 2 median Lower Quartile = 21.5 Inter-quartile range = 154 (175.5 – 21.5) Visual Representation of Data The dispersion of values in a data set may be appreciated better if it is presented in visual form. This can be done using dispersion diagrams and histograms Dispersion Graphs A dispersion graph can show a range of values in a data set in the form of a graph (Figure 2.5). They are visually very effective, as the full range of data can be seen together with the patterns and groupings of the data. They are particularly useful for making comparisons either between areas or at the same location over a period of time. Technique A vertical scale is drawn and should cover the full range of data. The independent variable is represented on the horizontal axis, although a scale may be irrelevant if only one column is used. The values are then plotted on the graph in the form of a column using dots of uniform size. One dot represents one value (values which are identical should be placed next to one another on the same line). The median and quartiles can be shown using horizontal lines or arrows. The data for life expectancy for Figure 2.1 (column 2) can be seen in Figure 2.5 below. Figure 2.5: A dispersion graph showing life expectancy (for countries in Figure 2.1) Countries Page 6 of 29 ● GEOGRAPHY ? ● AS/A2 LEVEL AQA (1031/2031) Exercise 2.1 – Skills: statistical, graphical, mean, median, upper/lower quartile, inter quartile range, dispersion graphs Exercise 2.1 1. a) What is the mean population size for the countries shown in Figure 2.6 (column 1)? .............................................................................................................................................................................. b) List the values for life expectancy shown in Figure 2.6 in rank order (Figure 2.7). What is the median life expectancy? .............................................................................................................................................................................. c) Plot the values for life expectancy in Figure 2.6 on a column dispersion graph (Figure 2.8) d) On the graph (Figure 2.8) mark the median and upper and lower quartiles e) What is the inter-quartile range for life expectancy? .............................................................................................................................................................................. Figure 2.6: Population data for selected countries in Europe/N. America (2006) Country UK Belarus Poland Russia Italy Spain Ukraine Canada USA Czech Republic Portugal Sweden Croatia Lithuania Population (millions) (1) 61 10 38 142 59 46 47 33 299 10 11 9 4 3 Life Expectancy at birth (2) 78 69 75 65 80 81 68 80 78 76 78 81 75 72 Figure 2.7 –List of Values for life expectancy % Urban (3) 89 70 62 73 90 76 68 79 79 77 53 84 56 67 Figure 2.8 Dispersion Graph (life expectancy) 85 80 70 60 50 40 (6 marks) Page 7 of 29 GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) 2. Compare your competed graph (Figure 2.8) with the dispersion column on Figure 2.5 ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (2 marks) 3. What problems are shown here with regard to dispersion graphs? ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (2 marks) (Total marks = 10) Page 8 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) Comparing Dispersion Graphs Although the dispersion graph does not provide a statistical record of the spread of data, it is an excellent visual guide. It is particularly useful when comparisons need to be made between samples. The scatter of values is plotted for each sample, using the same scale, and the medians and upper and lower quartiles are marked. Comparisons can be made and are based on the relative positions of the medians but, in particular, on the positions of the upper and lower quartiles. If you examine Figure 2.9 you can see there is no significant difference between the two sets of data, as the lower quartile of B is between the median and lower quartile of A. But if you look at Figure 2.10 which shows infant mortality rates in different continents, you can see that the lower quartile for Africa is above the median for Asia but below the upper quartile. This suggests a difference between the data which is ‘probably significant’. However, if you compare the dispersion for Africa and Asia with that of Europe you can see that the inter-quartile ranges do not overlap. The lower quartile for Africa, and also for Asia, lies above the upper quartile for Europe, which indicates a ‘significant difference’ between the data. Figure 2.9 Dispersion diagrams showing no significant difference A B UQ UQ UQ UQ M M M M LQ LQ LQ LQ M = Median UQ = Upper Quartile LQ = Lower Quartile Page 9 of 29 GEOGRAPHY ● ● AS/A2 LEVEL AQA (1031/2031) Figure 2.10: Dispersion diagrams showing the infant mortality for selected countries in Africa, Asia and Europe Africa Asia Europe Page 10 of 29 GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) Box-and-Whisker Diagrams The dispersion graph can easily be converted into a box-and-whisker diagram (Figure 2.11) with a slight amendment. Technique 1) Plot the points and mark on the median and upper and lower quartiles as in Figure 2.5. 2) Draw a further two horizontal lines, parallel with the horizontal axis, through the highest and lowest values. 3) Draw 2 vertical lines from the upper quartile to the lower quartile to ‘box’ the data. The box represents the inter-quartile range. 4) Draw a central vertical line from the highest to lowest value. This will result in the ‘whiskers’ which show the range of data. Figure 2.11 A box-and-whisker diagram showing the infant mortality in Africa Highest value Upper quartile Median Lower quartile Lowest value Africa Page 11 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) Graphs of frequency distributions Histograms Histograms are graphs that show the frequency distribution of data grouped into classes. A histogram is an effective way of showing the distribution of values in a data set. But it can only be used where the data is in groups or classes. Figure 2.12 shows bars or rectangles rising from the horizontal (x) axis which is marked off into classes. The vertical (y) axis indicates the frequency of the dependent variable. Notice that the bars in the histogram are continuous with no gaps between the bars. Exam hint: Students often confuse the histogram with the bar graph. But the histogram shows frequencies and the data must be in classes. Also the bars in the histogram must be continuous and not separated with spaces. Figure 2.12 A typical histogram Modal class 6 Frequency (y axis) 5 4 3 2 1 Frequency 0 0–20 21–40 41–60 61–80 Class (x axis) 81–100 Classes Technique Before you can draw the histogram you must decide on the number of classes and the class intervals. There must be a fixed class interval within the range of data. You must not use intervals that are different. Histograms should have at least five classes but there are no hard and fast rules that can be used in deciding on the number of classes. You must look at the range of data. Some standard text books advise students to use the formula: Number of classes = 5 × log of the number of items in the set But it must be stressed that this is the maximum number of classes and there is absolutely no requirement to find such a figure! Normally 6 or 7 classes are ideal. There is, however, a problem with the class interval and the boundaries between the classes because no class boundary can be omitted or counted twice. So, for example, you cannot have class intervals of 1– 20, 20–40, 40–60 because 20 and 40 would appear in two groups. So, therefore, you would use the class interval 0–19, 20–39, 40–59, etc. This is fine with discrete data, which are in whole numbers, but most data is not in whole numbers. So if your class interval is 0–19, 20–39, which class is allocated to a value of 19.5? Is it 19 or 20? To overcome this problem it is necessary to group the data as 0–19.9, 20–30.9, and so on and then no value can be omitted. The values of 19.9, 30.9 assume an occurring figure of .999999. Page 12 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) Exam hint: When grouping into classes aim for at least 5 groups and ensure that the class interval is the same. Make sure you are plotting frequency on the vertical axis. Look again at Figure 2.1 (column 3). The data ranges from 12% urban to 94% urban so a convenient class range would be from 10–100%. What about the number of classes and the class interval? You could group the data in 10s which would produce 9 classes but there is too little data for such a large number of classes. In groups of 20 there would be too few classes; so 15 would seem to be reasonable. Starting at 10, the groups would be 10–24, 25–39, 40–54, 55–69, 70–84, 85–99. The histogram produced is shown in Figure 2.13 below: Figure 2.13 Histogram to show % urban populations for selected countries in Africa, Asia, and Latin America 5 4 frequency 3 2 1 0 10–24 25–39 40–54 55–69 70–84 85–99 % urban Notes: The modal class in Figure 2.13 is 10–24% urban with a frequency of 4. Histograms are particularly useful for making visual comparisons between two or more sets of data but you must ensure the scales and class interval are the same! Frequency Polygon This is similar to a histogram and uses the same vertical scale of frequency and horizontal scale of classes. But instead of bars, points are plotted with dots where the mid-point in the class reaches the appropriate frequency. The points are joined by a straight line. If the points are plotted and joined by a smooth curve, instead of straight lines, then the result is a ‘frequency curve’. Frequency curves may be cumulative, in which case the vertical axis has a cumulative frequency, for example a Lorenz curve. Page 13 of 29 GEOGRAPHY ? ● AS/A2 LEVEL ● AQA (1031/2031) Exercise 2.2 – Skills: ICT skills, graphical skills, histograms, frequency charts Exercise 2.2 1) Copy and paste the data for Figure 2.6 on page 7 into an Excel worksheet. Create a histogram using ICT to show the percentage urban population for the selected countries in Europe/N. America. Label the axes correctly Adjust the bars to ensure that they touch and there are no gaps between them Divide the horizontal axis into percentage groups – 0–24, 25–39, 40–54, 55–69, 70–84, 85–99 Give your graph a title (5 marks) 2) Using the same scale present the same data using ICT in a frequency polygon line graph. (5 marks) 3) Compare the two graphs you have drawn with the histogram for urban populations in Africa/Asia/South America (Figure 2.13). ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (5 marks) 4) Using Figure 2.10 and your completed histograms assess the significance of the two methods (dispersion graph and histogram) as methods of showing dispersion of data and differences between data. ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (5 marks) (Total marks = 20) Page 14 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) Standard Deviation () This is the measure of dispersion of all values in a data set from the arithmetic mean. It is the most common method for showing dispersion but involves more detailed calculation than the inter-quartile range. If you don’t understand what that means, don’t worry – here’s how to work it out! Technique 1. Calculate the arithmetic mean ( x ) of all the values 2. Measure how much each value differs from it by subtracting the mean from each value ( x x ) values higher (+) and values lower (-)] 3. Each difference is squared ( x x )2 4. Add together all the squared deviations using the formula (x x ) 2 and this figure is divided by the number (n) of values. This is the ‘variance’ of distribution 5. The standard deviation (S) is the square root of the variance: S (x x ) 2 n Note: you do not need to remember formulas at AS level – they will be provided for you. The standard deviation is the best method for showing the extent to which the values cluster around the mean value. A low standard deviation, for example, will indicate that the values are clustered around the mean and there is a small spread of data. A high standard deviation will indicate that the values are widely spread around the mean and, therefore, dispersion is large. However, the degree of dispersion will vary with the mean value itself. If two data sets have the same standard deviation but different means the dispersion will be greater for the lower value. In a normal distribution which is symmetrical: 68% of the values will lie less than ±1 standard deviation from the mean 95% of the values will lie less than ±2 standard deviations from the mean 99% of the values will lie less than ±3 standard deviations from the mean Page 15 of 29 ● GEOGRAPHY ? ● AS/A2 LEVEL AQA (1031/2031) Exercise 2.3 – Skills: statistical skills, standard deviation Exercise 2.3 1) Calculate the standard deviation of the population data shown in Figure 2.14. Note: The mean is rarely a whole number so you will need to work to a reasonable level of accuracy; here we have worked to 2 decimal places. To avoid introducing rounding errors, use the memory function on your calculator to store the mean ( x ). Figure 2.14 Table for calculating standard deviation population xx Country (millions) (x) Egypt 75 -184.69 Nigeria 134 -125.69 Ethiopia 74 Uganda 27 -232.69 Mexico 108 Bangladesh 146 -113.69 India 1121 861.31 Pakistan 165 -94.69 China 1311 Brazil 186 Bolivia 9 -250.69 Chile 16 Puerto Rico 4 -255.69 ( x x )2 34114.39 15797.97 54144.63 12925.41 741854.92 8966.19 62845.47 65377.37 x = 3376 x = 259.69 (x x ) 2 = n = number of values S (x x ) n Standard Deviation = 1SD = ± 2 (closer to 0 = less deviation) 2SD = ± (6 marks) 2) Comment on the standard deviation figure. ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (4 marks) (Total marks = 10) Page 16 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) 2.3 Correlation Correlation is a measure of the relationship between two variables or two sets of data. It implies that there is an association between the data, for example rainfall and altitude, but not necessarily that one causes the other. It is usual in correlation for one of the variables to depend on the other variable – for this reason it is know as the dependent variable. The variable it depends on is known as the independent variable. Correlation can also be either positive or negative. Which is the dependent variable in the association – rainfall or altitude? Would the association be positive or negative? Can you think of a negative association? Correlation can be shown in different ways. It can be shown in graph form by, for example, a scattergraph, or by a statistical technique such as Spearman’s rank correlation. Scattergraphs (Scatter Diagrams) Technique This is the simplest and most visual technique to show correlation. The values for the two sets of data are plotted as dots on a graph using a horizontal (x) axis and a vertical (y) axis. The independent variable is placed on the horizontal axis and the dependent variable on the vertical axis (Figure 2.15). Figure 2.15 A typical scattergraph with positive correlation Dependent variable (y) 7 6 5 4 3 2 1 0 0 1 2 3 4 5 Independent variable (x) 6 7 If one of the values increases as the other increases, then it is a positive correlation (Figure 2.15). If all these points fall in a straight line rising from left to right (Figure 2.16) it is called perfect positive correlation. If one value decreases as the other increases then it is a negative correlation. If all the points fall in a straight line decreasing from left to right (Figure 2.17) it is called a perfect negative correlation. Page 17 of 29 ● GEOGRAPHY ● AS/A2 LEVEL Figure 2.16 Perfect positive correlation AQA (1031/2031) Figure 2.17 Perfect negative correlation (coefficient = +1) (coefficient = –1) The perfectly straight correlation lines (at 45°) of Figure 2.16 and Figure 2.17 rarely occur in reality and a more likely situation is seen in Figure 2.15. When all the points have been plotted a ‘best fit line’ (trend line) is drawn to show the trend of the data (Figure 2.18). The closer the points are to the trend line the greater the association between the data. Points which occur well away from the best fit line are known as residuals or anomalies. Exam hint: You should construct the best fit line so that there are an equal number of points on either side of the line. For greater accuracy, calculate the mean values of both variables and lightly mark the point where they intersect – the best fit line should go through this point. The best fit line in geography is a straight line (not a curve) and does not have to go through zero. Figure 2.18 A scattergraph with best fit line Mean point Best fit line 7 Dependent variable (y) 6 5 4 3 2 1 0 0 1 2 3 4 5 independent variable (x) Page 18 of 29 6 7 Equal number of points (3) on either side of best fit line ● GEOGRAPHY AS/A2 LEVEL ● AQA (1031/2031) Advantages and Disadvantages of Scattergraphs Advantages Scattergraphs are useful in identifying patterns and trends from the data and a good visual impression in produced You can easily identify any anomalies/variations in the data – such anomalies cannot be identified with Spearman’s rank correlation Easy to construct Disadvantages It is sometimes difficult to insert best fit lines and to see any clear trend from the plotted data They are not an accurate measure of the degree of correlation Spearman’s Rank Correlation Coefficient (rs) This is a statistical measure of the strength of the relationship between two variables or two sets of data. The calculated values will lie within the range of +1 to -1. A coefficient of + 1 indicates a perfect positive correlation A coefficient of –1 indicates a perfect negative correlation However, it is very rare to find a perfect correlation. Technique There are two parts to the correlation: 1. A coefficient is calculated to give the degree of association between two variables (this on its own is meaningless, as it is just a figure) 2. The coefficient is then tested to determine its significance 1. Calculation of Coefficient The Spearman’s rank correlation coefficient uses ‘ranked’ data (see Figure 2.28 on page 32) and is carried out as follows: a) Place in rank order the two variables (starting with the highest value as rank 1) i.e. 1, 2, 3, 4, 5. Where the 2 variables are the same, then sum the two ranks and divide equally between them, e.g. 23 2.5 (for each variable) 2 23 4 3 (for each variable) Ranks 2 & 3 & 4 = 3 Ranks 2 & 3 = b) When this has been completed for both sets of data, subtract the rank for the second set of variables from the first set to obtain the difference in each rank (d) 2 c) Square the differences (d ) 2 d) Total the differences squared (d ) e) Apply the following formula to determine the coefficient. The range of the coefficient will vary between +1 and -1 Page 19 of 29 ● GEOGRAPHY Spearman’s rank (rs ) 1 ● AS/A2 LEVEL AQA (1031/2031) 6 d 2 n3 n where rs = Spearman’s rank correlation coefficient d = the difference in ranks of each match pair = sum of n = number of paired values 2. Testing the Significance (or probability of chance) i) State the hypothesis in negative terms (called the Null Hypothesis or H0), which implies there is an absence of any relationship/association, as follows: H0 = there is no relationship between the 2 variables ii) State the alternative hypothesis (H1): H1 = there is a negative/positive correlation between the two variables iii) The Null Hypothesis will either be accepted or rejected by using a graph (Figure 2.19) or table (Figure 2.20) to determine the amount of chance association between the 2 variables (Note – the graph is on a log scale) iv) Plot the point on the graph using the coefficient and the degrees of freedom (n – 2) to find the significance level (n – 2 means subtract 2 from the number of pairs in the correlation) v) H0 is accepted or rejected (5% is the rejection level). If rejected state the significance level: If between the 5% and 1% line = 5% significance level If between the 1% and 0.1% line = 1% significance level If above the 0.1% line = 0.1% significance level (highly or 99.9% significant, i.e. only 0.1% likelihood that it is due to chance) 1.0 0.9 0.8 0.7 0.6 0.5 Likelihood of the correlation occurring by chance 0.4 0.3 0.1% 1% Significance Level Spearman’s Rank Correlation Coefficient Figure 2.19 Graph for use in interpreting Correlation Coefficient 0.2 5% 0.1 2 4 6 8 10 20 40 60 80 Degrees of freedom (number of pairs of items in sample -2) Page 20 of 29 Unable to reject H0 as significance levels above 5%. Hence 5% level of significance is known as Rejection Level GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) Critical values can also be found using the table in Figure 2.20. The value of rs for any given number of pairs (n) must be equal to or greater than the value shown to gain the level of significance. Figure 2.20 Critical Values of rs for Spearman’s rank correlation coefficient Number of Pairs (n) 10 12 14 16 Levels of Significance 5% 0.65 0.59 0.54 0.50 1% 0.78 0.72 0.67 0.63 Page 21 of 29 GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) 2.4 Comparative Tests Chi-squared test (χ2) The chi-squared (χ2) test is a significance test that examines the difference between a set of collected data (called observed data) and a theoretical set of data (called expected data). It may also be used to find whether there is difference between two sets of observed data. So the test could be used to examine the angularity/size of river bedload in different stages of a river’s course or pedestrian/traffic flows at different times of the day. However, before you use this test you should understand that there are certain conditions that need to be met: a) The data must be in the form of ‘frequencies’ counted in a number of different categories (percentages cannot be used). If the data has been obtained by measurement then it must be grouped into different classes or categories before it can be used. For example, if data was collected on the size of particles making up the bedload of a river (measured along the long axis), the particle sizes would need to be grouped into classes such as: below 15mm, 16–30mm, over 30mm. b) The total number of observed data must be greater than 20 for the test to have any meaning. The expected frequency in any one cell should not normally be less than 5. c) Only one set of collected data is needed for this test. If two sets of collected data are used they should be independent – one must not be dependent on the other. Let’s examine how this test could be used in practice with regard to hydrological studies examining pebble roundness. Technique (Worked Example – with one set of observed values) Forty pebbles were collected in the field, using random sampling (see section 4.1), in the upper course of a river in order to investigate the effect of the river’s course on pebble roundness, using a simple classification based on observation. 1) State the hypotheses you are going to use – firstly state the null hypothesis (H0) and then the alternative hypothesis (H1). The null hypothesis implies there is no difference between the observed and expected data. The alternative states there is a difference between the observed and expected data. H0: the upper course of a river will have no effect on pebble roundness H1: the upper course of a river will have an effect on pebble roundness 2) Make a contingency table (Figure 2.21) into which you can insert the observed data (O) and expected (theoretical) data (E). Each box in the table is called a ‘cell’. The table shows the actual number of pebbles collected of various shapes (O) and the expected number (E). Of the 40 pebbles examined 14 were angular, 16 sub-angular, 7 were sub-rounded and 3 rounded. In theory you would expect an even distribution of pebbles of different roundness, so the expected value (E) is the mean value ie.10 for each size (40 divided by 4). Page 22 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) Note: the contingency table may be composed of any number of columns and rows. However, with more columns there is a greater likelihood that one or more cells in the expected data will fail to achieve the minimum value of 5. Figure 2.21 Contingency table for calculating χ2 Angular Sub angular Sub rounded Rounded Observed(O) Expected (E) OE 2 O E Ο Ε 2 Ε 3) Complete the table (Figure 2.22) by calculating the chi-squared as follows: In each category subtract the expected data (E) from the observed data (O) to obtain O E in each cell. O-E is then squared O E in each cell and then divided by the expected data (E) to gain 2 Ο Ε 2 Ε The figures obtained in the cells are added together to find the chi-squared value using the formula: Ο Ε 2 Ε Figure 2.22 Calculation of χ2 Observed(O) Expected (E) OE 2 O E Ο Ε Ο Ε Ε Sub angular 16 10 6 Sub rounded 7 10 -3 Rounded 3 10 -7 16 36 9 49 1.6 3.6 0.9 4.9 2 Ε Angular 14 10 4 2 11.0 Page 23 of 29 ● GEOGRAPHY AS/A2 LEVEL ● AQA (1031/2031) 4) Interpret the χ2 figure using a ‘Table of critical values’ (Figure 2.23). The values in the table show levels of probability and degrees of freedom. The correct degrees of freedom are found by subtracting 1 from the number of observations in the set. As there is only one observation for each of the four shapes of pebbles, the degrees of freedom are 4 – 1 = 3. Now check this figure against the level of probability (p). p = 5% (0.05) means that only 5 times in 100 the result could be due to chance; p = 0.1% is the highest level of probability and means there is only 1 chance in a 1000 the result could be due to chance. The critical values given must be equalled or exceeded at the relevant degree of freedom to achieve the given level of probability. It can be seen here that the chi-squared figure of 11.0 is below the 1% level of probability (which is 11.35) but above the 5% level of 7.82. Therefore, it is significant at 5% level and the H0 hypothesis can be rejected and the H1 is accepted. Figure 2.23 Table of critical values for χ2 Degrees of freedom 1 2 3 4 5 6 7 8 9 10 Levels of probability (p) 5%(0.05) 1%(0.01) 0.1%(0.001) 3.84 5.99 7.82 9.49 11.07 12.59 14.07 15.51 16.92 18.31 6.64 9.21 11.35 13.28 15.09 16.81 18.48 20.09 21.67 23.21 10.83 13.82 16.27 18.47 20.52 22.46 24.32 26.13 27.88 29.59 Worked Example (with 2 sets of observed data) Let us examine pebble roundness with two sets of data collected in both the upper and middle course of a river. 1) Again state the hypotheses you are going to use – first the null hypothesis (H0) and then the alternative hypothesis (H1) H0: there is no difference in pebble roundness in the upper and middle course of a river H1: there is a difference in pebble roundness in the upper and middle course of a river 2) Make a contingency table and insert the observed data (O) for both the upper course and middle course of the river. The table shows the actual number of pebbles collected of various shapes (O). There were 14 angular pebbles, 16 sub-angular ones, 7 were sub-rounded and 3 rounded in the upper course. In the middle course, 38 pebbles were measured – 2 angular, 8 sub-angular, 18 sub-rounded and 10 rounded. Add up the rows and columns and complete the totals (Figure 2.24). Page 24 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) Figure 2.24 Observed Frequencies (O) Angular Sub angular Sub rounded Rounded Total 14 16 7 3 40 (row) 2 8 18 10 38 (row) 13 78 (grand total) Upper course Middle course Total 16 24 25 3) Using the formula below work out the expected frequencies (E) for each cell (Figure 2.25). This is the expected number of pebbles of different roundness you would expect to find at each location. Round up to one decimal place. E = cell row total × cell column total grand total Figure 2.25 Expected Frequencies (E) Angular Sub angular Sub rounded Rounded Total Upper course 40 16 8.2 78 40 24 12.3 78 40 25 12.8 78 40 13 6.7 78 40 (row) Middle course 38 16 7.8 78 38 24 11.7 78 38 25 12.2 78 38 13 6.3 78 38 (row) 13 78 (grand total) Total 16 24 25 4) Work out the χ value for each cell using the formula: 2 Ο Ε 2 Ε . Total all the values in the cells to find the calculated χ2 value. χ2 value = 20.3 Figure 2.26: χ2 values Angular Upper course Middle course Total 14 8.2 2 8.2 2 7.8 4.1 2 7.8 8.4 4.3 Sub angular 16 12.3 2 12.3 8 11.7 11.7 2.3 1.1 2 1.2 Sub rounded 7 12.8 2 2.6 12.8 18 12.2 2 12.2 5.4 Page 25 of 29 2.8 Rounded 3 6.7 Total 2 6.7 10 6.3 2.0 9.8 2.2 10.5 2 6.3 4.2 20.3 (total) GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) 5) Interpret the chi-squared value of 20.3 using a ‘Table of critical values’ (Figure 2.23). The values in the table show levels of probability and degrees of freedom (V). Use the following formula to find the degrees of freedom: V (r 1) (c 1) where r = number of rows in the contingency table c = number of columns in the contingency table V = degrees of freedom Degrees of freedom (V) = 1 × 3 = 3 It can be seen in Figure 2.23 that, with 3 degrees of freedom, the chi-squared value of 20.3 is above the 0.1% level of probability (which is 16.27). Therefore, it is significant at 0.1% level and the H0 hypothesis can be rejected (at 0.1% there is only 1 chance in a 1000 the result could be due to chance) and the H1 is accepted. Therefore, there is a difference in pebble roundness in the upper and middle course of a river. Mann-Whitney U test The Mann-Whitney U test can be used if you wish to test for significant differences between two independent sets of data. It tells us whether there is a significant difference statistically between the median values of the two sets of data, although it is not necessary to calculate the median values. The value of the test lies in the simplicity of its calculation. Also it can be used in a wide variety of situations, even when there are small samples in the data. For example, it could be used to compare food prices in local shops with supermarkets, to compare numbers of species of vegetation in a sand dune transect, to compare traffic or pedestrian flows in different locations in the CBD. Its usefulness stems from the following: a) The data used can be either at the interval or ordinal level (i.e. an order of magnitude), as long as it can be arranged in a rank order. For example, it could be used to measure the intensity of colour in a soil sample, by describing it as light brown, mid-brown, dark brown, black. b) The data can be counted or calculated (a t-test is used for measured data, such as river velocity) and the samples can be of different sizes. c) It can be used for small samples of data (unlike many statistical tests), provided that both samples have at least one measurement and one of them has at least 5 measurements. d) The data does not have to come from a population with a normal distribution. Technique 1) State the null hypothesis (H0) and the alternative hypothesis (H1). The null hypothesis implies there is no difference between the two sets of data. The alternative states there is a difference between the two sets of data. 2) Arrange the data in a rank order sequence (lowest to highest), with the identity of each group retained, usually called sample A and B. If the samples are of different sizes the smaller sample is usually designated as sample A (nA). 3) Examine the rank order and for each measurement in sample B count how many values in the sample from A are smaller and record in a table. If measurements in sample A are the same as sample B this counts as 0.5. This is then called UA. Page 26 of 29 GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) 4) Repeat the procedure for sample A by counting how many values in sample B are smaller than each value in sample A. But note this figure (UB) can be found from a formula: UB = nA × nB – UA where n = number in the sample The smaller of the two values UA or UB is the value taken as the U value. 5) Refer to the table of critical values of U at the 5% significance level (Figure 2.27). Read the sample sizes from the top and left hand sides of the table to find the critical value. 6) If the U value is less than or equal to the critical value, the null hypothesis can be rejected, i.e. there is a significant difference between the two samples at the 5% significance level. Figure 2.27: Critical values of U at the 5% level Size of the smallest sample (n1) Size of the largest sample (n2) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 – – – – – – – – – – – – – – 2 – – – – – – – 0 0 0 0 1 1 1 3 – – – – 0 1 1 2 2 3 3 4 4 5 4 – – – 0 1 2 3 4 4 5 6 7 8 9 5 – – 0 1 2 3 5 6 7 8 9 11 12 13 6 – – 1 2 3 5 6 8 10 11 13 14 16 17 7 – – 1 3 5 6 8 10 12 14 16 18 20 22 8 – 0 2 4 6 8 10 13 15 17 19 22 24 26 9 – 0 2 4 7 10 12 15 17 20 23 26 28 31 10 – 0 3 5 8 11 14 17 20 23 26 29 33 36 11 – 0 3 6 9 13 16 19 23 26 30 33 37 40 12 – 1 4 7 11 14 18 22 26 29 33 37 41 45 13 – 1 4 8 12 16 20 24 28 33 37 41 45 50 14 – 1 5 9 13 17 22 26 31 36 40 45 50 55 15 – 1 5 10 14 19 24 29 34 39 44 49 54 59 16 – 1 6 11 15 21 26 31 37 42 47 53 59 64 17 – 2 6 11 17 22 28 34 39 45 51 57 63 67 18 – 2 7 12 18 24 30 36 42 48 55 61 67 74 19 – 2 7 13 19 25 32 38 45 52 58 65 72 78 20 – 2 8 13 20 27 34 41 48 55 62 69 76 83 Dashes indicate no decision possible at the stated level of significance 15 – 1 5 10 14 19 24 29 34 39 44 49 54 59 64 70 75 80 85 90 16 – 1 6 11 15 21 26 31 37 42 47 53 59 64 70 75 81 86 92 98 17 – 2 6 11 17 22 28 34 39 45 51 57 63 67 75 81 87 93 99 105 18 – 2 7 12 18 24 30 36 42 48 55 61 67 74 80 86 93 99 106 112 19 – 2 7 13 19 25 32 38 45 52 58 65 72 78 85 92 99 106 113 119 20 – 2 8 13 20 27 34 41 48 55 62 69 76 83 90 98 105 112 119 127 Worked example of Mann-Whitney U test A study was carried out on a sand dune ecosystem in Lancashire in order to find the differences in numbers of vegetation species in a succession at 50m and 300m from the shoreline. A 50m tape was laid down parallel to the shoreline at 50m and 300m inland and points were selected randomly along the tape, using a table of random numbers. At each randomly selected point a quadrat was laid down and the number of species of vegetation counted. The numbers of species recorded were as follows: At 300m inland – 3, 4, 6, 6, 7, 7 At 50m inland – 0, 1, 1, 2, 3, 4, 5 Page 27 of 29 GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) 1) State the hypotheses you are going to use – first the null hypothesis (H0) and then the alternative hypothesis (H1) H0 – there is no significant difference in the number of species of vegetation at 50m and 300m inland from the shoreline. H1 – there is a significant difference in the number of species of vegetation at 50m and 300m inland from the shoreline. 2) The data was arranged in a rank order sequence (lowest to highest), with the identity of each group retained, called samples A and B. The smaller sample at 300m is designated as sample A (nA). Sample A (300m inland) 3, 4, 6, 6, 7, 7 Sample B (50m inland) 0, 1, 1, 2, 3, 4, 5 3) Examine the rank order and for each measurement in sample B count how many values in sample A are smaller and record them. If any measurements in sample A are the same as sample B, this counts as 0.5 (this would apply to the recordings of 3 and 4). The recording of 4 is designated as 1.5 because it has one measurement in sample A which is smaller and one the same. This sample is then called UA . Measurements in Sample B: 0, 1, 1, 2, 3, 4, 5 Number of smaller measurements at Sample A: 0, 0, 0, 0 0.5, 1.5, 2 Therefore, UA = sum of the scores of sample B = 4 4) Repeat the procedure for sample A by counting how many values in sample B are smaller than each value in sample A. Recordings at Sample A: 3, 4, 6, 6, 7, 7 Number of smaller measurements at Sample B: 4.5, 5.5, 7, 7, 7, 7 Therefore, UB = sum of the scores of sample A = 38 Note: there is no need to list all the data in this sample because it can be calculated using the following formula: UB = nA × nB – UA where n = number in the sample UB = (6 x 7) – 4 = 38 The smaller of the two values UA or UB is the value taken as the U value. U value is UA = 4 5) Refer to the table of critical values of U at the 5% significance level (Figure 2.27). Read the sample sizes from the top and left-hand side of the table to find the critical value. If the U value is less than or equal to the critical value, the null hypothesis can be rejected at the 5% significance level. The U value of 4 is indeed less than the critical value of 6. Therefore, the null hypothesis can be rejected at the 5% significance level. We can accept the alternative hypothesis that there is a significant difference in the number of species of vegetation at 50m and 300m inland from the shoreline. Page 28 of 29 GEOGRAPHY ? ● AS/A2 LEVEL ● AQA (1031/2031) Exercise 2.4 – A2 Skills: Statistical skills, Mann Whitney U test Exercise 2.4 The following data on traffic flow was recorded in the centre and at the edge of the CBD over 5 minute periods at different times of the day. Sample 1 2 3 4 5 6 7 8 CBD centre 220 150 162 110 62 85 46 102 CBD edge 143 88 56 97 42 40 63 88 1) State (a) the null hypothesis and (b) the alternative hypothesis. a) ................................................................................................................................................................................ ................................................................................................................................................................................ b)................................................................................................................................................................................ ................................................................................................................................................................................ (2 marks) 2) Carry out the Mann –Whitney U test to determine whether there is a significant difference in the results. Sample A Sample B Total UA UB (6 marks) 3) Comment on the significance of the difference between the two sets of results. ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (2 marks) (Total marks = 10) Page 29 of 29 GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) 2.5 Examination Questions Examination Assignment 2.1 (Skills: graphical skills, scattergraphs, best fit lines, handling data) 1) What is the meaning of the term infant mortality rate? ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (2 marks) 2) Using the data in Figure 2.28 (page 32) construct a scattergraph (Figure 2.27, next page) to show the relationship between per capita GDP (gross domestic product) and infant mortality rate. (5 marks) Exam hint: Make sure you label the axes and place the independent and dependent variables on the correct axes. 3) Draw a best fit line to show the trend of the data. Circle any anomalies. (2 marks) 4) Explain how you drew the best fit line. ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (3 marks) 5) What conclusions can you draw from the completed scattergraph? ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (4 marks) Page 30 of 29 GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) Figure 2.27 A scattergraph to show the relationship between per capita GDP and infant mortality rate 6) Describe the advantages of using the scattergraph as a means of analysing data. ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (4 marks) 7) Describe the factors that influence infant mortality rates in countries at different stages of development ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (5 marks) (Total marks = 25) Page 31 of 29 ● GEOGRAPHY ● AS/A2 LEVEL AQA (1031/2031) Examination Assignment 2.2 (Skills: statistical skills, Spearman’s rank correlation, handling statistical data, significance levels, drawing conclusions from results) 1) Complete the Spearman’s rank correlation table (Figure 2.28) (6 marks) Figure 2.28 Infant Mortality/Per Capita GDP* (2007) Country per capita GDP (US$) Canada Belarus Sweden N. Zealand France U.K. India Spain Poland Philippines Egypt Romania Brazil Russia 38200 10000 36900 27300 33800 35300 27000 33700 16200 3300 5400 10000 9700 14600 Infant rank mortality rank d rate 1 4.6 11 -10 6.6 2 2.8 14 -12 6 5.7 9 -3 4 3.4 13 -9 3 5.0 10 -7 7 34.6 1 6 5 4.3 12 -7 - 32 8 -McCabe 7.1Page 32 29/04/2017Poland 7 1 22.1 29.5 24.6 27.6 11.1 d 2 d 100 144 9 81 49 36 49 1 = *per capita GDP (or Gross Domestic Product) means the average income per person 2) Use the following formula to calculate the Spearman’s rank correlation coefficient (rs) between per capita GDP and infant mortality. Spearman’s rank (rs ) 1 6 d 2 n3 n ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (2 marks) 3) State the Null Hypothesis (H0) ................................................................................................................................................................................... ................................................................................................................................................................................... (1 marks) Page 32 of 29 GEOGRAPHY ● AS/A2 LEVEL ● AQA (1031/2031) 4) Using the correlation graph (Figure 2.28) give the level of significance of your results. Can you accept or reject the Null Hypothesis (H0) ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (3 marks) 5) What conclusions can be drawn from your results and what are the reasons for them? ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (7 marks) Exam hint: Remember that correlation can be either positive or negative. Both are equally valid as long as the trend is obvious or the coefficient is significant. The coefficient can only vary between +1 and –1 (if your coefficient is larger that this then go back and check your calculations!). 6) Assess the strengths and weaknesses of the Spearman’s rank correlation test for analysing data. ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... ................................................................................................................................................................................... (6 marks) (Total marks = 25) Page 33 of 29