Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Psy B07 DESCRIBING AND EXPLORING DATA Chapter 2 Slide 1 Psy B07 Outline Plotting data Grouping data Terminology Notation Measures of Central Tendency Measures of Variability Properties of a Statistic Chapter 2 Slide 2 Psy B07 Plotting Data Once a bunch of data has been collected, the raw numbers must be manipulated in some fashion to make them more informative. Several options are available including plotting the data or calculating descriptive statistics Chapter 2 Slide 3 Psy B07 Plotting Data Raw data of typical age and weight in a second year course (madeup data) Chapter 2 Age 18 26 21 21 25 18 20 21 18 21 21 21 20 21 20 23 22 20 21 22 24 26 19 19 Weight 107 115 108 111 163 119 119 200 178 135 143 113 103 166 112 151 192 135 117 138 137 161 117 142 Age 20 21 20 19 19 21 22 19 20 20 19 19 19 20 20 19 20 20 20 22 22 19 23 20 Weight 108 110 109 127 143 121 112 136 161 131 144 123 101 193 127 158 149 138 129 138 137 156 122 132 Slide 4 Psy B07 Plotting Data Often, the first thing one does with a set of raw data is to plot frequency distributions. Usually this is done by first creating a table of the frequencies broken down by values of the relevant variable, then the frequencies in the table are plotted in a histogram Chapter 2 Slide 5 Psy B07 Plotting Data Example: Typical age in a second year course Chapter 2 Age Frequency 18 19 20 21 22 23 24 25 26 3 10 14 10 5 2 1 1 2 Note: The frequencies in the adjacent table were calculated by simply counting the number of subjects having the specified value for the age variable Slide 6 Psy B07 Plotting Data 16 14 Age 18 19 20 21 22 23 24 25 26 12 Frequency 10 8 Frequency 3 10 14 10 5 2 1 1 2 6 4 2 0 18 19 20 21 22 23 24 25 26 Age Chapter 2 Slide 7 Psy B07 Grouping Data Plotting is easy when the variable of interest has a relatively small number of values (like our age variable did). However, the values of a variable are sometimes more continuous, resulting in uninformative frequency plots if done in the above manner. Chapter 2 Slide 8 Psy B07 Grouping Data For example, our weight variable ranges from 100 lb. to 200 lb. If we used the previously described technique, we would end up with 100 bars, most of which with a frequency less than 2 or 3 (and many with a frequency of zero). We can get around this problem by grouping our values into bins. Try for around 10 bins with natural splits. Chapter 2 Slide 9 Psy B07 Grouping Data Weight Bin 100 - 109 110 - 119 120 - 129 130 - 139 140 - 149 150 - 159 160 - 169 170 - 179 180 - 189 190 - 199 200 - 209 Chapter 2 Midpoint Frequency 104.5 114.5 124.5 134.5 144.5 154.5 164.5 174.5 184.5 194.5 204.5 6 10 6 10 5 3 4 1 0 2 1 Slide 10 Psy B07 Grouping Data 12 Frequency 10 8 6 Weight Frequency 104.5 6 114.5 10 124.5 6 134.5 10 144.5 5 154.5 3 164.5 4 174.5 1 184.5 0 194.5 2 204.5 1 Check out this demo which clearly shows how the width of the bin that you select can clearly affect the “look” of the data 4 2 204.5 194.5 184.5 174.5 164.5 154.5 144.5 134.5 124.5 114.5 104.5 0 Here is another similar demonstration of the effects of bin width Weight (lbs) See section in text on cumulative frequency distributions Chapter 2 Slide 11 Psy B07 Terminology Often, frequency histograms tend to have a roughly symmetrical bell-shape and such distributions are called normal or Gaussian 14 12 Frequency 10 60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5 3 8 7 12 7 6 4 0 1 8 6 4 2 0 60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5 Height (Inches) Chapter 2 Slide 12 Psy B07 Terminology Sometimes, the bell shape is not symmetrical The term positive skew refers to the situation where the “tail” of the distribution is to the right, negative skew is when the “tail” is to the left Chapter 2 Slide 13 Psy B07 Terminology 14 12 Frequency 10 60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5 3 8 7 12 7 6 4 0 1 0.75 2.75 4.75 6.75 8.75 10.75 12.75 14.75 16.75 18.75 20.75 8 7 13 12 5 5 2 0 1 1 0 1 6 4 2 Chapter 2 20.8 18.8 16.8 14.8 12.8 10.8 8.75 6.75 4.75 2.75 0.75 0 Slide 14 Psy B07 Notation Variables When we describe a set of data corresponding to the values of some variable, we will refer to that set using a letter such as X or Y. When we want to talk about specific data points within that set, we specify those points by adding a subscript to the letter like X1. Chapter 2 Slide 15 Psy B07 Notation 5, 8, 12, X1, X2, X3, Chapter 2 3, X4, 6, X5, 8, 7 X6, X7 Slide 16 Psy B07 Notation The Greek letter sigma, which looks like , means “add up” or “sum” whatever follows it. Thus, Xi, means “add up all the Xis. If we use the Xis from the previous example, Xi = 49 (or just X). Chapter 2 Slide 17 Psy B07 Nasty Example Midterm Student Mark X 1 2 3 4 5 Chapter 2 82 66 70 81 61 Real Mark Y 84 51 72 56 73 Slide 18 Psy B07 Nasty Example X = 360 Y = 336 (X-Y) = 24 X2 = 26262 (X)2 = 129600 Chapter 2 Slide 19 Psy B07 Your turn (XY) = 24283 ((X-Y))2 = 576 (X2-Y2) = 2956 Chapter 2 Slide 20 Psy B07 Notation Sometimes things are made more complicated because letters (e.g., X) are sometimes used to refer to entire data sets (as opposed to single variables) and multiple subscripts are used to specify specific data points. Chapter 2 Slide 21 Psy B07 Notation Week 1 2 3 4 5 Student 1 2 3 7 6 4 2 2 3 4 4 3 4 3 4 5 4 6 X24 = 3 X or Xij = 61 Chapter 2 Slide 22 Psy B07 Measures of Central Tendency While distributions provide an overall picture of some data set, it is sometimes desirable to represent the entire data set using descriptive statistics. The first descriptive statistics we will discuss, are those used to indicate where the centre of the distribution lies. Chapter 2 Slide 23 Psy B07 Measures of Central Tendency 14 12 Frequency 10 60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5 3 8 7 12 7 6 4 0 1 8 6 4 2 0 60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5 Height (Inches) Chapter 2 Slide 24 Psy B07 Measures of Central Tendency There are, in fact, three different measures of central tendency. The first of these is called the mode. The mode is simply the value of the relevant variable that occurs most often (i.e., has the highest frequency) in the sample. Chapter 2 Slide 25 Psy B07 Measures of Central Tendency Note that if you have done a frequency histogram, you can often identify the mode simply by finding the value with the highest bar. However, that will not work when grouping was performed prior to plotting the histogram (although you can still use the histogram to identify the modal group, just not the modal value) Chapter 2 Slide 26 Psy B07 Measures of Central Tendency Create a non-grouped frequency table as described previously, then identify the value with the greatest frequency. Example: Class height. Value Freq 61 62 63 64 65 66 67 68 Chapter 2 3 4 4 4 3 7 5 4 Value Freq 69 70 71 72 73 74 75 76 3 2 4 4 0 0 0 1 Slide 27 Psy B07 Measures of Central Tendency A second measure of central tendency is called the median. The median is the point corresponding to the score that lies in the middle of the distribution (i.e., there are as many data points above the median as there are below the median). Chapter 2 Slide 28 Psy B07 Measures of Central Tendency To find the median, the data points must first be sorted into either ascending or descending numerical order. The position of the median value can then be calculated using the following formula: N 1 Median Location 2 Chapter 2 Slide 29 Psy B07 Measures of Central Tendency 1) If there are an odd number of data points: (1, 3, 3, 4, 4, 5, 6, 7, 12) Median Location 9 1 5 2 The median is the item in the fifth position of the ordered data set, therefore the median is 4 Chapter 2 Slide 30 Psy B07 Measures of Central Tendency 2) If there are an even number of data points: (1, 3, 3, 3, 5, 5, 6, 7) Median Location 8 1 4.5 2 We take the average of the two adjacent values – in this case giving us 4 Chapter 2 Slide 31 Psy B07 Measures of Central Tendency Finally, the most commonly used measure of central tendency is called the mean (denoted x for a sample, and μ for a population). The mean is the same of what most of us call the average, and it is calculated in the following manner: X X N Chapter 2 Slide 32 Psy B07 Measures of Central Tendency For example, given the data set that we used to calculate the median (odd number example), the corresponding mean would be: X 45 X 5 N 9 Chapter 2 Slide 33 Psy B07 Measures of Central Tendency When a distribution is fairly symmetrical, the mean, median, and mode will be quite similar However, when the underlying distribution is not symmetrical, the three measures of central tendency can be quite different Chapter 2 Slide 34 Psy B07 Measures of Central Tendency This raises the issue of which measure is best. Example: Pizza Eating Value Freq Value Freq 0 1 2 3 4 5 6 4 2 8 6 6 6 5 8 10 15 16 20 40 5 2 1 1 1 1 Mode = 2 slices per week Median = 4 slices per week Mean = 5.7 slices per week Note that if you were calculating these values, you would show all your steps (it’s good to be a prof!). Chapter 2 Slide 35 Psy B07 Measures of Central Tendency Here is a demonstration that allows you to change a frequency histogram while simultaneously noting the effects of those changes on the mean versus the median. As you use the demo, you should easily be able to think about how these changes are also affecting the mode, right? Chapter 2 Slide 36 Psy B07 Measures of Variability In addition to knowing where the centre of the distribution is, it is often helpful to know the degree to which individual values cluster around the centre. This is known as variability Chapter 2 Slide 37 Psy B07 Measures of Variability There are various measures of variability, the most straightforward being the range of the sample: Highest value minus lowest value While range provides a good first pass at variance, it is not the best measure because of its sensitivity to extreme scores (see text). Chapter 2 Slide 38 Psy B07 Measures of Variability One approach to estimating variability is to directly measure the degree to which individual data points differ from the mean and then average those deviations. This is known as the average deviation ( X X ) N Chapter 2 Slide 39 Psy B07 Measures of Variability However, if we try to do this with real data, the result will always be zero: Example: (2,3,3,4,4,6,6,12) ( X X ) (3,2,2,1,1,1,1,7) 0 0 N 8 8 Chapter 2 Slide 40 Psy B07 Measures of Variability One way to get around the problem with the average deviation is to use the absolute value of the differences, instead of the differences themselves. The absolute value of some number is just the number without any sign: For Example: |-3| = 3 And: |+3| = 3 Chapter 2 Slide 41 Psy B07 Measures of Variability Thus, we could re-write and solve our average deviation question as follows: MAD X X N 3 2 2 1111 7 8 18 2.25 8 Therefore, this data set has a mean of 5, and a MAD of 2.25 Chapter 2 Slide 42 Psy B07 Measures of Variability Although the MAD is an acceptable measure of variability, the most commonly used measure is variance (denoted s2 for a sample and 2 for a population) and its square root termed the standard deviation (denoted s for a sample and for a population). Chapter 2 Slide 43 Psy B07 Measures of Variability The computation of variance is also based on the basic notion of the average deviation however, instead of getting around the “zero problem” by using absolute deviations (as in MAD), the “zero problem” is eliminating by squaring the differences from the mean Chapter 2 2 ( X X ) N 2 Slide 44 Psy B07 Measures of Variability Example: (2,3,4,4,4,5,6,12) 2 ( X X ) 2 N (9 4 1 1 1 0 1 49) 8 8.25 Chapter 2 Slide 45 Psy B07 Measures of Variability To convert the variance into SD, we simply take a square root of it: ( X X ) 2 N (9 4 1 1 1 0 1 49) 8 8.25 2.87 Chapter 2 Slide 46 Psy B07 Measures of Variability This demonstration allows you to play with the mean and standard deviation of a distribution. Note that changing the mean of the distribution simply moves the entire distribution to the left or right without changing its shape. In contrast, changing the standard deviation alters the spread of the data but does not affect where the distribution is “centered” DEMO Chapter 2 Slide 47 Psy B07 Measures of Variability Population vs. Sample As mentioned, we usually deal with statistics, not parameters. σ2 and σ are parameters. Their counterparts, when dealing with samples are s2 and s. The formulae are slightly different ( X X ) s N 1 2 Chapter 2 ( X X ) s N 1 Slide 48 Psy B07 Properties of a Statistic So, the mean (X) and variance (s2) are the descriptive statistics that are most commonly used to represent the data points of some sample. The real reason that they are the preferred measures of central tendency and variance is because of certain properties they have as estimators of their corresponding population parameters; μ and 2. Chapter 2 Slide 49 Psy B07 Properties of a Statistic Four properties are considered desirable in a population estimator; sufficiency, unbiasedness, efficiency, & resistance. Both the mean and the variance are the best estimators in their class in terms of the first three of these four properties. To understand these properties, you first need to understand a concept in statistics called the sampling distribution Chapter 2 Slide 50 Psy B07 Properties of a Statistic We will discuss sampling distributions off and on throughout the course, and I only want to touch on the notion now. Basically, the idea is this – in order to examine the properties of a statistic we often want to take repeated samples from some population of data and calculate the relevant statistic on each sample. We can then look at the distribution of the statistic across these samples and ask a variety of questions about it. Check out this demonstration which I hope makes the concept of sampling distributions more clear. Chapter 2 Slide 51 Psy B07 Properties of a Statistic 1) Sufficiency A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter. Chapter 2 Slide 52 Psy B07 Properties of a Statistic 2) Unbiasedness A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter it is estimating. Explanation of N-1 in s2 formula. Chapter 2 Slide 53 Psy B07 Properties of a Statistic Using the procedure, the mean can be shown to be an unbiased estimator (see p 47). However, if the σ2 formula is used to calculate s2 it turns out to underestimate σ2 Chapter 2 Slide 54 Psy B07 Properties of a Statistic The reason for this bias is that, when we calculate s2, we use x, an estimator of the population mean The chances of x being EXACTLY the same as μ are virtually nil, which results in the bias To compensate, we use N-1 Note that this is only true when calculating s2, if you have a measurable population and you want to calculate 2, you use N in the denominator, not N-1 Chapter 2 Slide 55 Psy B07 Properties of a Statistic Degrees of Freedom The mean of 6, 8, & 10 is 8. If I allow you to change as many of these numbers as you want BUT the mean must stay 8, how many of the numbers are you free to vary? Chapter 2 Slide 56 Psy B07 Properties of a Statistic The point of this exercise is that when the mean is fixed, it removes a degree of freedom from your sample -- this is like actually subtracting 1 from the number of observations in your sample. It is for exactly this reason that we use N-1 in the denominator when we calculate s2 (i.e., the calculation requires that the mean be fixed first which effectively removes -- fixes -- one of the data points). Chapter 2 Slide 57 Psy B07 Properties of a Statistic 3) Efficiency The efficiency of a statistic is reflected in the variance that is observed when one examines the means of a bunch of independently chosen samples. The smaller the variance, the more efficient the statistic is said to be Chapter 2 Slide 58 Psy B07 Properties of a Statistic 4) Resistance The resistance of an estimator refers to the degree to which that estimate is effected by extreme values. As mentioned previously, both X and s2 are highly sensitive to extreme values Chapter 2 Slide 59 Psy B07 Properties of a Statistic 4) Resistance Despite this, they are still the most commonly used estimates of the corresponding population parameters, mostly because of their superiority over other measures in terms sufficiency, unbiasedness, & efficiency Chapter 2 Slide 60