Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ana Jerončić 200 participants [EUR] about half (71+37=108)÷200 = 54% of the bills are “small”, i.e. less than 30 EUR (18+28+14=60)÷200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR There are only a few telephone bills in the middle range. Variable Frequency Frequency Frequency Symmetry A histogram is said to be symmetric if, when we draw a vertical line down the center of the histogram, the two sides are identical in shape and size: Variable Variable 2.3 A special type of symmetric unimodal histogram is one that is bell shaped Drawing the histogram helps verify the shape of the distribution in question. Frequency Many statistical techniques require that the population be bell shaped. Variable Bell Shaped Frequency Frequency Skewness (asymmetry) A skewed histogram is one with a long tail extending to either the right or the left: Variable Positively Skewed Variable Negatively Skewed 2.5 (left)—Serum albumin values in 248 adults FIG 2 (right)—Normal distribution with the same mean and standard deviation as the serum albumin values. Altman D G , and Bland J M BMJ 1995;310:298 ©1995 by British Medical Journal Publishing Group •Center of distribution •Variability •Shape Statistics that show how different units seem similar Parameters of central tendency Mean Median Mode Statistics that show how different units differ Parameters of statistical variability Standard deviation Range Percentils The average arithmetic value of set of numbers Adding all data together and then dividing them by the number of observations (sometimes referred to as n or the sample size) Observations: 3, 4, 5, 6, 7 Total sum: 3+4+5+6+7= 25 Number of observations = 5 Mean = 25/ 5 = 5 Calculate the mean of following data: 1, 2, 3, 3, 4, 5 =(1+2+3+3+4+5)/6 =3 1, 1, 1, 1, 2, 12 3 Mean is the most commonly used as the measure of central tendency. It is a central point around which the standard deviation is calculated. MeanA=3 MeanB=3 A B -Not a good descriptor of dataset B -Large influence of outliers, especially in small samples (ie. number 12) Mean is not a good descriptor of data when distribution is asymmetrical Median is in the Middle Median – the middle number in a set of ordered numbers. 1, 3, 7, 10, 13 Median = 7 Step 1 – Arrange the numbers in order from least to greatest. 21, 18, 24, 19, 27 18, 19, 21, 24, 27 Step 2 – Find the middle number. 21, 18, 24, 19, 27 18, 19, 21, 24, 27 Number that separates the lowest value half and the highest-value half of a sample or a population Centre of the distribution Numbers simply need to be put in order and the middle one is chosen Advantage: 1. More robust to outliers and a better representative of a group in small samples 2. Used in a skewed distribution The value that has the largest number of observations. In a bell curve distribution, the mode is at the peak. Example: 2,2,2,4,5,6,7,7,7,7,8 7 is the most frequent observation (4 times) Mod is 7 It is not influenced by the sample size or by intensities of observations However, it may not represent values close to the mean or median Most useful in grouped or categorical data RARELY USED Standard deviation Range Quantiles Smallest interval which contains all the data values. Calculated by substracting smallest observation from the greatest Takes into account outliers (it depends on only two observations) and represents quantitave data well when the sample size is large The interquartile range (IQR) is the range of the middle 50% of the data in a distribution. It is computed as follows: IQR = 75th percentile - 25th percentile Data are put in numerical order and then the lower and upper quarter of the data are discarded Advantage: eliminates the risk of misrepresenting data distribution due to outliers The most commonly used measure of data variability. Measure of average distance of all data values from the mean. The standard deviation is especially useful measure of data variability when the distribution is normal or approximately normal because the proportion of the distribution within a given number of standard deviations from the mean can be calculated. Mean 68% of data!! 1 standard deviation 68% of the distribution is within 1 standard deviation of the mean and approximately 95% of the distribution is within 2 standard deviations of the mean. Example If you observe a normal distribution of your variable with a mean of 50 and a standard deviation of 10, then 68% of the distribution would be between 50 - 10 = 40 and 50 +10 =60. Similarly, about 95% of the distribution would be between 50 - 2 x 10 = 30 and 50 + 2 x 10 = 70. Both distributions have means of 50. The blue distribution has a standard deviation of 5; The red distribution has a standard deviation of 10. For the blue distribution, 68% of the distribution is between 45 and 55; for the red distribution, 68% is between 40 and 60. Figure shows two normal distributions.