Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Description (1) Xiaojin Yu Introductory biostatistics http://www.hstathome.com/tjziyuan/Introductory%20Biostatistics% 20Le%20C.T.%20%20(Wiley,%202003)(T)(551s).pdf Introductory biostatistics for the health science http://faculty.ksu.edu.sa/hisham/Documents/eBooks/Introductory_ Biostatistics_for_the_Health.pdf Review What is Medical statistics about? key terms in Statistics 2 1.2 key words in Statistics Population(individual) & sample Variation & random variable Random Variable & data Statistic & parameter Sampling error Probability 3 Framework of statistical analysis population Randomly sampling sample individual, variation representative, sampling error parameter Statistics unknown Statistical inference based on probabililty known Statistical Description 4 Statistical Description CONTENTS For quantitative(numerical) data Frequency distribution Measures of central tendency Measures of dispersion For qualitative(categorical) data 5 Raw Data (quantitative) Example: 120 values of height (cm) for 12-year-old boys in 1997: 142.3 134.4 150.3 141.9 143.5 138.1 142.9 140.9 134.7 141.2 135.5 140.2 156.6 148.8 133.1 140.7 139.2 140.2 134.9 141.4 138.5 148.9 144.4 145.4 142.7 137.9 142.7 141.2 144.7 137.4 143.6 160.9 138.9 154.0 143.4 142.4 145.7 151.3 143.9 141.5 139.3 145.1 142.3 154.2 137.7 147.7 137.4 148.9 138.2 140.8 151.1 148.8 141.9 145.8 125.9 137.9 138.5 152.3 143.6 146.7 141.6 149.8 144.0 140.1 147.8 147.9 132.7 139.9 139.6 146.6 150.0 139.2 142.5 145.2 145.4 150.6 140.5 150.8 152.9 149.7 143.5 132.1 143.3 139.6 130.5 141.8 146.2 139.5 138.9 144.5 147.9 147.5 142.9 145.9 146.5 142.4 134.5 146.8 143.3 146.4 134.7 137.1 141.8 136.9 129.4 146.7 149.0 138.7 148.8 135.1 156.3 143.8 147.3 147.1 141.4 148.1 142.5 144.0 142.1 139.9 6 Data Summary For continuous variable data Numerical methods Description of tendency of central Description of dispersion Tabular and graphical methods 7 Tabular & Graphical Methods Frequency table Histogram 8 FREQUENCY TABLE Class Interval for Height (cm) 124~ 128~ 132~ 136~ 140~ 144~ 148~ 152~ 156~ 160~ Total Frequency (f) 1 2 10 22 37 26 15 4 2 1 120 Relative Frequency 0.0083 0.0167 0.0833 0.1834 0.3083 0.2167 0.1250 0.0333 0.0167 0.0083 1.0000 9 SOLUTION TO EXAMPLE 1.number of intervals k=10 2 calculate the width R=Xmax-Xmin= 160.9- 125.9=35 w=R/k W=35/10=3.5 3.form the intervals 4.counting frequency A recommended step is to present the proportion or relative frequency. 10 Class intervals Class Interval for Height (cm) 124~ 128~ 132~ 136~ 140~ 144~ 148~ 152~ 156~ 160~ Total 11 Tally and Counting Class Interval for Height (cm) Frequency (f) 124~ 128~ 1 2 10 22 37 26 15 4 2 1 120 132~ 136~ 140~ 144~ 148~ 152~ 156~ 160~ Total Tally mark 一 T 正正 正正正正 T 正正正正正正正T 正正正正正一 正正正 T 一 12 12 Final Frequency Table Class Interval for Height (cm) Frequency (f) 124~ 128~ 132~ 136~ 140~ 144~ 148~ 152~ 156~ 160~ Total 1 2 10 22 37 26 15 4 2 1 120 Relative Frequency Cumulative frequency Cumulative rela. freq(%) 0.0083 0.0167 0.0833 0.1834 0.3083 0.2167 0.1250 0.0333 0.0167 0.0083 1.0000 1 3 13 35 72 98 113 117 119 120 0.0083 0.0250 0.1083 0.2917 0.6000 0.8167 0.9416 0.9750 0.9916 1 100 within certain int erval recommended step is tofrequency R elative frequency present the proportion or relative frequency. total number of observations 1313 A Basic Steps to Form Frequency Table step1: determining the number of intervals 5-15 step2: calculating the width of intervals Step3: forming intervals- certain range of values Step4: count the number of observation with certain interval the final table consists of the intervals and the frequencies. 14 Frequency 40 30 20 10 0 124 132 140 148 156 164 Figure 2.1 Distribution of heights of 120 boys from China,1997 15 Present data graphically presenting data visually intuitively easy to read and understand self-explanatory stand alone from text Statistical table and graph are intended to communicate information, so it should be easy to read and understand. The shape of the distribution is the characteristic of the variable. 16 Application One lead to a research question concerns unimodal and symmetry of the distribution 17 Shape of frequency distribution Distribution Unimodal/bimodal Symmetry /skew 18 18 Unimodal/bimodal Homogeneous /heterogeneous The definition of population or the classification is approapriate. 19 SYMMETRY & SKEWNESS Symmetric means the distribution has the same shape on both side of the peak location. Skewness means the lack of symmetry in a probability distribution. (The Cambridge Dictionary of Statistics in the Medical Sciences.) An asymmetric distribution is called skew. (Armitage: Statistical Methods in Medical Research.) 20 Figure 2.2 Symmetric And Asymmetric Distribution negative skewness positive skewness 21 Positive & Negative Skewness A distribution is said to have positive skewness when it has a long thin tail at the right, and to have negative skewness when it has a long thin tail to the left. A distribution which the upper tail is longer than the low, would be called positively skew. 22 Frequency 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 Hg (umol/kg) Fig. The distribution of Hg (hydrargyrum) of 237 adults hair 23 Frequency 400 300 200 100 0 0 10 20 30 40 50 60 70 80 90 100 QOL Fig. The distribution of scores of QOL (quality of life ) of 892 senior citizen 24 Frequency 40 30 20 10 0 1 5 10 15 20 25 30 35 40 45 Survival time (month) Fig. The distribution of survival times for 102 malignant melanoma patients(恶性黑素瘤) 25 Frequency 2500 2000 1500 1000 500 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 Age at death (year) Fig. The distribution of ages at death of males in 1990~1992 26 Numerical methods Central tendency Tendency of dispersion arithmetic mean, Median, geometric mean range, interquartile range, standard deviation, variance, coefficient of variation, 27 Mean Concept and notation Calculation Application 28 CONCEPT OF MEAN Arithmetic mean, mean Population mean μ The Sample mean will be denoted by x (‘‘xbar’’). 29 CALCULATION OF MEAN given a data set of size n {x1,x2,…,xn}, The mean is computed by summing all the x’s and divided the sum by n. symbolically x x n 30 GROUPED DATA The mean can be approximated using the formula fm x n Where f denotes the frequency ,m the interval midpoint ,and the summation is across the intervals. 31 Midpoint The midpoint for an interval is obtained by calculating the average of the interval lower true boundary and the upper true boundary. The midpoint for the first interval is The midpoint for the second interval is 124~ 128~ 132~ 124 128 126 2 128 132 130 2 32 Example 1 Class Interval for Height (cm) 124~ 128~ 132~ 136~ 140~ 144~ 148~ 152~ 156~ 160~ Total Frequency (f) 1 2 10 22 37 26 15 4 2 1 120 m fm 126 130 134 138 142 146 150 154 158 162 126 260 1340 316 162 33 Average: Limitation in describing data It has been said that a fellow with one leg frozen in ice and the other leg in boiling water is comfortable ON AVERAGE ! 34 Geometric Mean-notation The geometric mean is defined as the nth root of the product of n numbers, i.e., for a set of numbers. G /GM 35 Geometric Meancalculation As the definition, the expression is G n x1 x2 ...xn Example like, the G for 2, 4, 8(n=3) should be like: G 2 48 4 3 36 Geometric mean: G n x1 x2 ...xn X ln x 1 n ln G ln xi n i 1 1 G ln G e ln G 37 37 Geometric Mean-calculation Example1_geo given a data set consisting of survival times to relapse in weeks of 21 acute leukemia patients that received some drug. 1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23(n=21) The mean is 8.67 weeks G n x1 x2 ...xn X ln x ln G n 1 ln xi n i 1 G ln 1 G e ln G X ln x ln x ln G 1.826 n G e1.826 6.02 38 38 Ex. Serum HI antibody dilution from 107 testees after measles vaccination f i lg X i 1 165.2654 G lg 35.04 lg n 107 1 hemagglutination inhibition(HI) 39 application positive skew data_if log transformation creates symmetric, unimodal Geometric series. 40 Median Concept of median Calculation Application-disadvantage $ advantage 41 Concept of Median If the data are arranged in increasing or decreasing order, the median is the middle value, which divided the set into equal halves. M sample median cch 17 19 31 39 48 56 68 73 73 75 80 rank 1 2 3 4 5 6 7 8 9 10 11 42 Calculation-how do we get it? Example1 n=11 cch 17 19 31 39 48 56 68 73 73 75 80 rank 1 2 3 4 5 6 7 8 9 10 11 M=56 a. When n is odd, M X n 1 2 43 Calculation-how do we get it? Example2 n=12 cch 17 19 31 39 48 56 68 73 73 75 80 122 rank 1 2 3 4 5 6 7 8 9 10 11 12 M=(56+58)/2=57 b. When n is even 1 M (X n X n ) 1 2 2 2 44 Application-Advantage It is robust to the extreme value. cch 17 19 31 39 48 56 68 73 73 75 80 1220 rank 1 2 10 11 3 4 5 6 7 Mean=58.42 8 9 12 mean=149.9 Median=57 45 Application-when is it used? Fig.A skew distribution. 46 Data described by Median Skew data Normal distribution data Ordinal data!! 47 For normal distribution 48 Figure 3 the average of height of basketball players. 49 Disadvantage of median the precise magnitude of most of the observations are not taken. if two groups of observations are pooled, the median of the combined group cannot be expressed in terms of the medians of the two component. n1 X 1 n2 X 2 X n1 n2 50 Summary: Choosing the most appropriate measure symmetric, unimodal-mean if log transformation creates symmetric, unimodal-geometric mean distribution free, uncertain datamedian Outlier or skewed data-median Ordinal data-median 51 Measure of Dispersion range, interquartile range, Variance& standard deviation, coefficient of variation 52 Percentile(quantile) X% PX (100-X)% Quartiles: Lower (First) quartile: Second quartile: Upper (Third) quartile: 25% (QL) p25 median 75% (QU)p75 53 Measures Of Dispersion Group A 26 Group B 24 Group C 24 28 27 30 30 29 30 31 32 34 33 36 34 54 Range & Inter-quartile Range R = xmax-xmin QU - QL = P75 -0 %P 252 5 % P0 Ql P2 5 8 .5 1 50% M P5 0 75% Qu P7 5 19 .4 5 10 0 % P 10 0 Obviously, range and inter-quartile are simple and easy to explain. However, there are a few difficulties about use of the range. 1.The first is that the value of the range is determined by only two of the original observations. 2.Second, the interpretation of the range depends on the number of observations in a complicated way, which is a undesirable feature. 55 variance s2 An alternative approach is to make use of deviations from the mean, x-xbar; the greater the variation in the data set, the larger the magnitude of these deviations will tend to be. From this deviation, the variance s2 is computed by squaring each deviation, adding them and dividing their sum by one less than n. s2 X X 2 n1 n-1: degree of freedom, df 56 Variance A population variance is denoted by σ2, 2 X 2 N A sample variance is denoted by s2, s 2 X X 2 n1 57 57 The following should be noted It would be no use to take the mean of deviations because Taking the mean of the absolute values, for example, is possibility. However, this measure has the drawback of being difficult to handle mathematically. (x x) 0 xx n 58 standard deviation, SD The variance s2 have the units that are the square of the original units. For example , if x is the time in seconds, the variance is measured in seconds squared(sec2). So it is convenient to have a measure of variation expressed in the same units as the original data, and this can be done by taking the square root of the variance. This quantity is the standard deviation, s X X 2 n1 59 Formula for Calculation In general the calculation using mean is likely to cause some trouble. If the mean is not a round number, say mean is 10/3, it will need to be rounded off, and errors arise in the subtraction of this figure from each x. this difficulty can be overcome by using the following shortcut formula for the variance or SD. X X 2 2 s /n n 1 60 Solution to calculation of s x x i 2 i s 746.1 50689.33 2 2 x ( x ) i i /n n 1 50689.33 (746.1) 2 / 11 10 2.89 61 Example: Group A 26 Group B 24 Group C 24 28 30 27 32 30 29 30 sd 34 33 31 36 34 range variance mean Group A: 8 10.0 3.16 30 Group B: 12 22.5 4.74 30 Group C: 8 8.5 2.92 30 62 Coefficient Of Variation, CV s CV 100% X nonzero mean. Make comparison between different distributions. for variables with different scale or unit; for variables with more different means. 63 Example: Comparing The Dispersion Of Two Variables mean sd Height: 166.06(cm) 4.95(cm) Weight: 53.72(kg) 4.96(kg) height : weight : 4.95 CV 100% 2.98% 166.06 4.96 CV 100% 9.23% 53.72 64 What do the variance and SD tell us? Large variance (or SD) means: more variable, wider range, lower degree of representativeness of mean. small variance (or SD) means: less variable, narrower range, higher degree of representativeness of mean. 65 Which measure should be used? sd, variance CV for different units; for more different means. Range for unimodal, symmetric, for any distribution, Wasteful of information. Interquartile for any distribution, robust, Wasteful of information. The subjects should be homogeneity! 66 Summary of Average and dispersion Mean±sd(min,max) Median±interquartile range(min,max) Using both average and dispersion. 67 SUMMARY Each variable has its own distribution; Descriptive Using graphs Using statistics average: Dispersion: Mean, G, M sd, variance, Q, CV, R Choosing appropriate measurement; Using average with dispersion. 68 DATA SUMMARIZATION Tabular and graphical methods Frequency table histogram Numerical methods -Using statistics measures of location: arithmetic mean, Median geometric mean, measures of dispersion: range, inter-Quartile range(IQR), standard deviation, variance, coefficient of variation, 69 70