Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析 review Types of data Numerical data: --- continuous --- discrete Categorical data: --- nominal --- ordinal review Statistics : It is a branch of applied mathematics that refers to the collection and interpretation of data, and evaluation of the reliability of the conclusions based on the data. Types of statistical analysis Descriptive analysis : ---Data collection ---Data interpretation Inferential analysis : ---Evaluate the reliability of the conclusions Contents Frequency distribution ★ Central tendency ★ Dispersion (measures of variability) ★ Tables and graphs New words • Frequency 频数 • Proportion 比例 • Percentage 百分数 • Histogram 直方图 • Polygon 折线图 • Distribution 分布 • Frequency distribution 频数分布 • Cumulative frequency 累积频数 • Cumulative proportion 累积比例 • Central tendency 集中趋势 • Dispersion 离散程度 • Mean 均数 • Arithmetic mean 算术均数 • Geometric mean 几何均数 • Median 中位数 • Mode 众数 • Skewness 偏度 • Kurtosis 峰度 • Descriptive analysis 描述分析 • Inferential analysis 推断分析 1. Id sex 1 m 2 m 3 f 4 m 5 f 6 f 7 f 8 m 9 f 10 f 11 m 12 f 13 f 14 f 15 f Frequency distribution age 6 8 13 16 16 15 23 19 25 21 13 19 9 10 14 Frequency (频数): For a given variable, the number of times a value occurs is called its frequency. Frequency table of sex Sex m f Label Male Female Frequency 5 10 Proportion or percent (比例或百分数): The ratio of a frequency to total frequency Frequency table of sex Sex Label Frequency proportion -------------------------------------------------m Male 5 33.33 f Female 10 66.67 -------------------------------------------------Total m+f 15 100.00 Freq distribution of sex Frequency Sex Frequency Percentage distribution: m A table or a graph that f proportion of these values occurs Frequency with the freq and 33.33 10 66.67 Frequency distribution of sex list all the distinct values in a variable together 5 15 10 5 0 male female Sex Method of displaying frequency distribution of categorical data 1. Nominal data 2. Ordinal data Freq distribution of nominal data Freq distribution of sex Sex Frequency Percentage m f 5 33.33 10 66.67 Frequency distribution of sex Frequency Id sex eyesight age 1 m 1 6 2 m 2 8 3 f 3 13 4 m 3 16 5 f 4 16 6 f 4 15 7 f 5 23 8 m 6 19 9 f 6 25 10 f 6 21 11 m 7 13 12 f 7 19 13 f 8 9 14 f 9 10 15 f 9 14 15 10 5 0 male female Sex Freq distribution of ordinal data Freq distribution of eyesight Eyesight Frequency Percentage 1-3 4 26.67 4-6 6 40.00 Frequency distribution of eyesight Frequency Id sex eyesight age 1 m 1 6 2 m 2 8 3 f 3 13 4 m 3 16 5 f 4 16 6 f 4 15 7 f 5 23 8 m 6 19 9 f 6 25 10 f 6 21 11 m 7 13 12 f 7 19 13 f 8 9 14 f 9 10 15 f 9 14 8 6 4 2 0 1-3 4-6 Eyesight 7-9 Method of displaying frequency distribution of numerical data • first dividing the whole interval into several unoverlapped subintervals, • count how many observations lies in each subinterval to make a frequency table, • take the midpoint of each subinterval as x-axis label, draw a histogram(直方图) or a polygon (折线图). Freq distribution of numerical data Freq distribution of age [0-10) [10-20) [20-30] Age midpoint Frequency 0~ 5 3 10~ 15 9 20~30 25 3 Frequency distribution of age Frequency Id sex eyesight age 1 m 1 6 2 m 2 8 3 f 3 13 4 m 3 16 5 f 4 16 6 f 4 15 7 f 5 23 8 m 6 19 9 f 6 25 10 f 6 21 11 m 7 13 12 f 7 19 13 f 8 9 14 f 9 10 15 f 9 14 10 5 0 5 15 Age 25 Histogram and polygon 10 Frequency polygon for age Frequency Frequency Frequency distribution of age 5 0 5 15 Age Histogram 25 10 5 0 0 5 10 15 Age 20 polygon 25 30 15 Frequency distribution of eyesight Frequency Frequency Frequency distribution of sex 10 5 0 male 8 6 4 2 0 1-3 female 4-6 Eyesight Sex Nominal data Ordinal data 10 Frequency polygon for age Frequency Frequency Frequency distribution of age 5 0 5 15 Age 7-9 10 25 Numerical data 5 0 0 5 10 15 Age 20 25 30 Cumulative frequency and cumulative proportion Cumulative frequency (累计频数): sum of total frequency from low to a certain category Cumulative proportion (累计比例): sum of total proportion from low to a certain category Frequency table of age Cumulative Cumulative Age midpoint Frequency Proportion frequency proportion 0-10 5 3 20.0 3 20.0 10-20 15 9 60.0 12 80.0 20-30 25 3 20.0 15 100.0 The plot of cumulative frequency and cumulative proportion The major measures of the characteristics of observations for a numerical variable Central tendency Dispersion (集中趋势) (离散程度) Frequency distribution of red blood cells 30 Frequency 25 20 15 10 5 0 420- 440- 460- 480- 500- 520- 540- 560- 580- 600- 620- 640- Red blood cells 2. Central tendency Central tendency(集中趋势): The description of the concentration near the middle of the range of all values in a variable. The major measures of central tendency are: mean, median, mode. The mean The mean (均数) : It is a measure of the average level of all observations in a variable, it is defined as follow: population mean 1 N N X i 1 i sample mean 1 n X Xi n i 1 ---------Arithmetic mean (算术均数) Eg1a: Estimate the mean The data listed below is the content of haemoglobin (g/L) (血色素), estimate the mean. Solution: Data: id 1 2 3 4 5 6 x 121 118 130 120 122 118 id 7 8 9 10 11 12 x 116 124 127 129 125 132 n=12 1 n x Xi n i 1 = (121+118+…+125+132)/12 = 123.5 So, the estimated mean of the Haemoglobin is 123.5 g/L. Another formula for mean If x has k different values, and fi is the frequency of i-th value xi occurring in the sample, then the sample mean can be estimated as follow: Data: x x1 x2 …… xk Formula: freq f1 f2 …… fk n k k i 1 i 1 X ( fi xi ) / ( fi ) k ( fi xi ) / n i 1 Eg1b: Estimate the mean The following data are measured serum cholesterol (血清胆固醇) from 101 aged 30-49 men. Estimate the mean. data: Serum Cholest. 2.5 ~ 3.5 ~ 4.5 ~ 5.5 ~ 6.5 ~ Solution: Midpoint 3.0 4.0 5.0 6.0 7.0 Freq. 9 32 42 15 3 101 n=101, k k i 1 i 1 x ( f i xi ) /( f i ) =(3×9+4×32+5×42+ 6×15+7×3) / 101 = 4.71 (mmol/L) The median The median (中位数): It is a middle measure in an ordered values of all observations in a variable. It is defined as below: population median sample median M X ( N 1) / 2 m x( n 1) / 2 In which, the X 1 , X 2 ,, X N are ordered values in pop, the x1 , x2 ,, xn are ordered values in sample. The method of estimating the median: 1) Order all values of observations in a variable from smaller to larger; 2) If n is odd, find out middle one observation, this value is the required median; 3) If n is even, find out middle two observations, the average of this two values is the required median. eg, if n=9, then m=x((9+1)/2)=x(5)=x5 if n=10, then m=x((10+1)/2)=x(5.5)=(x5+x6)/2 Eg2a: Estimate the median The data listed below is the content of haemoglobin (g/L), estimate the median. Solution: Data: id 1 2 3 4 5 6 x 121 118 130 120 122 118 id 7 8 9 10 11 12 x 116 124 127 129 125 132 The ordering values are: 116,118,118,120,121,122, 124,125,127,129,130,132. n=12, is even, therefore, med= (122+124)/2=123 So, the median of the Haemoglobin is 123 g/L. Eg2b: Estimate the median The following data are measured serum cholesterol (mmol/L) from 101 aged 30-49 men. estimate the median. Solution: Data: Serum Cholest. 2.5 ~ 3.5 ~ 4.5 ~ 5.5 ~ 6.5 ~ Midpoint Freq. 3.0 9 4.0 32 5.0 42 6.0 15 7.0 3 Since n=101 is odd number, so the median is middle one value, that is, the ordering number is 51, from the data, the 51th value is 5.0, ie, the median M=5.0. More accurate value of M is 4.5+(5.5-4.5) / 42×10=4.74 Frequency distribution about mean and median Central tendency of serum cholesterol Mean=4.71 Median=5.0 Frequency 60 40 20 0 3 4 5 6 Serum Cholesterol 7 Skewed distribution median mean frequency frequency 100 100 80 80 60 60 40 40 20 20 0 0 0 20 40 60 median mean 80 100 120 140 positive or right skewed 0 20 40 60 80 100 120 negative or left skewed 140 Comparing mean and median mean median information more (actual values) less (ranks) data available not available for ordinal data available for any data symmetric size in magnitude Mean=median + skewed Mean>median - skewed Mean<median The definition of median The median is a value for which no more than half the data are smaller than it and no more than half the data are larger than it. eg, 12, 14, 14, 15, 16, 16, 16, 17, 18. M=16, for which, four < M and two>M. The Geometric mean When distribution of a variable is not symmetry, or the data has no up or low bound, then the geometric mean is a best measure for the central tendency. Eg3. The following data are 10 patients’ white blood cell counts(×1000): 11, 9, 35, 5, 9, 8, 3, 10, 12, 8. Estimate the arithmetic mean and geometric mean. The mode The mode (众数): It is defined as the most frequently occurring values in a set of data. • It is a relatively great concentration. • If a data consists of the values: 6,7,7,8,8,8,8,9,10,11,11,12,12,12,12,13 then the mode is 8 and 12. Summary • Frequency distribution • Histogram & polygon • Measures of central tendency • Measures of dispersion 频数 Note: When the width of subinterval are not equal, or the data no up or low bound, then polygon is more available than histogram. 25 20 15 10 5 0 0 20 40 60 80 100 120 140 160 180 体重(盎司) Frequency distribution of birthweight New words • Dispersion 度 离散程 • Range 全距 • Deviation 离均差 • Variance 方差 • Standard deviation 标准差 • Coefficient of variation 数 变异系 New words • Quartile 四分位数 • Percentile 百分位数 • Inter-quartile interval 距 四分位间 §3. Dispersion Dispersion (离散程度): The indication of a spread of measurements around the center of a variable distribution The major measures of dispersion are: range, variance, standard deviation, interquartile range, coefficient of variation, etc. The range The range (全距): It measures the distributed length of data. Population range Range = max - min * # # # Sample range Range = max - min It is a simple measure, it has the same unit as the original data. It use less information (only max & min); Sample range underestimates the pop range—biased, inefficient It convey no information about the middle of the distribution. The quartiles The first-quartile (第一四分位数) Q1: It is a value, for which no more than 25% of observed values are less than it, and no more than 75% of observed values are greater than it. X1 ≤25% M Xn ≤ 75% The second-quartile (第二四分位数) Q2=M: It is a value, for which no more than 50% of observed values are less than it, and no more than 50% of observed values are greater than it. M X1 Xn ≤50% ≤ 50% The third-quartile (第三四分位数) Q3: It is a value, for which no more than 75% of observed values are less than it, and no more than 25% of observed values are greater than it. M X1 ≤75% Xn ≤ 25% Location of quartiles ≤ 25% ≤ 25% Q1 ≤ 25% Q2 ≤ 25% Q3 M X1 ≤ 50% Xn ≤50% The method of estimate the quartiles If the subscript is not an integer or half-integer,then it is rounded up to a nearest integer or half-integer. Eg1: Estimate the quartiles A 34 36 37 39 40 41 42 43 79 B 34 36 37 39 40 41 42 43 44 45 -------------n=9 n=10 The inter-quartile range (四分位数间距) : It is a the difference between Q1 and Q3: Q3-Q1. Q1 X1 Q3 M Middle 50% Xn Eg2: Estimate the interquartile range A 34 36 37 39 40 41 42 43 79 B 34 36 37 39 40 41 42 43 44 45 -------------n=9 n=10 Interquartile tange of A=42.5-36.5=6.0 Interquartile tange of A=43.5-37.0=6.5 The percentiles Theαth percentile (α百分位数 ) Pα : It is a value,for which no more than α% of data less than it, and no more than α% larger than it, where, 0 ≤ α≤100. • P0=min, p100=max • P25= Q1, P50= Q2=M, P75= Q3. The method of estimate the percentiles If the subscript is not an integer or halfinteger, then it is rounded up to a nearest integer or half-integer. Eg3: Estimate the percentiles Data: A 34 36 37 39 40 41 42 43 79 B 34 36 37 39 40 41 42 43 44 45 ------------n=9 n=10 For data A: P0=34, P10=34, P20=36, P30=37, …, P90=79, P100=79. For data B: P0=34, P10=34, P20=36, P30=37, …, P90=44, P100=45. Note: there are many ways to estimate percentiles, the results are not unique. The variance The variance (Var, 方差): It measures the average dispersion of the data about the mean. Population variance Sample variance note: degree of freedom are not same: N and n-1. * It convey information about the middle of the distribution. * S2 is a unbiased estimate of σ2, they are positive values; # The unit is not same as the original data. Simplify formulas of variance Population variance Sample variance Proving of simplify formula Eg4a:Estimate the variance Data: id x x*x 1 1 1 2 2 4 3 3 9 4 4 16 5 5 25 ------------------∑ 15 55 Solution: Another formula for variance Data: x x1 x2 …… xk freq f1 f2 …… fk n k k Eg4b: Estimate the variance Data: id x f f*x f*x*x 1 1 3 3 3 2 2 3 6 12 3 3 2 6 18 4 4 1 4 16 5 5 2 10 50 ----------------------------∑ 15 11 29 99 Solution: The standard deviation The standard deviation (sd, SD, 标准差): It measures the average dispersion of the data about the mean. Population sd Sample sd * It convey information about the mean of the distribution. * s is an unbiased estimate of σ, they are positive values; * The unit is the same as the original data. Eg5: Estimate the SD Data: id x x*x 1 1 1 2 2 4 3 3 9 4 4 16 5 5 25 ------------------∑ 15 55 Solution: The coefficient of variation The coefficient of variance (cv, CV, 变异系 数): It measures the relative variation about mean. Population cv Sample cv * It measures a relative variability or relative dispersion. * Its value does not depends on the unit of variable, Instead of variance or standard deviation with units. * It can be used to compare variations with different units Eg6: Estimate the CV Data: age Data: weight Data: weight id 1 2 3 4 5 id 1 2 3 4 5 id 1 2 3 4 5 x 1 2 3 4 5 sum: 15 mean: 3 var: 2.5 sd 1.58 cv: 52.70 y 11 12 13 14 15 sum: 65 mean: 13 var: 2.5 sd 1.58 cv: 12.16 Coding effects: (1) +-: (2) ×÷: S is unchanged; CV is unchanged. y 110 120 130 140 150 sum: 650 mean: 130 var: 250 sd 15.8 cv: 12.16 Summary 1. Measures of central tendency: mean, median, mode. 2. Masures of dispersion: variance, standard deviation, range, inter-quartile, CV.