Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Descriptive Statistics Descriptive Statistics B.H. Robbins Scholars Series May 27, 2010 1 / 34 Descriptive Statistics Outline Measurement Random Variables and Distributions Commonly used distributions Descriptive Statistics Graphics Categorical Variables Continuous Variables 2 / 34 Descriptive Statistics Measurement Types of measurements I Binary: two category variable I I I I Race, eye color, etc. Ordinal: a categorical variable whose values are in a special order I I yes/no, present/absent, 0/1 Categorical (nominal, polytomous, discrete): at least 2 values that are not necessarily ordered Disease severity, likert score, cancer stage, ASA status Count: A discrete value that can (in theory) has no upper limit I Number of ER visits in a day, number of CABG surgeries in a year. 3 / 34 Descriptive Statistics Measurement Types of measurements (cont.) I Continuous: numerical variable that has many possible values that represent some underlying distribution. I I I Tend to have the most information (assuming not a lot of preprocessing) Turning continuous data into categories is risky: loss of information (i.e., lower power to detect effects, more people needed to have the same power). Errors not reduced by categorization unless that’s the only way to get someone to answer the question (e.g., with income) 4 / 34 Descriptive Statistics Random Variables and Distributions Random Variables: X I A potential measurement I A quantity that may vary from subject to subject I A variable whose values are random but whose probability distribution is known. I A function that assigns a unique value with every outcome from an experiment. I Once the random variable X is observed, it is a sample value (x). 5 / 34 Descriptive Statistics Random Variables and Distributions Distributions of Random Variables (RV) I Distribution of the RV, X, is a profile of its tendencies I Depending upon the type of RV, the distribution may be completely characterized by one or a few parameters. Binary variable: I I I I The probability of a ’yes’ or ’present’ or 1 In the sample this is a proportion K-category categorical variable (multinomial) I I The probability that a randomly chosen subject will be from category 1, 2, ... K In a sample this a proportion falling into each category 6 / 34 Descriptive Statistics Random Variables and Distributions Distributions of Random Variables (RV) I Continuous variable: 1. probability density: the value of x is on the x-axis, and the relative probability of observing values close to it is on the y-axis: In a sample this is a histogram 2. cumulative probability distribution: y-axis is the probability of X ≤ x. This function only rises or stays flat. For a sample it is a cumulative histogram 3. All percentiles of X 4. All moments of X (e.g. mean(X), mean(X 2 ), mean(X 3 ),...) 5. If we know one of these, we can derive the others. 7 / 34 Descriptive Statistics Random Variables and Distributions Distributions of Random Variables (RV) I If the distribution is known, we may be able to make reasonable guesses about future observations I It is much harder to guess a single value x than it is to estimate a mean from a distribution. I At the very least, the distribution tells us the proportions of people we expect to see within each interval of interest 8 / 34 Descriptive Statistics Random Variables and Distributions Commonly used distributions Normal distribution I Characterized by two parameters: mean (µ) and variance (σ 2 ) I Mean is a measure of central tendency I Variance is the mean of the squared deviations from the mean. I µ and σ 2 are completely independent of each other. I Symmetric distribution 9 / 34 Descriptive Statistics Random Variables and Distributions Commonly used distributions Normal(−2,4) 0.10 Density 0.05 0.2 0.0 0.00 0.1 Density 0.3 0.15 0.4 0.20 Normal(−2,1) −10 −5 0 5 10 −10 −5 5 10 5 10 Density 0.05 0.10 0.15 0.4 0.3 0.2 0.00 0.1 0.0 Density 0 x Normal(2,4) 0.20 x Normal(2,1) −10 −5 0 5 10 −10 −5 0 10 / 34 Descriptive Statistics Random Variables and Distributions Commonly used distributions Bernoulli distribution I The random variable is binary I Characterized by one parameter: probability (p) I The mean is p and the variance is p(1 − p) 11 / 34 Descriptive Statistics Random Variables and Distributions Commonly used distributions Proportion 0.2 0.4 0.6 0.8 0 0 Proportion 0.2 0.4 0.6 0.8 1 Bernoulli(0.5) 1 Bernoulli(0.1) No Yes No Proportion 0.2 0.4 0.6 0.8 0 0 Proportion 0.2 0.4 0.6 0.8 1 Bernoulli(0.9) 1 Bernoulli(0.25) Yes No Yes No Yes 12 / 34 Descriptive Statistics Random Variables and Distributions Commonly used distributions Poisson distribution I Characterized by a single parameter parameter: mean (λ) I Poisson distribution is commonly used in modeling a count per interval of time I In a Poisson distribution, the variance is equal to the mean. I Generally is right skewed but converges to a symmetric distribution 13 / 34 Descriptive Statistics Random Variables and Distributions Commonly used distributions Poisson(5) 0.20 Density 0.00 0.0 5 10 15 20 0 5 10 x x Poisson(2) Poisson(10) 15 20 15 20 Density 0.0 0.00 0.4 0.10 0.8 0.20 1.2 0 Density 0.10 1.0 0.5 Density 1.5 0.30 Poisson(1) 0 5 10 x 15 20 0 5 10 x 14 / 34 Descriptive Statistics Random Variables and Distributions Commonly used distributions Exponential distribution I Characterized by a single parameter: rate (λ) I It is used to describe the time until an event occurs (train arrives, light bulb burns out, a patient has an MI) I The mean of an expoential is 1/λ and the variance is 1/λ2 I A relaxed version of this distribution is very commonly used in medical research for survival analysis. I Right or positively skewed 15 / 34 Descriptive Statistics Random Variables and Distributions Commonly used distributions Exponential(0.2) 0.15 0.10 Density 0.00 0.0 20 40 60 80 100 0 20 40 60 x x Exponential(0.5) Exponential(0.05) 80 100 80 100 0.0 0.02 0.00 0.1 0.2 Density 0.3 0.4 0.04 0 Density 0.05 0.6 0.4 0.2 Density 0.8 Exponential(1) 0 20 40 60 x 80 100 0 20 40 60 x 16 / 34 Descriptive Statistics Descriptive Statistics Continuous Variables I Let x1 , x2 , . . . , xn denote sample values I Measures of location or the ’center’ of a sample P Arithmetic average: x = n1 ni=1 xi I I I I Population mean is the value that x converges to as n → ∞ Highly influenced by a single value (which may be a bad thing if that values is not real) Median: middle among all sorted values (e.g., half the sample is greater and half is smaller) I I Usually very descriptive Not affected by an outlying value 17 / 34 Descriptive Statistics Descriptive Statistics Continuous Variables: Measures of location [-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 0.03 0.62 1.18 1.32] 0.03 0.62 11.8 13.2] I Sample average = -0.28 I Median = -0.59 [-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 I Sample average = 1.76 I Median = -0.59 18 / 34 Descriptive Statistics Descriptive Statistics Continuous Variables (cont.) I Quantiles or percentiles: Can generally be used to describe central tendency, spread, symmetry, heavy tailedness, etc. I I I I I The p th sample quantile, xp is the value such that the fraction p of observations fall below the value. The p th population quantile, is the value x such that pr (X ≤ x) = p Sample median is the 50th percentile or 0.5 quantile Quartiles: (Q1 , Q2 , Q3 ) are the (25th , 50th , 75th ) Quintiles: (Q1 , Q2 , Q3 , Q4 ) are the (20th , 40th , 60th , 80th ) 19 / 34 Descriptive Statistics Descriptive Statistics Continuous Variables (cont.) I I Measures of spread or variability Inter-Quartile range, (Q1 , Q3 ): An interval that contains half of all subjects I I Variance: The average of the squared deviations between each observed valuePand the overall average value: I I I Most people think these are meaningful for continuous distributions n 1 2 s 2 = n−1 i=1 (xi − x) n − 1 is used because we must acknowledge that we have estimated and do not know that true value of µ the population mean. Standard deviation: the square root of the variance I I A normal distribution can be defined in terms of the proportion of people falling within a SD of the mean 68% of the populations falls within 1 SD of the mean, and approximately 95% fall within 2 SDs of the mean. 20 / 34 Descriptive Statistics Descriptive Statistics Continuous Variables: Measures of scale or spread [-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 0.03 0.62 1.18 1.32] 0.03 0.62 11.8 13.2] I Sample Standard Deviation = 0.98 I Interquartile range: (-0.965, 0.325) [-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05 I Sample Standard Deviation = 5.36 I Interquartile range: (-0.965, 0.325) 21 / 34 Descriptive Statistics Descriptive Statistics Continuous Variables (cont.) I With highly skewed or assymetric distributions, variance and SD may not be very useful summaries of spread I If median length of stay at Vanderbilt Hospital is 3 days then having a SD of 10 due to some very sick people is difficult to interpret. I Range: may not be useful because it is always increasing with n and it is dominated by a single value. I Coefficient of variation (standard deviation divided by the mean): few situations where this could be useful because it depends highly on what the value of the mean is (esp if close to 0). 22 / 34 Descriptive Statistics Descriptive Statistics Discrete Variables I Categorical Variables I I The proportion of observations falling within each category Count Variables I Best to report the average. 23 / 34 Descriptive Statistics Graphics Categorical Variables Chart Junk I Chart junk (Tufte): non-data-ink or redundant data-ink I I I I Not part of the minimum set of visuals to communicate the message understandably. Elements of a display that are not necessary to understand the information contained in it Can be distracting and can skew the depiction, making it difficult to understand (3-D pie and bar chart are examples) Ink : information ratio : Ratio to of the amount of ink used and the amount of information portrayed 24 / 34 Descriptive Statistics Graphics Categorical Variables Categorical Variables I Pie chart I I I I High ink:information ratio Can create optical illusions (perception depends on orientation vs the horizon) With a lot of categories, difficult to label Bar chart I I I I High ink:information ratio Hard to depict uncertainty Hard to interpret subcategories Labels hard to read if bars are vertical 25 / 34 Descriptive Statistics Graphics Categorical Variables Categorical Variables I Dot chart I I I I I I Leads to the most accurate perception Easy to show labels Allows for multiple levels of categorization Multiple categories within a single line of dots Easy to show 2-sided error bars Dot chart and figure 3 26 / 34 Descriptive Statistics Graphics Categorical Variables n North 770 9565 7439 2378 8093 10341 7610 1484 14320 10518 1461 9231 3241 Category a Category b Category c Category d Category e Category f Category g Category h Category i Category j Category k Category l Category m South 494 13849 10810 6529 375 13410 11764 14450 3392 13591 13841 1384 14911 Category a Category b Category c Category d Category e Category f Category g Category h Category i Category j Category k Category l Category m 0.0 0.2 0.4 0.6 0.8 Y 27 / 34 Descriptive Statistics Graphics Continuous Variables Continuous Variables I Histograms show relative frequencies (requires binning of data) I I Cumulative distribution functions I I Not optimal for comparing multiple distributions Can read all quantiles directly off the plot Box plot: good way to compare many groups 28 / 34 Descriptive Statistics Graphics Continuous Variables Pr(X < x) Normal(0, 1) Normal(−1, 2) −4 −2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 Pr(X < x) Exponential CDF 0.0 0.2 0.4 0.6 0.8 1.0 Normal CDF Exponential(.5) Exponential(.2) 0 x 5 10 15 20 x Pr(X < x) 0.0 0.2 0.4 0.6 0.8 1.0 Poisson CDF ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● Poisson(2) Poisson(5) ● ● ● ● ● ● 5 10 15 20 x 29 / 34 Descriptive Statistics Graphics Continuous Variables Box and Whisker Plot F ● x ● x E x D ● x C x B ● x A −2 0 2 Value 30 / 34 Descriptive Statistics Graphics Continuous Variables Box Percentile Plot ● F ● E D ● C ● ● B ● A −2 0 2 Value 31 / 34 Descriptive Statistics Graphics Continuous Variables Decile Plot x F x E D x C x x B x A −2 −1 0 1 2 Value 32 / 34 Descriptive Statistics Graphics Continuous Variables Decile Plot: Exponential Distribution x F x E x D x C x B x A 0.5 1.0 1.5 2.0 2.5 Value 33 / 34 Descriptive Statistics Graphics Continuous Variables Summary I Measurements I Random variables and distributions I Numerical descriptive statistics Graphical summaries I I I Categorical variables Continuous variables 34 / 34