Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Module 2: Descriptive Statistics (and a bit about R) Statistics (OA3102) Professor Ron Fricker Naval Postgraduate School Monterey, California Reading assignment: WM&S chapter 1 Revision: 1-12 1 Why Care About Descriptive Statistics? • Data sets continue to grow ever bigger – The human mind cannot assimilate and make sense of volumes of raw data • Descriptive statistics are useful data reduction – Numeric summaries – Graphical plots • Good descriptive statistics help analysts and decision makers understand what the raw data means Revision: 1-12 2 Goals for this Module • Define types of data and types of variables • Learn how to appropriately summarize data using descriptive statistics – Numerical descriptive statistics • Measures of location: mean, median, mode • Measures of spread: variance, standard deviation, range, inter-quartile range, etc. – Graphical descriptive statistics • Continuous variables: histogram, boxplot • Categorical variables: barplots, pie charts • R paradigms and summarizing data with R Revision: 1-12 33 Variables • A characteristic that is being studied in a statistical problem is called a variable • Types of variables: – Continuous: Can divide by any number and result still makes sense • Examples: flight time, failure rate, detection distance – Categorical: • Ordinal: ordered categories – Examples: rank, magazine capacity, shirt size • Nominal: unordered categories – Examples: gender, service branch, ship type Revision: 1-12 44 Types of Data Data Qualitative (nominal) Quantitative Discrete (ordinal) Revision: 1-12 Continuous (continuous) 55 Some Descriptive Statistics • Numerical: – Location: Mean, median, mode – Spread: Standard deviation, variance, range, quantiles, IQR – Correlation • Graphical: – Histograms, bar charts, dot charts, boxplots, scatter plots, etc. • Good descriptive statistics leads to good decision making Revision: 1-12 6 Sample Mean ( x ) • Sample average or sample mean – Sample consists of n observations, x1,…,xn 1 n x xi n i 1 – Often denoted by x (spoken “x-bar”) • To calculate – R: use mean() function – Excel: =AVERAGE(cell reference) Revision: 1-12 7 Sample Median (~ x) • The median is the halfway point in the ordered data • Steps to calculate the median: – Order the data from smallest to largest – If the number of data is odd, the middle observation is the median. E.g., 1 3 5 6 12 12 99 – If the number is even, then the average of the two middle observations is the median. E.g., 1 3 5 6 12 12 Revision: 1-12 5.5 8 Using More Formal Notation… • Let x(i ) denote the ith order statistic from a sample x1 , x2 ,..., xn – E.g., for x1 5, x2 12, x3 2 , we have x(1) 2, x( 2) 5, x(3) 12 • Then the sample median can be defined as xn xn 1 2 x 2 n odd: ~ n even: ~ x xn1 2 2 – Equations apply to samples and populations • To calculate – R: use median() function – Excel: =MEDIAN(cell reference) Revision: 1-12 9 Mean vs. Median • Both are measures of location or “central tendency” – But, median less affected by outliers • Example: – Imagine a sample of data: 0, 0, 0, 1, 1, 1, 2, 2, 2 • Median=mean=1 – Another sample of data: 0, 0, 0, 1, 1, 1, 2, 2, 83 • Median still equals 1, but mean=10! • Which to use? Depends on whether you are: – characterizing a “typical” observation (the median) – or describing the average value (the mean) Revision: 1-12 10 Exercise • Calculate “by hand” the mean and median for the data: {6,1,3,7,3,6,7,4,8} Revision: 1-12 11 11 Exercise (continued) • Now do the same for {6,1,3,7,3,6,7,4,8,100} Revision: 1-12 12 12 Now, in R: • For {6,1,3,7,3,6,7,4,8}: • For {6,1,3,7,3,6,7,4,8,100}: Revision: 1-12 13 Common Measures of “Spread” • Measures of location tell you where the “center” of the data is • Measures of spread tell you how variable the data is around the center • Typical measures of spread: – Sample variance: essentially, the average squared deviation around the mean, n 2 1 s ( xi x ) n 1 i 1 2 – Standard deviation: the square root of the variance, s s • The standard deviation is in the same units at the mean Revision: 1-12 2 14 Exercise • Calculate “by hand” the sample variance and standard deviation for the data: {1,2,3,4,5} Revision: 1-12 15 15 Pictorially Revision: 1-12 16 Pictorially Revision: 1-12 17 Pictorially Revision: 1-12 18 Pictorially Revision: 1-12 19 Ignore Variability at Your Peril • Often analyses only focus on the average • But it’s possible to be right on average and be way off in every case – The average high temperature in Washington DC in June is 83 degrees • “Oh, how balmy!” • No...it’s either 75° or it’s 90+ degrees! Revision: 1-12 From Flaws and Fallicies in Statistical Thinking by Stephen K. Campbell. 20 The Range (R) • Range is another measure of spread • In words, it is the largest observation in the sample minus the smallest observation – Example: A sample of students’ ages in the class • Data: 21, 23, 23, 25, 25, 26, 27, 31, 33, 33, 35, 40 • Note that they are already ordered! • R = 40 - 21 = 19 – Using previous notation: R x n x 1 • In R: use the code diff(range()) – range() function gives x(1) and x(n) Revision: 1-12 21 Other Measures of Spread: Quantiles and Percentiles • Percentiles – For data, the pth percentile , 0 p 100 , is the value of x such that p% of the data is less than or equal to x • Quantiles same as percentiles except for scale – Percentiles are on a 0 to 100 scale – Quantiles are on a 0 to 1 scale – The pth quantile equals the (px100)th percentile Revision: 1-12 22 Special Percentiles and Quantiles • Special percentiles: – Minimum: 0th percentile (or 0 quantile) – Median: 50th percentile (or 0.5 quantile) – Maximum: 100th percentile (or 1.0 quantile) • Quartiles: 25th and 75th percentiles – Devore: “lower fourth” and “upper fourth” • Interquartile Range (IQR): IQR = 75th percentile - 25th percentile – Devore calls the IQR the “fourth spread” – In R: IQR() Revision: 1-12 23 Calculating Quantiles • R function: quantile(data, probs) – data is a numeric vector of data – probs is a numeric vector of probabilities • Default: 0, 0.25, 0.5, 0.75 and 1.0 quantiles • In R, pth quantile is x(px(n-1)+1) – If px(n-1)+1 is not an integer, interpolate between two closest values – E.g., Revision: 1-12 24 Hinges • Hinges are an alternative to quartiles – They’re the x(j) and x(n-j+1) order statistics, for n 1 2 1 j 2 where if j is not integer, interpolate • Easier way to compute: – If n is even, they’re the median values of the upper and lower halves of the sorted data – If n is odd, they’re the median values of the upper and lower halves of the sorted data, where each half includes the median data point Revision: 1-12 25 Exercise • “By hand,” calculate the five number summary for {12,2,7,5,15,4,9,18,6} – The five number summary is the minimum, lower hinge, median, upper hinge, maximum Revision: 1-12 26 26 Exercise (continued) • “By hand,” calculate the five number summary for {12,2,7,5,15,4,9,18,6,10} Revision: 1-12 27 27 Results in R Revision: 1-12 28 28 The Empirical Rule • If the distribution of measurements is approximately normal, then: • 68% of the data is within m ± 1s • 95% within m ± 2s • 99.7% (“almost all”) within m ± 3s 0.40 0.35 0.30 0.25 0.20 68% 0.15 0.10 95% 0.05 99.7% 0.00 -4 -3 -2 -1 0 Z 1 2 3 4 29 Remember Notation Conventions • Summation: – Σ notation and subscripts • Size: – n denotes size of sample – N denotes size of population • Knowns vs. unknowns: – Small letters (i.e., “x”) mean quantity is known – Capital letters (i.e., “X”) mean quantity is unknown (i.e., it’s a random variable) Revision: 1-12 30 Graphically Depicting Data (thousands) 15 10 5 Count Axis • Many different types of plots and charts 80 85 90 95 100 105 110 115 120 125 • What ever you do, don’t fall into the trap of just using Excel plots because they’re easy – R much more powerful and flexible – Excel does not do some important/useful plot types Revision: 1-12 31 A Classic Good Graphic Revision: 1-12 32 Some Types of Graphical and Tabular Summaries of Data • Univariate discrete data: tables, barplots, dot charts, pie charts • Univariate continuous data: stem-and-leaf plots, strip charts, histograms, boxplots • Bivariate discrete data: two-way contingency tables • Bivariate continuous data: scatterplots, QQ plots Revision: 1-12 33 Tabular Summaries of Data • Categorical data: counts and/or percentages by category • Continuous data: counts and/or percentages within “bins” – Bins: sequential intervals over the range of data • Generally intervals are of equal width • Must decide how to count data point that falls on the boundary between two bins – Either count them all in the left bins, or in the right bins – Doesn’t matter which, just be consistent Revision: 1-12 34 Example: Tabular Summary of Univariate Categorical Data Manufacturer Frequency Honda 41 Yamaha 27 Kawasaki 20 Harley-Davidson 18 BMW 3 Other 11 120 Relative Frequency (fraction) 0.34 0.23 0.17 0.15 0.03 0.08 1.00 • In R, use the table() function • For the example: Revision: 1-12 35 Barplots • Barplots also known as bar charts and bar graphs • Plot one bar for each category – Bars show counts or percentage of observations in each category • Can plot bars vertically or horizontally • In R: barplot() – Option horiz=TRUE plots bars horizontally (default is FALSE) Revision: 1-12 36 In R barplot(table(manufac),xlab="Manufacturer",ylab="Count") Revision: 1-12 barplot(table(manufac),ylab="Manufacturer“ ,xlab="Count",horiz=TRUE) 37 Plotting Fractions barplot(table(manufac)/length(manufac), xlab="Manufacturer",ylab="Fraction") Revision: 1-12 barplot(table(manufac)/length(manufac), ylab="Manufacturer",xlab="Fraction",horiz=TRUE) 38 Histograms • A histogram is a graph of the observed frequencies in a sample or population • Histograms show the distribution of the data • Reading a histogram: There are 10 observations greater than 215 but less than or equal to 225 12 10 8 6 4 2 0 170 Revision: 1-12 180 190 200 210 220 230 240 250 260 39 Histograms Depict the Empirical Distribution • Histograms help answer: – Where is the mean of the data (roughly) located? – How variable is the data? – What is the overall shape of the data? • Is the distribution symmetric? Is it skewed? If so, in what direction? – Are there any unusual observations? • In R: hist() function – Options: • breaks option allows user to vary number of bars • freq=TRUE (default) gives counts • freq=FALSE gives density histogram (area sums to one) Revision: 1-12 40 Frequency Histogram of Challenger Data > challenger<-c(84,49,61,40, 83,67,45,66,70,69,80, 58,68,60,67,72,73,70, 57,63,70,78,52,67,53, 67,75,61,70,81,76,79, 75,76,58,31) > hist(challenger) 84 68 Revision: 1-12 53 49 60 67 61 67 75 40 72 61 83 73 70 67 70 81 45 57 76 66 63 79 70 70 75 69 78 76 80 52 58 58 67 31 41 Density Histogram of Challenger Data hist(challenger,freq=FALSE) Revision: 1-12 42 Dos and Don’ts for Histograms • Do try alternate numbers of bars – Find best depiction of the shape (distribution) of data – Start with number of classes = n (i.e., breaks= n hist(challenger,breaks=2) hist(challenger,breaks=5) hist(challenger,breaks=9) 1 ) hist(challenger,breaks=25) • Don’t use unequal bin widths – keep the bar widths all the same • Don’t plot histograms by hand – use software Revision: 1-12 43 Frequency (count) Extremes in Histograms 40 35 30 25 20 15 10 5 0 30-89 Temperature (F) One extreme: A single bar for all the data – but that just shows the total, no information about the shape of the data Revision: 1-12 n classes seems to be about right to show distribution of the data Another extreme: One bar for each temperature – but that’s just a bar chart. It’s hard to see the shape 44 Differences Between Barplots and Histograms • Barplots: – For categorical data – Often most easily read with bars plotted horizontally – Adjacent bars are separated from each other • Histograms: – For continuous data – Convention to plot bars vertically (to look like a pdf) – Adjacent (nonzero) bars touch (since base of each bar denotes the “bin” for that bar) Revision: 1-12 45 Boxplots • Boxplots show distribution in one dimension – Only useful for continuous variables – Good for comparing distributions of a continuous variable between categorical groups – Will not show multiple modes • Illustration (of one variant): outlier whiskers median outliers hinges Revision: 1-12 46 Exercise • Given the following summary statistics for the Challenger data, (roughly) draw the boxplot over the “strip chart” Revision: 1-12 47 Exercise: Result from R • Boxplot Revision: 1-12 48 Histograms vs. Boxplots • Histogram shows distribution of the data in two dimensions – the boxplot is in one dimension – Histogram shows frequency of observations within ranges – Boxplot only shows summary statistics Revision: 1-12 49 We’ll Use Software To Do Most Calculations and Plots… • …generally R • Benefits of R include: – It’s free – More importantly, it’s powerful, flexible, extensible, and cutting-edge – In terms of extensible, there are now thousands of libraries (aka packages) available to do custom calculations, plots, etc. Revision: 1-12 50 Some R Paradigms • • • • Command line interface Object-oriented programming Types of objects, particularly data frames Vector-based calculations Revision: 1-12 51 Command Line Interface • Command line allows scripting/programming, which gives flexibility and extensibility – Point and click paradigm limits user to what has been programmed into the interface – Trade-off is “user friendliness,” meaning command line users must learn the underlying language and syntax • Good news: Once you gain a working familiarity, you have access to very powerful computing tool Revision: 1-12 52 All the Std Graphics Plus… Revision: 1-12 53 Example #1: Flexible Graphics Revision: 1-12 54 Example #2: Flexible Graphics Revision: 1-12 55 Example #3: Flexible Graphics Revision: 1-12 56 Example #4: Flexible Graphics Revision: 1-12 57 Example #5: Flexible Graphics Revision: 1-12 58 Object-oriented Programming • R is an object-oriented programming language – Wikipedia: “Object-oriented programming (OOP) is a programming paradigm that uses "objects" … to design applications and computer programs. ” • Everything in R is an object of some type – Each type of object has particular properties – Properties control what objects can and cannot do, as well as how other objects interact with them Revision: 1-12 59 Types of Objects • Important types of objects in R: – – – – Vector: a one-dimensional list of numbers Matrix: a two-dimensional list of numbers Array: a multi-dimensional list of numbers Data.frame: a two-dimensional list that can contain any type of data (numeric, string, logical, etc) – Function: small programs that usually take input as arguments and after running produce output • The function class(obj) will tell you what type of object “obj” is Revision: 1-12 60 More on Data Frames • Think of them like tables – Columns correspond to variables (and data in columns must all be of the same type) – Rows correspond to observations Revision: 1-12 61 More on Functions • Functions always end with parenthesis – If there are arguments, they go here – Some functions don’t have or need arguments • Example: ls() – Function code output when parentheses left off • Can run functions of functions – Example: mean(seq(1:9)) • Lots of built-in functions and you can write your own Revision: 1-12 62 Vector-based Calculations • R very efficient (i.e., fast) working with vectors, much less so with loops • Key idea: In data frames, instead of writing code that operates on the rows of a data frame (i.e., observation by observation) you write code that operates on the variables (i.e., the columns, which are the variables!) • Takes a while to get used to thinking in terms of vectors rather than individual observations Revision: 1-12 63 Simple Example • Data frame with data on various types of travel for a set of individuals: • Easy way to calc total days deployed in R: Revision: 1-12 64 Simple Example, continued • Even fancier: • The hard way: Revision: 1-12 65 What We Covered in this Module • Defined types of data and types of variables • Learned how to appropriately summarize data using descriptive statistics – Numerical descriptive statistics • Measures of location: mean, median, mode • Measures of spread: variance, standard deviation, range, inter-quartile range, etc. – Graphical descriptive statistics • Continuous variables: histogram, boxplot • Categorical variables: barplots, pie charts • R paradigms and summarizing data with R Revision: 1-12 66 66 Homework • WM&S chapter 1 – Required exercises 2, 9, 13, 17, 22, 25 – Extra credit: 11 • Hints and instructions: Do exercises 2,13, and 25 in R as much as possible o The data sets are in Sakai in CSV format; read them in using the instructions from Lab #1 o Exercise 2: Just construct a frequency histogram in R with the Mt. Washington observation left out o Exercises 13 and 25: The sort() function in R could be useful for counting the number that fall in each interval Exercise 9: Use either Table 4 in WM&S or R to calculate. If you use R, the pnorm() function will be helpful Exercise 17: Only do the approximation for Exercise 1.2 Revision: 1-12 67