Programming in R Data Analysis Module: Basic Descriptive Statistics Data Analysis Module Basic Descriptive Statistics and Confidence Intervals Basic Visualizations Histograms Pie Charts Bar Charts Scatterplots Ttests One Sample Paired Independent Two Sample ANOVA Chi Square and Odds Regression Basics 2 Data Analysis: Descriptive Statistics In this session I will explain: • Measures of central tendency and variation • How to use figures to summarize a single variable (univariate data) • How to create these in R. Data Analysis: Descriptive Statistics • Center, or where do we find most of the data • Distribution or shape, such as a bell shaped curve • Variation or dispersion, how far spread out is the data, on average, how far are observations from the center? • Outliers…have we got Bill Gates in our salary sample? Measure of central tendency The “center” of a data set can be described using two different measures: 1. Mean – the commonly known “average” 2. Median – the midpoint The mean • The sample mean is sometimes called “x bar” x = x n • Translation, add up all the values and divide by the number of values • Usually, this is what people call the average The median • The middle of the data is called the median – Sort the data from smallest to largest – If there are an odd number of observations, the middle number is the median – For even number of observations, the median is the midpoint between the two middle numbers Median price= (7521+8139)/2 or 7830 Shape and skewness Normal variables and standard deviation • In a symmetric, bell shaped distribution, we are able to describe the entire distribution using only two numbers, the mean and the standard deviation • The standard deviation is roughly the average distance that observations are from their mean Calculating the standard deviation Standard deviation= X x 2 i n 1 Translation: Find the difference between the mean and each value in the dataset, square each difference, add these up, divide by the total number of values minus 1, then take the square root of that (or, get R to do it for you) And we care because? The Empirical Rule For any normal curve, approximately •68% of the values fall within 1 standard deviation of the mean •95% of the values fall within 2 standard deviations of the mean •99.7% of the values fall within 3 standard deviations of the mean Other things to describe • How many modes? • The range, minimum and maximum Eruption times for of Old Faithful geyser in Yellowstone National Park, 1997 n=107 25 # of eruptions 20 15 10 20 18 17 12 5 5 5 0 3 2 2.2 2.5 2.8 3.1 3.4 0 1.9 3.7 16 8 4 4.3 4.6 4.9 1 5.2 This histogram shows a bimodal shape. The data has a minimum of 1.67 minutes and a maximum of 4.93 minutes, for a range of 3.26 minutes. Time of eruption in minutes http://wps.aw.com/wps/media/objects/15/15719/projects/ch3_faithful/index.html The five number summary • Minimum, maximum, median, lower quartile and upper quartile Minimum Lower Quartile Median Upper Quartile The visual representation of the five number summary is the box or box and whiskers plot Maximum Interpreting box plots ¼ of students slept between 3 and 6 hours, ¼ slept between 6 and 7, ¼ slept between 7 and 8 ¼ slept between 8 and 16 Outlier: any value more than 1.5 interquartile range(IQR) beyond closest quartile, shown with stars. Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. Other ways to visualize data • When developing a visual representation of a single variable, the most common tools are – Histograms, Pie Charts, Bar Charts, Box Plots and Stem and Leaf Plots. • We’ve already seen a histogram and a box plot How to produce these in R • The function summary() to get mean, median, first quartile, third quartile, minimum, and maximum. • table() to get frequency counts • prop.table() to get percentages • Plus, pie(), barplot(), hist(), and boxplot() to get pie, bar plots, histograms, and box plots, respectively.