Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
3. Basics of R – cont. Summarising Statistical Data Graphical Displays CA200 (based on the book by Prof. Jane M. Horgan) 1 Basics – 6+7*3/2 [1] 16.5 – x <- 1:4 x [1] 1 2 3 4 – x2 <- x**2 x2 [1] 1 4 9 16 #general expression – X <- 10 prod1 <- X*x prod1 [1] 10 20 30 40 #case sensitive! #integers are assigned to the vector x #print x #square the element, or x2<-x^2 CA200 2 Getting Help • • • • • • • click the Help button on the toolbar help() help.start() demo() ?read.table help.search ("data.entry") apropos (“boxplot”) - "boxplot", "boxplot.default", "boxplot.stat” CA200 3 Statistics: Measures of Central Tendency Typical or central points: • Mean: Sum of all values divided by the number of cases • Median: Middle value. 50% of data below and 50% above • Mode: Most commonly occurring value, value with the highest frequency CA200 4 Statistics: Measures of Dispersion Spread or variation in the data • Standard Deviation (σ): The square root of the average squared deviations from the mean - measures how the data values differ from the mean - a small standard deviation implies most values are near the average - a large standard deviation indicates that values are widely spread above and below the average. CA200 5 Statistics: Measures of Dispersion Spread or variation in the data • Range: Lowest and highest value • Quartiles: Divides data into quarters. 2nd quartile is median • Interquartile Range: 1st and 3rd quartiles, middle 50% of the data. CA200 6 Data Entry • Entering data from the screen to a vector • Example: 1.1 downtime <-c(0, 1, 2, 12, 12, 14, 18, 21, 21, 23, 24, 25, 28, 29, 30,30,30,33,36,44,45,47,51) mean(downtime) [1] 25.04348 median(downtime) [1] 25 range(downtime) [1] 0 51 sd(downtime) [1] 14.27164 CA200 7 Data Entry – cont. • Entering data from a file to a data frame • Example 1.2: Examination results: results.txt gender m m m m m m m f and so on arch1 99 NA 97 99 89 91 100 86 prog1 98 NA 97 97 92 97 88 82 CA200 arch2 83 86 92 95 86 91 96 89 prog2 94 77 93 96 94 97 85 87 8 Data Entry – cont. • NA indicates missing value. • No mark for arch1 and prog1 in second record. • results <- read.table ("C:\\results.txt", header = T) # download the file to desired location • results$arch1[5] [1] 89 • Alternatively • attach(results) • names(results) • allows you to access without prefix results. • arch1[5] [1] 89 CA200 9 Data Entry – Missing values • mean(arch1) [1] NA #no result because some marks are missing • na.rm = T (not available, remove) or • na.rm = TRUE • mean(arch1, na.rm = T) [1] 83.33333 • mean(prog1, na.rm = T) [1] 84.25 • mean(arch2, na.rm = T) • mean(prog2, na.rm = T) • mean(results, na.rm = T) gender arch1 prog1 arch2 prog2 NA 94.42857 93.00000 89.75000 90.37500 10 Data Entry – cont. • Use “read.table” if data in text file are separated by spaces • Use “read.csv” when data are separated by commas • Use “read.csv2” when data are separated by semicolon CA200 11 Data Entry – cont. Entering a data into a spreadsheet: • newdata <- data.frame() #brings up a new spreadsheet called newdata • fix(newdata) #allows to subsequently add data to this data frame CA200 12 Summary Statistics Example 1.1: Downtime: summary(downtime) Min. 0.00 1st Qu. 16.00 Median Mean 25.00 25.04 3rd Qu. Max. 31.50 51.00 Example 1.2: Examination Results: summary(results) Gender arch1 f: 4 Min. : 3.00 m:22 1st Qu.: 79.25 Median : 89.00 Mean : 83.33 3rd Qu.: 96.00 Max. :100.00 NA's : 2.00 NA's prog1 Min. :65.00 1st Qu.:80.75 Median :82.50 Mean :84.25 3rd Qu.:90.25 Max. :98.00 : 2.00 arch2 Min. :56.00 1st Qu.:77.75 Median :85.50 Mean :81.15 3rd Qu.:91.00 Max. :96.00 prog2 Min. :63.00 1st Qu.:77.50 Median :84.00 Mean :83.85 3rd Qu.:92.50 Max. :97.00 Summary Statistics - cont. Example 1.2: Examination Results: For a separate analysis use: mean(results$arch1, na.rm=T) [1] 83.33333 summary(arch1, na.rm=T) Min. 1st Qu. Median Mean 3.00 79.25 89.00 83.33 # hint: use attach(results) 3rd Qu. 96.00 Max. 100.00 NA's 2.00 14 Programming in R • Example 1.3: Write a program to calculate the mean of downtime Formula for the mean: x <- sum(downtime) # sum of elements in downtime n <- length(downtime) #number of elements in the vector mean_downtime <- x/n or mean_downtime <- sum(downtime) / length(downtime) 15 Programming in R – cont. • Example 1.4: Write a program to calculate the standard deviation of downtime #hint - use sqrt function CA200 16 Graphical displays - Boxplots • Boxplot – a graphical summary based on the median, quartile and extreme values boxplot(downtime) • box represents the interquartile range which contains 50% of cases • whiskers are lines that extend from max and min value • line across the box represents median • extreme values are cases on more than 1.5box length from max/min value CA200 17 Graphical displays – Boxplots – cont. • To improve graphical display use labels: boxplot(downtime, xlab = "downtime", ylab = "minutes") 18 Graphical displays – Multiple Boxplots • Multiple boxplots at the same axis - by adding extra arguments to boxplot function: boxplot(results$arch1, results$arch2, xlab = " Architecture, Semesters 1 and 2" ) • Conclusions: – marks are lower in sem2 – Range of marks in narrower in sem2 • Note outliers in sem1! 1.5 box length from max/min value. Atypical values. Graphical displays – Multiple Boxplots – cont. • Displays values per gender: boxplot(arch1~gender, xlab = "gender", ylab = "Marks(%)", main = "Architecture Semester 1") • Note the effect of using: main = "Architecture Semester 1” Par Display plots using par function • par (mfrow = c(2,2)) #outputs are displayed in 2x2 array • boxplot (arch1~gender, main = "Architecture Semester 1") • boxplot(arch2~gender, main = "Architecture Semester 2") • boxplot(prog1~gender, main = "Programming Semester 1") • boxplot(prog2~gender, main = "Programming Semester 2") To undo matrix type: • par(mfrow = c(1,1)) #restores graphics to the full screen 21 Par – cont. Conclusions: - female students are doing less well in programming for sem1 - median for female students for prog. sem1 is lower than for male students 22 Histograms • A histogram is a graphical display of frequencies in the categories of a variable hist(arch1, breaks = 5, xlab ="Marks(%)", ylab = "Number of students", main = "Architecture Semester 1“ ) • Note: A histogram with five breaks equal width - count observations that fill within categories or “bins” 23 Histograms hist(arch2, xlab ="Marks(%)", ylab = "Number of students", main = “Architecture Semester 2“ ) • Note: A histogram with default breaks CA200 24 Using par with histograms • The par can be used to represent all the subjects in the diagram • par (mfrow = c(2,2)) • hist(arch1, xlab = "Architecture", main = " Semester 1", ylim = c(0, 35)) • hist(arch2, xlab = "Architecture", main = " Semester 2", ylim = c(0, 35)) • hist(prog1, xlab = "Programming", main = " ", ylim = c(0, 35)) • hist(prog2, xlab = "Programming", main = " ", ylim = c(0, 35)) Note: ylim = c(0, 35) ensures that the y-axis is the same scale for all four objects! CA200 25 CA200 26 Stem and leaf • Stem and leaf – more modern way of displaying data! Like histograms: diagrams gives frequencies of categories but gives the actual values in each category • Stem usually depicts the 10s and the leaves depict units. stem (downtime, scale = 2) The decimal point is 1 digit(s) to the right of the | 0 | 012 1 | 2248 2 | 1134589 3 | 00036 4 | 457 5|1 CA200 27 Stem and leaf – cont. • stem(prog1, scale = 2) The decimal point is 1 digit(s) to the right of the | 6|5 7 | 12 7 | 66 8 | 01112223 8 | 5788 9 | 012 9 | 7778 Note: e.g. there are many students with mark 80%-85% CA200 28 Scatter Plots • To investigate relationship between variables: plot(prog1, prog2, xlab = "Programming, Semester 1", ylab = "Programming, Semester 2") • Note: - one variable increases with other! - students doing well in prog1 will do well in prog2! CA200 29 Pairs • If more than two variables are involved: courses <- results[2:5] pairs(courses) #scatter plots for all possible pairs or pairs(results[2:5]) CA200 30 Pairs – cont. CA200 31 Graphical display vs. Summary Statistics • Importance of graphical display to provide insight into the data! • Anscombe(1973), four data sets • Each data set consist of two variables on which there are 11 observations CA200 32 Graphical display vs. Summary Statistics Data Set 1 x1 y1 10 8.04 8 6.95 13 7.58 9 8.81 11 8.33 14 9.96 6 7.24 4 4.26 12 10.84 7 4.82 5 5.68 Data Set 2 x2 y2 10 9.14 8 8.14 13 8.74 9 8.77 11 9.26 14 8.10 6 6.13 4 3.10 12 9.13 7 7.26 5 4.74 Data Set 3 x3 y3 10 7.46 8 6.77 13 12.74 9 7.11 11 7.81 14 8.84 6 6.08 4 5.39 12 8.15 7 6.42 5 5.73 CA200 Data Set 4 x4 y4 8 6.58 8 5.76 8 7.71 8 8.84 8 8.47 8 7.04 8 5.25 19 12.50 8 5.56 8 7.91 8 6.89 33 First read the data into separate vectors: • x1<-c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) • y1<-c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68) • x2 <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) • y2 <-c(9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74) • x3<- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) • y3 <- c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73) • x4<- c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8) • y4 <- c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89) CA200 34 For convenience, group the data into frames: • • • • dataset1 <- data.frame(x1,y1) dataset2 <- data.frame(x2,y2) dataset3 <- data.frame(x3,y3) dataset4 <- data.frame(x4,y4) CA200 35 • 1. It is usual to obtain summary statistics: Calculate the mean: mean(dataset1) x1 9.000000 mean(data.frame(x1,x2,x3,x4)) x1 x2 9 9 y1 7.500909 x3 9 mean(data.frame(y1,y2,y3,y4)) y1 y2 7.500909 7.500909 2. x4 9 y3 7.500000 y4 7.500909 Calculate the standard deviation: sd(data.frame(x1,x2,x3,x4)) x1 x2 3.316625 3.316625 sd(data.frame(y1,y2,y3,y4)) y1 y2 2.031568 2.031657 x3 3.316625 x4 3.316625 y3 2.030424 y4 2.030579 Everything seems the same! CA200 36 • But when we plot: • • • • • par(mfrow = c(2, 2)) plot(x1,y1, xlim=c(0, 20), ylim =c(0, 13)) plot(x2,y2, xlim=c(0, 20), ylim =c(0, 13)) plot(x3,y3, xlim=c(0, 20), ylim =c(0, 13)) plot(x4,y4, xlim=c(0, 20), ylim =c(0, 13)) CA200 37 Note: 1. Data set 1 in linear with some scatter 2. Data set 2 is quadratic 3. Data set 3 has an outlier. Without them the data would be linear 4. Data set 4 contains x values which are equal expect one outlier. If removed, the data would be vertical. Everything seems different! Graphical displays are the core of getting insight/feel for the data! 38