Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Lab 1: Basic Graphics and Descriptive Statistics Michael Akritas Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Basic Statistical Graphics Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs I In histograms the range of the data is divided into bins, and a box is constructed above each bin. I The height of each box is the bin’s frequency. Alternatively, the heights can be adjusted so the histogram’s area is one. I R will automatically choose the number of bins but it also allows user specified intervals. Moreover, R offers the option of constructing a smooth histogram. I In stem and leaf plots each observation gets split into its stem, which is the beginning digit(s), and its leaf, which is the first of the remaining digits. I They retain more information about the original data but do not offer as much flexibility in selecting the bins. Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs The R data set faithful x = faithful$eruptions # set the eruption duration data in x hist(x) # basic frequency histogram hist(x, freq = FALSE) # histogram area = 1 plot(density(x)) # basic smooth histogram hist(x, freq = F) ; lines(density(x)) # superimposes the two stem(x) # basic stem and leaf plot hist(x, freq = F, col=“grey”, main=“Histogram of Old Faithful eruption durations”, xlab=“Eruption durations”) ; lines(density(x), col=”red”) Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs A variation of the stem and leaf plot(∗ ) I The command stem(x) is equivalent to stem(x, scale =1). I To illustrate the role of the scale parameter, enter the data on US beer production (in millions of barrels) for different quarters during the period 1975-1982 in the R object x through x=c(35, 36, 36, 36, 39, 39, 41, 41, 41, 42, 42, 44, 44, 44, 44, 44, 44, 46, 46, 47, 48, 48, 49, 49, 50, 52, 52, 53, 53, 54, 55) Then use the command: stem(x, scale = 0.5) Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs Output options(∗) I The figure can be saved as pdf, or jpg etc. Alternatively: pdf(“Desktop/HistOF.pdf”) # saves figure in Desktop/HistOF.pdf hist(x, freq = F, col=“grey”); lines(density(x), col=“red”) dev.off() # this must be done before opening the pdf file. I To save it as a jpg file replace pdf(“Desktop/HistOF.pdf”) in the above set of commands by jpeg(“Desktop/HistOF.jpg”). I To save text output to a txt file, for example the stem and leaf plot, copy and past, or use: sink(“Desktop/StemOF.txt”); stem(x); sink(file=NULL) Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs Data Input I There are several ways to import data stored in different types of files. We will use data stored in txt files uploaded on the internet. I The basic command we will use is read.table. For example, to import the bear measurements data from http://sites.stat.psu.edu/~mga/401/Data/BearsData.txt, use the command br=read.table(“http://stat.psu.edu/∼mga/401/Data/BearsData.txt”, header=T) This generates an R data frame, called br containing all information in the file BearsData.txt, Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs Scatterplot with gender identification I With the bear measurements data in the data frame br (type br to view the data, or br[1:3,] for the fist three lines), an enhanced chest girth and weight scatterplot with gender differentiation can be constructed with the commands: attach(br) plot(Chest.G,Weight, pch=21, bg=c(‘”red”,”green”)[unclass(Sex)]) legend( x=22, y=400,pch = c(21,21), col = c(”red”,”green”), legend = c(”Female”, ”Male”)) Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs Scatterplot matrix with gender identification I For more than two variables, a scatterplot matrix arranges all pairwise scatterplots in a matrix form. With the bear measurements in the data frame br use the command: pairs(br[4:8],pch=21,bg=c(”red”, ”green”)[unclass(Sex)]) # br[4:8] is a data frame consisting of columns 4-8 I (∗ )For a variation, which gives histograms on the diagonal and additional information, use the commands: install.packages(”psych”) # installs the package psych library(psych) # it suffices to issue this command once per session pairs.panels(br[4:8], pch=21,bg=c(”red”, ”green”)[unclass(Sex)]) Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs (∗ ) 3D Scatterplots I With the bear measurements data in the data frame br (use install.packages(”scatterplot3d”) if not installed before) use: library(scatterplot3d); scatterplot3d(br[6:8]) # for the basic 3D scatterplot scatterplot3d(br[6:8],angle=35, col.axis=“blue”, col.grid= “lightblue”, color=“red”) # angle and color controls scatterplot3d(br[6:8], angle=35, col.axis=“blue”, col.grid= “lightblue”, color=“red”, type=“h”, box=F) # vertical lines, no box scatterplot3d(br[6:8],pch=21,bg=c(“red”,“green”)[unclass(br$Sex)]) # with gender differentiation Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs I Pie charts and bar graphs are used with count data which display the percentage of each category in the sample. I For example, counts (or percentages or proportions) of different ethnic or education or income categories, the market share of different car companies, and so on. I The pie chart is popular in the mass media and one of the most widely used statistical charts in the business world. I It is a circular chart, where the sample is represented by a circle divided into sectors whose sizes represent proportions. I The pie chart in http://www.stat.psu.edu/~mga/401/fig/LvMsPie.pdf displays information on the November, 2011 light vehicle market share of car companies (source: http://wardsauto.com/keydata/USSalesSummary0702.xls). Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs I It has been pointed out, however, that it is difficult to compare different sections of a given pie chart. I According to Steven’s power law length is a better scale to use than area. I The bar graph uses bars of height proportional to the proportion it represents. I The bar graph for the light vehicle market share data is shown http://www.stat.psu.edu/~mga/401/fig/LvMsBar2.pdf Remark: When the heights of the bars are arranged in a decreasing order, the bar graph is also called Pareto chart. The Pareto chart is one of the key tools used in quality control, where it is often used to represent the most common sources of defects in a manufacturing process, or the most frequent reasons for customer complaints, etc. [Google Pareto principle] Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Histograms and Stem and Leaf Plots Scatterplots, Scatterplot Matrices and 3D Scatterplots Pie Charts and Bar Graphs lv=read.table(”http://stat.psu.edu/∼mga/401/Data/MarketShareLightVeh.txt” header=T) attach(lv) # so variables can be referred to by name pie(Percent,labels=Company, col=rainbow(length(Percent))) barplot(Percent, names.arg=Company, col= rainbow(length(Percent)), las=2) Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Example The productivity of each of the N = 10, 000 employees of a company is rated on a scale from 1 - 5. Let the statistical population v1 , v2 , . . . , v10,000 be vi = 1, i = 1, . . . , 300, vi = 2, i = 301, . . . , 1, 000, vi = 3, i = 1, 001, . . . , 5, 000, vi = 4, i = 5, 001, . . . , 9, 000, vi = 5, i = 9, 001, . . . , 10, 000. Find the population proportions for each rating, the average rating, and the population variance and standard deviation of rating. Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Solution. Set the statistical population in x: x=c(rep(1,300),rep(2,700),rep(3,4000),rep(4,4000),rep(5,1000)) Compute the proportions: table(x)/10000 Compute the mean, variance and standard deviation: mean(x); var(x)*(length(x)-1)/length(x) sqrt(var(x)*(length(x)-1)/length(x)) Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Example Take a simple r.s. of size n = 500 from the population of Example 1.6.3, and compute the sample proportions of the different ratings, the average rating, and the sample variance and standard deviation. Solution. y=sample(x, size = 500) table(y)/500 mean(y); var(y); sd(y) Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots Example Read the data on robots’ reaction times to simulated malfunctions into the data frame t by t=read.table(”http://stat.psu.edu/∼mga/ 401/Data/RobotReactTime.txt”, header=T). Read the reaction times of Robot 1 into the vector t1 by attach(t); t2=Time[Robot==2] (a) Obtain the five number summary. (b) Get the sample 30th, 60th, and 90th percentiles. (c) Construct a boxplot. Are there any outliers? Solution. summary(t2); quantile(t2, c(.3,.6,.9)); boxplot(t2, col=2) Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics Outline Basic Statistical Graphics Mean, Variance and Standard Deviation Medians, Percentiles and Box Plots I I I Go to next lab http: //www.stat.psu.edu/~mga/401/course.info/lab2.pdf Go to the Stat 401 home page http://www.stat.psu.edu/~mga/401/course.info/ http://www.google.com Michael Akritas Lab 1: Basic Graphics and Descriptive Statistics