Download Lab 1: Basic Graphics and Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Lab 1: Basic Graphics and Descriptive Statistics
Michael Akritas
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Basic Statistical Graphics
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
I
In histograms the range of the data is divided into bins, and a
box is constructed above each bin.
I
The height of each box is the bin’s frequency. Alternatively,
the heights can be adjusted so the histogram’s area is one.
I
R will automatically choose the number of bins but it also
allows user specified intervals. Moreover, R offers the option
of constructing a smooth histogram.
I
In stem and leaf plots each observation gets split into its
stem, which is the beginning digit(s), and its leaf, which is the
first of the remaining digits.
I
They retain more information about the original data but do
not offer as much flexibility in selecting the bins.
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
The R data set faithful
x = faithful$eruptions # set the eruption duration data in x
hist(x) # basic frequency histogram
hist(x, freq = FALSE) # histogram area = 1
plot(density(x)) # basic smooth histogram
hist(x, freq = F) ; lines(density(x)) # superimposes the two
stem(x) # basic stem and leaf plot
hist(x, freq = F, col=“grey”, main=“Histogram of Old Faithful
eruption durations”, xlab=“Eruption durations”) ; lines(density(x),
col=”red”)
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
A variation of the stem and leaf plot(∗ )
I
The command stem(x) is equivalent to stem(x, scale =1).
I
To illustrate the role of the scale parameter, enter the data on
US beer production (in millions of barrels) for different
quarters during the period 1975-1982 in the R object x
through
x=c(35, 36, 36, 36, 39, 39, 41, 41, 41, 42, 42, 44, 44, 44, 44, 44,
44, 46, 46, 47, 48, 48, 49, 49, 50, 52, 52, 53, 53, 54, 55)
Then use the command:
stem(x, scale = 0.5)
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
Output options(∗)
I
The figure can be saved as pdf, or jpg etc. Alternatively:
pdf(“Desktop/HistOF.pdf”) # saves figure in Desktop/HistOF.pdf
hist(x, freq = F, col=“grey”); lines(density(x), col=“red”)
dev.off() # this must be done before opening the pdf file.
I
To save it as a jpg file replace pdf(“Desktop/HistOF.pdf”) in
the above set of commands by jpeg(“Desktop/HistOF.jpg”).
I
To save text output to a txt file, for example the stem and
leaf plot, copy and past, or use:
sink(“Desktop/StemOF.txt”); stem(x); sink(file=NULL)
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
Data Input
I
There are several ways to import data stored in different types
of files. We will use data stored in txt files uploaded on the
internet.
I
The basic command we will use is read.table. For example, to
import the bear measurements data from
http://sites.stat.psu.edu/~mga/401/Data/BearsData.txt,
use the command
br=read.table(“http://stat.psu.edu/∼mga/401/Data/BearsData.txt”,
header=T)
This generates an R data frame, called br containing all
information in the file BearsData.txt,
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
Scatterplot with gender identification
I
With the bear measurements data in the data frame br (type
br to view the data, or br[1:3,] for the fist three lines), an
enhanced chest girth and weight scatterplot with gender
differentiation can be constructed with the commands:
attach(br)
plot(Chest.G,Weight, pch=21,
bg=c(‘”red”,”green”)[unclass(Sex)])
legend( x=22, y=400,pch = c(21,21), col = c(”red”,”green”),
legend = c(”Female”, ”Male”))
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
Scatterplot matrix with gender identification
I
For more than two variables, a scatterplot matrix arranges all
pairwise scatterplots in a matrix form. With the bear
measurements in the data frame br use the command:
pairs(br[4:8],pch=21,bg=c(”red”, ”green”)[unclass(Sex)]) #
br[4:8] is a data frame consisting of columns 4-8
I
(∗ )For a variation, which gives histograms on the diagonal and
additional information, use the commands:
install.packages(”psych”) # installs the package psych
library(psych) # it suffices to issue this command once per session
pairs.panels(br[4:8], pch=21,bg=c(”red”, ”green”)[unclass(Sex)])
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
(∗ ) 3D Scatterplots
I
With the bear measurements data in the data frame br (use
install.packages(”scatterplot3d”) if not installed before) use:
library(scatterplot3d); scatterplot3d(br[6:8]) # for the basic 3D
scatterplot
scatterplot3d(br[6:8],angle=35, col.axis=“blue”, col.grid=
“lightblue”, color=“red”) # angle and color controls
scatterplot3d(br[6:8], angle=35, col.axis=“blue”, col.grid=
“lightblue”, color=“red”, type=“h”, box=F) # vertical lines, no
box
scatterplot3d(br[6:8],pch=21,bg=c(“red”,“green”)[unclass(br$Sex)])
# with gender differentiation
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
I
Pie charts and bar graphs are used with count data which
display the percentage of each category in the sample.
I
For example, counts (or percentages or proportions) of
different ethnic or education or income categories, the market
share of different car companies, and so on.
I
The pie chart is popular in the mass media and one of the
most widely used statistical charts in the business world.
I
It is a circular chart, where the sample is represented by a
circle divided into sectors whose sizes represent proportions.
I
The pie chart in
http://www.stat.psu.edu/~mga/401/fig/LvMsPie.pdf
displays information on the November, 2011 light vehicle
market share of car companies (source:
http://wardsauto.com/keydata/USSalesSummary0702.xls).
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
I
It has been pointed out, however, that it is difficult to
compare different sections of a given pie chart.
I
According to Steven’s power law length is a better scale to
use than area.
I
The bar graph uses bars of height proportional to the
proportion it represents.
I
The bar graph for the light vehicle market share data is shown
http://www.stat.psu.edu/~mga/401/fig/LvMsBar2.pdf
Remark: When the heights of the bars are arranged in a decreasing order,
the bar graph is also called Pareto chart. The Pareto chart is one of the
key tools used in quality control, where it is often used to represent the
most common sources of defects in a manufacturing process, or the most
frequent reasons for customer complaints, etc. [Google Pareto principle]
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Histograms and Stem and Leaf Plots
Scatterplots, Scatterplot Matrices and 3D Scatterplots
Pie Charts and Bar Graphs
lv=read.table(”http://stat.psu.edu/∼mga/401/Data/MarketShareLightVeh.txt”
header=T)
attach(lv) # so variables can be referred to by name
pie(Percent,labels=Company, col=rainbow(length(Percent)))
barplot(Percent, names.arg=Company, col=
rainbow(length(Percent)), las=2)
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Example
The productivity of each of the N = 10, 000 employees of a
company is rated on a scale from 1 - 5. Let the statistical
population v1 , v2 , . . . , v10,000 be
vi
= 1, i = 1, . . . , 300,
vi
= 2, i = 301, . . . , 1, 000,
vi
= 3, i = 1, 001, . . . , 5, 000,
vi
= 4, i = 5, 001, . . . , 9, 000,
vi
= 5, i = 9, 001, . . . , 10, 000.
Find the population proportions for each rating, the average rating,
and the population variance and standard deviation of rating.
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Solution. Set the statistical population in x:
x=c(rep(1,300),rep(2,700),rep(3,4000),rep(4,4000),rep(5,1000))
Compute the proportions:
table(x)/10000
Compute the mean, variance and standard deviation:
mean(x); var(x)*(length(x)-1)/length(x)
sqrt(var(x)*(length(x)-1)/length(x))
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Example
Take a simple r.s. of size n = 500 from the population of Example
1.6.3, and compute the sample proportions of the different ratings,
the average rating, and the sample variance and standard deviation.
Solution.
y=sample(x, size = 500)
table(y)/500
mean(y); var(y); sd(y)
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
Example
Read the data on robots’ reaction times to simulated malfunctions
into the data frame t by t=read.table(”http://stat.psu.edu/∼mga/
401/Data/RobotReactTime.txt”, header=T).
Read the reaction times of Robot 1 into the vector t1 by
attach(t); t2=Time[Robot==2]
(a) Obtain the five number summary.
(b) Get the sample 30th, 60th, and 90th percentiles.
(c) Construct a boxplot. Are there any outliers?
Solution. summary(t2); quantile(t2, c(.3,.6,.9)); boxplot(t2, col=2)
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics
Outline
Basic Statistical Graphics
Mean, Variance and Standard Deviation
Medians, Percentiles and Box Plots
I
I
I
Go to next lab http:
//www.stat.psu.edu/~mga/401/course.info/lab2.pdf
Go to the Stat 401 home page
http://www.stat.psu.edu/~mga/401/course.info/
http://www.google.com
Michael Akritas
Lab 1: Basic Graphics and Descriptive Statistics