Download Data displays and summaries - UBC Department of Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
Data displays and summaries
The first step in data analysis is
exploratory data analysis (EDA).
In EDA, we can
display data using graphical tools,
summarize data using numerical summaries.
In EDA, we usually do not make any assumptions for the data.
Data displays and summaries
It is important to recognize different types of variables (data).
There are two types of variable (data): continuous (or
quantitative) and discrete (or categorical or qualitative).
Statistical methods for different types of variables (data) are
different. Thus, in data analysis you should be able to tell what
types of data you are handling.
In this course we will focus more on continuous (quantitative)
data.
Discrete data displays and summaries
Discrete/categorical data can be graphically displayed by bar
graphs or pie charts.
Discrete/categorical data can be numerically summarized by
counts and percentages/proportions.
The distribution of data means how the data are distributed,
i.e., all the values of the data and how frequent (how often)
these values (or a group of similar values) occur in the
dataset.
Continuous data displays and summaries
Continuous data can be graphically displayed by a histogram (or
stemplot).
A histogram shows the distribution of the data (i.e., data range
and frequencies).
Continuous data can also be graphically displayed by a boxplot.
Continuous data can be numerically summarized by mean and
standard deviation (or variance).
Data displays and summaries
The two most important summaries of continuous data are mean
and standard deviation.
The mean is the average (center).
The standard deviation (or variance) measures the variation of
the data.
Both the mean and standard deviation are important for
describing continuous data.
Data displays and summaries
Both the mean and the standard deviation can be greatly
influenced by a few outliers in the data.
More robust measures of center and variation are median and
inter-quartile range (IQR), which are not affected by outliers.
Other (robust) data summaries include percentiles and
quartiles.
Data displays and summaries
Percentiles/quartiles for continuous data:
The 5th percentile is a number such that 5% of data are smaller
than it and 95% data are larger than it.
The 95th percentile is a number such that 95% of data are
smaller than it and 5% data are larger than it.
The 50th percentile is just the median.
The 25th (or 75th) percentiles are also called the first (or
third) quartiles, denoted by Q1 and Q3.
A boxplot shows the minimum, Q1, median, Q3, and the
maximum.
Data displays and summaries
A boxplot is a useful tool for summarizing continuous data: it
shows the five summary statistics of the dataset.
Note that a histogram shows the distribution of the data, not the
summary statistics. So a histogram and a boxplot serve different
purposes.
Boxplots are useful for comparing different groups of data.
The values of mean and median are similar if the data
distributions are symmetric, but they can be very different if the
data distributions are skewed.
Data displays and summaries
There are infinitely many data distributions. The most common
distribution is the normal distribution.
The normal distribution, denoted by N(µ, σ ), is completely
determined by its mean µ and standard deviation σ .
When µ = 0 and σ = 1, the normal distribution N(0, 1) is called
the standard normal distribution. Percentiles for N(0, 1) are
available (tables, software, internet).
Data displays and summaries
A variable X is called to follow a normal distribution if its all
possible values follow a normal distribution.
The distribution of a continuous variable X can be displayed by
its density function f (x).
For any two numbers a and b (a < b), the proportion of values
of X between a and b is just the area under the density
function f (x) between a and b, i.e.,
Z b
f (x)dx.
a
Data displays and summaries
If a variable X follows N(µ, σ ), then the new variable
Z=
X−µ
σ
follows N(0, 1).
Given data x1 , x2 , · · · , xn , the data transformation
zi =
xi − x̄
,
s
i = 1, 2, · · · , n
is called standardization. The zi values are called z-scores.
Data displays and summaries
The 68-95-99.7 rule can be used to quickly obtain three common
percentiles for normally distributed data or variables.
To obtain any percentiles for normally distributed data or
variables, we should first standardize the data or variable, and
then we use the standard normal distribution N(0, 1) to find the
desirable percentiles.
Chapter 1 focuses on univariate EDA, i.e., exploratory data
analysis on data from a single (one) variable.