Download Chapter 1: Exploring data Intro: Statistics is the science of data. We

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Chapter 1: Exploring data
Intro:
Statistics is the science of data. We begin our study of statistics by mastering the art of
examining data. Any set of data contains information about some group of individuals. The
information is organized in variables.
Individuals – are the objects described by a set of data. Individuals may be people, but they may
also be other things.
A variable – is any characteristic of an individual. A variable can take different values for
different individuals.
When you come across a new set of data, you need to ask yourself the following questions.
Who? – What individuals do the data describe? How many individuals appear in the data?
What? – How many variables are there? What are the exact definitions of these variables? In
what units is each variable recorded?
Why? – What is the reason the data were gathered? What conclusions are we looking for?
There are two types of variables, categorical and quantitative.
A categorical variable – places an individual into one of several groups of categories.
A quantitative variable – takes numerical values for which arithmetic operations such as adding
and averaging make sense.
A variable generally takes values that vary. The pattern of variation of a variable is its
distribution. The distribution of a variable tells us what values the variable takes and how often
it takes these values.
In order to analyze data we begin by examining each variable by itself. Then move on to study
relationships among the variables. Start with graphs of the distributions then add numerical
summaries of specific aspects of the data.
1.1: Displaying Distributions with graphs.
There are several graphs to choose from when displaying data: bar graphs, pie charts, dot plots,
stem plots, histograms, and time plots, just to name some that we will be using in this section.
The purpose of a graph is to help us understand the data. It lets you look for an overall pattern
and for striking deviations from that pattern. To describe the overall pattern of a distribution you
start with the three biggest descriptors: shape, center, and spread. Next you can look for outliers
and clusters.
Looking at shape we want to concentrate on main features. Look for major peaks, not minor ups
and downs. Look for clear outliers, not just the smallest and largest observations. Look for
rough symmetry or clear skewness.
A distribution is symmetric if the right and left sides of the histogram are approximately mirror
images of each other. A distribution is skewed right if the right side of the histogram extends out
farther then the left. A distribution is skewed left if the left side of the histogram extends out
farther then the right side.
Relative frequency, cumulative frequency, percentiles and ogives pronounced O-Jive (relative
cumulative frequency graph)
The pth percentile of a distribution is the value such that p percent of the observations fall at or
below it.
Lets look at a table to see what the other terms mean.
Time plots plot each observation against the time at which it was measured. Time is always on
the x-axis. We want to describe trends in time plots to analyze what is going on over time.
Homework: #’s 1.23 – 1.30
1.2: Describing Distributions with Numbers.
Measuring center: The mean.
A description of a distribution almost always includes a measure of its center or average. The
most common measure of center is the arithmetic average, of mean.
The mean is represented with the notation x and is calculated by adding all of the observations
together and dividing by the number of observations.
x
x1  x2  ...  xn
1
, or in more compact notation x   xi
n
n
An important fact about the mean as a measure of center is it is sensitive to the influence of a few
extreme observations. Because the mean cannot resist the influence of extreme observations, we
say that it is not a resistant measure of center.
Measuring center: The median.
The median M is the midpoint of a distribution, the number such that half the observations are
smaller and the other half are larger. To find the median of a distribution:
1. Arrange all observations in order of size, from smallest to largest.
2. If the number of observations n is odd, the median M is the center of the ordered list.
3. If the number of observations n is even, the median M is the mean of the two center
observations in the ordered list.
The median is not influenced by extreme observations, so we say that the median is a resistant
measure of center.
Comparing the mean and median.
The mean and median of a symmetric distribution are close together. If the distribution is
exactly symmetric then the mean and median are exactly the same. In a skewed distribution the
mean is farther out in the long tail then is the median.
Measuring spread or variability: the quartiles.
One way to measure spread is to calculate the range, which is the difference between the largest
and smallest observations. This is not a resistant measure of spread it is greatly influenced by
extreme values.
Another way to measure spread is to measure the spread of the middle half of the data. The
quartiles mark out the middle half. The first quartile makes up 25% of the data, the second or
median makes up 50% of the data and the third makes up 75% of the data.
To calculate quartiles:
1. Arrange the observations in increasing order and locate the median M.
2. The first quartile Q1 is the median of the observations whose position in the ordered list is
to the left of the overall median.
3. The third quartile Q3 is the median of the observations whose position in the ordered list
is to the right of the overall median.
The Interquartile Range (IQR) is the distance between the first and third quartiles.
IQR = Q3 – Q1
If an observation falls in the IQR then you know that it’s neither unusually high nor unusually
low. The IQR is used to calculate outliers. An observation is an outlier if it falls more than
1.5  IQR above the third quartile or below the first quartile.
The five number summary and box plots.
The five number summary of a data set consist of the smallest observation, the first quartile, the
median, the third quartile and the largest observation.
Min
Q1
M
Q3
Max
The five number summary offers a reasonably complete description of center and spread. The
five number summary of a distribution leads to a new graph, the boxplot. Because boxplots
show less detail than histograms or stemplots, they are best used for side-by-side comparison of
more than one distribution. A boxplot gives an indication of symmetry of skewness of a
distribution. Because regular boxplots conceal outliers, sometimes it is wise to use a modified
boxplot, which puts outliers as isolated points.
Measuring spread: the standard deviation
The five number summary is not the most common numerical description of a distribution. The
distinction belongs to the combination of the mean to measure center and the standard
deviation to measure spread. The standard deviation measures spread by looking at how far the
observations are from their mean.
To calculate the standard deviation we need to look at the variance s2 first. The variance is the
average squared deviation. The variance s2 of a set of observations is the average of the squares
of the deviations of the observations from their mean. In symbols, the variance of n observations
x1, x2, …, xn is
 x  x   x

2
s
2
1
2


2
 x  ...  xn  x

2
n 1
or, more compactly,
s2 

1
 xi  x
n 1

2
The standard deviation s is the square root of the variance s2.
s

1
 xi  x
n 1

2
Properties of the standard deviation:

s measures spread about the mean and should be used only when the mean is chosen as
the measure of center.

s = 0 only when there is no spread. This happens only when all observations have the
same value. Otherwise, s > 0. As the observations become more spread out about their
mean, s gets larger.

s, like mean x is not resistant. Strong skewness of a few outliers can make s very large.
Choosing measures of center and spread.
Use the five number summary for describing a skewed distribution or a distribution with strong
outliers. Use mean and standard deviation to describe reasonably symmetric distributions that
are free from outliers.
Changing the units of measurement.
A linear transformation changes the original variable x into the new variable xnew given by an
equation of the form
xnew  a  bx
Adding the constant a shifts all values of x upward or downward by the same amount.
Multiplying by the positive constant b changes the size of the unit of measurement.
Linear transformations do not change the shape of a distribution, but it can change the center and
spread. Fortunately, the effects of such changes follow a simple pattern. To see the effect of a
linear transformation on measures of center and spread, apply these rules.

Multiplying each observation by a positive number b multiplies both the measure of
center (mean and median) and the measures of spread (standard deviation and IQR) by b.

Adding the same number a to each observation adds s to the measures of center and to the
quartiles but does not change measures of spread.
Comparing distributions.
Back-to-back stemplots and side-by-side boxplots are useful for comparing quantitative
distributions.
Homework: #’s 1.51 – 1.58
Chapter review
Homework: #’s 1.59 – 1.69