Download Chapter 3: Producing Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Receiver operating characteristic wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Chapter 1: Looking at Data - Distributions
I.
Introduction (IPS page 4-5)
A. Individuals – Individuals are the objects described by a set of data.
Individuals may be people, but they may also be animals or things.
B. Variable – A variable is any characteristic of an individual. A variable
can take different values for different individuals.
C. Categorical Variable – A categorical variable places an individual into
one of several groups or categories.
D. Quantitative Variable – A quantitative variable takes numerical values for
which arithmetic operations such as adding and averaging make sense.
E. Distribution – The distribution of a variable tells us what values it takes
and how often it takes these variables.
II.
Displaying Distributions with Graphs (IPS Section 1.1 page 5-38)
A. Exploratory Data Analysis – Exploratory data analysis uses graphs and
numerical summaries to describe the variables in data set and the
relations among them.
B. A large number of observations on a single variable can be summarized
in a table of frequencies (counts) or relative frequencies (percents or
fractions).
C. Bar graphs and pie charts display the distributions of categorical
variables. These graphs use the frequencies or relative frequencies of
the categories.
D. Stemplots and histograms display the distributions of quantitative
variables. Stemplots separate each observation into a stem and a onedigit leaf. Histograms plot the frequencies or relative frequencies of
classes of variables.
E. To make a stemplot:
1. Separate each observation into a stem consisting of all but the final
(rightmost) digit and a leaf, the final digit. Stems may have as many
digits as needed, but each leaf contains only a single digit.
2. Write the stems in a vertical column with the smallest at the top, and
draw a vertical line at the right of this column.
3. Write each leaf in the row to the right of its stem, in increasing order
out from the stem.
F. Examining a Distribution – In any graph of data, look for overall pattern
and for striking deviations from that pattern. You can describe the
overall pattern of a distribution by its shape, center and spread. An
important kind of deviation is an outlier, an individual value that falls
outside the overall pattern. Always look for outliers and try to explain
them.
Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman
and Company, New York. 2 -101.
G. Some distributions have simple shapes, such as symmetric and skewed.
A distribution is symmetric if the values smaller and larger than its
midpoint are mirror images of each other. It is skewed to the right if the
right tail (larger values) is much longer than the left tail (smaller values).
The number of modes (major peaks) is another aspect of overall shape.
A distribution with one major peak is called unimodal. Not all
distributions have a simple overall shape, especially when there are few
observations.
H. Time Plot – A time plot of a variable plots each observation against the
time at which it was measured. Always put time on the horizontal scale
of your plot and the variable you are measuring on the vertical scale.
Connecting the data points by lines helps emphasize any change over
time.
I. Many interesting data sets are time series, measurements of a variable
taken at regular intervals over time.
J. Seasonal Variation and Trend – A pattern in a time series that repeats
itself at known regular intervals of time is called seasonal variation. A
trend in a time series is a persistent, long-term rise or fall.
III.
Describing Distributions with Numbers (IPS section 1.2 pages 38-63)
A. Mean – To find the mean x of a set of observations, add their values
and divide by the number of observations. If the n observations are
x1 , x2 ,..., xn , their mean is
x
1
x1  x2  ...  xn
x

or, in more compact notation,
n
n
x
i
B. Median – The median M or x is the midpoint of a distribution, the number
such that half the observation are smaller and the other half are larger.
To find the median of a distribution:
1. Arrange all observations in order of size, from smallest to largest.
2. If the number of observations n is odd, the median is the center
observation in the ordered list. Find the location of the median by
counting (n+1)/2 observations up from the bottom of the list.
3. If the number of observations n is even, the median is the mean of the
two center observations in the ordered list. The location of the
median is again (n+1)/2 from the bottom of the list.
C. Mean versus Median– The median and mean are the most common
measures of the center of a distribution. The mean and median of a
symmetric distribution are close together. If the distribution is exactly
symmetric, the mean and the median are exactly the same. In a skewed
distribution, the mean is farther out in the long tail than is the median.
Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman
and Company, New York. 2 -101.
D. The simplest useful numerical description of a distribution consists of
both measure of center and a measure of spread.
E. We can describe the spread or variability of a distribution by giving
several percentiles. The pth percentile of a distribution is the value such
that a p percent of the observations fall at or below it. The most
commonly used percentiles other than the median are the quartiles. The
first quartile is the 25th percentile, and the third quartile is the 75th
percentile. The second quartile is the median.
F. To Calculate the Quartiles:
1. Arrange the observations in increasing order and locate the median in
the ordered list of observations.
2. The first quartile Q1 is the median of the observations whose position
in the ordered list is to the left of the location of the overall median.
3. The third quartile Q3 is the median of the observations whose position
in the ordered list is to the right of the location of the overall median.
G. The Five-Number Summary – The five-number summary of a set of
observations consists of the smallest observation, the first quartile, the
median, the third quartile, and the largest observation, written in order
from smallest to largest. In symbols, the five-number summary is
Minimum Q1 M (or x ) Q3 Maximum
H. Boxplot – A boxplot is a graph of the five-number summary. A central
box spans the quartiles Q1 and Q3. A line in the box marks the median.
Lines extend from the box out to the smallest and largest observations.
I. The Interquartile Range IQR – The interquartile range IQR is the
distance between the first and third quartiles
IQR  Q3  Q1
I. The 1.5 X IQR Criterion for Outliers – Call an observation a suspected
outlier if it falls more than 1.5 X IQR above the third quartile or below the
first quartile.
J. Modified Boxplot – In a modified boxplot, the lines extend out from the
central box only to the smallest and largest observations that are not
suspected outliers. Observations more than 1.5 X IQR outside the box
are plotted as individual points.
K. The Variance s2 and the Standard Deviation s – The variance s2 of a set
of observations is the average of the squares of the deviations of the
observations from their mean. In symbols, the variance of n
observations x1 , x2 ,..., xn is
( x1  x ) 2  ( x2  x ) 2  ...  ( xn  x ) 2
s 
n 1
2
or, in more compact notation,
s2 
1
( xi  x ) 2

n 1
Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman
and Company, New York. 2 -101.
The standard deviation s is the square root of the variance s2:
s
1
 ( xi  x )2
n 1
L. Properties of the Standard Deviation –
1. s measures spread about the mean and should be used only when
the mean is chosen as the measure of center.
2. s=0 only when there is no spread. This happens only when all
observations have the same value. Otherwise, s > 0. As the
observations become more spread out about their mean, s gets
larger.
3. s, like the mean, is not resistant. A few outliers can make s very
large.
M. Resistant Measures – a resistant measure of any aspect of a distribution
is relatively unaffected by changes in the numerical value of a small
proportion of the total number of observations, no matter how large there
changes are. The median and quartiles are resistant, but the mean and
the standard deviation are not. The mean and standard deviation are
good descriptions for symmetric distributions without outliers. They are
most useful for the normal distributions introduced in the following
section. The five-number summary is a better exploratory summary for
skewed distributions.
N. Linear Transformations – A linear transformation changes the original
variable x into the new variable
xnew given by an equation of the form
xnew  a  bx
Adding the constant a shifts all values of x upward or downward by the
same amount. In particular, such a shift changes the origin (zero point)
of the variable. Multiplying by the positive constant b changes the size of
the unit of measurement. Linear transformations do not change the
shape of a distribution.
Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman
and Company, New York. 2 -101.
IV.
The Normal Distribution (IPS section 1.3 pages 63-101)
A. Mathematical Model – A mathematical model is an idealized description.
It gives a compact picture of the overall pattern of the data but ignores
minor irregularities as well as any outliers.
B. Density Curve – A density curve is a curve that is always on or above
the horizontal axis and has area exactly 1 underneath it. A density curve
describes the overall pattern of a distribution. The area under the curve
and above any range of values is the relative frequency of all
observations that fall in that range. Examples of density curves include
the following:
Figure 1.22(a) A symmetric density curve with its mean and median
marked.
Figure 1.22(b) A right-skewed density curve with its mean and median
marked.
C. Mode – A mode of a distribution described by a density curve is a peak
point of the curve, the location where the curve is highest.
D. Median – Because areas under the curve represent proportions of the
observations, the median is the point with half the total area on each
side.
E. Quartiles – You can roughly locate the quartiles by dividing the area
under the curve into quarter as accurately as possible by eye.
F. IQR – The IQR is the distance between the first and third quartiles.
Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman
and Company, New York. 2 -101.
G. Median and Mean of a Density Curve – The median of a density curve is
the equal-areas point, the point that divides the area under the curve in
half. The mean of a density curve is the balancing point, at which the
curve would balance if made of solid material. The median and mean
are the same for a symmetric density curve. They both lie at the center
of the curve. The mean of a skewed curve is pulled away from the
median in the direction of the long tail.
Figure 1.23 The mean of a density curve is the point at which it would
balance.
H. Normal Distributions – the normal distributions are described by bellshaped, symmetric, unimodal density curves. The mean μ and standard
deviation σ completely specify the normal distribution N(μ, σ 2). The
mean is the center of symmetry, and σ is the distance from μ to the
change-of-curvature points on either side.
I. The 68-95-99.7 Rule – In the normal distribution with mean μ and
standard deviation σ:
 Approximately 68% of the observations fall within σ of the mean μ.
 Approximately 95% of the observations fall within σ of the mean μ.
 Approximately 99.7% of the observations fall within σ of the mean μ.
Figure 1.25 The 68-95-99.7 rule for normal distributions.
Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman
and Company, New York. 2 -101.
J. Standardizing and z-Scores – If x is an observation from a distribution
that has mean μ and standard deviation σ, the standard value of x is
z
x

A standardized value is often called a z-score.
K. The Standard Normal Distribution –The standard normal distribution is
the normal distribution N(0,1) with mean 0 and standard deviation 1. If a
variable X has any normal distribution N(μ, σ2) with mean μ and standard
deviation σ, then the standardized variable
X 
Z

has the standard normal distribution.
L. The Standard Normal Table – Table A in the back of your book is a table
of areas under the standard normal curve. The table entry for each
value z is the area under the curve to the left of z.
Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman
and Company, New York. 2 -101.
The following further illustrate of this concept:
Figure 1.27(a) The area under a standard normal curve to the left of
the point z = 1.40 is 0.9192. Table A gives areas under the standard
normal curve.
Figure 1.27(b) Areas under the standard normal curve to the right
and left of z = -2.15. Table A gives areas to the left.
M. Normal Quantile Plot – The adequacy of a normal model for describing a
distribution of data is best assessed by a normal quantile plot, which is
available in most statistical packages.
N. Use of Normal Quantile Plots – If the points on a normal quantile plot lie
close to a straight line, the plot indicates that the data are normal.
Systematic deviations from a straight line indicate a nonnormal
distribution. Outliers appear as points that are far away from the overall
pattern of the plot.
Moore, David and McCabe, George. 2002. Introduction to the Practice of Statistics. W. H. Freeman
and Company, New York. 2 -101.