Download notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 4
1
Displaying Quantitative Data
Numerical data can be visualized with a histogram.
Data are separated into equal intervals along a
horizontal axis, then tally the frequency of data in each
interval and build rectangles over each interval whose
heights measure the frequency (or, rather, the relative
frequency = proportion) of data in each interval.
[TI83: STAT Edit, STATPLOT, ZoomStat, and Window
settings.]
A dotplot displays each value as a dot over a scale axis.
Dots are stacked over the axis to indicate clusters of
data.
A quicker way to display numerical data by hand is
with a stem-and-leaf display. All but the rightmost
digit (or digits) of the measurement become stems;
stems head rows in which the remaining digit(s), the
leaves, are listed, carefully lined up in columns. (List
all intermediate stems, even if they contain no leaves.)
Data that tracks the change over time of a particular
characteristic is called a time plot. Time is measured
along the horizontal axis and the characteristic of
interest is displayed along the vertical. Connecting
consecutive data points highlights variation over time.
Chapter 4
2
Describing Numerical Data: Features of Interest
• The shape of a histogram or stem-and-leaf display
describes the distribution of the data, where data
are concentrated and how they spread out across
the entire range of values.
• Where is the center of the distribution located?
• How much spread is there in the distribution?
How tightly are data clustered about the center?
• Is there more than one cluster, or mode? (The
location of modes can change as the scale of a
display is altered.) Is the data unimodal,
bimodal, multimodal?
• Is the distribution uniform (flat), indicating that
every value is (roughly) equally represented? Is it
roughly symmetric, with equally frequent values
on either side of the center? or is it skewed (to the
left or right, in the direction of the tail)?
• Are there any outliers (values located far from the
center)? Can we explain them?
Chapter 4
3
Summarizing Numerical Data: Center and Spread
• midrange
the number halfway between the smallest and
largest data value is an estimate of the center of
the distribution; often a poor estimate, since it is
highly sensitive to the presence of outlier values
• median
the middle observation in a sorted list of the data
values (for an even number of values, average the
two middle observations); a better estimate of
center since it is resistant to the effects of outliers,
hence a more commonly used measure of center
Chapter 4
4
• range
the difference between the largest and smallest
data values is an estimate of the spread in the
data; again, often a poor estimate, since it is highly
sensitive to the size of outlier values
• lower/upper quartiles ( Q1 and Q3 )
the observations which are one quarter (Q1) and
three quarters (Q3) of the way up the list, the
median values of the half of the data located
below/above the median; also, the 25th and 75th
percentiles of the data
• interquartile range (IQR)
the difference IQR = Q 3 − Q1 between the two
quartiles; a better measure of the spread in the
data since it is resistant to the presence of outliers
€
The five-number summary of a data set:
• minimum value,
• lower quartile,
• median,
• upper quartile, and
• maximum value.
[TI83: STAT CALC 1-VarStats]
Chapter 4
5
• mean ( y )
the arithmetical average (where the distribution
“balances”)
€
y=
∑y
n
;
in skewed distributions, the mean is pulled in the
direction of the skewness (the longer tail),
€
indicating sensitivity
to the presence of outliers;
because each data value contributes own “weight”
to the determination of the mean, this is the most
commonly used measure of center in practice
Chapter 4
6
• deviation from the mean ( y − y )
the difference between a data value and the mean
of all the data
€
• variance ( s 2 )
estimates the average squared deviation from the
mean
€
2
s =
∑( y − y ) 2
n −1
• standard deviation ( s )
measure of spread that estimates the size of a
€
typical deviation
from the mean
€
s=
∑( y − y)2
n −1
;
like the mean, sensitive to outliers
€