Download Displaying Quantitative Data Numerical data can be visualized with

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia, lookup

History of statistics wikipedia, lookup

World Values Survey wikipedia, lookup

Time series wikipedia, lookup

Transcript
Chapters 4-5
Displaying Quantitative Data
Numerical data can be visualized with a histogram.
Data are separated into equal intervals along a numerical scale, then the frequency of data in each of the
intervals is tallied. Rectangles are built over each interval with heights, measured along a vertical scale, are
given by the frequency (or relative frequency) of data
within each interval.
[TI83: STAT Edit, STATPLOT, ZoomStat, and
Window settings.]
A quicker way to display numerical data by hand is with
a stem-and-leaf display. All but the rightmost digit
(or digits) of the measurement become stems; stems
head rows in which the remaining digit(s), the leaves,
are listed, carefully lined up vertically in columns. (List
all intermediate stems, even if they contain no leaves!)
1
Chapters 4-5
Describing Quantitative Data: Features of Interest
• The shape of a histogram or stem-and-leaf describes
the distribution of the data – where data is concentrated
and how it spreads out across the entire range of values.
• Where is the center of the distribution located?
• How much spread is there in the distribution? How
tightly is the data clustered about the center?
• Is there more than one cluster, or mode? Is the data
unimodal, bimodal, multimodal? Note: The location of modes can change with the scaling unit of a
display (width of a bar).
• Is the distribution uniform (has a flat contour), indicating that every value is (roughly) equally represented?
Is it roughly symmetric, with equally frequent values
on either side of the center (the distribution to the right
of the center is the mirror image of what appears to the
left)? Or is it skewed (heaver on one side of the center
than the other) to the left or right, in the direction of
the tail (region of most extreme values)?
• Are there any outliers (values located very far from
the center)? Can we explain why they appear?
2
Chapters 4-5
Summarizing Quantitative Data: Statistics
• statistic
Any quantity computed from the data values; typically
used to quantify some feature of the distribution of the
data.
There are two chief systems of statistics in common use:
ordered statistics, which locate features of the distribution in terms of an ordering of the data values from
lowest to highest; and weighted statistics, which locate features relative to how they balance against the
rest of data set on the measuring scale.
Feature
Ordered
statistic
Weighted
statistic
Center
median
Spread
interquartile range standard deviation
(IQR)
(s)
Relative
percentile
standing
3
mean (x̄)
z-score (z)
Chapters 4-5
• median
the middle observation in a sorted list of the data values (for an even number of values, average the two
middle observations); a better estimate of center since
it is resistant to the effects of outliers, hence a more
commonly used measure of center
• percentile
a pth percentile is any data value greater than or equal
to exactly p % of the data (the median is the 50th percentile)
• lower/upper quartiles (Q1 and Q3)
the observations which are one quarter (Q1) and three
quarters (Q3) of the way up the list; also equal to
the median values of each half of the data located below/above the median; or the 25th and 75th percentiles
(The 0th quartile is the minimum value: Q0 = min; the
second quartile is the median: Q2 = median; and the
fourth quartile is the maximum value: Q4 = max.)
• interquartile range (IQR)
difference IQR = Q3 − Q1 between the two quartiles
4
Chapters 4-5
• five-number summary of a data set is the list of its
five quartiles:
◦ minimum (min)
◦ lower quartile (Q1)
◦ median
◦ upper quartile (Q3)
◦ maximum (max)
• boxplot
display of the five-number summary formed by
◦ drawing a box over a number line so that the sides
of the box are located at the two quartiles,
◦ drawing the wall (a line across the box) at the location of the median, and
◦ drawing whiskers (lines parallel to the number line)
extended from the sides of the box to the min and
max.
[TI83: STATPLOT, ZoomStat]
5
Chapters 4-5
• modified boxplot
same as above, except:
◦ whiskers extend from the sides of the box to the
fences, points positioned 1.5 IQR from each end
of the box; and
◦ outliers (any values lying outside the fences) are individually marked with symbols; far outliers, which
lie more than 3 IQR from the ends of the box, are
often marked with different symbols for emphasis.
• resistance to outliers
moving the extreme values of a data set either further
away or closer to the center of the distribution does
not change the median value; hence, the median (and
other ordered statistics) is often preferred when describing skewed data sets (income data, housing prices, etc.).
6
Chapters 4-5
• mean (x̄)
a data set that includes n repeated measures of some
variable quantity x has mean value equal to its arithmetical average, the sum of the values divided by the
number of values:
P
x
x̄ =
n
It represents the point on the number scale where the
distribution “balances” (as if the histogram were made
of some massive substance)
• sensitivity to outliers
moving the extreme values of a data set either further
away or closer to the center of the distribution can substantially alter the mean value; hence, the mean (and
other weighted statistics) is used to describe only symmetric data sets or those without much skew. In skewed
distributions, the mean is pulled in the direction of the
skewness (the longer tail)
7
Chapters 4-5
• deviation from the mean (x − x̄)
the difference between a single data value x and the
mean x̄ of the data set; values greater than the mean
have positive deviations, while those below the mean
have negative deviations. Each number in the data set
has its own deviation from the mean.
• variance (s2)
an estimate of the average squared deviation from the
mean:
P
(x − x̄)2
2
s =
n−1
(we divide by n − 1 instead of n for technical reasons)
• standard deviation (s)
the square root of the variance, a measure of spread
that estimates the size of a typical deviation from the
mean:
rP
(x − x̄)2
s=
n−1
The larger the standard deviation, the further away from
the mean will most values be found.
8
Chapters 4-5
• z-score, or standardized value
a measure of realtive standing that determines how far
each data value is from the mean measured in units of
standard deviations:
x − x̄
z=
s
Positive z-scores correspond to values larger than the
mean; negative z-scores to values below the mean, e.g. a
value with z = −2.15 corresponds to a value below the
mean by an amount exactly equal to 2.15 times the size
of the standard deviation.
9