Download Chapter 4: Numerical Methods for Describing Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 4: Numerical Methods for Describing Data (How to CUSS)
4.1 Describing the Center of a Data Set
Mean – the arithmetic average (sum of the observations divided by the number of
observations.) It is the balance point (fulcrum) of a set of data.
x = variable for which we have sample data
n = number of observations in the sample
x1 = first observation
x2 = second observation
xn = the nth (last) observation
sample mean = (denoted by X ) sum of all observations in sample = x1 + x2 + … + xn = ∑ x
number of observations in sample
n
n
population mean = (denoted by μ) the average of all x values in the entire population.
Median – divides the data in half. When the data is in numerical order it is the middle value (or
the average of the middle two.)
Sample median – median of the n observations of a population.
Trimmed mean – found by ordering the data from smallest to largest, and deleting a selected
number of values from each end and averaging the remaining values.
number deleted from each end = (trimming percentage) · n
trimming Percentage – percentage of values deleted from each end of the ordered list.
Trimming percentage = number deleted from each end · 100
n
Comparing Mean and Median
Symmetric data – mean = median
Unimodal and positively skewed – mean greater than median
Unimodal and negatively skewed – mean less than median
Categorical data – is described using relative frequencies. If the are only two categories
(dichotomy) then they can be labeled as S (success) and F (Failure).
p = sample proportion of successes (denoted by p) - the fraction of S’s in the sample = number of S’s
n
л = used to denote the population proportion of successes
4.2 Describing Variability (Spread) in a Data Set
range – largest observation – smallest observation
deviations from the sample mean – differences in the sample mean ( X ) and each observation
x1 - X , x2 - X , … , xn - X
Except for rounding, ∑ (x - X ) = 0
average squared deviation (denoted by Sxx = ∑ (x - X )2
sample variance (denoted by s2) - uses square deviations to prevent positive and negative
deviations form counteracting one another
s
2
 x  X 

n 1
2

S xx
n 1
sample standard deviation – (denoted by s) is the positive square root of the sample variance.
s and s2 are based on (n - 1) degrees of freedom
population variance – denoted by σ2
population standard deviation – denoted by σ
Interquartile Range (iqr) – a measure of variability that is resistant to outliers (unlike the sample
mean) and it is based on quartiles.
iqr = upper quartile – lower quartile
lower quartile – median of the lower half of the sample, separates the bottom 25% of the
sample from the upper 75%.
upper quartile – median of the upper half of the sample, separates the top 25% of the
sample from the lower 75%.
population quartile range – difference between the upper and lower population quartiles.
**If the histogram of the data set is approximated by a normal curve, then the standard
deviation (sd) and interquartile range are related by: sd = iqr/1.35
If the sd is much larger than iqr/1.35, then the histogram is heavy-tailed.
Computer printout of summary statistics (type you will see on AP Test)
N
80
Mean
4.58
Median
4.04
TrMean
4.37
StDev
2.99
SEMean
0.33
Min
0.46
Max
13.63
Q1
2.15
Q3
6.56
TrMean=Trimmed Mean
SEMean=standard error of mean
4.3 Summarizing a Data Set: Boxplots
Boxplot – method of summarizing a data set that provides symmetry or skewness in addition to
the center and spread.
Outlier – observation that is more than 1.5(iqr) away from the nearest quartile (nearest end of
the box)
 Mild outlier – less than 3(iqr)
 Extreme outlier – observation that is more than 3(iqr) from the nearest quartile (nearest
end of the box)
Two types of boxplots:
Skeletal boxplot - disregards the presence of any mild or extreme outliers
 Draw a horizontal (or vertical) measurement scale
 Construct a rectangular box with a left edge at the lower quartile and a right edge at
the upper quartile. The box width is then equal to the iqr.
 Draw a vertical (or horizontal) segment inside the box at the median.
 Extend horizontal (or vertical) line segments (called whiskers) from each end of the
box to the smallest and largest observations in the data set.
Five-number summary – smallest observation, lower quartile, median, upper quartile, largest
observation
Modified boxplot – makes adjustments for any mild or extreme outliers
 Draw a horizontal (or vertical) measurement scale
 Construct a rectangular box with a left edge at the lower quartile and a right edge at
the upper quartile. The box width is then equal to the iqr.
 Draw a vertical (or horizontal) segment inside the box at the median.
 Determine if there are any mild or extreme outliers in the data set.
 Draw whiskers that extend from each end of the box to the most extreme
observation that is not an outlier.
 Draw a solid circle to mark the location of any mild outliers in the data set.
 Draw an open circle to mark the location of any extreme outliers in the data set.
Comparative boxplot – more than one boxplot using the same scale to convey impressions
concerning similarities and differences between data sets.
4.4 Interpreting Center and Variability
Chebyshev’s Rule – Allows us to get a sense of the distribution of data without a graphical
display by using only the mean and standard deviation. It is applicable to any data set, whether
symmetric or skewed, but we must be careful about making statements about the proportion
above a particular value. Also, while the rule states that “at least” a certain percentage is within
a number of standard deviations, often there are substantially more, thus the rule is rather
conservative.
Consider any number k, where k > 1. Then the percentage of observations that are within k
standard deviations of the mean is at least 100( 1 - 1/k2)%. Substituting selected values of k
gives the following results:
1 – 1/k2
1 - ¼ = .75
% w/in k std. dev. of the mean
at least 75%
3
1 – 1/9 = .89
at least 89%
4
1 – 1/16 = .94
at least 94%
4.472
1 – 1/20 = .95
at least 95%
5
1 – 1/25 = .96
at least 96%
10
1 – 1/100 = .99
at least 99%
# of std. dev.
2
Empirical Rule – Is more precise and less conservative than Chebyshev’s Rule and can be
applied whenever the distribution of data values can be reasonably well described by a normal
curve.
Approximately 68% of the observations are within 1 standard deviation of the mean.
Approximately 95% of the observations are within 2 standard deviations of the mean.
Approximately 99.7% of the observations are within 3 standard deviations of the mean.
z score – measure of the position of a particular value in a data set relative to all values in the set.
z score = value – mean
standard deviation
(this process is sometimes called standardization and
gives what is often called a standardized score.)
4.5 Interpreting and Communicating the Results of Statistical Analysis
Reporting the results of data analysis
 Start with descriptive information about the variables of interest
 Make a graphical display of the data
 Describe the center, shape, and spread of the data using measures
If the data is fairly symmetric
o Sample mean and standard deviation
o Sample mean and interquartile range
If the data is noticeably skewed
o Five-number summary
What to look for in published data – questions to ask yourself
 Is the chosen summary measure appropriate for the type of data collected? In particular,
watch for inappropriate use of mean and standard deviation with categorical data that has
been code numerically.
 How do the mean and median compare, or if only one is given, was the appropriate
measure used?
 Is the standard deviation large or small, and what does it tell you about the variable being
summarized?
 Can anything be said about the values by applying Chebyshev’s Rule of the Empirical
Rule?
Cautions and Limitations
 Measure of center don’t tell all? Must be used with information about variability and shape.
Data distributions with different shapes can have the same mean and standard deviation.
 Both mean and standard deviation are sensitive to extreme values, especially if the data set
is small.
 Measures of center and variability describe the values of the variables studied, not the
frequencies
 Be careful when interpreting boxplots based on small sample sizes, symmetry or skewness
cannot be determined from small sample sizes.
 Not all distributions are normal, be cautious about using the Empirical Rule when the
distribution is not approximately normal.
 Watch out for outliers!