Download Set 2 - UC Davis Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
University of California, Davis
Department of Statistics
Summer Session II
Statistics 13
August 6, 2012
Date of latest update: August 6
Lecture 2: Describing Data
Definition 2.1
classified.
A class is one of the categories into which qualitative data can be
Definition 2.2 The class frequency is the number of observations in the data set
that fall into a particular class.
Definition 2.3 The class relative frequency is the class frequency divided by the
total number of observations in the data set; that is
,
class relative frequency = class frequency
n
where n is the total number of observations.
Definition 2.4
that is,
The class percentage is the class relative frequency multiplied by 100;
class percentage = (class relative frequency) × 100.
Summary of Graphical Descriptive Methods for Qualitative Data
• Bar Graph: The categories (classes) of the qualitative variable are represented
by bars, where the height of each bar is either the class frequency, class relative
frequency, or class percentage.
• Pie Chart: The categories (classes) of the qualitative variable are represented by
slices of a pie (circle). The size of each slice is proportional to the class relative
frequency.
• Pareto Diagram: A bar graph with the categories (classes) of the qualitative
variable (i.e., the vars) arranged by height in descending order from left to right.
1
QIDS Overall Scores for
Pre Surgical Questionnaire
Control
Treatment
Severity of Depression
6.7%
9
15.6%
18.8%
13.3%
15
10
15
9
8
9
10
6
Under $25,000
$25,000−$50,000
$50,001−$75,000
$75,001−$100,000
Above $100,000
Prefer not to answer
6
5
5
5
4
3
2
2
1 1
13
10
7
6
5
7
2
20
7
8
0
0
20
25
Under $25,000
$25,000−$50,000
$50,001−$75,000
$75,001−$100,000
Above $100,000
Prefer not to answer
8
6
Number of patients
17.8%
25
8
8
12.5%
18.8%
None
Mild
Moderate
Severe
Very Severe
10
28.9%
17.8%
20.8%
10
10
16.7%
4
12.5%
8
1
0 0
0 0 0 0 0
6
0
5
5
10
15
20
25
3
Scores
0
0
1
2
3
4
5
6
1
2
3
4
5
6
The Quick Inventory of Depressive
Symptomatology score for the
pre-surgical questionnaire.
Income of the patients.
Summary of Graphical Descriptive Methods for Quantitative Data
• Dot Plot: The numerical value of each quantitative measurement in the data set
is represented by a dot on a horizontal scale. When data values repeat, the dots are
placed above one another vertically.
• Stem-and-Leaf Display: The numerical value of the quantitative variable is partitioned into a “stem” and a “leaf.” The possible stems are listed in order in a
column. The leaf for each quantitative measurement in the data set is placed in
the corresponding stem row. Leaves for observations with the same stem value are
listed in increasing order horizontally.
• Histogram: The possible numerical values of the quantitative variable are partitioned into class intervals, each of which has the same width. These intervals from
the scale of the horizontal axis. The frequency or relative frequency of observations
in each class interval is determined. A vertical bar is placed over each class interval, with the height of the bar equal to either the class frequency or class relative
frequency.
2
Dotplots
Example 1 The outbreak of food poisoning on a sportsday, Thailand 1990.
Age by sex
15
20
Distribution of birthdate
10
5
Frequency
F
0
M
0
10
20
30
40
50
60
1930
70
1935
1940
1945
1950
1955
1960
1965
1970
1975
Stem-and-Leaf Display
Example 2 The following data show the ages of the 27 residents of Alcan, Alaska. (Source:
U.S. Bureau of the Census)
45
46
43
1
19
37
52
35
8
42
3
41
10
11
48
40
31
42
The stem-and-plot leaf for the data:
0
1
2
3
4
5
13678
0129
0157
0011223568
0258
3
50
6
55
40
41
30
7
12
58
Histograms
Example 3 Using the age data from above.
Histogram of age
0
0.02
0.00
2
0.01
4
Frequency
6
Relative Frequency
0.03
8
10
0.04
Histogram of age
0
10
20
30
40
50
60
0
10
20
30
age
40
50
60
age
The Meaning of Summation Notation ni=1 xi
Sum the measurements of the variable that appears to the right of the summation symbol,
beginning with the first measurement and ending with the nth measurement.
P
Example 4 A data set contains the observations 5,1,3,2,1. Then we set x1 = 5, x2 = 1, x3 =
3, x4 = 2, x5 = 1. Then
5
xi = x1 + x2 + x3 + x4 + x5 = 5 + 1 + 3 + 2 + 1 = 12
a.
Pi=1
5
x2 = x21 + x22 + x23 + x24 + x25 = 52 + 12 + 32 + 22 + 12 = 40
b.
P5i=1 i
c.
2 − 1) + (x3 − 1) + (x4 − 1) + (x5 − 1) = (x1 + x2 + x3 + x4 +
i=1 (xi − 1) = (x1 − 1) + (xP
x5 ) − (1 + 1 + 1 + 1 + 1) = 5i=1 xi − 5 = 12 − 5 = 7
P5
(x−1)2 = (x1 −1)2 +(x2 −1)2 +(x3 −1)2 +(x4 −1)2 +(x5 −1)2 = 42 +02 +22 +12 +02 = 21
d.
Pi=1
e. ( 5i=1 xi )2 = (x1 + x2 + x3 + x4 + x5 )2 = (5 + 1 + 3 + 2 + 1)2 = 122 = 144
P
Definition 2.5 The mean of a set of quantitative data is the sum of the measurements,
divided by the number of measurements contained in the data set.
Formula for a Sample Mean: x̄ =
Pn
i=1
xi
n
Symbols for the Sample Mean and the Population Mean
x̄ =Sample mean
µ =Population mean
4
Definition 2.6 The median of a quantitative data set is the middle number when the
measurements are arranged in ascending (or descending) order.
Calculating a Sample Median M
Arrange the n measurements from the smallest to the largest.
1. If n is odd, M is the middle number.
2. If n is even, M is the mean of the middle two numbers.
Definition 2.7 A data set is said to be skewed if one tail of the distribution has more
extreme observations than the other tail.
Rightward skewness
Definition 2.8
set.
mean
median
Relative frequency
mean
median
Relative frequency
Relative frequency
mean
median
Symmetry
Leftward skewness
The mode is the measurement that occurs most frequently in the data
Definition 2.9 The range of a quantitative data set is equal to the largest measurement
minus the smallest measurement.
Definition 2.10 The sample variance for a sample of n measurements is equal to the
sum of the squared distances from the mean, divided by (n − 1). The symbol s2 is used
to represent the sample variance.
Formula for a Sample Variance: s2 =
A shortcut formula: s2 =
5
1
n−1
[
Pn
i=1
Pn
(xi −x̄)2
n−1
i=1
x2i − nx̄2 ]
Definition 2.11 The sample standard deviation, s, is defined as the positive square
root of the sample variance, s2 , or, mathematically,
√
s = s2
Symbols for Variance and Standard Deviation
s2
s
σ2
σ
=
=
=
=
Central Tendency
Variation
Sample variance
Sample standard deviation
Population variance
Population standard deviation
Numerical Descriptive Measures
Mean Median
Mode
Range Variance Standard Deviation
Two ways to interpret the standard deviation:
1. Chebyshev’s Rule
and
2. Empirical Rule.
1. Chebyshev’s rule applies to any data set, regardless of the shape of the frequency
distribution of the data.
a. It is possible that very few of the measurements will fall within one standard deviation of the mean.
b. At least 3/4 of the measurements will fall within two standard deviations of the
mean.
c. At least 8/9 of the measurements will fall within three standard deviations of the
mean.
d. Generally, for any number k greater than 1, at least (1 − 1/k 2 ) of the measurements
will fall within k standard deviations of the mean.
Relative frequency
2. Empirical rule is a rule of thumb that applies to data sets with frequency distributions
that are mound shaped and symmetric, as follows:
Population measurements
6
a. Approximately 68% of the measurements will fall within one standard deviation of
the mean.
b. Approximately 95% of the measurements will fall within two standard deviations of
the mean.
c. Approximately 99.7% (essentially all) of the measurements will fall within three
standard deviation of the mean.
x̄ ± s
(µ ± σ)
Chebyshev’s rule less than 3/4
Empirical rule approx 68%
x̄ ± 2s
x̄ ± 3s
x̄ ± ks
(µ ± 2σ)
(µ ± 3σ)
(µ ± kσ)
At least 3/4 At least 8/9 At least (1 − 1/k 2 )
approx 95% approx 99.7%
Example 5 Use Chebyshev’s Theorem to give a lower bound on the percent of data in the
interval (x̄ − 2.5s, x̄ + 2.5s). Answer: At least 1 − 2.51 2 = 0.84 = 84% of the measurements
will fall within the interval. i.e. The lower bound is 84%.
Definition 2.12 For any set of n measurements (arranged in ascending or descending
order), the pth percentile is a number such that p% of the measurements fall below that
number and (100 − p)% fall above it.
Definition 2.13
The sample z-score for a measurement x is
z=
x−x̄
s
The population z-score for a measurement x is
z=
x−µ
σ
Interpretation of z-scores for Mound-Shaped Distributions of Data
1. Approximately 68% of the measurements will have a z-score between -1 and 1.
2. Approximately 95% of the measurements will have z-score between -2 and 2.
3. Approximately 97% (almost all) of the measurements will have a z-score between -3
and 3.
Definition 2.14 An observation (or measurement) that is unusually large or small
relative to the other values in a data set is called an outlier. Outliers typically are
attributable to one of the following causes:
1. The measurement is observed, recorded, or entered into the computer incorrectly.
2. The measurement comes from a different population.
3. The measurement is correct, but represents a rare (chance) event.
Definition 2.15 The lower quartile QL is the 25th percentile of a data set. The
middle quartile M is the median. The upper quartile QU is the 75th percentile.
7
Definition 2.16 The interquartile range (IQR) is the distance between the lower
and upper quartiles.
IQR= QU − QL
Elements of a Box Plot
1. A rectangle (the box) is drawn with the ends (the hinges) drawn at the lower and
upper quartiles(QL and QU ). The median of the data is shown in the box, usually
by a line.
2. The points at distances 1.5(IQR) from each hinge mark the inner fences of the
data set. Lines (the whiskers) are drawn from each hinge to the most extreme
measurement inside the inner fence. Thus,
Lower inner fence= QL − 1.5(IQR)
Upper inner fence= QU + 1.5(IQR)
A second pair of fences, the outer fences, appears at a distance of 3(IQR) from
the hinges. One symbol (e.g., “*”) is used to represent measurements falling
between the inner and outer fences, and another (e.g., “0”) is used to represent
measurements that lie beyond the outer fences. Thus outer fences are not shown
unless one or more measurements lie beyond them. We have
Lower outer fence= QL − 3(IQR)
Upper outer fence= QU + 3(IQR)
Different symbols can be used to represent the median and the extreme data
points.
Remark: An observation is defined as an outlier if it falls outside the range from lower
outer fence to upper outer fence.
Graphing Bivariate Relationships
One way to describe the relationship between two quantitative variables, called a bivariate relationship, is to plot the data in a scattergram (or scatterplot).
a. Positive relationship
b. Negative relationship
8
c. No relationship