Download describing a sample of data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Time series wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Descriptive statistics: describing a sample of data
There are several ways in which we can summarize, or describe, a set of data. Throughout the
discussion in this handout, we will assume that we are describing a random sample of data from
some larger population. Recall that numbers that describe a sample of data are called
STATISTICS. The statistics serve as estimates of their corresponding population parameters.
The statistics that we use to describe a set of data depend on the type of data with which we are
dealing. We can summarize categorical (or binary) data with the proportion, while we can
summarize measurement data (discrete or continuous) with the mean, median, range,
interquartile range, variance and standard deviation.
In summarizing a sample of data, we might be interested in describing the “center” of the data, or
we might be interested in describing how the data vary. Statistics used to describe the center of
the data are called MEASURES OF LOCATION, while statistics used to describe how the data vary
are called MEASURES OF VARIABILITY.
The following list enumerates the most commonly used statistics:
1. The MEAN (or AVERAGE) of a sample of measurements (or OBSERVATIONS) is obtained by
simply “adding up the measurements, and dividing by the number of measurements in the
dataset.” Notationally, the SAMPLE MEAN, denoted x̄ , is calculated using the following
formula:
x x ... xn
( xi
x̄ 1 2
n
n
where x1, x2, ..., xn denote the measurements, n is the number of measurements (called the
SAMPLE SIZE), and ( means “add up.”
The sample mean x̄ is a measure of location that estimates the actual population mean µ.
The sample mean can be used to summarize discrete measurement data or continuous
measurement data.
Examples:
Variable
Data
Number of brothers
0, 2, 4, 1, 5
x̄ 02415
12
2.4
5
5
Weight of females
135, 105,
112, 135,
128, 132
x̄ 135 105 112 135 128 132
747
124.5
6
6
Handout 02
Sample mean
Page 1 of 4
2. Roughly speaking, the SAMPLE MEDIAN is the value that divides a sample of data into two
equal halves. That is, 50% of the data lie below the median and 50% of the data lie above it.
To calculate the sample median, we must first order the data. Then, if the number of
observations n is odd, the sample median is the middle observation; and if the number of
observations n is even, the sample median is the average of the two middle observations.
The sample median is a measure of location that estimates the actual population median. The
sample median can be used to describe discrete measurement data or continuous
measurement data.
Examples:
Variable
Ordered Data
Sample median
Number of brothers
0, 1, 2, 4, 5
2
Weight of females
105, 112, 128, 132, 135, 135
128 132
130
2
Similar to the sample median is the first quartile and the third quartile. The FIRST QUARTILE,
denoted Q1, is the value such that 25% of the data lie below the first quartile and 75% of the
data lie above it. The THIRD QUARTILE, denoted Q3, is the value such that 75% of the data
lie below the third quartile and 25% of the data lie above it. So, the first quartile, the median,
and the third quartile effectively divide up a sample of data into quarters.
NOTE: The sample mean is affected by extreme observations, or OUTLIERS, while the sample
median is not. Therefore, in the presence of outliers, the median is the more appropriate
measure of location.
3. The SAMPLE PROPORTION, denoted p̂ , is the “percentage” of observations in the sample
having a certain trait. It is calculated by simply counting the number of observations in the
sample having the trait divided by n, the total number of observations in the sample. The
sample proportion, which estimates the actual population proportion p, is used to describe
categorical data (including binary data).
Examples:
Variable
Ever smoke?
(1 = yes, 0 = no)
Class?
Handout 02
Data
Sample proportion
0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0
p̂smokers F, So, So, J, F, Se, Se, J, F, Se, So, Se
p̂F 6
0.40
15
3
0.25
12
Page 2 of 4
4. The SAMPLE RANGE is the difference between the largest and smallest numbers in the
sample.
The sample range, which is a measure of variability, can be used to describe discrete
measurement data or continuous measurement data.
NOTE: You should get into the habit of using the minimum and maximum to see if your data
set contains any outliers. If you find an outlier, you should identify whether it is a data
transcription or data entry error before continuing to analyze your data.
Examples:
Variable
Ordered Data
Sample range
Number of brothers
0, 1, 2, 4, 5
50=5
Weight of females
105, 112, 128, 132, 135, 135
135 105 = 30
5. The SAMPLE INTERQUARTILE RANGE, denoted IQR, is the difference between the third
quartile and the first quartile, i.e. IQR = Q3 Q1. Thus, the sample range measures the
range of all of the data, while the sample interquartile range measures the range of the middle
half of the data. The interquartile range, which is a measure of variability, can be used to
describe discrete measurement data or continuous measurement data.
6. Roughly speaking, the SAMPLE VARIANCE, denoted s2, measures the average amount the data
points in the sample deviate from the sample mean. Therefore, the larger s2, the more variable
the data. Notationally, the sample variance is calculated using the following formula:
( ( xi x̄ )2
s2 n1
where x1, x2, ..., xn denote the measurements, x̄ is the sample mean, and n is the sample size.
The sample variance is a measure of variability that estimates the actual population variance
)2. The sample variance can be used to summarize the variability of discrete measurement
data or continuous measurement data.
NOTE: Because the deviations are squared in calculating the sample variance, s2 is defined in
terms of squared units. That is, if your data are measured in pounds, then s2 is defined in
pounds-squared. The SAMPLE STANDARD DEVIATION, denoted s, is simply the positive
square root of the sample variance; its advantage is that it defines variability in terms of the
original units.
Handout 02
Page 3 of 4
Example:
Variable
Data
Number of brothers
0, 2, 4, 1, 5
Sample variance
(02.4)2 (22.4)2 (42.4)2 (12.4)2 (52.4)2
(51)
5.76 0.16 2.56 1.96 6.76
17.2
s2 4.3
(51)
4
s2 Therefore, s2 is 4.3 brothers-squared, and s is the square root of 4.3, or 2.07 brothers.
NOTE: The sample variance is affected by outliers. If you change the 1 to 10 in the above
example, the sample variance changes from 4.3 to 14.21!!
/ RECALL that the goal is to use a statistic to ESTIMATE its corresponding parameter, or to use
a statistic to TEST A HYPOTHESIS about the corresponding parameter.
For example, a pharmaceutical company claims that its new pain reliever eliminates
headaches in 90% of the people who use the drug. A medical doctor blindly tests the
pharmaceutical company’s claim on a random sample of 100 of her patients. Only 52, or
52%, of the patients’ headaches were eliminated. Is it likely that a random sample would
produce a sample proportion p̂ = 0.52, if the actual population proportion p were 0.90? That
is, does the random sample provide sufficient evidence to reject the pharmaceutical
company’s claim?
Handout 02
Page 4 of 4