Download @let@token Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Descriptive Statistics
Tieming Ji
Fall 2012
1 / 14
Motivation: In order to investigate characteristics of a
population (very large, not able to enumerate every elements
in it), a sample (a relatively small size compared to the
population) is often taken for study. In this chapter, we are
going to learn (1) methods to visualize a sample; and (2)
statistics to quantify sample characteristics and use them to
infer characteristics of interest for a population.
Definition: A random sample of size n from the distribution of
X is a collection of n independent random variables, each with
the same distribution as X .
2 / 14
Example 1: To study the random variable X , the life span in
hours of the lithium battery in a particular model of pocket
calculator, we obtain a random sample of 50 batteries and
determine the life span of each we obtain. These data result:
4285 564 1278 205 3920
2066 604 209 602 1379
2584
14 349 3770
99
1009 4152 478 726 510
318 737 3032 3894 582
1429 852 1461 2662 308
981 1560 701 497 3367
1402 1786 1406
35
99
1137 520 261 2778 373
414 396
83 1379 454
3 / 14
Stem-and-Leaf Diagram
The decimal point is 3 digit(s) to the right of
the |
0
1
2
3
4
|
|
|
|
|
001112233334445555566667779
001344444568
1678
04899
23
0
0
1
1
2
2
3
3
4
|
|
|
|
|
|
|
|
|
00111223333444
5555566667779
001344444
568
1
678
04
899
23
4 / 14
10
5
0
Frequency
15
20
Histograms
0
1000
2000
3000
4000
Life Span of Sample Batteries
5 / 14
0.2
0.4
0.6
●
●
●
●
●
●
0.0
Empirical F(x)
0.8
1.0
(Empirical) Cumulative Distribution Plots
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1000
2000
3000
4000
Life Span
6 / 14
Location Statistics – Mean
Definition: Let X1 , X2 , · · · , Xn be a random
of size n
Pnsample
Xi
for the random variable X . The statistic i=1 n is called the
sample mean and is denoted by X̄ .
Example 2: A random sample of size 9 yields the following observations
on the random variable X , the coal consumption in millions of tons by
electric utilities for a given year: 406, 395, 400, 450, 390, 410, 415, 401,
and 408. The sample mean is
9
x̄ =
1X
1
xi = (406 + 395 + · · · + 408) ≈ 408.3.
9
9
i=1
Thus, the average coal consumption of the 9 samples is around 408.3
million tons.
7 / 14
Location Statistics – Median
Definition: The order statistics of a sample x1 , x2 , · · · , xn is the
ordered observations from the smallest one to the largest one,
denoted by x(1) , x(2) , · · · , x(n) .
Definition: Let x(1) , x(2) , · · · , x(n) be the order statistics for a
sample of size n. The sample median is the middle observation
if n is odd. It is the average of the two middle observations if
n is even. We shall denote the median of a sample by x̃.
Definition: The median location is
n+1
.
2
In example 2, the order statistics are 390, 395, 400, 401, 406, 408, 410,
415, 450. The median location is (9+1)/2=5, and the median is x̃=406.
8 / 14
Measures of Variability – Sample Variance and
Sample Standard Deviation
Definition: Let X1 , X2 , · · · , Xn be a random sample of size n
for X . The statistic
n
X
(Xi − X̄ )2
S =
n−1
i=1
2
is called the sample variance. Further, the statistic S =
called the sample standard deviation.
√
S 2 is
9 / 14
Theorem: A computational formula for S 2 given a sample of
size n for the random variable X is computed by
P
P
2
n ni=1 Xi2 − ( ni=1 Xi )
2
S =
.
n(n − 1)
P9
In example 2, we have i=1 xi = 406 + 395 + · · · + 408 = 3675 and
P9
2
2
2
2
i=1 xi = 406 + 395 + · · · + 408 = 1503051. Thus,
S2 =
9
P9
2
i=1 xi −
P
9
i=1 xi
9 × (9 − 1)
2
=
9 × 1503051 − (3675)2
≈ 303.25.
9×8
And the sample standard deviation is
√
√
S = S 2 ≈ 303.25 ≈ 17.4.
10 / 14
Measures of Variability – Sample Range
Definition: The sample range of a random sample with size n
is defined as x(n) − x(1) .
In example 2, the sample range is 450-390=60. This measures the largest
difference among the 9 samples for a yearly coal consumption.
11 / 14
Measures of Variability – Interquartile Range
Sample range is affected by outliers. However the interquartile
range (iqr) is relatively robust when outliers exist. Interquartile
is defined as the difference of the 3rd quartile (75%) and the
1st quantile (25%).
Steps for finding the sample interquartile range with a sample of size n:
Find the median location n+1
2 , and round it down to the nearest
whole number which is called the truncated median location.
Define q = truncated median location +1 .
2
The 1st quartile, q1 , is x(q) if q is an integer; otherwise, q1 is the
average of x(q−0.5) and x(q+0.5) .
The 3rd quartile, q3 , is x(n−q+1) if q is an integer; otherwise, q3 is
the average of x(n−q+0.5) and x(n−q+1.5) .
The sample interquartile range is iqr=q3 − q1 .
12 / 14
Boxplot
In example 1, there are 50 observations for the life span of a
kind of battery.
13 / 14
Chapter Summary
We do not require you to draw a figure given data though it is
not difficult. We basically want to test you if you can read a
figure and draw useful information. For example, are there
outliers by looking at a box plot? Can you guess the
population distribution by looking at a sample distribution
(histogram, stem-and-leaf diagram)? etc.
Understanding the basic concepts of statistics, sample mean,
sample median, sample variance, sample standard deviation,
sample range, sample interquartile range. Can you relate these
sample statistics with population parameters (location,
variation, etc.)?
14 / 14