Download Chapter 1: Descriptive Statistics – Part I

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript


Statistics is the science of learning from data
exhibiting random fluctuation.
Descriptive statistics:
 Collecting data
 Presenting data
 Describing data

Inferential statistics:
 Drawing conclusions and/or making decisions
concerning a population based only on sample data
 Based on probability theory
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
2

What are data?
 Data can be numbers, record names, or other labels.
 Data are useless without their context…
 To provide context we need Who, What (and in what
units), When, Where, and How of the data.


In civil engineering we meet most often numerical
data.
Presentation tools for numerical data (one sample):
 Histogram
 Boxplot
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
3
Compressive strength of concrete (MPa)
(sample size =150 concrete cylinders)
25
Frequency
20
15
10
5
0
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
4

Other examples of histograms:
Example 1.1, part a) on my personal website:
mat.fsv.cvut.cz/Hala/

How to construct a boxplot?
Will be discussed later (the use of numerical
measures is necessary).
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
5

Measures of Location
 Sample mean
𝑥=
1
𝑛
𝑥𝑖
𝑥
𝑛/2 +1
𝑥 𝑛/2 +𝑥 𝑛 2+1
2
𝑛 odd
 Sample median
𝑥=
 Mode
single value which repeats more often
than any other
 First quartile
 Third quartile
𝑥
𝑄1 =
𝑛/4 +1
𝑥 𝑛/4 +𝑥 𝑛 4+1
2
𝑥
𝑄3 =
3𝑛/4 +1
𝑥 3𝑛/4 +𝑥 3𝑛 4+1
2
𝑛 even
𝑛 not divisible by 4
𝑛 divisible by 4
𝑛 not divisible by 4
𝑛 divisible by 4
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
6

Measures of Variation
 Range
𝑥(𝑛) − 𝑥(1)
 Population variance
𝜎𝑛2 =
 Sample variance
2
𝜎𝑛−1
=
1
𝑛−1
 Population standard deviation
𝜎𝑛 =
𝜎𝑛2
 Sample standard deviation
𝜎𝑛−1 =
 Interquartile range
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
1
𝑛
𝑥𝑖 − 𝑥
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
2
𝑥𝑖 − 𝑥
2
2
𝜎𝑛−1
7

Other Numerical Measures
 Coefficient of variation
𝜎𝑛
𝑥
or
 Skewness
𝑥𝑖 − 𝑥 3
𝑛 𝜎𝑛3
 Kurtosis
𝑥𝑖 − 𝑥 4
𝑛 𝜎𝑛4
𝜎𝑛−1
𝑥
or
−3
𝑥𝑖 − 𝑥 3
3
(𝑛−1) 𝜎𝑛−1
or
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
𝑥𝑖 − 𝑥 4
4
(𝑛−1) 𝜎𝑛−1
−3
8
The analysis of 16 samples of building material yields the
following weights of unwanted impurities (data in grams):
6
8
11
6
12
7
5
28
9
10
9
10
12
10
11
9
Compute all important numerical measures.
Answers:
 sample mean:
𝑥 = 10.1875
 variance:
𝜎𝑛2 = 25.4023
or
2
𝜎𝑛−1
= 27.0958
 standard deviation:
𝜎𝑛 = 5.0401
or
𝜎𝑛−1 = 5.2054
Comment: We can use the formulas introduced on previous pages.
However much simpler is to use scientific calculators. After entering
the data we can easily recall the statistics 𝑥, 𝜎𝑛 or 𝜎𝑛−1 . We can use
alternate formulas for the variance, too (see later).
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
9
Answers (continued):
We have to sort the data so as to find median and quartiles:
5
6
6
7
8
9
9
9
10
10
10
11
11
12
12
9+10
2
= 9.5
28
𝑛 = 16 is even number, moreover it is divisible by 4, so:
𝑥 𝑛/2 +𝑥 𝑛 2+1
2
𝑥 8 +𝑥 9
2
 sample median:
𝑥=
 mode:
does not exist
 first quartile:
𝑄1 =
𝑥 𝑛/4 +𝑥 𝑛 4+1
2
 third quartile:
𝑄3 =
𝑥 3𝑛/4 +𝑥 3𝑛 4+1
2
 range:
𝑥(𝑛) − 𝑥
 interquartile range:
𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 3.5
=
=
(values 9 and 10 repeat with
the same maximum frequency)
1
=
𝑥 4 +𝑥 5
2
=
=
𝑥 12 +𝑥 13
2
7+8
2
=
= 7.5
11+11
2
= 11
= 28 − 5 = 23
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
10
A particular value 𝑥 in a random sample is an outlier, if:
𝑥 > 𝑄3 + 1.5 ∙ 𝐼𝑄𝑅
or
𝑥 < 𝑄1 − 1.5 ∙ 𝐼𝑄𝑅
How do we construct Boxplot?
 Draw a horizontal plot line, choose a suitable scale.
 Plot a „box“ above the plot line, its edges represent quartiles.
 Median is represented by a vertical segment inside the box.
 Plot horizontal segments (whiskers) outside the box. Left one
joins the left edge of the box with the smallest nonoutlier.
Right one joins the right edge of the box with the largest
nonoutlier.
 Outliers (if they exist) are depicted by particular symbols (e.g.
stars or circles).
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
11
Refer to Example 1.2. Construct the boxplot for the data.
Answers:
Recall data array and important measures:
5
6
6
7
8
𝑥 = 9.5
9
9
9
10
𝑄1 = 7.5
10
10
𝑄3 = 11
11
11
12
12
28
𝐼𝑄𝑅 = 3.5
Fences (values cutting off the outliers):
 lower fence:
𝑄1 − 1.5 ∙ 𝐼𝑄𝑅 = 7.5 − 5.25 = 2.25
 upper fence:
𝑄3 + 1.5 ∙ 𝐼𝑄𝑅 = 11 + 5.25 = 16.25
There is one outlier in the sample – the largest observation 28.
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
12
Boxplot:
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
13
Refer to Examples 1.2 and 1.3:
We found in Example 1.3 that value 28 is an outlier.
Assume that this value is an erroneous measurement and
exclude it from the sample.
a) Compute basic summary measures for the reduced
sample of 15 observations.
b) Construct the boxplot for the reduced sample.
c) Compare the results for both samples.
Answers are available on my personal website.
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
14

„Normally“ distributed data:
 Histogram has almost symmetric shape; it can be fitted well by
Gaussian curve – see Chapter 5.
 Median and mean are almost equal.
 Boxplot is almost perfectly symmetric; there are no outliers.
 Skewness and kurtosis are very close to zero.
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
15

Examples:
 Histogram of compressive strength of concrete on page 4.
 Boxplot constructed in Example 1.4 (15 samples of
building material – reduced data set).
Comment:
Skewness computed for the data in Example 1.4 is negative
and equals approx. -0.416. It shows that there the data are
actually gentle left skewed - see later.
(You will not be asked to compute skewness in the exam.)
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
16
We meet in applications very often left or right skewed data.
Left-Skewed
Symmetric
Mean < Median < Mode Mean = Median = Mode
(Longer tail extends to left)
Right-Skewed
Mode < Median < Mean
(Longer tail extends to right)
Coefficient of skewness is
 negative for left-skewed data
 positive for right-skewed data
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
17

Examples of right-skewed distributions:
 Example 1.2 (16 samples of building material – original data set)
Comment: Skewness for this sample equals approx. 2.879.
 Earthquakes magnitudes:
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
18

An example of Boxplot for right-skewed data:
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
19
Examples of left-skewed distributions:
 All three variables in Example 1.1 (Excel file Example 1.1_data
and answers).
 Grade distribution in a class
of 80 students:
Additional questions:
 What is the range for the marks
of 20 best students?
 Which value cuts off the marks
of 25 % worst students?
 Are there any outliers? Discuss.
 Can we say anything about average mark in this exam?
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
20

Population variance:
𝜎𝑛2

=
1
𝑛
=
1
𝑛
𝑥𝑖2 − 𝑥
𝑥𝑖 − 𝑥
2
𝑛
𝜎𝑛2
𝑛−1
𝑥𝑖 − 𝑥
2
2
Sample variance:
2
𝜎𝑛−1
=
1
𝑛−1
=
=
1
𝑛−1
𝑥𝑖2
𝑛
−
𝑛−1
𝑥
2
When and why to use?

If a scientific calculator (or even a computer) is not available…

If the data are integer values…

When we have to recalculate mean and variance after a subtle
change in the data set…
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
21
Example 1.5
A researcher observed using a microscope the number of gold
particles in a thin coating of gold solution. He completed 517
observations in regular time intervals. The results are listed in
the table:
Number of particles
Frequency
0
1
2
3
4
5
6
7
112
168
130
68
32
5
1
1
Compute the mode, median, and quartiles. Compute the mean and
standard deviation, too.
Comment on the data distribution.
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
22
Answers:

Mode obviously equals 1.
So as to find easily median and quartiles let us calculate cumulative
frequencies:
Number of particles



0
1
2
3
4
5
6
7
Frequency
112
168
130
68
32
5
1
1
Cumulative frequency
112
280
410
478
510
515
516
517
Sample size 𝑛 = 517 is odd, so the median equals to the value in
259th position in data array, therefore 𝑥 = 1.
First quartile equals to the value in 130th position, third quartile
equals to the value in 388th position, therefore 𝑄1 = 1, 𝑄3 = 2.
Additional task: Try to create boxplot for this data set.
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
23
Answers (continued):
If the data are summarized in a frequency table, mean and variance
formulas can be modified:
𝑥=
1
𝑛
𝑥𝑖∗ ∙ 𝑛𝑖 ,
𝜎𝑛2 =
1
𝑛
𝑥𝑖∗ − 𝑥
2
∙ 𝑛𝑖 ,
2
𝜎𝑛−1
=
1
𝑛−1
𝑥𝑖∗ − 𝑥
2
∙ 𝑛𝑖 ,
where 𝑥1∗ , 𝑥2∗ , ⋯ , 𝑥𝑘∗ are distinct data values, 𝑛𝑖 are the corresponding
frequencies a 𝑛 = 𝑛𝑖 is the total of all frequencies, i.e. sample size.
An alternate variance formula can be derived, too:
𝜎𝑛2 =
1
𝑛
𝑥𝑖∗
2
∙ 𝑛𝑖 − 𝑥
2
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
24
Answers (continued):
We obtain the following results:
1
798
𝑥=
∙ 0 ∙ 112 + 1 ∙ 168 + ⋯ + 7 ∙ 1 =
= 1.54352 ,
517
517
1
798
𝜎𝑛2 =
∙ 02 ∙ 112 + 12 ∙ 168 + ⋯ + 72 ∙ 1 −
517
517
517 2
2
𝜎𝑛−1
=
∙ 𝜎 = 1.53153 .
516 𝑛
𝜎𝑛 = 1.23635,
𝜎𝑛−1 = 1.23755
2
= 1.52857 ,
Data distribution is obviously right-skewed.

Comment: If you have a scientific calculator which is able to process
data summarized in a frequency table, insert the data and frequencies
and recall 𝑥, 𝜎𝑛 , or 𝜎𝑛−1 .
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
25
If a large data set is grouped by classes in a frequency table and no
computer is available we can approximate the values of the mean
and variance using the table.
Hint:


Round all data values within each class interval to the midpoint,
Use the formulas listed on page 24 (where 𝑥𝑖∗ designates the
midpoint of 𝑖 𝑡ℎ class interval)
Example 1.6: Approximate the mean speed and the variance and
standard deviation of the speed (Excel file Example 1.1_data and
answers) using the frequency table in the sheet Histograms.
Answers are available in Excel file Example 1.1_data and answers in
the sheet Approximate calculation.
Chapter 1: DESCRIPTIVE
STATISTICS – PART I
26