Download L1: Lecture notes Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Categorical variable wikipedia, lookup

Transcript
Lecture 1: Descriptive statistics:
Population: group of individuals
(= respondents in a poll)
Variable: property of an individual
(= answer to a question in a poll)
 Qualitative variable: the possible values are
categories or levels.
We distinguish 2 kinds of scales:
Nominal: the values of a variable are not arranged,
e.g. sex, political preference.
Ordinal: the values of a variable are arranged,
e.g. 4 levels of education (none, lower, middle and
higher.
 Quantitative variable: the possible values are
numerical/measurable.
These variables are interval variables or ratio
variables (when there is a zero).
Information about a variable: often information about
the population as a whole is not available information
about a sample is: we have n observed values
x1, x2, …, xn of the variable at our disposal as an
indication of the population characteristics.
1
Describing the observations:
Explorative Data Analysis (EDA)
EDA- techniques:
 Summarizing the data:
measures for the location (centre), dispersion
(spread) etc, especially for quantitative variables
 Presentation with graphs and diagrams
Measures of location (centre):
1. The sample mean:
x1  ......  x n 1 n
x
 n  xi
n
i 1
2. The median M: the middle observation
(measurement) in size. If there is an even number of
observations, compute the mean of the middle two.
First arrange them from small to large (order
statistics): x1, x2, …, xn → x(1), x(2), …, x(n)
3. The modus: the most observed value.
Percentiles and quartiles:
The median is also called the 50th percentile: (about)
50% of the observations is smaller and 50% is greater
than the median M:
2
In a diagram of the observations (•= observation):
25%
25%
25%
25%
• …..……….• •….• •……….• • ……………….•
Q1
M
Q3
The quartiles Q1, Q2 and Q3 are the 25th, 50th and 75th
percentiles: they split the observations in roughly 4
equal quarters.
Rules (software sometimes uses different rules):
 Q1 is the median of all observations smaller than
the overall median M.
 Q2 is the median M!
 Q3 is the median of the observations greater then M
Percentiles (what follows is a description, not a
formal definition):
Example: the 90th percentile is a limiting value so
that (about) 10% of the observations is greater and
90% smaller.
Measures of dispersion (spread)
1. Inter Quartile Distance IQD = Q3 – Q1
2. Sample variance s2:
s 2  n11 
 xi - x 
2

2
1 
2- 1 
xi 
n 1   x i n




s2 is said to have n-1 degrees of freedom
3
3. Sample standard deviation s: s  s 2
Proporties of s en s2 :
 s ≥ 0 en s2 ≥ 0
 If s = 0, then all observations are equal.
Outliers: unusually large or small observations
The 1.5×IQD -rule: observations greater than Q3 +
1.5× IQD or smaller than Q1 - 1.5× IQD are outliers
Resistant measures: not sensitive for outliers.
 Resistant: Median and IQD
 Non-resistant: sample mean, variance and standard
deviation.
If the sample consist of n observations then:
 Frequency: the number of observations with the
same specific numerical or categorical value.
 Relative frequency: the quotient of frequency and
n (the fraction or proportion frequency/n)
Often written as a percentage, e.g. 13/50 = 26%
The distribution of a variable:
all possible values and their relative frequency.
The 5-numbers-summary of observations:
the smallest, Q1, M ,Q3 and the largest.
4
Graphs and diagrams:
1. Bar graph:
-On the x-axis: the categories/values.
-the height of the bars represent frequencies or
relative frequencies (given on the y-axis)
2. Sector, circle or pie diagram
-especially for qualitative variables
-every sector gives the proportion of the value.
3. Stem-and-leaf diagram
Example:
stem leaf
This diagram represents 31
1
observations: 15 is the smallest,
1 5556668
42 the largest.
2 01334
2 55678999 42 occurred twice as an
observation.
3 00123
We place the smallest leaf closest
3 579
to the stem.
4 022
4
 Split the stem whenever there are too many
observations per value of the stem. In the example
we have split the tens (leafs 0-4 and 5-9)
 Back-to-back stem-leaf diagram: for comparing
two samples we use the same stems for both
samples.
5
4. Box plot: a diagram of the 5-numbers-summary
If there are outliers, first graph them separately and
then the 5-numbers-summary of the original
observations excluding the outliers.
5. Histogram
 Histogram of (relative) frequencies (grouping data):
First make a frequency table: choose intervals
(classes) of equal length and determine their
frequencies. (Rule of thumb: roughly 10ln(n)
intervals)
The histogram consists of rectangles placed upon
the intervals on the x-axis and with the height equal
to the (relative) frequency on the y-axis.
 Histogram of intervals having different length:
the area of the rectangle equals the relative
frequency. opschrijven
Write down in the frequency table for every
interval: the relative frequency, the length and the
height of the rectangle. Use the formula:
relative frequency
height 
length
6
De height is called the frequency density.
(the higher, the more observations per unit)
6. Graph for time series
The x-axis is the time axis and the y-axis consists of
the values of the variable.
Beware of the (overall) trend and the effect of
seasons.
7. a scatterdiagram is a graph of the relation of two
quantative variables (on x-axis en y-axis resp.). Ex:
7
Reductie in % van NOx-uitstoot afhankelijk
van de hoeveelheid katalysator in benzine
5,0
4,5
4,0
3,5
3,0
2,5
2,0
0
1
2
3
4
5
6
7
8
HOEVEELHEID TOEGEV OEGDE STOF
When commenting the graphs or diagrams pay
attention to their shape (especially in the case of
histograms):
 the overall shape of the distribution: is it symmetric
or a “tail” to the right or left, how many peaks?
 The location of the centre and the spread.
 Gaps and (possible) outliers.
8