Download Descriptive Statistics for Continuous Data

Document related concepts
no text concepts found
Transcript
Unit 3: Descriptive Statistics for Continuous Data
Statistics for Linguists with R – A SIGIL Course
Designed by Marco Baroni1 and Stefan Evert2
1 Center
for Mind/Brain Sciences (CIMeC)
University of Trento, Italy
2 Corpus Linguistics Group
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
http://SIGIL.r-forge.r-project.org/
Copyright © 2007–2015 Baroni & Evert
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
1 / 40
Outline
Outline
Introduction
Categorical vs. numerical variables
Scales of measurement
Descriptive statistics
Characteristic measures
Histogram & density
Random variables & expectations
Continuous distributions
The shape of a distribution
The normal distribution (Gaussian)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
2 / 40
Introduction
Categorical vs. numerical variables
Outline
Introduction
Categorical vs. numerical variables
Scales of measurement
Descriptive statistics
Characteristic measures
Histogram & density
Random variables & expectations
Continuous distributions
The shape of a distribution
The normal distribution (Gaussian)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
3 / 40
Introduction
Categorical vs. numerical variables
Reminder: the library metaphor
I
I
In the library metaphor, we took random samples from an
infinite population of tokens (words, VPs, sentences, . . . )
Relevant property is a binary (or categorical) classification
I
I
I
I
active vs. passive VP or sentence (binary)
instance of lemma TIME vs. some other word (binary)
subcategorisation frame of verb token (itr, tr, ditr, p-obj, . . . )
part-of-speech tag of word token (50+ categories)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
4 / 40
Introduction
Categorical vs. numerical variables
Reminder: the library metaphor
I
I
In the library metaphor, we took random samples from an
infinite population of tokens (words, VPs, sentences, . . . )
Relevant property is a binary (or categorical) classification
I
I
I
I
I
active vs. passive VP or sentence (binary)
instance of lemma TIME vs. some other word (binary)
subcategorisation frame of verb token (itr, tr, ditr, p-obj, . . . )
part-of-speech tag of word token (50+ categories)
Characterisation of population distribution is straightforward
I
I
I
binomial: true proportion π = 10% of passive VPs,
or relative frequency of TIME, e.g. π = 2000 pmw
alternatively: specify redundant proportions (π, 1 − π),
e.g. passive/active VPs (.1, .9) or TIME/other (.002, .998)
multinomial: multiple proportions π1 + π2 + · · · + πK = 1,
e.g. (πnoun = .28, πverb = .17, πadj = .08, . . .)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
4 / 40
Introduction
Categorical vs. numerical variables
Numerical properties
In many other cases, the properties of interest are numerical:
Population census
height
weight
shoes
sex
178.18
160.10
150.09
182.24
169.88
185.22
166.89
162.58
69.52
51.46
43.05
63.21
63.04
90.59
47.43
54.13
39.5
37.0
35.5
46.0
43.5
46.5
43.0
37.0
f
f
f
m
m
m
m
f
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
5 / 40
Introduction
Categorical vs. numerical variables
Numerical properties
In many other cases, the properties of interest are numerical:
Population census
Wikipedia articles
height
weight
shoes
sex
tokens
types
TTR
avg len.
178.18
160.10
150.09
182.24
169.88
185.22
166.89
162.58
69.52
51.46
43.05
63.21
63.04
90.59
47.43
54.13
39.5
37.0
35.5
46.0
43.5
46.5
43.0
37.0
f
f
f
m
m
m
m
f
696
228
390
455
399
297
755
299
251
126
174
176
214
148
275
171
2.773
1.810
2.241
2.585
1.864
2.007
2.745
1.749
4.532
4.488
4.251
4.412
4.301
4.399
3.861
4.524
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
5 / 40
Introduction
Categorical vs. numerical variables
Descriptive vs. inferential statistics
Two main tasks of “classical” statistical methods (numerical data):
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
6 / 40
Introduction
Categorical vs. numerical variables
Descriptive vs. inferential statistics
Two main tasks of “classical” statistical methods (numerical data):
1. Descriptive statistics
I
I
I
compact description of the distribution of a (numerical)
property in a very large or infinite population
often by characteristic parameters such as mean, variance, . . .
this was the original purpose of statistics in the 19th century
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
6 / 40
Introduction
Categorical vs. numerical variables
Descriptive vs. inferential statistics
Two main tasks of “classical” statistical methods (numerical data):
1. Descriptive statistics
I
I
I
compact description of the distribution of a (numerical)
property in a very large or infinite population
often by characteristic parameters such as mean, variance, . . .
this was the original purpose of statistics in the 19th century
2. Inferential statistics
I
I
I
infer (aspects of) population distribution from a comparatively
small random sample
accurate estimates for level of uncertainty involved
often by testing (and rejecting) some null hypothesis H0
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
6 / 40
Introduction
Scales of measurement
Outline
Introduction
Categorical vs. numerical variables
Scales of measurement
Descriptive statistics
Characteristic measures
Histogram & density
Random variables & expectations
Continuous distributions
The shape of a distribution
The normal distribution (Gaussian)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
7 / 40
Introduction
Scales of measurement
Statisticians distinguish 4 scales of measurement
Categorical data
Numerical data
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
8 / 40
Introduction
Scales of measurement
Statisticians distinguish 4 scales of measurement
Categorical data
1. Nominal scale: purely qualitative classification
I
male vs. female, passive vs. active, POS tags, subcat frames
Numerical data
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
8 / 40
Introduction
Scales of measurement
Statisticians distinguish 4 scales of measurement
Categorical data
1. Nominal scale: purely qualitative classification
I
male vs. female, passive vs. active, POS tags, subcat frames
2. Ordinal scale: ordered categories
I
school grades A–E, social class, low/medium/high rating
Numerical data
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
8 / 40
Introduction
Scales of measurement
Statisticians distinguish 4 scales of measurement
Categorical data
1. Nominal scale: purely qualitative classification
I
male vs. female, passive vs. active, POS tags, subcat frames
2. Ordinal scale: ordered categories
I
school grades A–E, social class, low/medium/high rating
Numerical data
3. Interval scale: meaningful comparison of differences
I
temperature (°C), plausibility & grammaticality ratings
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
8 / 40
Introduction
Scales of measurement
Statisticians distinguish 4 scales of measurement
Categorical data
1. Nominal scale: purely qualitative classification
I
male vs. female, passive vs. active, POS tags, subcat frames
2. Ordinal scale: ordered categories
I
school grades A–E, social class, low/medium/high rating
Numerical data
3. Interval scale: meaningful comparison of differences
I
temperature (°C), plausibility & grammaticality ratings
4. Ratio scale: comparison of magnitudes, absolute zero
I
time, length/width/height, weight, frequency counts
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
8 / 40
Introduction
Scales of measurement
Statisticians distinguish 4 scales of measurement
Categorical data
1. Nominal scale: purely qualitative classification
I
male vs. female, passive vs. active, POS tags, subcat frames
2. Ordinal scale: ordered categories
I
school grades A–E, social class, low/medium/high rating
Numerical data
3. Interval scale: meaningful comparison of differences
I
temperature (°C), plausibility & grammaticality ratings
4. Ratio scale: comparison of magnitudes, absolute zero
I
time, length/width/height, weight, frequency counts
Additional dimension: discrete vs. continuous numerical data
I
I
discrete: frequency counts, rating (1, . . . , 7), shoe size, . . .
continuous: length, time, weight, temperature, . . .
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
8 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
I
subcategorisation frame
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
I
subcategorisation frame
I
reaction time (in psycholinguistic experiment)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
I
subcategorisation frame
I
reaction time (in psycholinguistic experiment)
I
familiarity rating on scale 1, . . . , 7
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
I
subcategorisation frame
I
reaction time (in psycholinguistic experiment)
I
familiarity rating on scale 1, . . . , 7
I
room number
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
I
subcategorisation frame
I
reaction time (in psycholinguistic experiment)
I
familiarity rating on scale 1, . . . , 7
I
room number
I
grammaticality rating: “*”, “??”, “?” or “ok”
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
I
subcategorisation frame
I
reaction time (in psycholinguistic experiment)
I
familiarity rating on scale 1, . . . , 7
I
room number
I
grammaticality rating: “*”, “??”, “?” or “ok”
I
magnitude estimation of plausibility (graphical scale)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
I
subcategorisation frame
I
reaction time (in psycholinguistic experiment)
I
familiarity rating on scale 1, . . . , 7
I
room number
I
grammaticality rating: “*”, “??”, “?” or “ok”
I
magnitude estimation of plausibility (graphical scale)
I
frequency of passive VPs in text
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
I
subcategorisation frame
I
reaction time (in psycholinguistic experiment)
I
familiarity rating on scale 1, . . . , 7
I
room number
I
grammaticality rating: “*”, “??”, “?” or “ok”
I
magnitude estimation of plausibility (graphical scale)
I
frequency of passive VPs in text
I
relative frequency of passive VPs
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
I
subcategorisation frame
I
reaction time (in psycholinguistic experiment)
I
familiarity rating on scale 1, . . . , 7
I
room number
I
grammaticality rating: “*”, “??”, “?” or “ok”
I
magnitude estimation of plausibility (graphical scale)
I
frequency of passive VPs in text
I
relative frequency of passive VPs
I
token-type-ratio (TTR) and average word length (Wikipedia)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Introduction
Scales of measurement
Quiz
Which scale of measurement / data type is it?
I
subcategorisation frame
I
reaction time (in psycholinguistic experiment)
I
familiarity rating on scale 1, . . . , 7
I
room number
I
grammaticality rating: “*”, “??”, “?” or “ok”
I
magnitude estimation of plausibility (graphical scale)
I
frequency of passive VPs in text
I
relative frequency of passive VPs
I
token-type-ratio (TTR) and average word length (Wikipedia)
+ in this unit: continuous numerical variables on ratio scale
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
9 / 40
Descriptive statistics
Characteristic measures
Outline
Introduction
Categorical vs. numerical variables
Scales of measurement
Descriptive statistics
Characteristic measures
Histogram & density
Random variables & expectations
Continuous distributions
The shape of a distribution
The normal distribution (Gaussian)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
10 / 40
Descriptive statistics
Characteristic measures
The task
I
Census data from small country of Ingary with m = 502,202
inhabitants. The following properties were recorded:
I
I
I
I
I
body height in cm
weight in kg
shoe size in Paris points (Continental European system)
sex (male, female)
Frequency statistics for m = 1,429,649 Wikipedia articles:
I
I
I
I
token count
type count
token-type ratio (TTR)
average word length (across tokens)
+ Describe / summarise these data sets (continuous variables)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
11 / 40
Descriptive statistics
Characteristic measures
The task
I
Census data from small country of Ingary with m = 502,202
inhabitants. The following properties were recorded:
I
I
I
I
I
body height in cm
weight in kg
shoe size in Paris points (Continental European system)
sex (male, female)
Frequency statistics for m = 1,429,649 Wikipedia articles:
I
I
I
I
token count
type count
token-type ratio (TTR)
average word length (across tokens)
+ Describe / summarise these data sets (continuous variables)
> library(SIGIL)
> FakeCensus <- simulated.census()
> WackypediaStats <- simulated.wikipedia()
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
11 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: central tendency
I
How would you describe body heights with a single number?
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
12 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: central tendency
I
How would you describe body heights with a single number?
m
mean µ =
x1 + · · · + xm
1 X
=
xi
m
m
i=1
I
Is this intuitively sensible? Or are we just used to it?
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
12 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: central tendency
I
How would you describe body heights with a single number?
m
mean µ =
x1 + · · · + xm
1 X
=
xi
m
m
i=1
I
Is this intuitively sensible? Or are we just used to it?
> mean(FakeCensus$height)
[1] 170.9781
> mean(FakeCensus$weight)
[1] 65.28917
> mean(FakeCensus$shoe.size)
[1] 41.49712
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
12 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: variability (spread)
I
Average weight of 65.3 kg not very useful if we have to design
an elevator for 10 persons or a chair that doesn’t collapse:
We need to know if everyone weighs close to 65 kg, or whether
the typical range is 40–100 kg, or whether it is even larger.
I
Measure of spread: minimum and maximum, here 30–196 kg
I
We’re more interested in the “typical” range of values without
the most extreme cases
I
Average variability based on error xi − µ for each individual
shows how well the mean µ describes the entire population
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
13 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: variability (spread)
I
Average weight of 65.3 kg not very useful if we have to design
an elevator for 10 persons or a chair that doesn’t collapse:
We need to know if everyone weighs close to 65 kg, or whether
the typical range is 40–100 kg, or whether it is even larger.
I
Measure of spread: minimum and maximum, here 30–196 kg
I
We’re more interested in the “typical” range of values without
the most extreme cases
I
Average variability based on error xi − µ for each individual
shows how well the mean µ describes the entire population
m
1 X
(xi − µ) = 0
m
i=1
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
13 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: variability (spread)
I
Average weight of 65.3 kg not very useful if we have to design
an elevator for 10 persons or a chair that doesn’t collapse:
We need to know if everyone weighs close to 65 kg, or whether
the typical range is 40–100 kg, or whether it is even larger.
I
Measure of spread: minimum and maximum, here 30–196 kg
I
We’re more interested in the “typical” range of values without
the most extreme cases
I
Average variability based on error xi − µ for each individual
shows how well the mean µ describes the entire population
m
1 X
|xi − µ| is mathematically inconvenient
m
i=1
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
13 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: variability (spread)
I
Average weight of 65.3 kg not very useful if we have to design
an elevator for 10 persons or a chair that doesn’t collapse:
We need to know if everyone weighs close to 65 kg, or whether
the typical range is 40–100 kg, or whether it is even larger.
I
Measure of spread: minimum and maximum, here 30–196 kg
I
We’re more interested in the “typical” range of values without
the most extreme cases
I
Average variability based on error xi − µ for each individual
shows how well the mean µ describes the entire population
m
variance σ 2 =
1 X
(xi − µ)2
m
i=1
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
13 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: variability (spread)
m
1 X
variance σ =
(xi − µ)2
m
2
i=1
+ Do you remember how to calculate this in R?
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
14 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: variability (spread)
m
1 X
variance σ =
(xi − µ)2
m
2
i=1
+ Do you remember how to calculate this in R?
I
I
I
height: µ = 171.00, σ 2 = 199.50
weight: µ = 65.29, σ 2 = 306.72
shoe size: µ = 41.50, σ 2 = 21.70
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
14 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: variability (spread)
m
1 X
variance σ =
(xi − µ)2
m
2
i=1
+ Do you remember how to calculate this in R?
I
I
I
height: µ = 171.00, σ 2 = 199.50, σ = 14.12
weight: µ = 65.29, σ 2 = 306.72, σ = 17.51
shoe size: µ = 41.50, σ 2 = 21.70, σ = 4.66
I
Mean and variance are not on a comparable
scale
√
2
Ü standard deviation (s.d.) σ = σ
I
NB: still gives more weight to larger errors!
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
14 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: higher moments
I
I
Mean based on (xi )1 is also known as a “first moment”,
variance based on (xi )2 as a “second moment”
The third moment is called skewness
m 1 X xi − µ 3
γ1 =
m
σ
i=1
I
and measures the asymmetry of a distribution
The fourth moment (kurtosis) measures “bulginess”
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
15 / 40
Descriptive statistics
Characteristic measures
Characteristic measures: higher moments
I
I
Mean based on (xi )1 is also known as a “first moment”,
variance based on (xi )2 as a “second moment”
The third moment is called skewness
m 1 X xi − µ 3
γ1 =
m
σ
i=1
I
and measures the asymmetry of a distribution
The fourth moment (kurtosis) measures “bulginess”
I
How useful are these characteristic measures?
I
I
Given the mean, s.d., skewness, . . . , can you tell how many
people are taller than 190 cm, or how many weigh ≈ 100 kg?
Such measures mainly used for computational efficiency, and
even this required an elaborate procedure in the 19th century
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
15 / 40
Descriptive statistics
Histogram & density
Outline
Introduction
Categorical vs. numerical variables
Scales of measurement
Descriptive statistics
Characteristic measures
Histogram & density
Random variables & expectations
Continuous distributions
The shape of a distribution
The normal distribution (Gaussian)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
16 / 40
Descriptive statistics
Histogram & density
The shape of a distribution: discrete data
0.04
0.03
0.02
0.00
0.01
Proportion of population
0.05
0.06
Discrete numerical data can be tabulated and plotted
30
32
34
36
38
40
42
44
46
48
50
52
54
Shoe size
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
17 / 40
Descriptive statistics
Histogram & density
The shape of a distribution: histogram for continuous data
Continuous data must be collected into bins Ü histogram
47776
40000
42244
30000
35374
25752
21286
20000
Frequency
50000
60000
62637
61971
59379
56578
55453
10000
12257
9905
4584
3874
1096
1 10 39 275
0
1329324
47 10 1
120
140
160
180
200
220
body height
I
No two people have exactly the same body height, weight, . . .
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
18 / 40
Descriptive statistics
Histogram & density
The shape of a distribution: histogram for continuous data
Continuous data must be collected into bins Ü histogram
60000
50000
20000
10000
10000
12257
9905
4584
3874
1096
1 10 39 275
1329324
47 10 1
0
120
140
40000
Frequency
30000
25752
21286
160
180
200
220
0
40000
42244
35374
30000
47776
20000
Frequency
50000
60000
62637
61971
59379
56578
55453
120
body height
140
160
180
200
I
No two people have exactly the same body height, weight, . . .
I
Frequency counts (= y-axis scale) depend on number of bins
SIGIL (Baroni & Evert)
220
body height
3a. Continuous Data: Description
sigil.r-forge.r-project.org
18 / 40
Descriptive statistics
Histogram & density
The shape of a distribution: histogram for continuous data
0.025
0.020
0.015
Density
0.010
0.005
0.000
0.000
0.005
0.010
Density
0.015
0.020
0.025
Continuous data must be collected into bins Ü histogram
120
140
160
180
200
220
120
body height
140
160
180
I
Density scale is comparable for different numbers of bins
I
Area of histogram bar ≡ relative frequency in population
SIGIL (Baroni & Evert)
200
220
body height
3a. Continuous Data: Description
sigil.r-forge.r-project.org
19 / 40
Descriptive statistics
Histogram & density
0.000
0.005
0.010
Density
0.015
0.020
0.025
Refining histograms: the density function
120
140
160
180
200
220
body height
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
20 / 40
Descriptive statistics
Histogram & density
0.000
0.005
0.010
Density
0.015
0.020
0.025
Refining histograms: the density function
120
140
160
180
200
220
body height
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
20 / 40
Descriptive statistics
Histogram & density
0.000
0.005
0.010
Density
0.015
0.020
0.025
Refining histograms: the density function
120
140
160
180
200
220
body height
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
20 / 40
Descriptive statistics
Histogram & density
0.000
0.005
0.010
Density
0.015
0.020
0.025
Refining histograms: the density function
120
140
160
180
200
220
body height
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
20 / 40
Descriptive statistics
Histogram & density
0.000
0.005
0.010
Density
0.015
0.020
0.025
Refining histograms: the density function
120
140
160
180
200
220
body height
I
Contour of histogram = density function
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
20 / 40
Descriptive statistics
Random variables & expectations
Outline
Introduction
Categorical vs. numerical variables
Scales of measurement
Descriptive statistics
Characteristic measures
Histogram & density
Random variables & expectations
Continuous distributions
The shape of a distribution
The normal distribution (Gaussian)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
21 / 40
Descriptive statistics
Random variables & expectations
Formal mathematical notation
I
Population Ω = {ω1 , ω2 , . . . , ωm } with m ≈ ∞
I
item ωk = person, Wikipedia article, word (lexical RT), . . .
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
22 / 40
Descriptive statistics
Random variables & expectations
Formal mathematical notation
I
Population Ω = {ω1 , ω2 , . . . , ωm } with m ≈ ∞
I
For each item, we are interested in several properties (e.g.
height, weight, shoe size, sex) called random variables (r.v.)
I
item ωk = person, Wikipedia article, word (lexical RT), . . .
height X : Ω → R+ with X (ωk ) = height of person ωk
weight Y : Ω → R+ with Y (ωk ) = weight of person ωk
I sex G : Ω → {0, 1} with G (ωk ) = 1 iff ωk is a woman
+ formally, a r.v. is a (usually real-valued) function over Ω
I
I
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
22 / 40
Descriptive statistics
Random variables & expectations
Formal mathematical notation
I
Population Ω = {ω1 , ω2 , . . . , ωm } with m ≈ ∞
I
For each item, we are interested in several properties (e.g.
height, weight, shoe size, sex) called random variables (r.v.)
I
item ωk = person, Wikipedia article, word (lexical RT), . . .
height X : Ω → R+ with X (ωk ) = height of person ωk
weight Y : Ω → R+ with Y (ωk ) = weight of person ωk
I sex G : Ω → {0, 1} with G (ωk ) = 1 iff ωk is a woman
+ formally, a r.v. is a (usually real-valued) function over Ω
I
I
I
Mean, variance, etc. computed for each random variable:
1 X
X (ω) =: E[X ]
m
ω∈Ω
2
1 X
σX2 =
X (ω) − µX =: Var[X ]
m
ω∈Ω
= E (X − µX )2
µX =
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
expectation
variance
sigil.r-forge.r-project.org
22 / 40
Descriptive statistics
Random variables & expectations
Working with random variables
I
2
X 0 (ω) := X (ω) − µ defines new r.v. X 0 : Ω → R
I
The expectation is a linear functional on r.v.:
+ any function f (X ) of a r.v. is itself a random variable
I
I
I
E[X + Y ] = E[X ] + E[Y ] for X , Y : Ω → R
E[r · X ] = r · E[X ] for r ∈ R
E[a] = a for constant r.v. a ∈ R (additional property)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
23 / 40
Descriptive statistics
Random variables & expectations
Working with random variables
I
2
X 0 (ω) := X (ω) − µ defines new r.v. X 0 : Ω → R
I
The expectation is a linear functional on r.v.:
+ any function f (X ) of a r.v. is itself a random variable
I
I
I
I
E[X + Y ] = E[X ] + E[Y ] for X , Y : Ω → R
E[r · X ] = r · E[X ] for r ∈ R
E[a] = a for constant r.v. a ∈ R (additional property)
These rules enable us to simplify the computation of σX2 :
σX2 = Var[X ] = E (X − µX )2 = E X 2 − 2µX X + µ2X
= E[X 2 ] − 2µX E[X ] +µ2X = E[X 2 ] − µ2X
| {z }
=µX
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
23 / 40
Descriptive statistics
Random variables & expectations
Working with random variables
I
2
X 0 (ω) := X (ω) − µ defines new r.v. X 0 : Ω → R
I
The expectation is a linear functional on r.v.:
+ any function f (X ) of a r.v. is itself a random variable
I
I
I
I
E[X + Y ] = E[X ] + E[Y ] for X , Y : Ω → R
E[r · X ] = r · E[X ] for r ∈ R
E[a] = a for constant r.v. a ∈ R (additional property)
These rules enable us to simplify the computation of σX2 :
σX2 = Var[X ] = E (X − µX )2 = E X 2 − 2µX X + µ2X
= E[X 2 ] − 2µX E[X ] +µ2X = E[X 2 ] − µ2X
| {z }
=µX
I
Random variables and probabilities: r.v. X describes outcome
of picking a random ω ∈ Ω Ü sampling distribution
Pr(a ≤ X ≤ b) =
SIGIL (Baroni & Evert)
1 {ω ∈ Ω | a ≤ X (ω) ≤ b}
m
3a. Continuous Data: Description
sigil.r-forge.r-project.org
23 / 40
Descriptive statistics
Random variables & expectations
A justification for the mean
I
I
σX2 tells us how well the r.v. X is characterised by µX
More generally, E (X − a)2 tells us how well X is
characterised by some real number a ∈ R
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
24 / 40
Descriptive statistics
Random variables & expectations
A justification for the mean
I
I
I
σX2 tells us how well the r.v. X is characterised by µX
More generally, E (X − a)2 tells us how well X is
characterised by some real number a ∈ R
The best single value we can give for X is the one that
minimises the average squared error:
E (X − a)2 = E[X 2 ] − 2a E[X ] +a2
| {z }
=µX
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
24 / 40
Descriptive statistics
Random variables & expectations
A justification for the mean
I
I
I
σX2 tells us how well the r.v. X is characterised by µX
More generally, E (X − a)2 tells us how well X is
characterised by some real number a ∈ R
The best single value we can give for X is the one that
minimises the average squared error:
E (X − a)2 = E[X 2 ] − 2a E[X ] +a2
| {z }
=µX
I
It is easy to see that a minimum is achieved for a = µX
+ The quadratic error term in our definition of σX2 guarantees
that there is always a unique minimum. This would not have
been the case e.g. with |X − a| instead of (X − a)2 .
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
24 / 40
Descriptive statistics
Random variables & expectations
How to compute the expectation of a discrete variable
I
Population distribution of a discrete variable is fully described
by giving the relative frequency of each possible value t ∈ R:
πt = Pr(X = t)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
25 / 40
Descriptive statistics
Random variables & expectations
How to compute the expectation of a discrete variable
I
Population distribution of a discrete variable is fully described
by giving the relative frequency of each possible value t ∈ R:
X X (ω)
E[X ] =
=
m
πt = Pr(X = t)
X X t
t
ω∈Ω
|
X (ω)=t
{z
m
=
X
t
t
X
X (ω)=t
1
m
}
group by value of X
=
X
t
SIGIL (Baroni & Evert)
t·
X
|X (ω) = t| X
=
t · πt =
t · Pr(X = t)
m
t
t
3a. Continuous Data: Description
sigil.r-forge.r-project.org
25 / 40
Descriptive statistics
Random variables & expectations
How to compute the expectation of a discrete variable
I
Population distribution of a discrete variable is fully described
by giving the relative frequency of each possible value t ∈ R:
X X (ω)
E[X ] =
=
m
πt = Pr(X = t)
X X t
t
ω∈Ω
X (ω)=t
|
{z
m
=
X
t
t
X
X (ω)=t
1
m
}
group by value of X
=
X
t
I
t·
X
|X (ω) = t| X
=
t · πt =
t · Pr(X = t)
m
t
t
The second moment E[X 2 ] needed for Var[X ] can also be
obtained in this way from the population distribution:
X
E[X 2 ] =
t 2 · Pr(X = t)
t
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
25 / 40
Descriptive statistics
Random variables & expectations
How to compute the expectation of a continuous variable
I
Population distribution of continuous variable can be
described by its density function g : R → [0, ∞]
I
keep in mind that Pr(X = t) = 0 for almost every value
t ∈ R: nobody is exactly 172.3456789 cm tall!
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
26 / 40
Descriptive statistics
Random variables & expectations
How to compute the expectation of a continuous variable
I
Population distribution of continuous variable can be
described by its density function g : R → [0, ∞]
I
keep in mind that Pr(X = t) = 0 for almost every value
t ∈ R: nobody is exactly 172.3456789 cm tall!
Area under density curve between a and b =
proportion of items ω ∈ Ω with a ≤ X (ω) ≤ b.
Z
Pr(a ≤ X ≤ b) =
b
g (t) dt
a
a
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
b
sigil.r-forge.r-project.org
26 / 40
Descriptive statistics
Random variables & expectations
How to compute the expectation of a continuous variable
I
Population distribution of continuous variable can be
described by its density function g : R → [0, ∞]
I
keep in mind that Pr(X = t) = 0 for almost every value
t ∈ R: nobody is exactly 172.3456789 cm tall!
Area under density curve between a and b =
proportion of items ω ∈ Ω with a ≤ X (ω) ≤ b.
b
Z
Pr(a ≤ X ≤ b) =
g (t) dt
a
Same reasoning as for discrete variable leads to:
Z
a
b
+∞
t · g (t) dt
E[X ] =
and
−∞
Z +∞
f (t) · g (t) dt
E[f (X )] =
−∞
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
26 / 40
Continuous distributions
The shape of a distribution
Outline
Introduction
Categorical vs. numerical variables
Scales of measurement
Descriptive statistics
Characteristic measures
Histogram & density
Random variables & expectations
Continuous distributions
The shape of a distribution
The normal distribution (Gaussian)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
27 / 40
Continuous distributions
The shape of a distribution
0.6
3.5
4.0
µ + 2σ
µ+σ
µ
µ−σ
0.0
µ − 2σ
0.2
0.4
Density
0.8
1.0
1.2
Different types of continuous distributions
4.5
5.0
5.5
symmetric, bell-shaped
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
28 / 40
Continuous distributions
The shape of a distribution
0.015
140
160
180
µ + 2σ
µ+σ
µ
µ−σ
µ − 2σ
0.000
0.005
0.010
Density
0.020
0.025
Different types of continuous distributions
200
symmetric, bulgy
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
29 / 40
Continuous distributions
The shape of a distribution
Density
−2
0
µ + 2σ
µ+σ
µ
µ−σ
µ − 2σ
0.0
0.1
0.2
0.3
0.4
0.5
0.6
median
Different types of continuous distributions
2
4
6
skewed (median 6= mean)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
30 / 40
Continuous distributions
The shape of a distribution
0.015
40
60
µ + 2σ
µ+σ
µ
µ−σ
µ − 2σ
0.000
0.005
0.010
Density
0.020
0.025
median
Different types of continuous distributions
80
100
120
140
complicated . . .
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
31 / 40
Continuous distributions
The shape of a distribution
0.06
30
35
40
µ + 2σ
µ+σ
µ
µ−σ
0.00
µ − 2σ
0.02
0.04
Density
0.08
0.10
median
0.12
Different types of continuous distributions
45
50
55
bimodal (mean & median misleading)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
32 / 40
Continuous distributions
The normal distribution (Gaussian)
Outline
Introduction
Categorical vs. numerical variables
Scales of measurement
Descriptive statistics
Characteristic measures
Histogram & density
Random variables & expectations
Continuous distributions
The shape of a distribution
The normal distribution (Gaussian)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
33 / 40
Continuous distributions
The normal distribution (Gaussian)
The Gaussian distribution
I
In many real-life data sets, the distribution has a typical
“bell-shaped” form known as a Gaussian (or normal)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
34 / 40
Continuous distributions
The normal distribution (Gaussian)
The Gaussian distribution
I
In many real-life data sets, the distribution has a typical
“bell-shaped” form known as a Gaussian (or normal)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
34 / 40
Continuous distributions
I
The normal distribution (Gaussian)
Idealised density function is given by simple equation:
1
2
2
g (t) = √ e −(t−µ) /2σ
σ 2π
with parameters µ ∈ R (location) and σ > 0 (width)
σ
2σ
2σ
g (t )
σ
µ
t
I
I
Notation: X ∼ N(µ, σ 2 ) if r.v. has such a distribution
No coincidence: E[X ] = µ and Var[X ] = σ 2 (Ü homework ;-)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
35 / 40
Continuous distributions
The normal distribution (Gaussian)
Important properties of the Gaussian distribution
I
Distribution is well-behaved: symmetric, and most values are
relatively close to the mean µ (within 2 standard deviations)
Z
µ+2σ
Pr(µ − 2σ ≤ X ≤ µ + 2σ) =
µ−2σ
1
2
2
√ e −(t−µ) /2σ dt
σ 2π
≈ 95.5%
I
68.3% are within range µ − σ ≤ X ≤ µ + σ (one s.d.)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
36 / 40
Continuous distributions
The normal distribution (Gaussian)
Important properties of the Gaussian distribution
I
Distribution is well-behaved: symmetric, and most values are
relatively close to the mean µ (within 2 standard deviations)
Z
µ+2σ
Pr(µ − 2σ ≤ X ≤ µ + 2σ) =
µ−2σ
1
2
2
√ e −(t−µ) /2σ dt
σ 2π
≈ 95.5%
I
I
68.3% are within range µ − σ ≤ X ≤ µ + σ (one s.d.)
The central limit theorem explains why this particular
distribution is so widespread (sum of independent effects)
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
36 / 40
Continuous distributions
The normal distribution (Gaussian)
Important properties of the Gaussian distribution
I
Distribution is well-behaved: symmetric, and most values are
relatively close to the mean µ (within 2 standard deviations)
Z
µ+2σ
Pr(µ − 2σ ≤ X ≤ µ + 2σ) =
µ−2σ
1
2
2
√ e −(t−µ) /2σ dt
σ 2π
≈ 95.5%
I
I
68.3% are within range µ − σ ≤ X ≤ µ + σ (one s.d.)
The central limit theorem explains why this particular
distribution is so widespread (sum of independent effects)
+ Mean and standard deviation are meaningful characteristics if
distribution is Gaussian or near-Gaussian
I
completely determined by these parameters
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
36 / 40
Continuous distributions
The normal distribution (Gaussian)
Assessing normality
I
Many hypothesis tests and other statistical techniques assume
that random variables follow a Gaussian distribution
I
I
If this normality assumption is not justified, a significant test
result may well be entirely spurious.
It is therefore important to verify that sample data come from
such a Gaussian or near-Gaussian distribution
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
37 / 40
Continuous distributions
The normal distribution (Gaussian)
Assessing normality
I
Many hypothesis tests and other statistical techniques assume
that random variables follow a Gaussian distribution
I
If this normality assumption is not justified, a significant test
result may well be entirely spurious.
I
It is therefore important to verify that sample data come from
such a Gaussian or near-Gaussian distribution
I
Method 1: Comparison of histograms and density functions
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
37 / 40
Continuous distributions
The normal distribution (Gaussian)
Assessing normality
I
Many hypothesis tests and other statistical techniques assume
that random variables follow a Gaussian distribution
I
If this normality assumption is not justified, a significant test
result may well be entirely spurious.
I
It is therefore important to verify that sample data come from
such a Gaussian or near-Gaussian distribution
I
Method 1: Comparison of histograms and density functions
I
Method 2: Quantile-quantile plots
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
37 / 40
Continuous distributions
The normal distribution (Gaussian)
µ
0.015
0.000
0.005
0.010
Density
0.020
0.025
> hist(x,freq=FALSE)
> lines(density(x))
µ−σ
0.030
Plot histogram and
estimated density:
µ+σ
Assessing normality: Histogram & density function
100
SIGIL (Baroni & Evert)
120
3a. Continuous Data: Description
140
160
180
200
sigil.r-forge.r-project.org
220
38 / 40
Continuous distributions
The normal distribution (Gaussian)
µ−σ
estimated density
normal approximation
Density
0.015
0.010
0.005
> xG <seq(min(x),max(x),len=100)
> yG <dnorm(xG,mean(x),sd(x))
> lines(xG,yG,col="red")
0.000
Compare best-matching
Gaussian distribution:
0.020
0.025
> hist(x,freq=FALSE)
> lines(density(x))
100
SIGIL (Baroni & Evert)
µ+σ
0.030
Plot histogram and
estimated density:
µ
Assessing normality: Histogram & density function
120
3a. Continuous Data: Description
140
160
180
200
sigil.r-forge.r-project.org
220
38 / 40
Continuous distributions
The normal distribution (Gaussian)
0.030
Substantial deviation Ü
not normal (problematic)
µ−σ
0.020
Density
0.015
0.010
0.005
> xG <seq(min(x),max(x),len=100)
> yG <dnorm(xG,mean(x),sd(x))
> lines(xG,yG,col="red")
0.000
Compare best-matching
Gaussian distribution:
SIGIL (Baroni & Evert)
estimated density
normal approximation
0.025
> hist(x,freq=FALSE)
> lines(density(x))
µ
Plot histogram and
estimated density:
µ+σ
Assessing normality: Histogram & density function
0
3a. Continuous Data: Description
50
100
sigil.r-forge.r-project.org
150
38 / 40
Continuous distributions
The normal distribution (Gaussian)
Assessing normality: Quantile-quantile plots
Quantile-quantile plots
are better suited for
small samples:
180
●
●
●
●
●
140
If distribution is
near-Gaussian, points
should follow red line.
160
> qqnorm(x)
> qqline(x,col="red")
Sample Quantiles
200
●
● ●
●●
●
●
●●●
●●●●●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●●●
●
●●
●●
●
●●●
●
●
●
●
−3
−2
−1
0
1
2
3
Theoretical Quantiles
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
39 / 40
Continuous distributions
The normal distribution (Gaussian)
Assessing normality: Quantile-quantile plots
Quantile-quantile plots
are better suited for
small samples:
●
SIGIL (Baroni & Evert)
100
●
●
●
●
●●
●●●
●
●●●
●
●
●●●
●
●
●●
●●
●●●●
80
Sample Quantiles
●
40
One-sided deviation
Ü skewed distribution
● ●
●
●
●●
●●
●●●●
●
●●
●
60
> qqnorm(x)
> qqline(x,col="red")
If distribution is
near-Gaussian, points
should follow red line.
●
120
●
●●●
●
●●
●●
●●●
●●●●
●●●
●●●
●●●
●●●●
●●
●●●●●●
●
●
●●
●●
●●
●
●
● ●
● ●●
●
−2
−1
0
1
2
Theoretical Quantiles
3a. Continuous Data: Description
sigil.r-forge.r-project.org
39 / 40
Continuous distributions
The normal distribution (Gaussian)
Playtime!
I
Take random samples of n items each from the census and
wikipedia data sets (e.g. n = 100)
library(corpora)
Survey <- sample.df(FakeCensus, n, sort=TRUE)
I
I
Plot histograms and estimated density for all variables
Assess normality of the underlying distributions
by comparison with Gaussian density function
by inspection of quantile-quantile plots
+ Can you make them look like the figures in the slides?
I
I
I
Plot histograms for all variables in the full data sets
(and estimated density functions if you’re patient enough)
I
I
What kinds of distributions do you find?
Which variables can meaningfully be described by
mean µ and standard deviation σ?
SIGIL (Baroni & Evert)
3a. Continuous Data: Description
sigil.r-forge.r-project.org
40 / 40
Related documents