Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Unit 3: Descriptive Statistics for Continuous Data Statistics for Linguists with R – A SIGIL Course Designed by Marco Baroni1 and Stefan Evert2 1 Center for Mind/Brain Sciences (CIMeC) University of Trento, Italy 2 Corpus Linguistics Group Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany http://SIGIL.r-forge.r-project.org/ Copyright © 2007–2015 Baroni & Evert SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 1 / 40 Outline Outline Introduction Categorical vs. numerical variables Scales of measurement Descriptive statistics Characteristic measures Histogram & density Random variables & expectations Continuous distributions The shape of a distribution The normal distribution (Gaussian) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 2 / 40 Introduction Categorical vs. numerical variables Outline Introduction Categorical vs. numerical variables Scales of measurement Descriptive statistics Characteristic measures Histogram & density Random variables & expectations Continuous distributions The shape of a distribution The normal distribution (Gaussian) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 3 / 40 Introduction Categorical vs. numerical variables Reminder: the library metaphor I I In the library metaphor, we took random samples from an infinite population of tokens (words, VPs, sentences, . . . ) Relevant property is a binary (or categorical) classification I I I I active vs. passive VP or sentence (binary) instance of lemma TIME vs. some other word (binary) subcategorisation frame of verb token (itr, tr, ditr, p-obj, . . . ) part-of-speech tag of word token (50+ categories) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 4 / 40 Introduction Categorical vs. numerical variables Reminder: the library metaphor I I In the library metaphor, we took random samples from an infinite population of tokens (words, VPs, sentences, . . . ) Relevant property is a binary (or categorical) classification I I I I I active vs. passive VP or sentence (binary) instance of lemma TIME vs. some other word (binary) subcategorisation frame of verb token (itr, tr, ditr, p-obj, . . . ) part-of-speech tag of word token (50+ categories) Characterisation of population distribution is straightforward I I I binomial: true proportion π = 10% of passive VPs, or relative frequency of TIME, e.g. π = 2000 pmw alternatively: specify redundant proportions (π, 1 − π), e.g. passive/active VPs (.1, .9) or TIME/other (.002, .998) multinomial: multiple proportions π1 + π2 + · · · + πK = 1, e.g. (πnoun = .28, πverb = .17, πadj = .08, . . .) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 4 / 40 Introduction Categorical vs. numerical variables Numerical properties In many other cases, the properties of interest are numerical: Population census height weight shoes sex 178.18 160.10 150.09 182.24 169.88 185.22 166.89 162.58 69.52 51.46 43.05 63.21 63.04 90.59 47.43 54.13 39.5 37.0 35.5 46.0 43.5 46.5 43.0 37.0 f f f m m m m f SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 5 / 40 Introduction Categorical vs. numerical variables Numerical properties In many other cases, the properties of interest are numerical: Population census Wikipedia articles height weight shoes sex tokens types TTR avg len. 178.18 160.10 150.09 182.24 169.88 185.22 166.89 162.58 69.52 51.46 43.05 63.21 63.04 90.59 47.43 54.13 39.5 37.0 35.5 46.0 43.5 46.5 43.0 37.0 f f f m m m m f 696 228 390 455 399 297 755 299 251 126 174 176 214 148 275 171 2.773 1.810 2.241 2.585 1.864 2.007 2.745 1.749 4.532 4.488 4.251 4.412 4.301 4.399 3.861 4.524 SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 5 / 40 Introduction Categorical vs. numerical variables Descriptive vs. inferential statistics Two main tasks of “classical” statistical methods (numerical data): SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 6 / 40 Introduction Categorical vs. numerical variables Descriptive vs. inferential statistics Two main tasks of “classical” statistical methods (numerical data): 1. Descriptive statistics I I I compact description of the distribution of a (numerical) property in a very large or infinite population often by characteristic parameters such as mean, variance, . . . this was the original purpose of statistics in the 19th century SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 6 / 40 Introduction Categorical vs. numerical variables Descriptive vs. inferential statistics Two main tasks of “classical” statistical methods (numerical data): 1. Descriptive statistics I I I compact description of the distribution of a (numerical) property in a very large or infinite population often by characteristic parameters such as mean, variance, . . . this was the original purpose of statistics in the 19th century 2. Inferential statistics I I I infer (aspects of) population distribution from a comparatively small random sample accurate estimates for level of uncertainty involved often by testing (and rejecting) some null hypothesis H0 SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 6 / 40 Introduction Scales of measurement Outline Introduction Categorical vs. numerical variables Scales of measurement Descriptive statistics Characteristic measures Histogram & density Random variables & expectations Continuous distributions The shape of a distribution The normal distribution (Gaussian) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 7 / 40 Introduction Scales of measurement Statisticians distinguish 4 scales of measurement Categorical data Numerical data SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40 Introduction Scales of measurement Statisticians distinguish 4 scales of measurement Categorical data 1. Nominal scale: purely qualitative classification I male vs. female, passive vs. active, POS tags, subcat frames Numerical data SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40 Introduction Scales of measurement Statisticians distinguish 4 scales of measurement Categorical data 1. Nominal scale: purely qualitative classification I male vs. female, passive vs. active, POS tags, subcat frames 2. Ordinal scale: ordered categories I school grades A–E, social class, low/medium/high rating Numerical data SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40 Introduction Scales of measurement Statisticians distinguish 4 scales of measurement Categorical data 1. Nominal scale: purely qualitative classification I male vs. female, passive vs. active, POS tags, subcat frames 2. Ordinal scale: ordered categories I school grades A–E, social class, low/medium/high rating Numerical data 3. Interval scale: meaningful comparison of differences I temperature (°C), plausibility & grammaticality ratings SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40 Introduction Scales of measurement Statisticians distinguish 4 scales of measurement Categorical data 1. Nominal scale: purely qualitative classification I male vs. female, passive vs. active, POS tags, subcat frames 2. Ordinal scale: ordered categories I school grades A–E, social class, low/medium/high rating Numerical data 3. Interval scale: meaningful comparison of differences I temperature (°C), plausibility & grammaticality ratings 4. Ratio scale: comparison of magnitudes, absolute zero I time, length/width/height, weight, frequency counts SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40 Introduction Scales of measurement Statisticians distinguish 4 scales of measurement Categorical data 1. Nominal scale: purely qualitative classification I male vs. female, passive vs. active, POS tags, subcat frames 2. Ordinal scale: ordered categories I school grades A–E, social class, low/medium/high rating Numerical data 3. Interval scale: meaningful comparison of differences I temperature (°C), plausibility & grammaticality ratings 4. Ratio scale: comparison of magnitudes, absolute zero I time, length/width/height, weight, frequency counts Additional dimension: discrete vs. continuous numerical data I I discrete: frequency counts, rating (1, . . . , 7), shoe size, . . . continuous: length, time, weight, temperature, . . . SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 8 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? I subcategorisation frame SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? I subcategorisation frame I reaction time (in psycholinguistic experiment) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? I subcategorisation frame I reaction time (in psycholinguistic experiment) I familiarity rating on scale 1, . . . , 7 SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? I subcategorisation frame I reaction time (in psycholinguistic experiment) I familiarity rating on scale 1, . . . , 7 I room number SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? I subcategorisation frame I reaction time (in psycholinguistic experiment) I familiarity rating on scale 1, . . . , 7 I room number I grammaticality rating: “*”, “??”, “?” or “ok” SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? I subcategorisation frame I reaction time (in psycholinguistic experiment) I familiarity rating on scale 1, . . . , 7 I room number I grammaticality rating: “*”, “??”, “?” or “ok” I magnitude estimation of plausibility (graphical scale) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? I subcategorisation frame I reaction time (in psycholinguistic experiment) I familiarity rating on scale 1, . . . , 7 I room number I grammaticality rating: “*”, “??”, “?” or “ok” I magnitude estimation of plausibility (graphical scale) I frequency of passive VPs in text SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? I subcategorisation frame I reaction time (in psycholinguistic experiment) I familiarity rating on scale 1, . . . , 7 I room number I grammaticality rating: “*”, “??”, “?” or “ok” I magnitude estimation of plausibility (graphical scale) I frequency of passive VPs in text I relative frequency of passive VPs SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? I subcategorisation frame I reaction time (in psycholinguistic experiment) I familiarity rating on scale 1, . . . , 7 I room number I grammaticality rating: “*”, “??”, “?” or “ok” I magnitude estimation of plausibility (graphical scale) I frequency of passive VPs in text I relative frequency of passive VPs I token-type-ratio (TTR) and average word length (Wikipedia) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Introduction Scales of measurement Quiz Which scale of measurement / data type is it? I subcategorisation frame I reaction time (in psycholinguistic experiment) I familiarity rating on scale 1, . . . , 7 I room number I grammaticality rating: “*”, “??”, “?” or “ok” I magnitude estimation of plausibility (graphical scale) I frequency of passive VPs in text I relative frequency of passive VPs I token-type-ratio (TTR) and average word length (Wikipedia) + in this unit: continuous numerical variables on ratio scale SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 9 / 40 Descriptive statistics Characteristic measures Outline Introduction Categorical vs. numerical variables Scales of measurement Descriptive statistics Characteristic measures Histogram & density Random variables & expectations Continuous distributions The shape of a distribution The normal distribution (Gaussian) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 10 / 40 Descriptive statistics Characteristic measures The task I Census data from small country of Ingary with m = 502,202 inhabitants. The following properties were recorded: I I I I I body height in cm weight in kg shoe size in Paris points (Continental European system) sex (male, female) Frequency statistics for m = 1,429,649 Wikipedia articles: I I I I token count type count token-type ratio (TTR) average word length (across tokens) + Describe / summarise these data sets (continuous variables) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 11 / 40 Descriptive statistics Characteristic measures The task I Census data from small country of Ingary with m = 502,202 inhabitants. The following properties were recorded: I I I I I body height in cm weight in kg shoe size in Paris points (Continental European system) sex (male, female) Frequency statistics for m = 1,429,649 Wikipedia articles: I I I I token count type count token-type ratio (TTR) average word length (across tokens) + Describe / summarise these data sets (continuous variables) > library(SIGIL) > FakeCensus <- simulated.census() > WackypediaStats <- simulated.wikipedia() SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 11 / 40 Descriptive statistics Characteristic measures Characteristic measures: central tendency I How would you describe body heights with a single number? SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 12 / 40 Descriptive statistics Characteristic measures Characteristic measures: central tendency I How would you describe body heights with a single number? m mean µ = x1 + · · · + xm 1 X = xi m m i=1 I Is this intuitively sensible? Or are we just used to it? SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 12 / 40 Descriptive statistics Characteristic measures Characteristic measures: central tendency I How would you describe body heights with a single number? m mean µ = x1 + · · · + xm 1 X = xi m m i=1 I Is this intuitively sensible? Or are we just used to it? > mean(FakeCensus$height) [1] 170.9781 > mean(FakeCensus$weight) [1] 65.28917 > mean(FakeCensus$shoe.size) [1] 41.49712 SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 12 / 40 Descriptive statistics Characteristic measures Characteristic measures: variability (spread) I Average weight of 65.3 kg not very useful if we have to design an elevator for 10 persons or a chair that doesn’t collapse: We need to know if everyone weighs close to 65 kg, or whether the typical range is 40–100 kg, or whether it is even larger. I Measure of spread: minimum and maximum, here 30–196 kg I We’re more interested in the “typical” range of values without the most extreme cases I Average variability based on error xi − µ for each individual shows how well the mean µ describes the entire population SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 13 / 40 Descriptive statistics Characteristic measures Characteristic measures: variability (spread) I Average weight of 65.3 kg not very useful if we have to design an elevator for 10 persons or a chair that doesn’t collapse: We need to know if everyone weighs close to 65 kg, or whether the typical range is 40–100 kg, or whether it is even larger. I Measure of spread: minimum and maximum, here 30–196 kg I We’re more interested in the “typical” range of values without the most extreme cases I Average variability based on error xi − µ for each individual shows how well the mean µ describes the entire population m 1 X (xi − µ) = 0 m i=1 SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 13 / 40 Descriptive statistics Characteristic measures Characteristic measures: variability (spread) I Average weight of 65.3 kg not very useful if we have to design an elevator for 10 persons or a chair that doesn’t collapse: We need to know if everyone weighs close to 65 kg, or whether the typical range is 40–100 kg, or whether it is even larger. I Measure of spread: minimum and maximum, here 30–196 kg I We’re more interested in the “typical” range of values without the most extreme cases I Average variability based on error xi − µ for each individual shows how well the mean µ describes the entire population m 1 X |xi − µ| is mathematically inconvenient m i=1 SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 13 / 40 Descriptive statistics Characteristic measures Characteristic measures: variability (spread) I Average weight of 65.3 kg not very useful if we have to design an elevator for 10 persons or a chair that doesn’t collapse: We need to know if everyone weighs close to 65 kg, or whether the typical range is 40–100 kg, or whether it is even larger. I Measure of spread: minimum and maximum, here 30–196 kg I We’re more interested in the “typical” range of values without the most extreme cases I Average variability based on error xi − µ for each individual shows how well the mean µ describes the entire population m variance σ 2 = 1 X (xi − µ)2 m i=1 SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 13 / 40 Descriptive statistics Characteristic measures Characteristic measures: variability (spread) m 1 X variance σ = (xi − µ)2 m 2 i=1 + Do you remember how to calculate this in R? SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 14 / 40 Descriptive statistics Characteristic measures Characteristic measures: variability (spread) m 1 X variance σ = (xi − µ)2 m 2 i=1 + Do you remember how to calculate this in R? I I I height: µ = 171.00, σ 2 = 199.50 weight: µ = 65.29, σ 2 = 306.72 shoe size: µ = 41.50, σ 2 = 21.70 SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 14 / 40 Descriptive statistics Characteristic measures Characteristic measures: variability (spread) m 1 X variance σ = (xi − µ)2 m 2 i=1 + Do you remember how to calculate this in R? I I I height: µ = 171.00, σ 2 = 199.50, σ = 14.12 weight: µ = 65.29, σ 2 = 306.72, σ = 17.51 shoe size: µ = 41.50, σ 2 = 21.70, σ = 4.66 I Mean and variance are not on a comparable scale √ 2 Ü standard deviation (s.d.) σ = σ I NB: still gives more weight to larger errors! SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 14 / 40 Descriptive statistics Characteristic measures Characteristic measures: higher moments I I Mean based on (xi )1 is also known as a “first moment”, variance based on (xi )2 as a “second moment” The third moment is called skewness m 1 X xi − µ 3 γ1 = m σ i=1 I and measures the asymmetry of a distribution The fourth moment (kurtosis) measures “bulginess” SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 15 / 40 Descriptive statistics Characteristic measures Characteristic measures: higher moments I I Mean based on (xi )1 is also known as a “first moment”, variance based on (xi )2 as a “second moment” The third moment is called skewness m 1 X xi − µ 3 γ1 = m σ i=1 I and measures the asymmetry of a distribution The fourth moment (kurtosis) measures “bulginess” I How useful are these characteristic measures? I I Given the mean, s.d., skewness, . . . , can you tell how many people are taller than 190 cm, or how many weigh ≈ 100 kg? Such measures mainly used for computational efficiency, and even this required an elaborate procedure in the 19th century SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 15 / 40 Descriptive statistics Histogram & density Outline Introduction Categorical vs. numerical variables Scales of measurement Descriptive statistics Characteristic measures Histogram & density Random variables & expectations Continuous distributions The shape of a distribution The normal distribution (Gaussian) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 16 / 40 Descriptive statistics Histogram & density The shape of a distribution: discrete data 0.04 0.03 0.02 0.00 0.01 Proportion of population 0.05 0.06 Discrete numerical data can be tabulated and plotted 30 32 34 36 38 40 42 44 46 48 50 52 54 Shoe size SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 17 / 40 Descriptive statistics Histogram & density The shape of a distribution: histogram for continuous data Continuous data must be collected into bins Ü histogram 47776 40000 42244 30000 35374 25752 21286 20000 Frequency 50000 60000 62637 61971 59379 56578 55453 10000 12257 9905 4584 3874 1096 1 10 39 275 0 1329324 47 10 1 120 140 160 180 200 220 body height I No two people have exactly the same body height, weight, . . . SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 18 / 40 Descriptive statistics Histogram & density The shape of a distribution: histogram for continuous data Continuous data must be collected into bins Ü histogram 60000 50000 20000 10000 10000 12257 9905 4584 3874 1096 1 10 39 275 1329324 47 10 1 0 120 140 40000 Frequency 30000 25752 21286 160 180 200 220 0 40000 42244 35374 30000 47776 20000 Frequency 50000 60000 62637 61971 59379 56578 55453 120 body height 140 160 180 200 I No two people have exactly the same body height, weight, . . . I Frequency counts (= y-axis scale) depend on number of bins SIGIL (Baroni & Evert) 220 body height 3a. Continuous Data: Description sigil.r-forge.r-project.org 18 / 40 Descriptive statistics Histogram & density The shape of a distribution: histogram for continuous data 0.025 0.020 0.015 Density 0.010 0.005 0.000 0.000 0.005 0.010 Density 0.015 0.020 0.025 Continuous data must be collected into bins Ü histogram 120 140 160 180 200 220 120 body height 140 160 180 I Density scale is comparable for different numbers of bins I Area of histogram bar ≡ relative frequency in population SIGIL (Baroni & Evert) 200 220 body height 3a. Continuous Data: Description sigil.r-forge.r-project.org 19 / 40 Descriptive statistics Histogram & density 0.000 0.005 0.010 Density 0.015 0.020 0.025 Refining histograms: the density function 120 140 160 180 200 220 body height SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 20 / 40 Descriptive statistics Histogram & density 0.000 0.005 0.010 Density 0.015 0.020 0.025 Refining histograms: the density function 120 140 160 180 200 220 body height SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 20 / 40 Descriptive statistics Histogram & density 0.000 0.005 0.010 Density 0.015 0.020 0.025 Refining histograms: the density function 120 140 160 180 200 220 body height SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 20 / 40 Descriptive statistics Histogram & density 0.000 0.005 0.010 Density 0.015 0.020 0.025 Refining histograms: the density function 120 140 160 180 200 220 body height SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 20 / 40 Descriptive statistics Histogram & density 0.000 0.005 0.010 Density 0.015 0.020 0.025 Refining histograms: the density function 120 140 160 180 200 220 body height I Contour of histogram = density function SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 20 / 40 Descriptive statistics Random variables & expectations Outline Introduction Categorical vs. numerical variables Scales of measurement Descriptive statistics Characteristic measures Histogram & density Random variables & expectations Continuous distributions The shape of a distribution The normal distribution (Gaussian) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 21 / 40 Descriptive statistics Random variables & expectations Formal mathematical notation I Population Ω = {ω1 , ω2 , . . . , ωm } with m ≈ ∞ I item ωk = person, Wikipedia article, word (lexical RT), . . . SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 22 / 40 Descriptive statistics Random variables & expectations Formal mathematical notation I Population Ω = {ω1 , ω2 , . . . , ωm } with m ≈ ∞ I For each item, we are interested in several properties (e.g. height, weight, shoe size, sex) called random variables (r.v.) I item ωk = person, Wikipedia article, word (lexical RT), . . . height X : Ω → R+ with X (ωk ) = height of person ωk weight Y : Ω → R+ with Y (ωk ) = weight of person ωk I sex G : Ω → {0, 1} with G (ωk ) = 1 iff ωk is a woman + formally, a r.v. is a (usually real-valued) function over Ω I I SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 22 / 40 Descriptive statistics Random variables & expectations Formal mathematical notation I Population Ω = {ω1 , ω2 , . . . , ωm } with m ≈ ∞ I For each item, we are interested in several properties (e.g. height, weight, shoe size, sex) called random variables (r.v.) I item ωk = person, Wikipedia article, word (lexical RT), . . . height X : Ω → R+ with X (ωk ) = height of person ωk weight Y : Ω → R+ with Y (ωk ) = weight of person ωk I sex G : Ω → {0, 1} with G (ωk ) = 1 iff ωk is a woman + formally, a r.v. is a (usually real-valued) function over Ω I I I Mean, variance, etc. computed for each random variable: 1 X X (ω) =: E[X ] m ω∈Ω 2 1 X σX2 = X (ω) − µX =: Var[X ] m ω∈Ω = E (X − µX )2 µX = SIGIL (Baroni & Evert) 3a. Continuous Data: Description expectation variance sigil.r-forge.r-project.org 22 / 40 Descriptive statistics Random variables & expectations Working with random variables I 2 X 0 (ω) := X (ω) − µ defines new r.v. X 0 : Ω → R I The expectation is a linear functional on r.v.: + any function f (X ) of a r.v. is itself a random variable I I I E[X + Y ] = E[X ] + E[Y ] for X , Y : Ω → R E[r · X ] = r · E[X ] for r ∈ R E[a] = a for constant r.v. a ∈ R (additional property) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 23 / 40 Descriptive statistics Random variables & expectations Working with random variables I 2 X 0 (ω) := X (ω) − µ defines new r.v. X 0 : Ω → R I The expectation is a linear functional on r.v.: + any function f (X ) of a r.v. is itself a random variable I I I I E[X + Y ] = E[X ] + E[Y ] for X , Y : Ω → R E[r · X ] = r · E[X ] for r ∈ R E[a] = a for constant r.v. a ∈ R (additional property) These rules enable us to simplify the computation of σX2 : σX2 = Var[X ] = E (X − µX )2 = E X 2 − 2µX X + µ2X = E[X 2 ] − 2µX E[X ] +µ2X = E[X 2 ] − µ2X | {z } =µX SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 23 / 40 Descriptive statistics Random variables & expectations Working with random variables I 2 X 0 (ω) := X (ω) − µ defines new r.v. X 0 : Ω → R I The expectation is a linear functional on r.v.: + any function f (X ) of a r.v. is itself a random variable I I I I E[X + Y ] = E[X ] + E[Y ] for X , Y : Ω → R E[r · X ] = r · E[X ] for r ∈ R E[a] = a for constant r.v. a ∈ R (additional property) These rules enable us to simplify the computation of σX2 : σX2 = Var[X ] = E (X − µX )2 = E X 2 − 2µX X + µ2X = E[X 2 ] − 2µX E[X ] +µ2X = E[X 2 ] − µ2X | {z } =µX I Random variables and probabilities: r.v. X describes outcome of picking a random ω ∈ Ω Ü sampling distribution Pr(a ≤ X ≤ b) = SIGIL (Baroni & Evert) 1 {ω ∈ Ω | a ≤ X (ω) ≤ b} m 3a. Continuous Data: Description sigil.r-forge.r-project.org 23 / 40 Descriptive statistics Random variables & expectations A justification for the mean I I σX2 tells us how well the r.v. X is characterised by µX More generally, E (X − a)2 tells us how well X is characterised by some real number a ∈ R SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 24 / 40 Descriptive statistics Random variables & expectations A justification for the mean I I I σX2 tells us how well the r.v. X is characterised by µX More generally, E (X − a)2 tells us how well X is characterised by some real number a ∈ R The best single value we can give for X is the one that minimises the average squared error: E (X − a)2 = E[X 2 ] − 2a E[X ] +a2 | {z } =µX SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 24 / 40 Descriptive statistics Random variables & expectations A justification for the mean I I I σX2 tells us how well the r.v. X is characterised by µX More generally, E (X − a)2 tells us how well X is characterised by some real number a ∈ R The best single value we can give for X is the one that minimises the average squared error: E (X − a)2 = E[X 2 ] − 2a E[X ] +a2 | {z } =µX I It is easy to see that a minimum is achieved for a = µX + The quadratic error term in our definition of σX2 guarantees that there is always a unique minimum. This would not have been the case e.g. with |X − a| instead of (X − a)2 . SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 24 / 40 Descriptive statistics Random variables & expectations How to compute the expectation of a discrete variable I Population distribution of a discrete variable is fully described by giving the relative frequency of each possible value t ∈ R: πt = Pr(X = t) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 25 / 40 Descriptive statistics Random variables & expectations How to compute the expectation of a discrete variable I Population distribution of a discrete variable is fully described by giving the relative frequency of each possible value t ∈ R: X X (ω) E[X ] = = m πt = Pr(X = t) X X t t ω∈Ω | X (ω)=t {z m = X t t X X (ω)=t 1 m } group by value of X = X t SIGIL (Baroni & Evert) t· X |X (ω) = t| X = t · πt = t · Pr(X = t) m t t 3a. Continuous Data: Description sigil.r-forge.r-project.org 25 / 40 Descriptive statistics Random variables & expectations How to compute the expectation of a discrete variable I Population distribution of a discrete variable is fully described by giving the relative frequency of each possible value t ∈ R: X X (ω) E[X ] = = m πt = Pr(X = t) X X t t ω∈Ω X (ω)=t | {z m = X t t X X (ω)=t 1 m } group by value of X = X t I t· X |X (ω) = t| X = t · πt = t · Pr(X = t) m t t The second moment E[X 2 ] needed for Var[X ] can also be obtained in this way from the population distribution: X E[X 2 ] = t 2 · Pr(X = t) t SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 25 / 40 Descriptive statistics Random variables & expectations How to compute the expectation of a continuous variable I Population distribution of continuous variable can be described by its density function g : R → [0, ∞] I keep in mind that Pr(X = t) = 0 for almost every value t ∈ R: nobody is exactly 172.3456789 cm tall! SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 26 / 40 Descriptive statistics Random variables & expectations How to compute the expectation of a continuous variable I Population distribution of continuous variable can be described by its density function g : R → [0, ∞] I keep in mind that Pr(X = t) = 0 for almost every value t ∈ R: nobody is exactly 172.3456789 cm tall! Area under density curve between a and b = proportion of items ω ∈ Ω with a ≤ X (ω) ≤ b. Z Pr(a ≤ X ≤ b) = b g (t) dt a a SIGIL (Baroni & Evert) 3a. Continuous Data: Description b sigil.r-forge.r-project.org 26 / 40 Descriptive statistics Random variables & expectations How to compute the expectation of a continuous variable I Population distribution of continuous variable can be described by its density function g : R → [0, ∞] I keep in mind that Pr(X = t) = 0 for almost every value t ∈ R: nobody is exactly 172.3456789 cm tall! Area under density curve between a and b = proportion of items ω ∈ Ω with a ≤ X (ω) ≤ b. b Z Pr(a ≤ X ≤ b) = g (t) dt a Same reasoning as for discrete variable leads to: Z a b +∞ t · g (t) dt E[X ] = and −∞ Z +∞ f (t) · g (t) dt E[f (X )] = −∞ SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 26 / 40 Continuous distributions The shape of a distribution Outline Introduction Categorical vs. numerical variables Scales of measurement Descriptive statistics Characteristic measures Histogram & density Random variables & expectations Continuous distributions The shape of a distribution The normal distribution (Gaussian) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 27 / 40 Continuous distributions The shape of a distribution 0.6 3.5 4.0 µ + 2σ µ+σ µ µ−σ 0.0 µ − 2σ 0.2 0.4 Density 0.8 1.0 1.2 Different types of continuous distributions 4.5 5.0 5.5 symmetric, bell-shaped SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 28 / 40 Continuous distributions The shape of a distribution 0.015 140 160 180 µ + 2σ µ+σ µ µ−σ µ − 2σ 0.000 0.005 0.010 Density 0.020 0.025 Different types of continuous distributions 200 symmetric, bulgy SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 29 / 40 Continuous distributions The shape of a distribution Density −2 0 µ + 2σ µ+σ µ µ−σ µ − 2σ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 median Different types of continuous distributions 2 4 6 skewed (median 6= mean) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 30 / 40 Continuous distributions The shape of a distribution 0.015 40 60 µ + 2σ µ+σ µ µ−σ µ − 2σ 0.000 0.005 0.010 Density 0.020 0.025 median Different types of continuous distributions 80 100 120 140 complicated . . . SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 31 / 40 Continuous distributions The shape of a distribution 0.06 30 35 40 µ + 2σ µ+σ µ µ−σ 0.00 µ − 2σ 0.02 0.04 Density 0.08 0.10 median 0.12 Different types of continuous distributions 45 50 55 bimodal (mean & median misleading) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 32 / 40 Continuous distributions The normal distribution (Gaussian) Outline Introduction Categorical vs. numerical variables Scales of measurement Descriptive statistics Characteristic measures Histogram & density Random variables & expectations Continuous distributions The shape of a distribution The normal distribution (Gaussian) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 33 / 40 Continuous distributions The normal distribution (Gaussian) The Gaussian distribution I In many real-life data sets, the distribution has a typical “bell-shaped” form known as a Gaussian (or normal) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 34 / 40 Continuous distributions The normal distribution (Gaussian) The Gaussian distribution I In many real-life data sets, the distribution has a typical “bell-shaped” form known as a Gaussian (or normal) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 34 / 40 Continuous distributions I The normal distribution (Gaussian) Idealised density function is given by simple equation: 1 2 2 g (t) = √ e −(t−µ) /2σ σ 2π with parameters µ ∈ R (location) and σ > 0 (width) σ 2σ 2σ g (t ) σ µ t I I Notation: X ∼ N(µ, σ 2 ) if r.v. has such a distribution No coincidence: E[X ] = µ and Var[X ] = σ 2 (Ü homework ;-) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 35 / 40 Continuous distributions The normal distribution (Gaussian) Important properties of the Gaussian distribution I Distribution is well-behaved: symmetric, and most values are relatively close to the mean µ (within 2 standard deviations) Z µ+2σ Pr(µ − 2σ ≤ X ≤ µ + 2σ) = µ−2σ 1 2 2 √ e −(t−µ) /2σ dt σ 2π ≈ 95.5% I 68.3% are within range µ − σ ≤ X ≤ µ + σ (one s.d.) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 36 / 40 Continuous distributions The normal distribution (Gaussian) Important properties of the Gaussian distribution I Distribution is well-behaved: symmetric, and most values are relatively close to the mean µ (within 2 standard deviations) Z µ+2σ Pr(µ − 2σ ≤ X ≤ µ + 2σ) = µ−2σ 1 2 2 √ e −(t−µ) /2σ dt σ 2π ≈ 95.5% I I 68.3% are within range µ − σ ≤ X ≤ µ + σ (one s.d.) The central limit theorem explains why this particular distribution is so widespread (sum of independent effects) SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 36 / 40 Continuous distributions The normal distribution (Gaussian) Important properties of the Gaussian distribution I Distribution is well-behaved: symmetric, and most values are relatively close to the mean µ (within 2 standard deviations) Z µ+2σ Pr(µ − 2σ ≤ X ≤ µ + 2σ) = µ−2σ 1 2 2 √ e −(t−µ) /2σ dt σ 2π ≈ 95.5% I I 68.3% are within range µ − σ ≤ X ≤ µ + σ (one s.d.) The central limit theorem explains why this particular distribution is so widespread (sum of independent effects) + Mean and standard deviation are meaningful characteristics if distribution is Gaussian or near-Gaussian I completely determined by these parameters SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 36 / 40 Continuous distributions The normal distribution (Gaussian) Assessing normality I Many hypothesis tests and other statistical techniques assume that random variables follow a Gaussian distribution I I If this normality assumption is not justified, a significant test result may well be entirely spurious. It is therefore important to verify that sample data come from such a Gaussian or near-Gaussian distribution SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 37 / 40 Continuous distributions The normal distribution (Gaussian) Assessing normality I Many hypothesis tests and other statistical techniques assume that random variables follow a Gaussian distribution I If this normality assumption is not justified, a significant test result may well be entirely spurious. I It is therefore important to verify that sample data come from such a Gaussian or near-Gaussian distribution I Method 1: Comparison of histograms and density functions SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 37 / 40 Continuous distributions The normal distribution (Gaussian) Assessing normality I Many hypothesis tests and other statistical techniques assume that random variables follow a Gaussian distribution I If this normality assumption is not justified, a significant test result may well be entirely spurious. I It is therefore important to verify that sample data come from such a Gaussian or near-Gaussian distribution I Method 1: Comparison of histograms and density functions I Method 2: Quantile-quantile plots SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 37 / 40 Continuous distributions The normal distribution (Gaussian) µ 0.015 0.000 0.005 0.010 Density 0.020 0.025 > hist(x,freq=FALSE) > lines(density(x)) µ−σ 0.030 Plot histogram and estimated density: µ+σ Assessing normality: Histogram & density function 100 SIGIL (Baroni & Evert) 120 3a. Continuous Data: Description 140 160 180 200 sigil.r-forge.r-project.org 220 38 / 40 Continuous distributions The normal distribution (Gaussian) µ−σ estimated density normal approximation Density 0.015 0.010 0.005 > xG <seq(min(x),max(x),len=100) > yG <dnorm(xG,mean(x),sd(x)) > lines(xG,yG,col="red") 0.000 Compare best-matching Gaussian distribution: 0.020 0.025 > hist(x,freq=FALSE) > lines(density(x)) 100 SIGIL (Baroni & Evert) µ+σ 0.030 Plot histogram and estimated density: µ Assessing normality: Histogram & density function 120 3a. Continuous Data: Description 140 160 180 200 sigil.r-forge.r-project.org 220 38 / 40 Continuous distributions The normal distribution (Gaussian) 0.030 Substantial deviation Ü not normal (problematic) µ−σ 0.020 Density 0.015 0.010 0.005 > xG <seq(min(x),max(x),len=100) > yG <dnorm(xG,mean(x),sd(x)) > lines(xG,yG,col="red") 0.000 Compare best-matching Gaussian distribution: SIGIL (Baroni & Evert) estimated density normal approximation 0.025 > hist(x,freq=FALSE) > lines(density(x)) µ Plot histogram and estimated density: µ+σ Assessing normality: Histogram & density function 0 3a. Continuous Data: Description 50 100 sigil.r-forge.r-project.org 150 38 / 40 Continuous distributions The normal distribution (Gaussian) Assessing normality: Quantile-quantile plots Quantile-quantile plots are better suited for small samples: 180 ● ● ● ● ● 140 If distribution is near-Gaussian, points should follow red line. 160 > qqnorm(x) > qqline(x,col="red") Sample Quantiles 200 ● ● ● ●● ● ● ●●● ●●●●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●● ● ●● ●● ● ●●● ● ● ● ● −3 −2 −1 0 1 2 3 Theoretical Quantiles SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 39 / 40 Continuous distributions The normal distribution (Gaussian) Assessing normality: Quantile-quantile plots Quantile-quantile plots are better suited for small samples: ● SIGIL (Baroni & Evert) 100 ● ● ● ● ●● ●●● ● ●●● ● ● ●●● ● ● ●● ●● ●●●● 80 Sample Quantiles ● 40 One-sided deviation Ü skewed distribution ● ● ● ● ●● ●● ●●●● ● ●● ● 60 > qqnorm(x) > qqline(x,col="red") If distribution is near-Gaussian, points should follow red line. ● 120 ● ●●● ● ●● ●● ●●● ●●●● ●●● ●●● ●●● ●●●● ●● ●●●●●● ● ● ●● ●● ●● ● ● ● ● ● ●● ● −2 −1 0 1 2 Theoretical Quantiles 3a. Continuous Data: Description sigil.r-forge.r-project.org 39 / 40 Continuous distributions The normal distribution (Gaussian) Playtime! I Take random samples of n items each from the census and wikipedia data sets (e.g. n = 100) library(corpora) Survey <- sample.df(FakeCensus, n, sort=TRUE) I I Plot histograms and estimated density for all variables Assess normality of the underlying distributions by comparison with Gaussian density function by inspection of quantile-quantile plots + Can you make them look like the figures in the slides? I I I Plot histograms for all variables in the full data sets (and estimated density functions if you’re patient enough) I I What kinds of distributions do you find? Which variables can meaningfully be described by mean µ and standard deviation σ? SIGIL (Baroni & Evert) 3a. Continuous Data: Description sigil.r-forge.r-project.org 40 / 40