Download STAB22 Statistics I Lecture 3 1

STAB22 Statistics I Lecture 3 1 Describing Quantitative Data  Quantitative Data: variables measure numerical quantities (for which arithmetic name calories operations make sense)   E.g. Breakfast cereal data (calories per serving) All-Bran 70 Apple Jacks 110 ⁞ ⁞ Describe quantitative data using:   Plots: histogram, stem-&-leaf Numerical summaries: mean, median, range, etc 2 Histogram  Create artificial categories, called bins or classes, for values of quantitative variable   E.g. calorie classes: 40→60, 60→80, 80→100, ... (classes span all data values & don’t overlap) Pretend variable is categorical (bins are categories) & create bar-plot 3 Example  Cereal calories   StatCrunch: Graphics > Histogram Rel. freq. histogram Differences from bar-plots:   Categories (classes) are ordered No gaps between bars (unless freq.=0) 4 Stem-and-Leaf Display  Split each numerical value in two parts (stem & leaf), along some level (0.1’s, 1’s, 10’s, 100’s etc)  E.g. 42.19, cut along 10’s    Stem (leading digits) = 4 Leaf (trailing digits) = 2.19 or 2 Put stems on the left column, and list leaves of every value to the right of their stem  Sort leaves within each stem from smallest to largest Stems Leaves 4 12445589 5 33566 6 3458 5 Example  Cereal potassium  Stem-and-leaf gives info similar to histogram’s   S-&-L shows individual data values, but is not practical for very large data sets E.g. What is maximum potassium content in data? Variable: potass Decimal point is 2 digit(s) to the right of the colon. 0 : 002233333333444444444 0 : 55555666666778999999 1 : 00000001111111222233444 1 : 667799 2 : 034 2 : 68 3 : 23 6 StatCrunch: Graphics > Stem and Leaf Describing Quantitative Data For any quantitative variable, want to describe overall pattern of its distribution. In particular, look at:  Shape: peaks (modes), symmetry, outliers  Centre: mean, median  Spread: range, standard deviation 7 Shape of Distribution Modality: Check for modes (peaks), i.e. most frequently occurring values Bimodal (2 peaks) Frequency Frequency Unimodal (1 peak) Variable Uniform (no peaks) Frequency  Variable Variable 8 Shape of Distribution  Symmetry: distribution is called symmetric if, when we draw a vertical line down its center, the two sides are similar in shape and size Variable Frequency Frequency Frequency Symmetric Distributions Variable Variable 9 Shape of Distribution Skeweness: Unimodal distributions with one tail longer than the other are called skewed Skewed to the left (negatively skewed) Skewed to the right (positively skewed) Frequency Frequency  Variable Variable 10 Shape of Distribution Outliers: Any data that lie far off the main body of the distribution (i.e. extreme values) are called outliers Frequency  Variable outliers 11 Centre of Distribution  Mean: sum of all values of quantitative variable divided by the number of values   E.g. Amount customers spend at coffee shop; data : $3.5, $4.2, $6.7, $2.6, $5.1 3.5+4.2+6.7+2.6+5.1 22.1   4.42 mean  5 5 Generally, for #n sample values x1, x2,…, xn, sample mean (denoted by x ) is given by x1  x2    xn  xi x  n n 12 Mean  Mean is “center of gravity” of data 2.6 3.5 4.2 5.1 6.7 4.42   Good representative value But, sensitive to outliers  E.g. 2.6 3.5 4.2 5.1 15 6.17 13 Centre of Distribution Median: midpoint of all values, after they are ordered from the smallest to the largest Median = 4.2  E.g. Coffee 2.6 3.5 4.2 5.1 6.7 shop data }    50% of values above & 50% below median For even # of data (n=2,4,6, etc.), median is the mean of the two middle numbers Median  2.6 3.5 4.2 3.5  4.2  3.85 2 5.1 14 Median Median is robust to (less influenced by) extreme values  E.g. 2.6 3.5 4.2 5.1 6.7 } Median = 4.2 }  2.6  3.5 4.2 5.1 15 Prefer median when data have outliers 15 Mean, Median & Shape   For symmetric distributions: mean = median For skewed distributions: mean >< median  One tail has more extreme observations than other Mean < Median Mean = Median Mean > Median 16 Questions 1. 2. 3. What measure of centre would you use on the following data: 0.6, 0.2, 0.1, 0.2, 0.2, 0.3, 0.7, 0.1, 0.0, 22.5, 0.4 ? Changing the value of a single score in a data set will necessarily cause the mean to change. (T/F) Changing the value of a single score in a data set will necessarily cause the median to change. (T/F) 17

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download STAB22 Statistics I Lecture 3 1