Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAB22 Statistics I Lecture 3 1 Describing Quantitative Data Quantitative Data: variables measure numerical quantities (for which arithmetic name calories operations make sense) E.g. Breakfast cereal data (calories per serving) All-Bran 70 Apple Jacks 110 ⁞ ⁞ Describe quantitative data using: Plots: histogram, stem-&-leaf Numerical summaries: mean, median, range, etc 2 Histogram Create artificial categories, called bins or classes, for values of quantitative variable E.g. calorie classes: 40→60, 60→80, 80→100, ... (classes span all data values & don’t overlap) Pretend variable is categorical (bins are categories) & create bar-plot 3 Example Cereal calories StatCrunch: Graphics > Histogram Rel. freq. histogram Differences from bar-plots: Categories (classes) are ordered No gaps between bars (unless freq.=0) 4 Stem-and-Leaf Display Split each numerical value in two parts (stem & leaf), along some level (0.1’s, 1’s, 10’s, 100’s etc) E.g. 42.19, cut along 10’s Stem (leading digits) = 4 Leaf (trailing digits) = 2.19 or 2 Put stems on the left column, and list leaves of every value to the right of their stem Sort leaves within each stem from smallest to largest Stems Leaves 4 12445589 5 33566 6 3458 5 Example Cereal potassium Stem-and-leaf gives info similar to histogram’s S-&-L shows individual data values, but is not practical for very large data sets E.g. What is maximum potassium content in data? Variable: potass Decimal point is 2 digit(s) to the right of the colon. 0 : 002233333333444444444 0 : 55555666666778999999 1 : 00000001111111222233444 1 : 667799 2 : 034 2 : 68 3 : 23 6 StatCrunch: Graphics > Stem and Leaf Describing Quantitative Data For any quantitative variable, want to describe overall pattern of its distribution. In particular, look at: Shape: peaks (modes), symmetry, outliers Centre: mean, median Spread: range, standard deviation 7 Shape of Distribution Modality: Check for modes (peaks), i.e. most frequently occurring values Bimodal (2 peaks) Frequency Frequency Unimodal (1 peak) Variable Uniform (no peaks) Frequency Variable Variable 8 Shape of Distribution Symmetry: distribution is called symmetric if, when we draw a vertical line down its center, the two sides are similar in shape and size Variable Frequency Frequency Frequency Symmetric Distributions Variable Variable 9 Shape of Distribution Skeweness: Unimodal distributions with one tail longer than the other are called skewed Skewed to the left (negatively skewed) Skewed to the right (positively skewed) Frequency Frequency Variable Variable 10 Shape of Distribution Outliers: Any data that lie far off the main body of the distribution (i.e. extreme values) are called outliers Frequency Variable outliers 11 Centre of Distribution Mean: sum of all values of quantitative variable divided by the number of values E.g. Amount customers spend at coffee shop; data : $3.5, $4.2, $6.7, $2.6, $5.1 3.5+4.2+6.7+2.6+5.1 22.1 4.42 mean 5 5 Generally, for #n sample values x1, x2,…, xn, sample mean (denoted by x ) is given by x1 x2 xn xi x n n 12 Mean Mean is “center of gravity” of data 2.6 3.5 4.2 5.1 6.7 4.42 Good representative value But, sensitive to outliers E.g. 2.6 3.5 4.2 5.1 15 6.17 13 Centre of Distribution Median: midpoint of all values, after they are ordered from the smallest to the largest Median = 4.2 E.g. Coffee 2.6 3.5 4.2 5.1 6.7 shop data } 50% of values above & 50% below median For even # of data (n=2,4,6, etc.), median is the mean of the two middle numbers Median 2.6 3.5 4.2 3.5 4.2 3.85 2 5.1 14 Median Median is robust to (less influenced by) extreme values E.g. 2.6 3.5 4.2 5.1 6.7 } Median = 4.2 } 2.6 3.5 4.2 5.1 15 Prefer median when data have outliers 15 Mean, Median & Shape For symmetric distributions: mean = median For skewed distributions: mean >< median One tail has more extreme observations than other Mean < Median Mean = Median Mean > Median 16 Questions 1. 2. 3. What measure of centre would you use on the following data: 0.6, 0.2, 0.1, 0.2, 0.2, 0.3, 0.7, 0.1, 0.0, 22.5, 0.4 ? Changing the value of a single score in a data set will necessarily cause the mean to change. (T/F) Changing the value of a single score in a data set will necessarily cause the median to change. (T/F) 17