Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 Descriptive Statistics: Numerical Measures Outline and Definitions Sample Statistic: measures from a data sample (point estimator of population parameter) Population Parameter: measures for data from a population Example (to be used throughout lecture): Apartment rental prices 20 efficiency apartments were sampled in a small town and analyzed with corresponding rent prices 425 430 435 435 440 445 445 445 465 470 475 480 490 500 510 535 550 570 600 615 Section 3.1: Measures of Location I) Mean -- average of all the data values -- mean can be computed from a sample ( x x ) x ii where n = # observations ∑ xi = x1 + x2 + x3 +…+ xn n -- mean can be computed from a population x ii (μ) where N = # observations ∑ xi = x1 + x2 + x3 +…+ xn N -- influenced by large and small values or outliers Example (Apartment rental prices) II) Median -- value in the middle when data is arranged in ascending order -- preferred measure when data has extreme values For odd # of observations -- arrange data in ascending order -- median is middle value Example: 1 For even # of observations -- median is the avg of the middle two values Example: Example (Apartment rental prices) III) Mode: -- value of data set that occurs most frequently -- greatest frequency can occur at 2 or more different values Data with 2 modes are called bimodal Data with more than 2 modes is called multimodal -- important measure for qualitative data Example (Apartment rental prices) IV] Percentiles -- provides info about how data is spread from smallest to largest values pth percentile = a value where approximately p percent of observations are < value and approximately (100p) percent of the observations are > value Steps 1) Arrange data in ascending order 2) Compute the position of the pth percentile (index i) i = (p/100)n 3) If i is not an integer, round up and pth percentile is value in ith position If i is an integer, pth percentile is the avg of values in position i and i + 1 Example (Apartment rental prices) 2 V] Quartiles -- quartiles are specific percentiles 1st quartile 25th percentile 2nd quartile 50 percentile or Median 3rd quartile 75th percentile Example (Apartment rental prices) Section 3.2: Measures of Variability -- measures of variability are also known as measures of dispersion -- used in conjunction with measures of location to understand business situation Example I] Range -- difference between largest and smallest data value -- very sensitive to smallest and largest data values so rarely used as an isolated measure of variability Example (Apartment rental prices) 3 II] Interquartile Range -- difference between 3rd Quartile and 1st Quartile -- range for middle 50% of data -- overcomes influence of extreme values Example (Apartment rental prices) III] Variance -- measure of variability considering all the data (how spread out your data points are around center of distribution) -- based on the difference between the value of each observation (xi) and the mean xi - x xi - μ known as deviations about the mean -- variance is computed by taking the square values of each deviation about the mean -- variance is the average of the square differences between each data value and the mean. -- since the measure is squared, the units it’s measured in is also squared (not very intuitive as single comparison) Sample Variance Population Variance ( xi x ) 2 2 2 s n 1 Example (Apartment rental prices) 4 (x ) 22 22 ii N IV] Standard Deviation -- positive square root of the variance -- avg amount by which data is spread about the mean (+/- above and below the mean) -- measured in the same units as the data, making it more easily interpreted than the variance. -- under a bell shaped (normal) and symmetric distributions, # of data points out from the mean that includes 68% of the cases (34% above mean and 34% below mean). -- within the right context, the lower the standard deviation, the more concentrated the values are around the mean Sample Standard Deviation Population Standard Deviation s s 22 Example (Apartment rental prices) 5 22 V] Coefficient of Variation -- indicates how large the standard deviation is in relation to the mean Sample Coefficient of Variation Population Coefficient of Variation 100 % Example (Apartment rental prices) Section 3.3 Measures of Distribution Shape, Relative Location and Detecting Outliers Skewness -- describes the shape of the distribution numerical measure generated via statistical software -- varies in degree and is described using terms such as moderately skewed and highly skewed -- viewed via construction of histogram a) Symmetric -- skewness = 0 -- mean = median = mode Example 6 b) Moderately Skewed Left -- skewness is negative (skewness measure between -1 and 0) --characterized as having a tail to the left -- mean is usually < median Example c) Moderately Skewed Right -- skewness is positive (skewness measure between 0 and 1) -- mean is usually > median Example d) Highly Skewed Right -- skewness is positive (skewness measure > 1) -- mean is usually > median Example 7 Z Score -- an observation’s z-score is a measure of the relative location of the observation in a data set. -- often called a standardized value converting a value into units of standard deviation -- it denotes the number of standard deviations a data value xi is from the mean. x x zi i s zi = z score for xi x = sample mean s = sample standard deviation a) z score < 0 data value < sample mean b) z score > 0 data value > sample mean c) z score = 0 data value = sample mean Example Chebyshev’s Theorem -- applies for any type of distribution -- allows us to make statements about proportion of data values that must be within a specified number of standard deviations from the mean. Multiplying by 100 gives you the % Theorem At least (1 - 1/z2) of the items in any data set will be within (+ / --) z standard deviations of the mean, where z is any value greater than 1. Proportion of data values within Z= 2 standard deviations of mean at least 75% Proportion of data values within Z =3 standard deviations of mean at least 89% Proportion of data values within Z= 4 standard deviations of mean at least 94% -- actual outcome may be greater than min % Example 8 Empirical Rule -- For normal distributions or bell shaped distributions: 68.25% of the values of a normal random variable are within +/- 1 standard deviation from its mean 95.44% of the values of a normal random variable are within +/- 2 standard deviations from its mean 99.72% of the values of a normal random variable are within +/- 3 standard deviation from its mean 99.72% 95.44% 68.26% + 3 – 3 – 1 + 1 – 2 + 2 9 x Detecting Outliers -- an outlier is an unusually small or unusually large value in the data set -- a data value with z score < -3 or Z score > 3 might be considered an outlier Section 3.5 Measure of Association Between 2 Variables -- descriptive measures of the relationship between 2 variables Covariance -- measure of linear association between 2 variables -- positive values indicate a positive linear relationship (as x inc, y inc) negative values indicate a negative linear relationship (as x inc, y dec) -- points evenly distributed are close to 0 -- need to be careful equating larger covariance values as meaning greater strength of linear association when comparing two covariance values (covariance values depends on unit of measurements of x and y) Eg.) Relationship between height and weight and measuring height in inches or feet o Measuring feet in inches vs feet gives a higher xi covariance - x values larger numerator larger For that reason, we don’t normally assess the strength of the linear associated using covariance (just used to denote positive or negative relationship) Sample Covariance s xy ( x x )( y y) i i n 1 for each sample size n with observations Population Covariance xy ( xi x )( yi y ) N 10 (x1, y1), (x2, y2)… Pearson Product Moment Correlation Coefficient -- can take on values between -1 and +1 -- values near -1 indicate a strong negative linear relationship values near +1 indicate a strong positive linear relationship Sample Correlation Coefficient rxy s xy where sx s y sxy = sample covariance sx, sy = sample standard deviations Population Correlation Coefficient xy xy x y where xy = population covariance x, y= population standard deviations -- correlation is a measure of linear association not causation just because 2 variables are highly correlated does not mean one causes the other -- scatter plots and correlation coefficients are normally done simultaneously 11 Example 12 Section 3.6 Weighted Mean and Working With Grouped Data Weighted Mean -- mean is computed by giving each data value a weight that reflects its importance -- analyst must choose the weight that best reflects the importance of each value Sample Weighted Mean wx x w i i xi = value of observation i wi = weight for each observation i i Example 13 14