Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STP226 Brief Class Notes Instructor: Ela Jackiewicz Chapter 3 Descriptive Measures Measures of Center (Central Tendency) These measures will tell us where is the center of our data or where most typical value of a data set lies Mode – the value that occurs most frequently in the data set Obtain the frequency of each value 1. If the greatest frequency is 1, then there is no mode. 2. If the greatest frequency is 2 or greater, then any value with that greatest frequency is the mode of the data set. Example: 2, 3, 3, 3, 4, 4, 5 Mode = 3 Median – divides the bottom 50% of the data from the top 50% Arrange the data in increasing order 1. If the # of observations is odd, the median is the observation exactly in the middle. 2. If the # of observations is even, the median is the mean of the two middle observations. For n observations, the position of the median is the ( n+12 ) th position in the ordered distribution. Ex Weight gain in pounds for 6 young lambs 1 2 10 11 13 19 , position=(6+1)/2=3.5 (median is between observation #3 and #4), Median=(10+11)/2=10.5 lb If we add one more observation: 10lb, data becomes: 1 2 10 10 11 13 19 , position=(7+1)/2 =4,(median is observation #4) Median=10lb Median is a robust (resistant) measure of center, it is relatively unaffected by changes in small portion of the data. Mean – sum of the observations divided by the number of observations. n ∑ xi i=1 ̄x = Mean (arithmetic mean)= ̄x = n In our example ̄ x =56/6~9.33 lb , where x i −s are observations in the sample. STP226 Brief Class Notes Instructor: Ela Jackiewicz Differences between each data point and the mean ( x i−̄x ) are called deviations from the mean and n their sum ∑ ( x i−̄x )=0 for any data set. i=1 In our example sum of all deviations = (- 8.33)+ (- 7.33)+.67+1.67+3.67+9.67=0 Mean can be visualized as a point of balance of the weightless seesaw with points (like children) sitting on it. Unlike median, mean is not robust, it is influenced by any data changes, very much by extremes. If data has some extreme values then median is a better measure of center for that data. Mean vs Median right skewed distribution, Mean>Median left skewed distribution, symmetric distribution, Mean< Median Mean=Median Measures of dispersion (variability) Range=Maximum-Minimum, gives overall spread of the data, easy to calculate, but very sensitive to extreme data values. Sample Standard Deviation DEFINITION: s= √ n ∑ (x i − x) 2 i=1 n−1 s averages the squared deviations from the mean. Square root is taken at the end, so the units of s are the same as the units of the data. Properties: s≥0 , s=0 if all data points are the same s has the same units as your data larger s indicates more variability STP226 Brief Class Notes Instructor: Ela Jackiewicz s2 is the sample variance. We will abbreviate SD for standard deviation, s will be used in the formulas. Ex. Experiment on chrysanthemums , botanist measured stem elongation in 7 days (in mm) n=5 ̄x =365/5=73 76, 72, 65, 70, 82 xi x i− ̄x ( x i− ̄x )2 76 72 65 70 82 3 -1 -8 -3 9 9 1 64 9 81 total 0 164 s== √ 164 =6.40 mm 4 variance s2=41mm2 s gives typical distance of the observations from the mean, larger s means more variability. Similar to the mean, s is also influenced by extreme data values (not a robust measure). n-1 =degrees of freedom of s, as an intuitive justification why we use ( n-1) not n we can consider n=1, when variability of 1 observation can't be computed, one data point gives no information about variability. Sample standard deviation xi x 2i 76 72 65 70 82 5776 5184 4225 4900 6724 365 26809 COMPUTATIONAL FORMULA: s= √ 2 n n 2 ∑ xi − (∑ ) i=1 i=1 n−1 xi n STP226 Brief Class Notes Instructor: Ela Jackiewicz s= √ 26809− 4 (365)2 5 = √ √ 26809−26645 164 = =6.40 mm 4 4 The more variation there is in a data set, the larger its standard deviation. Similar to the mean, standard deviation is not robust, it is influenced by any data changes, very much by extremes. Three Standard Deviations Rule: Almost all of the observations in any data set lie within three (3) standard deviations to either side of the mean. More Precise Rules for any data set: (optional) Chebychev’s rule : ~ 89% of the observations in any data set lie within three standard deviations to either side of the mean. Chebychev’s rule (more precisely):For any data set and any number k > 1, at least 100(1 – 1/k2)% of the observations lie within k standard deviations to either side of the mean. If the distribution is ~ bell-shaped, the Empirical Rule implies that ~ 99.7% of the observations lie within three standard deviations to either side of the mean. We will tallk about “bell shaped” distributions later. Typical Percentages: The Empirical Rule For a “nice” distribution (pretty symmetric, unimodal, no very long or very short tails) we expect to find : about 68% of all data points within the interval ( ̄y −SD , ̄y+ SD) about 95% of all data points within the interval ( ̄y −2SD , ̄y+ 2SD) more than 99% of all data points within the interval ( ̄y −3SD , ̄y+ 3SD) Effect of Transformation of Variables Sometimes when we work with a data set it is convenient to transform our variable(s). For example , we may want to change units or transform very small numbers that appear in scientific notation to something easier to use by multiplying original data by 10,000. Linear transformation is the simplest one: Let X be the original variable with mean ̄x and SD =s, then X ' =aX +b is it's linear transformation, mean and SD of X ' are ̄x ' and SD= s' respectively. That type of transformation does not change the essential shape of the distribution of X, the histogram of transformed variable can be made identical to the original histogram by suitable scaling of the horizontal axis. STP226 Brief Class Notes Instructor: Ela Jackiewicz How Linear Transformation Affects mean and SD? Only mean (but not s) is affected by the additive transformation (adding positive or negative constant b to X), but both mean and SD are affected by multiplying X by a positive or a negative constant a: ̄x ' =a ̄x +b and s ' =∣a∣s Ex Suppose X=summer temperature in some American city in 2013 in °F, ̄x =79.6 °F and s=12.7 °F. If we would like to change the X to °C, the transformation is as follows: 5 5 5 5 5 X '=( X −32)∗ = X − ∗32 , so new mean ̄x '= 79.6−( ∗32)=26.44 °C and 9 9 9 9 9 5 s ' = ∗12.7=7.06 °C 9 Nonlinear transformations like the following examples: 1 , X '= X 2 , can affect data in complex ways and they do X change essential shape of the frequency distribution. If the distribution is right skewed, for example, and we wish to make it more symmetric, we can apply square root transformation to pool the righthand tail and push out the left -hand tail. Logarithmic transformation will deliver even more drastic change in that regard (check out the histograms given at the end of this section) X '= √ X , X ' =log X , X '= The five-number summary; Boxplots Median, Percentiles, Deciles, Quartiles, Interquartile Range are all resistant measures. Percentiles – divide the distribution into 100 equal parts (P1, P2, …,P99) P1 divides the bottom 1% of the data from the top 99% P2 divides the bottom 2% of the data from the top 98% Etc, Median is the 50th percentile Deciles – divide the distribution into 10 equal parts (D1, D2, …, D9) D1 divides the bottom 10% of the data from the top 90% D2 divides the bottom 20% of the data from the top 80% Etc, Median is D5 Quartiles – divide the distribution into 4 equal parts (Q1, Q2, Q3) Q1 divides the bottom 25% of the data from the top 75% STP226 Brief Class Notes Instructor: Ela Jackiewicz Q2 divides the bottom 50% of the data from the top 50% Q3 divides the bottom 75% of the data from the top 25% Median is Q2 To find the Quartiles Arrange the data in increasing order. 1. 2. 3. Q1 is the median of the data set that lies at or below the median of the entire data set. Q2 is the median of the entire data set. Q3 is the median of the data set that lies at or abowe the median of the entire data set. Examples: 1. n=7 (odd) Data: 3, 4, 5, 6, 12, 13, 14 Q1 =(4+5)/2=4.5 Q2 = 6 Q3= (12+13)/2=12.5 When n is odd, calculator and your book have slightly different ways to calculate quartiles: Your Book: To compute Quartiles, Median is included in lower and upper part of data Your calculator: To compute Quartiles, Median is excluded from the computations, so you will get somewhat different values: Q1 = 4 Q2 = 6 Q3= 13 2. n=10 (even) Data: 1, 3, 4, 5, 6, 12, 13, 14, 15, 18 Q1 = 4 Q2 = (6+12)/2=9 Q3= 13 Interquartile Range (IQR) – difference between the first and third quartiles. IQR = Q3 – Q1 IQR gives the range of the middle 50% of the observations (approximately) The five-number summary of a data set consists of the minimum, maximum, and the quartiles in increasing order. Min., Q1, Q2, Q3, Max. Outliers – observations well outside of the overall pattern of the data LL=Lower limit = Q1 – 1.5 (IQR) UL=Upper limit = Q3 + 1.5 (IQR) STP226 Brief Class Notes Instructor: Ela Jackiewicz Potential outliers are observations outside of the Lower and Upper Limits. Boxplot (box-and-whisker diagram) and the modified boxplot To construct a boxplot 1. 2. 3. Determine the 5 number summary (Min, Q1, Q2, Q3, Max.) Draw a horizontal axis on which the numbers obtained in step 1can be located. Above this axis, mark the quartiles and the minimum and maximum with vertical lines. Connect the quartiles to each other to make a box, and then connect the box to the minimum and maximum with lines. The following is Boxplot for example 1 (top of previous page): 3 4.5 6 12.5 14 To construct a modified boxplot 1. 2. 3. 4. 5. Determine the quartiles. Determine potential outliers and the adjacent values. Draw a horizontal axis on which the numbers obtained in steps 1 and 2 can be located. Above this axis, mark the quartiles and the adjacent values with vertical lines. Connect the quartiles to each other to make a box, and then connect the box to the most extreme obs. that are still lying within the upper and lower limits Plot each potential outlier with an asterisk. The two lines stretching out on both sides are the whiskers. Example Data represents systolic blood pressure (in mmHg) of 7 adult males 151 124 132 170 146 124 113 We order data first: 113 124 124 132 146 151 170 Min=113, Max=170, Median=132 Q1=124 Q3=151 (Median is excluded when we compute quartiles) Boxplot connects all 5 numbers in the following way, the box represents middle half of the data. STP226 Brief Class Notes Instructor: Ela Jackiewicz 110 120 130 140 150 160 170 Are there any outliers? In our example: IQR=151-124=27, 1.5(IQR)=1.5*27 = 40.5 lower limit=124-40.5=83.5, upper limit = 151+40.5 = 191.5, all observations are within the limits, so so there are no outliers in our data set. Example Radishes growth (in mm) in the light. 4 5 5 7 7 8 9 10 10 10 10 14 20 21 Min=4, Max=21, Q1=7, Median=(9+10)/2=9.5 Q3=10 IQR=3, lower limit=2.5 upper limit=14.5, so 20 and 21 are outliers. Modified box plot exposes outliers. ** 5 10 15 20 25 Descriptive Measures for Populations; Use of Samples Statistical Inference is the process of drawing conclusions about the population based on the observations in the sample. Notation: Sample Size n Mean x SD s Population N μ σ STP226 Brief Class Notes Instructor: Ela Jackiewicz Parameter – A descriptive measure for a population. Example: x , s Statistic – A descriptive measure for a sample. Sample mean, x , is used to estimate a population mean, Sample SD, s, is used to estimate population SD, σ Population Mean Example: , μ μ (Mean of a Variable) – computed in same manner as for a sample mean For a variable X, the mean of all possible obs. for the entire population is called the population mean or mean of the variable X. It is denoted by x or when no confusion will arise, simply by . For a finite population, we have = x N where N is the population size. Population Standard Deviation σ (Standard Deviation of the Variable) For a variable x, the standard deviation of all possible obs. for the entire population is called the population standard deviation or standard deviation of the variable x. It is denoted by x or, when no confusion will arise, simply by . For a finite population, we have = (x ) N 2 = √( ∑ x 2 −μ2 N ) where N is the population size. Population Variance 2 Standardized Variable – For a variable X, the variable z = x is called the standard score or z-score z is also called the standardized version of x or the standardized variable corresponding to the variable x. The value of the z score tells us how many standard deviations above or below the mean is a particular value of x. STP226 Brief Class Notes Instructor: Ela Jackiewicz Properties of z-scores: z<0 if x is below the mean, z>0 if x is above the mean z=0 if x is equal to the mean. z-scores have mean=0 and SD=1 z-scores have no units: z = z N (z ) 2 =0 , z= =1 N Most of the z-scores are between -3 and 3 (3 SD Rule) Example: Final test scores in all Mat119 classes last semester have mean μ=72 and SD σ=10 a) Jane scored 86 points, find and interpret her z-score z= x , z=(86-72)/10=1.4 Het test grade is 1.4 standard deviations above the average b) Jack's z-score was -1.0, what was his test score is: x=μ+ z σ , X=72-1.0(10)=62 c) True or false? Very few final test scores are below 42 points or above 102 points . True, most of the scores are within 3SD-s from the mean , in an interval (42,102), so very few are outside of that range