Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STP 420 SUMMER 2002 STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES PART 1 - DATA CHAPTER 1 LOOKING AT DATA - DISTRIBUTIONS Individuals – objects described by a set of data (people, animals, things) - all the data for one individual make up a case Variable – any characteristic of an individual (may take different values for different individuals). Categorical variable – places an individual into one of several groups/categories. Quantitative variable – takes numerical values for which arithmetic operations (adding/averaging) makes sense. Distribution – tells us what values a variable takes and how often these values are taken. 1.1 Displaying Distributions with Graphs Exploratory data analysis – use statistical tools (graphs and numerical summaries) and ideas to help examine data and describe their main features - examine each variable and the relationships among variables - construct graphs and add numerical summaries Graphs for categorical variables Bar graph - order of bars are not important Pie chart - must have all parts that make up the whole 1 STP 420 SUMMER 2002 Measuring speed of light Newcomb experiment Measurement – dependent on instrument use to make measurement - appropriateness of measurement for purpose Variation – difference in measurements may be due to many factors Distribution - the pattern of variation of a variable The distribution of a quantitative variable records its numerical values and how often each value occurs Stemplot – gives quick picture of a distribution while including the actual numerical values in the graph 1. Separate each observation into a stem (has all but the last digit, can be 1, 2, or more digits) consisting of all but the final (rightmost) digit and a leaf (has only one digit), the final digit. 2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. 3. Write each leaf in the row to the right of its stem, in increasing order out from the stem. Back-to-back stemplot – uses one stem and two sets of leaves, one on either side of the stem helps to make comparison between two data sets. The number of stems can be doubled by splitting the stem in two; one with leaves from 0 to 4 and the other with leaves 5 to 9. Good idea to round off numbers to only a few digits before trying to make a stemplot (lose some accuracy in measurements) 2 STP 420 SUMMER 2002 Examining a distribution 1. In any graph of data, look for the overall pattern and for striking deviations from that pattern. 2. Can describe the overall pattern of a distribution by its shape, center, and spread. 3. Outlier, important deviation that falls outside the overall pattern. Mode(s) – observation(s) that occurs most often - shown by the major peak(s) in the graph Unimodal – distribution with one major peak Symmetric distribution – values smaller and larger than its midpoint are mirror images of each other Skewed to the right – right tail (larger values) longer than left tail (smaller values) Skewed to the left – left tail (smaller values) longer than right tail (larger values) Histogram – breaks the range of values of a variable into intervals (of equal width) and displays only the count (frequency) or percent (relative frequency) of the observations that fall into each interval Frequency table – table showing the intervals with their respective frequencies/relative frequencies Roundoff error – may sometimes be significant Looking at data Histogram can help to shape, spread (outliers), center Time plots – plotting the measurements in the order that they are observed (over time). Time series – measurements of a variable taken at regular intervals over time - examples: economic/social data 3 STP 420 SUMMER 2002 Seasonal variation – a pattern in a time series that repeats itself at known regular intervals of time Trend – persistent long-term rise or fall Monthly consumer price index for some product Index number – nationwide average price (less variable than the price at any one store that may from time to time offer special prices) Seasonally adjusted – helps to avoid misinterpretation especially for short periods of time. Decomposing time series Statistical software programs can help to examine a time series by decomposing the data into systematic patterns such as trends and seasonal variation and the residuals that remains after we remove these patterns 1.2 Describing Distributions with numbers Measures of center x1 x2 ... xn 1 xi n n 1. Mean = x 2. Median = M The median is the midpoint of the distribution, the number such that half the observations are smaller and the other half are larger. To find the median: 1. Arrange the observations in increasing order. 2. If the number of observations n is odd, the median is the center observation at the position (n+1)/2 in the ordered list. 3, If the number of observations n is even, the median is the mean of the two center observations in the ordered list and holds the same position as above in #2. 4 STP 420 SUMMER 2002 The mean is affected by extreme observations whereas the median is not affected, hence the median is called a resistant measure and the mean is not resistant. Measuring spread: Quartiles Quartiles divide the distribution into 4 equal parts To calculate the quartiles: 1. Arrange the observations in increasing order and find the median (same as Q2- the second quartile) 50% of the observations are to its left 2. The first quartile (Q1) is the median of the observations on the left of the median. 25% of the observations are to its left 3. The third quartile (Q3) is the median of the observations on the right of the median. 75% of the observations are to its left Percentiles divide the distribution into 100 equal parts 25%ile = Q1 50%ile = Q2 = M 75%ile = Q3 Range is the highest score minus the lowest score. Interquartile range is the highest quartile minus the lowest quartile. IQR = Q3 – Q1 An observation is a suspected outlier if it falls more than 1.5 X IQR above Q3 or below Q1. The Five number summary include Minimum Q1 M = Q2 Q3 Maximum in the given order. 5 STP 420 SUMMER 2002 Boxplot – graph of the five number summary with suspected outliers plotted individually - useful in comparing distributions 1. Central box spans the quartiles 2. A line in the box marks the median 3. Observations more than 1.5 X IQR above Q3 or below Q1 are plotted as individual outliers 4. Lines extend from the box out to the smallest and largest observations that are not suspected outliers. The variance s2 of a set of observations is the average of the squares of the deviations of the observations from their mean. ( x1 x) 2 ( x 2 x) 2 ... ( x n x) 2 1 s ( xi x ) 2 n 1 n 1 2 Hence, the standard deviation is s 1 ( xi x ) 2 n 1 x1 to xn are the observations and n-1 is the degrees of freedom Properties 1. s measures spread about the mean and should be used only when the mean is chosen as the measure of center. 2. s = 0 only when there is no spread, all observations are the same value. Otherwise s > 0 measures the spread of the observations about the mean (more spread implies a bigger s) 3. s, like the mean is not resistant. A few outliers can make s very large. 6 STP 420 SUMMER 2002 A Linear Transformation changes the original variable x into a new variable xnew = a + bx (equation of a straight line) the constant a shift all the values of x a units upward/downward the positive constant b changes the size of the unit of measurement linear transformations do not change the shape of a distribution Effect of a linear transformation To see the effects of a linear transformation on measures of center and spread, apply these rules: 1. Multiplying each observation by a positive number b multiplies both measures of center (mean and median) and measures of spread (interquartile range and standard deviation) by b. 2. Adding the same number a (+ve or –ve) to each observation adds a to measures of center and to quartiles and other percentiles but does not change measures of spread. 1.3 The normal distributions Strategy for exploring data 1. Always plot data (stemplot or histogram) 2. Look for overall pattern and striking deviations (outliers) 3. Calculate numerical summary to describe center and spread and 4. Draw a smooth curve approximately through the tops of the bars in the histogram. A density curve is a curve that 1. 2. is always on or above the horizontal axis has area exactly 1 underneath it It describes the overall pattern of a distribution. The area under the curve and above any range of values is the relative frequency of all observations that fall in that range. 7 STP 420 SUMMER 2002 Measuring center and spread for density curves If symmetric, mean, median and mode are same x value that has the highest peak Median and mean of a density curve 1. The median has an area of 0.5 on each side 2. The mean is the balance point 3. If skewed to the right, the measures are in the order mode, median and mean (the mean is pulled to the right) If skewed to the left, the measures are in the order mean, median and mode (the mean is pulled to the left) The mean of a population (idealized distribution) is The standard deviation of a population (idealized distribution) is The normal curve has equation: f ( x) 1 2 e 1 x 2 2 The 68-95-99.7 rule In the normal distribution with mean and standard deviation 1. 68% of the observations fall within of the mean 2. 95% of the observations fall within 2 of the mean 3. 99.7% of the observations fall within 3 of the mean Standardizing observations If x is an observation from a distribution that has mean and standard deviation , the standardized value of x is z x called a z-score 8 STP 420 SUMMER 2002 Standard normal distribution - N(0, 1): mean 0 and standard deviation 1 If the variable X has any normal distribution N(, ) with mean and standard deviation , then the standardized variable Z X has a standard normal distribution The standard normal table gives the area under the curve to the left of the z-score value. This is often interpreted as a probability. It is important that all X variables are standardized in order to use the standard normal tables to compute probabilities. Normal quantile plot - very sensitive way to assess normality, however, not easily done by hand - computer software programs allow us to construct a more accurate plot without taking much time If the points on a normality quantile plot lie close to a straight line, the plot indicates that the data are normal. Systematic deviations from a straight line indicate a nonnormal distribution. Outliers appear as points that are far away from the overall pattern of the plot. To construct the normal quantile plot 1. Arrange the observed data values from smallest to largest. Record what percentile of the data each value occupies. Eg. for 20 observations, the first is at the 5% point, the next is at the 10% point, and so on. 2. Find the z-scores for each of the percentiles. Eg. z = -1.645 is the 5% point of the standard normal distribution. 3. Plot each data point x against the corresponding z. If the data distribution is close to standard normal, the plotted points will lie close to the 450 line x = z. If the data distribution is closed to any normal distribution, the plotted points will lie close to any straight line. Granularity – when plotted points appear to form a horizontal segment in the probability. This does not hold us back from adopting a normal distribution for the data. - This could be avoided if the measurements are taken more accurately. 9