Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 1: Exploring data Intro: Statistics is the science of data. We begin our study of statistics by mastering the art of examining data. Any set of data contains information about some group of individuals. The information is organized in variables. Individuals – are the objects described by a set of data. Individuals may be people, but they may also be other things. A variable – is any characteristic of an individual. A variable can take different values for different individuals. When you come across a new set of data, you need to ask yourself the following questions. Who? – What individuals do the data describe? How many individuals appear in the data? What? – How many variables are there? What are the exact definitions of these variables? In what units is each variable recorded? Why? – What is the reason the data were gathered? What conclusions are we looking for? There are two types of variables, categorical and quantitative. A categorical variable – places an individual into one of several groups of categories. A quantitative variable – takes numerical values for which arithmetic operations such as adding and averaging make sense. A variable generally takes values that vary. The pattern of variation of a variable is its distribution. The distribution of a variable tells us what values the variable takes and how often it takes these values. In order to analyze data we begin by examining each variable by itself. Then move on to study relationships among the variables. Start with graphs of the distributions then add numerical summaries of specific aspects of the data. 1.1: Displaying Distributions with graphs. There are several graphs to choose from when displaying data: bar graphs, pie charts, dot plots, stem plots, histograms, and time plots, just to name some that we will be using in this section. The purpose of a graph is to help us understand the data. It lets you look for an overall pattern and for striking deviations from that pattern. To describe the overall pattern of a distribution you start with the three biggest descriptors: shape, center, and spread. Next you can look for outliers and clusters. Looking at shape we want to concentrate on main features. Look for major peaks, not minor ups and downs. Look for clear outliers, not just the smallest and largest observations. Look for rough symmetry or clear skewness. A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. A distribution is skewed right if the right side of the histogram extends out farther then the left. A distribution is skewed left if the left side of the histogram extends out farther then the right side. Relative frequency, cumulative frequency, percentiles and ogives pronounced O-Jive (relative cumulative frequency graph) The pth percentile of a distribution is the value such that p percent of the observations fall at or below it. Lets look at a table to see what the other terms mean. Time plots plot each observation against the time at which it was measured. Time is always on the x-axis. We want to describe trends in time plots to analyze what is going on over time. Homework: #’s 1.23 – 1.30 1.2: Describing Distributions with Numbers. Measuring center: The mean. A description of a distribution almost always includes a measure of its center or average. The most common measure of center is the arithmetic average, of mean. The mean is represented with the notation x and is calculated by adding all of the observations together and dividing by the number of observations. x x1 x2 ... xn 1 , or in more compact notation x xi n n An important fact about the mean as a measure of center is it is sensitive to the influence of a few extreme observations. Because the mean cannot resist the influence of extreme observations, we say that it is not a resistant measure of center. Measuring center: The median. The median M is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger. To find the median of a distribution: 1. Arrange all observations in order of size, from smallest to largest. 2. If the number of observations n is odd, the median M is the center of the ordered list. 3. If the number of observations n is even, the median M is the mean of the two center observations in the ordered list. The median is not influenced by extreme observations, so we say that the median is a resistant measure of center. Comparing the mean and median. The mean and median of a symmetric distribution are close together. If the distribution is exactly symmetric then the mean and median are exactly the same. In a skewed distribution the mean is farther out in the long tail then is the median. Measuring spread or variability: the quartiles. One way to measure spread is to calculate the range, which is the difference between the largest and smallest observations. This is not a resistant measure of spread it is greatly influenced by extreme values. Another way to measure spread is to measure the spread of the middle half of the data. The quartiles mark out the middle half. The first quartile makes up 25% of the data, the second or median makes up 50% of the data and the third makes up 75% of the data. To calculate quartiles: 1. Arrange the observations in increasing order and locate the median M. 2. The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the overall median. 3. The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the overall median. The Interquartile Range (IQR) is the distance between the first and third quartiles. IQR = Q3 – Q1 If an observation falls in the IQR then you know that it’s neither unusually high nor unusually low. The IQR is used to calculate outliers. An observation is an outlier if it falls more than 1.5 IQR above the third quartile or below the first quartile. The five number summary and box plots. The five number summary of a data set consist of the smallest observation, the first quartile, the median, the third quartile and the largest observation. Min Q1 M Q3 Max The five number summary offers a reasonably complete description of center and spread. The five number summary of a distribution leads to a new graph, the boxplot. Because boxplots show less detail than histograms or stemplots, they are best used for side-by-side comparison of more than one distribution. A boxplot gives an indication of symmetry of skewness of a distribution. Because regular boxplots conceal outliers, sometimes it is wise to use a modified boxplot, which puts outliers as isolated points. Measuring spread: the standard deviation The five number summary is not the most common numerical description of a distribution. The distinction belongs to the combination of the mean to measure center and the standard deviation to measure spread. The standard deviation measures spread by looking at how far the observations are from their mean. To calculate the standard deviation we need to look at the variance s2 first. The variance is the average squared deviation. The variance s2 of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of n observations x1, x2, …, xn is x x x 2 s 2 1 2 2 x ... xn x 2 n 1 or, more compactly, s2 1 xi x n 1 2 The standard deviation s is the square root of the variance s2. s 1 xi x n 1 2 Properties of the standard deviation: s measures spread about the mean and should be used only when the mean is chosen as the measure of center. s = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger. s, like mean x is not resistant. Strong skewness of a few outliers can make s very large. Choosing measures of center and spread. Use the five number summary for describing a skewed distribution or a distribution with strong outliers. Use mean and standard deviation to describe reasonably symmetric distributions that are free from outliers. Changing the units of measurement. A linear transformation changes the original variable x into the new variable xnew given by an equation of the form xnew a bx Adding the constant a shifts all values of x upward or downward by the same amount. Multiplying by the positive constant b changes the size of the unit of measurement. Linear transformations do not change the shape of a distribution, but it can change the center and spread. Fortunately, the effects of such changes follow a simple pattern. To see the effect of a linear transformation on measures of center and spread, apply these rules. Multiplying each observation by a positive number b multiplies both the measure of center (mean and median) and the measures of spread (standard deviation and IQR) by b. Adding the same number a to each observation adds s to the measures of center and to the quartiles but does not change measures of spread. Comparing distributions. Back-to-back stemplots and side-by-side boxplots are useful for comparing quantitative distributions. Homework: #’s 1.51 – 1.58 Chapter review Homework: #’s 1.59 – 1.69