Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Review - Week 1 Read: Part I: Chapters 1-6. Review Week 1: It is important to understand your data. We want to know who was measured, what was measured, how and where the data was collected and when and why the study was performed. Individuals are the objects described by a set of data. A variable is any characteristic of an individual. A categorical variable places an individual into one of several groups or categories. A quantitative variable takes numerical values for which arithmetic operations make sense. The distribution of a variable tells us what values it takes and how often it takes these values. Once you have collected your data you will often want to summarize and describe important features of the data, using graphical methods and numerical summaries. To graphically summarize the distribution of a categorical variable use bar or pie charts. For quantitative variables use histograms or stem-and-leaf plots. When numerical data is collected over time, use a time plot. When looking at the resulting graphs, be sure to study: • • • • The shape of the distribution – is it symmetric or skewed? The center of the distribution The spread of the distribution Any unusual values that do not follow the pattern of the rest of the data A statistic is a numerical summary of data. We are interested in summaries that provide information about the center and spread of the distribution. The mean of a data set is defined as the sum of all the data points divided by the number of data points in the set. The median is the midpoint of a data set. The interquartile range, IQR, measures the spread of the middle 50% of the data. The variance, s 2 , of a set of data is the average of the squared deviations of the observations from the mean. The standard deviation, s, is the square root of the variance. Exercises: Exercise 1: Looking at data What are the individuals and variables in the data below? Which variables are categorical and which are quantitative? Name Bob Sue Bill John Gender Male Female Male Male Age 27 33 21 56 Number of siblings 2 1 0 4 Exercise 2: Thinking about histograms (a) Suppose you measured the height of all male Columbia students. What do you think the resulting histogram would look like? In particular think about the shape, center and spread. Sketch what you think the histogram would look like. (b) Suppose you instead measured the height of all Columbia students. How would the histogram differ from (a)? In particular think about the shape, center and spread. Sketch what you think the histogram would look like. Exercise 3: Measures of center and spread. (a) Suppose we have the following data set {6, 3, 2, 4, 21}. Calculate the mean and median. Why do they appear to differ so much? (b) Suppose we have the following data set {1,1,2,2,2,4,6}. Calculate the mean and median. (c) Given the data set {1, 2, 3, 6, 8, 8, 10, 14, 15, 17}, find the median and the IQR. (d) Given the data set {2, 4, 3, 7}. Calculate the standard deviation. Exercise 4: Measures of center and spread. The five-number summary for the weights (in pounds) of fish caught in a bass tournament is 2.3 – 2.8 – 3.0 – 3.6– 5.2. (a) Would you describe this distribution as symmetric or skewed? (b) Would you expect the mean weight of all fish caught to be higher or lower than the median? Explain. (c) You caught 3 bass weighing 2.3 pounds, 3.9 pounds and 4.9 pounds. Were any of the fish outliers? Explain. (d) Create a boxplot of these data assuming that the three biggest fish that were caught were 4.7, 4.9 and 5.2 lbs. Histogram Drill 0 2 Frequency 4 6 8 10 The histogram below represents the average amount (dollars per student) spent by public schools in each state + the District of Columbia during the school year 1997-8. 4000 6000 8000 10000 spending Answer the following questions about the histogram: • Describe the shape of the histogram. • What is the total area under the histogram? • Do you believe that the mean or the median is larger? Why? • How many states spend more than $8,000 per student/year? • How many states spend less than $6,000 per student/year? • What proportions of states spend more than $9,000 per student/year? • What proportions of states spend less than $5,000 per student/year? Summation Notation: Let us denote the measurements in a data set consisting of quantitative variables, as x1 , x 2 , …, xn , where x1 is the first measurement in the data set, x 2 is the second, etc. If we want to calculate the sum of these n numbers, we can simply write x1 + x 2 + … + x n . n ∑x Another way of writing this is i . i =1 Ex. Suppose we have a data set consisting of the values {1, 4, 5, 7}. Here x1 = 1 , x 2 = 4 , x3 = 5 and x4 = 7 . 4 ∑x i = 1 + 4 + 5 + 7 = 17 i =1 4 ∑ 2x i = 2 × 1 + 2 × 4 + 2 × 5 + 2 × 7 = 34 i =1 4 ∑x 2 1 = 12 + 4 2 + 5 2 + 7 2 = 1 + 16 + 25 + 49 = 91 i =1 n It is extremely important to note that ∑ i =1 xi2 2 ⎛ n ⎞ ≠ ⎜⎜ xi ⎟⎟ . ⎝ i =1 ⎠ ∑ Exercises: Exercise 1: A data set consists of the values {1,6,3,2,3} Calculate: 5 (a) ∑x (b) i i =1 ∑ (x i =1 ∑ ∑ xi2 i =1 5 (d) ⎛ 5 ⎞ (c) ⎜⎜ xi ⎟⎟ ⎝ i =1 ⎠ 5 i − 3) 5 (e) ∑ (x i =1 − 3) 2 i 2