* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download what is statistics?
Survey
Document related concepts
Transcript
WHAT IS STATISTICS? Unit 6: A Brief Look at the World of Statistics WEBSTER’S DEFINITION statistics 1 numerical data assembled and classified so as to present significant information 2 the science of compiling such data Another way to think about it: Statistics is the science (and art) of learning from data. Data are numbers with a context. According to our Math 2 book: • Statistics are numerical values used to summarize and compare sets of data. NEW TERMINOLOGY Individuals are the objects described by a set of data. They may be a single person, an animal, group, or thing. A variable is any characteristic of an individual. Categorical vs. Quantitative DAY 1 CENTRAL TENDENCY & MEASURES OF DISPERSION EXPLORATORY DATA ANALYSIS Using statistical tools and ideas to examine data in order to describe their main features. CENTER: Measures of central tendency • Mean – the traditional “average” of a data set. This can be found by adding up all of the values and dividing by the number of values • Median – this is the value that would be in the middle of the data set if all of the value were written in order. • Mode – this is the value in a data set that occurs the most frequently. CENTER: MEAN • Mean—the traditional “average” of a data set. This can be found by adding up all of the values and dividing by the number of values. • Example: The grades for a quiz are as follows: 90 70 96 92 69 53 70 87 80 89 78 72 91 76 97 70 82 75 74 72 84 76 90 70 96 x 22 76 79.2273 So the mean is 79.2273 (note the symbol used). CENTER: MEDIAN • Median—this is the value that would be in the middle of the data set if all of the value were written in order. • Example: The grades for a quiz are as follows: 90 70 96 92 69 53 70 87 80 89 78 72 91 76 97 70 82 75 74 72 84 76 First, put them in order: 53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97 with 22 numbers, the median will be the average of the two “middle” numbers. In this case, 76 and 78 are the 11th and 12th terms. Therefore, the median is 77. CENTER: MODE • Mode—this is the value in a data set that occurs the most frequently. • Example: The grades for a quiz are as follows: 90 70 96 92 69 53 70 87 80 89 78 72 91 76 97 70 82 75 74 72 84 76 Put them in order (this helps detect the mode(s)): 53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97 We can see that 70 occurs three times while the next highest occurrence is seen only two times. Therefore, 70 is the mode. Note, this doesn’t tell us very much about the set of data as a whole. Also note, there can be no mode or multiple modes. SPREAD: RANGE • Range—this is a simplistic measure of spread that is calculated as the difference between the greatest and least data values. • Example: The grades for a quiz are as follows: 90 70 96 92 69 53 70 87 80 89 78 72 91 76 97 70 82 75 74 72 84 76 First, put them in order: 53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97 We can see that the lowest number is 53 and the highest is 97. Therefore, the range is 44 (found by 97-53). Tonight’s Assignment • You will need a SCIENTIFIC CALCULATOR for this unit. • Homework in Text Book: – p. 261 #1-13 All – Find the Min, Mean, Median, Mode, Max & Range for ALL problems – (You do not need to calculate the Standard Deviation). DAY 2 MEASURES OF DISPERSION SPREAD: Measures of dispersion • Range – this is a simplistic measure of spread that is calculated as the difference between the greatest and least data values. • Mean Absolute Deviation – you learned about this measure last year. It is the average of the absolute deviations from the mean. • Standard Deviation – this is a more complex calculation that is the most commonly used measure of spread in the practice of statistics. • Interquartile Range (IQR) – this is calculated as the difference between the 3rd and 1st quartiles. It is often used to help calculate outliers. SPREAD: MEAN ABSOLUTE DEVIATION • Mean Absolute Deviation—you learned about this measure last year. It is the average of the absolute deviations from the mean. • Example: The grades for a quiz are as follows: 90 70 96 92 69 53 70 87 80 89 78 72 91 76 97 70 82 75 74 72 84 76 • Recall, the mean is 79.2273, so to calculate the mean absolute deviation we subtract the mean from each value, take the absolute value, add up all such values, and divide by the number of values. 90 79.2273 70 79.2273 96 79.2273 22 • So the mean absolute deviation is 8.7025 76 79.2273 8.7025 SPREAD: STANDARD DEVIATION • Standard Deviation—this is a more complex calculation that is the most commonly used measure of spread in the practice of statistics. • Example: The grades for a quiz are as follows: 90 70 96 92 69 53 70 87 80 89 78 72 91 76 97 70 82 75 74 72 84 76 • Recall, the mean is 79.2273, so to calculate the standard deviation we subtract the mean from each value, square this value, add up all such values, and divide by the number of values. Then take the square root. 90 79.2273 70 79.2273 96 79.2273 2 2 22 2 76 79.2273 2 10.4921 Variance, what is it? • Just so you are aware, variance = standard deviation squared. 2 • So, variance = while, standard deviation = • Of course, that means you can also consider the standard deviation to be the square root of the variance. • Our book doesn’t directly address variance, but you may see it in some situations. SPREAD: INTERQUARTILE RANGE (IQR) • Interquartile Range (IQR)—this is calculated as the difference between the 3rd and 1st quartiles. It is often used to help calculate outliers. • Example: The grades for a quiz are as follows: 90 70 96 92 69 53 70 87 80 89 78 72 91 76 97 70 82 75 74 72 84 76 First, put them in order: 53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97 Then divide them into 4 equal sets 53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97 • Now there are 5.5 (22/4) values in each quarter of the data set. SPREAD: INTERQUARTILE RANGE (IQR) cont. 53,69,70,70,70,72,72,74,75,76,76,78,80,82,84,87,89,90,91,92,96,97 Q1 M Q3 We had already determined that the Median was 77 (avg of 76 & 78). That divided the set into two halves. To find Q1 and Q3, we simply find the median of the first and second halves. Seen here, Q1 is 72, the Median is 77 and Q3 is 89. So, the IQR = Q3 – Q1 = 17 This measure essentially lets us know how close together the middle 50% of all the data is located. Or how far spread out is the middle 50%. 5 Number Summary • The 5 Number Summary is: Minimum Q1 Median Q3 Maximum • For our example, the minimum (lowest value) was 53 and the maximum (highest value) was 97. • So our 5 Number Summary for this data set is: 53 72 77 89 97 OUTLIERS: Deviations from the majority of the data • When you look at a graph for a set of data, an outlier is typically a visibly different point. It will not “fit” with the rest of the data. • There are multiple ways to define an outlier. • An outlier is a data point that is more than two standard deviations from the mean. Outlier Example • Assume that the mean is 75 and the standard deviation is 11. We would consider anything about a 97 an outlier. Likewise, we would consider anything below a 53 and outlier. • To determine if there are any outliers, we simply look at the set of data to see if there are any values more than 2 standard deviations away from the mean. Tonight’s Assignment • Worksheet 1: – Absolute Mean Deviation – Standard Deviation – Finding Outliers DAY 3 Exploring Basic One Variable Graphs DISTRIBUTION? The distribution of a variable tells us what values it takes and how often it takes these values. How do we display a distribution? bar graph pie chart histogram stem plots time plots dot plots box plot Why do we graph? We want to get the overall picture of what is taking place before we start looking at numerical summaries of the data. The graphs will have particular features worth discussing that give us insight into the data. What graph & when? Categorical Variables bar graph pie chart use with parts of a whole Quantitative Variables histogram when values are wide spread dot plot when few values are taken stem plot good with small data sets (<100) time plot to display change over time box plot to display the 5 number summary Categorical variables • Given Categorical variables, we can use bar charts and pie charts to express them in a visual manner. • Ex. 1 – For the given data, create a bar chart and a pie chart to express them in a clear and visual manner. Favorite Music Genre Count (thousands) Percent Classical 20 6.5% Rock 100 32.3% Country 40 12.9% Alternative 90 29.0% Heavy metal 60 19.3% BAR CHART Thousand of People Favorite Music Genre 120 100 80 60 40 20 0 Classical Rock Country Genre Alternative Heavy Metal Pie Chart Favorite Music Genre 19% 6% Classical 33% Rock Country Alternative 29% Heavy Metal 13% Graphing Focus • In this class, we will not be creating graphs for categorical variables. • We will focus on graphing quantitative variables. • By virtue of learning how to create these, you should be more comfortable reading these graphs when they are presented to you. Constructing Histograms A histogram is used to graph the distribution of a single quantitative variable. To construct a histogram first divide the range of the data into classes of equal width. Second, count the number of observations in each class. Finally, draw the histogram being sure to title and label the graph appropriately. Interpreting Histograms (and other similar graphs) • There are really three things for us to consider: – CENTER – SPREAD (or dispersion) – OUTLIERS • We have already spent some time exploring measures of center and spread. • We want to also consider outliers. Tonight’s Assignment • Worksheet 2: Histograms DAY 4 Samples & Populations How can we gather data about a very large group? Population vs. Sample A population is a group of people or objects that you want information about. A sample is a subset of a population. Example: The height of 15 year old girls in the U.S. Example: A sample of 15 year old girls in the U.S. Types of Samples Self-selected sample: People volunteer to participate in the group. Systematic sample: A rule is used to select members of a population (people or data) to participate in the group. Convenience sample: The easiest members of a population are selected (such as people sitting in the 1st row). Random sample: Each member of the popultation has an equal chance of being selected. What is good & bad about these sample types? Self-selected sample: People volunteer to participate in the group. Systematic sample: A rule is used to select members of a population (people or data) to participate in the group. Convenience sample: The easiest members of a population are selected (such as people sitting in the 1st row). Random sample: Each member of the population has an equal chance of being selected. The Goal: Unbiased Sample An unbiased sample accurately represents the population. A biased sample may over-represent some members of the population, so is less likely to represent the entire population accurately. How do we know if we have a good sample? Margin of Error • We can calculate how closely a sample measures the exact population by using the Margin of Error. • The Margin of Error gives a limit on how much the sample data will vary from the entire population data. • It is calculated as: 1 p n Lunch Habits • In a survey of 990 workers, 30% said that they eat lunch at home during a typical work week. • What is the Margin of Error for the survey? • What is the interval of workers that is likely to contain the exact percent of all workers who eat at home each week. Lunch Habits • In a survey of 990 workers, 30% said that they eat lunch at home during a typical work week. • What is the Margin of Error for the survey? – Margin of Error = 1 0.32 32% 990 • What is the interval of workers that is likely to contain the exact percent of all workers who eat at home each week. – Find the low end and high end of the population range: 30% 3.2% 26.8% 30% 3.2% 33.2% – So, between 26.8% and 33.2% of all workers are likely to eat lunch at home each week. Tonight’s Assignment • Text Book: – p. 270 # 1-25 ODD p. 275 # 5-9 ALL • QUIZ TOMORROW on Central Tendency, Samples & Populations!