Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2 Exploring Data with Graphs and Numerical Summaries  Learn …. The Different Types of Data The Use of Graphs to Describe Data The Numerical Methods of Summarizing Data Agresti/Franklin Statistics, 1 of 63 Section 2.1 What are the Types of Data? Agresti/Franklin Statistics, 2 of 63 In Every Statistical Study:  Questions are posed  Characteristics are observed Agresti/Franklin Statistics, 3 of 63 Characteristics are Variables A Variable is any characteristic that is recorded for subjects in the study Agresti/Franklin Statistics, 4 of 63 Variation in Data  The terminology variable highlights the fact that data values vary. Agresti/Franklin Statistics, 5 of 63 Example: Students in a Statistics Class  Variables: • Age • GPA • Major • Smoking Status •… Agresti/Franklin Statistics, 6 of 63 Data values are called observations  Each observation can be: • Quantitative • Categorical Agresti/Franklin Statistics, 7 of 63 Categorical Variable  Each observation belongs to one of a set of categories  Examples: • Gender (Male or Female) • Religious Affiliation (Catholic, Jewish, …) • Place of residence (Apt, Condo, …) • Belief in Life After Death (Yes or No) Agresti/Franklin Statistics, 8 of 63 Quantitative Variable  Observations take numerical values  Examples: • Age • Number of siblings • Annual Income • Number of years of education completed Agresti/Franklin Statistics, 9 of 63 Graphs and Numerical Summaries  Describe the main features of a variable  For Quantitative variables: key features are center and spread  For Categorical variables: key feature is the percentage in each of the categories Agresti/Franklin Statistics, 10 of 63 Quantitative Variables  Discrete Quantitative Variables and  Continuous Quantitative Variables Agresti/Franklin Statistics, 11 of 63 Discrete  A quantitative variable is discrete if its possible values form a set of separate numbers such as 0, 1, 2, 3, … Agresti/Franklin Statistics, 12 of 63 Examples of discrete variables    Number of pets in a household Number of children in a family Number of foreign languages spoken Agresti/Franklin Statistics, 13 of 63 Continuous  A quantitative variable is continuous if its possible values form an interval Agresti/Franklin Statistics, 14 of 63 Examples of Continuous Variables     Height Weight Age Amount of time it takes to complete an assignment Agresti/Franklin Statistics, 15 of 63 Frequency Table  A method of organizing data  Lists all possible values for a variable along with the number of observations for each value Agresti/Franklin Statistics, 16 of 63 Example: Shark Attacks Agresti/Franklin Statistics, 17 of 63 Example: Example: Shark Shark Attacks Attacks  What is the variable?  Is it categorical or quantitative?  How is the proportion for Florida calculated?  How is the % for Florida calculated? Agresti/Franklin Statistics, 18 of 63 Example: Shark Attacks  Insights – what the data tells us about shark attacks Agresti/Franklin Statistics, 19 of 63 Identify the following variable as categorical or quantitative: Choice of diet (vegetarian or non-vegetarian): a. b. Categorical Quantitative Agresti/Franklin Statistics, 20 of 63 Identify the following variable as categorical or quantitative: Number of people you have known who have been elected to political office: a. b. Categorical Quantitative Agresti/Franklin Statistics, 21 of 63 Identify the following variable as discrete or continuous: The number of people in line at a box office to purchase theater tickets: a. b. Continuous Discrete Agresti/Franklin Statistics, 22 of 63 Identify the following variable as discrete or continuous: The weight of a dog: a. Continuous b. Discrete Agresti/Franklin Statistics, 23 of 63 Section 2.2 How Can We Describe Data Using Graphical Summaries? Agresti/Franklin Statistics, 24 of 63 Graphs for Categorical Data  Pie Chart: A circle having a “slice of pie” for each category  Bar Graph: A graph that displays a vertical bar for each category Agresti/Franklin Statistics, 25 of 63 Example: Sources of Electricity Use in the U.S. and Canada Agresti/Franklin Statistics, 26 of 63 Pie Chart Agresti/Franklin Statistics, 27 of 63 Bar Chart Agresti/Franklin Statistics, 28 of 63 Pie Chart vs. Bar Chart   Which graph do you prefer? Why? Agresti/Franklin Statistics, 29 of 63 Graphs for Quantitative Data  Dot Plot: shows a dot for each observation  Stem-and-Leaf Plot: portrays the individual observations  Histogram: uses bars to portray the data Agresti/Franklin Statistics, 30 of 63 Example: Sodium and Sugar Amounts in Cereals Agresti/Franklin Statistics, 31 of 63 Dotplot for Sodium in Cereals  Sodium Data: 0 210 260 125 220 290 210 140 220 200 125 170 250 150 170 70 230 200 290 180 Agresti/Franklin Statistics, 32 of 63 Stem-and-Leaf Plot for Sodium in Cereal Sodium Data: 0 210 260 125 220 290 210 140 220 200 125 170 250 150 170 70 230 200 290 180 Agresti/Franklin Statistics, 33 of 63 Frequency Table Sodium Data: 0 210 260 125 220 290 210 140 220 200 125 170 250 150 170 70 230 200 290 180 Agresti/Franklin Statistics, 34 of 63 Histogram for Sodium in Cereals Agresti/Franklin Statistics, 35 of 63 Which Graph?  Dot-plot and stem-and-leaf plot:  Histogram • More useful for small data sets • Data values are retained • More useful for large data sets • Most compact display • More flexibility in defining intervals Agresti/Franklin Statistics, 36 of 63 Shape of a Distribution  Overall pattern • Clusters? • Outliers? • Symmetric? • Skewed? • Unimodal? • Bimodal? Agresti/Franklin Statistics, 37 of 63 Symmetric or Skewed ? Agresti/Franklin Statistics, 38 of 63 Example: Hours of TV Watching Agresti/Franklin Statistics, 39 of 63 Identify the minimum and maximum sugar values: a. 2 and 14 c. 1 and 15 b. d. 1 and 3 0 and 16 Agresti/Franklin Statistics, 40 of 63 Consider a data set containing IQ scores for the general public: What shape would you expect a histogram of this data set to have? a. Symmetric b. Skewed to the left c. Skewed to the right d. Bimodal Agresti/Franklin Statistics, 41 of 63 Consider a data set of the scores of students on a very easy exam in which most score very well but a few score very poorly: What shape would you expect a histogram of this data set to have? a. Symmetric b. Skewed to the left c. Skewed to the right d. Bimodal Agresti/Franklin Statistics, 42 of 63 Section 2.3 How Can We describe the Center of Quantitative Data? Agresti/Franklin Statistics, 43 of 63 Mean  The sum of the observations divided by the number of observations x   x n Agresti/Franklin Statistics, 44 of 63 Median  The midpoint of the observations when they are ordered from the smallest to the largest (or from the largest to the smallest) Agresti/Franklin Statistics, 45 of 63 Find the mean and median CO2 Pollution levels in 8 largest nations measured in metric tons per person: 2.3 1.1 19.7 9.8 1.8 1.2 0.7 0.2 a. b. c. Mean = 4.6 Mean = 4.6 Mean = 1.5 Median = 1.5 Median = 5.8 Median = 4.6 Agresti/Franklin Statistics, 46 of 63 Outlier  An observation that falls well above or below the overall set of data  The mean can be highly influenced by an outlier  The median is resistant: not affected by an outlier Agresti/Franklin Statistics, 47 of 63 Mode  The value that occurs most frequently.  The mode is most often used with categorical data Agresti/Franklin Statistics, 48 of 63 Section 2.4 How Can We Describe the Spread of Quantitative Data? Agresti/Franklin Statistics, 49 of 63 Measuring Spread: Range  Range: difference between the largest and smallest observations Agresti/Franklin Statistics, 50 of 63 Measuring Spread: Standard Deviation  Creates a measure of variation by summarizing the deviations of each observation from the mean and calculating an adjusted average of these deviations s ( x  x )2 n 1 Agresti/Franklin Statistics, 51 of 63 Empirical Rule For bell-shaped data sets:  Approximately 68% of the observations fall within 1 standard deviation of the mean  Approximately 95% of the observations fall within 2 standard deviations of the mean  Approximately 100% of the observations fall within 3 standard deviations of the mean Agresti/Franklin Statistics, 52 of 63 Parameter and Statistic  A parameter is a numerical summary of the population  A statistic is a numerical summary of a sample taken from a population Agresti/Franklin Statistics, 53 of 63 Section 2.5 How Can Measures of Position Describe Spread? Agresti/Franklin Statistics, 54 of 63 Quartiles     Splits the data into four parts The median is the second quartile, Q2 The first quartile, Q1, is the median of the lower half of the observations The third quartile, Q3, is the median of the upper half of the observations Agresti/Franklin Statistics, 55 of 63 Example: Find the first and third quartiles Prices per share of 10 most actively traded stocks on NYSE (rounded to nearest $) 2 4 11 12 13 15 31 31 37 47 a. b. c. d. Q1 = 2 Q1 = 12 Q1 = 11 Q1 =11.5 Q3 = Q3 = Q3 = Q3 = 47 31 31 32 Agresti/Franklin Statistics, 56 of 63 Measuring Spread: Interquartile Range  The interquartile range is the distance between the third quartile and first quartile: IQR = Q3 – Q1 Agresti/Franklin Statistics, 57 of 63 Detecting Potential Outliers  An observation is a potential outlier if it falls more than 1.5 x IQR below the first quartile or more than 1.5 x IQR above the third quartile Agresti/Franklin Statistics, 58 of 63 The Five-Number Summary  The five number summary of a dataset: • Minimum value • First Quartile • Median • Third Quartile • Maximum value Agresti/Franklin Statistics, 59 of 63 Boxplot  A box is constructed from Q1 to Q3  A line is drawn inside the box at the median  A line extends outward from the lower end of the box to the smallest observation that is not a potential outlier  A line extends outward from the upper end of the box to the largest observation that is not a potential outlier Agresti/Franklin Statistics, 60 of 63 Boxplot for Sodium Data Sodium Data: 0 200 70 210 125 210 125 220 140 220 150 230 170 250 170 260 180 290 200 290 Five Number Summary: Min: 0 Q1: 145 Med: 200 Q3: 225 Max: 290 Agresti/Franklin Statistics, 61 of 63 Boxplot for Sodium in Cereals Sodium Data: 0 210 260 125 220 290 210 140 220 200 125 170 250 150 170 70 230 200 290 180 Agresti/Franklin Statistics, 62 of 63 Z-Score  The z-score for an observation measures how far an observation is from the mean in standard deviation units observatio n - mean z standard deviation  An observation in a bell-shaped distribution is a potential outlier if its z-score < -3 or > +3 Agresti/Franklin Statistics, 63 of 63