Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Describing Data Week 1 The W’s (Where do the Numbers come from?) • • • • • • • Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where: Where was the data measured? When: When was the measurement done? HoW: How was the data measured? Why: Why was the measurement done? Always Check the W’s • Anytime you see data always check the W’s. • This will help spot questionable statistics. • ALWAYS QUESTION DATA Variables (The What) • Variables are characteristics that are recorded about each individual. • Categorical variables are non-numeric in nature. • Quantitative variables are measurements and have units Displaying and Describing Categorical Data Terms • Frequency table: Categories and counts • Distribution: lists the frequencies of each category • Distribution: lists the relative frequencies of each category • Contingency Table: The frequencies or relative frequencies of 2 variables. Terms • Marginal Distribution: the totals found on the margins of the chart. The distribution of one of the two variables • Conditional distribution: the distribution of one row or column of a contingency table. • Independence: two variables are independent if the conditional distribution of all the values of a variable is the same as the marginal distribution of that variable. (Huh!) Three Rules of Data Analysis • First, make a picture! • First, make a picture! •First, make a picture! Or you could Why? • Pictures reveal things charts don’t. • Patterns can be revealed that are not readily apparent from the numbers. • Pictures are the easiest way to explain to others about the data To Make a Graph • Make piles. Organize the data into like groups • Make a frequency table • Make a relative frequency table by finding the percentages Make a Graph • Probably a bar chart graphing the frequencies or . . . • A pie chart to graph the relative frequencies • Beware of the area principle. • Stay 2-D To Make a Graph of Categorical Data • Think Check W’s Identify the variables Check to see if categories overlap Data are counts To Make a Graph of Categorical Data • Show Select the appropriate graph to compare categories Bar Graph for frequencies Pie Chart for relative frequencies (percents) Stacked bar graph can be used instead of a pie chart To Make a Graph of Categorical Data • Tell Interpret the results Describe the results in the context of the problem Answers are sentences not numbers Displaying Quantitative Data More Graphs Histograms • Think: Must be quantitative data Want to see the distribution Could be counts or percents Stem and Leaf Plots • Think Must be quantitative data Want to see the distribution Usually counts Relatively small sample size Stem and Leaf Plot • Show Scale is usually vertical Put the ‘Stems’ on the vertical scale Stems are usually the data without the last digit Might be rounded If there are a lot of leaves with one stem make dual stems and put 0-4 on one and 5-9 on the other Plot the ‘leaves’ Dot Plot • Think Must be quantitative data Want to see the distribution Usually counts Relatively small sample size Dot Plot • Show Scale can be vertical or horizontal Place a dot at the appropriate location Describing the Distribution • Tell Shape How many humps? • Unimodal • Bimodal - maybe more than one group thrown together • Multimodal Uniform Symmetric Skewed Gaps Clusters Describing the Distribution • Tell (continued) Center What is the middle value What is the middle range Describing the Distribution • Tell (Continued) Spread Range = Maximum value - minimum value Variation: How much does the data jump around Outliers • Discuss any data points that do not seem to fit the overall pattern. • Is there a logical explanation for them to be that different? Comparing Two Distributions • Compare the centers of the two distributions • Compare the shapes of the two distributions • Compare the spread of the two distributions • Compare any extreme values (outliers) of the two distributions. Time Plot • Think: Quantitative data Looking for trends • Show Time is horizontal scale Plot data Connect the dots Can use calculator Describing Distributions with Numbers Measurements of the Center • Mean: The ‘Average’ •µ mean of a population • x mean of a sample •Unique • Median: The middle score • Sort the data • Middle score or the average of the middle two scores • Unique More Center Measurers • Mode: The most common score Not necessarily unique Does Not necessarily exist Finding Quartiles • Sort the data • Find the median • The 1st quartile (25% mark) is the median of the smaller half of the data • The 3rd quartile (75% mark) is the median of the larger half of the data The Five Number Summary • • • • • The minimum data point The 1st quartile The median The 3rd quartile The largest data point InterQuartile Range and Outliers • Outliers are data points that do not fit the pattern of the distribution. • Interquartile range IQR is the difference of the 3rd quartile - the 1st quartile • An outlier is a point more that one and half times the IQR below the 1st quartile number or one and half times the IQR above the 3rd quartile Checking for Outliers • Find the 5 number summary • Calculate the Interquartile Range • IQR = 3rd quartile - 1st quartile • Lower cut off point = 1st quartile– 1.5(IQR) • Upper cut off point = 3rd quartile+ 1.5(IQR) • Check for data outside the cut off points The Normal Model Density Curves and Normal Distributions A Density Curve: • Is always on or above the x axis • Has an area of exactly 1 between the curve and the x axis • Describes the overall pattern of a distribution • The area under the curve above any range of values is the proportion of all the observations that fall in that range. Mean vs Median • The median of a density curve is the equal area point that divides the area under the curve in half • The mean of a density function is the center of mass, the point where curve would balance if it were made of solid material Normal Curves • • • • • Bell shaped, Symmetric,Single-peaked Mean = µ Standard deviation = Notation N(µ, ) One standard deviation on either side of µ is the inflection points of the curve 68-95-99.7 Rule • 68% of the data in a normal curve at least is within one standard deviation of the mean • 95% of the data in a normal curve at least is within two standard deviations of the mean • 99.7% of the data in a normal curve at least is within three standard deviations of the mean Why are Normal Distributions Important? • Good descriptions for many distributions of real data • Good approximation to the results of many chance outcomes • Many statistical inference procedures are based on normal distributions work well for other roughly symmetric distributions Standard Normal Curve Standardizing (z-score) • If x is from a normal population with mean equal to µ and standard deviation, then the standardized value z is the number of standard deviations x is from the mean • Z = (x - µ)/ • The unit on zis standard deviations Standard Normal Distribution • A normal distribution with µ = 0 and N(0,1) is called a Standard Normal 1, distribution • Z-scores are standard normal where z=(x-µ)/ = Standard Normal Tables • Table B (pg 552) in your book gives the percent of the data to the left of the z value. • Or in your Standard Normal table • Find the 1st 2 digits of the z value in the left column and move over to the column of the third digit and read off the area. • To find the cut-off point given the area, find the closest value to the area ‘inside’ the chart. The row gives the first 2 digits and the column give the last digit Solving a Normal Proportion • State the problem in terms of a variable (say x) in the context of the problem • Draw a picture and locate the required area • Standardize the variable using z =(x-µ)/ • Use the calculator/table and the fact that the total area under the curve = 1 to find the desired area. • Answer the question. Finding a Cutoff Given the Area • State the problem in terms of a variable (say x) and area • Draw a picture and shade the area • Use the table to find the z value with the desired area • Go z standard deviations from the mean in the correct direction. • Answer the question. Assessing Normality • In order to use the previous techniques the population must be normal • To assessing normality : Construct a stem plot or histogram and see if the curve is unimodal and roughly symmetric around the mean