Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH-138 Elementary Statistics Emily C. Francis Howard Community College Unit 1 Lecture Slides What is Statistics? Statistics is the science of: Collecting data Analyzing data Drawing conclusions and making decisions as a result of the data analysis. This is referred to as “statistical inference”. 2 What is a “Statistic”? A statistic is a function of the data Data -> [function] -> statistic For example, suppose we have a data set of height of students. Taking the average, or mean, of heights is a function. Thus the mean height is a statistic of the data. 3 Phases in Statistical Analysis Data Collection: The process of collecting data (samples) via surveys, observational studies, and/or designed experiments Data analysis: Graphing and summarizing key features of the data to discover major patterns in the data Statistical Inference: Drawing inferences (conclusions) and making decisions based on the data 4 Population vs. Sample For a given statistical inquiry: The population consists of all items of interest (people, places, companies, etc.) A sample is a (hopefully representative and random) subset of the population A numerical value/characteristic of a population is called a parameter. These are usually unknown. A numerical value/characteristic of a sample is called a statistic 5 Components of a Data Set Cases: people, places, companies, colleges, etc. Variables: characteristics/measurements of each individual case 6 Variable Types Categorical variables Have values that are described by words Represent categories Can be represented with #’s (the actual #’s assigned are irrelevant). The #’s have no units and no mathematical operations can be performed on these #’s. Quantitative variables Have numerical values and units 7 Displaying & Describing Categorical Data No mathematical operations can be performed on categorical data Categorical data can only be counted and then described/displayed using: Frequency (and relative frequency) tables Bar charts Pie charts 8 Contingency Tables A contingency table shows how cases are distributed along each variable, contingent on the value of another variable Marginal and conditional distributions 9 Displaying & Summarizing Quantitative Data A frequency (& relative frequency) distribution is an excellent initial data analysis tool A histogram is a visual representation of a frequency distribution. A relative frequency histogram is a visual representation of a relative frequency distribution Dotplot 10 Describing a Distribution Shape Center Spread 11 Distribution Shape “Modality” Symmetry Outliers 12 Measures of Center Median Mean 13 Median The median of a variable is the midpoint of the sorted data values For odd n, the median equals the middle data value For even n, the median equals the average of the middle two values Is useful when the variable of interest has a skewed distribution and/or has outliers (it is not sensitive to these outliers) Does not have to be a data value 14 Mean The mean is the sum of all the data values divided by the # of data values: y Y n Treats all values equally and can therefore be influenced by outliers Does not have to be a data value Deviations from the mean to the data points always sum to zero Is useful when the variable of interest is symmetric with no outliers 15 Distribution Shape (Contd.) Symmetric data: Mean is approx. equal to the median Tails of the distribution are balanced Skewed left data: Mean<Median Long tail of distribution “points” left A few low values, but most data on right Skewed right data: Median<Mean Long tail of distribution “points” right A few high values, but most data on left 16 Five-Number Summary Max Q3 Median Q1 Min 17 Measures of Spread Range Interquartile range (IQR) Variance Standard deviation 18 Range & IQR The range is the difference between the maximum and minimum data values IQR = Q3 – Q1 The IQR is useful when the variable of interest has a skewed distribution and/or has outliers (it is not sensitive to these outliers) 19 Variance The variance is basically the average of the squared deviations from the mean: s 2 2 ( y Y ) n 1 The units of this statistic are in squared units of the original data values 20 Standard Deviation The SD is the square root of the variance: s (y Y ) 2 n 1 Is a single # that helps us understand how spread out the data is Units of measurement are the same as the original data 21 Standard Deviation (Contd.) The standard deviation (and variance) statistics are never negative If every data value is equal, then there is no variation, and hence SD=Var=0 Is useful when the variable of interest is symmetric with no outliers 22 Boxplots A Boxplot is a graphical display of the five-number summary The procedure to construct a boxplot can be found on pgs. 90-91 of the text 23 Standardized Variables: Z Scores To “standardize” a variable, calculate each observation’s distance from the mean in units of the standard deviation. That is, define variable Z as: y Y z s 24 Normal Models A normal model: Is symmetric and “bell” shaped Is commonly used to model many things in the business and physical worlds Is defined by 2 parameters, μ (the mean) and σ (the standard deviation) Its distribution peaks at μ A normal distribution with mean=0 and std. dev.=1 is called “standard” The 68-95-99.7 Rule For data from a NORMAL model: ~68% will lie within 1 std. dev. of the mean ~95% will lie within 2 std. dev’s of the mean ~99.7% (virtually all the data) will lie within 3 std. dev’s of the mean 26 Normalcdf & Invnorm If you are given a value(s) and you want a percentage under the normal model, you use “normalcdf” on your calculator: normalcdf(left value, right value, mean, std. dev.) If you are given a percentage under the normal model and you want a value, you use “invnorm” on your calculator: invnorm(percentage, mean, std. dev.) 27 Scatter Plots A scatter plot shows n pairs of bivariate data observations on an X-Y graph A scatter plot is usually the starting point for bivariate data analysis We create scatter plots to investigate the relationship between two variables: Direction Form Strength Correlation In our discussion of correlation (and regression), we will be talking about paired sample data A correlation exists between 2 variables when one of them is related to the other in some way The linear correlation coefficient, r, measures the strength of the LINEAR relationship between two variables Before you calculate r, the following should hold: Quantitative variables condition “Straight Enough” condition Outlier condition Correlation Properties The value of r is always between -1 and 1, inclusive. That is, -1<=r<=1. The value of r is not affected by the choice of x or y r measures the strength of a linear relationship. It is not designed to measure the strength of a relationship that is not linear. Correlation is sensitive to outliers Correlation does not imply causality! Correlation does not measure slope Regression If 2 variables have a “significant” linear correlation, it is appropriate to estimate their exact linear relationship – regression does this A regression estimates a and b so that the linear relationship between x and y can be expressed as: ŷ ax b Note that ŷ is the PREDICTED value of y – thus, you can use this equation to predict values of y for given values of x (though not all values of x) The residual for any data point is: y yˆ 31 Regression (Cont.) When predicting a value of y based on some given value of x, do the following: If there is NOT a linear correlation, the best predicted y-value is the sample average of y If there IS a linear correlation, the best predicted y-value is found using the regression equation 32