Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Statistics is a Science of Data Class Notes Statistics uses data to ESTIMATE UNKNOWN QUATITIES, to make decisions and to develop policies. To draw any sensible conclusions from collected data we need to SUMMARIZE the data to EXAMINE the patterns that it forms. Unit #1 “Exploring one-Variable Data”(One-Variable data Analysis) “Exploring data” includes Graphical and Numerical Techniques used to Study Data (this topic appears in 8 to 12 out of 40 multiple-choice questions; in the Free-responses section, this topic appears in 1 to 2 out of 6 questions) Main issues: Who collects data and what do they do with collected data? Data is rarely collected in a form that is immediately useful for decision-making. To use it, it needs to be ORGANIZED and SUMMARIZED Descriptive methods are useful for data presentation, data reduction, and summarization. The best method depends on the type of data being collected. Types of Descriptive Methods / all of them complement each other o Tabular methods (frequency distribution table; this table facilitates the analysis of patterns of variation among observes data) o o o Graphical methods: o Numerical methods Tabular methods (frequency distribution table; this table facilitates the analysis of patterns of variation among observes data. Frequency of values (f) is the number of times that observation occurs. Relative Frequency of a value (r f) is the ratio of the frequency (f) to the total number of observations (n). Cumulative frequency gives the number of observations less than or equal to a specified value. (c f) See example 1, page 41 to 43, from “Cracking the AP Statistics Exam” Ed 2010. Graphical methods 1. For QUALITATIVE data bar chart is particularly useful. Pie charts are frequently used but are not recommended. For categorical data : Bar Charts and Pie Charts 2. For QUANTITATIVE data Dotplots and stemplots are used for small sets of data. For larger sets, histograms, cumulative frequency charts and boxplots are often employed. To describe quantitative data we need observe the Center of distribution, the spread, and the shape (symmetric distribution, left-skewed distribution, or right-skewed distribution. See page 47. Also look for patterns and for 2 striking (unusual) deviations: Clusters and gaps; Outliers. (See page 48, from “Cracking the AP Statistics Exam” Ed 2010). 3. Making a histogram using the entire data set. How do we read and histogram? (See pages 54, 55, 56, 57, and 58, from “Cracking the AP Statistics Exam” Ed 2010). o Numerical methods for Continuous Variables. There are three types of numerical measures: Measures of central tendency: Mean (population mean (µ) and sample mean and Median (M). Measures of Variation (spread). Range, Interquartile Range (IQR), and Standard Deviation (σ is used to denote a population standard deviation σ2 denotes a population variance. S is used to denote a sample standard deviation, s2 denotes a sample variance) Standard deviation measures the distance between any measurement and the mean of the set of data. S=0 indicates that all of the measurements are identical. A larger “s” (and consequently, variance) indicates a larger spread among measurements. See example on page 63 Measures of Position: Quartiles (divide the data into four equal parts), Percentiles, and standardized scores (z-scores) Q1 First quartile. A number such that, at most, 25% of the values are at or below it, and at most 75% of the values are at or above it. In other words, 25% of the values are below Q1 . Q2 Second quartile. Same as the median. A number such that at most 50% of the values are at or below it and at most 50% of values are at or above it. In other words, 25% of the values are between Q1 and Q2. And 25% are above Q3. Q3 Third quartile. A number such that at most 75 % of the values are at or below it and at most 25% of the values are at or above it. Quartiles can be calculated using STAT/CALC/1-Var Stats on your calculator. Percentiles (l) divide the set of data into 100 equal parts. For each variable there are 99 percentiles, denoted by P1, P2,……., P99. Pk is the kth percentile, which is the number such that at most k percent of the values are at or below it, , and at most (100-k) percent of the values are at or above it. Example, 95th percentile means 95% of the values are at or below P95 and at most 5% of the values are at or above P95. P25 = Q1 P50 = Q2 = M P75 = Q3 (l) = (n+1) k / 100 Z-Scores are independent of the units in which the data values are measured. They are useful when comparing observations measured on different scales. 3 o o o o Z- Score = measurement – mean/ standard deviation. Z-Score gives the distance between the measurement and the mean in terms of standard deviation. A negative Z-Score indicates that the measurement is smaller than the mean. A positive Z-Score indicates that the measurement is larger than the mean. See example 10, page 66 and example 11, page 67. See page 68 to see calculator instructions. BOXPLOTS : To know How do we do it by hand and How do we read it see Instructions in page 69) Effect of changing units on summary measures (copy pg 71 and 72) Summarizing Distribution Measuring the Center. Measuring Spread. Measuring Position. Empirical Rule. Histograms Cumulative Frequency Box plots Changing Units Notes -The presentation of data includes summarizations and descriptions, and involves concepts such that average values, measure of dispersion, positions of various values, and the shape of a distribution (Descriptive statistics) -Median: is the middle number of a set of numbers arranged in numerical order. The median is not affected by exactly how large the larger values are or by exactly how small the smaller values are. It is particularly useful measurement when the outliers are in some way suspicious or when we want to diminish their effects. -Mean: is found by summing items in a set and dividing by the number of items. Notation μ is used to denote the mean of the whole population and X is used to represent the mean of a sample (or a part of a population) Mean = ∑x/n -Measures of center: Mean And Median Mean > Median We can suppose that the distribution is right skewed -Measures of spread -Range = Max – min = Largest – Smallest - IQR = Q3 - Q1 ( Interquartile Range) Upper Quartile – Lower Quartile -Variance 4 -Standard Deviation -IQR: is the difference between the largest and the smallest values after removing the lower and upper quartiles. IQR is the range of the middle 50% that is IQR= Q_3 - Q_1= 75 percentile minus 25 percentile -Variance: is determined by averaging the squared difference of all the values from the mean. It is measured in Square Units -Standard Deviation: is the square root of the variance .It is measured in the same units as are the data. -Mean and standard deviation are appropriate only for symmetric data -IQR is a reasonable summary of spread (it uses only two quartiles/ ignores info about how individual values vary) -Outliers: Data away from the body of the distribution.(Extreme values / Unusual Values) Notes about Outliers: Can be the most informative part of the data Can be an error (then fixed if you can) Represent a very important info in your comments Affect mean and variance Median and IQR are not sensitive to the outliers Do not delete outliers, think about them Comparing Distribution Dotplots Double Bar Charts Back-to-back Stemplots Parallel Boxplots Cumulative Frequency Plots. Many real-life applications of statistics involve comparisons of TWO populations. We should portray both sets simultaneously. See EXAMPLEs I, II, III, IV, and V, pages 74, 75, 76, and 77, from “Cracking the AP Statistics Exam” Ed 2010). 5 Unit #2 Exploring Relationships between TWO Variable” (BIVARIATE DATA) (To investigate relationship between two quantitative variables) Main tppics Stemplots (shape, direction and strength of relationship) Correlation and Linearity Least Square Regression Line Residual plots Outliers and Influential Points Transformations to Achieve Linearity Notes Our studies so far have been concerned with measurements of a SINGLE VARIABLE. However, many statistical applications involve two or more variables related to one another. Fist question: How can the strength of an apparent relationship be measured? Second question: How can an observed relationship be put into functional terms? 1. Scatterplot and Correlation Coefficient If two different variables have a LINEAR RELATION, then we can measure the strength of that relationship with a LINEAR REGRESSION. How to SUMMARIZE the relationship between two variables. There are two commonly used measures A Scatterplot, a GRAPHICAL SUMMARY MEASURE The CORRELATION COEFFICIAENT (r), a NUMERICAL SUMMARY MEASURE. A SCATTERPLOT is used to describe the Nature, Degree, and Direction of the relation between two variables X and Y; (x,y) gives an order pair. A SCATTERPLOT can tell us Shape: Is the relation between X and Y linear or nonlinear? Direction: Whether y-value increases (positive relation) or decreases (negative relation) as x increases. Strength of relationship: If the trend of the data can be described with a linear or a curve; then the spread of the data values around the line or curve describes the degree (or strength) of the relation: o Points very close to the line or curve, a strong relationship between two variables. o More scattered points, a weaker relationship between two variables. o Points scattered without any apparent pattern, no relationship between two variables. 6 Correlation Coefficient (r) is a numeric measure of the degree and direction of the LINEAR RELATION between two quantitative variables. -1 ≤ r ≤ 1 always Direction: the positive or negative sign of the correlation describes the direction of the relation between two variables. Strength: the numeric value of the correlation describes the strength (or degree) of the linear relation between two variables: o + 1 describes a perfect positive correlation o -1describes a perfect negative correlation o Close to -1 or + 1 describes a strong relationship between two variables. How to compute the correlation by using a calculator. Page 84 How to make a scatterplot. How to get summary statistics for bivariate data. Least Squares Regression Line. After know that two variables are related to each other, we need quantify (or estimate) the relation between two variables. A Linear Regression Equation or Linear Regression Model is an equation that gives a straight line relationship between two variables Y = β0 + β1 X + e , where Y is the dependent variable or responsive variable. X is the independent variable or explanatory variable. β0 is the y-intercept. It is the value of Y for X=0 β1 is the slope of the line. It gives the amount of change in Y for every unit change in X. e is the random error or RESIDUAL . It is the different between the observed and predicted value. The predicted value of Y for a given value of X is denoted by yˆ (is denoted as “y-hat). It is computed by using the estimated regression line ŷ = a + bx, where a = β0 ˆ= the estimated yintercept of the regression line and b= β1 ˆ= the estimated slope of the regression line; e (error or residual) = (y- ŷ). 7 The least squares regression line is a line that minimizes the error sum of squares (SSE). It is also known as the line of best fit. b= β1 ˆ= SS xy /SS xx and a = β0 ˆ= Ȳ - b X̄ Using TI-83 To make a scatterplot Chose STAT – EDIT Enter X-values into L1 Enter corresponding Y-values into L2 Choose scatterplot by highlighting the picture of a scatterplotunder Type Enter XList: L1 Enter YList: L2 Select type of marking from 3 available marks Choose ZOOM – 9 : ZoomStat Use TRACE option to trace points in the scatterplot To compute the correlation coefficient (r): Enter X-values into L1 Enter corresponding Y-values into L2 Chose 2nd – CATALOG – DiagnosticOn Choose STAT-CALC- 8: LinReg(a+bx) Enter L1 , L2 (Don’t forget the comma). This will result in LinReg(a+bx) L1 , L2 Press ENTER This option will show a, b,r2 , and r for the regression line y= a+ bx To get summary statistics for bivariate data: Enter X-values into L1 Enter corresponding Y-values into L2 Select STAT – CALC-2:2-VAR Stats Enter L1 , L2 . This will result in 2-Var Stats L1 , L2 Press ENTER This option will provide X̄ , ∑X , ∑X2 , Sx , σx , n Ȳ, ∑Y, ∑Y2 , Sy, σy ∑ XY, min X, max X, min Y, max Y To compute least squares estimates of the slope (b), y-intercept (a), and coefficient of determination (r2 ) for the line y= a+ bx Enter X-values into L1 8 Enter corresponding Y-values into L2 Choose STAT-CALC- 8: LinReg(a+bx) Enter L1 , L2 (Don’t forget the comma). This will result in LinReg(a+bx) L1 , L2 Press ENTER This option will show a, b,r2 , and r To get a regression line superimposed on a scatterplot Enter X-values into L1 Enter corresponding Y-values into L2 Choose STAT-CALC- 8: LinReg(a+bx) Enter L1 , L2 , Y1. This will display LinReg(a+bx) L1 , L2 , Y1 To get Y1, use the following sequence of commands. Choose VARS-Y-VARS-1: Function-1: Y1 Press ENTER Press GRAPH To get a RESIDUAL Plot Enter X-values into L1 Enter corresponding Y-values into L2 Choose STAT-CALC- 8: LinReg(a+bx) Enter L1 , L2 , Y1. This will display LinReg(a+bx) L1 , L2 , Y1 To get Y1, use the following sequence of commands. Choose VARS-Y-VARS-1: Function-1: Y1 Press ENTER This will store the regression equation inY1 On TI-83, create a new column title. This will allow you to define the column. Enter L2 - Y1 (L1 ). This command will fill the column titled RESDL with residuals Press ENTER to compute residuals for the above estimated line. To plot the computed residuals Choose Y= In this screen, deselect regression equation \ Y1 by moving the cursor over the = symbol and pressing ENTER In this screen, set \ Y2 = 0 Choose 2nd-STAT PLOT -1:plot1 Turn Plot1 Off by highlighting OFF and pressing ENTER. Choose 2nd-STAT PLOT -2:plot2 Turn Plot2 Off by highlighting OFF and pressing ENTER. Choose scatterplot by highlighting the picture of a scatterplot under Type and pressing ENTER Enter Xlist:L1 Enter Ylist:RESDL Select type of marking from the three available marks 9 Choose ZOOM-9:Zoom Stat Use TRACE option to trace points in the scatterplot Transformation to achieve linearity Using nonlinear models. Here are some examples of transformations used to linearize the regression model: The log transformation (Z=ln(Y)) is used to linearize the regression model when the relationship between X and Y suggests a model with a consistently increasing slope. The square root transformation (Z=√Y=Y1/2) is used when the spread of observations increases with the mean. The reciprocal transformation (Z=1/Y1) is used to minimize the effect of large values of X. The square transformation (Z=Y2) is used when the slope of the relation consistently decreases as the independent variable increases. The power transformation (ln(Y) and ln(X)) is used if the relation between dependent and independent variables is modeled by Y=aXb. Exploring Categorical data: Frequency Tables Categorical data is data classified into different categories Example 1: Students might be classified by gender (male or female) and their student status (freshman, sophomore, junior or senior). Example 2: Suppose 200 students were classified by gender and academic major. The data resulted in a 2X4 contingency table, show in table 10. Academic Major Gender Arts Sciences Engineering Education Female 35 15 5 25 Male 10 40 50 20 The table shows that: 35 Female students are majoring in arts. 10 Male students are majoring in arts. 15 Female students are majoring in sciences. The figures above are joint frequencies of respective categories. The joint frequency of two categories is the frequency with which two categories, one from each of the two classifications criteria, occur together. 10 Academic Major Gender Female Male Column Totals Row Totals Arts Sciences Engineering Education 35 15 5 25 10 45 40 55 50 55 20 45 80 120 200 This table shows that: There are 80 female students in this study. There are 120 male students in this study. Out of the 200 students in this study, 45 are majoring in arts subjects. Out of the 200 students in this study, 55 are majoring in science subjects. Out of the 200 students in this study, 55 are majoring in engineering subjects. Out of the 200 students in this study, 45 are majoring in education subjects. From the above contingency table we can see that; Among female students, 35/80 or 43.75 percent are majoring in arts subjects. In other words, the conditional percentage of arts majors among female students is 43.75 percent. Among male students, 50/120 or 41.67 percent are majoring in engineering subjects. In other words, the conditional percentage of engineering majors among male students is 41.67 percent. Among science majors, 15/55 or 27.27 percent are females. In other words, the conditional percentage of female students among the science majors is 27.27 percent.