Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Instrumental variables estimation wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Regression toward the mean wikipedia , lookup
Choice modelling wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Data assimilation wikipedia , lookup
Time series wikipedia , lookup
Regression analysis wikipedia , lookup
A STATISTICAL ANALYSIS OF STUDENT ATHLETES AT STETSON UNIVERSITY By APRIL COATES A SENIOR RESEARCH PAPER PRESENTED TO THE DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE OF STETSON UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF BACHELOR OF SCIENCE STETSON UNIVERSITY 2005 ACKNOWLEDGMENTS I would like to start of first by acknowledging Dr. Erich Friedman. If not for him, I would not have a project to present. Thank you for allowing me to investigate what started as just a casual lunch conversation. I would also like to thank Dr. John Tichenor and Patti Sanders for all of their assistance in the collection of data. I would like to give special acknowledgement to Dr. Will Miles for his assistance throughout the semester. Thank you for your continuous support through computer crashes and frantic mental breakdowns. To my mom and brother, thank you for your constant love and support. Your confidence in me is what got me through this project. Lastly, I would like to say thank you to my second family—all of the professors of the Stetson University Math Department. Thank you all for pushing me beyond what I thought were my limits. 2 TABLE OF CONTENTS ACKNOWLEDGEMENTS ---------------------------------------------------------------------------- 2 LIST OF TABLES --------------------------------------------------------------------------------------- 5 LIST OF FIGURES ------------------------------------------------------------------------------------- 6 ABSTRACT ---------------------------------------------------------------------------------------------- 7 CHAPTERS 1. BACKGROUND ----------------------------------------------------------------------------------1.1. Stetson University Athletic Department Mission Statement---------------------------1.2. Financial Eligibility--------------------------------------------------------------------------1.3. Data Collection--------------------------------------------------------------------------------- 8 8 8 9 2. REGRESSION MODELS------------------------------------------------------------------------2.1. Two Variable Regression-------------------------------------------------------------------2.1.1. Principle of Least Squares -------------------------------------------------------2.1.2. Variable Interaction----------------------------------------------------------------2.1.3. Residual Analysis------------------------------------------------------------------2.2. Multiple Regression-------------------------------------------------------------------------2.2.1. General Additive Multiple Regression Model --------------------------------2.2.2. First-Order Model ----------------------------------------------------------------2.2.3. Second-Order No-Interaction Model ------------------------------------------2.2.4. First-Order Predictors and Interaction -----------------------------------------2.2.5. Complete Second-Order Model ------------------------------------------------2.3. Categorical Variables---------------------------------------------------------------------2.3.1. Dichotomous Variables ---------------------------------------------------------2.3.2. Multi-Category Variables ------------------------------------------------------- 12 12 13 14 15 17 18 18 19 19 20 21 21 23 3. DISTRIBUTIONS ---------------------------------------------------------------------------------3.1. Normal Distributions ------------------------------------------------------------------------3.2. Determining Underlying Distributions ---------------------------------------------------3.2.1. Hypothesis Testing and Significance Level ------------------------------------3.2.2. Chi-Square Distribution -----------------------------------------------------------3.2.3. Goodness-of-Fit Test --------------------------------------------------------------- 25 25 27 28 29 30 3 4. PRELIMINARY ANAYSIS ---------------------------------------------------------------------4.1. Linear Regression ---------------------------------------------------------------------------4.2. Residual Plots --------------------------------------------------------------------------------4.3. Goodness-of-Fit Test------------------------------------------------------------------------- 31 31 32 33 5. CONCLUSIONS ----------------------------------------------------------------------------------- 36 REFERENCES ------------------------------------------------------------------------------------------ 37 BIOGRAPHICAL SKETCH -------------------------------------------------------------------------- 38 4 LIST OF TABLES TABLE 1. 2. 3. 4. 5. Cost of Attendance--------------------------------------------------------------------------------Hypothesis Testing for One Proportion -------------------------------------------------------Hypothesis Testing for Two Proportions ------------------------------------------------------Frequency Table ----------------------------------------------------------------------------------Expected Values ----------------------------------------------------------------------------------- 5 10 28 29 33 34 LIST OF FIGURES FIGURE 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Sample of Linear Model---------------------------------------------------------------------------Residual Plot----------------------------------------------------------------------------------------First-Order Model ---------------------------------------------------------------------------------Second-Order No-Interaction Model -----------------------------------------------------------First-Order Predictors and Interaction Model -------------------------------------------------Complete-Second Order Model -----------------------------------------------------------------Categorical (no interaction) ----------------------------------------------------------------------Categorical (interaction) --------------------------------------------------------------------------Normal Distributions ------------------------------------------------------------------------------Standard Normal Distributions ------------------------------------------------------------------Normal Curve Probabilities ----------------------------------------------------------------------Chi-Squared Distribution -------------------------------------------------------------------------Baseball Percentage vs. GPA -------------------------------------------------------------------Baseball Regression Models --------------------------------------------------------------------Baseball Residual Plots --------------------------------------------------------------------------- 6 12 16 18 19 20 21 22 23 26 26 27 29 31 32 32 ABSTRACT A STATISTICAL ANALYSIS OF STUDENT ATHLETES AT STETSON UNIVERSITY By April Coates May 2006 Advisor: Dr. Will Miles Department: Mathematics and Computer Science Student-athletes have very different roles in the eyes of society. There are those in society which feel athletics are top priority while academics come second. On the other hand, there are others who believe athletes are students first and athletics are merely an extracurricular activity. The Stetson University Athletic Department strives for excellence in the classroom and on the playing field. However, are these student-athletes succeeding in the classroom? I approached the registrar, Dr. John Tichenor, to see what data would be available for the analysis on studentathletes over the past seven years. Prior to receiving the records, I began research on various methods needed to analyze the data—two variable regression, multivariable regression, residuals, categorical variables, and the goodness-of-fit test. Does the amount of athletic scholarship granted to an athlete correlate with academic performance? Do athletes tend to shy away from the more demanding majors? Are there sports that place a higher standard on academics? Are males and females really on the same playing field? I hope to answer these and other questions that arise next semester when I can analyze the complete data set. 7 CHAPTER 1 BACKGROUND Stetson University has been housing Division One athletes since it joined the Atlantic Sun Conference in 1985 [7]. Currently, Stetson students may participate in nine varsity sports: basketball, volleyball, crew, golf, tennis, soccer, cross country, baseball and softball. The NCAA maintains that every athlete be held to specific standards in order to remain eligible. For example, student-athletes must be full time students (twelve credit hours) and must maintain a minimum of a 2.00 grade point average [7]. 1.1 STETSON UNIVERSITY ATHLETIC DEPARTMENT MISSION STATEMENT Stetson University strives for individual student-athletes to achieve excellence in both the classroom and on the playing field. The university places a lot of emphasis on the studentathletes having a well-rounded college experience. Below summarizes Stetson University Athletic Department’s ideal college experience: The Stetson University Athletic Department strives to provide students with a sound educational experience through a holistic and collaborative athletic program that allows students to develop intellectually, spiritually, socially, and physically. Excellence is pursued through participation in a successful Division I, NCAA program, superior coaching, interaction among coaches, faculty, students, and staff, and a diversity of student-athlete activities based on a liberal-arts education. Students develop leadership through sport participation and community activities. In unison with the University Mission, the Athletic program helps students pursue truth by actively recruiting and providing a diverse and caring environment that values and commits to the rights and fair treatment of all people regardless of race, religion, or gender. The Athletic Program (Department) encourages its student athletes to be morally sensitive and contributing citizens through active forms of social responsibility [7]. 1.2. FINANCIAL ELIGILIBILTY Just like at other colleges and universities, student-athletes at Stetson University may be awarded scholarships for their athletic ability. These scholarships can range from a small stipend to aid covering tuition, room, board and books. However, these scholarships and grants, are only 8 awarded for a one year period. Upon the completion of an academic year, each athlete’s athletic abilities and eligibility will be considered when deciding whether or not the grants will be renewed. Stetson University has the right to withdrawal the student-athlete’s scholarship if he or she has willingly withdrawn from the sport, is no longer eligible to compete, or is guilty of a serious misconduct [7]. 1.3. DATA COLLECTION In order for the results of this project to have a statistically significant result, the collection of data should be as large as possible. Thus, the data collected spans over the past seven years, from 1997 to 2004, and includes over five hundred student-athletes. Included in the data set are variables such as, sport, sex, race, home state, SAT scores, declared major, Stetson University cumulative grade point average as of the end of the fall of his or her sophomore year, Stetson University athletic scholarships and non-athletic Stetson University scholarships granted. When the data was collected, there were a few concerns that were raised. While we wanted the data set to be as large as possible, we could not include all athletes over the time span available like originally planned. If this was to happen, student-athletes would more than likely be included in the data set more than once. For instance, suppose there is a student who played soccer for each of the four years he attended Stetson University. This particular soccer player would be included in the data set four times. In order to eliminate counting student-athletes more than once, it was determined to only use one class each academic year. The original thought was to use all freshmen student-athletes; however, variables such as grade point average, would have not been an accurate estimate of academic performance after only one semester. Many freshmen are overwhelmed from the adjustment to college life and academic performance is not usually at its best. Therefore, the next thought was to include sophomores. 9 As a sophomore, students have had time to adjust to their new environments and by this time, many students have declared their major. However, the students are still at a point where their grade point average is not heavily affected by their major. Another concern that was raised was in regards to the amounts of athletic scholarship granted to the student-athletes. Since the data extended over the past seven years, it was essential to incorporate the changes in tuition, room and board, books, as well as other fees not included in tuition. Looking at the table below, one can see the drastic increases over the seven years this project includes [8]. Year Cost of Attending Stetson University 1997-1998 $21,220 1998-1999 $23,180 1999-2000 $24,140 2000-2001 $25,255 2001-2002 $26,440 2002-2003 $27,925 2003-2004 $29,685 Table 1. Total Cost of Attendance For example, if a woman playing basketball received an athletic scholarship worth $12,000 in 1998 was compared to one of her teammates who also received an athletic scholarship for $12,000 when she played three years later, on the surface, it appears that both are women are receiving the same amount of aid. However, by examining the chart, in 1998 one can see that it costs approximately $23,180 to attend Stetson University for one year, compared to nearly $26,440 in 2001. Thus, in all actuality, the scholarship granted in 1998 was worth more than the one awarded three years later in 2001. Therefore, in order to integrate the increase in cost over the past seven years, rather than simply looking at the amount of athletic scholarship granted, the 10 focus will be on the percentage of total cost—tuition, room and board, books and other fees— covered by the athletic scholarship granted. 11 CHAPTER 2 REGRESSION MODELS When examining data, it is often essential to determine the relation or correlation between variables. For instance, in a simple case where two variables are being compared, one of the variables, x, is independent and known in advance, while the other, y, is dependent on x. While it is impossible to predict the actual value of the dependent variable, it is possible to create a model that can estimate the expected value, called E(Y) [3]. In the simple case, often times, a linear regression model can be used. However, in a more complicated data set, the model may be exponential, power, logistic, or reciprocal [2]. 2.1 TWO VARIABLE REGRESSION As previously mentioned, the simple case of regression is one comparing two variables. In order to estimate E(Y), the predicted value of y, take a random sample of n points, ( x1 , y1 ), ( x2 , y 2 ),..., ( xn , y n ) . So, E(Y) depends on the value of x, the independent variable. Thus, E(Y) = (x ) . In other words, the predicted value of y, let’s call it Y, can be written as a function of x. Assuming that the E(Y) is linear yields, E (Y ) x , a regression curve. Begin by fitting a straight line through the sample points [3]. An example of a linear regression model is shown in Figure 1. ( xi , ( xi )) Figure 1. Sample Linear Model 12 By expressing the regression coefficients and in terms of the following, E ( X ) 1 , E (Y ) 2 , var( X ) 12 , var(Y ) 22 , cov( X , Y ) 12 , 12 1 2 the linearity of the following equation can be proven. E (Y ) 2 2 ( x 1 ) 1 The value of the correlation coefficient, , tells how well or how poorly the two variables are correlated. The correlation coefficient can range in value from -1 to 1. A correlation coefficient of 1 indicates a perfect correlation between the two variables. From this Theorem, if 0 and hence, 12 0 , then the two variables are uncorrelated. The sign of the correlation coefficient indicates whether the two variables are positively or negatively correlated [4]. 2.1.1 PRINCIPLE OF LEAST SQUARES By examining the scatter plot of the observed pair data, it can be estimated whether a linear regression equation is an appropriate model. If it appears that the data is linear, one method to find the linear model, y x , is that of the principle of least squares. This method was originally recommended by Adrien Legendre, a French mathematician in the early nineteenth century [4]. The deviation from a particular point in the data set, ( xi , yi ) , to the line, y x , is called the residual. The residual can be calculated by yi ( xi ) 13 Suppose that the data points in the study are ( x1 , y1 ), ( x2 , y 2 ),...( xn , y n ) . Then the sum of the squares of the corresponding residuals is n f ( , ) [ y i ( xi )] 2 i 1 By squaring these residuals, a misleading summation of zero is avoided. Ideally, in a linear regression model, the goal is to minimize the residuals. The values ̂ and ˆ minimize f ( , ) are referred to as the least square estimates. In other words, f (ˆ , ˆ ) f ( , ) for all and . Thus, the least squares line is y ˆ ˆx [2]. The least square estimates that minimize the residuals are found by taking partial derivatives of f ( , ) with respect to and as well as setting those both equal to zero. By simply solving the two equations with two unknowns, and , the following formulas are the result: ˆ ( xi , x )( yi , y ) ˆ ( xi x )2 y i ˆ xi n y ˆx Once again, before computing ̂ and ˆ it is beneficial to plot the scatter plot to determine if the correlation between the variables is approximately linear. If the data appears linear, statistical computer packages can aid in the calculations [2]. In this project, Microsoft Excel was used. 2.1.2 VARIABLE INTERACTION Many times there are occurrences when a number of variables are proposed to be connected to others. In other words, the variables are not always independent of each other [4]. In the context of this project, one’s gender may influence the sport played. For instance, if you are a male you cannot play softball, volleyball, etc. Thus, perhaps the two variables are not independent. One 14 way to approximate the independence of two variables is to find the correlation between the two. While correlation does not imply causation, plotting the observed values, finding a regression model, and examining the correlation coefficient can provide more insight into the possibility of interaction between the variables. In Section 2.2.2, the differences in multiple regression models with and without interacting variables can be seen. 2.1.3 RESIDUAL ANALYSIS When determining whether or not the regression model chosen is an adequate model to fit the data, it is beneficial to construct a diagnostic plot. One specific diagnostic plot places the predicted value for yi , ŷi on the vertical axis and the actual value, yi on the horizontal axis. Once the plot is constructed, if the points closely fit a line with slope of positive one and the line goes through the origin, then the regression model for the data gives an accurate prediction of the values actually observed. Diagnostic plots which plot the residual on the vertical axis versus xi or ŷi on the horizontal axis are called residual plots. The residuals are ei yi yˆ i , where yˆ i E (Y ) . The residual plot in Figure 2 plots xi on the horizontal axis. Unlike the first diagnostic mentioned, the residual plots should show no distinct pattern in order for the regression to be an effective and accurate predictor. The residuals should be randomly dispersed around zero, according to a normal distribution. 15 Algebra II vs Pre-Calculus 30 Residual 20 10 Series1 0 -100.000 20.000 40.000 60.000 80.000 -20 Algebra II Figure 2. Residual Plot There is one more type of diagnostic plot involving the standardized residuals. It plots the standardized residuals on the vertical axis versus either xi or ŷi on the horizontal axis. When examining this plot, if a majority of the standardized residuals fall between 2 , the model is a good predictor. In other words, it is an accurate model if all but a few residuals fall within two standard deviations away from the mean. The standardized residuals can be calculated using the following formula, ei* y yˆ i 1 (x x)2 s 1 i n S xx where s (x i x)2 n 1 i 1,2,3,..., n and S xx (x i x)2 . From the formula, you can tell that these calculations will be tedious. However, one can construct various useful plots involving them. This method is most often used when doing a multiple regression analysis, which will be included in this particular project [2]. While residual plots often help to determine the appropriateness of the model chosen, difficulties may arise. In reality, the residual plot may not have a random dispersion of residual values. In this case, a nonlinear model may be more appropriate. 16 Perhaps the chosen model appears to fit the data fairly well, with the exception of a few outlying values. These values may be drastically manipulating the model of the best-fit function. Excluding those outliers in the data set could lead to a change in the original model. However, this is harder to determine when working with multiple regression models [2]. Lastly, the model could possibly be a poor fit if one or multiple independent variables were not considered or included in the data set. An example of this case can be seen when there is a pattern in the residuals when the residuals are plotted against an omitted variable. From this, one can incorporate the omitted variable in a multiple regression model [2]. As you can see there are some modifications that can be made in order to correct some of the obscurities found in residual plots. 2.2 MULTIPLE REGRESSION In the simple case of regression, the only concern was with the correlation between two variables. However, this simple case can now be generalized into the multiple regression case. One would use multiple regressions to build a model that relates numerous independent variables to a single dependent variable [2]. For instance, suppose that the high school grade point average and SAT score were thought to be a good predictor of a student’s college grade point average. If a model was to be constructed to help support this claim, it would be a multiple regression model. Let n represent the number of independent or “predictor variables” that wish to be studied, where n has to be at least two. Denote these predictor variables as x1 , x2 , x3 , x4 , x5 ...xn . Then the general additive multiple regression model equation is Y 0 1 x1 2 x2 3 x3 ... n xn where E ( ) 0 and V ( ) 2 . It is also assumed that is normally distributed. In the simple linear case, 0 1 x describes the mean Y value as a function of x. Similarly to the linear case, the population regression function, 0 1 x1 ... n xn , gives the expected 17 value of Y as a function of x1 , x2 , x3 ,...xn . The i ' s are called the population regression coefficients. The regression coefficients i can be understood as the expected change in Y related with a one unit increase in xi while all other variables are held constant [2]. 2.2.1 GENERAL ADDITIVE MULTIPLE REGRESSION MODEL Suppose a statistician has acquired data on y, x1 , x2 . From what we know about multiple regressions so far, one possible regression model is Y 0 1 x1 2 x2 . Nevertheless, there are other possible models that can be created from x1 and/or x 2 . Thus, polynomial regression is an exceptional case of multiple regression. There are a total of four useful multiple regression models [2]. 2.2.2. FIRST-ORDER MODEL The first model is the first-order model where Y 0 1 x1 2 x2 . This is the most clearcut generalization of the simple linear regression model. This model maintains that for a set value of one variable, the expected value of Y is a linear function of the remaining variable. Thus, when graphing the regression equation as a function of x1 for various values of x 2 , the graph is a collection of parallel lines [2]. See Figure 3. 18 Figure 3. First-Order Model 2.2.3. SECOND-ORDER NO-INTERACTION MODEL The next model is the second-order no-interaction model of the form Y 0 1 x1 2 x2 3 x12 4 x22 . According to this equation and assuming that x 2 is fixed, the expected change in Y for a one unit increase in x1 does not depend on x 2 . The calculation follows, 0 1 ( x1 1) 2 x 2 3 ( x1 1) 2 4 x 22 ( 0 1 x1 2 x 2 3 x12 4 x 22 ) 1 3 2 3 x1 Due to this lack of dependency on x 2 , the contours of the second-order no-interaction model are still parallel to one another. However, unlike the first-order model, the expected value is no longer linearly dependent on x1 . Thus, the contours are curves rather than simple straight lines [2]. Figure 4. Second-Order No-Interaction Model 2.2.4. FIRST-ORDER PREDICTORS AND INTERACTION Unlike the first two models, the model with first-order predictors and interaction has nonparallel contour lines. The model is expressed as Y 0 1 x1 2 x2 3 x1 x2 . Look at the dependency when x1 is increased by one unit. 19 Y 0 1 ( x1 1) 2 x2 3 ( x1 1) x2 ( 0 1 x1 2 x2 3 x1 x2 ) 1 3 x2 Since the expected value is dependent on x 2 , each contour line will have a different slope for various values of x 2 . In this model, one can see how the change in expected value as one variable increases also depends heavily on the value of the other variable. This concept is called interaction. It is important to note, that if the model involves variable interaction, the understanding of the i ' s cannot be applied since it is impossible to increase xi and hold the remaining variables constant [2]. Figure 5. First-Order Predictors and Interaction Model 2.2.5 COMPLETE-SECOND ORDER MODEL This model is also referred to as the full quadratic model. The general form for the completesecond order model is Y 0 1 x1 2 x 2 3 x12 4 x 22 5 x1 x 2 . In this model, the expected change in value of Y depends on both variables, x1 and x 2 , when x1 is increased by one unit. In similar methods done in the prior models, the expected change in Y is 1 3 2 3 x1 5 x2 . Since this is a function involving two variables, it implies that the contour lines are both curved and nonparallel to one another [2]. 20 Figure 6. Complete-Second Order Model 2.3. CATEGORICAL VARIABLES In all of the methods mentioned prior we have assumed that the variables being analyzed have all been numerical values or quantitative variables. However, it is possible that the chosen variables are categorical (qualitative). In this particular project, there are several categorical variables— sex, race, sport played, chosen major, and home state. Nevertheless, there are methods used to incorporate these categorical variables into the analysis. 2.31. DICHOTOMOUS VARIABLE First consider the simple case. Suppose the categorical variable being examined has two possible categories, such as male or female. This type of variable with two categories is called a dichotomous variable. With dichotomous variables, one must assign a dummy or indicator variable x. This dummy variable has two possible values, zero or one, and indicates which category is applicable for any chosen variable [2]. This concept is best explained through an example. Example: Suppose that it has been discovered that annual salary is dependent on the number of years of experience and whether or not the employee has a college degree. 21 Let the dependent variable y = the annual salary and the independent variable x2 the number of years of experience. Since the presence of a college degree is a categorical variable, let Take for instance the liner regression model, Y 0 1 x1 2 x2 . Thus, the mean value of the annual salary depends on whether the employee has a college degree: mean salary = 0 2 x2 when x1 0 (doesn’t have a degree) mean salary = 0 1 2 x2 when x1 1 (has a degree) Therefore, if the number of years of experience is held constant, then 1 is the difference between the presence of a college degree or lack there of. Thus, if 1 0 , on average the employee with a college degree will have a higher annual salary than one without a college degree. This can be seen in Figure 3. Figure 7. Categorical (no interaction) However, suppose the two variables interact. Then the possible model is Y 0 1 x1 2 x2 3 x1 x2 . Therefore, the mean annual salaries are 22 mean salary = 0 2 x2 when x1 0 (doesn’t have a degree) mean salary = 0 1 ( 2 3 ) x2 when x1 1 (has a degree) For this model, the change in mean annual salary with a one year increase of experience depends on whether or not the employee has a college degree. Thus, the two variables years of experience and presence of a college degree interact. The interaction is displayed in Figure 4. Figure 8. Categorical (interaction) 2.3.2. MULTI-CATEGORY VARIABLES Suppose that there are more than two categories for the variables observed. For instance, assume you have a variable with three-categories. First instinct is to create one variable and code it with three values, zero, one and two for each of the three categories. However, this is inaccurate. If you were to attempt this, you would be imposing an arrangement on the groups that is not necessarily implied in the context of the problem. The correct approach is to define two dummy variables [2]. For example, assume that y is the once again the annual salary of an employee, x1 = years of experience, and that the level of college degree is being considered. Then 23 Thus, in this example, if an employee has a B.A. or a B.S., x2 1, x3 0 . If the employee has a Masters then, x2 0, x3 1 . It is then implied that if an employee has a PhD then, x2 0, x3 0 . The no-interaction model only considers x1 , x 2 , x3 . However, the interaction model below represents the mean change in annual salary related to a one unit increase in the number of years of experience to depend on what type of college degree the employee has. Y 0 1 x1 2 x2 3 x3 4 x1 x2 5 x1 x3 This was the three-category case. However, this case can be generalized into a multiple category case. Suppose that you have a categorical variable with n possible categories you wish to incorporate in the multiple regression model. Thus, it would entail creating n 1 dummy variables. Consequently, adding only one categorical variable alone can add numerous predictors to a model [2]. 24 CHAPTER 3 DISTRIBUTIONS When analyzing a set of data it is essential to estimate the underlying distribution. Many of the tests and methods assume that the data has a normal underlying distribution. However, there are instances when this is not always the case. There are numerous different kinds of distributions that can be used to describe the particular data set. A few are Normal, Hypergeometric, Poisson, Gamma, Beta, Multinomial, and Chi-Square Distributions. Each distribution must be handled with various methods [1]. 3.1 NORMAL DISTRIBUTION Since many of the measurements of variables are approximately normally distributed, the normal distribution is one of the most important distributions in statistical analysis. A random variable, x, is normally distributed if its corresponding probability density function is defined as ( x )2 2 2 1 f ( x) e 2 , x , Where and 0 . A normal distribution is noted as N ( , 2 ) . Since is positive, f ( x) 0 [3]. When the probability density function, f(x), is graphed, the area under the bell-shaped, symmetric curve is one. See Figure 9. This can be proven using multi-variable calculus and incorporating polar coordinates when integrating f ( x)dx [2]. In a normal distribution, it is also true that E ( X ) and 2 Var ( X ) . In other words, and 2 are actually the mean and variance of the data sample. This is proven using the momentgenerating functions of X [3]. 25 Figure 9. Normal Distributions [B4] A special case of the normal distribution is called the standard normal distribution. This is when the mean of the distribution is zero and the variance is one. Like the normal distribution, the standard normal distribution is also a bell-shaped symmetric curve. However, it peaks at 0 . See Figure 10 below. Figure 10. Standard Normal Distribution [5] The probability density function of a standard normal distribution with random variable Z is denoted as 1 2z f ( z) e 2 2 where z . The various probabilities involving Z can be calculated using the Z-score table located in any statistics book. Let 26 z ( z ) P( Z z ) 1 2 e y2 2 dy . Calculating the probabilities can be aided by creating a visual, like the one in Figure 11. For those distributions that are normal but not standard normal, the random variable X can be standardized by Z X . But how can one discover the approximate underlying distribution of the data set [2]? Figure 11. Normal Curve Probabilities [5] 3.2. DETERMINING UNDERLYING DISTRIBUTIONS In order to provide some insight into the underlying distribution of a data set, Karl Pearson suggested a method using the chi-square statistic and hypothesis testing, which tests the suitability of a probabilistic model [3]. The chi-square goodness-of-fit test can determine whether the underlying distribution is what is assumed in the null hypothesis, or whether it is another distribution [2]. 27 3.2.1. HYPOTHESIS TESTING AND SIGNIFICANCE LEVELS From prior studies, a hypothesis is simply an educated guess in regards to the problem statement. With this in mind, we start by creating a null hypothesis, denoted H o . The null hypothesis is the initial assumption of one parameter in the sample set. The alternative hypothesis, H a , is an assertion that is a contradiction of H o . Based on a specified significance level, , one either rejects the null hypothesis or fails to reject the null hypothesis [2]. Once the null and alternative hypotheses are determined, a test-statistic is calculated from the distribution. If the test-statistic falls in the critical region, then you reject the null hypothesis [3]. The critical regions are defined in the Tables 2 and 3. If the null hypothesis is not accepted, one cannot assume that the alternative hypothesis is true. Ho p po p po p po Critical Region Ha p po p po p po z z z y / n po p o (1 p o ) / n y / n po p o (1 p o ) / n y / n po po (1 po ) / n z z z / 2 Table 2. Hypothesis Testing for One Proportion [3] 28 Ho Critical Region Ha p1 p2 p1 p2 p1 p2 p1 p2 z p1 p2 p1 p2 z z pˆ 1 pˆ 2 z pˆ (1 pˆ )(1 / n1 1 / n2 ) pˆ 1 pˆ 2 pˆ (1 pˆ )(1 / n1 1 / n 2 ) pˆ 1 pˆ 2 pˆ (1 pˆ )(1 / n1 1 / n 2 ) z z / 2 Table 3. Hypothesis Testing for Two Proportions [3] 3.2.2. CHI-SQUARE DISTRIBUTION The chi-squared distribution has only one parameter, m. This parameter is called the number of degrees of freedom of the distribution, where m 1,2,3,4,... In this distribution, the critical region is where area under the curve, m with m degrees of freedom, lies to the right of the 2 critical value, 2 ,m . The chi-square distribution is not symmetric [2]. See Figure 12 below. It is also important to note that this distribution has a mean of m and a variance of 2m [1]. Figure 12. Chi-Square Distribution 29 3.2.3. GOODNESS-OF-FIT TEST When analyzing data, it is essential to know the approximate underlying distributions of the data set. In the large set of data, let m be the number of outcomes and let p i represent the probability that a randomly selected observation will be of type i, where i 1,2,3,..., m . From the properties of the probability, we know the following pi 0 for all i 1,2,3,..., m and m p i 1 i 1. Let p10 , p 20 , p30 ,..., p m0 be numbers such that p i0 0 for all i 1,2,3,..., m and m p i 1 0 i 1. Since the goodness-of-fit test measures the inconsistency between observed values and the expected values when the null hypothesis is true, the hypotheses are stated below If we let N i be the actual number of observations of type i, then the test statistic used is Q m ( N np 0 ) 2 (observed predicted ) 2 i 0 i predicted npi i 1 Karl Pearson found that if the null hypothesis is true, then as the sample size becomes very large, the distribution of Q is roughly the chi-square distribution with m-1 degrees of freedom. Thus, after determining a significance level of 0 , let c be the 1 0 quantile of the chi-square distribution with m-1 degrees of freedom. Thus, if Q c , then the null hypothesis should be rejected. However, before the null hypothesis is completely rejected, it is necessary to be certain that there is no other reasonable alternative distribution that better fits the observed data [2]. 30 CHAPTER 4 PRELIMINARY ANALYSIS Upon receiving the data set, a brief analysis was done with a primary focus on comparing the student-athlete’s Stetson University grade point average versus the percentage of tuition covered by athletic scholarships. Rather than examining all student-athletes at Stetson University over the past seven years, I decided to take a small sample of the data set. I chose to look specifically at baseball. 4.1 LINEAR REGRESSION The data set included multiple variables that were determined to have a possible influence on the student-athlete’s Stetson University grade point average. I decided to see if there was a correlation between the percentage of cost to attend Stetson University covered by athletic scholarships granted and the grade point averages. Therefore, I pulled the baseball players from the data set and plotted the percentages versus the grade point average of the baseball players using Microsoft Excel. See Figure 13 for this plot. Baseball Stetson GPA 5.000 4.000 3.000 Series1 2.000 1.000 0.000 0 0.2 0.4 0.6 0.8 1 1.2 Percentage of Tuition Covered by Athletic Scholarships Figure 13. Baseball Percentage vs. GPA Is there a model that could predict the baseball player’s grade point average based on the percentage of cost of schooling he was receiving? I chose to try a few different models to see if there was any correlation between these two variables. I tried various polynomial regression 31 models; however, there was no significant difference between the models. Thus, the result of the linear model is in Figure 14. Figure 14. Baseball Regression Models Looking at the linear regression equation and plot in Figure 14, the model appears to fit the data rather poorly. A correlation coefficient of a 0.4352 does not imply a strong correlation between percentage of tuition covered by athletic scholarship and Stetson grade point average. Typically a correlation coefficient of a 0.7 or higher is considered a good correlation. However, this is not enough to discard this model. 4.2. RESIDUAL ANALYSIS Before discarding the model, we will look at the corresponding residual plot. Here are those residual values plotted against the corresponding x-values. Figure 15. Baseball Residual Plots 32 Since these residual values are randomly dispersed in the plot above, this suggests that the chosen model is not as poor of a fit as originally stated. However, there may possibly be a better model other than the polynomial equations first chosen. It is also possible that these two variables are simply not strongly correlated. This is something that will be examined further next semester. 4.3. GOODNESS-OF-FIT TEST With the baseball sample taken from the complete data set of all student-athletes, I wanted to apply the goodness-of-fit test to see if the underlying distribution was approximately normal. Since I focused on the percentage of tuition covered by athletic scholarships and the grade point average in the regression models, I decided to see if the baseball players’ grade point averages were normally distributed. I started by binning the data. I created three categories for the observed grade point averages to fall in. They were Bin 1: GPA is in the interval [1.5, 2.25) Bin 2: GPA is in the interval [2.25, 3.0) Bin 3: GPA is greater than or equal to 3.0 The frequency of each bin is listed below in Table 4. Bin Frequency 1 25 2 32 3 30 Total 87 Table 4. Frequency Table Once I had the observed values categorized, I then calculated the mean and standard deviation of this sample set. 33 x 2.687437 s x .624282 With these results I was able to use the assumption that the distribution is normal to calculate the probability that an observed grade point average will fall into each bin. Here are the calculated probabilities. 2.25 2.687437 1.5 2.687437 P(1.5 X 2.25) (0.701) (1.902) 0.2133 .624283 .624283 3.0 2.687437 2.25 2.687437 P(2.25 X 3.0) (0.501) (1.902) 0.6628 .624283 .624283 3.0 2.687437 P( X 3.0) 1 P ( X 3) 1 1 (0.501) 0.3085 .624283 Now that the probabilities have been calculated and the data categorized, we must check to make sure the expected values in each bin are at least five in order for the goodness-of-fit test to be an very accurate approximation [1]. Table 5 confirms the expected values. Bin Frequency Probability Expected Value 1 25 0.2133 18.5571 2 32 0.6628 57.6636 3 30 0.3085 26.8395 Table 5. Expected Values The hypotheses for the goodness-of-fit test can now be constructed. They are as follows, We can now calculate the 2 test statistic, Q 34 m ( N i npi0 ) 2 (observed predicted ) 2 Q predicted npi0 i 1 (25 18.5571) 2 (32 57.6636) 2 (30 26.8395) 2 18.5571 57.6636 26.8395 Q 2.2369 11.4218 0.3722 Q 14.0309 Q Recall that when deciding whether to reject or fail to reject the null hypothesis, you must determine a significance level that corresponds to 2 ,m1 . Thus, if Q 2 ,m1 , you reject the null hypothesis. In this case, a test with a significance of .10 requires that 2 ,m1 4.605 . Therefore, since 14.0309 is greater than 4.605, we reject the null hypothesis which assumes that this distribution is normal. Since the test statistic was considerably higher than any of the 2 , 2 values, I am confident that the distribution of Stetson baseball players’ grade point averages is not normal. However, it appears that the sample of grade point averages is approximately a uniform distribution. The high standard deviation has spread the data to where there is no significant peak around the mean. 35 CHAPTER 5 CONCLUSIONS With the research and attainment of data I was able to accomplish this semester, I have a better idea of my plans for next semester. Ideally, I would like to construct a model that could estimate incoming student-athletes’ grade point averages at Stetson University given numerous variables—sport played, major declared, percentage of tuition covered by athletic scholarship, percentage of tuition covered by non-athletic scholarship, and whether the student is from in-state or out-of-state. After doing some preliminary analysis, I understand that this may not be feasible. Therefore, I am also interested in examining any correlations between the various variables and being able to create regression models to describe the correlation. I also plan to compare variable averages among different sports and among males and females. In order to do this comparing of means, analysis of variance is needed. The analysis of variance uses hypothesis testing to compare the means of at least two normal distributions. It is a method that I plan to study further in the fall. 36 REFERENCES [1] M. H. DeGroot and M. J. Schervish, Probability and Statistics: Third Edition. Addison Wesley, Boston, MA, 2002. [2] J. L. Devore, Probability and Statistics for Engineering and the Sciences: Sixth Edition. Thomson Learning, Inc, Belmont, CA, 2004. [3] R. V. Hogg and E. A. Tanis, Probability and Statistical Inference: Sixth Edition. Prentice Hall, Upper Saddle River, NJ, 2001. [4] I. Miller and M. Miller, John E. Freund’s Mathematical Statistics with Applications: Seventh Edition. Pearson Prentice Hall, Upper Saddle River, NJ, 2004. [5] The Normal Distribution, April 5, 2005. http://www.stat.yale.edu/Courses/199798/101/normal.htm [6] “Normal Distribution,” Wikipedia, March 31, 2005. April 5, 2005. http://en.wikipedia.org/wiki/Normal_distribution [7] Student-Athlete Handbook 2004-2005, February 6, 2005. http://www.stetson.edu/athletics/home/extras/handbook.pdf [8] Stetson University Bulletin Archives, March 30, 2005. http://www.stetson.edu/other/bulletin/archives.php 37 BIOGRAPHICAL SKETCH April Coates is a junior at Stetson University. She is majoring in mathematics with a minor education. Her activities here at Stetson include Delta Kappa Pi, the QED math club, Fellowship of Christian Athletes, Resident Assistant, Teacher’s Assistant, math tutoring, and intramurals. She enjoys working out, playing sports, going to the beach, and taking road trips. Upon graduation, she is undecided between attending graduate school and starting a career as a teacher. 38