Download title goes here - Stetson University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Instrumental variables estimation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
A STATISTICAL ANALYSIS OF STUDENT ATHLETES AT STETSON UNIVERSITY
By
APRIL COATES
A SENIOR RESEARCH PAPER PRESENTED TO THE DEPARTMENT OF MATHEMATICS
AND COMPUTER SCIENCE OF STETSON UNIVERSITY IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF BACHELOR OF SCIENCE
STETSON UNIVERSITY
2005
ACKNOWLEDGMENTS
I would like to start of first by acknowledging Dr. Erich Friedman. If not for him, I would not
have a project to present. Thank you for allowing me to investigate what started as just a casual
lunch conversation. I would also like to thank Dr. John Tichenor and Patti Sanders for all of their
assistance in the collection of data. I would like to give special acknowledgement to Dr. Will
Miles for his assistance throughout the semester. Thank you for your continuous support through
computer crashes and frantic mental breakdowns. To my mom and brother, thank you for your
constant love and support. Your confidence in me is what got me through this project. Lastly, I
would like to say thank you to my second family—all of the professors of the Stetson University
Math Department. Thank you all for pushing me beyond what I thought were my limits.
2
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ----------------------------------------------------------------------------
2
LIST OF TABLES ---------------------------------------------------------------------------------------
5
LIST OF FIGURES -------------------------------------------------------------------------------------
6
ABSTRACT ----------------------------------------------------------------------------------------------
7
CHAPTERS
1. BACKGROUND ----------------------------------------------------------------------------------1.1. Stetson University Athletic Department Mission Statement---------------------------1.2. Financial Eligibility--------------------------------------------------------------------------1.3. Data Collection---------------------------------------------------------------------------------
8
8
8
9
2. REGRESSION MODELS------------------------------------------------------------------------2.1. Two Variable Regression-------------------------------------------------------------------2.1.1. Principle of Least Squares -------------------------------------------------------2.1.2. Variable Interaction----------------------------------------------------------------2.1.3. Residual Analysis------------------------------------------------------------------2.2. Multiple Regression-------------------------------------------------------------------------2.2.1. General Additive Multiple Regression Model --------------------------------2.2.2. First-Order Model ----------------------------------------------------------------2.2.3. Second-Order No-Interaction Model ------------------------------------------2.2.4. First-Order Predictors and Interaction -----------------------------------------2.2.5. Complete Second-Order Model ------------------------------------------------2.3. Categorical Variables---------------------------------------------------------------------2.3.1.
Dichotomous Variables ---------------------------------------------------------2.3.2. Multi-Category Variables -------------------------------------------------------
12
12
13
14
15
17
18
18
19
19
20
21
21
23
3. DISTRIBUTIONS ---------------------------------------------------------------------------------3.1. Normal Distributions ------------------------------------------------------------------------3.2. Determining Underlying Distributions ---------------------------------------------------3.2.1. Hypothesis Testing and Significance Level ------------------------------------3.2.2. Chi-Square Distribution -----------------------------------------------------------3.2.3. Goodness-of-Fit Test ---------------------------------------------------------------
25
25
27
28
29
30
3
4. PRELIMINARY ANAYSIS ---------------------------------------------------------------------4.1. Linear Regression ---------------------------------------------------------------------------4.2. Residual Plots --------------------------------------------------------------------------------4.3. Goodness-of-Fit Test-------------------------------------------------------------------------
31
31
32
33
5. CONCLUSIONS -----------------------------------------------------------------------------------
36
REFERENCES ------------------------------------------------------------------------------------------
37
BIOGRAPHICAL SKETCH --------------------------------------------------------------------------
38
4
LIST OF TABLES
TABLE
1.
2.
3.
4.
5.
Cost of Attendance--------------------------------------------------------------------------------Hypothesis Testing for One Proportion -------------------------------------------------------Hypothesis Testing for Two Proportions ------------------------------------------------------Frequency Table ----------------------------------------------------------------------------------Expected Values -----------------------------------------------------------------------------------
5
10
28
29
33
34
LIST OF FIGURES
FIGURE
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Sample of Linear Model---------------------------------------------------------------------------Residual Plot----------------------------------------------------------------------------------------First-Order Model ---------------------------------------------------------------------------------Second-Order No-Interaction Model -----------------------------------------------------------First-Order Predictors and Interaction Model -------------------------------------------------Complete-Second Order Model -----------------------------------------------------------------Categorical (no interaction) ----------------------------------------------------------------------Categorical (interaction) --------------------------------------------------------------------------Normal Distributions ------------------------------------------------------------------------------Standard Normal Distributions ------------------------------------------------------------------Normal Curve Probabilities ----------------------------------------------------------------------Chi-Squared Distribution -------------------------------------------------------------------------Baseball Percentage vs. GPA -------------------------------------------------------------------Baseball Regression Models --------------------------------------------------------------------Baseball Residual Plots ---------------------------------------------------------------------------
6
12
16
18
19
20
21
22
23
26
26
27
29
31
32
32
ABSTRACT
A STATISTICAL ANALYSIS OF STUDENT ATHLETES AT STETSON UNIVERSITY
By
April Coates
May 2006
Advisor: Dr. Will Miles
Department: Mathematics and Computer Science
Student-athletes have very different roles in the eyes of society. There are those in society which
feel athletics are top priority while academics come second. On the other hand, there are others
who believe athletes are students first and athletics are merely an extracurricular activity. The
Stetson University Athletic Department strives for excellence in the classroom and on the playing
field. However, are these student-athletes succeeding in the classroom? I approached the
registrar, Dr. John Tichenor, to see what data would be available for the analysis on studentathletes over the past seven years. Prior to receiving the records, I began research on various
methods needed to analyze the data—two variable regression, multivariable regression, residuals,
categorical variables, and the goodness-of-fit test.
Does the amount of athletic scholarship granted to an athlete correlate with academic
performance? Do athletes tend to shy away from the more demanding majors? Are there sports
that place a higher standard on academics? Are males and females really on the same playing
field? I hope to answer these and other questions that arise next semester when I can analyze the
complete data set.
7
CHAPTER 1
BACKGROUND
Stetson University has been housing Division One athletes since it joined the Atlantic Sun
Conference in 1985 [7]. Currently, Stetson students may participate in nine varsity sports:
basketball, volleyball, crew, golf, tennis, soccer, cross country, baseball and softball. The NCAA
maintains that every athlete be held to specific standards in order to remain eligible. For
example, student-athletes must be full time students (twelve credit hours) and must maintain a
minimum of a 2.00 grade point average [7].
1.1 STETSON UNIVERSITY ATHLETIC DEPARTMENT MISSION STATEMENT
Stetson University strives for individual student-athletes to achieve excellence in both the
classroom and on the playing field. The university places a lot of emphasis on the studentathletes having a well-rounded college experience. Below summarizes Stetson University
Athletic Department’s ideal college experience:
The Stetson University Athletic Department strives to provide students with a
sound educational experience through a holistic and collaborative athletic
program that allows students to develop intellectually, spiritually, socially, and
physically. Excellence is pursued through participation in a successful Division I,
NCAA program, superior coaching, interaction among coaches, faculty, students,
and staff, and a diversity of student-athlete activities based on a liberal-arts
education. Students develop leadership through sport participation and
community activities. In unison with the University Mission, the Athletic
program helps students pursue truth by actively recruiting and providing a
diverse and caring environment that values and commits to the rights and fair
treatment of all people regardless of race, religion, or gender. The Athletic
Program (Department) encourages its student athletes to be morally sensitive and
contributing citizens through active forms of social responsibility [7].
1.2. FINANCIAL ELIGILIBILTY
Just like at other colleges and universities, student-athletes at Stetson University may be awarded
scholarships for their athletic ability. These scholarships can range from a small stipend to aid
covering tuition, room, board and books. However, these scholarships and grants, are only
8
awarded for a one year period. Upon the completion of an academic year, each athlete’s athletic
abilities and eligibility will be considered when deciding whether or not the grants will be
renewed. Stetson University has the right to withdrawal the student-athlete’s scholarship if he or
she has willingly withdrawn from the sport, is no longer eligible to compete, or is guilty of a
serious misconduct [7].
1.3. DATA COLLECTION
In order for the results of this project to have a statistically significant result, the collection of data
should be as large as possible. Thus, the data collected spans over the past seven years, from
1997 to 2004, and includes over five hundred student-athletes. Included in the data set are
variables such as, sport, sex, race, home state, SAT scores, declared major, Stetson University
cumulative grade point average as of the end of the fall of his or her sophomore year, Stetson
University athletic scholarships and non-athletic Stetson University scholarships granted.
When the data was collected, there were a few concerns that were raised. While we wanted the
data set to be as large as possible, we could not include all athletes over the time span available
like originally planned. If this was to happen, student-athletes would more than likely be
included in the data set more than once. For instance, suppose there is a student who played
soccer for each of the four years he attended Stetson University. This particular soccer player
would be included in the data set four times. In order to eliminate counting student-athletes more
than once, it was determined to only use one class each academic year.
The original thought was to use all freshmen student-athletes; however, variables such as grade
point average, would have not been an accurate estimate of academic performance after only one
semester. Many freshmen are overwhelmed from the adjustment to college life and academic
performance is not usually at its best. Therefore, the next thought was to include sophomores.
9
As a sophomore, students have had time to adjust to their new environments and by this time,
many students have declared their major. However, the students are still at a point where their
grade point average is not heavily affected by their major.
Another concern that was raised was in regards to the amounts of athletic scholarship granted to
the student-athletes. Since the data extended over the past seven years, it was essential to
incorporate the changes in tuition, room and board, books, as well as other fees not included in
tuition. Looking at the table below, one can see the drastic increases over the seven years this
project includes [8].
Year
Cost of Attending Stetson University
1997-1998
$21,220
1998-1999
$23,180
1999-2000
$24,140
2000-2001
$25,255
2001-2002
$26,440
2002-2003
$27,925
2003-2004
$29,685
Table 1. Total Cost of Attendance
For example, if a woman playing basketball received an athletic scholarship worth $12,000 in
1998 was compared to one of her teammates who also received an athletic scholarship for
$12,000 when she played three years later, on the surface, it appears that both are women are
receiving the same amount of aid. However, by examining the chart, in 1998 one can see that it
costs approximately $23,180 to attend Stetson University for one year, compared to nearly
$26,440 in 2001. Thus, in all actuality, the scholarship granted in 1998 was worth more than the
one awarded three years later in 2001. Therefore, in order to integrate the increase in cost over
the past seven years, rather than simply looking at the amount of athletic scholarship granted, the
10
focus will be on the percentage of total cost—tuition, room and board, books and other fees—
covered by the athletic scholarship granted.
11
CHAPTER 2
REGRESSION MODELS
When examining data, it is often essential to determine the relation or correlation between
variables. For instance, in a simple case where two variables are being compared, one of the
variables, x, is independent and known in advance, while the other, y, is dependent on x. While it
is impossible to predict the actual value of the dependent variable, it is possible to create a model
that can estimate the expected value, called E(Y) [3]. In the simple case, often times, a linear
regression model can be used. However, in a more complicated data set, the model may be
exponential, power, logistic, or reciprocal [2].
2.1 TWO VARIABLE REGRESSION
As previously mentioned, the simple case of regression is one comparing two variables. In order
to estimate E(Y), the predicted value of y, take a random sample of n points,
( x1 , y1 ), ( x2 , y 2 ),..., ( xn , y n ) . So, E(Y) depends on the value of x, the independent variable.
Thus, E(Y) =  (x ) . In other words, the predicted value of y, let’s call it Y, can be written as a
function of x. Assuming that the E(Y) is linear yields, E (Y )    x , a regression curve. Begin
by fitting a straight line through the sample points [3]. An example of a linear regression model
is shown in Figure 1. ( xi ,  ( xi ))
Figure 1. Sample Linear Model
12
By expressing the regression coefficients  and  in terms of the following,
E ( X )  1 , E (Y )   2 , var( X )   12 , var(Y )   22 , cov( X , Y )   12 ,  
 12
 1 2
the linearity of the following equation can be proven.
E (Y )   2  
2
( x  1 )
1
The value of the correlation coefficient,  , tells how well or how poorly the two variables are
correlated. The correlation coefficient can range in value from -1 to 1. A correlation coefficient
of  1 indicates a perfect correlation between the two variables. From this Theorem, if   0
and hence,  12  0 , then the two variables are uncorrelated. The sign of the correlation
coefficient indicates whether the two variables are positively or negatively correlated [4].
2.1.1
PRINCIPLE OF LEAST SQUARES
By examining the scatter plot of the observed pair data, it can be estimated whether a linear
regression equation is an appropriate model. If it appears that the data is linear, one method to
find the linear model, y     x , is that of the principle of least squares. This method was
originally recommended by Adrien Legendre, a French mathematician in the early nineteenth
century [4].
The deviation from a particular point in the data set, ( xi , yi ) , to the line, y     x , is called
the residual. The residual can be calculated by
yi  (  xi )
13
Suppose that the data points in the study are ( x1 , y1 ), ( x2 , y 2 ),...( xn , y n ) . Then the sum of the
squares of the corresponding residuals is
n
f ( ,  )   [ y i  (  xi )] 2
i 1
By squaring these residuals, a misleading summation of zero is avoided. Ideally, in a linear
regression model, the goal is to minimize the residuals. The values ̂ and ˆ minimize
f ( ,  ) are referred to as the least square estimates. In other words, f (ˆ , ˆ )  f ( ,  ) for all
 and  . Thus, the least squares line is y  ˆ  ˆx [2].
The least square estimates that minimize the residuals are found by taking partial derivatives of
f ( ,  ) with respect to  and  as well as setting those both equal to zero. By simply solving
the two equations with two unknowns,  and  , the following formulas are the result:
ˆ  
( xi , x )( yi , y )
ˆ 
 ( xi  x )2
y
i
 ˆ  xi
n
 y  ˆx
Once again, before computing ̂ and ˆ it is beneficial to plot the scatter plot to determine if the
correlation between the variables is approximately linear. If the data appears linear, statistical
computer packages can aid in the calculations [2]. In this project, Microsoft Excel was used.
2.1.2
VARIABLE INTERACTION
Many times there are occurrences when a number of variables are proposed to be connected to
others. In other words, the variables are not always independent of each other [4]. In the context
of this project, one’s gender may influence the sport played. For instance, if you are a male you
cannot play softball, volleyball, etc. Thus, perhaps the two variables are not independent. One
14
way to approximate the independence of two variables is to find the correlation between the two.
While correlation does not imply causation, plotting the observed values, finding a regression
model, and examining the correlation coefficient can provide more insight into the possibility of
interaction between the variables. In Section 2.2.2, the differences in multiple regression models
with and without interacting variables can be seen.
2.1.3
RESIDUAL ANALYSIS
When determining whether or not the regression model chosen is an adequate model to fit the
data, it is beneficial to construct a diagnostic plot. One specific diagnostic plot places the
predicted value for yi , ŷi on the vertical axis and the actual value, yi on the horizontal axis. Once
the plot is constructed, if the points closely fit a line with slope of positive one and the line goes
through the origin, then the regression model for the data gives an accurate prediction of the
values actually observed.
Diagnostic plots which plot the residual on the vertical axis versus xi or ŷi on the horizontal axis
are called residual plots. The residuals are ei  yi  yˆ i , where yˆ i  E (Y ) . The residual plot in
Figure 2 plots xi on the horizontal axis. Unlike the first diagnostic mentioned, the residual plots
should show no distinct pattern in order for the regression to be an effective and accurate
predictor. The residuals should be randomly dispersed around zero, according to a normal
distribution.
15
Algebra II vs Pre-Calculus
30
Residual
20
10
Series1
0
-100.000
20.000
40.000
60.000
80.000
-20
Algebra II
Figure 2. Residual Plot
There is one more type of diagnostic plot involving the standardized residuals. It plots the
standardized residuals on the vertical axis versus either xi or ŷi on the horizontal axis. When
examining this plot, if a majority of the standardized residuals fall between  2 , the model is a
good predictor. In other words, it is an accurate model if all but a few residuals fall within two
standard deviations away from the mean. The standardized residuals can be calculated using the
following formula,
ei* 
y  yˆ i
1 (x  x)2
s 1  i
n
S xx
where s 
 (x
i
 x)2
n 1
i  1,2,3,..., n
and S xx 
(x
i
 x)2 .
From the formula, you can tell that these calculations will be tedious. However, one can
construct various useful plots involving them. This method is most often used when doing a
multiple regression analysis, which will be included in this particular project [2].
While residual plots often help to determine the appropriateness of the model chosen, difficulties
may arise. In reality, the residual plot may not have a random dispersion of residual values. In
this case, a nonlinear model may be more appropriate.
16
Perhaps the chosen model appears to fit the data fairly well, with the exception of a few outlying
values. These values may be drastically manipulating the model of the best-fit function.
Excluding those outliers in the data set could lead to a change in the original model. However,
this is harder to determine when working with multiple regression models [2].
Lastly, the model could possibly be a poor fit if one or multiple independent variables were not
considered or included in the data set. An example of this case can be seen when there is a
pattern in the residuals when the residuals are plotted against an omitted variable. From this, one
can incorporate the omitted variable in a multiple regression model [2]. As you can see there are
some modifications that can be made in order to correct some of the obscurities found in residual
plots.
2.2 MULTIPLE REGRESSION
In the simple case of regression, the only concern was with the correlation between two variables.
However, this simple case can now be generalized into the multiple regression case. One would
use multiple regressions to build a model that relates numerous independent variables to a single
dependent variable [2]. For instance, suppose that the high school grade point average and SAT
score were thought to be a good predictor of a student’s college grade point average. If a model
was to be constructed to help support this claim, it would be a multiple regression model. Let n
represent the number of independent or “predictor variables” that wish to be studied, where n has
to be at least two. Denote these predictor variables as x1 , x2 , x3 , x4 , x5 ...xn . Then the general
additive multiple regression model equation is
Y   0  1 x1   2 x2   3 x3  ...   n xn  
where E ( )  0 and V ( )   2 . It is also assumed that  is normally distributed.
In the simple linear case,  0  1 x describes the mean Y value as a function of x. Similarly to
the linear case, the population regression function,  0  1 x1  ...   n xn   , gives the expected
17
value of Y as a function of x1 , x2 , x3 ,...xn . The  i ' s are called the population regression
coefficients. The regression coefficients  i can be understood as the expected change in Y
related with a one unit increase in xi while all other variables are held constant [2].
2.2.1
GENERAL ADDITIVE MULTIPLE REGRESSION MODEL
Suppose a statistician has acquired data on y, x1 , x2 . From what we know about multiple
regressions so far, one possible regression model is Y   0  1 x1   2 x2   . Nevertheless,
there are other possible models that can be created from x1 and/or x 2 . Thus, polynomial
regression is an exceptional case of multiple regression. There are a total of four useful multiple
regression models [2].
2.2.2.
FIRST-ORDER MODEL
The first model is the first-order model where Y   0  1 x1   2 x2   . This is the most clearcut generalization of the simple linear regression model. This model maintains that for a set
value of one variable, the expected value of Y is a linear function of the remaining variable.
Thus, when graphing the regression equation as a function of x1 for various values of x 2 , the
graph is a collection of parallel lines [2]. See Figure 3.
18
Figure 3. First-Order Model
2.2.3.
SECOND-ORDER NO-INTERACTION MODEL
The next model is the second-order no-interaction model of the form
Y   0  1 x1   2 x2   3 x12   4 x22   .
According to this equation and assuming that x 2 is fixed, the expected change in Y for a one unit
increase in x1 does not depend on x 2 . The calculation follows,
 0  1 ( x1  1)   2 x 2   3 ( x1  1) 2   4 x 22  (  0  1 x1   2 x 2   3 x12   4 x 22 )  1   3  2 3 x1
Due to this lack of dependency on x 2 , the contours of the second-order no-interaction model are
still parallel to one another. However, unlike the first-order model, the expected value is no
longer linearly dependent on x1 . Thus, the contours are curves rather than simple straight lines
[2].
Figure 4. Second-Order No-Interaction Model
2.2.4. FIRST-ORDER PREDICTORS AND INTERACTION
Unlike the first two models, the model with first-order predictors and interaction has nonparallel
contour lines. The model is expressed as Y   0  1 x1   2 x2   3 x1 x2   . Look at the
dependency when x1 is increased by one unit.
19
Y   0  1 ( x1  1)   2 x2   3 ( x1  1) x2    ( 0  1 x1   2 x2   3 x1 x2   )  1   3 x2
Since the expected value is dependent on x 2 , each contour line will have a different slope for
various values of x 2 . In this model, one can see how the change in expected value as one variable
increases also depends heavily on the value of the other variable. This concept is called
interaction. It is important to note, that if the model involves variable interaction, the
understanding of the  i ' s cannot be applied since it is impossible to increase xi and hold the
remaining variables constant [2].
Figure 5. First-Order Predictors and Interaction Model
2.2.5
COMPLETE-SECOND ORDER MODEL
This model is also referred to as the full quadratic model. The general form for the completesecond order model is Y   0  1 x1   2 x 2   3 x12   4 x 22   5 x1 x 2   . In this model, the
expected change in value of Y depends on both variables, x1 and x 2 , when x1 is increased by one
unit. In similar methods done in the prior models, the expected change in Y
is 1   3  2 3 x1   5 x2 . Since this is a function involving two variables, it implies that the
contour lines are both curved and nonparallel to one another [2].
20
Figure 6. Complete-Second Order Model
2.3.
CATEGORICAL VARIABLES
In all of the methods mentioned prior we have assumed that the variables being analyzed have all
been numerical values or quantitative variables. However, it is possible that the chosen variables
are categorical (qualitative). In this particular project, there are several categorical variables—
sex, race, sport played, chosen major, and home state. Nevertheless, there are methods used to
incorporate these categorical variables into the analysis.
2.31.
DICHOTOMOUS VARIABLE
First consider the simple case. Suppose the categorical variable being examined has two possible
categories, such as male or female. This type of variable with two categories is called a
dichotomous variable. With dichotomous variables, one must assign a dummy or indicator
variable x. This dummy variable has two possible values, zero or one, and indicates which
category is applicable for any chosen variable [2]. This concept is best explained through an
example.
Example: Suppose that it has been discovered that annual salary is dependent on the number of
years of experience and whether or not the employee has a college degree.
21
Let the dependent variable y = the annual salary and the independent variable x2  the number of
years of experience. Since the presence of a college degree is a categorical variable, let
Take for instance the liner regression model,
Y   0  1 x1   2 x2   .
Thus, the mean value of the annual salary depends on whether the employee has a college degree:
mean salary =  0   2 x2
when x1  0
(doesn’t have a degree)
mean salary =  0  1   2 x2
when x1  1
(has a degree)
Therefore, if the number of years of experience is held constant, then  1 is the difference
between the presence of a college degree or lack there of. Thus, if 1  0 , on average the
employee with a college degree will have a higher annual salary than one without a college
degree. This can be seen in Figure 3.
Figure 7. Categorical (no interaction)
However, suppose the two variables interact. Then the possible model is
Y   0  1 x1   2 x2   3 x1 x2   .
Therefore, the mean annual salaries are
22
mean salary =  0   2 x2
when x1  0
(doesn’t have a degree)
mean salary =  0  1  ( 2   3 ) x2
when x1  1
(has a degree)
For this model, the change in mean annual salary with a one year increase of experience depends
on whether or not the employee has a college degree. Thus, the two variables years of experience
and presence of a college degree interact. The interaction is displayed in Figure 4.
Figure 8. Categorical (interaction)
2.3.2.
MULTI-CATEGORY VARIABLES
Suppose that there are more than two categories for the variables observed. For instance, assume
you have a variable with three-categories. First instinct is to create one variable and code it with
three values, zero, one and two for each of the three categories. However, this is inaccurate. If
you were to attempt this, you would be imposing an arrangement on the groups that is not
necessarily implied in the context of the problem. The correct approach is to define two dummy
variables [2].
For example, assume that y is the once again the annual salary of an employee, x1 = years of
experience, and that the level of college degree is being considered. Then
23
Thus, in this example, if an employee has a B.A. or a B.S., x2  1, x3  0 . If the employee has a
Masters then, x2  0, x3  1 . It is then implied that if an employee has a PhD
then, x2  0, x3  0 . The no-interaction model only considers x1 , x 2 , x3 . However, the
interaction model below represents the mean change in annual salary related to a one unit
increase in the number of years of experience to depend on what type of college degree the
employee has.
Y   0  1 x1   2 x2   3 x3   4 x1 x2   5 x1 x3  
This was the three-category case. However, this case can be generalized into a multiple category
case. Suppose that you have a categorical variable with n possible categories you wish to
incorporate in the multiple regression model. Thus, it would entail creating n  1 dummy
variables. Consequently, adding only one categorical variable alone can add numerous predictors
to a model [2].
24
CHAPTER 3
DISTRIBUTIONS
When analyzing a set of data it is essential to estimate the underlying distribution. Many of the
tests and methods assume that the data has a normal underlying distribution. However, there are
instances when this is not always the case. There are numerous different kinds of distributions
that can be used to describe the particular data set. A few are Normal, Hypergeometric, Poisson,
Gamma, Beta, Multinomial, and Chi-Square Distributions. Each distribution must be handled
with various methods [1].
3.1 NORMAL DISTRIBUTION
Since many of the measurements of variables are approximately normally distributed, the normal
distribution is one of the most important distributions in statistical analysis. A random variable,
x, is normally distributed if its corresponding probability density function is defined as
  ( x  )2 

2 2 

1
f ( x) 
e 
 2
,    x  ,
Where       and 0     . A normal distribution is noted as N (  ,  2 ) .
Since  is positive, f ( x)  0 [3]. When the probability density function, f(x), is graphed, the
area under the bell-shaped, symmetric curve is one. See Figure 9. This can be proven using

multi-variable calculus and incorporating polar coordinates when integrating
 f ( x)dx
[2]. In a

normal distribution, it is also true that   E ( X ) and  2  Var ( X ) . In other words,  and
 2 are actually the mean and variance of the data sample. This is proven using the momentgenerating functions of X [3].
25
Figure 9. Normal Distributions [B4]
A special case of the normal distribution is called the standard normal distribution. This is when
the mean of the distribution is zero and the variance is one. Like the normal distribution, the
standard normal distribution is also a bell-shaped symmetric curve. However, it peaks at   0 .
See Figure 10 below.
Figure 10. Standard Normal Distribution [5]
The probability density function of a standard normal distribution with random variable Z is
denoted as
1  2z
f ( z) 
e
2
2
where    z   .
The various probabilities involving Z can be calculated using the Z-score table located in any
statistics book. Let
26
z
( z )  P( Z  z ) 


1
2
e
 y2
2
dy .
Calculating the probabilities can be aided by creating a visual, like the one in Figure 11. For
those distributions that are normal but not standard normal, the random variable X can be
standardized by
Z
X 

.
But how can one discover the approximate underlying distribution of the data set [2]?
Figure 11. Normal Curve Probabilities [5]
3.2.
DETERMINING UNDERLYING DISTRIBUTIONS
In order to provide some insight into the underlying distribution of a data set, Karl Pearson
suggested a method using the chi-square statistic and hypothesis testing, which tests the
suitability of a probabilistic model [3]. The chi-square goodness-of-fit test can determine whether
the underlying distribution is what is assumed in the null hypothesis, or whether it is another
distribution [2].
27
3.2.1.
HYPOTHESIS TESTING AND SIGNIFICANCE LEVELS
From prior studies, a hypothesis is simply an educated guess in regards to the problem statement.
With this in mind, we start by creating a null hypothesis, denoted H o . The null hypothesis is the
initial assumption of one parameter in the sample set. The alternative hypothesis, H a , is an
assertion that is a contradiction of H o . Based on a specified significance level,  , one either
rejects the null hypothesis or fails to reject the null hypothesis [2].
Once the null and alternative hypotheses are determined, a test-statistic is calculated from the
distribution. If the test-statistic falls in the critical region, then you reject the null hypothesis [3].
The critical regions are defined in the Tables 2 and 3. If the null hypothesis is not accepted, one
cannot assume that the alternative hypothesis is true.
Ho
p  po
p  po
p  po
Critical Region
Ha
p  po
p  po
p  po
z
z
z 
y / n  po
p o (1  p o ) / n
y / n  po
p o (1  p o ) / n
y / n  po
po (1  po ) / n
 z
  z
 z / 2
Table 2. Hypothesis Testing for One Proportion [3]
28
Ho
Critical Region
Ha
p1  p2
p1  p2
p1  p2
p1  p2
z
p1  p2
p1  p2
z 
z
pˆ 1  pˆ 2
 z
pˆ (1  pˆ )(1 / n1  1 / n2 )
pˆ 1  pˆ 2
pˆ (1  pˆ )(1 / n1  1 / n 2 )
pˆ 1  pˆ 2
pˆ (1  pˆ )(1 / n1  1 / n 2 )
  z
 z / 2
Table 3. Hypothesis Testing for Two Proportions [3]
3.2.2.
CHI-SQUARE DISTRIBUTION
The chi-squared distribution has only one parameter, m. This parameter is called the number of
degrees of freedom of the distribution, where m  1,2,3,4,... In this distribution, the critical
region is where  area under the curve,  m with m degrees of freedom, lies to the right of the
2
critical value,  2 ,m . The chi-square distribution is not symmetric [2]. See Figure 12 below. It is
also important to note that this distribution has a mean of m and a variance of 2m [1].
Figure 12. Chi-Square Distribution
29
3.2.3.
GOODNESS-OF-FIT TEST
When analyzing data, it is essential to know the approximate underlying distributions of the data
set. In the large set of data, let m be the number of outcomes and let p i represent the probability
that a randomly selected observation will be of type i, where i  1,2,3,..., m . From the properties
of the probability, we know the following
pi  0 for all i  1,2,3,..., m and
m
p
i 1
i
 1.
Let p10 , p 20 , p30 ,..., p m0 be numbers such that
p i0  0 for all i  1,2,3,..., m and
m
p
i 1
0
i
 1.
Since the goodness-of-fit test measures the inconsistency between observed values and the
expected values when the null hypothesis is true, the hypotheses are stated below
If we let N i be the actual number of observations of type i, then the test statistic used is
Q
m
( N  np 0 ) 2
(observed  predicted ) 2
 i 0 i
predicted
npi
i 1
Karl Pearson found that if the null hypothesis is true, then as the sample size becomes very large,
the distribution of Q is roughly the chi-square distribution with m-1 degrees of freedom. Thus,
after determining a significance level of  0 , let c be the 1   0 quantile of the chi-square
distribution with m-1 degrees of freedom. Thus, if Q  c , then the null hypothesis should be
rejected. However, before the null hypothesis is completely rejected, it is necessary to be certain
that there is no other reasonable alternative distribution that better fits the observed data [2].
30
CHAPTER 4
PRELIMINARY ANALYSIS
Upon receiving the data set, a brief analysis was done with a primary focus on comparing the
student-athlete’s Stetson University grade point average versus the percentage of tuition covered
by athletic scholarships. Rather than examining all student-athletes at Stetson University over the
past seven years, I decided to take a small sample of the data set. I chose to look specifically at
baseball.
4.1 LINEAR REGRESSION
The data set included multiple variables that were determined to have a possible influence on the
student-athlete’s Stetson University grade point average. I decided to see if there was a
correlation between the percentage of cost to attend Stetson University covered by athletic
scholarships granted and the grade point averages. Therefore, I pulled the baseball players from
the data set and plotted the percentages versus the grade point average of the baseball players
using Microsoft Excel. See Figure 13 for this plot.
Baseball
Stetson GPA
5.000
4.000
3.000
Series1
2.000
1.000
0.000
0
0.2
0.4
0.6
0.8
1
1.2
Percentage of Tuition Covered by Athletic
Scholarships
Figure 13. Baseball Percentage vs. GPA
Is there a model that could predict the baseball player’s grade point average based on the
percentage of cost of schooling he was receiving? I chose to try a few different models to see if
there was any correlation between these two variables. I tried various polynomial regression
31
models; however, there was no significant difference between the models. Thus, the result of the
linear model is in Figure 14.
Figure 14. Baseball Regression Models
Looking at the linear regression equation and plot in Figure 14, the model appears to fit the data
rather poorly. A correlation coefficient of a 0.4352 does not imply a strong correlation between
percentage of tuition covered by athletic scholarship and Stetson grade point average. Typically a
correlation coefficient of a 0.7 or higher is considered a good correlation. However, this is not
enough to discard this model.
4.2. RESIDUAL ANALYSIS
Before discarding the model, we will look at the corresponding residual plot. Here are those
residual values plotted against the corresponding x-values.
Figure 15. Baseball Residual Plots
32
Since these residual values are randomly dispersed in the plot above, this suggests that the chosen
model is not as poor of a fit as originally stated. However, there may possibly be a better model
other than the polynomial equations first chosen. It is also possible that these two variables are
simply not strongly correlated. This is something that will be examined further next semester.
4.3. GOODNESS-OF-FIT TEST
With the baseball sample taken from the complete data set of all student-athletes, I wanted to
apply the goodness-of-fit test to see if the underlying distribution was approximately normal.
Since I focused on the percentage of tuition covered by athletic scholarships and the grade point
average in the regression models, I decided to see if the baseball players’ grade point averages
were normally distributed.
I started by binning the data. I created three categories for the observed grade point averages to
fall in. They were
Bin 1: GPA is in the interval [1.5, 2.25)
Bin 2: GPA is in the interval [2.25, 3.0)
Bin 3: GPA is greater than or equal to 3.0
The frequency of each bin is listed below in Table 4.
Bin
Frequency
1
25
2
32
3
30
Total
87
Table 4. Frequency Table
Once I had the observed values categorized, I then calculated the mean and standard deviation of
this sample set.
33
x  2.687437
s x  .624282
With these results I was able to use the assumption that the distribution is normal to calculate the
probability that an observed grade point average will fall into each bin. Here are the calculated
probabilities.
 2.25  2.687437 
 1.5  2.687437 
P(1.5  X  2.25)  
  
   (0.701)   (1.902)  0.2133
.624283


 .624283 
 3.0  2.687437 
 2.25  2.687437 
P(2.25  X  3.0)  
  
   (0.501)   (1.902)  0.6628
.624283
.624283




 3.0  2.687437 
P( X  3.0)  1  P ( X  3)  1  
  1   (0.501)  0.3085
.624283


Now that the probabilities have been calculated and the data categorized, we must check to make
sure the expected values in each bin are at least five in order for the goodness-of-fit test to be an
very accurate approximation [1]. Table 5 confirms the expected values.
Bin
Frequency
Probability
Expected Value
1
25
0.2133
18.5571
2
32
0.6628
57.6636
3
30
0.3085
26.8395
Table 5. Expected Values
The hypotheses for the goodness-of-fit test can now be constructed. They are as follows,
We can now calculate the  2 test statistic, Q
34
m
( N i  npi0 ) 2
(observed  predicted ) 2
Q

predicted
npi0
i 1
(25  18.5571) 2 (32  57.6636) 2 (30  26.8395) 2


18.5571
57.6636
26.8395
Q  2.2369  11.4218  0.3722
Q  14.0309
Q
Recall that when deciding whether to reject or fail to reject the null hypothesis, you must
determine a significance level that corresponds to 2 ,m1 . Thus, if Q  2 ,m1 , you reject the null
hypothesis. In this case, a test with a significance of .10 requires that 2 ,m1  4.605 . Therefore,
since 14.0309 is greater than 4.605, we reject the null hypothesis which assumes that this
distribution is normal. Since the test statistic was considerably higher than any of the  2 , 2
values, I am confident that the distribution of Stetson baseball players’ grade point averages is not
normal. However, it appears that the sample of grade point averages is approximately a uniform
distribution. The high standard deviation has spread the data to where there is no significant peak
around the mean.
35
CHAPTER 5
CONCLUSIONS
With the research and attainment of data I was able to accomplish this semester, I have a better
idea of my plans for next semester. Ideally, I would like to construct a model that could estimate
incoming student-athletes’ grade point averages at Stetson University given numerous
variables—sport played, major declared, percentage of tuition covered by athletic scholarship,
percentage of tuition covered by non-athletic scholarship, and whether the student is from in-state
or out-of-state.
After doing some preliminary analysis, I understand that this may not be feasible. Therefore, I
am also interested in examining any correlations between the various variables and being able to
create regression models to describe the correlation. I also plan to compare variable averages
among different sports and among males and females. In order to do this comparing of means,
analysis of variance is needed. The analysis of variance uses hypothesis testing to compare the
means of at least two normal distributions. It is a method that I plan to study further in the fall.
36
REFERENCES
[1] M. H. DeGroot and M. J. Schervish, Probability and Statistics: Third Edition. Addison
Wesley, Boston, MA, 2002.
[2] J. L. Devore, Probability and Statistics for Engineering and the Sciences: Sixth Edition.
Thomson Learning, Inc, Belmont, CA, 2004.
[3] R. V. Hogg and E. A. Tanis, Probability and Statistical Inference: Sixth Edition. Prentice
Hall, Upper Saddle River, NJ, 2001.
[4] I. Miller and M. Miller, John E. Freund’s Mathematical Statistics with Applications: Seventh
Edition. Pearson Prentice Hall, Upper Saddle River, NJ, 2004.
[5] The Normal Distribution, April 5, 2005. http://www.stat.yale.edu/Courses/199798/101/normal.htm
[6] “Normal Distribution,” Wikipedia, March 31, 2005. April 5, 2005.
http://en.wikipedia.org/wiki/Normal_distribution
[7] Student-Athlete Handbook 2004-2005, February 6, 2005.
http://www.stetson.edu/athletics/home/extras/handbook.pdf
[8] Stetson University Bulletin Archives, March 30, 2005.
http://www.stetson.edu/other/bulletin/archives.php
37
BIOGRAPHICAL SKETCH
April Coates is a junior at Stetson University. She is majoring in mathematics with a minor
education. Her activities here at Stetson include Delta Kappa Pi, the QED math club, Fellowship
of Christian Athletes, Resident Assistant, Teacher’s Assistant, math tutoring, and intramurals.
She enjoys working out, playing sports, going to the beach, and taking road trips. Upon
graduation, she is undecided between attending graduate school and starting a career as a teacher.
38