Download Probability Models for the NCAA Regional Basketball Tournaments

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computer simulation wikipedia , lookup

Birthday problem wikipedia , lookup

Predictive analytics wikipedia , lookup

History of numerical weather prediction wikipedia , lookup

Numerical weather prediction wikipedia , lookup

Generalized linear model wikipedia , lookup

General circulation model wikipedia , lookup

Atmospheric model wikipedia , lookup

Transcript
More Probability Models for the NCAA Regional Basketball Tournaments
Neil C. Schwertman; Kathryn L. Schenk; Brett C. Holbrook
The American Statistician, Vol. 50, No. 1. (Feb., 1996), pp. 34-38.
Stable URL:
http://links.jstor.org/sici?sici=0003-1305%28199602%2950%3A1%3C34%3AMPMFTN%3E2.0.CO%3B2-2
The American Statistician is currently published by American Statistical Association.
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/astata.html.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic
journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers,
and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take
advantage of advances in technology. For more information regarding JSTOR, please contact [email protected].
http://www.jstor.org
Wed Jan 2 12:50:54 2008
More Probability Models for the NCAA Regional
Basketball Tournaments
Neil C. SCHWERTMAN,
Kathryn L. SCHENK,
and Brett C. HOLBROOK
Sports events and tournament competitions provide excellent opportunities for model building and using basic
statistical methodology in an interesting way. In this article, National Collegiate Athletic Association (NCAA) regional basketball tournament data are used to develop simple linear regression and logistic regression models using
seed position for predicting the probability of each of the
16 seeds winning the regional tournament. The accuracy of
these models is assessed by comparing the empirical probabilities not only to the predicted probabilities of winning
the regional tournament but also the predicted probabilities
of each seed winning each contest.
KEY WORDS: Basketball; Logistic regression; Regression.
1. INTRODUCTION
Enthusiasm for the study of probability is enhanced when
the concepts are illustrated by real examples of interest to
students. Athletic competitions afford many such opportunities to demonstrate the concepts of probability and have
been extensively studied in the literature; see, for example,
Mosteller (1952), Searls (1963), Moser (1982), Monahan
and Berger (1977), David (1959), Glenn (1960), Schwertman, McCready, and Howard (1991), and Ladwig and
Schwertman (1992). One excellent probability analysis opportunity for use in the classroom occurs each spring when
"March Madness," as the media calls it, occurs. "March
Madness" is the National Collegiate Athletic Association
(NCAA) regional and Final Four basketball tournaments
that culminate in a National Collegiate Championship game.
The NCAA selects (actually, certain conference champions or tournament winners are included automatically) 64
teams, 16 for each of 4 regions, to compete for the national
championship. The NCAA committee of experts not only
selects the 64 teams from 292 teams in Division 1-A, but assigns a seed position to each team in the four regions based
on their consensus of team strengths. The format for each
regional tournament is predetermined following the pattern
in Figure 1, where the number one seed (strongest team)
plays the sixteenth seed (weakest team), the number two
seed (next strongest team) plays the fifteenth seed (second
weakest), etc. The experts attempt to evenly distribute the
Neil C. Schwertman is Chairman and Professor of Statistics, Department of Mathematics and Statistics, California State University, Chico,
CA 95929-0525. Kathryn L. Schenk is Instructional Support Coordinator, Computer Center, California State University, Chico, CA 95929-0525.
Brett C. Holbrook is Student, Department of Experimental Statistics, New
Mexico State University, Las Cruces, NM 88003-0003.
34
The American Statistician, February 1996, Vol. 50, No. I
teams to the regional tournament to achieve parity in the
quality of each region.
Schwertman et al. (1991) suggested three rather ad hoc
probability models that predicted remarkably well the empirical probability of each seed winning its regional tournament and advancing to the "final four." The validity of
the three models was measured only by each seed's probability of winning its regional tournament. In this article
we use the NCAA regional basketball tournament data as
an example to illustrate ordinary least squares and logistic
regression in developing prediction models. The parameter estimates for the simple models considered are based
on the 600 games played (1985-1994) during the first ten
years using the 64-team format. Validity of the eight new
empirical and the three previous models in Schwertman et
al. (1991) are assessed by comparing the empirical probabilities not only to the predicted probabilities for each seed
winning the regional tournament but also to the predicted
probabilities of each seed winning each contest.
2. TOURNAMENT ANALYSIS
Predicting the probability of each seed winning the regional tournament (and advancing to the final four) requires
the consideration of all possible paths and opponents. Even
though there are 16 teams in each region, the single elimination format (only the winning team survives in the tournament, i.e., one loss and the team is eliminated) is relatively
easy to analyze compared to a double-elimination format.
[See, for example, the analysis of the college baseball world
series by Ladwig and Schwertman (1992).]In the first game
each seed has only one possible opponent, but in the second game there are two possible opponents, four possible
in the third game and eight possible in the regional finals.
Hence there are 1 . 2 . 4 . 8 = 64 potential sets of opponents for each seed to play in order to eventually win the
regional tournament. For example, for the number 2 seed
to win, it must defeat seed 15 in game 5, either 7 or 10 in
game 11, either 3, 14, 6, or 11 in game 14, and either 1,
16, 8, 9, 4, 13, 5, or 12 in game 15. The probability analysis for the regional championship must include not only
the probability of defeating each potential opponent, but
also the probability of each potential opponent advancing
to that particular game. To illustrate, suppose the second
seed wins the regional tournament by defeating seeds 15,
7,6, and 1 in games 5, 11, 14, and 15, respectively. Then the
probability that this occurs is P(2,15) . P ( 2 , 7 ) . P ( 7 plays
in game 11) . P ( 2 , 6 ) . P ( 6 plays in game 14) . P ( 2 , l ) . P ( 1
plays in game 15) where P ( i ,j) is the probability that an ith
seed defeats a jth seed and P(j,i ) = 1 - P ( i ,j ) . A more
detailed explanation of the various paths and the associated probability analysis is contained in Schwertman et al.
(1991). As in that article and most all such probability analyses of athletic competitions, we assume that the games are
@ 1996 American Statistical Association
independent and the probabilities remain constant throughout the tournament. To complete the analysis we now must
find probability models for determining P ( i , j ) .
3. PROBABILITY MODELS
The purpose of the probability models is to incorporate
the relative strength of the teams in estimating the probability of each team winning in each game. It seems reasonable
to use some function of seed positions because these were
determined by a consensus of experts. In order to have the
broadest possible use in the classroom we use the simplest
linear straight line model E(Y)= PO P l ( S ( i ) - S(j))
where S ( i ) is some function of i, the team's seed position.
Clearly, multiple regression models could be used for more
advanced classes, but our basic model is appropriate even
for most introductory classes. In addition to the three models used in Schwertman et al. (1991), eight other models for
assigning probabilities of success in each individual game
are considered. The eight models are defined by the 23 possible combinations of three factors: (1) type of regression
(ordinary or logistic), (2) type of intercept Po (estimated
or specified constant), and (3) type of independent variable
(linear or nonlinear function of seed positions).
There are obviously many functions of seed position that
could be used in the models. The choice of S ( i ) and S ( j ) is
quite arbitrary. We have chosen two rather simple functions
for our investigation that fortunately provide excellent preditor models. The first is Sl ( i ) = -i for all i , which is simply
using the difference in seed position as a single independent
variable. This function of seed positiori suggests a linearity
in team strengths, for example, the difference in strength between seeds 1 and 3 is the same as between seeds 14 and 16.
Intuitively it seems likely that there is a greater difference
in quality between a number 1 and 3 seed than between a
14 and 16 seed. Thus this linearity may not be appropriate,
+
Game 1
*-\
Game 9
Game 2
Model 2
Model
0.9
-
0.8
-
/
:Mo el
,,,. M
~
~
2
~
I
4
6
8
,
:
10
,
,
12
,
,
,
14
Figure 2. Points in Scatterplot May Represent as Few as 1 or as
Many as 97 Games,
and therefore we consider a nonlinear function S 2 ( i ) for
incorporating team strength. Since the normal distribution
occurs naturally in describing many random variables and is
included in most introductory classes, it seems reasonable
to use the normal distribution to describe a nonlinear relationship in team strengths. If we assume that the strength of
the 292 teams is normally distributed and that the experts
properly ordered the top 64 teams for the tournament from
229 to 292 with 229 the weakest and 292 the strongest,
we can then determine a percentile and a corresponding z
score for that percentile. That is, adding a correction for
continuity to 292, S z ( i ) = W1((294.5 - 42)/292.5) where
cD is the cumulative distribution function of the standard
normal. For example, the transformed seed position z score
for the number 1 seeds (289, 290, 291, 292 when ordered
from weakest to strongest) was calculated from the percentile of this group's median, that is, 290.51292.5 = 99.316
percentile, which corresponds to a z score of 2.466. Similarly, the number 2 seed's (285, 286, 287, 288) z score is
calculated from the 286.51292.5 = 97.9487 percentile, corresponding to a z score of 2.044. For seeds 3-16 the corresponding z scores are: 1.823, 1.666, 1.542, 1.438, 1.348,
1.267, 1.194, 1.127, 1.064, 1.006, .951, ,898, 348, and ,800,
respectively.
The dependent variable is 1 if the lower seed (stronger
team) defeats the higher seed and 0 otherwise. It should
be noted that the logistic regression models (models 5-
Game 3
Model 4
Model 3
Game 4
0.8
-
Game 5
Game 6 Game 14 (3) Game 7
(14)
>
(6)
Game 8
Figure 1.
(11)
\
NCAA Regional Basketball Tournament Pairings.
Figure 3. Points in Scatterplot Represent as Few as 1 or as Many
as 40 Games.
The American Statistician, February 1996, Vol. 50, No. 1
35
Predicted Probabilities of Each Seed Winning the NCAA Regional Basketball Tournament
Table 2.
Model
Seed
1
3
2
P3(i,j )
=
,530385
P4(i,j )
=
.5
+ .286507(S2(i)
-
+ .317258(S2(i)
P?(i,.?)= l / ( l +
5
4
-
1
P G (j~) , = l / ( l
P7(i,j)
=
+
l/(l+
e-.1727(j-i)
e.~847-1.744~(~2(i)-~2(j))
Ps(i,j )
=
1/(1+
1
1
e-1.6405(Sz(i)-Sa(j))).
Figures 2 and 3 display the graphs of models 1-8 and a
scatterplot of the data.
The other three models, 9-11, used by Schwertman
et al. (1991), consist of one nonlinear type (model 9),
Pg(i,j ) = j / ( i + j ) ; a linear type (model lo), Plo(i,j ) =
.5 + .03125(S1( i ) - S l ( j ) ) ;and one based on normal
scoring using the Sa(i) (model 111, P I l ( i , j ) = .5
.2813625(S2(i)- S 2 ( j ) )For
. details see Schwertman et al.
(1991).
Estimates of the unspecified parameter(s) in the first four
models were obtained by ordinary least squares, while the
next four models (5-8) were determined using the SAS logistic procedure. (See SAS/STAT User's Guide, Vol. 2, Version 6, 4th ed., pp. 1069-1 126 for details.)
+
4. COMPARISON OF MODELS
The 11 different models for assigning probabilities of winTable 3.
8
9
10
11
ning for each seed in each individual game were compared
in three ways by using a chi-square statistic as a measure of
the relative fit of the models. Of the possible 120 pairings
of seeds (16 . 1512) only 52 have occurred. Table 1 lists
the pairs that have occurred, the number of games played
between these seeds, the number of wins by the lower seed
number (stronger team), and the empirical and estimated
probabilities of the lower seed number winning from the
11 models. Using the empirical data for the seed pairings,
a chi-square goodness-of-fit
Sa(j))
Sa(j))
e.0328-.1770(j-i)
7
6
for each of the 52 seed pairs was computed, and the sum
of these chi-squares is given for each model. Twenty-six
of the seed pairs had fewer than five games played, and
the small expected numbers in these cells, being used as a
divisor, may place too much emphasis on these cells and
distort the chi-square values. Hence a second set of chisquare statistics based on just the 26 seed pairings with at
least 5 games was computed and is also given in Table 1.
Models 1-8 use the data to estimate the model parameters,
and consequently these chi-square values are not entirely
independent. Nevertheless, the chi-square values do provide
a measure of the relative accuracy of the various models.
Goodness of Fit Analysis
Expected numbers
Probability model number
Group
(seed no.)
Obs.
I
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5 or more
16
8
4
3
5
11.22
8.07
5.51
3.86
7.34
10.50
7.80
5.64
3.95
8.1 1
18.51
7.06
3.80
2.22
4.40
18.76
6.96
3.77
2.15
4.36
10.76
8.12
5.80
4.01
7.31
10.59
8.06
5.85
4.04
7.47
17.94
7.47
3.98
2.24
4.38
17.88
7.46
3.98
2.25
4.42
18.69
7.78
3.84
2.04
3.65
9.89
7.50
5.55
4.00
9.06
16.52
6.77
3.97
2.45
6.29
X;4)
p value
3.3837
,4958
4.7865
,3099
,8276
,9347
1.0070
,9087
4.0921
,3937
4.4287
,3511
,5937
,9638
,5652
,9669
1.3545
,8521
6.3057
,1775
The Amencan Statisticinrz, Fel~nrnry1996, Vol. 50, No. 1
,6283
,9599
37
Models 2, 3, and 4 produced P(1.16) that were greater
than 1.0. When this occurred the probabilities were set to
.99999 in order to compute the chi-square statistics.
The third comparison of the models was done by using
P(i,
j) to compute the probabilities of each seed winning
the regional tournament. These probabilities are displayed
in Table 2, and a chi-square goodness-of-fit to the empirical
probabilities, used for evaluating the 11 models, is given in
Table 3.
5. CONCLUSIONS
For the set of 26 pairings with 5 games or more, both the
ordinary and logistic models using the z scoring of seed
number S 2 ( i )had smaller chi-square values than the corresponding models using just the seed position. In many cases
the z scoring substantially improved the fit of the predicted
value to the empirical data, and seems to be a worthwhile
technique.
Two unexpected results of the analysis occurred. The first
is that when predicting the probability of success in each
j). the regressions with the y intercept specified
game, P(i.
(.5 for ordinary least squares and zero for logistic regression) occasionally provided somewhat smaller chi-square
values than the unrestricted models. Because least squares
minimizes the squared deviations between predicted and observed values it was anticipated that the unrestricted models (1, 3, 5, 7) would have slightly smaller chi-square values
than the corresponding restricted models (2, 4, 6, 8). The
chi-square statistic, however, is a weighted sum of these
squared deviations, and hence is not necessarily a minimum
when the unweighted sum is minimized.
The second unexpected result was that the models that
were best (the smaller chi-square values) at predicting
P(z,j ) for each game did not do as well as some of the
other models at predicting the overall regional tournament
champion. Models 7 and 8 (logistic, with and without intercept, z-scored seeds) were the best at predicting the regional
winner but were about in the middle (when ranked) of the
models for predicting individual games. On the other hand,
models 3 and 4 (ordinary least squares, with or without
specified intercept, z-scored seeds) were the best prediction
models for the individual games, but only fourth or fifth
best for predicting the regional champion.
38
The American Statistician, February 1996, Wol. 50, No. 1
If the objective is to develop a model for predicting the
winner between various seed pairs, then model 3 (ordinary
least squares, no specified intercept, z-scored seeds) seems
to be the most satisfactory, whereas if the objective is to
predict the regional winner, then the logistic models, 7 and
8 (logistic regression, with and without intercept, z-scored
seeds), are the most satisfactory models. Interestingly, the
ad hoc model 11 seemed to be very adequate at predicting
both.
Obviously there are numerous models that could be used.
We have focused on the simplest straight line models and elementary methods of incorporating team strength in order
to make the methodology accessible to a broad spectrum
of students. We have attempted to present an application
of ordinary least squares, logistic regression, and probability that should be of interest to many students. The everincreasing interest in "March Madness" can be used to motivate and stimulate this instructive, timely application of
several principles and methods of probability and statistics.
Students seem to learn better when they can see application of the subject to something of interest to them. We
believe that this analysis of the regional basketball tournaments can promote student learning and enthusiasm for
studying probability and statistics.
[Received August 1993. Revised October 1994.1
REFERENCES
David, H. A. (1959), "Tournaments and Paired Comparisons," Biometrika,
46, 139-149.
Glenn, W. A. (1960), "A Comparison of the Effectiveness of Tournaments,"
Biometrika, 47, 253-262.
Ladwig, J. A,, and Schwertman, N. C. (1992), "Using Probability and
Statistics to Analyze Tournament Competitions, Chance, 5, 49-53.
Monahan, J. P., and Berger, P. D. (1977), "Playoff Structures in the National
Hockey League," in Optimal Strategies in Sports, eds. S. P. Ladany and
R. E. Machol, Amsterdam: North-Holland, pp. 123-128.
Moser, L. E. (1982), "A Mathematical Analysis of the Game of Jai Alai,"
The American Mathematical Monthly, 89, 292-300.
Mosteller, F. (1952),"The World Series Competition," Journal of the American Statistical Association, 47, 355-380.
Schwertman, N. C., McCready, T. A,, and Howard, L. (1991), "Probability
Models for the NCAA Regional Basketball Tournaments," The American Statistician, 45, 35-38.
Searls, D. T. (1963), "On the Probability of Winning with Different Tournament Procedures," Journal of the American Statistical Association, 58,
1064-1081.
http://www.jstor.org
LINKED CITATIONS
- Page 1 of 2 -
You have printed the following article:
More Probability Models for the NCAA Regional Basketball Tournaments
Neil C. Schwertman; Kathryn L. Schenk; Brett C. Holbrook
The American Statistician, Vol. 50, No. 1. (Feb., 1996), pp. 34-38.
Stable URL:
http://links.jstor.org/sici?sici=0003-1305%28199602%2950%3A1%3C34%3AMPMFTN%3E2.0.CO%3B2-2
This article references the following linked citations. If you are trying to access articles from an
off-campus location, you may be required to first logon via your library web site to access JSTOR. Please
visit your library's website or contact a librarian to learn about options for remote access to JSTOR.
References
Tournaments and Paired Comparisons
H. A. David
Biometrika, Vol. 46, No. 1/2. (Jun., 1959), pp. 139-149.
Stable URL:
http://links.jstor.org/sici?sici=0006-3444%28195906%2946%3A1%2F2%3C139%3ATAPC%3E2.0.CO%3B2-6
A Comparison of the Effectiveness of Tournaments
W. A. Glenn
Biometrika, Vol. 47, No. 3/4. (Dec., 1960), pp. 253-262.
Stable URL:
http://links.jstor.org/sici?sici=0006-3444%28196012%2947%3A3%2F4%3C253%3AACOTEO%3E2.0.CO%3B2-P
A Mathematical Analysis of the Game of Jai Alai
Louise E. Moser
The American Mathematical Monthly, Vol. 89, No. 5. (May, 1982), pp. 292-300.
Stable URL:
http://links.jstor.org/sici?sici=0002-9890%28198205%2989%3A5%3C292%3AAMAOTG%3E2.0.CO%3B2-7
The World Series Competition
Frederick Mosteller
Journal of the American Statistical Association, Vol. 47, No. 259. (Sep., 1952), pp. 355-380.
Stable URL:
http://links.jstor.org/sici?sici=0162-1459%28195209%2947%3A259%3C355%3ATWSC%3E2.0.CO%3B2-U
http://www.jstor.org
LINKED CITATIONS
- Page 2 of 2 -
Probability Models for the NCAA Regional Basketball Tournaments
Neil C. Schwertman; Thomas A. McCready; Lesley Howard
The American Statistician, Vol. 45, No. 1. (Feb., 1991), pp. 35-38.
Stable URL:
http://links.jstor.org/sici?sici=0003-1305%28199102%2945%3A1%3C35%3APMFTNR%3E2.0.CO%3B2-V
On the Probability of Winning with Different Tournament Procedures
Donald T. Searls
Journal of the American Statistical Association, Vol. 58, No. 304. (Dec., 1963), pp. 1064-1081.
Stable URL:
http://links.jstor.org/sici?sici=0162-1459%28196312%2958%3A304%3C1064%3AOTPOWW%3E2.0.CO%3B2-U