Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer simulation wikipedia , lookup
Birthday problem wikipedia , lookup
Predictive analytics wikipedia , lookup
History of numerical weather prediction wikipedia , lookup
Numerical weather prediction wikipedia , lookup
Generalized linear model wikipedia , lookup
More Probability Models for the NCAA Regional Basketball Tournaments Neil C. Schwertman; Kathryn L. Schenk; Brett C. Holbrook The American Statistician, Vol. 50, No. 1. (Feb., 1996), pp. 34-38. Stable URL: http://links.jstor.org/sici?sici=0003-1305%28199602%2950%3A1%3C34%3AMPMFTN%3E2.0.CO%3B2-2 The American Statistician is currently published by American Statistical Association. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/astata.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers, and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take advantage of advances in technology. For more information regarding JSTOR, please contact [email protected]. http://www.jstor.org Wed Jan 2 12:50:54 2008 More Probability Models for the NCAA Regional Basketball Tournaments Neil C. SCHWERTMAN, Kathryn L. SCHENK, and Brett C. HOLBROOK Sports events and tournament competitions provide excellent opportunities for model building and using basic statistical methodology in an interesting way. In this article, National Collegiate Athletic Association (NCAA) regional basketball tournament data are used to develop simple linear regression and logistic regression models using seed position for predicting the probability of each of the 16 seeds winning the regional tournament. The accuracy of these models is assessed by comparing the empirical probabilities not only to the predicted probabilities of winning the regional tournament but also the predicted probabilities of each seed winning each contest. KEY WORDS: Basketball; Logistic regression; Regression. 1. INTRODUCTION Enthusiasm for the study of probability is enhanced when the concepts are illustrated by real examples of interest to students. Athletic competitions afford many such opportunities to demonstrate the concepts of probability and have been extensively studied in the literature; see, for example, Mosteller (1952), Searls (1963), Moser (1982), Monahan and Berger (1977), David (1959), Glenn (1960), Schwertman, McCready, and Howard (1991), and Ladwig and Schwertman (1992). One excellent probability analysis opportunity for use in the classroom occurs each spring when "March Madness," as the media calls it, occurs. "March Madness" is the National Collegiate Athletic Association (NCAA) regional and Final Four basketball tournaments that culminate in a National Collegiate Championship game. The NCAA selects (actually, certain conference champions or tournament winners are included automatically) 64 teams, 16 for each of 4 regions, to compete for the national championship. The NCAA committee of experts not only selects the 64 teams from 292 teams in Division 1-A, but assigns a seed position to each team in the four regions based on their consensus of team strengths. The format for each regional tournament is predetermined following the pattern in Figure 1, where the number one seed (strongest team) plays the sixteenth seed (weakest team), the number two seed (next strongest team) plays the fifteenth seed (second weakest), etc. The experts attempt to evenly distribute the Neil C. Schwertman is Chairman and Professor of Statistics, Department of Mathematics and Statistics, California State University, Chico, CA 95929-0525. Kathryn L. Schenk is Instructional Support Coordinator, Computer Center, California State University, Chico, CA 95929-0525. Brett C. Holbrook is Student, Department of Experimental Statistics, New Mexico State University, Las Cruces, NM 88003-0003. 34 The American Statistician, February 1996, Vol. 50, No. I teams to the regional tournament to achieve parity in the quality of each region. Schwertman et al. (1991) suggested three rather ad hoc probability models that predicted remarkably well the empirical probability of each seed winning its regional tournament and advancing to the "final four." The validity of the three models was measured only by each seed's probability of winning its regional tournament. In this article we use the NCAA regional basketball tournament data as an example to illustrate ordinary least squares and logistic regression in developing prediction models. The parameter estimates for the simple models considered are based on the 600 games played (1985-1994) during the first ten years using the 64-team format. Validity of the eight new empirical and the three previous models in Schwertman et al. (1991) are assessed by comparing the empirical probabilities not only to the predicted probabilities for each seed winning the regional tournament but also to the predicted probabilities of each seed winning each contest. 2. TOURNAMENT ANALYSIS Predicting the probability of each seed winning the regional tournament (and advancing to the final four) requires the consideration of all possible paths and opponents. Even though there are 16 teams in each region, the single elimination format (only the winning team survives in the tournament, i.e., one loss and the team is eliminated) is relatively easy to analyze compared to a double-elimination format. [See, for example, the analysis of the college baseball world series by Ladwig and Schwertman (1992).]In the first game each seed has only one possible opponent, but in the second game there are two possible opponents, four possible in the third game and eight possible in the regional finals. Hence there are 1 . 2 . 4 . 8 = 64 potential sets of opponents for each seed to play in order to eventually win the regional tournament. For example, for the number 2 seed to win, it must defeat seed 15 in game 5, either 7 or 10 in game 11, either 3, 14, 6, or 11 in game 14, and either 1, 16, 8, 9, 4, 13, 5, or 12 in game 15. The probability analysis for the regional championship must include not only the probability of defeating each potential opponent, but also the probability of each potential opponent advancing to that particular game. To illustrate, suppose the second seed wins the regional tournament by defeating seeds 15, 7,6, and 1 in games 5, 11, 14, and 15, respectively. Then the probability that this occurs is P(2,15) . P ( 2 , 7 ) . P ( 7 plays in game 11) . P ( 2 , 6 ) . P ( 6 plays in game 14) . P ( 2 , l ) . P ( 1 plays in game 15) where P ( i ,j) is the probability that an ith seed defeats a jth seed and P(j,i ) = 1 - P ( i ,j ) . A more detailed explanation of the various paths and the associated probability analysis is contained in Schwertman et al. (1991). As in that article and most all such probability analyses of athletic competitions, we assume that the games are @ 1996 American Statistical Association independent and the probabilities remain constant throughout the tournament. To complete the analysis we now must find probability models for determining P ( i , j ) . 3. PROBABILITY MODELS The purpose of the probability models is to incorporate the relative strength of the teams in estimating the probability of each team winning in each game. It seems reasonable to use some function of seed positions because these were determined by a consensus of experts. In order to have the broadest possible use in the classroom we use the simplest linear straight line model E(Y)= PO P l ( S ( i ) - S(j)) where S ( i ) is some function of i, the team's seed position. Clearly, multiple regression models could be used for more advanced classes, but our basic model is appropriate even for most introductory classes. In addition to the three models used in Schwertman et al. (1991), eight other models for assigning probabilities of success in each individual game are considered. The eight models are defined by the 23 possible combinations of three factors: (1) type of regression (ordinary or logistic), (2) type of intercept Po (estimated or specified constant), and (3) type of independent variable (linear or nonlinear function of seed positions). There are obviously many functions of seed position that could be used in the models. The choice of S ( i ) and S ( j ) is quite arbitrary. We have chosen two rather simple functions for our investigation that fortunately provide excellent preditor models. The first is Sl ( i ) = -i for all i , which is simply using the difference in seed position as a single independent variable. This function of seed positiori suggests a linearity in team strengths, for example, the difference in strength between seeds 1 and 3 is the same as between seeds 14 and 16. Intuitively it seems likely that there is a greater difference in quality between a number 1 and 3 seed than between a 14 and 16 seed. Thus this linearity may not be appropriate, + Game 1 *-\ Game 9 Game 2 Model 2 Model 0.9 - 0.8 - / :Mo el ,,,. M ~ ~ 2 ~ I 4 6 8 , : 10 , , 12 , , , 14 Figure 2. Points in Scatterplot May Represent as Few as 1 or as Many as 97 Games, and therefore we consider a nonlinear function S 2 ( i ) for incorporating team strength. Since the normal distribution occurs naturally in describing many random variables and is included in most introductory classes, it seems reasonable to use the normal distribution to describe a nonlinear relationship in team strengths. If we assume that the strength of the 292 teams is normally distributed and that the experts properly ordered the top 64 teams for the tournament from 229 to 292 with 229 the weakest and 292 the strongest, we can then determine a percentile and a corresponding z score for that percentile. That is, adding a correction for continuity to 292, S z ( i ) = W1((294.5 - 42)/292.5) where cD is the cumulative distribution function of the standard normal. For example, the transformed seed position z score for the number 1 seeds (289, 290, 291, 292 when ordered from weakest to strongest) was calculated from the percentile of this group's median, that is, 290.51292.5 = 99.316 percentile, which corresponds to a z score of 2.466. Similarly, the number 2 seed's (285, 286, 287, 288) z score is calculated from the 286.51292.5 = 97.9487 percentile, corresponding to a z score of 2.044. For seeds 3-16 the corresponding z scores are: 1.823, 1.666, 1.542, 1.438, 1.348, 1.267, 1.194, 1.127, 1.064, 1.006, .951, ,898, 348, and ,800, respectively. The dependent variable is 1 if the lower seed (stronger team) defeats the higher seed and 0 otherwise. It should be noted that the logistic regression models (models 5- Game 3 Model 4 Model 3 Game 4 0.8 - Game 5 Game 6 Game 14 (3) Game 7 (14) > (6) Game 8 Figure 1. (11) \ NCAA Regional Basketball Tournament Pairings. Figure 3. Points in Scatterplot Represent as Few as 1 or as Many as 40 Games. The American Statistician, February 1996, Vol. 50, No. 1 35 Predicted Probabilities of Each Seed Winning the NCAA Regional Basketball Tournament Table 2. Model Seed 1 3 2 P3(i,j ) = ,530385 P4(i,j ) = .5 + .286507(S2(i) - + .317258(S2(i) P?(i,.?)= l / ( l + 5 4 - 1 P G (j~) , = l / ( l P7(i,j) = + l/(l+ e-.1727(j-i) e.~847-1.744~(~2(i)-~2(j)) Ps(i,j ) = 1/(1+ 1 1 e-1.6405(Sz(i)-Sa(j))). Figures 2 and 3 display the graphs of models 1-8 and a scatterplot of the data. The other three models, 9-11, used by Schwertman et al. (1991), consist of one nonlinear type (model 9), Pg(i,j ) = j / ( i + j ) ; a linear type (model lo), Plo(i,j ) = .5 + .03125(S1( i ) - S l ( j ) ) ;and one based on normal scoring using the Sa(i) (model 111, P I l ( i , j ) = .5 .2813625(S2(i)- S 2 ( j ) )For . details see Schwertman et al. (1991). Estimates of the unspecified parameter(s) in the first four models were obtained by ordinary least squares, while the next four models (5-8) were determined using the SAS logistic procedure. (See SAS/STAT User's Guide, Vol. 2, Version 6, 4th ed., pp. 1069-1 126 for details.) + 4. COMPARISON OF MODELS The 11 different models for assigning probabilities of winTable 3. 8 9 10 11 ning for each seed in each individual game were compared in three ways by using a chi-square statistic as a measure of the relative fit of the models. Of the possible 120 pairings of seeds (16 . 1512) only 52 have occurred. Table 1 lists the pairs that have occurred, the number of games played between these seeds, the number of wins by the lower seed number (stronger team), and the empirical and estimated probabilities of the lower seed number winning from the 11 models. Using the empirical data for the seed pairings, a chi-square goodness-of-fit Sa(j)) Sa(j)) e.0328-.1770(j-i) 7 6 for each of the 52 seed pairs was computed, and the sum of these chi-squares is given for each model. Twenty-six of the seed pairs had fewer than five games played, and the small expected numbers in these cells, being used as a divisor, may place too much emphasis on these cells and distort the chi-square values. Hence a second set of chisquare statistics based on just the 26 seed pairings with at least 5 games was computed and is also given in Table 1. Models 1-8 use the data to estimate the model parameters, and consequently these chi-square values are not entirely independent. Nevertheless, the chi-square values do provide a measure of the relative accuracy of the various models. Goodness of Fit Analysis Expected numbers Probability model number Group (seed no.) Obs. I 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 or more 16 8 4 3 5 11.22 8.07 5.51 3.86 7.34 10.50 7.80 5.64 3.95 8.1 1 18.51 7.06 3.80 2.22 4.40 18.76 6.96 3.77 2.15 4.36 10.76 8.12 5.80 4.01 7.31 10.59 8.06 5.85 4.04 7.47 17.94 7.47 3.98 2.24 4.38 17.88 7.46 3.98 2.25 4.42 18.69 7.78 3.84 2.04 3.65 9.89 7.50 5.55 4.00 9.06 16.52 6.77 3.97 2.45 6.29 X;4) p value 3.3837 ,4958 4.7865 ,3099 ,8276 ,9347 1.0070 ,9087 4.0921 ,3937 4.4287 ,3511 ,5937 ,9638 ,5652 ,9669 1.3545 ,8521 6.3057 ,1775 The Amencan Statisticinrz, Fel~nrnry1996, Vol. 50, No. 1 ,6283 ,9599 37 Models 2, 3, and 4 produced P(1.16) that were greater than 1.0. When this occurred the probabilities were set to .99999 in order to compute the chi-square statistics. The third comparison of the models was done by using P(i, j) to compute the probabilities of each seed winning the regional tournament. These probabilities are displayed in Table 2, and a chi-square goodness-of-fit to the empirical probabilities, used for evaluating the 11 models, is given in Table 3. 5. CONCLUSIONS For the set of 26 pairings with 5 games or more, both the ordinary and logistic models using the z scoring of seed number S 2 ( i )had smaller chi-square values than the corresponding models using just the seed position. In many cases the z scoring substantially improved the fit of the predicted value to the empirical data, and seems to be a worthwhile technique. Two unexpected results of the analysis occurred. The first is that when predicting the probability of success in each j). the regressions with the y intercept specified game, P(i. (.5 for ordinary least squares and zero for logistic regression) occasionally provided somewhat smaller chi-square values than the unrestricted models. Because least squares minimizes the squared deviations between predicted and observed values it was anticipated that the unrestricted models (1, 3, 5, 7) would have slightly smaller chi-square values than the corresponding restricted models (2, 4, 6, 8). The chi-square statistic, however, is a weighted sum of these squared deviations, and hence is not necessarily a minimum when the unweighted sum is minimized. The second unexpected result was that the models that were best (the smaller chi-square values) at predicting P(z,j ) for each game did not do as well as some of the other models at predicting the overall regional tournament champion. Models 7 and 8 (logistic, with and without intercept, z-scored seeds) were the best at predicting the regional winner but were about in the middle (when ranked) of the models for predicting individual games. On the other hand, models 3 and 4 (ordinary least squares, with or without specified intercept, z-scored seeds) were the best prediction models for the individual games, but only fourth or fifth best for predicting the regional champion. 38 The American Statistician, February 1996, Wol. 50, No. 1 If the objective is to develop a model for predicting the winner between various seed pairs, then model 3 (ordinary least squares, no specified intercept, z-scored seeds) seems to be the most satisfactory, whereas if the objective is to predict the regional winner, then the logistic models, 7 and 8 (logistic regression, with and without intercept, z-scored seeds), are the most satisfactory models. Interestingly, the ad hoc model 11 seemed to be very adequate at predicting both. Obviously there are numerous models that could be used. We have focused on the simplest straight line models and elementary methods of incorporating team strength in order to make the methodology accessible to a broad spectrum of students. We have attempted to present an application of ordinary least squares, logistic regression, and probability that should be of interest to many students. The everincreasing interest in "March Madness" can be used to motivate and stimulate this instructive, timely application of several principles and methods of probability and statistics. Students seem to learn better when they can see application of the subject to something of interest to them. We believe that this analysis of the regional basketball tournaments can promote student learning and enthusiasm for studying probability and statistics. [Received August 1993. Revised October 1994.1 REFERENCES David, H. A. (1959), "Tournaments and Paired Comparisons," Biometrika, 46, 139-149. Glenn, W. A. (1960), "A Comparison of the Effectiveness of Tournaments," Biometrika, 47, 253-262. Ladwig, J. A,, and Schwertman, N. C. (1992), "Using Probability and Statistics to Analyze Tournament Competitions, Chance, 5, 49-53. Monahan, J. P., and Berger, P. D. (1977), "Playoff Structures in the National Hockey League," in Optimal Strategies in Sports, eds. S. P. Ladany and R. E. Machol, Amsterdam: North-Holland, pp. 123-128. Moser, L. E. (1982), "A Mathematical Analysis of the Game of Jai Alai," The American Mathematical Monthly, 89, 292-300. Mosteller, F. (1952),"The World Series Competition," Journal of the American Statistical Association, 47, 355-380. Schwertman, N. C., McCready, T. A,, and Howard, L. (1991), "Probability Models for the NCAA Regional Basketball Tournaments," The American Statistician, 45, 35-38. Searls, D. T. (1963), "On the Probability of Winning with Different Tournament Procedures," Journal of the American Statistical Association, 58, 1064-1081. http://www.jstor.org LINKED CITATIONS - Page 1 of 2 - You have printed the following article: More Probability Models for the NCAA Regional Basketball Tournaments Neil C. Schwertman; Kathryn L. Schenk; Brett C. Holbrook The American Statistician, Vol. 50, No. 1. (Feb., 1996), pp. 34-38. Stable URL: http://links.jstor.org/sici?sici=0003-1305%28199602%2950%3A1%3C34%3AMPMFTN%3E2.0.CO%3B2-2 This article references the following linked citations. If you are trying to access articles from an off-campus location, you may be required to first logon via your library web site to access JSTOR. Please visit your library's website or contact a librarian to learn about options for remote access to JSTOR. References Tournaments and Paired Comparisons H. A. David Biometrika, Vol. 46, No. 1/2. (Jun., 1959), pp. 139-149. Stable URL: http://links.jstor.org/sici?sici=0006-3444%28195906%2946%3A1%2F2%3C139%3ATAPC%3E2.0.CO%3B2-6 A Comparison of the Effectiveness of Tournaments W. A. Glenn Biometrika, Vol. 47, No. 3/4. (Dec., 1960), pp. 253-262. Stable URL: http://links.jstor.org/sici?sici=0006-3444%28196012%2947%3A3%2F4%3C253%3AACOTEO%3E2.0.CO%3B2-P A Mathematical Analysis of the Game of Jai Alai Louise E. Moser The American Mathematical Monthly, Vol. 89, No. 5. (May, 1982), pp. 292-300. Stable URL: http://links.jstor.org/sici?sici=0002-9890%28198205%2989%3A5%3C292%3AAMAOTG%3E2.0.CO%3B2-7 The World Series Competition Frederick Mosteller Journal of the American Statistical Association, Vol. 47, No. 259. (Sep., 1952), pp. 355-380. Stable URL: http://links.jstor.org/sici?sici=0162-1459%28195209%2947%3A259%3C355%3ATWSC%3E2.0.CO%3B2-U http://www.jstor.org LINKED CITATIONS - Page 2 of 2 - Probability Models for the NCAA Regional Basketball Tournaments Neil C. Schwertman; Thomas A. McCready; Lesley Howard The American Statistician, Vol. 45, No. 1. (Feb., 1991), pp. 35-38. Stable URL: http://links.jstor.org/sici?sici=0003-1305%28199102%2945%3A1%3C35%3APMFTNR%3E2.0.CO%3B2-V On the Probability of Winning with Different Tournament Procedures Donald T. Searls Journal of the American Statistical Association, Vol. 58, No. 304. (Dec., 1963), pp. 1064-1081. Stable URL: http://links.jstor.org/sici?sici=0162-1459%28196312%2958%3A304%3C1064%3AOTPOWW%3E2.0.CO%3B2-U