Download KMcM final thesis

Document related concepts
no text concepts found
Transcript
2020
Prediction of the
Physiological
Workload of Football
Training Sessions
STATISTICAL MODELLING AND R SHINY
APPLICATION
KENNETH MCMILLAN 9372431
Project for MSc Data Analytics (online)
May 2020
0
Table of Contents
Figures ........................................................................................................................................ ii
Tables ......................................................................................................................................... ii
Executive Summary ................................................................................................................... iii
1. Introduction ............................................................................................................................ 1
Project Report Overview ........................................................................................................ 2
2. Exploratory Data Analysis ...................................................................................................... 3
Data Overview ....................................................................................................................... 3
Response and Explanatory variables ..................................................................................... 3
Effect of Position .................................................................................................................... 7
Non-linearity and repeated measurements ............................................................................. 7
EDA Summary ....................................................................................................................... 9
3. Methods ................................................................................................................................10
Data Collection ......................................................................................................................10
Data Management.................................................................................................................10
Data Analysis and Statistical Modelling .................................................................................10
Model Fitting .........................................................................................................................12
Model Comparison ................................................................................................................13
Model Selection.....................................................................................................................14
Model Performance ...............................................................................................................14
4. Results ..................................................................................................................................15
Total Distance models ...........................................................................................................15
Metrics Summary for all response variables ..........................................................................22
Performance on Test Data Set ..............................................................................................25
Results Summary ..................................................................................................................27
5. Conclusions and Discussion .................................................................................................28
Model selection: Simpler LMM or GAMM? ............................................................................28
Project limitations ..................................................................................................................28
Practical applications ............................................................................................................29
Future application improvements ..........................................................................................30
Final Conclusion....................................................................................................................30
6. References ...........................................................................................................................31
i
Appendix A: Partial Residual Plots (for TDmodel7) ...................................................................33
Appendix B: Model fits for all Response Variables 1 .................................................................36
Appendix C: Model fits for all Response Variables 2 (Facets) ...................................................39
Appendix D: Navigating the R Shiny Training Session Planner Application ...............................45
Appendix E: Collection of Training Session Planner Application Screenshots ...........................46
Figures
Figure 1: Professional Football Player wearing a wearable GPS device .................................... 1
Figure 2: Training Drill Schematic .............................................................................................. 4
Figure 3: Pairwise Scatterplots of continuous variables .............................................................. 5
Figure 4: Distribution of the continuous variables ....................................................................... 5
Figure 5: Training drills vary by duration, the longest being Tactical drills .................................. 6
Figure 6: Total Distance observations for each training drill ....................................................... 6
Figure 7: No apparent effect of position on Total Distance covered............................................ 7
Figure 8: Scatterplot of relationship between Duration and the response variables .................... 8
Figure 9: Scatterplot of Total Distance against Duration, by Drill Type ....................................... 8
Figure 10: Dotplot of measurements per drill for a sample player ............................................... 9
Figure 11: Wearable GPS 'STATSports Viper' Device ...............................................................10
Figure 12: Residual Plots for TDmodel1 Diagnostics.................................................................15
Figure 13: Residual Plots for TDmodel2 Diagnostics.................................................................16
Figure 14: Residual Plots for TDmodel3 Diagnostics.................................................................17
Figure 15: Residual Plots for TDmodel4 Diagnostics.................................................................18
Figure 16: Residual Plots for TDmodel7 Diagnostics (best fitting model)...................................19
Figure 17: A selection of Partial Residual Plots for TDmodel7...................................................20
Figure 18: Model Fits using TDmodel4, 7 and 9 facetted by Player ...........................................21
Figure 19: Model Fits using TDmodel4, 7 and 9 facetted by Training Drill .................................21
Figure 20: Plots of Predicted versus Actuals (using Test Data Set) ...........................................25
Figure 21: Predicted versus Actuals for all response variables ..................................................26
Tables
Table 1: Variables description and information ........................................................................... 4
Table 2: Total Distance Model Metrics ......................................................................................22
Table 3: High Intensity Distance Model Metrics .........................................................................22
Table 4: Acceleration Distance Model Metrics ...........................................................................23
Table 5: Deceleration Distance Model Metrics ..........................................................................23
Table 6: Sprint Distance Model Metrics .....................................................................................24
Table 7: Maximum Speed Model Metrics ..................................................................................24
Table 8: Performance Metrics using Test Data Set ...................................................................25
Table 9: Comparison of Train/Test Performance Metrics for all response variables ..................26
ii
Executive Summary
Introduction: Accurate prescription of individual physiological training loads in a football
(soccer) training session is crucial in order to optimise training while minimising the risk of injury
and increasing player availability. Global Positioning System (GPS) devices are routinely used
by professional football clubs to monitor the training workloads of their players. These GPS
devices collect data on key physiological workload metrics such as total distance covered, high
intensity distance, and acceleration/deceleration distances.
Main Aim: The main aim of this project was to build a football Training Session Planner
application using appropriate predictive models trained on historical GPS training data, to
ultimately enable a football coach to quickly view physiological workload predictions of the
training drills that constitute the prescribed training session.
Methods: Linear Mixed Models (LMMs) and Generalised Additive Mixed Models (GAMMs) were
fitted to the randomly assigned Training data set (70% of data) for all 6 response variables. Nine
different models were fitted for each response: the first 4 models involved fitting LMMs with
random intercept (lme4 R package), with the subsequent non-linear models using GAMMs with
different smoothing methods (mgcv and gamm4 R packages). A random player effect was used
to account for the nested structure of the data. R2, RMSE, MAE and AIC of the models were
compared in order to select an appropriate model for each response variable. Model
performance was then established using the Test data set.
Results: A GAMM with cubic regression splines was found to be the most appropriate model
for all of the response variables, but a simpler polynomial LMM produced a similar model fit
performance. A high R2 value (0.81) was found for the Total Distance response variable.
However, R2 values of the model fits were low when predicting Sprint Distance and Maximum
Speed (0.53 and 0.55 respectively). All models showed an excellent fit to the Test data set (r >
0.84) with the exception of the Sprint Distance and Maximum Speed response variables (r
=0.64 and 0.71 respectively). RMSEs were found to be of an acceptable practical value.
Conclusions: By using historical GPS workload data of football training sessions, the
physiological workloads of a prescribed training session can be estimated with practically
acceptable accuracy using LMMs and GAMMs. However, predictions of Sprint Distance and
Maximum Speed attained should be viewed with caution. The fitted models were deployed in an
interactive R Shiny Training Session Planner application designed to assist football coaches
with prediction of the physiological workload of their prescribed training sessions.
iii
1. Introduction
One of the most popular and effective training monitoring devices in elite sports is the Global
Positioning System (GPS) (1). They are widely used in professional football (soccer), rugby and
other sports (2–6) to summarise a player’s physiological training workload (i.e. metrics such as
total distance covered, high intensity distance, and acceleration distances) across the different
training drills in a training session (see Figure 1). Accurate ‘prescription’ of individual training
loads in a training session is crucial in order to optimise training adaptations while minimising
the risk of injury.
Football players participate in a wide range
of training drills during a training session in
order to induce physiological adaptations
needed to succeed in competitive matchplay. A typical soccer training session
consists of a combination of warm-up,
technical/tactical drills, fitness related drills,
and cool-down activities. The choice and
duration of each training drill has to be
planned in advance. The choice may
Figure 1: Professional Football Player wearing
a wearable GPS device
depend on the proximity of the next
upcoming competitive match and/or the football coach’s preference. Each prescribed training
drill will induce a specific physiological effect and these effects may differ greatly between
training drills.
Once a training session is completed the resulting GPS workload data are typically analysed to
assess each component of the training session separately by generating visualisations and
tables summarising each metric at the squad and individual player level.
The main aim of this project was to build a football Training Session Planner application using
statistical models trained on historical GPS workload data that would allow a football coach to
quickly obtain physiological workload predictions for their prescribed training session. The
application would allow the coach to quickly determine whether the physiological load of the
training session is appropriate or not. For example, the football coach may want to check that a
training session that is planned for 24-hours before a competitive match is not physiologically
too demanding, which might result in player fatigue on match-day. On the other hand, the coach
1
may want to analyse whether a prescribed training session is of the appropriate duration and
intensity to elicit favourable physiological adaptations.
Project Report Overview
Section 2: Exploratory Data Analysis findings are reported in Section 2.
Section 3: Methodologies used in the data analysis and modelling are described in Section 3.
Brief overviews of Linear Mixed Models and Generalised Additive Mixed Models are presented,
along with the reasons behind why these approaches were used in the statistical modelling in
this project.
Section 4: This section presents the main findings of the data analysis and details the modelling
processes that were implemented. For brevity, modelling details for the Total Distance response
variable are presented only. Modelling for the other 5 response variables followed a similar
pattern, and the model performance metrics are tabulated and graphically illustrated in this
section. Graphical illustrations of the model fits for all models can be found in Appendices B and
C. To conclude the Results section, model performance metrics using the Test data set are
presented.
Section 5: The main findings of the project are discussed in Section 5. Limitations and practical
applications of the project are acknowledged, and recommendations for future work are
suggested. A modified html version of the project report (which includes more Exploratory Data
Analysis graphs) can also viewed at:
https://rpubs.com/kennymcmillan/603180
Using the most appropriate models established in the project analysis, an interactive R Shiny
Training Session Planner application was created. This application was designed for ease of
use by football coaches to predict the training load of their prescribed training sessions. This
application allows coaches to “build” a training session, by selecting the individual training drills,
and their subsequent duration and number of sets. The R Shiny application can be viewed at:
https://aspireacademy-physiology.shinyapps.io/MSc_ShinyDashboard/
Instructions on navigating the Training Session Planner application can be found in Appendix D
at the end of this project report. Screenshots of the application are displayed in Appendix E.
2
2. Exploratory Data Analysis
This section describes findings of an Exploratory Data Analysis of the data set. The reader can
also interactively explore the data set using the R Shiny application (see EDA options in
sidebar).
Data Overview
GPS data were collected during training sessions from 26 professional football players. 4514
training drill observations were collected over a 2-year period. There are no missing data in the
data set. The majority of data collected came from Technical and Tactical Drills. The smallest
amount of data were collected from Warm-Up and Recovery Drills. Within each training drill
classification, the number of observations of each training drill were found to vary considerably.
Observations ranged from the maximum of 430 observations for the 11 v 11, fully coached drill
to only 46 for Speed and Agility training. GPS physiological workload responses to 25 different
training drills were analysed, with these drills belonging to 7 different classifications of training
drills (see Figure 2 overleaf).
Each player was also classified according to one of five possible playing positions: Attacker (A),
Central Defender (CD), Central Midfielder (CM), Wide Defender (WD), and Wide Midfielder
(WM). The numbers of observations per player was found to vary, in the range of 138 to 210
observations.
Response and Explanatory variables
Table 1 overleaf describes the response and explanatory variables in the data set. All 6
response variables investigated in this project are continuous random variables. All but one of
the explanatory variables are categorical, with the training duration variable being the exception
(continuous in nature). A scatterplot of all the continuous variables is presented in Figure 3. As
expected, it can be seen that the distance-related metrics generally increase as the duration of
the drill increases. Total Distance covered in particular has a high correlation with drill duration
(r = 0.85). In contrast, Maximum Speed attained is poorly correlated with training duration (r =
0.37). This observation makes sense as a player could reach maximum speed at any time-point
during a training drill. Maximum speed attained is more likely to be explained by the type of
training drill being performed (e.g. we would expect higher maximum speed to be attained
during speed and agility training compared to a warm-down drill).
3
Figure 2: Training Drill Schematic
Table 1: Variables description and information
4
Figure 3: Pairwise Scatterplots of continuous variables
The distributions of the response variables were found to be right-skewed, except Maximum
Speed that appeared normally distributed (see Figure 4). Duration of the training drill was also
found to be right-skewed. Drills of longer duration were mainly attributable to Tactical and
Technical Drills (see Figure 5), with some observations approaching the duration of competitive
match-play (90 minutes). There seems to be an interaction between drill duration and type.
Figure 4: Distribution of the continuous variables
5
Figure 5: Training drills vary by duration, the longest being Tactical drills
Figure 6 below illustrates all observations for each training drill for the Total Distance response
variable. For some drills it is apparent that Total Distance increases steadily with increasing
training duration, but for some other drills this is not as apparent (e.g. Warm-down Drill).
Figure 6: Total Distance observations for each training drill
6
Effect of Position
During competitive football match-play, playing in different positions may induce different
physiological workloads. For example, wide-midfield players usually cover a significantly greater
total distance during match-play than central-defenders. However, Figure 7 below illustrates that
for the training drills, playing position seems to have no great effect on workload output,
presumably due to smaller pitch sizes being used in most training drills, and the players not
being bound to their tactical position on the training pitch.
Figure 7: No apparent effect of position on Total Distance covered
Non-linearity and repeated measurements
Figure 8 overleaf clearly demonstrates that there is a non-linear relationship between all 6
response variables and training duration. Therefore, it would seem that fitting non-linear
models would be more appropriate than fitting linear models. Figure 9 also show this nonlinearity between Total Distance covered and duration of the drill by Drill Type. For the Sprint
Distance response variable, some strange stratified measurements are apparent, perhaps due
to incorrect measurements by the GPS devices.
7
Figure 8: Scatterplot of relationship between Duration and the response variables
Figure 9: Scatterplot of Total Distance against Duration, by Drill Type
8
The project data set consisted of repeated measurements for each training drill for each
player. This is illustrated in Figure 10 for Player 15, who provided the most observations (210
data points).
Figure 10: Dotplot of measurements per drill for a sample player
EDA Summary
The most important findings of the exploratory data analysis were:

Distributions of the response variables are right-skewed with the exception of the
Maximum Speed response variable that appears normally distributed. The explanatory
variable Duration is also right-skewed.

Different training drill types have different duration ranges.

Different training drills evoke different physiological responses.

There appears to be a non-linear relationship between duration of the training drill and
the workload metrics (response variables).

The playing position of the player may not have a great effect on the training workload
response.

The sprint distance data has some peculiarities.
9
3. Methods
This section details the methodologies used in this project.
Data Collection
Data were collected from training sessions performed by
26 players of a British professional football team over 2
football seasons. All the players were in peak physical
condition and free from injury at the time of data collection.
The data came from wearable GPS devices (see Figure
11) designed specifically for use in sport (STATSports
Figure 11: Wearable GPS
'STATSports Viper' Device
Viper GNSS unit). Data were collected at 5 Hz, and downloaded by the software that
accompanies the GPS devices. Full training session data was then carefully parsed into
individual drill data segments which were then exported into a Microsoft Excel spreadsheet for
further analysis.
Data Management
GPS observations coming from competitive match-play were removed from the data set, so that
only training drill information was used for analysis. 4514 training drill observations were
available for statistical analysis and modelling. 70% of the data were randomly assigned to a
Training data set for model fitting, with the remaining 30% set aside as a holdout Test data set.
Data Analysis and Statistical Modelling
All statistical analysis was performed using R version 3.6.1 (7).
Mixed-Effects Modelling
Most commonly used statistical techniques assume independence of the observations or
measurements. The project data set has a hierarchical structure, with repeated measurements each player had repeat observations for certain training drills. Repeat measurements in the
same individual are usually more similar to each other than values from different individuals and
therefore not independent (8). Mixed-effects models, such as Linear Mixed Models (LMMs)
offer a flexible framework to model the sources of variation and correlation that arise from
repeat observations (8). They are a statistical model that contains both fixed effects and
random effects. Fixed effects can be defined as effects that we wish to estimate explicitly, while
random effects are effects that introduce variability (9).
10
Random Intercept LMM
One form of LMM is the Random Intercept model and this modelling approach was applied in
some of the earlier models in this project. As an example, if we let 𝑦𝑖𝑗 be the Total Distance of
the i-th player on the j-th occasion and let 𝑥𝑖𝑗 denote the training drill duration, then:
𝑦𝑖𝑗 = 𝛽0 + 𝑏0𝑖 + 𝛽1 𝑥𝑖𝑗 + 𝑒𝑖𝑗 ,
where the 𝑏0𝑖 are independent 𝑁(0, 𝜎02 ) , the 𝑒𝑖𝑗 are independent 𝑁(0, 𝜎 2 ), and 𝑏0𝑖 and 𝑒𝑖𝑗 are
independent of each other. The term 𝑏0𝑖 is the random effect corresponding to the i-th player.
By rewriting the model as:
𝑦𝑖𝑗 = 𝑏𝑖∗+ 𝛽1 𝑥𝑖𝑗 + 𝑒𝑖𝑗
,
the terms 𝑏𝑖∗ ~ N(β0, 𝜎02 ) are random intercepts, with each intercept corresponding to a different
player centred around 𝛽0 , with added variability represented by the variance parameter 𝜎 2 (9).
Non-linear modelling: Generalised Additive Models and Generalised Additive Mixed Models
EDA results from Section 2 of this report highlighted that the relationship between training
duration and drill type is not necessarily linear over time. Mixed-effects models can be used to
model both linear and non-linear relationships between response and explanatory variables.
While LMMs can be used to express linear relationships between sets of variables, non-linear
mixed-effects models can also model mechanistic relationships between the response and
explanatory variables (8).
Generalised Additive Models (GAMs) and Generalised Additive Mixed Models (GAMMs) provide
a general framework for extending the standard linear model and capturing non-linearity. The
linear predictor depends on unknown smooth functions of some of the explanatory variables,
while maintaining additivity (10). The general form of a GAM is:
𝑔(𝐸(𝑌)) = 𝛽0 + 𝑓1 (𝑥1 ) + 𝑓2 (𝑥2 ) + ∙∙∙ + 𝑓𝑚 (𝑥𝑚 ) + e
The model relates a univariate response variable, 𝑌, to some predictor variables, 𝑥𝑖 . An
exponential family distribution is specified for 𝑌 (for example, normal or Poisson distributions)
along with a link function 𝑔 (for example the identity or log functions) relating the expected value
of 𝑌 to the predictor variables. The terms 𝑓1 (𝑥1 ), … , 𝑓𝑚 (𝑥𝑚 ) denote smooth, non-parametric
functions. Smooth functions in GAMs allow the shape of the data to be followed much more
closely than linear models, as they not constrained by the assumption of linearity. Therefore,
11
GAMs (and therefore GAMMs) have the ability to model non-linear relationships in a more
flexible form than standard linear models, and they have the potential to make more
accurate predictions. Indeed, GAM / GAMMs can serve as a useful compromise between
linear and fully non-parametric models (10).
When using the mgcv package in R, smoothing terms can be specified in a gam formula call by
using the “s” term. Various smooth classes are available for use, and a variety of these were
used in the models fitted in this project. What defines a smooth class is the basis used to
represent the smooth function and quadratic penalty (or multiple penalties) used to penalise the
basis coefficients in order to control the degree of smoothness. The smooths built into
the mgcv package are all based one way or another on low rank versions of splines.(11)
The smooth components of GAMs can be viewed as random effects for estimation purposes,
meaning that random effects terms can be incorporated into GAMs. Random effects in a GAM
can be represented in the same way that the smooths are — as penalised regression terms.
This method can be used within the gam function call in the mgcv package by making use of
the s( ... , bs = "re") term (11), and this method has been utilised in this project. As an
example, when using the Player_ID variable (an explanatory variable in this project) ,
s(Player_ID, bs = "re") produces a random coefficient for each level of Player, with the
random coefficients all modelled as i.i.d. normal (11).
Logarithmic transformations
Logarithmically transforming variables is a very common way to handle situations where a nonlinear relationship exists between the explanatory and response variables, or when a variance
stabilising transformation is needed. Taking the log of the response and/or explanatory variables
can help when residuals appear non-normal, and also help to eliminate any heteroscedasticity.
Logarithmic transformations are also a convenient means of transforming a highly skewed
variable into one that is more approximately normal. Since all response variables in this project
were found to be right-skewed (with the exception of the Maximum Speed response variable),
and a non-linear relationship between the response variables and the training drill duration was
apparent, logarithmic transformations of both the response variables and the explanatory
variable relating to drill duration were applied in the models fitted in this project.
Model Fitting
As recommended by Gelman (12), simple models were initially fitted and then modified to
generate possibly better models (according to some goodness-of-fit criterion) with this process
12
repeated until no further improvements were evident. This iterative process was applied for
modelling of all 6 response variables (see Table 1). In this project report, detailed analysis of the
modelling steps is presented only for the ‘Total Distance’ response variable (for brevity
purposes). However, all model fitting steps, results and analysis can be viewed in the submitted
R code file. The following 9 models were sequentially fitted for each response variable:
1. LMM using all variables (random intercept model)
2. LMM with log transformations and dropping of insignificant predictor variables (random
intercept model)
3. LMM with added interaction term (random intercept model)
4. LMM with polynomial (random intercept model)
5. GAMM using default smooth (Thin plate regression splines) (mgcv package).
6. GAMM using Factor smooths (mgcv package).
7. GAMM using Cubic Regression Splines (mgcv package).
8. GAMM using p-splines (mgcv package).
9. GAMM model (gamm4 package) (random intercept model)
All models were fitted using the Training data set. On selection of the final model fit, the holdout
Test data set was used to provide an unbiased evaluation of the final model (10).
R packages used to fit and visualise LMM and GAMM models
LMMs (random intercept model) were fitted using the lme4 package (13). The lme4 package
default output for LMMs provides t-values but no p-values. The primary motivation for this
omission is that in LMMs it is not at all obvious what the appropriate denominator degrees of
freedom to use are, except perhaps for some simple designs and nicely balanced data (14). In
this project, significance of LMMs were calculated using the lmerTest package (15). Prediction
intervals from the LMMs (used in the Training Prediction Planner Application) were obtained
using the ‘predictInterval’ function from the merTools package (16). R2 values for LMMs were
estimated using the MuMin package (17). GAMMs were fitted using the mgcv and gamm4
packages (11). The mgcViz package (18) was used for visualisation of GAMM partial residual
plots.
Model Comparison
The Likelihood Ratio Test (LRT) can compare two different nested models in order to determine
whether one model is a better fit to the data than the other. LRTs are commonly used to decide
if a particular explanatory variable should be included in a mixed-effects model. LRTs are most
13
commonly used to decide if a particular random effect (say, a random intercept) should be
retained in the model by evaluating whether that effect improves the fit of the model, with all
other model parameters held constant (14). When used for evaluating the significance of fixed
effects, LRTs have one potential disadvantage: using LRTs to compare two models that differ in
their fixed effects structure may not always be appropriate (8). When mixed-effects models are
fitted using Restricted Maximum Likelihood (REML), there is a term in the REML criterion that
changes when the fixed-effects structure changes, making a comparison of models differing in
their fixed effects structure meaningless. Thus, if LRTs are to be used to evaluate significance,
models must be fitted using Maximum Likelihood (ML), and this approach was therefore used in
this project.
Model Selection
The Root Mean Squared Error (RMSE) and the Akaike Information Criterion (AIC) (19) were
both used to decide between alternative models, by choosing the model with the smaller values.
RMSE can be interpreted as the standard deviation of the unexplained variance, and has the
useful property of being in the same units as the response variable. Lower values of RMSE
indicate a better fit. RMSE is a good measure of how accurately the model predicts the
response, and can be regarded as an important criterion for fit if the main purpose of the model
is prediction (the main aim of this project). When comparing alternative models, the residuals
from the fit were analysed to check for any departures from the model’s underlying
assumptions. The final selected model was then deployed in the R Shiny application using
REML.
Model Performance
Model performance was evaluated by comparing the predicted versus actual training load
metrics in the holdout Test data set for each of the 6 response variables. For each of these
metrics, Pearson’s correlation coefficient (predicted versus actuals), RMSE, and MAE were
reported.
14
4. Results
In this section of the report, the different models that were fitted to predict the Total Distance
covered are described and results reported. Similar models were fitted for the other 5 response
variables, and summaries of all model performance metrics are provided in Tables 2 to 7. R
code detailing all modelling steps can be found in the submitted R code file. To conclude the
results section, performance of the selected models on the Test Data Set are summarised and
illustrated.
Total Distance models
TDmodel1: All variables, no transformations
The first model that was fit to the Training data set to model Total Distance covered (highlighted
in purple colour in the R code equation below) was a LMM with all variables, and no
transformations. This LMM was a random intercept model, fitted using the lme4 package. A
random player effect was used (highlighted in red):
TDmodel1 = lmer(Total_Distance ~ Duration + Drill + Position + (1 | Player_ID),
data = train_data, REML = F)
𝑦𝑖𝑗𝑘𝑙 = 𝛽0 + 𝜏𝑗 + 𝛾𝑘 + 𝛽1 𝑥𝑖𝑗𝑘𝑙 + 𝑏𝑖 + 𝜖𝑖𝑗𝑘𝑙
𝑏𝑖 ~ 𝑁(0, 𝜎12 ), 𝜖𝑖𝑗𝑘𝑙 ~𝑁(0, 𝜎 2 )
where 𝛽0 is the intercept, 𝜏𝑗 is the 𝑗-th effect of Drill, 𝛾𝑘 is the 𝑘-th effect of Position, 𝛽1 is the
slope for Duration (𝑥𝑖𝑗𝑘𝑙 ) measured on the 𝑖-th player at occasion 𝑙 under the levels 𝑗, 𝑘, of Drill
and Position respectively. The term 𝑏𝑖 is the random effect corresponding to the i-th player.
This model produces a QQ plot with a “long” head and tail (see Figure 12). RMSE was found to
be the highest (385) out of all the subsequent models fitted.
Figure 12: Residual Plots for TDmodel1 Diagnostics
15
TDmodel2: Position variable dropped + Log transformation
ANOVA results showed the Position explanatory variable to be a non-significant predictor of
Total Distance covered and was subsequently dropped from future models. A log transformation
was also applied to both the response variable and the explanatory variable Duration, in order to
capture the correct functional form between the response and the predictors:
TDmodel2 = lmer(log(Total_Distance) ~ log(Duration) + Drill + (1 | Player_ID),
data = train_data, REML = F)
log(𝑦𝑖𝑗𝑘 ) = 𝛽0 + 𝜏𝑗 + 𝛽1 log(𝑥𝑖𝑗𝑘 ) + 𝑏𝑖 + 𝜖𝑖𝑗𝑘
𝑏𝑖 ~ 𝑁(0, 𝜎12 ), 𝜖𝑖𝑗𝑘 ~𝑁(0, 𝜎 2 )
where 𝛽0 is the intercept, 𝜏𝑗 is the 𝑗-th effect of Drill, 𝛽1 is the slope for Duration (𝑥𝑖𝑗𝑘 ) measured
on the 𝑖-th player at occasion 𝑘 under the level 𝑗 of Drill. The term 𝑏𝑖 is the random effect
corresponding to the i -th player.
In this instance where both the response variable and response variable are log-transformed,
the interpretation of the model is that there is an expected percentage change in 𝑦 when 𝑥
increases by some percentage.
Log transformation of Total Distance and Duration resulted in a more appropriate looking QQ
plot (see red arrow in Figure 13 below), and residual scatter plot.
Figure 13: Residual Plots for TDmodel2 Diagnostics
16
TDmodel3: Addition of an interaction term
For the 3rd model, an interaction term between training duration and drill was applied in the
model (highlighted in blue):
TDmodel3 = lmer(log(Total_Distance) ~ log(Duration)*Drill + (1 | Player_ID),
data = train_data, REML = F)
log(𝑦𝑖𝑗𝑘 ) = 𝛽0 + 𝜏𝑗 + 𝜃𝑗 𝑥∗𝑖𝑗𝑘 + 𝑏𝑖 + 𝜖𝑖𝑗𝑘 ,
where 𝑥∗𝑖𝑗𝑘 = log(𝑥𝑖𝑗𝑘 ) , 𝑏𝑖 ~ 𝑁(0, 𝜎12 ), 𝜖𝑖𝑗𝑘 ~𝑁(0, 𝜎 2 )
The difference in this model is the addition of a separate linear regression for each level of Drill.
Thus, each 𝜃𝑗 represents the slope of log(Duration) for the effect of drill 𝑗.
An ANOVA showed that addition of the interaction significantly improves the model, with AIC
largely decreasing from 1749 to 1482 and RMSE falling from 381 to 346. All residual model
checks seemed acceptable, except the long tail on QQ plot (see Figure 14).
Figure 14: Residual Plots for TDmodel3 Diagnostics
TDmodel4: Addition of appropriate polynomial
Prior EDA showed that the relationship between the response variables and Duration is nonlinear. Therefore, a polynomial fit was applied to the model to capture this non-linearity:
17
TDmodel4 = lmer (log (Total_Distance) ~ poly (log(Duration),3 )*Drill + (1 | Player_ID),
data = train_data, REML = F)
𝑃
log(𝑦𝑖𝑗𝑘 ) = 𝛽0 + 𝜏𝑗 + ∑
𝜃𝑗𝑞 (𝑃𝑞 (𝑥∗𝑖𝑗𝑘 )) + 𝑏𝑖 + 𝜖𝑖𝑗𝑘 ,
𝑞=1
∗
where 𝑥𝑖𝑗𝑘
= log(𝑥𝑖𝑗𝑘 ) , 𝑏𝑖 ~ 𝑁(0, 𝜎12 ), 𝜖𝑖𝑗𝑘 ~𝑁(0, 𝜎 2 ), and 𝑃𝑞 = Polynomial of degree q
ANOVA results indicated that a 3rd order polynomial fit was the most appropriate. When
compared to TDmodel3, AIC further decreased greatly from 1482 to 1350, whilst RMSE
dropped from 346 to 327. All residual model checks remained of an acceptable nature, albeit
with a long tail on the QQ plot still present (see Figure 15 overleaf).
Figure 15: Residual Plots for TDmodel4 Diagnostics
TDmodel5 to TDmodel8: GAMMs with differing smoothing functions
TDmodel5 to TDmodel8 were GAMM models, fitted using the gam function in the mgcv
package. Each GAMM model used a different smoother, denoted as 𝑓(𝑥):
log(𝑦𝑖𝑗𝑘 ) = 𝛽0 + 𝑓(𝑥∗𝑖𝑗𝑘 ) + 𝑏𝑖 + 𝜖𝑖𝑗𝑘 ,
where 𝑥∗𝑖𝑗𝑘 = log(𝑥𝑖𝑗𝑘 ) ,
𝑏𝑖 ~ 𝑁(0, 𝜎21 ), 𝜖𝑖𝑗𝑘 ~𝑁(0, 𝜎2 )
TDmodel5 used the default Thin plate regression splines as the smoothing function. An
interaction was fitted using the “by” parameter in the s function of a gam function call in the
mgcv package (highlighted in blue). Position was set as a random effect (highlighted in red):
18
TDmodel5 = mgcv::gam (log(Total_Distance) ~ s(log(Duration), by=Drill, id=0) +
s(Player_ID, bs = "re"), data = train_data, method = "ML")
In the following models, TDmodel6 through to TDmodel8 used the same interaction and random
effect terms as TDmodel5, only differing in the penalised splines used (highlighted in green).
TDmodel6 used the “factor smooth” option in mgcv, where a smooth was required at each level
of Drill, and each smooth had the same smoothing parameter:
TDmodel6 = mgcv::gam (log(Total_Distance) ~ s( log(Duration), by=Drill, bs = "fs ") +
s(Player_ID, bs="re"), data = train_data, method = "ML")
TDmodel7 used cubic regression splines: these have a cubic spline basis defined by a modest
sized set of knots spread evenly through the covariate values. They are penalised by the
conventional integrated square second derivative cubic spline penalty (11).
TDmodel7 = mgcv::gam(log(Total_Distance) ~ s(log(Duration), by=Drill, bs = "cr” ) +
s(Player_ID, bs="re") , data = train_data, method = "ML")
For TDmodel8, P-splines combine a B-spline basis with a discrete penalty on the basis
coefficients, where any combination of penalty and basis order is allowed (11).
TDmodel8 = mgcv::gam(log(Total_Distance) ~ s log(Duration), by=Drill, bs = "ps") +
s(Player_ID, bs="re") , data = train_data, method = "ML")
Further decrements in AIC were evident between TDmodel4 and TDmodel7 (1350 to 1315), but
only small fluctuations in RMSE were seen across the fitting of the various GAMM models.
Figure 16: Residual Plots for TDmodel7 Diagnostics (best fitting model)
19
Figure 16 on the previous page displays the model check graphs for TDmodel7, which was
found to be the model with the lowest AIC and second lowest RMSE out of all models fitted (see
Table 2 on page 22).
TDmodel9: GAMM using gamm4 package
Finally, a GAMM was fitted using the gamm4 package, but no further improvements in AIC or
RMSE were observed when compared to the previous GAMM models fitted using mgcv.
TDmodel9 = gamm4 (log(Total_Distance) ~ s(log(Duration), by = Drill),
random = ~ (1 | Player_ID), data = train_data, REML = F)
Figures 17 displays a selection of partial residual plots for TDmodel7 (the model chosen as the
most appropriate). Appendix A contains all of the partial residual plots for TDmodel7. Figures 18
and 19 display the model fits of TDmodel4, TDmodel7, and TDmodel9, facetted by Player and
Training Drill respectively. It can be seen that the performance of the LMM with polynomial
(TDmodel4) and the chosen GAMM model (TDmodel7) are very similar.
Figure 17: A selection of Partial Residual Plots for TDmodel7
20
Figure 18: Model Fits using TDmodel4, 7 and 9 facetted by Player
Figure 19: Model Fits using TDmodel4, 7 and 9 facetted by Training Drill
21
Metrics Summary for all response variables
R2, AIC, RMSE and MAE are shown in Tables 2 – 7 below for all models fitted on the 6
response variables. The models that were selected for validation with the Test data set are
highlighted in a red box in each table.
Table 2: Total Distance Model Metrics
A 3rd order polynomial was used in TDmodel4.
Table 3: High Intensity Distance Model Metrics
A 3rd order polynomial was used in HIDmodel4.
22
Table 4: Acceleration Distance Model Metrics
A 3rd order polynomial was used in ADmodel4.
Table 5: Deceleration Distance Model Metrics
A 3rd order polynomial was used in DDmodel4.
23
Table 6: Sprint Distance Model Metrics
A 5th order polynomial was used in SDmodel4.
Table 7: Maximum Speed Model Metrics
A 3rd order polynomial was used in MSmodel4.
24
Performance on Test Data Set
A GAMM using cubic regression splines was found to be the most appropriate model for the
prediction of all of the response variables. However, the simpler polynomial LMM models
had very similar fit. All selected models showed an excellent fit to the Test data set (r > 0.84),
with the exception of Sprint Distance and Maximum Speed (r = 0.64 and 0.71 respectively) (see
Table 8). Figures 20 and 21 show the Predicted versus Actuals for each response variable.
Table 8: Performance Metrics using Test Data Set
Figure 20: Plots of Predicted versus Actuals (using Test Data Set)
25
Figure 21: Predicted versus Actuals for all response variables
Table 9 displays a comparison of the Training and Test data performance metrics. The selected
models all performed better on the Test data set, with the exception of Sprint Distance and Max
Speed. It should also be noted that both the Sprint Distance and Maximum Speed response
variables showed poor correlation of fitted versus actuals in both the Training and Test data set
model fits.
Table 9: Comparison of Train/Test Performance Metrics for all response variables
26
Results Summary

Logarithmic transformation of the response variables and Duration was applied in order
to improve model fit.

Player position was established as a non-significant predictor variable.

Visual inspection of residual plots from the selected models did not reveal any
problematic deviations from homoscedasticity or normality, apart from the QQ plots
having a “long tail” appearance.

A GAMM using cubic regression splines was found to be the most appropriate model for
prediction of all the response variables.

The simpler LMMs with polynomial term had very similar RMSEs when compared to the
more complex GAMM models.

RMSE of all models are of an acceptable practical magnitude.

The selected GAMM models performed well on the Test set data, with the exception of
the response variables pertaining to Sprint Distance covered and Maximum Speed
attained.

R2 values are low for Sprint Distance and Maximum Speed (0.53 and 0.55 respectively),
and estimates of these responses should be treated with caution.
27
5. Conclusions and Discussion
Model selection: Simpler LMM or GAMM?
The results from this project demonstrate that GAMMs and an LMM using polynomials are
appropriate models for predicting the physiological workload of football training drills. In the
polynomial LMM, the polynomial term increases the flexibility of the linear model in order to
model a curvilinear relationship, thereby providing a simple way to model curves without having
to resort to more complicated non-linear models. However, the extra flexibility of polynomials
compared to smoothing splines can produce undesirable results at the boundaries, while
splines can still provide a reasonable fit to the data (10).
Project limitations
Some limitations of this project are recognised and highlighted in the following points:
a) Small number of predictor variables: There were only a small number of predictor
variables available for modelling the response variables. Additional information such as
date of training session, temperature and humidity, age of player, and sequence number
of drill in a training session may have provided more information. For instance, workload
metrics may be different in hot and hot humid conditions. Drills performed towards the
end of a training session may show lower workload values due to possible fatigue than
the same drill carried out in the earlier parts of a session. It may be reasonable to
suggest that younger players may work harder in some drills, whilst older, more
experienced players may “pace” themselves during the training sessions.
b) Limited data for some training drills: There were limited data observations for some
of the training drills, and this would negatively hinder appropriate model fitting. For
example, the “Speed and Agility” training drill had 46 observations only.
c) Random intercept-only: Random intercept-only LMMs were fitted in this project, as it
was decided not to use random slope models. It may be reasonable to investigate the
use of random slopes in subsequent analysis, as it may be expected that different
players may differ in their responses to the different drills.
d) Models used in application: Although the GAMM models using cubic regression
splines showed the combination of the lowest AICs and RMSE’s, the simpler polynomial
LMM models were chosen to be deployed in the R Shiny Application, because of easier
coding and quicker run-times. This is not a large limitation though, as Tables 2 to 7 in the
result section clearly show that the RMSE of these models is very similar to those of the
28
GAMM models (on training data). However, as previously discussed, the use of
polynomial models may result in some peculiar predictions at the boundaries of the
selected training duration for some of the training drills.
e) Prediction intervals: The application allows the user to select whether prediction
intervals should be displayed or not. Unfortunately, coding problems meant that a
simpler polynomial model that was missing the interaction term was used when
prediction intervals were required (see submitted R code file).
Practical applications
In order to put the prediction models into practical use, a predictive Training Session Planner
was created using R Shiny. Some practical usages of this application are discussed in the
following points:
a) Optimised football training session planning: The R Shiny Training Session Planner
application created for this project (see Appendices D and E) should assist a football
coach with their session planning - by simply selecting training drills, number of sets of
each drill, and the training drill duration, the user can instantly predict the physiological
workload of the individual drills and the full football training session. The application
ultimately provides the football coach with an easy-to-use tool to design and implement
training sessions in an efficient and safe manner by appropriate use of historical GPS
data.
b) Optimise player performance and well-being: By establishing the estimated workload
for each player before training commences, the football coach has an increased chance
of correctly prescribing a safe and effective training session that maximises performance
and reduces the chances of possible injury. Such a tool is clearly beneficial for the
football player also as an individualised training session designed to make them
physiologically prepared for competitive matches has the potential to improve their
performance and reduce their risk of injury due to inappropriate training loads.
c) Coach educational tool: The Training Session Planner can also be used as a coach
education tool. From personal experience, professional coaches rarely understand or
acknowledge the physiological load that their prescribed training sessions impose.
d) Estimate missing data: The Training Session Planner can also be used to predict missing
training data when devices fail to work correctly, or when a player has forgotten to wear
the GPS device during a training session.
29
e) An alternative to purchasing GPS equipment: For professional football clubs who
cannot afford to purchase GPS sports devices, the Training Session Planner application
may be used to help coaches estimate the physiological workload of their football
training sessions.
f) Use in other team sports: The Training Session Planner is applicable to all sports
where annotated GPS training session data are collected (e.g. NFL, rugby, basketball,
hockey, Australian Rules football). The application can be modified to use data from all
makes of GPS sports devices (e.g. GPSports, Catapult Sports).
Future application improvements
a) Use of GAMM models: The R Shiny Training Session Planner application should be
coded in order to use the GAMM models that were found to give the lowest AIC / RMSE
in this project, and prediction intervals calculated from these models.
b) Add new data: An option to add new data to the training drill database and then update
the prediction models would be a welcome improvement to the application.
c) Recommender system: The Training Session Planner application could be improved
by including a “recommender system”. For example, to recommend a sequence of
training drills that lasts between 75 to 90 minutes, which manifests a total distance
workload of between 6 - 7,000m, and a high intensity distance of between 500 and
750m, whilst minimising acceleration and deceleration distances.
Final Conclusion
In this project, statistical models using GPS data from retrospective training drills were built to
predict future training workload in order to optimise the planning of football training sessions. A
Training Session Planner can help coaches to effectively use historical GPS data to instantly
predict all physiological workload metrics of interest before a training session. In this way, each
training session can be planned effectively, and the predicted physiological workload metrics
can be updated quickly if there are any last minute changes to the football training session plan.
Cautious interpretation of predicted workload values is warranted for training drills with few
historical observations. Sprint Distance and Maximum Speed R2 values were low, and any
predictions of these metrics in the R Shiny application should also be interpreted carefully.
Furthermore, cautious interpretation of training workload predictions is required when training
durations are selected that lie at the boundaries of a training drill’s observational duration range.
30
6. References
1. Larsson P. Global Positioning System and Sport-Specific Testing: Sports Med.
2003;33(15):1093–101.
2. Dwyer DB, Gabbett TJ. Global positioning system data analysis: velocity ranges and a new
definition of sprinting for field sport athletes. J Strength Cond Res. 2012 Mar;26(3):818–24.
3. Hausler J, Halaki M, Orr R. Application of Global Positioning System and Microsensor
Technology in Competitive Rugby League Match-Play: A Systematic Review and Metaanalysis. Sports Med Auckl NZ. 2016 Apr;46(4):559–88.
4. Varley MC, Jaspers A, Helsen WF, Malone JJ. Methodological Considerations When
Quantifying High-Intensity Efforts in Team Sport Using Global Positioning System
Technology. Int J Sports Physiol Perform. 2017 Sep;12(8):1059–68.
5. White AD, MacFarlane NG. Analysis of international competition and training in men’s field
hockey by global positioning system and inertial sensor technology. J Strength Cond Res.
2015 Jan;29(1):137–43.
6. Wisbey B, Montgomery PG, Pyne DB, Rattray B. Quantifying movement demands of AFL
football using GPS tracking. J Sci Med Sport. 2010 Sep;13(5):531–6.
7. Ihaka R, Gentleman R. R: a language for data analysis and graphics. J Comput Graph Stat.
1996;5(3):299–314.
8. Pinheiro JC, Bates DM. Linear mixed-effects models: basic concepts and examples. Mix-Eff
Models -Plus. 2000;3–56.
9. Neocleous T. Advanced Predictive Models: Linear Mixed Models. Course notes for 2017-18
MSc Data Analytics (online). University of Glasgow. 2017.
10. James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. Vol. 112.
Springer; 2013.
11. Wood S, Scheipl F. gamm4: Generalized additive mixed models using mgcv and lme4. R
package version 0.2-3. 2014.
12. Gelman A, Hill J. Data analysis using regression and multilevel / hierarchical models. Vol. 1.
Cambridge University Press New York, NY, USA; 2012.
13. Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4.
ArXiv Prepr ArXiv14065823. 2014;
14. Luke SG. Evaluating significance in linear mixed-effects models in R. Behav Res Methods.
2017 Aug;49(4):1494–502.
15. Kuznetsova A, Brockhoff PB, Christensen RHB. Package ‘lmertest.’ R Package Version.
2015;2(0).
16. Knowles JE, Frederick C, Knowles MJE. Package ‘merTools.’ 2016;
31
17. Barton K, Barton MK. Package ‘MuMIn.’ R Package Version. 2019;1(6).
18. Fasiolo M, Nedellec R, Goude Y, Wood SN. Scalable visualization methods for modern
generalized additive models. J Comput Graph Stat. 2019;1–9.
19. Sakamoto Y, Ishiguro M, Kitagawa G. Akaike information criterion statistics. Dordr Neth
Reidel. 1986;81.
32
Appendix A: Partial Residual Plots (for TDmodel7)
33
34
35
Appendix B: Model fits for all Response Variables 1
Total Distance
High Intensity Distance
36
Acceleration Distance
Deceleration Distance
37
Sprint Distance
Maximum Speed
38
Appendix C: Model fits for all Response Variables 2 (Facets)
Total Distance
39
High Intensity Distance
40
Acceleration Distance
41
Deceleration Distance
42
Sprint Distance
43
Maximum Speed
44
Appendix D: Navigating the R Shiny Training Session Planner Application
45
Appendix E: Collection of Training Session Planner Application Screenshots
Screenshot of “Session Summary” tab. Squad average workload values are displayed (blue values boxes), as well as each
individual player’s total session statistics.
Training drills
selected
Squad Averages for
total training session
Individual player training session summary
(Total +/- 90% Prediction Intervals)
46
Screenshot of the “Session Summary Visuals” tab. The user can choose the workload metrics to visualise the workload of each
individual player.
Choose the workload metric
from the dropdown list
Visualise the individual
player workloads
47
Screenshot of “Drill 1 Summary” tab. The workloads of all 5 selected training drills can be viewed separately.
Workloads of each player
for the chosen training drill
Choose what drill
you want to analyse
48
By selecting the “Exploratory Data Analysis” option in the left-hand sidebar, the user has multiple options to visualise the data set
used in this project.
Exploratory Data Analysis
option in side bar
Visualise training drill:duration relationships
49
A PDF of the project report can be viewed and downloaded if required. A link to a knitted R Markdown version of the project report (a
different version with more graphs, R code output, and more expansive EDA section) is provided.
Project Report
option in side bar
Link to knitted R Markdown
version of project report
Downloadable PDF of
project report
50