Download Supplementary Material S1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neural modeling fields wikipedia , lookup

Process tracing wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Mathematical model wikipedia , lookup

Machine learning wikipedia , lookup

Transcript
Supplementary Material S1 – Additional details on the machine learning procedures
of the studies reviewed
In this section we provide additional details on the machine learning procedures
employed by the studies reviewed in the manuscript. A typical machine learning process
involves three complementary processes: 1) feature selection; 2) classification or score
prediction; 3) validation. Classification or score prediction follows the selection of a
subset of features, and validation processes should be applied to both those steps.
Feature Selection
Feature selection is an automated process that deals with the identification of a minimal
set of maximally informative features amongst a larger set. Only a small subset of the
studies reviewed in the manuscript (6/15) employ feature selection: stepwise forward, and
ElasticNet methods.
Stepwise methods involve the construction and comparison of multiple statistical models
each involving a different combination of features. A stepwise forward method, in
particular, involves the correlation of each single feature with the outcome and their
ranking in decreasing order of correlation. The first model involves only the most
correlated feature. The second most correlated feature is then added to the model, but
only kept if the model is improved (e.g. in its fit to the data or its likelihood). The third
feature is then introduced and so on and so forth. Other possible stepwise procedures can
involve developing a model with all features, removing them one by one starting from the
least correlated with the outcome and stopping when the model ceases to improve
(stepwise backward method).
The ElasticNet regularization method is a statistical technique that selects a minimal
amount of predictors by grouping them (correlated predictors are grouped together) and
applies penalization rules to shrink the coefficients of these groups, de facto excluding
those groups for which the coefficients become indistinguishable from zero. The method
combines the regularization procedures in Lasso and Ridge methods.
For an introduction to the notion of feature selection, see Friedman, J., Hastie, T., &
Tibshirani, R. (2001). The elements of statistical learning, Berlin: Springer. For a recent
review of feature selection methods, see Chandrashekar, G., & Sahin, F. (2014). A survey
on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.
Classification
Classification (in machine learning) is a statistical process through which an algorithm
learns to attribute previously unseen data to one of two or more classes. In the studies
reviewed in the manuscript, this means that the algorithm has to identify whether a
previously unseen participant has ASD or not, based on the acoustic characteristics of
their voice. The performance of a classification algorithm can be described in many
ways. The studies reviewed focus on accuracy (proportion of true classifications over the
total amount of classifications), sensitivity or recall (proportion of participants with ASD
that are correctly identified), specificity (proportion of TD participants that are correctly
identified) and precision or positive predictive value (the probability of participants
classified as ASD to actually have ASD). A wide range of algorithms have been
developed to solve such tasks. The studies reviewed employ: Discriminant Analysis
(DA), Naïve Bayes, Support Vector Machines, k-Nearest Neighbours, and Neural
Networks.
Discriminant Analysis combines the acoustic features in the function that best separates
the classes at stake (e.g. ASD and TD). A linear DA creates a linear function, a quadratic
DA a quadratic one, etc. This function defines a distribution of probability of a new voice
belonging to an individual with ASD.
Naïve Bayes classifiers are conceptually similar to DA: they use Bayes rule to define a
probability distribution that the combined values of the acoustic features are associated
with ASD. However, Naïve Bayes also assumes that the acoustic features included in the
analysis are independent (that is, uncorrelated from each other).
Support Vector Machines are classifiers that are particularly powerful in two contexts:
when the features relate to the outcome in non-linear ways and when there are more
features than data points (though there is a high risk of model overfitting, see Validation
section). A SVM constructs a multi-dimensional hyperplane defined by the features and
their interactions and identifies the regions that best separate the classes at stake (e.g.
ASD and TD).
k-Nearest Neighbours is a slightly different approach to classification. Instead of
identifying a global function that maximally separates the classes, k-NN relies on the
local structure of the training data. In other words, it takes the data point to classify and
asks the question: Of the k closest data points (nearest neighbours) in the training set, are
there more cases of ASD or of TD? K-NN then attributes the majority class to the new
data point.
Neural networks constitute the last group of classification methods employed in the
studies reviewed. While NNs are of many kinds, they all share a set of basic principles, in
that they are inspired by a simplified understanding of biological neurons. A certain
number of input units (input neurons) act as a filter for the acoustic features of the voice
sample, being or not activated by their values. Activated neurons send a “message” on to
the next layer(s) of “neurons” (hidden units). According to the combination of messages
received, these units might or not be activated and pass on a “message”. More layers of
neurons can undergo this process until the output units are reached. In the case of the
reviewed papers, there is one output unit: if it gets activated, the NN classifies the data as
belonging to an individual with ASD, if not as belonging to a TD individual. At this point
the performance of the NN can be evaluated and the connections between neurons
adjusted to reduce errors. This process is repeated until the NN reaches a satisfactory
performance and can then be defined as “trained” and applied to new data (e.g. a test set).
Deep NNs are characterized by a high number of inner layers and innovative learning
rules.
Score Prediction
Score prediction algorithms differ from classification in that they try to predict a numeric
variable (e.g. a score from 1 to 30) instead of a categorical one (e.g. ASD or TD). The
studies reviewed only employ one kind of score prediction algorithm: linear and ordinal
regression models. A regression model attempts to quantify the relation between a
numeric outcome and a set of numeric and/or categorical predictors. The performance of
regression models is assessed (in the reviewed studies) as the amount of variance in the
outcome measure (e.g. ADOS total score) that is explained by the model, also called R
squared or adjusted R squared (when penalized according to the amount of features
employed in the regression model.
Validation Process
The development of robust validation methods might be the most effective contribution
of machine learning to statistics so far, especially when applied to psychology and related
disciplines (e.g. Yarkoni & Westfall, 2016). Traditional statistical analyses are optimized
to explain the data in the current sample. In other words, the statistical model developed
(be it a t-test or a multiple regression) is optimized to explain as much variance in the
current sample as possible. However, this tends to overfit the data, that is, to produce
models that are only good for the current sample and its idiosyncrasies, but do not
generalize to new data. The model is then a better description of the random noise in the
data, instead of capturing the systematic variations of the data. To avoid this issue
machine learning algorithms tend to employ validation procedures, that is, procedures in
which the generalizability of the statistical models developed is directly tested on new
data. The studies reviewed employ three related methods to do so: training/test data split,
k-fold cross-validation and leave-one-out cross-validation.
Training/test data split is the simplest validation method. It involves removing a certain
percentage of the data (e.g. 20%) as a holdout dataset. The statistical analysis (e.g. the
training of a classification algorithm) is performed on the remaining data. The resulting
model is then tested on the hold-out dataset and only performance on this last part is
considered indicative of the ability of the model to generalize to new data. However, in
many cases and especially in the studies reviewed, the sample size is limited and the
training/test data split might seem an unreasonable loss of data on which to train a
statistical model and the test set might seem too small to reliably reflect generalizability
of the model. Cross-validation is a method developed to re-use all the data as training
material, while maintaining hold-out data. K-fold cross-validation consists in dividing the
dataset in k roughly equal parts (or folds). The model is then trained on all the data except
one fold and tested on that fold. This procedure is repeated for all folds. The performance
of the model is then estimated as the average of the performance on the k test folds.
Leave-one-out cross-validation is a variant of k-fold cross-validation in which k is equal
to the number of data points. In other words, the model is trained on all data except for
one data point and the process is repeated for all data points. While Leave-One-Out
cross-validation is very popular, it is more prone to overfitting than k-fold crossvalidation with lower k (e.g. Friedman, J., Hastie, T., & Tibshirani, R., 2001. The
elements of statistical learning, Berlin: Springer).
Finally it should be noted that a good validation process should be applied to both feature
selection and classification (or score prediction) and that it should respect the structure of
the data (so that in presence of more than one vocalization from the same participants,
they are not present in both training and test sets).