Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Supplementary Material S1 – Additional details on the machine learning procedures of the studies reviewed In this section we provide additional details on the machine learning procedures employed by the studies reviewed in the manuscript. A typical machine learning process involves three complementary processes: 1) feature selection; 2) classification or score prediction; 3) validation. Classification or score prediction follows the selection of a subset of features, and validation processes should be applied to both those steps. Feature Selection Feature selection is an automated process that deals with the identification of a minimal set of maximally informative features amongst a larger set. Only a small subset of the studies reviewed in the manuscript (6/15) employ feature selection: stepwise forward, and ElasticNet methods. Stepwise methods involve the construction and comparison of multiple statistical models each involving a different combination of features. A stepwise forward method, in particular, involves the correlation of each single feature with the outcome and their ranking in decreasing order of correlation. The first model involves only the most correlated feature. The second most correlated feature is then added to the model, but only kept if the model is improved (e.g. in its fit to the data or its likelihood). The third feature is then introduced and so on and so forth. Other possible stepwise procedures can involve developing a model with all features, removing them one by one starting from the least correlated with the outcome and stopping when the model ceases to improve (stepwise backward method). The ElasticNet regularization method is a statistical technique that selects a minimal amount of predictors by grouping them (correlated predictors are grouped together) and applies penalization rules to shrink the coefficients of these groups, de facto excluding those groups for which the coefficients become indistinguishable from zero. The method combines the regularization procedures in Lasso and Ridge methods. For an introduction to the notion of feature selection, see Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning, Berlin: Springer. For a recent review of feature selection methods, see Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28. Classification Classification (in machine learning) is a statistical process through which an algorithm learns to attribute previously unseen data to one of two or more classes. In the studies reviewed in the manuscript, this means that the algorithm has to identify whether a previously unseen participant has ASD or not, based on the acoustic characteristics of their voice. The performance of a classification algorithm can be described in many ways. The studies reviewed focus on accuracy (proportion of true classifications over the total amount of classifications), sensitivity or recall (proportion of participants with ASD that are correctly identified), specificity (proportion of TD participants that are correctly identified) and precision or positive predictive value (the probability of participants classified as ASD to actually have ASD). A wide range of algorithms have been developed to solve such tasks. The studies reviewed employ: Discriminant Analysis (DA), Naïve Bayes, Support Vector Machines, k-Nearest Neighbours, and Neural Networks. Discriminant Analysis combines the acoustic features in the function that best separates the classes at stake (e.g. ASD and TD). A linear DA creates a linear function, a quadratic DA a quadratic one, etc. This function defines a distribution of probability of a new voice belonging to an individual with ASD. Naïve Bayes classifiers are conceptually similar to DA: they use Bayes rule to define a probability distribution that the combined values of the acoustic features are associated with ASD. However, Naïve Bayes also assumes that the acoustic features included in the analysis are independent (that is, uncorrelated from each other). Support Vector Machines are classifiers that are particularly powerful in two contexts: when the features relate to the outcome in non-linear ways and when there are more features than data points (though there is a high risk of model overfitting, see Validation section). A SVM constructs a multi-dimensional hyperplane defined by the features and their interactions and identifies the regions that best separate the classes at stake (e.g. ASD and TD). k-Nearest Neighbours is a slightly different approach to classification. Instead of identifying a global function that maximally separates the classes, k-NN relies on the local structure of the training data. In other words, it takes the data point to classify and asks the question: Of the k closest data points (nearest neighbours) in the training set, are there more cases of ASD or of TD? K-NN then attributes the majority class to the new data point. Neural networks constitute the last group of classification methods employed in the studies reviewed. While NNs are of many kinds, they all share a set of basic principles, in that they are inspired by a simplified understanding of biological neurons. A certain number of input units (input neurons) act as a filter for the acoustic features of the voice sample, being or not activated by their values. Activated neurons send a “message” on to the next layer(s) of “neurons” (hidden units). According to the combination of messages received, these units might or not be activated and pass on a “message”. More layers of neurons can undergo this process until the output units are reached. In the case of the reviewed papers, there is one output unit: if it gets activated, the NN classifies the data as belonging to an individual with ASD, if not as belonging to a TD individual. At this point the performance of the NN can be evaluated and the connections between neurons adjusted to reduce errors. This process is repeated until the NN reaches a satisfactory performance and can then be defined as “trained” and applied to new data (e.g. a test set). Deep NNs are characterized by a high number of inner layers and innovative learning rules. Score Prediction Score prediction algorithms differ from classification in that they try to predict a numeric variable (e.g. a score from 1 to 30) instead of a categorical one (e.g. ASD or TD). The studies reviewed only employ one kind of score prediction algorithm: linear and ordinal regression models. A regression model attempts to quantify the relation between a numeric outcome and a set of numeric and/or categorical predictors. The performance of regression models is assessed (in the reviewed studies) as the amount of variance in the outcome measure (e.g. ADOS total score) that is explained by the model, also called R squared or adjusted R squared (when penalized according to the amount of features employed in the regression model. Validation Process The development of robust validation methods might be the most effective contribution of machine learning to statistics so far, especially when applied to psychology and related disciplines (e.g. Yarkoni & Westfall, 2016). Traditional statistical analyses are optimized to explain the data in the current sample. In other words, the statistical model developed (be it a t-test or a multiple regression) is optimized to explain as much variance in the current sample as possible. However, this tends to overfit the data, that is, to produce models that are only good for the current sample and its idiosyncrasies, but do not generalize to new data. The model is then a better description of the random noise in the data, instead of capturing the systematic variations of the data. To avoid this issue machine learning algorithms tend to employ validation procedures, that is, procedures in which the generalizability of the statistical models developed is directly tested on new data. The studies reviewed employ three related methods to do so: training/test data split, k-fold cross-validation and leave-one-out cross-validation. Training/test data split is the simplest validation method. It involves removing a certain percentage of the data (e.g. 20%) as a holdout dataset. The statistical analysis (e.g. the training of a classification algorithm) is performed on the remaining data. The resulting model is then tested on the hold-out dataset and only performance on this last part is considered indicative of the ability of the model to generalize to new data. However, in many cases and especially in the studies reviewed, the sample size is limited and the training/test data split might seem an unreasonable loss of data on which to train a statistical model and the test set might seem too small to reliably reflect generalizability of the model. Cross-validation is a method developed to re-use all the data as training material, while maintaining hold-out data. K-fold cross-validation consists in dividing the dataset in k roughly equal parts (or folds). The model is then trained on all the data except one fold and tested on that fold. This procedure is repeated for all folds. The performance of the model is then estimated as the average of the performance on the k test folds. Leave-one-out cross-validation is a variant of k-fold cross-validation in which k is equal to the number of data points. In other words, the model is trained on all data except for one data point and the process is repeated for all data points. While Leave-One-Out cross-validation is very popular, it is more prone to overfitting than k-fold crossvalidation with lower k (e.g. Friedman, J., Hastie, T., & Tibshirani, R., 2001. The elements of statistical learning, Berlin: Springer). Finally it should be noted that a good validation process should be applied to both feature selection and classification (or score prediction) and that it should respect the structure of the data (so that in presence of more than one vocalization from the same participants, they are not present in both training and test sets).