This topic review discusses the terminology used in advanced statistical models collectively referred to as machine learning. Be familiar with this terminology and the different types of models, their applications in investment decision-making, and their limitations. Specifically, be able to identify the appropriate algorithm that is most suitable for a given problem.
The statistical models we have discussed so far often rely on a set of assumptions about the distribution of the underlying data. Machine learning (ML) requires no such restrictive distributional assumptions in most applications. Very broadly, ML is defined as the use of algorithms to make decisions by generalizing (that is, finding patterns) in a given data set. ML generally performs better than standard statistical approaches when dealing with a large number of variables (high dimension) and when the relationships among variables are nonlinear.
Supervised learning uses labelled training data (that is, the target variable is defined) to guide the ML program toward superior forecasting accuracy. For example, to identify earnings manipulators one could provide a large collection of attributes for known manipulators and for known non-manipulators. A computer program can then identify patterns that separate manipulators from non-manipulators and apply those patterns to classify new observations. Multiple regression is an example of supervised learning used in more classical statistics.
Typical tasks for supervised learning include classification and regression. If the target variable is continuous, the model involved is a regression model. Classification models are used when the target variable is categorical or ordinal (for example, ranking). Algorithms can be designed for binary classification (for example, classifying companies as likely to default vs not likely to default) or multicategory classification (for example, assigning a bond to one of several ratings classes).
In unsupervised learning, the ML program is not given labelled training data; instead, inputs (features) are provided without any prescribed outputs. In the absence of any target variable, the program seeks structure or interrelationships in the data (for example, clustering observations that are similar). Clustering is a common unsupervised ML task.
Deep learning algorithms are used for complex tasks such as image recognition and natural language processing. Programs that learn from their own prediction errors are called reinforcement learning algorithms. Both deep learning and reinforcement learning algorithms are commonly implemented using networks inspired by neural architecture-commonly referred to as neural networks. These networks are applied to problems with significant nonlinearities and complex structure. We discuss these classes of algorithms in more detail in later sections.
Figure 3.1 summarizes the suitability of various ML algorithms for different problem types and data characteristics.
Figure 3.2 shows the practical steps involved in selecting an appropriate ML algorithm based on the problem to be solved and the characteristics of the data.
We discuss these ML algorithms in the remainder of this topic review.
Overfitting is common in supervised ML and occurs when a model fits the training data so closely that it captures noise (random fluctuations) as if it were signal. Overfitting is especially likely when a large number of features are included. Overfitting manifests as an apparently superior in-sample fit (for example, high in-sample R-squared) but poorer predictive performance on new data (low out-of-sample R-squared). Overfit models do not generalize well.
When a model generalizes well, it retains explanatory and predictive power when applied to new (out-of-sample) data.
To measure model generalization, data analysts typically create three non-overlapping data sets:
In-sample prediction errors occur with the training and validation samples, while prediction errors in the test sample are known as out-of-sample errors. Data scientists decompose prediction error into three components:
A learning curve plots the accuracy rate (that is, 1 - error rate) in the validation or test sample against the size of the training sample. A robust model that generalizes well will show improving accuracy as training-sample size increases, and the in-sample and out-of-sample error rates will converge toward a desirable level. Models with high bias will see the error rates converge but at a level below the desired accuracy. Models with high variance will show good in-sample accuracy but poor out-of-sample accuracy-out-of-sample accuracy will lag behind.
Variance error generally increases with model complexity, while bias error generally decreases with model complexity. Linear models tend to have higher bias and lower variance; highly nonlinear models tend to have higher variance and lower bias. The bias-variance trade-off is the principle that an optimal level of model complexity minimises total expected error.
To reduce overfitting, data scientists commonly apply:
In k-fold cross-validation, the sample is randomly divided into k equal parts. In each of k iterations, (k - 1) parts are used to train the model and the remaining part is used for validation. Error is measured on the held-out part for each iteration. The process is repeated k times with each part serving once as the validation set. The average in-sample and out-of-sample error rates across folds provide an estimate of model performance and help select hyperparameters that minimise expected error.
1. Which statement about target variables is most accurate?
A.
They can be continuous, ordinal, or categorical.
B.
They are not specified for supervised learning.
C.
They refer to independent variables.
2. Which statement most accurately describes supervised learning?
A.
It uses labeled training data.
B.
It requires periodic human intervention.
C.
It is best suited for classification.
3. A model that has poor in-sample explanatory power is most likely to have a high:
A.
bias error.
B.
variance error.
C.
base error.
4. The problem of overfitting a model would least appropriately be addressed by:
A.
imposing a penalty on included features that do not add to explanatory power of the model.
B.
using cross validation.
C.
using a smaller sample.
5. Cross validation occurs when:
A.
training and validation samples change over the learning cycle.
B.
prediction is tested in another heterogeneous sample.
C.
the performance parameter is set by another algorithm.
We describe common supervised ML algorithms and indicate the problems for which they are best suited. These algorithms are widely used in investment applications (for example, credit classification, fraud detection, or stock-selection tasks).
Penalized regression models reduce overfitting by including a penalty term that increases with the number or size of coefficients. The objective becomes minimising the sum of squared errors plus a penalty term. Penalized regressions make the model more parsimonious by shrinking or eliminating coefficients of nonperforming features.
Least Absolute Shrinkage and Selection Operator (LASSO). LASSO minimises the sum of squared errors plus the sum of the absolute values of the slope coefficients multiplied by a penalty parameter λ (lambda). There is a trade-off between reducing SSE (by including more features) and imposing the penalty (which discourages complexity). Through optimisation, LASSO can set some coefficients exactly to zero, thereby performing variable selection automatically. The penalty parameter λ is a hyperparameter chosen by the researcher (often via cross-validation).
Regularization (in broader terms) forces coefficients of less-important features toward zero to reduce statistical variability in high-dimensional estimation. Regularization can be applied to nonlinear problems as well, for example when estimating stable covariance matrices for mean-variance optimisation.
Investment analysts commonly use LASSO and related regularization techniques to build parsimonious predictive models that generalise better out of sample.
In everyday usage, "parsimonious" means stingy or penny-pinching. In statistics, a parsimonious model is one that achieves a required explanatory or predictive performance using as few predictor variables as necessary.
SVM is a linear classification algorithm that separates the data into one of two classes by identifying an n-dimensional hyperplane (given n features) that best divides the sample into two classes. SVM chooses a boundary that maximises the margin-the distance between the boundary and the nearest observations from each class. The observations that lie on the margin boundaries are the support vectors.
SVM can be adapted via soft-margin classification to allow some misclassification in the training data. This introduces a trade-off between a wider margin (which tends to generalize better) and classification errors on the training set. For nonlinear problems, kernel methods or nonlinear models may be used instead of a linear SVM, but nonlinear models may require more features and may risk overfitting.
Investment applications of SVM include classifying debt issuers into likely-to-default versus not-likely-to-default, screening stocks to short, and text classification of news or company disclosures as positive or negative.
KNN is a nonparametric method commonly used for classification but sometimes used for regression. The researcher specifies the hyperparameter k, which determines how many nearest observations (according to a chosen distance metric) are consulted to classify a new observation. The new observation is assigned the class most common among the k nearest neighbours (or the average outcome for regression).
Choosing k is important: if k is too small, the model can be noisy and highly variable; if k is too large, the model can be overly smooth and biased. If k is even and a tie occurs, classification ambiguity arises. KNN requires the definition of a distance metric (for example, Euclidean distance); inclusion of irrelevant or highly correlated features can skew distances and degrade performance. Domain knowledge is often essential to choose appropriate feature scaling and distance measures.
Investment uses include predicting bankruptcy, assigning bonds to ratings classes, predicting stock returns, and constructing custom indices based on similarity.
CART encompasses both classification trees (for categorical targets) and regression trees (for continuous targets). Classification trees are particularly useful when variables interact in complex, nonlinear ways and logit-type linear models are ill-suited.
A classification tree assigns observations to classes by recursively partitioning the feature space. At a node, the algorithm selects the most informative feature and a cutoff value c. Observations with feature values greater than c are assigned to one branch and the rest to the other branch; subsequent splits further partition the data. The process continues until splits no longer reduce the estimation error, producing terminal nodes (leaves) that produce class assignments or average predicted values for regression trees. A given feature may reappear at different levels of the tree with different cutoff values if it improves prediction at that level.
To prevent overfitting, regularization criteria such as maximum tree depth, minimum number of observations in a leaf, or maximum number of leaves are imposed. Alternatively, trees may be grown and then pruned by removing sections with minimal explanatory power.
CART is popular because it provides a visual and interpretable explanation of how predictions are formed, in contrast with some black-box methods.
Investment applications include detecting fraudulent financial statements and selecting individual stocks or bonds for portfolios.
Ensemble learning combines predictions from multiple models to produce a final prediction typically better than any single model. Two broad ensemble approaches are:
Random forest is an ensemble of decision trees built using bagging where each tree is trained on a bootstrap sample of the original data and at each split a randomly chosen subset of features is considered. Each tree produces a classification or a prediction, and the ensemble aggregates them (for example, by majority vote or by averaging). Random forests reduce variance relative to a single tree and mitigate overfitting because each tree sees different data and features. However, the interpretability of single CART trees is largely lost and random forests are therefore often treated as black-box models.
Investment applications of random forests include factor-based asset allocation and prediction models for the success of an IPO.
1. A general linear regression model that focuses on reduction of the total number of features used is best described as a:
A.
clustering model.
B.
deep learning model.
C.
penalized regression model.
2. A machine learning technique that can be applied to predict either a categorical target variable or a continuous target variable is most likely to describe a:
A.
support vector machine.
B.
classification and regression tree (CART).
C.
logit model.
3. An algorithm to assign a bond to a credit rating category is least likely to use:
A.
clustering.
B.
classification and regression tree (CART).
C.
K-nearest neighbor (KNN).
4. A fixed-income analyst is designing a model to categorize bonds into one of five ratings classifications. The analyst uses 12 fundamental variables and 2 technical variables to help her in the task. The number of features used by the analyst is closest to:
A.
14 features.
B.
70 features.
C.
120 features.
We now discuss common unsupervised learning methods and their investment applications.
PCA addresses problems that arise when a data set has excessive noise because the number of features (dimension) is large. Dimension reduction discards attributes that contain little additional information and summarises the information in a smaller set of uncorrelated factors. PCA constructs linear combinations of the original features-these linear combinations are called principal components or eigenvectors. Each principal component has an associated eigenvalue that measures the proportion of total variance explained by that component. The first principal component has the largest eigenvalue and explains the largest share of variance; the second principal component explains the next largest share, and so on.
In practice, a scree plot shows the proportion of total variance explained by each principal component. Analysts commonly retain the smallest number of principal components that collectively explain a substantial portion (for example, about 85%-95%) of total variance. Because principal components are linear combinations of original variables, they can be difficult to label or interpret, which is why PCA is sometimes described as producing a black-box representation.
Clustering groups observations into categories based on similarity of attributes (cohesion). For example, stocks may be grouped by similar return patterns rather than by standard sector labels. Human judgement plays a role in defining similarity and choosing distance metrics; Euclidean distance (straight-line distance in feature space) is a common choice.
K-means clustering partitions observations into k non-overlapping clusters, where k is a hyperparameter specified by the researcher. Each cluster has a centroid, and each observation is assigned to the cluster with the nearest centroid. The algorithm proceeds iteratively: initial centroids (often random) are chosen; observations are assigned to the nearest centroid; centroids are recomputed; assignments may change; the process repeats until assignments stabilise. A limitation is that k must be specified in advance, requiring some prior knowledge or exploratory work.
Hierarchical clustering builds a hierarchy of clusters without a pre-specified number of clusters. There are two broad approaches:
Clustering can be used for diversification (for example, investing across clusters) and to detect concentration risk (for example, heavy allocation to a single cluster). Although clusters are not always easily interpretable, the method can reveal hidden structure in complex data.
ESG (environmental, social, governance) factor-based investing is gaining popularity. The governance factor is often relatively objective and easier to measure, while social and environmental impacts are more subjective. ML and natural language processing can parse corporate disclosures in text, audio, and video formats to collate signals. For example, mentions of phrases such as "human capital", "living wage", and "D&I" (diversity and inclusion) can indicate a company's social stance; mentions of "sustainable", "recycle", or "green" can indicate environmental focus.
Supervised learning algorithms such as logistic regression, SVM, CART, random forests, or neural networks can then be used to generate ESG scores from these extracted features.
Neural networks (NNs), also called artificial neural networks (ANNs), are adaptable models useful in supervised regression and classification. A typical feedforward neural network consists of an input layer, one or more hidden layers, and an output layer. The input layer nodes represent features (independent variables). Inputs are usually scaled so that values from multiple nodes are comparable.
Each node in a hidden layer (a neuron) receives multiple weighted inputs, computes a weighted sum (a summation operator), and passes the result through an activation function (typically nonlinear). The outputs of neurons feed forward to nodes in later layers in a process called forward propagation. During training, the network adjusts the weights to reduce errors using backward propagation (backpropagation), which computes gradients of a loss function and updates weights via an optimisation algorithm (for example, gradient descent).
Network structure (for example, number of inputs, the size and number of hidden layers, and number of output nodes) is specified by the researcher and treated as hyperparameters. For example, a network with three inputs, a single hidden layer of four neurons, and a single output can be denoted structurally as 3-4-1. Hyperparameters are tuned based on out-of-sample performance.
Deep learning networks are neural networks with many hidden layers (at least two and frequently dozens). DLNs are especially powerful for tasks such as image recognition, natural language processing, and other complex pattern-recognition problems. The output layer of a DLN typically produces class probabilities and assigns each observation to the class with highest probability.
DLNs have been applied successfully in finance as well. For example, in one study using the six input parameters of the Black-Scholes model, a DLN predicted option values with an R2 of 99.8%. Other studies have used DLNs on standard equity factors (for example, book-to-market, operating income-to-market capitalisation) and obtained predictive improvements over classical factor models in specific experiments. The popularity of DLNs stems from improvements in optimisation methods, increases in computational speed (for example GPU computing), and availability of large machine-readable data sets.
Reinforcement learning involves an agent that interacts with an environment and seeks to maximise a defined cumulative reward subject to constraints. RL does not require labelled training data; instead, the agent learns from feedback across many trials. Notable successes of RL include AlphaGo (DeepMind), which defeated the world champion at the game of Go by learning from self-play and experience.
Applications of RL to investment decision-making have been explored, but the evidence for consistent out-of-sample superiority in live trading remains mixed and inconclusive at present. RL methods can be computationally intensive and may require careful reward design and robust simulation environments before deployment.
1. Image recognition problems are best suited for which category of machine learning (ML) algorithms?
A.
Hierarchical clustering.
B.
Unsupervised learning.
C.
Deep learning.
2. Which of the following is least likely to be described as a black-box approach to machine learning (ML)?
A.
Principal component analysis (PCA).
B.
Classification trees.
C.
Random forests.
3. An analyst wants to categorize an investment universe of 1,000 stocks into 10 dissimilar groups. The machine learning (ML) algorithm most suited for this task is:
A.
a classification and regression tree (CART).
B.
clustering.
C.
regression.
With supervised learning, inputs and outputs are identified for the computer, and the algorithm uses this labelled training data to model relationships.
With unsupervised learning, the computer is not given labelled data; rather it is provided unlabeled inputs and is tasked with determining the structure of the data.
With deep learning algorithms, networks such as neural networks and reinforcement learning models learn from their own prediction errors and are used for complex tasks such as image recognition and natural language processing.
In supervised learning, overfitting results from a large number of independent variables (features), producing an overly complex model that may have learned random noise and thus has apparent high in-sample forecasting accuracy but poor generalisation (low out-of-sample R-squared).
To reduce overfitting, data scientists use complexity reduction (for example, penalization/regularization) and cross-validation. Complexity reduction involves imposing a penalty on included features that do not improve out-of-sample prediction accuracy; the penalty increases with model complexity.
Common supervised learning algorithms:
Common unsupervised learning algorithms:
Neural networks comprise an input layer, hidden layers (which process inputs), and an output layer. Hidden-layer nodes (neurons) calculate weighted sums of inputs and apply an activation function (typically nonlinear).
Deep learning networks are neural networks with multiple hidden layers and are applied to pattern recognition tasks such as image and speech recognition.
Reinforcement learning algorithms learn to maximise a defined reward by interacting with an environment and learning from outcomes.
1.
Ans. A Target variables (that is, dependent variables) can be continuous, ordinal, or categorical. Target variables are not specified for unsupervised learning. (LOS 3.a)
2.
Ans. A Supervised learning uses labeled training data, and it does not require human intervention for each prediction. Classification algorithms can be used for both supervised and unsupervised contexts depending on how they are applied. (LOS 3.a)
3.
Ans. A Bias error is the in-sample error resulting from models with a poor fit. (LOS 3.b)
4.
Ans. C To reduce the problem of overfitting, data scientists use complexity reduction and cross validation. Using a smaller sample would generally exacerbate overfitting rather than reduce it. (LOS 3.b)
5.
Ans. A In cross validation, the training and validation samples are randomly generated for each learning cycle (for example, in k-fold cross-validation the held-out fold changes across cycles). (LOS 3.b)
1.
Ans. C Penalized regression imposes a penalty based on the number or magnitude of features used in a model and is used to construct parsimonious models. (LOS 3.c)
2.
Ans. B Classification and regression tree (CART) is a supervised ML technique that can predict either a continuous target variable (regression tree) or a categorical target variable (classification tree). CART is commonly applied to binary classification or regression. Support vector machines and logit models are typically used for categorical targets only. (LOS 3.c)
3.
Ans. A CART and KNN are supervised learning algorithms used for classification. Clustering is an unsupervised learning algorithm and therefore is less appropriate when labelled training data exist for specific ratings. (LOS 3.c)
4.
Ans. A The analyst is using 12 fundamental variables and 2 technical variables for a total of 14 features. (LOS 3.c)
1.
Ans. C Deep learning algorithms are used for complex tasks such as image recognition and natural language processing. (LOS 3.e)
2.
Ans. B Classification trees are popular because they provide a visual explanation of the predictive process. Random forests and PCA do not provide clear guidance about the features used to classify observations (random forests) or what the principal components represent (PCA), which is why both are often described as black-box approaches. (LOS 3.c, 3.d)
3.
Ans. B Since the researcher is not providing any labelled training data about the 1,000 stocks, an unsupervised learning algorithm is required. Clustering is an unsupervised approach suitable for grouping stocks into dissimilar groups. Regression and CART are supervised algorithms and require labelled targets. (LOS 3.c)