This topic review discusses the terminology used in advanced statistical models collectively referred to as machine learning. Be familiar with this terminology and the different types of models, their applications in investment decision-making, and their limitations. Specifically, be able to identify the appropriate algorithm that is most suitable for a given problem.
The statistical models discussed in earlier readings rely on a set of assumptions about the distribution of the underlying data. Machine learning (ML) typically requires fewer such assumptions and focuses on using algorithms to find patterns and make decisions from data. Broadly, ML is the use of algorithms to generalize from a given data set in order to make predictions or discover structure.
ML methods commonly perform better than standard statistical approaches when dealing with:
Common ML terms and their meanings:
Supervised learning uses labeled training data (i.e., the target variable is defined for the training examples). The algorithm learns a mapping from inputs to outputs so it can predict the target for new observations. Example: to identify earnings manipulators one can provide a large set of attributes for known manipulators and non-manipulators; the algorithm learns patterns that distinguish the two groups. Multiple regression is a simple example of supervised learning. Typical supervised tasks include classification (target categorical or ordinal) and regression (target continuous).
Unsupervised learning is applied when the ML program is not given labeled training data; only inputs (features) are provided and the algorithm seeks structure or relationships within the inputs. Example tasks: clustering and dimension reduction.
Deep learning refers to models based on neural networks with multiple hidden layers. Deep learning models are commonly used for complex tasks such as image recognition and natural language processing. Algorithms that learn from feedback on their predictions and improve via trial and error are termed reinforcement learning. Both deep learning and reinforcement learning are typically implemented using neural network architectures and are especially useful when significant nonlinearities are present.
Figure references in this module summarise algorithm suitability (Figure 3.1) and show the steps used to choose an appropriate ML algorithm (Figure 3.2).
Figure 3.1: ML Algorithm Types
Figure 3.2: Choice of Appropriate ML Algorithm
Later sections describe these ML algorithms in detail and give guidance on their investment applications.
Overfitting occurs when a supervised ML model captures noise in the training data as if it were a true pattern. This typically results when too many features are included relative to the amount of training data or when model complexity is excessive. Overfitted models show high in-sample fit (for example, large in-sample R-squared) but poor out-of-sample performance (low out-of-sample R-squared) because they do not generalize to new data.
To assess generalization, data scientists typically partition data into three nonoverlapping data sets:
Prediction errors are decomposed conceptually into:
A learning curve plots accuracy (1 - error rate) for validation or test samples against the size of the training sample. A well-generalising model will typically show improving validation/test accuracy as the training size increases and convergence of in-sample and out-of-sample errors toward a satisfactory level. Patterns commonly seen:
Remedies for overfitting include:
When a model generalizes well, it retains explanatory power when applied to new (out-of-sample) data. Cross validation and appropriate penalization are routine tools to measure and improve generalization.
1. Which statement about target variables is most accurate?
A. They can be continuous, ordinal, or categorical.
B. They are not specified for supervised learning.
C. They refer to independent variables.
2. Which statement most accurately describes supervised learning?
A. It uses labeled training data.
B. It requires periodic human intervention.
C. It is best suited for classification.
3. A model that has poor in-sample explanatory power is most likely to have a high:
A. bias error.
B. variance error.
C. base error.
4. The problem of overfitting a model would least appropriately be addressed by:
A. imposing a penalty on included features that do not add to explanatory power of the model.
B. using cross validation.
C. using a smaller sample.
5. Cross validation occurs when:
A. training and validation samples change over the learning cycle.
B. prediction is tested in another heterogeneous sample.
C. the performance parameter is set by another algorithm.
Common supervised ML algorithms and their uses:
Penalized regression models reduce overfitting by imposing a penalty for including many features. The objective is to minimise the sum of squared errors (SSE) plus a penalty term that increases with the magnitude or number of slope coefficients. Penalization makes the model more parsimonious (simpler) by excluding or shrinking coefficients of features that do not meaningfully improve out-of-sample prediction.
LASSO (Least Absolute Shrinkage and Selection Operator) is a widely used penalized regression. LASSO minimises SSE plus the sum of the absolute values of the slope coefficients. The hyperparameter λ (lambda) controls the tradeoff between fit and penalty; increasing λ tends to set more coefficients exactly to zero, thus performing feature selection automatically.
Regularization more generally refers to forcing coefficient estimates toward zero to reduce variance; regularization can be applied in linear and nonlinear settings (for example, to estimate a stable covariance matrix for mean-variance optimisation).
SVM is principally a linear classification algorithm that separates data into two classes by defining an n-dimensional hyperplane (given n features). SVM chooses a boundary that maximizes the margin - the distance to the nearest observations on either side; the observations that determine the margin are called support vectors. When strict separation is not possible, soft margin classification allows some misclassification in training data and optimises the tradeoff between margin width and misclassification error.
SVM applications in investment management: classifying debt issuers as likely-to-default vs not-likely-to-default, stocks to short vs not, or classifying textual sentiment (news, releases) as positive/negative.
KNN is a nonparametric method most commonly used for classification but sometimes for regression. The hyperparameter k specifies the number of closest observations (neighbors) in the training sample that are considered when classifying a new observation. The new observation is assigned the majority (or average, in regression) of the k neighbors.
Key considerations: choosing k (too small increases variance and potential error; too large dilutes local structure), tie handling (even k can lead to ties), and defining the distance metric (what it means to be "near" - often Euclidean distance but may be others). Including irrelevant or strongly correlated features can distort distances and lead to poor performance.
Investment applications: predicting bankruptcy, assigning bonds to rating classes, predicting stock movement, constructing customised indices.
CART produces a decision tree by recursively partitioning the feature space. For classification trees the target is categorical (commonly binary); for regression trees the target is continuous.
At each internal node, CART selects a feature and a cutoff value c to split observations into two child nodes; one child contains observations with feature > c, the other the remainder. Each split is chosen to reduce estimation error (improve purity) relative to the parent node. Splitting continues until a stopping rule is met (for example, minimal node size or minimal reduction in error), producing terminal nodes (leaves).
A feature may reappear in lower nodes with a different cutoff if that improves classification. To avoid overfitting, regularization constraints such as maximum tree depth, minimum samples per leaf, or pruning (removing branches with little explanatory power) are used.
CART is popular for its interpretability: it provides a visual, rule-based explanation for predictions, in contrast to many "black-box" algorithms.
Investment applications: detecting fraudulent financial statements, stock and bond selection, binary event prediction (e.g., IPO success).
Ensemble learning combines predictions from multiple models to reduce average error: different models' errors tend to cancel, improving overall accuracy. Two broad ensemble approaches:
Random forest is an ensemble variant based on CART: a large number of classification or regression trees are trained on bootstrapped samples from the initial training set. At each split, a random subset of features is considered (feature bagging). The final prediction is determined by aggregating predictions from all trees (majority vote for classification, average for regression). Because each tree uses different data and different feature subsets, random forests reduce overfitting and improve signal-to-noise ratio.
Drawback: random forests sacrifice the transparency of a single decision tree and are therefore often considered a black-box model.
Investment applications: factor-based asset allocation, predicting IPO success, credit risk models.
1. A general linear regression model that focuses on reduction of the total number of features used is best described as a:
A. clustering model.
B. deep learning model.
C. penalized regression model.
2. A machine learning technique that can be applied to predict either a categorical target variable or a continuous target variable is most likely to describe a:
A. support vector machine.
B. classification and regression tree (CART).
C. logit model.
3. An algorithm to assign a bond to a credit rating category is least likely to use:
A. clustering.
B. classification and regression tree (CART).
C. K-nearest neighbor (KNN).
4. A fixed-income analyst is designing a model to categorize bonds into one of five ratings classifications. The analyst uses 12 fundamental variables and 2 technical variables to help her in the task. The number of features used by the analyst is closest to:
A. 14 features.
B. 70 features.
C. 120 features.
Examples of unsupervised learning and investment applications:
PCA is a dimension reduction technique that transforms a large set of correlated variables into a smaller set of uncorrelated variables called principal components (eigenvectors). Each component is a linear combination of the original features; each component has an associated eigenvalue representing the proportion of total variance explained by that component. The first principal component explains the largest share of variance, the second the next largest, and so on.
A scree plot shows the proportion of variance explained by each principal component. In practice, analysts typically retain the smallest number of components that collectively explain between roughly 85% and 95% of the total variance. Because principal components are linear combinations of original variables, their interpretation can be difficult; PCA is therefore often described as a black-box approach.
Investment applications: reducing noise in high-dimensional factor sets, constructing lower-dimension factor representations for portfolio construction.
Clustering groups observations into categories (clusters) based on similarity of attributes (cohesion). In investment contexts clustering can be used to group securities by behaviour rather than conventional sector labels (e.g., group stocks by return patterns). Human judgement often influences the choice of similarity metric; a common metric is Euclidean distance (straight-line distance between observations).
Common clustering methods:
Investment uses: portfolio diversification by selecting assets from different clusters, risk analysis by identifying concentration in specific clusters, and uncovering latent structures across securities.
EXAMPLE: Application of machine learning to ESG investing
ESG (environmental, social, and governance) factor-based investing is increasingly popular. The governance factor is relatively straightforward to measure via corporate governance metrics, while social and environmental signals are often more subjective. ML and natural language processing can extract information from textual disclosures, audio, and video to construct quantitative measures. For example, mentions of phrases such as "human capital," "living wage," and "D&I" (diversity & inclusion) can be used to quantify social-related disclosures; words such as "sustainability," "recycle," or "green" can indicate environmental intent.
Once textual features are extracted and scored, supervised learning algorithms (logistic regression, SVM, CART, random forests, neural networks) can be trained to generate ESG scores or to classify companies according to ESG risk/quality.
Artificial Neural Networks (ANNs), often simply called neural networks, are composed of layers of nodes (neurons) connected by weighted links. Typical architecture:
Processing proceeds by forward propagation: inputs are multiplied by weights, summed at each neuron, passed through an activation function, and then passed to the next layer. Learning occurs via backpropagation, where prediction errors are propagated backward through the network to adjust weights using gradient-based optimisation methods.
Network structure (for example, a 3-4-1 network with three inputs, four neurons in a single hidden layer, and one output) is determined by the researcher and these choices are treated as hyperparameters that may be tuned based on validation or test performance. Neural networks can capture complex, nonlinear relationships when properly configured and regularised.
Deep learning networks are neural networks with many hidden layers (at least 2 but often dozens). Deep architectures are particularly effective for hierarchical representation learning and excel in tasks like image, speech, and character recognition. In classification tasks, the final layer of a DLN commonly outputs class probabilities and the observation is assigned to the class with the highest probability.
Applications of DLNs include credit card fraud detection, self-driving cars, natural language processing, and various investment decision tasks (for example, option pricing). One study found a DLN using the six Black-Scholes input parameters could predict option values with an R2 of 99.8%; other studies report DLNs outperforming traditional factor models in certain investment strategies. The recent rise of DLNs is driven by methodological advances, faster computing hardware (GPUs), and abundant machine-readable data.
Reinforcement learning involves an agent that interacts with an environment to maximise a defined reward signal subject to constraints. The agent learns from trial-and-error feedback rather than from labeled training examples. RL has produced notable successes (for example, DeepMind's AlphaGo in the game of Go). While RL techniques show potential for investment applications (for example, portfolio construction and trading strategies learned through simulated environments), results in finance are still an area of active research and the efficacy of RL in real-world investment decision-making is not conclusively established.
1. Image recognition problems are best suited for which category of machine learning (ML) algorithms?
A. Hierarchical clustering.
B. Unsupervised learning.
C. Deep learning.
2. Which of the following is least likely to be described as a black-box approach to machine learning (ML)?
A. Principal component analysis (PCA).
B. Classification trees.
C. Random forests.
3. An analyst wants to categorize an investment universe of 1,000 stocks into 10 dissimilar groups. The machine learning (ML) algorithm most suited for this task is:
A. a classification and regression tree (CART).
B. clustering.
C. regression.
1. A. Target variables (dependent variables) can be continuous, ordinal, or categorical. Target variables are not specified for unsupervised learning. (LOS 3.a)
2. A. Supervised learning uses labeled training data; it does not by definition require periodic human intervention. Classification algorithms exist for both supervised and certain unsupervised contexts. (LOS 3.a)
3. A. Bias error is the in-sample error resulting from models with a poor fit (high bias). (LOS 3.b)
4. C. Using a smaller sample is the least appropriate way to address overfitting; reducing sample size generally worsens estimation variability. Appropriate remedies include complexity reduction (penalisation) and cross validation. (LOS 3.b)
5. A. In cross validation, training and validation samples are randomly generated over the learning cycles (for example in k-fold cross validation the training/validation roles rotate). (LOS 3.b)
1. C. A penalized regression imposes a penalty based on the number (or magnitude) of features used in a model; it is used to construct parsimonious models. (LOS 3.c)
2. B. Classification and regression trees (CART) can be applied to predict either a continuous target (regression tree) or a categorical target (classification tree). SVM is typically a classification tool (binary or multiclass via extensions) and logit models are for categorical targets. (LOS 3.c)
3. A. CART and KNN are supervised learning algorithms; clustering is an unsupervised method and is less likely to be used when labeled training data for rating classes is available. (LOS 3.c)
4. A. The analyst uses 12 fundamental variables and 2 technical variables for a total of 14 features. (LOS 3.c)
1. C. Deep learning algorithms are well suited for complex tasks such as image recognition and natural language processing. (LOS 3.e)
2. B. Classification trees (CART) are least likely to be described as black boxes because they provide visual, rule-based explanations. PCA and random forests are more typically described as black-box approaches because the components or ensemble predictions are harder to interpret directly. (LOS 3.c, 3.d)
3. B. Because the researcher is not providing labeled training data for the 1,000 stocks, an unsupervised algorithm such as clustering is appropriate. Regression and CART are supervised learning approaches. (LOS 3.c)