CFA Level 2 Exam  >  CFA Level 2 Notes  >  Quantitative Methods  >  Machine Learning

Machine Learning

READING 3: MACHINE LEARNING

EXAM FOCUS

This topic review discusses the terminology used in advanced statistical models collectively referred to as machine learning. Be familiar with this terminology and the different types of models, their applications in investment decision-making, and their limitations. Specifically, be able to identify the appropriate algorithm that is most suitable for a given problem.

MACHINE LEARNING

The statistical models discussed in earlier readings rely on a set of assumptions about the distribution of the underlying data. Machine learning (ML) typically requires fewer such assumptions and focuses on using algorithms to find patterns and make decisions from data. Broadly, ML is the use of algorithms to generalize from a given data set in order to make predictions or discover structure.

ML methods commonly perform better than standard statistical approaches when dealing with:

  • High-dimensional data (a large number of variables or features).
  • Nonlinear relationships among variables.

Common ML terms and their meanings:

  • Target variable. The dependent variable (the y variable). Target variables can be continuous, categorical, or ordinal.
  • Features. The independent variables (the x variables) used as inputs to the model.
  • Training data set. The sample used to fit the model.
  • Hyperparameter. A model input specified by the researcher (for example, the penalty parameter in LASSO or the number of neighbours k in KNN).

MODULE 3.1: TYPES OF LEARNING AND OVERFITTING PROBLEMS

LOS 3.a: Describe supervised machine learning, unsupervised machine learning, and deep learning.

Supervised learning uses labeled training data (i.e., the target variable is defined for the training examples). The algorithm learns a mapping from inputs to outputs so it can predict the target for new observations. Example: to identify earnings manipulators one can provide a large set of attributes for known manipulators and non-manipulators; the algorithm learns patterns that distinguish the two groups. Multiple regression is a simple example of supervised learning. Typical supervised tasks include classification (target categorical or ordinal) and regression (target continuous).

Unsupervised learning is applied when the ML program is not given labeled training data; only inputs (features) are provided and the algorithm seeks structure or relationships within the inputs. Example tasks: clustering and dimension reduction.

Deep learning refers to models based on neural networks with multiple hidden layers. Deep learning models are commonly used for complex tasks such as image recognition and natural language processing. Algorithms that learn from feedback on their predictions and improve via trial and error are termed reinforcement learning. Both deep learning and reinforcement learning are typically implemented using neural network architectures and are especially useful when significant nonlinearities are present.

Figure references in this module summarise algorithm suitability (Figure 3.1) and show the steps used to choose an appropriate ML algorithm (Figure 3.2).

Figure 3.1: ML Algorithm Types

Figure 3.2: Choice of Appropriate ML Algorithm

  1. Decide if the data set is complex (contains too many features). If so, apply a dimension reduction algorithm before proceeding to step 2.
  2. Decide if the problem is classification.
    • If no (a numerical prediction problem):
    • Use penalized regression if the data relationship is approximately linear.
    • For nonlinear or complex data, use CART, random forests, or neural networks.
    • If yes (a classification problem), continue to step 3.
  3. Is the classification supervised?
    • If yes and the data relationships are approximately linear, consider KNN or SVM.
    • If yes and the data are complex and nonlinear, consider CART, random forests, or neural networks.
  4. For unsupervised classification:
    • For linear structure and a known number of categories, use k-means.
    • If the number of categories is unknown, use hierarchical clustering.
    • For complex nonlinear structure, neural networks (or related deep learning methods) may be applied.

Later sections describe these ML algorithms in detail and give guidance on their investment applications.

LOS 3.b: Describe overfitting and identify methods of addressing it.

Overfitting occurs when a supervised ML model captures noise in the training data as if it were a true pattern. This typically results when too many features are included relative to the amount of training data or when model complexity is excessive. Overfitted models show high in-sample fit (for example, large in-sample R-squared) but poor out-of-sample performance (low out-of-sample R-squared) because they do not generalize to new data.

To assess generalization, data scientists typically partition data into three nonoverlapping data sets:

  • Training sample - used to develop the model.
  • Validation sample - used to tune hyperparameters and to select among alternative models.
  • Test sample - used for final evaluation of model performance on unseen data.

Prediction errors are decomposed conceptually into:

  • Bias error. Error due to poor model fit (underfitting) - typically observed as high in-sample error resulting from models that are too simple.
  • Variance error. Error due to sensitivity to the particular training sample (overfitting) - manifests as poor out-of-sample performance.
  • Base error. Irreducible error due to inherent randomness (noise) in the data.

A learning curve plots accuracy (1 - error rate) for validation or test samples against the size of the training sample. A well-generalising model will typically show improving validation/test accuracy as the training size increases and convergence of in-sample and out-of-sample errors toward a satisfactory level. Patterns commonly seen:

  • High bias models: in-sample and out-of-sample errors converge but to an unacceptable (high) error level.
  • High variance models: the in-sample error is low while the out-of-sample error remains much higher; increasing training data and regularization are remedies.

Remedies for overfitting include:

  • Complexity reduction - impose penalties that exclude features that do not meaningfully contribute to out-of-sample performance. The penalty typically increases with the number of features used by the model.
  • Regularization - methods (such as LASSO or ridge) that shrink coefficient estimates toward zero to reduce variance.
  • Cross validation - estimate out-of-sample error directly from the available data. A common method is k-fold cross validation, where the sample is split into k equal parts; (k - 1) parts are used for training and the remaining part for validation; the process repeats k times with each part used once for validation, and the average validation error is computed.
  • Ensure that the training and validation samples are large and representative of the population of interest.

PROFESSOR'S NOTE

When a model generalizes well, it retains explanatory power when applied to new (out-of-sample) data. Cross validation and appropriate penalization are routine tools to measure and improve generalization.

MODULE QUIZ 3.1

1. Which statement about target variables is most accurate?

A. They can be continuous, ordinal, or categorical.

B. They are not specified for supervised learning.

C. They refer to independent variables.

2. Which statement most accurately describes supervised learning?

A. It uses labeled training data.

B. It requires periodic human intervention.

C. It is best suited for classification.

3. A model that has poor in-sample explanatory power is most likely to have a high:

A. bias error.

B. variance error.

C. base error.

4. The problem of overfitting a model would least appropriately be addressed by:

A. imposing a penalty on included features that do not add to explanatory power of the model.

B. using cross validation.

C. using a smaller sample.

5. Cross validation occurs when:

A. training and validation samples change over the learning cycle.

B. prediction is tested in another heterogeneous sample.

C. the performance parameter is set by another algorithm.

MODULE 3.2: SUPERVISED LEARNING ALGORITHMS

LOS 3.c: Describe supervised machine learning algorithms-including penalized regression, support vector machine, k-nearest neighbor, classification and regression tree, ensemble learning, and random forest-and determine the problems for which they are best suited.

Common supervised ML algorithms and their uses:

  • Penalized regressions.

    Penalized regression models reduce overfitting by imposing a penalty for including many features. The objective is to minimise the sum of squared errors (SSE) plus a penalty term that increases with the magnitude or number of slope coefficients. Penalization makes the model more parsimonious (simpler) by excluding or shrinking coefficients of features that do not meaningfully improve out-of-sample prediction.

    LASSO (Least Absolute Shrinkage and Selection Operator) is a widely used penalized regression. LASSO minimises SSE plus the sum of the absolute values of the slope coefficients. The hyperparameter λ (lambda) controls the tradeoff between fit and penalty; increasing λ tends to set more coefficients exactly to zero, thus performing feature selection automatically.

    Regularization more generally refers to forcing coefficient estimates toward zero to reduce variance; regularization can be applied in linear and nonlinear settings (for example, to estimate a stable covariance matrix for mean-variance optimisation).

  • Support Vector Machine (SVM).

    SVM is principally a linear classification algorithm that separates data into two classes by defining an n-dimensional hyperplane (given n features). SVM chooses a boundary that maximizes the margin - the distance to the nearest observations on either side; the observations that determine the margin are called support vectors. When strict separation is not possible, soft margin classification allows some misclassification in training data and optimises the tradeoff between margin width and misclassification error.

    SVM applications in investment management: classifying debt issuers as likely-to-default vs not-likely-to-default, stocks to short vs not, or classifying textual sentiment (news, releases) as positive/negative.

  • K-nearest neighbor (KNN).

    KNN is a nonparametric method most commonly used for classification but sometimes for regression. The hyperparameter k specifies the number of closest observations (neighbors) in the training sample that are considered when classifying a new observation. The new observation is assigned the majority (or average, in regression) of the k neighbors.

    Key considerations: choosing k (too small increases variance and potential error; too large dilutes local structure), tie handling (even k can lead to ties), and defining the distance metric (what it means to be "near" - often Euclidean distance but may be others). Including irrelevant or strongly correlated features can distort distances and lead to poor performance.

    Investment applications: predicting bankruptcy, assigning bonds to rating classes, predicting stock movement, constructing customised indices.

  • Classification and Regression Trees (CART).

    CART produces a decision tree by recursively partitioning the feature space. For classification trees the target is categorical (commonly binary); for regression trees the target is continuous.

    At each internal node, CART selects a feature and a cutoff value c to split observations into two child nodes; one child contains observations with feature > c, the other the remainder. Each split is chosen to reduce estimation error (improve purity) relative to the parent node. Splitting continues until a stopping rule is met (for example, minimal node size or minimal reduction in error), producing terminal nodes (leaves).

    A feature may reappear in lower nodes with a different cutoff if that improves classification. To avoid overfitting, regularization constraints such as maximum tree depth, minimum samples per leaf, or pruning (removing branches with little explanatory power) are used.

    CART is popular for its interpretability: it provides a visual, rule-based explanation for predictions, in contrast to many "black-box" algorithms.

    Investment applications: detecting fraudulent financial statements, stock and bond selection, binary event prediction (e.g., IPO success).

  • Ensemble learning and Random Forest.

    Ensemble learning combines predictions from multiple models to reduce average error: different models' errors tend to cancel, improving overall accuracy. Two broad ensemble approaches:

    • Aggregation of heterogeneous learners: combine different algorithms (for example, SVM, CART, logistic regression) using a voting scheme (majority vote or weighted vote).
    • Aggregation of homogeneous learners: use the same algorithm on different training samples and combine their predictions; commonly implemented via bagging (bootstrap aggregating).

    Random forest is an ensemble variant based on CART: a large number of classification or regression trees are trained on bootstrapped samples from the initial training set. At each split, a random subset of features is considered (feature bagging). The final prediction is determined by aggregating predictions from all trees (majority vote for classification, average for regression). Because each tree uses different data and different feature subsets, random forests reduce overfitting and improve signal-to-noise ratio.

    Drawback: random forests sacrifice the transparency of a single decision tree and are therefore often considered a black-box model.

    Investment applications: factor-based asset allocation, predicting IPO success, credit risk models.

MODULE QUIZ 3.2

1. A general linear regression model that focuses on reduction of the total number of features used is best described as a:

A. clustering model.

B. deep learning model.

C. penalized regression model.

2. A machine learning technique that can be applied to predict either a categorical target variable or a continuous target variable is most likely to describe a:

A. support vector machine.

B. classification and regression tree (CART).

C. logit model.

3. An algorithm to assign a bond to a credit rating category is least likely to use:

A. clustering.

B. classification and regression tree (CART).

C. K-nearest neighbor (KNN).

4. A fixed-income analyst is designing a model to categorize bonds into one of five ratings classifications. The analyst uses 12 fundamental variables and 2 technical variables to help her in the task. The number of features used by the analyst is closest to:

A. 14 features.

B. 70 features.

C. 120 features.

MODULE 3.3: UNSUPERVISED LEARNING ALGORITHMS AND OTHER MODELS

LOS 3.d: Describe unsupervised machine learning algorithms - including principal components analysis, k-means clustering, and hierarchical clustering - and determine the problems for which they are best suited.

Examples of unsupervised learning and investment applications:

  • Principal Component Analysis (PCA).

    PCA is a dimension reduction technique that transforms a large set of correlated variables into a smaller set of uncorrelated variables called principal components (eigenvectors). Each component is a linear combination of the original features; each component has an associated eigenvalue representing the proportion of total variance explained by that component. The first principal component explains the largest share of variance, the second the next largest, and so on.

    A scree plot shows the proportion of variance explained by each principal component. In practice, analysts typically retain the smallest number of components that collectively explain between roughly 85% and 95% of the total variance. Because principal components are linear combinations of original variables, their interpretation can be difficult; PCA is therefore often described as a black-box approach.

    Investment applications: reducing noise in high-dimensional factor sets, constructing lower-dimension factor representations for portfolio construction.

  • Clustering.

    Clustering groups observations into categories (clusters) based on similarity of attributes (cohesion). In investment contexts clustering can be used to group securities by behaviour rather than conventional sector labels (e.g., group stocks by return patterns). Human judgement often influences the choice of similarity metric; a common metric is Euclidean distance (straight-line distance between observations).

    Common clustering methods:

    • K-means clustering. Partitions observations into k nonoverlapping clusters (k is a hyperparameter specified by the researcher). Each cluster has a centroid and observations are assigned to the cluster whose centroid is nearest. The algorithm starts with random centroids, assigns observations to the nearest centroid, recalculates centroids, reassigns observations where necessary, and iterates until assignments stabilise. A limitation is that k must be chosen in advance; the method is sensitive to initial centroid choice and to scaling of features.
    • Hierarchical clustering. Builds a hierarchy of clusters without specifying the number of clusters in advance. Two approaches:
      • Agglomerative (bottom-up): start with each observation as its own cluster and gradually merge similar clusters.
      • Divisive (top-down): start with one cluster containing all observations and recursively split it into smaller clusters.

    Investment uses: portfolio diversification by selecting assets from different clusters, risk analysis by identifying concentration in specific clusters, and uncovering latent structures across securities.

EXAMPLE: Application of machine learning to ESG investing

ESG (environmental, social, and governance) factor-based investing is increasingly popular. The governance factor is relatively straightforward to measure via corporate governance metrics, while social and environmental signals are often more subjective. ML and natural language processing can extract information from textual disclosures, audio, and video to construct quantitative measures. For example, mentions of phrases such as "human capital," "living wage," and "D&I" (diversity & inclusion) can be used to quantify social-related disclosures; words such as "sustainability," "recycle," or "green" can indicate environmental intent.

Once textual features are extracted and scored, supervised learning algorithms (logistic regression, SVM, CART, random forests, neural networks) can be trained to generate ESG scores or to classify companies according to ESG risk/quality.

LOS 3.e: Describe neural networks, deep learning nets, and reinforcement learning.

Neural Networks

Artificial Neural Networks (ANNs), often simply called neural networks, are composed of layers of nodes (neurons) connected by weighted links. Typical architecture:

  • Input layer: nodes representing the feature values (these are often scaled or normalised so that different features are comparable).
  • Hidden layer(s): one or more layers of neurons that transform inputs via weighted summation followed by an activation function (usually nonlinear).
  • Output layer: node(s) that produce the model prediction (a single node for a scalar regression output, or multiple nodes/output probabilities for classification).

Processing proceeds by forward propagation: inputs are multiplied by weights, summed at each neuron, passed through an activation function, and then passed to the next layer. Learning occurs via backpropagation, where prediction errors are propagated backward through the network to adjust weights using gradient-based optimisation methods.

Network structure (for example, a 3-4-1 network with three inputs, four neurons in a single hidden layer, and one output) is determined by the researcher and these choices are treated as hyperparameters that may be tuned based on validation or test performance. Neural networks can capture complex, nonlinear relationships when properly configured and regularised.

Deep Learning Networks (DLNs)

Deep learning networks are neural networks with many hidden layers (at least 2 but often dozens). Deep architectures are particularly effective for hierarchical representation learning and excel in tasks like image, speech, and character recognition. In classification tasks, the final layer of a DLN commonly outputs class probabilities and the observation is assigned to the class with the highest probability.

Applications of DLNs include credit card fraud detection, self-driving cars, natural language processing, and various investment decision tasks (for example, option pricing). One study found a DLN using the six Black-Scholes input parameters could predict option values with an R2 of 99.8%; other studies report DLNs outperforming traditional factor models in certain investment strategies. The recent rise of DLNs is driven by methodological advances, faster computing hardware (GPUs), and abundant machine-readable data.

Reinforcement Learning (RL)

Reinforcement learning involves an agent that interacts with an environment to maximise a defined reward signal subject to constraints. The agent learns from trial-and-error feedback rather than from labeled training examples. RL has produced notable successes (for example, DeepMind's AlphaGo in the game of Go). While RL techniques show potential for investment applications (for example, portfolio construction and trading strategies learned through simulated environments), results in finance are still an area of active research and the efficacy of RL in real-world investment decision-making is not conclusively established.

MODULE QUIZ 3.3

1. Image recognition problems are best suited for which category of machine learning (ML) algorithms?

A. Hierarchical clustering.

B. Unsupervised learning.

C. Deep learning.

2. Which of the following is least likely to be described as a black-box approach to machine learning (ML)?

A. Principal component analysis (PCA).

B. Classification trees.

C. Random forests.

3. An analyst wants to categorize an investment universe of 1,000 stocks into 10 dissimilar groups. The machine learning (ML) algorithm most suited for this task is:

A. a classification and regression tree (CART).

B. clustering.

C. regression.

KEY CONCEPTS

LOS 3.a

  • With supervised learning, both inputs and desired outputs (labels) are provided; the algorithm learns a mapping from inputs to outputs.
  • With unsupervised learning, the algorithm is given unlabeled data and must discover structure (for example, clusters or low-dimensional representations).
  • Deep learning and reinforcement learning are techniques that allow algorithms (often neural networks) to learn complex nonlinear mappings or to learn from feedback and prediction errors.

LOS 3.b

  • In supervised learning, overfitting arises when a model is overly complex relative to the amount of training data; such models fit in-sample noise and fail to generalise to out-of-sample data.
  • To reduce overfitting, practitioners use complexity reduction (penalization, feature selection) and cross validation (for example, k-fold cross validation) to estimate and minimise out-of-sample error.

LOS 3.c

  • Penalized regression: reduces overfitting by penalising coefficient size or number of features (example: LASSO).
  • Support vector machine (SVM): linear classification algorithm that separates classes using a margin-maximising hyperplane.
  • K-nearest neighbor (KNN): nonparametric classification/regression based on proximity to training examples.
  • Classification and regression tree (CART): tree-structured model useful for handling nonlinear relationships and for producing interpretable decision rules.
  • Ensemble learning: combines multiple models to reduce average error.
  • Random forest: ensemble of decision trees trained on bootstrapped samples and random feature subsets to reduce overfitting.

LOS 3.d

  • Principal component analysis (PCA): reduces correlated variables into uncorrelated principal components (eigenvectors) ordered by eigenvalues (variance explained).
  • K-means clustering: partitions observations into k clusters with centroids; k is a hyperparameter chosen by the analyst.
  • Hierarchical clustering: produces a tree of clusters without specifying the number of clusters in advance; can be agglomerative or divisive.

LOS 3.e

  • Neural networks comprise an input layer, one or more hidden layers (neurons that compute weighted sums followed by activation functions), and an output layer.
  • Deep learning networks are neural networks with multiple (often many) hidden layers and are effective for pattern, speech, and image recognition tasks.
  • Reinforcement learning uses agents that learn to maximise a reward signal through repeated trial and error rather than from labeled training data.

ANSWER KEY FOR MODULE QUIZZES

Module Quiz 3.1

1. A. Target variables (dependent variables) can be continuous, ordinal, or categorical. Target variables are not specified for unsupervised learning. (LOS 3.a)

2. A. Supervised learning uses labeled training data; it does not by definition require periodic human intervention. Classification algorithms exist for both supervised and certain unsupervised contexts. (LOS 3.a)

3. A. Bias error is the in-sample error resulting from models with a poor fit (high bias). (LOS 3.b)

4. C. Using a smaller sample is the least appropriate way to address overfitting; reducing sample size generally worsens estimation variability. Appropriate remedies include complexity reduction (penalisation) and cross validation. (LOS 3.b)

5. A. In cross validation, training and validation samples are randomly generated over the learning cycles (for example in k-fold cross validation the training/validation roles rotate). (LOS 3.b)

Module Quiz 3.2

1. C. A penalized regression imposes a penalty based on the number (or magnitude) of features used in a model; it is used to construct parsimonious models. (LOS 3.c)

2. B. Classification and regression trees (CART) can be applied to predict either a continuous target (regression tree) or a categorical target (classification tree). SVM is typically a classification tool (binary or multiclass via extensions) and logit models are for categorical targets. (LOS 3.c)

3. A. CART and KNN are supervised learning algorithms; clustering is an unsupervised method and is less likely to be used when labeled training data for rating classes is available. (LOS 3.c)

4. A. The analyst uses 12 fundamental variables and 2 technical variables for a total of 14 features. (LOS 3.c)

Module Quiz 3.3

1. C. Deep learning algorithms are well suited for complex tasks such as image recognition and natural language processing. (LOS 3.e)

2. B. Classification trees (CART) are least likely to be described as black boxes because they provide visual, rule-based explanations. PCA and random forests are more typically described as black-box approaches because the components or ensemble predictions are harder to interpret directly. (LOS 3.c, 3.d)

3. B. Because the researcher is not providing labeled training data for the 1,000 stocks, an unsupervised algorithm such as clustering is appropriate. Regression and CART are supervised learning approaches. (LOS 3.c)

The document Machine Learning is a part of the CFA Level 2 Course Quantitative Methods.
All you need of CFA Level 2 at this link: CFA Level 2
Explore Courses for CFA Level 2 exam
Get EduRev Notes directly in your Google search
Related Searches
Objective type Questions, Machine Learning, mock tests for examination, study material, MCQs, Exam, Extra Questions, Machine Learning, video lectures, Free, Semester Notes, Sample Paper, Machine Learning, Viva Questions, pdf , ppt, shortcuts and tricks, Summary, Previous Year Questions with Solutions, Important questions, past year papers, practice quizzes;