CFA Level 2 Exam  >  CFA Level 2 Notes  >  Quantitative Methods  >  Big Data Projects

Big Data Projects

EXAM FOCUS

This topic review is a broad overview of the use of big data analysis for financial forecasting. Candidates should understand the following:

  • Terminology used and the processes involved in big data projects.
  • Requirements and limitations of the techniques discussed.
  • How to evaluate a model's performance in practical settings.

INTRODUCTION

Big data is commonly characterised by the three Vs:

  • Volume - the quantity of data. Big data refers to very large volumes of data that are challenging to store and process using traditional methods.
  • Variety - the different sources and formats of data. Big data is collected from many sources such as user-generated content, transactional records, emails, images, clickstreams, and other logs. This diversity presents opportunities and concerns (for example, privacy and data governance).
  • Velocity - the speed at which data is created, collected and updates (for example, social media posts over short time intervals).

When using data to draw inferences, a fourth characteristic is important: veracity (or validity). Not all data sources are reliable; researchers must separate quality from quantity to produce robust forecasts.

Structured data (for example, balance-sheet rows and columns) is organised and readily used by standard statistical and ML models. Unstructured data (for example, free-text from filings or social media, images or audio) requires preprocessing for machine use because it is not arranged in neat rows and columns.

MODULE 4.1: DATA ANALYSIS STEPS

LOS 4.a: Identify and explain steps in a data analysis project

To illustrate the typical steps in analysing data for financial forecasting, consider a consumer credit scoring model. The overall process is iterative and usually follows these five main stages:

  1. Conceptualisation of the modelling task

    Define the precise problem, the model output (target), how the model will be used, the stakeholders, and whether the model will be embedded into existing or new business processes. For example, the purpose of the model might be to measure a borrower's credit risk accurately so that lending decisions can be automated or supported.

  2. Data collection

    Identify and collect structured numeric data from internal and external sources. For a credit scoring model this may include past repayment history, employment record, income and other borrower attributes. Decide which sources (internal/external) and specific tables or APIs to use.

  3. Data preparation and wrangling

    Clean the dataset and prepare it for modelling. Address missing values, invalid or out-of-range values, duplicates and inconsistent units. Preprocessing can involve aggregation, filtering, imputation rules, and selection of relevant variables. In credit scoring, rules may be used to fill gaps where data is missing or suspected inaccurate.

  4. Data exploration

    Perform exploratory data analysis (EDA), feature selection and feature engineering. For credit scoring several variables may be combined into an ability-to-pay score or other derived features.

  5. Model training

    Select an appropriate machine learning algorithm, train it on the training data, validate and tune hyperparameters. The model choice depends on the relationship between features and the target, data size, and business constraints.

These steps are iterative. Depending on model output quality, data exploration, feature engineering or even data collection choices might be revisited to improve model performance.

Adapting the steps for unstructured (text-based) data

If the model incorporates unstructured data (for example, a borrower's social media posts), the first four steps are modified as follows:

  1. Text problem formulation

    Define the specific inputs (text sources) and the output needed, and determine how the output will be used.

  2. Data collection (curation)

    Decide the sources (web scraping, specific social-media APIs). For supervised learning, create or obtain labelled examples (annotated target variables) indicating which texts map to positive or negative outcomes.

  3. Text preparation and wrangling

    Preprocess unstructured text to convert it into a form usable by structured modelling techniques (for example, tokenisation, stop-word removal, stemming or lemmatisation, and document-term matrix creation).

  4. Text exploration

    Use visualisation and textual feature engineering to select tokens, phrases or other representations useful for the modelling task.

Outputs from unstructured-text models may be used in isolation or combined with structured variables as inputs to another model.

LOS 4.b: Objectives, steps and examples of preparing and wrangling data

Data preparation and wrangling are critical steps that often consume the majority of a project's time and resources. Once a problem is defined, identify appropriate data with the help of domain experts and obtain it from internal databases or external vendors via APIs. When using external data, carefully check metadata or README files that describe how the data is stored and collected. External data can speed up projects but may be accessible to competitors, reducing proprietary advantage.

Data cleansing reduces errors in raw data. For structured data, common errors include:

  • Missing values
  • Invalid values (data outside a meaningful range)
  • Inaccurate values
  • Non-uniform values due to incorrect format or units
  • Duplicate observations

Cleansing is done with automated rules and human review. Metadata (summary statistics) often helps identify anomalies. Observations that cannot be cleansed may be dropped.

Data wrangling / preprocessing includes data transformation and scaling.

Common types of data transformation:

  • Extraction - derive a value from raw fields (for example, compute years employed from start date and end date).
  • Aggregation - consolidate related variables into one using weights where required.
  • Filtration - remove irrelevant observations.
  • Selection - remove unused features (columns).
  • Conversion - convert data types (nominal to dummy variables, ordinal to numeric ranks).

Handling outliers: Identify outliers using statistical techniques (for example, values more than three standard deviations from the mean). Options include replacing outliers with algorithm-determined values, dropping observations, trimming (exclude highest and lowest x% of observations) or winsorization (replace extremes by the maximum allowable value).

Scaling converts features to a common range. Some ML algorithms (for example, neural networks, support vector machines) require features to be on comparable scales. Two common scaling methods are:

  • Normalization - scale variable values to a fixed range, typically between 0 and 1.
  • Standardization - centre variables to have mean 0 and scale them to have unit standard deviation.

Notes:

  • Normalization maps values to [0, 1].
  • Standardization maps values to z-scores so that a value of +1.22 means 1.22 standard deviations above the mean.
  • Standardization is less sensitive to outliers than normalization but assumes an approximately normal distribution.

PROFESSOR'S NOTE

Some learning outcomes (LOS) appear out of order in this reading for presentation clarity.

LOS 4.g: Preparing, wrangling and exploring text-based data

Text preparation or cleansing

Text cleansing commonly includes the following steps:

  • Remove HTML tags. Text obtained from web pages often contains HTML markup which should be removed before text analysis. Use regular expressions (regex) or HTML parsers as appropriate.
  • Remove punctuations. Many analyses drop punctuation. Where punctuation has semantic importance (for example, "%" or "$"), replace it with textual annotations so the model can retain the signal.
  • Remove or mask numbers. Replace digits with annotations when the numeric value is unimportant, or extract numeric features separately where the numeric value is relevant.
  • Remove extra white space. Tabs, multiple spaces, and indents that are formatting artefacts should be removed.

Text wrangling (normalisation and tokenisation)

After cleansing, normalise text for consistent processing:

  • Lowercasing to avoid treating "Market" and "market" as distinct tokens.
  • Removal of stop words such as "the", "is", which often do not carry semantic meaning in many ML applications.
  • Stemming - a rules-based reduction that converts word variants into a common stem (for example, integrate, integration, integrating → integrat). Stemming is efficient but can be imprecise for human interpretation.
  • Lemmatization - converts inflected forms to the lemma (dictionary root). Lemmatization is more accurate than stemming but requires more resources.

Tokenisation splits text into tokens (commonly words). Example: "It is a beautiful day." tokenised into: (1) it, (2) is, (3) a, (4) beautiful, (5) day.

After normalization and tokenisation, apply a bag-of-words (BOW) approach which collects tokens disregarding their sequence. A document-term matrix converts unstructured text into structured form: each document is a row, each token is a column, and cell values record token occurrence counts.

If sequence matters, use N-grams. A two-word sequence is a bigram, a three-word sequence a trigram, etc. Example sentence: "The market is up today." Bigrams: "the_market", "market_is", "is_up", "up_today". N-gram BOWs preserve sequence at the cost of larger vocabularies; in N-gram implementations stop words are often retained because they can form meaningful phrases.

MODULE QUIZ 4.1

1. Which of the following is least likely to be a step in data analysis?

A. Structured formative analysis.

B. Data collection.

C. Data preparation.

2. Which of the following shortcomings of a feature is least likely to be addressed by data cleansing?

A. Missing values.

B. Common values.

C. Non-uniform values.

3. The process of adjusting variable values so that they fall between 0 and 1 is most commonly referred to as:

A. scaling.

B. standardization.

C. normalization.

MODULE 4.2: DATA EXPLORATION

LOS 4.d: Objectives, methods and examples of data exploration

Data exploration evaluates the dataset to determine how to configure it for model training. Steps include:

  • Exploratory data analysis (EDA) - use summary statistics, heat maps, word clouds, scatterplots and other visualisations to understand distributions, relationships and to form hypotheses for modelling.
  • Feature selection - choose attributes that add predictive value. Too many features increase model complexity and training time; a parsimonious model often generalises better.
  • Feature engineering - create new features by transforming, decomposing or combining existing ones (for example, convert revenue and number of customers into a per-customer revenue metric). Feature extraction (for example, computing age from date of birth) also falls under this step.

Model performance depends heavily on feature selection and engineering; analysts often iterate here until model results are satisfactory.

Data exploration for structured data

Structured data are rows (observations) and columns (features). EDA can be one-dimensional (single feature) or multi-dimensional. For high-dimensional data, apply dimension-reduction techniques such as principal component analysis (PCA).

Single-feature summary statistics: mean, standard deviation, skewness, kurtosis. Visualisations: box plots, histograms, density plots and bar charts. Histograms show observation frequencies across bins; density plots are smoothed histograms; box plots display median, quartiles and outliers.

Multiple-feature analysis: correlation matrices, scatterplots and paired visualisations. Statistical tests: parametric (ANOVA, t-tests, correlation) and nonparametric (Spearman rank, chi-square) as appropriate.

Feature selection is iterative and business-informed. Assign importance scores using statistical or model-based methods and then rank and select features. Dimension-reduction algorithms can decrease feature count to speed model training.

One-hot encoding (OHE) converts categorical features into binary dummy variables so models can process them.

LOS 4.e: Methods for extracting, selecting and engineering features from textual data

Data exploration for unstructured data

Tokenise text and calculate summary statistics such as term frequency (number of times a token appears) and co-occurrence (tokens appearing together). A word cloud displays tokens with font size proportional to frequency, helping identify contextually important words.

Figure 4.1: Word Cloud, Apple (NASDAQ: AAPL) SEC Filing

Feature selection (text)

Select a subset of tokens from the BOW to reduce noise and improve parsimony. Often remove very high-frequency tokens (likely stop words) and very low-frequency tokens (rare or noisy). Common feature-selection methods for text include:

  • Document frequency (DF) - the proportion of documents that contain a token; used to filter tokens by prevalence.
  • Chi-square test - ranks tokens by how strongly they associate with a class in text classification; high chi-square tokens discriminate classes.
  • Mutual information (MI) - indicates how much a token contributes to class separation. Tokens present across all classes have MI close to 0; tokens appearing in few classes have MI closer to 1.

Feature engineering (text)

Common techniques:

  • Numbers - replace tokens with number-aware annotations (for example, /numberX/ or /number4/ for four-digit values such as years).
  • N-grams - preserve multiword sequences when they carry meaning (for example, "expansionary_monetary_policy").
  • Named entity recognition (NER) - tag tokens by entity class (for example, ORG for organisations, PLACE for locations) to create more discriminative features.
  • Parts of speech (POS) tagging - assign grammatical roles (for example, NNP for proper nouns, CD for cardinal numbers) to provide contextual signals.

MODULE QUIZ 4.2

1. The process used to convert a categorical feature into a binary (dummy) variable is best described as:

A. one-hot encoding (OHE).

B. parts of speech (POS).

C. name entity recognition (NER).

2. To make a bag-of-words (BOW) concise, the most appropriate procedure would be to:

A. eliminate high- and low-frequency words.

B. use a word cloud.

C. use N-grams.

3. Mutual information (MI) of tokens that appear in one or few classes is most likely to be:

A. close to 0.

B. close to 1.

C. close to 100.

MODULE 4.3: MODEL TRAINING AND EVALUATION

LOS 4.f: Objectives, steps and techniques in model training

Before training, define model objectives, identify useful data points and conceptualise the model. ML engineers should work with domain experts to understand relationships (for example, how inflation relates to exchange rates).

After unstructured data is processed into structured form (for example, document-term matrices), model training for such data follows the structured-data workflow: ML locates patterns and builds decision rules that generalise to new observations. Model fitting describes how well the model generalises out of sample.

Common causes of model fitting errors:

  • Training sample size - small samples can lead to underfitting because patterns are not learned sufficiently.
  • Number of features - too few features may underfit; too many can overfit due to high model complexity and low degrees of freedom.

There are three main tasks in model training:

  • Method selection - choose an appropriate algorithm considering supervised/unsupervised learning, data type (numeric, text, image), and dataset size. Examples:
    • Supervised methods: regression, ensemble trees, support vector machines (SVMs), neural networks (NNs).
    • Unsupervised methods: clustering, dimension reduction (PCA), anomaly detection.
    • Text: generalized linear models (GLMs), SVMs. Images: neural networks and deep learning.
  • Performance evaluation - assess goodness of fit and predictive ability using evaluation metrics and validation strategies.
  • Tuning - modify hyperparameters and model pipeline until performance meets objectives.

For supervised learning, split labelled data typically as follows: approximately 60% training, 20% validation (tuning), and 20% test (final out-of-sample evaluation). Unsupervised learning lacks labelled targets and does not require this split in the same way.

Class imbalance occurs when one class is over-represented (for example, many high-grade bonds vs few defaults). Strategies include undersampling the majority class and oversampling the minority class to provide balanced training data.

PROFESSOR'S NOTE

Model fitting is covered in more depth in separate machine learning materials.

LOS 4.c: Evaluate the fit of a machine learning algorithm

Model validation requires measuring training and out-of-sample performance. The techniques below focus on binary classification but are broadly applicable.

1. Error analysis and confusion matrix

Classification errors are:

  • False positives (FP) - predicted positive but actually negative (type I error).
  • False negatives (FN) - predicted negative but actually positive (type II error).

A confusion matrix summarises true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). From it we compute:

  • Precision (P) = TP / (TP + FP)
  • Recall (R) = TP / (TP + FN)
  • Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • F1 score = (2 × P × R) / (P + R)

Precision is important when the cost of a false positive is high; recall is important when the cost of a false negative is high. The business context decides the preferred trade-off.

2. Receiver operating characteristic (ROC) and AUC

The ROC curve plots the true positive rate (TPR = recall) on the Y-axis against the false positive rate (FPR = FP / (FP + TN)) on the X-axis. The area under the curve (AUC) ranges from 0 to 1 - values closer to 1 indicate better discrimination. An AUC of 0.50 is equivalent to random guessing.

3. Regression metrics for continuous targets

For continuous target variables (regression problems) common metrics include root mean squared error (RMSE), mean absolute error (MAE) and R-squared.

EXAMPLE: Model evaluation

Dave Kwah is evaluating a model that predicts whether a company will have a dividend cut next year. The model uses a binary classification: cut versus not cut. In the test sample of 78 observations, the model correctly classified 18 companies that had a dividend cut, and 46 companies that did not have a dividend cut. The model failed to identify 3 companies that actually had a dividend cut.

1. Calculate the model's precision and recall.

2. Calculate the model's accuracy and F1 score.

3. Calculate the model's FPR.

Sol.

Identify confusion-matrix components from the description:

True positives (TP) = 18 (companies correctly predicted to have a dividend cut)

True negatives (TN) = 46 (companies correctly predicted to not have a dividend cut)

False negatives (FN) = 3 (companies that actually had a dividend cut but were missed)

Total observations = 78

Compute false positives (FP) = Total - (TP + TN + FN) = 78 - (18 + 46 + 3) = 11

Precision P = TP / (TP + FP) = 18 / (18 + 11) = 18 / 29 ≈ 0.6207

Recall R = TP / (TP + FN) = 18 / (18 + 3) = 18 / 21 ≈ 0.8571

Accuracy = (TP + TN) / Total = (18 + 46) / 78 = 64 / 78 ≈ 0.8205

F1 score = (2 × P × R) / (P + R)

Compute numerator = 2 × 0.6207 × 0.8571 ≈ 1.064

Compute denominator = 0.6207 + 0.8571 ≈ 1.4778

F1 ≈ 1.064 / 1.4778 ≈ 0.72 (rounded to two decimal places)

False positive rate FPR = FP / (FP + TN) = 11 / (11 + 46) = 11 / 57 ≈ 0.1930

Model tuning

After evaluation, revise the model to improve performance. Key concepts:

  • Bias error arises from underfitting (models too simple to capture patterns).
  • Variance error arises from overfitting (models too complex and not generalising).
  • A fitting curve plots training error and cross-validation error against model complexity to identify the optimal complexity that balances bias and variance.
  • Regularization penalises model complexity to reduce overfitting.
  • Hyperparameters (for example, number of hidden layers in a neural network) are set by engineers and tuned (for example, via grid search) to find the best combination.
  • Grid search automates hyperparameter selection across a predefined grid.
  • Ceiling analysis examines each component in the pipeline to locate the weak link whose improvement will most increase overall performance.

MODULE QUIZ 4.3

1. When the training data contains the ground truth, the most appropriate learning method is:

A. supervised learning.

B. unsupervised learning.

C. machine learning.

Use the following information to answer Questions 2 through 6.

While analysing health-care stocks, Ben Stokes devises a model to classify the stocks as those that will report earnings above consensus forecasts versus those that won't. Stokes prepares the following confusion matrix using the results of his model.

Confusion Matrix for Earnings Outperformance

Predicted: OutperformPredicted: Not Outperform
Actual: Outperform122
Actual: Not Outperform47

2. The model's accuracy score is closest to:

A. 0.44.

B. 0.76.

C. 0.89.

3. The model's recall is closest to:

A. 0.67.

B. 0.72.

C. 0.86.

4. The model's precision is closest to:

A. 0.64.

B. 0.72.

C. 0.75.

5. The model's F1 score is closest to:

A. 0.80.

B. 0.89.

C. 0.94.

6. To reduce type I error, Stokes should most appropriately increase the model's:

A. precision.

B. recall.

C. accuracy.

KEY CONCEPTS

LOS 4.a

The steps in a data analysis project are:

  • Conceptualisation of the modelling task
  • Data collection
  • Data preparation and wrangling
  • Data exploration
  • Model training

LOS 4.b

Data cleansing addresses missing, invalid, inaccurate and non-uniform values, and duplicate observations. Data wrangling (preprocessing) includes data transformation (extraction, aggregation, filtration, selection, conversion) and scaling (normalization or standardization). Normalization scales values between 0 and 1. Standardization centres values at mean 0 and scales by standard deviation to give unit variance; it is less sensitive to outliers but assumes an approximately normal distribution for the feature.

LOS 4.c

Model performance evaluation tools for classification include error analysis via a confusion matrix and derived metrics:

  • precision (P) = TP / (FP + TP)
  • recall (R) = TP / (TP + FN)
  • accuracy = (TP + TN) / (all observations)
  • F1 score = (2 × P × R) / (P + R)

The ROC curve plots recall (TPR) against FPR (FP / (FP + TN)) and the AUC summarises its area. For continuous targets, use RMSE and related metrics. Model tuning balances bias and variance and chooses hyperparameters to improve out-of-sample performance.

LOS 4.d

Data exploration comprises EDA, feature selection and feature engineering. EDA inspects summary statistics and relationships. Feature selection chooses features that contribute to predictive power; feature engineering optimises and creates features for faster and more reliable model training.

LOS 4.e

Text-data summary statistics include term frequency and co-occurrence. A word cloud indicates frequently used words by font size. Feature selection for text may use document frequency, chi-square tests and mutual information. Text feature engineering includes identifying numbers, using N-grams, NER and POS tag tokenisation.

LOS 4.f

Model conceptualisation requires collaboration with domain experts. ML finds patterns in training data that generalise to out-of-sample data. Model fitting errors are caused by small training samples (risk of underfitting) or inappropriate feature counts (too few features → underfitting; too many features → overfitting). Model training includes selection, evaluation and tuning.

LOS 4.g

Text processing: remove HTML, punctuation, numbers and extraneous white space. Normalise text by lowercasing, removing stop words, stemming and lemmatisation. Tokenise text; apply N-grams when sequence matters. Create a bag-of-words (BOW) and convert to a document-term matrix for modelling.

ANSWER KEY FOR MODULE QUIZZES

Module Quiz 4.1

1. A Structured formative analysis is not a defined step in the curriculum. The five steps are conceptualisation of the modelling task; data collection; data preparation and wrangling; data exploration; and model training. (LOS 4.a)

2. B Common values are not addressed by data cleansing. Data cleansing handles missing, invalid, non-uniform and inaccurate values, and duplicates. (LOS 4.b)

3. C Normalization scales variable values between 0 and 1. (LOS 4.b)

Module Quiz 4.2

1. A OHE converts a categorical feature into a binary (dummy) variable suitable for machine processing. POS and NER assign tags to tokens and are not one-hot encoding. (LOS 4.e)

2. A To make a BOW concise, often high- and low-frequency words are eliminated (high-frequency words often include stop words; low-frequency words may be noise). Word clouds visualise frequency; N-grams preserve sequence when needed. (LOS 4.e)

3. B MI close to 1 indicates a token appears primarily in one or a few classes; tokens appearing across all classes will have MI close to 0. (LOS 4.e)

Module Quiz 4.3

1. A Supervised learning is used when the training data contains ground truth (labelled outcomes). Unsupervised learning is used when no target variable exists. Machine learning is the broad field that includes both supervised and unsupervised methods. (LOS 4.f)

The following matrix answers Questions 2 through 6:

Confusion Matrix used for Questions 2-6

Predicted PositivePredicted Negative
Actual Positive122
Actual Negative47

Compute metrics from the matrix:

TP = 12, FP = 4, FN = 2, TN = 7, Total = 25

Accuracy = (TP + TN) / Total = (12 + 7) / 25 = 19 / 25 = 0.76. Answer: B

Recall (R) = TP / (TP + FN) = 12 / (12 + 2) = 12 / 14 ≈ 0.857 ≈ 0.86. Answer: C

Precision (P) = TP / (TP + FP) = 12 / (12 + 4) = 12 / 16 = 0.75. Answer: C

F1 score = (2 × P × R) / (P + R) = (2 × 0.75 × 0.86) / (0.75 + 0.86) = 0.80. Answer: A

To reduce type I error (false positives), Stokes should increase precision. Answer: A

TOPIC QUIZ

You have finished the Quantitative Methods topic section. Take the online Topic Quiz to assess understanding. These quizzes are timed and exam-like; a score below 70% suggests additional review is needed. Allow approximately three minutes per question.

The document Big Data Projects is a part of the CFA Level 2 Course Quantitative Methods.
All you need of CFA Level 2 at this link: CFA Level 2
Explore Courses for CFA Level 2 exam
Get EduRev Notes directly in your Google search
Related Searches
shortcuts and tricks, Big Data Projects, past year papers, Viva Questions, Big Data Projects, Previous Year Questions with Solutions, Big Data Projects, Exam, Free, study material, Extra Questions, ppt, mock tests for examination, MCQs, Semester Notes, practice quizzes, Summary, Important questions, pdf , Sample Paper, video lectures, Objective type Questions;