CFA Level 2 Exam  >  CFA Level 2 Notes  >  Quantitative Methods  >  Big Data Projects

Big Data Projects

READING 4

BIG DATA PROJECTS

EXAM FOCUS

This topic review is a broad overview of the use of big data analysis for financial forecasting. Candidates should understand:

  • the terminology used and the processes involved in big data projects;
  • the requirements and limitations of the techniques discussed; and
  • how to evaluate a model's performance.

INTRODUCTION

Big data is commonly characterised by the three Vs: volume, variety, and velocity.

  • Volume refers to the quantity of data. Big data implies a very large volume of observations and records.
  • Variety refers to the range of data sources. Big data is collected from many origins such as user-generated content, transactional records, emails, images, clickstreams and other sources. Collecting diverse data provides opportunities but raises concerns such as data privacy and governance.
  • Velocity refers to the speed at which data is generated and collected (for example, social media posts produced in a short interval).

When making inferences from data an additional characteristic, veracity (validity), must be considered: not every source is reliable, and analysts must separate quality from quantity to obtain robust forecasts.

Structured data (for example, balance-sheet figures) are neatly organised in rows and columns. Unstructured data (for example, text from regulatory filings or social media) lack such tabular organisation; machine learning algorithms must extract useful signals from noisy, unstructured streams.

MODULE 4.1: DATA ANALYSIS STEPS

LOS 4.a: Identify and explain steps in a data analysis project.

To illustrate the typical steps in analysing data for financial forecasting, consider a consumer credit scoring model. The process is commonly organised into five iterative steps:

  1. Conceptualization of the modelling task. Define the problem, the model output (target variable), how the output will be used, who will use it, and whether the model will be embedded in existing or new business processes. For the credit scoring example, the purpose is to measure accurately the credit risk of a borrower.
  2. Data collection. Identify and obtain structured numeric data from internal and external sources. Credit-scoring models may use past repayment history, employment history, income, and other borrower attributes. The analyst must decide which data sources (internal versus external) to use.
  3. Data preparation and wrangling. Clean the data and prepare it for model use. This includes addressing missing values, verifying out-of-range values, aggregating, filtering, or extracting relevant variables, and applying rules to fill gaps when data are missing or inaccurate.
  4. Data exploration. Perform exploratory data analysis (EDA), feature selection and feature engineering. For credit scoring, several variables may be combined to form composite scores (for example, an ability-to-pay metric).
  5. Model training. Choose an appropriate machine learning (ML) algorithm, evaluate it on a training set, and tune hyperparameters. The algorithm choice depends on the relationship between features and the target variable.

These five steps are iterative. Depending on model output quality, the analyst may return to earlier steps (for example, revisit feature engineering to improve predictive performance).

When analysing unstructured, text-based data the steps are adapted as follows:

  1. Text problem formulation. Determine the prediction or classification task, specify inputs and output, and decide how the output will be used.
  2. Data collection (curation). Determine sources (for example, web scraping, specific social media platforms). For supervised learning, create or annotate a reliable target variable (for example, label text associated with higher credit risk).
  3. Text preparation and wrangling. Preprocess unstructured streams so they can be used with structured modelling techniques.
  4. Text exploration. Visualise text data, select and engineer text features.

The output of a model trained on unstructured data can be used alone, or combined with structured features as input to a secondary model.

LOS 4.b: Describe objectives, steps, and examples of preparing and wrangling data.

Data preparation and wrangling is a critical step that typically consumes most of a project's time and resources. Once the problem is defined, domain experts help to identify the appropriate data to collect. Data are downloaded from internal databases or external vendors. When accessing a database, exercise caution to ensure data validity; README files often document how data are stored. External data can be obtained via application programming interfaces (APIs). External data are sometimes costly but can reduce local wrangling effort. Using widely available third-party data can reduce a firm's proprietary advantage because competitors may use the same sources.

Data cleansing reduces errors in raw data. For structured data, common errors include:

  • Missing values;
  • Invalid values (outside a meaningful range);
  • Inaccurate values;
  • Non-uniform values because of wrong formats or units of measurement; and
  • Duplicate observations.

Cleansing is performed by automated rules-based algorithms and by human review. Metadata (summary information) is a useful starting point for identifying errors. Observations that cannot be corrected are usually dropped.

Data wrangling (preprocessing) prepares data for model use. Common preprocessing tasks include data transformation and scaling.

Data transformation types include:

  • Extraction: derive new values (for example, compute years employed from dates);
  • Aggregation: combine related variables into one using appropriate weights;
  • Filtration: remove irrelevant observations;
  • Selection: remove features (columns) not needed for modelling; and
  • Conversion: convert between data types (nominal, ordinal, numeric).

Outliers can be detected using statistical rules (for example, values more than three standard deviations from the mean). Typical treatments include:

  • Replace the outlier with an algorithm-determined value;
  • Delete the observation; or
  • Trim the dataset by excluding the highest and lowest x% (trimming), or replace extreme values by a maximum allowable value (winsorisation).

Scaling converts features to a common unit of measurement. Some ML algorithms (for example, neural networks and support vector machines) perform better when features are homogenous in range. Two common scaling methods:

  • Normalization: scale variable values between 0 and 1.
  • Standardization: centre variables at 0 and scale them in units of standard deviation from the mean so that the transformed variable has mean 0 and standard deviation 1.

A standardized variable value of +1.22 means the original value is 1.22 standard deviations above the mean. Unlike normalization, standardization is less sensitive to outliers but assumes an approximately normal distribution for the variable.

PROFESSOR'S NOTE

Some learning outcome statements (LOS) are presented out of order for exposition ease.

LOS 4.g: Describe preparing, wrangling, and exploring text-based data for financial forecasting.

Unstructured text is readable by humans but must be converted to structured form for machine processing. Text processing consists of cleansing and preprocessing steps.

Text Preparation or Cleansing

Common cleansing steps are:

  1. Remove HTML tags. Text collected from web pages often contains embedded HTML tags; these should be stripped out before analysis. Regular expressions (regex) are commonly used to identify and remove patterns.
  2. Remove punctuation. Most punctuation is removed as it rarely carries semantic meaning for many ML applications. Certain symbols (for example, %, $, ?) may be important and can be replaced by annotated tokens to preserve their information.
  3. Remove or annotate numbers. Digits may be removed or replaced by annotation tokens (for example, /numberX/) to indicate presence of a number when the specific value is not needed. If numeric values are important, extract them explicitly.
  4. Remove excess whitespace. Formatting-related whitespace (tabs, extra spaces, indents) is removed to standardise text.

Text Wrangling (Preprocessing)

After cleansing, text is normalised using:

  1. Lowercasing. Convert text to lower case so tokens like "market" and "Market" are identical.
  2. Stop-word removal. Remove common words such as "the", "is" when they do not contribute semantic value in the chosen modelling approach.
  3. Stemming. Apply rules-based reduction to convert word variants to a common stem (for example, "integrate", "integration", "integrating" → "integrat"). Stemming reduces vocabulary size but can make tokens less interpretable to humans.
  4. Lemmatization. Convert inflected forms to their lemma (morphological root). Lemmatization is often more accurate than stemming but is more computationally intensive.

Tokenization splits text into tokens (typically words). Example: the sentence "It is a beautiful day." yields five tokens: (1) it, (2) is, (3) a, (4) beautiful, (5) day.

After normalisation, a bag-of-words (BOW) representation collects tokens without regard to order. A document-term matrix converts unstructured data into a structured matrix with rows representing documents and columns representing tokens; each cell records the count of a token in a document.

If sequence matters, use N-grams where tokens are sequences of N words. A two-word sequence is called a bigram; a three-word sequence a trigram. Example: for "The market is up today.", bigrams include "the_market", "market_is", "is_up", "up_today". When using N-grams, stop-word removal must be considered carefully as it affects sequence tokens.

MODULE QUIZ 4.1

1. Which of the following is least likely to be a step in data analysis?

  • A. Structured formative analysis.
  • B. Data collection.
  • C. Data preparation.

2. Which of the following shortcomings of a feature is least likely to be addressed by data cleansing?

  • A. Missing values.
  • B. Common values.
  • C. Non-uniform values.

3. The process of adjusting variable values so that they fall between 0 and 1 is most commonly referred to as:

  • A. scaling.
  • B. standardization.
  • C. normalization.

MODULE 4.2: DATA EXPLORATION

LOS 4.d: Describe objectives, methods, and examples of data exploration.

Data exploration evaluates a data set to determine appropriate configuration for model training. Typical steps:

  1. Exploratory data analysis (EDA). Examine summary statistics, heat maps, word clouds and other visuals to understand data properties, distributions, and relationships; to test hypotheses; and to plan modelling.
  2. Feature selection. Choose only the attributes needed for model training. More features increase model complexity and training time; unnecessary features increase noise and may reduce out-of-sample performance.
  3. Feature engineering (FE). Create new features by transforming (for example, natural logarithm), decomposing, or combining features. Feature extraction produces variables from raw data (for example, compute age from date of birth). Model performance often depends heavily on feature selection and engineering.

Data Exploration for Structured Data

Structured data are organised as rows (observations) and columns (features). EDA can be univariate or multivariate. When many features exist, dimension-reduction methods such as principal component analysis (PCA) can assist exploration.

For a single feature, descriptive statistics include mean, standard deviation, skewness and kurtosis. Visualisations used in EDA include box plots, histograms, density plots and bar charts.

  • Histograms show frequencies in equal-width bins.
  • Density plots are smoothed histograms for continuous variables.
  • Bar charts display frequency distributions of categorical variables.
  • Box plots show median, quartiles and outliers for continuous features.

For multivariate exploration, use correlation matrices and scatterplots; statistical tests include ANOVA, t-tests, Spearman rank correlation and chi-square tests.

Feature selection aims to retain features that contribute to out-of-sample predictive power, producing a parsimonious model. Features may be scored and ranked; dimension-reduction algorithms can reduce processing time.

Feature engineering optimises the representation of features for the algorithm. For categorical features, one-hot encoding (OHE) converts categories into binary dummy variables to enable machine processing.

LOS 4.e: Describe methods for extracting, selecting and engineering features from textual data.

Data Exploration for Unstructured Data

Tokenise text and compute summary statistics such as term frequency (count of a token) and co-occurrence (tokens appearing together). A word cloud visually emphasises frequent tokens by font size, helping identify contextually important words. (Figure 4.1 shows an example of a word cloud for an SEC filing.)

Feature selection for textual BOW representations reduces dimension and noise by eliminating unhelpful tokens. High- and low-frequency tokens are often removed: high frequency tokens tend to be stop words or common vocabulary words, while low-frequency tokens may be irrelevant.

Feature Selection Methods (Text)

  • Document frequency (DF): DF(token) = (number of documents containing the token) / (total number of documents).
  • Chi-square test: rank tokens by their usefulness to discriminate classes in text classification problems; higher chi-square values indicate greater association with a particular class.
  • Mutual information (MI): a numerical measure of a token's contribution to a class. Tokens that appear across all classes have MI ≈ 0; tokens concentrated in one or a few classes have MI approaching 1.

Feature Engineering (Text)

Common FE techniques for text include:

  • Number tokens: convert numbers of particular lengths into annotated tokens, for example convert a four-digit year into /number4/ or a generic number into /numberX/.
  • N-grams: preserve multi-word patterns when useful, e.g. expansionary_monetary_policy as a single token rather than separate tokens.
  • Named entity recognition (NER): tag tokens by entity class (for example, Microsoft → ORG, Europe → PLACE) to improve discrimination.
  • Parts of speech (POS): tag tokens with grammatical categories (for example, Microsoft → NNP proper noun, 1969 → CD cardinal number) to supply syntactic context.

MODULE QUIZ 4.2

1. The process used to convert a categorical feature into a binary (dummy) variable is best described as:

  • A. one-hot encoding (OHE).
  • B. parts of speech (POS).
  • C. name entity recognition (NER).

2. To make a bag-of-words (BOW) concise, the most appropriate procedure would be to:

  • A. eliminate high- and low-frequency words.
  • B. use a word cloud.
  • C. use N-grams.

3. Mutual information (MI) of tokens that appear in one or few classes is most likely to be:

  • A. close to 0.
  • B. close to 1.
  • C. close to 100.

MODULE 4.3: MODEL TRAINING AND EVALUATION

LOS 4.f: Describe objectives, steps, and techniques in model training.

Before training, define the modelling objectives, identify useful data points and conceptualise the model. Model conceptualisation is an iterative planning phase; ML engineers should collaborate with domain experts to identify relevant relationships (for example, the relation between inflation and exchange rates).

After unstructured data has been processed into a structured matrix, model training follows similar principles used for structured data. Machine learning seeks patterns that explain the target variable. Model fitting describes the model's ability to generalise to new (out-of-sample) data.

Model fitting errors can arise from:

  • Training-sample size: small datasets may cause underfitting because the model cannot learn important patterns;
  • Number of features: too few features can cause underfitting; too many features can cause overfitting because of limited degrees of freedom. Feature selection mitigates both underfitting and overfitting; good feature engineering often reduces underfitting.

The three tasks of model training

  1. Method selection. Choose an appropriate algorithm considering whether the problem is supervised or unsupervised, the data type (numerical, text, image) and data size.
    • Supervised learning is used when labelled training data (ground truth) exists; common methods include regression, ensemble decision trees, support vector machines (SVM) and neural networks (NN).
    • Unsupervised learning is used when no labelled target exists; common methods include clustering, dimensionality reduction and anomaly detection.
    • Data type: for numerical prediction use regression or tree-based methods; for text use GLMs or SVMs; for images use neural networks and deep learning.
    • Data size: SVMs can handle large feature sets; neural networks typically require many observations and perform better when the feature set is suitable for deep architectures.
  2. Performance evaluation. Assess model efficacy using appropriate metrics and validation techniques.
  3. Tuning. Modify hyperparameters to improve performance.

For supervised learning, split the data into three parts: training (~60%), validation (tuning) (~20%), and test (~20%) to measure out-of-sample performance. For unsupervised learning, labelled splits are not applicable.

Class imbalance arises when one class dominates the dataset (for example, many high-grade bonds and few defaults). To address imbalance, use undersampling of the majority class or oversampling of the minority class to present a balanced training set.

LOS 4.c: Evaluate the fit of a machine learning algorithm.

Techniques to Measure Model Performance

Validation requires appropriate metrics. The following techniques are commonly used for binary classification problems.

1. Error analysis and confusion matrix. Errors are false positives (type I) and false negatives (type II). A confusion matrix summarises classification results. From it, compute:

Precision (P) = true positives / (true positives + false positives)

Recall (R) = true positives / (true positives + false negatives)

Accuracy = (true positives + true negatives) / (all observations)

F1 score = (2 × P × R) / (P + R)

High precision is preferred when the cost of a false positive is large; high recall is preferred when the cost of a false negative is large. The tradeoff between precision and recall is a business decision.

2. Receiver operating characteristic (ROC). The ROC curve plots true positive rate (TPR = recall) on the Y-axis against false positive rate (FPR = false positives / actual negatives) on the X-axis. The area under the curve (AUC) ranges from 0 to 1; AUC close to 1 indicates high predictive accuracy. AUC = 0.50 corresponds to random guessing.

3. Root mean squared error (RMSE). Use RMSE for continuous targets (regression). RMSE summarises average prediction error in the sample.

EXAMPLE: Model evaluation

Dave Kwah evaluates a binary classification model for whether a company will have a dividend cut next year (cut versus not cut). In the test sample of 78 observations:

  • The model correctly classified 18 companies that had a dividend cut (true positives, TP = 18).
  • The model correctly classified 46 companies that did not have a dividend cut (true negatives, TN = 46).
  • The model failed to identify 3 companies that actually had a dividend cut (false negatives, FN = 3).

Using these values, perform the following calculations:

  1. Calculate the model's precision and recall.
  2. Calculate the model's accuracy and F1 score.
  3. Calculate the model's false positive rate (FPR).

Answer

Compute intermediate values first.

TP = 18

TN = 46

FN = 3

Total observations = 78

FP = Total - (TP + TN + FN) = 78 - (18 + 46 + 3) = 11

Precision = TP / (TP + FP)

Precision = 18 / (18 + 11) = 18 / 29 ≈ 0.6207

Recall = TP / (TP + FN)

Recall = 18 / (18 + 3) = 18 / 21 ≈ 0.8571

Accuracy = (TP + TN) / Total

Accuracy = (18 + 46) / 78 = 64 / 78 ≈ 0.8205

F1 score = (2 × Precision × Recall) / (Precision + Recall)

F1 = (2 × 0.620689655 × 0.857142857) / (0.620689655 + 0.857142857) ≈ 0.7186

False positive rate (FPR) = FP / (FP + TN)

FPR = 11 / (11 + 46) = 11 / 57 ≈ 0.1930

Model Tuning

After evaluation, tune the model until acceptable performance is reached. Two error types to consider:

  • Bias error (training error from underfitting): occurs when the model is too simple to learn patterns.
  • Variance error (validation error from overfitting): occurs when the model learns the training data too well and does not generalise.

A fitting curve plots training error and cross-validation error against model complexity. As complexity increases, training error typically decreases while variance (validation error) may increase. Regularisation imposes penalties on model complexity to reduce overfitting. Optimal model complexity balances bias and variance.

Parameters (for example regression slopes) are estimated from training data by optimisation. Hyperparameters (for example number of hidden layers, regularisation strength) are set by engineers and tuned via validation. Tuning can be manual or automated; a grid search systematically searches combinations of hyperparameters. Ceiling analysis evaluates each component in the model-building pipeline to identify the weakest link to improve overall performance.

MODULE QUIZ 4.3

1. When the training data contains the ground truth, the most appropriate learning method is:

  • A. supervised learning.
  • B. unsupervised learning.
  • C. machine learning.

Use the following information to answer Questions 2 through 6.

While analysing health-care stocks, Ben Stokes devises a model to classify stocks as those that will report earnings above consensus forecasts versus those that will not. Stokes prepares the following confusion matrix using the results of his model.

Confusion Matrix for Earnings Outperformance

(Confusion matrix presented by the analyst.)

2. The model's accuracy score is closest to:

  • A. 0.44.
  • B. 0.76.
  • C. 0.89.

3. The model's recall is closest to:

  • A. 0.67.
  • B. 0.72.
  • C. 0.86.

4. The model's precision is closest to:

  • A. 0.64.
  • B. 0.72.
  • C. 0.75.

5. The model's F1 score is closest to:

  • A. 0.80.
  • B. 0.89.
  • C. 0.94.

6. To reduce type I error, Stokes should most appropriately increase the model's:

  • A. precision.
  • B. recall.
  • C. accuracy.

KEY CONCEPTS

LOS 4.a

The steps in a data analysis project include: (1) conceptualization of the modelling task, (2) data collection, (3) data preparation and wrangling, (4) data exploration, and (5) model training.

LOS 4.b

Data cleansing addresses missing, invalid, inaccurate and non-uniform values and duplicate observations. Data wrangling or preprocessing includes transformation and scaling. Transformation includes extraction, aggregation, filtration, selection and conversion. Scaling converts data to a common unit. Normalization scales variables to [0,1]. Standardization centres variables at mean 0 and scales by standard deviation; standardization is less sensitive to outliers but assumes approximate normal distribution.

LOS 4.c

Model performance for classification is evaluated using a confusion matrix and metrics such as precision, recall, accuracy and F1 score.

precision (P) = true positives / (false positives + true positives)

recall (R) = true positives / (true positives + false negatives)

accuracy = (true positives + true negatives) / (all observations)

F1 score = (2 × P × R) / (P + R)

The ROC curve plots the tradeoff between false positives and true positives; AUC measures area under ROC. Use RMSE for continuous targets. Model tuning balances bias and variance and selects optimal hyperparameters.

LOS 4.d

Data exploration includes EDA, feature selection and feature engineering. EDA inspects summary statistics and patterns. Feature selection chooses features that improve out-of-sample predictive power. Feature engineering optimises feature representations for faster and more accurate model training.

LOS 4.e

Text summary statistics include term frequency and co-occurrence. Word clouds visually highlight frequent tokens. Feature selection tools include document frequency, chi-square and mutual information. Feature engineering techniques for text include number tokenisation, N-grams, named entity recognition (NER) and parts-of-speech (POS) tagging, and tokenisation into BOW or other structured forms.

LOS 4.f

Model conceptualisation requires collaboration between ML engineers and domain experts to identify relationships and data characteristics. ML identifies patterns in training data so that the model generalises to out-of-sample data. Insufficient training data or inappropriate numbers of features can cause underfitting or overfitting. Model training involves method selection, evaluation and tuning.

LOS 4.g

Text processing removes HTML tags, punctuation, numbers and extraneous whitespace. Normalisation steps include lowercasing, stop-word removal, stemming and lemmatization. Tokenisation splits text into tokens. N-grams capture sequences when order is important. A document-term matrix organises tokens with documents as rows and tokens as columns; cell values count token occurrences.

ANSWER KEY FOR MODULE QUIZZES

Module Quiz 4.1

1. A Structured formative analysis is not a standard term in the curriculum. The five steps are conceptualization; data collection; data preparation and wrangling; data exploration; and model training. (LOS 4.a)

2. B Common values are not addressed by cleansing. Missing, invalid, non-uniform and inaccurate values are cleansed. (LOS 4.b)

3. C Normalization scales variables between 0 and 1. (LOS 4.b)

Module Quiz 4.2

1. A OHE converts categorical features into binary (dummy) variables suitable for machine processing. POS and NER assign tags to tokens. (LOS 4.e)

2. A To make a BOW concise, eliminate high- and low-frequency words. High-frequency words tend to be stop words or common vocabulary. Word clouds are visual tools; N-grams are used when sequence matters. (LOS 4.e)

3. B MI indicates a token's contribution to a class: tokens appearing in all classes have MI ≈ 0; tokens concentrated in one or a few classes have MI ≈ 1. (LOS 4.e)

Module Quiz 4.3

1. A Supervised learning is used when the training data contains ground truth (the known target). Unsupervised learning is used when there is no known target. Machine learning is the broad category including both. (LOS 4.f)

The following matrix answers Questions 2 through 6.

Confusion Matrix for Earnings Outperformance (answer key calculations)

Given values used by the answer key:

  • True positives (TP) = 12
  • True negatives (TN) = 7
  • False positives (FP) = 4
  • False negatives (FN) = 2
  • Total observations = 25

2. B Accuracy = (TP + TN) / (TP + TN + FP + FN) = 19 / 25 = 0.76. (LOS 4.c)

3. C Recall (R) = TP / (TP + FN) = 12 / 14 = 0.86. (LOS 4.c)

4. C Precision (P) = TP / (TP + FP) = 12 / 16 = 0.75. (LOS 4.c)

5. A F1 score = (2 × P × R) / (P + R) = (2 × 0.75 × 0.86) / (0.75 + 0.86) = 0.80. (LOS 4.c)

6. A To reduce type I error (false positives), increase precision. High precision is valued when the cost of a false positive is large. (LOS 4.c)

Topic Quiz: Quantitative Methods

You have completed the Quantitative Methods topic section. Take the Topic Quiz to assess understanding. These tests simulate exam-like questions and provide feedback. A score below 70% indicates further study is advisable.

The document Big Data Projects is a part of the CFA Level 2 Course Quantitative Methods.
All you need of CFA Level 2 at this link: CFA Level 2
Explore Courses for CFA Level 2 exam
Get EduRev Notes directly in your Google search
Related Searches
study material, Semester Notes, Free, Big Data Projects, Extra Questions, video lectures, Summary, Exam, Previous Year Questions with Solutions, ppt, Viva Questions, practice quizzes, pdf , Sample Paper, Important questions, mock tests for examination, Objective type Questions, past year papers, Big Data Projects, MCQs, Big Data Projects, shortcuts and tricks;