This topic review is a broad overview of the use of big data analysis for financial forecasting. Candidates should understand the following:
Big data is commonly characterised by the three Vs:
When using data to draw inferences, a fourth characteristic is important: veracity (or validity). Not all data sources are reliable; researchers must separate quality from quantity to produce robust forecasts.
Structured data (for example, balance-sheet rows and columns) is organised and readily used by standard statistical and ML models. Unstructured data (for example, free-text from filings or social media, images or audio) requires preprocessing for machine use because it is not arranged in neat rows and columns.
To illustrate the typical steps in analysing data for financial forecasting, consider a consumer credit scoring model. The overall process is iterative and usually follows these five main stages:
Define the precise problem, the model output (target), how the model will be used, the stakeholders, and whether the model will be embedded into existing or new business processes. For example, the purpose of the model might be to measure a borrower's credit risk accurately so that lending decisions can be automated or supported.
Identify and collect structured numeric data from internal and external sources. For a credit scoring model this may include past repayment history, employment record, income and other borrower attributes. Decide which sources (internal/external) and specific tables or APIs to use.
Clean the dataset and prepare it for modelling. Address missing values, invalid or out-of-range values, duplicates and inconsistent units. Preprocessing can involve aggregation, filtering, imputation rules, and selection of relevant variables. In credit scoring, rules may be used to fill gaps where data is missing or suspected inaccurate.
Perform exploratory data analysis (EDA), feature selection and feature engineering. For credit scoring several variables may be combined into an ability-to-pay score or other derived features.
Select an appropriate machine learning algorithm, train it on the training data, validate and tune hyperparameters. The model choice depends on the relationship between features and the target, data size, and business constraints.
These steps are iterative. Depending on model output quality, data exploration, feature engineering or even data collection choices might be revisited to improve model performance.
If the model incorporates unstructured data (for example, a borrower's social media posts), the first four steps are modified as follows:
Define the specific inputs (text sources) and the output needed, and determine how the output will be used.
Decide the sources (web scraping, specific social-media APIs). For supervised learning, create or obtain labelled examples (annotated target variables) indicating which texts map to positive or negative outcomes.
Preprocess unstructured text to convert it into a form usable by structured modelling techniques (for example, tokenisation, stop-word removal, stemming or lemmatisation, and document-term matrix creation).
Use visualisation and textual feature engineering to select tokens, phrases or other representations useful for the modelling task.
Outputs from unstructured-text models may be used in isolation or combined with structured variables as inputs to another model.
Data preparation and wrangling are critical steps that often consume the majority of a project's time and resources. Once a problem is defined, identify appropriate data with the help of domain experts and obtain it from internal databases or external vendors via APIs. When using external data, carefully check metadata or README files that describe how the data is stored and collected. External data can speed up projects but may be accessible to competitors, reducing proprietary advantage.
Data cleansing reduces errors in raw data. For structured data, common errors include:
Cleansing is done with automated rules and human review. Metadata (summary statistics) often helps identify anomalies. Observations that cannot be cleansed may be dropped.
Data wrangling / preprocessing includes data transformation and scaling.
Common types of data transformation:
Handling outliers: Identify outliers using statistical techniques (for example, values more than three standard deviations from the mean). Options include replacing outliers with algorithm-determined values, dropping observations, trimming (exclude highest and lowest x% of observations) or winsorization (replace extremes by the maximum allowable value).
Scaling converts features to a common range. Some ML algorithms (for example, neural networks, support vector machines) require features to be on comparable scales. Two common scaling methods are:
Notes:
Some learning outcomes (LOS) appear out of order in this reading for presentation clarity.
Text cleansing commonly includes the following steps:
After cleansing, normalise text for consistent processing:
Tokenisation splits text into tokens (commonly words). Example: "It is a beautiful day." tokenised into: (1) it, (2) is, (3) a, (4) beautiful, (5) day.
After normalization and tokenisation, apply a bag-of-words (BOW) approach which collects tokens disregarding their sequence. A document-term matrix converts unstructured text into structured form: each document is a row, each token is a column, and cell values record token occurrence counts.
If sequence matters, use N-grams. A two-word sequence is a bigram, a three-word sequence a trigram, etc. Example sentence: "The market is up today." Bigrams: "the_market", "market_is", "is_up", "up_today". N-gram BOWs preserve sequence at the cost of larger vocabularies; in N-gram implementations stop words are often retained because they can form meaningful phrases.
1. Which of the following is least likely to be a step in data analysis?
A. Structured formative analysis.
B. Data collection.
C. Data preparation.
2. Which of the following shortcomings of a feature is least likely to be addressed by data cleansing?
A. Missing values.
B. Common values.
C. Non-uniform values.
3. The process of adjusting variable values so that they fall between 0 and 1 is most commonly referred to as:
A. scaling.
B. standardization.
C. normalization.
Data exploration evaluates the dataset to determine how to configure it for model training. Steps include:
Model performance depends heavily on feature selection and engineering; analysts often iterate here until model results are satisfactory.
Structured data are rows (observations) and columns (features). EDA can be one-dimensional (single feature) or multi-dimensional. For high-dimensional data, apply dimension-reduction techniques such as principal component analysis (PCA).
Single-feature summary statistics: mean, standard deviation, skewness, kurtosis. Visualisations: box plots, histograms, density plots and bar charts. Histograms show observation frequencies across bins; density plots are smoothed histograms; box plots display median, quartiles and outliers.
Multiple-feature analysis: correlation matrices, scatterplots and paired visualisations. Statistical tests: parametric (ANOVA, t-tests, correlation) and nonparametric (Spearman rank, chi-square) as appropriate.
Feature selection is iterative and business-informed. Assign importance scores using statistical or model-based methods and then rank and select features. Dimension-reduction algorithms can decrease feature count to speed model training.
One-hot encoding (OHE) converts categorical features into binary dummy variables so models can process them.
Tokenise text and calculate summary statistics such as term frequency (number of times a token appears) and co-occurrence (tokens appearing together). A word cloud displays tokens with font size proportional to frequency, helping identify contextually important words.
Figure 4.1: Word Cloud, Apple (NASDAQ: AAPL) SEC Filing
Select a subset of tokens from the BOW to reduce noise and improve parsimony. Often remove very high-frequency tokens (likely stop words) and very low-frequency tokens (rare or noisy). Common feature-selection methods for text include:
Common techniques:
1. The process used to convert a categorical feature into a binary (dummy) variable is best described as:
A. one-hot encoding (OHE).
B. parts of speech (POS).
C. name entity recognition (NER).
2. To make a bag-of-words (BOW) concise, the most appropriate procedure would be to:
A. eliminate high- and low-frequency words.
B. use a word cloud.
C. use N-grams.
3. Mutual information (MI) of tokens that appear in one or few classes is most likely to be:
A. close to 0.
B. close to 1.
C. close to 100.
Before training, define model objectives, identify useful data points and conceptualise the model. ML engineers should work with domain experts to understand relationships (for example, how inflation relates to exchange rates).
After unstructured data is processed into structured form (for example, document-term matrices), model training for such data follows the structured-data workflow: ML locates patterns and builds decision rules that generalise to new observations. Model fitting describes how well the model generalises out of sample.
Common causes of model fitting errors:
There are three main tasks in model training:
For supervised learning, split labelled data typically as follows: approximately 60% training, 20% validation (tuning), and 20% test (final out-of-sample evaluation). Unsupervised learning lacks labelled targets and does not require this split in the same way.
Class imbalance occurs when one class is over-represented (for example, many high-grade bonds vs few defaults). Strategies include undersampling the majority class and oversampling the minority class to provide balanced training data.
Model fitting is covered in more depth in separate machine learning materials.
Model validation requires measuring training and out-of-sample performance. The techniques below focus on binary classification but are broadly applicable.
Classification errors are:
A confusion matrix summarises true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). From it we compute:
Precision is important when the cost of a false positive is high; recall is important when the cost of a false negative is high. The business context decides the preferred trade-off.
The ROC curve plots the true positive rate (TPR = recall) on the Y-axis against the false positive rate (FPR = FP / (FP + TN)) on the X-axis. The area under the curve (AUC) ranges from 0 to 1 - values closer to 1 indicate better discrimination. An AUC of 0.50 is equivalent to random guessing.
For continuous target variables (regression problems) common metrics include root mean squared error (RMSE), mean absolute error (MAE) and R-squared.
Dave Kwah is evaluating a model that predicts whether a company will have a dividend cut next year. The model uses a binary classification: cut versus not cut. In the test sample of 78 observations, the model correctly classified 18 companies that had a dividend cut, and 46 companies that did not have a dividend cut. The model failed to identify 3 companies that actually had a dividend cut.
1. Calculate the model's precision and recall.
2. Calculate the model's accuracy and F1 score.
3. Calculate the model's FPR.
Sol.
Identify confusion-matrix components from the description:
True positives (TP) = 18 (companies correctly predicted to have a dividend cut)
True negatives (TN) = 46 (companies correctly predicted to not have a dividend cut)
False negatives (FN) = 3 (companies that actually had a dividend cut but were missed)
Total observations = 78
Compute false positives (FP) = Total - (TP + TN + FN) = 78 - (18 + 46 + 3) = 11
Precision P = TP / (TP + FP) = 18 / (18 + 11) = 18 / 29 ≈ 0.6207
Recall R = TP / (TP + FN) = 18 / (18 + 3) = 18 / 21 ≈ 0.8571
Accuracy = (TP + TN) / Total = (18 + 46) / 78 = 64 / 78 ≈ 0.8205
F1 score = (2 × P × R) / (P + R)
Compute numerator = 2 × 0.6207 × 0.8571 ≈ 1.064
Compute denominator = 0.6207 + 0.8571 ≈ 1.4778
F1 ≈ 1.064 / 1.4778 ≈ 0.72 (rounded to two decimal places)
False positive rate FPR = FP / (FP + TN) = 11 / (11 + 46) = 11 / 57 ≈ 0.1930
After evaluation, revise the model to improve performance. Key concepts:
1. When the training data contains the ground truth, the most appropriate learning method is:
A. supervised learning.
B. unsupervised learning.
C. machine learning.
Use the following information to answer Questions 2 through 6.
While analysing health-care stocks, Ben Stokes devises a model to classify the stocks as those that will report earnings above consensus forecasts versus those that won't. Stokes prepares the following confusion matrix using the results of his model.
| Predicted: Outperform | Predicted: Not Outperform | |
|---|---|---|
| Actual: Outperform | 12 | 2 |
| Actual: Not Outperform | 4 | 7 |
2. The model's accuracy score is closest to:
A. 0.44.
B. 0.76.
C. 0.89.
3. The model's recall is closest to:
A. 0.67.
B. 0.72.
C. 0.86.
4. The model's precision is closest to:
A. 0.64.
B. 0.72.
C. 0.75.
5. The model's F1 score is closest to:
A. 0.80.
B. 0.89.
C. 0.94.
6. To reduce type I error, Stokes should most appropriately increase the model's:
A. precision.
B. recall.
C. accuracy.
The steps in a data analysis project are:
Data cleansing addresses missing, invalid, inaccurate and non-uniform values, and duplicate observations. Data wrangling (preprocessing) includes data transformation (extraction, aggregation, filtration, selection, conversion) and scaling (normalization or standardization). Normalization scales values between 0 and 1. Standardization centres values at mean 0 and scales by standard deviation to give unit variance; it is less sensitive to outliers but assumes an approximately normal distribution for the feature.
Model performance evaluation tools for classification include error analysis via a confusion matrix and derived metrics:
The ROC curve plots recall (TPR) against FPR (FP / (FP + TN)) and the AUC summarises its area. For continuous targets, use RMSE and related metrics. Model tuning balances bias and variance and chooses hyperparameters to improve out-of-sample performance.
Data exploration comprises EDA, feature selection and feature engineering. EDA inspects summary statistics and relationships. Feature selection chooses features that contribute to predictive power; feature engineering optimises and creates features for faster and more reliable model training.
Text-data summary statistics include term frequency and co-occurrence. A word cloud indicates frequently used words by font size. Feature selection for text may use document frequency, chi-square tests and mutual information. Text feature engineering includes identifying numbers, using N-grams, NER and POS tag tokenisation.
Model conceptualisation requires collaboration with domain experts. ML finds patterns in training data that generalise to out-of-sample data. Model fitting errors are caused by small training samples (risk of underfitting) or inappropriate feature counts (too few features → underfitting; too many features → overfitting). Model training includes selection, evaluation and tuning.
Text processing: remove HTML, punctuation, numbers and extraneous white space. Normalise text by lowercasing, removing stop words, stemming and lemmatisation. Tokenise text; apply N-grams when sequence matters. Create a bag-of-words (BOW) and convert to a document-term matrix for modelling.
1. A Structured formative analysis is not a defined step in the curriculum. The five steps are conceptualisation of the modelling task; data collection; data preparation and wrangling; data exploration; and model training. (LOS 4.a)
2. B Common values are not addressed by data cleansing. Data cleansing handles missing, invalid, non-uniform and inaccurate values, and duplicates. (LOS 4.b)
3. C Normalization scales variable values between 0 and 1. (LOS 4.b)
1. A OHE converts a categorical feature into a binary (dummy) variable suitable for machine processing. POS and NER assign tags to tokens and are not one-hot encoding. (LOS 4.e)
2. A To make a BOW concise, often high- and low-frequency words are eliminated (high-frequency words often include stop words; low-frequency words may be noise). Word clouds visualise frequency; N-grams preserve sequence when needed. (LOS 4.e)
3. B MI close to 1 indicates a token appears primarily in one or a few classes; tokens appearing across all classes will have MI close to 0. (LOS 4.e)
1. A Supervised learning is used when the training data contains ground truth (labelled outcomes). Unsupervised learning is used when no target variable exists. Machine learning is the broad field that includes both supervised and unsupervised methods. (LOS 4.f)
The following matrix answers Questions 2 through 6:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | 12 | 2 |
| Actual Negative | 4 | 7 |
Compute metrics from the matrix:
TP = 12, FP = 4, FN = 2, TN = 7, Total = 25
Accuracy = (TP + TN) / Total = (12 + 7) / 25 = 19 / 25 = 0.76. Answer: B
Recall (R) = TP / (TP + FN) = 12 / (12 + 2) = 12 / 14 ≈ 0.857 ≈ 0.86. Answer: C
Precision (P) = TP / (TP + FP) = 12 / (12 + 4) = 12 / 16 = 0.75. Answer: C
F1 score = (2 × P × R) / (P + R) = (2 × 0.75 × 0.86) / (0.75 + 0.86) = 0.80. Answer: A
To reduce type I error (false positives), Stokes should increase precision. Answer: A
You have finished the Quantitative Methods topic section. Take the online Topic Quiz to assess understanding. These quizzes are timed and exam-like; a score below 70% suggests additional review is needed. Allow approximately three minutes per question.