This topic review is a broad overview of the use of big data analysis for financial forecasting. Candidates should understand:
Big data is commonly characterised by the three Vs: volume, variety, and velocity.
When making inferences from data an additional characteristic, veracity (validity), must be considered: not every source is reliable, and analysts must separate quality from quantity to obtain robust forecasts.
Structured data (for example, balance-sheet figures) are neatly organised in rows and columns. Unstructured data (for example, text from regulatory filings or social media) lack such tabular organisation; machine learning algorithms must extract useful signals from noisy, unstructured streams.
To illustrate the typical steps in analysing data for financial forecasting, consider a consumer credit scoring model. The process is commonly organised into five iterative steps:
These five steps are iterative. Depending on model output quality, the analyst may return to earlier steps (for example, revisit feature engineering to improve predictive performance).
When analysing unstructured, text-based data the steps are adapted as follows:
The output of a model trained on unstructured data can be used alone, or combined with structured features as input to a secondary model.
Data preparation and wrangling is a critical step that typically consumes most of a project's time and resources. Once the problem is defined, domain experts help to identify the appropriate data to collect. Data are downloaded from internal databases or external vendors. When accessing a database, exercise caution to ensure data validity; README files often document how data are stored. External data can be obtained via application programming interfaces (APIs). External data are sometimes costly but can reduce local wrangling effort. Using widely available third-party data can reduce a firm's proprietary advantage because competitors may use the same sources.
Data cleansing reduces errors in raw data. For structured data, common errors include:
Cleansing is performed by automated rules-based algorithms and by human review. Metadata (summary information) is a useful starting point for identifying errors. Observations that cannot be corrected are usually dropped.
Data wrangling (preprocessing) prepares data for model use. Common preprocessing tasks include data transformation and scaling.
Data transformation types include:
Outliers can be detected using statistical rules (for example, values more than three standard deviations from the mean). Typical treatments include:
Scaling converts features to a common unit of measurement. Some ML algorithms (for example, neural networks and support vector machines) perform better when features are homogenous in range. Two common scaling methods:
A standardized variable value of +1.22 means the original value is 1.22 standard deviations above the mean. Unlike normalization, standardization is less sensitive to outliers but assumes an approximately normal distribution for the variable.
Some learning outcome statements (LOS) are presented out of order for exposition ease.
Unstructured text is readable by humans but must be converted to structured form for machine processing. Text processing consists of cleansing and preprocessing steps.
Common cleansing steps are:
After cleansing, text is normalised using:
Tokenization splits text into tokens (typically words). Example: the sentence "It is a beautiful day." yields five tokens: (1) it, (2) is, (3) a, (4) beautiful, (5) day.
After normalisation, a bag-of-words (BOW) representation collects tokens without regard to order. A document-term matrix converts unstructured data into a structured matrix with rows representing documents and columns representing tokens; each cell records the count of a token in a document.
If sequence matters, use N-grams where tokens are sequences of N words. A two-word sequence is called a bigram; a three-word sequence a trigram. Example: for "The market is up today.", bigrams include "the_market", "market_is", "is_up", "up_today". When using N-grams, stop-word removal must be considered carefully as it affects sequence tokens.
1. Which of the following is least likely to be a step in data analysis?
2. Which of the following shortcomings of a feature is least likely to be addressed by data cleansing?
3. The process of adjusting variable values so that they fall between 0 and 1 is most commonly referred to as:
Data exploration evaluates a data set to determine appropriate configuration for model training. Typical steps:
Structured data are organised as rows (observations) and columns (features). EDA can be univariate or multivariate. When many features exist, dimension-reduction methods such as principal component analysis (PCA) can assist exploration.
For a single feature, descriptive statistics include mean, standard deviation, skewness and kurtosis. Visualisations used in EDA include box plots, histograms, density plots and bar charts.
For multivariate exploration, use correlation matrices and scatterplots; statistical tests include ANOVA, t-tests, Spearman rank correlation and chi-square tests.
Feature selection aims to retain features that contribute to out-of-sample predictive power, producing a parsimonious model. Features may be scored and ranked; dimension-reduction algorithms can reduce processing time.
Feature engineering optimises the representation of features for the algorithm. For categorical features, one-hot encoding (OHE) converts categories into binary dummy variables to enable machine processing.
Tokenise text and compute summary statistics such as term frequency (count of a token) and co-occurrence (tokens appearing together). A word cloud visually emphasises frequent tokens by font size, helping identify contextually important words. (Figure 4.1 shows an example of a word cloud for an SEC filing.)
Feature selection for textual BOW representations reduces dimension and noise by eliminating unhelpful tokens. High- and low-frequency tokens are often removed: high frequency tokens tend to be stop words or common vocabulary words, while low-frequency tokens may be irrelevant.
Common FE techniques for text include:
1. The process used to convert a categorical feature into a binary (dummy) variable is best described as:
2. To make a bag-of-words (BOW) concise, the most appropriate procedure would be to:
3. Mutual information (MI) of tokens that appear in one or few classes is most likely to be:
Before training, define the modelling objectives, identify useful data points and conceptualise the model. Model conceptualisation is an iterative planning phase; ML engineers should collaborate with domain experts to identify relevant relationships (for example, the relation between inflation and exchange rates).
After unstructured data has been processed into a structured matrix, model training follows similar principles used for structured data. Machine learning seeks patterns that explain the target variable. Model fitting describes the model's ability to generalise to new (out-of-sample) data.
Model fitting errors can arise from:
For supervised learning, split the data into three parts: training (~60%), validation (tuning) (~20%), and test (~20%) to measure out-of-sample performance. For unsupervised learning, labelled splits are not applicable.
Class imbalance arises when one class dominates the dataset (for example, many high-grade bonds and few defaults). To address imbalance, use undersampling of the majority class or oversampling of the minority class to present a balanced training set.
Validation requires appropriate metrics. The following techniques are commonly used for binary classification problems.
1. Error analysis and confusion matrix. Errors are false positives (type I) and false negatives (type II). A confusion matrix summarises classification results. From it, compute:
Precision (P) = true positives / (true positives + false positives)
Recall (R) = true positives / (true positives + false negatives)
Accuracy = (true positives + true negatives) / (all observations)
F1 score = (2 × P × R) / (P + R)
High precision is preferred when the cost of a false positive is large; high recall is preferred when the cost of a false negative is large. The tradeoff between precision and recall is a business decision.
2. Receiver operating characteristic (ROC). The ROC curve plots true positive rate (TPR = recall) on the Y-axis against false positive rate (FPR = false positives / actual negatives) on the X-axis. The area under the curve (AUC) ranges from 0 to 1; AUC close to 1 indicates high predictive accuracy. AUC = 0.50 corresponds to random guessing.
3. Root mean squared error (RMSE). Use RMSE for continuous targets (regression). RMSE summarises average prediction error in the sample.
Dave Kwah evaluates a binary classification model for whether a company will have a dividend cut next year (cut versus not cut). In the test sample of 78 observations:
Using these values, perform the following calculations:
Answer
Compute intermediate values first.
TP = 18
TN = 46
FN = 3
Total observations = 78
FP = Total - (TP + TN + FN) = 78 - (18 + 46 + 3) = 11
Precision = TP / (TP + FP)
Precision = 18 / (18 + 11) = 18 / 29 ≈ 0.6207
Recall = TP / (TP + FN)
Recall = 18 / (18 + 3) = 18 / 21 ≈ 0.8571
Accuracy = (TP + TN) / Total
Accuracy = (18 + 46) / 78 = 64 / 78 ≈ 0.8205
F1 score = (2 × Precision × Recall) / (Precision + Recall)
F1 = (2 × 0.620689655 × 0.857142857) / (0.620689655 + 0.857142857) ≈ 0.7186
False positive rate (FPR) = FP / (FP + TN)
FPR = 11 / (11 + 46) = 11 / 57 ≈ 0.1930
After evaluation, tune the model until acceptable performance is reached. Two error types to consider:
A fitting curve plots training error and cross-validation error against model complexity. As complexity increases, training error typically decreases while variance (validation error) may increase. Regularisation imposes penalties on model complexity to reduce overfitting. Optimal model complexity balances bias and variance.
Parameters (for example regression slopes) are estimated from training data by optimisation. Hyperparameters (for example number of hidden layers, regularisation strength) are set by engineers and tuned via validation. Tuning can be manual or automated; a grid search systematically searches combinations of hyperparameters. Ceiling analysis evaluates each component in the model-building pipeline to identify the weakest link to improve overall performance.
1. When the training data contains the ground truth, the most appropriate learning method is:
Use the following information to answer Questions 2 through 6.
While analysing health-care stocks, Ben Stokes devises a model to classify stocks as those that will report earnings above consensus forecasts versus those that will not. Stokes prepares the following confusion matrix using the results of his model.
(Confusion matrix presented by the analyst.)
2. The model's accuracy score is closest to:
3. The model's recall is closest to:
4. The model's precision is closest to:
5. The model's F1 score is closest to:
6. To reduce type I error, Stokes should most appropriately increase the model's:
The steps in a data analysis project include: (1) conceptualization of the modelling task, (2) data collection, (3) data preparation and wrangling, (4) data exploration, and (5) model training.
Data cleansing addresses missing, invalid, inaccurate and non-uniform values and duplicate observations. Data wrangling or preprocessing includes transformation and scaling. Transformation includes extraction, aggregation, filtration, selection and conversion. Scaling converts data to a common unit. Normalization scales variables to [0,1]. Standardization centres variables at mean 0 and scales by standard deviation; standardization is less sensitive to outliers but assumes approximate normal distribution.
Model performance for classification is evaluated using a confusion matrix and metrics such as precision, recall, accuracy and F1 score.
precision (P) = true positives / (false positives + true positives)
recall (R) = true positives / (true positives + false negatives)
accuracy = (true positives + true negatives) / (all observations)
F1 score = (2 × P × R) / (P + R)
The ROC curve plots the tradeoff between false positives and true positives; AUC measures area under ROC. Use RMSE for continuous targets. Model tuning balances bias and variance and selects optimal hyperparameters.
Data exploration includes EDA, feature selection and feature engineering. EDA inspects summary statistics and patterns. Feature selection chooses features that improve out-of-sample predictive power. Feature engineering optimises feature representations for faster and more accurate model training.
Text summary statistics include term frequency and co-occurrence. Word clouds visually highlight frequent tokens. Feature selection tools include document frequency, chi-square and mutual information. Feature engineering techniques for text include number tokenisation, N-grams, named entity recognition (NER) and parts-of-speech (POS) tagging, and tokenisation into BOW or other structured forms.
Model conceptualisation requires collaboration between ML engineers and domain experts to identify relationships and data characteristics. ML identifies patterns in training data so that the model generalises to out-of-sample data. Insufficient training data or inappropriate numbers of features can cause underfitting or overfitting. Model training involves method selection, evaluation and tuning.
Text processing removes HTML tags, punctuation, numbers and extraneous whitespace. Normalisation steps include lowercasing, stop-word removal, stemming and lemmatization. Tokenisation splits text into tokens. N-grams capture sequences when order is important. A document-term matrix organises tokens with documents as rows and tokens as columns; cell values count token occurrences.
1. A Structured formative analysis is not a standard term in the curriculum. The five steps are conceptualization; data collection; data preparation and wrangling; data exploration; and model training. (LOS 4.a)
2. B Common values are not addressed by cleansing. Missing, invalid, non-uniform and inaccurate values are cleansed. (LOS 4.b)
3. C Normalization scales variables between 0 and 1. (LOS 4.b)
1. A OHE converts categorical features into binary (dummy) variables suitable for machine processing. POS and NER assign tags to tokens. (LOS 4.e)
2. A To make a BOW concise, eliminate high- and low-frequency words. High-frequency words tend to be stop words or common vocabulary. Word clouds are visual tools; N-grams are used when sequence matters. (LOS 4.e)
3. B MI indicates a token's contribution to a class: tokens appearing in all classes have MI ≈ 0; tokens concentrated in one or a few classes have MI ≈ 1. (LOS 4.e)
1. A Supervised learning is used when the training data contains ground truth (the known target). Unsupervised learning is used when there is no known target. Machine learning is the broad category including both. (LOS 4.f)
The following matrix answers Questions 2 through 6.
Given values used by the answer key:
2. B Accuracy = (TP + TN) / (TP + TN + FP + FN) = 19 / 25 = 0.76. (LOS 4.c)
3. C Recall (R) = TP / (TP + FN) = 12 / 14 = 0.86. (LOS 4.c)
4. C Precision (P) = TP / (TP + FP) = 12 / 16 = 0.75. (LOS 4.c)
5. A F1 score = (2 × P × R) / (P + R) = (2 × 0.75 × 0.86) / (0.75 + 0.86) = 0.80. (LOS 4.c)
6. A To reduce type I error (false positives), increase precision. High precision is valued when the cost of a false positive is large. (LOS 4.c)
You have completed the Quantitative Methods topic section. Take the Topic Quiz to assess understanding. These tests simulate exam-like questions and provide feedback. A score below 70% indicates further study is advisable.