Understand how backtesting is part of risk assessment for an investment strategy, and know the steps and procedures used to backtest a strategy. Be able to interpret the metrics and visuals reported in a backtest and identify common problems that can invalidate backtest results. In historical scenario analysis, contrast Monte Carlo and historical simulation approaches. Understand the role of inputs and the choices made when constructing a simulation, and the process for interpreting simulation outputs. Finally, learn how to use sensitivity analysis as a complementary risk-assessment technique.
This module provides a summary of backtesting and other quantitative methods used to assess the risk of investment strategies. Backtesting uses historical data to emulate the investment process and to estimate the risk-return properties of a proposed strategy before real capital is committed.
Greater availability of market data combined with large increases in computing power have enabled widespread use of backtesting. Today, software tools allow investors to test many combinations of strategy rules, build multi-factor models, and assemble portfolios in simulation prior to live implementation.
The principal objective of backtesting is to assess the risk and return characteristics of an investment strategy by simulating how the strategy would have been executed historically. Backtesting helps to reassure investors that strategies and models are likely to perform as expected and provides a framework for refining the investment process.
Backtesting has long been used in the financial industry. It is particularly natural for systematic and quantitative strategies, but fundamental managers also make extensive use of historical tests. The method is intuitive because it mimics the real investing cycle: formulate a strategy, test it with what would have been known at the time, and assess the results.
Backtesting depends on the implicit assumption that the future will in some respects resemble the past. A strategy that performs well historically is expected, under that assumption, to perform well going forward. For many practical reasons discussed later, however, this assumption often fails: a strategy that appears strong in backtest can fail in live trading. Conversely, a strategy that would have worked well in reality but does not display predictive power in backtesting will rarely be adopted in practice.
1.
The primary objective of backtesting an investment strategy is to help an investor:
A. understand an investment strategy's risk-return trade-off.
B. generate the highest possible income without losing any principal.
C. develop portfolios that emphasize capital appreciation.
2.
Backtesting helps us understand the risk-return tradeoff of an investment strategy by:
A. approximating the real-life investment process.
B. comparing the performance of the strategy to that of a previous strategy.
C. copying or duplicating an existing strategy.
The fundamental steps in backtesting an investment strategy are generally as follows:
The first step is to specify assumptions and investment objectives. For an active investor, common goals include achieving high risk-adjusted returns while controlling downside risk. Additional considerations include portfolio turnover, concentration limits, and the investment time horizon.
The investment universe is the set of assets that the strategy may consider for investment. For exposition, we may use the Russell 3000 Index as an example of a broad U.S. equity universe.
When the investment universe spans multiple currencies, choices arise: whether to express returns in local currency or convert to a single reporting currency. This decision often depends on whether currency exposure is hedged. Returns should be reported relative to an appropriate benchmark that reflects the investment universe-e.g., the S&P/TSX Composite Index for Canadian equity strategies.
Monthly rebalancing is common, though some strategies use shorter or longer intervals. Higher-frequency rebalancing raises transaction costs, which can erode apparent backtested profits. Therefore, backtest performance should clearly state whether transaction costs are included.
Long histories increase confidence in backtest results, but market data are often nonstationary-they contain different regimes (e.g., recessions, expansions, high- or low-inflation periods). Because of regime variation, it is important to analyse discrete time intervals within the sample as well as the aggregate.
Equity strategies commonly use factor-based models. A factor is any variable that helps predict returns or risks, thereby enabling ranking and selection of stocks. Factors are meant to represent distinct sources of systematic risk tied to economic fundamentals.
Factor selection should be guided by both statistical evidence and economic rationale. A factor that performs well historically but lacks theoretical support may be the result of data mining and should be treated with scepticism. Typical practice is to develop a theory, define investing rules, gather the necessary historical data (e.g., earnings yield and returns), and partition the history into training (in-sample) and testing (out-of-sample) periods.
Rather than relying on a single variable, investors generally combine multiple factors, often in a linear model or multi-factor rank system. Two common ways to combine factor portfolios are:
For exposition, consider typical variables for common investment styles:
Example portfolio construction: short the worst 20% of stocks by a factor and buy the best 20%, rebalancing monthly and ignoring transaction costs for simplicity.
Equally weighted factor combinations (BM) often perform similarly to more complex weighting schemes. The RP approach adjusts for factor volatilities and inter-factor correlations so that each factor contributes equally to portfolio risk.
Construct the portfolio according to the strategy rules and rebalance at the predetermined frequency. Portfolio construction is constrained by the strategy and by practical limits-geography, market-cap, liquidity, shorting limits, etc.
Rather than a single in-sample/out-of-sample split, many managers use a rolling window or walk-forward procedure. The model is calibrated over a moving window and then applied to the next out-of-sample period; parameters can be updated as new data arrive, while the primary methodology is fixed ex ante to reduce overfitting.
Example: backtest a trailing 12-month earnings-yield value factor beginning 30 November 2023.
From the resulting series of out-of-sample returns, compute performance statistics such as monthly average return, maximum drawdown, volatility, and Sharpe ratio. Accept the strategy if it performs satisfactorily out-of-sample and makes economic sense.
Caveats: rolling-window backtesting implicitly assumes past patterns may repeat, and may fail to capture dynamic market behaviour or extreme downside risks.
For multifactor strategies the rolling-window process is executed twice:
Both BM and RP composites are rebalanced monthly and evaluated out-of-sample to determine returns and risk metrics.
Evaluate performance using a set of risk and return metrics and visuals, and check for signs of overfitting or biases.
A common approach is the Fama and French (1993) hedged portfolio: rank the universe by factor score, split into quantiles (e.g., quintiles), and go long the top quantile and short the bottom quantile to form a long-short hedged portfolio. Within quantiles, stocks may be equally weighted or weighted by market-cap. Rolling-window backtesting with monthly rebalancing yields out-of-sample performance series to be evaluated via metrics such as Sharpe, Sortino, and maximum drawdown.
Different backtesting procedures can produce different results. For instance, if the relationship between a factor and future returns is non-linear, one method may indicate significance while another does not. No single backtest method is perfect; ideally, multiple complementary methods point to similar conclusions.
1.
Considering the various factors (or assets) to be combined in an investment portfolio, the "risk parity" portfolio construction technique is least likely to take into account each factor's (or asset's):
A. volatility.
B. correlations.
C. liquidity.
2.
The basic steps in a "rolling window" backtest are most likely to include:
A. making the prediction, computing the variance of the prediction error, and determining the prediction interval.
B. determining the position of the initial random centroids, assigning each observation to its closest cluster, and redefining the clusters.
C. strategy design, historical investment simulation, and analysis of backtesting output.
3.
In the rolling window backtesting methodology, researchers are least likely to:
A. use a walk-forward framework.
B. calibrate trade signals based on the rolling window.
C. identify data, attributes, and priorities.
Common performance and risk metrics used in backtest analysis include:
Value at Risk (VaR) quantifies the minimum expected loss over a specified period for a given confidence level. VaR is sensitive to assumptions about the distribution of returns (e.g., normal vs fat-tailed).
Conditional VaR (CVaR) (also called expected shortfall) is the expected loss conditional on losses exceeding the VaR threshold. At significance level α (e.g., α = 5%), CVaR is the average loss in the worst α% of outcomes. With historical data, CVaR is the average of returns worse than the VaR cutoff.
Maximum drawdown is the largest historical loss from peak to trough. It begins from the highest cumulative return and subtracts the lowest cumulative return occurring after that peak. Maximum drawdown is commonly used by hedge funds and CTAs to characterise downside risk.
Useful visuals include:
Several common errors can bias backtest results:
Survivorship bias occurs when tests include only entities that survive to the present and exclude defunct firms. This happens because current lists of securities are easy to obtain while reconstructing all securities that once existed is harder. The remedy is to use point-in-time data, which reflect the information that would have been available at each historical date and therefore avoid look-ahead and survivorship bias.
Companies drop out of indices for many reasons (bankruptcy, acquisition, delisting, private buyouts), and new firms enter through IPOs, spin-offs, and corporate restructurings. For example, the Russell 3000 Index started with ≈3,000 securities in 1985; by May 31, 2019, fewer than 400 of the original constituents (≈13%) remained. Backtests that use only surviving stocks can produce results that are materially different, and sometimes contrary, to the true historical performance.
Example: the low-volatility anomaly may appear to reverse if one uses only current survivors rather than point-in-time constituent lists. This highlights the importance of using point-in-time data in backtests.
Look-ahead bias appears when models use information that would not have been available at the decision date. Reporting lags and revisions cause this issue. Point-in-time data is the preferred remedy, but when unavailable, analysts may implement realistic reporting-lag assumptions (e.g., a one-, two-, or three-month lag) to approximate when data would have been known.
However, incorrect lag assumptions create either residual look-ahead bias (too short a lag) or excessively stale information (too long a lag). Accounting restatements and revisions of macro data can also lead to differences between vendor-supplied current data and historical data that was available at the time.
Data snooping (or p-hacking) occurs when many models are tested and the best-performing one is selected without considering the multiple-testing problem. Selecting a model solely because it produced a large t-statistic or small p-value on historical data can produce false positives.
Remedies include using more stringent significance thresholds (e.g., requiring t-statistics above 3.0) and employing cross-validation. Cross-validation divides the data into training and validation sets and checks that model performance generalises to unseen data. Rolling-window backtesting is a form of cross-validation (without random splitting), where in-sample periods train the model and out-of-sample periods validate it.
1.
In assessing backtesting results, an analyst is least likely to take into account:
A. traditional performance measurements such as Sharpe ratio and Sortino ratio.
B. value at risk, conditional value-at-risk, and maximum drawdown.
C. transcription perturbations, synthesis, codon optimality, and translation elongation.
2.
Issues in backtesting to which analysts should pay particular attention are least likely to include:
A. survivorship bias.
B. look-ahead bias.
C. hindsight bias.
Historical scenario analysis (historical stress testing) examines strategy performance across different regimes and structural breaks. It assesses how strategies would have performed in distinct historical environments.
Two typical regime distinctions are:
Other regime examples include geopolitical states (trade-agreement vs no-trade-agreement), credit cycle stages, or inflationary vs disinflationary periods. Strategy risk and return characteristics frequently vary by regime. For example, a risk-parity (RP) factor portfolio may be more resilient during recessions compared with an equally-weighted benchmark (BM) portfolio. Distributional characteristics such as skewness and kurtosis can differ across regimes: BM may show negative skewness and fat tails in both expansions and recessions but lower average returns in recessions; RP may show lower kurtosis and volatility.
Backtesting assumes the future will resemble the past. Many asset-allocation methods further assume returns follow a multivariate normal distribution. In practice, returns often show negative skewness (more frequent negative surprises) and excess kurtosis (fat tails), which make normal assumptions inadequate. Conventional mean-variance optimisation can therefore give misleading results, and rolling-window backtests may not capture unprecedented future events. To address these limitations, analysts supplement backtesting with scenario analysis and simulation.
Simulation generalises risk assessment beyond the single chronological past in backtests. Two primary classes are historical simulation and Monte Carlo simulation.
In a historical simulation, return observations are sampled at random (with or without replacement) from a long historical record, disregarding the original order. This approach leverages actual past outcomes but is limited to what has already occurred. Financial institutions commonly use historical simulation for risk evaluation.
In contrast, Monte Carlo simulation specifies a statistical distribution for each relevant variable and draws random observations from a calibrated multivariate distribution. Monte Carlo allows for more flexible distributional assumptions-non-normality, tail dependence, fat tails-by choosing appropriate parametric forms and calibrating parameters to historical data. Monte Carlo is more flexible but may be computationally intensive and requires careful distribution fitting.
Historical simulation shares advantages and limitations with rolling-window backtesting: both rely on the past to characterise future randomness. Historical simulation may be performed with or without replacement; sampling with replacement is called bootstrapping, which is useful when the required number of simulations is large relative to the historical sample size.
Professor's note: the term "bootstrapping" in the simulation sampling sense is unrelated to interest-rate bootstrapping used in fixed-income analysis.
Monte Carlo simulation requires selecting an appropriate statistical distribution for each key decision variable. Calibration involves estimating means, variances, skewness, kurtosis, and tail-dependence parameters from historical data. When simulating multiple correlated assets or factors, specifying a proper multivariate distribution is critical to capture correlations and tail co-movements; modelling each asset independently is inadequate when dependencies exist.
Model complexity involves a trade-off: highly parameterised models may fit historical data well but produce large estimation error when historical data are insufficient to estimate many parameters. Simpler models may be misspecified but suffer less estimation noise.
Interpret simulation output with standard performance and risk measures: Sharpe ratio, downside-risk metrics (CVaR, maximum drawdown), and comparative distributions of benchmark versus strategy returns. Historical and Monte Carlo simulations are complementary: each models randomness differently and can validate backtest results. However, varying distributional assumptions, parameter estimates, and model specifications will typically yield different measures of Sharpe ratio, CVaR, and tail-risk estimates.
Sensitivity analysis studies how the target variable (e.g., portfolio return) responds to changes in input variables or modelling assumptions. Because Monte Carlo outcomes depend strongly on assumed distributions, sensitivity analysis tests robustness by using alternative distributional assumptions and re-running simulations.
For example, re-fit factor-return data to a multivariate skewed Student's t-distribution (which allows fat tails and skewness) and run 1,000 (or more) simulations. Compare resulting metrics-Sharpe ratio, CVaR, downside-risk measures-with those from the base simulation and from the historical backtest. If results differ materially, the strategy's perceived risk and return are sensitive to distributional assumptions, indicating greater model risk.
Note: using more flexible distributions (e.g., multivariate skewed t) helps capture skewness and excess kurtosis but increases the number of parameters to estimate, thereby raising the potential for estimation error.
1.
It would be least accurate to state that historical scenario analysis:
A. is an overall examination of the complete historical record of an asset's average past performance.
B. examines the efficacy of a strategy in discrete historical environments, such as during recessions or periods of high inflation.
C. can help investors understand the performance of an investment strategy in different structural regimes.
2.
Standard rolling-window backtesting is most likely to fail to account for downside asset returns due to:
A. negative skewness, excess kurtosis, and tail dependence.
B. positive skewness, fat tails, and clustering of extreme events.
C. negative skewness, platykurtic distribution, and tail dependence.
3.
Unlike historical simulation, under the Monte Carlo approach:
A. each key variable is assigned a statistical distribution.
B. repeated samples are drawn from a set of time-series data.
C. the data is assumed to be stationary.
4.
In historical simulation, "bootstrapping" is most accurately described as:
A. random draws with replacement.
B. forming a company with little capital.
C. constructing a zero-coupon yield curve.
5.
Compared to a conventional Monte Carlo simulation, the use of a multivariate skewed Student's t-distribution is more likely to:
A. account for skewness in the data set.
B. require the estimation of fewer parameters.
C. benefit from smaller estimation errors.
The primary goal of backtesting is to assess the risk and return of an investment strategy by simulating the investment process using historical data. Backtesting evaluates whether a strategy would have produced excess returns historically and supports optimisation of the investment process.
The three steps in backtesting an investment strategy are:
In rolling-window backtesting, a walk-forward process is used: calibrate factors or trade signals on a moving window, rebalance periodically, and evaluate out-of-sample performance to approximate live investing.
Backtest outputs include return metrics (average return), risk measures (volatility, downside risk), and derived performance indicators such as the Sharpe ratio, Sortino ratio, and maximum drawdown. Visuals commonly include return-distribution plots and cumulative-return charts on a log scale.
Typical problems in backtesting include:
Cross-validation (training and testing on separate data segments) and testing strategies across different geographic markets can add confidence about robustness.
Scenario analysis examines strategy performance in distinct regimes (recession vs expansion; high vs low volatility). Scenario analysis and simulation can reveal risks not captured by methods assuming multivariate normality, especially when returns exhibit skewness and excess kurtosis.
Monte Carlo and historical simulation address skewness, excess kurtosis, and tail dependence differently. Historical simulation samples observations from the past (each observation equally likely). Monte Carlo assigns parametric distributions to key variables and draws from a calibrated multivariate distribution, a procedure that is nondeterministic and flexible in modeling distributional shapes and dependencies.
Historical simulation is straightforward but limited to realised past outcomes; bootstrapping (sampling with replacement) is often used when many trials are required relative to available data. Monte Carlo requires careful distribution-fitting and calibration; multivariate distributions must be used when asset returns are correlated.
Sensitivity analysis examines how changes in input assumptions affect outcomes. Fitting factor-return data to distributions that account for skewness and kurtosis (e.g., multivariate skewed Student's t) and rerunning simulations helps quantify the risk from model misspecification. However, more flexible distributions require estimating more parameters and therefore can increase estimation error.
1. A - Backtesting's main objective is to help understand the risk-return trade-off of a strategy by simulating the real-life investment process (LOS 38.a).
2. A - Backtesting helps understand the risk-return trade-off by simulating the real-life investment process (LOS 38.a).
1. C - Risk parity accounts for factor volatilities and correlations; liquidity is not the primary input of the risk-parity calculation (LOS 38.b).
2. C - The fundamental steps in rolling-window backtesting are: strategy design; historical investment simulation; analysis of backtesting output.
3. C - Rolling-window backtesting employs a walk-forward framework, calibrates factors/trade signals on the window, rebalances periodically, and tracks performance over time (LOS 38.b).
1. C - Analysts should consider traditional performance metrics (Sharpe, Sortino) and tail-risk measures (VaR, CVaR, maximum drawdown) when evaluating backtests (LOS 38.c).
2. C - Analysts should pay particular attention to survivorship bias and look-ahead bias in backtests; hindsight bias is a separate cognitive bias.
1. A - Historical scenario analysis examines discrete regimes (high/low inflation, recessions/expansions) and helps investors understand strategy performance across regimes (LOS 38.e).
2. A - Rolling-window backtesting may fail to account for negative skewness, excess kurtosis (fat tails), and tail dependence, which matter for downside risk (LOS 38.f).
3. A - Monte Carlo assigns statistical distributions to key variables and draws from them; historical simulation samples actual historical observations (LOS 38.f).
4. A - Bootstrapping in historical simulation refers to random draws with replacement (LOS 38.g).
5. A - A multivariate skewed Student's t-distribution can account for skewness and excess kurtosis but requires estimating more parameters and thus may suffer larger estimation errors (LOS 38.h).