Volatility Forecasts Embedded in the Prices of Crude-Oil Options

This paper evaluates and compares the ability of alternative option-implied volatility measures to forecast the monthly realized volatility of crude-oil returns. We find that a corridor implied volatility measure that aggregates information from a narrow range of option contracts consistently outperforms forecasts obtained by the popular Black-Scholes and model-free volatility expectations, as well as those generated by a high-frequency realized volatility model. In particular, this measure ranks favorably in all regression-based tests, delivers the lowest forecast errors under either symmetric or asymmetric loss functions, and generates economically significant gains in volatility timing exercises. Our results also show that the CBOE's "oil-VIX" (OVX) index performs poorly, as it routinely produces the least accurate forecasts.


Introduction
In economic terms, crude-oil is the most important traded commodity. Unsurprisingly, a wide range of economic agents, from individual investors to policy makers, closely monitor its price and routinely attempt to make predictions about the future. Unlike standard financial assets, however, one salient feature of crude-oil prices is that they can experience dramatic shifts for reasons that are largely unrelated to global macroeconomic conditions, such as OPEC policy however the former appear to be more accurate than the latter in individual forecast comparisons.
While the early literature has examined the information content of Black-Scholes implied volatilities calculated from different strikes and maturities (Trippi (1977); Chiras & Manaster (1978), Beckers (1981); Gemmill (1986) ;Fung, Lie, & Moreno (1990)) the consensus is that the simple ATMIV of a contract expiring as close to the forecast horizon appears to provide the most reliable results. More recently, ATMIV forecasts have been compared to the so-called model-free implied volatility (MFIV) that has a number of appealing theoretical properties. 4 The empirical evidence, however, has produced inconclusive results. Jiang & Tian (2005) study the S&P500 index and find MFIV to be more informative than ATMIV, while the opposite conclusion is reached by Andersen & Bondarenko (2007) for the same underlying asset. Taylor, Yadav, & Zhang (2010) examine individual U.S. stocks and report that ATMIV provides more accurate volatility forecasts than its model-free counterpart. Finally, in their study of three energy markets, Prokopczuk & Simen (2014) find that MFIV is more informative than ATMIV in predicting either crude-oil, heating oil or natural gas volatility. They also find that a simple adjustment for volatility risk-premia enhances the forecast performance of all option-implied measures.
When the task at hand is to predict future return variation MFIV is not without shortcomings. This is mainly for two reasons. First, some options included in the calculation of this measure (such as deep out-of-the money puts for the case of equities) tend to be very sensitive to volatility risk-premia fluctuations. This can introduce substantial variation in the option-implied measures that is largely unrelated to the forecast target, i.e. integrated variance. Second, calculating MFIV requires that market prices of options with extreme strikes are observed. In practice, this means that either some extrapolation scheme must be implemented, or that options beyond a certain strike range should be excluded from the calculation.
The most popular estimates of MFIV measures are the volatility indices produced and published by CBOE, such as the VIX index for the case of the S&P500 and the OVX for the case of crude-oil. CBOE's implementation algorithm, which is common for both the VIX and OVX indices, adopts a liquidity-based cut-off point that determines the range of options to be included in the MFIV measure calculation. The choice of this algorithm by CBOE has recently attracted some criticism. Andersen & Bondarenko (2007) were the first to note that the VIX is in fact an ex ante measure of corridor integrated variance, rather than integrated variance. In two comprehensive empirical studies, Andersen, Bondarenko, & Gonzalez-Perez (2015) and Andersen, Fusari, & Todorov (2017) use high-frequency option data and report that the VIX calculation method introduces systematic biases to the extracted measure, including artificial jumps, which become particularly pronounced during periods of market stress. From a different viewpoint, Griffin & Shams (2018) put forth evidence pointing towards market manipulation of the VIX futures market. In essence, this is facilitated by CBOE's adopted cut-off algorithm, as speculators can temporarily boost the liquidity of deep out-of-the money S&P 500 options, increasing the level of the VIX just before the settlement price for VIX futures is determined. Given that the same methodology is used to calculate both the VIX and OVX indices, all the above raise reasonable concerns regarding the informational efficiency of OVX-based forecasts. In addition, since the popularity of volatility indices has recently become widespread in the finance industry, a comparison between the OVX and other option-implied alternatives appears to be long overdue.
Our work builds on the study of Andersen & Bondarenko (2007) who propose an alternative measure of ex ante risk-neutral expectation of volatility, the so-called corridor implied volatility (CIV). Similar to the MFIV, and unlike the Black-Scholes model, this measure aggregates volatility information from several options and does not depend on a particular option pricing model. However, the extracted measure is not a risk-neutral expectation of integrated variance but corridor integrated variance, i.e. return variation accumulated only when the asset price lies within a corridor of two pre-specified price levels. The advantage of this approach is that one can select a corridor width that, while containing a wide-range of option prices, excludes those with extreme strikes, avoiding both price extrapolations and liquidity-driven cut-off points that may influence the reliability of the extracted measure. Moreover, since each corridor corresponds to a different range of option data, one can explore CIV measures that may be less sensitive to volatility risk-premia fluctuations. 5 The contribution of this paper is threefold. First, we examine the forecast performance of CIV measures vis-à-vis a collection of competing alternatives, including HAR, MFIV, OVX and ATMIV forecasts, for the case of crude-oil. Our paper builds, but significantly expands, on the work done by Prokopczuk & Simen (2014) who compare the performance of MFIV and ATMIV forecasts. Besides considering additional option-implied measures, our study also includes models that utilize high-frequency return information, while forecasts are ranked using both statistical and economic criteria. Second, we evaluate volatility forecasts for the case of the United States Oil Fund (USO), an ETF that attempts to track the price of West Texas Intermediate light sweet crude-oil. Considering this alternative, yet closely related, target quantity, enables us to further scrutinize our earlier findings. Moreover, to the best of our knowledge, we are first to construct and evaluate option-implied forecasts for this dataset. Third, we provide the first empirical evaluation of the OVX index, used in the forecasting study of Haugom, Langeland, Molnár, & Westgaard (2014), against other option-implied alternatives. We do so for both crude-oil forecasts, which are more important in applied practice, as well as for USO volatility forecasts, which enable more reliable methodological comparisons. This is because USO options constitute the basis for the OVX calculation.
Our empirical results provide insights on a number of issues. We find that a particular CIV measure, that uses a relatively narrow range of option prices, consistently ranks favorably against all other competing measures using a variety of statistical and economic criteria. In particular, model forecasts that utilize this measure achieve the highest R 2 in Minzer-Zarnowitz regressions, remain significant in encompassing regression tests, and deliver the most accurate forecasts under both the symmetric and asymmetric loss functions we consider. Moreover, volatility timing exercises show that utilizing this measure results in significant economic gains. The superior performance of this narrow CIV measure is ev-ident in our full crude-oil dataset , our "post-financialization" sample 6 period (2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016), as well as, our USO ETF series (2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016).
With respect to the performance of the OVX, we find clear evidence that it provides problematic crude-oil volatility forecasts as it is routinely outperformed by other option-based alternatives. When the target quantity is USO volatility, the OVX generates improved, but far from optimal, forecasts. Our complementary analysis of the algorithm used to calculate the OVX indicates that the currently adopted cut-off rule introduces noise which hinders its forecast accuracy. Finally, in contrast to Prokopczuk & Simen (2014), we find that ATMIV is more informative about future crude-oil volatility than the MFIV measure 7 , although the opposite is true for the case of USO volatility forecasts.
The structure of the paper is as follows. Section 2 discusses various measures of volatility that we use in this study. Section 3 describes the dataset and the details of our methodology. The empirical results for our crude-oil dataset are presented in Section 4 while those corresponding to the USO ETF are presented in Section 5. Our robustness checks can be found in Section 6. Section 7 concludes.

Volatility Measures
In this section we describe the alternative volatility measures we use to construct forecasts. However, before doing so, we state the assumptions we make about asset price dynamics and outline the relevant theory on which all our volatility measures are based.

The Dynamics of Futures Prices
Assume that over the period t ∈ [0, T] investors can continuously trade in a frictionless and arbitrage-free market. In the filtered probability space (Ω, F, P; F t∈[0,T] ), the futures price of a contract expiring at time 0 < T < T, denoted as F t , evolves according to the following general diffusion, where W t is a Wiener process. The drift µ t and volatility σ t can change across time according to the filtration F t . The constraint imposed on the futures price dynamics is that the stochastic process is a semimartingale without jumps in prices 8 . It is worth noting that the only restriction 6 With the introduction of new securities linked to commodities, as well as with regulatory changes related to trading, the popularity of investing in commodity products has had a significant impact of their price dynamics. See for example Henderson, Pearson, & Wang (2014) and Singleton (2014) among others. 7 The same result is also reported in Andersen & Bondarenko (2007) for the case of the S&P 500, and Taylor et al. (2010), for the case of individual stocks.
8 Price jumps are excluded from this representation because the option-implied expectations, discussed later in the paper, will be biased when prices are subject to discontinuous movements. Jiang & Tian (2005) and Carr & Wu (2009) note that the bias will not be sizeable for small or moderate jumps, although large jumps could have a significant impact, as argued by Carr, Lee, & Wu (2012).
imposed on the volatility dynamics is that σ t is a strictly positive (càdlàg) stochastic process, so volatility can exhibit jumps.
According to these price dynamics, the total variation of logarithmic futures price changes from t = 0 to T is given by the integrated variance (IVAR), defined as, Although total return variation is the forecast target of this paper, we also utilize the concept of corridor integrated variance (CIVAR), i.e., variance accumulated only when the underlying asset (F t in our case) lies between two "barrier" price levels B 1 and B 2 . Defining the indicator function I t that takes the value of 1 if B 1 ≤ F t ≤ B 2 and 0 otherwise, CIVAR is given by the following expression, Obviously, when the corridor defined by B 1 and B 2 is sufficiently wide to contain all levels that the futures price can reach with positive probability under P, CIVAR will be equal to IVAR. In other words, IVAR can be seen as a special case of CIVAR, since for B 1 = 0 and B 2 = ∞ the definitions of the two measures coincide.

Volatility Expectations From Option Prices
Option markets may be informative about future volatility, since observed prices can be utilized to extract forward-looking expectations of the aforementioned volatility measures. In particular, suppose European plain vanilla options, written on an underlying futures contract F t and expiring at time t = T , trade for a continuum of strike prices K. As shown in Carr & Madan (1998), Demeterfi et al. (1999) and Britten-Jones & Neuberger (2000), ex ante risk-neutral expectations of the future integrated variance can be obtained by calculating the value of a static position in a portfolio of European options. Specifically, the expected integrated variance from time t = 0 to time T , under the risk-neutral measure Q, can be calculated from: where M 0,T (K) is the price of a European out-of-the money option (i.e., either put or call), with strike price K and maturity T . Since this expectation does not depend on a particular option pricing model (such as the Black-Scholes model for example), it is referred to as Model Free Implied Variance (MFIV).
Similarly, as shown in Carr & Madan (1998) and Andersen & Bondarenko (2007), Corridor Implied Variance (CIV), i.e. the risk-neutral expectation of future integrated corridor variance, can be obtained by calculating the value of a static position in European options with strikes ranging from B 1 to B 2 ,

CBOE Crude Oil Volatility Index (OVX)
The last option-based measure we consider is the Crude Oil ETF Volatility Index (OVX), also known as the "Oil VIX". The OVX, which is produced and disseminated by the CBOE, intends to measure the market's (risk-neutral) expectation of crude-oil price volatility over the next month. It is defined as the square root of MFIV, given in Equation 4. The data underlying the OVX computation are options written on the United States Oil Fund (USO), an ETF that is designed to track the price of West Texas Intermediate light sweet crude-oil. For the construction of the OVX the CBOE adopts the same methodology as the one employed for the popular S&P 500 VIX index. Notably, CBOE applies a liquidity criterion to determine the range of option contracts included in the calculation of the index. In particular, moving from high (low) strike, out-of-the-money, put (call) options towards those with lower (higher) strikes, once two contracts with consecutive strike prices have zero bid prices a cut-off point is applied and no further contracts are considered. Therefore, both the VIX and the OVX are, in reality, CIV measures, with a corridor width determined by market liquidity.

Realized Variance Measures
While our forecast target, i.e., integrated variance, is inherently latent, accurate ex post IVAR estimates can be obtained using high-frequency price observations. In particular, Barndorff-Nielsen & Shephard (2002), Meddahi (2002) and Andersen, Bollerslev, Diebold, & Labys (2003) show that summing squared intraday returns leads to an estimator which converges in probability to IVAR and is referred to as realized variance (RV). To calculate RV, suppose on day t there are M + 1 equally spaced intraday price observations at times t i , i = 0, . . . , M . We will also assume the interval between the observations is 1/M , i.e., the length of a day is standardized to unity. If the log price at time t i is p t i , then the intraday return between times t i−1 and t i is r t i = p t i − p t i−1 . It is then straightforward to calculate RV on day t as, Theoretically, RV becomes more accurate as M increases, i.e., as more intraday prices are observed over shorter and shorter intervals. However, if prices are observed over very short intervals, RV will be contaminated by microstructure noise, which causes an upward bias (Andersen, Bollerslev, Diebold, & Labys, 2000). A common remedy is to use prices observed over a relatively coarse set of intraday times so that the effects of microstructure noise are mitigated. Typically, prices recorded over 5-min intervals are used, despite transactions occurring at a much higher frequency. Although using a coarse intraday sampling interval solves the microstructure noise problem, it results in information being discarded. In order to recover some of this information, ensuring our RVs are estimated with as much accuracy as possible, and to continue avoiding microstructure noise by using a coarse sampling interval, we use sub-sampled RVs (Zhang, Mykland, & Aït-Sahalia, 2005) which are calculated as follows, is the ∆-period intraday return between times t i−1 and t i−1+∆ , and M and ∆ are chosen such that M/∆ is an integer.
Later in the paper we will require an accurate proxy of CIVAR. For this purpose, we introduce the concept of corridor realized variance (CRV). With sub-sampling, CRV is calculated as follows, wherer is the ∆period intraday corridor return between times t i−1 and t i−1+∆ . Therefore, if either p t i−1+∆ or p t i−1 are above B 2 then they are set equal to B 2 , whereas if either are below B 1 they are set equal to B 1 . Again, M and ∆ are chosen such that M/∆ is an integer.

Data and Sample Construction
Our main dataset consists of options and high-frequency prices for the WTI Light Sweet Crude Oil futures, which currently trade at CME, the world's most liquid commodity derivatives market. Our options and futures datasets start in January 1996 and end in April 2016. Our secondary dataset contains options and high-frequency prices of the USO ETF from May 2007 to December 2016.

Crude-oil Option Data
Option contracts are written on futures contracts for physical delivery of light sweet crudeoil. In particular, the underlying asset is the futures contract whose delivery date is three business days after the expiration of the option. These contracts have American-style exercise and are settled in cash. Our option data consist of daily settlement prices, which are recorded at 14:30 ET each day. We make several adjustments to the raw data before we proceed with the estimation of the option-implied volatility (OIV) measures. As the latter require price data on European options rather than American ones for their calculation, we attempt to alleviate this problem in two ways. First, we eliminate all in-the-money options and only keep out-of-the money options, for which the early exercise premium is significantly lower and, for the case of deepout-of the money options, almost quantitatively negligible. Second, we estimate the early exercise premium of each option using the Barone-Adesi & Whaley (1987) American option pricing formula and subsequently calculate the price of their European-style counterparts.
Finally, in order to guard against recording errors and other market microstructure effects, we eliminate options with a price less than $0.01 and filter all call/put prices that violate standard arbitrage bounds. An overview of our final option data sample is provided in Table  1. It is noteworthy that the number of traded option contracts has increased substantially over the last few years.

High-Frequency Crude-oil Futures Data
Our high-frequency data comprise of transaction prices recorded at 30-sec intervals. Until June 2006, futures were traded between 09:00 and 14:30 ET using an open outcry system in a trading pit, resulting in 661 price observations per day. Subsequently, they have been traded between 18:00 and 17:00 ET the following day on the electronic GLOBEX platform, resulting in 2,761 price observations per day.
To construct RV we use 5-min intraday returns. This frequency is commonly used in the empirical literature, e.g., see Andersen, Bollerslev, Diebold, & Labys (2001), Andersen et al. (2003) and ? among others, as it is deemed to provide an appropriate trade-off between the objective of incorporating as much information as possible from intraday prices and the necessity to avoid contamination from microstructure noise. Hence, in applying our subsampled RV estimator in Equation (6) we set M = 660 or M = 2760, depending on the data are from before June 2006 or not, and ∆ = 10.

USO Option and High-Frequency Return Data
Options on the USO ETF trade 08:30-15:00 CT on the CBOE and correspond to the delivery of 100 shares of the USO ETF, where the delivery date is three business days following the exercise of the option. These contracts have American-style exercise. Our USO ETF option prices consists of end-of-day bid and ask quotes. Before proceeding with the estimation of our OIV measures we estimate the early exercise premium and apply the same data filter procedures as with our crude-oil option dataset.
In analogy to our crude-oil futures data, our high-frequency USO ETF data consists of transaction prices recorded at 30-sec intervals. The USO ETF trades on the NYSE Arca between 09:30 and 16:00 ET, resulting in 781 price observations per day. Again, we use 5-min returns to construct RV so that M = 780 and ∆ = 10 when applying our sub-sampled RV estimator in Equation (6).

Forecast Samples
We examine the forecast performance of alternative option implied measures using three distinct samples. The first, and most important one, corresponds to the full history of our crude-oil option and futures data (January 1996-April 2016. We refer to it as our full sample.
Motivated by the literature that finds that the behavior of commodities has experienced significant changes after their "financialization" period, we conduct a separate analysis using data that belong exclusively to that era. This sample, which we refer to as Post-Fin, starts in May 2007, when the OVX, our natural benchmark, was first reported by the CBOE, and ends in April 2016.
Finally, we examine option data and high-frequency returns corresponding to the USO ETF. Since the latter underlies CBOE's OVX computation we believe that the analysis of USO volatility forecasts can provide interesting insights. This sample, which we refer to as USO sample, spans the period between May 2007 and December 2016. In the interests of clarity, we present the empirical results of the USO sample separately in Section 5.
Our empirical study focuses on monthly variance forecasts. Along these lines, we study options that have approximately 22 trading days to expiration. 9 Throughout the sample, the day that we collect option prices is never before the maturity date of the previous option chain we studied, i.e., all our option-based forecasts are non-overlapping.

Construction of Implied Volatility Measures
The computation of the MFIV and CIV measures requires the existence of options trading for a continuum of strike prices, an assumption which is of course not satisfied in practice. In order to address this, we first estimate a risk-neutral distribution using the prices of observed options for each relevant date, which enables us to subsequently generate option prices for arbitrary strike prices. Our preferred risk-neutral distribution is the flexible Generalized Beta Distribution of the second kind (GB2) which, as discussed in Taylor (2005), has a number of appealing properties. 10 The calculation of the CIV measures also requires a selection of the relevant corridor width, i.e., the barrier price levels B 1 and B 2 . We consider four CIV measures 11 in total, with the barriers determined by evaluating the quantile function of the risk-neutral distribution F Q . Specifically, defining B 1 = F −1 Q (p) and B 2 = F −1 Q (1 − p), the CIV1 to CIV4 measures are obtained by first setting p = 0.45, 0.35, 0.25, 0.10, respectively, and subsequently evaluating Equation (5). Figure 1 plots several CIV measures together with the MFIV estimates corresponding to crude-oil over the the 1996-2016 period. As expected, all measures exhibit a strong degree of covariation, although narrow CIV measures appear more stable than their wide corridor counterparts. As shown in Table 2, all CIV measures are highly correlated with both MFIV and ATMIV. The autocorrelation patterns for 1, 6, and 12 lags also reveal that CIV measures become less persistent as the width of the corridor widens. As expected, the unconditional moments of the CIV measures approach those of the full MFIV as more 9 If for a given month options with an expiration horizon of 22 trading days are not available, we shorten the target horizon by one or (if needed) two days and scale the resulting OIV such that its expiration horizon corresponds to 22 days. For example, if the expiration horizon of the OIV was N trading days, we scale the OIV by multiplying it by 22/N . 10 The GB2 distribution, firstly proposed by Bookstaber & McDonald (1987), allows for general levels of skewness and kurtosis, while European option prices can be computed in closed-form. We elaborate on our choice and examine its robustness in Section 6.1.1.
11 Supplementary analysis for alternative corridor measures is presented in Section 6.1.2.
options are included in the calculations, while for the case of CIV4 the two measures have almost identical properties.

Variance Forecasts using the HAR Model
There is a considerable literature demonstrating the superiority of forecasts generated from time-series models of RV over those obtained by a random walk model or specification using daily return data (see, for example, Corsi (2009), Patton & Sheppard (2015), Bollerslev, Patton, & Quaedvlieg (2016) and Bollerslev, Hood, Huss, & Pedersen (2018)). Along these lines, we use the heterogeneous autoregressive model (HAR) of Corsi (2009) to generate RV-based forecasts, where, RV u,v = v s=u RV s , i.e. our realized variances are calculated using the sub-sampled RV estimates described in Equation 6. While various extensions of the HAR model could be considered, our study relies exclusively on the baseline version. This is because, as shown in the comprehensive studies of Sévi (2014) and Prokopczuk et al. (2016), sophisticated HAR extensions do not outperform the simple HAR benchmark for the case of crude-oil. 12 We estimate the HAR model parameters using a rolling window of 60 monthly observations.

Empirical Results
We examine the forecast accuracy of competing forecasts using several techniques. Firstly, we evaluate the information content of each of our forecasts using Mincer-Zarnowitz regressions. Secondly, we make comparisons between the information content of our forecasts using encompassing regressions. Thirdly, we analyse the prediction errors of our forecasts using statistical loss functions. Lastly, we assess the economic value of our forecasts by implementing a volatility timing exercise.

Mincer-Zarnowitz Regressions
Our first evaluation procedure assesses the information content of our HAR-based forecasts and the OIVs. This is done by running Mincer-Zarnowitz regressions (Mincer & Zarnowitz, 1969), whereby we regress our variance target, the RV calculated over each out-of-sample monthly forecast horizon, against the competing forecasts. More precisely, for each forecast i, we run the following regression, where f i t,t+22 is the forecast from model i using information available up until day t for the variance between days t + 1 and t + 22. The information content of each forecast is measured by the R 2 of this regression.
In Table 3 we present results for the Mincer-Zarnowitz regressions. In Panel A we report results for the full sample, whilst in Panel B we summarize results for the Post-Fin sample. The values of R 2 suggest the following. Firstly, option-implied volatilities (OIVs) have markedly higher information content compared to forecasts based solely on past RVs. Specifically, HAR forecasts have the lowest R 2 values, whilst the R 2 for the OIV forecasts are larger by 5-7 percentage points for the full sample and 35-37 percentage points for the Post-Fin sample. Secondly, CIV1 appears to have the highest information content, since CIV1 forecasts result in the largest R 2 s in both the full and Post-Fin samples. Thirdly, the information content of our CIVs appears to be inversely proportional to the width of the corridor; as we move from CIV4 to CIV1, the R 2 s progressively increase. Lastly, the OVX has a low information content with its R 2 being approximately on par with CIV4, the worse performing CIV, and lower than ATMIV.
It should also be noted that the parameter estimates corresponding to Equation (9) differ substantially between the forecasts. This is as expected and a consequence of the differing levels of bias in the forecasts. If a forecast is unbiased, then we would expect β 0 = 0 and β 1 = 1. Whilst all of the β 0 parameters are close to zero, there are large differences between the values of β 1 . Overall, the MFIV, ATMIV and OVX are upwardly biased, consistent with the presence of a variance risk-premium, whilst CIV1, CIV2 and CIV3 are downwardly biased, reflecting the fact that they provide risk-neutral expectations for CIVAR rather than IVAR. The observed biases are consistent with the mean values of the OIVs reported in Table 2.

Encompassing Regressions
Our second evaluation procedure assesses the relative performance of the alternative forecasts by running encompassing regressions. We make comparisons between two forecasts by running the following bivariate regressions, We can determine whether the forecast from one model encompasses the other by examining the significance of the individual regression parameters. If the information contained in the forecast from model i is subsumed by the information in the forecast from model j, then we expect β 1 to be insignificant and β 2 to be significant. If this occurs, then we say that the forecast from model i encompasses the forecast from model j and vice versa.
We make the following comparisons using our encompassing regressions: (i) we compare the information content of the HAR model forecasts, which are based on RV alone, against our OIVs to examine whether the forward-looking information in our OIVs is useful vis-à-vis the backward-looking information contained in RV; and (ii) we compare the information content of our alternative OIVs to examine which of our OIVs contains the most useful forecasting information.
The results of the encompassing regressions are summarized in Table 4. Results for comparisons between the OIVs and the HAR forecasts are summarized in Panels A and B for the full and Post-Fin samples, respectively. Panel A shows that CIV1, CIV2 and CIV3 encompass the HAR forecasts, at the 1% level. In the remaining encompassing regressions, although the parameters on MFIV, ATMIV and CIV4 are significant at the 1% level, the parameters on HAR are also significant; albeit only at the 10% level for the encompassing regressions involving ATMIV and CIV4. Thus, although they do not encompass the HAR forecasts, there appears to be incremental information in the MFIV, ATMIV and CIV4 OIVs which is not captured in the historical time-series of RV. However, in Panel B, all OIVs, including the OVX, encompass the HAR forecasts, at the 1% level.
Results for comparisons between the OIVs are summarized in Panels C and D for the full and Post-Fin samples, respectively. Focusing on the full sample results, it can be seen that MFIV is encompassed by all the other OIVs at the 5% level. Comparisons of ATMIV and CIVs show that ATMIV is encompassed by CIV1, CIV2 and CIV3, at the 5% level. The encompassing regressions involving CIV1, CIV2, CIV3 and CIV4 show that CIV4 is encompassed by CIV1, CIV2 and CIV3, at the 5% level. Therefore, the results show that narrowing the corridor of our CIVs leads to improved information content. These findings are consistent with the arguments of Andersen & Bondarenko (2007) that the presence of options with extreme strikes, which are thinly traded and typically very sensitive to volatility risk premia fluctuations, contaminates the volatility information contained in the MFIV and wide CIV measures. 13 The results in Panel D are weaker than those in Panel C and likely reflect the smaller size of the Post-Fin sample. Nonetheless, although often significant at the 10% level only, the results show that CIV1 encompasses all other OIVs. Notably, the OVX does not encompass any of the other OIVs, consistent with our Mincer-Zarnowitz regression results in Section 4.1 which showed the information content of the OVX to be inferior relative to our other OIVs.
In summary, the results from the encompassing regressions support those from the Mincer-Zarnowitz regressions in Section 4.1. The information content of the OIVs appears to be superior relative to those based on RV alone, i.e., the HAR forecasts. Furthermore, of the OIVs, the CIV1 appears to be informationally superior, since it encompasses our other OIVs in both the full and Post-Fin samples. In contrast, the information content of the OVX is seemingly limited as it does not encompass any other OIV while its respective parameter is always less significant when combined with any CIV.

Generating Out-of-sample Forecasts
The presence of a variance risk premia as well as the fact that corridor implied variances provide risk-neutral forecasts of expected corridor integrated variance, rather than integrated variance, generally causes the raw OIV measures to provide biased forecasts of future return variation. Empirically, this can be observed by the parameters of the Mincer-Zarnowitz regressions presented in Table 3. Along these lines, out-of-sample variance forecasts of models utilizing option-implied information are obtained through the following univariate models, where OIV t,t+22 is an option-implied measure (i.e. either MFIV, ATMBS, CIV1, CIV2, CIV3, CIV4, or, where applicable, OVX) calculated on day t and corresponding to a forecast for 13 We investigate the role of the variance risk premium in determining the forecast accuracy of our OIVs in more depth in Section 4.6 the variance between days t + 1 and t + 22. We use BC-MFIV to denote the bias-corrected MFIV forecast, BC-ATMIV to denote the bias-corrected ATMIV forecast and so on.
Moreover, by comparing the adjusted R 2 s of the encompassing regressions in Table 4 to the R 2 s of the Mincer-Zarnowitz regressions in Table 3, we can see that a linear combination of RVs and OIVs leads to a higher information content than can be attained with any individual forecast. This motivates evaluating the accuracy of forecasts generated by augmented models that utilize information from both option prices and historical return data. Consistent with the prior literature (e.g. Busch, Christensen, & Nielsen (2011)), we use the following augmented HAR model to generate forecasts, We use HAR-MFIV to denote the forecasts from an HAR model augmented with MFIV, HAR-ATMIV the forecasts from an HAR model augmented with ATMIV and so on.
In total, we examine out-of-sample forecasts generated by thirteen different models for the full sample and fifteen for the Post-Fin and USO samples, since the latter two include the BC-OVX and HAR-OVX forecasts. The parameters for all models defined in equations (8), (11) and (12) are estimated using a rolling window of 60 monthly observations.

Forecast Evaluation using Statistical Criteria
Although Mincer-Zarnowitz and encompassing regressions provide insights into the information content of our forecasts, they do not provide much information about their precision. From the perspective of economic agents, quantifying forecast accuracy is of paramount importance. Thus, we now turn to assessing prediction errors by means of statistical loss functions.

Statistical Loss Functions
To evaluate the accuracy of our forecasts, we use a symmetric, the mean squared error (MSE), and an asymmetric, the quasi-likelihood (QLIKE), loss function. A lower MSE and/or QLIKE corresponds to smaller prediction errors. These loss functions were chosen because they are commonly employed in the volatility forecasting literature and, as shown in Patton (2011), they are robust to measurement error in the IVAR proxy. More precisely, using the MSE and QLIKE loss functions ensures that the ranking of two forecasts in terms of expected loss is preserved even when the true integrated variance is replaced by a conditionally unbiased, but imperfect, proxy.
Denoting f i t as the time t variance forecast generated from a reference model i, and RV t,T as the corresponding realization of the target quantity, MSE and QLIKE losses from n total forecasts are defined as: In order to test for significant differences between the MSE and QLIKE of competing forecasts we use Diebold-Mariano tests (Diebold & Mariano, 1995) with Newey-West (Newey & West, 1987) standard errors.

Forecast Evaluation Results
The out-of-sample MSEs and QLIKEs of our competing forecasts are presented in Table  5. Panel A reports results for the full sample whilst Panel B summarizes results for the Post-Fin sample. In each panel we also report the results of Diebold-Mariano tests in which the HAR forecasts are used as a benchmark. Therefore, any significance reported indicates the MSE or QLIKE is significantly different to that of the HAR forecasts. In Panel B, to assess the accuracy of forecasts which incorporate information from the OVX, we additionally report results from Diebold-Mariano tests where the HAR-OVX forecasts are treated as the benchmark. 14 From Panel A it can be seen that, overall, the bias-corrected forecasts outperform the corresponding HAR forecasts according to the MSE. The BC-CIV1 forecasts result in the lowest MSE and, of the HAR-based forecasts, the lowest MSE is associated with HAR-CIV1. However, Diebold-Mariano tests show that no forecast results in an MSE which is significantly lower than that generated by the HAR forecasts.
The QLIKE results in Panel A differ slightly to those for the MSE. Under this loss function the HAR-based forecasts outperform the bias-corrected forecasts, with the most accurate forecasts being those of HAR-CIV1. However, in contrast to the MSE results, we find that the QLIKEs for HAR-CIV1, HAR-CIV2, HAR-CIV3 and HAR-ATMIV are all significantly lower than for HAR at the 5% level. The QLIKEs for HAR-MFIV and HAR-CIV4 are also significantly lower at the 10% level. The fact that we find significant differences when using the QLIKE loss function is most likely associated with the ability of this loss function to more accurately discriminate between competing variance forecasts (Patton & Sheppard, 2009). Overall, the results in Panel A suggest that using information in CIV1 leads to the most accurate variance forecasts.
The results for the Post-Fin sample (Panel B) are analogous to those of the full sample (Panel A) and support our overall conclusion that incorporating information from CIV1 leads to the most accurate forecasts. Importantly, it can also be seen that forecasts based on the OVX perform poorly. When the HAR-OVX forecasts are used as a benchmark, it can be seen that nearly all non-OVX option-implied forecasts produce significantly lower QLIKEs and MSEs at the 5% level. 15 Although the BC-OVX forecasts result in an MSE that is significantly lower than that for the HAR-OVX forecasts, of the bias-corrected forecasts, BC-OVX result in the largest MSE.
In summary, the results show that prediction errors can be minimized when CIV1 is employed. The results are consistent with those from the Mincer-Zarnowitz and encompassing regressions, which showed that the OIVs were informationally superior to forecasts based on RV and that CIV1 had a higher information content relative to other OIVs. Our results also suggest that forecasts based on the OVX perform significantly worse than those based on our alternative OIVs. Further insights regarding the performance of the OVX, including the effect of CBOE's cut-off methodology, are provided by the analysis of our USO sample in Section 5.

Forecast Evaluation using Economic Criteria
Although evaluating variance forecasts with MSE and QLIKE is common, they are statistical loss functions. It is not clear how minimizing the MSE and/or QLIKE translates into economic gains for the forecaster. Therefore, in order to ascertain whether the improved forecast accuracy we observe leads to economic gains, we perform a volatility timing exercise.
We consider an agent whose investment opportunity set consists of the WTI crude-oil futures and a risk-free asset and assume the agent's objective is to maximize the utility of a portfolio consisting of these two assets. We follow the set-up of Bollerslev et al. (2018) and assume Sharpe ratios are constant and that a quadratic utility function provides an accurate approximation to investors' true utility functions. With these assumptions, the following function describes investors' utility per unit wealth (UoW), where W t is the investors wealth, w t is the proportion of the investors wealth held in crudeoil futures, SR is the Sharpe ratio and γ is the investors level of risk aversion. 16 It can be shown that constructing a portfolio to maximize expected utility is equivalent to forming a portfolio with a specific volatility target, where the optimal proportion of wealth to invest in crude-oil futures is, .
The ratio in the numerator corresponds to the volatility target and the denominator is the expected volatility. To operationalize the strategy, E t (RV t+1,t+22 ) in denominator is replaced with a variance forecast. We also follow Bollerslev et al. (2018) and set SR = 0.4 and γ = 2, which they argue are sensible parameters when forecasting variance over a monthly horizon, and results in a volatility target of 20%. Using Equation 14 to substitute for w t in Equation 13, replacing E t (RV t+1,t+22 ) with our variance forecasts, and plugging-in our assumed values of SR and γ leads to the following expression for the utility per unit wealth based on forecast f t+1,t+22 , Comparisons between models are then made using realized utility, Note, realized utility is expressed as a percentage return and, given our assumptions, can take a maximum value of 4%.

Realized utility results
Panels A and B of Table 6 summarize the realized utility results for our full and Post-Fin samples, respectively. The first column of Panel A reports the realized utility. It can be seen that the HAR-based forecasts outperform the bias-corrected forecasts. In addition, of the HAR-based forecasts, HAR-CIV1 generates the highest realized utility.
In the first column of Panel A we also report the results from Diebold-Mariano tests of whether the realized utility associated with a forecast is significantly different to the realized utility generated by the HAR forecasts. It can be seen that HAR-CIV1, HAR-CIV2 and HAR-CIV3 produce realized utilities that are significantly higher, at the 5% level for HAR-CIV1 and HAR-CIV2 and the 10% level for HAR-CIV3, than the realized utility attained with the HAR forecasts. Therefore, the pattern in forecasting performance observed with the statistical loss functions is retained when we use an economic loss function. The results also confirm that the improvements in forecasting accuracy we observed with the statistical loss functions translate into economic benefits.
The difference between the value of the realized utility for the HAR-CIV1 forecasts relative to the HAR forecasts in Panel A is 2 bp. Although this may appear to be a relatively modest difference, there are two reasons why this represents a material economic improvement. Firstly, as highlighted by Bollerslev et al. (2018), there has been a drive by investment management companies, in particular mutual funds and ETFs, towards lowering fees. The fees now charged by low-cost funds are of the order of tens of basis points. As argued by Bollerslev et al. (2018), this means that a single digit basis point increase in fees is relatively substantial. The difference in realized utility of 2 bp means that a fund using the HAR-CIV1 forecasts instead of the HAR forecasts will be able to increase its fees by 2 bp and remain equally attractive to investors.
Secondly, the realized utilities reported are unconditional. Therefore, in any given month the economic benefit of using the HAR-CIV1 over the HAR forecasts could be much larger than 2 bp. In order to examine this further, we report in columns 2-7 of Panel A in Table  6 the 2.5, 5, 10, 25, 50 and 75% quantiles of the UoW for each set of forecasts. Comparing the HAR-CIV1 to the HAR forecasts, it is clear that there are substantial differences, of 5-10 bp, in the UoW at the 2.5, 5 and 10% quantiles. The difference between the lower quantiles of the UoW distributions suggests the HAR-CIV1 forecasts tend to outperform the HAR forecasts precisely when it is most difficult to forecast volatility and when an accurate forecast is in greatest demand, e.g., when there is a volatility shock which causes UoW to be low.
The results for the Post-Fin sample in Panel B of Table 6 are analogous to those for the full sample in Panel A. Again, the HAR-CIV1 results in the highest realized utility, whilst, besides the OVX-HAR case, all augmented HAR models generate a realized utility higher than that of the HAR.The differences between the realized utilities of each non-OVX augmented HAR forecast and the HAR forecasts are also on the whole larger than those reported in Panel A of Table 6, being approximately 2-5 bp.
To examine the economic benefit of using our OIVs over the OVX, in Panel B we test for a significant difference between each forecast and the HAR-OVX forecasts. 17 It can be seen that the realized utility of all non-OVX augmented HAR forecasts are significantly higher, typically at the 1% level, than the realized utility associated with the HAR-OVX forecasts. Therefore, these results further support our conclusion that there is value in constructing our OIVs directly from option prices rather than relying on the CBOE's methodology.
Similar to our findings in Panel A, there are potentially large differences between the conditional values of UoW for the Post-Fin sample. For the 25-50% UoW quantiles the difference between the realized utilities associated with the HAR and augmented HAR forecasts is approximately 4-13 bp, whilst the difference between the HAR-OVX and the other non-OVX augmented HAR forecasts is approximately 9 bp.
Of course, the magnitude of the economic benefits derived from each forecast depends on the assumptions employed. As Bollerslev et al. (2018) highlight, the value of the realized utility is a linear function of the volatility target. Thus, if the volatility target doubles, e.g., through a doubling of the Sharpe ratio or a halving of the coefficient of risk aversion, the size of the economic benefits also double. Nevertheless, the framework above provides a sensible approximation and therefore the magnitude of the economic benefits presented should be reasonable.
In conclusion, the results demonstrate that the improvements in forecasting accuracy observed when using OIVs translate into economic benefits under reasonable assumptions. They also further corroborate our preference for the CIV1 amongst our OIVs.

Realized utility results with transaction costs
In order to make our analysis more realistic, we also take into consideration transaction costs. Specifically, transaction costs are calculated as being a proportion of turnover, where, The precise level of transaction costs is controlled by c. We follow Wang, Liu, Ma, & Wu (2016) and Caldeira, Moura, Nogales, & Santos (2017) and set c to be either 0.033% or 0.15%. It should be noted that transaction costs are low in futures markets (Locke & Venkatesh, 1997) and have decreased markedly over the past few decades. Realized utility net of transaction costs is then given by, Panels A and B of Table 7 summarize the net realized utility and transaction costs for the full and Post-Fin samples, respectively. In Panel A (Panel B) we also use Diebold-Mariano tests to formally evaluate whether the net realized utility and transaction costs associated with each forecast are significantly different to the net realized utility and transaction costs generated by the HAR (HAR-OVX) forecasts.
In both Panels A and B it can be seen that transaction costs, whether c = 0.015% or c = 0.0033%, do not differ substantially between the competing forecasts. In Panel A, it can be seen that there are no significant differences in transaction costs, whilst in Panel B, BC-CIV1 and BC-CIV2 lead to significantly lower transaction costs at the 10% level. Consequently, because none of the forecasts lead to unusually high transaction costs, the ranking of the forecasts in Panels A and B of Table 7 are identical to those in Panels A and B of Table 6. In particular, the HAR-CIV1 forecasts produce the highest net realized utility in both the full and Post-Fin samples. Within the full sample, the HAR-CIV1 and HAR-CIV2 forecasts result in net realized utilities that are significantly higher than the net realized utility of the HAR forecasts at the 5% level. Whilst for the Post-Fin sample, all the non-OVX augmented HAR forecasts result in significantly higher net realized utilities than the HAR-OVX forecasts, typically at the 1% level. Thus, differences in trading volumes are small and do not have a material impact on the relative economic benefits of the forecasts. Therefore, the results also support our conclusions in Section 4.5.1

Forecast Performance and Variance Risk Premia
Corridor implied volatility measures discard some information from option panels as contracts with strikes outside the corridors are not included in their calculation. This creates a mismatch between the option-implied quantity (i.e. expected risk-neutral corridor variance) and the target quantity (expected real-world integrated variance). If narrow CIV measures perform well empirically, it must be the case that the benefits from discarding these options outweighs the costs.
There are two reasons that can justify this trade-off. The first is that the discarded options, due to thin or non-existent trading, merely contaminate option-implied expectations with measurement error. The second is that certain options embed risk premia, in the sense that a higher expected volatility is required to explain their price.
For the purpose of forecasting, the presence of such risk premia may not be a significant concern if they are relatively constant across time, and a sufficiently long history of past data is available to estimate them, so that the bias from the option-implied forecasts can be removed. What is notoriously more challenging is the possibility that they vary substantially over time, causing variations in the option-implied measures that are unrelated with the target quantity, i.e. expected integrated variance. In this section we investigate whether the behavior of risk-premia can help us explain why our narrow corridor measure perform well in practice.

Variance and Corridor Risk Premia
Thus far we have presented our empirical results without attempting to explain why CIV1 provides the most accurate variance forecasts. In the stock market, the variance risk premium (VRP), defined as the difference between real-world and risk-neutral expected values of variance, is know to vary across time (Todorov, 2009) and have an unconditional expected value that is negative (Carr & Wu, 2008;Bollerslev, Tauchen, & Zhou, 2009). Naturally, these properties of the VRP adversely affect the accuracy of volatility forecasts derived from the prices of options.
In order to gain an insight into whether a similar argument applies to our crude oil data, we analyze the time series properties of risk premia associated with our option-implied measures.
To estimate the VRP we follow Bollerslev et al. (2009) and use 18 where the integrated variance (IVAR) measures for the real-world () and risk-neutral world (Q) have been discussed in Section 2.1 and Section 2.2, respectively. Due to our CIVs providing risk-neutral expectations of corridor integrated variance (CIVAR), it is not appropriate to estimate their corresponding risk premia using the equation above. Instead, we use the analogous concept of corridor variance risk premium (CVRP), defined as is equal to CIV1, CIV2, CIV3 or CIV4, and CRV (B 1 ,B 2 ) t−21,t represents the associated corridor realized variances which we denote as CRV1, CRV2, CRV3 and CRV4. 19 Similarly, the barriers of CV RP (B 1 ,B 2 ) correspond to those of our four CIV measures, so we denote the respective corridor variance risk premia estimates as CV RP 1 , CV RP 2 , CV RP 3 and CV RP 4 .

Statistical properties of Corridor Realized Variances
Since corridor realized variances (CRVs) have not been explored in the extant literature, we firstly examine their time-series properties. Panel A of Table 8 provides summary statistics for our realized variance measures. As expected, the means and standard deviations of the CRVs are lower than those of the RV. The skewness and kurtosis of CRV1-CRV3 are also markedly lower. Therefore, the time-series properties of our CRVs are analogous to those of the CIVs in that their time variation is lower and have fewer extreme values than their full variance counterparts. Although arguably a little higher for the CRVs, the autocorrelations are of a similar magnitude with those of the RV.
Our CIV measures provide (risk-neutral) forecasts of future corridor integrated variance. Our target quantity in this paper, however, is integrated variance. Therefore, a closer inspection at the relationship between two, as measured by our respective proxies (i.e. CRV and RV), can provide some useful insights. For instance, if the relative differences are stable over time, then corridor integrated variance forecasts can be easily scaled up to provide reasonable predictions of integrated variance.
Along these lines, we examine the statistical properties of the ratios between the CRV and RV measures (i.e. CRV /RV ) for each of our corridors. For CRV1 to CRV4 these ratios are denoted as R1 to R4. The summary statistics are displayed in Panel B of Table 8. As it can be seen therein, although the skewness and kurtosis of R1 are a little larger than those of R2 and R3, R1 has a much lower standard deviation. In terms of persistence, the autocorrelations for all the realized variance ratios are of comparable magnitude. Therefore, the relationship between CRV1 and RV appears to be more stable than for the other wider corridors, so it reasonable to assume that sensible integrated variance predictions can be obtained by scaling up the corresponding corridor integrated variance forecasts.

Statistical properties of Corridor Risk Premia
To understand the behavior of corridor risk premia and examine what impact they might have on our forecast results, we take a closer look at their statistical properties. As it can be seen from Panel C of Table 8, several observations are noteworthy. Firstly, VRP and CVRPs are significantly different to zero in all cases, meaning that all the option-implied measures provide biased forecasts of their respective target quantities. Second, both the mean and standard deviation of the CV RP 1 are substantially lower compared to the rest. Thirdly, the skewness and kurtosis of CV RP 1 , CV RP 2 and CV RP 3 are comparatively low, indicating that narrow corridors are associated with fewer extreme CVRPs values. This suggests that for narrow corridor measures the difference between risk-neutral and real-world expectations will not be prone to sudden large deviations.
We continue our discussion by looking at the most important link between the statistical properties of the various CVRPs and the forecast performance of the associated CIV measures. Specifically, we examine if variance and corridor risk premia vary across time by testing for zero autocorrelations in the series (x −x) where x is a reference VRP or CVRP estimate andx denotes its sample mean. We consider 1, 6 or 12 lags and the null hypothesis that all autocorrelations up to these lags are jointly zero is evaluated using the Ljung-Box test. This test should uncover evidence of conditional mean dynamics in each series. We find that at the 5% level the constant mean assumption cannot be rejected for any of the CVRPs. On the contrary, at the same significance level, the zero autocorrelation assumption is rejected at both 1 and 6 lags for the case of the VRP. 20 As a time-varying risk-premium contaminates option-implied forecasts with variations that are irrelevant to the target quantity (expected integrated variance) but rather reflect adjustments in the pricing of risk premia, this empirical finding directly demonstrates the advantage of CIV forecasts over those obtained from the full MFIV measure.
We conclude our empirical evidence on time-varying risk premia by doing a more careful comparison between volatility expectations extracted from options at the tails of the riskneutral density versus those that have strikes around its mean. To do so, we compare the risk-premia associated exclusively with extreme strike options, defined as CV RP tail = V RP − CV RP 4 , with those that correspond to the central support of the RND. For the latter, we use the CV RP 2 estimate that contains a comparable (risk-neutral) probability mass. Since risk-premia should vary with the state of the economy, we use the economic activity index of Aruoba, Diebold, & Scotti (2009), which we denote as (ADS), to capture any changes in the conditional mean of each series. In particular, we estimate the following regression models: The results of these regressions are revealing. For the case of the CV RP 2 measure, the estimated parameters are c = 0.028 (significant at the 1% level) and β = −0.00102, which has the correct sign (i.e. risk premia are higher in bad economic states), but is insignificant with a p-value of 93%. With respect to the CV RP tail measure, the estimated parameters are c * = 0.0133 (significant at the 1% level) and β * = −0.00648 which has the correct sign as well, but is significant at the 5% level using HAC standard errors. 21 In connection with our forecasting study, this result carries two important messages for option-implied measures that rely on extreme strike options. First, it provides further evidence that the risk-premia associated with extreme strike options are time-varying, which generally impairs their forecast accuracy. Second, it highlights that their risk-neutral forecasts will abruptly drift apart from real-world expectations during times of sudden market turmoil, so that their forecast performance will deteriorate when it is needed the most. So, overall, studying the dynamics of variance and corridor risk premia provides some intuition regarding the good forecasting performance of our narrow CIV measures, as well as strengthens the argument for preferring them in practice.

Further Insights Using USO Options
Although the main focus of this paper is forecasting crude-oil volatility, interesting insights can be obtained by studying a closely related quantity, namely the volatility of the USO ETF. First, given that the two underlyings are tightly linked but trade in different markets, offers a unique opportunity to examine if our key findings are mainly driven by trading venue idiosyncrasies such as market liquidity. Furthermore, the fact that the USO dataset underlies CBOE's computation of the OVX paves the way for methodological comparisons. For instance, although our main results clearly showed that our CIV measures deliver more accurate crude-oil volatility forecasts than the OVX, directly contrasting the performance of the two methodologies was inappropriate as their implementation was based on different data. Finally, the USO dataset enables us to explore alternatives to CBOE's cut-off point rule and evaluate their forecast performance. The details of the USO dataset, that consists of options and associated high-frequency returns of the underlying ETF, can be found in Section 3.1.3. In order to remain consistent with our main empirical analysis on crude-oil data, we adopt exactly the same methodological approach. Along these lines, our target quantity is the monthly realized variance of the asset underlying the option contracts, i.e. the USO ETF. All option-implied measures are obtained from option contracts that have approximately 22 trading days to expiration and our forecasts are strictly non-overlapping.

Forecasting USO volatility
In this section we investigate the information content of various volatility forecasts extracted from the prices of our USO options, i.e. we examine the performance of the MFIV, ATMIV, OVX and our four CIV measures. Following our empirical exercise for crude-oil volatility, we first conduct regression analysis and subsequently assess each alternative using two statistical loss functions.

Regression Analysis
We begin our analysis by inspecting the results of the various forecast regressions. 22 Table  9 shows the (adjusted) R 2 from Mincer-Zarnowitz regressions when each of the competing alternatives is used to predict the monthly variance of the USO fund. As it can be seen therein, the CIV1 forecasts have the highest correlation with the target quantity, followed by CIV2, CIV3 and CIV4. Consistent with our earlier findings for crude-oil volatility, the information content of the HAR forecasts is notably inferior to any of the OIV measures. The OVX forecasts are ranked second to last.
Since combinations of forecasts from both option-implied and realized variance models are often preferable in practice, it is important to assess our OIVs measures with respect to the incremental volatility information they contain, i.e. information that is not present in RVbased forecasts. Along these lines, we examine the adjusted R 2 from bivariate specifications where realized variance is regressed on both a reference OIV measure as well as the HAR forecast. As our results show, comparisons in terms of adjusted R 2 reveal that including HAR forecasts is not productive since every univariate model ranks favorably compared to its bivariate counterpart. Notably, the relative ranking between bivariate models remains remarkably similar to that of univariate specifications. Focusing exclusively on bivariate models, the most accurate forecasts are generated by those containing the CIV1 measure, followed by those that utilize CIV2 and CIV3, while the lowest correlation between forecasts and realizations are those corresponding to the OVX and ATMIV measures.
Overall, using the adjusted R 2 to rank all models clearly shows that narrow corridor measures contain the most useful information in term of predicting future USO volatility. In particular, the best results are those corresponding to the univariate models containing CIV1, CIV2 and CIV3 followed by their bivariate counterparts that combine them with HAR forecasts. On the other side, both univariate and bivariate OVX forecasts rank towards the bottom of the list. It is also worth mentioning the MFIV measure delivers less informative forecasts compared to any of the CIV alternatives. The superior performance of the CIV1, the unimpressive ranking of the OVX, as well as the inferior MFIV forecasts vis-à-vis our CIV measures, mirror exactly our earlier findings for the crude-oil dataset.

Statistical Loss Functions
We complete our empirical exercise by evaluating the accuracy of our models using our two benchmark statistical loss functions, i.e. the symmetric MSE and the asymmetric QLIKE. Consistent with our forecast setting for the case of crude-oil, we generate monthly variance forecasts from fifteen separate forecast models. Specifically, seven specifications are based exclusively on OIV measures, one utilizes only past return data (HAR), and seven are augmented models that combine HAR with OIV forecasts. One-step ahead forecasts are generated by estimating each regression model, described in sections 3.4 (HAR model) and 4.3 (univariate and augmented models), using a rolling window of 60 monthly observations, so that all forecasts are bias-corrected. Table 9 displays the results of this forecasting horse race. Under the MSE loss function, the first thing that becomes apparent is that all univariate OIV models outperform every augmented specification. Second, forecasts obtained exclusively by relying on historical return information (i.e. generated by the HAR model) are less sharp than those of any other model. Third, the most accurate forecasts are generated by the univariate CIV1 specification, followed by those corresponding to the univariate CIV2 measure. It is also noteworthy that OVX forecasts deliver very low MSE losses as well, as they are only worst than those of CIV1 and CIV2. The augmented OVX specification continues to perform very poorly though, coming second to last compared to all other models. Finally, for both univariate and augmented specifications, the MFIV measure continues to deliver less accurate volatility forecasts compared to any CIV measure.
Under QLIKE losses, where a stronger penalty is imposed for underpredicting variance, models that combine two sources of information, namely option prices and historical returns, deliver more accurate forecasts than those obtained by univariate specifications. The sharpest forecasts are provided by the HAR-CIV1 model, followed by those of HAR-CIV2 and HAR-CIV3. On the other end, the worst forecasts are obtained by the univariate AT-MIV and MFIV models. The HAR model generates lower QLIKE losses than any univariate OIV model but higher than any augmented specification. Finally, out of 15 the models, those that involve the OVX measure rank fourth (univariate) and tenth (augmented) from the top, so the performance of the measure is reasonable overall.

Crude-oil versus USO Volatility Forecasts
How much do our key empirical findings change when examine the volatility of the USO ETF instead of that of crude-oil futures? This is answer is very little, if at all. In both cases, regression analysis indicated that forecasts from historical return models are outmatched by option-implied forecasts. Furthermore, narrow corridor implied volatilities outperform either the MFIV or ATMIV measures. Using the adjusted R 2 as a ranking criterion, the CIV1 contains more information about future volatility levels compared to any other measure. With respect to MSE losses, univariate models that utilize OIV measures deliver sharper forecasts compared to augmented specifications. Once again, the most accurate forecasts are generated by narrow CIV measures, notably CIV1, instead of other benchmarks, such as the HAR, ATMIV, MFIV or OVX forecasts. When the QLIKE loss function is considered, models that combine both option-implied and past realized variances rank favorably compared to their univariate counterparts. The lowest forecast errors are still obtained by models that utilize narrow corridor measures, with the augmented HAR-CIV1 model providing the best results.
All the above are true for both the USO ETF, as well as, the two crude-oil datasets (full and Post-Fin samples). The only notable, yet rather unsurprising, difference in the USO results is that the performance of the OVX index is not as poor as in the crude-oil sample. In particular, while still ranking below any alternative option-implied measure in univariate or bivariate forecast regressions, models that utilize the OVX generally have, in relative terms, lower realized losses than before. Some improvement was obviously expected since for the case of crude-oil forecasts there was a mismatch between the target quantity (crude-oil volatility) and the underlying asset of the options used in the OVX calculation (USO ETF).

Alternative Cut-off Points for the OVX Calculation
In general, the usefulness of the OVX index compared to other alternatives depends on the application at hand. For example, if one is interested in studying risk premia associated with extreme price movements, narrow corridor implied volatility measures will be of little value as they discard options with extreme strikes. With that in mind, and consistent with the spirit of our paper, the alternative OVX indices we construct are only evaluated with respect to their forecast accuracy. In the interests of brevity, we only assess the forecasts using regression analysis.
All our alternative OVX indices are computed by applying the CBOE methodology on same option chains, but each of them differs on how the cut-off point, determining which options will be included in the calculation of the measure, is decided. In particular, we construct indices where the cut-off point is obtained by targeting fixed values of the Effective Range (ER) statistic of Andersen et al. (2015), defined as where (K 1 , K N ) represents the strike range, σ BS is the at-the-money Black-Scholes implied volatility (ATMIV in our notation), and T is the time-to-maturity of the underlying options. We include five OVX alternatives in our comparisons by setting K 1 = −z and K N = z for z = 1, 1.5, 2, 2.5, 3. The corresponding measures are denoted as OV X 1 , OV X 1.5 , OV X 2 , OV X 2.5 and OV X 3 . The measure obtained by applying the CBOE's cut-off algorithm is simply denoted as OVX. Our proposed methodology for directly using the effective range as a means to determine which options should be included in OVX calculation formula has a number of appealing features. First, it is a simple modification of CBOE's cut-off methodology that is straightforward to implement. Second, besides calculating the at-the-money implied volatility, no additional assumptions, such as the estimation of a risk-neutral density, are required. Third, the rationale of using a fixed effective range target is clear. In times of high liquidity, more options are included in the original OVX calculation, which increases, ceteris paribus, the magnitude of the extracted measure. This effect is largely alleviated by keeping the strike range, normalized by the level of risk-neutral volatility (σ BS ), relatively stable across time and largely independent from market liquidity. 23 Targeting fixed values of the effective range statistic determines the minimum and maximum strikes that can be considered in the OVX calculation. Of course, since observed strike prices discrete, the actual range of (normalized) strikes will not be constant. Figure 2 depicts the actual effective range for the OV X, OV X 2 and the OV X 1 measures computed using the observed strike prices included in each volatility index formula. It is clear that the effective range for the OVX measure exhibits substantial time-variation throughout our sample. Notably, in some cases the range of (normalized) strikes changes markedly from one month to the next. Unsurprisingly, the effective ranges for OV X 2 and the OV X 1 measures are very stable across time.
We now examine the information content of the OVX indices by looking at the adjusted R 2 of forecast regressions on realized variance. We consider both univariate and bivariate models, i.e. regression models that only use OVX indices and those that include the HAR forecast as an additional regressor. As shown in Table 10, with the exception of the OV X 1 , every univariate model has a higher adjusted R 2 compared to its bivariate counterpart. The ranking within each of these groups is identical for both types of models, namely OV X 1 provides the best results followed by OV X 2 , OV X 1.5 , OV X 2.5 , OV X 3 and, lastly, the OV X. So, with the exception of OV X 2 , the wider the cut-off range the more noisy the optionimplied information about future realized variance. The most important empirical finding, however, is that the cut-off algorithm employed by the CBOE appears to result in a measure that provides the least informative forecasts compared to any of the ER-based alternatives. It is also worth noting that while models that include the OV X 1 generate the highest adjusted R 2 in these comparisons, they still rank below our CIV1, CIV2 and CIV3 measures.
We conclude our analysis by conducting a series of encompassing regression tests where we combine the OVX measure with each of the competing indices. Specifically, we compare the information content of OVX forecasts, denoted as f ovx , against those of a reference ER-based measure, denoted as f er , by estimating The estimation results are presented in Panel B of Table 10. When the OVX is combined with either OV X 1.5 , OV X 2.5 or OV X 3 neither β 1 or β 2 are significant, although the latter has slightly lower p-values than the former for all three cases. On the contrary, when the OVX is combined with the OV X 2.5 measure both β 1 and β 2 are significant at the 10% level, with p-values of 8% and 6.5% respectively. Finally, and most notably, our empirical results show that the OV X 1 measure encompasses the OVX at the 10% level since β 2 is significant (p-value of 7%) while β 1 is not (p-value of 20%).
In this section we analyze the robustness of our results to: (i) the construction of alternative option-implied measures; (ii) the size of the rolling-window used to estimate the models; (iii) the choice of out-of-sample period; (iv) the method used to bias correct the OIVs; and (v) the inclusion of overnight returns. In the interests of brevity, we only discuss the results for our main dataset, i.e. the full crude-oil sample spanning the 1996-2016 period. Detailed tables, displaying the estimation results of sections 6.1-6.5, are provided in a supplementary online appendix. 24

Alternative Option-Implied Measures
When constructing our option-implied measures, several methodological choices have to be made. The two most notable ones, were presented in Section 3.3. First, a model that can generate option prices for arbitrary strikes is needed. Second, one needs to select the barrier levels that define each corridor measure. Below, we explain the rationale of our decisions and empirically investigate the robustness of our results by contrasting them with two reasonable alternatives.

Cubic Spline Implied Volatility Function
In theory, a continuum of observed option prices is required in order evaluate equations (4) and (5) and calculate model-free and corridor implied volatilities. Since this is not feasible in practice, some assumptions are needed to generate option prices at arbitrary strikes. A wide collection of alternative methods can be employed for this task, with the cubic spline implied volatility function approach being quite popular in the literature, especially for the case of S&P 500 options. 25 Instead of fitting a polynomial function to our implied volatility data, we have opted for a parametric GB2 risk-neutral density (RND) estimation instead. 26 The key finding of our paper is that CIV measures contain more information about future crude-oil volatility compared to other natural benchmarks, i.e. forecasts obtained through the HAR model or the ATMIV, OVX and MFIV measures. Amongst these benchmarks, MFIV is based on the GB2 RND approach we have adopted. It is therefore possible that our CIV measures have outperformed MFIV simply because of our poor modeling assumption.
To empirically investigate whether our RND choice had an impact on our findings, we have constructed a MFIV measure (M F IV c ) using a cubic spline approach as in Jiang & Tian (2005) and compared its performance in predicting future crude-oil variance using regression analysis. What we found was that, for both univariate and bivariate regressions (which include the lagged RV), the M F IV c had a lower R 2 compared to any other CIV measure. Encompassing regressions that combine M F IV c with each of the alternative OIV measures further underscored these results. In particular, the parameter corresponding to the M F IV c measure was always insignificant, while the narrow corridor measures CIV1 and CIV2 encompassed the M F IC c measure at the 10% level.

Corridor Selection vis-à-vis Andersen & Bondarenko (2007)
As noted in Section 3.3, the corridor width of the CIVs we examine are determined by the quantiles of the risk-neutral distribution F Q , i.e. for a given quartile level p, we discard options that have strike prices either lower than F −1 Q (p) or higher than F −1 Q (1 − p). Our resulting CIV measures are obtained by setting p = 0.45, 0.35, 0.25 or 0.1, so our selection slightly differs from that of Andersen & Bondarenko (2007). Specifically, in their study Andersen & Bondarenko (2007) set p = 0.25, 0.10, 0.05 or 0.025, so our empirical results do not include their two widest corridor measures. Our choice was guided by the fact that crude-oil options are far less liquid compared to those written on the S&P 500, so option prices at extreme strikes would generally be either highly unreliable or not traded at all. With that in mind, we have opted to examine two narrow corridors measures instead, i.e. CIV1 and CIV2.
In order to verify that the exclusion of extreme corridor measures is innocuous, we construct the two "extreme" CIV measures lacking from our CIV collection, i.e. CIV .05 and CIV .025 (obtained by setting p = 0.05 and p = 0.025, respectively), and examine their performance using regression analysis. In particular, we make comparisons in terms the adjusted R 2 of forecast regressions as well as conduct inference via encompassing specifications.
Regressing the realized monthly variance series on either CIV .05 or CIV .025 and comparing the resulting adjusted R 2 with those of the other OIV alternatives revealed that, for both cases, their performance is inferior to all other CIV measures. The same is true when lagged RV is included in the regression models. Moreover, encompassing regressions demonstrate that every CIV measure encompasses CIV .05 and CIV .025 at the 10% level, while for the case of the CIV1 this is true at the 5% level. All in all, and in line with the main findings of our paper, very wide corridors tent to be associated with inferior forecast performance and that is demonstrably true for extreme measures such as CIV .05 and CIV .025 .

Estimation Window
Thus far, the parameters for our bias-correction and HAR models have been estimated using rolling windows of 60 observations, or approximately five years' worth of data. To examine the robustness of our results to this modeling choice, we vary the estimation window between 66 and 96 observations, or between approximately five and eight years, and then re-evaluate our forecasts using the MSE, QLIKE and realized utility loss functions. 27 Reassuringly, our main conclusions continue to hold. In particular, for all estimation windows, the BC-CIV1 model has the lowest MSE, whilst the HAR-CIV1 has the lowest QLIKE and highest realized utility. Therefore, our results are robust to the choice of estimation window.

Sub-sample Analysis
In our analysis in Section 4, our out-of-sample period started in January 2001 and ended in April 2016. In order to check the robustness of our results to variations in the out-of-sample period, we vary the out-of-sample start date to be either January 2002, January 2003, . . ., or January 2012. This ensures we have sub-samples that both include and exclude the 2007-2008 financial crisis. We do not consider a start date beyond 2012 because this would result in an insufficient number of out-of-sample observations.
Using the MSE, QLIKE, and realized utilities as measures of forecast accuracy, we find that the pattern observed over the full out-of-sample period is retained in each of the subsamples; the BC-CIV1 provides the most accurate forecasts according to the MSE whilst the HAR-CIV1 forecasts are the most accurate according to the QLIKE and have the highest realized utility. Therefore, our results appear to be robust to the choice of different out-ofsample periods.

Alternative Bias Correction Procedure
We have also conducted robustness checks with respect to our bias correction technique. Thus far, the bias in OIVs has been corrected for by generating out-of-sample forecasts through either univariate (Equation 11) or augmented (Equation 12) regression models. We now examine an alternative technique inspired by Prokopczuk & Simen (2014), who show that the forecasting performance of the MFIV can be improved significantly by making a non-parametric adjustment for the variance risk premium. We apply their technique to all our OIVs. Since this method relies on averages of ratios, we will refer to it as a relative bias-correction.
More precisely, to implement the relative bias-correction, the average ratio of OIV to RV must be computed to get an estimate of the relative bias, which, in our case, is estimated using a window of τ monthly non-overlapping observations, The relative bias-corrected OIV can then be estimated as follows, We evaluate the performance of the relative bias-correction forecasts using the MSE, QLIKE and realized utility loss functions. As benchmarks, we include our best-performing (nonrelative) bias-corrected and HAR-based forecasts, i.e. BC-CIV1 and HAR-CIV1. To investigate the effect of the window size used in the relative bias-correction, we make comparisons for τ = {6, 12, . . . , 60}. 28 When the MSE is used a ranking criterion we find that the relative bias-corrected forecasts are more accurate than the BC-CIV1 and HAR-CIV1 forecasts, for all values of τ . Except for when τ = 24, the lowest MSE is generated by a narrow corridor measure (RBC-CIV2) which is consistent with our findings that narrow CIV measures are more informative about future volatility compared to ATMIV or MFIV. Notably, under either the QLIKE or realized utility loss functions, the HAR-CIV1 forecasts provide the most accurate forecasts for all values of τ . So, overall, our main conclusions concerning the information content of narrow CIV measures and particularly CIV1 remain largely unaltered.

Including Overnight Returns
Finally, we examine whether our results are robust to the inclusion of overnight returns, i.e., the return between the close of the market on day t − 1 and the open of the market on day t, in the computation of RV. Thus far, RV has been computed using intraday returns, meaning that our RV estimates exclude volatility accumulated overnight, which we denote overnight IVAR. This is important for two reasons. Firstly, our sample covers the period January 1996 to April 2016 and therefore includes a time period prior to June 2006, when our crude-oil futures only traded 09:00-14:30 ET. Thus, our forecasting analysis spans a time period when the exclusion of overnight IVAR may have a material impact on our results. Secondly, our OIVs provide risk-neutral expectations about the total IVAR over 22-day forecast horizons. Consequently, our forecasting models that employ OIVs have not been fully utilizing the information available and our comparisons of forecast accuracy have not been for time periods that match the forecast horizon of the OIVs.
To compute RVs that estimate IVAR inclusive of the overnight period, we follow Hansen & Lunde (2005) and use three alternative RV estimators. The first is the irregular RV where the squared overnight return is simply added to the intraday RV measure. The second is the scaled RV, where an estimate of overnight IVAR is incorporated into RV by scaling-up the intraday RV. Specifically, the intraday RV is multiplied by a constant that ensures that the resulting estimator has the correct expected value (i.e. IVAR). The third estimator is the combination RV which is similar to the irregular RV, except that the weights applied to the squared overnight returns and the intraday RV are allowed to vary from unity. In particular, Hansen & Lunde (2005) demonstrate how to select these weights in an optimal way, i.e. compute the weights that minimize the variance of the estimated RV measure while keeping its expected value equal to the total IVAR.
To evaluate the robustness of our results to the inclusion of overnight returns, we repeat our forecasting exercises using the three alternative IVAR estimators described above. We find that forecasts that employ CIV1 in their construction deliver the most accurate forecasts irrespective of whether IVAR is estimated using the scaled RV or combination RV estimator. This is because these CIV1-based models generate the lowest MSE and QLIKE losses, as well as the highest realized utility. Again, these results support our conclusion that using CIV1 leads to superior volatility forecasts.
When the irregular RV is used, the HAR-ATMIV and HAR-MFIV appear to deliver the most accurate forecasts under the QLIKE and realized utility loss functions, respectively. However, Ahoniemi & Lanne (2013) provide empirical evidence that both the scaled RV and combination RV provide more reliable results when selecting amongst competing volatility forecasts due to the higher level of noise in the irregular RV. Therefore, we do not believe the results using the irregular RV provide sufficiently strong evidence to contradict our conclusion that using CIV1 leads to the most accurate volatility forecasts.

Conclusion
In this paper we evaluated, using both economic and statistical criteria, the information content of monthly crude-oil volatility forecasts extracted from the prices of traded options. We examined a variety of alternative option-implied measures including Black-Scholes at-themoney implieds (ATMIV), model-free volatility expectations (MFIV), CBOE's "oil-VIX" (OVX) and, notably, corridor implied volatilities (CIV). Besides stand-alone comparisons, option-implied forecasts were also contrasted, and combined, with those obtained by a realized volatility model (HAR) that utilizes high-frequency return information.
Our key finding is that a particular CIV measure (CIV1), that utilizes a narrow range of option contracts, consistently generates the most accurate forecasts compared to all other alternatives. In Mincer-Zarnowitz regressions, CIV1 achieves the highest R 2 , whilst encompassing regression tests show that CIV1 subsumes the information contained in ATMIV, MFIV, OVX and HAR forecasts. Furthermore, under either a symmetric (MSE) or an asymmetric (QLIKE) loss function, CIV1-based forecasts deliver the lowest forecast errors. In terms of economic significance, incorporating the CIV1 into the HAR model leads to forecasts that generate a significantly higher realized utility, even when transaction costs are taken into account. All these findings remain intact for both our full sample  and a sub-sample (2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016) that only contains data after the commodities' "financialization" period.
Our results also provide valuable insights regarding the information content of the OVX index. In terms of predicting crude-oil volatility, we find that the OVX-based forecasts perform rather poorly. Regression-based tests show that the OVX is encompassed by CIV1, whilst OVX-based forecasts are typically the least accurate according to either statistical or economic loss functions. To the best of our knowledge, this is the first time the reliability of the OVX measure has been scrutinized, so the concerns we raise have direct implications for practitioners who often rely on the CBOE's volatility indices.
Finally, we repeat the same empirical exercise for the case of the USO ETF volatility (2007. Since the USO ETF attempts to track the price of crude-oil and its options underlie the construction of the OVX index, it becomes possible to further scrutinize our results. Notably, find that the same CIV1 measure continues to generate the most accurate volatility forecasts compared to any other alternative. This underscores the robustness of our findings. The performance of the OVX, while somewhat improved in this dataset, continues to be unimpressive. Our complementary analysis on the cut-off point algorithm applied by the CBOE suggests that it introduces noise that impedes its forecast accuracy. Overall, this paper contributes to the academic literature that assesses the forwardlooking information embedded in the prices of crude-oil options. Given that measuring crude-oil risk is of paramount importance for a variety of economic agents, our empirical study is of value for policy-makers and investors alike.        Table 6: Realized utility and quantiles of the utility per unit wealth (UoW) for the bias-corrected and HAR-based forecasts. The realized utility is given in Equation (16) and the per period UoW is given in Equation (15). All values in the table are reported as percentages. In the columns reporting realized utility, boldface is used to highlight maximum values. Panel A reports the realized utility and quantiles of the UoW for forecasts calculated using the full sample. Panel B reports the realized utility and quantiles of the UoW for forecasts calculated using the Post-Fin sample. We also report the results of Diebold-Mariano tests for differences between the realized utility of the HAR forecasts, which are the benchmark forecasts, and each of the remaining competing forecasts. Newey-West standard errors were used. ***, ** and * indicate a significant difference to the benchmark at the 1, 5 and 10% levels, respectively.  Table 7: Net realized utility and transaction costs (TC) for the bias-corrected and HAR-based forecasts. The net realized utility is given in Equation (18) and TC is given in Equation (17). Net realized utility and TC are presented for c = 0.15% and c = 0.033%. All values in the table are reported as percentages. In the columns reporting net realized utility, boldface is used to highlight maximum values. Panel A reports the net realized utility and TC for forecasts calculated using the full sample. Panel B reports the net realized utility and TC for forecasts calculated using the Post-Fin sample. We also report the results of Diebold-Mariano tests for differences between the net realized utility of the HAR forecasts, which are the benchmark forecasts, and each of the remaining competing forecasts. Analogous Diebold-Mariano tests are conducted for TC, where the TC of the HAR forecasts are the benchmark. Newey-West standard errors were used in all Diebold-Mariano tests. ***, ** and * indicate a significant difference to the benchmark at the 1, 5 and 10% levels, respectively.    Table 9: Summary of forecasting results for the USO ETF. This table displays the adjusted R 2 for univariate and bivariate forecast regressions on monthly realized variance. Results for the univariate models are obtained by evaluating equation (9), while for bivariate models, i.e. specifications that combine option-implied and HAR forecasts, they are obtained through Equation (10). The table also displays the MSE and QLIKE losses corresponding to biascorrected and augmented forecasts, generated through equations (11) and (12), respectively.