- Static fit favored Adaptive-QLBS
- Shortfall risk favored RLOP
- Trading costs changed the game
- Price fit and hedge safety split apart
An option model can look accurate on a chart and still fail when real trades are on the line. That gap matters because hedging is supposed to protect you when prices move, not just match a formula. This paper compares two reinforcement-learning approaches for option hedging under trading costs: Adaptive-QLBS, an extension of the Q-learner in Black-Scholes, and RLOP, short for Replication Learning of Option Pricing. Using SPY and XOP option data, the authors find that Adaptive-QLBS gives higher static pricing accuracy in implied volatility space, while RLOP does better on the dynamic side by reducing shortfall probability, the chance that a hedge underperforms the option payoff. The takeaway is sharper than a simple accuracy score: option pricing models should be judged by realized hedging outcomes as well as their fit to market prices.
SPY and XOP option data set up a blunt test. The question was simple. Which model looks best on paper, and which one leaves the cleaner hedge? Adaptive-QLBS won the price race. RLOP won the safety race. That split is the surprise. A model can look right in implied volatility, the market's built-in guess for future swings, and still miss the payoff when trades get real. Shortfall risk is the chance that the hedge ends below the option payout. Fees make that risk worse. Risk aversion, a dislike of big losses, pushes the models to care about that gap. This work says price fit alone is not enough.
Why the best-looking price was not the safest hedge
Adaptive-QLBS fit prices better in implied volatility, the market's built-in guess for future swings. That means its price mark matched market pricing more closely when the price was translated into that common yardstick. RLOP told a different story. It cut shortfall probability, the chance that a hedge ends below the option payoff. That matters in live trading because a cheap-looking hedge can still fail when the bill comes due. The comparison ran on SPY and XOP options, so the test covered both the stock market proxy and the energy world. The result splits the task in two. One method better matches what traders see on the screen. The other better guards what they fear at the end of the trade.
SPY and XOP
both methods tested- Adaptive-QLBS fit prices better in implied volatility space.
- RLOP cut shortfall probability in the hedge test.
- SPY and XOP gave both methods a real market check.
“Adaptive-QLBS achieves higher static pricing accuracy in implied volatility space, while RLOP delivers superior dynamic hedging performance by reducing shortfall probability.”
How reinforcement learning was bent toward real trades
QLBS starts from Black-Scholes, a classic option formula for a world with no trading costs. The new version adds risk aversion, which means it hates losses more than gains. It also adds trading costs, the fees that eat into each move. RLOP goes after replication, which means it tries to match the payoff with live trading steps. Both methods fit inside standard reinforcement learning, a trial-and-error way to learn choices from reward. That makes them flexible enough for market frictions, where every trade can shave off value.
Why the split matters
That split changes how you judge an option model. A pretty fit in implied volatility can hide a weak hedge. A strong hedge can also miss the neatest price mark. So the scorecard has to widen. Traders and risk teams need both static fit and realized hedge results. Static fit asks whether the model matches the market's price language. Dynamic hedge results ask whether the position survives when trades cost money. RLOP makes that second question harder to ignore. It puts shortfall risk on the front page. Adaptive-QLBS keeps its edge on price fit. Together, they show that one score no longer tells the full story.
What the result rules out
The sharp lesson is simple. A good price fit does not buy a safe hedge. That means one common way of judging option models looks too narrow. A desk now needs two scorecards. One tracks price fit. One tracks shortfall probability under trading costs. SPY and XOP make that split easy to see. Adaptive-QLBS wins the fit race. RLOP wins the protection race. This result redraws the goal line for option pricing.

Comments