A Reinforcement Learning Scheme Prices Options Under Volatility Uncertainty

Key takeaways

High-dimensional pricing turns into a control problem
Backward actor-critic learns prices and controls together
C-vine policies keep correlation matrices valid
Tests compare favorably with Monte Carlo and learning baselines

When the volatility and correlation of several assets are uncertain, pricing an option turns into a control problem, not a simple formula. That is exactly the challenge the uncertain volatility model creates for multidimensional European options. The paper introduces a backward actor-critic stochastic policy gradient scheme that works step by step through time, pairing Proximal Policy Optimization with shallow neural networks for both the value function and the control policy. A key design choice is a squashed Gaussian policy built on a C-vine representation of correlation matrices, which keeps the correlation matrix positive semidefinite by construction. In numerical experiments on a range of multidimensional derivatives, the method produced accurate prices, stayed computationally efficient, and compared favorably with existing Monte Carlo and machine-learning benchmarks. The result is a practical way to tackle a high-dimensional pricing problem that quickly becomes hard for standard numerical methods.

Three assets can send an option model into a maze. Each price moves on its own. The links between them move too. In the Uncertain Volatility Model, or UVM, the model only knows a safe range for volatility and correlation. That changes pricing from a neat formula into a search for the most cautious price. This paper asks whether reinforcement learning, a train-by-feedback method, can handle that search without breaking the math. The answer hinges on one idea. The policy must never ask for an impossible correlation pattern.

When pricing becomes a search for the safest answer

The method tackles multidimensional European options, which pay off on several assets at once. The tests show accurate prices on a range of multidimensional derivatives. The method also stays computationally efficient. Its results compare favorably with Monte Carlo, a random-sampling method, and with machine-learning baselines. The big surprise is not only speed. The scheme also keeps the control search inside the valid market range. That makes the price robust, not just fast.

How the model keeps every guess legal

The engine runs backward in time. A discrete dynamic programming principle, a step-by-step way to solve a choice problem, sets the path. An actor-critic loop then does the learning. The actor proposes a control. The critic scores it. Proximal Policy Optimization, or PPO, keeps each policy update small. Shallow neural networks approximate both the value function and the policy. The control side uses a squashed Gaussian policy, which samples from a bell curve and then clips the output into a safe range. A C-vine representation builds the correlation matrix one step at a time.

The backward pass learns later times first.
The actor proposes a move, and the critic scores it.
Shallow neural nets estimate both the price and the policy.
The C-vine policy keeps correlation matrices valid.

“positive semidefiniteness by construction.”

From the abstract

Why that constraint changes the game

Robust pricing matters when a desk wants a cautious number, not a fragile one. The UVM asks for the worst credible case inside a bounded set. That is hard once many assets enter the scene. This method gives that search a practical shape. It turns the task into a backward learning loop. It also keeps the policy inside the legal region from the start. That matters because one bad correlation guess can break the whole price path. The paper’s tests show a method that stays accurate and workable.

What this opens next

The surprise is simple. A learning system can explore this market maze and still obey the matrix rules. That opens a concrete path for multidimensional robust pricing. Standard methods can bog down there. The next test is the same UVM setting with more assets in the option basket. If the backward policy still works there, robust pricing could stay practical as dimension climbs. That would make the legal-policy trick more than a neat idea. It would make it a working tool for harder markets.

A Reinforcement Learning Scheme Prices Options Under Volatility Uncertainty

When pricing becomes a search for the safest answer

How the model keeps every guess legal

Why that constraint changes the game

What this opens next

Authors

Provenance

Keep reading

Comments