Key takeaways
  • Order-flow imbalance as the state
  • Policy RL over tabular Q-learning
  • GRPO and GSPO as group-aware updates
  • Better PnL, profitability, and drawdown

If you want a trading bot that reacts to rapid order-book changes, this paper says old-school Q-learning may not be the best teacher. The authors use order-flow imbalance, a compact summary of buying and selling pressure in the limit order book, to train policy-based reinforcement learning agents. They test vanilla PPO plus DeepSeekMath-inspired GRPO and GSPO, which use group-normalized updates and downside-aware shaping, on AMZN, AAPL, and GOOG. Under a simplified backtesting setup with spread-scaled rewards, these newer policies improve net average PnL, profitability, and drawdown over the Q-learning baseline. The paper’s bigger claim is simple: order-flow signals can work as a state representation for policy reinforcement learning, and group-aware PPO-style surrogates can outperform value-based methods in this directional trading setting. This is not a market-making system. The agent makes directional decisions from forecast-style states, rather than posting bid and ask quotes or managing inventory.

At the top of the book, a stock can look calm and still be gathering pressure. This system tries to read that pressure from order-flow imbalance, then turn it into a directional trade on AMZN, AAPL, and GOOG. The surprising part is not that the agent learns from a compact state; it is that policy-based reinforcement learning, helped by DeepSeekMath-style group updates, beats a Q-learning baseline in simplified backtests. If you have watched a price flicker on a trading app and wondered what sits behind the move, this is the logic: compress the book into the signals that matter, then let the policy learn when to act. The reward uses spread-scaled gains, so the test stays tied to real market friction.

Why a smaller state can still catch the move

Across AMZN, AAPL, and GOOG, the policy agents beat the tabular Q-learning baseline on the main backtest measures: net average PnL, profitability, and drawdown. Vanilla PPO points in the right direction, but the DeepSeekMath-inspired variants, GRPO and GSPO, push farther because they normalize updates across a group and shape the objective to care about downside as well as upside. That matters in trading, where a model can look sharp if it wins often but still bleed slowly when losses cluster. The result says two things at once: order-flow signals carry enough signal to serve as the agent's state, and the training rule can change the outcome just as much as the input representation.

3stocks

AMZN, AAPL, GOOG

backtests
  • Tabular Q-learning anchors the value-based baseline.
  • Vanilla PPO provides the plain policy-gradient reference.
  • GRPO adds group-normalized updates and downside-aware shaping.
  • GSPO tests another DeepSeekMath-inspired surrogate.

Our results show that (1) Order-Flow signals are an adequate state for policy RL and (2) group-aware PPO surrogates are preferable over value-based baselines.

From the abstract

How DeepSeekMath ideas reach the order book

This setup stays directional rather than market-making, so the agent does not post both sides of the book or juggle inventory. Instead, it starts from order-flow features, a compact summary of buying and selling pressure in the limit order book, and uses policy-gradient methods to choose directional actions from forecast-style states. PPO gives the plain policy update. GRPO and GSPO borrow the DeepSeekMath idea of group-normalized updates, so one run gets judged against its peers rather than in isolation, and downside-aware shaping pushes the model to respect painful losses, not only gains. In a backtest built on spread-scaled rewards, that design keeps learning close to the friction the trader actually faces.

This is not a market-making system. It does not post bid and ask quotes, manage queue position, or tune inventory around two-sided liquidity provision. That narrow scope matters because it turns the problem into a cleaner test of forecasting and action choice. The agent looks at forecast-style states, then picks a direction; the hard part is not quoting around the spread but deciding whether the pressure in the book justifies a trade at all. On that smaller stage, order-flow imbalance gives the learner a compact view of the market's push and pull.


Why it matters for directional trading

The practical gain is simplicity without pretending the market is simple. A huge limit order book can swamp a learner, but order-flow signals shrink it to the part that seems to matter for direction. That makes policy RL more plausible here, because the agent does not need a value table over a huge state space to do useful work. The result also nudges trading design toward training rules that care about the shape of returns, not just their average size. GRPO and GSPO win here because they add group-normalized updates and downside awareness, which means the model does not get rewarded for reckless spikes that later collapse into drawdown.

What to test next

The next test is simple to name: take the same policy stack beyond AMZN, AAPL, and GOOG, and beyond the simplified spread-scaled backtest that makes this study tractable. If the pattern holds there, then a compact order-flow state plus a group-aware policy update could become a serious alternative to value tables for directional trading. If it breaks, the fault line will be clearer too, because the result already shows where to look: not just at the state, but at the training rule that turns that state into action.