Pro & Institutional

Reinforcement Learning

Train AI agents to trade using PPO, A2C, DQN and other RL algorithms

RlxEnv — Gym-Compatible Environment

RLX provides RlxEnv — a trading environment compatible with OpenAI Gym and Gymnasium. Train agents using Stable Baselines3, RLlib, or any Gym-compatible framework.

Python
from rlxbt import RlxEnv, load_data

# Load data
data = load_data("data/BTCUSDT_1h_with_indicators.csv")

# Create environment
env = RlxEnv(
    data=data,
    initial_capital=100000.0,
    window_size=32       # Observation window
)

# Standard Gym interface
obs, info = env.reset()
action = env.action_space.sample()
obs, reward, done, truncated, info = env.step(action)

Action Space

0
Hold
Close if open
1
Long
Open / Hold long
2
Short
Open / Hold short

Observation Space

The observation vector includes market data and account state:

  • 1.Market Window (window_size × 5) — Normalized OHLCV data
  • 2.Portfolio Value — Normalized current value
  • 3.Position Size — Signed position (-1 to 1)
  • 4.Position Status — 1.0 if open, 0.0 if closed

Total dimension: (window_size × 5) + 3 = 163 for window_size=32

Exit Rules + RL = Synergy

The key insight: separate entry timing from risk management.

🤖 RL Agent Handles
  • WHEN to enter (timing)
  • • Market state recognition
  • • Pattern detection
🛡️ Exit Rules Handle
  • HOW to exit (risk)
  • • Stop-loss protection
  • • Take-profit targets
Python
# Exit rules for risk management
exit_rules = {
    "hold_bars": 12,              # Max 12 bars in position
    "max_drawdown_percent": 5.0,  # 5% stop-loss
    "min_profit_percent": 1.5,    # 1.5% take-profit
}

# Create environment with exit rules
env = RlxEnv(
    data=data,
    initial_capital=100000.0,
    window_size=32,
    exit_rules=exit_rules
)

Why This Works

The agent can focus on signal quality without learning complex risk management. Exit rules handle downside protection automatically, simplifying the reward function.

Training with Stable Baselines3

Python
from rlxbt import RlxEnv, load_data
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv

# Load and split data
data = load_data("data/BTCUSDT_1h_with_indicators.csv")
train_data = data.iloc[:int(len(data)*0.7)].reset_index(drop=True)
test_data = data.iloc[int(len(data)*0.7):].reset_index(drop=True)

# Exit rules configuration
exit_rules = {
    "hold_bars": 12,
    "max_drawdown_percent": 5.0,
    "min_profit_percent": 1.5,
}

# Create vectorized environment
train_env = DummyVecEnv([
    lambda: RlxEnv(data=train_data, window_size=32, exit_rules=exit_rules)
])

# Train PPO agent
model = PPO(
    "MlpPolicy",
    train_env,
    verbose=1,
    ent_coef=0.02,
    learning_rate=3e-4
)
model.learn(total_timesteps=100_000)

# Evaluate on test set
test_env = RlxEnv(data=test_data, window_size=32, exit_rules=exit_rules)
obs, _ = test_env.reset()
done = False

while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = test_env.step(int(action))

print(f"Total Return: {info['total_return']*100:+.2f}%")
print(f"Sharpe Ratio: {info['sharpe_ratio']:.4f}")

Real Training Results

PPO agent trained on BTCUSDT 1h data (100K steps, 30% holdout test set):

Exit Rules Return Sharpe Max DD Trades
No Rules (baseline) +14.31% 0.0390 26.67%95
Conservative (2% SL)+11.35%0.010222.14%2,475
Aggressive (5% SL)-23.88%-0.020040.20%963

Exit Reason Distribution (Conservative)

95.0%
Signal Exit
Agent decisions
3.9%
Max Drawdown
Stop-loss triggered
1.1%
Min Profit
Take-profit triggered

Exit Rules Configurations

No Rules

Agent controls everything
exit_rules = None

Conservative

Tight risk management
{
  "hold_bars": 48,
  "max_drawdown_percent": 2.0,
  "min_profit_percent": 3.0
}

Aggressive

Wider stops
{
  "hold_bars": 12,
  "max_drawdown_percent": 5.0,
  "min_profit_percent": 1.5
}

⚠️ RL Results Are Stochastic

Different training runs produce different results. For reliable conclusions, average across multiple seeds.

Best Practices

1. Use Exit Rules for Risk Management

Let the agent focus on timing. Exit rules handle downside protection.

2. Train on 70%, Test on 30%

Always validate on out-of-sample data to detect overfitting.

3. Use Entropy Regularization

Set ent_coef=0.02 to encourage exploration.

4. Average Multiple Seeds

Run training 3-5 times with different seeds and average results.

5. Match hold_bars to Timeframe

For 1h data, 12-48 bars = 0.5-2 days. Adjust based on your strategy horizon.