Train AI agents to trade using PPO, A2C, DQN and other RL algorithms
RLX provides RlxEnv — a trading environment compatible with OpenAI Gym and Gymnasium. Train agents using Stable Baselines3, RLlib, or any Gym-compatible framework.
from rlxbt import RlxEnv, load_data
# Load data
data = load_data("data/BTCUSDT_1h_with_indicators.csv")
# Create environment
env = RlxEnv(
data=data,
initial_capital=100000.0,
window_size=32 # Observation window
)
# Standard Gym interface
obs, info = env.reset()
action = env.action_space.sample()
obs, reward, done, truncated, info = env.step(action)The observation vector includes market data and account state:
Total dimension: (window_size × 5) + 3 = 163 for window_size=32
The key insight: separate entry timing from risk management.
# Exit rules for risk management
exit_rules = {
"hold_bars": 12, # Max 12 bars in position
"max_drawdown_percent": 5.0, # 5% stop-loss
"min_profit_percent": 1.5, # 1.5% take-profit
}
# Create environment with exit rules
env = RlxEnv(
data=data,
initial_capital=100000.0,
window_size=32,
exit_rules=exit_rules
)Why This Works
The agent can focus on signal quality without learning complex risk management. Exit rules handle downside protection automatically, simplifying the reward function.
from rlxbt import RlxEnv, load_data
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
# Load and split data
data = load_data("data/BTCUSDT_1h_with_indicators.csv")
train_data = data.iloc[:int(len(data)*0.7)].reset_index(drop=True)
test_data = data.iloc[int(len(data)*0.7):].reset_index(drop=True)
# Exit rules configuration
exit_rules = {
"hold_bars": 12,
"max_drawdown_percent": 5.0,
"min_profit_percent": 1.5,
}
# Create vectorized environment
train_env = DummyVecEnv([
lambda: RlxEnv(data=train_data, window_size=32, exit_rules=exit_rules)
])
# Train PPO agent
model = PPO(
"MlpPolicy",
train_env,
verbose=1,
ent_coef=0.02,
learning_rate=3e-4
)
model.learn(total_timesteps=100_000)
# Evaluate on test set
test_env = RlxEnv(data=test_data, window_size=32, exit_rules=exit_rules)
obs, _ = test_env.reset()
done = False
while not done:
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, truncated, info = test_env.step(int(action))
print(f"Total Return: {info['total_return']*100:+.2f}%")
print(f"Sharpe Ratio: {info['sharpe_ratio']:.4f}")PPO agent trained on BTCUSDT 1h data (100K steps, 30% holdout test set):
| Exit Rules | Return | Sharpe | Max DD | Trades |
|---|---|---|---|---|
| No Rules (baseline) | +14.31% | 0.0390 | 26.67% | 95 |
| Conservative (2% SL) | +11.35% | 0.0102 | 22.14% | 2,475 |
| Aggressive (5% SL) | -23.88% | -0.0200 | 40.20% | 963 |
exit_rules = None
{
"hold_bars": 48,
"max_drawdown_percent": 2.0,
"min_profit_percent": 3.0
}{
"hold_bars": 12,
"max_drawdown_percent": 5.0,
"min_profit_percent": 1.5
}⚠️ RL Results Are Stochastic
Different training runs produce different results. For reliable conclusions, average across multiple seeds.
Let the agent focus on timing. Exit rules handle downside protection.
Always validate on out-of-sample data to detect overfitting.
Set ent_coef=0.02 to encourage exploration.
Run training 3-5 times with different seeds and average results.
For 1h data, 12-48 bars = 0.5-2 days. Adjust based on your strategy horizon.