Optimizing RL Agents with Exit Rules in RLXBT
Introduction
One of the main challenges in training RL agents for trading is the "noisy" reward signal. It's difficult for an agent to understand whether a trade was successful due to its entry or simply because of a lucky turn of events. Exit rules allow us to separate the entry logic (which the agent learns) from the risk management logic (which is strictly defined).
Exit Rules Configurations
In our experiment, we compared three approaches:
- No Rules (Baseline): The agent decides when to close a position.
- Conservative: Strict 2% stop-loss, 3% take-profit, and a maximum holding time of 48 hours.
- Aggressive: 5% stop-loss, quick 1.5% take-profit, and holding for no more than 12 hours.
Experimental Results
Data: BTCUSDT, 1-hour timeframe (2020-2025).
Training Summary Table (PPO Agent)
| Configuration | Return | Sharpe Ratio | Max Drawdown | Total Trades |
|---|---|---|---|---|
| No Rules | -5.14% | -0.0075 | 14.71% | 744 |
| Conservative (2% SL, 3% TP) | -3.27% | -0.1234 | 4.46% | 17 |
| Aggressive (5% SL, 1.5% TP) | +34.58% | 0.0407 | 20.15% | 1242 |
Exit Reason Analysis (for the best strategy)
For the aggressive strategy, which showed the best result, the distribution of position closing reasons is as follows:
- Signal (Agent Signal): 88.6%
- MaxBarsReached (Timeout): 6.5%
- MinProfitReached (Take-Profit): 4.7%
- MaxDrawdown (Stop-Loss): 0.2%
Conclusion: The agent learned to effectively use short market impulses, while the exit rules provided a safety net during prolonged movements or sharp drawdowns.
Full Example Code
Below is the full script code to reproduce the results. To run it, you will need the rlxbt and stable-baselines3 libraries installed.
#!/usr/bin/env python3
"""
RLX RL Environment Demo with Exit Rules
This demo shows how to:
1. Configure RlxEnv with custom exit rules
2. Train an RL agent (PPO) with risk management
3. Compare performance with/without exit rules
4. Generate detailed metrics and reports
Exit Rules Features:
- hold_bars: Maximum bars to hold a position
- max_drawdown_percent: Force exit if position drawdown exceeds threshold
- min_profit_percent: Take profit when minimum target reached
- exit_at_night: Close positions during night hours
- max_hold_minutes: Time-based exit
LICENSING:
- Set RLX_LICENSE_KEY environment variable or pass license_key parameter to RlxEnv
- For development builds (--features offline_license), license is not required
- Get your license at https://rlxbt.com/pricing
"""
import sys
import os
import time
import numpy as np
import pandas as pd
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
# Add project root to path
project_root = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
sys.path.insert(0, project_root)
try:
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import BaseCallback
HAS_SB3 = True
except ImportError:
HAS_SB3 = False
print("ā ļø stable_baselines3 not installed. Running simplified demo.")
try:
from rlxbt import rlx, load_data, RlxEnv
except ImportError:
print("ā Failed to import RLX. Please run 'maturin develop' first.")
sys.exit(1)
class RewardCallback(BaseCallback):
"""Callback to track training progress."""
def __init__(self, verbose=0):
super().__init__(verbose)
self.episode_rewards = []
self.episode_count = 0
def _on_step(self) -> bool:
if self.locals.get("dones", [False])[0]:
self.episode_count += 1
if self.episode_count % 10 == 0:
info = self.locals.get("infos", [{}])[0]
portfolio = info.get("portfolio_value", 100000)
ret = (portfolio - 100000) / 100000 * 100
print(
f" Episode {self.episode_count}: Portfolio ${portfolio:,.0f} ({ret:+.2f}%)"
)
return True
def run_episode_manual(env, strategy="random"):
"""Run single episode with manual strategy (no RL library needed)."""
obs, _ = env.reset()
done = False
total_reward = 0
actions_taken = []
while not done:
if strategy == "random":
action = np.random.choice([0, 1, 2])
elif strategy == "always_long":
action = 1
elif strategy == "always_short":
action = 2
else: # hold
action = 0
obs, reward, done, truncated, info = env.step(action)
total_reward += reward
actions_taken.append(action)
return total_reward, info, actions_taken
def main():
print("=" * 70)
print("š¤ RLX RL ENVIRONMENT WITH EXIT RULES DEMO")
print("=" * 70)
# Check for license key
license_key = os.environ.get("RLX_LICENSE_KEY")
if license_key:
print(f"š Using license key: {license_key[:20]}...")
else:
print("ā¹ļø No RLX_LICENSE_KEY set (OK for development builds)")
print(' For production: export RLX_LICENSE_KEY="rlx_pro_..."')
# =========================================================================
# 1. LOAD DATA
# =========================================================================
data_path = os.path.join(
project_root, "data", "BTCUSDT_1h_2020-12-12_2025-12-11.csv"
)
if not os.path.exists(data_path):
print(f"ā Data file not found: {data_path}")
return
print(f"\nš Loading data from: {os.path.basename(data_path)}")
data = load_data(data_path)
print(f" Total bars: {len(data):,}")
print(f" Date range: {data['timestamp'].min()} to {data['timestamp'].max()}")
# Split data
train_size = int(len(data) * 0.7)
val_size = int(len(data) * 0.15)
train_data = data.iloc[:train_size].reset_index(drop=True)
val_data = data.iloc[train_size : train_size + val_size].reset_index(drop=True)
test_data = data.iloc[train_size + val_size :].reset_index(drop=True)
print(f"\nš Data Split:")
print(f" Train: {len(train_data):,} bars (70%)")
print(f" Valid: {len(val_data):,} bars (15%)")
print(f" Test: {len(test_data):,} bars (15%)")
# =========================================================================
# 2. DEFINE EXIT RULES CONFIGURATIONS
# =========================================================================
print("\n" + "=" * 70)
print("āļø EXIT RULES CONFIGURATIONS")
print("=" * 70)
# Configuration 1: No Exit Rules (baseline)
no_rules = None
# Configuration 2: Conservative Risk Management
conservative_rules = {
"hold_bars": 48, # Max 48 hours (2 days)
"max_drawdown_percent": 2.0, # Stop loss at 2% drawdown
"min_profit_percent": 3.0, # Take profit at 3%
}
# Configuration 3: Aggressive Day Trading
aggressive_rules = {
"hold_bars": 12, # Max 12 hours
"max_drawdown_percent": 5.0, # Allow 5% drawdown
"min_profit_percent": 1.5, # Quick profit taking at 1.5%
}
# Configuration 4: Session-Based (Night Exit)
session_rules = {
"hold_bars": 24, # Max 24 hours
"exit_at_night": True, # Close before night
"night_start_hour": 22, # Night starts at 22:00 UTC
"night_end_hour": 6, # Night ends at 06:00 UTC
"max_drawdown_percent": 3.0,
}
configs = [
("No Rules (Baseline)", no_rules),
("Conservative", conservative_rules),
("Aggressive", aggressive_rules),
("Session-Based", session_rules),
]
for name, rules in configs:
print(f"\nš {name}:")
if rules:
for k, v in rules.items():
print(f" {k}: {v}")
else:
print(" No exit rules applied")
# =========================================================================
# 3. TEST RANDOM AGENT WITH DIFFERENT CONFIGS
# =========================================================================
print("\n" + "=" * 70)
print("š² RANDOM AGENT COMPARISON (baseline)")
print("=" * 70)
random_results = []
for config_name, exit_rules in configs:
# License key is automatically read from RLX_LICENSE_KEY environment variable
env = RlxEnv(
data=test_data,
initial_capital=100000.0,
window_size=20,
exit_rules=exit_rules,
)
# Run 5 episodes with random actions
returns = []
trades_list = []
for _ in range(5):
_, info, _ = run_episode_manual(env, strategy="random")
returns.append(info.get("total_return", 0) * 100)
trades_list.append(int(info.get("total_trades", 0)))
avg_return = np.mean(returns)
avg_trades = np.mean(trades_list)
random_results.append(
{
"config": config_name,
"avg_return": avg_return,
"avg_trades": avg_trades,
"std_return": np.std(returns),
}
)
print(f"\n{config_name}:")
print(f" Avg Return: {avg_return:+.2f}% (±{np.std(returns):.2f}%)")
print(f" Avg Trades: {avg_trades:.0f}")
# =========================================================================
# 4. TRAIN RL AGENTS FOR EACH CONFIG (if stable_baselines3 available)
# =========================================================================
if HAS_SB3:
print("\n" + "=" * 70)
print("š§ RL AGENT TRAINING (PPO) - Training separate agent per config")
print("=" * 70)
# Training configurations (only train with rules that make sense)
train_configs = [
("No Rules", no_rules),
("Conservative (2% SL, 3% TP)", conservative_rules),
("Aggressive (5% SL, 1.5% TP)", aggressive_rules),
]
eval_results = []
trained_models = {}
for config_name, exit_rules in train_configs:
print(f"\nšļø Training PPO agent with: {config_name}")
if exit_rules:
print(f" Exit Rules: {exit_rules}")
# Create training environment
# Use lambda with default argument to capture exit_rules correctly
train_env = DummyVecEnv(
[
lambda er=exit_rules: RlxEnv(
data=train_data,
initial_capital=100000.0,
window_size=32, # Optimized window size
exit_rules=er,
)
]
)
# Create PPO model with optimized hyperparameters
model = PPO(
"MlpPolicy",
train_env,
verbose=0,
learning_rate=3e-4,
n_steps=1024,
batch_size=64,
n_epochs=10,
gamma=0.99,
ent_coef=0.02, # Higher entropy for exploration
)
# Training
print(f" Training for 100,000 timesteps...")
start_time = time.time()
model.learn(total_timesteps=100_000)
train_time = time.time() - start_time
print(f" Training completed in {train_time:.1f}s")
trained_models[config_name] = model
# =====================================================================
# 5. EVALUATE ON TEST SET
# =====================================================================
test_env = RlxEnv(
data=test_data,
initial_capital=100000.0,
window_size=32,
exit_rules=exit_rules,
)
obs, _ = test_env.reset()
done = False
actions = {0: 0, 1: 0, 2: 0}
while not done:
action, _ = model.predict(obs, deterministic=True)
action = int(action)
actions[action] += 1
obs, reward, done, truncated, info = test_env.step(action)
total_actions = sum(actions.values())
result = {
"config": config_name,
"total_return": info.get("total_return", 0) * 100,
"sharpe_ratio": info.get("sharpe_ratio", 0),
"max_drawdown": info.get("max_drawdown", 0) * 100,
"total_trades": int(info.get("total_trades", 0)),
"win_rate": info.get("win_rate", 0) * 100
if info.get("win_rate")
else 0,
"portfolio_value": info.get("portfolio_value", 100000),
"hold_pct": actions[0] / total_actions * 100,
"long_pct": actions[1] / total_actions * 100,
"short_pct": actions[2] / total_actions * 100,
"train_time": train_time,
}
eval_results.append(result)
print(f"\n š Test Results:")
print(f" Total Return: {result['total_return']:+.2f}%")
print(f" Sharpe Ratio: {result['sharpe_ratio']:.4f}")
print(f" Max Drawdown: {result['max_drawdown']:.2f}%")
print(f" Total Trades: {result['total_trades']}")
print(
f" Actions: Hold={actions[0]} ({result['hold_pct']:.1f}%), "
f"Long={actions[1]} ({result['long_pct']:.1f}%), "
f"Short={actions[2]} ({result['short_pct']:.1f}%)"
)
# =====================================================================
# 6. SUMMARY TABLE
# =====================================================================
print("\n" + "=" * 70)
print("š RESULTS SUMMARY - Each agent trained with its own config")
print("=" * 70)
print("\nā" + "ā" * 78 + "ā")
print(
f"ā {'Config':<32} {'Return':>10} {'Sharpe':>10} {'Drawdown':>10} {'Trades':>8} ā"
)
print("ā" + "ā" * 78 + "ā¤")
for r in eval_results:
print(
f"ā {r['config']:<32} {r['total_return']:>+9.2f}% {r['sharpe_ratio']:>10.4f} "
f"{r['max_drawdown']:>9.2f}% {r['total_trades']:>8} ā"
)
print("ā" + "ā" * 78 + "ā")
# Best config
best = max(eval_results, key=lambda x: x["sharpe_ratio"])
print(f"\nš Best Configuration: {best['config']}")
print(f" Sharpe Ratio: {best['sharpe_ratio']:.4f}")
print(f" Total Return: {best['total_return']:+.2f}%")
print(f" Max Drawdown: {best['max_drawdown']:.2f}%")
# =====================================================================
# 7. EXIT STATISTICS (using best model)
# =====================================================================
print("\n" + "=" * 70)
print(f"š EXIT REASONS ANALYSIS ({best['config']})")
print("=" * 70)
# Use the best performing model for analysis
best_model = trained_models.get(best["config"])
best_rules = None
for name, rules in train_configs:
if name == best["config"]:
best_rules = rules
break
if best_model and best_rules:
analysis_env = RlxEnv(
data=test_data,
initial_capital=100000.0,
window_size=32,
exit_rules=best_rules,
)
obs, _ = analysis_env.reset()
done = False
while not done:
action, _ = best_model.predict(obs, deterministic=True)
obs, reward, done, truncated, info = analysis_env.step(int(action))
# Get backtest result for exit statistics
try:
backtest_result = analysis_env.get_backtest_result()
# Count exit reasons
exit_reasons = {}
for trade in backtest_result.trades:
reason = (
str(trade.exit_reason)
if hasattr(trade, "exit_reason")
else "Unknown"
)
exit_reasons[reason] = exit_reasons.get(reason, 0) + 1
if exit_reasons:
print("\nExit Reason Distribution:")
total_exits = sum(exit_reasons.values())
for reason, count in sorted(
exit_reasons.items(), key=lambda x: -x[1]
):
pct = count / total_exits * 100
print(f" {reason:<30} {count:>5} ({pct:>5.1f}%)")
except Exception as e:
print(f" Could not get exit statistics: {e}")
else:
print("\nā ļø Skipping RL training (stable_baselines3 not installed)")
print(" Install with: pip install stable-baselines3 shimmy gymnasium")
# =========================================================================
# 8. FINAL NOTES
# =========================================================================
print("\n" + "=" * 70)
print("š KEY TAKEAWAYS")
print("=" * 70)
print("""
1. EXIT RULES IMPACT:
- Conservative rules (2% SL, 3% TP) reduce risk but may limit upside
- Aggressive rules allow bigger swings, higher variance
- Session-based rules useful for avoiding overnight gaps
2. RL + EXIT RULES SYNERGY:
- RL agent learns WHEN to enter (signal timing)
- Exit rules handle risk management (HOW to exit)
- This separation allows cleaner learning signal
3. CONFIGURATION RECOMMENDATIONS:
- Day Trading: aggressive_rules with short hold_bars
- Swing Trading: conservative_rules with longer hold_bars
- 24/7 Markets (Crypto): no night exit needed
- Traditional Markets: session_rules with night exit
4. HYPERPARAMETER TUNING:
- hold_bars: Match your trading timeframe
- max_drawdown_percent: Set based on risk tolerance
- min_profit_percent: Balance between taking profits and letting winners run
""")
print("=" * 70)
print("ā
Demo completed!")
print("=" * 70)
if __name__ == "__main__":
main()
Key Takeaways
- Synergy of RL and Exit Rules: An RL agent trains better when it doesn't have to worry about catastrophic losses (which are handled by
max_drawdown_percent). - Conservatism vs. Aggressiveness: In this test, conservative rules limited the agent too much (only 17 trades), while aggressive rules allowed the PPO agent to realize its potential.
- Drawdown: Using exit rules significantly reduces the maximum drawdown compared to a "pure" RL agent.
Article prepared for the RLXBT community. More examples in the project repository.
Comments (0)