Reinforcement Learning for Trading¶
This document describes the PPO-based reinforcement learning system for automated cryptocurrency trading.
Overview¶
The RL trading module implements Proximal Policy Optimization (PPO) for training autonomous trading agents. It includes a Gym-compatible trading environment and a complete PPO implementation.
Quick Start¶
from src.ml.environments.trading_env import TradingEnv
from src.ml.models.rl_agent import PPOAgent
import pandas as pd
# Load market data
data = pd.read_csv('btc_hourly.csv')
# Create trading environment
env = TradingEnv(
data=data,
initial_balance=10000,
commission=0.001,
window_size=50,
reward_function='sharpe'
)
# Create RL agent
agent = PPOAgent(
observation_dim=env.observation_space.shape[0],
action_dim=env.action_space.n,
hidden_size=256,
learning_rate=3e-4
)
# Train agent
training_stats = agent.train(
env,
episodes=1000,
max_steps=500,
update_frequency=2048,
verbose=True
)
# Evaluate agent
results = agent.evaluate(env, episodes=10, deterministic=True)
print(f"Mean Return: {results['mean_return']:.2%}")
print(f"Mean Reward: {results['mean_reward']:.2f}")
# Save trained agent
agent.save('models/ppo_agent_btc.pt')
Trading Environment¶
The TradingEnv class provides a realistic trading simulation:
Features¶
- Gym API Compatibility: Standard reset(), step(), render() interface
- Realistic Trading: Market orders with slippage and commissions
- Position Management: Long positions (short support optional)
- Multiple Reward Functions: Profit, Sharpe, Sortino, Calmar
- Feature Integration: Uses any engineered features
Actions¶
HOLD (0): Hold current positionBUY (1): Buy/increase long positionSELL (2): Sell/close long position
Observations¶
The observation space includes: - Current balance (normalized) - Current position - Entry price - Current price - Historical window of features
Rewards¶
Four reward formulations:
- Profit: Simple P&L normalized by initial capital
- Sharpe: Risk-adjusted return (mean/std)
- Sortino: Downside risk-adjusted return
- Calmar: Return divided by maximum drawdown
# Configure reward function
env = TradingEnv(
data,
reward_function='sharpe', # or 'profit', 'sortino', 'calmar'
initial_balance=10000,
commission=0.001
)
PPO Agent¶
Architecture¶
- Actor Network: Policy network for action selection
- Critic Network: Value network for state evaluation
- GAE: Generalized Advantage Estimation
- Clipped Objective: PPO clipping for stable updates
Training¶
# Configure agent
agent = PPOAgent(
observation_dim=obs_dim,
action_dim=3,
hidden_size=256, # Size of hidden layers
learning_rate=3e-4, # Learning rate
gamma=0.99, # Discount factor
gae_lambda=0.95, # GAE lambda
clip_epsilon=0.2, # PPO clip parameter
entropy_coef=0.01 # Entropy bonus
)
# Train with custom parameters
stats = agent.train(
env,
episodes=1000,
max_steps=1000,
update_frequency=2048, # Update every N steps
eval_frequency=10, # Evaluate every N episodes
verbose=True
)
# Access training statistics
print(f"Final reward: {stats['episode_rewards'][-1]:.2f}")
print(f"Policy loss: {stats['policy_loss'][-1]:.6f}")
Evaluation¶
# Evaluate trained agent
results = agent.evaluate(
env,
episodes=10,
deterministic=True # Use deterministic policy
)
# Print results
print(f"Mean Return: {results['mean_return']:.2%}")
print(f"Std Return: {results['std_return']:.2%}")
# Access detailed episode statistics
for i, stats in enumerate(results['episode_stats']):
print(f"Episode {i+1}:")
print(f" Total Return: {stats['total_return']:.2%}")
print(f" Num Trades: {stats['num_trades']}")
print(f" Win Rate: {stats['win_rate']:.2%}")
print(f" Max Drawdown: {stats['max_drawdown']:.2%}")
Advanced Usage¶
Custom Trading Environment¶
# Create environment with custom features
from src.ml.features import FeatureEngineer
fe = FeatureEngineer()
data_with_features = fe.fit_transform(data)
env = TradingEnv(
data_with_features,
initial_balance=100000,
commission=0.001,
slippage=0.0005,
window_size=60,
max_position_size=1.0,
allow_short=False,
normalize_observations=True,
features=['returns', 'rsi', 'macd', 'volatility']
)
Hyperparameter Tuning¶
# Grid search for best hyperparameters
param_grid = {
'hidden_size': [128, 256, 512],
'learning_rate': [1e-4, 3e-4, 1e-3],
'gamma': [0.95, 0.99, 0.995],
'clip_epsilon': [0.1, 0.2, 0.3]
}
best_return = float('-inf')
best_params = None
for hidden_size in param_grid['hidden_size']:
for lr in param_grid['learning_rate']:
agent = PPOAgent(
observation_dim=obs_dim,
action_dim=3,
hidden_size=hidden_size,
learning_rate=lr
)
agent.train(env, episodes=100, verbose=False)
results = agent.evaluate(env, episodes=5)
if results['mean_return'] > best_return:
best_return = results['mean_return']
best_params = {'hidden_size': hidden_size, 'lr': lr}
print(f"Best params: {best_params}, Return: {best_return:.2%}")
Transfer Learning¶
# Train on BTC
btc_env = TradingEnv(btc_data, initial_balance=10000)
agent = PPOAgent(observation_dim=btc_env.observation_space.shape[0], action_dim=3)
agent.train(btc_env, episodes=500)
# Fine-tune on ETH
eth_env = TradingEnv(eth_data, initial_balance=10000)
agent.train(eth_env, episodes=100) # Continue training
# Evaluate on ETH
results = agent.evaluate(eth_env, episodes=10)
Integration with Backtesting¶
from src.backtesting.base import Strategy
class RLStrategy(Strategy):
def __init__(self, bars, events_queue, agent, features):
self.agent = agent
self.features = features
super().__init__(bars, events_queue)
def calculate_signals(self, event):
# Get observation from current market state
observation = self._build_observation()
# Get action from agent
action, _ = self.agent.select_action(observation, deterministic=True)
# Execute action
if action == 1: # BUY
self.emit_signal(symbol, SignalType.LONG)
elif action == 2: # SELL
self.emit_signal(symbol, SignalType.SHORT)
# Use in backtest
engine = EnhancedBacktestEngine(
symbol_list=['BTCUSDT'],
strategy_class=RLStrategy,
strategy_params={'agent': agent, 'features': features},
# ... other params
)
Best Practices¶
- Start Simple: Begin with small hidden sizes and fewer episodes
- Monitor Training: Watch for reward improvements over time
- Diverse Data: Train on varied market conditions
- Regularization: Use entropy bonus to encourage exploration
- Evaluation: Always evaluate on unseen data
- Hyperparameters: Tune based on your specific market
- Risk Management: Implement position limits and stop-losses
Troubleshooting¶
Poor Performance¶
- Increase training episodes
- Adjust reward function
- Add more features to observations
- Tune hyperparameters
- Ensure data quality
Unstable Training¶
- Reduce learning rate
- Increase clip epsilon
- Add gradient clipping
- Normalize observations
- Check for data issues
Overfitting¶
- Use validation set
- Reduce model complexity
- Add entropy regularization
- Train on more diverse data
API Reference¶
See src/ml/models/rl_agent.py and src/ml/environments/trading_env.py for complete API documentation.
Examples¶
See notebooks/reinforcement_learning_trading.ipynb for detailed examples including: - Environment setup - Agent training - Evaluation and analysis - Hyperparameter tuning - Production deployment