Skip to content

Reinforcement Learning for Trading

This document describes the PPO-based reinforcement learning system for automated cryptocurrency trading.

Overview

The RL trading module implements Proximal Policy Optimization (PPO) for training autonomous trading agents. It includes a Gym-compatible trading environment and a complete PPO implementation.

Quick Start

from src.ml.environments.trading_env import TradingEnv
from src.ml.models.rl_agent import PPOAgent
import pandas as pd

# Load market data
data = pd.read_csv('btc_hourly.csv')

# Create trading environment
env = TradingEnv(
    data=data,
    initial_balance=10000,
    commission=0.001,
    window_size=50,
    reward_function='sharpe'
)

# Create RL agent
agent = PPOAgent(
    observation_dim=env.observation_space.shape[0],
    action_dim=env.action_space.n,
    hidden_size=256,
    learning_rate=3e-4
)

# Train agent
training_stats = agent.train(
    env,
    episodes=1000,
    max_steps=500,
    update_frequency=2048,
    verbose=True
)

# Evaluate agent
results = agent.evaluate(env, episodes=10, deterministic=True)
print(f"Mean Return: {results['mean_return']:.2%}")
print(f"Mean Reward: {results['mean_reward']:.2f}")

# Save trained agent
agent.save('models/ppo_agent_btc.pt')

Trading Environment

The TradingEnv class provides a realistic trading simulation:

Features

  • Gym API Compatibility: Standard reset(), step(), render() interface
  • Realistic Trading: Market orders with slippage and commissions
  • Position Management: Long positions (short support optional)
  • Multiple Reward Functions: Profit, Sharpe, Sortino, Calmar
  • Feature Integration: Uses any engineered features

Actions

  • HOLD (0): Hold current position
  • BUY (1): Buy/increase long position
  • SELL (2): Sell/close long position

Observations

The observation space includes: - Current balance (normalized) - Current position - Entry price - Current price - Historical window of features

Rewards

Four reward formulations:

  1. Profit: Simple P&L normalized by initial capital
  2. Sharpe: Risk-adjusted return (mean/std)
  3. Sortino: Downside risk-adjusted return
  4. Calmar: Return divided by maximum drawdown
# Configure reward function
env = TradingEnv(
    data,
    reward_function='sharpe',  # or 'profit', 'sortino', 'calmar'
    initial_balance=10000,
    commission=0.001
)

PPO Agent

Architecture

  • Actor Network: Policy network for action selection
  • Critic Network: Value network for state evaluation
  • GAE: Generalized Advantage Estimation
  • Clipped Objective: PPO clipping for stable updates

Training

# Configure agent
agent = PPOAgent(
    observation_dim=obs_dim,
    action_dim=3,
    hidden_size=256,        # Size of hidden layers
    learning_rate=3e-4,     # Learning rate
    gamma=0.99,             # Discount factor
    gae_lambda=0.95,        # GAE lambda
    clip_epsilon=0.2,       # PPO clip parameter
    entropy_coef=0.01       # Entropy bonus
)

# Train with custom parameters
stats = agent.train(
    env,
    episodes=1000,
    max_steps=1000,
    update_frequency=2048,   # Update every N steps
    eval_frequency=10,       # Evaluate every N episodes
    verbose=True
)

# Access training statistics
print(f"Final reward: {stats['episode_rewards'][-1]:.2f}")
print(f"Policy loss: {stats['policy_loss'][-1]:.6f}")

Evaluation

# Evaluate trained agent
results = agent.evaluate(
    env,
    episodes=10,
    deterministic=True  # Use deterministic policy
)

# Print results
print(f"Mean Return: {results['mean_return']:.2%}")
print(f"Std Return: {results['std_return']:.2%}")

# Access detailed episode statistics
for i, stats in enumerate(results['episode_stats']):
    print(f"Episode {i+1}:")
    print(f"  Total Return: {stats['total_return']:.2%}")
    print(f"  Num Trades: {stats['num_trades']}")
    print(f"  Win Rate: {stats['win_rate']:.2%}")
    print(f"  Max Drawdown: {stats['max_drawdown']:.2%}")

Advanced Usage

Custom Trading Environment

# Create environment with custom features
from src.ml.features import FeatureEngineer

fe = FeatureEngineer()
data_with_features = fe.fit_transform(data)

env = TradingEnv(
    data_with_features,
    initial_balance=100000,
    commission=0.001,
    slippage=0.0005,
    window_size=60,
    max_position_size=1.0,
    allow_short=False,
    normalize_observations=True,
    features=['returns', 'rsi', 'macd', 'volatility']
)

Hyperparameter Tuning

# Grid search for best hyperparameters
param_grid = {
    'hidden_size': [128, 256, 512],
    'learning_rate': [1e-4, 3e-4, 1e-3],
    'gamma': [0.95, 0.99, 0.995],
    'clip_epsilon': [0.1, 0.2, 0.3]
}

best_return = float('-inf')
best_params = None

for hidden_size in param_grid['hidden_size']:
    for lr in param_grid['learning_rate']:
        agent = PPOAgent(
            observation_dim=obs_dim,
            action_dim=3,
            hidden_size=hidden_size,
            learning_rate=lr
        )

        agent.train(env, episodes=100, verbose=False)
        results = agent.evaluate(env, episodes=5)

        if results['mean_return'] > best_return:
            best_return = results['mean_return']
            best_params = {'hidden_size': hidden_size, 'lr': lr}

print(f"Best params: {best_params}, Return: {best_return:.2%}")

Transfer Learning

# Train on BTC
btc_env = TradingEnv(btc_data, initial_balance=10000)
agent = PPOAgent(observation_dim=btc_env.observation_space.shape[0], action_dim=3)
agent.train(btc_env, episodes=500)

# Fine-tune on ETH
eth_env = TradingEnv(eth_data, initial_balance=10000)
agent.train(eth_env, episodes=100)  # Continue training

# Evaluate on ETH
results = agent.evaluate(eth_env, episodes=10)

Integration with Backtesting

from src.backtesting.base import Strategy

class RLStrategy(Strategy):
    def __init__(self, bars, events_queue, agent, features):
        self.agent = agent
        self.features = features
        super().__init__(bars, events_queue)

    def calculate_signals(self, event):
        # Get observation from current market state
        observation = self._build_observation()

        # Get action from agent
        action, _ = self.agent.select_action(observation, deterministic=True)

        # Execute action
        if action == 1:  # BUY
            self.emit_signal(symbol, SignalType.LONG)
        elif action == 2:  # SELL
            self.emit_signal(symbol, SignalType.SHORT)

# Use in backtest
engine = EnhancedBacktestEngine(
    symbol_list=['BTCUSDT'],
    strategy_class=RLStrategy,
    strategy_params={'agent': agent, 'features': features},
    # ... other params
)

Best Practices

  1. Start Simple: Begin with small hidden sizes and fewer episodes
  2. Monitor Training: Watch for reward improvements over time
  3. Diverse Data: Train on varied market conditions
  4. Regularization: Use entropy bonus to encourage exploration
  5. Evaluation: Always evaluate on unseen data
  6. Hyperparameters: Tune based on your specific market
  7. Risk Management: Implement position limits and stop-losses

Troubleshooting

Poor Performance

  • Increase training episodes
  • Adjust reward function
  • Add more features to observations
  • Tune hyperparameters
  • Ensure data quality

Unstable Training

  • Reduce learning rate
  • Increase clip epsilon
  • Add gradient clipping
  • Normalize observations
  • Check for data issues

Overfitting

  • Use validation set
  • Reduce model complexity
  • Add entropy regularization
  • Train on more diverse data

API Reference

See src/ml/models/rl_agent.py and src/ml/environments/trading_env.py for complete API documentation.

Examples

See notebooks/reinforcement_learning_trading.ipynb for detailed examples including: - Environment setup - Agent training - Evaluation and analysis - Hyperparameter tuning - Production deployment

References