Database Optimization Guide for DSTA¶
This document provides comprehensive guidelines for optimizing database operations in the DSTA (Dr. Strange Trading Analysis) system.
Table of Contents¶
- Overview
- Indexing Strategies
- Query Optimization
- Connection Pooling
- Bulk Operations
- Caching Strategies
- Performance Monitoring
- Best Practices
Overview¶
The DSTA system handles large volumes of time-series financial data. Proper database optimization is critical for:
- Fast data retrieval for backtesting
- Real-time trading strategy execution
- Efficient historical data analysis
- Scalable multi-strategy operations
Key Performance Metrics¶
- Query Response Time: < 100ms for typical queries
- Bulk Insert Rate: > 10,000 rows/second
- Concurrent Connections: Up to 100 simultaneous users
- Data Retention: 5+ years of historical data
Indexing Strategies¶
Time-Series Data Indexes¶
For candlestick and trade data, the most common query pattern is time-range based filtering:
-- Primary index: (symbol, timestamp)
CREATE INDEX idx_candlestick_symbol_timestamp
ON api_candlestick (symbol, timestamp DESC);
-- Composite index for interval-specific queries
CREATE INDEX idx_candlestick_symbol_interval_timestamp
ON api_candlestick (symbol, interval, timestamp DESC);
-- Covering index for OHLCV queries
CREATE INDEX idx_candlestick_ohlcv
ON api_candlestick (symbol, timestamp)
INCLUDE (open, high, low, close, volume);
Trade and OrderBook Indexes¶
-- Trade data
CREATE INDEX idx_trade_symbol_timestamp
ON api_trade (symbol, timestamp DESC);
CREATE INDEX idx_trade_order_id
ON api_trade (order_id);
-- OrderBook snapshots
CREATE INDEX idx_orderbook_symbol_timestamp
ON api_orderbook (symbol, timestamp DESC);
CREATE INDEX idx_orderbook_level
ON api_orderbook (symbol, timestamp, side, level);
Index Maintenance¶
-- Analyze table statistics
ANALYZE api_candlestick;
-- Reindex for performance
REINDEX INDEX CONCURRENTLY idx_candlestick_symbol_timestamp;
-- Check for unused indexes
SELECT
schemaname,
tablename,
indexname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY idx_tup_read DESC;
Using the QueryOptimizer¶
from optimization.query_optimizer import QueryOptimizer
optimizer = QueryOptimizer()
# Create recommended indexes
from api.models import Candlestick
optimizer.create_indexes(
Candlestick,
fields=['symbol', 'timestamp', 'interval']
)
# Analyze table
stats = optimizer.analyze_table('api_candlestick')
print(f"Table size: {stats['size_mb']:.2f} MB")
print(f"Row count: {stats['row_count']:,}")
Query Optimization¶
Using QueryOptimizer for Cached Queries¶
from optimization.query_optimizer import QueryOptimizer
optimizer = QueryOptimizer(cache_timeout=300)
# Execute query with caching
def get_ohlcv_data():
query = """
SELECT timestamp, open, high, low, close, volume
FROM api_candlestick
WHERE symbol = %s AND timestamp >= %s
ORDER BY timestamp
"""
return pd.read_sql_query(query, connection, params=['BTCUSDT', start_date])
data = optimizer.execute_cached_query(
cache_key='btcusdt_daily_2024',
query_func=get_ohlcv_data,
timeout=600 # 10 minutes
)
Optimizing Django QuerySets¶
from api.models import Candlestick
from optimization.query_optimizer import QueryOptimizer
optimizer = QueryOptimizer()
# Get queryset
queryset = Candlestick.objects.filter(symbol='BTCUSDT')
# Auto-optimize with select_related and prefetch_related
optimized_qs = optimizer.optimize_queryset(
queryset,
prefetch_related=['related_metrics']
)
# This reduces N+1 queries significantly
for candle in optimized_qs:
print(candle.symbol, candle.close)
Query Analysis¶
from optimization.query_optimizer import QueryOptimizer
optimizer = QueryOptimizer()
# Analyze slow query
query = """
SELECT symbol, AVG(close) as avg_price
FROM api_candlestick
WHERE timestamp >= '2024-01-01'
GROUP BY symbol
"""
analysis = optimizer.analyzer.analyze_query(query)
print("EXPLAIN output:", analysis['explain_output'])
print("Recommendations:", analysis['recommendations'])
# Get slow queries
slow_queries = optimizer.get_slow_queries(threshold=1.0)
for sq in slow_queries:
print(f"Query: {sq['query'][:50]}...")
print(f"Avg time: {sq['avg_time']:.3f}s")
print(f"Count: {sq['count']}")
Partitioning for Large Tables¶
-- Partition by time range (monthly)
CREATE TABLE api_candlestick (
id SERIAL,
symbol VARCHAR(20),
timestamp TIMESTAMP,
interval VARCHAR(10),
open DECIMAL(20, 8),
high DECIMAL(20, 8),
low DECIMAL(20, 8),
close DECIMAL(20, 8),
volume DECIMAL(20, 8)
) PARTITION BY RANGE (timestamp);
-- Create partitions
CREATE TABLE api_candlestick_2024_01
PARTITION OF api_candlestick
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE api_candlestick_2024_02
PARTITION OF api_candlestick
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- Automatic partition creation can be handled by pg_partman extension
Connection Pooling¶
PostgreSQL Configuration¶
# postgresql.conf
max_connections = 200
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 20MB
min_wal_size = 2GB
max_wal_size = 8GB
Django Database Settings¶
# settings.py
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': 'dsta',
'USER': 'dsta_user',
'PASSWORD': 'secure_password',
'HOST': 'localhost',
'PORT': '5432',
'CONN_MAX_AGE': 600, # Connection pooling
'OPTIONS': {
'connect_timeout': 10,
'options': '-c statement_timeout=30000', # 30 seconds
},
}
}
# Connection pool settings
CONN_MAX_AGE = 600 # 10 minutes
Using PgBouncer¶
# pgbouncer.ini
[databases]
dsta = host=localhost port=5432 dbname=dsta
[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
# Pool settings
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50
min_pool_size = 10
reserve_pool_size = 10
reserve_pool_timeout = 5
# Timeouts
server_idle_timeout = 600
server_lifetime = 3600
Bulk Operations¶
Using BulkOperations Class¶
from optimization.query_optimizer import BulkOperations
from api.models import Candlestick
bulk_ops = BulkOperations()
# Prepare data
candlestick_data = [
{
'symbol': 'BTCUSDT',
'timestamp': datetime(2024, 1, 1, i),
'interval': '1h',
'open': 45000.0,
'high': 45100.0,
'low': 44900.0,
'close': 45050.0,
'volume': 1000.0,
}
for i in range(1000)
]
# Bulk create (much faster than individual saves)
bulk_ops.bulk_create(Candlestick, candlestick_data, batch_size=1000)
# Bulk update
candlesticks = Candlestick.objects.filter(symbol='BTCUSDT')[:1000]
for c in candlesticks:
c.volume *= 1.1 # Update volume
bulk_ops.bulk_update(Candlestick, list(candlesticks), ['volume'], batch_size=1000)
Performance Comparison¶
| Operation | Individual Saves | Bulk Operations | Speedup |
|---|---|---|---|
| 1,000 inserts | 15.2 sec | 0.8 sec | 19x |
| 10,000 inserts | 152.5 sec | 3.2 sec | 48x |
| 1,000 updates | 12.8 sec | 0.6 sec | 21x |
COPY for Massive Inserts¶
import csv
from django.db import connection
def bulk_copy_from_csv(csv_file_path: str, table_name: str):
"""Use PostgreSQL COPY for fastest bulk inserts."""
with open(csv_file_path, 'r') as f:
with connection.cursor() as cursor:
cursor.copy_expert(
f"COPY {table_name} FROM STDIN WITH CSV HEADER",
f
)
Caching Strategies¶
Application-Level Caching¶
# settings.py
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.redis.RedisCache',
'LOCATION': 'redis://127.0.0.1:6379/1',
'OPTIONS': {
'CLIENT_CLASS': 'django_redis.client.DefaultClient',
},
'KEY_PREFIX': 'dsta',
'TIMEOUT': 300, # 5 minutes default
}
}
Using QueryCache¶
from optimization.query_optimizer import QueryCache
cache = QueryCache(default_timeout=300)
# Cache query results
key = 'btcusdt_ohlcv_2024'
data = cache.get(key)
if data is None:
# Execute query
data = execute_expensive_query()
cache.set(key, data, timeout=600)
# Get cache statistics
stats = cache.get_stats()
print(f"Hit rate: {stats['hit_rate_percent']:.1f}%")
print(f"Total requests: {stats['total_requests']}")
Materialized Views¶
-- Create materialized view for daily aggregates
CREATE MATERIALIZED VIEW daily_ohlcv AS
SELECT
symbol,
DATE_TRUNC('day', timestamp) as day,
FIRST(open) as open,
MAX(high) as high,
MIN(low) as low,
LAST(close) as close,
SUM(volume) as volume
FROM api_candlestick
WHERE interval = '1h'
GROUP BY symbol, DATE_TRUNC('day', timestamp);
-- Create index on materialized view
CREATE INDEX idx_daily_ohlcv_symbol_day ON daily_ohlcv (symbol, day);
-- Refresh periodically (can be automated with cron)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_ohlcv;
Performance Monitoring¶
Query Performance Tracking¶
from optimization.query_optimizer import QueryOptimizer
optimizer = QueryOptimizer()
# Execute queries...
# Optimizer automatically tracks performance
# Get statistics
stats = optimizer.get_query_stats()
print(f"Total queries: {stats['total_queries']}")
print(f"Average time: {stats['avg_time']:.3f}s")
# Identify slow queries
slow_queries = optimizer.get_slow_queries(threshold=1.0)
for query in slow_queries:
print(f"\nSlow query detected:")
print(f" Query: {query['query'][:100]}...")
print(f" Avg time: {query['avg_time']:.3f}s")
print(f" Executions: {query['count']}")
PostgreSQL Monitoring Queries¶
-- Active queries
SELECT
pid,
now() - query_start as duration,
query,
state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
-- Table I/O statistics
SELECT
schemaname,
tablename,
heap_blks_read,
heap_blks_hit,
ROUND(heap_blks_hit::numeric / NULLIF(heap_blks_hit + heap_blks_read, 0) * 100, 2) as cache_hit_ratio
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC
LIMIT 20;
-- Index usage
SELECT
schemaname,
tablename,
indexname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC
LIMIT 20;
Best Practices¶
1. Always Use Indexes for Time-Series Queries¶
✅ Good:
# Indexed query
Candlestick.objects.filter(
symbol='BTCUSDT',
timestamp__gte=start_date,
timestamp__lte=end_date
).order_by('timestamp')
❌ Bad:
2. Use select_related() and prefetch_related()¶
✅ Good:
❌ Bad:
# N+1 query problem
trades = Trade.objects.all()
for trade in trades:
print(trade.order.id) # Separate query for each trade!
3. Batch Database Operations¶
✅ Good:
❌ Bad:
4. Use Database-Level Aggregations¶
✅ Good:
# Aggregation in database
from django.db.models import Avg, Max, Min
stats = Candlestick.objects.filter(symbol='BTCUSDT').aggregate(
avg_close=Avg('close'),
max_high=Max('high'),
min_low=Min('low')
)
❌ Bad:
# Fetch all data and aggregate in Python
candles = Candlestick.objects.filter(symbol='BTCUSDT').all()
avg_close = sum(c.close for c in candles) / len(candles) # Slow and memory-intensive
5. Implement Proper Caching¶
✅ Good:
from django.core.cache import cache
def get_daily_ohlcv(symbol, date):
key = f'daily_ohlcv_{symbol}_{date}'
data = cache.get(key)
if data is None:
data = Candlestick.objects.filter(
symbol=symbol,
timestamp__date=date
).values()
cache.set(key, data, timeout=3600)
return data
❌ Bad:
def get_daily_ohlcv(symbol, date):
# Always hit database, no caching
return Candlestick.objects.filter(
symbol=symbol,
timestamp__date=date
).values()
6. Use Connection Pooling¶
✅ Good:
# settings.py with connection pooling
DATABASES = {
'default': {
'CONN_MAX_AGE': 600, # Reuse connections
}
}
❌ Bad:
# Creating new connection for each request
DATABASES = {
'default': {
'CONN_MAX_AGE': 0, # No connection reuse
}
}
7. Monitor and Optimize Slow Queries¶
from optimization.query_optimizer import QueryOptimizer
# Regular monitoring
optimizer = QueryOptimizer()
slow_queries = optimizer.get_slow_queries(threshold=1.0)
if slow_queries:
logger.warning(f"Found {len(slow_queries)} slow queries")
for query in slow_queries:
# Analyze and optimize
analysis = optimizer.analyzer.analyze_query(query['query'])
logger.info(f"Recommendations: {analysis['recommendations']}")
8. Use Read Replicas for Heavy Queries¶
# settings.py
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': 'dsta',
# Primary database config
},
'replica': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': 'dsta',
# Replica database config (read-only)
}
}
# Use replica for heavy read queries
Candlestick.objects.using('replica').filter(symbol='BTCUSDT').all()
Conclusion¶
Proper database optimization is essential for the DSTA system's performance. Key takeaways:
- Index strategically - Focus on time-series queries
- Batch operations - Use bulk creates/updates
- Cache aggressively - Reduce database load
- Monitor continuously - Identify and fix slow queries
- Scale horizontally - Use read replicas for heavy loads
Following these guidelines will ensure the DSTA system can handle large volumes of financial data efficiently and scale to support multiple trading strategies simultaneously.
References¶
Last Updated: January 27, 2025