Deployment Runbook¶
This guide covers production deployment, environment configuration, database management, monitoring, and troubleshooting for the Tessellate Renewables platform.
1. Railway Deployment (Production)¶
Railway is the primary deployment platform. The project contains multiple services that deploy independently from the same repository.
Service Configuration¶
| Service | Source | Dockerfile | Start Command | Port |
|---|---|---|---|---|
| api | main branch |
Dockerfile |
uvicorn app:app --host 0.0.0.0 --port 8000 --workers 4 |
8000 |
| worker | main branch |
Dockerfile.worker |
celery -A celery_config worker -l info -Q default,solar,wind,reports -c 4 |
- |
| redis | Railway Plugin | - | Managed | 6379 |
| postgres | Railway Plugin | - | Managed | 5432 |
| streamlit | main branch |
Dockerfile.streamlit |
streamlit run frontend/app.py --server.port 8501 |
8501 |
| ml-engine | main branch |
Dockerfile.ml-engine |
python -m modal_gpu |
- |
Deploying Updates¶
Railway auto-deploys from the main branch. For manual deployments:
# Install Railway CLI
npm install -g @railway/cli
# Login and link project
railway login
railway link
# Deploy a specific service
railway up --service api
railway up --service worker
# View deployment logs
railway logs --service api
railway logs --service worker
# Open Railway dashboard
railway open
Rolling Deployments¶
Railway performs zero-downtime rolling deployments by default:
- New container is built and started.
- Health check is verified (
/healthfor the API service). - Traffic is shifted to the new container.
- Old container is drained and stopped.
If the health check fails, the deployment is rolled back automatically.
Railway Configuration Files¶
railway.worker.toml-- Worker service overridesrailway.streamlit.toml-- Streamlit service overridesrailway.ml-engine.toml-- ML engine service overridesProcfile-- Process definitions (alternative to Dockerfiles)
2. Environment Variables Reference¶
All environment variables used across services. Variables are set in Railway's service settings or via the Railway CLI.
Core Application¶
| Variable | Required | Default | Description |
|---|---|---|---|
APP_ENV |
Yes | - | Environment: production, staging, development |
SECRET_KEY |
Yes | - | Application secret for JWT signing (min 32 chars) |
API_BASE_URL |
Yes | - | Public API URL (e.g., https://api.tessellate.energy) |
ALLOWED_ORIGINS |
No | * |
CORS allowed origins (comma-separated) |
LOG_LEVEL |
No | INFO |
Logging level: DEBUG, INFO, WARNING, ERROR |
WORKERS |
No | 4 |
Uvicorn worker count |
DEBUG |
No | false |
Enable debug mode (never in production) |
Database¶
| Variable | Required | Default | Description |
|---|---|---|---|
DATABASE_URL |
Yes | - | PostgreSQL connection string |
DATABASE_POOL_SIZE |
No | 10 |
SQLAlchemy connection pool size |
DATABASE_MAX_OVERFLOW |
No | 20 |
Max overflow connections |
DATABASE_POOL_TIMEOUT |
No | 30 |
Pool checkout timeout (seconds) |
DATABASE_ECHO |
No | false |
Echo SQL statements to logs |
Redis¶
| Variable | Required | Default | Description |
|---|---|---|---|
REDIS_URL |
Yes | - | Redis connection string |
REDIS_MAX_CONNECTIONS |
No | 50 |
Max connection pool size |
REDIS_CACHE_TTL |
No | 3600 |
Default cache TTL in seconds |
REDIS_SSL |
No | false |
Enable TLS for Redis connection |
Celery¶
| Variable | Required | Default | Description |
|---|---|---|---|
CELERY_BROKER_URL |
Yes | - | Redis URL for task broker (usually same as REDIS_URL) |
CELERY_RESULT_BACKEND |
Yes | - | Redis URL for task results |
CELERY_WORKER_CONCURRENCY |
No | 4 |
Worker process concurrency |
CELERY_TASK_SOFT_TIME_LIMIT |
No | 300 |
Soft time limit per task (seconds) |
CELERY_TASK_TIME_LIMIT |
No | 600 |
Hard time limit per task (seconds) |
CELERY_WORKER_MAX_TASKS_PER_CHILD |
No | 100 |
Recycle worker after N tasks |
Authentication¶
| Variable | Required | Default | Description |
|---|---|---|---|
JWT_SECRET_KEY |
Yes | - | JWT signing key (can be same as SECRET_KEY) |
JWT_ALGORITHM |
No | HS256 |
JWT algorithm |
JWT_ACCESS_TOKEN_EXPIRE_MINUTES |
No | 30 |
Access token lifetime |
JWT_REFRESH_TOKEN_EXPIRE_DAYS |
No | 7 |
Refresh token lifetime |
External Services¶
| Variable | Required | Default | Description |
|---|---|---|---|
SENDGRID_API_KEY |
Yes | - | SendGrid API key for transactional email |
SENDGRID_FROM_EMAIL |
No | [email protected] |
Sender email address |
STRIPE_SECRET_KEY |
Yes | - | Stripe secret key |
STRIPE_PUBLISHABLE_KEY |
Yes | - | Stripe publishable key |
STRIPE_WEBHOOK_SECRET |
Yes | - | Stripe webhook signing secret |
NREL_API_KEY |
Yes | - | NREL API key for solar/weather data |
MODAL_TOKEN_ID |
No | - | Modal GPU compute token ID |
MODAL_TOKEN_SECRET |
No | - | Modal GPU compute token secret |
Rate Limiting and Metering¶
| Variable | Required | Default | Description |
|---|---|---|---|
RATE_LIMIT_DEFAULT |
No | 1000 |
Default requests per minute per key |
RATE_LIMIT_BURST |
No | 50 |
Burst allowance above rate limit |
METERING_ENABLED |
No | true |
Enable API usage metering |
METERING_FLUSH_INTERVAL |
No | 60 |
Metering buffer flush interval (seconds) |
Feature Flags¶
| Variable | Required | Default | Description |
|---|---|---|---|
ENABLE_WIND_DIVISION |
No | true |
Enable wind analysis endpoints |
ENABLE_GPU_TASKS |
No | false |
Enable GPU-accelerated tasks |
ENABLE_WEBHOOKS |
No | true |
Enable webhook delivery |
ENABLE_SSE |
No | true |
Enable server-sent events |
ENABLE_AB_TESTING |
No | false |
Enable A/B testing framework |
3. Docker Deployment¶
For self-hosted or non-Railway deployments, use Docker Compose.
Production Docker Compose¶
The file docker-compose.prod.yml defines the full production stack:
# Build and start all services
docker compose -f docker-compose.prod.yml up -d --build
# View logs
docker compose -f docker-compose.prod.yml logs -f api
docker compose -f docker-compose.prod.yml logs -f worker
# Scale workers
docker compose -f docker-compose.prod.yml up -d --scale worker=4
# Stop all services
docker compose -f docker-compose.prod.yml down
# Stop and remove volumes (WARNING: deletes data)
docker compose -f docker-compose.prod.yml down -v
Docker Build Notes¶
- Multi-stage builds: The API Dockerfile uses multi-stage builds to minimize image
size. The final image is based on
python:3.11-slim. - Layer caching: Requirements are installed before copying application code to maximize Docker layer cache hits.
- Non-root user: Containers run as a non-root
appuserfor security. - Health checks: Each container defines a Docker
HEALTHCHECKinstruction.
Environment File¶
Create a .env file for local Docker deployments (never commit this file):
# .env (DO NOT COMMIT)
APP_ENV=production
SECRET_KEY=your-secret-key-min-32-characters-long
DATABASE_URL=postgresql://user:password@postgres:5432/tessellate
REDIS_URL=redis://redis:6379/0
CELERY_BROKER_URL=redis://redis:6379/1
CELERY_RESULT_BACKEND=redis://redis:6379/1
SENDGRID_API_KEY=SG.xxxxx
STRIPE_SECRET_KEY=sk_live_xxxxx
STRIPE_PUBLISHABLE_KEY=pk_live_xxxxx
STRIPE_WEBHOOK_SECRET=whsec_xxxxx
NREL_API_KEY=xxxxx
JWT_SECRET_KEY=your-jwt-secret-key
4. Database Migrations¶
The project uses Alembic for database schema migrations.
Common Commands¶
# Check current migration version
alembic current
# View migration history
alembic history --verbose
# Generate a new migration from model changes
alembic revision --autogenerate -m "Add wind_jobs table"
# Apply all pending migrations
alembic upgrade head
# Apply next migration only
alembic upgrade +1
# Rollback one migration
alembic downgrade -1
# Rollback to a specific revision
alembic downgrade abc123
# Rollback all migrations (WARNING: destructive)
alembic downgrade base
# Show the SQL for a migration without applying it
alembic upgrade head --sql
Migration Best Practices¶
-
Always review auto-generated migrations. Alembic's autogenerate is not perfect; it may miss index changes or generate incorrect ALTER statements.
-
Test migrations on a copy of production data before deploying:
-
Never edit a migration that has been deployed. Create a new migration to fix issues.
-
Include both upgrade and downgrade functions in every migration.
-
Use batch operations for large tables to avoid long locks:
Railway Migrations¶
Migrations run automatically on deploy via the API service entrypoint:
# docker-entrypoint.sh
#!/bin/bash
set -e
echo "Running database migrations..."
alembic upgrade head
echo "Starting application..."
exec uvicorn app:app --host 0.0.0.0 --port ${PORT:-8000} --workers ${WORKERS:-4}
5. Celery Worker Setup¶
Starting Workers¶
# Start a worker with all queues
celery -A celery_config worker \
--loglevel=info \
--queues=default,solar,wind,reports \
--concurrency=4 \
--max-tasks-per-child=100
# Start a worker for solar tasks only
celery -A celery_config worker \
--loglevel=info \
--queues=solar \
--concurrency=2 \
--hostname=solar-worker@%h
# Start a worker for GPU tasks (single concurrency)
celery -A celery_config worker \
--loglevel=info \
--queues=gpu \
--concurrency=1 \
--hostname=gpu-worker@%h
# Start Celery Beat (periodic task scheduler)
celery -A celery_config beat \
--loglevel=info \
--schedule=/tmp/celerybeat-schedule
Monitoring Workers¶
# List active workers
celery -A celery_config inspect active
# List registered tasks
celery -A celery_config inspect registered
# View worker statistics
celery -A celery_config inspect stats
# Purge all queued tasks (WARNING: destructive)
celery -A celery_config purge
# Monitor in real-time with Flower
celery -A celery_config flower --port=5555
Worker Configuration¶
Key settings in celery_config.py:
| Setting | Value | Description |
|---|---|---|
task_serializer |
json |
JSON serialization for tasks |
result_serializer |
json |
JSON serialization for results |
accept_content |
["json"] |
Only accept JSON content |
timezone |
UTC |
All timestamps in UTC |
task_track_started |
True |
Track when tasks begin execution |
task_acks_late |
True |
Acknowledge after completion (not receipt) |
worker_prefetch_multiplier |
1 |
Prefetch one task at a time |
task_reject_on_worker_lost |
True |
Requeue if worker dies |
6. Health Check Endpoints¶
API Health Checks¶
| Endpoint | Method | Description | Used By |
|---|---|---|---|
GET /health |
GET | Basic liveness check | Railway, load balancer |
GET /health/ready |
GET | Readiness check (DB + Redis) | Kubernetes, Railway |
GET /health/detailed |
GET | Full component status | Monitoring, ops |
GET /health¶
Returns 200 OK if the API process is running.
GET /health/ready¶
Returns 200 OK only if all critical dependencies are reachable.
{
"status": "ready",
"checks": {
"database": "ok",
"redis": "ok",
"celery": "ok"
},
"timestamp": "2025-01-15T10:30:00Z"
}
Returns 503 Service Unavailable if any check fails:
{
"status": "not_ready",
"checks": {
"database": "ok",
"redis": "error: connection refused",
"celery": "ok"
},
"timestamp": "2025-01-15T10:30:00Z"
}
GET /health/detailed¶
Returns full system health with performance metrics (requires admin auth).
{
"status": "healthy",
"version": "1.4.2",
"uptime_seconds": 86400,
"components": {
"database": {
"status": "ok",
"latency_ms": 2.3,
"pool_size": 10,
"pool_checked_out": 3
},
"redis": {
"status": "ok",
"latency_ms": 0.8,
"memory_used_mb": 45.2,
"connected_clients": 12
},
"celery": {
"status": "ok",
"active_workers": 4,
"queued_tasks": 7,
"active_tasks": 3
}
},
"timestamp": "2025-01-15T10:30:00Z"
}
7. Monitoring Setup¶
Application Metrics¶
The platform exposes Prometheus-compatible metrics at /metrics (when enabled):
| Metric | Type | Description |
|---|---|---|
http_requests_total |
Counter | Total HTTP requests by method, path, status |
http_request_duration_seconds |
Histogram | Request latency distribution |
celery_tasks_total |
Counter | Total Celery tasks by name, status |
celery_task_duration_seconds |
Histogram | Task execution time |
db_query_duration_seconds |
Histogram | Database query latency |
redis_operations_total |
Counter | Redis operations by command |
active_websocket_connections |
Gauge | Current SSE connections |
api_keys_active |
Gauge | Number of active API keys |
SLO Monitoring¶
The platform includes built-in SLO (Service Level Objective) tracking. See
services/slo_tracker.py and the /slo/* endpoints for:
- API availability (target: 99.9%)
- P95 latency (target: < 500ms)
- P99 latency (target: < 2000ms)
- Error rate (target: < 1%)
- Error budget tracking and burn rate alerts
Log Aggregation¶
Structured JSON logging is used across all services:
{
"timestamp": "2025-01-15T10:30:00Z",
"level": "INFO",
"service": "api",
"request_id": "req_abc123",
"user_id": "usr_xyz",
"org_id": "org_456",
"method": "POST",
"path": "/v1/solar/optimize",
"status": 202,
"duration_ms": 45,
"message": "Solar optimization job created"
}
View logs in Railway:
# API logs (last 100 lines)
railway logs --service api -n 100
# Worker logs (follow mode)
railway logs --service worker -f
# Filter by error level
railway logs --service api | grep '"level":"ERROR"'
Alerting¶
Configure alerts in your monitoring platform for:
| Alert | Condition | Severity |
|---|---|---|
| API Down | Health check fails for > 2 minutes | Critical |
| High Error Rate | 5xx rate > 5% over 5 minutes | Critical |
| High Latency | P95 > 2s over 5 minutes | Warning |
| Worker Queue Backlog | Queued tasks > 100 for > 10 minutes | Warning |
| Database Connection Pool | Pool utilization > 80% | Warning |
| Redis Memory | Memory usage > 80% | Warning |
| Celery Worker Down | Active workers < expected count | Critical |
| Error Budget Burn | SLO error budget < 20% remaining | Warning |
| Error Budget Exhausted | SLO error budget = 0% | Critical |
8. Troubleshooting¶
Common Issues¶
API returns 502 Bad Gateway¶
Cause: The API service is not responding to the load balancer health check.
Resolution:
1. Check API logs: railway logs --service api -n 200
2. Verify the health endpoint: curl https://api.tessellate.energy/health
3. Check if the database is reachable: railway logs --service api | grep "database"
4. Restart the service: railway service restart api
Celery tasks stuck in "queued" status¶
Cause: Workers are not consuming tasks from the queue.
Resolution:
1. Check worker logs: railway logs --service worker -n 200
2. Verify Redis connectivity: railway logs --service worker | grep "redis"
3. Check active workers: celery -A celery_config inspect active
4. Check queue lengths: celery -A celery_config inspect reserved
5. Restart workers: railway service restart worker
Database connection pool exhausted¶
Cause: Too many concurrent database connections. Pool size is exceeded.
Resolution:
1. Check pool stats in /health/detailed endpoint.
2. Look for long-running queries:
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE duration > interval '5 minutes' AND state != 'idle';
DATABASE_POOL_SIZE and DATABASE_MAX_OVERFLOW if needed.
Redis out of memory¶
Cause: Cache or Celery result backend is consuming too much memory.
Resolution:
1. Check Redis memory: redis-cli INFO memory
2. Check key count by prefix:
redis-cli --scan --pattern "cache:*" | wc -l
redis-cli --scan --pattern "celery-task-meta-*" | wc -l
REDIS_CACHE_TTL to expire cached items sooner.
5. Upgrade Redis instance memory on Railway.
Alembic migration fails on deploy¶
Cause: Migration conflicts or incompatible schema changes.
Resolution: 1. Check the migration error in deploy logs. 2. Connect to the database and check current version:
3. If a migration was partially applied, check thealembic_version table:
4. Fix the migration and re-deploy, or manually set the version:
Solar optimization job fails with OOM¶
Cause: The optimization is consuming too much memory (large search space or many iterations).
Resolution:
1. Check worker logs for the specific task ID.
2. Reduce optimization parameters:
- Lower population_size for NSGA-II.
- Reduce n_iterations for Bayesian optimization.
- Use grid_search with coarser step sizes.
3. Increase worker memory allocation on Railway.
4. Set CELERY_WORKER_MAX_TASKS_PER_CHILD=50 to recycle workers more frequently.
Webhook delivery failures¶
Cause: The target URL is unreachable or returning errors.
Resolution: 1. Check webhook delivery logs in the admin dashboard. 2. Verify the callback URL is publicly accessible. 3. Check for SSL certificate issues on the target. 4. Webhooks retry 3 times with exponential backoff (10s, 60s, 300s). 5. After 3 failures, the webhook is marked as failed. Re-trigger via API:
curl -X POST "https://api.tessellate.energy/v1/webhooks/{webhook_id}/retry" \
-H "Authorization: Bearer $API_KEY"
Rate limit incorrectly applied¶
Cause: Rate limit counter in Redis is stale or misconfigured.
Resolution: 1. Check current rate limit for an API key:
2. Clear the rate limit counter: 3. Verify the rate limit configuration for the organization's plan. 4. CheckRATE_LIMIT_DEFAULT environment variable.