Skip to content

Deployment Runbook

This guide covers production deployment, environment configuration, database management, monitoring, and troubleshooting for the Tessellate Renewables platform.


1. Railway Deployment (Production)

Railway is the primary deployment platform. The project contains multiple services that deploy independently from the same repository.

Service Configuration

Service Source Dockerfile Start Command Port
api main branch Dockerfile uvicorn app:app --host 0.0.0.0 --port 8000 --workers 4 8000
worker main branch Dockerfile.worker celery -A celery_config worker -l info -Q default,solar,wind,reports -c 4 -
redis Railway Plugin - Managed 6379
postgres Railway Plugin - Managed 5432
streamlit main branch Dockerfile.streamlit streamlit run frontend/app.py --server.port 8501 8501
ml-engine main branch Dockerfile.ml-engine python -m modal_gpu -

Deploying Updates

Railway auto-deploys from the main branch. For manual deployments:

# Install Railway CLI
npm install -g @railway/cli

# Login and link project
railway login
railway link

# Deploy a specific service
railway up --service api
railway up --service worker

# View deployment logs
railway logs --service api
railway logs --service worker

# Open Railway dashboard
railway open

Rolling Deployments

Railway performs zero-downtime rolling deployments by default:

  1. New container is built and started.
  2. Health check is verified (/health for the API service).
  3. Traffic is shifted to the new container.
  4. Old container is drained and stopped.

If the health check fails, the deployment is rolled back automatically.

Railway Configuration Files

  • railway.worker.toml -- Worker service overrides
  • railway.streamlit.toml -- Streamlit service overrides
  • railway.ml-engine.toml -- ML engine service overrides
  • Procfile -- Process definitions (alternative to Dockerfiles)

2. Environment Variables Reference

All environment variables used across services. Variables are set in Railway's service settings or via the Railway CLI.

Core Application

Variable Required Default Description
APP_ENV Yes - Environment: production, staging, development
SECRET_KEY Yes - Application secret for JWT signing (min 32 chars)
API_BASE_URL Yes - Public API URL (e.g., https://api.tessellate.energy)
ALLOWED_ORIGINS No * CORS allowed origins (comma-separated)
LOG_LEVEL No INFO Logging level: DEBUG, INFO, WARNING, ERROR
WORKERS No 4 Uvicorn worker count
DEBUG No false Enable debug mode (never in production)

Database

Variable Required Default Description
DATABASE_URL Yes - PostgreSQL connection string
DATABASE_POOL_SIZE No 10 SQLAlchemy connection pool size
DATABASE_MAX_OVERFLOW No 20 Max overflow connections
DATABASE_POOL_TIMEOUT No 30 Pool checkout timeout (seconds)
DATABASE_ECHO No false Echo SQL statements to logs

Redis

Variable Required Default Description
REDIS_URL Yes - Redis connection string
REDIS_MAX_CONNECTIONS No 50 Max connection pool size
REDIS_CACHE_TTL No 3600 Default cache TTL in seconds
REDIS_SSL No false Enable TLS for Redis connection

Celery

Variable Required Default Description
CELERY_BROKER_URL Yes - Redis URL for task broker (usually same as REDIS_URL)
CELERY_RESULT_BACKEND Yes - Redis URL for task results
CELERY_WORKER_CONCURRENCY No 4 Worker process concurrency
CELERY_TASK_SOFT_TIME_LIMIT No 300 Soft time limit per task (seconds)
CELERY_TASK_TIME_LIMIT No 600 Hard time limit per task (seconds)
CELERY_WORKER_MAX_TASKS_PER_CHILD No 100 Recycle worker after N tasks

Authentication

Variable Required Default Description
JWT_SECRET_KEY Yes - JWT signing key (can be same as SECRET_KEY)
JWT_ALGORITHM No HS256 JWT algorithm
JWT_ACCESS_TOKEN_EXPIRE_MINUTES No 30 Access token lifetime
JWT_REFRESH_TOKEN_EXPIRE_DAYS No 7 Refresh token lifetime

External Services

Variable Required Default Description
SENDGRID_API_KEY Yes - SendGrid API key for transactional email
SENDGRID_FROM_EMAIL No [email protected] Sender email address
STRIPE_SECRET_KEY Yes - Stripe secret key
STRIPE_PUBLISHABLE_KEY Yes - Stripe publishable key
STRIPE_WEBHOOK_SECRET Yes - Stripe webhook signing secret
NREL_API_KEY Yes - NREL API key for solar/weather data
MODAL_TOKEN_ID No - Modal GPU compute token ID
MODAL_TOKEN_SECRET No - Modal GPU compute token secret

Rate Limiting and Metering

Variable Required Default Description
RATE_LIMIT_DEFAULT No 1000 Default requests per minute per key
RATE_LIMIT_BURST No 50 Burst allowance above rate limit
METERING_ENABLED No true Enable API usage metering
METERING_FLUSH_INTERVAL No 60 Metering buffer flush interval (seconds)

Feature Flags

Variable Required Default Description
ENABLE_WIND_DIVISION No true Enable wind analysis endpoints
ENABLE_GPU_TASKS No false Enable GPU-accelerated tasks
ENABLE_WEBHOOKS No true Enable webhook delivery
ENABLE_SSE No true Enable server-sent events
ENABLE_AB_TESTING No false Enable A/B testing framework

3. Docker Deployment

For self-hosted or non-Railway deployments, use Docker Compose.

Production Docker Compose

The file docker-compose.prod.yml defines the full production stack:

# Build and start all services
docker compose -f docker-compose.prod.yml up -d --build

# View logs
docker compose -f docker-compose.prod.yml logs -f api
docker compose -f docker-compose.prod.yml logs -f worker

# Scale workers
docker compose -f docker-compose.prod.yml up -d --scale worker=4

# Stop all services
docker compose -f docker-compose.prod.yml down

# Stop and remove volumes (WARNING: deletes data)
docker compose -f docker-compose.prod.yml down -v

Docker Build Notes

  • Multi-stage builds: The API Dockerfile uses multi-stage builds to minimize image size. The final image is based on python:3.11-slim.
  • Layer caching: Requirements are installed before copying application code to maximize Docker layer cache hits.
  • Non-root user: Containers run as a non-root appuser for security.
  • Health checks: Each container defines a Docker HEALTHCHECK instruction.

Environment File

Create a .env file for local Docker deployments (never commit this file):

# .env (DO NOT COMMIT)
APP_ENV=production
SECRET_KEY=your-secret-key-min-32-characters-long
DATABASE_URL=postgresql://user:password@postgres:5432/tessellate
REDIS_URL=redis://redis:6379/0
CELERY_BROKER_URL=redis://redis:6379/1
CELERY_RESULT_BACKEND=redis://redis:6379/1
SENDGRID_API_KEY=SG.xxxxx
STRIPE_SECRET_KEY=sk_live_xxxxx
STRIPE_PUBLISHABLE_KEY=pk_live_xxxxx
STRIPE_WEBHOOK_SECRET=whsec_xxxxx
NREL_API_KEY=xxxxx
JWT_SECRET_KEY=your-jwt-secret-key

4. Database Migrations

The project uses Alembic for database schema migrations.

Common Commands

# Check current migration version
alembic current

# View migration history
alembic history --verbose

# Generate a new migration from model changes
alembic revision --autogenerate -m "Add wind_jobs table"

# Apply all pending migrations
alembic upgrade head

# Apply next migration only
alembic upgrade +1

# Rollback one migration
alembic downgrade -1

# Rollback to a specific revision
alembic downgrade abc123

# Rollback all migrations (WARNING: destructive)
alembic downgrade base

# Show the SQL for a migration without applying it
alembic upgrade head --sql

Migration Best Practices

  1. Always review auto-generated migrations. Alembic's autogenerate is not perfect; it may miss index changes or generate incorrect ALTER statements.

  2. Test migrations on a copy of production data before deploying:

    # Dump production schema (no data)
    pg_dump --schema-only $PROD_DATABASE_URL > schema.sql
    
    # Restore to test database
    psql $TEST_DATABASE_URL < schema.sql
    
    # Run migration against test
    DATABASE_URL=$TEST_DATABASE_URL alembic upgrade head
    

  3. Never edit a migration that has been deployed. Create a new migration to fix issues.

  4. Include both upgrade and downgrade functions in every migration.

  5. Use batch operations for large tables to avoid long locks:

    # In migration file
    def upgrade():
        # Add column as nullable first
        op.add_column('large_table', sa.Column('new_col', sa.String(), nullable=True))
        # Backfill in batches (separate script or task)
        # Then make non-nullable
        op.alter_column('large_table', 'new_col', nullable=False)
    

Railway Migrations

Migrations run automatically on deploy via the API service entrypoint:

# docker-entrypoint.sh
#!/bin/bash
set -e
echo "Running database migrations..."
alembic upgrade head
echo "Starting application..."
exec uvicorn app:app --host 0.0.0.0 --port ${PORT:-8000} --workers ${WORKERS:-4}

5. Celery Worker Setup

Starting Workers

# Start a worker with all queues
celery -A celery_config worker \
  --loglevel=info \
  --queues=default,solar,wind,reports \
  --concurrency=4 \
  --max-tasks-per-child=100

# Start a worker for solar tasks only
celery -A celery_config worker \
  --loglevel=info \
  --queues=solar \
  --concurrency=2 \
  --hostname=solar-worker@%h

# Start a worker for GPU tasks (single concurrency)
celery -A celery_config worker \
  --loglevel=info \
  --queues=gpu \
  --concurrency=1 \
  --hostname=gpu-worker@%h

# Start Celery Beat (periodic task scheduler)
celery -A celery_config beat \
  --loglevel=info \
  --schedule=/tmp/celerybeat-schedule

Monitoring Workers

# List active workers
celery -A celery_config inspect active

# List registered tasks
celery -A celery_config inspect registered

# View worker statistics
celery -A celery_config inspect stats

# Purge all queued tasks (WARNING: destructive)
celery -A celery_config purge

# Monitor in real-time with Flower
celery -A celery_config flower --port=5555

Worker Configuration

Key settings in celery_config.py:

Setting Value Description
task_serializer json JSON serialization for tasks
result_serializer json JSON serialization for results
accept_content ["json"] Only accept JSON content
timezone UTC All timestamps in UTC
task_track_started True Track when tasks begin execution
task_acks_late True Acknowledge after completion (not receipt)
worker_prefetch_multiplier 1 Prefetch one task at a time
task_reject_on_worker_lost True Requeue if worker dies

6. Health Check Endpoints

API Health Checks

Endpoint Method Description Used By
GET /health GET Basic liveness check Railway, load balancer
GET /health/ready GET Readiness check (DB + Redis) Kubernetes, Railway
GET /health/detailed GET Full component status Monitoring, ops

GET /health

Returns 200 OK if the API process is running.

{
  "status": "healthy",
  "timestamp": "2025-01-15T10:30:00Z"
}

GET /health/ready

Returns 200 OK only if all critical dependencies are reachable.

{
  "status": "ready",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "celery": "ok"
  },
  "timestamp": "2025-01-15T10:30:00Z"
}

Returns 503 Service Unavailable if any check fails:

{
  "status": "not_ready",
  "checks": {
    "database": "ok",
    "redis": "error: connection refused",
    "celery": "ok"
  },
  "timestamp": "2025-01-15T10:30:00Z"
}

GET /health/detailed

Returns full system health with performance metrics (requires admin auth).

{
  "status": "healthy",
  "version": "1.4.2",
  "uptime_seconds": 86400,
  "components": {
    "database": {
      "status": "ok",
      "latency_ms": 2.3,
      "pool_size": 10,
      "pool_checked_out": 3
    },
    "redis": {
      "status": "ok",
      "latency_ms": 0.8,
      "memory_used_mb": 45.2,
      "connected_clients": 12
    },
    "celery": {
      "status": "ok",
      "active_workers": 4,
      "queued_tasks": 7,
      "active_tasks": 3
    }
  },
  "timestamp": "2025-01-15T10:30:00Z"
}

7. Monitoring Setup

Application Metrics

The platform exposes Prometheus-compatible metrics at /metrics (when enabled):

Metric Type Description
http_requests_total Counter Total HTTP requests by method, path, status
http_request_duration_seconds Histogram Request latency distribution
celery_tasks_total Counter Total Celery tasks by name, status
celery_task_duration_seconds Histogram Task execution time
db_query_duration_seconds Histogram Database query latency
redis_operations_total Counter Redis operations by command
active_websocket_connections Gauge Current SSE connections
api_keys_active Gauge Number of active API keys

SLO Monitoring

The platform includes built-in SLO (Service Level Objective) tracking. See services/slo_tracker.py and the /slo/* endpoints for:

  • API availability (target: 99.9%)
  • P95 latency (target: < 500ms)
  • P99 latency (target: < 2000ms)
  • Error rate (target: < 1%)
  • Error budget tracking and burn rate alerts

Log Aggregation

Structured JSON logging is used across all services:

{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "INFO",
  "service": "api",
  "request_id": "req_abc123",
  "user_id": "usr_xyz",
  "org_id": "org_456",
  "method": "POST",
  "path": "/v1/solar/optimize",
  "status": 202,
  "duration_ms": 45,
  "message": "Solar optimization job created"
}

View logs in Railway:

# API logs (last 100 lines)
railway logs --service api -n 100

# Worker logs (follow mode)
railway logs --service worker -f

# Filter by error level
railway logs --service api | grep '"level":"ERROR"'

Alerting

Configure alerts in your monitoring platform for:

Alert Condition Severity
API Down Health check fails for > 2 minutes Critical
High Error Rate 5xx rate > 5% over 5 minutes Critical
High Latency P95 > 2s over 5 minutes Warning
Worker Queue Backlog Queued tasks > 100 for > 10 minutes Warning
Database Connection Pool Pool utilization > 80% Warning
Redis Memory Memory usage > 80% Warning
Celery Worker Down Active workers < expected count Critical
Error Budget Burn SLO error budget < 20% remaining Warning
Error Budget Exhausted SLO error budget = 0% Critical

8. Troubleshooting

Common Issues

API returns 502 Bad Gateway

Cause: The API service is not responding to the load balancer health check.

Resolution: 1. Check API logs: railway logs --service api -n 200 2. Verify the health endpoint: curl https://api.tessellate.energy/health 3. Check if the database is reachable: railway logs --service api | grep "database" 4. Restart the service: railway service restart api

Celery tasks stuck in "queued" status

Cause: Workers are not consuming tasks from the queue.

Resolution: 1. Check worker logs: railway logs --service worker -n 200 2. Verify Redis connectivity: railway logs --service worker | grep "redis" 3. Check active workers: celery -A celery_config inspect active 4. Check queue lengths: celery -A celery_config inspect reserved 5. Restart workers: railway service restart worker

Database connection pool exhausted

Cause: Too many concurrent database connections. Pool size is exceeded.

Resolution: 1. Check pool stats in /health/detailed endpoint. 2. Look for long-running queries:

SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
3. Kill long-running queries if necessary:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE duration > interval '5 minutes' AND state != 'idle';
4. Increase DATABASE_POOL_SIZE and DATABASE_MAX_OVERFLOW if needed.

Redis out of memory

Cause: Cache or Celery result backend is consuming too much memory.

Resolution: 1. Check Redis memory: redis-cli INFO memory 2. Check key count by prefix:

redis-cli --scan --pattern "cache:*" | wc -l
redis-cli --scan --pattern "celery-task-meta-*" | wc -l
3. Clear expired Celery results:
redis-cli --scan --pattern "celery-task-meta-*" | xargs redis-cli DEL
4. Reduce REDIS_CACHE_TTL to expire cached items sooner. 5. Upgrade Redis instance memory on Railway.

Alembic migration fails on deploy

Cause: Migration conflicts or incompatible schema changes.

Resolution: 1. Check the migration error in deploy logs. 2. Connect to the database and check current version:

alembic current
3. If a migration was partially applied, check the alembic_version table:
SELECT * FROM alembic_version;
4. Fix the migration and re-deploy, or manually set the version:
UPDATE alembic_version SET version_num = 'target_revision';

Solar optimization job fails with OOM

Cause: The optimization is consuming too much memory (large search space or many iterations).

Resolution: 1. Check worker logs for the specific task ID. 2. Reduce optimization parameters: - Lower population_size for NSGA-II. - Reduce n_iterations for Bayesian optimization. - Use grid_search with coarser step sizes. 3. Increase worker memory allocation on Railway. 4. Set CELERY_WORKER_MAX_TASKS_PER_CHILD=50 to recycle workers more frequently.

Webhook delivery failures

Cause: The target URL is unreachable or returning errors.

Resolution: 1. Check webhook delivery logs in the admin dashboard. 2. Verify the callback URL is publicly accessible. 3. Check for SSL certificate issues on the target. 4. Webhooks retry 3 times with exponential backoff (10s, 60s, 300s). 5. After 3 failures, the webhook is marked as failed. Re-trigger via API:

curl -X POST "https://api.tessellate.energy/v1/webhooks/{webhook_id}/retry" \
  -H "Authorization: Bearer $API_KEY"

Rate limit incorrectly applied

Cause: Rate limit counter in Redis is stale or misconfigured.

Resolution: 1. Check current rate limit for an API key:

redis-cli GET "rate_limit:{api_key_hash}:window"
2. Clear the rate limit counter:
redis-cli DEL "rate_limit:{api_key_hash}:window"
3. Verify the rate limit configuration for the organization's plan. 4. Check RATE_LIMIT_DEFAULT environment variable.