Deployment Runbook¶

This guide covers production deployment, environment configuration, database management, monitoring, and troubleshooting for the Tessellate Renewables platform.

1. Railway Deployment (Production)¶

Railway is the primary deployment platform. The project contains multiple services that deploy independently from the same repository.

Service Configuration¶

Service	Source	Dockerfile	Start Command	Port
api	`main` branch	`Dockerfile`	`uvicorn app:app --host 0.0.0.0 --port 8000 --workers 4`	8000
worker	`main` branch	`Dockerfile.worker`	`celery -A celery_config worker -l info -Q default,solar,wind,reports -c 4`	-
redis	Railway Plugin	-	Managed	6379
postgres	Railway Plugin	-	Managed	5432
streamlit	`main` branch	`Dockerfile.streamlit`	`streamlit run frontend/app.py --server.port 8501`	8501
ml-engine	`main` branch	`Dockerfile.ml-engine`	`python -m modal_gpu`	-

Deploying Updates¶

Railway auto-deploys from the main branch. For manual deployments:

# Install Railway CLI
npm install -g @railway/cli

# Login and link project
railway login
railway link

# Deploy a specific service
railway up --service api
railway up --service worker

# View deployment logs
railway logs --service api
railway logs --service worker

# Open Railway dashboard
railway open

Rolling Deployments¶

Railway performs zero-downtime rolling deployments by default:

New container is built and started.
Health check is verified (/health for the API service).
Traffic is shifted to the new container.
Old container is drained and stopped.

If the health check fails, the deployment is rolled back automatically.

Railway Configuration Files¶

railway.worker.toml -- Worker service overrides
railway.streamlit.toml -- Streamlit service overrides
railway.ml-engine.toml -- ML engine service overrides
Procfile -- Process definitions (alternative to Dockerfiles)

2. Environment Variables Reference¶

All environment variables used across services. Variables are set in Railway's service settings or via the Railway CLI.

Core Application¶

Variable	Required	Default	Description
`APP_ENV`	Yes	-	Environment: `production`, `staging`, `development`
`SECRET_KEY`	Yes	-	Application secret for JWT signing (min 32 chars)
`API_BASE_URL`	Yes	-	Public API URL (e.g., `https://api.tessellate.energy`)
`ALLOWED_ORIGINS`	No	`*`	CORS allowed origins (comma-separated)
`LOG_LEVEL`	No	`INFO`	Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`
`WORKERS`	No	`4`	Uvicorn worker count
`DEBUG`	No	`false`	Enable debug mode (never in production)

Database¶

Variable	Required	Default	Description
`DATABASE_URL`	Yes	-	PostgreSQL connection string
`DATABASE_POOL_SIZE`	No	`10`	SQLAlchemy connection pool size
`DATABASE_MAX_OVERFLOW`	No	`20`	Max overflow connections
`DATABASE_POOL_TIMEOUT`	No	`30`	Pool checkout timeout (seconds)
`DATABASE_ECHO`	No	`false`	Echo SQL statements to logs

Redis¶

Variable	Required	Default	Description
`REDIS_URL`	Yes	-	Redis connection string
`REDIS_MAX_CONNECTIONS`	No	`50`	Max connection pool size
`REDIS_CACHE_TTL`	No	`3600`	Default cache TTL in seconds
`REDIS_SSL`	No	`false`	Enable TLS for Redis connection

Celery¶

Variable	Required	Default	Description
`CELERY_BROKER_URL`	Yes	-	Redis URL for task broker (usually same as `REDIS_URL`)
`CELERY_RESULT_BACKEND`	Yes	-	Redis URL for task results
`CELERY_WORKER_CONCURRENCY`	No	`4`	Worker process concurrency
`CELERY_TASK_SOFT_TIME_LIMIT`	No	`300`	Soft time limit per task (seconds)
`CELERY_TASK_TIME_LIMIT`	No	`600`	Hard time limit per task (seconds)
`CELERY_WORKER_MAX_TASKS_PER_CHILD`	No	`100`	Recycle worker after N tasks

Authentication¶

Variable	Required	Default	Description
`JWT_SECRET_KEY`	Yes	-	JWT signing key (can be same as `SECRET_KEY`)
`JWT_ALGORITHM`	No	`HS256`	JWT algorithm
`JWT_ACCESS_TOKEN_EXPIRE_MINUTES`	No	`30`	Access token lifetime
`JWT_REFRESH_TOKEN_EXPIRE_DAYS`	No	`7`	Refresh token lifetime

External Services¶

Variable	Required	Default	Description
`SENDGRID_API_KEY`	Yes	-	SendGrid API key for transactional email
`SENDGRID_FROM_EMAIL`	No	`[email protected]`	Sender email address
`STRIPE_SECRET_KEY`	Yes	-	Stripe secret key
`STRIPE_PUBLISHABLE_KEY`	Yes	-	Stripe publishable key
`STRIPE_WEBHOOK_SECRET`	Yes	-	Stripe webhook signing secret
`NREL_API_KEY`	Yes	-	NREL API key for solar/weather data
`MODAL_TOKEN_ID`	No	-	Modal GPU compute token ID
`MODAL_TOKEN_SECRET`	No	-	Modal GPU compute token secret

Rate Limiting and Metering¶

Variable	Required	Default	Description
`RATE_LIMIT_DEFAULT`	No	`1000`	Default requests per minute per key
`RATE_LIMIT_BURST`	No	`50`	Burst allowance above rate limit
`METERING_ENABLED`	No	`true`	Enable API usage metering
`METERING_FLUSH_INTERVAL`	No	`60`	Metering buffer flush interval (seconds)

Feature Flags¶

Variable	Required	Default	Description
`ENABLE_WIND_DIVISION`	No	`true`	Enable wind analysis endpoints
`ENABLE_GPU_TASKS`	No	`false`	Enable GPU-accelerated tasks
`ENABLE_WEBHOOKS`	No	`true`	Enable webhook delivery
`ENABLE_SSE`	No	`true`	Enable server-sent events
`ENABLE_AB_TESTING`	No	`false`	Enable A/B testing framework

3. Docker Deployment¶

For self-hosted or non-Railway deployments, use Docker Compose.

Production Docker Compose¶

The file docker-compose.prod.yml defines the full production stack:

# Build and start all services
docker compose -f docker-compose.prod.yml up -d --build

# View logs
docker compose -f docker-compose.prod.yml logs -f api
docker compose -f docker-compose.prod.yml logs -f worker

# Scale workers
docker compose -f docker-compose.prod.yml up -d --scale worker=4

# Stop all services
docker compose -f docker-compose.prod.yml down

# Stop and remove volumes (WARNING: deletes data)
docker compose -f docker-compose.prod.yml down -v

Docker Build Notes¶

Multi-stage builds: The API Dockerfile uses multi-stage builds to minimize image size. The final image is based on python:3.11-slim.
Layer caching: Requirements are installed before copying application code to maximize Docker layer cache hits.
Non-root user: Containers run as a non-root appuser for security.
Health checks: Each container defines a Docker HEALTHCHECK instruction.

Environment File¶

Create a .env file for local Docker deployments (never commit this file):

# .env (DO NOT COMMIT)
APP_ENV=production
SECRET_KEY=your-secret-key-min-32-characters-long
DATABASE_URL=postgresql://user:password@postgres:5432/tessellate
REDIS_URL=redis://redis:6379/0
CELERY_BROKER_URL=redis://redis:6379/1
CELERY_RESULT_BACKEND=redis://redis:6379/1
SENDGRID_API_KEY=SG.xxxxx
STRIPE_SECRET_KEY=sk_live_xxxxx
STRIPE_PUBLISHABLE_KEY=pk_live_xxxxx
STRIPE_WEBHOOK_SECRET=whsec_xxxxx
NREL_API_KEY=xxxxx
JWT_SECRET_KEY=your-jwt-secret-key

4. Database Migrations¶

The project uses Alembic for database schema migrations.

Common Commands¶

# Check current migration version
alembic current

# View migration history
alembic history --verbose

# Generate a new migration from model changes
alembic revision --autogenerate -m "Add wind_jobs table"

# Apply all pending migrations
alembic upgrade head

# Apply next migration only
alembic upgrade +1

# Rollback one migration
alembic downgrade -1

# Rollback to a specific revision
alembic downgrade abc123

# Rollback all migrations (WARNING: destructive)
alembic downgrade base

# Show the SQL for a migration without applying it
alembic upgrade head --sql

Migration Best Practices¶

Always review auto-generated migrations. Alembic's autogenerate is not perfect; it may miss index changes or generate incorrect ALTER statements.

Test migrations on a copy of production data before deploying:

# Dump production schema (no data)
pg_dump --schema-only $PROD_DATABASE_URL > schema.sql

# Restore to test database
psql $TEST_DATABASE_URL < schema.sql

# Run migration against test
DATABASE_URL=$TEST_DATABASE_URL alembic upgrade head

Never edit a migration that has been deployed. Create a new migration to fix issues.
Include both upgrade and downgrade functions in every migration.

Use batch operations for large tables to avoid long locks:

# In migration file
def upgrade():
    # Add column as nullable first
    op.add_column('large_table', sa.Column('new_col', sa.String(), nullable=True))
    # Backfill in batches (separate script or task)
    # Then make non-nullable
    op.alter_column('large_table', 'new_col', nullable=False)

Railway Migrations¶

Migrations run automatically on deploy via the API service entrypoint:

# docker-entrypoint.sh
#!/bin/bash
set -e
echo "Running database migrations..."
alembic upgrade head
echo "Starting application..."
exec uvicorn app:app --host 0.0.0.0 --port ${PORT:-8000} --workers ${WORKERS:-4}

5. Celery Worker Setup¶

Starting Workers¶

# Start a worker with all queues
celery -A celery_config worker \
  --loglevel=info \
  --queues=default,solar,wind,reports \
  --concurrency=4 \
  --max-tasks-per-child=100

# Start a worker for solar tasks only
celery -A celery_config worker \
  --loglevel=info \
  --queues=solar \
  --concurrency=2 \
  --hostname=solar-worker@%h

# Start a worker for GPU tasks (single concurrency)
celery -A celery_config worker \
  --loglevel=info \
  --queues=gpu \
  --concurrency=1 \
  --hostname=gpu-worker@%h

# Start Celery Beat (periodic task scheduler)
celery -A celery_config beat \
  --loglevel=info \
  --schedule=/tmp/celerybeat-schedule

Monitoring Workers¶

# List active workers
celery -A celery_config inspect active

# List registered tasks
celery -A celery_config inspect registered

# View worker statistics
celery -A celery_config inspect stats

# Purge all queued tasks (WARNING: destructive)
celery -A celery_config purge

# Monitor in real-time with Flower
celery -A celery_config flower --port=5555

Worker Configuration¶

Key settings in celery_config.py:

Setting	Value	Description
`task_serializer`	`json`	JSON serialization for tasks
`result_serializer`	`json`	JSON serialization for results
`accept_content`	`["json"]`	Only accept JSON content
`timezone`	`UTC`	All timestamps in UTC
`task_track_started`	`True`	Track when tasks begin execution
`task_acks_late`	`True`	Acknowledge after completion (not receipt)
`worker_prefetch_multiplier`	`1`	Prefetch one task at a time
`task_reject_on_worker_lost`	`True`	Requeue if worker dies

6. Health Check Endpoints¶

API Health Checks¶

Endpoint	Method	Description	Used By
`GET /health`	GET	Basic liveness check	Railway, load balancer
`GET /health/ready`	GET	Readiness check (DB + Redis)	Kubernetes, Railway
`GET /health/detailed`	GET	Full component status	Monitoring, ops

`GET /health`¶

Returns 200 OK if the API process is running.

{
  "status": "healthy",
  "timestamp": "2025-01-15T10:30:00Z"
}

`GET /health/ready`¶

Returns 200 OK only if all critical dependencies are reachable.

{
  "status": "ready",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "celery": "ok"
  },
  "timestamp": "2025-01-15T10:30:00Z"
}

Returns 503 Service Unavailable if any check fails:

{
  "status": "not_ready",
  "checks": {
    "database": "ok",
    "redis": "error: connection refused",
    "celery": "ok"
  },
  "timestamp": "2025-01-15T10:30:00Z"
}

`GET /health/detailed`¶

Returns full system health with performance metrics (requires admin auth).

{
  "status": "healthy",
  "version": "1.4.2",
  "uptime_seconds": 86400,
  "components": {
    "database": {
      "status": "ok",
      "latency_ms": 2.3,
      "pool_size": 10,
      "pool_checked_out": 3
    },
    "redis": {
      "status": "ok",
      "latency_ms": 0.8,
      "memory_used_mb": 45.2,
      "connected_clients": 12
    },
    "celery": {
      "status": "ok",
      "active_workers": 4,
      "queued_tasks": 7,
      "active_tasks": 3
    }
  },
  "timestamp": "2025-01-15T10:30:00Z"
}

7. Monitoring Setup¶

Application Metrics¶

The platform exposes Prometheus-compatible metrics at /metrics (when enabled):

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests by method, path, status
`http_request_duration_seconds`	Histogram	Request latency distribution
`celery_tasks_total`	Counter	Total Celery tasks by name, status
`celery_task_duration_seconds`	Histogram	Task execution time
`db_query_duration_seconds`	Histogram	Database query latency
`redis_operations_total`	Counter	Redis operations by command
`active_websocket_connections`	Gauge	Current SSE connections
`api_keys_active`	Gauge	Number of active API keys

SLO Monitoring¶

The platform includes built-in SLO (Service Level Objective) tracking. See services/slo_tracker.py and the /slo/* endpoints for:

API availability (target: 99.9%)
P95 latency (target: < 500ms)
P99 latency (target: < 2000ms)
Error rate (target: < 1%)
Error budget tracking and burn rate alerts

Log Aggregation¶

Structured JSON logging is used across all services:

{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "INFO",
  "service": "api",
  "request_id": "req_abc123",
  "user_id": "usr_xyz",
  "org_id": "org_456",
  "method": "POST",
  "path": "/v1/solar/optimize",
  "status": 202,
  "duration_ms": 45,
  "message": "Solar optimization job created"
}

View logs in Railway:

# API logs (last 100 lines)
railway logs --service api -n 100

# Worker logs (follow mode)
railway logs --service worker -f

# Filter by error level
railway logs --service api | grep '"level":"ERROR"'

Alerting¶

Configure alerts in your monitoring platform for:

Alert	Condition	Severity
API Down	Health check fails for > 2 minutes	Critical
High Error Rate	5xx rate > 5% over 5 minutes	Critical
High Latency	P95 > 2s over 5 minutes	Warning
Worker Queue Backlog	Queued tasks > 100 for > 10 minutes	Warning
Database Connection Pool	Pool utilization > 80%	Warning
Redis Memory	Memory usage > 80%	Warning
Celery Worker Down	Active workers < expected count	Critical
Error Budget Burn	SLO error budget < 20% remaining	Warning
Error Budget Exhausted	SLO error budget = 0%	Critical

8. Troubleshooting¶

Common Issues¶

API returns 502 Bad Gateway¶

Cause: The API service is not responding to the load balancer health check.

Resolution: 1. Check API logs: railway logs --service api -n 200 2. Verify the health endpoint: curl https://api.tessellate.energy/health 3. Check if the database is reachable: railway logs --service api | grep "database" 4. Restart the service: railway service restart api

Celery tasks stuck in "queued" status¶

Cause: Workers are not consuming tasks from the queue.

Resolution: 1. Check worker logs: railway logs --service worker -n 200 2. Verify Redis connectivity: railway logs --service worker | grep "redis" 3. Check active workers: celery -A celery_config inspect active 4. Check queue lengths: celery -A celery_config inspect reserved 5. Restart workers: railway service restart worker

Database connection pool exhausted¶

Cause: Too many concurrent database connections. Pool size is exceeded.

Resolution: 1. Check pool stats in /health/detailed endpoint. 2. Look for long-running queries:

SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

3. Kill long-running queries if necessary:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE duration > interval '5 minutes' AND state != 'idle';

4. Increase DATABASE_POOL_SIZE and DATABASE_MAX_OVERFLOW if needed.

Redis out of memory¶

Cause: Cache or Celery result backend is consuming too much memory.

Resolution: 1. Check Redis memory: redis-cli INFO memory 2. Check key count by prefix:

redis-cli --scan --pattern "cache:*" | wc -l
redis-cli --scan --pattern "celery-task-meta-*" | wc -l

3. Clear expired Celery results:

redis-cli --scan --pattern "celery-task-meta-*" | xargs redis-cli DEL

4. Reduce REDIS_CACHE_TTL to expire cached items sooner. 5. Upgrade Redis instance memory on Railway.

Alembic migration fails on deploy¶

Cause: Migration conflicts or incompatible schema changes.

Resolution: 1. Check the migration error in deploy logs. 2. Connect to the database and check current version:

alembic current

3. If a migration was partially applied, check the alembic_version table:

SELECT * FROM alembic_version;

4. Fix the migration and re-deploy, or manually set the version:

UPDATE alembic_version SET version_num = 'target_revision';

Solar optimization job fails with OOM¶

Cause: The optimization is consuming too much memory (large search space or many iterations).

Resolution: 1. Check worker logs for the specific task ID. 2. Reduce optimization parameters: - Lower population_size for NSGA-II. - Reduce n_iterations for Bayesian optimization. - Use grid_search with coarser step sizes. 3. Increase worker memory allocation on Railway. 4. Set CELERY_WORKER_MAX_TASKS_PER_CHILD=50 to recycle workers more frequently.

Webhook delivery failures¶

Cause: The target URL is unreachable or returning errors.

Resolution: 1. Check webhook delivery logs in the admin dashboard. 2. Verify the callback URL is publicly accessible. 3. Check for SSL certificate issues on the target. 4. Webhooks retry 3 times with exponential backoff (10s, 60s, 300s). 5. After 3 failures, the webhook is marked as failed. Re-trigger via API:

curl -X POST "https://api.tessellate.energy/v1/webhooks/{webhook_id}/retry" \
  -H "Authorization: Bearer $API_KEY"

Rate limit incorrectly applied¶

Cause: Rate limit counter in Redis is stale or misconfigured.

Resolution: 1. Check current rate limit for an API key:

redis-cli GET "rate_limit:{api_key_hash}:window"

2. Clear the rate limit counter:

redis-cli DEL "rate_limit:{api_key_hash}:window"

3. Verify the rate limit configuration for the organization's plan. 4. Check RATE_LIMIT_DEFAULT environment variable.