minhtrannhat 46ede7757d feat: add observability stack and background task infrastructure
Add OpenTelemetry instrumentation with distributed tracing and metrics:
- Structured JSON logging with trace context correlation
- Auto-instrumentation for FastAPI, asyncpg, httpx, redis
- OTLP exporter for traces and Prometheus metrics endpoint

Implement Celery worker and notification task system:
- Celery app with Redis/SQS broker support and configurable queues
- Notification tasks for incident fan-out, webhooks, and escalations
- Pluggable TaskQueue abstraction with in-memory driver for testing

Add Grafana observability stack (Loki, Tempo, Prometheus, Grafana):
- OpenTelemetry Collector for receiving OTLP traces and logs
- Tempo for distributed tracing backend
- Loki for log aggregation with Promtail DaemonSet
- Prometheus for metrics scraping with RBAC configuration
- Grafana with pre-provisioned datasources and API overview dashboard
- Helm templates for all observability components

Enhance application infrastructure:
- Global exception handlers with structured ErrorResponse schema
- Request logging middleware with timing metrics
- Health check updated to verify task queue connectivity
- Non-root user in Dockerfile for security
- Init containers in Helm deployments for dependency ordering
- Production Helm values with autoscaling and retention policies
2026-01-07 20:51:13 -05:00
2025-11-21 12:00:00 +00:00
2025-11-21 12:00:00 +00:00
2025-11-21 12:00:00 +00:00
2025-11-21 12:00:00 +00:00
2025-11-21 12:00:00 +00:00

IncidentOps

A fullstack on-call & incident management platform

Environment Configuration

Variable Description Default
DATABASE_URL Postgres connection string
REDIS_URL Legacy redis endpoint, also used if no broker override is supplied redis://localhost:6379/0
TASK_QUEUE_DRIVER Task queue implementation (celery or inmemory) celery
TASK_QUEUE_BROKER_URL Celery broker URL (falls back to REDIS_URL when unset) None
TASK_QUEUE_BACKEND Celery transport semantics (redis or sqs) redis
TASK_QUEUE_DEFAULT_QUEUE Queue used for fan-out + notification deliveries default
TASK_QUEUE_CRITICAL_QUEUE Queue used for escalation + delayed work critical
TASK_QUEUE_VISIBILITY_TIMEOUT Visibility timeout passed to sqs transport 600
TASK_QUEUE_POLLING_INTERVAL Polling interval for sqs transport (seconds) 1.0
NOTIFICATION_ESCALATION_DELAY_SECONDS Delay before re-checking unacknowledged incidents 900
AWS_REGION Region used when TASK_QUEUE_BACKEND=sqs None
JWT_SECRET_KEY Symmetric JWT signing key
JWT_ALGORITHM JWT algorithm HS256
JWT_ISSUER JWT issuer claim incidentops
JWT_AUDIENCE JWT audience claim incidentops-api

Task Queue Modes

  • Development / Tests Set TASK_QUEUE_DRIVER=inmemory to bypass Celery entirely (default for local pytest). The API will enqueue events into an in-memory recorder while the worker code remains importable.
  • Celery + Redis Set TASK_QUEUE_DRIVER=celery and either leave TASK_QUEUE_BROKER_URL unset (and rely on REDIS_URL) or point it to another Redis endpoint. This is the default production-style configuration.
  • Celery + Amazon SQS Provide TASK_QUEUE_BROKER_URL=sqs:// (Celery automatically discovers credentials), set TASK_QUEUE_BACKEND=sqs, and configure AWS_REGION. Optional tuning is available via the visibility timeout and polling interval variables above.

Running the Worker

The worker automatically discovers tasks under worker/tasks. Use the same environment variables as the API:

uv run celery -A worker.celery_app worker --loglevel=info

Setup

Docker Compose

docker compose up --build -d

K8S with Skaffold and Helm

# Install with infrastructure only (for testing)
helm install incidentops helm/incidentops -n incidentops --create-namespace \
  --set migration.enabled=false \
  --set api.replicaCount=0 \
  --set worker.replicaCount=0 \
  --set web.replicaCount=0

# Full install (requires building app images first)
helm install incidentops helm/incidentops -n incidentops --create-namespace

# Create a cluster
kind create cluster --name incidentops

# We then deploy
skaffold dev

# One-time deployment
skaffold run

# Production deployment
skaffold run -p production
Description
No description provided
Readme 335 KiB
Languages
Python 97.7%
Smarty 2%
Dockerfile 0.3%