Go to file

feat: add observability stack and background task infrastructure

Add OpenTelemetry instrumentation with distributed tracing and metrics:
- Structured JSON logging with trace context correlation
- Auto-instrumentation for FastAPI, asyncpg, httpx, redis
- OTLP exporter for traces and Prometheus metrics endpoint

Implement Celery worker and notification task system:
- Celery app with Redis/SQS broker support and configurable queues
- Notification tasks for incident fan-out, webhooks, and escalations
- Pluggable TaskQueue abstraction with in-memory driver for testing

Add Grafana observability stack (Loki, Tempo, Prometheus, Grafana):
- OpenTelemetry Collector for receiving OTLP traces and logs
- Tempo for distributed tracing backend
- Loki for log aggregation with Promtail DaemonSet
- Prometheus for metrics scraping with RBAC configuration
- Grafana with pre-provisioned datasources and API overview dashboard
- Helm templates for all observability components

Enhance application infrastructure:
- Global exception handlers with structured ErrorResponse schema
- Request logging middleware with timing metrics
- Health check updated to verify task queue connectivity
- Non-root user in Dockerfile for security
- Init containers in Helm deployments for dependency ordering
- Production Helm values with autoscaling and retention policies

2026-01-07 20:51:13 -05:00

app

feat: add observability stack and background task infrastructure

2026-01-07 20:51:13 -05:00

helm/incidentops

feat: add observability stack and background task infrastructure

2026-01-07 20:51:13 -05:00

migrations

feat(api): Pydantic schemas + Data Repositories

2025-12-07 12:00:00 +00:00

observability

feat: add observability stack and background task infrastructure

2026-01-07 20:51:13 -05:00

tests

feat: add observability stack and background task infrastructure

2026-01-07 20:51:13 -05:00

worker

feat: add observability stack and background task infrastructure

2026-01-07 20:51:13 -05:00

.gitignore

feat: project skeleton

2025-11-21 12:00:00 +00:00

.python-version

feat: project skeleton

2025-11-21 12:00:00 +00:00

docker-compose.yml

feat: add observability stack and background task infrastructure

2026-01-07 20:51:13 -05:00

Dockerfile

feat: add observability stack and background task infrastructure

2026-01-07 20:51:13 -05:00

Dockerfile.web

feat: project skeleton

2025-11-21 12:00:00 +00:00

main.py

feat: project skeleton

2025-11-21 12:00:00 +00:00

pyproject.toml

feat: add observability stack and background task infrastructure

2026-01-07 20:51:13 -05:00

README.md

feat: add observability stack and background task infrastructure

2026-01-07 20:51:13 -05:00

skaffold.yaml

feat: add observability stack and background task infrastructure

2026-01-07 20:51:13 -05:00

SPECS.md

feat: project skeleton

2025-11-21 12:00:00 +00:00

uv.lock

feat(api): Pydantic schemas + Data Repositories

2025-12-07 12:00:00 +00:00

README.md

IncidentOps

A fullstack on-call & incident management platform

Environment Configuration

Variable	Description	Default
`DATABASE_URL`	Postgres connection string	—
`REDIS_URL`	Legacy redis endpoint, also used if no broker override is supplied	`redis://localhost:6379/0`
`TASK_QUEUE_DRIVER`	Task queue implementation (`celery` or `inmemory`)	`celery`
`TASK_QUEUE_BROKER_URL`	Celery broker URL (falls back to `REDIS_URL` when unset)	`None`
`TASK_QUEUE_BACKEND`	Celery transport semantics (`redis` or `sqs`)	`redis`
`TASK_QUEUE_DEFAULT_QUEUE`	Queue used for fan-out + notification deliveries	`default`
`TASK_QUEUE_CRITICAL_QUEUE`	Queue used for escalation + delayed work	`critical`
`TASK_QUEUE_VISIBILITY_TIMEOUT`	Visibility timeout passed to `sqs` transport	`600`
`TASK_QUEUE_POLLING_INTERVAL`	Polling interval for `sqs` transport (seconds)	`1.0`
`NOTIFICATION_ESCALATION_DELAY_SECONDS`	Delay before re-checking unacknowledged incidents	`900`
`AWS_REGION`	Region used when `TASK_QUEUE_BACKEND=sqs`	`None`
`JWT_SECRET_KEY`	Symmetric JWT signing key	—
`JWT_ALGORITHM`	JWT algorithm	`HS256`
`JWT_ISSUER`	JWT issuer claim	`incidentops`
`JWT_AUDIENCE`	JWT audience claim	`incidentops-api`

Task Queue Modes

Development / Tests – Set TASK_QUEUE_DRIVER=inmemory to bypass Celery entirely (default for local pytest). The API will enqueue events into an in-memory recorder while the worker code remains importable.
Celery + Redis – Set TASK_QUEUE_DRIVER=celery and either leave TASK_QUEUE_BROKER_URL unset (and rely on REDIS_URL) or point it to another Redis endpoint. This is the default production-style configuration.
Celery + Amazon SQS – Provide TASK_QUEUE_BROKER_URL=sqs:// (Celery automatically discovers credentials), set TASK_QUEUE_BACKEND=sqs, and configure AWS_REGION. Optional tuning is available via the visibility timeout and polling interval variables above.

Running the Worker

The worker automatically discovers tasks under worker/tasks. Use the same environment variables as the API:

uv run celery -A worker.celery_app worker --loglevel=info

Setup

Docker Compose

docker compose up --build -d

K8S with Skaffold and Helm

# Install with infrastructure only (for testing)
helm install incidentops helm/incidentops -n incidentops --create-namespace \
  --set migration.enabled=false \
  --set api.replicaCount=0 \
  --set worker.replicaCount=0 \
  --set web.replicaCount=0

# Full install (requires building app images first)
helm install incidentops helm/incidentops -n incidentops --create-namespace

# Create a cluster
kind create cluster --name incidentops

# We then deploy
skaffold dev

# One-time deployment
skaffold run

# Production deployment
skaffold run -p production

README.md Unescape Escape

IncidentOps

Environment Configuration

Task Queue Modes

Running the Worker

Setup

Docker Compose

K8S with Skaffold and Helm

README.md