feat: add observability stack and background task infrastructure

Add OpenTelemetry instrumentation with distributed tracing and metrics:
- Structured JSON logging with trace context correlation
- Auto-instrumentation for FastAPI, asyncpg, httpx, redis
- OTLP exporter for traces and Prometheus metrics endpoint

Implement Celery worker and notification task system:
- Celery app with Redis/SQS broker support and configurable queues
- Notification tasks for incident fan-out, webhooks, and escalations
- Pluggable TaskQueue abstraction with in-memory driver for testing

Add Grafana observability stack (Loki, Tempo, Prometheus, Grafana):
- OpenTelemetry Collector for receiving OTLP traces and logs
- Tempo for distributed tracing backend
- Loki for log aggregation with Promtail DaemonSet
- Prometheus for metrics scraping with RBAC configuration
- Grafana with pre-provisioned datasources and API overview dashboard
- Helm templates for all observability components

Enhance application infrastructure:
- Global exception handlers with structured ErrorResponse schema
- Request logging middleware with timing metrics
- Health check updated to verify task queue connectivity
- Non-root user in Dockerfile for security
- Init containers in Helm deployments for dependency ordering
- Production Helm values with autoscaling and retention policies
This commit is contained in:
2026-01-07 20:51:13 -05:00
parent f427d191e0
commit 46ede7757d
45 changed files with 3742 additions and 76 deletions

View File

@@ -0,0 +1,96 @@
"""Tests for worker notification helpers."""
from __future__ import annotations
from uuid import UUID, uuid4
import asyncpg
import pytest
from app.repositories.incident import IncidentRepository
from worker.tasks.notifications import NotificationDispatch, prepare_notification_dispatches
pytestmark = pytest.mark.asyncio
async def _seed_incident(conn: asyncpg.Connection) -> tuple[UUID, UUID, UUID]:
org_id = uuid4()
service_id = uuid4()
incident_id = uuid4()
await conn.execute(
"INSERT INTO orgs (id, name, slug) VALUES ($1, $2, $3)",
org_id,
"Notif Org",
"notif-org",
)
await conn.execute(
"INSERT INTO services (id, org_id, name, slug) VALUES ($1, $2, $3, $4)",
service_id,
org_id,
"API",
"api",
)
repo = IncidentRepository(conn)
await repo.create(
incident_id=incident_id,
org_id=org_id,
service_id=service_id,
title="Outage",
description="",
severity="high",
)
return org_id, service_id, incident_id
async def test_prepare_notification_dispatches_creates_attempts(db_conn: asyncpg.Connection) -> None:
org_id, _service_id, incident_id = await _seed_incident(db_conn)
target_id = uuid4()
await db_conn.execute(
"""
INSERT INTO notification_targets (id, org_id, name, target_type, enabled)
VALUES ($1, $2, $3, $4, $5)
""",
target_id,
org_id,
"Primary Webhook",
"webhook",
True,
)
dispatches = await prepare_notification_dispatches(db_conn, incident_id=incident_id, org_id=org_id)
assert len(dispatches) == 1
dispatch = dispatches[0]
assert isinstance(dispatch, NotificationDispatch)
assert dispatch.target["name"] == "Primary Webhook"
attempt = await db_conn.fetchrow(
"SELECT status FROM notification_attempts WHERE id = $1",
dispatch.attempt_id,
)
assert attempt is not None and attempt["status"] == "pending"
async def test_prepare_notification_dispatches_skips_disabled_targets(db_conn: asyncpg.Connection) -> None:
org_id, _service_id, incident_id = await _seed_incident(db_conn)
await db_conn.execute(
"""
INSERT INTO notification_targets (id, org_id, name, target_type, enabled)
VALUES ($1, $2, $3, $4, $5)
""",
uuid4(),
org_id,
"Disabled",
"email",
False,
)
dispatches = await prepare_notification_dispatches(db_conn, incident_id=incident_id, org_id=org_id)
assert dispatches == []