Add OpenTelemetry instrumentation with distributed tracing and metrics: - Structured JSON logging with trace context correlation - Auto-instrumentation for FastAPI, asyncpg, httpx, redis - OTLP exporter for traces and Prometheus metrics endpoint Implement Celery worker and notification task system: - Celery app with Redis/SQS broker support and configurable queues - Notification tasks for incident fan-out, webhooks, and escalations - Pluggable TaskQueue abstraction with in-memory driver for testing Add Grafana observability stack (Loki, Tempo, Prometheus, Grafana): - OpenTelemetry Collector for receiving OTLP traces and logs - Tempo for distributed tracing backend - Loki for log aggregation with Promtail DaemonSet - Prometheus for metrics scraping with RBAC configuration - Grafana with pre-provisioned datasources and API overview dashboard - Helm templates for all observability components Enhance application infrastructure: - Global exception handlers with structured ErrorResponse schema - Request logging middleware with timing metrics - Health check updated to verify task queue connectivity - Non-root user in Dockerfile for security - Init containers in Helm deployments for dependency ordering - Production Helm values with autoscaling and retention policies
33 lines
489 B
YAML
33 lines
489 B
YAML
server:
|
|
http_listen_port: 3200
|
|
|
|
distributor:
|
|
receivers:
|
|
otlp:
|
|
protocols:
|
|
grpc:
|
|
endpoint: 0.0.0.0:4317
|
|
http:
|
|
endpoint: 0.0.0.0:4318
|
|
|
|
ingester:
|
|
trace_idle_period: 10s
|
|
max_block_bytes: 1048576
|
|
max_block_duration: 5m
|
|
|
|
compactor:
|
|
compaction:
|
|
block_retention: 168h # 7 days
|
|
|
|
storage:
|
|
trace:
|
|
backend: local
|
|
local:
|
|
path: /var/tempo/traces
|
|
wal:
|
|
path: /var/tempo/wal
|
|
|
|
querier:
|
|
search:
|
|
query_timeout: 30s
|