Add OpenTelemetry instrumentation with distributed tracing and metrics: - Structured JSON logging with trace context correlation - Auto-instrumentation for FastAPI, asyncpg, httpx, redis - OTLP exporter for traces and Prometheus metrics endpoint Implement Celery worker and notification task system: - Celery app with Redis/SQS broker support and configurable queues - Notification tasks for incident fan-out, webhooks, and escalations - Pluggable TaskQueue abstraction with in-memory driver for testing Add Grafana observability stack (Loki, Tempo, Prometheus, Grafana): - OpenTelemetry Collector for receiving OTLP traces and logs - Tempo for distributed tracing backend - Loki for log aggregation with Promtail DaemonSet - Prometheus for metrics scraping with RBAC configuration - Grafana with pre-provisioned datasources and API overview dashboard - Helm templates for all observability components Enhance application infrastructure: - Global exception handlers with structured ErrorResponse schema - Request logging middleware with timing metrics - Health check updated to verify task queue connectivity - Non-root user in Dockerfile for security - Init containers in Helm deployments for dependency ordering - Production Helm values with autoscaling and retention policies
24 lines
563 B
YAML
24 lines
563 B
YAML
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
scrape_configs:
|
|
# Scrape Prometheus itself
|
|
- job_name: "prometheus"
|
|
static_configs:
|
|
- targets: ["localhost:9090"]
|
|
|
|
# Scrape IncidentOps API metrics
|
|
- job_name: "incidentops-api"
|
|
static_configs:
|
|
- targets: ["api:9464"]
|
|
metrics_path: /metrics
|
|
scrape_interval: 10s
|
|
|
|
# Scrape IncidentOps Worker metrics (when metrics are enabled)
|
|
- job_name: "incidentops-worker"
|
|
static_configs:
|
|
- targets: ["worker:9464"]
|
|
metrics_path: /metrics
|
|
scrape_interval: 10s
|