feat: add observability stack and background task infrastructure

Add OpenTelemetry instrumentation with distributed tracing and metrics:
- Structured JSON logging with trace context correlation
- Auto-instrumentation for FastAPI, asyncpg, httpx, redis
- OTLP exporter for traces and Prometheus metrics endpoint

Implement Celery worker and notification task system:
- Celery app with Redis/SQS broker support and configurable queues
- Notification tasks for incident fan-out, webhooks, and escalations
- Pluggable TaskQueue abstraction with in-memory driver for testing

Add Grafana observability stack (Loki, Tempo, Prometheus, Grafana):
- OpenTelemetry Collector for receiving OTLP traces and logs
- Tempo for distributed tracing backend
- Loki for log aggregation with Promtail DaemonSet
- Prometheus for metrics scraping with RBAC configuration
- Grafana with pre-provisioned datasources and API overview dashboard
- Helm templates for all observability components

Enhance application infrastructure:
- Global exception handlers with structured ErrorResponse schema
- Request logging middleware with timing metrics
- Health check updated to verify task queue connectivity
- Non-root user in Dockerfile for security
- Init containers in Helm deployments for dependency ordering
- Production Helm values with autoscaling and retention policies
This commit is contained in:
2026-01-07 20:51:13 -05:00
parent f427d191e0
commit 46ede7757d
45 changed files with 3742 additions and 76 deletions

View File

@@ -8,7 +8,7 @@ from app.schemas.auth import (
SwitchOrgRequest,
TokenResponse,
)
from app.schemas.common import CursorParams, PaginatedResponse
from app.schemas.common import CursorParams, ErrorDetail, ErrorResponse, PaginatedResponse
from app.schemas.incident import (
CommentRequest,
IncidentCreate,
@@ -35,6 +35,8 @@ __all__ = [
"TokenResponse",
# Common
"CursorParams",
"ErrorDetail",
"ErrorResponse",
"PaginatedResponse",
# Incident
"CommentRequest",

View File

@@ -3,6 +3,47 @@
from pydantic import BaseModel, Field
class ErrorDetail(BaseModel):
"""Individual error detail for validation errors."""
loc: list[str | int] = Field(description="Location of the error (field path)")
msg: str = Field(description="Error message")
type: str = Field(description="Error type identifier")
class ErrorResponse(BaseModel):
"""Structured error response returned by all error handlers."""
error: str = Field(description="Error type (e.g., 'not_found', 'validation_error')")
message: str = Field(description="Human-readable error message")
details: list[ErrorDetail] | None = Field(
default=None, description="Additional error details for validation errors"
)
request_id: str | None = Field(
default=None, description="Request trace ID for debugging"
)
model_config = {
"json_schema_extra": {
"examples": [
{
"error": "not_found",
"message": "Incident not found",
"request_id": "abc123def456",
},
{
"error": "validation_error",
"message": "Request validation failed",
"details": [
{"loc": ["body", "title"], "msg": "Field required", "type": "missing"}
],
"request_id": "abc123def456",
},
]
}
}
class CursorParams(BaseModel):
"""Pagination parameters using cursor-based pagination."""