164 lines
4.5 KiB
Markdown
164 lines
4.5 KiB
Markdown
|
|
# IncidentOps Specification
|
||
|
|
|
||
|
|
Multi-tenant incident management API. Org context embedded in JWT — no `orgId` in URLs.
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
| Service | Stack | Purpose |
|
||
|
|
|---------|-------|---------|
|
||
|
|
| **api** | FastAPI, asyncpg | REST API, JWT auth, RBAC |
|
||
|
|
| **worker** | Celery, Redis | Notifications, escalations |
|
||
|
|
| **web** | Next.js | Dashboard (future) |
|
||
|
|
|
||
|
|
**Infrastructure:** PostgreSQL, Redis, ingress-nginx, Helm/Skaffold
|
||
|
|
|
||
|
|
## Auth
|
||
|
|
|
||
|
|
### JWT Access Token Claims
|
||
|
|
- `sub`: user_id (uuid)
|
||
|
|
- `org_id`: active org (uuid)
|
||
|
|
- `org_role`: `admin | member | viewer`
|
||
|
|
- `iss`: issuer (configurable, default: `incidentops`)
|
||
|
|
- `aud`: audience (configurable, default: `incidentops-api`)
|
||
|
|
- `jti`: unique token ID (uuid)
|
||
|
|
- `iat`: issued at (unix timestamp)
|
||
|
|
- `exp`: expiration (unix timestamp)
|
||
|
|
|
||
|
|
### Refresh Token
|
||
|
|
- Opaque token returned in JSON (not cookie)
|
||
|
|
- Stored hashed in DB with `active_org_id`
|
||
|
|
- Rotated on refresh and org-switch
|
||
|
|
|
||
|
|
### Endpoints
|
||
|
|
| Endpoint | Description |
|
||
|
|
|----------|-------------|
|
||
|
|
| `POST /v1/auth/register` | Create user + default org, return tokens |
|
||
|
|
| `POST /v1/auth/login` | Authenticate, return tokens |
|
||
|
|
| `POST /v1/auth/refresh` | Rotate refresh token, mint new access token |
|
||
|
|
| `POST /v1/auth/switch-org` | Change active org, rotate tokens |
|
||
|
|
| `POST /v1/auth/logout` | Revoke refresh token |
|
||
|
|
|
||
|
|
## Authorization
|
||
|
|
|
||
|
|
### Roles
|
||
|
|
| Role | Permissions |
|
||
|
|
|------|-------------|
|
||
|
|
| viewer | Read-only |
|
||
|
|
| member | + create incidents, transitions, comments |
|
||
|
|
| admin | + manage members, notification targets |
|
||
|
|
|
||
|
|
### Enforcement
|
||
|
|
- Role check via dependency injection
|
||
|
|
- Ownership check: resource `org_id` must match JWT `org_id`
|
||
|
|
|
||
|
|
## API Routes
|
||
|
|
|
||
|
|
All under `/v1`. Auth required unless noted.
|
||
|
|
|
||
|
|
### Org (implicit from JWT)
|
||
|
|
- `GET /org` — current org summary
|
||
|
|
- `GET /org/members` (admin)
|
||
|
|
- `GET /org/services`
|
||
|
|
- `POST /org/services` (member+)
|
||
|
|
- `GET /org/notification-targets` (admin)
|
||
|
|
- `POST /org/notification-targets` (admin)
|
||
|
|
|
||
|
|
### Incidents
|
||
|
|
- `GET /incidents?status=&cursor=&limit=`
|
||
|
|
- `POST /services/{serviceId}/incidents` (member+)
|
||
|
|
- `GET /incidents/{incidentId}`
|
||
|
|
- `GET /incidents/{incidentId}/events`
|
||
|
|
- `POST /incidents/{incidentId}/transition` (member+)
|
||
|
|
- `POST /incidents/{incidentId}/comment` (member+)
|
||
|
|
|
||
|
|
### Health
|
||
|
|
- `GET /healthz` — liveness
|
||
|
|
- `GET /readyz` — readiness (postgres + redis)
|
||
|
|
|
||
|
|
## Incident State Machine
|
||
|
|
|
||
|
|
```
|
||
|
|
Triggered → Acknowledged → Mitigated → Resolved
|
||
|
|
```
|
||
|
|
|
||
|
|
- Transitions validated at application level
|
||
|
|
- Optimistic locking via `version` column
|
||
|
|
- All changes recorded in `incident_events`
|
||
|
|
|
||
|
|
## Database Schema
|
||
|
|
|
||
|
|
| Table | Purpose |
|
||
|
|
|-------|---------|
|
||
|
|
| `users` | User accounts |
|
||
|
|
| `orgs` | Organizations |
|
||
|
|
| `org_members` | User-org membership + role |
|
||
|
|
| `services` | Org-scoped services |
|
||
|
|
| `incidents` | Org-scoped incidents with version |
|
||
|
|
| `incident_events` | Append-only timeline |
|
||
|
|
| `refresh_tokens` | Token rotation + active org |
|
||
|
|
| `notification_targets` | Webhook/email/slack configs |
|
||
|
|
| `notification_attempts` | Delivery tracking (idempotent) |
|
||
|
|
|
||
|
|
## Background Jobs (Celery)
|
||
|
|
|
||
|
|
| Task | Queue | Purpose |
|
||
|
|
|------|-------|---------|
|
||
|
|
| `incident_triggered` | default | Fan-out to notification targets |
|
||
|
|
| `send_webhook` | default | HTTP POST with retry |
|
||
|
|
| `escalate_if_unacked` | critical | Delayed escalation (stretch) |
|
||
|
|
|
||
|
|
## Config (Environment)
|
||
|
|
|
||
|
|
| Variable | Required | Default |
|
||
|
|
|----------|----------|---------|
|
||
|
|
| `DATABASE_URL` | Yes | — |
|
||
|
|
| `REDIS_URL` | No | `redis://localhost:6379/0` |
|
||
|
|
| `JWT_SECRET_KEY` | Yes | — |
|
||
|
|
| `JWT_ALGORITHM` | No | `HS256` |
|
||
|
|
| `JWT_ISSUER` | No | `incidentops` |
|
||
|
|
| `JWT_AUDIENCE` | No | `incidentops-api` |
|
||
|
|
| `ACCESS_TOKEN_EXPIRE_MINUTES` | No | `15` |
|
||
|
|
| `REFRESH_TOKEN_EXPIRE_DAYS` | No | `30` |
|
||
|
|
|
||
|
|
## Development
|
||
|
|
|
||
|
|
Use `uv` for all Python operations:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Install dependencies
|
||
|
|
uv sync
|
||
|
|
|
||
|
|
# Run tests
|
||
|
|
uv run pytest tests/
|
||
|
|
|
||
|
|
# Run the API server
|
||
|
|
uv run uvicorn app.main:app --reload
|
||
|
|
|
||
|
|
# Run migrations
|
||
|
|
uv run python migrations/migrate.py
|
||
|
|
```
|
||
|
|
|
||
|
|
## Project Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
incidentops/
|
||
|
|
├── app/
|
||
|
|
│ ├── main.py # FastAPI entry
|
||
|
|
│ ├── config.py # pydantic-settings
|
||
|
|
│ ├── db.py # asyncpg pool
|
||
|
|
│ ├── core/ # security, exceptions
|
||
|
|
│ ├── api/v1/ # route handlers
|
||
|
|
│ ├── schemas/ # pydantic models
|
||
|
|
│ ├── repositories/ # data access
|
||
|
|
│ └── services/ # business logic
|
||
|
|
├── worker/
|
||
|
|
│ ├── celery_app.py
|
||
|
|
│ └── tasks/
|
||
|
|
├── migrations/
|
||
|
|
│ └── *.sql + migrate.py
|
||
|
|
├── helm/
|
||
|
|
├── Dockerfile
|
||
|
|
├── docker-compose.yml
|
||
|
|
└── pyproject.toml
|
||
|
|
```
|