19 KiB
19 KiB
IncidentOps Specification
A multi-tenant incident management system with implicit active-org context from JWT.
Project Structure
incidentops/
├── IncidentOps.sln
├── docker-compose.yml
├── skaffold.yaml
├── .gitignore
│
├── src/
│ ├── IncidentOps.Api/ # ASP.NET Core REST API
│ │ ├── Auth/
│ │ │ ├── ClaimsPrincipalExtensions.cs
│ │ │ ├── RequestContext.cs
│ │ │ └── RoleRequirement.cs
│ │ ├── Controllers/
│ │ │ ├── AuthController.cs
│ │ │ ├── HealthController.cs
│ │ │ ├── IncidentsController.cs
│ │ │ └── OrgController.cs
│ │ ├── Dockerfile
│ │ ├── Program.cs
│ │ ├── appsettings.json
│ │ └── appsettings.Development.json
│ │
│ ├── IncidentOps.Worker/ # Hangfire Worker Service
│ │ ├── Jobs/
│ │ │ ├── EscalateIfUnackedJob.cs
│ │ │ ├── IncidentTriggeredJob.cs
│ │ │ └── SendWebhookNotificationJob.cs
│ │ ├── Dockerfile
│ │ ├── Program.cs
│ │ └── appsettings.json
│ │
│ ├── IncidentOps.Domain/ # Domain Entities & Enums
│ │ ├── Entities/
│ │ │ ├── Incident.cs
│ │ │ ├── IncidentEvent.cs
│ │ │ ├── NotificationAttempt.cs
│ │ │ ├── NotificationTarget.cs
│ │ │ ├── Org.cs
│ │ │ ├── OrgMember.cs
│ │ │ ├── RefreshToken.cs
│ │ │ ├── Service.cs
│ │ │ └── User.cs
│ │ └── Enums/
│ │ ├── IncidentEventType.cs
│ │ ├── IncidentStatus.cs
│ │ ├── NotificationTargetType.cs
│ │ └── OrgRole.cs
│ │
│ ├── IncidentOps.Infrastructure/ # Data Access & Services
│ │ ├── Auth/
│ │ │ ├── IPasswordService.cs
│ │ │ ├── ITokenService.cs
│ │ │ └── JwtSettings.cs
│ │ ├── Data/
│ │ │ ├── DbConnectionFactory.cs
│ │ │ └── Repositories/
│ │ │ ├── IIncidentEventRepository.cs
│ │ │ ├── IIncidentRepository.cs
│ │ │ ├── INotificationTargetRepository.cs
│ │ │ ├── IOrgMemberRepository.cs
│ │ │ ├── IOrgRepository.cs
│ │ │ ├── IRefreshTokenRepository.cs
│ │ │ ├── IServiceRepository.cs
│ │ │ └── IUserRepository.cs
│ │ ├── Jobs/
│ │ │ ├── IEscalateIfUnackedJob.cs
│ │ │ ├── IIncidentTriggeredJob.cs
│ │ │ └── ISendWebhookNotificationJob.cs
│ │ ├── Migrations/
│ │ │ ├── Migration0001_InitialSchema.cs
│ │ │ ├── Migration0002_RefreshTokens.cs
│ │ │ └── Migration0003_NotificationTargets.cs
│ │ └── ServiceCollectionExtensions.cs
│ │
│ └── IncidentOps.Contracts/ # DTOs / API Contracts
│ ├── Auth/
│ │ ├── AuthResponse.cs
│ │ ├── LoginRequest.cs
│ │ ├── LogoutRequest.cs
│ │ ├── MeResponse.cs
│ │ ├── RefreshRequest.cs
│ │ ├── RegisterRequest.cs
│ │ └── SwitchOrgRequest.cs
│ ├── Incidents/
│ │ ├── CommentRequest.cs
│ │ ├── CreateIncidentRequest.cs
│ │ ├── IncidentDto.cs
│ │ ├── IncidentEventDto.cs
│ │ ├── IncidentListResponse.cs
│ │ └── TransitionRequest.cs
│ ├── Orgs/
│ │ ├── CreateNotificationTargetRequest.cs
│ │ ├── NotificationTargetDto.cs
│ │ ├── OrgDto.cs
│ │ └── OrgMemberDto.cs
│ └── Services/
│ ├── CreateServiceRequest.cs
│ └── ServiceDto.cs
│
├── web/ # Next.js Frontend
│ ├── app/
│ │ ├── dashboard/page.tsx
│ │ ├── login/page.tsx
│ │ ├── register/page.tsx
│ │ ├── layout.tsx
│ │ ├── page.tsx
│ │ └── globals.css
│ ├── lib/
│ │ └── api.ts
│ ├── types/
│ │ └── index.ts
│ ├── Dockerfile
│ ├── package.json
│ ├── tsconfig.json
│ └── next.config.js
│
├── helm/incidentops/ # Helm Chart
│ ├── Chart.yaml
│ ├── values.yaml
│ └── templates/
│ ├── _helpers.tpl
│ ├── api-deployment.yaml
│ ├── api-service.yaml
│ ├── worker-deployment.yaml
│ ├── web-deployment.yaml
│ ├── web-service.yaml
│ ├── ingress.yaml
│ └── secrets.yaml
│
└── docs/
└── specs.md
1. Architecture (microservices-lite)
Deployables
-
api-service (.NET 10, ASP.NET Core)
- REST API (implicit org scope from JWT)
- JWT access + refresh (both returned in JSON)
- RBAC enforced using
org_roleclaim + DB ownership checks - Writes incidents + timeline events
- Enqueues background jobs to Hangfire
-
worker-service (.NET 10 Worker Service)
- Runs Hangfire Server using Redis storage
- Executes jobs: notification send, escalation checks, rollups
- Writes notification attempts and system events
-
web (Next.js 14 + TypeScript)
- Auth pages + dashboard + incident detail
Dependencies (in kind via Helm)
- PostgreSQL (Bitnami)
- Redis (Bitnami) - Hangfire storage
- ingress-nginx
- (later) Prometheus/Grafana/OTel
2. Auth Model (active org in JWT, implicit org scope)
JWT Access Token Claims
| Claim | Description |
|---|---|
sub |
userId (uuid) |
org_id |
activeOrgId (uuid) |
org_role |
admin|member|viewer |
iss |
Issuer |
aud |
Audience |
iat |
Issued at |
exp |
Expiration |
jti |
(optional) Token ID |
Refresh Token Model (JSON, not cookie)
- Random opaque token returned in JSON
- Stored hashed in DB
- Rotated on refresh and switch-org
- Refresh token row stores
active_org_id(per-session org selection)
DB: refresh_tokens
id uuid PRIMARY KEY
user_id uuid NOT NULL
token_hash text NOT NULL UNIQUE
active_org_id uuid NOT NULL
expires_at timestamptz NOT NULL
revoked_at timestamptz NULL
created_at timestamptz NOT NULL
Auth Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /v1/auth/register |
Create user + default org |
| POST | /v1/auth/login |
Authenticate, return tokens |
| POST | /v1/auth/refresh |
Rotate refresh token |
| POST | /v1/auth/switch-org |
Switch active org context |
| POST | /v1/auth/logout |
Revoke refresh token |
Registration Flow
On POST /v1/auth/register { email, password, displayName }:
- Create user record
- Create a default org automatically (e.g., "John's Org")
- Create org_member with role=Admin
- Return access + refresh tokens
3. Authorization Rules (implicit org scope)
Request Context
Middleware extracts from JWT:
UserIdfromsubOrgIdfromorg_idRolefromorg_role
Authorization Approach
- Role check: enforce viewer/member/admin by claim
- Ownership check: for any resource ID in path, load its
org_idfrom DB and require it equals tokenorg_id- Prevents cross-tenant IDOR even though org isn't in the URL
Role Permissions
| Role | Permissions |
|---|---|
| viewer | Read-only access |
| member | Create incidents, transitions, comments |
| admin | Manage members, notification targets, on-call schedules |
4. API Surface (implicit org in JWT)
All routes under /v1. Unless noted, routes require auth.
Auth
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| POST | /auth/register |
No | Register new user |
| POST | /auth/login |
No | Login |
| POST | /auth/refresh |
No | Refresh tokens |
| POST | /auth/switch-org |
No | Switch org context |
| POST | /auth/logout |
No | Logout |
| GET | /me |
Yes | Get current user info |
Org (current org context)
| Method | Endpoint | Role | Description |
|---|---|---|---|
| GET | /org |
viewer+ | Current org summary + role |
| GET | /org/members |
admin | List org members |
| POST | /org/members |
admin | Invite/add member (stretch) |
| GET | /org/services |
viewer+ | List services |
| POST | /org/services |
member+ | Create service |
| GET | /org/notification-targets |
admin | List notification targets |
| POST | /org/notification-targets |
admin | Create notification target |
Incidents
| Method | Endpoint | Role | Description |
|---|---|---|---|
| GET | /incidents |
viewer+ | List incidents (cursor pagination) |
| POST | /services/{serviceId}/incidents |
member+ | Create incident |
| GET | /incidents/{incidentId} |
viewer+ | Get incident detail |
| GET | /incidents/{incidentId}/events |
viewer+ | Get incident timeline |
| POST | /incidents/{incidentId}/transition |
member+ | Transition incident state |
| POST | /incidents/{incidentId}/comment |
member+ | Add comment |
Health
| Method | Endpoint | Description |
|---|---|---|
| GET | /healthz |
Liveness probe |
| GET | /readyz |
Readiness probe (checks Postgres + Redis) |
5. Domain Workflows
Incident State Machine
Triggered → Acknowledged → Mitigated → Resolved
Enforcement
- Application-level validation (allowed transitions)
- DB optimistic concurrency using
incidents.version
Transition Write Pattern
UPDATE incidents
SET status = @newStatus, version = version + 1, updated_at = NOW()
WHERE id = @id AND org_id = @orgId AND version = @expectedVersion
- If 0 rows updated →
409 Conflict(stale client) or404if not found in org
Timeline Model
Append-only incident_events records for:
- Incident created
- Transitions (ack, mitigate, resolve)
- Comments
- Notifications sent/failed
- Escalations triggered
actor_user_id is null for system/worker actions.
6. PostgreSQL Schema (core tables)
Users
CREATE TABLE users (
id uuid PRIMARY KEY,
email text NOT NULL UNIQUE,
password_hash text NOT NULL,
display_name text NOT NULL,
created_at timestamptz NOT NULL DEFAULT NOW()
);
Orgs
CREATE TABLE orgs (
id uuid PRIMARY KEY,
name text NOT NULL,
slug text NOT NULL UNIQUE,
created_at timestamptz NOT NULL DEFAULT NOW()
);
Org Members
CREATE TABLE org_members (
id uuid PRIMARY KEY,
org_id uuid NOT NULL REFERENCES orgs(id) ON DELETE CASCADE,
user_id uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE,
role text NOT NULL CHECK (role IN ('admin', 'member', 'viewer')),
created_at timestamptz NOT NULL DEFAULT NOW(),
UNIQUE(org_id, user_id)
);
Services
CREATE TABLE services (
id uuid PRIMARY KEY,
org_id uuid NOT NULL REFERENCES orgs(id) ON DELETE CASCADE,
name text NOT NULL,
slug text NOT NULL,
description text,
created_at timestamptz NOT NULL DEFAULT NOW(),
UNIQUE(org_id, slug)
);
Incidents
CREATE TABLE incidents (
id uuid PRIMARY KEY,
org_id uuid NOT NULL REFERENCES orgs(id) ON DELETE CASCADE,
service_id uuid NOT NULL REFERENCES services(id) ON DELETE CASCADE,
title text NOT NULL,
description text,
status text NOT NULL DEFAULT 'triggered'
CHECK (status IN ('triggered', 'acknowledged', 'mitigated', 'resolved')),
severity text NOT NULL DEFAULT 'sev3'
CHECK (severity IN ('sev1', 'sev2', 'sev3', 'sev4')),
version integer NOT NULL DEFAULT 1,
created_at timestamptz NOT NULL DEFAULT NOW(),
updated_at timestamptz
);
CREATE INDEX idx_incidents_org_status ON incidents(org_id, status);
Incident Events
CREATE TABLE incident_events (
id uuid PRIMARY KEY,
incident_id uuid NOT NULL REFERENCES incidents(id) ON DELETE CASCADE,
event_type text NOT NULL,
actor_user_id uuid REFERENCES users(id),
payload jsonb,
created_at timestamptz NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_incident_events_incident ON incident_events(incident_id, created_at);
Notification Targets
CREATE TABLE notification_targets (
id uuid PRIMARY KEY,
org_id uuid NOT NULL REFERENCES orgs(id) ON DELETE CASCADE,
name text NOT NULL,
target_type text NOT NULL CHECK (target_type IN ('webhook', 'email', 'slack')),
configuration text NOT NULL,
is_enabled boolean NOT NULL DEFAULT true,
created_at timestamptz NOT NULL DEFAULT NOW(),
updated_at timestamptz
);
Notification Attempts
CREATE TABLE notification_attempts (
id uuid PRIMARY KEY,
incident_id uuid NOT NULL REFERENCES incidents(id) ON DELETE CASCADE,
target_id uuid NOT NULL REFERENCES notification_targets(id) ON DELETE CASCADE,
success boolean NOT NULL,
error_message text,
attempt_number integer NOT NULL DEFAULT 1,
created_at timestamptz NOT NULL DEFAULT NOW(),
UNIQUE(incident_id, target_id)
);
Refresh Tokens
CREATE TABLE refresh_tokens (
id uuid PRIMARY KEY,
user_id uuid NOT NULL REFERENCES users(id) ON DELETE CASCADE,
token_hash text NOT NULL UNIQUE,
active_org_id uuid NOT NULL REFERENCES orgs(id),
expires_at timestamptz NOT NULL,
revoked_at timestamptz,
created_at timestamptz NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_refresh_tokens_user ON refresh_tokens(user_id);
7. Data Access (Dapper) and Migrations (FluentMigrator)
Dapper Conventions
- Repositories receive
OrgIdas an explicit parameter and include it in WHERE clauses - Keep SQL close to repositories (or separate
.sqlfiles) - Use
NpgsqlConnection+IDbTransactionfor multi-statement operations
FluentMigrator
| Migration | Tables |
|---|---|
| 0001 | users, orgs, org_members, services, incidents, incident_events |
| 0002 | refresh_tokens |
| 0003 | notification_targets, notification_attempts |
8. Hangfire Job Design (Redis storage)
Setup
- API configures Hangfire Client (enqueue)
- Worker hosts Hangfire Server (process)
Queues
| Queue | Purpose |
|---|---|
| critical | Escalations |
| default | Notifications |
| low | Rollups |
Jobs
1. IncidentTriggeredJob(incidentId)
- Reads incident (must belong to org in incident row)
- Loads enabled notification targets for the org
- Inserts
notification_attemptsrows (idempotent) - Enqueues per-target send jobs
2. SendWebhookNotificationJob(incidentId, targetId)
- Attempts HTTP POST with incident summary payload
- Updates attempt status + writes
incident_eventof typesystem.notification_sentorsystem.notification_failed - Throws on transient failures to trigger retry; safe due to DB idempotency
3. EscalateIfUnackedJob(incidentId, step) (stretch)
- Runs delayed
- Checks status; if still Triggered, sends secondary notifications
Operational Note
- Expose Hangfire Dashboard only in local and protect it (basic auth or require a dev token)
9. Kubernetes (kind) + Helm + Skaffold (local-only)
Helm Umbrella Chart Deploys
- bitnami/postgresql
- bitnami/redis
- api Deployment/Service
- worker Deployment
- web Deployment/Service
- Ingress with host
incidentops.local:/api,/v1,/healthz,/readyz→ api-service/→ web
Configuration via Environment
| Variable | Description |
|---|---|
ConnectionStrings__Postgres |
PostgreSQL connection string |
Redis__ConnectionString |
Redis connection string |
Jwt__Issuer |
JWT issuer |
Jwt__Audience |
JWT audience |
Jwt__SigningKey |
JWT signing key (secret) |
Readiness
- API checks Postgres + Redis
- Worker checks Postgres + Redis at startup
Skaffold
- Builds three images (api, worker, web)
helm upgrade --installon changes
10. Frontend UX Requirements (implicit org)
- On login, display
activeOrgfrom response - Org switcher calls
/v1/auth/switch-organd replaces tokens - All subsequent API calls use only
Authorizationheader; no orgId params - Store tokens in localStorage or secure cookie
- Handle 401 by attempting token refresh
11. Key Highlights (README/Resume)
- "Multi-tenant org context embedded in JWT; org switching re-issues tokens."
- "DB ownership checks prevent cross-tenant resource access."
- "Optimistic concurrency for incident transitions."
- "Background jobs with retries + idempotent notification attempts."
- "Deployed locally to Kubernetes via Helm + Skaffold."
12. Technology Stack
| Layer | Technology |
|---|---|
| Runtime | .NET 10 |
| API Framework | ASP.NET Core |
| Worker | .NET Worker Service |
| Background Jobs | Hangfire with Redis |
| Database | PostgreSQL |
| ORM | Dapper |
| Migrations | FluentMigrator |
| Auth | JWT Bearer + BCrypt |
| Frontend | Next.js 14 + TypeScript |
| Container | Docker |
| Orchestration | Kubernetes (kind) |
| Deployment | Helm + Skaffold |
13. Local Development
Prerequisites
- .NET 10 SDK
- Node.js 20+
- Docker
- kind (Kubernetes in Docker)
- Helm
- Skaffold
Quick Start
# With Docker Compose (simplest)
docker-compose up -d
# Run API
cd src/IncidentOps.Api
dotnet run
# Run Worker (separate terminal)
cd src/IncidentOps.Worker
dotnet run
# Run Web (separate terminal)
cd web
npm install
npm run dev
With Kubernetes (kind)
# Create cluster
kind create cluster --name incidentops
# Deploy with Skaffold
skaffold dev
# Access at http://incidentops.local (add to /etc/hosts)
14. API Request/Response Examples
Register
POST /v1/auth/register
Content-Type: application/json
{
"email": "user@example.com",
"password": "SecurePass123!",
"displayName": "John Doe"
}
Response:
{
"accessToken": "eyJhbG...",
"refreshToken": "a1b2c3d4...",
"activeOrg": {
"id": "uuid",
"name": "John Doe's Org",
"slug": "org-abc123",
"role": "admin"
}
}
Create Incident
POST /v1/services/{serviceId}/incidents
Authorization: Bearer {accessToken}
Content-Type: application/json
{
"title": "Database connection timeout",
"description": "Users experiencing slow queries",
"severity": "sev2"
}
Transition Incident
POST /v1/incidents/{incidentId}/transition
Authorization: Bearer {accessToken}
Content-Type: application/json
{
"action": "ack",
"expectedVersion": 1
}