Tau.Acuvim/portal/docs/FLEET-DESIGN.md
Diseri Pearson 3333202f3a Dual-token rotation grace window (24h default)
Token rotation used to be immediate cutover — push gap from when
ops rotates to when the customer's .env is updated and portal
restarted. Now the old token keeps working for 24h after rotation,
so customer ops has a full workday to swap it in without dropping a
single push tick.

Backend
- Customer entity gains PreviousTokenHash + PreviousTokenExpiresAt
  (both nullable). Non-unique index on PreviousTokenHash so the
  OR-lookup in FindByTokenAsync stays cheap.
- CustomerService.RotateTokenAsync(id, graceWindow=null, ct):
  copies the existing TokenHash into PreviousTokenHash with
  PreviousTokenExpiresAt = now + graceWindow (default 24h, lifted
  to CustomerService.DefaultTokenGracePeriod), then issues a new
  current token. Second rotation overwrites the previous slot —
  at most one previous token is ever honoured.
- CustomerService.FindByTokenAsync matches either current OR
  (previous AND PreviousTokenExpiresAt > now). IsActive=false
  still rejects both.
- DTO exposes PreviousTokenExpiresAt so the UI can render the
  grace window status.
- New EF migration AddPreviousTokenGraceWindow on AdminDbContext.

Frontend
- Customers table "Token" column shows an "Old token valid until …"
  orange tag with a tooltip whenever the grace window is active,
  plus the issue/rotation date as before.
- TokenShownOnceModal mentions the 24h grace window so ops knows
  they have time to update .env without urgency.
- Rotate-token popconfirm copy updated to reflect the new behavior.

Tests (+5, 61/61 passing)
- CustomerTokenGraceTests covers: create doesn't set previous;
  rotate moves current into previous slot with future expiry;
  zero grace window rejects original immediately; second rotation
  overwrites previous (original dies, first-rotation becomes the
  new previous); inactive customer rejects both current AND previous.

Verified end-to-end on the dev host
- Migration applied cleanly on the existing admin_fleet DB (existing
  DEV0001 customer got NULL previous columns, no data loss).
- Created GRACE01 → got token1.
- Rotated → got token2. PreviousTokenExpiresAt = +24h. Both token1
  and token2 push successfully (200).
- Rotated again → got token3. token1 push now returns 401 (gone).
  token2 push still 200 (now the previous). token3 push 200 (current).

Docs
- FLEET-DESIGN.md §6 rewritten — no longer "immediate cutover".
- §11 "open seams" row for this feature marked as shipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:45:31 +02:00

345 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fleet ingest — design (locked)
Status: **approved**, ready for Phase 13 implementation.
This is the design reference for the Admin / Fleet aggregation feature. Read this before touching any code in Phases 1315.
---
## 1. Goals & non-goals
**In scope (v1):**
- Mirror every customer's raw `monitoring.PowerMeasurements`, `Sites`, and `Devices` into a central DB on an Admin stack, with `CustomerId` attached.
- Admin operator sees fleet-wide and per-customer dashboards.
- Customer stacks keep working if Admin is unreachable; pushes resume when it comes back.
- Same portal binary in two modes (`RunMode=Client` / `Admin`), config-selected.
**Not in scope (v1):**
- Cross-customer cost / tariff aggregation (Admin shows kWh + power; per-customer instance still owns billing).
- Branding / user sync (Admin has its own brand + users — separate fleets of humans).
- Bidirectional sync (Admin → Customer commands, OTA, etc.).
- Multi-region or sharded central DB.
---
## 2. Topology
```
[Traefik public]
├─ abc0001.portal.example.com → Client stack (portal + timescale + grafana)
├─ abc0002.portal.example.com → Client stack
├─ ...
└─ admin.portal.example.com → Admin stack (portal + timescale + grafana)
[Client stack abc0001] --batches POST--> [Admin stack /api/fleet/ingest]
[Client stack abc0002] --batches POST--> (central TimescaleDB)
[Client stack ...] --batches POST-->
```
Admin stack reuses `docker-compose.prod.yml` and the same image, with:
- `COMPOSE_PROJECT_NAME=admin`
- `CUSTOMER_HOST=admin.portal.example.com`
- `RunMode=Admin`
- Its own Postgres volume (the central fleet DB)
The whole feature is opt-in. With no Admin stack deployed, customer stacks set `FleetIngest__Enabled=false` and work exactly as today.
---
## 3. `RunMode` mechanics
`Application.RunMode` is `Client` (default) or `Admin`. Single binary; Program.cs branches:
| Concern | Client | Admin |
|---|---|---|
| DbContext target | local Postgres | central Postgres |
| Endpoints `/api/sites/*`, `/api/measurements/*`, `/api/ingest/measurements`, `/api/admin/sites/*` | mapped | **not mapped** |
| Endpoints `/api/fleet/*`, `/api/admin/customers/*` | not mapped | mapped |
| `FleetPushService` (BackgroundService) | started if `FleetIngest__Enabled=true` | not registered |
| `FleetIngestService` | not registered | registered |
| EF migration set | `app` + `monitoring` | `app` + `fleet` |
| Identity / Branding / Users / Settings | same | same |
| Settings → Rates tab | shown | hidden |
| Sidebar nav | Dashboard / Dashboards / Sites / Settings | Dashboard / Dashboards / **Customers** / Settings |
**Startup guards:**
- Reject `RunMode=Admin` if `Database:ConnectionString` is empty (no auto-provision for Admin — there is no obvious local default for a fleet DB).
- Reject `RunMode=Client + FleetIngest__Enabled=true` if `FleetIngest__Url` or `FleetIngest__Token` is empty.
- Reject the wrong mode against an existing DB by sniffing for a marker table:
- `RunMode=Client` against a DB containing `fleet.Customers` → fatal.
- `RunMode=Admin` against a DB containing `monitoring.PowerMeasurements` → fatal.
- Stops the "I pointed dev at the prod DB" disaster.
---
## 4. Schemas
### 4.1 Central DB (Admin) — `fleet` schema
**`fleet.Customers`** — registry
| Column | Type | Notes |
|---|---|---|
| Id | uuid PK | Admin-generated, stable |
| Code | varchar(50) UNIQUE | e.g. `ABC0001` — human handle (preserve case for display, lowercase for compose) |
| Name | varchar(200) | |
| TokenHash | varchar(64) UNIQUE indexed | SHA-256 hex of the push token |
| TokenIssuedAt | timestamptz | |
| TokenRotatedAt | timestamptz nullable | |
| IsActive | bool | If false, ingest returns 401 |
| FirstSeenAt, LastSeenAt | timestamptz nullable | When their first / most-recent push landed |
| CreatedAt | timestamptz | |
**`fleet.Sites`** — mirrors customer-side, identity preserved
| Column | Type | Notes |
|---|---|---|
| Id | uuid | Customer-side Site UUID, preserved |
| CustomerId | uuid FK→Customers | |
| Name, Address, IsActive | … | |
| LocalMunicipalityId | int nullable | Opaque on Admin |
| ReceivedAt | timestamptz | When Admin upserted this row |
| **PK** | (CustomerId, Id) | |
**`fleet.Devices`** — same pattern
| Column | Type | Notes |
|---|---|---|
| Id | uuid | Preserved |
| CustomerId | uuid FK | |
| SiteId | uuid | (CustomerId, SiteId) FK→Sites composite |
| Name, ExternalId, Description, IsActive | … | |
| ReceivedAt | timestamptz | |
| **PK** | (CustomerId, Id) | |
**`fleet.PowerMeasurements`** — hypertable, the big one
| Column | Type | Notes |
|---|---|---|
| Time | timestamptz | Hypertable partition |
| CustomerId | uuid | |
| DeviceId | uuid | (CustomerId, DeviceId) FK→Devices |
| ActivePowerKw, ReactivePowerKvar, ApparentPowerKva, PowerFactor, VoltageV, FrequencyHz, EnergyImportedKwh, EnergyExportedKwh | double | |
| Source | varchar(50) nullable | |
| **PK** | (Time, CustomerId, DeviceId) | Time must be in any unique constraint on a hypertable |
| **Indexes** | (CustomerId, Time DESC); (CustomerId, DeviceId, Time DESC) | |
**`fleet.IngestEvents`** — observability
| Column | Type | Notes |
|---|---|---|
| Id | uuid PK | |
| CustomerId | uuid FK | |
| ReceivedAt | timestamptz | |
| BatchType | varchar(20) | `sites` / `devices` / `measurements` |
| RowsAccepted, RowsRejected | int | |
| BatchBytes | int | |
| ClientHwm | varchar(50) | Cursor the client claims this batch covers |
| TimeSpread | interval nullable | `max(Time) - min(Time)` in the batch — visible burst back-fills |
| Error | varchar(500) nullable | |
### 4.2 Client DB additions
One new table: **`app.FleetPushState`**
| Column | Type | Notes |
|---|---|---|
| ResourceType | varchar(20) PK | `sites` / `devices` / `measurements` |
| LastCursor | timestamptz nullable | For measurements: max(`ReceivedAt`) pushed. For sites/devices: max(`UpdatedAt`) |
| LastSyncedAt | timestamptz nullable | |
| LastError | varchar(500) nullable | |
| ConsecutiveFailures | int default 0 | Drives exponential backoff |
Plus: **add `ReceivedAt timestamptz` (default `NOW()`) to `monitoring.PowerMeasurements`**, indexed on `(ReceivedAt)`. See §5.1 for the rationale.
### 4.3 Migration assembly split
Two `DbContext` classes — `ClientDbContext` (existing `AppDbContext`, renamed) and `AdminDbContext`. Each owns its migrations folder. Only the one matching `RunMode` is registered with DI. Ugly-but-reliable wins over single-context-with-tooling-gymnastics.
---
## 5. Push pipeline
### 5.1 Late arrivals — push by `ReceivedAt`, not `Time`
The firmware (ESP32 / TTGO T-Call) **buffers samples offline** and replays them when connectivity returns. Late arrivals are a designed-for case, not an edge.
If we pushed by `Time > LastCursor`, a device offline for 6 hours would have its back-fill skipped (its samples' `Time` is in the past).
Fix: every row carries `ReceivedAt` assigned by the local DB on insert. Push selects `WHERE ReceivedAt > LastCursor[measurements]`. Back-fills are picked up on the next push tick after the firmware replays.
`Time` stays as-is for queries; nothing else changes about how the data is used.
### 5.2 Burst handling
When a firmware replay drops thousands of rows in one local insert, the next push tick sees a huge backlog. To avoid one customer monopolising network / Admin:
- Per request: up to `FleetIngest__BatchSize` rows (default 5000) **or** `FleetIngest__BatchMaxBytes` (default 1 MB), whichever hits first.
- Per tick: at most 3 successful batches per resource type, then yield.
- Result: a 30k-row back-fill drains over ~5 minutes (10 ticks @ 60s) without starving sites/devices syncing.
- Observable: `fleet.IngestEvents.TimeSpread` shows the `max(Time) - min(Time)` per batch; a wave of back-fill is obvious in Admin logs without grepping.
### 5.3 Cadence
- Default push interval: 60 s.
- Configurable: `FleetIngest__IntervalSeconds`.
### 5.4 Order of operations per tick
1. **Sites**: `UpdatedAt > LastCursor[sites]` → POST batch (full upsert, all rows; small data).
2. **Devices**: same.
3. **Measurements**: `ReceivedAt > LastCursor[measurements]` ordered by `ReceivedAt` → POST batches until drained or limit.
Sites and devices first so Admin's measurement FK insert doesn't reject rows for unknown devices.
### 5.5 Endpoint contract
```
POST /api/fleet/ingest
Headers:
X-Customer-Token: <opaque 32-byte hex>
X-Batch-Type: sites | devices | measurements
X-Push-Cursor: <ISO8601> (highest ReceivedAt/UpdatedAt in this batch)
Content-Type: application/json
Body:
JSON array of rows matching the type
Response 200:
{ accepted: int, rejected: int, errors: [{ row: int, reason: string }] }
Response 401: unknown / inactive / wrong token
Response 413: batch too large
Response 429: rate-limited
Response 5xx: transient — retry
```
CustomerId is derived from the token server-side. Clients never send it. Prevents one customer from forging rows for another.
### 5.6 Failure handling
| Failure | Client behavior |
|---|---|
| Network timeout / 5xx | Exponential backoff (1m → 2m → 5m → 10m, cap 30m); `ConsecutiveFailures++` |
| 401 | Stop trying for this resource; log loudly; surface in **Settings → App config** ("Admin link down") |
| 200 with `rejected > 0` on measurements | Re-push sites/devices on next tick before retrying measurements |
| 413 | Halve `BatchSize` for this resource and retry |
| 429 | Honor `Retry-After`; otherwise back off 1 minute |
Push runs on its own DI scope and its own DbContext. If the push DB query hangs, only push is blocked — never user-facing API.
---
## 6. Auth — token lifecycle
- Token = 32 random bytes, hex-encoded (64 chars). Generated server-side via `RandomNumberGenerator`.
- Stored as **SHA-256 hex hash**, UNIQUE indexed → ingest endpoint does a single O(log N) indexed lookup. Bcrypt is the wrong tool for a high-throughput, high-entropy machine token.
- Shown **once** in the Admin UI at creation/rotation. Admin only stores the hash. Lost token → rotate.
- Rotation: **dual-token with 24h grace window** (shipped after the initial design). On rotate, the current `TokenHash` is copied into `PreviousTokenHash` with `PreviousTokenExpiresAt = now + 24h`, then a fresh current is generated. `FindByTokenAsync` matches either current OR (previous AND not expired AND `IsActive`). Customer ops has the full window to update `.env` + restart without dropping a push tick. A second rotation within the grace window overwrites the previous slot — at most one previous token is ever honoured at a time. Grace duration is `CustomerService.DefaultTokenGracePeriod` (parameterizable on `RotateTokenAsync` for tests; lift to config if you need per-customer tuning).
- `Customers.IsActive=false` → all that customer's pushes get 401 until reactivated.
---
## 7. Continuous aggregates + retention
### 7.1 Aggregates (created post-migration by `FleetTimescaleBootstrapper`)
- `fleet_hourly_per_device``time_bucket('1 hour', Time)`, `CustomerId`, `DeviceId` → avg/max/min `ActivePowerKw`, `max(EnergyImported) - min(EnergyImported)` as `KwhDelta`, sample count.
- `fleet_daily_per_customer``time_bucket('1 day', Time)`, `CustomerId` → totals across all that customer's devices.
- **Realtime** (`materialized_only = false`) so late back-fills appear in CA queries immediately, with the unmaterialized tail served live from the hypertable until next refresh.
- Refresh policy:
- `fleet_hourly_per_device`: every 5 min; `start_offset = 30 days`, `end_offset = 1 hour`.
- `fleet_daily_per_customer`: every 1 hour; `start_offset = 365 days`, `end_offset = 1 day`.
- Wide `start_offset` means firmware back-fills within those windows get materialized on the next refresh tick. TimescaleDB's invalidation log means only chunks that actually changed get re-processed.
### 7.2 Retention vs compression
**Retention policy: NONE.** All data kept indefinitely for historical / trend analysis.
**Compression policy** (this is what makes "forever" cheap):
- `fleet.PowerMeasurements`: compress chunks older than **7 days**.
- Compression settings: `segmentby = (CustomerId, DeviceId)`, `orderby = Time DESC`.
- Typical ratio for power-monitoring data: **510×**. Point queries by `(CustomerId, DeviceId, Time range)` stay fast because of the segmentby clustering.
- Bootstrapper enables compression and adds the policy idempotently.
CA tables are tiny by comparison; no compression needed there.
---
## 8. Admin UI surface
| Page | What |
|---|---|
| Dashboard | Fleet headline — customer count, active customers (push in last hour), aggregate live active power, total kWh today, last-24h lag chart |
| Customers | Table — Code / Name / Last seen / Sites / Devices / Today kWh / Status / Actions (rotate token, disable, view) |
| Customer detail | Sites + devices for the selected customer + Grafana embed scoped to `customer_id` |
| Dashboards | Grafana embed — `fleet-overview` default, `customer-drilldown` parameterized |
| Settings → Branding | Admin's own brand |
| Settings → Users | Admin operator accounts |
| Settings → Grafana | Read-only info card |
| Settings → App config | Read-only config snapshot (RunMode visible at the top) |
Hidden in Admin mode: **Sites** top-level (no local sites), **Settings → Rates** tab (no local tariffs).
---
## 9. Observability
Two questions a real operator asks, two answers:
1. **"Is customer X pushing?"** — Customers page shows `LastSeenAt`, `ConsecutiveFailures`, last batch size, recent burst spread.
2. **"Why did this batch not land?"** — `SELECT * FROM fleet.IngestEvents WHERE CustomerId = … ORDER BY ReceivedAt DESC LIMIT 50;`
Plus Serilog:
- Client: `[INF] Pushed N measurements to fleet ingest (cursor=…) accepted=N rejected=0`
- Admin: `[INF] Accepted N measurements from {Code} ({CustomerId}), rejected 0`
---
## 10. Per-stack config
**Customer `.env` additions:**
```
RunMode=Client # default — can omit
FleetIngest__Enabled=true
FleetIngest__Url=https://admin.portal.example.com/api/fleet/ingest
FleetIngest__Token=<32-byte hex token from Admin Customers page>
FleetIngest__IntervalSeconds=60
FleetIngest__BatchSize=5000
FleetIngest__BatchMaxBytes=1048576
```
`FleetIngest__Enabled=false` → push service doesn't start; customer runs as today, no data leaves.
**Admin `.env`:**
```
RunMode=Admin
COMPOSE_PROJECT_NAME=admin
CUSTOMER_HOST=admin.portal.example.com
Database__ConnectionString=Host=timescaledb;... # central fleet DB (separate volume)
# All other vars same as Client compose.
```
---
## 11. Open seams (deferred, with the obvious extension paths noted)
| Seam | Where v2 picks it up |
|---|---|
| Tariff sync + cross-customer cost | New `fleet.Tariffs` table, push tariff change events from customer, derived cost in `fleet_daily_per_customer`. |
| Per-customer Postgres RLS for multi-Admin-user setups | Add `current_setting('app.customer_filter')`-based RLS policies on `fleet.*` tables; Admin role to customer-scope mapping in `IdentityRole.CustomerId`. |
| Bidirectional Admin → Customer commands | New WebSocket or long-poll channel on customer side; gated by mutual cert or a second token. |
| Branding sync (for the "Admin sees customer's brand when drilling in" niceness) | Push branding row from customer; Admin renders the customer's brand on Customer-detail pages. |
| Forward compatibility | Versioned URL `/v1/fleet/ingest`. Admin tolerates unknown JSON fields. Strictness on response shape only. |
| ~~Dual-token grace window for rotation~~ | **Shipped.** See §6 — `Customers.PreviousTokenHash` + `PreviousTokenExpiresAt`; ingest accepts either while previous is unexpired. Default 24h. |
| Sharded Admin (10000+ customers) | Customer's `FleetIngest__Url` already supports any host — point different customers at different Admin instances; aggregate at a tier above with Promscale or similar if needed. |
| Hard-delete / GDPR | Admin Customers page → "Decommission" action: `DELETE FROM fleet.* WHERE CustomerId = …` cascade. Logged. |
---
## 12. Phase split
| Phase | What | End-state |
|---|---|---|
| **13** | `RunMode` plumbing + Fleet schema + `AdminDbContext` split + Customers registry + token issuance UI. No data movement. | Spin up an Admin stack, register a customer, see their token. Client stack picks up `RunMode=Client` + push config but doesn't yet push. |
| **14** | `FleetPushService` (Client), `/api/fleet/ingest` + `FleetIngestService` (Admin), `IngestEvents`, `ReceivedAt` migration on Client, `FleetPushState`, continuous aggregates SQL, compression policy SQL. | Two local stacks: one Client pushes to one Admin. Rows visible in `fleet.PowerMeasurements`. Late-arrival firmware-replay scenario verified with fixtures. |
| **15** | Admin UI (Dashboard, Customers list, Customer detail, Customers token rotation), Grafana provisioning of `fleet-overview.json` + `customer-drilldown.json`, README + OPERATIONS updates for Admin deployment + customer onboarding workflow. | Full feature usable end-to-end. |
---
## 13. Sanity checks (carried over from the design conversation)
- **No central data without the central DB existing.** Whole feature is opt-in via `FleetIngest__Enabled`.
- **Customer model is unchanged.** Every customer keeps full control of their own data. We copy *roll-ups* (and raw measurements for fidelity), not authority.
- **One ingest endpoint, no fan-out.** N customers all POST to the same `/api/fleet/ingest`. Admin scales vertically first, shards horizontally later if needed (each customer's `FleetIngest__Url` can point at a specific Admin instance).