Token rotation used to be immediate cutover — push gap from when ops rotates to when the customer's .env is updated and portal restarted. Now the old token keeps working for 24h after rotation, so customer ops has a full workday to swap it in without dropping a single push tick. Backend - Customer entity gains PreviousTokenHash + PreviousTokenExpiresAt (both nullable). Non-unique index on PreviousTokenHash so the OR-lookup in FindByTokenAsync stays cheap. - CustomerService.RotateTokenAsync(id, graceWindow=null, ct): copies the existing TokenHash into PreviousTokenHash with PreviousTokenExpiresAt = now + graceWindow (default 24h, lifted to CustomerService.DefaultTokenGracePeriod), then issues a new current token. Second rotation overwrites the previous slot — at most one previous token is ever honoured. - CustomerService.FindByTokenAsync matches either current OR (previous AND PreviousTokenExpiresAt > now). IsActive=false still rejects both. - DTO exposes PreviousTokenExpiresAt so the UI can render the grace window status. - New EF migration AddPreviousTokenGraceWindow on AdminDbContext. Frontend - Customers table "Token" column shows an "Old token valid until …" orange tag with a tooltip whenever the grace window is active, plus the issue/rotation date as before. - TokenShownOnceModal mentions the 24h grace window so ops knows they have time to update .env without urgency. - Rotate-token popconfirm copy updated to reflect the new behavior. Tests (+5, 61/61 passing) - CustomerTokenGraceTests covers: create doesn't set previous; rotate moves current into previous slot with future expiry; zero grace window rejects original immediately; second rotation overwrites previous (original dies, first-rotation becomes the new previous); inactive customer rejects both current AND previous. Verified end-to-end on the dev host - Migration applied cleanly on the existing admin_fleet DB (existing DEV0001 customer got NULL previous columns, no data loss). - Created GRACE01 → got token1. - Rotated → got token2. PreviousTokenExpiresAt = +24h. Both token1 and token2 push successfully (200). - Rotated again → got token3. token1 push now returns 401 (gone). token2 push still 200 (now the previous). token3 push 200 (current). Docs - FLEET-DESIGN.md §6 rewritten — no longer "immediate cutover". - §11 "open seams" row for this feature marked as shipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
345 lines
17 KiB
Markdown
345 lines
17 KiB
Markdown
# Fleet ingest — design (locked)
|
||
|
||
Status: **approved**, ready for Phase 13 implementation.
|
||
|
||
This is the design reference for the Admin / Fleet aggregation feature. Read this before touching any code in Phases 13–15.
|
||
|
||
---
|
||
|
||
## 1. Goals & non-goals
|
||
|
||
**In scope (v1):**
|
||
- Mirror every customer's raw `monitoring.PowerMeasurements`, `Sites`, and `Devices` into a central DB on an Admin stack, with `CustomerId` attached.
|
||
- Admin operator sees fleet-wide and per-customer dashboards.
|
||
- Customer stacks keep working if Admin is unreachable; pushes resume when it comes back.
|
||
- Same portal binary in two modes (`RunMode=Client` / `Admin`), config-selected.
|
||
|
||
**Not in scope (v1):**
|
||
- Cross-customer cost / tariff aggregation (Admin shows kWh + power; per-customer instance still owns billing).
|
||
- Branding / user sync (Admin has its own brand + users — separate fleets of humans).
|
||
- Bidirectional sync (Admin → Customer commands, OTA, etc.).
|
||
- Multi-region or sharded central DB.
|
||
|
||
---
|
||
|
||
## 2. Topology
|
||
|
||
```
|
||
[Traefik public]
|
||
├─ abc0001.portal.example.com → Client stack (portal + timescale + grafana)
|
||
├─ abc0002.portal.example.com → Client stack
|
||
├─ ...
|
||
└─ admin.portal.example.com → Admin stack (portal + timescale + grafana)
|
||
|
||
[Client stack abc0001] --batches POST--> [Admin stack /api/fleet/ingest]
|
||
[Client stack abc0002] --batches POST--> (central TimescaleDB)
|
||
[Client stack ...] --batches POST-->
|
||
```
|
||
|
||
Admin stack reuses `docker-compose.prod.yml` and the same image, with:
|
||
- `COMPOSE_PROJECT_NAME=admin`
|
||
- `CUSTOMER_HOST=admin.portal.example.com`
|
||
- `RunMode=Admin`
|
||
- Its own Postgres volume (the central fleet DB)
|
||
|
||
The whole feature is opt-in. With no Admin stack deployed, customer stacks set `FleetIngest__Enabled=false` and work exactly as today.
|
||
|
||
---
|
||
|
||
## 3. `RunMode` mechanics
|
||
|
||
`Application.RunMode` is `Client` (default) or `Admin`. Single binary; Program.cs branches:
|
||
|
||
| Concern | Client | Admin |
|
||
|---|---|---|
|
||
| DbContext target | local Postgres | central Postgres |
|
||
| Endpoints `/api/sites/*`, `/api/measurements/*`, `/api/ingest/measurements`, `/api/admin/sites/*` | mapped | **not mapped** |
|
||
| Endpoints `/api/fleet/*`, `/api/admin/customers/*` | not mapped | mapped |
|
||
| `FleetPushService` (BackgroundService) | started if `FleetIngest__Enabled=true` | not registered |
|
||
| `FleetIngestService` | not registered | registered |
|
||
| EF migration set | `app` + `monitoring` | `app` + `fleet` |
|
||
| Identity / Branding / Users / Settings | same | same |
|
||
| Settings → Rates tab | shown | hidden |
|
||
| Sidebar nav | Dashboard / Dashboards / Sites / Settings | Dashboard / Dashboards / **Customers** / Settings |
|
||
|
||
**Startup guards:**
|
||
- Reject `RunMode=Admin` if `Database:ConnectionString` is empty (no auto-provision for Admin — there is no obvious local default for a fleet DB).
|
||
- Reject `RunMode=Client + FleetIngest__Enabled=true` if `FleetIngest__Url` or `FleetIngest__Token` is empty.
|
||
- Reject the wrong mode against an existing DB by sniffing for a marker table:
|
||
- `RunMode=Client` against a DB containing `fleet.Customers` → fatal.
|
||
- `RunMode=Admin` against a DB containing `monitoring.PowerMeasurements` → fatal.
|
||
- Stops the "I pointed dev at the prod DB" disaster.
|
||
|
||
---
|
||
|
||
## 4. Schemas
|
||
|
||
### 4.1 Central DB (Admin) — `fleet` schema
|
||
|
||
**`fleet.Customers`** — registry
|
||
| Column | Type | Notes |
|
||
|---|---|---|
|
||
| Id | uuid PK | Admin-generated, stable |
|
||
| Code | varchar(50) UNIQUE | e.g. `ABC0001` — human handle (preserve case for display, lowercase for compose) |
|
||
| Name | varchar(200) | |
|
||
| TokenHash | varchar(64) UNIQUE indexed | SHA-256 hex of the push token |
|
||
| TokenIssuedAt | timestamptz | |
|
||
| TokenRotatedAt | timestamptz nullable | |
|
||
| IsActive | bool | If false, ingest returns 401 |
|
||
| FirstSeenAt, LastSeenAt | timestamptz nullable | When their first / most-recent push landed |
|
||
| CreatedAt | timestamptz | |
|
||
|
||
**`fleet.Sites`** — mirrors customer-side, identity preserved
|
||
| Column | Type | Notes |
|
||
|---|---|---|
|
||
| Id | uuid | Customer-side Site UUID, preserved |
|
||
| CustomerId | uuid FK→Customers | |
|
||
| Name, Address, IsActive | … | |
|
||
| LocalMunicipalityId | int nullable | Opaque on Admin |
|
||
| ReceivedAt | timestamptz | When Admin upserted this row |
|
||
| **PK** | (CustomerId, Id) | |
|
||
|
||
**`fleet.Devices`** — same pattern
|
||
| Column | Type | Notes |
|
||
|---|---|---|
|
||
| Id | uuid | Preserved |
|
||
| CustomerId | uuid FK | |
|
||
| SiteId | uuid | (CustomerId, SiteId) FK→Sites composite |
|
||
| Name, ExternalId, Description, IsActive | … | |
|
||
| ReceivedAt | timestamptz | |
|
||
| **PK** | (CustomerId, Id) | |
|
||
|
||
**`fleet.PowerMeasurements`** — hypertable, the big one
|
||
| Column | Type | Notes |
|
||
|---|---|---|
|
||
| Time | timestamptz | Hypertable partition |
|
||
| CustomerId | uuid | |
|
||
| DeviceId | uuid | (CustomerId, DeviceId) FK→Devices |
|
||
| ActivePowerKw, ReactivePowerKvar, ApparentPowerKva, PowerFactor, VoltageV, FrequencyHz, EnergyImportedKwh, EnergyExportedKwh | double | |
|
||
| Source | varchar(50) nullable | |
|
||
| **PK** | (Time, CustomerId, DeviceId) | Time must be in any unique constraint on a hypertable |
|
||
| **Indexes** | (CustomerId, Time DESC); (CustomerId, DeviceId, Time DESC) | |
|
||
|
||
**`fleet.IngestEvents`** — observability
|
||
| Column | Type | Notes |
|
||
|---|---|---|
|
||
| Id | uuid PK | |
|
||
| CustomerId | uuid FK | |
|
||
| ReceivedAt | timestamptz | |
|
||
| BatchType | varchar(20) | `sites` / `devices` / `measurements` |
|
||
| RowsAccepted, RowsRejected | int | |
|
||
| BatchBytes | int | |
|
||
| ClientHwm | varchar(50) | Cursor the client claims this batch covers |
|
||
| TimeSpread | interval nullable | `max(Time) - min(Time)` in the batch — visible burst back-fills |
|
||
| Error | varchar(500) nullable | |
|
||
|
||
### 4.2 Client DB additions
|
||
|
||
One new table: **`app.FleetPushState`**
|
||
| Column | Type | Notes |
|
||
|---|---|---|
|
||
| ResourceType | varchar(20) PK | `sites` / `devices` / `measurements` |
|
||
| LastCursor | timestamptz nullable | For measurements: max(`ReceivedAt`) pushed. For sites/devices: max(`UpdatedAt`) |
|
||
| LastSyncedAt | timestamptz nullable | |
|
||
| LastError | varchar(500) nullable | |
|
||
| ConsecutiveFailures | int default 0 | Drives exponential backoff |
|
||
|
||
Plus: **add `ReceivedAt timestamptz` (default `NOW()`) to `monitoring.PowerMeasurements`**, indexed on `(ReceivedAt)`. See §5.1 for the rationale.
|
||
|
||
### 4.3 Migration assembly split
|
||
|
||
Two `DbContext` classes — `ClientDbContext` (existing `AppDbContext`, renamed) and `AdminDbContext`. Each owns its migrations folder. Only the one matching `RunMode` is registered with DI. Ugly-but-reliable wins over single-context-with-tooling-gymnastics.
|
||
|
||
---
|
||
|
||
## 5. Push pipeline
|
||
|
||
### 5.1 Late arrivals — push by `ReceivedAt`, not `Time`
|
||
|
||
The firmware (ESP32 / TTGO T-Call) **buffers samples offline** and replays them when connectivity returns. Late arrivals are a designed-for case, not an edge.
|
||
|
||
If we pushed by `Time > LastCursor`, a device offline for 6 hours would have its back-fill skipped (its samples' `Time` is in the past).
|
||
|
||
Fix: every row carries `ReceivedAt` assigned by the local DB on insert. Push selects `WHERE ReceivedAt > LastCursor[measurements]`. Back-fills are picked up on the next push tick after the firmware replays.
|
||
|
||
`Time` stays as-is for queries; nothing else changes about how the data is used.
|
||
|
||
### 5.2 Burst handling
|
||
|
||
When a firmware replay drops thousands of rows in one local insert, the next push tick sees a huge backlog. To avoid one customer monopolising network / Admin:
|
||
- Per request: up to `FleetIngest__BatchSize` rows (default 5000) **or** `FleetIngest__BatchMaxBytes` (default 1 MB), whichever hits first.
|
||
- Per tick: at most 3 successful batches per resource type, then yield.
|
||
- Result: a 30k-row back-fill drains over ~5 minutes (10 ticks @ 60s) without starving sites/devices syncing.
|
||
- Observable: `fleet.IngestEvents.TimeSpread` shows the `max(Time) - min(Time)` per batch; a wave of back-fill is obvious in Admin logs without grepping.
|
||
|
||
### 5.3 Cadence
|
||
|
||
- Default push interval: 60 s.
|
||
- Configurable: `FleetIngest__IntervalSeconds`.
|
||
|
||
### 5.4 Order of operations per tick
|
||
|
||
1. **Sites**: `UpdatedAt > LastCursor[sites]` → POST batch (full upsert, all rows; small data).
|
||
2. **Devices**: same.
|
||
3. **Measurements**: `ReceivedAt > LastCursor[measurements]` ordered by `ReceivedAt` → POST batches until drained or limit.
|
||
|
||
Sites and devices first so Admin's measurement FK insert doesn't reject rows for unknown devices.
|
||
|
||
### 5.5 Endpoint contract
|
||
|
||
```
|
||
POST /api/fleet/ingest
|
||
Headers:
|
||
X-Customer-Token: <opaque 32-byte hex>
|
||
X-Batch-Type: sites | devices | measurements
|
||
X-Push-Cursor: <ISO8601> (highest ReceivedAt/UpdatedAt in this batch)
|
||
Content-Type: application/json
|
||
Body:
|
||
JSON array of rows matching the type
|
||
Response 200:
|
||
{ accepted: int, rejected: int, errors: [{ row: int, reason: string }] }
|
||
Response 401: unknown / inactive / wrong token
|
||
Response 413: batch too large
|
||
Response 429: rate-limited
|
||
Response 5xx: transient — retry
|
||
```
|
||
|
||
CustomerId is derived from the token server-side. Clients never send it. Prevents one customer from forging rows for another.
|
||
|
||
### 5.6 Failure handling
|
||
|
||
| Failure | Client behavior |
|
||
|---|---|
|
||
| Network timeout / 5xx | Exponential backoff (1m → 2m → 5m → 10m, cap 30m); `ConsecutiveFailures++` |
|
||
| 401 | Stop trying for this resource; log loudly; surface in **Settings → App config** ("Admin link down") |
|
||
| 200 with `rejected > 0` on measurements | Re-push sites/devices on next tick before retrying measurements |
|
||
| 413 | Halve `BatchSize` for this resource and retry |
|
||
| 429 | Honor `Retry-After`; otherwise back off 1 minute |
|
||
|
||
Push runs on its own DI scope and its own DbContext. If the push DB query hangs, only push is blocked — never user-facing API.
|
||
|
||
---
|
||
|
||
## 6. Auth — token lifecycle
|
||
|
||
- Token = 32 random bytes, hex-encoded (64 chars). Generated server-side via `RandomNumberGenerator`.
|
||
- Stored as **SHA-256 hex hash**, UNIQUE indexed → ingest endpoint does a single O(log N) indexed lookup. Bcrypt is the wrong tool for a high-throughput, high-entropy machine token.
|
||
- Shown **once** in the Admin UI at creation/rotation. Admin only stores the hash. Lost token → rotate.
|
||
- Rotation: **dual-token with 24h grace window** (shipped after the initial design). On rotate, the current `TokenHash` is copied into `PreviousTokenHash` with `PreviousTokenExpiresAt = now + 24h`, then a fresh current is generated. `FindByTokenAsync` matches either current OR (previous AND not expired AND `IsActive`). Customer ops has the full window to update `.env` + restart without dropping a push tick. A second rotation within the grace window overwrites the previous slot — at most one previous token is ever honoured at a time. Grace duration is `CustomerService.DefaultTokenGracePeriod` (parameterizable on `RotateTokenAsync` for tests; lift to config if you need per-customer tuning).
|
||
- `Customers.IsActive=false` → all that customer's pushes get 401 until reactivated.
|
||
|
||
---
|
||
|
||
## 7. Continuous aggregates + retention
|
||
|
||
### 7.1 Aggregates (created post-migration by `FleetTimescaleBootstrapper`)
|
||
|
||
- `fleet_hourly_per_device` — `time_bucket('1 hour', Time)`, `CustomerId`, `DeviceId` → avg/max/min `ActivePowerKw`, `max(EnergyImported) - min(EnergyImported)` as `KwhDelta`, sample count.
|
||
- `fleet_daily_per_customer` — `time_bucket('1 day', Time)`, `CustomerId` → totals across all that customer's devices.
|
||
- **Realtime** (`materialized_only = false`) so late back-fills appear in CA queries immediately, with the unmaterialized tail served live from the hypertable until next refresh.
|
||
- Refresh policy:
|
||
- `fleet_hourly_per_device`: every 5 min; `start_offset = 30 days`, `end_offset = 1 hour`.
|
||
- `fleet_daily_per_customer`: every 1 hour; `start_offset = 365 days`, `end_offset = 1 day`.
|
||
- Wide `start_offset` means firmware back-fills within those windows get materialized on the next refresh tick. TimescaleDB's invalidation log means only chunks that actually changed get re-processed.
|
||
|
||
### 7.2 Retention vs compression
|
||
|
||
**Retention policy: NONE.** All data kept indefinitely for historical / trend analysis.
|
||
|
||
**Compression policy** (this is what makes "forever" cheap):
|
||
- `fleet.PowerMeasurements`: compress chunks older than **7 days**.
|
||
- Compression settings: `segmentby = (CustomerId, DeviceId)`, `orderby = Time DESC`.
|
||
- Typical ratio for power-monitoring data: **5–10×**. Point queries by `(CustomerId, DeviceId, Time range)` stay fast because of the segmentby clustering.
|
||
- Bootstrapper enables compression and adds the policy idempotently.
|
||
|
||
CA tables are tiny by comparison; no compression needed there.
|
||
|
||
---
|
||
|
||
## 8. Admin UI surface
|
||
|
||
| Page | What |
|
||
|---|---|
|
||
| Dashboard | Fleet headline — customer count, active customers (push in last hour), aggregate live active power, total kWh today, last-24h lag chart |
|
||
| Customers | Table — Code / Name / Last seen / Sites / Devices / Today kWh / Status / Actions (rotate token, disable, view) |
|
||
| Customer detail | Sites + devices for the selected customer + Grafana embed scoped to `customer_id` |
|
||
| Dashboards | Grafana embed — `fleet-overview` default, `customer-drilldown` parameterized |
|
||
| Settings → Branding | Admin's own brand |
|
||
| Settings → Users | Admin operator accounts |
|
||
| Settings → Grafana | Read-only info card |
|
||
| Settings → App config | Read-only config snapshot (RunMode visible at the top) |
|
||
|
||
Hidden in Admin mode: **Sites** top-level (no local sites), **Settings → Rates** tab (no local tariffs).
|
||
|
||
---
|
||
|
||
## 9. Observability
|
||
|
||
Two questions a real operator asks, two answers:
|
||
|
||
1. **"Is customer X pushing?"** — Customers page shows `LastSeenAt`, `ConsecutiveFailures`, last batch size, recent burst spread.
|
||
2. **"Why did this batch not land?"** — `SELECT * FROM fleet.IngestEvents WHERE CustomerId = … ORDER BY ReceivedAt DESC LIMIT 50;`
|
||
|
||
Plus Serilog:
|
||
- Client: `[INF] Pushed N measurements to fleet ingest (cursor=…) accepted=N rejected=0`
|
||
- Admin: `[INF] Accepted N measurements from {Code} ({CustomerId}), rejected 0`
|
||
|
||
---
|
||
|
||
## 10. Per-stack config
|
||
|
||
**Customer `.env` additions:**
|
||
```
|
||
RunMode=Client # default — can omit
|
||
FleetIngest__Enabled=true
|
||
FleetIngest__Url=https://admin.portal.example.com/api/fleet/ingest
|
||
FleetIngest__Token=<32-byte hex token from Admin Customers page>
|
||
FleetIngest__IntervalSeconds=60
|
||
FleetIngest__BatchSize=5000
|
||
FleetIngest__BatchMaxBytes=1048576
|
||
```
|
||
|
||
`FleetIngest__Enabled=false` → push service doesn't start; customer runs as today, no data leaves.
|
||
|
||
**Admin `.env`:**
|
||
```
|
||
RunMode=Admin
|
||
COMPOSE_PROJECT_NAME=admin
|
||
CUSTOMER_HOST=admin.portal.example.com
|
||
Database__ConnectionString=Host=timescaledb;... # central fleet DB (separate volume)
|
||
# All other vars same as Client compose.
|
||
```
|
||
|
||
---
|
||
|
||
## 11. Open seams (deferred, with the obvious extension paths noted)
|
||
|
||
| Seam | Where v2 picks it up |
|
||
|---|---|
|
||
| Tariff sync + cross-customer cost | New `fleet.Tariffs` table, push tariff change events from customer, derived cost in `fleet_daily_per_customer`. |
|
||
| Per-customer Postgres RLS for multi-Admin-user setups | Add `current_setting('app.customer_filter')`-based RLS policies on `fleet.*` tables; Admin role to customer-scope mapping in `IdentityRole.CustomerId`. |
|
||
| Bidirectional Admin → Customer commands | New WebSocket or long-poll channel on customer side; gated by mutual cert or a second token. |
|
||
| Branding sync (for the "Admin sees customer's brand when drilling in" niceness) | Push branding row from customer; Admin renders the customer's brand on Customer-detail pages. |
|
||
| Forward compatibility | Versioned URL `/v1/fleet/ingest`. Admin tolerates unknown JSON fields. Strictness on response shape only. |
|
||
| ~~Dual-token grace window for rotation~~ | **Shipped.** See §6 — `Customers.PreviousTokenHash` + `PreviousTokenExpiresAt`; ingest accepts either while previous is unexpired. Default 24h. |
|
||
| Sharded Admin (10000+ customers) | Customer's `FleetIngest__Url` already supports any host — point different customers at different Admin instances; aggregate at a tier above with Promscale or similar if needed. |
|
||
| Hard-delete / GDPR | Admin Customers page → "Decommission" action: `DELETE FROM fleet.* WHERE CustomerId = …` cascade. Logged. |
|
||
|
||
---
|
||
|
||
## 12. Phase split
|
||
|
||
| Phase | What | End-state |
|
||
|---|---|---|
|
||
| **13** | `RunMode` plumbing + Fleet schema + `AdminDbContext` split + Customers registry + token issuance UI. No data movement. | Spin up an Admin stack, register a customer, see their token. Client stack picks up `RunMode=Client` + push config but doesn't yet push. |
|
||
| **14** | `FleetPushService` (Client), `/api/fleet/ingest` + `FleetIngestService` (Admin), `IngestEvents`, `ReceivedAt` migration on Client, `FleetPushState`, continuous aggregates SQL, compression policy SQL. | Two local stacks: one Client pushes to one Admin. Rows visible in `fleet.PowerMeasurements`. Late-arrival firmware-replay scenario verified with fixtures. |
|
||
| **15** | Admin UI (Dashboard, Customers list, Customer detail, Customers token rotation), Grafana provisioning of `fleet-overview.json` + `customer-drilldown.json`, README + OPERATIONS updates for Admin deployment + customer onboarding workflow. | Full feature usable end-to-end. |
|
||
|
||
---
|
||
|
||
## 13. Sanity checks (carried over from the design conversation)
|
||
|
||
- **No central data without the central DB existing.** Whole feature is opt-in via `FleetIngest__Enabled`.
|
||
- **Customer model is unchanged.** Every customer keeps full control of their own data. We copy *roll-ups* (and raw measurements for fidelity), not authority.
|
||
- **One ingest endpoint, no fan-out.** N customers all POST to the same `/api/fleet/ingest`. Admin scales vertically first, shards horizontally later if needed (each customer's `FleetIngest__Url` can point at a specific Admin instance).
|