Tau.Acuvim/portal/docs/FLEET-DESIGN.md
Diseri Pearson b654997fc9 Fleet sync: Municipalities + Tariffs + TariffPeriods
Mirror the customer-side rate hierarchy (Municipality → Tariff →
TariffPeriod) to the central DB, scoped by CustomerId.
Per-municipality rate structure is preserved exactly as customers
configure it locally — each tariff still references its municipality
and carries its own per-period TOU rates.

Backend
- New fleet entities: FleetMunicipality, FleetTariff, FleetTariffPeriod.
  Composite PKs (CustomerId, Id) like Sites/Devices so two customers'
  rate trees never collide. FleetTariff has a real FK to
  FleetMunicipality on (CustomerId, MunicipalityId); FleetTariffPeriod
  cascades from its FleetTariff parent.
- Two new push batch types: "municipalities" (full set) and "tariffs"
  (full set with periods nested inside each tariff). Push order per
  tick: sites → devices → municipalities → tariffs → measurements.
- FleetIngestService dispatches the new types. Municipalities upsert
  by (CustomerId, Id). Tariffs run inside a single transaction per
  batch: upsert the tariff row, DELETE all periods for that tariff,
  INSERT the new set — atomic period replacement matches the
  customer-side UpsertTariffRequest semantics.
- FleetPushState gains ResourceMunicipalities + ResourceTariffs
  constants so the per-resource cursor/backoff state has slots for them.
- FleetQueryService.GetCustomerDetailAsync now includes municipalities
  and tariffs (with periods, with municipality name joined client-side
  from the lookup dict). New DTOs: FleetMunicipalityViewDto,
  FleetTariffViewDto, FleetTariffPeriodViewDto.
- AdminDbContext migration AddFleetRates creates the three tables and
  their indexes/FKs.

Frontend
- Customer detail page gains a "Tariffs (N)" tab. Each tariff renders
  as a collapsible card with its municipality, active flag, effective
  window, default rate / fixed charge / VAT, and an inline period
  table (Period name, day-of-week bitmask formatted as MTWTFSS, start,
  end, rate). Empty state when no tariffs synced yet.

Verified end-to-end on the dev host
- Created "Phase23 Test City" municipality + "Domestic TOU" tariff
  with two periods (Peak weekdays 17:00-20:00 @ 3.75, Off-Peak
  weekends 00:00-23:59 @ 1.20) on the Client.
- Within one push tick (~20s) all three rows landed on Admin:
  fleet.Municipalities (1), fleet.Tariffs (1), fleet.TariffPeriods (2).
- /api/fleet/customers/{id}/detail returns the full tree with
  municipality name resolved.

Tests: 61/61 still passing.

Design doc (FLEET-DESIGN.md §11) updated — tariff sync row marked
shipped, cross-customer cost compute flagged as the natural next step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:10:51 +02:00

345 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fleet ingest — design (locked)
Status: **approved**, ready for Phase 13 implementation.
This is the design reference for the Admin / Fleet aggregation feature. Read this before touching any code in Phases 1315.
---
## 1. Goals & non-goals
**In scope (v1):**
- Mirror every customer's raw `monitoring.PowerMeasurements`, `Sites`, and `Devices` into a central DB on an Admin stack, with `CustomerId` attached.
- Admin operator sees fleet-wide and per-customer dashboards.
- Customer stacks keep working if Admin is unreachable; pushes resume when it comes back.
- Same portal binary in two modes (`RunMode=Client` / `Admin`), config-selected.
**Not in scope (v1):**
- Cross-customer cost / tariff aggregation (Admin shows kWh + power; per-customer instance still owns billing).
- Branding / user sync (Admin has its own brand + users — separate fleets of humans).
- Bidirectional sync (Admin → Customer commands, OTA, etc.).
- Multi-region or sharded central DB.
---
## 2. Topology
```
[Traefik public]
├─ abc0001.portal.example.com → Client stack (portal + timescale + grafana)
├─ abc0002.portal.example.com → Client stack
├─ ...
└─ admin.portal.example.com → Admin stack (portal + timescale + grafana)
[Client stack abc0001] --batches POST--> [Admin stack /api/fleet/ingest]
[Client stack abc0002] --batches POST--> (central TimescaleDB)
[Client stack ...] --batches POST-->
```
Admin stack reuses `docker-compose.prod.yml` and the same image, with:
- `COMPOSE_PROJECT_NAME=admin`
- `CUSTOMER_HOST=admin.portal.example.com`
- `RunMode=Admin`
- Its own Postgres volume (the central fleet DB)
The whole feature is opt-in. With no Admin stack deployed, customer stacks set `FleetIngest__Enabled=false` and work exactly as today.
---
## 3. `RunMode` mechanics
`Application.RunMode` is `Client` (default) or `Admin`. Single binary; Program.cs branches:
| Concern | Client | Admin |
|---|---|---|
| DbContext target | local Postgres | central Postgres |
| Endpoints `/api/sites/*`, `/api/measurements/*`, `/api/ingest/measurements`, `/api/admin/sites/*` | mapped | **not mapped** |
| Endpoints `/api/fleet/*`, `/api/admin/customers/*` | not mapped | mapped |
| `FleetPushService` (BackgroundService) | started if `FleetIngest__Enabled=true` | not registered |
| `FleetIngestService` | not registered | registered |
| EF migration set | `app` + `monitoring` | `app` + `fleet` |
| Identity / Branding / Users / Settings | same | same |
| Settings → Rates tab | shown | hidden |
| Sidebar nav | Dashboard / Dashboards / Sites / Settings | Dashboard / Dashboards / **Customers** / Settings |
**Startup guards:**
- Reject `RunMode=Admin` if `Database:ConnectionString` is empty (no auto-provision for Admin — there is no obvious local default for a fleet DB).
- Reject `RunMode=Client + FleetIngest__Enabled=true` if `FleetIngest__Url` or `FleetIngest__Token` is empty.
- Reject the wrong mode against an existing DB by sniffing for a marker table:
- `RunMode=Client` against a DB containing `fleet.Customers` → fatal.
- `RunMode=Admin` against a DB containing `monitoring.PowerMeasurements` → fatal.
- Stops the "I pointed dev at the prod DB" disaster.
---
## 4. Schemas
### 4.1 Central DB (Admin) — `fleet` schema
**`fleet.Customers`** — registry
| Column | Type | Notes |
|---|---|---|
| Id | uuid PK | Admin-generated, stable |
| Code | varchar(50) UNIQUE | e.g. `ABC0001` — human handle (preserve case for display, lowercase for compose) |
| Name | varchar(200) | |
| TokenHash | varchar(64) UNIQUE indexed | SHA-256 hex of the push token |
| TokenIssuedAt | timestamptz | |
| TokenRotatedAt | timestamptz nullable | |
| IsActive | bool | If false, ingest returns 401 |
| FirstSeenAt, LastSeenAt | timestamptz nullable | When their first / most-recent push landed |
| CreatedAt | timestamptz | |
**`fleet.Sites`** — mirrors customer-side, identity preserved
| Column | Type | Notes |
|---|---|---|
| Id | uuid | Customer-side Site UUID, preserved |
| CustomerId | uuid FK→Customers | |
| Name, Address, IsActive | … | |
| LocalMunicipalityId | int nullable | Opaque on Admin |
| ReceivedAt | timestamptz | When Admin upserted this row |
| **PK** | (CustomerId, Id) | |
**`fleet.Devices`** — same pattern
| Column | Type | Notes |
|---|---|---|
| Id | uuid | Preserved |
| CustomerId | uuid FK | |
| SiteId | uuid | (CustomerId, SiteId) FK→Sites composite |
| Name, ExternalId, Description, IsActive | … | |
| ReceivedAt | timestamptz | |
| **PK** | (CustomerId, Id) | |
**`fleet.PowerMeasurements`** — hypertable, the big one
| Column | Type | Notes |
|---|---|---|
| Time | timestamptz | Hypertable partition |
| CustomerId | uuid | |
| DeviceId | uuid | (CustomerId, DeviceId) FK→Devices |
| ActivePowerKw, ReactivePowerKvar, ApparentPowerKva, PowerFactor, VoltageV, FrequencyHz, EnergyImportedKwh, EnergyExportedKwh | double | |
| Source | varchar(50) nullable | |
| **PK** | (Time, CustomerId, DeviceId) | Time must be in any unique constraint on a hypertable |
| **Indexes** | (CustomerId, Time DESC); (CustomerId, DeviceId, Time DESC) | |
**`fleet.IngestEvents`** — observability
| Column | Type | Notes |
|---|---|---|
| Id | uuid PK | |
| CustomerId | uuid FK | |
| ReceivedAt | timestamptz | |
| BatchType | varchar(20) | `sites` / `devices` / `measurements` |
| RowsAccepted, RowsRejected | int | |
| BatchBytes | int | |
| ClientHwm | varchar(50) | Cursor the client claims this batch covers |
| TimeSpread | interval nullable | `max(Time) - min(Time)` in the batch — visible burst back-fills |
| Error | varchar(500) nullable | |
### 4.2 Client DB additions
One new table: **`app.FleetPushState`**
| Column | Type | Notes |
|---|---|---|
| ResourceType | varchar(20) PK | `sites` / `devices` / `measurements` |
| LastCursor | timestamptz nullable | For measurements: max(`ReceivedAt`) pushed. For sites/devices: max(`UpdatedAt`) |
| LastSyncedAt | timestamptz nullable | |
| LastError | varchar(500) nullable | |
| ConsecutiveFailures | int default 0 | Drives exponential backoff |
Plus: **add `ReceivedAt timestamptz` (default `NOW()`) to `monitoring.PowerMeasurements`**, indexed on `(ReceivedAt)`. See §5.1 for the rationale.
### 4.3 Migration assembly split
Two `DbContext` classes — `ClientDbContext` (existing `AppDbContext`, renamed) and `AdminDbContext`. Each owns its migrations folder. Only the one matching `RunMode` is registered with DI. Ugly-but-reliable wins over single-context-with-tooling-gymnastics.
---
## 5. Push pipeline
### 5.1 Late arrivals — push by `ReceivedAt`, not `Time`
The firmware (ESP32 / TTGO T-Call) **buffers samples offline** and replays them when connectivity returns. Late arrivals are a designed-for case, not an edge.
If we pushed by `Time > LastCursor`, a device offline for 6 hours would have its back-fill skipped (its samples' `Time` is in the past).
Fix: every row carries `ReceivedAt` assigned by the local DB on insert. Push selects `WHERE ReceivedAt > LastCursor[measurements]`. Back-fills are picked up on the next push tick after the firmware replays.
`Time` stays as-is for queries; nothing else changes about how the data is used.
### 5.2 Burst handling
When a firmware replay drops thousands of rows in one local insert, the next push tick sees a huge backlog. To avoid one customer monopolising network / Admin:
- Per request: up to `FleetIngest__BatchSize` rows (default 5000) **or** `FleetIngest__BatchMaxBytes` (default 1 MB), whichever hits first.
- Per tick: at most 3 successful batches per resource type, then yield.
- Result: a 30k-row back-fill drains over ~5 minutes (10 ticks @ 60s) without starving sites/devices syncing.
- Observable: `fleet.IngestEvents.TimeSpread` shows the `max(Time) - min(Time)` per batch; a wave of back-fill is obvious in Admin logs without grepping.
### 5.3 Cadence
- Default push interval: 60 s.
- Configurable: `FleetIngest__IntervalSeconds`.
### 5.4 Order of operations per tick
1. **Sites**: `UpdatedAt > LastCursor[sites]` → POST batch (full upsert, all rows; small data).
2. **Devices**: same.
3. **Measurements**: `ReceivedAt > LastCursor[measurements]` ordered by `ReceivedAt` → POST batches until drained or limit.
Sites and devices first so Admin's measurement FK insert doesn't reject rows for unknown devices.
### 5.5 Endpoint contract
```
POST /api/fleet/ingest
Headers:
X-Customer-Token: <opaque 32-byte hex>
X-Batch-Type: sites | devices | measurements
X-Push-Cursor: <ISO8601> (highest ReceivedAt/UpdatedAt in this batch)
Content-Type: application/json
Body:
JSON array of rows matching the type
Response 200:
{ accepted: int, rejected: int, errors: [{ row: int, reason: string }] }
Response 401: unknown / inactive / wrong token
Response 413: batch too large
Response 429: rate-limited
Response 5xx: transient — retry
```
CustomerId is derived from the token server-side. Clients never send it. Prevents one customer from forging rows for another.
### 5.6 Failure handling
| Failure | Client behavior |
|---|---|
| Network timeout / 5xx | Exponential backoff (1m → 2m → 5m → 10m, cap 30m); `ConsecutiveFailures++` |
| 401 | Stop trying for this resource; log loudly; surface in **Settings → App config** ("Admin link down") |
| 200 with `rejected > 0` on measurements | Re-push sites/devices on next tick before retrying measurements |
| 413 | Halve `BatchSize` for this resource and retry |
| 429 | Honor `Retry-After`; otherwise back off 1 minute |
Push runs on its own DI scope and its own DbContext. If the push DB query hangs, only push is blocked — never user-facing API.
---
## 6. Auth — token lifecycle
- Token = 32 random bytes, hex-encoded (64 chars). Generated server-side via `RandomNumberGenerator`.
- Stored as **SHA-256 hex hash**, UNIQUE indexed → ingest endpoint does a single O(log N) indexed lookup. Bcrypt is the wrong tool for a high-throughput, high-entropy machine token.
- Shown **once** in the Admin UI at creation/rotation. Admin only stores the hash. Lost token → rotate.
- Rotation: **dual-token with 24h grace window** (shipped after the initial design). On rotate, the current `TokenHash` is copied into `PreviousTokenHash` with `PreviousTokenExpiresAt = now + 24h`, then a fresh current is generated. `FindByTokenAsync` matches either current OR (previous AND not expired AND `IsActive`). Customer ops has the full window to update `.env` + restart without dropping a push tick. A second rotation within the grace window overwrites the previous slot — at most one previous token is ever honoured at a time. Grace duration is `CustomerService.DefaultTokenGracePeriod` (parameterizable on `RotateTokenAsync` for tests; lift to config if you need per-customer tuning).
- `Customers.IsActive=false` → all that customer's pushes get 401 until reactivated.
---
## 7. Continuous aggregates + retention
### 7.1 Aggregates (created post-migration by `FleetTimescaleBootstrapper`)
- `fleet_hourly_per_device``time_bucket('1 hour', Time)`, `CustomerId`, `DeviceId` → avg/max/min `ActivePowerKw`, `max(EnergyImported) - min(EnergyImported)` as `KwhDelta`, sample count.
- `fleet_daily_per_customer``time_bucket('1 day', Time)`, `CustomerId` → totals across all that customer's devices.
- **Realtime** (`materialized_only = false`) so late back-fills appear in CA queries immediately, with the unmaterialized tail served live from the hypertable until next refresh.
- Refresh policy:
- `fleet_hourly_per_device`: every 5 min; `start_offset = 30 days`, `end_offset = 1 hour`.
- `fleet_daily_per_customer`: every 1 hour; `start_offset = 365 days`, `end_offset = 1 day`.
- Wide `start_offset` means firmware back-fills within those windows get materialized on the next refresh tick. TimescaleDB's invalidation log means only chunks that actually changed get re-processed.
### 7.2 Retention vs compression
**Retention policy: NONE.** All data kept indefinitely for historical / trend analysis.
**Compression policy** (this is what makes "forever" cheap):
- `fleet.PowerMeasurements`: compress chunks older than **7 days**.
- Compression settings: `segmentby = (CustomerId, DeviceId)`, `orderby = Time DESC`.
- Typical ratio for power-monitoring data: **510×**. Point queries by `(CustomerId, DeviceId, Time range)` stay fast because of the segmentby clustering.
- Bootstrapper enables compression and adds the policy idempotently.
CA tables are tiny by comparison; no compression needed there.
---
## 8. Admin UI surface
| Page | What |
|---|---|
| Dashboard | Fleet headline — customer count, active customers (push in last hour), aggregate live active power, total kWh today, last-24h lag chart |
| Customers | Table — Code / Name / Last seen / Sites / Devices / Today kWh / Status / Actions (rotate token, disable, view) |
| Customer detail | Sites + devices for the selected customer + Grafana embed scoped to `customer_id` |
| Dashboards | Grafana embed — `fleet-overview` default, `customer-drilldown` parameterized |
| Settings → Branding | Admin's own brand |
| Settings → Users | Admin operator accounts |
| Settings → Grafana | Read-only info card |
| Settings → App config | Read-only config snapshot (RunMode visible at the top) |
Hidden in Admin mode: **Sites** top-level (no local sites), **Settings → Rates** tab (no local tariffs).
---
## 9. Observability
Two questions a real operator asks, two answers:
1. **"Is customer X pushing?"** — Customers page shows `LastSeenAt`, `ConsecutiveFailures`, last batch size, recent burst spread.
2. **"Why did this batch not land?"** — `SELECT * FROM fleet.IngestEvents WHERE CustomerId = … ORDER BY ReceivedAt DESC LIMIT 50;`
Plus Serilog:
- Client: `[INF] Pushed N measurements to fleet ingest (cursor=…) accepted=N rejected=0`
- Admin: `[INF] Accepted N measurements from {Code} ({CustomerId}), rejected 0`
---
## 10. Per-stack config
**Customer `.env` additions:**
```
RunMode=Client # default — can omit
FleetIngest__Enabled=true
FleetIngest__Url=https://admin.portal.example.com/api/fleet/ingest
FleetIngest__Token=<32-byte hex token from Admin Customers page>
FleetIngest__IntervalSeconds=60
FleetIngest__BatchSize=5000
FleetIngest__BatchMaxBytes=1048576
```
`FleetIngest__Enabled=false` → push service doesn't start; customer runs as today, no data leaves.
**Admin `.env`:**
```
RunMode=Admin
COMPOSE_PROJECT_NAME=admin
CUSTOMER_HOST=admin.portal.example.com
Database__ConnectionString=Host=timescaledb;... # central fleet DB (separate volume)
# All other vars same as Client compose.
```
---
## 11. Open seams (deferred, with the obvious extension paths noted)
| Seam | Where v2 picks it up |
|---|---|
| ~~Tariff sync~~ + cross-customer cost | **Tariff sync shipped.** `fleet.Municipalities` + `fleet.Tariffs` + `fleet.TariffPeriods` mirror the customer-side rate hierarchy scoped by `CustomerId`. New push batch types: `municipalities`, `tariffs` (with periods nested). Per-municipality rates preserved. **Cost compute on Admin side is the natural next step** — a `FleetCostService` joining `fleet.hourly_per_device` × tariffs → per-customer kWh × rate. |
| Per-customer Postgres RLS for multi-Admin-user setups | Add `current_setting('app.customer_filter')`-based RLS policies on `fleet.*` tables; Admin role to customer-scope mapping in `IdentityRole.CustomerId`. |
| Bidirectional Admin → Customer commands | New WebSocket or long-poll channel on customer side; gated by mutual cert or a second token. |
| Branding sync (for the "Admin sees customer's brand when drilling in" niceness) | Push branding row from customer; Admin renders the customer's brand on Customer-detail pages. |
| Forward compatibility | Versioned URL `/v1/fleet/ingest`. Admin tolerates unknown JSON fields. Strictness on response shape only. |
| ~~Dual-token grace window for rotation~~ | **Shipped.** See §6 — `Customers.PreviousTokenHash` + `PreviousTokenExpiresAt`; ingest accepts either while previous is unexpired. Default 24h. |
| Sharded Admin (10000+ customers) | Customer's `FleetIngest__Url` already supports any host — point different customers at different Admin instances; aggregate at a tier above with Promscale or similar if needed. |
| Hard-delete / GDPR | Admin Customers page → "Decommission" action: `DELETE FROM fleet.* WHERE CustomerId = …` cascade. Logged. |
---
## 12. Phase split
| Phase | What | End-state |
|---|---|---|
| **13** | `RunMode` plumbing + Fleet schema + `AdminDbContext` split + Customers registry + token issuance UI. No data movement. | Spin up an Admin stack, register a customer, see their token. Client stack picks up `RunMode=Client` + push config but doesn't yet push. |
| **14** | `FleetPushService` (Client), `/api/fleet/ingest` + `FleetIngestService` (Admin), `IngestEvents`, `ReceivedAt` migration on Client, `FleetPushState`, continuous aggregates SQL, compression policy SQL. | Two local stacks: one Client pushes to one Admin. Rows visible in `fleet.PowerMeasurements`. Late-arrival firmware-replay scenario verified with fixtures. |
| **15** | Admin UI (Dashboard, Customers list, Customer detail, Customers token rotation), Grafana provisioning of `fleet-overview.json` + `customer-drilldown.json`, README + OPERATIONS updates for Admin deployment + customer onboarding workflow. | Full feature usable end-to-end. |
---
## 13. Sanity checks (carried over from the design conversation)
- **No central data without the central DB existing.** Whole feature is opt-in via `FleetIngest__Enabled`.
- **Customer model is unchanged.** Every customer keeps full control of their own data. We copy *roll-ups* (and raw measurements for fidelity), not authority.
- **One ingest endpoint, no fan-out.** N customers all POST to the same `/api/fleet/ingest`. Admin scales vertically first, shards horizontally later if needed (each customer's `FleetIngest__Url` can point at a specific Admin instance).