Diseri Pearson 880525b306 Add Fleet ingest design doc (portal/docs/FLEET-DESIGN.md)

Locked design for Admin / cross-customer aggregation feature.
Implementation lands in phases 13-15.

Key decisions captured:
- Same portal binary, RunMode=Client|Admin config flag.
- Two DbContext classes (ClientDbContext + AdminDbContext) to keep
  schemas cleanly separated and migrations sane.
- Fleet ingest is opt-in (FleetIngest__Enabled=false works exactly
  as today, no data leaves customer stack).
- Push by ReceivedAt, not Time, so firmware offline-buffer replays
  are picked up automatically.
- Per-tick batch cap so a back-fill wave from one customer doesn't
  starve other customers' pushes.
- SHA-256 token hash (not bcrypt) for the high-throughput ingest
  endpoint; tokens shown once on Admin Customers page.
- Realtime continuous aggregates with wide start_offset so late
  back-fills materialize on the next refresh tick.
- No retention policy. TimescaleDB compression on chunks older than
  7 days handles long-term storage cost.
- Open seams (tariff sync, RLS, GDPR delete, dual-token rotation,
  sharding) documented with v2 extension paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 09:55:07 +02:00

17 KiB

Raw Blame History

Fleet ingest — design (locked)

Status: approved, ready for Phase 13 implementation.

This is the design reference for the Admin / Fleet aggregation feature. Read this before touching any code in Phases 13–15.

1. Goals & non-goals

In scope (v1):

Mirror every customer's raw monitoring.PowerMeasurements, Sites, and Devices into a central DB on an Admin stack, with CustomerId attached.
Admin operator sees fleet-wide and per-customer dashboards.
Customer stacks keep working if Admin is unreachable; pushes resume when it comes back.
Same portal binary in two modes (RunMode=Client / Admin), config-selected.

Not in scope (v1):

Cross-customer cost / tariff aggregation (Admin shows kWh + power; per-customer instance still owns billing).
Branding / user sync (Admin has its own brand + users — separate fleets of humans).
Bidirectional sync (Admin → Customer commands, OTA, etc.).
Multi-region or sharded central DB.

2. Topology

[Traefik public]
    ├─ abc0001.portal.example.com  →  Client stack (portal + timescale + grafana)
    ├─ abc0002.portal.example.com  →  Client stack
    ├─ ...
    └─ admin.portal.example.com    →  Admin stack (portal + timescale + grafana)

[Client stack abc0001] --batches POST--> [Admin stack /api/fleet/ingest]
[Client stack abc0002] --batches POST--> (central TimescaleDB)
[Client stack ...]     --batches POST-->

Admin stack reuses docker-compose.prod.yml and the same image, with:

COMPOSE_PROJECT_NAME=admin
CUSTOMER_HOST=admin.portal.example.com
RunMode=Admin
Its own Postgres volume (the central fleet DB)

The whole feature is opt-in. With no Admin stack deployed, customer stacks set FleetIngest__Enabled=false and work exactly as today.

3. `RunMode` mechanics

Application.RunMode is Client (default) or Admin. Single binary; Program.cs branches:

Concern	Client	Admin
DbContext target	local Postgres	central Postgres
Endpoints `/api/sites/`, `/api/measurements/`, `/api/ingest/measurements`, `/api/admin/sites/*`	mapped	not mapped
Endpoints `/api/fleet/`, `/api/admin/customers/`	not mapped	mapped
`FleetPushService` (BackgroundService)	started if `FleetIngest__Enabled=true`	not registered
`FleetIngestService`	not registered	registered
EF migration set	`app` + `monitoring`	`app` + `fleet`
Identity / Branding / Users / Settings	same	same
Settings → Rates tab	shown	hidden
Sidebar nav	Dashboard / Dashboards / Sites / Settings	Dashboard / Dashboards / Customers / Settings

Startup guards:

Reject RunMode=Admin if Database:ConnectionString is empty (no auto-provision for Admin — there is no obvious local default for a fleet DB).
Reject RunMode=Client + FleetIngest__Enabled=true if FleetIngest__Url or FleetIngest__Token is empty.
Reject the wrong mode against an existing DB by sniffing for a marker table:
- RunMode=Client against a DB containing fleet.Customers → fatal.
- RunMode=Admin against a DB containing monitoring.PowerMeasurements → fatal.
Stops the "I pointed dev at the prod DB" disaster.

4. Schemas

4.1 Central DB (Admin) — `fleet` schema

fleet.Customers — registry

Column	Type	Notes
Id	uuid PK	Admin-generated, stable
Code	varchar(50) UNIQUE	e.g. `ABC0001` — human handle (preserve case for display, lowercase for compose)
Name	varchar(200)
TokenHash	varchar(64) UNIQUE indexed	SHA-256 hex of the push token
TokenIssuedAt	timestamptz
TokenRotatedAt	timestamptz nullable
IsActive	bool	If false, ingest returns 401
FirstSeenAt, LastSeenAt	timestamptz nullable	When their first / most-recent push landed
CreatedAt	timestamptz

fleet.Sites — mirrors customer-side, identity preserved

Column	Type	Notes
Id	uuid	Customer-side Site UUID, preserved
CustomerId	uuid FK→Customers
Name, Address, IsActive	…
LocalMunicipalityId	int nullable	Opaque on Admin
ReceivedAt	timestamptz	When Admin upserted this row
PK	(CustomerId, Id)

fleet.Devices — same pattern

Column	Type	Notes
Id	uuid	Preserved
CustomerId	uuid FK
SiteId	uuid	(CustomerId, SiteId) FK→Sites composite
Name, ExternalId, Description, IsActive	…
ReceivedAt	timestamptz
PK	(CustomerId, Id)

fleet.PowerMeasurements — hypertable, the big one

Column	Type	Notes
Time	timestamptz	Hypertable partition
CustomerId	uuid
DeviceId	uuid	(CustomerId, DeviceId) FK→Devices
ActivePowerKw, ReactivePowerKvar, ApparentPowerKva, PowerFactor, VoltageV, FrequencyHz, EnergyImportedKwh, EnergyExportedKwh	double
Source	varchar(50) nullable
PK	(Time, CustomerId, DeviceId)	Time must be in any unique constraint on a hypertable
Indexes	(CustomerId, Time DESC); (CustomerId, DeviceId, Time DESC)

fleet.IngestEvents — observability

Column	Type	Notes
Id	uuid PK
CustomerId	uuid FK
ReceivedAt	timestamptz
BatchType	varchar(20)	`sites` / `devices` / `measurements`
RowsAccepted, RowsRejected	int
BatchBytes	int
ClientHwm	varchar(50)	Cursor the client claims this batch covers
TimeSpread	interval nullable	`max(Time) - min(Time)` in the batch — visible burst back-fills
Error	varchar(500) nullable

4.2 Client DB additions

One new table: app.FleetPushState

Column	Type	Notes
ResourceType	varchar(20) PK	`sites` / `devices` / `measurements`
LastCursor	timestamptz nullable	For measurements: max(`ReceivedAt`) pushed. For sites/devices: max(`UpdatedAt`)
LastSyncedAt	timestamptz nullable
LastError	varchar(500) nullable
ConsecutiveFailures	int default 0	Drives exponential backoff

Plus: add ReceivedAt timestamptz (default NOW()) to monitoring.PowerMeasurements, indexed on (ReceivedAt). See §5.1 for the rationale.

4.3 Migration assembly split

Two DbContext classes — ClientDbContext (existing AppDbContext, renamed) and AdminDbContext. Each owns its migrations folder. Only the one matching RunMode is registered with DI. Ugly-but-reliable wins over single-context-with-tooling-gymnastics.

5. Push pipeline

5.1 Late arrivals — push by `ReceivedAt`, not `Time`

The firmware (ESP32 / TTGO T-Call) buffers samples offline and replays them when connectivity returns. Late arrivals are a designed-for case, not an edge.

If we pushed by Time > LastCursor, a device offline for 6 hours would have its back-fill skipped (its samples' Time is in the past).

Fix: every row carries ReceivedAt assigned by the local DB on insert. Push selects WHERE ReceivedAt > LastCursor[measurements]. Back-fills are picked up on the next push tick after the firmware replays.

Time stays as-is for queries; nothing else changes about how the data is used.

5.2 Burst handling

When a firmware replay drops thousands of rows in one local insert, the next push tick sees a huge backlog. To avoid one customer monopolising network / Admin:

Per request: up to FleetIngest__BatchSize rows (default 5000) or FleetIngest__BatchMaxBytes (default 1 MB), whichever hits first.
Per tick: at most 3 successful batches per resource type, then yield.
Result: a 30k-row back-fill drains over ~5 minutes (10 ticks @ 60s) without starving sites/devices syncing.
Observable: fleet.IngestEvents.TimeSpread shows the max(Time) - min(Time) per batch; a wave of back-fill is obvious in Admin logs without grepping.

5.3 Cadence

Default push interval: 60 s.
Configurable: FleetIngest__IntervalSeconds.

5.4 Order of operations per tick

Sites: UpdatedAt > LastCursor[sites] → POST batch (full upsert, all rows; small data).
Devices: same.
Measurements: ReceivedAt > LastCursor[measurements] ordered by ReceivedAt → POST batches until drained or limit.

Sites and devices first so Admin's measurement FK insert doesn't reject rows for unknown devices.

5.5 Endpoint contract

POST /api/fleet/ingest
Headers:
  X-Customer-Token: <opaque 32-byte hex>
  X-Batch-Type:     sites | devices | measurements
  X-Push-Cursor:    <ISO8601> (highest ReceivedAt/UpdatedAt in this batch)
  Content-Type:     application/json
Body:
  JSON array of rows matching the type
Response 200:
  { accepted: int, rejected: int, errors: [{ row: int, reason: string }] }
Response 401: unknown / inactive / wrong token
Response 413: batch too large
Response 429: rate-limited
Response 5xx: transient — retry

CustomerId is derived from the token server-side. Clients never send it. Prevents one customer from forging rows for another.

5.6 Failure handling

Failure	Client behavior
Network timeout / 5xx	Exponential backoff (1m → 2m → 5m → 10m, cap 30m); `ConsecutiveFailures++`
401	Stop trying for this resource; log loudly; surface in Settings → App config ("Admin link down")
200 with `rejected > 0` on measurements	Re-push sites/devices on next tick before retrying measurements
413	Halve `BatchSize` for this resource and retry
429	Honor `Retry-After`; otherwise back off 1 minute

Push runs on its own DI scope and its own DbContext. If the push DB query hangs, only push is blocked — never user-facing API.

6. Auth — token lifecycle

Token = 32 random bytes, hex-encoded (64 chars). Generated server-side via RandomNumberGenerator.
Stored as SHA-256 hex hash, UNIQUE indexed → ingest endpoint does a single O(log N) indexed lookup. Bcrypt is the wrong tool for a high-throughput, high-entropy machine token.
Shown once in the Admin UI at creation/rotation. Admin only stores the hash. Lost token → rotate.
Rotation: immediate cutover — new value replaces old hash. Customer ops updates .env + restarts. Brief push gap acceptable; can revisit with dual-token grace window later.
Customers.IsActive=false → all that customer's pushes get 401 until reactivated.

7. Continuous aggregates + retention

7.1 Aggregates (created post-migration by `FleetTimescaleBootstrapper`)

fleet_hourly_per_device — time_bucket('1 hour', Time), CustomerId, DeviceId → avg/max/min ActivePowerKw, max(EnergyImported) - min(EnergyImported) as KwhDelta, sample count.
fleet_daily_per_customer — time_bucket('1 day', Time), CustomerId → totals across all that customer's devices.
Realtime (materialized_only = false) so late back-fills appear in CA queries immediately, with the unmaterialized tail served live from the hypertable until next refresh.
Refresh policy:
- fleet_hourly_per_device: every 5 min; start_offset = 30 days, end_offset = 1 hour.
- fleet_daily_per_customer: every 1 hour; start_offset = 365 days, end_offset = 1 day.
Wide start_offset means firmware back-fills within those windows get materialized on the next refresh tick. TimescaleDB's invalidation log means only chunks that actually changed get re-processed.

7.2 Retention vs compression

Retention policy: NONE. All data kept indefinitely for historical / trend analysis.

Compression policy (this is what makes "forever" cheap):

fleet.PowerMeasurements: compress chunks older than 7 days.
Compression settings: segmentby = (CustomerId, DeviceId), orderby = Time DESC.
Typical ratio for power-monitoring data: 5–10×. Point queries by (CustomerId, DeviceId, Time range) stay fast because of the segmentby clustering.
Bootstrapper enables compression and adds the policy idempotently.

CA tables are tiny by comparison; no compression needed there.

8. Admin UI surface

Page	What
Dashboard	Fleet headline — customer count, active customers (push in last hour), aggregate live active power, total kWh today, last-24h lag chart
Customers	Table — Code / Name / Last seen / Sites / Devices / Today kWh / Status / Actions (rotate token, disable, view)
Customer detail	Sites + devices for the selected customer + Grafana embed scoped to `customer_id`
Dashboards	Grafana embed — `fleet-overview` default, `customer-drilldown` parameterized
Settings → Branding	Admin's own brand
Settings → Users	Admin operator accounts
Settings → Grafana	Read-only info card
Settings → App config	Read-only config snapshot (RunMode visible at the top)

Hidden in Admin mode: Sites top-level (no local sites), Settings → Rates tab (no local tariffs).

9. Observability

Two questions a real operator asks, two answers:

"Is customer X pushing?" — Customers page shows LastSeenAt, ConsecutiveFailures, last batch size, recent burst spread.
"Why did this batch not land?" — SELECT * FROM fleet.IngestEvents WHERE CustomerId = … ORDER BY ReceivedAt DESC LIMIT 50;

Plus Serilog:

Client: [INF] Pushed N measurements to fleet ingest (cursor=…) accepted=N rejected=0
Admin: [INF] Accepted N measurements from {Code} ({CustomerId}), rejected 0

10. Per-stack config

Customer .env additions:

RunMode=Client                                          # default — can omit
FleetIngest__Enabled=true
FleetIngest__Url=https://admin.portal.example.com/api/fleet/ingest
FleetIngest__Token=<32-byte hex token from Admin Customers page>
FleetIngest__IntervalSeconds=60
FleetIngest__BatchSize=5000
FleetIngest__BatchMaxBytes=1048576

FleetIngest__Enabled=false → push service doesn't start; customer runs as today, no data leaves.

Admin .env:

RunMode=Admin
COMPOSE_PROJECT_NAME=admin
CUSTOMER_HOST=admin.portal.example.com
Database__ConnectionString=Host=timescaledb;...        # central fleet DB (separate volume)
# All other vars same as Client compose.

11. Open seams (deferred, with the obvious extension paths noted)

Seam	Where v2 picks it up
Tariff sync + cross-customer cost	New `fleet.Tariffs` table, push tariff change events from customer, derived cost in `fleet_daily_per_customer`.
Per-customer Postgres RLS for multi-Admin-user setups	Add `current_setting('app.customer_filter')`-based RLS policies on `fleet.*` tables; Admin role to customer-scope mapping in `IdentityRole.CustomerId`.
Bidirectional Admin → Customer commands	New WebSocket or long-poll channel on customer side; gated by mutual cert or a second token.
Branding sync (for the "Admin sees customer's brand when drilling in" niceness)	Push branding row from customer; Admin renders the customer's brand on Customer-detail pages.
Forward compatibility	Versioned URL `/v1/fleet/ingest`. Admin tolerates unknown JSON fields. Strictness on response shape only.
Dual-token grace window for rotation	`Customers.PreviousTokenHash` column; ingest accepts either for 24h after rotation.
Sharded Admin (10000+ customers)	Customer's `FleetIngest__Url` already supports any host — point different customers at different Admin instances; aggregate at a tier above with Promscale or similar if needed.
Hard-delete / GDPR	Admin Customers page → "Decommission" action: `DELETE FROM fleet.* WHERE CustomerId = …` cascade. Logged.

12. Phase split

Phase	What	End-state
13	`RunMode` plumbing + Fleet schema + `AdminDbContext` split + Customers registry + token issuance UI. No data movement.	Spin up an Admin stack, register a customer, see their token. Client stack picks up `RunMode=Client` + push config but doesn't yet push.
14	`FleetPushService` (Client), `/api/fleet/ingest` + `FleetIngestService` (Admin), `IngestEvents`, `ReceivedAt` migration on Client, `FleetPushState`, continuous aggregates SQL, compression policy SQL.	Two local stacks: one Client pushes to one Admin. Rows visible in `fleet.PowerMeasurements`. Late-arrival firmware-replay scenario verified with fixtures.
15	Admin UI (Dashboard, Customers list, Customer detail, Customers token rotation), Grafana provisioning of `fleet-overview.json` + `customer-drilldown.json`, README + OPERATIONS updates for Admin deployment + customer onboarding workflow.	Full feature usable end-to-end.

13. Sanity checks (carried over from the design conversation)

No central data without the central DB existing. Whole feature is opt-in via FleetIngest__Enabled.
Customer model is unchanged. Every customer keeps full control of their own data. We copy roll-ups (and raw measurements for fidelity), not authority.
One ingest endpoint, no fan-out. N customers all POST to the same /api/fleet/ingest. Admin scales vertically first, shards horizontally later if needed (each customer's FleetIngest__Url can point at a specific Admin instance).

17 KiB Raw Blame History Unescape Escape