Tau.Acuvim/portal/docs/FLEET-DESIGN.md
Diseri Pearson 880525b306 Add Fleet ingest design doc (portal/docs/FLEET-DESIGN.md)
Locked design for Admin / cross-customer aggregation feature.
Implementation lands in phases 13-15.

Key decisions captured:
- Same portal binary, RunMode=Client|Admin config flag.
- Two DbContext classes (ClientDbContext + AdminDbContext) to keep
  schemas cleanly separated and migrations sane.
- Fleet ingest is opt-in (FleetIngest__Enabled=false works exactly
  as today, no data leaves customer stack).
- Push by ReceivedAt, not Time, so firmware offline-buffer replays
  are picked up automatically.
- Per-tick batch cap so a back-fill wave from one customer doesn't
  starve other customers' pushes.
- SHA-256 token hash (not bcrypt) for the high-throughput ingest
  endpoint; tokens shown once on Admin Customers page.
- Realtime continuous aggregates with wide start_offset so late
  back-fills materialize on the next refresh tick.
- No retention policy. TimescaleDB compression on chunks older than
  7 days handles long-term storage cost.
- Open seams (tariff sync, RLS, GDPR delete, dual-token rotation,
  sharding) documented with v2 extension paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 09:55:07 +02:00

17 KiB
Raw Blame History

Fleet ingest — design (locked)

Status: approved, ready for Phase 13 implementation.

This is the design reference for the Admin / Fleet aggregation feature. Read this before touching any code in Phases 1315.


1. Goals & non-goals

In scope (v1):

  • Mirror every customer's raw monitoring.PowerMeasurements, Sites, and Devices into a central DB on an Admin stack, with CustomerId attached.
  • Admin operator sees fleet-wide and per-customer dashboards.
  • Customer stacks keep working if Admin is unreachable; pushes resume when it comes back.
  • Same portal binary in two modes (RunMode=Client / Admin), config-selected.

Not in scope (v1):

  • Cross-customer cost / tariff aggregation (Admin shows kWh + power; per-customer instance still owns billing).
  • Branding / user sync (Admin has its own brand + users — separate fleets of humans).
  • Bidirectional sync (Admin → Customer commands, OTA, etc.).
  • Multi-region or sharded central DB.

2. Topology

[Traefik public]
    ├─ abc0001.portal.example.com  →  Client stack (portal + timescale + grafana)
    ├─ abc0002.portal.example.com  →  Client stack
    ├─ ...
    └─ admin.portal.example.com    →  Admin stack (portal + timescale + grafana)

[Client stack abc0001] --batches POST--> [Admin stack /api/fleet/ingest]
[Client stack abc0002] --batches POST--> (central TimescaleDB)
[Client stack ...]     --batches POST-->

Admin stack reuses docker-compose.prod.yml and the same image, with:

  • COMPOSE_PROJECT_NAME=admin
  • CUSTOMER_HOST=admin.portal.example.com
  • RunMode=Admin
  • Its own Postgres volume (the central fleet DB)

The whole feature is opt-in. With no Admin stack deployed, customer stacks set FleetIngest__Enabled=false and work exactly as today.


3. RunMode mechanics

Application.RunMode is Client (default) or Admin. Single binary; Program.cs branches:

Concern Client Admin
DbContext target local Postgres central Postgres
Endpoints /api/sites/*, /api/measurements/*, /api/ingest/measurements, /api/admin/sites/* mapped not mapped
Endpoints /api/fleet/*, /api/admin/customers/* not mapped mapped
FleetPushService (BackgroundService) started if FleetIngest__Enabled=true not registered
FleetIngestService not registered registered
EF migration set app + monitoring app + fleet
Identity / Branding / Users / Settings same same
Settings → Rates tab shown hidden
Sidebar nav Dashboard / Dashboards / Sites / Settings Dashboard / Dashboards / Customers / Settings

Startup guards:

  • Reject RunMode=Admin if Database:ConnectionString is empty (no auto-provision for Admin — there is no obvious local default for a fleet DB).
  • Reject RunMode=Client + FleetIngest__Enabled=true if FleetIngest__Url or FleetIngest__Token is empty.
  • Reject the wrong mode against an existing DB by sniffing for a marker table:
    • RunMode=Client against a DB containing fleet.Customers → fatal.
    • RunMode=Admin against a DB containing monitoring.PowerMeasurements → fatal.
  • Stops the "I pointed dev at the prod DB" disaster.

4. Schemas

4.1 Central DB (Admin) — fleet schema

fleet.Customers — registry

Column Type Notes
Id uuid PK Admin-generated, stable
Code varchar(50) UNIQUE e.g. ABC0001 — human handle (preserve case for display, lowercase for compose)
Name varchar(200)
TokenHash varchar(64) UNIQUE indexed SHA-256 hex of the push token
TokenIssuedAt timestamptz
TokenRotatedAt timestamptz nullable
IsActive bool If false, ingest returns 401
FirstSeenAt, LastSeenAt timestamptz nullable When their first / most-recent push landed
CreatedAt timestamptz

fleet.Sites — mirrors customer-side, identity preserved

Column Type Notes
Id uuid Customer-side Site UUID, preserved
CustomerId uuid FK→Customers
Name, Address, IsActive
LocalMunicipalityId int nullable Opaque on Admin
ReceivedAt timestamptz When Admin upserted this row
PK (CustomerId, Id)

fleet.Devices — same pattern

Column Type Notes
Id uuid Preserved
CustomerId uuid FK
SiteId uuid (CustomerId, SiteId) FK→Sites composite
Name, ExternalId, Description, IsActive
ReceivedAt timestamptz
PK (CustomerId, Id)

fleet.PowerMeasurements — hypertable, the big one

Column Type Notes
Time timestamptz Hypertable partition
CustomerId uuid
DeviceId uuid (CustomerId, DeviceId) FK→Devices
ActivePowerKw, ReactivePowerKvar, ApparentPowerKva, PowerFactor, VoltageV, FrequencyHz, EnergyImportedKwh, EnergyExportedKwh double
Source varchar(50) nullable
PK (Time, CustomerId, DeviceId) Time must be in any unique constraint on a hypertable
Indexes (CustomerId, Time DESC); (CustomerId, DeviceId, Time DESC)

fleet.IngestEvents — observability

Column Type Notes
Id uuid PK
CustomerId uuid FK
ReceivedAt timestamptz
BatchType varchar(20) sites / devices / measurements
RowsAccepted, RowsRejected int
BatchBytes int
ClientHwm varchar(50) Cursor the client claims this batch covers
TimeSpread interval nullable max(Time) - min(Time) in the batch — visible burst back-fills
Error varchar(500) nullable

4.2 Client DB additions

One new table: app.FleetPushState

Column Type Notes
ResourceType varchar(20) PK sites / devices / measurements
LastCursor timestamptz nullable For measurements: max(ReceivedAt) pushed. For sites/devices: max(UpdatedAt)
LastSyncedAt timestamptz nullable
LastError varchar(500) nullable
ConsecutiveFailures int default 0 Drives exponential backoff

Plus: add ReceivedAt timestamptz (default NOW()) to monitoring.PowerMeasurements, indexed on (ReceivedAt). See §5.1 for the rationale.

4.3 Migration assembly split

Two DbContext classes — ClientDbContext (existing AppDbContext, renamed) and AdminDbContext. Each owns its migrations folder. Only the one matching RunMode is registered with DI. Ugly-but-reliable wins over single-context-with-tooling-gymnastics.


5. Push pipeline

5.1 Late arrivals — push by ReceivedAt, not Time

The firmware (ESP32 / TTGO T-Call) buffers samples offline and replays them when connectivity returns. Late arrivals are a designed-for case, not an edge.

If we pushed by Time > LastCursor, a device offline for 6 hours would have its back-fill skipped (its samples' Time is in the past).

Fix: every row carries ReceivedAt assigned by the local DB on insert. Push selects WHERE ReceivedAt > LastCursor[measurements]. Back-fills are picked up on the next push tick after the firmware replays.

Time stays as-is for queries; nothing else changes about how the data is used.

5.2 Burst handling

When a firmware replay drops thousands of rows in one local insert, the next push tick sees a huge backlog. To avoid one customer monopolising network / Admin:

  • Per request: up to FleetIngest__BatchSize rows (default 5000) or FleetIngest__BatchMaxBytes (default 1 MB), whichever hits first.
  • Per tick: at most 3 successful batches per resource type, then yield.
  • Result: a 30k-row back-fill drains over ~5 minutes (10 ticks @ 60s) without starving sites/devices syncing.
  • Observable: fleet.IngestEvents.TimeSpread shows the max(Time) - min(Time) per batch; a wave of back-fill is obvious in Admin logs without grepping.

5.3 Cadence

  • Default push interval: 60 s.
  • Configurable: FleetIngest__IntervalSeconds.

5.4 Order of operations per tick

  1. Sites: UpdatedAt > LastCursor[sites] → POST batch (full upsert, all rows; small data).
  2. Devices: same.
  3. Measurements: ReceivedAt > LastCursor[measurements] ordered by ReceivedAt → POST batches until drained or limit.

Sites and devices first so Admin's measurement FK insert doesn't reject rows for unknown devices.

5.5 Endpoint contract

POST /api/fleet/ingest
Headers:
  X-Customer-Token: <opaque 32-byte hex>
  X-Batch-Type:     sites | devices | measurements
  X-Push-Cursor:    <ISO8601> (highest ReceivedAt/UpdatedAt in this batch)
  Content-Type:     application/json
Body:
  JSON array of rows matching the type
Response 200:
  { accepted: int, rejected: int, errors: [{ row: int, reason: string }] }
Response 401: unknown / inactive / wrong token
Response 413: batch too large
Response 429: rate-limited
Response 5xx: transient — retry

CustomerId is derived from the token server-side. Clients never send it. Prevents one customer from forging rows for another.

5.6 Failure handling

Failure Client behavior
Network timeout / 5xx Exponential backoff (1m → 2m → 5m → 10m, cap 30m); ConsecutiveFailures++
401 Stop trying for this resource; log loudly; surface in Settings → App config ("Admin link down")
200 with rejected > 0 on measurements Re-push sites/devices on next tick before retrying measurements
413 Halve BatchSize for this resource and retry
429 Honor Retry-After; otherwise back off 1 minute

Push runs on its own DI scope and its own DbContext. If the push DB query hangs, only push is blocked — never user-facing API.


6. Auth — token lifecycle

  • Token = 32 random bytes, hex-encoded (64 chars). Generated server-side via RandomNumberGenerator.
  • Stored as SHA-256 hex hash, UNIQUE indexed → ingest endpoint does a single O(log N) indexed lookup. Bcrypt is the wrong tool for a high-throughput, high-entropy machine token.
  • Shown once in the Admin UI at creation/rotation. Admin only stores the hash. Lost token → rotate.
  • Rotation: immediate cutover — new value replaces old hash. Customer ops updates .env + restarts. Brief push gap acceptable; can revisit with dual-token grace window later.
  • Customers.IsActive=false → all that customer's pushes get 401 until reactivated.

7. Continuous aggregates + retention

7.1 Aggregates (created post-migration by FleetTimescaleBootstrapper)

  • fleet_hourly_per_devicetime_bucket('1 hour', Time), CustomerId, DeviceId → avg/max/min ActivePowerKw, max(EnergyImported) - min(EnergyImported) as KwhDelta, sample count.
  • fleet_daily_per_customertime_bucket('1 day', Time), CustomerId → totals across all that customer's devices.
  • Realtime (materialized_only = false) so late back-fills appear in CA queries immediately, with the unmaterialized tail served live from the hypertable until next refresh.
  • Refresh policy:
    • fleet_hourly_per_device: every 5 min; start_offset = 30 days, end_offset = 1 hour.
    • fleet_daily_per_customer: every 1 hour; start_offset = 365 days, end_offset = 1 day.
  • Wide start_offset means firmware back-fills within those windows get materialized on the next refresh tick. TimescaleDB's invalidation log means only chunks that actually changed get re-processed.

7.2 Retention vs compression

Retention policy: NONE. All data kept indefinitely for historical / trend analysis.

Compression policy (this is what makes "forever" cheap):

  • fleet.PowerMeasurements: compress chunks older than 7 days.
  • Compression settings: segmentby = (CustomerId, DeviceId), orderby = Time DESC.
  • Typical ratio for power-monitoring data: 510×. Point queries by (CustomerId, DeviceId, Time range) stay fast because of the segmentby clustering.
  • Bootstrapper enables compression and adds the policy idempotently.

CA tables are tiny by comparison; no compression needed there.


8. Admin UI surface

Page What
Dashboard Fleet headline — customer count, active customers (push in last hour), aggregate live active power, total kWh today, last-24h lag chart
Customers Table — Code / Name / Last seen / Sites / Devices / Today kWh / Status / Actions (rotate token, disable, view)
Customer detail Sites + devices for the selected customer + Grafana embed scoped to customer_id
Dashboards Grafana embed — fleet-overview default, customer-drilldown parameterized
Settings → Branding Admin's own brand
Settings → Users Admin operator accounts
Settings → Grafana Read-only info card
Settings → App config Read-only config snapshot (RunMode visible at the top)

Hidden in Admin mode: Sites top-level (no local sites), Settings → Rates tab (no local tariffs).


9. Observability

Two questions a real operator asks, two answers:

  1. "Is customer X pushing?" — Customers page shows LastSeenAt, ConsecutiveFailures, last batch size, recent burst spread.
  2. "Why did this batch not land?"SELECT * FROM fleet.IngestEvents WHERE CustomerId = … ORDER BY ReceivedAt DESC LIMIT 50;

Plus Serilog:

  • Client: [INF] Pushed N measurements to fleet ingest (cursor=…) accepted=N rejected=0
  • Admin: [INF] Accepted N measurements from {Code} ({CustomerId}), rejected 0

10. Per-stack config

Customer .env additions:

RunMode=Client                                          # default — can omit
FleetIngest__Enabled=true
FleetIngest__Url=https://admin.portal.example.com/api/fleet/ingest
FleetIngest__Token=<32-byte hex token from Admin Customers page>
FleetIngest__IntervalSeconds=60
FleetIngest__BatchSize=5000
FleetIngest__BatchMaxBytes=1048576

FleetIngest__Enabled=false → push service doesn't start; customer runs as today, no data leaves.

Admin .env:

RunMode=Admin
COMPOSE_PROJECT_NAME=admin
CUSTOMER_HOST=admin.portal.example.com
Database__ConnectionString=Host=timescaledb;...        # central fleet DB (separate volume)
# All other vars same as Client compose.

11. Open seams (deferred, with the obvious extension paths noted)

Seam Where v2 picks it up
Tariff sync + cross-customer cost New fleet.Tariffs table, push tariff change events from customer, derived cost in fleet_daily_per_customer.
Per-customer Postgres RLS for multi-Admin-user setups Add current_setting('app.customer_filter')-based RLS policies on fleet.* tables; Admin role to customer-scope mapping in IdentityRole.CustomerId.
Bidirectional Admin → Customer commands New WebSocket or long-poll channel on customer side; gated by mutual cert or a second token.
Branding sync (for the "Admin sees customer's brand when drilling in" niceness) Push branding row from customer; Admin renders the customer's brand on Customer-detail pages.
Forward compatibility Versioned URL /v1/fleet/ingest. Admin tolerates unknown JSON fields. Strictness on response shape only.
Dual-token grace window for rotation Customers.PreviousTokenHash column; ingest accepts either for 24h after rotation.
Sharded Admin (10000+ customers) Customer's FleetIngest__Url already supports any host — point different customers at different Admin instances; aggregate at a tier above with Promscale or similar if needed.
Hard-delete / GDPR Admin Customers page → "Decommission" action: DELETE FROM fleet.* WHERE CustomerId = … cascade. Logged.

12. Phase split

Phase What End-state
13 RunMode plumbing + Fleet schema + AdminDbContext split + Customers registry + token issuance UI. No data movement. Spin up an Admin stack, register a customer, see their token. Client stack picks up RunMode=Client + push config but doesn't yet push.
14 FleetPushService (Client), /api/fleet/ingest + FleetIngestService (Admin), IngestEvents, ReceivedAt migration on Client, FleetPushState, continuous aggregates SQL, compression policy SQL. Two local stacks: one Client pushes to one Admin. Rows visible in fleet.PowerMeasurements. Late-arrival firmware-replay scenario verified with fixtures.
15 Admin UI (Dashboard, Customers list, Customer detail, Customers token rotation), Grafana provisioning of fleet-overview.json + customer-drilldown.json, README + OPERATIONS updates for Admin deployment + customer onboarding workflow. Full feature usable end-to-end.

13. Sanity checks (carried over from the design conversation)

  • No central data without the central DB existing. Whole feature is opt-in via FleetIngest__Enabled.
  • Customer model is unchanged. Every customer keeps full control of their own data. We copy roll-ups (and raw measurements for fidelity), not authority.
  • One ingest endpoint, no fan-out. N customers all POST to the same /api/fleet/ingest. Admin scales vertically first, shards horizontally later if needed (each customer's FleetIngest__Url can point at a specific Admin instance).