Tau.Acuvim/portal/docs/FLEET-DESIGN.md
Diseri Pearson b654997fc9 Fleet sync: Municipalities + Tariffs + TariffPeriods
Mirror the customer-side rate hierarchy (Municipality → Tariff →
TariffPeriod) to the central DB, scoped by CustomerId.
Per-municipality rate structure is preserved exactly as customers
configure it locally — each tariff still references its municipality
and carries its own per-period TOU rates.

Backend
- New fleet entities: FleetMunicipality, FleetTariff, FleetTariffPeriod.
  Composite PKs (CustomerId, Id) like Sites/Devices so two customers'
  rate trees never collide. FleetTariff has a real FK to
  FleetMunicipality on (CustomerId, MunicipalityId); FleetTariffPeriod
  cascades from its FleetTariff parent.
- Two new push batch types: "municipalities" (full set) and "tariffs"
  (full set with periods nested inside each tariff). Push order per
  tick: sites → devices → municipalities → tariffs → measurements.
- FleetIngestService dispatches the new types. Municipalities upsert
  by (CustomerId, Id). Tariffs run inside a single transaction per
  batch: upsert the tariff row, DELETE all periods for that tariff,
  INSERT the new set — atomic period replacement matches the
  customer-side UpsertTariffRequest semantics.
- FleetPushState gains ResourceMunicipalities + ResourceTariffs
  constants so the per-resource cursor/backoff state has slots for them.
- FleetQueryService.GetCustomerDetailAsync now includes municipalities
  and tariffs (with periods, with municipality name joined client-side
  from the lookup dict). New DTOs: FleetMunicipalityViewDto,
  FleetTariffViewDto, FleetTariffPeriodViewDto.
- AdminDbContext migration AddFleetRates creates the three tables and
  their indexes/FKs.

Frontend
- Customer detail page gains a "Tariffs (N)" tab. Each tariff renders
  as a collapsible card with its municipality, active flag, effective
  window, default rate / fixed charge / VAT, and an inline period
  table (Period name, day-of-week bitmask formatted as MTWTFSS, start,
  end, rate). Empty state when no tariffs synced yet.

Verified end-to-end on the dev host
- Created "Phase23 Test City" municipality + "Domestic TOU" tariff
  with two periods (Peak weekdays 17:00-20:00 @ 3.75, Off-Peak
  weekends 00:00-23:59 @ 1.20) on the Client.
- Within one push tick (~20s) all three rows landed on Admin:
  fleet.Municipalities (1), fleet.Tariffs (1), fleet.TariffPeriods (2).
- /api/fleet/customers/{id}/detail returns the full tree with
  municipality name resolved.

Tests: 61/61 still passing.

Design doc (FLEET-DESIGN.md §11) updated — tariff sync row marked
shipped, cross-customer cost compute flagged as the natural next step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:10:51 +02:00

18 KiB
Raw Blame History

Fleet ingest — design (locked)

Status: approved, ready for Phase 13 implementation.

This is the design reference for the Admin / Fleet aggregation feature. Read this before touching any code in Phases 1315.


1. Goals & non-goals

In scope (v1):

  • Mirror every customer's raw monitoring.PowerMeasurements, Sites, and Devices into a central DB on an Admin stack, with CustomerId attached.
  • Admin operator sees fleet-wide and per-customer dashboards.
  • Customer stacks keep working if Admin is unreachable; pushes resume when it comes back.
  • Same portal binary in two modes (RunMode=Client / Admin), config-selected.

Not in scope (v1):

  • Cross-customer cost / tariff aggregation (Admin shows kWh + power; per-customer instance still owns billing).
  • Branding / user sync (Admin has its own brand + users — separate fleets of humans).
  • Bidirectional sync (Admin → Customer commands, OTA, etc.).
  • Multi-region or sharded central DB.

2. Topology

[Traefik public]
    ├─ abc0001.portal.example.com  →  Client stack (portal + timescale + grafana)
    ├─ abc0002.portal.example.com  →  Client stack
    ├─ ...
    └─ admin.portal.example.com    →  Admin stack (portal + timescale + grafana)

[Client stack abc0001] --batches POST--> [Admin stack /api/fleet/ingest]
[Client stack abc0002] --batches POST--> (central TimescaleDB)
[Client stack ...]     --batches POST-->

Admin stack reuses docker-compose.prod.yml and the same image, with:

  • COMPOSE_PROJECT_NAME=admin
  • CUSTOMER_HOST=admin.portal.example.com
  • RunMode=Admin
  • Its own Postgres volume (the central fleet DB)

The whole feature is opt-in. With no Admin stack deployed, customer stacks set FleetIngest__Enabled=false and work exactly as today.


3. RunMode mechanics

Application.RunMode is Client (default) or Admin. Single binary; Program.cs branches:

Concern Client Admin
DbContext target local Postgres central Postgres
Endpoints /api/sites/*, /api/measurements/*, /api/ingest/measurements, /api/admin/sites/* mapped not mapped
Endpoints /api/fleet/*, /api/admin/customers/* not mapped mapped
FleetPushService (BackgroundService) started if FleetIngest__Enabled=true not registered
FleetIngestService not registered registered
EF migration set app + monitoring app + fleet
Identity / Branding / Users / Settings same same
Settings → Rates tab shown hidden
Sidebar nav Dashboard / Dashboards / Sites / Settings Dashboard / Dashboards / Customers / Settings

Startup guards:

  • Reject RunMode=Admin if Database:ConnectionString is empty (no auto-provision for Admin — there is no obvious local default for a fleet DB).
  • Reject RunMode=Client + FleetIngest__Enabled=true if FleetIngest__Url or FleetIngest__Token is empty.
  • Reject the wrong mode against an existing DB by sniffing for a marker table:
    • RunMode=Client against a DB containing fleet.Customers → fatal.
    • RunMode=Admin against a DB containing monitoring.PowerMeasurements → fatal.
  • Stops the "I pointed dev at the prod DB" disaster.

4. Schemas

4.1 Central DB (Admin) — fleet schema

fleet.Customers — registry

Column Type Notes
Id uuid PK Admin-generated, stable
Code varchar(50) UNIQUE e.g. ABC0001 — human handle (preserve case for display, lowercase for compose)
Name varchar(200)
TokenHash varchar(64) UNIQUE indexed SHA-256 hex of the push token
TokenIssuedAt timestamptz
TokenRotatedAt timestamptz nullable
IsActive bool If false, ingest returns 401
FirstSeenAt, LastSeenAt timestamptz nullable When their first / most-recent push landed
CreatedAt timestamptz

fleet.Sites — mirrors customer-side, identity preserved

Column Type Notes
Id uuid Customer-side Site UUID, preserved
CustomerId uuid FK→Customers
Name, Address, IsActive
LocalMunicipalityId int nullable Opaque on Admin
ReceivedAt timestamptz When Admin upserted this row
PK (CustomerId, Id)

fleet.Devices — same pattern

Column Type Notes
Id uuid Preserved
CustomerId uuid FK
SiteId uuid (CustomerId, SiteId) FK→Sites composite
Name, ExternalId, Description, IsActive
ReceivedAt timestamptz
PK (CustomerId, Id)

fleet.PowerMeasurements — hypertable, the big one

Column Type Notes
Time timestamptz Hypertable partition
CustomerId uuid
DeviceId uuid (CustomerId, DeviceId) FK→Devices
ActivePowerKw, ReactivePowerKvar, ApparentPowerKva, PowerFactor, VoltageV, FrequencyHz, EnergyImportedKwh, EnergyExportedKwh double
Source varchar(50) nullable
PK (Time, CustomerId, DeviceId) Time must be in any unique constraint on a hypertable
Indexes (CustomerId, Time DESC); (CustomerId, DeviceId, Time DESC)

fleet.IngestEvents — observability

Column Type Notes
Id uuid PK
CustomerId uuid FK
ReceivedAt timestamptz
BatchType varchar(20) sites / devices / measurements
RowsAccepted, RowsRejected int
BatchBytes int
ClientHwm varchar(50) Cursor the client claims this batch covers
TimeSpread interval nullable max(Time) - min(Time) in the batch — visible burst back-fills
Error varchar(500) nullable

4.2 Client DB additions

One new table: app.FleetPushState

Column Type Notes
ResourceType varchar(20) PK sites / devices / measurements
LastCursor timestamptz nullable For measurements: max(ReceivedAt) pushed. For sites/devices: max(UpdatedAt)
LastSyncedAt timestamptz nullable
LastError varchar(500) nullable
ConsecutiveFailures int default 0 Drives exponential backoff

Plus: add ReceivedAt timestamptz (default NOW()) to monitoring.PowerMeasurements, indexed on (ReceivedAt). See §5.1 for the rationale.

4.3 Migration assembly split

Two DbContext classes — ClientDbContext (existing AppDbContext, renamed) and AdminDbContext. Each owns its migrations folder. Only the one matching RunMode is registered with DI. Ugly-but-reliable wins over single-context-with-tooling-gymnastics.


5. Push pipeline

5.1 Late arrivals — push by ReceivedAt, not Time

The firmware (ESP32 / TTGO T-Call) buffers samples offline and replays them when connectivity returns. Late arrivals are a designed-for case, not an edge.

If we pushed by Time > LastCursor, a device offline for 6 hours would have its back-fill skipped (its samples' Time is in the past).

Fix: every row carries ReceivedAt assigned by the local DB on insert. Push selects WHERE ReceivedAt > LastCursor[measurements]. Back-fills are picked up on the next push tick after the firmware replays.

Time stays as-is for queries; nothing else changes about how the data is used.

5.2 Burst handling

When a firmware replay drops thousands of rows in one local insert, the next push tick sees a huge backlog. To avoid one customer monopolising network / Admin:

  • Per request: up to FleetIngest__BatchSize rows (default 5000) or FleetIngest__BatchMaxBytes (default 1 MB), whichever hits first.
  • Per tick: at most 3 successful batches per resource type, then yield.
  • Result: a 30k-row back-fill drains over ~5 minutes (10 ticks @ 60s) without starving sites/devices syncing.
  • Observable: fleet.IngestEvents.TimeSpread shows the max(Time) - min(Time) per batch; a wave of back-fill is obvious in Admin logs without grepping.

5.3 Cadence

  • Default push interval: 60 s.
  • Configurable: FleetIngest__IntervalSeconds.

5.4 Order of operations per tick

  1. Sites: UpdatedAt > LastCursor[sites] → POST batch (full upsert, all rows; small data).
  2. Devices: same.
  3. Measurements: ReceivedAt > LastCursor[measurements] ordered by ReceivedAt → POST batches until drained or limit.

Sites and devices first so Admin's measurement FK insert doesn't reject rows for unknown devices.

5.5 Endpoint contract

POST /api/fleet/ingest
Headers:
  X-Customer-Token: <opaque 32-byte hex>
  X-Batch-Type:     sites | devices | measurements
  X-Push-Cursor:    <ISO8601> (highest ReceivedAt/UpdatedAt in this batch)
  Content-Type:     application/json
Body:
  JSON array of rows matching the type
Response 200:
  { accepted: int, rejected: int, errors: [{ row: int, reason: string }] }
Response 401: unknown / inactive / wrong token
Response 413: batch too large
Response 429: rate-limited
Response 5xx: transient — retry

CustomerId is derived from the token server-side. Clients never send it. Prevents one customer from forging rows for another.

5.6 Failure handling

Failure Client behavior
Network timeout / 5xx Exponential backoff (1m → 2m → 5m → 10m, cap 30m); ConsecutiveFailures++
401 Stop trying for this resource; log loudly; surface in Settings → App config ("Admin link down")
200 with rejected > 0 on measurements Re-push sites/devices on next tick before retrying measurements
413 Halve BatchSize for this resource and retry
429 Honor Retry-After; otherwise back off 1 minute

Push runs on its own DI scope and its own DbContext. If the push DB query hangs, only push is blocked — never user-facing API.


6. Auth — token lifecycle

  • Token = 32 random bytes, hex-encoded (64 chars). Generated server-side via RandomNumberGenerator.
  • Stored as SHA-256 hex hash, UNIQUE indexed → ingest endpoint does a single O(log N) indexed lookup. Bcrypt is the wrong tool for a high-throughput, high-entropy machine token.
  • Shown once in the Admin UI at creation/rotation. Admin only stores the hash. Lost token → rotate.
  • Rotation: dual-token with 24h grace window (shipped after the initial design). On rotate, the current TokenHash is copied into PreviousTokenHash with PreviousTokenExpiresAt = now + 24h, then a fresh current is generated. FindByTokenAsync matches either current OR (previous AND not expired AND IsActive). Customer ops has the full window to update .env + restart without dropping a push tick. A second rotation within the grace window overwrites the previous slot — at most one previous token is ever honoured at a time. Grace duration is CustomerService.DefaultTokenGracePeriod (parameterizable on RotateTokenAsync for tests; lift to config if you need per-customer tuning).
  • Customers.IsActive=false → all that customer's pushes get 401 until reactivated.

7. Continuous aggregates + retention

7.1 Aggregates (created post-migration by FleetTimescaleBootstrapper)

  • fleet_hourly_per_devicetime_bucket('1 hour', Time), CustomerId, DeviceId → avg/max/min ActivePowerKw, max(EnergyImported) - min(EnergyImported) as KwhDelta, sample count.
  • fleet_daily_per_customertime_bucket('1 day', Time), CustomerId → totals across all that customer's devices.
  • Realtime (materialized_only = false) so late back-fills appear in CA queries immediately, with the unmaterialized tail served live from the hypertable until next refresh.
  • Refresh policy:
    • fleet_hourly_per_device: every 5 min; start_offset = 30 days, end_offset = 1 hour.
    • fleet_daily_per_customer: every 1 hour; start_offset = 365 days, end_offset = 1 day.
  • Wide start_offset means firmware back-fills within those windows get materialized on the next refresh tick. TimescaleDB's invalidation log means only chunks that actually changed get re-processed.

7.2 Retention vs compression

Retention policy: NONE. All data kept indefinitely for historical / trend analysis.

Compression policy (this is what makes "forever" cheap):

  • fleet.PowerMeasurements: compress chunks older than 7 days.
  • Compression settings: segmentby = (CustomerId, DeviceId), orderby = Time DESC.
  • Typical ratio for power-monitoring data: 510×. Point queries by (CustomerId, DeviceId, Time range) stay fast because of the segmentby clustering.
  • Bootstrapper enables compression and adds the policy idempotently.

CA tables are tiny by comparison; no compression needed there.


8. Admin UI surface

Page What
Dashboard Fleet headline — customer count, active customers (push in last hour), aggregate live active power, total kWh today, last-24h lag chart
Customers Table — Code / Name / Last seen / Sites / Devices / Today kWh / Status / Actions (rotate token, disable, view)
Customer detail Sites + devices for the selected customer + Grafana embed scoped to customer_id
Dashboards Grafana embed — fleet-overview default, customer-drilldown parameterized
Settings → Branding Admin's own brand
Settings → Users Admin operator accounts
Settings → Grafana Read-only info card
Settings → App config Read-only config snapshot (RunMode visible at the top)

Hidden in Admin mode: Sites top-level (no local sites), Settings → Rates tab (no local tariffs).


9. Observability

Two questions a real operator asks, two answers:

  1. "Is customer X pushing?" — Customers page shows LastSeenAt, ConsecutiveFailures, last batch size, recent burst spread.
  2. "Why did this batch not land?"SELECT * FROM fleet.IngestEvents WHERE CustomerId = … ORDER BY ReceivedAt DESC LIMIT 50;

Plus Serilog:

  • Client: [INF] Pushed N measurements to fleet ingest (cursor=…) accepted=N rejected=0
  • Admin: [INF] Accepted N measurements from {Code} ({CustomerId}), rejected 0

10. Per-stack config

Customer .env additions:

RunMode=Client                                          # default — can omit
FleetIngest__Enabled=true
FleetIngest__Url=https://admin.portal.example.com/api/fleet/ingest
FleetIngest__Token=<32-byte hex token from Admin Customers page>
FleetIngest__IntervalSeconds=60
FleetIngest__BatchSize=5000
FleetIngest__BatchMaxBytes=1048576

FleetIngest__Enabled=false → push service doesn't start; customer runs as today, no data leaves.

Admin .env:

RunMode=Admin
COMPOSE_PROJECT_NAME=admin
CUSTOMER_HOST=admin.portal.example.com
Database__ConnectionString=Host=timescaledb;...        # central fleet DB (separate volume)
# All other vars same as Client compose.

11. Open seams (deferred, with the obvious extension paths noted)

Seam Where v2 picks it up
Tariff sync + cross-customer cost Tariff sync shipped. fleet.Municipalities + fleet.Tariffs + fleet.TariffPeriods mirror the customer-side rate hierarchy scoped by CustomerId. New push batch types: municipalities, tariffs (with periods nested). Per-municipality rates preserved. Cost compute on Admin side is the natural next step — a FleetCostService joining fleet.hourly_per_device × tariffs → per-customer kWh × rate.
Per-customer Postgres RLS for multi-Admin-user setups Add current_setting('app.customer_filter')-based RLS policies on fleet.* tables; Admin role to customer-scope mapping in IdentityRole.CustomerId.
Bidirectional Admin → Customer commands New WebSocket or long-poll channel on customer side; gated by mutual cert or a second token.
Branding sync (for the "Admin sees customer's brand when drilling in" niceness) Push branding row from customer; Admin renders the customer's brand on Customer-detail pages.
Forward compatibility Versioned URL /v1/fleet/ingest. Admin tolerates unknown JSON fields. Strictness on response shape only.
Dual-token grace window for rotation Shipped. See §6 — Customers.PreviousTokenHash + PreviousTokenExpiresAt; ingest accepts either while previous is unexpired. Default 24h.
Sharded Admin (10000+ customers) Customer's FleetIngest__Url already supports any host — point different customers at different Admin instances; aggregate at a tier above with Promscale or similar if needed.
Hard-delete / GDPR Admin Customers page → "Decommission" action: DELETE FROM fleet.* WHERE CustomerId = … cascade. Logged.

12. Phase split

Phase What End-state
13 RunMode plumbing + Fleet schema + AdminDbContext split + Customers registry + token issuance UI. No data movement. Spin up an Admin stack, register a customer, see their token. Client stack picks up RunMode=Client + push config but doesn't yet push.
14 FleetPushService (Client), /api/fleet/ingest + FleetIngestService (Admin), IngestEvents, ReceivedAt migration on Client, FleetPushState, continuous aggregates SQL, compression policy SQL. Two local stacks: one Client pushes to one Admin. Rows visible in fleet.PowerMeasurements. Late-arrival firmware-replay scenario verified with fixtures.
15 Admin UI (Dashboard, Customers list, Customer detail, Customers token rotation), Grafana provisioning of fleet-overview.json + customer-drilldown.json, README + OPERATIONS updates for Admin deployment + customer onboarding workflow. Full feature usable end-to-end.

13. Sanity checks (carried over from the design conversation)

  • No central data without the central DB existing. Whole feature is opt-in via FleetIngest__Enabled.
  • Customer model is unchanged. Every customer keeps full control of their own data. We copy roll-ups (and raw measurements for fidelity), not authority.
  • One ingest endpoint, no fan-out. N customers all POST to the same /api/fleet/ingest. Admin scales vertically first, shards horizontally later if needed (each customer's FleetIngest__Url can point at a specific Admin instance).