Tau.Acuvim/portal/docs/FLEET-DESIGN.md
Diseri Pearson 7627306800 Cross-customer cost compute on the Admin side
Per-customer cost over a UTC time range, joining the fleet hierarchy:
hourly_per_device CA × Devices × Sites × Municipalities × active Tariff
× period (in the municipality's local time) → per-bucket kWh × rate.
Per-day rollup + per-device breakdown. VAT applied per bucket so cost
stays correct across mid-window tariff changes.

Backend
- FleetCostService.ComputeAsync(customerId, fromUtc, toUtc):
  - Loads device→site→municipality mapping (small per customer).
  - Loads tariffs (with periods) grouped by municipality, ordered by
    EffectiveFrom desc for active-tariff lookup.
  - Reads hourly buckets in range from fleet.hourly_per_device via
    raw ADO (the CA isn't an EF entity).
  - For each bucket: pick active tariff for the bucket's date, convert
    bucket start to local time via Municipality.TimeZoneId, pick the
    matching period (or default rate), compute base + VAT.
  - Rolls up per UTC date + per device. Tracks BucketsWithoutTariff
    when site has no muni or no tariff covers the bucket date.
- New DTOs: FleetCostDto / FleetCostDayDto / FleetCostDeviceDto.
- Endpoint: GET /api/fleet/customers/{id}/cost?from&to. AdminOnly.
  Validation: to > from, range <= 366 days (BadRequest on violation).
- Period-selection helper duplicated from CostCalculator (5 lines;
  generic abstraction across TariffPeriod / FleetTariffPeriod is
  more code than the duplication).
- Fixed monthly charges deliberately NOT applied (whole-month billing
  concept; FixedMonthlyChargesIncluded=false in the response).

Frontend
- AdminCustomerDetailPage gets a Cost tab:
  - RangePicker with quick ranges (Today, Last 7d, Last 30d, This
    month). Default last 7 days.
  - 4 Statistic cards: total kWh, base cost, VAT, total.
  - Warning alerts: when buckets-without-tariff > 0; always-on info
    that fixed monthly charges aren't included.
  - Per-day table + per-device table.

Verified end-to-end on the running stack
- Patched DEV0001's existing site to LocalMunicipalityId=1
  (Phase 23 test municipality with Domestic TOU tariff).
- Ingested 3 measurements at 10:00 / 10:20 / 10:40 UTC with kWh
  totals 2000 / 2020 / 2050 → hourly CA bucket has delta = 50 kWh.
- Total kWh in window = 56 (50 from new bucket + 6 from earlier
  Phase 14 backfill bucket).
- Tariff resolution: 10:00-12:00 UTC = 12:00-14:00 SAST, which is
  neither Peak (17:00-20:00 weekdays) nor Off-Peak (weekends only)
  → defaults to 2.50/kWh.
- 56 × 2.50 = 140.00 base ✓
- 140 × 0.15 = 21.00 VAT ✓
- Total 161.00 ✓ exactly what the API returned.

Docs: FLEET-DESIGN.md §11 row updated — tariff sync + cost compute
both marked as shipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:45:44 +02:00

345 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fleet ingest — design (locked)
Status: **approved**, ready for Phase 13 implementation.
This is the design reference for the Admin / Fleet aggregation feature. Read this before touching any code in Phases 1315.
---
## 1. Goals & non-goals
**In scope (v1):**
- Mirror every customer's raw `monitoring.PowerMeasurements`, `Sites`, and `Devices` into a central DB on an Admin stack, with `CustomerId` attached.
- Admin operator sees fleet-wide and per-customer dashboards.
- Customer stacks keep working if Admin is unreachable; pushes resume when it comes back.
- Same portal binary in two modes (`RunMode=Client` / `Admin`), config-selected.
**Not in scope (v1):**
- Cross-customer cost / tariff aggregation (Admin shows kWh + power; per-customer instance still owns billing).
- Branding / user sync (Admin has its own brand + users — separate fleets of humans).
- Bidirectional sync (Admin → Customer commands, OTA, etc.).
- Multi-region or sharded central DB.
---
## 2. Topology
```
[Traefik public]
├─ abc0001.portal.example.com → Client stack (portal + timescale + grafana)
├─ abc0002.portal.example.com → Client stack
├─ ...
└─ admin.portal.example.com → Admin stack (portal + timescale + grafana)
[Client stack abc0001] --batches POST--> [Admin stack /api/fleet/ingest]
[Client stack abc0002] --batches POST--> (central TimescaleDB)
[Client stack ...] --batches POST-->
```
Admin stack reuses `docker-compose.prod.yml` and the same image, with:
- `COMPOSE_PROJECT_NAME=admin`
- `CUSTOMER_HOST=admin.portal.example.com`
- `RunMode=Admin`
- Its own Postgres volume (the central fleet DB)
The whole feature is opt-in. With no Admin stack deployed, customer stacks set `FleetIngest__Enabled=false` and work exactly as today.
---
## 3. `RunMode` mechanics
`Application.RunMode` is `Client` (default) or `Admin`. Single binary; Program.cs branches:
| Concern | Client | Admin |
|---|---|---|
| DbContext target | local Postgres | central Postgres |
| Endpoints `/api/sites/*`, `/api/measurements/*`, `/api/ingest/measurements`, `/api/admin/sites/*` | mapped | **not mapped** |
| Endpoints `/api/fleet/*`, `/api/admin/customers/*` | not mapped | mapped |
| `FleetPushService` (BackgroundService) | started if `FleetIngest__Enabled=true` | not registered |
| `FleetIngestService` | not registered | registered |
| EF migration set | `app` + `monitoring` | `app` + `fleet` |
| Identity / Branding / Users / Settings | same | same |
| Settings → Rates tab | shown | hidden |
| Sidebar nav | Dashboard / Dashboards / Sites / Settings | Dashboard / Dashboards / **Customers** / Settings |
**Startup guards:**
- Reject `RunMode=Admin` if `Database:ConnectionString` is empty (no auto-provision for Admin — there is no obvious local default for a fleet DB).
- Reject `RunMode=Client + FleetIngest__Enabled=true` if `FleetIngest__Url` or `FleetIngest__Token` is empty.
- Reject the wrong mode against an existing DB by sniffing for a marker table:
- `RunMode=Client` against a DB containing `fleet.Customers` → fatal.
- `RunMode=Admin` against a DB containing `monitoring.PowerMeasurements` → fatal.
- Stops the "I pointed dev at the prod DB" disaster.
---
## 4. Schemas
### 4.1 Central DB (Admin) — `fleet` schema
**`fleet.Customers`** — registry
| Column | Type | Notes |
|---|---|---|
| Id | uuid PK | Admin-generated, stable |
| Code | varchar(50) UNIQUE | e.g. `ABC0001` — human handle (preserve case for display, lowercase for compose) |
| Name | varchar(200) | |
| TokenHash | varchar(64) UNIQUE indexed | SHA-256 hex of the push token |
| TokenIssuedAt | timestamptz | |
| TokenRotatedAt | timestamptz nullable | |
| IsActive | bool | If false, ingest returns 401 |
| FirstSeenAt, LastSeenAt | timestamptz nullable | When their first / most-recent push landed |
| CreatedAt | timestamptz | |
**`fleet.Sites`** — mirrors customer-side, identity preserved
| Column | Type | Notes |
|---|---|---|
| Id | uuid | Customer-side Site UUID, preserved |
| CustomerId | uuid FK→Customers | |
| Name, Address, IsActive | … | |
| LocalMunicipalityId | int nullable | Opaque on Admin |
| ReceivedAt | timestamptz | When Admin upserted this row |
| **PK** | (CustomerId, Id) | |
**`fleet.Devices`** — same pattern
| Column | Type | Notes |
|---|---|---|
| Id | uuid | Preserved |
| CustomerId | uuid FK | |
| SiteId | uuid | (CustomerId, SiteId) FK→Sites composite |
| Name, ExternalId, Description, IsActive | … | |
| ReceivedAt | timestamptz | |
| **PK** | (CustomerId, Id) | |
**`fleet.PowerMeasurements`** — hypertable, the big one
| Column | Type | Notes |
|---|---|---|
| Time | timestamptz | Hypertable partition |
| CustomerId | uuid | |
| DeviceId | uuid | (CustomerId, DeviceId) FK→Devices |
| ActivePowerKw, ReactivePowerKvar, ApparentPowerKva, PowerFactor, VoltageV, FrequencyHz, EnergyImportedKwh, EnergyExportedKwh | double | |
| Source | varchar(50) nullable | |
| **PK** | (Time, CustomerId, DeviceId) | Time must be in any unique constraint on a hypertable |
| **Indexes** | (CustomerId, Time DESC); (CustomerId, DeviceId, Time DESC) | |
**`fleet.IngestEvents`** — observability
| Column | Type | Notes |
|---|---|---|
| Id | uuid PK | |
| CustomerId | uuid FK | |
| ReceivedAt | timestamptz | |
| BatchType | varchar(20) | `sites` / `devices` / `measurements` |
| RowsAccepted, RowsRejected | int | |
| BatchBytes | int | |
| ClientHwm | varchar(50) | Cursor the client claims this batch covers |
| TimeSpread | interval nullable | `max(Time) - min(Time)` in the batch — visible burst back-fills |
| Error | varchar(500) nullable | |
### 4.2 Client DB additions
One new table: **`app.FleetPushState`**
| Column | Type | Notes |
|---|---|---|
| ResourceType | varchar(20) PK | `sites` / `devices` / `measurements` |
| LastCursor | timestamptz nullable | For measurements: max(`ReceivedAt`) pushed. For sites/devices: max(`UpdatedAt`) |
| LastSyncedAt | timestamptz nullable | |
| LastError | varchar(500) nullable | |
| ConsecutiveFailures | int default 0 | Drives exponential backoff |
Plus: **add `ReceivedAt timestamptz` (default `NOW()`) to `monitoring.PowerMeasurements`**, indexed on `(ReceivedAt)`. See §5.1 for the rationale.
### 4.3 Migration assembly split
Two `DbContext` classes — `ClientDbContext` (existing `AppDbContext`, renamed) and `AdminDbContext`. Each owns its migrations folder. Only the one matching `RunMode` is registered with DI. Ugly-but-reliable wins over single-context-with-tooling-gymnastics.
---
## 5. Push pipeline
### 5.1 Late arrivals — push by `ReceivedAt`, not `Time`
The firmware (ESP32 / TTGO T-Call) **buffers samples offline** and replays them when connectivity returns. Late arrivals are a designed-for case, not an edge.
If we pushed by `Time > LastCursor`, a device offline for 6 hours would have its back-fill skipped (its samples' `Time` is in the past).
Fix: every row carries `ReceivedAt` assigned by the local DB on insert. Push selects `WHERE ReceivedAt > LastCursor[measurements]`. Back-fills are picked up on the next push tick after the firmware replays.
`Time` stays as-is for queries; nothing else changes about how the data is used.
### 5.2 Burst handling
When a firmware replay drops thousands of rows in one local insert, the next push tick sees a huge backlog. To avoid one customer monopolising network / Admin:
- Per request: up to `FleetIngest__BatchSize` rows (default 5000) **or** `FleetIngest__BatchMaxBytes` (default 1 MB), whichever hits first.
- Per tick: at most 3 successful batches per resource type, then yield.
- Result: a 30k-row back-fill drains over ~5 minutes (10 ticks @ 60s) without starving sites/devices syncing.
- Observable: `fleet.IngestEvents.TimeSpread` shows the `max(Time) - min(Time)` per batch; a wave of back-fill is obvious in Admin logs without grepping.
### 5.3 Cadence
- Default push interval: 60 s.
- Configurable: `FleetIngest__IntervalSeconds`.
### 5.4 Order of operations per tick
1. **Sites**: `UpdatedAt > LastCursor[sites]` → POST batch (full upsert, all rows; small data).
2. **Devices**: same.
3. **Measurements**: `ReceivedAt > LastCursor[measurements]` ordered by `ReceivedAt` → POST batches until drained or limit.
Sites and devices first so Admin's measurement FK insert doesn't reject rows for unknown devices.
### 5.5 Endpoint contract
```
POST /api/fleet/ingest
Headers:
X-Customer-Token: <opaque 32-byte hex>
X-Batch-Type: sites | devices | measurements
X-Push-Cursor: <ISO8601> (highest ReceivedAt/UpdatedAt in this batch)
Content-Type: application/json
Body:
JSON array of rows matching the type
Response 200:
{ accepted: int, rejected: int, errors: [{ row: int, reason: string }] }
Response 401: unknown / inactive / wrong token
Response 413: batch too large
Response 429: rate-limited
Response 5xx: transient — retry
```
CustomerId is derived from the token server-side. Clients never send it. Prevents one customer from forging rows for another.
### 5.6 Failure handling
| Failure | Client behavior |
|---|---|
| Network timeout / 5xx | Exponential backoff (1m → 2m → 5m → 10m, cap 30m); `ConsecutiveFailures++` |
| 401 | Stop trying for this resource; log loudly; surface in **Settings → App config** ("Admin link down") |
| 200 with `rejected > 0` on measurements | Re-push sites/devices on next tick before retrying measurements |
| 413 | Halve `BatchSize` for this resource and retry |
| 429 | Honor `Retry-After`; otherwise back off 1 minute |
Push runs on its own DI scope and its own DbContext. If the push DB query hangs, only push is blocked — never user-facing API.
---
## 6. Auth — token lifecycle
- Token = 32 random bytes, hex-encoded (64 chars). Generated server-side via `RandomNumberGenerator`.
- Stored as **SHA-256 hex hash**, UNIQUE indexed → ingest endpoint does a single O(log N) indexed lookup. Bcrypt is the wrong tool for a high-throughput, high-entropy machine token.
- Shown **once** in the Admin UI at creation/rotation. Admin only stores the hash. Lost token → rotate.
- Rotation: **dual-token with 24h grace window** (shipped after the initial design). On rotate, the current `TokenHash` is copied into `PreviousTokenHash` with `PreviousTokenExpiresAt = now + 24h`, then a fresh current is generated. `FindByTokenAsync` matches either current OR (previous AND not expired AND `IsActive`). Customer ops has the full window to update `.env` + restart without dropping a push tick. A second rotation within the grace window overwrites the previous slot — at most one previous token is ever honoured at a time. Grace duration is `CustomerService.DefaultTokenGracePeriod` (parameterizable on `RotateTokenAsync` for tests; lift to config if you need per-customer tuning).
- `Customers.IsActive=false` → all that customer's pushes get 401 until reactivated.
---
## 7. Continuous aggregates + retention
### 7.1 Aggregates (created post-migration by `FleetTimescaleBootstrapper`)
- `fleet_hourly_per_device``time_bucket('1 hour', Time)`, `CustomerId`, `DeviceId` → avg/max/min `ActivePowerKw`, `max(EnergyImported) - min(EnergyImported)` as `KwhDelta`, sample count.
- `fleet_daily_per_customer``time_bucket('1 day', Time)`, `CustomerId` → totals across all that customer's devices.
- **Realtime** (`materialized_only = false`) so late back-fills appear in CA queries immediately, with the unmaterialized tail served live from the hypertable until next refresh.
- Refresh policy:
- `fleet_hourly_per_device`: every 5 min; `start_offset = 30 days`, `end_offset = 1 hour`.
- `fleet_daily_per_customer`: every 1 hour; `start_offset = 365 days`, `end_offset = 1 day`.
- Wide `start_offset` means firmware back-fills within those windows get materialized on the next refresh tick. TimescaleDB's invalidation log means only chunks that actually changed get re-processed.
### 7.2 Retention vs compression
**Retention policy: NONE.** All data kept indefinitely for historical / trend analysis.
**Compression policy** (this is what makes "forever" cheap):
- `fleet.PowerMeasurements`: compress chunks older than **7 days**.
- Compression settings: `segmentby = (CustomerId, DeviceId)`, `orderby = Time DESC`.
- Typical ratio for power-monitoring data: **510×**. Point queries by `(CustomerId, DeviceId, Time range)` stay fast because of the segmentby clustering.
- Bootstrapper enables compression and adds the policy idempotently.
CA tables are tiny by comparison; no compression needed there.
---
## 8. Admin UI surface
| Page | What |
|---|---|
| Dashboard | Fleet headline — customer count, active customers (push in last hour), aggregate live active power, total kWh today, last-24h lag chart |
| Customers | Table — Code / Name / Last seen / Sites / Devices / Today kWh / Status / Actions (rotate token, disable, view) |
| Customer detail | Sites + devices for the selected customer + Grafana embed scoped to `customer_id` |
| Dashboards | Grafana embed — `fleet-overview` default, `customer-drilldown` parameterized |
| Settings → Branding | Admin's own brand |
| Settings → Users | Admin operator accounts |
| Settings → Grafana | Read-only info card |
| Settings → App config | Read-only config snapshot (RunMode visible at the top) |
Hidden in Admin mode: **Sites** top-level (no local sites), **Settings → Rates** tab (no local tariffs).
---
## 9. Observability
Two questions a real operator asks, two answers:
1. **"Is customer X pushing?"** — Customers page shows `LastSeenAt`, `ConsecutiveFailures`, last batch size, recent burst spread.
2. **"Why did this batch not land?"** — `SELECT * FROM fleet.IngestEvents WHERE CustomerId = … ORDER BY ReceivedAt DESC LIMIT 50;`
Plus Serilog:
- Client: `[INF] Pushed N measurements to fleet ingest (cursor=…) accepted=N rejected=0`
- Admin: `[INF] Accepted N measurements from {Code} ({CustomerId}), rejected 0`
---
## 10. Per-stack config
**Customer `.env` additions:**
```
RunMode=Client # default — can omit
FleetIngest__Enabled=true
FleetIngest__Url=https://admin.portal.example.com/api/fleet/ingest
FleetIngest__Token=<32-byte hex token from Admin Customers page>
FleetIngest__IntervalSeconds=60
FleetIngest__BatchSize=5000
FleetIngest__BatchMaxBytes=1048576
```
`FleetIngest__Enabled=false` → push service doesn't start; customer runs as today, no data leaves.
**Admin `.env`:**
```
RunMode=Admin
COMPOSE_PROJECT_NAME=admin
CUSTOMER_HOST=admin.portal.example.com
Database__ConnectionString=Host=timescaledb;... # central fleet DB (separate volume)
# All other vars same as Client compose.
```
---
## 11. Open seams (deferred, with the obvious extension paths noted)
| Seam | Where v2 picks it up |
|---|---|
| ~~Tariff sync + cross-customer cost~~ | **Both shipped.** `fleet.Municipalities` + `fleet.Tariffs` + `fleet.TariffPeriods` mirror the customer-side rate hierarchy scoped by `CustomerId` (per-customer per-municipality rates preserved). `FleetCostService` joins `fleet.hourly_per_device` × Devices × Sites × Municipalities × active Tariff × period (in municipality local time) → per-bucket cost. Endpoint: `GET /api/fleet/customers/{id}/cost?from&to`. Per-day rollup + per-device breakdown. VAT applied per bucket from the active tariff. Fixed monthly charges intentionally excluded (whole-month billing concept). 366-day range cap. UI: **Cost** tab on customer detail with RangePicker, summary cards, daily + per-device tables. |
| Per-customer Postgres RLS for multi-Admin-user setups | Add `current_setting('app.customer_filter')`-based RLS policies on `fleet.*` tables; Admin role to customer-scope mapping in `IdentityRole.CustomerId`. |
| Bidirectional Admin → Customer commands | New WebSocket or long-poll channel on customer side; gated by mutual cert or a second token. |
| Branding sync (for the "Admin sees customer's brand when drilling in" niceness) | Push branding row from customer; Admin renders the customer's brand on Customer-detail pages. |
| Forward compatibility | Versioned URL `/v1/fleet/ingest`. Admin tolerates unknown JSON fields. Strictness on response shape only. |
| ~~Dual-token grace window for rotation~~ | **Shipped.** See §6 — `Customers.PreviousTokenHash` + `PreviousTokenExpiresAt`; ingest accepts either while previous is unexpired. Default 24h. |
| Sharded Admin (10000+ customers) | Customer's `FleetIngest__Url` already supports any host — point different customers at different Admin instances; aggregate at a tier above with Promscale or similar if needed. |
| Hard-delete / GDPR | Admin Customers page → "Decommission" action: `DELETE FROM fleet.* WHERE CustomerId = …` cascade. Logged. |
---
## 12. Phase split
| Phase | What | End-state |
|---|---|---|
| **13** | `RunMode` plumbing + Fleet schema + `AdminDbContext` split + Customers registry + token issuance UI. No data movement. | Spin up an Admin stack, register a customer, see their token. Client stack picks up `RunMode=Client` + push config but doesn't yet push. |
| **14** | `FleetPushService` (Client), `/api/fleet/ingest` + `FleetIngestService` (Admin), `IngestEvents`, `ReceivedAt` migration on Client, `FleetPushState`, continuous aggregates SQL, compression policy SQL. | Two local stacks: one Client pushes to one Admin. Rows visible in `fleet.PowerMeasurements`. Late-arrival firmware-replay scenario verified with fixtures. |
| **15** | Admin UI (Dashboard, Customers list, Customer detail, Customers token rotation), Grafana provisioning of `fleet-overview.json` + `customer-drilldown.json`, README + OPERATIONS updates for Admin deployment + customer onboarding workflow. | Full feature usable end-to-end. |
---
## 13. Sanity checks (carried over from the design conversation)
- **No central data without the central DB existing.** Whole feature is opt-in via `FleetIngest__Enabled`.
- **Customer model is unchanged.** Every customer keeps full control of their own data. We copy *roll-ups* (and raw measurements for fidelity), not authority.
- **One ingest endpoint, no fan-out.** N customers all POST to the same `/api/fleet/ingest`. Admin scales vertically first, shards horizontally later if needed (each customer's `FleetIngest__Url` can point at a specific Admin instance).