Diseri Pearson a92b4277ae Phase 14: Push + ingest pipeline (end-to-end fleet aggregation)

Customer-stack measurements now flow to the Admin-stack central DB via
HTTPS POST, with firmware buffer-and-replay back-fills handled correctly.

Client side (push)
- monitoring.PowerMeasurements gains ReceivedAt (default NOW()) +
  index. Push selects WHERE ReceivedAt > LastCursor, so back-dated
  rows from offline-buffer replays are picked up automatically.
- app.FleetPushState table holds per-resource cursors + backoff state.
- FleetPushClient: HttpClient wrapper, X-Customer-Token header,
  X-Batch-Type, X-Push-Cursor. 413 returns retry-after halving signal.
- FleetPushService: BackgroundService loop. Per tick: sites (full set),
  devices (full set), measurements (cursor-driven up to 3 batches).
  Exponential backoff per resource on failure (1m → 30m cap).
  Honors 429 Retry-After. Only registered when RunMode=Client AND
  FleetIngest__Enabled=true.

Admin side (ingest)
- /api/fleet/ingest: anonymous, X-Customer-Token authed against
  fleet.Customers via SHA-256 indexed lookup. 401 on bad token; 400
  on bad batch type.
- FleetIngestService dispatches by X-Batch-Type:
  sites/devices → upsert by (CustomerId, Id) with ON CONFLICT UPDATE
  measurements → bulk INSERT ON CONFLICT (Time, CustomerId, DeviceId)
                 DO NOTHING (idempotent under re-delivery).
- Updates fleet.Customers.FirstSeenAt/LastSeenAt on each successful batch.
- Writes fleet.IngestEvents audit row per batch (accepted, rejected,
  bytes, client cursor, time-spread, error).
- FleetTimescaleBootstrapper runs after MigrateAsync in Admin mode:
  CREATE EXTENSION timescaledb, create_hypertable on fleet.PowerMeasurements,
  chunk interval 7 days, compression with segmentby=(CustomerId,DeviceId)
  + compress_orderby "Time" DESC, compression policy 7 days, hourly_per_device
  continuous aggregate (realtime, materialized_only=false, 30-day start_offset
  so back-fills get materialized on next refresh tick).

Wiring
- docker-compose.yml threads Application__RunMode + FleetIngest__* from
  .env (defaults safely off) so a single dev host can run two stacks.
- .env.example documents the new vars under their own section.

Tests
- FleetIngestValidationTests (2 new). 53/53 passing.

Verified end-to-end on the dev host
- Client (portal-dev_portal, RunMode=Client, FleetIngest__Enabled=true)
  pushes to Admin (portal-admin-test, RunMode=Admin, separate admin_fleet DB)
  via container DNS.
- Customer registered on Admin (DEV0001), token captured, dropped into
  Client .env, Client restarted, push service started on schedule.
- Ingested measurements (including a 2026-04-01 back-dated sample
  simulating firmware replay) all land in fleet.PowerMeasurements with
  the correct CustomerId.
- Customer.FirstSeenAt/LastSeenAt update, IngestEvents records every
  batch (sites + devices per tick, measurements when cursor advances).
- Hypertable confirmed via timescaledb_information.hypertables;
  hourly_per_device CA confirmed via timescaledb_information.continuous_aggregates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 10:17:58 +02:00

20 KiB

Raw Blame History

Tau Acuvim Portal

Customer-facing, white-labeled power monitoring portal. One stack per customer, deployed behind Traefik with the customer ID (lowercased — Docker Compose v2 requirement) as the container prefix: customer ABC0001 produces abc0001_portal, abc0001_grafana, abc0001_timescale.

This project lives next to console/ (internal management interface) and firmware/ (ESP32) in the same repo. The three projects share no code; the portal stands alone.

Overview
Architecture
Configuration template
Local setup
Docker Compose
Database migrations
Accessing the app
Accessing Grafana
Default local test credentials
Production deployment notes
Security notes
Testing
Operations

Overview

A customer signs in to their own branded portal, sees their meters' live + historical power readings via embedded Grafana dashboards, and (for admins) configures branding, municipality tariffs, users, sites, and devices. Each customer gets a fully isolated stack: their own database, their own Grafana, their own branding. Traefik routes <customer>.portal.example.com to the right containers.

Tech stack

Layer	Technology
Run modes	`RunMode=Client` (per-customer, default) or `RunMode=Admin` (fleet aggregation) — same binary, config-selected. See docs/FLEET-DESIGN.md.
Backend	.NET 10 minimal API, EF Core 10, Npgsql, ASP.NET Core Identity, Serilog
Frontend	React 18 + TypeScript + Vite, Ant Design 5, TanStack Query, react-router
Database	TimescaleDB 2.17 on PostgreSQL 16
Graphing	Grafana 11 (provisioned datasource + dashboards)
Container	Docker / Docker Compose; Traefik for routing
Auth	Cookie-based via ASP.NET Identity (SPA-friendly, 401/403 not redirects)

Architecture

Containers, per customer

                ┌────────────────────────────────────────────────────┐
                │                    Traefik                         │
                │   host: <customer>.portal.example.com              │
                └────────────────────────────────────────────────────┘
        ┌─────────────────────────┬─────────────────────────┐
        ▼                         ▼                         ▼
┌───────────────────┐   ┌───────────────────┐   ┌───────────────────┐
│ <PREFIX>_portal   │   │ <PREFIX>_grafana  │   │<PREFIX>_timescale │
│ .NET API + SPA    │──▶│ Grafana 11        │──▶│ TimescaleDB + Pg16│
│ :8080             │   │ :3000             │   │ :5432             │
└───────────────────┘   └───────────────────┘   └───────────────────┘

<PREFIX> = the customer's 7-digit ID lowercased (e.g. abc0001 for customer ABC0001), set via COMPOSE_PROJECT_NAME. Compose v2 rejects uppercase project names.

Backend layout

Combined container — Dockerfile builds the React SPA, then a multi-stage .NET build, then copies the SPA into wwwroot. One image, one process per customer.
Single AppDbContext with three schemas:
- identity — ASP.NET Identity tables.
- app — branding, municipalities, tariffs, periods.
- monitoring — sites, devices, power measurements (hypertable).
Minimal API with endpoints grouped under Endpoints/*.cs. Services in Services/*.cs. Typed options in Configuration/.

Frontend layout

src/pages/ — page-level components (DashboardsPage, SettingsPage, etc.).
src/components/ — shared + feature components.
src/api/ — typed axios calls.
src/hooks/useAuth, useBranding — global state via Context.
RequireAuth / RequireRole — route guards.
AntD ConfigProvider is themed dynamically from BrandingProvider (white-labelling).

Configuration template

All configurable values are declared in src/Tau.Acuvim.Portal/appsettings.template.json — checked in, no secrets. It's the lowest-priority configuration source: everything overrides it.

Precedence (lowest → highest)

appsettings.template.json — shippable defaults (always loaded).
appsettings.json — runtime infra (Serilog, AllowedHosts).
appsettings.{Environment}.json — per-environment overrides.
appsettings.Local.json — gitignored, your local overrides.
Environment variables — use __ as section separator (e.g. Authentication__DefaultAdminPassword=...). Production secrets go here.

Sections

Section	Purpose
`Application`	Name, environment, public URL
`Database`	Provider, connection string, `MigrateOnStartup`, `AutoProvisionLocalTimescaleDb`
`TimescaleDb`	Host/port/db/user/password for auto-provision in dev
`Grafana`	Base URL, internal URL, path prefix, embed mode, dashboard list
`WhiteLabel`	App name, logo URL, colours, footer, logo storage path
`Authentication`	Cookie name, lockout, default admin email/password
`Monitoring`	Hypertable chunk interval, aggregate flag

Database connection resolution

If Database:ConnectionString is non-empty → use it.
Else if Database:AutoProvisionLocalTimescaleDb=true AND env is not Production → build from the TimescaleDb:* block (Host/Port/Database/Username/Password).
Otherwise → app refuses to start with a clear error.

AutoProvisionLocalTimescaleDb=true in Production is a hard failure — production must supply its own connection string via env var or secret.

Local setup

Prerequisites

.NET 10 SDK
Node 22+ / npm
Docker Desktop
dotnet-ef tool: dotnet tool install --global dotnet-ef

First-time: generate the initial migrations

Two DbContext classes — one per RunMode — each with its own migration folder.

cd C:\AcuvimDev\Tau.Acuvim\portal

# Client (RunMode=Client, default) — identity + branding + monitoring + rates
dotnet ef migrations add InitialCreate `
  --context AppDbContext `
  --project src/Tau.Acuvim.Portal/Tau.Acuvim.Portal.csproj `
  --output-dir Migrations

# Admin (RunMode=Admin) — identity + branding + fleet
$env:Application__RunMode='Admin'
$env:Database__ConnectionString='Host=localhost;Database=stub;Username=u;Password=p'  # parsed only
dotnet ef migrations add InitialFleet `
  --context AdminDbContext `
  --project src/Tau.Acuvim.Portal/Tau.Acuvim.Portal.csproj `
  --output-dir Migrations/Admin
Remove-Item Env:Application__RunMode
Remove-Item Env:Database__ConnectionString

Commit both Migrations/ and Migrations/Admin/. MigrateAsync on startup applies whatever exists for the active context.

Option A: full stack in Docker (recommended)

cd C:\AcuvimDev\Tau.Acuvim\portal
Copy-Item .env.example .env
docker compose up --build -d
docker compose ps

Then:

Portal: http://localhost:8080
Grafana: http://localhost:3001 (anonymous Viewer in dev)
TimescaleDB: localhost:5433 (user/db from .env)

Stop + wipe:

docker compose down -v

Option B: backend + DB in Docker, frontend via Vite

Same docker compose up, then in another terminal:

cd C:\AcuvimDev\Tau.Acuvim\portal\frontend
npm install
npm run dev

Vite serves at http://localhost:5174 and proxies /api + /health to the .NET container on :8080.

Option C: everything local (no Docker for the app)

Postgres still needs Docker (or a local install):

docker compose up -d timescaledb
cd C:\AcuvimDev\Tau.Acuvim\portal\src\Tau.Acuvim.Portal
dotnet run     # listens on :8080
# in another terminal
cd C:\AcuvimDev\Tau.Acuvim\portal\frontend
npm run dev    # :5174

Docker Compose

Dev — `docker-compose.yml`

Three services: portal, timescaledb, grafana. Persistent named volumes (timescale-data, grafana-data, portal-keys, portal-branding). Healthcheck on Postgres; portal waits for healthy. Grafana ships with anonymous Viewer for easy local access; provisioned TimescaleDB datasource (uid: timescaledb) and any JSON dashboards under grafana/dashboards/.

Host port mappings: 8080→portal, 5433→timescaledb, 3001→grafana. Chosen to coexist with the console stack.

Prod — `docker-compose.prod.yml`

Same services, no host port mappings. Joins external traefik-public network. Per-customer Traefik labels (subdomain routing for portal, same-origin path-prefix routing for Grafana at /grafana). Grafana sub-path + GF_SERVER_ROOT_URL configured. All secrets via env vars.

Run:

docker network create traefik-public        # once on the host
docker compose -f docker-compose.prod.yml --env-file .env up -d

See OPERATIONS.md for the full per-customer deployment loop.

Database migrations

MigrateAsync runs on startup (controlled by Database:MigrateOnStartup, default true). Immediately after, TimescaleBootstrapper runs an idempotent block:

CREATE EXTENSION IF NOT EXISTS timescaledb (defensive).
SELECT create_hypertable('monitoring."PowerMeasurements"', 'Time', if_not_exists => TRUE, migrate_data => TRUE).
SELECT set_chunk_time_interval('monitoring."PowerMeasurements"', INTERVAL '<MonitoringOptions.ChunkTimeInterval>').

Safe to re-run on every start.

Adding a new migration

When you change the entity model:

cd C:\AcuvimDev\Tau.Acuvim\portal
dotnet ef migrations add <DescriptiveName> `
  --project src/Tau.Acuvim.Portal/Tau.Acuvim.Portal.csproj `
  --output-dir Migrations

Commit. Next deploy applies it automatically.

Accessing the app

Environment	URL
Local (Docker combined)	http://localhost:8080
Local (Vite dev)	http://localhost:5174
Production	`https://<customer-host>` (e.g. `https://abc0001.portal.example.com`)

API base path: /api. Swagger UI in dev: /swagger.

Health endpoints

GET /health — liveness (process alive). Returns Healthy if the app responds.
GET /health/ready — readiness. Returns Healthy only if TimescaleDB answers.

Nav surface

Page	Who sees it
Dashboard	Any authenticated user
Dashboards (embedded Grafana)	Any authenticated user
Sites	Admin only
Settings (Branding / Rates / Users / Grafana / App config)	Admin only

Accessing Grafana

Local dev

Direct: http://localhost:3001 — anonymous Viewer can browse provisioned dashboards. Admin/GRAFANA_ADMIN_PASSWORD for editing.
Embedded: portal Dashboards page — iframe src points at the local Grafana base URL.

Production

Direct browser access to <customer-host>/grafana/* is gated by your chosen auth mode (see Security notes). Anonymous is off in the prod compose.
Embedded: portal Dashboards page — same-origin iframe via Traefik path prefix /grafana.

Provisioning

Datasource: grafana/provisioning/datasources/timescaledb.yml (uid: timescaledb).
Dashboard provider: grafana/provisioning/dashboards/dashboards.yml (auto-discovers JSON in grafana/dashboards/, refresh 30s).
Starter dashboard: grafana/dashboards/power-overview.json — active power + cumulative energy + latest-power stat, parameterised by a device template variable.

To add a dashboard:

Drop the JSON into grafana/dashboards/.
Add an entry to Grafana.Dashboards in appsettings.template.json (or override in appsettings.Local.json) with the same Uid. The portal's Dashboards page picks it up after a refresh.

Default local test credentials

Generated locally — change before publishing the stack to anyone.

Email: admin@example.com
Password: ChangeMe123!

Defined in appsettings.template.json → Authentication. The bootstrapper seeds this account only if no account with that email exists, and never overwrites a changed password.

Production guard: if ASPNETCORE_ENVIRONMENT=Production and the default password is still ChangeMe123!, the app refuses to start with an explicit error. Override Authentication__DefaultAdminPassword via env var before deploying.

Production deployment notes

For the per-customer deployment loop see OPERATIONS.md. The short version:

One Compose project per customer. Set COMPOSE_PROJECT_NAME=abc0001 (lowercase form of the customer ID — Compose v2 rejects uppercase). Containers are named abc0001_portal, abc0001_grafana, abc0001_timescale.
One subdomain per customer. Set CUSTOMER_HOST=abc0001.portal.example.com. Wildcard DNS + wildcard TLS cert via Traefik's resolver (certresolver=le).
Decide your Grafana auth mode (see Security notes). The prod compose deliberately leaves Grafana auth off so the iframe refuses to load until you pick.
Set all secrets via env vars (not files):
- POSTGRES_PASSWORD
- GRAFANA_ADMIN_PASSWORD
- Authentication__DefaultAdminPassword
External traefik-public Docker network must exist (created once on the host running Traefik).

Up the stack:

docker compose -f docker-compose.prod.yml --env-file .env up -d

Verify the three containers report healthy and https://<customer-host>/health/ready returns Healthy.

Security notes

What's protected by default

ASP.NET Core Identity with lockout (5 failed attempts → 15 min) and strong password requirements (8+ chars, upper + lower + digit).
Cookies are HttpOnly + SameSite=Lax + Secure in prod, scoped to the portal subdomain.
Admin-only endpoints are gated by an AdminOnly policy (RequireRole("Admin")). Confirmed at backend; nav hidden on frontend.
Cannot delete your own account — backend block, not just UI.
GET /api/admin/config-overview is admin-only; the DTO never includes the connection string or any password. Redaction by construction, not filtering.
Branding logo upload rejects files >2 MB and extensions outside {png, jpg, jpeg, svg, webp}.
Anti-forgery is left on by default on cookie-authenticated endpoints; the logo upload explicitly opts out (multipart needs it disabled). Other admin endpoints accept JSON over same-site Lax cookies, which is CSRF-safe for state-changing same-origin SPA requests.
Security headers (X-Content-Type-Options, X-Frame-Options: SAMEORIGIN, Referrer-Policy: strict-origin-when-cross-origin) on every response. HSTS in prod.

Production refuse-to-start guards

App refuses to start in Production if Authentication:DefaultAdminPassword is still ChangeMe123!.
App refuses to start in Production if Database:AutoProvisionLocalTimescaleDb=true (you must supply an explicit connection string).
App refuses to start if no connection string can be resolved at all.

Grafana embedding — three production auth options

The dev compose runs Grafana with anonymous Viewer (safe on localhost). The prod compose has anonymous off and leaves the auth mode unset on purpose — pick one before publishing:

Option	What it does	Trade-off
(a) Traefik `forwardAuth`	Traefik middleware calls a portal `/api/auth/check` endpoint on every Grafana request; portal cookie required, else 401	Zero changes to Grafana. Best when "any portal user = same dashboards."
(b) Grafana `auth.proxy`	`GF_AUTH_PROXY_ENABLED=true`; trust an `X-WEBAUTH-USER` header set by Traefik	Maps portal user → Grafana user, gets per-user folders/perms. Sanitise the header — never let a client set it directly.
(c) Service-account API key + render tokens	Portal mints short-lived render tokens; SPA embeds via `?auth_token=...`	Most moving parts. Right when dashboards are stitched into custom UI per-panel rather than full Grafana.

Until one is wired, prod-mode Grafana refuses anonymous access and the iframe shows a login page — the intended safe default.

Other considerations

Same-origin embed (prod path-prefix routing through Traefik) sidesteps third-party-cookie blockers that increasingly break cross-origin Grafana iframes.
Provisioned datasource is editable: false — admins cannot accidentally rewire Grafana from its UI.
Default password complexity is tunable in Program.cs → IdentityOptions. Lockout is tunable in the same block.
TimescaleDB licensing — we use Apache-licensed timescale/timescaledb:*-pg16. Stay on community features (hypertables, continuous aggregates) if you ever sell this as managed DBaaS.

Testing

Backend unit tests under tests/Tau.Acuvim.Portal.Tests/ cover cost calculation, rate validation, connection-string resolution, day-of-week math:

cd C:\AcuvimDev\Tau.Acuvim\portal\tests\Tau.Acuvim.Portal.Tests
dotnet test

See TESTING.md for the full manual integration scenario, frontend test scaffolding recipe, and edge-case checklist.

Operations

For per-customer provisioning, secret rotation, backups, and health monitoring see OPERATIONS.md.

Admin / Fleet mode

A second deployment of the same image — RunMode=Admin, separate DB — aggregates data from all customer stacks for a fleet-wide operator view. See docs/FLEET-DESIGN.md for the full design.

Phase 14 (this release): the full push pipeline is live. Customer stacks with FleetIngest__Enabled=true run a FleetPushService background loop that batches sites, devices, and measurements (cursor by ReceivedAt — firmware buffer-and-replay back-fills get picked up automatically) and POSTs them to FleetIngest__Url with X-Customer-Token. Admin's /api/fleet/ingest upserts and writes an IngestEvents audit row per batch. Admin's FleetTimescaleBootstrapper makes fleet.PowerMeasurements a hypertable with compression-after-7-days and a realtime fleet.hourly_per_device continuous aggregate.

Spin up an Admin stack:

docker exec <client-timescale> createdb -U power_user admin_fleet    # one-time

docker run -d --name admin-portal --network <existing-network> `
  -e Application__RunMode=Admin `
  -e Database__ConnectionString='Host=<host>;Port=5432;Database=admin_fleet;Username=power_user;Password=<secret>' `
  -e Authentication__DefaultAdminPassword=<rotate-from-template-default> `
  -p 8090:8080 `
  portal-dev-portal

Then sign in at http://localhost:8090 → Customers → register the customer → token shown once.

Enable push on the customer stack: add to that customer's .env:

Application__RunMode=Client
FleetIngest__Enabled=true
FleetIngest__Url=http://admin-portal:8080/api/fleet/ingest   # container DNS in-network, or https://admin-host
FleetIngest__Token=<token from Customers page>
FleetIngest__IntervalSeconds=60                              # default
FleetIngest__BatchSize=5000                                  # default

Restart the customer's portal container — the FleetPushService starts on its own. Verify the audit trail on the Admin side:

docker exec <admin-timescale> psql -U power_user -d admin_fleet -c `
  'SELECT \"BatchType\",\"RowsAccepted\",\"ReceivedAt\" FROM fleet.\"IngestEvents\" ORDER BY \"ReceivedAt\" DESC LIMIT 10;'

20 KiB Raw Blame History