Diseri Pearson c5787a7a7f Phase 15: Admin operator surface + fleet dashboards + onboarding docs

The Admin stack now has a usable operator UI for managing the fleet.
End-to-end verified locally: Client pushes → Admin dashboard reflects
the activity within the CA refresh window.

Backend (Admin-only)
- FleetQueryService: dashboard headline (totals, active count, today's
  measurements + kWh from the hourly_per_device CA) and per-customer
  detail (sites, devices, last 50 measurements, last 20 ingest events).
- /api/fleet/dashboard and /api/fleet/customers/{id}/detail endpoints.
- DTOs added; Program.cs wires the service + endpoints under RunMode=Admin.

Frontend
- DashboardPage now branches on RunMode — Admin renders the fleet
  headline (statistic cards + customer summary table with lag tags),
  Client keeps the existing placeholder.
- AdminCustomerDetailPage drills into one customer: descriptions card +
  tabs for Recent ingest (with rejection counts, batch sizes, time-spread
  for visible firmware-replay waves), Recent measurements, Sites, Devices.
- AdminCustomersPage rows are clickable → /admin/customers/:id (skips
  the click when target is a button/popover so action buttons still work).
- App.tsx adds the /admin/customers/:id route, RequireRole-gated.

Grafana
- grafana/dashboards-admin/fleet-overview.json — 4 stat panels (active
  customers, total, last-24h samples, today's kWh) plus 2 time series
  (per-customer active power, per-customer hourly kWh). Reads from
  fleet.hourly_per_device CA.
- grafana/dashboards-admin/customer-drilldown.json — parameterized by
  $customer (template variable querying fleet.Customers). Per-device
  active power, cumulative kWh, recent ingest events table.

Docs
- README: Phase 15 section describing the new admin UI surface +
  pointer to dashboard-admin folder.
- OPERATIONS: new "Fleet aggregator (Admin stack)" section covering
  one-time provisioning (Admin portal + Admin Grafana), end-to-end
  customer-onboarding workflow (register on Admin → drop token in
  customer .env → restart → verify in UI/SQL), common ops (rotate
  token, disable, investigate, compression stats, force CA refresh,
  decommission), and Admin-DB backup notes.
- README decommissioning note now mentions deleting from fleet.Customers
  if the customer was registered for aggregation.

Verified end-to-end
- Phase 14's Client + Admin stacks rebuilt with Phase 15 code.
- /api/fleet/dashboard returns correct totals (1 customer, 1 active,
  measurements + kWh derived from CA).
- /api/fleet/customers/{id}/detail returns sites, devices, recent
  measurements, recent ingest events.
- Ingested a fresh measurement on Client → after CA refresh, totals
  in Admin dashboard advance correctly.
- All 53 tests still passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 10:27:55 +02:00

16 KiB

Raw Blame History

Tau Acuvim Portal — Operations

Per-customer deployment loop. For background, architecture, and security model, read the README first.

Prerequisites (per host)
Provisioning a new customer
Updating a customer's stack
Rotating secrets
Backup & restore
Health & monitoring
Troubleshooting
Decommissioning a customer

Prerequisites (per host)

These exist once on the host running customer stacks; not per customer.

Docker Engine (or Docker Desktop on Windows hosts).
External Traefik instance — running on the same host, joined to a Docker network named traefik-public. Configured with:
- Two entrypoints: web (80), websecure (443).
- A certificate resolver named le (Let's Encrypt via DNS-01 or HTTP-01).
- HTTP → HTTPS redirect.
- Docker provider with exposedByDefault: false.
Wildcard DNS + TLS cert for *.portal.example.com (or whatever your customer subdomain pattern is).

The traefik-public Docker network exists:

docker network create traefik-public        # one-time

The portal image is built (or pull-able from a registry):

cd /path/to/portal
docker compose -f docker-compose.prod.yml build

Provisioning a new customer

Goal: spin up an isolated stack for customer ABC0001 (Compose project abc0001 — lowercase required) at abc0001.portal.example.com.

1. Create the customer directory

A common pattern: one directory per customer holding only an .env file (the compose files are shared from the repo). Adjust to your fleet-management tool of choice (Ansible, Portainer, Helm-on-K8s later).

/srv/portal/abc0001/
  └── .env

2. Generate strong secrets

openssl rand -base64 32      # POSTGRES_PASSWORD
openssl rand -base64 32      # GRAFANA_ADMIN_PASSWORD
openssl rand -base64 32      # Authentication__DefaultAdminPassword

3. Fill in `.env`

COMPOSE_PROJECT_NAME=abc0001
CUSTOMER_HOST=abc0001.portal.example.com

POSTGRES_DB=power_monitoring
POSTGRES_USER=power_user
POSTGRES_PASSWORD=<from step 2>

Authentication__DefaultAdminEmail=admin@abc0001.example.com
Authentication__DefaultAdminPassword=<from step 2>

GRAFANA_ADMIN_PASSWORD=<from step 2>
Grafana__EmbedPathPrefix=/grafana

4. Decide Grafana auth mode

Anonymous is off in the prod compose by default. Pick one of the three options from the README's Security notes and wire it before exposing the stack to anyone:

(a) Traefik forwardAuth → add the middleware to the traefik.http.routers.${COMPOSE_PROJECT_NAME}-grafana labels and implement /api/auth/check on the portal.
(b) Grafana auth.proxy → set GF_AUTH_PROXY_ENABLED=true, GF_AUTH_PROXY_HEADER_NAME=X-WEBAUTH-USER env vars; ensure Traefik (or the portal) sets the header and that no client can.
(c) Render tokens → minted by a portal endpoint; SPA appends ?auth_token=... to the iframe URL.

Without any of these, Grafana refuses anonymous access in prod (intended safe default — iframe will show a login page).

5. Bring it up

cd /srv/portal/abc0001
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml up -d

6. Verify

# Wait for healthy
docker ps --filter "label=com.docker.compose.project=abc0001"

# Health checks
curl -fs https://abc0001.portal.example.com/health         # → Healthy
curl -fs https://abc0001.portal.example.com/health/ready   # → Healthy

# Migration + seed in the logs
docker logs abc0001_portal | grep -E "Applied migration|Seeded|hypertable"
# expect:
#   Applied migration 'InitialCreate'
#   TimescaleDB hypertable for monitoring.PowerMeasurements is ready
#   Seeded default admin admin@abc0001.example.com

Sign in as Authentication__DefaultAdminEmail with the password from step 2.
Settings → Users → create the customer's real admin account; toggle Admin on.
Sign out, sign in as the customer admin, change the default admin password (or delete the default admin account if the customer admin is the only one needed).
Settings → Branding → upload customer logo, apply colours.
Settings → Rates → seed at least one municipality + tariff for cost calc.
Sites → create the customer's sites/devices so the ingest pipeline knows where measurements belong.

Updating a customer's stack

Code-only update (no migrations, no compose changes)

docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml \
  up -d --build portal

Brief downtime while the new container starts. The DB is untouched.

Update with new migrations

Same command — MigrateAsync on startup applies pending migrations before the app accepts traffic. Watch the logs:

docker logs -f abc0001_portal | grep -E "Applied migration|Failed|hypertable"

If a migration fails the container will exit; fix forward, push a corrected image, retry.

Compose changes (env vars, ports, labels)

Edit the customer's .env (or the central docker-compose.prod.yml) and:

docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml up -d

Compose recreates only the containers whose definition changed.

Rolling many customers

There's no built-in fan-out — pick your orchestrator (Ansible playbook, simple bash loop, Portainer stacks). Update one customer first, verify, then roll the rest.

Rotating secrets

Database password

# 1. Change the password inside Postgres
docker exec -it abc0001_timescale psql -U power_user -d power_monitoring \
  -c "ALTER USER power_user WITH PASSWORD '<new>';"

# 2. Update .env
sed -i 's/^POSTGRES_PASSWORD=.*/POSTGRES_PASSWORD=<new>/' .env

# 3. Recreate the portal + grafana to pick up new env vars
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml \
  up -d portal grafana

Grafana admin password

sed -i 's/^GRAFANA_ADMIN_PASSWORD=.*/GRAFANA_ADMIN_PASSWORD=<new>/' .env
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml \
  up -d grafana

GF_SECURITY_ADMIN_PASSWORD is re-applied on container start.

Default admin password

Once the customer admin exists and has changed their own password, the default admin can be deleted from the Settings → Users UI. After that, Authentication__DefaultAdminPassword is only used if the row is re-seeded (which happens only when no account with that email exists).

Backup & restore

What to back up

Volume	What's in it	Frequency
`<PREFIX>_timescale-data`	All customer data (Identity, branding, tariffs, sites, devices, measurements)	Daily, more for high-write customers
`<PREFIX>_grafana-data`	Grafana's internal SQLite (user prefs, plugin state). Dashboards re-provision from JSON so this is not authoritative.	Weekly is plenty
`<PREFIX>_portal-branding`	Uploaded logos	Daily
`<PREFIX>_portal-keys`	Data Protection key ring (cookie signing). Losing this invalidates all sessions but doesn't lose data.	Weekly

Postgres dump

docker exec abc0001_timescale \
  pg_dump -U power_user -d power_monitoring -F c -f /tmp/backup.dump
docker cp abc0001_timescale:/tmp/backup.dump ./abc0001-$(date +%Y%m%d).dump
docker exec abc0001_timescale rm /tmp/backup.dump

For consistent hypertable backups, prefer Timescale's pg_dump (supports hypertables natively as of PG12+; the above works).

Volume snapshot

For non-DB volumes, simplest is a tar from the volume's mountpoint, or use your storage layer's snapshot facility (LVM, ZFS, EBS, etc.).

Restore

# Fresh DB
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env down timescaledb
docker volume rm abc0001_timescale-data
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env up -d timescaledb

# Restore
docker cp abc0001-YYYYMMDD.dump abc0001_timescale:/tmp/backup.dump
docker exec abc0001_timescale \
  pg_restore -U power_user -d power_monitoring --clean --if-exists /tmp/backup.dump
docker exec abc0001_timescale rm /tmp/backup.dump

# Start everything
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env up -d

The TimescaleBootstrapper is idempotent — it will not error on a restored hypertable.

Health & monitoring

Liveness / readiness

GET /health — liveness. Use as Traefik / load-balancer health check.
GET /health/ready — readiness (DB reachable). Use for orchestration "in service" decisions.

Logs

Serilog writes JSON to stdout; the Docker logging driver of your choice (json-file, journald, gelf to a central log store) picks it up.

docker logs abc0001_portal --tail 200 --follow

Notable lines:

Database connection resolved via … — confirms how this container resolved its DB at startup.
Applied migration '…' — one per pending migration.
TimescaleDB hypertable for monitoring.PowerMeasurements is ready — bootstrapper succeeded.
Seeded default admin … — first start only; absence on subsequent starts is correct.

DB health from the host

docker exec abc0001_timescale pg_isready -U power_user -d power_monitoring

TimescaleDB chunks

docker exec -it abc0001_timescale psql -U power_user -d power_monitoring -c \
  "SELECT chunk_name, range_start, range_end, total_bytes
   FROM chunks_detailed_size('monitoring.\"PowerMeasurements\"');"

Troubleshooting

Symptom	First check
Portal container restart-looping	`docker logs <PREFIX>_portal` — usually a missing env var (default-admin password in prod, missing Postgres password) or a migration failure.
`/health/ready` returns Unhealthy	Postgres container down, or wrong creds. `docker logs <PREFIX>_timescale`.
Grafana iframe loads but no charts	Datasource UID mismatch — confirm `grafana/provisioning/datasources/timescaledb.yml` has `uid: timescaledb` and the dashboard JSON references the same.
Grafana iframe shows login screen in prod	Expected if no auth mode is wired yet (anonymous off by default). Pick a mode (see README → Security).
Branded logo missing after restart	`<PREFIX>_portal-branding` volume not mounted, or filesystem perms wrong. The container runs as user `app`; volume must be writable by uid 1000.
Ingest returns `accepted: 0, rejected: N`	Devices don't exist for those `externalId`s. Create them via the Sites screen first.
Cookie auth seems random / sessions lost on restart	`<PREFIX>_portal-keys` volume not mounted — Data Protection re-keys on every start.
Hypertable error on startup	Pre-existing non-empty plain table being converted. `migrate_data => TRUE` should handle it; if not, restore from backup and check for manual `monitoring."PowerMeasurements"` schema changes.

Decommissioning a customer

# Final backup
docker exec abc0001_timescale pg_dump -U power_user -d power_monitoring \
  -F c -f /tmp/final.dump
docker cp abc0001_timescale:/tmp/final.dump ./abc0001-final-$(date +%Y%m%d).dump

# Stop and remove containers
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml down

# Remove volumes (destroys data — confirm backup first)
docker volume rm \
  abc0001_timescale-data \
  abc0001_grafana-data \
  abc0001_portal-branding \
  abc0001_portal-keys

# Remove customer dir
rm -rf /srv/portal/abc0001

# DNS record + cert (manual or via your DNS automation)

# If using the fleet aggregator: also delete the customer from the Admin
# Customers page (UI Delete) or via psql against the central DB:
#   DELETE FROM fleet."Customers" WHERE "Code" = 'ABC0001';
# (cascades to Sites, Devices, PowerMeasurements, IngestEvents)

Fleet aggregator (Admin stack)

For background and the full design see docs/FLEET-DESIGN.md. This section covers the day-to-day ops.

One-time: provisioning the Admin stack

# 1. Create a dedicated Postgres DB for the central fleet
docker exec <timescale-container> createdb -U power_user admin_fleet

# 2. Spin up the Admin portal (same image as a customer stack, different env)
docker run -d --name admin-portal --restart unless-stopped \
  --network <shared-network> \
  -e Application__RunMode=Admin \
  -e ASPNETCORE_ENVIRONMENT=Production \
  -e Application__PublicUrl=https://admin.portal.example.com \
  -e Database__ConnectionString='Host=<host>;Port=5432;Database=admin_fleet;Username=power_user;Password=<secret>' \
  -e Authentication__DefaultAdminEmail=ops@yourco.example \
  -e Authentication__DefaultAdminPassword=<strong> \
  -v admin-portal-keys:/data/keys \
  -v admin-portal-branding:/data/branding \
  tau-acuvim-portal:latest

# 3. (Optional) Spin up an Admin-side Grafana pointed at admin_fleet
docker run -d --name admin-grafana --restart unless-stopped \
  --network <shared-network> \
  -e GF_SECURITY_ADMIN_PASSWORD=<strong> \
  -e GF_SECURITY_ALLOW_EMBEDDING=true \
  -e GF_AUTH_ANONYMOUS_ENABLED=false \
  -e POSTGRES_DB=admin_fleet \
  -e POSTGRES_USER=power_user \
  -e POSTGRES_PASSWORD=<secret> \
  -v admin-grafana-data:/var/lib/grafana \
  -v /srv/portal/grafana/provisioning:/etc/grafana/provisioning:ro \
  -v /srv/portal/grafana/dashboards-admin:/var/lib/grafana/dashboards:ro \
  grafana/grafana:11.4.0

Behind Traefik: add labels on admin-portal and admin-grafana mirroring the per-customer pattern, with Host(admin.portal.example.com) and (for Grafana) && PathPrefix(/grafana). Choose a Grafana auth mode from README Security (forwardAuth / auth.proxy / render tokens) before exposing.

Onboarding a new customer end-to-end

# A. Admin side — register and capture token (one-time per customer)
#    1. Sign in to https://admin.portal.example.com
#    2. Customers → "Register customer" → Code=ABC0001, Name=Acme Corp
#    3. Copy the token shown ONCE.

# B. Customer side — spin up their stack (per OPERATIONS "Provisioning a new customer")
#    AND add to their .env:
cat >> /srv/portal/abc0001/.env <<EOF
Application__RunMode=Client
FleetIngest__Enabled=true
FleetIngest__Url=https://admin.portal.example.com/api/fleet/ingest
FleetIngest__Token=<token from step A.3>
FleetIngest__IntervalSeconds=60
FleetIngest__BatchSize=5000
EOF

#    3. Restart the customer's portal so the push service starts.
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env \
  up -d portal

# C. Verify
#    1. In Admin UI → Customers, ABC0001 should show "Last push" advance within a minute.
#    2. Click the row → Customer detail → "Recent ingest" tab should list sites/devices
#       batches (and measurements once any are ingested locally).
#    3. From the host:
docker exec <admin-timescale> psql -U power_user -d admin_fleet -c \
  'SELECT "BatchType","RowsAccepted","ReceivedAt" FROM fleet."IngestEvents" ORDER BY "ReceivedAt" DESC LIMIT 10;'

Common ops

What	Where
Rotate a customer's push token	Admin UI → Customers → row's "Rotate token" button. Update customer's `.env` and restart their portal. Brief push gap (until restart) is expected.
Disable a customer (stop accepting their data)	Admin UI → Customers → Edit → Active off. Ingest returns 401 immediately; data already in `fleet.*` is untouched.
Investigate "why hasn't ABC0001 shown up?"	Customer detail page → Recent ingest tab. Check for 401s, rejected rows, error messages. Or: `SELECT * FROM fleet."IngestEvents" WHERE "CustomerId" = '<id>' ORDER BY "ReceivedAt" DESC;`
Inspect compression	`SELECT * FROM hypertable_compression_stats('fleet."PowerMeasurements"');`
Force a continuous aggregate refresh	`CALL refresh_continuous_aggregate('fleet.hourly_per_device', NULL, NULL);`
Decommission a customer from the fleet	Admin UI → Customers → Delete (cascades sites/devices/measurements/events). Customer's local stack is untouched; their portal will get 401s on push until they disable `FleetIngest__Enabled` or you re-register them.

Backing up the central DB

Same pg_dump pattern as a customer DB (see above), targeting admin_fleet. Includes hypertable chunks; restore with pg_restore then run the Admin portal once to re-bootstrap the continuous aggregate refresh policy (FleetTimescaleBootstrapper is idempotent).

16 KiB Raw Blame History