Tau.Acuvim/portal/OPERATIONS.md
Diseri Pearson e2cbb83397 Portal: Client Dashboard, Measurements page, Excel exports, Grafana auth, RLS
A bundle of related portal work — picked up while ensuring per-customer
isolation actually works end-to-end and replacing the placeholder Client
landing page. Build green, full test suite 66/66.

Frontend — Client surface
- DashboardPage: replace placeholder with 4 KPI cards (kWh, current kW,
  active devices, estimated cost), 24h active-power ECharts mini-chart,
  per-device "today/range" table, and a date-range picker with shortcuts
  (Today / 7d / 30d / This month / Custom). 30s auto-refresh.
- New Measurements page (/measurements, Client mode, any authenticated
  user) with multi-select device filter, full date range incl. an
  "All time" shortcut, server-paginated preview, and Excel export.
- "Export to Excel" buttons on: Client Dashboard summary, Client Dashboard
  raw measurements, Admin fleet dashboard, Admin customer-detail Cost tab.
- DashboardsPage sidebar items: let the menu item grow and reset
  line-height so the two-line title+description doesn't crush.

Frontend — Admin / user mgmt
- RestrictedAdmin role: admin who only sees their assigned customers.
  New UserFormDrawer choice + CustomerAccessModal for granting/revoking
  per-customer access; surfaced from the Users page.

Backend
- ClosedXML 0.104.2 + ExcelExportService (pure formatter; frozen header,
  currency/kWh/kW/date number formats, AdjustToContents).
- DashboardSummaryService computes per-device totals + estimated cost
  (hourly bucketing × site's municipality's active tariff, mirroring
  FleetCostService for the Admin side).
- New endpoints:
    GET  /api/dashboard/summary[+/export.xlsx]
    GET  /api/measurements/raw[+/export.xlsx]   (deviceIds, paginated)
    GET  /api/sites/devices                     (flat list w/ site name)
    GET  /api/fleet/dashboard/export.xlsx
    GET  /api/fleet/customers/{id}/cost/export.xlsx
    GET  /api/auth/check                        (cookie-only liveness)
- AdminCustomerAccess: per-user customer scoping for RestrictedAdmin via
  Postgres-row-level filter — RlsContext (per-DI-scope state) +
  CustomerFilterMiddleware (populates from claims after auth) +
  fleet.* DbSets gain HasQueryFilter expressions. Bootstrappers
  Elevate() to bypass the filter for trusted system code.
- Migration: 20260518095759_AddAdminCustomerAccess (mapping table,
  composite PK on UserId+CustomerId).

Infra / templating (the "spin it up via the template" piece)
- docker-compose.prod.yml + docker-compose.yml: pass WhiteLabel__*,
  Application__RunMode, FleetIngest__* through to the container as
  ${VAR:-default} substitutions. Previously these were silently dropped
  in prod — a customer's .env settings for branding/fleet-push never
  reached the running process. Latent bug, fixed.
- docker-compose.prod.yml: forwardAuth middleware labels on the
  Grafana router pointing at /api/auth/check. Option (a) from the
  README's three prod-auth modes — every Grafana request now gates on
  a valid portal cookie. Anonymous stays off.
- .env.example rewritten with a Client section, optional FleetIngest
  block, and an Admin variant block — annotated on what's required vs.
  optional and where the seed-only-on-first-boot caveat applies.
- README "Grafana embedding" table: option (a) now marked active with
  an inline note on how to switch modes later.
- OPERATIONS.md step 3 includes the white-label pre-brand .env snippet;
  step 4 (formerly "decide Grafana auth mode") updated to reflect
  that auth is wired by default.

Tests
- New BrandingSeedFromOptionsTests (5 tests) pins the env-var → IOptions
  → DB seed contract: first read seeds from options; subsequent reads
  return the DB row (UI edits survive restarts); EnsureSeededAsync is
  idempotent; UpdateAsync falls back to options for blanked fields.
- CustomerTokenGraceTests helper: pass the new RlsContext to
  AdminDbContext (SetAll() so existing semantics hold).

Verified end-to-end
- Real Docker spin-up with WhiteLabel__* in a throwaway .env →
  /api/branding returned all six fields verbatim (ApplicationName,
  LogoUrl, three colors, FooterText).
- curl login → /api/dashboard/summary returned valid JSON →
  /api/dashboard/summary/export.xlsx returned a 6.9 KB file the
  `file` command identifies as "Microsoft Excel 2007+".
- /api/measurements/raw with and without deviceIds filter returned
  correct paginated rows; /export.xlsx with filter produced a valid
  7.1 KB xlsx with the meter count in the filename.
- Frontend tsc -b clean; backend dotnet build 0/0; xunit 66/66.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:15:44 +02:00

17 KiB

Tau Acuvim Portal — Operations

Per-customer deployment loop. For background, architecture, and security model, read the README first.


Contents

  1. Prerequisites (per host)
  2. Provisioning a new customer
  3. Updating a customer's stack
  4. Rotating secrets
  5. Backup & restore
  6. Health & monitoring
  7. Troubleshooting
  8. Decommissioning a customer

Prerequisites (per host)

These exist once on the host running customer stacks; not per customer.

  1. Docker Engine (or Docker Desktop on Windows hosts).
  2. External Traefik instance — running on the same host, joined to a Docker network named traefik-public. Configured with:
    • Two entrypoints: web (80), websecure (443).
    • A certificate resolver named le (Let's Encrypt via DNS-01 or HTTP-01).
    • HTTP → HTTPS redirect.
    • Docker provider with exposedByDefault: false.
  3. Wildcard DNS + TLS cert for *.portal.example.com (or whatever your customer subdomain pattern is).
  4. The traefik-public Docker network exists:
    docker network create traefik-public        # one-time
    
  5. The portal image is built (or pull-able from a registry):
    cd /path/to/portal
    docker compose -f docker-compose.prod.yml build
    

Provisioning a new customer

Goal: spin up an isolated stack for customer ABC0001 (Compose project abc0001 — lowercase required) at abc0001.portal.example.com.

1. Create the customer directory

A common pattern: one directory per customer holding only an .env file (the compose files are shared from the repo). Adjust to your fleet-management tool of choice (Ansible, Portainer, Helm-on-K8s later).

/srv/portal/abc0001/
  └── .env

2. Generate strong secrets

openssl rand -base64 32      # POSTGRES_PASSWORD
openssl rand -base64 32      # GRAFANA_ADMIN_PASSWORD
openssl rand -base64 32      # Authentication__DefaultAdminPassword

3. Fill in .env

Copy .env.example to the customer's directory and fill in:

COMPOSE_PROJECT_NAME=abc0001
CUSTOMER_HOST=abc0001.portal.example.com
Application__RunMode=Client

POSTGRES_DB=power_monitoring
POSTGRES_USER=power_user
POSTGRES_PASSWORD=<from step 2>

Authentication__DefaultAdminEmail=admin@abc0001.example.com
Authentication__DefaultAdminPassword=<from step 2>

GRAFANA_ADMIN_PASSWORD=<from step 2>
Grafana__EmbedPathPrefix=/grafana

# Pre-brand the stack so the customer's first sign-in already shows their
# colours and name. Only applied on first boot; later changes are via the UI.
WhiteLabel__ApplicationName=Acme Corp Power Monitoring
WhiteLabel__PrimaryColor=#0c4a6e
WhiteLabel__SecondaryColor=#0e7490
WhiteLabel__AccentColor=#06b6d4
WhiteLabel__FooterText=© Acme Corp

See .env.example for the full annotated set including the optional FleetIngest__* block (added later, when you enable fleet aggregation for this customer).

4. Grafana auth (already wired)

Production Grafana embedding uses Traefik forwardAuth → portal /api/auth/check, defined inline on the Grafana router in docker-compose.prod.yml. Every Grafana sub-request is gated on a valid portal cookie; anonymous is off. No per-customer action required.

To switch to a different mode (e.g. auth.proxy for per-user Grafana folders), see the README "Grafana embedding — production auth" section.

5. Bring it up

cd /srv/portal/abc0001
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml up -d

6. Verify

# Wait for healthy
docker ps --filter "label=com.docker.compose.project=abc0001"

# Health checks
curl -fs https://abc0001.portal.example.com/health         # → Healthy
curl -fs https://abc0001.portal.example.com/health/ready   # → Healthy

# Migration + seed in the logs
docker logs abc0001_portal | grep -E "Applied migration|Seeded|hypertable"
# expect:
#   Applied migration 'InitialCreate'
#   TimescaleDB hypertable for monitoring.PowerMeasurements is ready
#   Seeded default admin admin@abc0001.example.com

7. First login + handover

  1. Sign in as Authentication__DefaultAdminEmail with the password from step 2.
  2. Settings → Users → create the customer's real admin account; toggle Admin on.
  3. Sign out, sign in as the customer admin, change the default admin password (or delete the default admin account if the customer admin is the only one needed).
  4. Settings → Branding → upload customer logo, apply colours.
  5. Settings → Rates → seed at least one municipality + tariff for cost calc.
  6. Sites → create the customer's sites/devices so the ingest pipeline knows where measurements belong.

Updating a customer's stack

Code-only update (no migrations, no compose changes)

docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml \
  up -d --build portal

Brief downtime while the new container starts. The DB is untouched.

Update with new migrations

Same command — MigrateAsync on startup applies pending migrations before the app accepts traffic. Watch the logs:

docker logs -f abc0001_portal | grep -E "Applied migration|Failed|hypertable"

If a migration fails the container will exit; fix forward, push a corrected image, retry.

Compose changes (env vars, ports, labels)

Edit the customer's .env (or the central docker-compose.prod.yml) and:

docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml up -d

Compose recreates only the containers whose definition changed.

Rolling many customers

There's no built-in fan-out — pick your orchestrator (Ansible playbook, simple bash loop, Portainer stacks). Update one customer first, verify, then roll the rest.


Rotating secrets

Database password

# 1. Change the password inside Postgres
docker exec -it abc0001_timescale psql -U power_user -d power_monitoring \
  -c "ALTER USER power_user WITH PASSWORD '<new>';"

# 2. Update .env
sed -i 's/^POSTGRES_PASSWORD=.*/POSTGRES_PASSWORD=<new>/' .env

# 3. Recreate the portal + grafana to pick up new env vars
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml \
  up -d portal grafana

Grafana admin password

sed -i 's/^GRAFANA_ADMIN_PASSWORD=.*/GRAFANA_ADMIN_PASSWORD=<new>/' .env
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml \
  up -d grafana

GF_SECURITY_ADMIN_PASSWORD is re-applied on container start.

Default admin password

Once the customer admin exists and has changed their own password, the default admin can be deleted from the Settings → Users UI. After that, Authentication__DefaultAdminPassword is only used if the row is re-seeded (which happens only when no account with that email exists).


Backup & restore

What to back up

Volume What's in it Frequency
<PREFIX>_timescale-data All customer data (Identity, branding, tariffs, sites, devices, measurements) Daily, more for high-write customers
<PREFIX>_grafana-data Grafana's internal SQLite (user prefs, plugin state). Dashboards re-provision from JSON so this is not authoritative. Weekly is plenty
<PREFIX>_portal-branding Uploaded logos Daily
<PREFIX>_portal-keys Data Protection key ring (cookie signing). Losing this invalidates all sessions but doesn't lose data. Weekly

Postgres dump

docker exec abc0001_timescale \
  pg_dump -U power_user -d power_monitoring -F c -f /tmp/backup.dump
docker cp abc0001_timescale:/tmp/backup.dump ./abc0001-$(date +%Y%m%d).dump
docker exec abc0001_timescale rm /tmp/backup.dump

For consistent hypertable backups, prefer Timescale's pg_dump (supports hypertables natively as of PG12+; the above works).

Volume snapshot

For non-DB volumes, simplest is a tar from the volume's mountpoint, or use your storage layer's snapshot facility (LVM, ZFS, EBS, etc.).

Restore

# Fresh DB
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env down timescaledb
docker volume rm abc0001_timescale-data
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env up -d timescaledb

# Restore
docker cp abc0001-YYYYMMDD.dump abc0001_timescale:/tmp/backup.dump
docker exec abc0001_timescale \
  pg_restore -U power_user -d power_monitoring --clean --if-exists /tmp/backup.dump
docker exec abc0001_timescale rm /tmp/backup.dump

# Start everything
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env up -d

The TimescaleBootstrapper is idempotent — it will not error on a restored hypertable.


Health & monitoring

Liveness / readiness

  • GET /health — liveness. Use as Traefik / load-balancer health check.
  • GET /health/ready — readiness (DB reachable). Use for orchestration "in service" decisions.

Logs

Serilog writes JSON to stdout; the Docker logging driver of your choice (json-file, journald, gelf to a central log store) picks it up.

docker logs abc0001_portal --tail 200 --follow

Notable lines:

  • Database connection resolved via … — confirms how this container resolved its DB at startup.
  • Applied migration '…' — one per pending migration.
  • TimescaleDB hypertable for monitoring.PowerMeasurements is ready — bootstrapper succeeded.
  • Seeded default admin … — first start only; absence on subsequent starts is correct.

DB health from the host

docker exec abc0001_timescale pg_isready -U power_user -d power_monitoring

TimescaleDB chunks

docker exec -it abc0001_timescale psql -U power_user -d power_monitoring -c \
  "SELECT chunk_name, range_start, range_end, total_bytes
   FROM chunks_detailed_size('monitoring.\"PowerMeasurements\"');"

Troubleshooting

Symptom First check
Portal container restart-looping docker logs <PREFIX>_portal — usually a missing env var (default-admin password in prod, missing Postgres password) or a migration failure.
/health/ready returns Unhealthy Postgres container down, or wrong creds. docker logs <PREFIX>_timescale.
Grafana iframe loads but no charts Datasource UID mismatch — confirm grafana/provisioning/datasources/timescaledb.yml has uid: timescaledb and the dashboard JSON references the same.
Grafana iframe shows login screen in prod Expected if no auth mode is wired yet (anonymous off by default). Pick a mode (see README → Security).
Branded logo missing after restart <PREFIX>_portal-branding volume not mounted, or filesystem perms wrong. The container runs as user app; volume must be writable by uid 1000.
Ingest returns accepted: 0, rejected: N Devices don't exist for those externalIds. Create them via the Sites screen first.
Cookie auth seems random / sessions lost on restart <PREFIX>_portal-keys volume not mounted — Data Protection re-keys on every start.
Hypertable error on startup Pre-existing non-empty plain table being converted. migrate_data => TRUE should handle it; if not, restore from backup and check for manual monitoring."PowerMeasurements" schema changes.

Decommissioning a customer

# Final backup
docker exec abc0001_timescale pg_dump -U power_user -d power_monitoring \
  -F c -f /tmp/final.dump
docker cp abc0001_timescale:/tmp/final.dump ./abc0001-final-$(date +%Y%m%d).dump

# Stop and remove containers
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml down

# Remove volumes (destroys data — confirm backup first)
docker volume rm \
  abc0001_timescale-data \
  abc0001_grafana-data \
  abc0001_portal-branding \
  abc0001_portal-keys

# Remove customer dir
rm -rf /srv/portal/abc0001

# DNS record + cert (manual or via your DNS automation)

# If using the fleet aggregator: also delete the customer from the Admin
# Customers page (UI Delete) or via psql against the central DB:
#   DELETE FROM fleet."Customers" WHERE "Code" = 'ABC0001';
# (cascades to Sites, Devices, PowerMeasurements, IngestEvents)

Fleet aggregator (Admin stack)

For background and the full design see docs/FLEET-DESIGN.md. This section covers the day-to-day ops.

One-time: provisioning the Admin stack

# 1. Create a dedicated Postgres DB for the central fleet
docker exec <timescale-container> createdb -U power_user admin_fleet

# 2. Spin up the Admin portal (same image as a customer stack, different env)
docker run -d --name admin-portal --restart unless-stopped \
  --network <shared-network> \
  -e Application__RunMode=Admin \
  -e ASPNETCORE_ENVIRONMENT=Production \
  -e Application__PublicUrl=https://admin.portal.example.com \
  -e Database__ConnectionString='Host=<host>;Port=5432;Database=admin_fleet;Username=power_user;Password=<secret>' \
  -e Authentication__DefaultAdminEmail=ops@yourco.example \
  -e Authentication__DefaultAdminPassword=<strong> \
  -v admin-portal-keys:/data/keys \
  -v admin-portal-branding:/data/branding \
  tau-acuvim-portal:latest

# 3. (Optional) Spin up an Admin-side Grafana pointed at admin_fleet
docker run -d --name admin-grafana --restart unless-stopped \
  --network <shared-network> \
  -e GF_SECURITY_ADMIN_PASSWORD=<strong> \
  -e GF_SECURITY_ALLOW_EMBEDDING=true \
  -e GF_AUTH_ANONYMOUS_ENABLED=false \
  -e POSTGRES_DB=admin_fleet \
  -e POSTGRES_USER=power_user \
  -e POSTGRES_PASSWORD=<secret> \
  -v admin-grafana-data:/var/lib/grafana \
  -v /srv/portal/grafana/provisioning:/etc/grafana/provisioning:ro \
  -v /srv/portal/grafana/dashboards-admin:/var/lib/grafana/dashboards:ro \
  grafana/grafana:11.4.0

Behind Traefik: add labels on admin-portal and admin-grafana mirroring the per-customer pattern, with Host(admin.portal.example.com) and (for Grafana) && PathPrefix(/grafana). Choose a Grafana auth mode from README Security (forwardAuth / auth.proxy / render tokens) before exposing.

Onboarding a new customer end-to-end

# A. Admin side — register and capture token (one-time per customer)
#    1. Sign in to https://admin.portal.example.com
#    2. Customers → "Register customer" → Code=ABC0001, Name=Acme Corp
#    3. Copy the token shown ONCE.

# B. Customer side — spin up their stack (per OPERATIONS "Provisioning a new customer")
#    AND add to their .env:
cat >> /srv/portal/abc0001/.env <<EOF
Application__RunMode=Client
FleetIngest__Enabled=true
FleetIngest__Url=https://admin.portal.example.com/api/fleet/ingest
FleetIngest__Token=<token from step A.3>
FleetIngest__IntervalSeconds=60
FleetIngest__BatchSize=5000
EOF

#    3. Restart the customer's portal so the push service starts.
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env \
  up -d portal

# C. Verify
#    1. In Admin UI → Customers, ABC0001 should show "Last push" advance within a minute.
#    2. Click the row → Customer detail → "Recent ingest" tab should list sites/devices
#       batches (and measurements once any are ingested locally).
#    3. From the host:
docker exec <admin-timescale> psql -U power_user -d admin_fleet -c \
  'SELECT "BatchType","RowsAccepted","ReceivedAt" FROM fleet."IngestEvents" ORDER BY "ReceivedAt" DESC LIMIT 10;'

Common ops

What Where
Rotate a customer's push token Admin UI → Customers → row's "Rotate token" button. Update customer's .env and restart their portal. Brief push gap (until restart) is expected.
Disable a customer (stop accepting their data) Admin UI → Customers → Edit → Active off. Ingest returns 401 immediately; data already in fleet.* is untouched.
Investigate "why hasn't ABC0001 shown up?" Customer detail page → Recent ingest tab. Check for 401s, rejected rows, error messages. Or: SELECT * FROM fleet."IngestEvents" WHERE "CustomerId" = '<id>' ORDER BY "ReceivedAt" DESC;
Inspect compression SELECT * FROM hypertable_compression_stats('fleet."PowerMeasurements"');
Force a continuous aggregate refresh CALL refresh_continuous_aggregate('fleet.hourly_per_device', NULL, NULL);
Decommission a customer from the fleet Admin UI → Customers → Delete (cascades sites/devices/measurements/events). Customer's local stack is untouched; their portal will get 401s on push until they disable FleetIngest__Enabled or you re-register them.

Backing up the central DB

Same pg_dump pattern as a customer DB (see above), targeting admin_fleet. Includes hypertable chunks; restore with pg_restore then run the Admin portal once to re-bootstrap the continuous aggregate refresh policy (FleetTimescaleBootstrapper is idempotent).