Tau.Acuvim/portal/OPERATIONS.md
Diseri Pearson e2cbb83397 Portal: Client Dashboard, Measurements page, Excel exports, Grafana auth, RLS
A bundle of related portal work — picked up while ensuring per-customer
isolation actually works end-to-end and replacing the placeholder Client
landing page. Build green, full test suite 66/66.

Frontend — Client surface
- DashboardPage: replace placeholder with 4 KPI cards (kWh, current kW,
  active devices, estimated cost), 24h active-power ECharts mini-chart,
  per-device "today/range" table, and a date-range picker with shortcuts
  (Today / 7d / 30d / This month / Custom). 30s auto-refresh.
- New Measurements page (/measurements, Client mode, any authenticated
  user) with multi-select device filter, full date range incl. an
  "All time" shortcut, server-paginated preview, and Excel export.
- "Export to Excel" buttons on: Client Dashboard summary, Client Dashboard
  raw measurements, Admin fleet dashboard, Admin customer-detail Cost tab.
- DashboardsPage sidebar items: let the menu item grow and reset
  line-height so the two-line title+description doesn't crush.

Frontend — Admin / user mgmt
- RestrictedAdmin role: admin who only sees their assigned customers.
  New UserFormDrawer choice + CustomerAccessModal for granting/revoking
  per-customer access; surfaced from the Users page.

Backend
- ClosedXML 0.104.2 + ExcelExportService (pure formatter; frozen header,
  currency/kWh/kW/date number formats, AdjustToContents).
- DashboardSummaryService computes per-device totals + estimated cost
  (hourly bucketing × site's municipality's active tariff, mirroring
  FleetCostService for the Admin side).
- New endpoints:
    GET  /api/dashboard/summary[+/export.xlsx]
    GET  /api/measurements/raw[+/export.xlsx]   (deviceIds, paginated)
    GET  /api/sites/devices                     (flat list w/ site name)
    GET  /api/fleet/dashboard/export.xlsx
    GET  /api/fleet/customers/{id}/cost/export.xlsx
    GET  /api/auth/check                        (cookie-only liveness)
- AdminCustomerAccess: per-user customer scoping for RestrictedAdmin via
  Postgres-row-level filter — RlsContext (per-DI-scope state) +
  CustomerFilterMiddleware (populates from claims after auth) +
  fleet.* DbSets gain HasQueryFilter expressions. Bootstrappers
  Elevate() to bypass the filter for trusted system code.
- Migration: 20260518095759_AddAdminCustomerAccess (mapping table,
  composite PK on UserId+CustomerId).

Infra / templating (the "spin it up via the template" piece)
- docker-compose.prod.yml + docker-compose.yml: pass WhiteLabel__*,
  Application__RunMode, FleetIngest__* through to the container as
  ${VAR:-default} substitutions. Previously these were silently dropped
  in prod — a customer's .env settings for branding/fleet-push never
  reached the running process. Latent bug, fixed.
- docker-compose.prod.yml: forwardAuth middleware labels on the
  Grafana router pointing at /api/auth/check. Option (a) from the
  README's three prod-auth modes — every Grafana request now gates on
  a valid portal cookie. Anonymous stays off.
- .env.example rewritten with a Client section, optional FleetIngest
  block, and an Admin variant block — annotated on what's required vs.
  optional and where the seed-only-on-first-boot caveat applies.
- README "Grafana embedding" table: option (a) now marked active with
  an inline note on how to switch modes later.
- OPERATIONS.md step 3 includes the white-label pre-brand .env snippet;
  step 4 (formerly "decide Grafana auth mode") updated to reflect
  that auth is wired by default.

Tests
- New BrandingSeedFromOptionsTests (5 tests) pins the env-var → IOptions
  → DB seed contract: first read seeds from options; subsequent reads
  return the DB row (UI edits survive restarts); EnsureSeededAsync is
  idempotent; UpdateAsync falls back to options for blanked fields.
- CustomerTokenGraceTests helper: pass the new RlsContext to
  AdminDbContext (SetAll() so existing semantics hold).

Verified end-to-end
- Real Docker spin-up with WhiteLabel__* in a throwaway .env →
  /api/branding returned all six fields verbatim (ApplicationName,
  LogoUrl, three colors, FooterText).
- curl login → /api/dashboard/summary returned valid JSON →
  /api/dashboard/summary/export.xlsx returned a 6.9 KB file the
  `file` command identifies as "Microsoft Excel 2007+".
- /api/measurements/raw with and without deviceIds filter returned
  correct paginated rows; /export.xlsx with filter produced a valid
  7.1 KB xlsx with the meter count in the filename.
- Frontend tsc -b clean; backend dotnet build 0/0; xunit 66/66.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:15:44 +02:00

423 lines
17 KiB
Markdown

# Tau Acuvim Portal — Operations
Per-customer deployment loop. For background, architecture, and security model, read the [README](./README.md) first.
---
## Contents
1. [Prerequisites (per host)](#prerequisites-per-host)
2. [Provisioning a new customer](#provisioning-a-new-customer)
3. [Updating a customer's stack](#updating-a-customers-stack)
4. [Rotating secrets](#rotating-secrets)
5. [Backup & restore](#backup--restore)
6. [Health & monitoring](#health--monitoring)
7. [Troubleshooting](#troubleshooting)
8. [Decommissioning a customer](#decommissioning-a-customer)
---
## Prerequisites (per host)
These exist once on the host running customer stacks; not per customer.
1. **Docker Engine** (or Docker Desktop on Windows hosts).
2. **External Traefik instance** — running on the same host, joined to a Docker network named `traefik-public`. Configured with:
- Two entrypoints: `web` (80), `websecure` (443).
- A certificate resolver named `le` (Let's Encrypt via DNS-01 or HTTP-01).
- HTTP → HTTPS redirect.
- Docker provider with `exposedByDefault: false`.
3. **Wildcard DNS + TLS cert** for `*.portal.example.com` (or whatever your customer subdomain pattern is).
4. **The `traefik-public` Docker network exists**:
```
docker network create traefik-public # one-time
```
5. **The portal image is built** (or pull-able from a registry):
```
cd /path/to/portal
docker compose -f docker-compose.prod.yml build
```
---
## Provisioning a new customer
Goal: spin up an isolated stack for customer `ABC0001` (Compose project `abc0001` — lowercase required) at `abc0001.portal.example.com`.
### 1. Create the customer directory
A common pattern: one directory per customer holding only an `.env` file (the compose files are shared from the repo). Adjust to your fleet-management tool of choice (Ansible, Portainer, Helm-on-K8s later).
```
/srv/portal/abc0001/
└── .env
```
### 2. Generate strong secrets
```bash
openssl rand -base64 32 # POSTGRES_PASSWORD
openssl rand -base64 32 # GRAFANA_ADMIN_PASSWORD
openssl rand -base64 32 # Authentication__DefaultAdminPassword
```
### 3. Fill in `.env`
Copy `.env.example` to the customer's directory and fill in:
```ini
COMPOSE_PROJECT_NAME=abc0001
CUSTOMER_HOST=abc0001.portal.example.com
Application__RunMode=Client
POSTGRES_DB=power_monitoring
POSTGRES_USER=power_user
POSTGRES_PASSWORD=<from step 2>
Authentication__DefaultAdminEmail=admin@abc0001.example.com
Authentication__DefaultAdminPassword=<from step 2>
GRAFANA_ADMIN_PASSWORD=<from step 2>
Grafana__EmbedPathPrefix=/grafana
# Pre-brand the stack so the customer's first sign-in already shows their
# colours and name. Only applied on first boot; later changes are via the UI.
WhiteLabel__ApplicationName=Acme Corp Power Monitoring
WhiteLabel__PrimaryColor=#0c4a6e
WhiteLabel__SecondaryColor=#0e7490
WhiteLabel__AccentColor=#06b6d4
WhiteLabel__FooterText=© Acme Corp
```
See `.env.example` for the full annotated set including the optional `FleetIngest__*` block (added later, when you enable fleet aggregation for this customer).
### 4. Grafana auth (already wired)
Production Grafana embedding uses Traefik `forwardAuth` → portal `/api/auth/check`, defined inline on the Grafana router in `docker-compose.prod.yml`. Every Grafana sub-request is gated on a valid portal cookie; anonymous is off. No per-customer action required.
To switch to a different mode (e.g. `auth.proxy` for per-user Grafana folders), see the README "Grafana embedding — production auth" section.
### 5. Bring it up
```bash
cd /srv/portal/abc0001
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml up -d
```
### 6. Verify
```bash
# Wait for healthy
docker ps --filter "label=com.docker.compose.project=abc0001"
# Health checks
curl -fs https://abc0001.portal.example.com/health # → Healthy
curl -fs https://abc0001.portal.example.com/health/ready # → Healthy
# Migration + seed in the logs
docker logs abc0001_portal | grep -E "Applied migration|Seeded|hypertable"
# expect:
# Applied migration 'InitialCreate'
# TimescaleDB hypertable for monitoring.PowerMeasurements is ready
# Seeded default admin admin@abc0001.example.com
```
### 7. First login + handover
1. Sign in as `Authentication__DefaultAdminEmail` with the password from step 2.
2. **Settings → Users** → create the customer's real admin account; toggle Admin on.
3. Sign out, sign in as the customer admin, **change the default admin password** (or delete the default admin account if the customer admin is the only one needed).
4. **Settings → Branding** → upload customer logo, apply colours.
5. **Settings → Rates** → seed at least one municipality + tariff for cost calc.
6. **Sites** → create the customer's sites/devices so the ingest pipeline knows where measurements belong.
---
## Updating a customer's stack
### Code-only update (no migrations, no compose changes)
```bash
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml \
up -d --build portal
```
Brief downtime while the new container starts. The DB is untouched.
### Update with new migrations
Same command — `MigrateAsync` on startup applies pending migrations before the app accepts traffic. Watch the logs:
```bash
docker logs -f abc0001_portal | grep -E "Applied migration|Failed|hypertable"
```
If a migration fails the container will exit; fix forward, push a corrected image, retry.
### Compose changes (env vars, ports, labels)
Edit the customer's `.env` (or the central `docker-compose.prod.yml`) and:
```bash
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml up -d
```
Compose recreates only the containers whose definition changed.
### Rolling many customers
There's no built-in fan-out — pick your orchestrator (Ansible playbook, simple bash loop, Portainer stacks). Update one customer first, verify, then roll the rest.
---
## Rotating secrets
### Database password
```bash
# 1. Change the password inside Postgres
docker exec -it abc0001_timescale psql -U power_user -d power_monitoring \
-c "ALTER USER power_user WITH PASSWORD '<new>';"
# 2. Update .env
sed -i 's/^POSTGRES_PASSWORD=.*/POSTGRES_PASSWORD=<new>/' .env
# 3. Recreate the portal + grafana to pick up new env vars
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml \
up -d portal grafana
```
### Grafana admin password
```bash
sed -i 's/^GRAFANA_ADMIN_PASSWORD=.*/GRAFANA_ADMIN_PASSWORD=<new>/' .env
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml \
up -d grafana
```
`GF_SECURITY_ADMIN_PASSWORD` is re-applied on container start.
### Default admin password
Once the customer admin exists and has changed their own password, the default admin can be deleted from the **Settings → Users** UI. After that, `Authentication__DefaultAdminPassword` is only used if the row is re-seeded (which happens only when no account with that email exists).
---
## Backup & restore
### What to back up
| Volume | What's in it | Frequency |
|---|---|---|
| `<PREFIX>_timescale-data` | All customer data (Identity, branding, tariffs, sites, devices, measurements) | Daily, more for high-write customers |
| `<PREFIX>_grafana-data` | Grafana's internal SQLite (user prefs, plugin state). Dashboards re-provision from JSON so this is **not authoritative**. | Weekly is plenty |
| `<PREFIX>_portal-branding` | Uploaded logos | Daily |
| `<PREFIX>_portal-keys` | Data Protection key ring (cookie signing). Losing this invalidates all sessions but doesn't lose data. | Weekly |
### Postgres dump
```bash
docker exec abc0001_timescale \
pg_dump -U power_user -d power_monitoring -F c -f /tmp/backup.dump
docker cp abc0001_timescale:/tmp/backup.dump ./abc0001-$(date +%Y%m%d).dump
docker exec abc0001_timescale rm /tmp/backup.dump
```
For consistent hypertable backups, prefer Timescale's `pg_dump` (supports hypertables natively as of PG12+; the above works).
### Volume snapshot
For non-DB volumes, simplest is a `tar` from the volume's mountpoint, or use your storage layer's snapshot facility (LVM, ZFS, EBS, etc.).
### Restore
```bash
# Fresh DB
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env down timescaledb
docker volume rm abc0001_timescale-data
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env up -d timescaledb
# Restore
docker cp abc0001-YYYYMMDD.dump abc0001_timescale:/tmp/backup.dump
docker exec abc0001_timescale \
pg_restore -U power_user -d power_monitoring --clean --if-exists /tmp/backup.dump
docker exec abc0001_timescale rm /tmp/backup.dump
# Start everything
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env up -d
```
The TimescaleBootstrapper is idempotent — it will not error on a restored hypertable.
---
## Health & monitoring
### Liveness / readiness
- `GET /health` — liveness. Use as Traefik / load-balancer health check.
- `GET /health/ready` — readiness (DB reachable). Use for orchestration "in service" decisions.
### Logs
Serilog writes JSON to stdout; the Docker logging driver of your choice (json-file, journald, gelf to a central log store) picks it up.
```bash
docker logs abc0001_portal --tail 200 --follow
```
Notable lines:
- `Database connection resolved via …` — confirms how this container resolved its DB at startup.
- `Applied migration '…'` — one per pending migration.
- `TimescaleDB hypertable for monitoring.PowerMeasurements is ready` — bootstrapper succeeded.
- `Seeded default admin …` — first start only; absence on subsequent starts is correct.
### DB health from the host
```bash
docker exec abc0001_timescale pg_isready -U power_user -d power_monitoring
```
### TimescaleDB chunks
```bash
docker exec -it abc0001_timescale psql -U power_user -d power_monitoring -c \
"SELECT chunk_name, range_start, range_end, total_bytes
FROM chunks_detailed_size('monitoring.\"PowerMeasurements\"');"
```
---
## Troubleshooting
| Symptom | First check |
|---|---|
| Portal container restart-looping | `docker logs <PREFIX>_portal` — usually a missing env var (default-admin password in prod, missing Postgres password) or a migration failure. |
| `/health/ready` returns Unhealthy | Postgres container down, or wrong creds. `docker logs <PREFIX>_timescale`. |
| Grafana iframe loads but no charts | Datasource UID mismatch — confirm `grafana/provisioning/datasources/timescaledb.yml` has `uid: timescaledb` and the dashboard JSON references the same. |
| Grafana iframe shows login screen in prod | Expected if no auth mode is wired yet (anonymous off by default). Pick a mode (see README → Security). |
| Branded logo missing after restart | `<PREFIX>_portal-branding` volume not mounted, or filesystem perms wrong. The container runs as user `app`; volume must be writable by uid 1000. |
| Ingest returns `accepted: 0, rejected: N` | Devices don't exist for those `externalId`s. Create them via the Sites screen first. |
| Cookie auth seems random / sessions lost on restart | `<PREFIX>_portal-keys` volume not mounted — Data Protection re-keys on every start. |
| Hypertable error on startup | Pre-existing non-empty plain table being converted. `migrate_data => TRUE` should handle it; if not, restore from backup and check for manual `monitoring."PowerMeasurements"` schema changes. |
---
## Decommissioning a customer
```bash
# Final backup
docker exec abc0001_timescale pg_dump -U power_user -d power_monitoring \
-F c -f /tmp/final.dump
docker cp abc0001_timescale:/tmp/final.dump ./abc0001-final-$(date +%Y%m%d).dump
# Stop and remove containers
docker compose --env-file .env -f /path/to/portal/docker-compose.prod.yml down
# Remove volumes (destroys data — confirm backup first)
docker volume rm \
abc0001_timescale-data \
abc0001_grafana-data \
abc0001_portal-branding \
abc0001_portal-keys
# Remove customer dir
rm -rf /srv/portal/abc0001
# DNS record + cert (manual or via your DNS automation)
# If using the fleet aggregator: also delete the customer from the Admin
# Customers page (UI Delete) or via psql against the central DB:
# DELETE FROM fleet."Customers" WHERE "Code" = 'ABC0001';
# (cascades to Sites, Devices, PowerMeasurements, IngestEvents)
```
---
## Fleet aggregator (Admin stack)
For background and the full design see [docs/FLEET-DESIGN.md](./docs/FLEET-DESIGN.md). This section covers the day-to-day ops.
### One-time: provisioning the Admin stack
```bash
# 1. Create a dedicated Postgres DB for the central fleet
docker exec <timescale-container> createdb -U power_user admin_fleet
# 2. Spin up the Admin portal (same image as a customer stack, different env)
docker run -d --name admin-portal --restart unless-stopped \
--network <shared-network> \
-e Application__RunMode=Admin \
-e ASPNETCORE_ENVIRONMENT=Production \
-e Application__PublicUrl=https://admin.portal.example.com \
-e Database__ConnectionString='Host=<host>;Port=5432;Database=admin_fleet;Username=power_user;Password=<secret>' \
-e Authentication__DefaultAdminEmail=ops@yourco.example \
-e Authentication__DefaultAdminPassword=<strong> \
-v admin-portal-keys:/data/keys \
-v admin-portal-branding:/data/branding \
tau-acuvim-portal:latest
# 3. (Optional) Spin up an Admin-side Grafana pointed at admin_fleet
docker run -d --name admin-grafana --restart unless-stopped \
--network <shared-network> \
-e GF_SECURITY_ADMIN_PASSWORD=<strong> \
-e GF_SECURITY_ALLOW_EMBEDDING=true \
-e GF_AUTH_ANONYMOUS_ENABLED=false \
-e POSTGRES_DB=admin_fleet \
-e POSTGRES_USER=power_user \
-e POSTGRES_PASSWORD=<secret> \
-v admin-grafana-data:/var/lib/grafana \
-v /srv/portal/grafana/provisioning:/etc/grafana/provisioning:ro \
-v /srv/portal/grafana/dashboards-admin:/var/lib/grafana/dashboards:ro \
grafana/grafana:11.4.0
```
Behind Traefik: add labels on `admin-portal` and `admin-grafana` mirroring the per-customer pattern, with `Host(admin.portal.example.com)` and (for Grafana) `&& PathPrefix(/grafana)`. Choose a Grafana auth mode from README Security (forwardAuth / auth.proxy / render tokens) before exposing.
### Onboarding a new customer end-to-end
```bash
# A. Admin side — register and capture token (one-time per customer)
# 1. Sign in to https://admin.portal.example.com
# 2. Customers → "Register customer" → Code=ABC0001, Name=Acme Corp
# 3. Copy the token shown ONCE.
# B. Customer side — spin up their stack (per OPERATIONS "Provisioning a new customer")
# AND add to their .env:
cat >> /srv/portal/abc0001/.env <<EOF
Application__RunMode=Client
FleetIngest__Enabled=true
FleetIngest__Url=https://admin.portal.example.com/api/fleet/ingest
FleetIngest__Token=<token from step A.3>
FleetIngest__IntervalSeconds=60
FleetIngest__BatchSize=5000
EOF
# 3. Restart the customer's portal so the push service starts.
docker compose -f /path/to/portal/docker-compose.prod.yml --env-file .env \
up -d portal
# C. Verify
# 1. In Admin UI → Customers, ABC0001 should show "Last push" advance within a minute.
# 2. Click the row → Customer detail → "Recent ingest" tab should list sites/devices
# batches (and measurements once any are ingested locally).
# 3. From the host:
docker exec <admin-timescale> psql -U power_user -d admin_fleet -c \
'SELECT "BatchType","RowsAccepted","ReceivedAt" FROM fleet."IngestEvents" ORDER BY "ReceivedAt" DESC LIMIT 10;'
```
### Common ops
| What | Where |
|---|---|
| Rotate a customer's push token | Admin UI → Customers → row's "Rotate token" button. Update customer's `.env` and restart their portal. Brief push gap (until restart) is expected. |
| Disable a customer (stop accepting their data) | Admin UI → Customers → Edit → Active off. Ingest returns 401 immediately; data already in `fleet.*` is untouched. |
| Investigate "why hasn't ABC0001 shown up?" | Customer detail page → Recent ingest tab. Check for 401s, rejected rows, error messages. Or: `SELECT * FROM fleet."IngestEvents" WHERE "CustomerId" = '<id>' ORDER BY "ReceivedAt" DESC;` |
| Inspect compression | `SELECT * FROM hypertable_compression_stats('fleet."PowerMeasurements"');` |
| Force a continuous aggregate refresh | `CALL refresh_continuous_aggregate('fleet.hourly_per_device', NULL, NULL);` |
| Decommission a customer from the fleet | Admin UI → Customers → Delete (cascades sites/devices/measurements/events). Customer's local stack is untouched; their portal will get 401s on push until they disable `FleetIngest__Enabled` or you re-register them. |
### Backing up the central DB
Same `pg_dump` pattern as a customer DB (see above), targeting `admin_fleet`. Includes hypertable chunks; restore with `pg_restore` then run the Admin portal once to re-bootstrap the continuous aggregate refresh policy (`FleetTimescaleBootstrapper` is idempotent).