Tau.Acuvim/docs/troubleshooting.md
Renier Forster 84a0668c54 Initial commit: Tau Acuvim IoT monitoring system
Complete IoT monitoring platform for Acuvim II power meters via ESP32.

Firmware (Phases 1-7):
- ESP32-WROVER-B (TTGO T-Call v1.4) with RS485 Modbus RTU
- WiFi STA+AP concurrent mode with GSM/GPRS failover
- Transport abstraction layer with 4 priority modes
- MQTT protocol with 20 commands, LWT, QoS, exponential backoff
- SD card offline buffering with JSONL rotation and non-blocking drain
- OTA firmware updates with dual partition rollback protection
- Watchdog timer, crash loop detection, Acuvim health monitoring
- Captive portal provisioning with AP mode

Console backend (Phase 8):
- .NET 10 minimal API with PostgreSQL + EF Core
- JWT authentication, SignalR real-time updates
- MQTTnet 5.x bridge service with health monitoring
- Device, telemetry, firmware, alert, group management
- Rate limiting, security headers, Swagger/OpenAPI

Frontend (Phase 9):
- React 18 + TypeScript + Vite with Ant Design 5
- ECharts telemetry visualization, TanStack Query
- SignalR live updates, device management UI
- Dashboard, fleet management, firmware deployment

Testing & Production (Phase 10):
- 28 firmware unit tests (Modbus, JSON, config, version)
- 23 xUnit backend tests (device, telemetry, command, alert)
- Docker Compose with nginx, TLS MQTT, PostgreSQL
- Production deployment, commissioning, and troubleshooting docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-16 19:05:32 +02:00

386 lines
16 KiB
Markdown

# Troubleshooting Guide
Common issues, diagnostic procedures, and resolutions for the Tau Acuvim monitoring system covering the ESP32 device firmware, MQTT communication, and the console application.
---
## 1. Device Issues
### 1.1 Quick Reference Table
| Symptom | Likely Cause | Resolution |
|---------|-------------|------------|
| No AP visible after power on | Firmware not loaded or corrupt | Re-flash firmware via USB. See Section 4. |
| AP visible but captive portal does not open | Captive portal DNS redirect not triggering | Navigate manually to `http://192.168.4.1` in a browser. |
| WiFi connects but disconnects repeatedly | Wrong password, weak signal, or channel congestion | Re-enter credentials. Move device closer to AP. Try manual SSID entry if hidden network. |
| MQTT not connecting | Wrong broker address, port, credentials, or firewall blocking | Verify settings on captive portal MQTT page. Use the Test Connection button. Check that port 8883 is open on the server. |
| MQTT connects but no telemetry in console | ACL blocking publishes, topic prefix mismatch, or console MQTT bridge not subscribed | Check Mosquitto logs for ACL denials. Verify topic prefix matches between device and console config. |
| No Modbus data (all values zero) | RS485 wiring incorrect, wrong slave address, or wrong baud rate | Check A/B/GND wiring. Verify Acuvim II settings match device config (slave address, baud rate). |
| Modbus intermittent errors | Loose wiring, cable too long without termination, electrical noise | Tighten terminal connections. Add 120-ohm termination resistor. Use shielded cable. |
| GSM not connecting | SIM not inserted, no data plan, wrong APN, no signal, antenna disconnected | Verify SIM card seated correctly. Check APN settings. Ensure antenna is attached. Check coverage. |
| GSM connects but MQTT fails over GSM | MQTT broker not reachable from public internet, TLS issues on SIM800L | Verify broker has a public IP/domain. SIM800L supports TLS 1.0/1.1 only -- consider SIM7600 for TLS 1.2. |
| OTA update fails | Download timeout, wrong URL, insufficient flash space, checksum mismatch | Verify console URL is correct. Ensure firmware binary is under 1.9 MB. Check server logs for download errors. Try OTA over WiFi (not GSM). |
| Device goes offline intermittently | Power supply instability, brownout, WiFi interference | Check power adapter rating (5V 2A recommended). Look for `BROWNOUT` in heartbeat reset reason. |
| Data gaps in telemetry | Transport was down during the gap, SD card buffer failed or is not installed | Check heartbeat history for connection type changes. Verify SD card is inserted and formatted correctly. |
| High memory usage / declining heap | Memory leak in firmware | Monitor `heap_min` trend in heartbeat data. If it decreases steadily over hours, report as a firmware bug. |
| Crash loops (device reboots repeatedly) | Bad configuration causing panic, firmware bug | After 5 consecutive crash reboots, the device performs an automatic factory reset and enters AP mode. Re-configure via captive portal. If the crash persists, re-flash firmware via USB. |
| Device stuck, not responding | Main loop hung, watchdog not firing | Power cycle the device. If it recurs, connect via USB serial to capture the hang state. The watchdog should reset the device after 30 seconds of inactivity. |
| Wrong time on telemetry | NTP not synced, no internet access at boot | Device syncs NTP on WiFi connect. If no internet is available, timestamps may be approximate. Verify NTP servers are reachable from the site network. |
---
## 2. Console Application Issues
### 2.1 Quick Reference Table
| Symptom | Likely Cause | Resolution |
|---------|-------------|------------|
| Console web UI not loading | Container crashed, nginx misconfigured, DNS not resolving | Run `docker compose ps` to check container status. Check `docker compose logs console`. Verify nginx config and SSL certificate. |
| 502 Bad Gateway | Console container is down or not listening on port 5000 | Restart the console container: `docker compose restart console`. Check logs for startup errors. |
| Database connection error on startup | PostgreSQL not ready, wrong connection string, password mismatch | Verify the `db` container is healthy: `docker compose ps`. Check that the password in `.env` matches for both `POSTGRES_PASSWORD` and `ConnectionStrings__DefaultConnection`. |
| MQTT bridge not receiving messages | Broker credentials wrong, MQTT container down, subscription failure | Check `docker compose logs console` for MQTT connection errors. Verify the MQTT username/password in `.env` match the Mosquitto password file. |
| Devices register but show as offline | Health monitor not detecting heartbeats, time zone mismatch | Check that heartbeats are actually arriving: subscribe to `acuvim/+/heartbeat` with a test client. Ensure server clock is correct (NTP synced). |
| API returns 401 Unauthorized | JWT token expired or invalid, wrong secret | Re-login to obtain a new token. Verify `Jwt__Secret` in `.env` has not changed since the token was issued. |
| API returns 500 Internal Server Error | Unhandled exception in the application | Check `docker compose logs console` for the full stack trace. Common causes: database migration not applied, null reference on missing data. |
| SignalR not connecting from browser | WebSocket upgrade blocked by proxy, CORS issue | Verify nginx has the WebSocket upgrade configuration for `/hubs/`. Check browser console for CORS errors. Ensure the production domain is in the CORS allowed origins. |
| Firmware upload fails | File too large, wrong content type, disk full | Check `client_max_body_size` in nginx config (should be at least 10M). Verify disk space on the server. |
| Telemetry queries are slow | Missing database index, table too large | Run `EXPLAIN ANALYZE` on the slow query. Ensure the `idx_telemetry_device_ts` index exists. Consider data retention cleanup. |
| Alerts not appearing | Alert processing service not matching topics, device not publishing alerts | Check console logs for alert topic subscription. Verify device alert thresholds are configured (send `set_alerts` command). |
| OTA deployment shows "timeout" | Device did not respond within 60 seconds, device offline | Verify the device is online before deploying. Check that the device can reach the firmware download URL. Try a `ping` command first. |
---
## 3. MQTT Broker Issues
### 3.1 Quick Reference Table
| Symptom | Likely Cause | Resolution |
|---------|-------------|------------|
| Mosquitto container won't start | Config syntax error, certificate file missing or wrong permissions | Check `docker compose logs mqtt`. Verify certificate files exist and are owned by UID 1883. Check `mosquitto.conf` syntax. |
| Devices cannot connect to port 8883 | Firewall blocking, TLS certificate error, wrong port in device config | Verify firewall rule allows 8883/tcp. Test with `openssl s_client -connect host:8883`. Check certificate validity. |
| "Connection refused" | Mosquitto not listening on the expected port, container not running | Verify `docker compose ps` shows the mqtt container running. Check that the listener is configured in `mosquitto.conf`. |
| "Not authorized" | Wrong username/password, user not in password file | Verify device credentials in `/opt/acuvim/mosquitto/config/passwd`. Re-create with `mosquitto_passwd` if needed. Restart Mosquitto after changes. |
| "ACL denied" | Topic does not match ACL pattern for the connecting user | Check `/opt/acuvim/mosquitto/config/acl`. Ensure the device username matches its device_id used in topics. |
| Messages not reaching console | Console MQTT client disconnected, subscription lost | Check console logs for MQTT reconnect events. Verify the console subscribes to `acuvim/+/#` and `devices/register`. |
| High broker memory usage | Too many retained messages, large message backlog for offline clients | Clear retained messages if excessive. Review `max_queued_messages` setting in `mosquitto.conf`. |
---
## 4. Serial Monitor Debugging
For local debugging when a device is physically accessible, connect via USB and monitor the serial output.
### 4.1 Using PlatformIO
```bash
pio device monitor --baud 115200
```
### 4.2 Using screen (Linux/macOS)
```bash
screen /dev/ttyUSB0 115200
```
To exit: press `Ctrl+A` then `K`, then confirm with `Y`.
### 4.3 Using PuTTY (Windows)
1. Open PuTTY.
2. Select "Serial" as connection type.
3. Set the COM port (check Device Manager for the correct port).
4. Set speed to `115200`.
5. Click "Open".
### 4.4 What to Look For
The firmware logs the following events to serial output:
```
[BOOT] Acuvim Monitor v1.2.0
[BOOT] Reset reason: POWER_ON
[BOOT] Boot count: 4
[BOOT] Free heap: 245760 bytes
[NVS] Config loaded
[WIFI] Connecting to SiteNetwork...
[WIFI] Connected, IP: 192.168.1.100
[NTP] Time synced: 2026-05-16 10:30:00 UTC
[MQTT] Connecting to console.example.com:8883...
[MQTT] Connected, subscribing to acuvim/ACV-AABBCCDDEEFF/cmd
[MQTT] LWT set on acuvim/ACV-AABBCCDDEEFF/status
[REG] Publishing registration to devices/register
[MB] Reading Acuvim II (addr=1, baud=9600)...
[MB] Read OK: Va=230.1 Vb=231.4 Vc=229.8 Ia=15.2 Ib=14.8 Ic=15.5
[MQTT] Published telemetry (287 bytes)
[HB] Heartbeat #1 sent
```
Error messages to watch for:
```
[WIFI] Connection failed: AUTH_FAIL
[MQTT] Connection failed: rc=-2 (MQTT_CONNECTION_REFUSED)
[MB] Read failed: TIMEOUT (addr=1, reg=0x0000)
[MB] 3 consecutive errors
[GSM] SIM not detected
[GSM] Network registration failed
[OTA] Download failed: HTTP 404
[OTA] Checksum mismatch, aborting
[SYS] WARNING: Free heap below 30KB (28416 bytes)
[SYS] CRASH LOOP DETECTED (count=5), performing factory reset
```
---
## 5. Diagnostic MQTT Commands
Send these commands from the console or directly via a test MQTT client to diagnose device issues remotely.
### 5.1 Using mosquitto_pub (Command Line)
```bash
# Ping the device
mosquitto_pub -h console.example.com -p 8883 \
--cafile ca.crt -u console -P <password> \
-t "acuvim/ACV-AABBCCDDEEFF/cmd" \
-m '{"cmd":"ping","request_id":"diag-001"}'
# Subscribe to responses
mosquitto_sub -h console.example.com -p 8883 \
--cafile ca.crt -u console -P <password> \
-t "acuvim/ACV-AABBCCDDEEFF/resp"
```
### 5.2 Diagnostic Commands
**Test connectivity:**
```json
{"cmd": "ping", "request_id": "diag-001"}
```
Expected response: `{"request_id":"diag-001","status":"success","data":{"pong":true},"message":"pong"}`
**Get full device status:**
```json
{"cmd": "get_status", "request_id": "diag-002"}
```
Returns uptime, memory, connection state, Modbus health, and more.
**Get current configuration:**
```json
{"cmd": "get_config", "request_id": "diag-003"}
```
Returns all configuration values (passwords masked).
**Get latest telemetry reading:**
```json
{"cmd": "get_telemetry", "request_id": "diag-004"}
```
Returns the most recent Acuvim II measurement without waiting for the next poll cycle.
**Restart the device:**
```json
{"cmd": "restart", "request_id": "diag-005"}
```
Use this when the device is behaving unexpectedly. It performs a clean restart.
**Factory reset (last resort):**
```json
{"cmd": "factory_reset", "request_id": "diag-006", "params": {"confirm": true}}
```
Clears all configuration. The device will restart into AP mode and require re-commissioning via the captive portal.
---
## 6. Database Diagnostics
### 6.1 Check Database Connectivity
```bash
docker compose exec db psql -U acuvim -d acuvim -c "SELECT 1;"
```
### 6.2 Check Device Status
```bash
docker compose exec db psql -U acuvim -d acuvim -c "
SELECT device_id, name, status, firmware_version, last_heartbeat,
connection_type, signal_strength
FROM devices
ORDER BY last_heartbeat DESC;
"
```
### 6.3 Check for Telemetry Gaps
```bash
docker compose exec db psql -U acuvim -d acuvim -c "
SELECT device_id,
MIN(timestamp) AS first_record,
MAX(timestamp) AS last_record,
COUNT(*) AS record_count,
EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / COUNT(*) AS avg_interval_sec
FROM telemetry_records
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY device_id
ORDER BY device_id;
"
```
### 6.4 Check Database Size
```bash
docker compose exec db psql -U acuvim -d acuvim -c "
SELECT
pg_size_pretty(pg_database_size('acuvim')) AS total_size,
pg_size_pretty(pg_total_relation_size('telemetry_records')) AS telemetry_size,
pg_size_pretty(pg_total_relation_size('alerts')) AS alerts_size,
pg_size_pretty(pg_total_relation_size('commands')) AS commands_size;
"
```
### 6.5 Check Unresolved Alerts
```bash
docker compose exec db psql -U acuvim -d acuvim -c "
SELECT device_id, alert_type, severity, message, created_at
FROM alerts
WHERE resolved_at IS NULL
ORDER BY created_at DESC
LIMIT 20;
"
```
---
## 7. Network Diagnostics
### 7.1 Test MQTT Broker Connectivity
From the server:
```bash
# Test TLS connection
openssl s_client -connect console.example.com:8883 -CAfile /opt/acuvim/mosquitto/certs/ca.crt
```
From a remote machine (simulating a device):
```bash
mosquitto_sub -h console.example.com -p 8883 \
--cafile ca.crt -u ACV-AABBCCDDEEFF -P <device-password> \
-t "acuvim/ACV-AABBCCDDEEFF/cmd" -v
```
### 7.2 Test Console API
```bash
# Health check
curl -s https://console.example.com/health | python3 -m json.tool
# Login and get token
TOKEN=$(curl -s -X POST https://console.example.com/api/auth/login \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"<password>"}' | python3 -c "import sys,json; print(json.load(sys.stdin)['token'])")
# List devices
curl -s https://console.example.com/api/devices \
-H "Authorization: Bearer $TOKEN" | python3 -m json.tool
```
### 7.3 Check Firewall Rules
```bash
sudo ufw status verbose
```
Verify ports 443 and 8883 are allowed from anywhere.
### 7.4 Check Docker Network
```bash
# Verify containers can communicate
docker compose exec console ping -c 3 db
docker compose exec console ping -c 3 mqtt
```
---
## 8. Common Recovery Procedures
### 8.1 Device is Completely Unresponsive
1. Power cycle the device (disconnect and reconnect USB).
2. Wait 30 seconds for the boot sequence.
3. If the AP becomes visible, connect and reconfigure.
4. If no AP appears, connect via USB serial to check boot output.
5. If the serial shows a boot loop or panic, re-flash the firmware:
```bash
cd firmware/
pio run -t upload
```
### 8.2 Device Lost MQTT Connection After Config Change
If an `mqtt_set` command was sent with incorrect values:
1. Connect to the device's AP (Acuvim-XXXXXX) from a phone.
2. Open `http://192.168.4.1` in a browser.
3. Navigate to Settings > MQTT.
4. Correct the broker address, port, and credentials.
5. Tap Save & Connect.
### 8.3 Console Database Migration Failed
```bash
# Check what migrations are pending
docker compose exec console dotnet ef migrations list
# Apply migrations
docker compose exec console dotnet ef database update
# If a migration is broken, check the error and fix manually:
docker compose logs console | grep -i "migration"
```
### 8.4 Mosquitto Password File Corrupted
```bash
# Recreate the password file
docker run --rm -v /opt/acuvim/mosquitto/config:/mosquitto/config \
eclipse-mosquitto:2 \
mosquitto_passwd -c -b /mosquitto/config/passwd console <console-password>
# Re-add all device users
docker run --rm -v /opt/acuvim/mosquitto/config:/mosquitto/config \
eclipse-mosquitto:2 \
mosquitto_passwd -b /mosquitto/config/passwd ACV-AABBCCDDEEFF <device-password>
# Restart broker
docker compose restart mqtt
```
### 8.5 Restore Database from Backup
```bash
# Stop the console to prevent writes during restore
docker compose stop console
# Restore
gunzip -c /opt/acuvim/backups/acuvim_YYYYMMDD_HHMMSS.sql.gz | \
docker compose exec -T db psql -U acuvim -d acuvim
# Restart console
docker compose start console
```
---
## 9. Log Locations
| Component | Log Location | How to Access |
|-----------|-------------|---------------|
| Console application | Container stdout/stderr | `docker compose logs console` |
| PostgreSQL | Container stdout/stderr | `docker compose logs db` |
| Mosquitto | `/opt/acuvim/mosquitto/log/mosquitto.log` | `tail -f /opt/acuvim/mosquitto/log/mosquitto.log` |
| Nginx | `/var/log/nginx/access.log` and `error.log` | `tail -f /var/log/nginx/error.log` |
| ESP32 firmware | USB serial output | `pio device monitor --baud 115200` |
| Backup script | `/opt/acuvim/backups/backup.log` | `cat /opt/acuvim/backups/backup.log` |