Complete IoT monitoring platform for Acuvim II power meters via ESP32. Firmware (Phases 1-7): - ESP32-WROVER-B (TTGO T-Call v1.4) with RS485 Modbus RTU - WiFi STA+AP concurrent mode with GSM/GPRS failover - Transport abstraction layer with 4 priority modes - MQTT protocol with 20 commands, LWT, QoS, exponential backoff - SD card offline buffering with JSONL rotation and non-blocking drain - OTA firmware updates with dual partition rollback protection - Watchdog timer, crash loop detection, Acuvim health monitoring - Captive portal provisioning with AP mode Console backend (Phase 8): - .NET 10 minimal API with PostgreSQL + EF Core - JWT authentication, SignalR real-time updates - MQTTnet 5.x bridge service with health monitoring - Device, telemetry, firmware, alert, group management - Rate limiting, security headers, Swagger/OpenAPI Frontend (Phase 9): - React 18 + TypeScript + Vite with Ant Design 5 - ECharts telemetry visualization, TanStack Query - SignalR live updates, device management UI - Dashboard, fleet management, firmware deployment Testing & Production (Phase 10): - 28 firmware unit tests (Modbus, JSON, config, version) - 23 xUnit backend tests (device, telemetry, command, alert) - Docker Compose with nginx, TLS MQTT, PostgreSQL - Production deployment, commissioning, and troubleshooting docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
386 lines
16 KiB
Markdown
386 lines
16 KiB
Markdown
# Troubleshooting Guide
|
|
|
|
Common issues, diagnostic procedures, and resolutions for the Tau Acuvim monitoring system covering the ESP32 device firmware, MQTT communication, and the console application.
|
|
|
|
---
|
|
|
|
## 1. Device Issues
|
|
|
|
### 1.1 Quick Reference Table
|
|
|
|
| Symptom | Likely Cause | Resolution |
|
|
|---------|-------------|------------|
|
|
| No AP visible after power on | Firmware not loaded or corrupt | Re-flash firmware via USB. See Section 4. |
|
|
| AP visible but captive portal does not open | Captive portal DNS redirect not triggering | Navigate manually to `http://192.168.4.1` in a browser. |
|
|
| WiFi connects but disconnects repeatedly | Wrong password, weak signal, or channel congestion | Re-enter credentials. Move device closer to AP. Try manual SSID entry if hidden network. |
|
|
| MQTT not connecting | Wrong broker address, port, credentials, or firewall blocking | Verify settings on captive portal MQTT page. Use the Test Connection button. Check that port 8883 is open on the server. |
|
|
| MQTT connects but no telemetry in console | ACL blocking publishes, topic prefix mismatch, or console MQTT bridge not subscribed | Check Mosquitto logs for ACL denials. Verify topic prefix matches between device and console config. |
|
|
| No Modbus data (all values zero) | RS485 wiring incorrect, wrong slave address, or wrong baud rate | Check A/B/GND wiring. Verify Acuvim II settings match device config (slave address, baud rate). |
|
|
| Modbus intermittent errors | Loose wiring, cable too long without termination, electrical noise | Tighten terminal connections. Add 120-ohm termination resistor. Use shielded cable. |
|
|
| GSM not connecting | SIM not inserted, no data plan, wrong APN, no signal, antenna disconnected | Verify SIM card seated correctly. Check APN settings. Ensure antenna is attached. Check coverage. |
|
|
| GSM connects but MQTT fails over GSM | MQTT broker not reachable from public internet, TLS issues on SIM800L | Verify broker has a public IP/domain. SIM800L supports TLS 1.0/1.1 only -- consider SIM7600 for TLS 1.2. |
|
|
| OTA update fails | Download timeout, wrong URL, insufficient flash space, checksum mismatch | Verify console URL is correct. Ensure firmware binary is under 1.9 MB. Check server logs for download errors. Try OTA over WiFi (not GSM). |
|
|
| Device goes offline intermittently | Power supply instability, brownout, WiFi interference | Check power adapter rating (5V 2A recommended). Look for `BROWNOUT` in heartbeat reset reason. |
|
|
| Data gaps in telemetry | Transport was down during the gap, SD card buffer failed or is not installed | Check heartbeat history for connection type changes. Verify SD card is inserted and formatted correctly. |
|
|
| High memory usage / declining heap | Memory leak in firmware | Monitor `heap_min` trend in heartbeat data. If it decreases steadily over hours, report as a firmware bug. |
|
|
| Crash loops (device reboots repeatedly) | Bad configuration causing panic, firmware bug | After 5 consecutive crash reboots, the device performs an automatic factory reset and enters AP mode. Re-configure via captive portal. If the crash persists, re-flash firmware via USB. |
|
|
| Device stuck, not responding | Main loop hung, watchdog not firing | Power cycle the device. If it recurs, connect via USB serial to capture the hang state. The watchdog should reset the device after 30 seconds of inactivity. |
|
|
| Wrong time on telemetry | NTP not synced, no internet access at boot | Device syncs NTP on WiFi connect. If no internet is available, timestamps may be approximate. Verify NTP servers are reachable from the site network. |
|
|
|
|
---
|
|
|
|
## 2. Console Application Issues
|
|
|
|
### 2.1 Quick Reference Table
|
|
|
|
| Symptom | Likely Cause | Resolution |
|
|
|---------|-------------|------------|
|
|
| Console web UI not loading | Container crashed, nginx misconfigured, DNS not resolving | Run `docker compose ps` to check container status. Check `docker compose logs console`. Verify nginx config and SSL certificate. |
|
|
| 502 Bad Gateway | Console container is down or not listening on port 5000 | Restart the console container: `docker compose restart console`. Check logs for startup errors. |
|
|
| Database connection error on startup | PostgreSQL not ready, wrong connection string, password mismatch | Verify the `db` container is healthy: `docker compose ps`. Check that the password in `.env` matches for both `POSTGRES_PASSWORD` and `ConnectionStrings__DefaultConnection`. |
|
|
| MQTT bridge not receiving messages | Broker credentials wrong, MQTT container down, subscription failure | Check `docker compose logs console` for MQTT connection errors. Verify the MQTT username/password in `.env` match the Mosquitto password file. |
|
|
| Devices register but show as offline | Health monitor not detecting heartbeats, time zone mismatch | Check that heartbeats are actually arriving: subscribe to `acuvim/+/heartbeat` with a test client. Ensure server clock is correct (NTP synced). |
|
|
| API returns 401 Unauthorized | JWT token expired or invalid, wrong secret | Re-login to obtain a new token. Verify `Jwt__Secret` in `.env` has not changed since the token was issued. |
|
|
| API returns 500 Internal Server Error | Unhandled exception in the application | Check `docker compose logs console` for the full stack trace. Common causes: database migration not applied, null reference on missing data. |
|
|
| SignalR not connecting from browser | WebSocket upgrade blocked by proxy, CORS issue | Verify nginx has the WebSocket upgrade configuration for `/hubs/`. Check browser console for CORS errors. Ensure the production domain is in the CORS allowed origins. |
|
|
| Firmware upload fails | File too large, wrong content type, disk full | Check `client_max_body_size` in nginx config (should be at least 10M). Verify disk space on the server. |
|
|
| Telemetry queries are slow | Missing database index, table too large | Run `EXPLAIN ANALYZE` on the slow query. Ensure the `idx_telemetry_device_ts` index exists. Consider data retention cleanup. |
|
|
| Alerts not appearing | Alert processing service not matching topics, device not publishing alerts | Check console logs for alert topic subscription. Verify device alert thresholds are configured (send `set_alerts` command). |
|
|
| OTA deployment shows "timeout" | Device did not respond within 60 seconds, device offline | Verify the device is online before deploying. Check that the device can reach the firmware download URL. Try a `ping` command first. |
|
|
|
|
---
|
|
|
|
## 3. MQTT Broker Issues
|
|
|
|
### 3.1 Quick Reference Table
|
|
|
|
| Symptom | Likely Cause | Resolution |
|
|
|---------|-------------|------------|
|
|
| Mosquitto container won't start | Config syntax error, certificate file missing or wrong permissions | Check `docker compose logs mqtt`. Verify certificate files exist and are owned by UID 1883. Check `mosquitto.conf` syntax. |
|
|
| Devices cannot connect to port 8883 | Firewall blocking, TLS certificate error, wrong port in device config | Verify firewall rule allows 8883/tcp. Test with `openssl s_client -connect host:8883`. Check certificate validity. |
|
|
| "Connection refused" | Mosquitto not listening on the expected port, container not running | Verify `docker compose ps` shows the mqtt container running. Check that the listener is configured in `mosquitto.conf`. |
|
|
| "Not authorized" | Wrong username/password, user not in password file | Verify device credentials in `/opt/acuvim/mosquitto/config/passwd`. Re-create with `mosquitto_passwd` if needed. Restart Mosquitto after changes. |
|
|
| "ACL denied" | Topic does not match ACL pattern for the connecting user | Check `/opt/acuvim/mosquitto/config/acl`. Ensure the device username matches its device_id used in topics. |
|
|
| Messages not reaching console | Console MQTT client disconnected, subscription lost | Check console logs for MQTT reconnect events. Verify the console subscribes to `acuvim/+/#` and `devices/register`. |
|
|
| High broker memory usage | Too many retained messages, large message backlog for offline clients | Clear retained messages if excessive. Review `max_queued_messages` setting in `mosquitto.conf`. |
|
|
|
|
---
|
|
|
|
## 4. Serial Monitor Debugging
|
|
|
|
For local debugging when a device is physically accessible, connect via USB and monitor the serial output.
|
|
|
|
### 4.1 Using PlatformIO
|
|
|
|
```bash
|
|
pio device monitor --baud 115200
|
|
```
|
|
|
|
### 4.2 Using screen (Linux/macOS)
|
|
|
|
```bash
|
|
screen /dev/ttyUSB0 115200
|
|
```
|
|
|
|
To exit: press `Ctrl+A` then `K`, then confirm with `Y`.
|
|
|
|
### 4.3 Using PuTTY (Windows)
|
|
|
|
1. Open PuTTY.
|
|
2. Select "Serial" as connection type.
|
|
3. Set the COM port (check Device Manager for the correct port).
|
|
4. Set speed to `115200`.
|
|
5. Click "Open".
|
|
|
|
### 4.4 What to Look For
|
|
|
|
The firmware logs the following events to serial output:
|
|
|
|
```
|
|
[BOOT] Acuvim Monitor v1.2.0
|
|
[BOOT] Reset reason: POWER_ON
|
|
[BOOT] Boot count: 4
|
|
[BOOT] Free heap: 245760 bytes
|
|
[NVS] Config loaded
|
|
[WIFI] Connecting to SiteNetwork...
|
|
[WIFI] Connected, IP: 192.168.1.100
|
|
[NTP] Time synced: 2026-05-16 10:30:00 UTC
|
|
[MQTT] Connecting to console.example.com:8883...
|
|
[MQTT] Connected, subscribing to acuvim/ACV-AABBCCDDEEFF/cmd
|
|
[MQTT] LWT set on acuvim/ACV-AABBCCDDEEFF/status
|
|
[REG] Publishing registration to devices/register
|
|
[MB] Reading Acuvim II (addr=1, baud=9600)...
|
|
[MB] Read OK: Va=230.1 Vb=231.4 Vc=229.8 Ia=15.2 Ib=14.8 Ic=15.5
|
|
[MQTT] Published telemetry (287 bytes)
|
|
[HB] Heartbeat #1 sent
|
|
```
|
|
|
|
Error messages to watch for:
|
|
|
|
```
|
|
[WIFI] Connection failed: AUTH_FAIL
|
|
[MQTT] Connection failed: rc=-2 (MQTT_CONNECTION_REFUSED)
|
|
[MB] Read failed: TIMEOUT (addr=1, reg=0x0000)
|
|
[MB] 3 consecutive errors
|
|
[GSM] SIM not detected
|
|
[GSM] Network registration failed
|
|
[OTA] Download failed: HTTP 404
|
|
[OTA] Checksum mismatch, aborting
|
|
[SYS] WARNING: Free heap below 30KB (28416 bytes)
|
|
[SYS] CRASH LOOP DETECTED (count=5), performing factory reset
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Diagnostic MQTT Commands
|
|
|
|
Send these commands from the console or directly via a test MQTT client to diagnose device issues remotely.
|
|
|
|
### 5.1 Using mosquitto_pub (Command Line)
|
|
|
|
```bash
|
|
# Ping the device
|
|
mosquitto_pub -h console.example.com -p 8883 \
|
|
--cafile ca.crt -u console -P <password> \
|
|
-t "acuvim/ACV-AABBCCDDEEFF/cmd" \
|
|
-m '{"cmd":"ping","request_id":"diag-001"}'
|
|
|
|
# Subscribe to responses
|
|
mosquitto_sub -h console.example.com -p 8883 \
|
|
--cafile ca.crt -u console -P <password> \
|
|
-t "acuvim/ACV-AABBCCDDEEFF/resp"
|
|
```
|
|
|
|
### 5.2 Diagnostic Commands
|
|
|
|
**Test connectivity:**
|
|
```json
|
|
{"cmd": "ping", "request_id": "diag-001"}
|
|
```
|
|
Expected response: `{"request_id":"diag-001","status":"success","data":{"pong":true},"message":"pong"}`
|
|
|
|
**Get full device status:**
|
|
```json
|
|
{"cmd": "get_status", "request_id": "diag-002"}
|
|
```
|
|
Returns uptime, memory, connection state, Modbus health, and more.
|
|
|
|
**Get current configuration:**
|
|
```json
|
|
{"cmd": "get_config", "request_id": "diag-003"}
|
|
```
|
|
Returns all configuration values (passwords masked).
|
|
|
|
**Get latest telemetry reading:**
|
|
```json
|
|
{"cmd": "get_telemetry", "request_id": "diag-004"}
|
|
```
|
|
Returns the most recent Acuvim II measurement without waiting for the next poll cycle.
|
|
|
|
**Restart the device:**
|
|
```json
|
|
{"cmd": "restart", "request_id": "diag-005"}
|
|
```
|
|
Use this when the device is behaving unexpectedly. It performs a clean restart.
|
|
|
|
**Factory reset (last resort):**
|
|
```json
|
|
{"cmd": "factory_reset", "request_id": "diag-006", "params": {"confirm": true}}
|
|
```
|
|
Clears all configuration. The device will restart into AP mode and require re-commissioning via the captive portal.
|
|
|
|
---
|
|
|
|
## 6. Database Diagnostics
|
|
|
|
### 6.1 Check Database Connectivity
|
|
|
|
```bash
|
|
docker compose exec db psql -U acuvim -d acuvim -c "SELECT 1;"
|
|
```
|
|
|
|
### 6.2 Check Device Status
|
|
|
|
```bash
|
|
docker compose exec db psql -U acuvim -d acuvim -c "
|
|
SELECT device_id, name, status, firmware_version, last_heartbeat,
|
|
connection_type, signal_strength
|
|
FROM devices
|
|
ORDER BY last_heartbeat DESC;
|
|
"
|
|
```
|
|
|
|
### 6.3 Check for Telemetry Gaps
|
|
|
|
```bash
|
|
docker compose exec db psql -U acuvim -d acuvim -c "
|
|
SELECT device_id,
|
|
MIN(timestamp) AS first_record,
|
|
MAX(timestamp) AS last_record,
|
|
COUNT(*) AS record_count,
|
|
EXTRACT(EPOCH FROM MAX(timestamp) - MIN(timestamp)) / COUNT(*) AS avg_interval_sec
|
|
FROM telemetry_records
|
|
WHERE timestamp > NOW() - INTERVAL '24 hours'
|
|
GROUP BY device_id
|
|
ORDER BY device_id;
|
|
"
|
|
```
|
|
|
|
### 6.4 Check Database Size
|
|
|
|
```bash
|
|
docker compose exec db psql -U acuvim -d acuvim -c "
|
|
SELECT
|
|
pg_size_pretty(pg_database_size('acuvim')) AS total_size,
|
|
pg_size_pretty(pg_total_relation_size('telemetry_records')) AS telemetry_size,
|
|
pg_size_pretty(pg_total_relation_size('alerts')) AS alerts_size,
|
|
pg_size_pretty(pg_total_relation_size('commands')) AS commands_size;
|
|
"
|
|
```
|
|
|
|
### 6.5 Check Unresolved Alerts
|
|
|
|
```bash
|
|
docker compose exec db psql -U acuvim -d acuvim -c "
|
|
SELECT device_id, alert_type, severity, message, created_at
|
|
FROM alerts
|
|
WHERE resolved_at IS NULL
|
|
ORDER BY created_at DESC
|
|
LIMIT 20;
|
|
"
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Network Diagnostics
|
|
|
|
### 7.1 Test MQTT Broker Connectivity
|
|
|
|
From the server:
|
|
```bash
|
|
# Test TLS connection
|
|
openssl s_client -connect console.example.com:8883 -CAfile /opt/acuvim/mosquitto/certs/ca.crt
|
|
```
|
|
|
|
From a remote machine (simulating a device):
|
|
```bash
|
|
mosquitto_sub -h console.example.com -p 8883 \
|
|
--cafile ca.crt -u ACV-AABBCCDDEEFF -P <device-password> \
|
|
-t "acuvim/ACV-AABBCCDDEEFF/cmd" -v
|
|
```
|
|
|
|
### 7.2 Test Console API
|
|
|
|
```bash
|
|
# Health check
|
|
curl -s https://console.example.com/health | python3 -m json.tool
|
|
|
|
# Login and get token
|
|
TOKEN=$(curl -s -X POST https://console.example.com/api/auth/login \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"username":"admin","password":"<password>"}' | python3 -c "import sys,json; print(json.load(sys.stdin)['token'])")
|
|
|
|
# List devices
|
|
curl -s https://console.example.com/api/devices \
|
|
-H "Authorization: Bearer $TOKEN" | python3 -m json.tool
|
|
```
|
|
|
|
### 7.3 Check Firewall Rules
|
|
|
|
```bash
|
|
sudo ufw status verbose
|
|
```
|
|
|
|
Verify ports 443 and 8883 are allowed from anywhere.
|
|
|
|
### 7.4 Check Docker Network
|
|
|
|
```bash
|
|
# Verify containers can communicate
|
|
docker compose exec console ping -c 3 db
|
|
docker compose exec console ping -c 3 mqtt
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Common Recovery Procedures
|
|
|
|
### 8.1 Device is Completely Unresponsive
|
|
|
|
1. Power cycle the device (disconnect and reconnect USB).
|
|
2. Wait 30 seconds for the boot sequence.
|
|
3. If the AP becomes visible, connect and reconfigure.
|
|
4. If no AP appears, connect via USB serial to check boot output.
|
|
5. If the serial shows a boot loop or panic, re-flash the firmware:
|
|
```bash
|
|
cd firmware/
|
|
pio run -t upload
|
|
```
|
|
|
|
### 8.2 Device Lost MQTT Connection After Config Change
|
|
|
|
If an `mqtt_set` command was sent with incorrect values:
|
|
|
|
1. Connect to the device's AP (Acuvim-XXXXXX) from a phone.
|
|
2. Open `http://192.168.4.1` in a browser.
|
|
3. Navigate to Settings > MQTT.
|
|
4. Correct the broker address, port, and credentials.
|
|
5. Tap Save & Connect.
|
|
|
|
### 8.3 Console Database Migration Failed
|
|
|
|
```bash
|
|
# Check what migrations are pending
|
|
docker compose exec console dotnet ef migrations list
|
|
|
|
# Apply migrations
|
|
docker compose exec console dotnet ef database update
|
|
|
|
# If a migration is broken, check the error and fix manually:
|
|
docker compose logs console | grep -i "migration"
|
|
```
|
|
|
|
### 8.4 Mosquitto Password File Corrupted
|
|
|
|
```bash
|
|
# Recreate the password file
|
|
docker run --rm -v /opt/acuvim/mosquitto/config:/mosquitto/config \
|
|
eclipse-mosquitto:2 \
|
|
mosquitto_passwd -c -b /mosquitto/config/passwd console <console-password>
|
|
|
|
# Re-add all device users
|
|
docker run --rm -v /opt/acuvim/mosquitto/config:/mosquitto/config \
|
|
eclipse-mosquitto:2 \
|
|
mosquitto_passwd -b /mosquitto/config/passwd ACV-AABBCCDDEEFF <device-password>
|
|
|
|
# Restart broker
|
|
docker compose restart mqtt
|
|
```
|
|
|
|
### 8.5 Restore Database from Backup
|
|
|
|
```bash
|
|
# Stop the console to prevent writes during restore
|
|
docker compose stop console
|
|
|
|
# Restore
|
|
gunzip -c /opt/acuvim/backups/acuvim_YYYYMMDD_HHMMSS.sql.gz | \
|
|
docker compose exec -T db psql -U acuvim -d acuvim
|
|
|
|
# Restart console
|
|
docker compose start console
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Log Locations
|
|
|
|
| Component | Log Location | How to Access |
|
|
|-----------|-------------|---------------|
|
|
| Console application | Container stdout/stderr | `docker compose logs console` |
|
|
| PostgreSQL | Container stdout/stderr | `docker compose logs db` |
|
|
| Mosquitto | `/opt/acuvim/mosquitto/log/mosquitto.log` | `tail -f /opt/acuvim/mosquitto/log/mosquitto.log` |
|
|
| Nginx | `/var/log/nginx/access.log` and `error.log` | `tail -f /var/log/nginx/error.log` |
|
|
| ESP32 firmware | USB serial output | `pio device monitor --baud 115200` |
|
|
| Backup script | `/opt/acuvim/backups/backup.log` | `cat /opt/acuvim/backups/backup.log` |
|