Tau.Acuvim/docs/acuvim-spec-07.md
Renier Forster 84a0668c54 Initial commit: Tau Acuvim IoT monitoring system
Complete IoT monitoring platform for Acuvim II power meters via ESP32.

Firmware (Phases 1-7):
- ESP32-WROVER-B (TTGO T-Call v1.4) with RS485 Modbus RTU
- WiFi STA+AP concurrent mode with GSM/GPRS failover
- Transport abstraction layer with 4 priority modes
- MQTT protocol with 20 commands, LWT, QoS, exponential backoff
- SD card offline buffering with JSONL rotation and non-blocking drain
- OTA firmware updates with dual partition rollback protection
- Watchdog timer, crash loop detection, Acuvim health monitoring
- Captive portal provisioning with AP mode

Console backend (Phase 8):
- .NET 10 minimal API with PostgreSQL + EF Core
- JWT authentication, SignalR real-time updates
- MQTTnet 5.x bridge service with health monitoring
- Device, telemetry, firmware, alert, group management
- Rate limiting, security headers, Swagger/OpenAPI

Frontend (Phase 9):
- React 18 + TypeScript + Vite with Ant Design 5
- ECharts telemetry visualization, TanStack Query
- SignalR live updates, device management UI
- Dashboard, fleet management, firmware deployment

Testing & Production (Phase 10):
- 28 firmware unit tests (Modbus, JSON, config, version)
- 23 xUnit backend tests (device, telemetry, command, alert)
- Docker Compose with nginx, TLS MQTT, PostgreSQL
- Production deployment, commissioning, and troubleshooting docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-16 19:05:32 +02:00

558 lines
17 KiB
Markdown

# Phase 7: Heartbeat, Health Management & Device Registration
## Objective
Implement the heartbeat system for device health monitoring, comprehensive health management for both the ESP32 and the Acuvim II, device self-registration with the console application, and the MQTT command/response protocol for remote device management.
## Prerequisites
- Phase 6 complete (OTA working)
- MQTT broker running
- Console application (Phase 8-9) or `mosquitto_sub` for testing
## Deliverables
1. Periodic heartbeat publishing (WiFi and/or GSM)
2. ESP32 health metrics collection
3. Acuvim II health monitoring and diagnostics
4. Device self-registration via MQTT
5. Complete MQTT command/response protocol
6. Watchdog timer for crash recovery
7. Diagnostic logging
---
## 7.1 Heartbeat System
### `heartbeat_manager.h` / `heartbeat_manager.cpp`
The heartbeat is a periodic MQTT publish containing device health metrics. The console uses the absence of heartbeats to detect offline devices.
```cpp
class HeartbeatManager {
public:
void begin(ConfigManager& config, TransportManager& transport,
MqttClient& mqtt, AcuvimReader& acuvim);
void loop(); // Called in main loop
void sendNow(); // Force immediate heartbeat
String buildHeartbeatPayload();
private:
unsigned long lastHeartbeat;
uint32_t heartbeatCount;
uint32_t bootCount; // Persisted in NVS
};
```
### Heartbeat Payload
Published to `{prefix}/{device_id}/heartbeat`:
```json
{
"ts": 1716000000,
"dev": "ACV-AABBCCDDEEFF",
"fw": "1.0.0",
"up": 3600,
"boot": 12,
"hb": 60,
"conn": {
"type": "wifi",
"wifi": {
"ssid": "MyNetwork",
"rssi": -45,
"ip": "192.168.1.100"
},
"gsm": {
"enabled": true,
"connected": false,
"signal": 0,
"operator": ""
},
"mqtt": true
},
"health": {
"heap_free": 120000,
"heap_min": 95000,
"psram_free": 3800000,
"cpu_temp": 52.3,
"reset_reason": "SW_RESET",
"uptime_sec": 3600
},
"modbus": {
"connected": true,
"success": 1234,
"errors": 5,
"error_rate": 0.4,
"last_error": 0,
"last_read_ms": 45
},
"sd": {
"available": true,
"queued": 0,
"free_mb": 3800
},
"ota": {
"version": "1.0.0",
"partition": "app0",
"update_available": false
}
}
```
### Heartbeat Fields Explained
| Field | Description |
|-------|-------------|
| `ts` | UTC epoch timestamp |
| `dev` | Device ID |
| `fw` | Firmware version |
| `up` | Uptime in seconds |
| `boot` | Boot count (persisted, increments each restart) |
| `hb` | Heartbeat sequence number since boot |
| `conn.type` | Active transport (`wifi` or `gsm`) |
| `health.heap_free` | Current free heap memory (bytes) |
| `health.heap_min` | Minimum free heap since boot (leak detection) |
| `health.psram_free` | Free PSRAM (bytes) |
| `health.cpu_temp` | ESP32 internal temperature sensor (if available) |
| `health.reset_reason` | Why the ESP32 last reset (power on, watchdog, crash, etc.) |
| `modbus.error_rate` | Error percentage (errors / total * 100) |
| `modbus.last_read_ms` | Duration of last successful Modbus read cycle |
### Heartbeat Behavior
- Interval: configurable, default 60 seconds
- Published via active transport (WiFi or GSM)
- Sent immediately on boot (after initial connection)
- Sent immediately after transport switch
- QoS 1 (acknowledged delivery)
- If publish fails: retry on next cycle (do not accumulate)
- Console marks device as "degraded" after 3 missed heartbeats and "offline" after 5
## 7.2 ESP32 Health Monitoring
### Metrics Collection
```cpp
struct DeviceHealth {
uint32_t freeHeap;
uint32_t minFreeHeap;
uint32_t freePsram;
float cpuTemp;
String resetReason;
uint32_t uptimeSec;
uint32_t bootCount;
uint8_t wifiReconnects;
uint8_t gsmReconnects;
uint8_t mqttReconnects;
};
```
### Reset Reason Tracking
```cpp
String getResetReasonString() {
esp_reset_reason_t reason = esp_reset_reason();
switch (reason) {
case ESP_RST_POWERON: return "POWER_ON";
case ESP_RST_EXT: return "EXTERNAL";
case ESP_RST_SW: return "SW_RESET";
case ESP_RST_PANIC: return "PANIC";
case ESP_RST_INT_WDT: return "INT_WDT";
case ESP_RST_TASK_WDT: return "TASK_WDT";
case ESP_RST_WDT: return "WDT";
case ESP_RST_DEEPSLEEP: return "DEEP_SLEEP";
case ESP_RST_BROWNOUT: return "BROWNOUT";
default: return "UNKNOWN";
}
}
```
### Memory Leak Detection
- Track `esp_get_minimum_free_heap_size()` — if it decreases steadily, there's a leak
- Log warning if min free heap drops below 30KB
- Include in heartbeat for console-side trend analysis
### Watchdog Timer
Enable the task watchdog to recover from hangs:
```cpp
#include <esp_task_wdt.h>
void setup() {
esp_task_wdt_init(30, true); // 30 second timeout, panic on timeout
esp_task_wdt_add(NULL); // Watch the main task
}
void loop() {
esp_task_wdt_reset(); // Feed the watchdog
// ... main loop work ...
}
```
If the main loop hangs for >30 seconds, the watchdog resets the ESP32. The reset reason will be `TASK_WDT`, visible in the next heartbeat.
## 7.3 Acuvim II Health Monitoring
### Communication Health
Track and report Modbus communication quality:
```cpp
struct ModbusHealth {
bool connected; // Last read succeeded
uint32_t totalReads; // Total read attempts
uint32_t successReads; // Successful reads
uint32_t failedReads; // Failed reads
float errorRate; // Percentage (failed/total * 100)
uint8_t lastError; // Last Modbus error code
uint32_t lastReadDuration; // ms for last complete read cycle
uint32_t consecutiveErrors; // Errors in a row (0 = last read OK)
uint32_t lastSuccessTs; // Timestamp of last successful read
};
```
### Acuvim II Value Health
Monitor for abnormal readings that may indicate meter or installation issues:
```cpp
struct AcuvimHealth {
bool overvoltage; // Any phase > threshold (e.g., 260V)
bool undervoltage; // Any phase < threshold (e.g., 200V)
bool overcurrent; // Any phase > rated current
bool voltageImbalance; // Phase voltage difference > 10%
bool currentImbalance; // Phase current difference > 20%
bool frequencyDeviation; // Frequency outside 49.5-50.5 Hz
bool lowPowerFactor; // PF < 0.85
bool highTHD; // THD > 8%
};
```
### Health Alerts
When health thresholds are exceeded, publish an alert to `{prefix}/{device_id}/alerts`:
```json
{
"ts": 1716000000,
"dev": "ACV-AABBCCDDEEFF",
"alert": "overvoltage",
"severity": "warning",
"message": "Phase A voltage 265.3V exceeds 260V threshold",
"value": 265.3,
"threshold": 260.0,
"phase": "A"
}
```
Alert types and default thresholds (configurable via console):
| Alert | Default Threshold | Severity |
|-------|-------------------|----------|
| Overvoltage | >260V | warning |
| Undervoltage | <200V | warning |
| Overcurrent | >(CT rated * 1.2) | warning |
| Voltage imbalance | >10% difference | info |
| Current imbalance | >20% difference | info |
| Frequency deviation | <49.5 or >50.5 Hz | warning |
| Low power factor | <0.85 | info |
| High THD | >8% | info |
| Modbus communication loss | 5 consecutive errors | critical |
| Modbus high error rate | >5% over 1 hour | warning |
### Alert Deduplication
- Same alert is not re-published until the condition clears and reoccurs
- Use hysteresis: alert triggers at threshold, clears at threshold - margin (e.g., overvoltage triggers at 260V, clears at 255V)
## 7.4 Device Self-Registration
### Registration Flow
On first boot (or after factory reset), the device registers itself with the console:
```
1. Device boots, generates device_id from MAC address
2. Connects to WiFi or GSM
3. Connects to MQTT
4. Publishes to "devices/register":
{
"device_id": "ACV-AABBCCDDEEFF",
"mac": "AA:BB:CC:DD:EE:FF",
"firmware": "1.0.0",
"hardware": "TTGO T-Call v1.4",
"chip": "ESP32-WROVER-B",
"imei": "123456789012345",
"capabilities": ["wifi", "gsm", "sd", "modbus"],
"ts": 1716000000
}
5. Console receives registration, creates/updates device record
6. Console publishes to "{prefix}/{device_id}/cmd":
{
"cmd": "register_ack",
"request_id": "reg-001",
"status": "accepted",
"device_name": "Inverter-Building-A"
}
7. Device saves registration acknowledgment
```
### Registration Triggers
- First boot (no registration ACK in NVS)
- Factory reset (NVS cleared)
- Firmware update (re-register with new version)
- Every 24 hours (re-register to confirm presence)
### Capabilities Array
Dynamically detected at boot:
```cpp
JsonArray caps = doc.createNestedArray("capabilities");
caps.add("wifi"); // Always present
if (gsmManager.isModemReady()) caps.add("gsm");
if (sdManager.isAvailable()) caps.add("sd");
caps.add("modbus"); // Always present
caps.add("ota"); // Always present
```
## 7.5 MQTT Command/Response Protocol
### Command Format (Console -> Device)
Published to `{prefix}/{device_id}/cmd`:
```json
{
"cmd": "<command_name>",
"request_id": "<unique_id>",
"params": { ... }
}
```
### Response Format (Device -> Console)
Published to `{prefix}/{device_id}/resp`:
```json
{
"request_id": "<matching_id>",
"status": "success|error",
"data": { ... },
"message": "Human-readable message"
}
```
### Complete Command List
| Command | Params | Description |
|---------|--------|-------------|
| `wifi_scan` | none | Scan and return available WiFi networks |
| `wifi_set` | `{ssid, password}` | Set WiFi credentials |
| `wifi_disconnect` | none | Disconnect WiFi |
| `mqtt_set` | `{broker, port, username, password, topic_prefix, tls}` | Update MQTT config |
| `gsm_set` | `{apn, username, password, enabled}` | Update GSM config |
| `transport_priority` | `{priority}` | Set transport priority |
| `sleep_set` | `{enabled, sleep_min, wake_sec}` | Update sleep config |
| `modbus_set` | `{slave_addr, baud_rate, poll_interval}` | Update Modbus config |
| `console_set` | `{url}` | Set console URL |
| `ota_update` | `{url, version, checksum}` | Trigger OTA update |
| `ota_check` | none | Check for available update |
| `get_config` | none | Return full device config |
| `get_status` | none | Return device status |
| `get_telemetry` | none | Return latest Acuvim readings |
| `restart` | none | Restart device |
| `factory_reset` | `{confirm: true}` | Factory reset (requires confirm) |
| `set_heartbeat` | `{interval_sec}` | Change heartbeat interval |
| `set_alerts` | `{thresholds: {...}}` | Update alert thresholds |
| `ping` | none | Connectivity test, device responds with `pong` |
| `register_ack` | `{status, device_name}` | Acknowledge registration |
### Command Handler
```cpp
class CommandHandler {
public:
void begin(ConfigManager& config, WifiManager& wifi,
GsmManager& gsm, TransportManager& transport,
MqttClient& mqtt, AcuvimReader& acuvim,
OtaManager& ota, SdManager& sd);
void handleCommand(const String& topic, const String& payload);
private:
void cmdWifiScan(const String& requestId);
void cmdWifiSet(const String& requestId, JsonObject& params);
void cmdMqttSet(const String& requestId, JsonObject& params);
void cmdGsmSet(const String& requestId, JsonObject& params);
void cmdSleepSet(const String& requestId, JsonObject& params);
void cmdModbusSet(const String& requestId, JsonObject& params);
void cmdOtaUpdate(const String& requestId, JsonObject& params);
void cmdGetConfig(const String& requestId);
void cmdGetStatus(const String& requestId);
void cmdGetTelemetry(const String& requestId);
void cmdRestart(const String& requestId);
void cmdFactoryReset(const String& requestId, JsonObject& params);
void cmdPing(const String& requestId);
// ... etc
};
```
### Example: WiFi Scan Command
```
Console publishes to {prefix}/ACV-AABBCCDDEEFF/cmd:
{
"cmd": "wifi_scan",
"request_id": "scan-001"
}
Device publishes to {prefix}/ACV-AABBCCDDEEFF/resp:
{
"request_id": "scan-001",
"status": "success",
"data": {
"networks": [
{"ssid": "Network1", "rssi": -40, "enc": "WPA2", "ch": 6},
{"ssid": "Network2", "rssi": -65, "enc": "WPA2", "ch": 11}
]
}
}
```
## 7.6 Sleep Mode Implementation
### Deep Sleep Flow
When sleep mode is enabled:
```
1. Wake from deep sleep (or initial boot)
2. Initialize hardware
3. Connect to transport (WiFi or GSM)
4. Connect to MQTT
5. Read Acuvim II data
6. Publish telemetry
7. Publish heartbeat
8. Drain SD card queue (if any)
9. Check for OTA updates
10. Process any pending MQTT commands (wait 5 seconds)
11. Disconnect MQTT (clean disconnect)
12. Enter deep sleep for configured duration
13. Repeat from step 1
```
### Implementation
```cpp
void enterDeepSleep(uint16_t minutes) {
mqtt.disconnect();
wifi.disconnect();
gsm.powerOff();
uint64_t sleepMicros = (uint64_t)minutes * 60 * 1000000;
esp_sleep_enable_timer_wakeup(sleepMicros);
esp_deep_sleep_start();
}
```
### Sleep Considerations
- NVS persists across deep sleep (config retained)
- RTC memory can store small amounts of data (boot count, last state)
- WiFi reconnect takes ~2-5 seconds after deep sleep wake
- GSM reconnect takes ~10-30 seconds after deep sleep wake
- Factor connection time into wake duration calculation
## 7.7 Boot Count and Crash Detection
### Boot Count
```cpp
// In NVS
void incrementBootCount() {
Preferences prefs;
prefs.begin("acuvim_sys", false);
uint32_t count = prefs.getUInt("boot_count", 0) + 1;
prefs.putUInt("boot_count", count);
prefs.end();
}
```
### Crash Loop Detection
If the device crashes repeatedly (e.g., bad config causes panic):
```cpp
void checkCrashLoop() {
String reason = getResetReasonString();
if (reason == "PANIC" || reason == "TASK_WDT" || reason == "INT_WDT") {
Preferences prefs;
prefs.begin("acuvim_sys", false);
uint8_t crashes = prefs.getUChar("crash_count", 0) + 1;
prefs.putUChar("crash_count", crashes);
if (crashes >= 5) {
// 5 consecutive crashes — factory reset as last resort
Serial.println("Crash loop detected, performing factory reset");
configManager.reset();
prefs.putUChar("crash_count", 0);
}
prefs.end();
} else {
// Clean boot, reset crash counter
Preferences prefs;
prefs.begin("acuvim_sys", false);
prefs.putUChar("crash_count", 0);
prefs.end();
}
}
```
## 7.8 Testing & Validation
| Test | Method | Pass Criteria |
|------|--------|---------------|
| Heartbeat WiFi | Monitor MQTT | Heartbeat received at interval |
| Heartbeat GSM | Disable WiFi | Heartbeat via GSM |
| Heartbeat content | Parse JSON | All health fields present |
| Missing heartbeat | Kill device | Console detects missing heartbeat |
| Registration | Factory reset, reboot | Registration published, ACK received |
| WiFi scan command | Send via MQTT | Networks returned in response |
| Config command | Send get_config | Full config returned |
| Restart command | Send restart | Device reboots, comes back online |
| Alert: overvoltage | Simulate high voltage | Alert published once |
| Alert: dedup | Sustained overvoltage | Alert not repeated |
| Alert: clear | Voltage returns to normal | No alert, ready to re-trigger |
| Deep sleep | Enable sleep, monitor | Device wakes, publishes, sleeps |
| Watchdog | Create infinite loop (test build) | WDT resets device |
| Crash loop | Inject panic (test build) | Factory reset after 5 crashes |
| Memory health | Run 48 hours | min_free_heap stable |
| Boot count | Reboot multiple times | Count increments correctly |
## 7.9 Phase 7 Completion Criteria
- [ ] Heartbeat published at configurable interval
- [ ] Heartbeat contains all device health metrics
- [ ] Heartbeat works over both WiFi and GSM
- [ ] Device self-registers on first boot
- [ ] All MQTT commands implemented and tested
- [ ] Command/response protocol working bidirectionally
- [ ] Acuvim II health alerts generated on threshold violations
- [ ] Alert deduplication working (no repeated alerts)
- [ ] Deep sleep mode working with correct wake/sleep cycle
- [ ] Watchdog timer prevents hangs
- [ ] Crash loop detection triggers factory reset
- [ ] Boot count tracked and reported
- [ ] Reset reason tracked and reported
---
**Previous Phase:** [Phase 6 — OTA Firmware Updates](acuvim-spec-06.md)
**Next Phase:** [Phase 8 — Console Application Backend](acuvim-spec-08.md)