# Phase 7: Heartbeat, Health Management & Device Registration ## Objective Implement the heartbeat system for device health monitoring, comprehensive health management for both the ESP32 and the Acuvim II, device self-registration with the console application, and the MQTT command/response protocol for remote device management. ## Prerequisites - Phase 6 complete (OTA working) - MQTT broker running - Console application (Phase 8-9) or `mosquitto_sub` for testing ## Deliverables 1. Periodic heartbeat publishing (WiFi and/or GSM) 2. ESP32 health metrics collection 3. Acuvim II health monitoring and diagnostics 4. Device self-registration via MQTT 5. Complete MQTT command/response protocol 6. Watchdog timer for crash recovery 7. Diagnostic logging --- ## 7.1 Heartbeat System ### `heartbeat_manager.h` / `heartbeat_manager.cpp` The heartbeat is a periodic MQTT publish containing device health metrics. The console uses the absence of heartbeats to detect offline devices. ```cpp class HeartbeatManager { public: void begin(ConfigManager& config, TransportManager& transport, MqttClient& mqtt, AcuvimReader& acuvim); void loop(); // Called in main loop void sendNow(); // Force immediate heartbeat String buildHeartbeatPayload(); private: unsigned long lastHeartbeat; uint32_t heartbeatCount; uint32_t bootCount; // Persisted in NVS }; ``` ### Heartbeat Payload Published to `{prefix}/{device_id}/heartbeat`: ```json { "ts": 1716000000, "dev": "ACV-AABBCCDDEEFF", "fw": "1.0.0", "up": 3600, "boot": 12, "hb": 60, "conn": { "type": "wifi", "wifi": { "ssid": "MyNetwork", "rssi": -45, "ip": "192.168.1.100" }, "gsm": { "enabled": true, "connected": false, "signal": 0, "operator": "" }, "mqtt": true }, "health": { "heap_free": 120000, "heap_min": 95000, "psram_free": 3800000, "cpu_temp": 52.3, "reset_reason": "SW_RESET", "uptime_sec": 3600 }, "modbus": { "connected": true, "success": 1234, "errors": 5, "error_rate": 0.4, "last_error": 0, "last_read_ms": 45 }, "sd": { "available": true, "queued": 0, "free_mb": 3800 }, "ota": { "version": "1.0.0", "partition": "app0", "update_available": false } } ``` ### Heartbeat Fields Explained | Field | Description | |-------|-------------| | `ts` | UTC epoch timestamp | | `dev` | Device ID | | `fw` | Firmware version | | `up` | Uptime in seconds | | `boot` | Boot count (persisted, increments each restart) | | `hb` | Heartbeat sequence number since boot | | `conn.type` | Active transport (`wifi` or `gsm`) | | `health.heap_free` | Current free heap memory (bytes) | | `health.heap_min` | Minimum free heap since boot (leak detection) | | `health.psram_free` | Free PSRAM (bytes) | | `health.cpu_temp` | ESP32 internal temperature sensor (if available) | | `health.reset_reason` | Why the ESP32 last reset (power on, watchdog, crash, etc.) | | `modbus.error_rate` | Error percentage (errors / total * 100) | | `modbus.last_read_ms` | Duration of last successful Modbus read cycle | ### Heartbeat Behavior - Interval: configurable, default 60 seconds - Published via active transport (WiFi or GSM) - Sent immediately on boot (after initial connection) - Sent immediately after transport switch - QoS 1 (acknowledged delivery) - If publish fails: retry on next cycle (do not accumulate) - Console marks device as "degraded" after 3 missed heartbeats and "offline" after 5 ## 7.2 ESP32 Health Monitoring ### Metrics Collection ```cpp struct DeviceHealth { uint32_t freeHeap; uint32_t minFreeHeap; uint32_t freePsram; float cpuTemp; String resetReason; uint32_t uptimeSec; uint32_t bootCount; uint8_t wifiReconnects; uint8_t gsmReconnects; uint8_t mqttReconnects; }; ``` ### Reset Reason Tracking ```cpp String getResetReasonString() { esp_reset_reason_t reason = esp_reset_reason(); switch (reason) { case ESP_RST_POWERON: return "POWER_ON"; case ESP_RST_EXT: return "EXTERNAL"; case ESP_RST_SW: return "SW_RESET"; case ESP_RST_PANIC: return "PANIC"; case ESP_RST_INT_WDT: return "INT_WDT"; case ESP_RST_TASK_WDT: return "TASK_WDT"; case ESP_RST_WDT: return "WDT"; case ESP_RST_DEEPSLEEP: return "DEEP_SLEEP"; case ESP_RST_BROWNOUT: return "BROWNOUT"; default: return "UNKNOWN"; } } ``` ### Memory Leak Detection - Track `esp_get_minimum_free_heap_size()` — if it decreases steadily, there's a leak - Log warning if min free heap drops below 30KB - Include in heartbeat for console-side trend analysis ### Watchdog Timer Enable the task watchdog to recover from hangs: ```cpp #include void setup() { esp_task_wdt_init(30, true); // 30 second timeout, panic on timeout esp_task_wdt_add(NULL); // Watch the main task } void loop() { esp_task_wdt_reset(); // Feed the watchdog // ... main loop work ... } ``` If the main loop hangs for >30 seconds, the watchdog resets the ESP32. The reset reason will be `TASK_WDT`, visible in the next heartbeat. ## 7.3 Acuvim II Health Monitoring ### Communication Health Track and report Modbus communication quality: ```cpp struct ModbusHealth { bool connected; // Last read succeeded uint32_t totalReads; // Total read attempts uint32_t successReads; // Successful reads uint32_t failedReads; // Failed reads float errorRate; // Percentage (failed/total * 100) uint8_t lastError; // Last Modbus error code uint32_t lastReadDuration; // ms for last complete read cycle uint32_t consecutiveErrors; // Errors in a row (0 = last read OK) uint32_t lastSuccessTs; // Timestamp of last successful read }; ``` ### Acuvim II Value Health Monitor for abnormal readings that may indicate meter or installation issues: ```cpp struct AcuvimHealth { bool overvoltage; // Any phase > threshold (e.g., 260V) bool undervoltage; // Any phase < threshold (e.g., 200V) bool overcurrent; // Any phase > rated current bool voltageImbalance; // Phase voltage difference > 10% bool currentImbalance; // Phase current difference > 20% bool frequencyDeviation; // Frequency outside 49.5-50.5 Hz bool lowPowerFactor; // PF < 0.85 bool highTHD; // THD > 8% }; ``` ### Health Alerts When health thresholds are exceeded, publish an alert to `{prefix}/{device_id}/alerts`: ```json { "ts": 1716000000, "dev": "ACV-AABBCCDDEEFF", "alert": "overvoltage", "severity": "warning", "message": "Phase A voltage 265.3V exceeds 260V threshold", "value": 265.3, "threshold": 260.0, "phase": "A" } ``` Alert types and default thresholds (configurable via console): | Alert | Default Threshold | Severity | |-------|-------------------|----------| | Overvoltage | >260V | warning | | Undervoltage | <200V | warning | | Overcurrent | >(CT rated * 1.2) | warning | | Voltage imbalance | >10% difference | info | | Current imbalance | >20% difference | info | | Frequency deviation | <49.5 or >50.5 Hz | warning | | Low power factor | <0.85 | info | | High THD | >8% | info | | Modbus communication loss | 5 consecutive errors | critical | | Modbus high error rate | >5% over 1 hour | warning | ### Alert Deduplication - Same alert is not re-published until the condition clears and reoccurs - Use hysteresis: alert triggers at threshold, clears at threshold - margin (e.g., overvoltage triggers at 260V, clears at 255V) ## 7.4 Device Self-Registration ### Registration Flow On first boot (or after factory reset), the device registers itself with the console: ``` 1. Device boots, generates device_id from MAC address 2. Connects to WiFi or GSM 3. Connects to MQTT 4. Publishes to "devices/register": { "device_id": "ACV-AABBCCDDEEFF", "mac": "AA:BB:CC:DD:EE:FF", "firmware": "1.0.0", "hardware": "TTGO T-Call v1.4", "chip": "ESP32-WROVER-B", "imei": "123456789012345", "capabilities": ["wifi", "gsm", "sd", "modbus"], "ts": 1716000000 } 5. Console receives registration, creates/updates device record 6. Console publishes to "{prefix}/{device_id}/cmd": { "cmd": "register_ack", "request_id": "reg-001", "status": "accepted", "device_name": "Inverter-Building-A" } 7. Device saves registration acknowledgment ``` ### Registration Triggers - First boot (no registration ACK in NVS) - Factory reset (NVS cleared) - Firmware update (re-register with new version) - Every 24 hours (re-register to confirm presence) ### Capabilities Array Dynamically detected at boot: ```cpp JsonArray caps = doc.createNestedArray("capabilities"); caps.add("wifi"); // Always present if (gsmManager.isModemReady()) caps.add("gsm"); if (sdManager.isAvailable()) caps.add("sd"); caps.add("modbus"); // Always present caps.add("ota"); // Always present ``` ## 7.5 MQTT Command/Response Protocol ### Command Format (Console -> Device) Published to `{prefix}/{device_id}/cmd`: ```json { "cmd": "", "request_id": "", "params": { ... } } ``` ### Response Format (Device -> Console) Published to `{prefix}/{device_id}/resp`: ```json { "request_id": "", "status": "success|error", "data": { ... }, "message": "Human-readable message" } ``` ### Complete Command List | Command | Params | Description | |---------|--------|-------------| | `wifi_scan` | none | Scan and return available WiFi networks | | `wifi_set` | `{ssid, password}` | Set WiFi credentials | | `wifi_disconnect` | none | Disconnect WiFi | | `mqtt_set` | `{broker, port, username, password, topic_prefix, tls}` | Update MQTT config | | `gsm_set` | `{apn, username, password, enabled}` | Update GSM config | | `transport_priority` | `{priority}` | Set transport priority | | `sleep_set` | `{enabled, sleep_min, wake_sec}` | Update sleep config | | `modbus_set` | `{slave_addr, baud_rate, poll_interval}` | Update Modbus config | | `console_set` | `{url}` | Set console URL | | `ota_update` | `{url, version, checksum}` | Trigger OTA update | | `ota_check` | none | Check for available update | | `get_config` | none | Return full device config | | `get_status` | none | Return device status | | `get_telemetry` | none | Return latest Acuvim readings | | `restart` | none | Restart device | | `factory_reset` | `{confirm: true}` | Factory reset (requires confirm) | | `set_heartbeat` | `{interval_sec}` | Change heartbeat interval | | `set_alerts` | `{thresholds: {...}}` | Update alert thresholds | | `ping` | none | Connectivity test, device responds with `pong` | | `register_ack` | `{status, device_name}` | Acknowledge registration | ### Command Handler ```cpp class CommandHandler { public: void begin(ConfigManager& config, WifiManager& wifi, GsmManager& gsm, TransportManager& transport, MqttClient& mqtt, AcuvimReader& acuvim, OtaManager& ota, SdManager& sd); void handleCommand(const String& topic, const String& payload); private: void cmdWifiScan(const String& requestId); void cmdWifiSet(const String& requestId, JsonObject& params); void cmdMqttSet(const String& requestId, JsonObject& params); void cmdGsmSet(const String& requestId, JsonObject& params); void cmdSleepSet(const String& requestId, JsonObject& params); void cmdModbusSet(const String& requestId, JsonObject& params); void cmdOtaUpdate(const String& requestId, JsonObject& params); void cmdGetConfig(const String& requestId); void cmdGetStatus(const String& requestId); void cmdGetTelemetry(const String& requestId); void cmdRestart(const String& requestId); void cmdFactoryReset(const String& requestId, JsonObject& params); void cmdPing(const String& requestId); // ... etc }; ``` ### Example: WiFi Scan Command ``` Console publishes to {prefix}/ACV-AABBCCDDEEFF/cmd: { "cmd": "wifi_scan", "request_id": "scan-001" } Device publishes to {prefix}/ACV-AABBCCDDEEFF/resp: { "request_id": "scan-001", "status": "success", "data": { "networks": [ {"ssid": "Network1", "rssi": -40, "enc": "WPA2", "ch": 6}, {"ssid": "Network2", "rssi": -65, "enc": "WPA2", "ch": 11} ] } } ``` ## 7.6 Sleep Mode Implementation ### Deep Sleep Flow When sleep mode is enabled: ``` 1. Wake from deep sleep (or initial boot) 2. Initialize hardware 3. Connect to transport (WiFi or GSM) 4. Connect to MQTT 5. Read Acuvim II data 6. Publish telemetry 7. Publish heartbeat 8. Drain SD card queue (if any) 9. Check for OTA updates 10. Process any pending MQTT commands (wait 5 seconds) 11. Disconnect MQTT (clean disconnect) 12. Enter deep sleep for configured duration 13. Repeat from step 1 ``` ### Implementation ```cpp void enterDeepSleep(uint16_t minutes) { mqtt.disconnect(); wifi.disconnect(); gsm.powerOff(); uint64_t sleepMicros = (uint64_t)minutes * 60 * 1000000; esp_sleep_enable_timer_wakeup(sleepMicros); esp_deep_sleep_start(); } ``` ### Sleep Considerations - NVS persists across deep sleep (config retained) - RTC memory can store small amounts of data (boot count, last state) - WiFi reconnect takes ~2-5 seconds after deep sleep wake - GSM reconnect takes ~10-30 seconds after deep sleep wake - Factor connection time into wake duration calculation ## 7.7 Boot Count and Crash Detection ### Boot Count ```cpp // In NVS void incrementBootCount() { Preferences prefs; prefs.begin("acuvim_sys", false); uint32_t count = prefs.getUInt("boot_count", 0) + 1; prefs.putUInt("boot_count", count); prefs.end(); } ``` ### Crash Loop Detection If the device crashes repeatedly (e.g., bad config causes panic): ```cpp void checkCrashLoop() { String reason = getResetReasonString(); if (reason == "PANIC" || reason == "TASK_WDT" || reason == "INT_WDT") { Preferences prefs; prefs.begin("acuvim_sys", false); uint8_t crashes = prefs.getUChar("crash_count", 0) + 1; prefs.putUChar("crash_count", crashes); if (crashes >= 5) { // 5 consecutive crashes — factory reset as last resort Serial.println("Crash loop detected, performing factory reset"); configManager.reset(); prefs.putUChar("crash_count", 0); } prefs.end(); } else { // Clean boot, reset crash counter Preferences prefs; prefs.begin("acuvim_sys", false); prefs.putUChar("crash_count", 0); prefs.end(); } } ``` ## 7.8 Testing & Validation | Test | Method | Pass Criteria | |------|--------|---------------| | Heartbeat WiFi | Monitor MQTT | Heartbeat received at interval | | Heartbeat GSM | Disable WiFi | Heartbeat via GSM | | Heartbeat content | Parse JSON | All health fields present | | Missing heartbeat | Kill device | Console detects missing heartbeat | | Registration | Factory reset, reboot | Registration published, ACK received | | WiFi scan command | Send via MQTT | Networks returned in response | | Config command | Send get_config | Full config returned | | Restart command | Send restart | Device reboots, comes back online | | Alert: overvoltage | Simulate high voltage | Alert published once | | Alert: dedup | Sustained overvoltage | Alert not repeated | | Alert: clear | Voltage returns to normal | No alert, ready to re-trigger | | Deep sleep | Enable sleep, monitor | Device wakes, publishes, sleeps | | Watchdog | Create infinite loop (test build) | WDT resets device | | Crash loop | Inject panic (test build) | Factory reset after 5 crashes | | Memory health | Run 48 hours | min_free_heap stable | | Boot count | Reboot multiple times | Count increments correctly | ## 7.9 Phase 7 Completion Criteria - [ ] Heartbeat published at configurable interval - [ ] Heartbeat contains all device health metrics - [ ] Heartbeat works over both WiFi and GSM - [ ] Device self-registers on first boot - [ ] All MQTT commands implemented and tested - [ ] Command/response protocol working bidirectionally - [ ] Acuvim II health alerts generated on threshold violations - [ ] Alert deduplication working (no repeated alerts) - [ ] Deep sleep mode working with correct wake/sleep cycle - [ ] Watchdog timer prevents hangs - [ ] Crash loop detection triggers factory reset - [ ] Boot count tracked and reported - [ ] Reset reason tracked and reported --- **Previous Phase:** [Phase 6 — OTA Firmware Updates](acuvim-spec-06.md) **Next Phase:** [Phase 8 — Console Application Backend](acuvim-spec-08.md)