Complete IoT monitoring platform for Acuvim II power meters via ESP32. Firmware (Phases 1-7): - ESP32-WROVER-B (TTGO T-Call v1.4) with RS485 Modbus RTU - WiFi STA+AP concurrent mode with GSM/GPRS failover - Transport abstraction layer with 4 priority modes - MQTT protocol with 20 commands, LWT, QoS, exponential backoff - SD card offline buffering with JSONL rotation and non-blocking drain - OTA firmware updates with dual partition rollback protection - Watchdog timer, crash loop detection, Acuvim health monitoring - Captive portal provisioning with AP mode Console backend (Phase 8): - .NET 10 minimal API with PostgreSQL + EF Core - JWT authentication, SignalR real-time updates - MQTTnet 5.x bridge service with health monitoring - Device, telemetry, firmware, alert, group management - Rate limiting, security headers, Swagger/OpenAPI Frontend (Phase 9): - React 18 + TypeScript + Vite with Ant Design 5 - ECharts telemetry visualization, TanStack Query - SignalR live updates, device management UI - Dashboard, fleet management, firmware deployment Testing & Production (Phase 10): - 28 firmware unit tests (Modbus, JSON, config, version) - 23 xUnit backend tests (device, telemetry, command, alert) - Docker Compose with nginx, TLS MQTT, PostgreSQL - Production deployment, commissioning, and troubleshooting docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
558 lines
17 KiB
Markdown
558 lines
17 KiB
Markdown
# Phase 7: Heartbeat, Health Management & Device Registration
|
|
|
|
## Objective
|
|
|
|
Implement the heartbeat system for device health monitoring, comprehensive health management for both the ESP32 and the Acuvim II, device self-registration with the console application, and the MQTT command/response protocol for remote device management.
|
|
|
|
## Prerequisites
|
|
|
|
- Phase 6 complete (OTA working)
|
|
- MQTT broker running
|
|
- Console application (Phase 8-9) or `mosquitto_sub` for testing
|
|
|
|
## Deliverables
|
|
|
|
1. Periodic heartbeat publishing (WiFi and/or GSM)
|
|
2. ESP32 health metrics collection
|
|
3. Acuvim II health monitoring and diagnostics
|
|
4. Device self-registration via MQTT
|
|
5. Complete MQTT command/response protocol
|
|
6. Watchdog timer for crash recovery
|
|
7. Diagnostic logging
|
|
|
|
---
|
|
|
|
## 7.1 Heartbeat System
|
|
|
|
### `heartbeat_manager.h` / `heartbeat_manager.cpp`
|
|
|
|
The heartbeat is a periodic MQTT publish containing device health metrics. The console uses the absence of heartbeats to detect offline devices.
|
|
|
|
```cpp
|
|
class HeartbeatManager {
|
|
public:
|
|
void begin(ConfigManager& config, TransportManager& transport,
|
|
MqttClient& mqtt, AcuvimReader& acuvim);
|
|
void loop(); // Called in main loop
|
|
void sendNow(); // Force immediate heartbeat
|
|
|
|
String buildHeartbeatPayload();
|
|
|
|
private:
|
|
unsigned long lastHeartbeat;
|
|
uint32_t heartbeatCount;
|
|
uint32_t bootCount; // Persisted in NVS
|
|
};
|
|
```
|
|
|
|
### Heartbeat Payload
|
|
|
|
Published to `{prefix}/{device_id}/heartbeat`:
|
|
|
|
```json
|
|
{
|
|
"ts": 1716000000,
|
|
"dev": "ACV-AABBCCDDEEFF",
|
|
"fw": "1.0.0",
|
|
"up": 3600,
|
|
"boot": 12,
|
|
"hb": 60,
|
|
"conn": {
|
|
"type": "wifi",
|
|
"wifi": {
|
|
"ssid": "MyNetwork",
|
|
"rssi": -45,
|
|
"ip": "192.168.1.100"
|
|
},
|
|
"gsm": {
|
|
"enabled": true,
|
|
"connected": false,
|
|
"signal": 0,
|
|
"operator": ""
|
|
},
|
|
"mqtt": true
|
|
},
|
|
"health": {
|
|
"heap_free": 120000,
|
|
"heap_min": 95000,
|
|
"psram_free": 3800000,
|
|
"cpu_temp": 52.3,
|
|
"reset_reason": "SW_RESET",
|
|
"uptime_sec": 3600
|
|
},
|
|
"modbus": {
|
|
"connected": true,
|
|
"success": 1234,
|
|
"errors": 5,
|
|
"error_rate": 0.4,
|
|
"last_error": 0,
|
|
"last_read_ms": 45
|
|
},
|
|
"sd": {
|
|
"available": true,
|
|
"queued": 0,
|
|
"free_mb": 3800
|
|
},
|
|
"ota": {
|
|
"version": "1.0.0",
|
|
"partition": "app0",
|
|
"update_available": false
|
|
}
|
|
}
|
|
```
|
|
|
|
### Heartbeat Fields Explained
|
|
|
|
| Field | Description |
|
|
|-------|-------------|
|
|
| `ts` | UTC epoch timestamp |
|
|
| `dev` | Device ID |
|
|
| `fw` | Firmware version |
|
|
| `up` | Uptime in seconds |
|
|
| `boot` | Boot count (persisted, increments each restart) |
|
|
| `hb` | Heartbeat sequence number since boot |
|
|
| `conn.type` | Active transport (`wifi` or `gsm`) |
|
|
| `health.heap_free` | Current free heap memory (bytes) |
|
|
| `health.heap_min` | Minimum free heap since boot (leak detection) |
|
|
| `health.psram_free` | Free PSRAM (bytes) |
|
|
| `health.cpu_temp` | ESP32 internal temperature sensor (if available) |
|
|
| `health.reset_reason` | Why the ESP32 last reset (power on, watchdog, crash, etc.) |
|
|
| `modbus.error_rate` | Error percentage (errors / total * 100) |
|
|
| `modbus.last_read_ms` | Duration of last successful Modbus read cycle |
|
|
|
|
### Heartbeat Behavior
|
|
|
|
- Interval: configurable, default 60 seconds
|
|
- Published via active transport (WiFi or GSM)
|
|
- Sent immediately on boot (after initial connection)
|
|
- Sent immediately after transport switch
|
|
- QoS 1 (acknowledged delivery)
|
|
- If publish fails: retry on next cycle (do not accumulate)
|
|
- Console marks device as "degraded" after 3 missed heartbeats and "offline" after 5
|
|
|
|
## 7.2 ESP32 Health Monitoring
|
|
|
|
### Metrics Collection
|
|
|
|
```cpp
|
|
struct DeviceHealth {
|
|
uint32_t freeHeap;
|
|
uint32_t minFreeHeap;
|
|
uint32_t freePsram;
|
|
float cpuTemp;
|
|
String resetReason;
|
|
uint32_t uptimeSec;
|
|
uint32_t bootCount;
|
|
uint8_t wifiReconnects;
|
|
uint8_t gsmReconnects;
|
|
uint8_t mqttReconnects;
|
|
};
|
|
```
|
|
|
|
### Reset Reason Tracking
|
|
|
|
```cpp
|
|
String getResetReasonString() {
|
|
esp_reset_reason_t reason = esp_reset_reason();
|
|
switch (reason) {
|
|
case ESP_RST_POWERON: return "POWER_ON";
|
|
case ESP_RST_EXT: return "EXTERNAL";
|
|
case ESP_RST_SW: return "SW_RESET";
|
|
case ESP_RST_PANIC: return "PANIC";
|
|
case ESP_RST_INT_WDT: return "INT_WDT";
|
|
case ESP_RST_TASK_WDT: return "TASK_WDT";
|
|
case ESP_RST_WDT: return "WDT";
|
|
case ESP_RST_DEEPSLEEP: return "DEEP_SLEEP";
|
|
case ESP_RST_BROWNOUT: return "BROWNOUT";
|
|
default: return "UNKNOWN";
|
|
}
|
|
}
|
|
```
|
|
|
|
### Memory Leak Detection
|
|
|
|
- Track `esp_get_minimum_free_heap_size()` — if it decreases steadily, there's a leak
|
|
- Log warning if min free heap drops below 30KB
|
|
- Include in heartbeat for console-side trend analysis
|
|
|
|
### Watchdog Timer
|
|
|
|
Enable the task watchdog to recover from hangs:
|
|
|
|
```cpp
|
|
#include <esp_task_wdt.h>
|
|
|
|
void setup() {
|
|
esp_task_wdt_init(30, true); // 30 second timeout, panic on timeout
|
|
esp_task_wdt_add(NULL); // Watch the main task
|
|
}
|
|
|
|
void loop() {
|
|
esp_task_wdt_reset(); // Feed the watchdog
|
|
// ... main loop work ...
|
|
}
|
|
```
|
|
|
|
If the main loop hangs for >30 seconds, the watchdog resets the ESP32. The reset reason will be `TASK_WDT`, visible in the next heartbeat.
|
|
|
|
## 7.3 Acuvim II Health Monitoring
|
|
|
|
### Communication Health
|
|
|
|
Track and report Modbus communication quality:
|
|
|
|
```cpp
|
|
struct ModbusHealth {
|
|
bool connected; // Last read succeeded
|
|
uint32_t totalReads; // Total read attempts
|
|
uint32_t successReads; // Successful reads
|
|
uint32_t failedReads; // Failed reads
|
|
float errorRate; // Percentage (failed/total * 100)
|
|
uint8_t lastError; // Last Modbus error code
|
|
uint32_t lastReadDuration; // ms for last complete read cycle
|
|
uint32_t consecutiveErrors; // Errors in a row (0 = last read OK)
|
|
uint32_t lastSuccessTs; // Timestamp of last successful read
|
|
};
|
|
```
|
|
|
|
### Acuvim II Value Health
|
|
|
|
Monitor for abnormal readings that may indicate meter or installation issues:
|
|
|
|
```cpp
|
|
struct AcuvimHealth {
|
|
bool overvoltage; // Any phase > threshold (e.g., 260V)
|
|
bool undervoltage; // Any phase < threshold (e.g., 200V)
|
|
bool overcurrent; // Any phase > rated current
|
|
bool voltageImbalance; // Phase voltage difference > 10%
|
|
bool currentImbalance; // Phase current difference > 20%
|
|
bool frequencyDeviation; // Frequency outside 49.5-50.5 Hz
|
|
bool lowPowerFactor; // PF < 0.85
|
|
bool highTHD; // THD > 8%
|
|
};
|
|
```
|
|
|
|
### Health Alerts
|
|
|
|
When health thresholds are exceeded, publish an alert to `{prefix}/{device_id}/alerts`:
|
|
|
|
```json
|
|
{
|
|
"ts": 1716000000,
|
|
"dev": "ACV-AABBCCDDEEFF",
|
|
"alert": "overvoltage",
|
|
"severity": "warning",
|
|
"message": "Phase A voltage 265.3V exceeds 260V threshold",
|
|
"value": 265.3,
|
|
"threshold": 260.0,
|
|
"phase": "A"
|
|
}
|
|
```
|
|
|
|
Alert types and default thresholds (configurable via console):
|
|
|
|
| Alert | Default Threshold | Severity |
|
|
|-------|-------------------|----------|
|
|
| Overvoltage | >260V | warning |
|
|
| Undervoltage | <200V | warning |
|
|
| Overcurrent | >(CT rated * 1.2) | warning |
|
|
| Voltage imbalance | >10% difference | info |
|
|
| Current imbalance | >20% difference | info |
|
|
| Frequency deviation | <49.5 or >50.5 Hz | warning |
|
|
| Low power factor | <0.85 | info |
|
|
| High THD | >8% | info |
|
|
| Modbus communication loss | 5 consecutive errors | critical |
|
|
| Modbus high error rate | >5% over 1 hour | warning |
|
|
|
|
### Alert Deduplication
|
|
|
|
- Same alert is not re-published until the condition clears and reoccurs
|
|
- Use hysteresis: alert triggers at threshold, clears at threshold - margin (e.g., overvoltage triggers at 260V, clears at 255V)
|
|
|
|
## 7.4 Device Self-Registration
|
|
|
|
### Registration Flow
|
|
|
|
On first boot (or after factory reset), the device registers itself with the console:
|
|
|
|
```
|
|
1. Device boots, generates device_id from MAC address
|
|
2. Connects to WiFi or GSM
|
|
3. Connects to MQTT
|
|
4. Publishes to "devices/register":
|
|
{
|
|
"device_id": "ACV-AABBCCDDEEFF",
|
|
"mac": "AA:BB:CC:DD:EE:FF",
|
|
"firmware": "1.0.0",
|
|
"hardware": "TTGO T-Call v1.4",
|
|
"chip": "ESP32-WROVER-B",
|
|
"imei": "123456789012345",
|
|
"capabilities": ["wifi", "gsm", "sd", "modbus"],
|
|
"ts": 1716000000
|
|
}
|
|
5. Console receives registration, creates/updates device record
|
|
6. Console publishes to "{prefix}/{device_id}/cmd":
|
|
{
|
|
"cmd": "register_ack",
|
|
"request_id": "reg-001",
|
|
"status": "accepted",
|
|
"device_name": "Inverter-Building-A"
|
|
}
|
|
7. Device saves registration acknowledgment
|
|
```
|
|
|
|
### Registration Triggers
|
|
|
|
- First boot (no registration ACK in NVS)
|
|
- Factory reset (NVS cleared)
|
|
- Firmware update (re-register with new version)
|
|
- Every 24 hours (re-register to confirm presence)
|
|
|
|
### Capabilities Array
|
|
|
|
Dynamically detected at boot:
|
|
|
|
```cpp
|
|
JsonArray caps = doc.createNestedArray("capabilities");
|
|
caps.add("wifi"); // Always present
|
|
if (gsmManager.isModemReady()) caps.add("gsm");
|
|
if (sdManager.isAvailable()) caps.add("sd");
|
|
caps.add("modbus"); // Always present
|
|
caps.add("ota"); // Always present
|
|
```
|
|
|
|
## 7.5 MQTT Command/Response Protocol
|
|
|
|
### Command Format (Console -> Device)
|
|
|
|
Published to `{prefix}/{device_id}/cmd`:
|
|
|
|
```json
|
|
{
|
|
"cmd": "<command_name>",
|
|
"request_id": "<unique_id>",
|
|
"params": { ... }
|
|
}
|
|
```
|
|
|
|
### Response Format (Device -> Console)
|
|
|
|
Published to `{prefix}/{device_id}/resp`:
|
|
|
|
```json
|
|
{
|
|
"request_id": "<matching_id>",
|
|
"status": "success|error",
|
|
"data": { ... },
|
|
"message": "Human-readable message"
|
|
}
|
|
```
|
|
|
|
### Complete Command List
|
|
|
|
| Command | Params | Description |
|
|
|---------|--------|-------------|
|
|
| `wifi_scan` | none | Scan and return available WiFi networks |
|
|
| `wifi_set` | `{ssid, password}` | Set WiFi credentials |
|
|
| `wifi_disconnect` | none | Disconnect WiFi |
|
|
| `mqtt_set` | `{broker, port, username, password, topic_prefix, tls}` | Update MQTT config |
|
|
| `gsm_set` | `{apn, username, password, enabled}` | Update GSM config |
|
|
| `transport_priority` | `{priority}` | Set transport priority |
|
|
| `sleep_set` | `{enabled, sleep_min, wake_sec}` | Update sleep config |
|
|
| `modbus_set` | `{slave_addr, baud_rate, poll_interval}` | Update Modbus config |
|
|
| `console_set` | `{url}` | Set console URL |
|
|
| `ota_update` | `{url, version, checksum}` | Trigger OTA update |
|
|
| `ota_check` | none | Check for available update |
|
|
| `get_config` | none | Return full device config |
|
|
| `get_status` | none | Return device status |
|
|
| `get_telemetry` | none | Return latest Acuvim readings |
|
|
| `restart` | none | Restart device |
|
|
| `factory_reset` | `{confirm: true}` | Factory reset (requires confirm) |
|
|
| `set_heartbeat` | `{interval_sec}` | Change heartbeat interval |
|
|
| `set_alerts` | `{thresholds: {...}}` | Update alert thresholds |
|
|
| `ping` | none | Connectivity test, device responds with `pong` |
|
|
| `register_ack` | `{status, device_name}` | Acknowledge registration |
|
|
|
|
### Command Handler
|
|
|
|
```cpp
|
|
class CommandHandler {
|
|
public:
|
|
void begin(ConfigManager& config, WifiManager& wifi,
|
|
GsmManager& gsm, TransportManager& transport,
|
|
MqttClient& mqtt, AcuvimReader& acuvim,
|
|
OtaManager& ota, SdManager& sd);
|
|
|
|
void handleCommand(const String& topic, const String& payload);
|
|
|
|
private:
|
|
void cmdWifiScan(const String& requestId);
|
|
void cmdWifiSet(const String& requestId, JsonObject& params);
|
|
void cmdMqttSet(const String& requestId, JsonObject& params);
|
|
void cmdGsmSet(const String& requestId, JsonObject& params);
|
|
void cmdSleepSet(const String& requestId, JsonObject& params);
|
|
void cmdModbusSet(const String& requestId, JsonObject& params);
|
|
void cmdOtaUpdate(const String& requestId, JsonObject& params);
|
|
void cmdGetConfig(const String& requestId);
|
|
void cmdGetStatus(const String& requestId);
|
|
void cmdGetTelemetry(const String& requestId);
|
|
void cmdRestart(const String& requestId);
|
|
void cmdFactoryReset(const String& requestId, JsonObject& params);
|
|
void cmdPing(const String& requestId);
|
|
// ... etc
|
|
};
|
|
```
|
|
|
|
### Example: WiFi Scan Command
|
|
|
|
```
|
|
Console publishes to {prefix}/ACV-AABBCCDDEEFF/cmd:
|
|
{
|
|
"cmd": "wifi_scan",
|
|
"request_id": "scan-001"
|
|
}
|
|
|
|
Device publishes to {prefix}/ACV-AABBCCDDEEFF/resp:
|
|
{
|
|
"request_id": "scan-001",
|
|
"status": "success",
|
|
"data": {
|
|
"networks": [
|
|
{"ssid": "Network1", "rssi": -40, "enc": "WPA2", "ch": 6},
|
|
{"ssid": "Network2", "rssi": -65, "enc": "WPA2", "ch": 11}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
## 7.6 Sleep Mode Implementation
|
|
|
|
### Deep Sleep Flow
|
|
|
|
When sleep mode is enabled:
|
|
|
|
```
|
|
1. Wake from deep sleep (or initial boot)
|
|
2. Initialize hardware
|
|
3. Connect to transport (WiFi or GSM)
|
|
4. Connect to MQTT
|
|
5. Read Acuvim II data
|
|
6. Publish telemetry
|
|
7. Publish heartbeat
|
|
8. Drain SD card queue (if any)
|
|
9. Check for OTA updates
|
|
10. Process any pending MQTT commands (wait 5 seconds)
|
|
11. Disconnect MQTT (clean disconnect)
|
|
12. Enter deep sleep for configured duration
|
|
13. Repeat from step 1
|
|
```
|
|
|
|
### Implementation
|
|
|
|
```cpp
|
|
void enterDeepSleep(uint16_t minutes) {
|
|
mqtt.disconnect();
|
|
wifi.disconnect();
|
|
gsm.powerOff();
|
|
|
|
uint64_t sleepMicros = (uint64_t)minutes * 60 * 1000000;
|
|
esp_sleep_enable_timer_wakeup(sleepMicros);
|
|
esp_deep_sleep_start();
|
|
}
|
|
```
|
|
|
|
### Sleep Considerations
|
|
|
|
- NVS persists across deep sleep (config retained)
|
|
- RTC memory can store small amounts of data (boot count, last state)
|
|
- WiFi reconnect takes ~2-5 seconds after deep sleep wake
|
|
- GSM reconnect takes ~10-30 seconds after deep sleep wake
|
|
- Factor connection time into wake duration calculation
|
|
|
|
## 7.7 Boot Count and Crash Detection
|
|
|
|
### Boot Count
|
|
|
|
```cpp
|
|
// In NVS
|
|
void incrementBootCount() {
|
|
Preferences prefs;
|
|
prefs.begin("acuvim_sys", false);
|
|
uint32_t count = prefs.getUInt("boot_count", 0) + 1;
|
|
prefs.putUInt("boot_count", count);
|
|
prefs.end();
|
|
}
|
|
```
|
|
|
|
### Crash Loop Detection
|
|
|
|
If the device crashes repeatedly (e.g., bad config causes panic):
|
|
|
|
```cpp
|
|
void checkCrashLoop() {
|
|
String reason = getResetReasonString();
|
|
if (reason == "PANIC" || reason == "TASK_WDT" || reason == "INT_WDT") {
|
|
Preferences prefs;
|
|
prefs.begin("acuvim_sys", false);
|
|
uint8_t crashes = prefs.getUChar("crash_count", 0) + 1;
|
|
prefs.putUChar("crash_count", crashes);
|
|
|
|
if (crashes >= 5) {
|
|
// 5 consecutive crashes — factory reset as last resort
|
|
Serial.println("Crash loop detected, performing factory reset");
|
|
configManager.reset();
|
|
prefs.putUChar("crash_count", 0);
|
|
}
|
|
prefs.end();
|
|
} else {
|
|
// Clean boot, reset crash counter
|
|
Preferences prefs;
|
|
prefs.begin("acuvim_sys", false);
|
|
prefs.putUChar("crash_count", 0);
|
|
prefs.end();
|
|
}
|
|
}
|
|
```
|
|
|
|
## 7.8 Testing & Validation
|
|
|
|
| Test | Method | Pass Criteria |
|
|
|------|--------|---------------|
|
|
| Heartbeat WiFi | Monitor MQTT | Heartbeat received at interval |
|
|
| Heartbeat GSM | Disable WiFi | Heartbeat via GSM |
|
|
| Heartbeat content | Parse JSON | All health fields present |
|
|
| Missing heartbeat | Kill device | Console detects missing heartbeat |
|
|
| Registration | Factory reset, reboot | Registration published, ACK received |
|
|
| WiFi scan command | Send via MQTT | Networks returned in response |
|
|
| Config command | Send get_config | Full config returned |
|
|
| Restart command | Send restart | Device reboots, comes back online |
|
|
| Alert: overvoltage | Simulate high voltage | Alert published once |
|
|
| Alert: dedup | Sustained overvoltage | Alert not repeated |
|
|
| Alert: clear | Voltage returns to normal | No alert, ready to re-trigger |
|
|
| Deep sleep | Enable sleep, monitor | Device wakes, publishes, sleeps |
|
|
| Watchdog | Create infinite loop (test build) | WDT resets device |
|
|
| Crash loop | Inject panic (test build) | Factory reset after 5 crashes |
|
|
| Memory health | Run 48 hours | min_free_heap stable |
|
|
| Boot count | Reboot multiple times | Count increments correctly |
|
|
|
|
## 7.9 Phase 7 Completion Criteria
|
|
|
|
- [ ] Heartbeat published at configurable interval
|
|
- [ ] Heartbeat contains all device health metrics
|
|
- [ ] Heartbeat works over both WiFi and GSM
|
|
- [ ] Device self-registers on first boot
|
|
- [ ] All MQTT commands implemented and tested
|
|
- [ ] Command/response protocol working bidirectionally
|
|
- [ ] Acuvim II health alerts generated on threshold violations
|
|
- [ ] Alert deduplication working (no repeated alerts)
|
|
- [ ] Deep sleep mode working with correct wake/sleep cycle
|
|
- [ ] Watchdog timer prevents hangs
|
|
- [ ] Crash loop detection triggers factory reset
|
|
- [ ] Boot count tracked and reported
|
|
- [ ] Reset reason tracked and reported
|
|
|
|
---
|
|
|
|
**Previous Phase:** [Phase 6 — OTA Firmware Updates](acuvim-spec-06.md)
|
|
**Next Phase:** [Phase 8 — Console Application Backend](acuvim-spec-08.md)
|