Tau.Acuvim/docs/acuvim-spec-07.md
Renier Forster 84a0668c54 Initial commit: Tau Acuvim IoT monitoring system
Complete IoT monitoring platform for Acuvim II power meters via ESP32.

Firmware (Phases 1-7):
- ESP32-WROVER-B (TTGO T-Call v1.4) with RS485 Modbus RTU
- WiFi STA+AP concurrent mode with GSM/GPRS failover
- Transport abstraction layer with 4 priority modes
- MQTT protocol with 20 commands, LWT, QoS, exponential backoff
- SD card offline buffering with JSONL rotation and non-blocking drain
- OTA firmware updates with dual partition rollback protection
- Watchdog timer, crash loop detection, Acuvim health monitoring
- Captive portal provisioning with AP mode

Console backend (Phase 8):
- .NET 10 minimal API with PostgreSQL + EF Core
- JWT authentication, SignalR real-time updates
- MQTTnet 5.x bridge service with health monitoring
- Device, telemetry, firmware, alert, group management
- Rate limiting, security headers, Swagger/OpenAPI

Frontend (Phase 9):
- React 18 + TypeScript + Vite with Ant Design 5
- ECharts telemetry visualization, TanStack Query
- SignalR live updates, device management UI
- Dashboard, fleet management, firmware deployment

Testing & Production (Phase 10):
- 28 firmware unit tests (Modbus, JSON, config, version)
- 23 xUnit backend tests (device, telemetry, command, alert)
- Docker Compose with nginx, TLS MQTT, PostgreSQL
- Production deployment, commissioning, and troubleshooting docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-16 19:05:32 +02:00

17 KiB

Phase 7: Heartbeat, Health Management & Device Registration

Objective

Implement the heartbeat system for device health monitoring, comprehensive health management for both the ESP32 and the Acuvim II, device self-registration with the console application, and the MQTT command/response protocol for remote device management.

Prerequisites

  • Phase 6 complete (OTA working)
  • MQTT broker running
  • Console application (Phase 8-9) or mosquitto_sub for testing

Deliverables

  1. Periodic heartbeat publishing (WiFi and/or GSM)
  2. ESP32 health metrics collection
  3. Acuvim II health monitoring and diagnostics
  4. Device self-registration via MQTT
  5. Complete MQTT command/response protocol
  6. Watchdog timer for crash recovery
  7. Diagnostic logging

7.1 Heartbeat System

heartbeat_manager.h / heartbeat_manager.cpp

The heartbeat is a periodic MQTT publish containing device health metrics. The console uses the absence of heartbeats to detect offline devices.

class HeartbeatManager {
public:
    void begin(ConfigManager& config, TransportManager& transport,
               MqttClient& mqtt, AcuvimReader& acuvim);
    void loop();                          // Called in main loop
    void sendNow();                       // Force immediate heartbeat

    String buildHeartbeatPayload();

private:
    unsigned long lastHeartbeat;
    uint32_t heartbeatCount;
    uint32_t bootCount;                   // Persisted in NVS
};

Heartbeat Payload

Published to {prefix}/{device_id}/heartbeat:

{
  "ts": 1716000000,
  "dev": "ACV-AABBCCDDEEFF",
  "fw": "1.0.0",
  "up": 3600,
  "boot": 12,
  "hb": 60,
  "conn": {
    "type": "wifi",
    "wifi": {
      "ssid": "MyNetwork",
      "rssi": -45,
      "ip": "192.168.1.100"
    },
    "gsm": {
      "enabled": true,
      "connected": false,
      "signal": 0,
      "operator": ""
    },
    "mqtt": true
  },
  "health": {
    "heap_free": 120000,
    "heap_min": 95000,
    "psram_free": 3800000,
    "cpu_temp": 52.3,
    "reset_reason": "SW_RESET",
    "uptime_sec": 3600
  },
  "modbus": {
    "connected": true,
    "success": 1234,
    "errors": 5,
    "error_rate": 0.4,
    "last_error": 0,
    "last_read_ms": 45
  },
  "sd": {
    "available": true,
    "queued": 0,
    "free_mb": 3800
  },
  "ota": {
    "version": "1.0.0",
    "partition": "app0",
    "update_available": false
  }
}

Heartbeat Fields Explained

Field Description
ts UTC epoch timestamp
dev Device ID
fw Firmware version
up Uptime in seconds
boot Boot count (persisted, increments each restart)
hb Heartbeat sequence number since boot
conn.type Active transport (wifi or gsm)
health.heap_free Current free heap memory (bytes)
health.heap_min Minimum free heap since boot (leak detection)
health.psram_free Free PSRAM (bytes)
health.cpu_temp ESP32 internal temperature sensor (if available)
health.reset_reason Why the ESP32 last reset (power on, watchdog, crash, etc.)
modbus.error_rate Error percentage (errors / total * 100)
modbus.last_read_ms Duration of last successful Modbus read cycle

Heartbeat Behavior

  • Interval: configurable, default 60 seconds
  • Published via active transport (WiFi or GSM)
  • Sent immediately on boot (after initial connection)
  • Sent immediately after transport switch
  • QoS 1 (acknowledged delivery)
  • If publish fails: retry on next cycle (do not accumulate)
  • Console marks device as "degraded" after 3 missed heartbeats and "offline" after 5

7.2 ESP32 Health Monitoring

Metrics Collection

struct DeviceHealth {
    uint32_t freeHeap;
    uint32_t minFreeHeap;
    uint32_t freePsram;
    float cpuTemp;
    String resetReason;
    uint32_t uptimeSec;
    uint32_t bootCount;
    uint8_t wifiReconnects;
    uint8_t gsmReconnects;
    uint8_t mqttReconnects;
};

Reset Reason Tracking

String getResetReasonString() {
    esp_reset_reason_t reason = esp_reset_reason();
    switch (reason) {
        case ESP_RST_POWERON:  return "POWER_ON";
        case ESP_RST_EXT:      return "EXTERNAL";
        case ESP_RST_SW:       return "SW_RESET";
        case ESP_RST_PANIC:    return "PANIC";
        case ESP_RST_INT_WDT:  return "INT_WDT";
        case ESP_RST_TASK_WDT: return "TASK_WDT";
        case ESP_RST_WDT:      return "WDT";
        case ESP_RST_DEEPSLEEP: return "DEEP_SLEEP";
        case ESP_RST_BROWNOUT: return "BROWNOUT";
        default:               return "UNKNOWN";
    }
}

Memory Leak Detection

  • Track esp_get_minimum_free_heap_size() — if it decreases steadily, there's a leak
  • Log warning if min free heap drops below 30KB
  • Include in heartbeat for console-side trend analysis

Watchdog Timer

Enable the task watchdog to recover from hangs:

#include <esp_task_wdt.h>

void setup() {
    esp_task_wdt_init(30, true);  // 30 second timeout, panic on timeout
    esp_task_wdt_add(NULL);       // Watch the main task
}

void loop() {
    esp_task_wdt_reset();  // Feed the watchdog
    // ... main loop work ...
}

If the main loop hangs for >30 seconds, the watchdog resets the ESP32. The reset reason will be TASK_WDT, visible in the next heartbeat.

7.3 Acuvim II Health Monitoring

Communication Health

Track and report Modbus communication quality:

struct ModbusHealth {
    bool connected;              // Last read succeeded
    uint32_t totalReads;         // Total read attempts
    uint32_t successReads;       // Successful reads
    uint32_t failedReads;        // Failed reads
    float errorRate;             // Percentage (failed/total * 100)
    uint8_t lastError;           // Last Modbus error code
    uint32_t lastReadDuration;   // ms for last complete read cycle
    uint32_t consecutiveErrors;  // Errors in a row (0 = last read OK)
    uint32_t lastSuccessTs;      // Timestamp of last successful read
};

Acuvim II Value Health

Monitor for abnormal readings that may indicate meter or installation issues:

struct AcuvimHealth {
    bool overvoltage;            // Any phase > threshold (e.g., 260V)
    bool undervoltage;           // Any phase < threshold (e.g., 200V)
    bool overcurrent;            // Any phase > rated current
    bool voltageImbalance;       // Phase voltage difference > 10%
    bool currentImbalance;       // Phase current difference > 20%
    bool frequencyDeviation;     // Frequency outside 49.5-50.5 Hz
    bool lowPowerFactor;         // PF < 0.85
    bool highTHD;                // THD > 8%
};

Health Alerts

When health thresholds are exceeded, publish an alert to {prefix}/{device_id}/alerts:

{
  "ts": 1716000000,
  "dev": "ACV-AABBCCDDEEFF",
  "alert": "overvoltage",
  "severity": "warning",
  "message": "Phase A voltage 265.3V exceeds 260V threshold",
  "value": 265.3,
  "threshold": 260.0,
  "phase": "A"
}

Alert types and default thresholds (configurable via console):

Alert Default Threshold Severity
Overvoltage >260V warning
Undervoltage <200V warning
Overcurrent >(CT rated * 1.2) warning
Voltage imbalance >10% difference info
Current imbalance >20% difference info
Frequency deviation <49.5 or >50.5 Hz warning
Low power factor <0.85 info
High THD >8% info
Modbus communication loss 5 consecutive errors critical
Modbus high error rate >5% over 1 hour warning

Alert Deduplication

  • Same alert is not re-published until the condition clears and reoccurs
  • Use hysteresis: alert triggers at threshold, clears at threshold - margin (e.g., overvoltage triggers at 260V, clears at 255V)

7.4 Device Self-Registration

Registration Flow

On first boot (or after factory reset), the device registers itself with the console:

1. Device boots, generates device_id from MAC address
2. Connects to WiFi or GSM
3. Connects to MQTT
4. Publishes to "devices/register":
   {
     "device_id": "ACV-AABBCCDDEEFF",
     "mac": "AA:BB:CC:DD:EE:FF",
     "firmware": "1.0.0",
     "hardware": "TTGO T-Call v1.4",
     "chip": "ESP32-WROVER-B",
     "imei": "123456789012345",
     "capabilities": ["wifi", "gsm", "sd", "modbus"],
     "ts": 1716000000
   }
5. Console receives registration, creates/updates device record
6. Console publishes to "{prefix}/{device_id}/cmd":
   {
     "cmd": "register_ack",
     "request_id": "reg-001",
     "status": "accepted",
     "device_name": "Inverter-Building-A"
   }
7. Device saves registration acknowledgment

Registration Triggers

  • First boot (no registration ACK in NVS)
  • Factory reset (NVS cleared)
  • Firmware update (re-register with new version)
  • Every 24 hours (re-register to confirm presence)

Capabilities Array

Dynamically detected at boot:

JsonArray caps = doc.createNestedArray("capabilities");
caps.add("wifi");                           // Always present
if (gsmManager.isModemReady()) caps.add("gsm");
if (sdManager.isAvailable()) caps.add("sd");
caps.add("modbus");                         // Always present
caps.add("ota");                            // Always present

7.5 MQTT Command/Response Protocol

Command Format (Console -> Device)

Published to {prefix}/{device_id}/cmd:

{
  "cmd": "<command_name>",
  "request_id": "<unique_id>",
  "params": { ... }
}

Response Format (Device -> Console)

Published to {prefix}/{device_id}/resp:

{
  "request_id": "<matching_id>",
  "status": "success|error",
  "data": { ... },
  "message": "Human-readable message"
}

Complete Command List

Command Params Description
wifi_scan none Scan and return available WiFi networks
wifi_set {ssid, password} Set WiFi credentials
wifi_disconnect none Disconnect WiFi
mqtt_set {broker, port, username, password, topic_prefix, tls} Update MQTT config
gsm_set {apn, username, password, enabled} Update GSM config
transport_priority {priority} Set transport priority
sleep_set {enabled, sleep_min, wake_sec} Update sleep config
modbus_set {slave_addr, baud_rate, poll_interval} Update Modbus config
console_set {url} Set console URL
ota_update {url, version, checksum} Trigger OTA update
ota_check none Check for available update
get_config none Return full device config
get_status none Return device status
get_telemetry none Return latest Acuvim readings
restart none Restart device
factory_reset {confirm: true} Factory reset (requires confirm)
set_heartbeat {interval_sec} Change heartbeat interval
set_alerts {thresholds: {...}} Update alert thresholds
ping none Connectivity test, device responds with pong
register_ack {status, device_name} Acknowledge registration

Command Handler

class CommandHandler {
public:
    void begin(ConfigManager& config, WifiManager& wifi,
               GsmManager& gsm, TransportManager& transport,
               MqttClient& mqtt, AcuvimReader& acuvim,
               OtaManager& ota, SdManager& sd);

    void handleCommand(const String& topic, const String& payload);

private:
    void cmdWifiScan(const String& requestId);
    void cmdWifiSet(const String& requestId, JsonObject& params);
    void cmdMqttSet(const String& requestId, JsonObject& params);
    void cmdGsmSet(const String& requestId, JsonObject& params);
    void cmdSleepSet(const String& requestId, JsonObject& params);
    void cmdModbusSet(const String& requestId, JsonObject& params);
    void cmdOtaUpdate(const String& requestId, JsonObject& params);
    void cmdGetConfig(const String& requestId);
    void cmdGetStatus(const String& requestId);
    void cmdGetTelemetry(const String& requestId);
    void cmdRestart(const String& requestId);
    void cmdFactoryReset(const String& requestId, JsonObject& params);
    void cmdPing(const String& requestId);
    // ... etc
};

Example: WiFi Scan Command

Console publishes to {prefix}/ACV-AABBCCDDEEFF/cmd:
{
  "cmd": "wifi_scan",
  "request_id": "scan-001"
}

Device publishes to {prefix}/ACV-AABBCCDDEEFF/resp:
{
  "request_id": "scan-001",
  "status": "success",
  "data": {
    "networks": [
      {"ssid": "Network1", "rssi": -40, "enc": "WPA2", "ch": 6},
      {"ssid": "Network2", "rssi": -65, "enc": "WPA2", "ch": 11}
    ]
  }
}

7.6 Sleep Mode Implementation

Deep Sleep Flow

When sleep mode is enabled:

1. Wake from deep sleep (or initial boot)
2. Initialize hardware
3. Connect to transport (WiFi or GSM)
4. Connect to MQTT
5. Read Acuvim II data
6. Publish telemetry
7. Publish heartbeat
8. Drain SD card queue (if any)
9. Check for OTA updates
10. Process any pending MQTT commands (wait 5 seconds)
11. Disconnect MQTT (clean disconnect)
12. Enter deep sleep for configured duration
13. Repeat from step 1

Implementation

void enterDeepSleep(uint16_t minutes) {
    mqtt.disconnect();
    wifi.disconnect();
    gsm.powerOff();

    uint64_t sleepMicros = (uint64_t)minutes * 60 * 1000000;
    esp_sleep_enable_timer_wakeup(sleepMicros);
    esp_deep_sleep_start();
}

Sleep Considerations

  • NVS persists across deep sleep (config retained)
  • RTC memory can store small amounts of data (boot count, last state)
  • WiFi reconnect takes ~2-5 seconds after deep sleep wake
  • GSM reconnect takes ~10-30 seconds after deep sleep wake
  • Factor connection time into wake duration calculation

7.7 Boot Count and Crash Detection

Boot Count

// In NVS
void incrementBootCount() {
    Preferences prefs;
    prefs.begin("acuvim_sys", false);
    uint32_t count = prefs.getUInt("boot_count", 0) + 1;
    prefs.putUInt("boot_count", count);
    prefs.end();
}

Crash Loop Detection

If the device crashes repeatedly (e.g., bad config causes panic):

void checkCrashLoop() {
    String reason = getResetReasonString();
    if (reason == "PANIC" || reason == "TASK_WDT" || reason == "INT_WDT") {
        Preferences prefs;
        prefs.begin("acuvim_sys", false);
        uint8_t crashes = prefs.getUChar("crash_count", 0) + 1;
        prefs.putUChar("crash_count", crashes);

        if (crashes >= 5) {
            // 5 consecutive crashes — factory reset as last resort
            Serial.println("Crash loop detected, performing factory reset");
            configManager.reset();
            prefs.putUChar("crash_count", 0);
        }
        prefs.end();
    } else {
        // Clean boot, reset crash counter
        Preferences prefs;
        prefs.begin("acuvim_sys", false);
        prefs.putUChar("crash_count", 0);
        prefs.end();
    }
}

7.8 Testing & Validation

Test Method Pass Criteria
Heartbeat WiFi Monitor MQTT Heartbeat received at interval
Heartbeat GSM Disable WiFi Heartbeat via GSM
Heartbeat content Parse JSON All health fields present
Missing heartbeat Kill device Console detects missing heartbeat
Registration Factory reset, reboot Registration published, ACK received
WiFi scan command Send via MQTT Networks returned in response
Config command Send get_config Full config returned
Restart command Send restart Device reboots, comes back online
Alert: overvoltage Simulate high voltage Alert published once
Alert: dedup Sustained overvoltage Alert not repeated
Alert: clear Voltage returns to normal No alert, ready to re-trigger
Deep sleep Enable sleep, monitor Device wakes, publishes, sleeps
Watchdog Create infinite loop (test build) WDT resets device
Crash loop Inject panic (test build) Factory reset after 5 crashes
Memory health Run 48 hours min_free_heap stable
Boot count Reboot multiple times Count increments correctly

7.9 Phase 7 Completion Criteria

  • Heartbeat published at configurable interval
  • Heartbeat contains all device health metrics
  • Heartbeat works over both WiFi and GSM
  • Device self-registers on first boot
  • All MQTT commands implemented and tested
  • Command/response protocol working bidirectionally
  • Acuvim II health alerts generated on threshold violations
  • Alert deduplication working (no repeated alerts)
  • Deep sleep mode working with correct wake/sleep cycle
  • Watchdog timer prevents hangs
  • Crash loop detection triggers factory reset
  • Boot count tracked and reported
  • Reset reason tracked and reported

Previous Phase: Phase 6 — OTA Firmware Updates Next Phase: Phase 8 — Console Application Backend