Observability Pipeline
SecureLink collects three types of telemetry from edge devices: metrics (time-series measurements), logs (structured event records), and flows (network traffic records). This page describes how each data channel works and how online/offline detection is implemented.
Three Data Channels
┌──────────────────────────────────────┐
│ Edge Device │
│ │
│ ┌──────────┐ ┌─────────┐ │
│ │ VPP │ │ Docker │ │
│ │ Dataplane│ │ Logs │ │
│ └────┬─────┘ └────┬────┘ │
│ │ │ │
│ ┌────▼─────┐ ┌───▼────┐ │
│ │Flowprobe │ │Promtail│ │
│ │ Plugin │ │ │ │
│ └────┬─────┘ └───┬────┘ │
│ │ │ │
│ ┌────▼───────┐ │ │
│ │IPFIX │ │ │
│ │Collector │ │ │
│ │(aggregate) │ │ │
│ └────┬───────┘ │ │
│ │ │ │
│ Agent│Metrics │ │
└───────┼──────────────┼───────────────┘
│ │
MQTT │ │ Promtail push
│ │
┌────────────▼──┐ ┌──────▼──────┐
│message-service│ │ Loki │
│ -go │ │ │
└────────┬──────┘ └─────────────┘
│
┌────────▼──────────┐
│ VictoriaMetrics │
│ (Time-Series DB) │
└────────┬──────────┘
│
┌────────▼──────────┐
│ Grafana │
│ (Dashboards) │
└───────────────────┘
Metrics
Metrics are numerical measurements collected at regular intervals — CPU usage, memory consumption, interface packet counters, uptime, and IPFIX flow aggregates.
Collection: The Edge Agent collects system and VPP metrics and exposes them as Prometheus-format metrics.
Transport: Metrics are pushed via MQTT to the message-service-go bridge, which forwards them to VictoriaMetrics using the Prometheus Remote Write protocol.
Storage: VictoriaMetrics stores time-series data with 1-year retention. It is PromQL-compatible, meaning the same query language used with Prometheus works here.
Query: The VSM API queries VictoriaMetrics via PromQL to populate dashboard charts. Grafana also queries VictoriaMetrics directly for its dashboards.
Key labels: Every metric carries two identifying labels:
edge_id— The device's serial number (always reliable)customer_id— The tenant ID (set during bootstrap from the deployment configuration)
Logs
Logs are structured event records from Docker containers running on the orchestrator and edge devices.
Collection: Promtail runs as a sidecar on both the orchestrator and edge devices. It tails Docker container logs and adds metadata labels (container name, host, tenant).
Transport: Promtail pushes log entries to Loki using its native push API.
Storage: Loki stores logs with efficient compression. It indexes labels but stores log lines as chunks, making it cost-effective for high-volume logging.
Query: Logs are queried using LogQL, which supports filtering by labels and grep-like pattern matching on log content. The monitoring dashboard in the Web UI and Grafana both use LogQL.
Flows (IPFIX)
Flows are records of network traffic — which source communicated with which destination, how many bytes were transferred, and over which protocol.
Generation: The VPP Flowprobe plugin is configured on selected interfaces to generate IPFIX (IP Flow Information Export) records for all traffic passing through.
Collection: Unlike traditional IPFIX deployments where records are sent to a central collector, SecureLink collects and aggregates flows on the edge device itself. The Edge Agent's IPFIX collector plugin listens for Flowprobe's IPFIX records on a dedicated internal interface.
Aggregation: Raw flows are aggregated by (protocol, destination port) to control metric cardinality. Without aggregation, the full 5-tuple (src IP, dst IP, src port, dst port, protocol) would create an explosion of unique time-series.
Transport: Aggregated flow data is converted to Prometheus metrics and pushed via MQTT to VictoriaMetrics.
Key metrics:
| Metric | Description |
|---|---|
edge_ipfix_bytes_total | Total bytes transferred per (protocol, destination port) |
edge_ipfix_packets_total | Total packets per (protocol, destination port) |
edge_ipfix_flows_total | Number of distinct flows per (protocol, destination port) |
Online/Offline Detection
The Web UI shows whether each device is online or offline. This determination uses a dual-check approach that combines two independent signals:
Signal 1: Telemetry Presence
The API queries VictoriaMetrics for recent CPU metrics for each edge. If metric data exists, the edge has been reporting telemetry recently.
Signal 2: Last Seen Recency
Every time an edge sends an inform message (every 60 seconds), the API updates the device's "last seen" timestamp in the database. The UI considers a device offline if this timestamp is more than 5 minutes old.
Both signals must pass for a device to show as online. This prevents false positives — a device that has stale metrics cached in VictoriaMetrics but has stopped reporting will correctly show as offline once the last-seen threshold is exceeded.
Heartbeat Intervals by Device Type
| Device Type | Message Type | Interval | Contains |
|---|---|---|---|
| Dedicated Edge | Inform | 60 seconds | VPP interface statistics, WAN IP |
| MTGE | Keepalive | 30 seconds | Heartbeat signal |
| MTGE | Inform | 60 seconds | VPP interface statistics, WAN IP |
| Connector | Keepalive | 30 seconds | Heartbeat signal |
AF_PACKET edges whose WAN interface operates in L2 bridge mode (no IP address) will still show as online. The last-seen timestamp is updated on every inform regardless of whether a WAN IP could be extracted.
Alerting
SecureLink supports automated alerting based on metric conditions:
VictoriaMetrics ──▶ vmalert ──▶ alertmanager ──▶ Notifications
(metrics store) (rule (grouping, (webhook,
evaluation) routing) email,
Slack)
- vmalert evaluates alerting rules against VictoriaMetrics at regular intervals (e.g., "no metrics from edge X for 5 minutes")
- When a rule fires, vmalert sends the alert to alertmanager
- alertmanager groups related alerts, deduplicates, and routes them to configured receivers
- Notifications can be sent via webhook (to the VSM API for in-app alerts), email, or Slack
Further Reading
- Platform Architecture — How observability fits into the overall system
- Device Status — How status information is displayed in the UI
- MQTT Topic Reference — Telemetry-related MQTT topics