Skip to main content

Observability Pipeline

SecureLink collects three types of telemetry from edge devices: metrics (time-series measurements), logs (structured event records), and flows (network traffic records). This page describes how each data channel works and how online/offline detection is implemented.

Three Data Channels

                         ┌──────────────────────────────────────┐
│ Edge Device │
│ │
│ ┌──────────┐ ┌─────────┐ │
│ │ VPP │ │ Docker │ │
│ │ Dataplane│ │ Logs │ │
│ └────┬─────┘ └────┬────┘ │
│ │ │ │
│ ┌────▼─────┐ ┌───▼────┐ │
│ │Flowprobe │ │Promtail│ │
│ │ Plugin │ │ │ │
│ └────┬─────┘ └───┬────┘ │
│ │ │ │
│ ┌────▼───────┐ │ │
│ │IPFIX │ │ │
│ │Collector │ │ │
│ │(aggregate) │ │ │
│ └────┬───────┘ │ │
│ │ │ │
│ Agent│Metrics │ │
└───────┼──────────────┼───────────────┘
│ │
MQTT │ │ Promtail push
│ │
┌────────────▼──┐ ┌──────▼──────┐
│message-service│ │ Loki │
│ -go │ │ │
└────────┬──────┘ └─────────────┘

┌────────▼──────────┐
│ VictoriaMetrics │
│ (Time-Series DB) │
└────────┬──────────┘

┌────────▼──────────┐
│ Grafana │
│ (Dashboards) │
└───────────────────┘

Metrics

Metrics are numerical measurements collected at regular intervals — CPU usage, memory consumption, interface packet counters, uptime, and IPFIX flow aggregates.

Collection: The Edge Agent collects system and VPP metrics and exposes them as Prometheus-format metrics.

Transport: Metrics are pushed via MQTT to the message-service-go bridge, which forwards them to VictoriaMetrics using the Prometheus Remote Write protocol.

Storage: VictoriaMetrics stores time-series data with 1-year retention. It is PromQL-compatible, meaning the same query language used with Prometheus works here.

Query: The VSM API queries VictoriaMetrics via PromQL to populate dashboard charts. Grafana also queries VictoriaMetrics directly for its dashboards.

Key labels: Every metric carries two identifying labels:

  • edge_id — The device's serial number (always reliable)
  • customer_id — The tenant ID (set during bootstrap from the deployment configuration)

Logs

Logs are structured event records from Docker containers running on the orchestrator and edge devices.

Collection: Promtail runs as a sidecar on both the orchestrator and edge devices. It tails Docker container logs and adds metadata labels (container name, host, tenant).

Transport: Promtail pushes log entries to Loki using its native push API.

Storage: Loki stores logs with efficient compression. It indexes labels but stores log lines as chunks, making it cost-effective for high-volume logging.

Query: Logs are queried using LogQL, which supports filtering by labels and grep-like pattern matching on log content. The monitoring dashboard in the Web UI and Grafana both use LogQL.

Flows (IPFIX)

Flows are records of network traffic — which source communicated with which destination, how many bytes were transferred, and over which protocol.

Generation: The VPP Flowprobe plugin is configured on selected interfaces to generate IPFIX (IP Flow Information Export) records for all traffic passing through.

Collection: Unlike traditional IPFIX deployments where records are sent to a central collector, SecureLink collects and aggregates flows on the edge device itself. The Edge Agent's IPFIX collector plugin listens for Flowprobe's IPFIX records on a dedicated internal interface.

Aggregation: Raw flows are aggregated by (protocol, destination port) to control metric cardinality. Without aggregation, the full 5-tuple (src IP, dst IP, src port, dst port, protocol) would create an explosion of unique time-series.

Transport: Aggregated flow data is converted to Prometheus metrics and pushed via MQTT to VictoriaMetrics.

Key metrics:

MetricDescription
edge_ipfix_bytes_totalTotal bytes transferred per (protocol, destination port)
edge_ipfix_packets_totalTotal packets per (protocol, destination port)
edge_ipfix_flows_totalNumber of distinct flows per (protocol, destination port)

Online/Offline Detection

The Web UI shows whether each device is online or offline. This determination uses a dual-check approach that combines two independent signals:

Signal 1: Telemetry Presence

The API queries VictoriaMetrics for recent CPU metrics for each edge. If metric data exists, the edge has been reporting telemetry recently.

Signal 2: Last Seen Recency

Every time an edge sends an inform message (every 60 seconds), the API updates the device's "last seen" timestamp in the database. The UI considers a device offline if this timestamp is more than 5 minutes old.

Both signals must pass for a device to show as online. This prevents false positives — a device that has stale metrics cached in VictoriaMetrics but has stopped reporting will correctly show as offline once the last-seen threshold is exceeded.

Heartbeat Intervals by Device Type

Device TypeMessage TypeIntervalContains
Dedicated EdgeInform60 secondsVPP interface statistics, WAN IP
MTGEKeepalive30 secondsHeartbeat signal
MTGEInform60 secondsVPP interface statistics, WAN IP
ConnectorKeepalive30 secondsHeartbeat signal
info

AF_PACKET edges whose WAN interface operates in L2 bridge mode (no IP address) will still show as online. The last-seen timestamp is updated on every inform regardless of whether a WAN IP could be extracted.

Alerting

SecureLink supports automated alerting based on metric conditions:

  VictoriaMetrics ──▶ vmalert ──▶ alertmanager ──▶ Notifications
(metrics store) (rule (grouping, (webhook,
evaluation) routing) email,
Slack)
  1. vmalert evaluates alerting rules against VictoriaMetrics at regular intervals (e.g., "no metrics from edge X for 5 minutes")
  2. When a rule fires, vmalert sends the alert to alertmanager
  3. alertmanager groups related alerts, deduplicates, and routes them to configured receivers
  4. Notifications can be sent via webhook (to the VSM API for in-app alerts), email, or Slack

Further Reading