51 KiB

Raw Blame History

Architecting a Kubernetes-Native Network Performance Monitoring Service with iperf3, Prometheus, and Helm

Section 1: Architectural Blueprint for Continuous Network Validation

1.1 Introduction to Proactive Network Monitoring in Kubernetes

In modern cloud-native infrastructures, Kubernetes has emerged as the de facto standard for container orchestration, simplifying the deployment, scaling, and management of complex applications.^1^ However, the very dynamism and abstraction that make Kubernetes powerful also introduce significant challenges in diagnosing network performance issues. The ephemeral nature of pods, the complexity of overlay networks provided by Container Network Interfaces (CNIs), and the multi-layered traffic routing through Services and Ingress controllers can obscure the root causes of latency, packet loss, and throughput degradation.

Traditional, reactive troubleshooting---investigating network problems only after an application has failed---is insufficient in these environments. Performance bottlenecks can be subtle, intermittent, and difficult to reproduce, often manifesting as degraded user experience long before they trigger hard failures.^1^ To maintain the reliability and performance of critical workloads, engineering teams must shift from a reactive to a proactive stance. This requires a system that performs continuous, automated validation of the underlying network fabric, treating network health not as an assumption but as a measurable, time-series metric.

This document outlines the architecture and implementation of a comprehensive, Kubernetes-native network performance monitoring service. The solution leverages a suite of industry-standard, open-source tools to provide continuous, actionable insights into cluster network health. The core components are:

iperf3: A widely adopted tool for active network performance

measurement, used to generate traffic and measure maximum achievable bandwidth, jitter, and packet loss between two points.^2^
Prometheus: A powerful, open-source monitoring and alerting system

that has become the standard for collecting and storing time-series metrics in the Kubernetes ecosystem.^3^
Grafana: A leading visualization tool for creating rich,

interactive dashboards from various data sources, including Prometheus, enabling intuitive analysis of complex datasets.^4^

By combining these components into a cohesive, automated service, we can transform abstract network performance into a concrete, queryable, and visualizable stream of data, enabling teams to detect and address infrastructure-level issues before they impact end-users.^6^

1.2 The Core Architectural Pattern: Decoupled Test Endpoints and a Central Orchestrator

The foundation of this monitoring service is a robust, decoupled architectural pattern designed for scalability and resilience within a dynamic Kubernetes environment. The design separates the passive test endpoints from the active test orchestrator, a critical distinction that ensures the system is both efficient and aligned with Kubernetes operational principles.

The data flow and component interaction can be visualized as follows:

A DaemonSet deploys an iperf3 server pod onto every node in the

cluster, creating a mesh of passive test targets.
A central Deployment, the iperf3-exporter, uses the Kubernetes

API to discover the IP addresses of all iperf3 server pods.
The iperf3-exporter periodically orchestrates tests, running an

iperf3 client to connect to each server pod and measure network performance.
The exporter parses the JSON output from iperf3, transforms the

results into Prometheus metrics, and exposes them on a /metrics HTTP endpoint.
A Prometheus server, configured via a ServiceMonitor,

scrapes the /metrics endpoint of the exporter, ingesting the performance data into its time-series database.
A Grafana instance, using Prometheus as a data source,

visualizes the metrics in a purpose-built dashboard, providing heatmaps and time-series graphs of node-to-node bandwidth, jitter, and packet loss.

This architecture is composed of three primary logical components:

Component 1: The iperf3-server DaemonSet. To accurately measure

network performance between any two nodes (N-to-N), an iperf3 server process must be running and accessible on every node. The DaemonSet is the canonical Kubernetes controller for this exact use case. It guarantees that a copy of a specific pod runs on all, or a selected subset of, nodes within the cluster.^7^ When a new node joins the cluster, the DaemonSet controller automatically deploys an iperf3-server pod to it; conversely, when a node is removed, the pod is garbage collected. This ensures the mesh of test endpoints is always in sync with the state of the cluster, requiring zero manual intervention.^9^ This pattern of using a DaemonSet to deploy iperf3 across a cluster is a well-established practice for network validation.^11^
Component 2: The iperf3-exporter Deployment. A separate,

centralized component is required to act as the test orchestrator. This component is responsible for initiating the iperf3 client connections, executing the tests, parsing the results, and exposing them as Prometheus metrics. Since this is a stateless service whose primary function is to perform a periodic task, a Deployment is the ideal controller.^8^ A Deployment ensures a specified number of replicas are running, provides mechanisms for rolling updates, and allows for independent resource management and lifecycle control, decoupled from the iperf3-server pods it tests against.^10^
Component 3: The Prometheus & Grafana Stack. The monitoring

backend is provided by the kube-prometheus-stack, a comprehensive Helm chart that deploys Prometheus, Grafana, Alertmanager, and the necessary exporters for cluster monitoring.^4^ Our custom monitoring service is designed to integrate seamlessly with this stack, leveraging its Prometheus Operator for automatic scrape configuration and its Grafana instance for visualization.

1.3 Architectural Justification and Design Rationale

The primary strength of this architecture lies in its deliberate separation of concerns, a design choice that yields significant benefits in resilience, scalability, and operational efficiency. The DaemonSet is responsible for the presence of test endpoints, while the Deployment handles the orchestration of the tests. This decoupling is not arbitrary; it is a direct consequence of applying Kubernetes-native principles to the problem.

The logical progression is as follows: The requirement to continuously measure N-to-N node bandwidth necessitates that iperf3 server processes are available on all N nodes to act as targets. The most reliable, self-healing, and automated method to achieve this "one-pod-per-node" pattern in Kubernetes is to use a DaemonSet.^7^ This makes the server deployment automatically scale with the cluster itself. Next, a process is needed to trigger the tests against these servers. This "orchestrator" is a logically distinct, active service. It needs to be reliable and potentially scalable, but it does not need to run on every single node. The standard Kubernetes object for managing such stateless services is a

Deployment.^8^

This separation allows for independent and appropriate resource allocation. The iperf3-server pods are extremely lightweight, consuming minimal resources while idle. The iperf3-exporter, however, may be more CPU-intensive during the brief periods it is actively running tests. By placing them in different workload objects (DaemonSet and Deployment), we can configure their resource requests and limits independently. This prevents the monitoring workload from interfering with or being starved by application workloads, a crucial consideration for any production-grade system. This design is fundamentally more robust and scalable than simpler, monolithic approaches, such as a single script that attempts to manage both server and client lifecycles.^12^

Section 2: Implementing the iperf3-prometheus-exporter

The heart of this monitoring solution is the iperf3-prometheus-exporter, a custom application responsible for orchestrating the network tests and translating their results into a format that Prometheus can ingest. This section provides a detailed breakdown of its implementation, from technology selection to the final container image.

2.1 Technology Selection: Python for Agility and Ecosystem

Python was selected as the implementation language for the exporter due to its powerful ecosystem and rapid development capabilities. The availability of mature, well-maintained libraries for interacting with both Prometheus and Kubernetes significantly accelerates the development of a robust, cloud-native application.

The key libraries leveraged are:

prometheus-client: The official Python client library for

instrumenting applications with Prometheus metrics. It provides a simple API for defining metrics (Gauges, Counters, etc.) and exposing them via an HTTP server, handling much of the boilerplate required for creating a valid exporter.^13^
iperf3-python: A clean, high-level Python wrapper around the

iperf3 C library. It allows for programmatic control of iperf3 clients and servers, and it can directly parse the JSON output of a test into a convenient Python object, eliminating the need for manual process management and output parsing.^15^
kubernetes: The official Python client library for the Kubernetes

API. This library is essential for the exporter to become "Kubernetes-aware," enabling it to dynamically discover the iperf3-server pods it needs to test against by querying the API server directly.

2.2 Core Exporter Logic (Annotated Python Code)

The exporter's logic can be broken down into five distinct steps, which together form a continuous loop of discovery, testing, and reporting.

Step 1: Initialization and Metric Definition

The application begins by importing the necessary libraries and defining the Prometheus metrics that will be exposed. We use a Gauge metric, as bandwidth is a value that can go up or down. Labels are crucial for providing context; they allow us to slice and dice the data in Prometheus and Grafana.

Python

import os import time import logging from kubernetes import client, config from prometheus_client import start_http_server, Gauge import iperf3

# --- Configuration --- # Configure logging logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# --- Prometheus Metrics Definition --- IPERF_BANDWIDTH_MBPS = Gauge( 'iperf_network_bandwidth_mbps', 'Network bandwidth measured by iperf3 in Megabits per second',

\'source_node\', \'destination_node\', \'protocol\'

) IPERF_JITTER_MS = Gauge( 'iperf_network_jitter_ms', 'Network jitter measured by iperf3 in milliseconds',

\'source_node\', \'destination_node\', \'protocol\'

) IPERF_PACKETS_TOTAL = Gauge( 'iperf_network_packets_total', 'Total packets transmitted or received during the iperf3 test',

\'source_node\', \'destination_node\', \'protocol\'

) IPERF_LOST_PACKETS = Gauge( 'iperf_network_lost_packets_total', 'Total lost packets during the iperf3 UDP test',

\'source_node\', \'destination_node\', \'protocol\'

) IPERF_TEST_SUCCESS = Gauge( 'iperf_test_success', 'Indicates if the iperf3 test was successful (1) or failed (0)',

\'source_node\', \'destination_node\', \'protocol\'

)

Step 2: Kubernetes-Aware Target Discovery

A static list of test targets is an anti-pattern in a dynamic environment like Kubernetes.^16^ The exporter must dynamically discover its targets. This is achieved by using the Kubernetes Python client to query the API server for all pods that match the label selector of our

iperf3-server DaemonSet (e.g., app=iperf3-server). The function returns a list of dictionaries, each containing the pod's IP address and the name of the node it is running on.

This dynamic discovery is what transforms the exporter from a simple script into a resilient, automated service. It adapts to cluster scaling events without any manual intervention. The logical path is clear: Kubernetes clusters are dynamic, so a hardcoded list of IPs would become stale instantly. The API server is the single source of truth for the cluster's state. Therefore, the exporter must query this API, which in turn necessitates including the Kubernetes client library and configuring the appropriate Role-Based Access Control (RBAC) permissions for its ServiceAccount.

Python

def discover_iperf_servers(): """ Discover iperf3 server pods in the cluster using the Kubernetes API. """ try: # Load in-cluster configuration config.load_incluster_config() v1 = client.CoreV1Api()

namespace = os.getenv('IPERF_SERVER_NAMESPACE', 'default') label_selector = os.getenv('IPERF_SERVER_LABEL_SELECTOR', 'app=iperf3-server')

logging.info(f"Discovering iperf3 servers with label '{label_selector}' in namespace '{namespace}'")

ret = v1.list_pod_for_all_namespaces(label_selector=label_selector, watch=False)

servers = for i in ret.items: # Ensure pod has an IP and is running if i.status.pod_ip and i.status.phase == 'Running': servers.append({ 'ip': i.status.pod_ip, 'node_name': i.spec.node_name }) logging.info(f"Discovered {len(servers)} iperf3 server pods.") return servers except Exception as e: logging.error(f"Error discovering iperf servers: {e}") return

Step 3: The Test Orchestration Loop

The main function of the application contains an infinite while True loop that orchestrates the entire process. It periodically discovers the servers, creates a list of test pairs (node-to-node), and then executes an iperf3 test for each pair.

Python

def run_iperf_test(server_ip, server_port, protocol, source_node, dest_node): """ Runs a single iperf3 test and updates Prometheus metrics. """ logging.info(f"Running iperf3 test from {source_node} to {dest_node} ({server_ip}:{server_port}) using {protocol.upper()}")

client = iperf3.Client() client.server_hostname = server_ip client.port = server_port client.protocol = protocol client.duration = int(os.getenv('IPERF_TEST_DURATION', 5)) client.json_output = True # Critical for parsing

result = client.run()

# Parse results and update metrics parse_and_publish_metrics(result, source_node, dest_node, protocol)

def main_loop(): """ Main orchestration loop. """ test_interval = int(os.getenv('IPERF_TEST_INTERVAL', 300)) server_port = int(os.getenv('IPERF_SERVER_PORT', 5201)) protocol = os.getenv('IPERF_TEST_PROTOCOL', 'tcp').lower() source_node_name = os.getenv('SOURCE_NODE_NAME') # Injected via Downward API

if not source_node_name: logging.error("SOURCE_NODE_NAME environment variable not set. Exiting.") return

while True: servers = discover_iperf_servers()

for server in servers: # Avoid testing a node against itself if server['node_name'] == source_node_name: continue

run_iperf_test(server['ip'], server_port, protocol, source_node_name, server['node_name'])

logging.info(f"Completed test cycle. Sleeping for {test_interval} seconds.") time.sleep(test_interval)

Step 4: Parsing and Publishing Metrics

After each test run, a dedicated function parses the JSON result object provided by the iperf3-python library.^15^ It extracts the key performance indicators and uses them to set the value of the corresponding Prometheus

Gauge, applying the correct labels for source and destination nodes. Robust error handling ensures that failed tests are also recorded as a metric, which is vital for alerting.

Python

def parse_and_publish_metrics(result, source_node, dest_node, protocol): """ Parses the iperf3 result and updates Prometheus gauges. """ labels = {'source_node': source_node, 'destination_node': dest_node, 'protocol': protocol}

if result.error: logging.error(f"Test from {source_node} to {dest_node} failed: {result.error}") IPERF_TEST_SUCCESS.labels(**labels).set(0) # Clear previous successful metrics for this path IPERF_BANDWIDTH_MBPS.labels(**labels).set(0) IPERF_JITTER_MS.labels(**labels).set(0) return

IPERF_TEST_SUCCESS.labels(**labels).set(1)

# The summary data is in result.sent_Mbps or result.received_Mbps depending on direction # For simplicity, we check for available attributes. if hasattr(result, 'sent_Mbps'): bandwidth_mbps = result.sent_Mbps elif hasattr(result, 'received_Mbps'): bandwidth_mbps = result.received_Mbps else: # Fallback for different iperf3 versions/outputs bandwidth_mbps = result.Mbps if hasattr(result, 'Mbps') else 0

IPERF_BANDWIDTH_MBPS.labels(**labels).set(bandwidth_mbps)

if protocol == 'udp': IPERF_JITTER_MS.labels(**labels).set(result.jitter_ms if hasattr(result, 'jitter_ms') else 0) IPERF_PACKETS_TOTAL.labels(**labels).set(result.packets if hasattr(result, 'packets') else 0) IPERF_LOST_PACKETS.labels(**labels).set(result.lost_packets if hasattr(result, 'lost_packets') else 0)

Step 5: Exposing the /metrics Endpoint

Finally, the main execution block starts a simple HTTP server using the prometheus-client library. This server exposes the collected metrics on the standard /metrics path, ready to be scraped by Prometheus.^13^

Python

if __name__ == '__main__': # Start the Prometheus metrics server listen_port = int(os.getenv('LISTEN_PORT', 9876)) start_http_server(listen_port) logging.info(f"Prometheus exporter listening on port {listen_port}")

# Start the main orchestration loop main_loop()

2.3 Containerizing the Exporter (Dockerfile)

To deploy the exporter in Kubernetes, it must be packaged into a container image. A multi-stage Dockerfile is used to create a minimal and more secure final image by separating the build environment from the runtime environment. This is a standard best practice for producing production-ready containers.^14^

Dockerfile

# Stage 1: Build stage with dependencies FROM python:3.9-slim as builder

WORKDIR /app

# Install iperf3 and build dependencies RUN apt-get update && \ apt-get install -y --no-install-recommends gcc iperf3 libiperf-dev && \ rm -rf /var/lib/apt/lists/*

# Install Python dependencies COPY requirements.txt. RUN pip install --no-cache-dir -r requirements.txt

# Stage 2: Final runtime stage FROM python:3.9-slim

WORKDIR /app

# Copy iperf3 binary and library from the builder stage COPY --from=builder /usr/bin/iperf3 /usr/bin/iperf3 COPY --from=builder /usr/lib/x86_64-linux-gnu/libiperf.so.0 /usr/lib/x86_64-linux-gnu/libiperf.so.0

# Copy installed Python packages from the builder stage COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages

# Copy the exporter application code COPY exporter.py.

# Expose the metrics port EXPOSE 9876

# Set the entrypoint CMD ["python", "exporter.py"]

The corresponding requirements.txt would contain:

prometheus-client iperf3 kubernetes

Section 3: Kubernetes Manifests and Deployment Strategy

With the architectural blueprint defined and the exporter application containerized, the next step is to translate this design into declarative Kubernetes manifests. These YAML files define the necessary Kubernetes objects to deploy, configure, and manage the monitoring service. Using static manifests here provides a clear foundation before they are parameterized into a Helm chart in the next section.

3.1 The iperf3-server DaemonSet

The iperf3-server component is deployed as a DaemonSet to ensure an instance of the server pod runs on every eligible node in the cluster.^7^ This creates the ubiquitous grid of test endpoints required for comprehensive N-to-N testing.

Key fields in this manifest include:

spec.selector: Connects the DaemonSet to the pods it manages via

labels.
spec.template.metadata.labels: The label app: iperf3-server is

applied to the pods, which is crucial for discovery by both the iperf3-exporter and Kubernetes Services.
spec.template.spec.containers: Defines the iperf3 container, using

a public image and running the iperf3 -s command to start it in server mode.
spec.template.spec.tolerations: This is often necessary to allow

the DaemonSet to schedule pods on control-plane (master) nodes, which may have taints preventing normal workloads from running there. This ensures the entire cluster, including masters, is part of the test mesh.
spec.template.spec.hostNetwork: true: This is a critical setting.

By running the server pods on the host's network namespace, we bypass the Kubernetes network overlay (CNI) for the server side. This allows the test to measure the raw performance of the underlying node network interface, which is often the primary goal of infrastructure-level testing.

YAML

apiVersion: apps/v1 kind: DaemonSet metadata: name: iperf3-server labels: app: iperf3-server spec: selector: matchLabels: app: iperf3-server template: metadata: labels: app: iperf3-server spec: # Run on the host network to measure raw node-to-node performance hostNetwork: true # Tolerations to allow scheduling on control-plane nodes tolerations:

key: "node-role.kubernetes.io/control-plane" operator: "Exists" effect: "NoSchedule"
key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" containers:
name: iperf3-server image: networkstatic/iperf3:latest args: ["-s"] # Start in server mode ports:
containerPort: 5201 name: iperf3 protocol: TCP
containerPort: 5201 name: iperf3-udp protocol: UDP resources: requests: cpu: "50m" memory: "64Mi" limits: cpu: "100m" memory: "128Mi"

3.2 The iperf3-exporter Deployment

The iperf3-exporter is deployed as a Deployment, as it is a stateless application that orchestrates the tests.^14^ Only one replica is typically needed, as it can sequentially test all nodes.

Key fields in this manifest are:

spec.replicas: 1: A single instance is sufficient for most

clusters.
spec.template.spec.serviceAccountName: This assigns the custom

ServiceAccount (defined next) to the pod, granting it the necessary permissions to talk to the Kubernetes API.
spec.template.spec.containers.env: The SOURCE_NODE_NAME

environment variable is populated using the Downward API. This is how the exporter pod knows which node it is running on, allowing it to skip testing against itself.
spec.template.spec.containers.image: This points to the custom

exporter image built in the previous section.

YAML

apiVersion: apps/v1 kind: Deployment metadata: name: iperf3-exporter labels: app: iperf3-exporter spec: replicas: 1 selector: matchLabels: app: iperf3-exporter template: metadata: labels: app: iperf3-exporter spec: serviceAccountName: iperf3-exporter-sa containers:

name: iperf3-exporter image: your-repo/iperf3-prometheus-exporter:latest # Replace with your image ports:
containerPort: 9876 name: metrics env: # Use the Downward API to inject the node name this pod is running on
name: SOURCE_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName # Other configurations for the exporter script
name: IPERF_TEST_INTERVAL value: "300"
name: IPERF_SERVER_LABEL_SELECTOR value: "app=iperf3-server" resources: requests: cpu: "100m" memory: "128Mi" limits: cpu: "500m" memory: "256Mi"

3.3 RBAC: Granting Necessary Permissions

For the exporter to perform its dynamic discovery of iperf3-server pods, it must be granted specific, limited permissions to read information from the Kubernetes API. This is accomplished through a ServiceAccount, a ClusterRole, and a ClusterRoleBinding.

ServiceAccount: Provides an identity for the exporter pod within

the cluster.
ClusterRole: Defines a set of permissions. Here, we grant get,

list, and watch access to pods. These are the minimum required permissions for the discovery function to work. The role is a ClusterRole because the exporter needs to find pods across all namespaces where servers might be running.
ClusterRoleBinding: Links the ServiceAccount to the ClusterRole,

effectively granting the permissions to any pod that uses the ServiceAccount.

YAML

apiVersion: v1 kind: ServiceAccount metadata: name: iperf3-exporter-sa --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: iperf3-exporter-role rules:

apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: iperf3-exporter-rb subjects:
kind: ServiceAccount name: iperf3-exporter-sa namespace: default # The namespace where the exporter is deployed roleRef: kind: ClusterRole name: iperf3-exporter-role apiGroup: rbac.authorization.k8s.io

3.4 Network Exposure: Service and ServiceMonitor

To make the exporter's metrics available to Prometheus, we need two final objects. The Service exposes the exporter pod's metrics port within the cluster, and the ServiceMonitor tells the Prometheus Operator how to find and scrape that service.

This ServiceMonitor-based approach is the linchpin for a GitOps-friendly integration. Instead of manually editing the central Prometheus configuration file---a brittle and non-declarative process---we deploy a ServiceMonitor custom resource alongside our application.^14^ The Prometheus Operator, a key component of the

kube-prometheus-stack, continuously watches for these objects. When it discovers our iperf3-exporter-sm, it automatically generates the necessary scrape configuration and reloads Prometheus without any manual intervention.^4^ This empowers the application team to define

how their application should be monitored as part of the application's own deployment package, a cornerstone of scalable, "you build it, you run it" observability.

YAML

apiVersion: v1 kind: Service metadata: name: iperf3-exporter-svc labels: app: iperf3-exporter spec: selector: app: iperf3-exporter ports:

name: metrics port: 9876 targetPort: metrics protocol: TCP --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: iperf3-exporter-sm labels: # Label for Prometheus Operator to discover this ServiceMonitor release: prometheus-operator spec: selector: matchLabels: # This must match the labels on the Service object above app: iperf3-exporter endpoints:
port: metrics interval: 60s scrapeTimeout: 30s

Section 4: Packaging with Helm for Reusability and Distribution

While static YAML manifests are excellent for defining Kubernetes resources, they lack the flexibility needed for easy configuration, distribution, and lifecycle management. Helm, the package manager for Kubernetes, solves this by bundling applications into version-controlled, reusable packages called charts.^17^ This section details how to package the entire

iperf3 monitoring service into a professional, flexible, and distributable Helm chart.

4.1 Helm Chart Structure

A well-organized Helm chart follows a standard directory structure. This convention makes charts easier to understand and maintain.^19^

iperf3-monitor/ ├── Chart.yaml # Metadata about the chart (name, version, etc.) ├── values.yaml # Default configuration values for the chart ├── charts/ # Directory for sub-chart dependencies (empty for this project) ├── templates/ # Directory containing the templated Kubernetes manifests │ ├── _helpers.tpl # A place for reusable template helpers │ ├── server-daemonset.yaml │ ├── exporter-deployment.yaml │ ├── rbac.yaml │ ├── service.yaml │ └── servicemonitor.yaml └── README.md # Documentation for the chart

4.2 Templating the Kubernetes Manifests

The core of Helm's power lies in its templating engine, which uses Go templates. We convert the static manifests from Section 3 into dynamic templates by replacing hardcoded values with references to variables defined in the values.yaml file.

A crucial best practice is to use a _helpers.tpl file to define common functions and partial templates, especially for generating resource names and labels. This reduces boilerplate, ensures consistency, and makes the chart easier to manage.^19^

Example: templates/_helpers.tpl

Code snippet

{{/* Expand the name of the chart. */}} {{- define "iperf3-monitor.name" -}} {{- default.Chart.Name.Values.nameOverride | trunc 63 | trimSuffix "-" }} {{- end -}}

{{/* Create a default fully qualified app name. We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec). */}} {{- define "iperf3-monitor.fullname" -}} {{- if.Values.fullnameOverride }} {{-.Values.fullnameOverride | trunc 63 | trimSuffix "-" }} {{- else }} {{- $name := default.Chart.Name.Values.nameOverride }} {{- if contains $name.Release.Name }} {{-.Release.Name | trunc 63 | trimSuffix "-" }} {{- else }} {{- printf "%s-%s".Release.Name $name | trunc 63 | trimSuffix "-" }} {{- end }} {{- end }} {{- end -}}

{{/* Common labels */}} {{- define "iperf3-monitor.labels" -}} helm.sh/chart: {{ include "iperf3-monitor.name". }} {{ include "iperf3-monitor.selectorLabels". }} {{- if.Chart.AppVersion }} app.kubernetes.io/version: {{.Chart.AppVersion | quote }} {{- end }} app.kubernetes.io/managed-by: {{.Release.Service }} {{- end -}}

{{/* Selector labels */}} {{- define "iperf3-monitor.selectorLabels" -}} app.kubernetes.io/name: {{ include "iperf3-monitor.name". }} app.kubernetes.io/instance: {{.Release.Name }} {{- end -}}

Example: Templated exporter-deployment.yaml

YAML

apiVersion: apps/v1 kind: Deployment metadata: name: {{ include "iperf3-monitor.fullname". }}-exporter labels: {{- include "iperf3-monitor.labels". | nindent 4 }} app.kubernetes.io/component: exporter spec: replicas: {{.Values.exporter.replicaCount }} selector: matchLabels: {{- include "iperf3-monitor.selectorLabels". | nindent 6 }} app.kubernetes.io/component: exporter template: metadata: labels: {{- include "iperf3-monitor.selectorLabels". | nindent 8 }} app.kubernetes.io/component: exporter spec: {{- if.Values.rbac.create }} serviceAccountName: {{ include "iperf3-monitor.fullname". }}-sa {{- else }} serviceAccountName: {{.Values.serviceAccount.name }} {{- end }} containers:

name: iperf3-exporter image: "{{.Values.exporter.image.repository }}:{{.Values.exporter.image.tag | default.Chart.AppVersion }}" imagePullPolicy: {{.Values.exporter.image.pullPolicy }} ports:
containerPort: 9876 name: metrics env:
name: SOURCE_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName
name: IPERF_TEST_INTERVAL value: "{{.Values.exporter.testInterval }}" resources: {{- toYaml.Values.exporter.resources | nindent 10 }}

4.3 Designing a Comprehensive values.yaml

The values.yaml file is the public API of a Helm chart. A well-designed values file is intuitive, clearly documented, and provides users with the flexibility to adapt the chart to their specific needs. Best practices include using clear, camelCase naming conventions and providing comments for every parameter.^21^

A particularly powerful feature of Helm is conditional logic. By wrapping entire resource definitions in if blocks based on boolean flags in values.yaml (e.g., {{- if.Values.rbac.create }}), the chart becomes highly adaptable. A user in a high-security environment can disable the automatic creation of ClusterRoles by setting rbac.create: false, allowing them to manage permissions manually without causing the Helm installation to fail.^20^ Similarly, a user not running the Prometheus Operator can set

serviceMonitor.enabled: false. This adaptability transforms the chart from a rigid, all-or-nothing package into a flexible building block, dramatically increasing its utility across different organizations and security postures.

The following table documents the comprehensive set of configurable parameters for the iperf3-monitor chart. This serves as the primary documentation for any user wishing to install and customize the service.

Parameter	Description	Type	Default
nameOverride	Override the name of the chart.	string	""
fullnameOverride	Override the fully qualified app name.	string	""
exporter.image.repository	The container image repository for the exporter.	string	ghcr.io/my-org/iperf3-prometheus-exporter
exporter.image.tag	The container image tag for the exporter.	string	(Chart.AppVersion)
exporter.image.pullPolicy	The image pull policy for the exporter.	string	IfNotPresent
exporter.replicaCount	Number of exporter pod replicas.	integer	1
exporter.testInterval	Interval in seconds between test cycles.	integer	300
exporter.testTimeout	Timeout in seconds for a single iperf3 test.	integer	10
exporter.testProtocol	Protocol to use for testing (tcp or udp).	string	tcp
exporter.resources	CPU/memory resource requests and limits for the exporter.	object	{}
server.image.repository	The container image repository for the iperf3 server.	string	networkstatic/iperf3
server.image.tag	The container image tag for the iperf3 server.	string	latest
server.resources	CPU/memory resource requests and limits for the server pods.	object	{}
server.nodeSelector	Node selector for scheduling server pods.	object	{}
server.tolerations	Tolerations for scheduling server pods on tainted nodes.	array	``
rbac.create	If true, create ServiceAccount, ClusterRole, and ClusterRoleBinding.	boolean	true
serviceAccount.name	The name of the ServiceAccount to use. Used if rbac.create is false.	string	""
serviceMonitor.enabled	If true, create a ServiceMonitor for Prometheus Operator.	boolean	true
serviceMonitor.interval	Scrape interval for the ServiceMonitor.	string	60s
serviceMonitor.scrapeTimeout	Scrape timeout for the ServiceMonitor.	string	30s

Section 5: Visualizing Network Performance with a Custom Grafana Dashboard

The final piece of the user experience is a purpose-built Grafana dashboard that transforms the raw, time-series metrics from Prometheus into intuitive, actionable visualizations. A well-designed dashboard does more than just display data; it tells a story, guiding an operator from a high-level overview of cluster health to a deep-dive analysis of a specific problematic network path.^5^

5.1 Dashboard Design Principles

The primary goals for this network performance dashboard are:

At-a-Glance Overview: Provide an immediate, cluster-wide view of

network health, allowing operators to quickly spot systemic issues or anomalies.
Intuitive Drill-Down: Enable users to seamlessly transition from

a high-level view to a detailed analysis of performance between specific nodes.
Correlation: Display multiple related metrics (bandwidth,

jitter, packet loss) on the same timeline to help identify causal relationships.
Clarity and Simplicity: Avoid clutter and overly complex panels

that can obscure meaningful data.^4^

5.2 Key Visualizations and Panels

The dashboard is constructed from several key panel types, each serving a specific analytical purpose.

Panel 1: Node-to-Node Bandwidth Heatmap. This is the centerpiece

of the dashboard's overview. It uses Grafana's "Heatmap" visualization to create a matrix of network performance.
- Y-Axis: Source Node (source_node label).
- X-Axis: Destination Node (destination_node label).
- Cell Color: The value of the iperf_network_bandwidth_mbps
  
  metric.
- PromQL Query: avg(iperf_network_bandwidth_mbps) by (source_node,
  
  destination_node) This panel provides an instant visual summary of the entire cluster's network fabric. A healthy cluster will show a uniformly "hot" (high bandwidth) grid, while any "cold" spots immediately draw attention to underperforming network paths.
Panel 2: Time-Series Performance Graphs. These panels use the

"Time series" visualization to plot performance over time, allowing for trend analysis and historical investigation.
- Bandwidth (Mbps): Plots
  
  iperf_network_bandwidth_mbps{source_node="$source_node", destination_node="$destination_node"}.
- Jitter (ms): Plots
  
  iperf_network_jitter_ms{source_node="$source_node", destination_node="$destination_node", protocol="udp"}.
- Packet Loss (%): Plots (iperf_network_lost_packets_total{...} /
  
  iperf_network_packets_total{...}) * 100. These graphs are filtered by the dashboard variables, enabling the drill-down analysis.
Panel 3: Stat Panels. These panels use the "Stat" visualization

to display single, key performance indicators (KPIs) for the selected time range and nodes.
- Average Bandwidth: avg(iperf_network_bandwidth_mbps{...})
- Minimum Bandwidth: min(iperf_network_bandwidth_mbps{...})
- Maximum Jitter: max(iperf_network_jitter_ms{...})

5.3 Enabling Interactivity with Grafana Variables

The dashboard's interactivity is powered by Grafana's template variables. These variables are dynamically populated from Prometheus and are used to filter the data displayed in the panels.^4^

$source_node: A dropdown variable populated by the PromQL query

label_values(iperf_network_bandwidth_mbps, source_node).
$destination_node: A dropdown variable populated by

label_values(iperf_network_bandwidth_mbps{source_node="$source_node"}, destination_node). This query is cascaded, meaning it only shows destinations relevant to the selected source.
$protocol: A custom variable with the options tcp and udp.

This combination of a high-level heatmap with interactive, variable-driven drill-down graphs creates a powerful analytical workflow. An operator can begin with a bird's-eye view of the cluster. Upon spotting an anomaly on the heatmap (e.g., a low-bandwidth link between Node-5 and Node-8), they can use the $source_node and $destination_node dropdowns to select that specific path. All the time-series panels will instantly update to show the detailed performance history for that link, allowing the operator to correlate bandwidth drops with jitter spikes or other events. This workflow transforms raw data into actionable insight, dramatically reducing the Mean Time to Identification (MTTI) for network issues.

5.4 The Complete Grafana Dashboard JSON Model

To facilitate easy deployment, the entire dashboard is defined in a single JSON model. This model can be imported directly into any Grafana instance.

JSON

{ "__inputs":, "__requires": [ { "type": "grafana", "id": "grafana", "name": "Grafana", "version": "8.0.0" }, { "type": "datasource", "id": "prometheus", "name": "Prometheus", "version": "1.0.0" } ], "annotations": { "list": [ { "builtIn": 1, "datasource": { "type": "grafana", "uid": "-- Grafana --" }, "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "editable": true, "fiscalYearStartMonth": 0, "gnetId": null, "graphTooltip": 0, "id": null, "links":, "panels":)", "format": "heatmap", "legendFormat": "{{source_node}} -> {{destination_node}}", "refId": "A" } ], "cards": { "cardPadding": null, "cardRound": null }, "color": { "mode": "spectrum", "scheme": "red-yellow-green", "exponent": 0.5, "reverse": false }, "dataFormat": "tsbuckets", "yAxis": { "show": true, "format": "short" }, "xAxis": { "show": true } }, { "title": "Bandwidth Over Time (Source: $source_node, Dest: $destination_node)", "type": "timeseries", "datasource": { "type": "prometheus", "uid": "prometheus" }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 9 }, "targets":, "fieldConfig": { "defaults": { "unit": "mbps" } } }, { "title": "Jitter Over Time (Source: $source_node, Dest: $destination_node)", "type": "timeseries", "datasource": { "type": "prometheus", "uid": "prometheus" }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 9 }, "targets": [ { "expr": "iperf_network_jitter_ms{source_node=\$source_node\, destination_node=\$destination_node\, protocol=\udp\}", "legendFormat": "Jitter", "refId": "A" } ], "fieldConfig": { "defaults": { "unit": "ms" } } } ], "refresh": "30s", "schemaVersion": 36, "style": "dark", "tags": ["iperf3", "network", "kubernetes"], "templating": { "list": [ { "current": {}, "datasource": { "type": "prometheus", "uid": "prometheus" }, "definition": "label_values(iperf_network_bandwidth_mbps, source_node)", "hide": 0, "includeAll": false, "multi": false, "name": "source_node", "options":, "query": "label_values(iperf_network_bandwidth_mbps, source_node)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 1, "type": "query" }, { "current": {}, "datasource": { "type": "prometheus", "uid": "prometheus" }, "definition": "label_values(iperf_network_bandwidth_mbps{source_node=\$source_node\}, destination_node)", "hide": 0, "includeAll": false, "multi": false, "name": "destination_node", "options":, "query": "label_values(iperf_network_bandwidth_mbps{source_node=\$source_node\}, destination_node)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 1, "type": "query" }, { "current": { "selected": true, "text": "tcp", "value": "tcp" }, "hide": 0, "includeAll": false, "multi": false, "name": "protocol", "options": [ { "selected": true, "text": "tcp", "value": "tcp" }, { "selected": false, "text": "udp", "value": "udp" } ], "query": "tcp,udp", "skipUrlSync": false, "type": "custom" } ] }, "time": { "from": "now-1h", "to": "now" }, "timepicker": {}, "timezone": "browser", "title": "Kubernetes iperf3 Network Performance", "uid": "k8s-iperf3-dashboard", "version": 1, "weekStart": "" }

Section 6: GitHub Repository Structure and CI/CD Workflow

To deliver this monitoring service as a professional, open-source-ready project, it is essential to package it within a well-structured GitHub repository and implement a robust Continuous Integration and Continuous Deployment (CI/CD) pipeline. This automates the build, test, and release process, ensuring that every version of the software is consistent, trustworthy, and easy for consumers to adopt.

6.1 Recommended Repository Structure

A clean, logical directory structure is fundamental for project maintainability and ease of navigation for contributors and users.

. ├──.github/ │ └── workflows/ │ └── release.yml # GitHub Actions workflow for CI/CD ├── charts/ │ └── iperf3-monitor/ # The Helm chart for the service │ ├── Chart.yaml │ ├── values.yaml │ └── templates/ │ └──... └── exporter/ ├── Dockerfile # Dockerfile for the exporter ├── requirements.txt # Python dependencies └── exporter.py # Exporter source code ├──.gitignore ├── LICENSE └── README.md

This structure cleanly separates the exporter application code (/exporter) from its deployment packaging (/charts/iperf3-monitor), and its release automation (/.github/workflows).

6.2 CI/CD Pipeline with GitHub Actions

A fully automated CI/CD pipeline is the hallmark of a mature software project. It eliminates manual, error-prone release steps and provides strong guarantees about the integrity of the published artifacts. By triggering the pipeline on the creation of a Git tag (e.g., v1.2.3), we use the tag as a single source of truth for versioning both the Docker image and the Helm chart. This ensures that chart version 1.2.3 is built to use image version 1.2.3, and that both have been validated before release. This automated, atomic release process provides trust and velocity, elevating the project from a collection of files into a reliable, distributable piece of software.

The following GitHub Actions workflow automates the entire release process:

YAML

#.github/workflows/release.yml name: Release iperf3-monitor

on: push: tags:

'v*.*.*'

env: REGISTRY: ghcr.io IMAGE_NAME: ${{ github.repository }}

jobs: lint-and-test: name: Lint and Test runs-on: ubuntu-latest steps:

name: Check out code uses: actions/checkout@v3
name: Set up Helm uses: azure/setup-helm@v3 with: version: v3.10.0
name: Helm Lint run: helm lint./charts/iperf3-monitor

build-and-publish-image: name: Build and Publish Docker Image runs-on: ubuntu-latest needs: lint-and-test permissions: contents: read packages: write steps:

name: Check out code uses: actions/checkout@v3
name: Log in to GitHub Container Registry uses: docker/login-action@v2 with: registry: ${{ env.REGISTRY }} username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }}
name: Extract metadata (tags, labels) for Docker id: meta uses: docker/metadata-action@v4 with: images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
name: Build and push Docker image uses: docker/build-push-action@v4 with: context:./exporter push: true tags: ${{ steps.meta.outputs.tags }} labels: ${{ steps.meta.outputs.labels }}

package-and-publish-chart: name: Package and Publish Helm Chart runs-on: ubuntu-latest needs: build-and-publish-image permissions: contents: write steps:

name: Check out code uses: actions/checkout@v3 with: fetch-depth: 0
name: Set up Helm uses: azure/setup-helm@v3 with: version: v3.10.0
name: Set Chart Version run: | VERSION=$(echo "${{ github.ref_name }}" | sed 's/^v//') helm-docs --sort-values-order file yq e -i '.version = strenv(VERSION)'./charts/iperf3-monitor/Chart.yaml yq e -i '.appVersion = strenv(VERSION)'./charts/iperf3-monitor/Chart.yaml
name: Publish Helm chart uses: stefanprodan/helm-gh-pages@v1.6.0 with: token: ${{ secrets.GITHUB_TOKEN }} charts_dir:./charts charts_url: https://${{ github.repository_owner }}.github.io/${{ github.event.repository.name }}

6.3 Documentation and Usability

The final, and arguably most critical, component for project success is high-quality documentation. The README.md file at the root of the repository is the primary entry point for any user. It should clearly explain what the project does, its architecture, and how to deploy and use it.

A common failure point in software projects is documentation that falls out of sync with the code. For Helm charts, the values.yaml file frequently changes, adding new parameters and options. To combat this, it is a best practice to automate the documentation of these parameters. The helm-docs tool can be integrated directly into the CI/CD pipeline to automatically generate the "Parameters" section of the README.md by parsing the comments directly from the values.yaml file.^20^ This ensures that the documentation is always an accurate reflection of the chart's configurable options, providing a seamless and trustworthy experience for users.

Conclusion

The proliferation of distributed microservices on Kubernetes has made network performance a critical, yet often opaque, component of overall application health. This report has detailed a comprehensive, production-grade solution for establishing continuous network validation within a Kubernetes cluster. By architecting a system around the robust, decoupled pattern of an iperf3-server DaemonSet and a Kubernetes-aware iperf3-exporter Deployment, this service provides a resilient and automated foundation for network observability.

The implementation leverages industry-standard tools---Python for the exporter, Prometheus for metrics storage, and Grafana for visualization---to create a powerful and flexible monitoring pipeline. The entire service is packaged into a professional Helm chart, following best practices for templating, configuration, and adaptability. This allows for simple, version-controlled deployment across a wide range of environments. The final Grafana dashboard transforms the collected data into an intuitive, interactive narrative, enabling engineers to move swiftly from high-level anomaly detection to root-cause analysis.

Ultimately, by treating network performance not as a given but as a continuously measured metric, organizations can proactively identify and resolve infrastructure bottlenecks, enhance application reliability, and ensure a consistent, high-quality experience for their users in the dynamic world of Kubernetes.

51 KiB Raw Blame History

Architecting a Kubernetes-Native Network Performance Monitoring Service with iperf3, Prometheus, and Helm

Section 1: Architectural Blueprint for Continuous Network Validation

1.1 Introduction to Proactive Network Monitoring in Kubernetes

1.2 The Core Architectural Pattern: Decoupled Test Endpoints and a Central Orchestrator

1.3 Architectural Justification and Design Rationale

Section 2: Implementing the iperf3-prometheus-exporter

2.1 Technology Selection: Python for Agility and Ecosystem

2.2 Core Exporter Logic (Annotated Python Code)

Step 1: Initialization and Metric Definition

Step 2: Kubernetes-Aware Target Discovery

Step 3: The Test Orchestration Loop

Step 4: Parsing and Publishing Metrics

Step 5: Exposing the /metrics Endpoint

2.3 Containerizing the Exporter (Dockerfile)

Section 3: Kubernetes Manifests and Deployment Strategy

3.1 The iperf3-server DaemonSet

3.2 The iperf3-exporter Deployment

3.3 RBAC: Granting Necessary Permissions

3.4 Network Exposure: Service and ServiceMonitor

Section 4: Packaging with Helm for Reusability and Distribution

4.1 Helm Chart Structure

4.2 Templating the Kubernetes Manifests

4.3 Designing a Comprehensive values.yaml

Section 5: Visualizing Network Performance with a Custom Grafana Dashboard

5.1 Dashboard Design Principles

5.2 Key Visualizations and Panels

5.3 Enabling Interactivity with Grafana Variables

5.4 The Complete Grafana Dashboard JSON Model

Section 6: GitHub Repository Structure and CI/CD Workflow

6.1 Recommended Repository Structure

6.2 CI/CD Pipeline with GitHub Actions

6.3 Documentation and Usability

Conclusion

51 KiB

Raw Blame History