iperf3-monitor/README.md

# Kubernetes-Native Network Performance Monitoring Service

This project provides a comprehensive solution for continuous network validation within a Kubernetes cluster. Leveraging industry-standard tools like `iperf3`, `Prometheus`, and `Grafana`, it offers proactive monitoring of network performance between nodes, helping to identify and troubleshoot latency, bandwidth, and packet loss issues before they impact applications.

## Features

*   **Continuous N-to-N Testing:** Automatically measures network performance between all nodes in the cluster.
*   **Kubernetes-Native:** Deploys as standard Kubernetes workloads (DaemonSet, Deployment).
*   **Dynamic Discovery:** Exporter automatically discovers iperf3 server pods using the Kubernetes API.
*   **Prometheus Integration:** Translates iperf3 results into standard Prometheus metrics for time-series storage.
*   **Grafana Visualization:** Provides a rich, interactive dashboard with heatmaps and time-series graphs.
*   **Helm Packaging:** Packaged as a Helm chart for easy deployment and configuration management.
*   **Automated CI/CD:** Includes a GitHub Actions workflow for building and publishing the exporter image and Helm chart.

## Architecture

The service is based on a decoupled architecture:

1.  **iperf3-server DaemonSet:** Deploys an `iperf3` server pod on every node to act as a test endpoint. Running on the host network to measure raw node performance.
2.  **iperf3-exporter Deployment:** A centralized service that uses the Kubernetes API to discover server pods, orchestrates `iperf3` client tests against them, parses the JSON output, and exposes performance metrics via an HTTP endpoint.
3.  **Prometheus & Grafana Stack:** A standard monitoring backend (like `kube-prometheus-stack`) that scrapes the exporter's metrics and visualizes them in a custom dashboard.

This separation of concerns ensures scalability, resilience, and aligns with Kubernetes operational principles.

## Getting Started

### Prerequisites

*   A running Kubernetes cluster.
*   `kubectl` configured to connect to your cluster.
*   Helm v3+ installed.
*   A Prometheus instance configured to scrape services (ideally using the Prometheus Operator and ServiceMonitors).
*   A Grafana instance accessible and configured with Prometheus as a data source.

### Installation with Helm

1.  Add the Helm chart repository (replace with your actual repo URL once published):

    ```/dev/null/helm-install.sh#L1-1
    helm repo add iperf3-monitor https://malarinv.github.io/iperf3-monitor/
    ```

2.  Update your Helm repositories:

    ```/dev/null/helm-install.sh#L3-3
    helm repo update
    ```

3.  Install the chart:

    ```/dev/null/helm-install.sh#L5-8
    helm install iperf3-monitor iperf3-monitor/iperf3-monitor \
      --namespace monitoring # Or your preferred namespace \
      --create-namespace \
      --values values.yaml # Optional: Use a custom values file
    ```

    > **Note:** Ensure your Prometheus instance is configured to scrape services in the namespace where you install the chart and that it recognizes `ServiceMonitor` resources with the label `release: prometheus-operator` (if using the standard `kube-prometheus-stack` setup).

### Configuration

The Helm chart is highly configurable via the `values.yaml` file. You can override default settings by creating your own `values.yaml` and passing it during installation (`--values my-values.yaml`).

Refer to the comments in the default `values.yaml` for a detailed explanation of each parameter:

```iperf3-monitor/charts/iperf3-monitor/values.yaml
# Default values for iperf3-monitor.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

# -- Override the name of the chart.
nameOverride: ""

# -- Override the fully qualified app name.
fullnameOverride: ""

# Exporter Configuration (`controllers.exporter`)
# The iperf3 exporter is managed under the `controllers.exporter` section,
# leveraging the `bjw-s/common-library` for robust workload management.
controllers:
  exporter:
    # -- Enable the exporter controller.
    enabled: true
    # -- Set the controller type for the exporter.
    # Valid options are "deployment" or "daemonset".
    # Use "daemonset" for N-to-N node monitoring where an exporter runs on each node (or selected nodes).
    # Use "deployment" for a centralized exporter (typically with replicaCount: 1).
    # @default -- "deployment"
    type: deployment
    # -- Number of desired exporter pods. Only used if type is "deployment".
    # @default -- 1
    replicas: 1

    # -- Application-specific configuration for the iperf3 exporter.
    # These values are used to populate environment variables for the exporter container.
    appConfig:
      # -- Interval in seconds between complete test cycles (i.e., testing all server nodes).
      testInterval: 300
      # -- Log level for the iperf3 exporter (e.g., DEBUG, INFO, WARNING, ERROR, CRITICAL).
      logLevel: INFO
      # -- Timeout in seconds for a single iperf3 test run.
      testTimeout: 10
      # -- Protocol to use for testing (tcp or udp).
      testProtocol: tcp
      # -- iperf3 server port to connect to. Should match the server's listening port.
      serverPort: "5201"
      # -- Label selector to find iperf3 server pods.
      # This is templated. Default: 'app.kubernetes.io/name=<chart-name>,app.kubernetes.io/instance=<release-name>,app.kubernetes.io/component=server'
      serverLabelSelector: 'app.kubernetes.io/name={{ include "iperf3-monitor.name" . }},app.kubernetes.io/instance={{ .Release.Name }},app.kubernetes.io/component=server'

    # -- Pod-level configurations for the exporter.
    pod:
      # -- Annotations for the exporter pod.
      annotations: {}
      # -- Labels for the exporter pod (the common library adds its own defaults too).
      labels: {}
      # -- Node selector for scheduling exporter pods. Useful for DaemonSet or specific scheduling with Deployments.
      # Example:
      # nodeSelector:
      #   kubernetes.io/os: linux
      nodeSelector: {}
      # -- Tolerations for scheduling exporter pods.
      # Example:
      # tolerations:
      # - key: "node-role.kubernetes.io/control-plane"
      #   operator: "Exists"
      #   effect: "NoSchedule"
      tolerations: []
      # -- Affinity rules for scheduling exporter pods.
      # Example:
      # affinity:
      #   nodeAffinity:
      #     requiredDuringSchedulingIgnoredDuringExecution:
      #       nodeSelectorTerms:
      #       - matchExpressions:
      #         - key: "kubernetes.io/arch"
      #           operator: In
      #           values:
      #           - amd64
      affinity: {}
      # -- Security context for the exporter pod.
      # securityContext:
      #   fsGroup: 65534
      #   runAsUser: 65534
      #   runAsGroup: 65534
      #   runAsNonRoot: true
      securityContext: {}
      # -- Automount service account token for the pod.
      automountServiceAccountToken: true

    # -- Container-level configurations for the main exporter container.
    containers:
      exporter: # Name of the primary container
        image:
          repository: ghcr.io/malarinv/iperf3-monitor
          tag: "" # Defaults to .Chart.AppVersion
          pullPolicy: IfNotPresent
        # -- Custom environment variables for the exporter container.
        # These are merged with the ones generated from appConfig.
        # env:
        #   MY_CUSTOM_VAR: "my_value"
        env: {}
        # -- Ports for the exporter container.
        ports:
          metrics: # Name of the port
            port: 9876 # Container port for metrics
            protocol: TCP
            enabled: true
        # -- CPU and memory resource requests and limits.
        # resources:
        #   requests:
        #     cpu: "100m"
        #     memory: "128Mi"
        #   limits:
        #     cpu: "500m"
        #     memory: "256Mi"
        resources: {}
        # -- Probes configuration for the exporter container.
        # probes:
        #   liveness:
        #     enabled: true # Example: enable liveness probe
        #     spec: # Customize probe spec if needed
        #       initialDelaySeconds: 30
        #       periodSeconds: 15
        #       timeoutSeconds: 5
        #       failureThreshold: 3
        probes:
          liveness:
            enabled: false
          readiness:
            enabled: false
          startup:
            enabled: false

server:
  # -- Configuration for the iperf3 server container image (DaemonSet).
  image:
    # -- The container image repository for the iperf3 server.
    repository: networkstatic/iperf3
    # -- The container image tag for the iperf3 server.
    tag: latest

  # -- CPU and memory resource requests and limits for the iperf3 server pods (DaemonSet).
  # These should be very low as the server is mostly idle.
  # @default -- A small default is provided if commented out.
  resources: {}
    # requests:
    #   cpu: "50m"
    #   memory: "64Mi"
    # limits:
    #   cpu: "100m"
    #   memory: "128Mi"

  # -- Node selector for scheduling iperf3 server pods.
  # Use this to restrict the DaemonSet to a subset of nodes.
  # @default -- {} (schedule on all nodes)
  nodeSelector: {}

  # -- Tolerations for scheduling iperf3 server pods on tainted nodes (e.g., control-plane nodes).
  # This is often necessary to include master nodes in the test mesh.
  # @default -- Tolerates control-plane and master taints.
  tolerations:
    - key: "node-role.kubernetes.io/control-plane"
      operator: "Exists"
      effect: "NoSchedule"
    - key: "node-role.kubernetes.io/master"
      operator: "Exists"
      effect: "NoSchedule"

rbac:
  # -- If true, create ServiceAccount, ClusterRole, and ClusterRoleBinding for the exporter.
  # Set to false if you manage RBAC externally.
  create: true

serviceAccount:
  # -- The name of the ServiceAccount to use for the exporter pod.
  # Only used if rbac.create is false. If not set, it defaults to the chart's fullname.
  name: ""

serviceMonitor:
  # -- If true, create a ServiceMonitor resource for integration with Prometheus Operator.
  # Requires a running Prometheus Operator in the cluster.
  enabled: true

  # -- Scrape interval for the ServiceMonitor. How often Prometheus scrapes the exporter metrics.
  interval: 60s

  # -- Scrape timeout for the ServiceMonitor. How long Prometheus waits for metrics response.
  scrapeTimeout: 30s

# -- Configuration for the exporter Service.
service:
  # -- Service type. ClusterIP is typically sufficient.
  type: ClusterIP
  # -- Port on which the exporter service is exposed.
  port: 9876
  # -- Target port on the exporter pod.
  targetPort: 9876

# -- Optional configuration for a network policy to allow traffic to the iperf3 server DaemonSet.
# This is often necessary if you are using a network policy controller.
networkPolicy:
  # -- If true, create a NetworkPolicy resource.
  enabled: false
  # -- Specify source selectors if needed (e.g., pods in a specific namespace).
  from: []
  # -- Specify namespace selectors if needed.
  namespaceSelector: {}
  # -- Specify pod selectors if needed.
  podSelector: {}
```

## Grafana Dashboard

A custom Grafana dashboard is provided to visualize the collected `iperf3` metrics.

1.  Log in to your Grafana instance.
2.  Navigate to `Dashboards` -> `Import`.
3.  Paste the full JSON model provided below into the text area and click `Load`.
4.  Select your Prometheus data source and click `Import`.

```/dev/null/grafana-dashboard.json
{
"__inputs": [],
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "8.0.0"
},
{
"type": "datasource",
"id": "prometheus",
"name": "Prometheus",
"version": "1.0.0"
}
],
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"gnetId": null,
"graphTooltip": 0,
"id": null,
"links": [],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"gridPos": {
"h": 9,
"w": 24,
"x": 0,
"y": 0
},
"id": 2,
"targets": [
{
"expr": "avg(iperf_network_bandwidth_mbps) by (source_node, destination_node)",
"format": "heatmap",
"legendFormat": "{{source_node}} -> {{destination_node}}",
"refId": "A"
}
],
"cards": { "cardPadding": null, "cardRound": null },
"color": {
"mode": "spectrum",
"scheme": "red-yellow-green",
"exponent": 0.5,
"reverse": false
},
"dataFormat": "tsbuckets",
"yAxis": { "show": true, "format": "short" },
"xAxis": { "show": true }
},
{
"title": "Bandwidth Over Time (Source: $source_node, Dest: $destination_node)",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 9
},
"targets": [
{
"expr": "iperf_network_bandwidth_mbps{source_node=~\"^$source_node$\", destination_node=~\"^$destination_node$\", protocol=~\"^$protocol$\"}",
"legendFormat": "Bandwidth",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "mbps"
}
}
},
{
"title": "Jitter Over Time (Source: $source_node, Dest: $destination_node)",
"type": "timeseries",
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 9
},
"targets": [
{
"expr": "iperf_network_jitter_ms{source_node=~\"^$source_node$\", destination_node=~\"^$destination_node$\", protocol=\"udp\"}",
"legendFormat": "Jitter",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms"
}
}
}
],
"refresh": "30s",
"schemaVersion": 36,
"style": "dark",
"tags": ["iperf3", "network", "kubernetes"],
"templating": {
"list": [
{
"current": {},
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"definition": "label_values(iperf_network_bandwidth_mbps, source_node)",
"hide": 0,
"includeAll": false,
"multi": false,
"name": "source_node",
"options": [],
"query": "label_values(iperf_network_bandwidth_mbps, source_node)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"type": "query"
},
{
"current": {},
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"definition": "label_values(iperf_network_bandwidth_mbps{source_node=~\"^$source_node$\"}, destination_node)",
"hide": 0,
"includeAll": false,
"multi": false,
"name": "destination_node",
"options": [],
"query": "label_values(iperf_network_bandwidth_mbps{source_node=~\"^$source_node$\"}, destination_node)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"type": "query"
},
{
"current": { "selected": true, "text": "tcp", "value": "tcp" },
"hide": 0,
"includeAll": false,
"multi": false,
"name": "protocol",
"options": [
{ "selected": true, "text": "tcp", "value": "tcp" },
{ "selected": false, "text": "udp", "value": "udp" }
],
"query": "tcp,udp",
"skipUrlSync": false,
"type": "custom"
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {},
"timezone": "browser",
"title": "Kubernetes iperf3 Network Performance",
"uid": "k8s-iperf3-dashboard",
"version": 1,
"weekStart": ""
}
```

## Repository Structure

The project follows a standard structure:

```/dev/null/repo-structure.txt
.
├── .github/
│   └── workflows/
│       └── release.yml    # GitHub Actions workflow for CI/CD
├── charts/
│   └── iperf3-monitor/    # The Helm chart for the service
│       ├── Chart.yaml
│       ├── values.yaml
│       └── templates/
│           ├── _helpers.tpl
│           ├── server-daemonset.yaml
│           ├── exporter-deployment.yaml
│           ├── rbac.yaml
│           ├── service.yaml
│           └── servicemonitor.yaml
└── exporter/
    ├── Dockerfile         # Dockerfile for the exporter
    ├── requirements.txt   # Python dependencies
    └── exporter.py        # Exporter source code
├── .gitignore             # Specifies intentionally untracked files
├── LICENSE                # Project license
└── README.md              # This file
```

## Development and CI/CD

The project includes a GitHub Actions workflow (`.github/workflows/release.yml`) triggered on Git tags (`v*.*.*`) to automate:

1.  Linting the Helm chart.
2.  Building and publishing the Docker image for the exporter to GitHub Container Registry (`ghcr.io`).
3.  Updating the Helm chart version based on the Git tag.
4.  Packaging and publishing the Helm chart to GitHub Pages.

## License

This project is licensed under the GNU Affero General Public License v3. See the `LICENSE` file for details.