# Kubernetes-Native Network Performance Monitoring Service This project provides a comprehensive solution for continuous network validation within a Kubernetes cluster. Leveraging industry-standard tools like `iperf3`, `Prometheus`, and `Grafana`, it offers proactive monitoring of network performance between nodes, helping to identify and troubleshoot latency, bandwidth, and packet loss issues before they impact applications. ## Features * **Continuous N-to-N Testing:** Automatically measures network performance between all nodes in the cluster. * **Kubernetes-Native:** Deploys as standard Kubernetes workloads (DaemonSet, Deployment). * **Dynamic Discovery:** Exporter automatically discovers iperf3 server pods using the Kubernetes API. * **Prometheus Integration:** Translates iperf3 results into standard Prometheus metrics for time-series storage. * **Grafana Visualization:** Provides a rich, interactive dashboard with heatmaps and time-series graphs. * **Helm Packaging:** Packaged as a Helm chart for easy deployment and configuration management. * **Automated CI/CD:** Includes a GitHub Actions workflow for building and publishing the exporter image and Helm chart. ## Architecture The service is based on a decoupled architecture: 1. **iperf3-server DaemonSet:** Deploys an `iperf3` server pod on every node to act as a test endpoint. Running on the host network to measure raw node performance. 2. **iperf3-exporter Deployment:** A centralized service that uses the Kubernetes API to discover server pods, orchestrates `iperf3` client tests against them, parses the JSON output, and exposes performance metrics via an HTTP endpoint. 3. **Prometheus & Grafana Stack:** A standard monitoring backend (like `kube-prometheus-stack`) that scrapes the exporter's metrics and visualizes them in a custom dashboard. This separation of concerns ensures scalability, resilience, and aligns with Kubernetes operational principles. ## Getting Started ### Prerequisites * A running Kubernetes cluster. * `kubectl` configured to connect to your cluster. * Helm v3+ installed. * A Prometheus instance configured to scrape services (ideally using the Prometheus Operator and ServiceMonitors). * A Grafana instance accessible and configured with Prometheus as a data source. ### Installation with Helm 1. Add the Helm chart repository (replace with your actual repo URL once published): ```/dev/null/helm-install.sh#L1-1 helm repo add iperf3-monitor https://malarinv.github.io/iperf3-monitor/ ``` 2. Update your Helm repositories: ```/dev/null/helm-install.sh#L3-3 helm repo update ``` 3. Install the chart: ```/dev/null/helm-install.sh#L5-8 helm install iperf3-monitor iperf3-monitor/iperf3-monitor \ --namespace monitoring # Or your preferred namespace \ --create-namespace \ --values values.yaml # Optional: Use a custom values file ``` > **Note:** Ensure your Prometheus instance is configured to scrape services in the namespace where you install the chart and that it recognizes `ServiceMonitor` resources with the label `release: prometheus-operator` (if using the standard `kube-prometheus-stack` setup). ### Configuration The Helm chart is highly configurable via the `values.yaml` file. You can override default settings by creating your own `values.yaml` and passing it during installation (`--values my-values.yaml`). Refer to the comments in the default `values.yaml` for a detailed explanation of each parameter: ```iperf3-monitor/charts/iperf3-monitor/values.yaml # Default values for iperf3-monitor. # This is a YAML-formatted file. # Declare variables to be passed into your templates. # -- Override the name of the chart. nameOverride: "" # -- Override the fully qualified app name. fullnameOverride: "" exporter: # -- Configuration for the exporter container image. image: # -- The container image repository for the exporter. repository: ghcr.io/malarinv/iperf3-monitor # -- The container image tag for the exporter. If not set, the chart's appVersion is used. tag: "" # -- The image pull policy for the exporter container. pullPolicy: IfNotPresent # -- Number of exporter pod replicas. Typically 1 is sufficient. replicaCount: 1 # -- Interval in seconds between complete test cycles (i.e., testing all server nodes). testInterval: 300 # -- Timeout in seconds for a single iperf3 test run. testTimeout: 10 # -- Protocol to use for testing (tcp or udp). testProtocol: tcp # -- CPU and memory resource requests and limits for the exporter pod. # @default -- A small default is provided if commented out. resources: {} # requests: # cpu: "100m" # memory: "128Mi" # limits: # cpu: "500m" # memory: "256Mi" server: # -- Configuration for the iperf3 server container image (DaemonSet). image: # -- The container image repository for the iperf3 server. repository: networkstatic/iperf3 # -- The container image tag for the iperf3 server. tag: latest # -- CPU and memory resource requests and limits for the iperf3 server pods (DaemonSet). # These should be very low as the server is mostly idle. # @default -- A small default is provided if commented out. resources: {} # requests: # cpu: "50m" # memory: "64Mi" # limits: # cpu: "100m" # memory: "128Mi" # -- Node selector for scheduling iperf3 server pods. # Use this to restrict the DaemonSet to a subset of nodes. # @default -- {} (schedule on all nodes) nodeSelector: {} # -- Tolerations for scheduling iperf3 server pods on tainted nodes (e.g., control-plane nodes). # This is often necessary to include master nodes in the test mesh. # @default -- Tolerates control-plane and master taints. tolerations: - key: "node-role.kubernetes.io/control-plane" operator: "Exists" effect: "NoSchedule" - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" rbac: # -- If true, create ServiceAccount, ClusterRole, and ClusterRoleBinding for the exporter. # Set to false if you manage RBAC externally. create: true serviceAccount: # -- The name of the ServiceAccount to use for the exporter pod. # Only used if rbac.create is false. If not set, it defaults to the chart's fullname. name: "" serviceMonitor: # -- If true, create a ServiceMonitor resource for integration with Prometheus Operator. # Requires a running Prometheus Operator in the cluster. enabled: true # -- Scrape interval for the ServiceMonitor. How often Prometheus scrapes the exporter metrics. interval: 60s # -- Scrape timeout for the ServiceMonitor. How long Prometheus waits for metrics response. scrapeTimeout: 30s # -- Configuration for the exporter Service. service: # -- Service type. ClusterIP is typically sufficient. type: ClusterIP # -- Port on which the exporter service is exposed. port: 9876 # -- Target port on the exporter pod. targetPort: 9876 # -- Optional configuration for a network policy to allow traffic to the iperf3 server DaemonSet. # This is often necessary if you are using a network policy controller. networkPolicy: # -- If true, create a NetworkPolicy resource. enabled: false # -- Specify source selectors if needed (e.g., pods in a specific namespace). from: [] # -- Specify namespace selectors if needed. namespaceSelector: {} # -- Specify pod selectors if needed. podSelector: {} ``` ## Grafana Dashboard A custom Grafana dashboard is provided to visualize the collected `iperf3` metrics. 1. Log in to your Grafana instance. 2. Navigate to `Dashboards` -> `Import`. 3. Paste the full JSON model provided below into the text area and click `Load`. 4. Select your Prometheus data source and click `Import`. ```/dev/null/grafana-dashboard.json { "__inputs": [], "__requires": [ { "type": "grafana", "id": "grafana", "name": "Grafana", "version": "8.0.0" }, { "type": "datasource", "id": "prometheus", "name": "Prometheus", "version": "1.0.0" } ], "annotations": { "list": [ { "builtIn": 1, "datasource": { "type": "grafana", "uid": "-- Grafana --" }, "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "editable": true, "fiscalYearStartMonth": 0, "gnetId": null, "graphTooltip": 0, "id": null, "links": [], "panels": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "gridPos": { "h": 9, "w": 24, "x": 0, "y": 0 }, "id": 2, "targets": [ { "expr": "avg(iperf_network_bandwidth_mbps) by (source_node, destination_node)", "format": "heatmap", "legendFormat": "{{source_node}} -> {{destination_node}}", "refId": "A" } ], "cards": { "cardPadding": null, "cardRound": null }, "color": { "mode": "spectrum", "scheme": "red-yellow-green", "exponent": 0.5, "reverse": false }, "dataFormat": "tsbuckets", "yAxis": { "show": true, "format": "short" }, "xAxis": { "show": true } }, { "title": "Bandwidth Over Time (Source: $source_node, Dest: $destination_node)", "type": "timeseries", "datasource": { "type": "prometheus", "uid": "prometheus" }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 9 }, "targets": [ { "expr": "iperf_network_bandwidth_mbps{source_node=~\"^$source_node$\", destination_node=~\"^$destination_node$\", protocol=~\"^$protocol$\"}", "legendFormat": "Bandwidth", "refId": "A" } ], "fieldConfig": { "defaults": { "unit": "mbps" } } }, { "title": "Jitter Over Time (Source: $source_node, Dest: $destination_node)", "type": "timeseries", "datasource": { "type": "prometheus", "uid": "prometheus" }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 9 }, "targets": [ { "expr": "iperf_network_jitter_ms{source_node=~\"^$source_node$\", destination_node=~\"^$destination_node$\", protocol=\"udp\"}", "legendFormat": "Jitter", "refId": "A" } ], "fieldConfig": { "defaults": { "unit": "ms" } } } ], "refresh": "30s", "schemaVersion": 36, "style": "dark", "tags": ["iperf3", "network", "kubernetes"], "templating": { "list": [ { "current": {}, "datasource": { "type": "prometheus", "uid": "prometheus" }, "definition": "label_values(iperf_network_bandwidth_mbps, source_node)", "hide": 0, "includeAll": false, "multi": false, "name": "source_node", "options": [], "query": "label_values(iperf_network_bandwidth_mbps, source_node)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 1, "type": "query" }, { "current": {}, "datasource": { "type": "prometheus", "uid": "prometheus" }, "definition": "label_values(iperf_network_bandwidth_mbps{source_node=~\"^$source_node$\"}, destination_node)", "hide": 0, "includeAll": false, "multi": false, "name": "destination_node", "options": [], "query": "label_values(iperf_network_bandwidth_mbps{source_node=~\"^$source_node$\"}, destination_node)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 1, "type": "query" }, { "current": { "selected": true, "text": "tcp", "value": "tcp" }, "hide": 0, "includeAll": false, "multi": false, "name": "protocol", "options": [ { "selected": true, "text": "tcp", "value": "tcp" }, { "selected": false, "text": "udp", "value": "udp" } ], "query": "tcp,udp", "skipUrlSync": false, "type": "custom" } ] }, "time": { "from": "now-1h", "to": "now" }, "timepicker": {}, "timezone": "browser", "title": "Kubernetes iperf3 Network Performance", "uid": "k8s-iperf3-dashboard", "version": 1, "weekStart": "" } ``` ## Repository Structure The project follows a standard structure: ```/dev/null/repo-structure.txt . ├── .github/ │ └── workflows/ │ └── release.yml # GitHub Actions workflow for CI/CD ├── charts/ │ └── iperf3-monitor/ # The Helm chart for the service │ ├── Chart.yaml │ ├── values.yaml │ └── templates/ │ ├── _helpers.tpl │ ├── server-daemonset.yaml │ ├── exporter-deployment.yaml │ ├── rbac.yaml │ ├── service.yaml │ └── servicemonitor.yaml └── exporter/ ├── Dockerfile # Dockerfile for the exporter ├── requirements.txt # Python dependencies └── exporter.py # Exporter source code ├── .gitignore # Specifies intentionally untracked files ├── LICENSE # Project license └── README.md # This file ``` ## Development and CI/CD The project includes a GitHub Actions workflow (`.github/workflows/release.yml`) triggered on Git tags (`v*.*.*`) to automate: 1. Linting the Helm chart. 2. Building and publishing the Docker image for the exporter to GitHub Container Registry (`ghcr.io`). 3. Updating the Helm chart version based on the Git tag. 4. Packaging and publishing the Helm chart to GitHub Pages. ## License This project is licensed under the GNU Affero General Public License v3. See the `LICENSE` file for details.