520 lines
16 KiB
Markdown
520 lines
16 KiB
Markdown
# Kubernetes-Native Network Performance Monitoring Service
|
|
|
|
This project provides a comprehensive solution for continuous network validation within a Kubernetes cluster. Leveraging industry-standard tools like `iperf3`, `Prometheus`, and `Grafana`, it offers proactive monitoring of network performance between nodes, helping to identify and troubleshoot latency, bandwidth, and packet loss issues before they impact applications.
|
|
|
|
## Features
|
|
|
|
* **Continuous N-to-N Testing:** Automatically measures network performance between all nodes in the cluster.
|
|
* **Kubernetes-Native:** Deploys as standard Kubernetes workloads (DaemonSet, Deployment).
|
|
* **Dynamic Discovery:** Exporter automatically discovers iperf3 server pods using the Kubernetes API.
|
|
* **Prometheus Integration:** Translates iperf3 results into standard Prometheus metrics for time-series storage.
|
|
* **Grafana Visualization:** Provides a rich, interactive dashboard with heatmaps and time-series graphs.
|
|
* **Helm Packaging:** Packaged as a Helm chart for easy deployment and configuration management.
|
|
* **Automated CI/CD:** Includes a GitHub Actions workflow for building and publishing the exporter image and Helm chart.
|
|
|
|
## Architecture
|
|
|
|
The service is based on a decoupled architecture:
|
|
|
|
1. **iperf3-server DaemonSet:** Deploys an `iperf3` server pod on every node to act as a test endpoint. Running on the host network to measure raw node performance.
|
|
2. **iperf3-exporter Deployment:** A centralized service that uses the Kubernetes API to discover server pods, orchestrates `iperf3` client tests against them, parses the JSON output, and exposes performance metrics via an HTTP endpoint.
|
|
3. **Prometheus & Grafana Stack:** A standard monitoring backend (like `kube-prometheus-stack`) that scrapes the exporter's metrics and visualizes them in a custom dashboard.
|
|
|
|
This separation of concerns ensures scalability, resilience, and aligns with Kubernetes operational principles.
|
|
|
|
## Getting Started
|
|
|
|
### Prerequisites
|
|
|
|
* A running Kubernetes cluster.
|
|
* `kubectl` configured to connect to your cluster.
|
|
* Helm v3+ installed.
|
|
* A Prometheus instance configured to scrape services (ideally using the Prometheus Operator and ServiceMonitors).
|
|
* A Grafana instance accessible and configured with Prometheus as a data source.
|
|
|
|
### Installation with Helm
|
|
|
|
1. Add the Helm chart repository (replace with your actual repo URL once published):
|
|
|
|
```/dev/null/helm-install.sh#L1-1
|
|
helm repo add iperf3-monitor https://malarinv.github.io/iperf3-monitor/
|
|
```
|
|
|
|
2. Update your Helm repositories:
|
|
|
|
```/dev/null/helm-install.sh#L3-3
|
|
helm repo update
|
|
```
|
|
|
|
3. Install the chart:
|
|
|
|
```/dev/null/helm-install.sh#L5-8
|
|
helm install iperf3-monitor iperf3-monitor/iperf3-monitor \
|
|
--namespace monitoring # Or your preferred namespace \
|
|
--create-namespace \
|
|
--values values.yaml # Optional: Use a custom values file
|
|
```
|
|
|
|
> **Note:** Ensure your Prometheus instance is configured to scrape services in the namespace where you install the chart and that it recognizes `ServiceMonitor` resources with the label `release: prometheus-operator` (if using the standard `kube-prometheus-stack` setup).
|
|
|
|
### Configuration
|
|
|
|
The Helm chart is highly configurable via the `values.yaml` file. You can override default settings by creating your own `values.yaml` and passing it during installation (`--values my-values.yaml`).
|
|
|
|
Refer to the comments in the default `values.yaml` for a detailed explanation of each parameter:
|
|
|
|
```iperf3-monitor/charts/iperf3-monitor/values.yaml
|
|
# Default values for iperf3-monitor.
|
|
# This is a YAML-formatted file.
|
|
# Declare variables to be passed into your templates.
|
|
|
|
# -- Override the name of the chart.
|
|
nameOverride: ""
|
|
|
|
# -- Override the fully qualified app name.
|
|
fullnameOverride: ""
|
|
|
|
# Exporter Configuration (`controllers.exporter`)
|
|
# The iperf3 exporter is managed under the `controllers.exporter` section,
|
|
# leveraging the `bjw-s/common-library` for robust workload management.
|
|
controllers:
|
|
exporter:
|
|
# -- Enable the exporter controller.
|
|
enabled: true
|
|
# -- Set the controller type for the exporter.
|
|
# Valid options are "deployment" or "daemonset".
|
|
# Use "daemonset" for N-to-N node monitoring where an exporter runs on each node (or selected nodes).
|
|
# Use "deployment" for a centralized exporter (typically with replicaCount: 1).
|
|
# @default -- "deployment"
|
|
type: deployment
|
|
# -- Number of desired exporter pods. Only used if type is "deployment".
|
|
# @default -- 1
|
|
replicas: 1
|
|
|
|
# -- Application-specific configuration for the iperf3 exporter.
|
|
# These values are used to populate environment variables for the exporter container.
|
|
appConfig:
|
|
# -- Interval in seconds between complete test cycles (i.e., testing all server nodes).
|
|
testInterval: 300
|
|
# -- Log level for the iperf3 exporter (e.g., DEBUG, INFO, WARNING, ERROR, CRITICAL).
|
|
logLevel: INFO
|
|
# -- Timeout in seconds for a single iperf3 test run.
|
|
testTimeout: 10
|
|
# -- Protocol to use for testing (tcp or udp).
|
|
testProtocol: tcp
|
|
# -- iperf3 server port to connect to. Should match the server's listening port.
|
|
serverPort: "5201"
|
|
# -- Label selector to find iperf3 server pods.
|
|
# This is templated. Default: 'app.kubernetes.io/name=<chart-name>,app.kubernetes.io/instance=<release-name>,app.kubernetes.io/component=server'
|
|
serverLabelSelector: 'app.kubernetes.io/name={{ include "iperf3-monitor.name" . }},app.kubernetes.io/instance={{ .Release.Name }},app.kubernetes.io/component=server'
|
|
|
|
# -- Pod-level configurations for the exporter.
|
|
pod:
|
|
# -- Annotations for the exporter pod.
|
|
annotations: {}
|
|
# -- Labels for the exporter pod (the common library adds its own defaults too).
|
|
labels: {}
|
|
# -- Node selector for scheduling exporter pods. Useful for DaemonSet or specific scheduling with Deployments.
|
|
# Example:
|
|
# nodeSelector:
|
|
# kubernetes.io/os: linux
|
|
nodeSelector: {}
|
|
# -- Tolerations for scheduling exporter pods.
|
|
# Example:
|
|
# tolerations:
|
|
# - key: "node-role.kubernetes.io/control-plane"
|
|
# operator: "Exists"
|
|
# effect: "NoSchedule"
|
|
tolerations: []
|
|
# -- Affinity rules for scheduling exporter pods.
|
|
# Example:
|
|
# affinity:
|
|
# nodeAffinity:
|
|
# requiredDuringSchedulingIgnoredDuringExecution:
|
|
# nodeSelectorTerms:
|
|
# - matchExpressions:
|
|
# - key: "kubernetes.io/arch"
|
|
# operator: In
|
|
# values:
|
|
# - amd64
|
|
affinity: {}
|
|
# -- Security context for the exporter pod.
|
|
# securityContext:
|
|
# fsGroup: 65534
|
|
# runAsUser: 65534
|
|
# runAsGroup: 65534
|
|
# runAsNonRoot: true
|
|
securityContext: {}
|
|
# -- Automount service account token for the pod.
|
|
automountServiceAccountToken: true
|
|
|
|
# -- Container-level configurations for the main exporter container.
|
|
containers:
|
|
exporter: # Name of the primary container
|
|
image:
|
|
repository: ghcr.io/malarinv/iperf3-monitor
|
|
tag: "" # Defaults to .Chart.AppVersion
|
|
pullPolicy: IfNotPresent
|
|
# -- Custom environment variables for the exporter container.
|
|
# These are merged with the ones generated from appConfig.
|
|
# env:
|
|
# MY_CUSTOM_VAR: "my_value"
|
|
env: {}
|
|
# -- Ports for the exporter container.
|
|
ports:
|
|
metrics: # Name of the port
|
|
port: 9876 # Container port for metrics
|
|
protocol: TCP
|
|
enabled: true
|
|
# -- CPU and memory resource requests and limits.
|
|
# resources:
|
|
# requests:
|
|
# cpu: "100m"
|
|
# memory: "128Mi"
|
|
# limits:
|
|
# cpu: "500m"
|
|
# memory: "256Mi"
|
|
resources: {}
|
|
# -- Probes configuration for the exporter container.
|
|
# probes:
|
|
# liveness:
|
|
# enabled: true # Example: enable liveness probe
|
|
# spec: # Customize probe spec if needed
|
|
# initialDelaySeconds: 30
|
|
# periodSeconds: 15
|
|
# timeoutSeconds: 5
|
|
# failureThreshold: 3
|
|
probes:
|
|
liveness:
|
|
enabled: false
|
|
readiness:
|
|
enabled: false
|
|
startup:
|
|
enabled: false
|
|
|
|
server:
|
|
# -- Configuration for the iperf3 server container image (DaemonSet).
|
|
image:
|
|
# -- The container image repository for the iperf3 server.
|
|
repository: networkstatic/iperf3
|
|
# -- The container image tag for the iperf3 server.
|
|
tag: latest
|
|
|
|
# -- CPU and memory resource requests and limits for the iperf3 server pods (DaemonSet).
|
|
# These should be very low as the server is mostly idle.
|
|
# @default -- A small default is provided if commented out.
|
|
resources: {}
|
|
# requests:
|
|
# cpu: "50m"
|
|
# memory: "64Mi"
|
|
# limits:
|
|
# cpu: "100m"
|
|
# memory: "128Mi"
|
|
|
|
# -- Node selector for scheduling iperf3 server pods.
|
|
# Use this to restrict the DaemonSet to a subset of nodes.
|
|
# @default -- {} (schedule on all nodes)
|
|
nodeSelector: {}
|
|
|
|
# -- Tolerations for scheduling iperf3 server pods on tainted nodes (e.g., control-plane nodes).
|
|
# This is often necessary to include master nodes in the test mesh.
|
|
# @default -- Tolerates control-plane and master taints.
|
|
tolerations:
|
|
- key: "node-role.kubernetes.io/control-plane"
|
|
operator: "Exists"
|
|
effect: "NoSchedule"
|
|
- key: "node-role.kubernetes.io/master"
|
|
operator: "Exists"
|
|
effect: "NoSchedule"
|
|
|
|
rbac:
|
|
# -- If true, create ServiceAccount, ClusterRole, and ClusterRoleBinding for the exporter.
|
|
# Set to false if you manage RBAC externally.
|
|
create: true
|
|
|
|
serviceAccount:
|
|
# -- The name of the ServiceAccount to use for the exporter pod.
|
|
# Only used if rbac.create is false. If not set, it defaults to the chart's fullname.
|
|
name: ""
|
|
|
|
serviceMonitor:
|
|
# -- If true, create a ServiceMonitor resource for integration with Prometheus Operator.
|
|
# Requires a running Prometheus Operator in the cluster.
|
|
enabled: true
|
|
|
|
# -- Scrape interval for the ServiceMonitor. How often Prometheus scrapes the exporter metrics.
|
|
interval: 60s
|
|
|
|
# -- Scrape timeout for the ServiceMonitor. How long Prometheus waits for metrics response.
|
|
scrapeTimeout: 30s
|
|
|
|
# -- Configuration for the exporter Service.
|
|
service:
|
|
# -- Service type. ClusterIP is typically sufficient.
|
|
type: ClusterIP
|
|
# -- Port on which the exporter service is exposed.
|
|
port: 9876
|
|
# -- Target port on the exporter pod.
|
|
targetPort: 9876
|
|
|
|
# -- Optional configuration for a network policy to allow traffic to the iperf3 server DaemonSet.
|
|
# This is often necessary if you are using a network policy controller.
|
|
networkPolicy:
|
|
# -- If true, create a NetworkPolicy resource.
|
|
enabled: false
|
|
# -- Specify source selectors if needed (e.g., pods in a specific namespace).
|
|
from: []
|
|
# -- Specify namespace selectors if needed.
|
|
namespaceSelector: {}
|
|
# -- Specify pod selectors if needed.
|
|
podSelector: {}
|
|
```
|
|
|
|
## Grafana Dashboard
|
|
|
|
A custom Grafana dashboard is provided to visualize the collected `iperf3` metrics.
|
|
|
|
1. Log in to your Grafana instance.
|
|
2. Navigate to `Dashboards` -> `Import`.
|
|
3. Paste the full JSON model provided below into the text area and click `Load`.
|
|
4. Select your Prometheus data source and click `Import`.
|
|
|
|
```/dev/null/grafana-dashboard.json
|
|
{
|
|
"__inputs": [],
|
|
"__requires": [
|
|
{
|
|
"type": "grafana",
|
|
"id": "grafana",
|
|
"name": "Grafana",
|
|
"version": "8.0.0"
|
|
},
|
|
{
|
|
"type": "datasource",
|
|
"id": "prometheus",
|
|
"name": "Prometheus",
|
|
"version": "1.0.0"
|
|
}
|
|
],
|
|
"annotations": {
|
|
"list": [
|
|
{
|
|
"builtIn": 1,
|
|
"datasource": {
|
|
"type": "grafana",
|
|
"uid": "-- Grafana --"
|
|
},
|
|
"enable": true,
|
|
"hide": true,
|
|
"iconColor": "rgba(0, 211, 255, 1)",
|
|
"name": "Annotations & Alerts",
|
|
"type": "dashboard"
|
|
}
|
|
]
|
|
},
|
|
"editable": true,
|
|
"fiscalYearStartMonth": 0,
|
|
"gnetId": null,
|
|
"graphTooltip": 0,
|
|
"id": null,
|
|
"links": [],
|
|
"panels": [
|
|
{
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"gridPos": {
|
|
"h": 9,
|
|
"w": 24,
|
|
"x": 0,
|
|
"y": 0
|
|
},
|
|
"id": 2,
|
|
"targets": [
|
|
{
|
|
"expr": "avg(iperf_network_bandwidth_mbps) by (source_node, destination_node)",
|
|
"format": "heatmap",
|
|
"legendFormat": "{{source_node}} -> {{destination_node}}",
|
|
"refId": "A"
|
|
}
|
|
],
|
|
"cards": { "cardPadding": null, "cardRound": null },
|
|
"color": {
|
|
"mode": "spectrum",
|
|
"scheme": "red-yellow-green",
|
|
"exponent": 0.5,
|
|
"reverse": false
|
|
},
|
|
"dataFormat": "tsbuckets",
|
|
"yAxis": { "show": true, "format": "short" },
|
|
"xAxis": { "show": true }
|
|
},
|
|
{
|
|
"title": "Bandwidth Over Time (Source: $source_node, Dest: $destination_node)",
|
|
"type": "timeseries",
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"gridPos": {
|
|
"h": 8,
|
|
"w": 12,
|
|
"x": 0,
|
|
"y": 9
|
|
},
|
|
"targets": [
|
|
{
|
|
"expr": "iperf_network_bandwidth_mbps{source_node=~\"^$source_node$\", destination_node=~\"^$destination_node$\", protocol=~\"^$protocol$\"}",
|
|
"legendFormat": "Bandwidth",
|
|
"refId": "A"
|
|
}
|
|
],
|
|
"fieldConfig": {
|
|
"defaults": {
|
|
"unit": "mbps"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"title": "Jitter Over Time (Source: $source_node, Dest: $destination_node)",
|
|
"type": "timeseries",
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"gridPos": {
|
|
"h": 8,
|
|
"w": 12,
|
|
"x": 12,
|
|
"y": 9
|
|
},
|
|
"targets": [
|
|
{
|
|
"expr": "iperf_network_jitter_ms{source_node=~\"^$source_node$\", destination_node=~\"^$destination_node$\", protocol=\"udp\"}",
|
|
"legendFormat": "Jitter",
|
|
"refId": "A"
|
|
}
|
|
],
|
|
"fieldConfig": {
|
|
"defaults": {
|
|
"unit": "ms"
|
|
}
|
|
}
|
|
}
|
|
],
|
|
"refresh": "30s",
|
|
"schemaVersion": 36,
|
|
"style": "dark",
|
|
"tags": ["iperf3", "network", "kubernetes"],
|
|
"templating": {
|
|
"list": [
|
|
{
|
|
"current": {},
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"definition": "label_values(iperf_network_bandwidth_mbps, source_node)",
|
|
"hide": 0,
|
|
"includeAll": false,
|
|
"multi": false,
|
|
"name": "source_node",
|
|
"options": [],
|
|
"query": "label_values(iperf_network_bandwidth_mbps, source_node)",
|
|
"refresh": 1,
|
|
"regex": "",
|
|
"skipUrlSync": false,
|
|
"sort": 1,
|
|
"type": "query"
|
|
},
|
|
{
|
|
"current": {},
|
|
"datasource": {
|
|
"type": "prometheus",
|
|
"uid": "prometheus"
|
|
},
|
|
"definition": "label_values(iperf_network_bandwidth_mbps{source_node=~\"^$source_node$\"}, destination_node)",
|
|
"hide": 0,
|
|
"includeAll": false,
|
|
"multi": false,
|
|
"name": "destination_node",
|
|
"options": [],
|
|
"query": "label_values(iperf_network_bandwidth_mbps{source_node=~\"^$source_node$\"}, destination_node)",
|
|
"refresh": 1,
|
|
"regex": "",
|
|
"skipUrlSync": false,
|
|
"sort": 1,
|
|
"type": "query"
|
|
},
|
|
{
|
|
"current": { "selected": true, "text": "tcp", "value": "tcp" },
|
|
"hide": 0,
|
|
"includeAll": false,
|
|
"multi": false,
|
|
"name": "protocol",
|
|
"options": [
|
|
{ "selected": true, "text": "tcp", "value": "tcp" },
|
|
{ "selected": false, "text": "udp", "value": "udp" }
|
|
],
|
|
"query": "tcp,udp",
|
|
"skipUrlSync": false,
|
|
"type": "custom"
|
|
}
|
|
]
|
|
},
|
|
"time": {
|
|
"from": "now-1h",
|
|
"to": "now"
|
|
},
|
|
"timepicker": {},
|
|
"timezone": "browser",
|
|
"title": "Kubernetes iperf3 Network Performance",
|
|
"uid": "k8s-iperf3-dashboard",
|
|
"version": 1,
|
|
"weekStart": ""
|
|
}
|
|
```
|
|
|
|
## Repository Structure
|
|
|
|
The project follows a standard structure:
|
|
|
|
```/dev/null/repo-structure.txt
|
|
.
|
|
├── .github/
|
|
│ └── workflows/
|
|
│ └── release.yml # GitHub Actions workflow for CI/CD
|
|
├── charts/
|
|
│ └── iperf3-monitor/ # The Helm chart for the service
|
|
│ ├── Chart.yaml
|
|
│ ├── values.yaml
|
|
│ └── templates/
|
|
│ ├── _helpers.tpl
|
|
│ ├── server-daemonset.yaml
|
|
│ ├── exporter-deployment.yaml
|
|
│ ├── rbac.yaml
|
|
│ ├── service.yaml
|
|
│ └── servicemonitor.yaml
|
|
└── exporter/
|
|
├── Dockerfile # Dockerfile for the exporter
|
|
├── requirements.txt # Python dependencies
|
|
└── exporter.py # Exporter source code
|
|
├── .gitignore # Specifies intentionally untracked files
|
|
├── LICENSE # Project license
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Development and CI/CD
|
|
|
|
The project includes a GitHub Actions workflow (`.github/workflows/release.yml`) triggered on Git tags (`v*.*.*`) to automate:
|
|
|
|
1. Linting the Helm chart.
|
|
2. Building and publishing the Docker image for the exporter to GitHub Container Registry (`ghcr.io`).
|
|
3. Updating the Helm chart version based on the Git tag.
|
|
4. Packaging and publishing the Helm chart to GitHub Pages.
|
|
|
|
## License
|
|
|
|
This project is licensed under the GNU Affero General Public License v3. See the `LICENSE` file for details.
|