Kubernetes Resource Overhead

How We Cut Kubernetes Resource Overhead by 50% Using Only Built-in Tools

Running a production Kubernetes cluster efficiently is one of those challenges that looks straightforward on paper but reveals surprising complexity the moment you start digging into the actual numbers. Recently, we completed a resource optimization project on a production K3s cluster hosting multiple workloads — databases, observability stacks, mail servers, web applications, and more — and the results were striking. Without adding a single node, without switching cloud providers, and without any expensive third-party FinOps tooling, we brought CPU Limits Commitment from 105% down to 77.7% and cut total memory consumption in the observability namespace by more than 70%.

This post walks through the methodology, the specific findings, and the lessons that apply to virtually any Kubernetes environment.

Why Resource Tuning Matters More Than You Think

In Kubernetes, there are two distinct resource concepts that operators often conflate: requests and limits.

Requests are what the scheduler uses to decide where to place a pod. If a node has 4 CPU cores available according to its committed requests, a new pod requesting 2 cores will be scheduled there — regardless of what those pods are actually consuming in practice.

Limits define the ceiling. For memory, exceeding the limit results in an OOMKill. For CPU, the kernel throttles the container, causing latency spikes without a crash.

This distinction creates two distinct failure modes that we regularly see in production clusters:

  • Over-requesting makes the scheduler think the cluster is full when it isn’t, preventing new workloads from being placed even though physical capacity is available.
  • Over-limiting on CPU means the Limits Commitment metric exceeds 100%, which implies that if all pods simultaneously spiked to their limits, the cluster would be oversubscribed. While CPU throttling is less catastrophic than OOMKill, it creates unpredictable latency under load.

Neither situation is dangerous by itself in isolation, but both represent wasted capacity and potential instability that a data-driven approach can systematically address.

The Diagnostic Starting Point

The first step was establishing a baseline using the cluster’s existing Prometheus and Grafana stack. The initial state looked like this:

MetricInitial Value
CPU Utilisation5.30%
CPU Requests Commitment47.5%
CPU Limits Commitment105%
Memory Utilisation22.6%
Memory Requests Commitment42.8%
Memory Limits Commitment66.7%

The CPU Limits Commitment exceeding 100% immediately stands out. Physical CPU utilisation at 5.3% with limits at 105% means the configured limits are roughly 20 times the actual usage. This is a classic symptom of default or copy-paste resource configurations that were never revisited after initial deployment.

The Memory Limits at 66.7% looked more acceptable but had room for improvement, particularly given that utilisation was only 22.6%.

Step 1: Finding the Memory Culprit

The investigation started with a namespace-level breakdown. A quick kubectl top pods -A --sort-by=memory revealed that the observability namespace was consuming over 11 GB of memory — the second highest consumer after the database cluster.

The biggest single pod was an 8 GB Memcached instance: the Loki chunks cache.

This was a Helm-managed deployment where the chunksCache block in the values file was commented out. The Grafana Loki Helm chart defaults to allocatedMemory: 8192 for the chunks cache — 8 GB — which Memcached immediately reserves at startup regardless of actual load.

The fix was straightforward: explicitly configure the cache size in the values file rather than relying on the default:

chunksCache:
enabled: true
allocatedMemory: 1024 # down from the default 8192 MB
resources:
requests:
memory: 1152Mi
cpu: 100m
limits:
memory: 1536Mi
cpu: 500m

For a small-to-medium production cluster, 1 GB of chunks cache is more than sufficient. The result: over 7 GB freed from a single configuration line change, with no measurable impact on query performance.

The lesson here applies broadly: always explicitly configure Helm chart cache sizes. Chart defaults are designed for large-scale deployments and are frequently inappropriate for smaller clusters.

Step 2: Right-Sizing the Database

The second largest memory consumer was a 3-node MariaDB Galera cluster, consuming approximately 22 GB across all replicas. The per-pod resource configuration was:

resources: 
limits:
cpu: "1000m"
memory: "8Gi"
requests:
cpu: "500m"
memory: "6Gi"

The InnoDB buffer pool was configured at 4 GB per node. This is the right place to look first for MySQL/MariaDB memory tuning — the buffer pool is intentionally kept in memory and dominates the process’s memory footprint.

Before making any changes, the actual database state was queried:

-- Actual database size 
SELECT ROUND(SUM(data_length + index_length) / 1024 / 1024 / 1024, 2) AS 'DB Size (GB)' FROM information_schema.tables;
-- Result: 3.72 GB

-- Buffer pool hit ratio
SHOW STATUS LIKE 'Innodb_buffer_pool_read%';
-- read_requests: 8,445,733,405
-- reads: 129,755
-- Hit ratio: 99.9985%

The hit ratio of 99.9985% is excellent — almost everything is served from memory. However, with only 3.72 GB of actual data, a 4 GB buffer pool per node means the entire dataset already fits in cache. Reducing to 3 GB maintains the same effective behaviour while freeing meaningful memory.

innodb_buffer_pool_size=3G # down from 4G
resources: 
limits:
memory: "5Gi" # down from 8Gi
requests:
memory: "4Gi" # down from 6Gi

For the CPU side, the numbers told a different story:

mariadb-galera-0 30m CPU mariadb-galera-1 47m CPU mariadb-galera-2 39m CPU

With 500m requests and 1000m limits against 30–47m actual usage, the requests were clearly too high. However, the correct response was not to reduce the limit — it was to reduce the request while keeping or raising the limit:

resources: 
requests:
cpu: "100m" # down from 500m — reflects actual usage
limits:
cpu: "2000m" # raised — protects against Galera SST bursts

Galera’s State Snapshot Transfer (SST) process during node resync can briefly spike CPU usage significantly. A low limit here would cause throttling during cluster recovery scenarios — exactly when you need maximum performance.

This pattern — low requests, generous limits — is the right approach for stateful workloads with variable CPU profiles.

Step 3: Identifying Over-Requested Workloads Cluster-Wide

With the major memory issues resolved, attention shifted to the CPU Limits Commitment problem. The approach was to enumerate all pod CPU requests and limits across the cluster:

kubectl get pods -A -o json | jq -r ' .items[] | .metadata.namespace + "/" + .metadata.name + " → req: " + (.spec.containers[].resources.requests.cpu // "none") ' | sort

Cross-referencing this output against kubectl top pods -A --sort-by=cpu revealed several patterns:

Pattern 1: Idle application servers with high static requests

Multiple web application pods had 500m CPU requests with actual usage under 15m — a 33x over-request. Reducing these to 10–50m requests while maintaining reasonable limits freed significant scheduler capacity without any performance impact.

Pattern 2: DaemonSet components with inflated limits

Five node-exporter pods (one per node) each had 1000m CPU limits. Node-exporter is a lightweight metrics collector that typically uses 5–15m CPU. Each pod’s limit was reduced to 100m, saving 4500m in total limit allocation.

Pattern 3: Sidecar containers with default configurations

Several multi-container pods had nginx sidecar containers with 1000m CPU requests — essentially the same as the main application container. Sidecar containers serving as reverse proxies rarely consume more than 10–20m CPU in practice. Correcting these configurations released over 2000m in scheduled requests.

Pattern 4: Caching layers with unnecessary replicas

Redis deployments used for application-level caching had multiple replicas configured. For a cache that exists purely to accelerate read performance — not for persistence or high availability — replicas provide no benefit. Eliminating unnecessary Redis replicas reduced both CPU and memory footprint.

Step 4: Application-Level Tuning

One interesting issue emerged from the PHP-FPM configuration of a Nextcloud deployment. The startup logs showed:

WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 48 idle, and 57 total children

The PHP-FPM pool configuration was the culprit:

pm = dynamic 
pm.start_servers = 50
pm.min_spare_servers = 50
pm.max_spare_servers = 100
pm.max_children = 1000

Starting 50 workers on pod startup, each consuming 30–40 MB of memory, means approximately 1.5–2 GB allocated just to idle PHP workers before any request is served. The max_children = 1000 value is particularly dangerous — if PHP-FPM ever tried to honour this, it would require 30–40 GB of memory, far exceeding the pod’s memory limit.

The corrected configuration:

pm = dynamic 
pm.max_children = 50
pm.start_servers = 5
pm.min_spare_servers = 3
pm.max_spare_servers = 10
pm.max_requests = 500

This reflects realistic concurrency for a single-tenant Nextcloud instance while keeping memory footprint manageable.

The Results

After completing all optimisation passes, the cluster metrics reached the following state:

MetricBeforeAfterChange
CPU Utilisation5.30%3.93%-1.37pp
CPU Requests Commitment47.5%30.3%-17.2pp
CPU Limits Commitment105%77.7%-27.3pp
Memory Utilisation22.6%20.4%-2.2pp
Memory Requests Commitment42.8%38.0%-4.8pp
Memory Limits Commitment66.7%58.2%-8.5pp

The CPU Limits Commitment moving from 105% to 77.7% is the headline improvement — the cluster went from a theoretically oversubscribed state to one with healthy headroom for growth. Memory Limits Commitment at 58.2% provides comfortable protection against OOMKills cluster-wide.

Critically, no workloads were degraded. All applications continued operating normally throughout the optimisation process, which was performed via rolling Helm upgrades and incremental deployment updates.

Key Principles to Take Away

This exercise reinforced several principles that apply universally to Kubernetes resource management:

1. Measure before you tune. Every change in this project was preceded by actual usage data from kubectl top or Prometheus metrics. Guessing at appropriate values — even educated guesses — leads to either waste or instability.

2. Requests and limits serve different purposes. Requests govern scheduling; limits govern runtime behaviour. Optimising requests improves scheduler efficiency and cluster capacity. Optimising limits reduces the risk of noisy neighbours and theoretical oversubscription.

3. Helm chart defaults are not production defaults. Chart maintainers configure defaults for general use, often erring toward over-provisioning. Always review cache sizes, replica counts, and resource blocks in any Helm chart before deploying to production.

4. CPU and memory behave differently under pressure. Memory over-limit causes OOMKill — abrupt and often disruptive. CPU over-limit causes throttling — subtle and manifesting as latency. This asymmetry means memory limits deserve more conservative headroom than CPU limits.

5. Stateful workloads need asymmetric CPU profiles. Databases and clustered systems have highly variable CPU usage — low during steady state, high during recovery, replication catch-up, or bulk operations. Low requests with generous limits is the correct pattern here.

6. Keep monitoring after changes. The InnoDB buffer pool hit ratio, PHP-FPM worker counts, and cache hit rates should be checked periodically after tuning. Workload growth can shift the optimal configuration over time.

What’s Next

With the cluster in a healthy state, the natural next step is establishing alerting thresholds:

  • Alert when CPU Limits Commitment exceeds 85% — signals that a node addition should be planned
  • Alert when Memory Limits Commitment exceeds 75% — provides early warning before OOMKill risk becomes real
  • Alert when any namespace’s actual memory usage exceeds 80% of its request — catches workloads that are growing beyond their declared profile

These signals allow for proactive capacity planning rather than reactive firefighting — which is ultimately the goal of any FinOps practice in a Kubernetes environment.

If your organisation is running Kubernetes workloads and wants to systematically reduce cloud costs and improve cluster efficiency, this kind of structured resource audit is an excellent starting point. The tooling required — Prometheus, Grafana, and kubectl — is already present in most production clusters. The value is in knowing what to look for and how to interpret what you find.

For professional guidance on Kubernetes optimization and Fractional DevOps & Cloud FinOps, feel free to reach out.

Suggested external references:

Scroll to Top