Wifi Power Save Killed My Kubernetes Database Performance

A primary key lookup that takes 1.3ms in psql was taking 6 seconds through the app. I spent a day deploying an entire observability stack – OpenTelemetry, Grafana Tempo, Loki, metrics-server – before a throwaway observation in a terminal window cracked the case.

The Setup

Five-node bare-metal Kubernetes cluster at home. Three mini-PCs, two HP servers. All connected over wifi. A Next.js app with Prisma 7 talking to PostgreSQL 16. Everything running, everything functional, everything painfully slow.

Submitting an answer in the practice flow took 3-7 seconds. Navigating between questions felt like dial-up. The app was unusable for its intended purpose.

The Wrong Suspects

I started where any reasonable person would: the database.

psql> \timing
psql> SELECT * FROM "Attempt" WHERE id = 'cmnvu1obo000b01gsuzvaivso';
Time: 1.306 ms

1.3ms. The database was fine. I tested from inside the app pod:

Layer	Latency
Direct psql	1.3ms
Raw pg.Pool (warm)	11ms
Raw pg.Pool (cold)	260ms
Through Prisma adapter	5,923ms

A 5000x slowdown between psql and the app. My first instinct was to blame Prisma’s driver adapter, connection pooling, or SSL negotiation. I tuned the pool, disabled SSL, pre-warmed connections. Nothing helped.

Deploying Observability Into the Void

The cluster had zero monitoring. No metrics, no tracing, no structured logging. I was debugging blind. So I stopped chasing the bug and built the instruments.

OpenTelemetry in the Next.js app via --require ./otel-setup.cjs (outside the bundler – Turbopack and OTel don’t mix)
Grafana Tempo for trace storage
Grafana Loki + Promtail for log aggregation
Postgres log_min_duration_statement=0 to log every query with server-side timing

The traces told a clear story:

POST /practice/[attemptId]                           6,597ms
├─ prisma:client:db_query SELECT AttemptQuestion       309ms
├─ prisma:client:db_query SELECT Question               11ms
├─ prisma:client:db_query SELECT Attempt             4,198ms  ← ???
├─ prisma:client:db_query UPDATE AttemptQuestion      1,109ms
└─ prisma:client:db_query SELECT (next)                 85ms

That third query – a primary key lookup on a table with one row – took 4.2 seconds through the app. Meanwhile, postgres logged:

duration: 0.24 ms  statement: SELECT "public"."Attempt"."id" ...

0.24ms server-side. The entire 4.2 seconds was network transit. A ping test confirmed it:

5 packets transmitted, 3 packets received, 40% packet loss
round-trip min/avg/max = 102.398/197.925/308.581 ms

40% packet loss between pods on different nodes. On a local network. Over wifi.

The Clue That Broke It Open

I was about to start investigating MTU issues, Flannel VXLAN overhead, and wifi channel congestion when I noticed something strange.

While I had kubectl logs -f deploy/cps-postgres running in a side terminal to watch the query logs, the app was fast. Every page loaded in under a second. I closed the terminal. Slow again. Opened it. Fast again.

That’s not a database problem. That’s not a network topology problem. That’s a wifi radio problem.

Wifi Power Save: The Silent Killer

Linux enables wifi power-save by default on every wifi interface. It’s a sensible default for laptops – save battery by letting the radio sleep when idle. It’s a catastrophic default for servers.

Here’s what happens:

The wifi radio goes idle for a few hundred milliseconds (no packets)
Linux puts the radio into power-save mode
The radio sleeps, waking only at the AP’s beacon interval (~100ms) to check for buffered frames
A new packet arrives (your TCP SYN to postgres)
The packet waits for the next beacon, the radio wakes, re-negotiates with the AP
If the SYN or SYN-ACK is lost during this messy wake sequence, TCP retransmits – after 1 second. Then 2 seconds. Then 4 seconds.

This explains every observation:

First query per request: 1-6 seconds – radio was asleep
Subsequent queries: 8-90ms – radio already awake from the first query
With kubectl logs -f: everything fast – continuous stream keeps the radio awake
40% packet loss – packets lost during wake transitions
Node going NotReady – kubelet heartbeats delayed past the monitor timeout

Every node had it on:

$ iw dev wlp2s0 get power_save
Power save: on

The Fix

One command per node:

sudo iw dev <wifi-interface> set power_save off

To persist across reboots, a systemd oneshot service:

[Unit]
Description=Disable wifi power save for K8s cluster stability
After=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/iw dev wlp2s0 set power_save off
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Applied to all five nodes. Immediate result: practice flow went from 3-7 seconds per interaction to under a second consistently.

What I Almost Did Instead

Before finding the real cause, I had plans to:

Rewrite the practice page as a client component to eliminate server round-trips (the user correctly vetoed this – “keep it SSR”)
Co-locate app and postgres on the same node to avoid the network entirely
Investigate Flannel VXLAN MTU issues on wifi
Switch to wired networking across all nodes

All of those are either wrong, overkill, or solving a symptom. The actual fix was a one-liner.

Lessons

1. Wifi power-save is devastating for server workloads. Linux enables it by default. If you’re running anything on wifi – homelab, edge cluster, IoT gateway – disable it. The latency penalty isn’t the ~100ms wake time. It’s the cascading TCP retransmissions when packets are lost during the wake sequence.

2. Build instruments before chasing theories. I wasted zero time on wrong fixes because I deployed tracing first. The traces definitively showed “DB is fast, network is slow” before I touched any application code. Without them, I would have been optimizing queries that ran in 0.24ms.

3. Environmental observations beat log analysis. The breakthrough wasn’t in a trace or a metric. It was noticing that the app was fast when a terminal was streaming logs. Computers are deterministic – if behavior changes when an unrelated thing is running, those things aren’t actually unrelated.

4. log_min_duration_statement=0 is your friend. When you’re trying to prove where latency lives, logging every query with its server-side execution time is the simplest, most conclusive tool. If the server says 0.24ms and the client sees 4200ms, the problem is between them.

The Setup#

The Wrong Suspects#

Deploying Observability Into the Void#

The Clue That Broke It Open#

Wifi Power Save: The Silent Killer#

The Fix#

What I Almost Did Instead#

Lessons#