A primary key lookup that takes 1.3ms in psql was taking 6 seconds through the app. I spent a day deploying an entire observability stack – OpenTelemetry, Grafana Tempo, Loki, metrics-server – before a throwaway observation in a terminal window cracked the case.
The Setup
Five-node bare-metal Kubernetes cluster at home. Three mini-PCs, two HP servers. All connected over wifi. A Next.js app with Prisma 7 talking to PostgreSQL 16. Everything running, everything functional, everything painfully slow.
Submitting an answer in the practice flow took 3-7 seconds. Navigating between questions felt like dial-up. The app was unusable for its intended purpose.
The Wrong Suspects
I started where any reasonable person would: the database.
psql> \timing
psql> SELECT * FROM "Attempt" WHERE id = 'cmnvu1obo000b01gsuzvaivso';
Time: 1.306 ms
1.3ms. The database was fine. I tested from inside the app pod:
| Layer | Latency |
|---|---|
| Direct psql | 1.3ms |
| Raw pg.Pool (warm) | 11ms |
| Raw pg.Pool (cold) | 260ms |
| Through Prisma adapter | 5,923ms |
A 5000x slowdown between psql and the app. My first instinct was to blame Prisma’s driver adapter, connection pooling, or SSL negotiation. I tuned the pool, disabled SSL, pre-warmed connections. Nothing helped.
Deploying Observability Into the Void
The cluster had zero monitoring. No metrics, no tracing, no structured logging. I was debugging blind. So I stopped chasing the bug and built the instruments.
- OpenTelemetry in the Next.js app via
--require ./otel-setup.cjs(outside the bundler – Turbopack and OTel don’t mix) - Grafana Tempo for trace storage
- Grafana Loki + Promtail for log aggregation
- Postgres
log_min_duration_statement=0to log every query with server-side timing
The traces told a clear story:
POST /practice/[attemptId] 6,597ms
├─ prisma:client:db_query SELECT AttemptQuestion 309ms
├─ prisma:client:db_query SELECT Question 11ms
├─ prisma:client:db_query SELECT Attempt 4,198ms ← ???
├─ prisma:client:db_query UPDATE AttemptQuestion 1,109ms
└─ prisma:client:db_query SELECT (next) 85ms
That third query – a primary key lookup on a table with one row – took 4.2 seconds through the app. Meanwhile, postgres logged:
duration: 0.24 ms statement: SELECT "public"."Attempt"."id" ...
0.24ms server-side. The entire 4.2 seconds was network transit. A ping test confirmed it:
5 packets transmitted, 3 packets received, 40% packet loss
round-trip min/avg/max = 102.398/197.925/308.581 ms
40% packet loss between pods on different nodes. On a local network. Over wifi.
The Clue That Broke It Open
I was about to start investigating MTU issues, Flannel VXLAN overhead, and wifi channel congestion when I noticed something strange.
While I had kubectl logs -f deploy/cps-postgres running in a side terminal to watch the query logs, the app was fast. Every page loaded in under a second. I closed the terminal. Slow again. Opened it. Fast again.
That’s not a database problem. That’s not a network topology problem. That’s a wifi radio problem.
Wifi Power Save: The Silent Killer
Linux enables wifi power-save by default on every wifi interface. It’s a sensible default for laptops – save battery by letting the radio sleep when idle. It’s a catastrophic default for servers.
Here’s what happens:
- The wifi radio goes idle for a few hundred milliseconds (no packets)
- Linux puts the radio into power-save mode
- The radio sleeps, waking only at the AP’s beacon interval (~100ms) to check for buffered frames
- A new packet arrives (your TCP SYN to postgres)
- The packet waits for the next beacon, the radio wakes, re-negotiates with the AP
- If the SYN or SYN-ACK is lost during this messy wake sequence, TCP retransmits – after 1 second. Then 2 seconds. Then 4 seconds.
This explains every observation:
- First query per request: 1-6 seconds – radio was asleep
- Subsequent queries: 8-90ms – radio already awake from the first query
- With
kubectl logs -f: everything fast – continuous stream keeps the radio awake - 40% packet loss – packets lost during wake transitions
- Node going
NotReady– kubelet heartbeats delayed past the monitor timeout
Every node had it on:
$ iw dev wlp2s0 get power_save
Power save: on
The Fix
One command per node:
sudo iw dev <wifi-interface> set power_save off
To persist across reboots, a systemd oneshot service:
[Unit]
Description=Disable wifi power save for K8s cluster stability
After=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/sbin/iw dev wlp2s0 set power_save off
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
Applied to all five nodes. Immediate result: practice flow went from 3-7 seconds per interaction to under a second consistently.
What I Almost Did Instead
Before finding the real cause, I had plans to:
- Rewrite the practice page as a client component to eliminate server round-trips (the user correctly vetoed this – “keep it SSR”)
- Co-locate app and postgres on the same node to avoid the network entirely
- Investigate Flannel VXLAN MTU issues on wifi
- Switch to wired networking across all nodes
All of those are either wrong, overkill, or solving a symptom. The actual fix was a one-liner.
Lessons
1. Wifi power-save is devastating for server workloads. Linux enables it by default. If you’re running anything on wifi – homelab, edge cluster, IoT gateway – disable it. The latency penalty isn’t the ~100ms wake time. It’s the cascading TCP retransmissions when packets are lost during the wake sequence.
2. Build instruments before chasing theories. I wasted zero time on wrong fixes because I deployed tracing first. The traces definitively showed “DB is fast, network is slow” before I touched any application code. Without them, I would have been optimizing queries that ran in 0.24ms.
3. Environmental observations beat log analysis. The breakthrough wasn’t in a trace or a metric. It was noticing that the app was fast when a terminal was streaming logs. Computers are deterministic – if behavior changes when an unrelated thing is running, those things aren’t actually unrelated.
4. log_min_duration_statement=0 is your friend. When you’re trying to prove where latency lives, logging every query with its server-side execution time is the simplest, most conclusive tool. If the server says 0.24ms and the client sees 4200ms, the problem is between them.