A primary key lookup that takes 1.3ms in psql was taking 6 seconds through the app. I spent a day deploying an entire observability stack – OpenTelemetry, Grafana Tempo, Loki, metrics-server – before a throwaway observation in a terminal window cracked the case.

The Setup

Five-node bare-metal Kubernetes cluster at home. Three mini-PCs, two HP servers. All connected over wifi. A Next.js app with Prisma 7 talking to PostgreSQL 16. Everything running, everything functional, everything painfully slow.

Submitting an answer in the practice flow took 3-7 seconds. Navigating between questions felt like dial-up. The app was unusable for its intended purpose.

The Wrong Suspects

I started where any reasonable person would: the database.

psql> \timing
psql> SELECT * FROM "Attempt" WHERE id = 'cmnvu1obo000b01gsuzvaivso';
Time: 1.306 ms

1.3ms. The database was fine. I tested from inside the app pod:

LayerLatency
Direct psql1.3ms
Raw pg.Pool (warm)11ms
Raw pg.Pool (cold)260ms
Through Prisma adapter5,923ms

A 5000x slowdown between psql and the app. My first instinct was to blame Prisma’s driver adapter, connection pooling, or SSL negotiation. I tuned the pool, disabled SSL, pre-warmed connections. Nothing helped.

Deploying Observability Into the Void

The cluster had zero monitoring. No metrics, no tracing, no structured logging. I was debugging blind. So I stopped chasing the bug and built the instruments.

  • OpenTelemetry in the Next.js app via --require ./otel-setup.cjs (outside the bundler – Turbopack and OTel don’t mix)
  • Grafana Tempo for trace storage
  • Grafana Loki + Promtail for log aggregation
  • Postgres log_min_duration_statement=0 to log every query with server-side timing

The traces told a clear story:

POST /practice/[attemptId]                           6,597ms
├─ prisma:client:db_query SELECT AttemptQuestion       309ms
├─ prisma:client:db_query SELECT Question               11ms
├─ prisma:client:db_query SELECT Attempt             4,198ms  ← ???
├─ prisma:client:db_query UPDATE AttemptQuestion      1,109ms
└─ prisma:client:db_query SELECT (next)                 85ms

That third query – a primary key lookup on a table with one row – took 4.2 seconds through the app. Meanwhile, postgres logged:

duration: 0.24 ms  statement: SELECT "public"."Attempt"."id" ...

0.24ms server-side. The entire 4.2 seconds was network transit. A ping test confirmed it:

5 packets transmitted, 3 packets received, 40% packet loss
round-trip min/avg/max = 102.398/197.925/308.581 ms

40% packet loss between pods on different nodes. On a local network. Over wifi.

The Clue That Broke It Open

I was about to start investigating MTU issues, Flannel VXLAN overhead, and wifi channel congestion when I noticed something strange.

While I had kubectl logs -f deploy/cps-postgres running in a side terminal to watch the query logs, the app was fast. Every page loaded in under a second. I closed the terminal. Slow again. Opened it. Fast again.

That’s not a database problem. That’s not a network topology problem. That’s a wifi radio problem.

Wifi Power Save: The Silent Killer

Linux enables wifi power-save by default on every wifi interface. It’s a sensible default for laptops – save battery by letting the radio sleep when idle. It’s a catastrophic default for servers.

Here’s what happens:

  1. The wifi radio goes idle for a few hundred milliseconds (no packets)
  2. Linux puts the radio into power-save mode
  3. The radio sleeps, waking only at the AP’s beacon interval (~100ms) to check for buffered frames
  4. A new packet arrives (your TCP SYN to postgres)
  5. The packet waits for the next beacon, the radio wakes, re-negotiates with the AP
  6. If the SYN or SYN-ACK is lost during this messy wake sequence, TCP retransmits – after 1 second. Then 2 seconds. Then 4 seconds.

This explains every observation:

  • First query per request: 1-6 seconds – radio was asleep
  • Subsequent queries: 8-90ms – radio already awake from the first query
  • With kubectl logs -f: everything fast – continuous stream keeps the radio awake
  • 40% packet loss – packets lost during wake transitions
  • Node going NotReady – kubelet heartbeats delayed past the monitor timeout

Every node had it on:

$ iw dev wlp2s0 get power_save
Power save: on

The Fix

One command per node:

sudo iw dev <wifi-interface> set power_save off

To persist across reboots, a systemd oneshot service:

[Unit]
Description=Disable wifi power save for K8s cluster stability
After=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/iw dev wlp2s0 set power_save off
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Applied to all five nodes. Immediate result: practice flow went from 3-7 seconds per interaction to under a second consistently.

What I Almost Did Instead

Before finding the real cause, I had plans to:

  • Rewrite the practice page as a client component to eliminate server round-trips (the user correctly vetoed this – “keep it SSR”)
  • Co-locate app and postgres on the same node to avoid the network entirely
  • Investigate Flannel VXLAN MTU issues on wifi
  • Switch to wired networking across all nodes

All of those are either wrong, overkill, or solving a symptom. The actual fix was a one-liner.

Lessons

1. Wifi power-save is devastating for server workloads. Linux enables it by default. If you’re running anything on wifi – homelab, edge cluster, IoT gateway – disable it. The latency penalty isn’t the ~100ms wake time. It’s the cascading TCP retransmissions when packets are lost during the wake sequence.

2. Build instruments before chasing theories. I wasted zero time on wrong fixes because I deployed tracing first. The traces definitively showed “DB is fast, network is slow” before I touched any application code. Without them, I would have been optimizing queries that ran in 0.24ms.

3. Environmental observations beat log analysis. The breakthrough wasn’t in a trace or a metric. It was noticing that the app was fast when a terminal was streaming logs. Computers are deterministic – if behavior changes when an unrelated thing is running, those things aren’t actually unrelated.

4. log_min_duration_statement=0 is your friend. When you’re trying to prove where latency lives, logging every query with its server-side execution time is the simplest, most conclusive tool. If the server says 0.24ms and the client sees 4200ms, the problem is between them.