Intermittent Node Hang in K3s Cluster

Issue Summary

In a multi node/single node K3s cluster, one node is intermittently unresponsive:

kubectl times out or hangs when querying the cluster.
kubectl get nodes shows the affected node as NotReady, sometimes flipping to Ready.
Node is unresponsive via Kubernetes control but possibly reachable via SSH.

Prerequisites

SSH access to all nodes.
kubectl access from the control node (or locally using k3s kubectl).
Root/sudo access on the affected node.
Node hostname or IP for targeting the investigation.

Troubleshooting Steps

1. Verify Node Status

Run from a working node:

kubectl get nodes -o wide

Expected behavior:

One node shows NotReady or status changes between Ready ↔ NotReady.

2. SSH Into the Affected Node

From any other node or your local system:

ssh gateway-admin@<node-ip>

If login is slow, this could indicate:

High CPU/memory
I/O issues
Network issues

3. Check System Resources

Once you have SSH access to the affected node, your first step is to check the node’s overall system health. The following commands help you identify if the issue is due to resource exhaustion or system-level bottlenecks.

3a. Check Who Is Logged In and What They're Doing

Explanation:

The w command shows who is logged into the system and what they are doing.
It provides useful details like system uptime, load average, and the processes each user is running.
At the top, it shows load averages - if these numbers are high (e.g., above the number of CPU cores), your system might be under heavy load.

Example Output:

15:05:12 up 10 days,  2:41,  2 users,  load average: 5.68, 5.33, 5.27
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    10.0.0.5         13:02    1:25m  0.01s  0.01s -bash

3b. Check Real-Time Resource Usage

top

Explanation:

Shows real-time CPU, memory, and process activity.
Helps identify processes consuming excessive CPU or memory.
The first few lines provide overall stats, while the lower section shows detailed info for each running process.

What to look for:

%CPU or %MEM near 100 for a single process.
Load average at the top - this should ideally be less than the total number of CPU cores.

3c. Check Memory Usage

free -m

Explanation:

Displays the amount of used and available memory in megabytes.
Shows RAM and swap usage.

What to check:

If available memory is very low (under 100 MB), the node may be out of memory.
If swap usage is high, this may indicate memory pressure.

Example Output:

              total        used        free      shared  buff/cache   available
Mem:           2000        1800          50          10         150         100
Swap:          1024         900         124

3d. Check Disk Usage

df -h .

Explanation:

Shows disk space usage for all mounted filesystems in a human-readable format.
Critical for checking if /var, /, or /var/lib/containerd are full.

What to look for:

If any mount point is at 100%, it can prevent Kubernetes and containerd from functioning properly.
K3s stores a lot of container data under /var/lib/rancher or /var/lib/containerd.

Example Output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        40G   38G  500M  99% /
tmpfs           500M     0  500M   0% /dev/shm

3e. Check System Uptime and Load

uptime

Explanation:

A quick summary of how long the system has been running and current CPU load averages.
Load averages are shown for the last 1, 5, and 15 minutes.

How to interpret load average:

If the number is higher than the total number of CPU cores, your system is overloaded.
Example: A 4-core machine should ideally have a load average under 4.0.

Example Output:

15:20:32 up 10 days,  3:12,  2 users,  load average: 6.11, 6.55, 6.80

4. Check K3s Service status and logs

K3s service:

sudo service k3s status

Check k3s logs:

sudo journalctl -u k3s -n 100

Look for:

Frequent restarts
Segfaults
Network/disk timeout logs
Out-of-memory events

5. Restart Services (If Needed)

If resource usage is normal but kubelet/k3s-agent logs show issues:

sudo service k3s stop
sudo service k3s start

Wait a few seconds and re-check node status from another machine:

kubectl get nodes

6. Review Kernel/System Logs

On the affected node:

dmesg | tail -n 50
sudo journalctl -xe

Look for:

OOM (Out of Memory) killer messages
Kernel panics or segfaults
Disk or network I/O errors

7. Drain and Reboot the Node (If Safe)

This process is typically done for maintenance, troubleshooting, or upgrades. It’s important to ensure no critical workloads are impacted before performing these steps.

7a. Drain the Node

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

This command evicts all pods from the specified node so that it can be safely rebooted or updated.
--ignore-daemonsets: Prevents errors by not trying to evict DaemonSet-managed pods, which usually run on all nodes (like logging or monitoring agents).
--delete-emptydir-data: Deletes data in pods that use emptyDir volumes (which are ephemeral), since these volumes will be lost anyway during node restart.

Note

This command will cordon the node (mark it unschedulable) automatically.

7b. Restart the Node/VM

sudo reboot

Explanation:

This command restarts the node (server/VM) at the OS level.
It is used for applying OS updates, resolving stuck resources, or resetting node state.

7c. Uncordon the Node

kubectl uncordon <node-name>

After the node comes back online, this command marks it as schedulable again, allowing pods to be placed back on it by Kubernetes.
Without this, the node will remain “cordoned” (unschedulable), and no new pods will be scheduled on it.