This guide provides a step-by-step process to diagnose and resolve high memory issues causing NextGen Gateway pods to crash in a Kubernetes environment. It includes commands to check the pod status, identify memory-related issues, and implement solutions to stabilize the pod.

Verifying the Memory usage

To verify the memory usage in Kubernetes pods, ensure that the metrics server is enabled in your Kubernetes cluster. The “kubectl top“ command can be used to retrieve snapshots of resource utilization for pods or nodes in your Kubernetes cluster.

Check Pod Memory Usage

Run the following command to check the memory usage of the pod:

$ kubectl top pods 
NAME           CPU(cores)   MEMORY(bytes)   
nextgen-gw-0   48m          1375Mi 
nextgen-gw-redis-master-0               11m          11Mi 

Check Container Memory Usage

Run the following command to check the memory usage of the node:

$ kubectl top pods --containers 
POD                                               NAME                 CPU(cores)   MEMORY(bytes)    
nextgen-gw-0                            nativebridge         0m           6Mi              
nextgen-gw-0                            postgres             5m           83Mi             
nextgen-gw-0                            vprobe               46m          633Mi     
nextgen-gw-redis-master-0               redis                13m          11Mi 

Check Node Memory Usage

Run the following command to check the memory usage of the node:

$ kubectl top nodes 
NAME              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
nextgen-gateway   189m         9%     3969Mi          49% 

Understanding Pod Crashes Due to High Memory Usage

The NextGen Gateway pod in a Kubernetes cluster crashes due to high memory usage.

Possible Causes

When a pod exceeds its allocated memory, Kubernetes automatically kills the process to protect the node’s stability, resulting in an OOMKilled (Out of Memory Killed) error. This is particularly critical for the NextGen Gateway, as it may affect the stability and monitoring capabilities of the OpsRamp platform.

Troubleshooting Steps

Follow these steps to diagnose and resolve memory issues for the NextGen Gateway pod:

  1. Check the status of Kubernetes objects to determine if pods are running or not.
  2. Gather detailed information about the pod by running the following command. This will provide the status, restart count, and the reason for any previous restarts:
    kubectl describe pod <pod_name> 
    Example:
    kubectl describe pod nextgen-gw-0 
  3. Examine memory-related termination reasons in the pod’s event logs.
    Sample Log Output:
    vprobe: 
        Container ID:   containerd://40c8585cf88dc7d0dd4e43560dc631ef559b0c92e6d5d429719a384aaea77777 
        Image:          us-central1-docker.pkg.dev/opsramp-registry/gateway-cluster-images/vprobe:17.0.0 
        Image ID:       us-central1-docker.pkg.dev/opsramp-registry/gateway-cluster-images/vprobe@sha256:8de1a98c3c14307fa4882c7e7422a1a4e4d507d2bbc454b53f905062b665e9d2 
        Port:           <none> 
        Host Port:      <none> 
        State:          Running 
          Started:      Mon, 29 Jan 2024 12:01:30 +0530 
        Last State:     Terminated 
          Reason:       OOMKilled 
          Exit Code:    137 
          Started:      Mon, 29 Jan 2024 12:00:42 +0530 
          Finished:     Mon, 29 Jan 2024 12:01:29 +0530 
        Ready:          True 
        Restart Count:  1 
  4. Confirm memory issue by Exit Code.
    If the exit code is 137, this indicates that the pod has crashed due to a memory issue

Resolution for Memory Issues

To resolve the memory issue and prevent further pod crashes, take the following actions:

  1. Decrease the load on the NextGen Gateway by limiting the number of metrics being processed.
  2. Adjust the memory limits for the NextGen Gateway pod, ensuring it has sufficient memory to handle the required load without crashing. For detailed instructions on modifying the memory limits, refer to the Update Memory Limits for NextGen Gateway section.