Troubleshooting Your Cluster

The following troubleshooting tips may be helpful in resolving issues with the CDF cluster.

Issue Description

Installation of master nodes fails

During installation, installation of Master Nodes can fail with the error:

     Unable to connect to the server: context deadline exceeded

Ensure that your no_proxy and NO_PROXY variables include valid virtual IP addresses and hostnames for each of the master and worker nodes in the cluster, as well as the NFS server.

Installation times out

During installation, the process may time out with the error:

      Configure and start ETCD database

Ensure your no_proxy and NO_PROXY variables include correct Master Node information.

During sudo installation, worker node fails to install

During the Add Node phase, if one or more of the worker nodes fails to install and the log shows the following error message:

     [ERROR] : GET Url: https://itom-vault.core:8200/v1/***/PRIVATE_KEY_CONTENT_{hostname}_{sudo user}, ResponseStatusCode: 404

You can take the following steps to rectify the issue:

  1. Click Cancel to return to the version selection screen.
  2. Proceed through the installation screens again (all previous data is preserved).
  3. On the Add Node screen, where you added the Worker Node data, remove the worker node which failed by clicking on the Delete icon.
  4. Click Add Node and add the node again.
  5. Click Next and proceed with the installation.
Cluster list empty in Kafka Manager If cluster list is empty in the Kafka Manager UI, delete the existing Kafka Manager pod and try the UI again after a new Kafka Manager pod is back to the Running state.
Worker nodes out of disk space and pods evicted

If the worker nodes run out of disk space, causing the pods on the node to go into Evicted status, try one of the following steps:

  • Fix the disk space issue by adding an additional drive or contact Micro Focus support to receive help remove unnecessary files.
  • On the node where the low disk space occurred, run the following command:
 {install dir} /kubernetes/bin/kube-restart.sh

For information on adjusting the eviction threshold, see "Updating the CDF Hard Eviction Policy."

Kafka fails to start up; fails to acquire lock or corrupted index file found

Many scenarios can cause a failure for Kafka to start up and report either Failed to acquire lock or Corrupted index file found.

Workaround: To resolve this on the problematic Kafka node:.

1. Go to the directory:

cd /opt/arcsight/k8s-hostpath-volume/th/kafka/

2. Find the file .lock, and delete it.

3. Search for all index files:

find . -name "*.index" | xargs ls -altr

4. Delete all the corrupted index files

5. Restart the affected Kafka pod.

Slow network or slow VM response during upgrade causes delay or failure of web services operations

An intermittent issue has been observed with web service pod startup, during the upgrade toTranformation Hub 3.3, that correlates with slow network and/or slow VM response. The pod startup gets blocked or delayed, leading to various issues, such as failing to create new topics and/or failing to register the new schema version.

One error seen in the web service log file is, "Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 5715 ms, time limit is 2000". The workaround is to restart the web service pod.

Arcsight database rejects new sessions because the maximum sessions limit is reached

You might observe the following error in the logs:
[Vertica][VJDBC](4060) FATAL: New session rejected due to limit, already 125 sessions active

Workaround: Do one of the following:

  • Delete the active open sessions to ensure that the total number of active sessions is within the specified maximum limit.

 ArcSight Database fails to restart

If the database fails to start, you can run a set of commands to recover the last known good set of data and restart the database. For example, the database might not restart after an unexpected shutdown. Please consult your database administrator for the commands to run.

Multiple node failures

Here are some considerations when handling node failures on 3 or more worker nodes.

  • A cluster with 3 masters and 3 or more worker nodes should have at least 2 or more master and worker nodes running (quorum) to work properly in high availability.

  • As a general rule in terms of data loss prevention, no more than TOPIC_REPLICATION_FACTOR minus 1 worker nodes can be down at any time

  • Handling failures and stability if Worker nodes go down:

    • Resume the stability of the cluster as follows:
      • Repair or replace any down worker nodes or replace with new ones
      • Delete any pods which are in “Terminating” state (this is the expected behavior for stateful pods in Kubernetes when nodes are down).
    • Wait until the pod startup sequence is completed. The cluster should resume normal operation.
    • Repair any issues on the lost nodes, the cluster should return to Running state
Second upgrade fails or some resources aren't really upgraded after it

In some cases, a second upgrade may fail completely or fail to upgrade resources. If this is encountered, run the following command:
kubectl delete deployment suite-upgrade-pod-arcsight-installer -n `kubectl get namespaces | grep arcsight-installer | awk '{print $1}'`

Wait until the suite-upgrade-pod-arcsight-installer is deleted, then begin the second upgrade again.

CDF deployment fails on servers running VMWare VMotion Installation of CDF may fail on virtual machines running the VMWare product VMotion. If this occurs, run the installation of CDF again but disable VMotion on all CDF virtual machines.
After adding or reducing stream processor instances, Kafka Manager fails to show accurate consumer information for topics

After adding or reducing the number of stream processor instances, Kafka Manager may fail to show correct consumer information for some topics. To get the most current consumer information, restart the Kafka Manager pod with the command:

kubectl delete pod -n arcsight-installer-XXX th-kafka-manager-XXX

Then reconnect to the Kafka Manager UI.

Kafka Manager not displaying members of the consumer group When a new member is added to a consumer group, Kafka Manager must be restarted in order to display the new members. This applies to Logger, ESM, Vertica Scheduler, SOAR, and Intelligence.
New partition source topics not correctly displayed in Kafka Manager Changes to the partition source topics in Kafka Manager may take up to 5 minutes to refresh and display correctly.