During the upgrade of EKS, you will first upgrade the EKS control plane, and then upgrade the EKS version on each of the worker nodes.
To perform the EKS control plane and worker node upgrade:
Browse to the EKS console, then click Clusters.
From the cluster list, select the cluster to be upgraded and click Update now.
Select the new Kubernetes version, which must be one version number higher than the current version. For example, 1.19 to 1.20.
Click Update.
Do not proceed until this process has completed successfully.
After the EKS control plane upgrade is completed, browse to the Auto Scaling groups page and create a new Auto Scaling group for Kafka workers. The new Auto Scaling group contains the same number of nodes (and the same roles) as the current Auto Scaling group, but using the newly-created launch template for the Kubernetes version to which you are upgrading.
In Choose launch template or configuration, look for the name of the one just created, and then select the version that contains the right EKS upgrade.
Select the same VPC and all private Availability Zones.
In Configure advanced options, for the Load balancing - optional value, check Attach to an existing load balancer.
From the Attach to an existing load balancer menu, check Choose from your load balancer target groups. This will allow you to select the load balancer target groups created for this cluster (3000, 443, 5443, 32080, 32081, 32090, 32101-32150 for the CTH pods, etc) from a list box.
This selection will automatically register the Auto Scaling group nodes to the target groups.
Leave all other advanced options at default values.
Ensure group sizes for desired, minimum and maximum are the same as the original Auto Scaling group.
Add any desired notifications.
Add the same tags contained in the original group. Make sure that the EKS cluster name appears in the tags.
Review the new configuration and click Create Auto Scaling group. This process will create three new nodes.
See
Get the cluster node names by running the following command, and then record the node names returned:
kubectl get nodes
Example output:
NAME STATUS ROLES AGE VERSION
ip-10-0-10-253.example.com Ready <none> 10m v1.21.14-eks-ba74326
ip-10-0-10-61.example.com Ready <none> 3d1h v1.20.15-eks-ba74326
ip-10-0-13-119.example.com Ready <none> 3d1h v1.20.15-eks-ba74326
ip-10-0-13-40.example.com Ready <none> 10m v1.21.14-eks-ba74326
ip-10-0-14-238.example.com Ready <none> 10m v1.21.14-eks-ba74326
ip-10-0-15-193.example.com Ready <none> 3d1h v1.20.15-eks-ba74326
Select a worker node and drain it by running the following command.
kubectl drain <FQDN of selected node> --force --delete-local-data --ignore-daemonsets --timeout=180s
intelligence-namenode=yes label has been applied to any node, then drain that node last among all nodes.Wait for the command to complete and for all pods to return to the Running state before proceeding to Step 7. The drain/cordon process can take several minutes for all pods to return to the Running state. Please be patient and allow the process time to complete and all pods to return to Running state.
Log in to Kafka Manager (cmak) at https://<CLUSTER FQDN>:32090/th/cmak and re-assign Kafka partitions for the node with these commands:
Click Generate Partition Assignments and select all topics (default)
Click Run Partition Assignments (use default settings for broker IDs)
Click Go to reassign partitions and refresh the page until the Completed field shows a data and time of completion.
(Conditional) If your deployment includes ArcSight Intelligence, then monitor the ES replication process by logging into the pod and executing this command:
kubectl exec -n $(kubectl get ns |awk '/arcsight/ {print $1}') elasticsearch-master-0 -c elasticsearch -it bash
Monitor the ES replication process using the following command:
curl -k -XGET 'https://elastic:<password>@localhost:9200/_cat/health?v=true'
Example command and output:
curl -k -XGET 'https://elastic:changeme@localhost:9200/_cat/health?v=true'
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1671118161 15:29:21 interset green 6 3 1128 583 0 0 0 0 - 100.0%
However, if the Elasticsearch recovery is not progressing and the
active_shards_percent does not reach 100%, or its status is red, you may disregard the pods not returning to running state and proceed with the EKS upgrade (this applies to intelligence, interset and searchmanager pods). This is because, after cordoning the node, some interset pods might remain in init state, but this can be ignored.In Kafka Manager, check that the topic partition table data returns to normal. There should be no under-replication of data, and broker leader skew values should return to 0%. Do the following:
kubectl exec -it -n $(kubectl get ns|awk '/arcsight/ {print $1}') th-kafka-0 -c atlas-kafka -- /usr/bin/kafka-topics --bootstrap-server localhost:9092 --describe --topic th-cef --under-replicated-partitions
The command above checks specifically for the th-cef topic. Replacing that value with other topics would perform the check for those.
No result is returned when there is no under-replication.
Run this command to check for broker leader skew:
kubectl exec -it -n $(kubectl get ns|awk '/arcsight/ {print $1}') th-kafka-0 -c atlas-kafka -- /usr/bin/kafka-topics --bootstrap-server localhost:9092 --describe --topic th-cef
The command above checks specifically for the th-cef topic. Replacing that value with other topics would perform the check for those.
Verify that the number of times each broker is listed equals the replicationFactor value.
Example:
/usr/bin/kafka-topics --bootstrap-server localhost:9092 --describe --topic th-cef
Topic: th-cef PartitionCount: 6 ReplicationFactor: 2 Configs: cleanup.policy=delete,segment.bytes=1073741824,message.format.version=2.5-IV0,retention.bytes=6442450944
Topic: th-cef Partition: 0 Leader: 1005 Replicas: 1005,1004 Isr: 1004,1005
Topic: th-cef Partition: 1 Leader: 1006 Replicas: 1006,1005 Isr: 1005,1006
Topic: th-cef Partition: 2 Leader: 1004 Replicas: 1004,1006 Isr: 1004,1006
Topic: th-cef Partition: 3 Leader: 1005 Replicas: 1005,1006 Isr: 1005,1006
Topic: th-cef Partition: 4 Leader: 1006 Replicas: 1006,1004 Isr: 1004,1006
Topic: th-cef Partition: 5 Leader: 1004 Replicas: 1004,1005 Isr: 1004,1005
The example output shown above is a final, successful result. It will take time to achieve it, depending on the amount of data, the number of partitions and replication factors.
As well, it might require several executions of the command to reach this point, where all partitions are listed and there's two instances of each leader value.
If this output is not achieved after several executions of the command, please go back to step 7 and execute parts a and b. Then come back and execute the command again.
Repeat steps 6 through 9 for each additional node in the cluster. When you have completed the steps on each node, only then proceed to Step 11.
(Conditional) If the intelligence-namenode=yes label was applied to the node being drained, proceed to
If not, you can proceed to Step 12 below directly.
Delete the old Auto Scaling group (you created its replacement as above). Wait until all nodes from the old Auto Scaling group are deleted and all pods have returned to the Running state before proceeding to step 13.
Your next step depends on the EKS version you have reached:
If your cluster has been configured for Kubernetes 1.22 already, the upgrade is complete, and you may proceed with the next steps.
If your cluster is configured for a Kubernetes version earlier than 1.22, then repeat the entire process above for the next Kubernetes version, beginning with Step 1.
(Conditional) If the ArcSight Database has been deployed, log into the database master node and run these commands:
./kafka_scheduler stop ./kafka_scheduler start
Verify that Kafka Scheduler has the correct new node IP addresses by running this command:
./kafka_scheduler status
Verify that Event Table Status events count is increasing by running this command:
./kafka_scheduler events
Next Step - If Step 8 failed for your Intelligence deployment:
Next Step - If Step 8 concluded successfully for your Intelligence deployment:
Next Step - If you don't have Intelligence in your deployment: Continue with the rest of the upgrade checklist.