Performing the EKS Upgrade

During the upgrade of EKS, you will first upgrade the EKS control plane, and then upgrade the EKS version on each of the worker nodes.

To perform the EKS control plane and worker node upgrade:

Browse to the EKS console, then click Clusters.
From the cluster list, select the cluster to be upgraded and click Update now.

If your cluster is at EKS version 1.22, attempting the next update will prompt a warning message, stating that: "if you are using EBS volumes in your cluster, then you must install the Amazon EBS CSI driver before updating your cluster to version 1.23 to avoid interruptions to your workloads". You can ignore this message and proceed with the update without installing the EBS CSI driver.
Select the new Kubernetes version, which must be one version number higher than the current version. For example, 1.19 to 1.20.
Click Update.

It's normal for this process to take around 20 minutes. Once finished, the control plane will have been upgraded.
Do not proceed until this process has completed successfully.

After the EKS control plane upgrade is completed, browse to the Auto Scaling groups page and create a new Auto Scaling group for Kafka workers. The new Auto Scaling group contains the same number of nodes (and the same roles) as the current Auto Scaling group, but using the newly-created launch template for the Kubernetes version to which you are upgrading.
1. In Choose launch template or configuration, look for the name of the one just created, and then select the version that contains the right EKS upgrade.
2. Select the same VPC and all private Availability Zones.
3. In Configure advanced options, for the Load balancing - optional value, check Attach to an existing load balancer.
  
  From the Attach to an existing load balancer menu, check Choose from your load balancer target groups. This will allow you to select the load balancer target groups created for this cluster (3000, 443, 5443, 32080, 32081, 32090, 32101-32150 for the CTH pods, etc) from a list box.
  
  You can start writing the name of the groups on the Existing load balancer target groups box, and click each of them to add it. You cannot make multiple selections, since the list box will close each time one is chosen, but it can be reopened to enter the rest of the groups
  
  This selection will automatically register the Auto Scaling group nodes to the target groups.
  
  Leave all other advanced options at default values.
4. Ensure group sizes for desired, minimum and maximum are the same as the original Auto Scaling group.
5. Add any desired notifications.
6. Add the same tags contained in the original group. Make sure that the EKS cluster name appears in the tags.
7. Review the new configuration and click Create Auto Scaling group. This process will create three new nodes.
Following the completion of the worker node upgrade, the Kafka broker IDs (as seen in the Kafka Manager UI under Brokers) will all be incremented by 1 from their starting values. This is expected.

See Creating the AWS Auto Scaling Group for more information.
Get the cluster node names by running the following command, and then record the node names returned:
```
kubectl get nodes
```

Example output:

NAME                                        STATUS   ROLES    AGE    VERSION

ip-10-0-10-253.example.com   Ready    <none>   10m    v1.21.14-eks-ba74326

ip-10-0-10-61.example.com    Ready    <none>   3d1h   v1.20.15-eks-ba74326

ip-10-0-13-119.example.com   Ready    <none>   3d1h   v1.20.15-eks-ba74326

ip-10-0-13-40.example.com    Ready    <none>   10m    v1.21.14-eks-ba74326

ip-10-0-14-238.example.com   Ready    <none>   10m    v1.21.14-eks-ba74326

ip-10-0-15-193.example.com   Ready    <none>   3d1h   v1.20.15-eks-ba74326

Repeat the following sequence consisting of steps 6 through 11 on each of the original Kafka worker nodes, cycling throughout the nodes one by one, until the sequence has been performed on all worker nodes. Proceed to Step 13 only after the sequence has been performed on each original Kafka worker node..

Select a worker node and drain it by running the following command.
```
kubectl drain <FQDN of selected node> --force --delete-local-data --ignore-daemonsets --timeout=180s
```
If the intelligence-namenode=yes label has been applied to any node, then drain that node last among all nodes.

Wait for the command to complete and for all pods to return to the Running state before proceeding to Step 7. The drain/cordon process can take several minutes for all pods to return to the Running state. Please be patient and allow the process time to complete and all pods to return to Running state.

If a CrashLoopBackOff or error status appears during this process, it can be ignored

Log in to Kafka Manager (cmak) at https://<CLUSTER FQDN>:32090/th/cmak and re-assign Kafka partitions for the node with these commands:
1. Click Generate Partition Assignments and select all topics (default)
2. Click Run Partition Assignments (use default settings for broker IDs)
3. Click Go to reassign partitions and refresh the page until the Completed field shows a data and time of completion.

(Conditional) If your deployment includes ArcSight Intelligence, then monitor the ES replication process by logging into the pod and executing this command:

kubectl exec -n $(kubectl get ns |awk '/arcsight/ {print $1}') elasticsearch-master-0 -c  elasticsearch -it bash

Monitor the ES replication process using the following command:

curl -k -XGET 'https://elastic:<password>@localhost:9200/_cat/health?v=true'

Example command and output:

curl -k -XGET 'https://elastic:changeme@localhost:9200/_cat/health?v=true'

epoch      timestamp cluster  status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1671118161 15:29:21  interset green           6         3   1128 583    0    0        0             0                  -                100.0%

At the completion of the Elastic Search replication, the status of all nodes should be "green". A "yellow" status indicates that the replication has not yet completed.

However, if the Elasticsearch recovery is not progressing and the active_shards_percent does not reach 100%, or its status is red, you may disregard the pods not returning to running state and proceed with the EKS upgrade (this applies to intelligence, interset and searchmanager pods). This is because, after cordoning the node, some interset pods might remain in init state, but this can be ignored.

In Kafka Manager, check that the topic partition table data returns to normal. There should be no under-replication of data, and broker leader skew values should return to 0%. Do the following:
1. Select a Kafka pod and log into it, then run this command to check for under-replicated topics:
2. Run this command to check for broker leader skew:
```
kubectl exec -it -n $(kubectl get ns|awk '/arcsight/ {print $1}') th-kafka-0 -c atlas-kafka -- /usr/bin/kafka-topics --bootstrap-server localhost:9092 --describe --topic th-cef
```
  The command above checks specifically for the th-cef topic. Replacing that value with other topics would perform the check for those.
  
  Verify that the number of times each broker is listed equals the replicationFactor value.
  
  Example:
```
/usr/bin/kafka-topics --bootstrap-server localhost:9092 --describe --topic th-cef
```
```
Topic: th-cef   PartitionCount: 6       ReplicationFactor: 2    Configs: cleanup.policy=delete,segment.bytes=1073741824,message.format.version=2.5-IV0,retention.bytes=6442450944
```
```
Topic: th-cef   Partition: 0    Leader: 1005    Replicas: 1005,1004     Isr: 1004,1005
```
```
Topic: th-cef   Partition: 1    Leader: 1006    Replicas: 1006,1005     Isr: 1005,1006
```
```
Topic: th-cef   Partition: 2    Leader: 1004    Replicas: 1004,1006     Isr: 1004,1006
```
```
Topic: th-cef   Partition: 3    Leader: 1005    Replicas: 1005,1006     Isr: 1005,1006
```
```
Topic: th-cef   Partition: 4    Leader: 1006    Replicas: 1006,1004     Isr: 1004,1006
```
```
Topic: th-cef   Partition: 5    Leader: 1004    Replicas: 1004,1005     Isr: 1004,1005
```
  The example output shown above is a final, successful result. It will take time to achieve it, depending on the amount of data, the number of partitions and replication factors.
  
  As well, it might require several executions of the command to reach this point, where all partitions are listed and there's two instances of each leader value.
  
  If this output is not achieved after several executions of the command, please go back to step 7 and execute parts a and b. Then come back and execute the command again.
  
  Kafka will automatically advertise the new node to the connectors.

Repeat steps 6 through 9 for each additional node in the cluster. When you have completed the steps on each node, only then proceed to Step 11.
(Conditional) If the intelligence-namenode=yes label was applied to the node being drained, proceed to Performing the EKS Upgrade - (Conditional) Step 11 - For Intelligence Only.

If not, you can proceed to Step 12 below directly.
Delete the old Auto Scaling group (you created its replacement as above). Wait until all nodes from the old Auto Scaling group are deleted and all pods have returned to the Running state before proceeding to step 13.

A message may be returned that pods are still running. This message may be ignored.

Your next step depends on the EKS version you have reached:
- If your cluster has been configured for Kubernetes 1.22 already, the upgrade is complete, and you may proceed with the next steps.
- If your cluster is configured for a Kubernetes version earlier than 1.22, then repeat the entire process above for the next Kubernetes version, beginning with Step 1.

(Conditional) If the ArcSight Database has been deployed, log into the database master node and run these commands:

./kafka_scheduler stop
./kafka_scheduler start

Verify that Kafka Scheduler has the correct new node IP addresses by running this command:

./kafka_scheduler status

Verify that Event Table Status events count is increasing by running this command:

./kafka_scheduler events

Next Step - If Step 8 failed for your Intelligence deployment: (Conditional) If the Elasticsearch recovery in step 8

Next Step - If Step 8 concluded successfully for your Intelligence deployment: (Conditional - For Intelligence Deployments only) If the Elasticsearch recovery in step 8 succeeded

Next Step - If you don't have Intelligence in your deployment: Continue with the rest of the upgrade checklist.