Performing the EKS Upgrade

During the upgrade of EKS, you will first upgrade the EKS control plane, and then upgrade the EKS version on each of the worker nodes.

To perform the EKS control plane and worker node upgrade:

Browse to the EKS console, then click Clusters.
From the cluster list, select the cluster to be upgraded and click Update now.

If your cluster is at EKS version 1.22, attempting the next update will prompt a warning message, stating that: "

if you are using EBS volumes in your cluster, then you must install the Amazon EBS CSI driver before updating your cluster to version 1.23 to avoid interruptions to your workloads".

You can ignore this message and proceed with the update without installing the EBS CSI driver.
Select the new Kubernetes version, which must be one version number higher than the current version (for example, 1.19 to 1.20) and click Update.

It's normal for this process to take around 20 minutes. Once finished, the control plane would have been upgraded.
Do not proceed until this process has completed successfully.

After the EKS control plane upgrade is completed, browse to the Auto Scaling groups page and create a new Auto Scaling group for Kafka workers. The new Auto Scaling group contains the same number of nodes (and the same roles) as the current Auto Scaling group, but using the newly-created Launch Configuration for the Kubernetes version to which you are upgrading.
1. Switch to Launch Configuration, and from the list menu, select the Launch configuration you created earlier.
2. Select the same VPC and all private Availability Zones.
3. In Configure advanced options, for the Load balancing - optional value, check Attach to an existing load balancer.
  
  From the Attach to an existing load balancer menu, check Choose from your load balancer target groups. This will allow you to select the load balancer target groups created for this cluster (3000, 443, 5443, 32080, 32081, 320101-X for the CTH pods, etc) from a list box.
  
  This selection will automatically register the Auto Scaling group nodes to the target groups.
  
  Leave all other advanced options at default values.
4. Ensure group sizes for desired, minimum and maximum are the same as the original Auto Scaling group.
5. Add any desired notifications.
6. Add the same tags contained in the original group.
7. Review the new configuration and click Create Auto Scaling group. This process will create three new nodes.
Following the completion of the worker node upgrade, the Kafka broker IDs (as seen in the Kafka Manager UI under Brokers) will all be incremented by 1 from their starting values. This is expected.

See Creating the AWS Auto Scaling Group for more information.
Get the cluster node names by running the following command, and then record the node names returned:
```
kubectl get nodes
```

Example output:

NAME                                        STATUS   ROLES    AGE    VERSION

ip-10-0-10-253.example.com   Ready    <none>   10m    v1.21.14-eks-ba74326

ip-10-0-10-61.example.com    Ready    <none>   3d1h   v1.20.15-eks-ba74326

ip-10-0-13-119.example.com   Ready    <none>   3d1h   v1.20.15-eks-ba74326

ip-10-0-13-40.example.com    Ready    <none>   10m    v1.21.14-eks-ba74326

ip-10-0-14-238.example.com   Ready    <none>   10m    v1.21.14-eks-ba74326

ip-10-0-15-193.example.com   Ready    <none>   3d1h   v1.20.15-eks-ba74326

Repeat the following sequence consisting of steps 6 through 11 on each of the original Kafka worker nodes, cycling throughout the nodes one by one, until the sequence has been performed on all worker nodes. Proceed to Step 13 only after the sequence has been performed on each original Kafka worker node..

Select a worker node and drain it by running the following command.
```
kubectl drain <FQDN of selected node> --force --delete-local-data --ignore-daemonsets --timeout=180s
```
If the intelligence-namenode=yes label has been applied to any node, then drain that node last among all nodes.

Wait for the command to complete and for all pods to return to the Running state before proceeding to Step 7. The drain/cordon process can take several minutes for all pods to return to the Running state. Please be patient and allow the process time to complete and all pods to return to Running state.

If a crashloop or error status appears during this process, it can be ignored

Log in to Kafka Manager (cmak) at https://<CLUSTER FQDN>:32090/th/cmak and re-assign Kafka partitions for the node with these commands:
1. Click Generate Partition Assignments and select all topics (default)
2. Click Run Partition Assignments (use default settings for broker IDs)
3. Click Go to reassign partitions and refresh the page until the Completed field shows a data and time of completion.

(Conditional) If your deployment includes ArcSight Intelligence, then monitor the ES replication process by logging into the pod and executing this command:

kubectl exec -n $(kubectl get ns |awk '/arcsight/ {print $1}') elasticsearch-master-0 -c  elasticsearch -it bash

Monitor the Snapshot progress using the following command:

curl -k -XGET 'https://elastic:<password>@localhost:9200/_cat/health?v=true'

Example command and output:

curl -k -XGET 'https://elastic:changeme@localhost:9200/_cat/health?v=true'

epoch      timestamp cluster  status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1671118161 15:29:21  interset green           6         3   1128 583    0    0        0             0                  -                100.0%

At the completion of the Elastic Search replication, the status of all nodes should be "green". A "yellow" status indicates that the replication has not yet completed.

However, if the Elasticsearch recovery is not progressing and the active_shards_percent did not reach 100%, you may disregard the Intelligence pods failure and proceed with the EKS upgrade. This is because, after cordoning the node, some interset pods might remain in init state, and this may be ignored.

In Kafka Manager, check that the topic partition table data returns to normal. There should be no under-replication of data, and broker leader skew values should return to 0%. Do the following:

Select a Kafka pod and log into it, then run this command to check for under-replicated topics:

kubectl exec -it -n $(kubectl get ns|awk '/arcsight/ {print $1}') th-kafka-0 -c atlas-kafka -- /usr/bin/kafka-topics --bootstrap-server localhost:9092 --describe --topic th-cef --under-replicated-partitions

The command above checks specifically for the th-cef topic. Replacing that value with other topics would perform the check for those.

No result is returned when there is no under-replication.

Run this command to check for broker leader skew:

kubectl exec -it -n $(kubectl get ns|awk '/arcsight/ {print $1}') th-kafka-0 -c atlas-kafka -- /usr/bin/kafka-topics --bootstrap-server localhost:9092 --describe --topic th-cef

The command above checks specifically for the th-cef topic. Replacing that value with other topics would perform the check for those.

Verify that each broker is listed a number of times equal to the replicationFactor.

Example:

/usr/bin/kafka-topics --bootstrap-server localhost:9092 --describe --topic th-cef

Topic: th-cef   PartitionCount: 6       ReplicationFactor: 2    Configs: cleanup.policy=delete,segment.bytes=1073741824,message.format.version=2.5-IV0,retention.bytes=6442450944

Topic: th-cef   Partition: 0    Leader: 1005    Replicas: 1005,1004     Isr: 1004,1005

Topic: th-cef   Partition: 1    Leader: 1006    Replicas: 1006,1005     Isr: 1005,1006

Topic: th-cef   Partition: 2    Leader: 1004    Replicas: 1004,1006     Isr: 1004,1006

Topic: th-cef   Partition: 3    Leader: 1005    Replicas: 1005,1006     Isr: 1005,1006

Topic: th-cef   Partition: 4    Leader: 1006    Replicas: 1006,1004     Isr: 1004,1006

Topic: th-cef   Partition: 5    Leader: 1004    Replicas: 1004,1005     Isr: 1004,1005

Kafka will automatically advertise the new node to the connector.

Repeat steps 6 through 9 for each additional node in the cluster. When you have completed the steps on each node, only then proceed to Step 11.
(Conditional) If the intelligence-namenode=yes label was applied to the node being drained:
- Add the intelligence-namenode=yes label to one of the new nodes and update the node FDQN in the Reconfigure page of the CDF (HDFS NameNode under the Intelligence tab)
Delete the old Auto Scaling group (you created its replacement as above). Wait until all nodes from the old Auto Scaling group are deleted and all pods have returned to the Running state before proceeding to step 13.
Your next step depends on the EKS version you have reached:
- If your cluster has been configured for Kubernetes 1.23 already, the upgrade is complete, and you may proceed with the next steps.
- If your cluster is configured for a Kubernetes version earlier than 1.23, then repeat the entire process above for the next Kubernetes version, beginning with Step 1.

(Conditional) If the ArcSight Database has been deployed:

Log into each of the database nodes and replace the HDFS NameNode value from the previous step to the Core-site.xml file located in the /etc/hadoop/conf/ directory.

Select the option that applies to your configuration:

If Enable Secure Data Transfer with HDFS Cluster is disabled in the Intelligence tab of the CDF Management portal, in the /etc/hadoop/conf directory, add the following configurations:

In the hdfs-site.xml file:

<configuration>
</configuration>

In the core-site.xml file:

<configuration>
  <property>
   <name>fs.defaultFS</name>
   <value>hdfs://<namenode>:30820/</value>
  </property>
  <property>
   <name>dfs.namenode.http-address</name>
   <value><namenode>:30070</value>
  </property>
</configuration>

If Enable Secure Data Transfer with HDFS Cluster is enabled in the Intelligence tab of the CDF Management portal, in the /etc/hadoop/conf directory.

If Kerberos Authentication is enabled, reconfigure it by following this procedure before proceeding.

Navigate to the /etc/hadoop/conf directory and add the following configurations:

In the hdfs-site.xml file:

<configuration>
  <property>
   <name>dfs.http.policy</name>
   <value>HTTPS_ONLY</value>
  </property>
  <property>
   <name>dfs.encrypt.data.transfer</name>
   <value>true</value>
  </property>
</configuration>

In the core-site.xml file:

<configuration>
  <property>
   <name>fs.defaultFS</name>
   <value>hdfs://<namenode>:30820/</value>
  </property>
  <property>
   <name>dfs.namenode.https-address</name>
   <value><namenode>:30070</value>
  </property>
</configuration>

Change to the following directory:

cd /opt/vertica/bin/

su dbadmin

vsql
[password prompt]

Next, run the following commands:

SELECT CLEAR_HDFS_CACHES();
SELECT VERIFY_HADOOP_CONF_DIR();
SELECT node_name, node_address, export_address FROM nodes;

The expected output is:

Welcome to vsql, the Vertica Analytic Database interactive terminal.

Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit

fusiondb=>
fusiondb=> SELECT CLEAR_HDFS_CACHES();
CLEAR_HDFS_CACHES
------------------------------------------------------------------------
Cleared
(1 row)

fusiondb=> SELECT VERIFY_HADOOP_CONF_DIR();
VERIFY_HADOOP_CONF_DIR
-------------------------------------------------------------------------
Validation Success
v_fusiondb_node0001: HadoopConfDir [/etc/hadoop/conf] is valid
v_fusiondb_node0002: HadoopConfDir [/etc/hadoop/conf] is valid
v_fusiondb_node0003: HadoopConfDir [/etc/hadoop/conf] is valid

(3 rows)

fusiondb=> SELECT node_name, node_address, export_address FROM nodes;
node_name | node_address | export_address
-------------------------------------------------------------------------
v_fusiondb_node0001 | <IP1> | <IP1>
v_fusiondb_node0002 | <IP2> | <IP2>
v_fusiondb_node0003 | <IP3> | <IP3>
(3 rows)

Log into the database master node and run these commands:
```
./kafka_scheduler stop
./kafka_scheduler start
```

Verify that Kafka Scheduler has the correct new node IP addresses by running this command:

./kafka_scheduler status

Verify that Event Table Status events count is increasing by running this command:

./kafka_scheduler events

Next Step - If Step 8 failed for your Intelligence deployment: (Conditional - For Intelligence Deployments only) If the Elasticsearch recovery in step 8 failed

Next Step - If Step 8 concluded successfully for your Intelligence deployment: (Conditional - For Intelligence Deployments only) If the Elasticsearch recovery in step 8 succeeded

Next Step - If you don't have Intelligence in your deployment: Your upgrade process is finished!