Performing the AKS Upgrade

Prior to Upgrading AKS

Prior to beginning the upgrade of your Azure Kubernetes Service, perform these tasks.

Verify Connector Disk Storage Space: To ensure data continuity, make sure that your connectors have enough disk storage space for at least 30 minutes of event flow. Events will be queued for processing after the upgrade is complete

Verify External Data Disks: Ensure that each of your AKS nodes has a data disk attached.

Create a Data Collection Table: In a text editor, prepare a blank 3-column data collection table in which you can record AKS node information which you will collect during the upgrade. The table should have 1 row for each AKS node in your cluster. Label each column as shown in the example below. This example table shows 3 rows, 1 for each node, but you should add as many rows to your table as necessary to accommodate your own environment.

Data Collection Table

Node Name

Labels

Broker ID (required only for AKS nodes where label kafka:yes has been applied)

<node- name-1>

<CSV list of node labels. Example: fusion,kafka,th-platform,th- processing,zk. If Intelligence capability is installed: intelligence=yes,intelligence-datanode=yes,intelligence-spark=yes' >

<broker-ID-1>

<node- name-2>

<CSV list of node labels. Example: fusion,kafka,th-platform,th- processing,zk. If Intelligence capability is installed: intelligence=yes,intelligence-datanode=yes,intelligence-spark=yes' >

<broker-ID-2>

<node- name-3>

<CSV list of node labels. Example: fusion,kafka,th-platform,th- processing,zk. If Intelligence capability is installed: intelligence=yes,intelligence-datanode=yes,intelligence-spark=yes' >

<broker-ID-3>

This procedure will upgrade your AKS by one version number (for example, from 1.19 to 1.20). Since AKS may only be upgraded one version number at a time, it may be necessary to cycle through the entire procedure shown below multiple times until you complete the upgrade to the required version number.

To perform the AKS upgrade:

  1. Log in to your jump host.

  2. Get the names of the cluster nodes by running the following command. Record the name of each node in your data collection table, one per line.

    kubectl get nodes -o wide -A
    Step 3 must be repeated on each cluster node. When you have completed performing the step on each node, only then proceed to Step 5.
  3. Choose the first node from your table. From the jump host, back up the fstab configuration on the selected node by running the following commands:

    mkdir /opt/backup
    scp -i id_rsa azureuser@<node-name>:/etc/fstab /opt/backup/<node-name>_ fstab
  4. Repeat Step 3 for each of the other AKS nodes in turn. When finished with all nodes, then proceed to Step 5.

  5. List the Fusion and Transformation Hub node names in the cluster, along with their labels by running the following command. Record the AKS node names and labels of each node in your data collection table.

    kubectl get nodes -L=fusion,kafka,th-platform,th-processing,zk -o wide
    1. (Conditional) If Intelligence has been deployed, run the following command:

      kubectl get nodes -L=intelligence,intelligence-datanode,intelligence-namenode,intelligence-spark -o wide

      Record the AKS node names and labels of each node in the data collection table.

  1. Get the broker ID of each AKS node. Do the following:

    1. SSH to each worker node labeled with kafka:yes, using the node name.

    2. Become root.

    3. Run the following command:

      cat /opt/arcsight/k8s-hostpath-volume/th/kafka/meta.properties | grep "broker.id"

    4. Record each broker ID in your data collection table.

  1. Scale down itom-postgresql by running the following command:

    kubectl scale deployment/itom-postgresql -n core --replicas=0

    Wait for all itom- postgresql pods to be terminated before proceeding:

  1. On the jump host, check the AKS upgrade version by running the following command:

    az aks get-upgrades --resource-group <myResourceGroup> --name <myAKSCluster> --output table
  2. Start the upgrade by running the following command:

    az aks upgrade --resource-group <myResourceGroup> --name <myAKSCluster> --kubernetes-version <new Kubernetes version>
  3. After the command executes, answer y to each of the following prompts:

    Are you sure you want to perform this operation?  (y/N):
    Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version <new AKS version> Continue? (y/N):
  1. Open a second SSH session to the jump host (in parallel to the previous one) to monitor the upgrade progress. Run the following command:

    watch kubectl get nodes -L=fusion,intelligence,intelligence-datanode,intelligence-namenode,intelligence-spark,kafka,th-platform,th-processing,zk -o wide

    Review the output of the command every minute or so to identify which AKS node is currently being upgraded. The status of a node which is being upgraded will change from Ready to Ready,SchedulingDisabled.

Execute the following Steps 12a through 12c promptly (within 3 minutes). Begin with the first worker node that got its AKS version upgraded (the first upgraded worker node will change its status from Ready,SchedulingDisabled to Ready, and its AKS version will appear as the new AKS version), in order to ensure that data is written correctly before the th-kafka pod restarts. Failure to do so may result in Schema Registry errors, or broker ID or cluster ID mismatches.
If you encounter an issue with Schema Registry or broker ID mismatches, see the workaround here.
If you encounter an issue with a Kafka pod in CrashLoopBackOff status, reporting a cluster ID mismatch in logs, see the workaround here.
  1. Promptly (within 3 minutes) restore the files that you backed up earlier, mount the disk, and restore the settings as follows.

    1. Copy the backup files from the jump host to each AKS node:

      scp -i id_rsa /opt/backup/<<node-name>_fstab azureuser@<node-name>:/tmp/fstab
    2. SSH to each AKS node to perform a restore of the backed-up files. Become root, and then execute the following commands (create the /opt/arcsight directory only if it does not already exist):

      mkdir /opt/arcsight
      cat /tmp/fstab | grep "/opt/arcsight" >> /etc/fstab 
      mount -a
    3. (Conditional) If Intelligence has been deployed, run the following commands:

      sudo echo vm.max_map_count=262144 >> /etc/sysctl.conf 
      sysctl --system

  1. Scale up the itom-postgresql pod by running the following command:

    kubectl scale deployment/itom-postgresql -n core --replicas=1

  1. In order to proceed with this step, you must make sure that the AKS upgrade command executed on Steps 9 and 10 has completed successfully. That being the case, verify that the nodes are labeled correctly after the AKS upgrade.

    1. Run the following command and compare the labels returned with the node label values you recorded in the table in Step 5:

      kubectl get nodes -L=fusion,kafka,th-platform,th-processing,zk -o wide
    2. (Conditional) If Intelligence has been deployed, run the following command and compare the labels with the values you recorded in the data collection table:

      kubectl get nodes -L=intelligence,intelligence-datanode,intelligence-namenode,intelligence-spark -o wide
    3. (Conditional) Add any additional Intelligence labels as required for your environment.

    4. (Conditional) If Intelligence has been deployed, verify that the intelligence- namenode has the same label as before the AKS upgrade. If the node needs to be relabeled, run the following command:

      kubectl label --overwrite node <node-name> intelligence-datanode=yes intelligence-spark=yes intelligence=yes intelligence-namenode=yes fusion=yes zk=yes kafka=yes th-processing=yes th-platform=yes
    5. (Conditional) If the intelligence-namenode label is applied to any other worker node, then remove it with the following command:

      kubectl label --overwrite node <node-name> intelligence-namenode-
  1. Verify that services are running after the AKS upgrade, as follows.

    1. On the jump host, save the namespace in an environment variable for execution of some commands later:

      NS=$(kubectl get namespaces | grep arcsight-installer | awk ' {print $1} ')

    2. From the jump host, verify that all pods are running and ready by running these commands:

      watch -n1 "kubectl -n $NS get pods -o custom-columns=NAMESPACE:metadata.namespace,POD:metadata.name,PodIP:status.podIP,ContainersReadiness:status.containerStatuses[*].ready"
      If there are still non-ready containers (pods with a value of false in the ContainersReadiness column) after approximately 10 minutes, then check the workaround described here.
    3. Verify access to OMT portal, Fusion, Kafka Manager, and, if deployed, Intelligence, by opening each of the following URLs in a browser:

      https://<hostname>:5443 
      https://<hostname>:443 
      https://<hostname>/th/cmak
      If Intelligence has been deployed:   
      https://<hostname>/interset
    4. Verify in Kafka Manager that the topic events count continues to increase with time.

    5. For each cluster node labeled to run Kafka pods (kafka:yes), verify that all topics are located in the Kafka folder by running the following commands:

      ssh -i id_rsa azureuser@<node-name>
      ls /opt/arcsight/k8s-hostpath-volume/th/kafka
      The expected output of the previous command should show the folders of some of the existing topics in the format topic_name-N, where N is the number of partitions of the topic.
    6. Using the table you previously prepared, verify that each node's broker ID is correct by running the following command:

      kubectl exec th-zookeeper-0 -n $NS -- /bin/bash -c "/usr/bin/zookeeper-shell th-zookeeper-0.th-zk-sts:2181 ls /brokers/ids"
      The output should show the original broker IDs. If there are any mismatches between the reported broker IDs and the IDs in your data collection table, then please take note of them and use the workaround described here.
    7. Validate ArcSight Database event ingestion by connecting to your database initiator node. Navigate to the arcsight-db-tools directory and execute the following command. Verify that the events value is increasing with time.

      watch ./kafka_scheduler events

  1. (Conditional) If Intelligence has been deployed, restart hdfs-namenode and hdfs- datanode using the following command:

    kubectl delete pods -n $NS $(kubectl get pods -n $NS -o wide | grep "hdfs-" | cut -d ' ' -f1)
  2. (Conditional) If Intelligence has been deployed, check the health of the elasticsearch cluster by running the following command:

    curl -k -XGET "https://elastic:<elasticsearch password>@<any_node_ name>:31092/_cat/health?v=true"
    1. Cluster status from the previous command should show as green. If status is red, then restart the elasticsearch pods using the following commands:

      kubectl scale statefulset elasticsearch-master -n $NS --replicas=0
      kubectl scale statefulset elasticsearch-data -n $NS --replicas=0 
      kubectl scale statefulset elasticsearch-master -n $NS --replicas=1 
      kubectl scale statefulset elasticsearch-data -n $NS --replicas={replica count}

18. (Conditional) If Intelligence has been deployed, ensure that Transformation Hub pods are in the running state, and then restart interest-logstash by running the following commands:

kubectl -n $NS scale statefulset interset-logstash --replicas=0 
kubectl -n $NS scale statefulset interset-logstash --replicas={replica count}

19. (Conditional) Cycle through the process as many times as necessary to upgrade to the supported AKS version, upgrading one AKS version number per cycle.

Next Step: Upgrading ESM