Failure Modes

During a catastrophic GLM failure, all Enterprise Server Cluster members are dependent upon the Global Lock Manager (GLM) for the successful completion of all active jobs. In a permanent connection loss situation, system administrator actions may need to be taken on the GLM and the client.

GLM failure indicated on the Enterprise Server Cluster client

On an Enterprise Server Cluster client, GLM failure can be identified by the following console message:
CASCS1117I Connection to GLM_APPLID (sysid GLM_SYSID) lost (protocol P2P)
where:
  • GLM_APPLID is the GLM APPLID
  • GLM_SYSID is the GLM SYSID
This message indicates that the connection to the GLM has been lost. The connection may be lost because of a connection issue, but a catastrophic failure of the GLM will also result in the same message. With a catastrophic GLM failure, this connection will only be re-established after the GLM is restarted. Furthermore, with a catastrophic GLM failure, this message is returned immediately when the cluster client makes its first connection attempt. Otherwise, the system continues to retry the connection for up to the duration specified in the ES_GLM_TIMEOUT environment variable. When this limit is reached, the following console log output is displayed:
CASCS3032S Connection to ES Cluster manager ESCLMGR (sysid MST1) is disabled, verify and 
release global locks on ES cluster manager.

Remedial actions

Prior to any GLM restart, it is vital that you identify and correct the reason for the failure. If necessary, you should collect CAS data for further failure analysis.

On an Enterprise Server Cluster client, active jobs can continue to execute, until they attempt to DEQUEUE their global locks. If the GLM is restarted before the DEQUEUE is requested, the DEQUEUE will execute successfully. Otherwise, the DEQUEUE will fail but the failure will be ignored and the local DEQUEUE will execute successfully. The JCL can therefore execute successfully, and no client-side actions are required.

Removing an Enterprise Server Cluster client from the cluster

Following a catastrophic GLM failure, or in the event of an Enterprise Server Cluster client failure, or some other requirement to take an Enterprise Server region out of a cluster, some action is required.

Actions to take on the GLM prior to a GLM restart
When the GLM restarts after failure, all of the cluster clients that were previously connected are now marked as ACTIVE in the CASGLM.LCK file. Consequently, the GLM will request and wait for each cluster client to send their active global locks, and places itself in the NOWORK state as it awaits the responses. This state switches back to ACTIVE as soon as the GLM has received all of the client responses. If one or more clients do not answer, the GLM cannot proceed with its work.

The NOWORK state is displayed in the GLM.

On the Server Information page of ESMAC (CASRDO5), the following state information is displayed:

When all Enterprise Server Cluster clients marked as ACTIVE in CASGLM.LCK have reconnected, the following will be displayed:

In the GLM's console log, the following message will be displayed:
CASKC6008S No reply received for lock request from ESCLSLV2. GLM work halted until reply on ESMAC
 control page is provided.
If you attempt to execute a JCL job on an Enterprise Server Cluster client while the GLM is in NOWORK state, the following sequence of messages will be displayed:
ESCL1  CASCS3036E GLM ESCLMGR (sysid MST1) is in "NOWORK" state, waiting for all ES Cluster 
clients to send their locks. Check message KC6008S on the GLM. 10:36:43               
ESCL1   JCLCM0188I JOB02312 LCKSLEEP JOB  STARTED 10:36:43                                                                                                                         
ESCL1   JCLCM2000E JOB02312 LCKSLEEP Unable to acquire global lock for job LCKSLEEP. 10:36:43                                                                                      
ESCL1   CLCM0181S JOB02312 LCKSLEEP JOB  ABENDED - COND CODE S922 10:36:43                                                                                                        

To allow the GLM to resume lock processing, a reply is expected on the CONTROL page of the GLM's ESMAC screen (CASRDO11):

To remove the Enterprise Server Cluster client ESCLSLV2 from the cluster, uncheck the checkbox.

Lock removal following a permanent connection loss to the GLM

In the event of a permanent connection loss to the GLM, both client and GLM actions are required.

The following scenario illustrates a situation in which global lock removal is necessary.

Members of the Enterprise Server Cluster:
  • The GLM
  • Enterprise Server Cluster client 1: ESCLCLT1
  • Enterprise Server Cluster client 2: ESCLCLT2
Scenario
All members have started and have successfully completed the handshake. JCL is executing in both regions.
Client 1 state
On ESCLCLT1, JCL1 is executing in PID1 and holds exclusive locks on resource 1 and resource 2. ESCLCLT1 PID1 JCL1 is in the ACTIVE state in the active queue.
Client 2 state
On ESCLCLT2, JCL2 is executing in PID2 and requests locks for resource 1. ESCLCLT2 PID2 JCL2 is in the WAIT state in the active queue.
A permanent connection failure occurs between the GLM and ESCLCLT1. The GLM is still active and the connection between ESCLCLT2 and the GLM is active.

If during the connection failure of ESCLCLT1, JCL1 attempts to DEQUEUE its locks, the Enterprise Server Cluster client layer will retry for the duration set by ES_GLM_TIMEOUT. When this value is reached, the Enterprise Server Cluster layer will mark the connection to the GLM as disabled and all global ENQUEUE/DEQUEUE requests will be rejected. The JCL will terminate successfully, the locks - including the global locks on the Enterprise Server Cluster client - will be released, but the global locks will still be active on the GLM.

On the Enterprise Server Cluster client
The systems administrator needs to decide whether JCL1 should run to completion or if it should be cancelled.
On the Enterprise Server Cluster GLM
Whatever the decision, the locks held by JCL1 on the GLM need to be removed for the JCL2 job to run to completion on ESCLCLT2.

The next section describes the use of the caslock command and its equivalent ESMAC page (CASRDO33). Both tools provide the capability to browse and remove locks, together with the ability to take a cluster offline.