Recovery

Depending on the failure scenario, a recovery situation can involve the planned execution of several remedial activities.
Typical activities involved in the recovery of a failed Enterprise Server Cluster include:
  • Troubleshooting a network connection that should be reactivated as soon as possible, and ideally before the connection is marked as disabled.
  • Collecting information to allow later analysis of the failure.
  • Releasing the locks held by an Enterprise Server Cluster client.
  • Restoring a database.
The main objective of any recovery process is to limit work disruption as much as possible. Disruption can be minimised by careful preparation for a cluster failure:
  • Identify the possible failure scenarios and prepare for them.
  • Ensure your preparation work is documented, and that the system administrator and/or operator knows what needs to be done at the point of failure, so that its duration is kept to a minimum.

Recovery scenarios

There are two primary causes of an Enterprise Server Cluster failure:
  • A permanent connection failure to the Global Lock Manager (GLM).
  • Catastrophic GLM failure - caused by a disk failure, memory corruption, a resource shortage etc.
Note:

The system will tolerate non-permanent connection failures for a time defined by the environment variable ES_GLM_TIMEOUT. Once the duration of a connection failure exceeds that set by this variable, the state of the connection defined between the cluster client and the GLM is marked as disabled.

At this point, any attempt to require global locks will fail and the following message is displayed in the JCL job log:
JCLCM2000E Unable to acquire global lock for job JRX0033. JCLCM0181S JOB ABENDED - COND CODE S922

The state of the connection will be reset to enabled as soon as the GLM reconnects.