10.7 Enabling Monitoring and Configuring the Monitor Script

Resource monitoring allows OES Cluster Services to detect a the resource failure independently of its ability to detect node failures. Monitoring is disabled by default. It is enabled separately for each cluster resource.

10.7.1 Understanding Resource Monitoring

When you enable resource monitoring, you must specify a polling interval, a failure rate, a failure action, and a timeout value. These settings control how error conditions are resolved for the resource.

Polling Interval

The monitor script runs at a frequency specified by the polling interval. By default, it runs every minute when the resource is online. You can specify the polling interval in minutes or seconds. The polling interval applies only to a given resource.

Failure Rate

The failure rate is the maximum number of failures (Maximum Local Failures) detected by the monitor script during a specified amount of time (Time Interval).

A failure action is initiated when the resource monitor detects that the resource fails more times than the maximum number of local failures allowed to occur during the specified time interval. For failures that occur before it exceeds the maximum, Cluster Services automatically attempts to unload and load the resource. The progress and output of executing a monitor script are appended to /var/opt/novell/log/ncs/<resource_name>.monitor.out file.

For example, if you set the failure rate to 3 failures in 10 minutes, the failure action is initiated if it fails 4 times in a 10 minute period. For the first 3 failures, Cluster Services automatically attempts to unload and load the resource.

Failure Action

The Failover Action indicates whether you want the resource to be set to a comatose state, to migrate to another server, or to reboot the hosting node (without synchronizing or unmounting the disks) if a failure action initiates. With resource monitoring, the Start, Failover, and Failback Modes have no effect on where the resource migrates. This means that a resource that has been migrated by the resource monitoring failure action does not migrate back (fail back) to the node it migrated from unless you manually migrate it back.

Set Resources as Comatose: (Default) If the failure action initiates, the resource is placed in a comatose state. Administrator action is required to take the resource offline, resolve the issue, and bring it online again on the same or different node.

Migrate the Resource Based on the Preferred Nodes List: If the failure action initiates and the resource is on its most preferred node, the resource migrates to the next available node in its Preferred Nodes list, which you previously ordered according to your failover order preferences. The resource is not automatically failed back to the original node. Administrator action is required to cluster migrate the resource to the node, as desired. Each time a failure action triggers a failover, the resource migrates to a different node, according to the order in its Preferred Nodes list and availability of the nodes.

Reboot the Hosting Node without Syncing or Unmounting Disks: If the failure action initiates, each of the resources on the hosting node will fail over to the next available node in its Preferred Nodes list because of the reboot. All resources on the node are failed over. This is a hard reboot, not a graceful one. The reboot option is normally used only for a mission-critical cluster resource that must remain available. The resources are not automatically failed back to the original node. Administrator action is required to cluster migrate them back to the node, as desired.

Timeout Value

The timeout value determines how much time the script is given to complete. If the script does not complete within the specified time, the configured failure action is initiated. Cluster Services marks the process as failed right after the defined timeout expires, but it must wait for the process to conclude before it can start other resource operations.

The timeout value is applied only when the resource is migrated to another node. It is not used during resource online/offline procedures.

How Resource Monitoring Works

The monitor script runs at the frequency you specify as the polling interval.
There are two conditions that trigger a response by OES Cluster Services:
- An error is returned. Go to Step 3.
- The script times out, and the process fails. Go to Step 4.
Cluster Services tallies the error occurrence, compares it to the configured failure rate, then does one of the following:
- Total errors in the interval are less than or equal to the Maximum Local Failures: Cluster Services tries to resolve the error by offlining the resource, then onlining the resource.
  
  If this problem resolution effort fails, Cluster Services goes to Step 4 immediately regardless of the failure rate condition at that time.
- Total errors in the interval are more than the Maximum Local Failures: Go to Step 4.
Cluster Services initiates the configured failure action. Possible actions are:
- Puts the resource in a comatose state
- Migrates the resource to another server
- Reboots the hosting node (without synchronizing or unmounting the disks)

10.7.2 Configuring Resource Monitoring

The resource monitoring function allows you to monitor the health of a specified resource by using a script that you create or customize. If you want OES Cluster Services to check the health status of a resource, you must enable and configure resource monitoring for that resource. Enabling resource monitoring requires you to specify a polling interval, a failure rate, a failure action, and a timeout value.

If you are creating a new cluster resource, the Monitor Script page should already be displayed. You can start with Step 6.

In iManager, select Clusters > My Clusters.
Select the cluster that you want to manage.

If the cluster does not appear in your list, add the cluster to your list as described in Section 8.2, Setting Up a Personalized List of Clusters to Manage.
Select the Cluster Options tab.
Click the cluster resource to open its Properties page.

You can also select the check box next to the resource, then click Details.
On the Properties page, click the Monitoring tab.
Select the Enable Resource Monitoring check box to enable resource monitoring for the selected resource.

Resource monitoring is disabled by default.
For the polling interval, specify how often you want the resource monitor script for this resource to run.

You can specify the value in minutes or seconds.
Specify the number of failures (Maximum Local Failures) for the specified amount of time (Time Interval).

See Failure Rate.
Specify the Failover Action by indicating whether you want the resource to be set to a comatose state, to migrate to another server, or to reboot the hosting node (without synchronizing or unmounting the disks) if a failure action initiates. The reboot option is normally used only for a mission-critical cluster resource that must remain available.

See Failure Action.
Click the Scripts tab, then click the Monitor Script link.
Edit or add the necessary commands to the script to monitor the resource on the server.

The resource templates included with Cluster Services for Linux include resource monitor scripts that you can customize.

You also need to personalize the script by replacing variables with actual values for your specific configuration, such as the mount point, IP address, volume group name, file system type, and mount device.

You can use the same commands that would be used at the Linux terminal console. For example, see Section 10.7.3, Monitoring Services that Are Critical to a Resource.
Specify the Monitor Script Timeout value, then click Apply to save the script.

The timeout value determines how much time the script is given to complete. If the script does not complete within the specified time, the failure action you chose in Step 9 initiates.
Do one of the following:
- If you are configuring a new resource, click Next, then continue with Section 10.9.2, Setting the Start, Failover, and Failback Modes for a Resource.
- Click Apply to save your changes.
  
  Changes for a resource’s properties are not applied while the resource is loaded or running on a server. Apply the updated script by taking the resource offline and then bringing it online on the same node. Alternatively, after the updated scripts have been synchronized from eDirectory to the source and destination nodes, the updated scripts are used automatically on system failover or cluster migration. For more information, see Section 10.8, Applying Updated Resource Scripts by Offline/Offline, Failover, and Migration.

10.7.3 Monitoring Services that Are Critical to a Resource

In addition to monitoring the clustered service or storage objects, a resource monitor script can be used to monitor the status of services that are critical to the resource, such as Linux User Management (LUM), eDirectory, or other services.

IMPORTANT:NCS provides the ability to monitor the status of the eDirectory daemon (ndsd) at the NCS level. It is disabled by default. The monitoring can be set independently on each node. It runs whenever NCS is running on the node. See Section 8.8, Configuring NCS to Monitor the eDirectory Daemon (ndsd).

If you enable NDSD monitoring at the NCS level, we recommend that you remove (or comment out) the eDirectory status check in individual monitor scripts to avoid excessive checking.

The resource monitor script runs only on the cluster server where the cluster resource is currently online. The script does not monitor the critical services on its assigned cluster server when the resource is offline. The monitor script does not monitor critical services for any other cluster node. Each monitor script acts independently, so be aware of the potential traffic you might generate if you include a status check in multiple monitoring scripts.

LUM Monitoring Example

You can use the systemctl status namcd.service command to monitor whether the Linux User Management service is running. Add the following command to the resource monitor script:

# (optional) status of the Linux User Management service 
exit_on_error systemctl status namcd.service

Alternatively, you use the namcd status command to monitor whether the Linux User Management service is running and to automatically restart namcd if its daemon is not running. However, namcd creates messages in /var/log/messages with each check. Add the following command to the resource monitor script:

# (optional) status of the LUM service and restart if it is not loaded or running
exit_on_error namcd status

eDirectory Monitoring Example

You can monitor the state of the eDirectory service in an individual monitoring script by adding the following command.

# (optional) status of the eDirectory service
exit_on_error systemctl status ndsd.service

For OES 11 SP2 and later, you can alternatively monitor NDSD at the NCS level on each node. See Section 8.8, Configuring NCS to Monitor the eDirectory Daemon (ndsd).

10.7.4 Example Monitor Scripts

The resource templates included with OES Cluster Services for Linux include resource monitor scripts that you can customize.

Example monitor scripts are available in the following sections:

Clustered NSS pool:
- Section 12.8, Configuring a Monitor Script for the Shared NSS Pool
Clustered LVM volume group: