The Failover and Recovery Processes

The concept of having a high availability group is to ensure that, in the event of a hardware or software failure of your primary server, service to your data is disrupted as little as possible. The failover process ensures that a standby server in the group assumes the role of primary server, and continues to provide access to your data.

The failover process can occur automatically or manually. Manual failover is only required when your group has fewer than three active servers in the group (that is, one primary server and two standby servers).

When automatic failover occurs, the server that is to be promoted to primary server is determined by a number of internal metrics based on: the most up-to-date standby, a server's average response times, and the order in which the servers are listed in the ES_HA_VSAM environment variable. When the new primary has been established, any in-flight transactions are rolled back, and the following message is produced:

9/130 file status (connection to Fileshare re-established)

When a primary or standby server leaves the high availability group, it can re-join once it has been restarted and recovered sufficiently. The recovery is a two-stage process, where firstly a rollback recovery, using the data in its transaction log file is performed, and any in-flight transactions are restored to an in-flight state; and then a request is made to the primary server to receive all of the transactions that have occurred since the server's removal from the group. Once these transactions have been received and applied to the transaction log, the server can re-join the group as a standby server.

You can configure this process using the iFileshare exit procedure; see IFSEXITPROC - The iFileshare Exit Procedure for more details.

If a server's removal leaves the group without a minimum number of servers, file replication will still occur within the group, but the automatic failover process is unavailable until the group is restarted with at least a minimum number of servers.

To ensure your client application can still continue processing when the recovery process completes, you must configure your Fileshare client to automatically reconnect once a connection has been lost; this functionality is only available if you have the following lines to the Fileshare client configuration file (fhredir.cfg):

/um
/ra <number-of-attempts>
/rd <delay-between-attempts>

If the new primary server is on a remote machine, you must also have configured your mf-client.dat configuration file to locate iFileshare on the new machine; see Configuring Remote Servers in a High Availability Group.