WiDirect Failover

From WiDirect
Jump to: navigation, search

In some scenarios it may be advisable to have multiple WiDirect units running side by side in the event that one fails. In the unlikely event that a WiDirect fails, the other one will perform all the WiDirect functions. The steps in this document will show how to configure shared storage between two WiDirect or WiClient units so that either may takeover at anytime. The automatic takeover by one unit can be done manually, or they can be configured to automatically detect when the other has failed.

It is important that the WiDirect operator understand all the steps in this guide, including the recovery steps in case something goes wrong. If the devices are configured incorrectly then it is possible for the network to go down even though neither devices has failed. Proper network monitoring is important to make sure that doesn't happen.

Overview

Setting up multiple WiDirects for failover is complicated, but provides benefits in the event one of the units fails. Only one WiDirect is going to be active at anytime, but the second one will have a constant backup of all the important data from the first WiDirect. If one WiDirect fails, then the other one is still able to manage the network. Each of the WiDirects is going to have a local IP address on the eth0 and eth1 interfaces. The WiDirects are also going to have a shared IP address on each interface.


Configure Hostname

It is important for hostnames to be properly set on both WiDirects. The hostname can be set on the network page in version 2.3 and above.

The examples below will use f1.awi6.net and f2.awi6.net for the hostnames of the two servers. f1.awi6.net has IP addresses 10.8.9.123 and 10.4.1.2, and f2.awi6.net has IP addresses 10.8.2.224 and 10.4.1.3. The active device will be running on IP addresses 10.8.1.10 (f.awi6.net) and 10.4.1.1. Replace those IP addresses and hostnames with the actual hostnames you are using.

Install Packages

Many of the steps below will require root access to the WiDirect. This command can be run initially to obtain root access:

	su -

A number of packages are required to be installed to configure WiDirect failover. Run this command first:

emacs /etc/yum.repos.d/clusterlabs.repo

Add this text to the text file:

[clusterlabs]
name=High Availability/Clustering server technologies (epel-5)
baseurl=http://www.clusterlabs.org/rpm/epel-5
type=rpm-md
gpgcheck=0
enabled=1

Save the file and run these commands:

wget https://allcitywireless.com/failover/epel-release-5-4.noarch.rpm
rpm -i epel-release-5-4.noarch.rpm
yum remove awicp_reloaders
yum install awicp_reloaders_ha drbd83 kmod-drbd83* heartbeat pacemaker ipmitool

After installing the packages run the reboot command:

reboot

Create Firewall Rules

A number of ports need to be opened for the services to work properly. TCP ports 7788 through 7799 need to be opened for the shared drive functionality to work. UDP port 694 must be opened for the process monitoring services to work. Add these lines to the top portion of the iptables file:

-A INPUT -i eth0 -p tcp -m tcp --dport 7788:7799 --tcp-flags SYN,RST,ACK SYN -j ACCEPT
-A INPUT -i eth0 -p udp -m udp --dport 694 -j ACCEPT

FailoverIptables.jpg

Configure Local Services

These commands should be run on each box.

service iptables restart
service mysqld stop
chkconfig mysqld off
service dhcpd stop
chkconfig dhcpd off
service dnsmasq stop
chkconfig dnsmasq off
service httpd stop
chkconfig httpd off
rm -rf /etc/rc3.d/*awicp*

Create Shared Drive

Both WiDirects are going to share storage space for data that will be shared between them. There is at least 2 GB of empty space available on the hard drive for the shared drive. Some models may have more space available. Check with AllCity Wireless support staff for more information. Run these commands on both WiDirects to create the partitions:

	lvm
	lvcreate --size 2G -n LogVol02 VolGroup00
	exit
	emacs /etc/drbd.conf


Below is an example DRBD configuration file. Remember to substitute in the proper IP addresses and hostnames. You also want to put in an email address for split brain notifications. It is important that those notifications be acted on immediately, as they require manual intervention to get the system running again.

resource drbd0 {
	protocol C;
	handlers {
	split-brain "/usr/lib/drbd/notify-split-brain.sh supportemail@domain.com";
               fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
               after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}

	startup {
		degr-wfc-timeout 120;
		wfc-timeout      120;
	}
        disk {
                on-io-error detach;
                fencing resource-only;
        }

	net {
		timeout 120;
		connect-int 20;
		ping-int 20;
		max-buffers 2048;
		max-epoch-size 2048;
		ko-count 30; 
		cram-hmac-alg "sha1";
		shared-secret "MakeThisSecretSecure";
	}
	syncer {
		rate 10M;
		al-extents 257;
	}

	on f1.awi6.net {
		device /dev/drbd0;
		disk /dev/VolGroup00/LogVol02;
		address 10.8.9.123:7788;
		meta-disk internal;
	}
        on f2.awi6.net {
                device /dev/drbd0;
        	disk /dev/VolGroup00/LogVol02;
                address 10.8.2.224:7788;
                meta-disk internal;
        }

}

After the configuration file is saved the next step is to create the drive metadata. These commands need to be run on both WiDirects:

	drbdadm create-md drbd0
	drbdadm up drbd0
	mkdir /shared
	service mysqld stop
	mkdir /root/AWICP/license
	chmod -R a+rw /root/AWICP/license
	cp /root/AWICP/etc/awicp.serial /root/AWICP/license
	reboot

After those commands have been run on both WiDirects, one WiDirect needs to be identified as the initial primary device. Run these commands to identify the primary WiDirect:

drbdsetup /dev/drbd0 primary -o
mke2fs -j /dev/drbd0
e2fsck /dev/drbd0
mount /dev/drbd0 /shared
mv /var/lib/mysql /shared/mysql
ln -s /shared/mysql /var/lib/mysql
mv /root/AWICP/www/portal/branding /shared/
ln -s /shared/branding /root/AWICP/www/portal/branding 
mv /root/AWICP/etc /shared
ln -s /shared/etc /root/AWICP/etc
mv /root/AWICP/logs /shared/
ln -s /shared/logs /root/AWICP/logs 
mv /root/AWICP/monitor-data /shared/
ln -s /shared/monitor-data /root/AWICP/monitor-data 
mv /root/AWICP/db /shared/
ln -s /shared/db /root/AWICP/db
mv /etc/dhcpd.conf /shared/etc/dhcpd.conf
ln -s /shared/etc/dhcpd.conf /etc/dhcpd.conf 
mv /var/lib/dhcpd /shared/
ln -s /shared/dhcpd /var/lib/dhcpd

One more step is required to modify the file locations on the secondary WiDirect:

mv /var/lib/mysql /var/lib/mysql.backup
ln -s /shared/mysql /var/lib/mysql
mv /root/AWICP/www/portal/branding  /root/AWICP/www/portal/branding.backup 
ln -s /shared/branding /root/AWICP/www/portal/branding 
mv /root/AWICP/etc /root/AWICP/etc.backup
ln -s /shared/etc /root/AWICP/etc 
mv /root/AWICP/logs /root/AWICP/logs.backup
ln -s /shared/logs /root/AWICP/logs 
mv /root/AWICP/monitor-data /root/AWICP/monitor-data.backup 
ln -s /shared/monitor-data /root/AWICP/monitor-data 
mv /root/AWICP/db /root/AWICP/db.backup
ln -s /shared/db /root/AWICP/db
mv /etc/dhcpd.conf /etc/dhcpd.conf.backup
ln -s /shared/etc/dhcpd.conf /etc/dhcpd.conf 
mv /var/lib/dhcpd /var/lib/dhcpd.backup
ln -s /shared/dhcpd /var/lib/dhcpd

After running those commands there will be a period of syncing between the two drives. Run the status command to check the status:

service drbd status

The status command will indicate whether the drives are inconsistent or up to date. Initially the secondary drive will be listed as inconsistent, and the status command will show the percent synced between the two devices.

View and Change Status of Shared Disk Drive

The drbd service manages the shared drive between the two boxes. To view the current status of the shared drive you can run the command below:

service drbd status

The commands in this section describe how to manually change which WiDirect is the primary one, and which is the secondary. These are only for reference, and do not need to be run typically. The following sections will describe how to use the heartbeat service to manage these automatically. To change the current primary box to be secondary you can run these commands:

/root/AWICP/bin/widirect_stop_all.sh
service mysqld stop
service httpd stop
umount /shared
drbdadm secondary drbd0

The other box can then be made the primary server:

drbdadm primary drbd0
mount /dev/drbd0 /shared
service mysqld start
service httpd start
/root/AWICP/bin/widirect_start_all.sh

Configure Services for Failover

The first step to configure the services for failover on both devices is to create the Heartbeat configuration file. Run this command to edit that file:

	emacs /etc/ha.d/ha.cf

Edit that file to contain the following text:

logfile /var/log/ha-log
autojoin none
bcast eth0
warntime 5
deadtime 15
initdead 60
keepalive 2
crm yes
node f1.awi6.net
node f2.awi6.net

The last two lines should be modified for the appropriate hostnames for the WiDirect. Run this command on both devices to edit the keys file:

	touch /etc/ha.d/authkeys
	chmod 600 /etc/ha.d/authkeys
	emacs /etc/ha.d/authkeys

The following text can be added to create a simple authkeys file:

auth 2
2 sha1 test-ha

A more secure authkeys file can be generated from the command line with the below command. That authkeys file can then be copied to the other WiDirect.

	( echo -ne "auth 1\n1 sha1 ";   dd if=/dev/urandom bs=512 count=1 | openssl md5 ) > /etc/ha.d/authkeys
	(From the Linux High Availability User’s Guide, http://linux-ha.org)

After those files have been updated the Heartbeat service can be started with these commands:

	service heartbeat start
	chkconfig heartbeat on

The next step will be to configure each of the individual services for failover. Run this command from the command line to start configuring the services:

	crm configure

In the crm configuration window run these commands to configure the services for automatic failover:

primitive awicp_ap_ping_monitor lsb:awicp_ap_ping_monitor
primitive awicp_ap_snmp_monitor lsb:awicp_ap_snmp_monitor
primitive awicp_bandwidth_manager lsb:awicp_bandwidth_manager
primitive awicp_client lsb:awicp_client
primitive awicp_client_radius_listener lsb:awicp_client_radius_listener
primitive awicp_clientwatcher lsb:awicp_clientwatcher
primitive awicp_gardencrawler lsb:awicp_gardencrawler
primitive awicp_manager lsb:awicp_manager
primitive awicp_preproxy lsb:awicp_preproxy
primitive awicp_watchdog lsb:awicp_watchdog
primitive dhcpd lsb:dhcpd
primitive drbd_mysql ocf:linbit:drbd \
        params drbd_resource="drbd0" \
        op monitor interval="15s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive fs_mysql ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/shared" fstype="ext3" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive httpd lsb:httpd
primitive ip_dhcp ocf:heartbeat:IPaddr2 \
        params ip="10.4.1.1" nic="eth1" cidr_netmask="24"
primitive ip_mysql ocf:heartbeat:IPaddr2 \
        params ip="10.8.1.10" nic="eth0" cidr_netmask="16"
primitive mysqld lsb:mysqld
group mysql fs_mysql ip_mysql ip_dhcp mysqld httpd dhcpd awicp_client awicp_preproxy awicp_ap_ping_monitor awicp_ap_snmp_monitor awicp_client_radius_listener awicp_bandwidth_manager awicp_clientwatcher awicp_gardencrawler awicp_manager awicp_watchdog
ms ms_drbd_mysql drbd_mysql \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node="max=1" notify="true"
colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master
order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start
property stonith-enabled="false"
commit
exit

Note:The steps above do not configure a STONITH device, which is highly recommended for one node to be able to disable another node in the event it partially fails. Some models, such as the WiDirect Carrier, have this feature built-in to the device, though there are disadvantages to using the built-in method. Please contact AllCity Wireless for more information about this configuration.

Configure Firewall for Shared IP

If setting up failover on a WiDirect, then the WiClients need to also be configured to point to the IP address that is shared between the two devices. Update the AuthServer section of the firewall to point to the proper IP address or hostname. File:FirewallFailover.jpg

Further Configuration

It is important that the WiClients page be configured correctly when using multiple devices in one network. The GWID field is typically the MAC address of the eth1 interface on the WiDirect or WiClient with the colons removed. When using multiple devices the Secondary GWID field should be filled in with the MAC address of eth1 of the second device. If failover is being used on the primary WiDirect, then it is important to rename the client to something other than “Local WiDirect.” If the client name is not changed then the primary GWID will be reset to the MAC address of whichever device is primary when the WiDirect starts up.

It is also recommended that you modify the system check page to show the status of the heartbeat processes. Run this command to edit the file:

emacs /root/AWICP/config-helpers/statusCheck.pl

Look for the line that shows "my $showFailoverStatus=0;" and change the 0 to a 1.

Failover Recovery

In many instances the two WiDirects will automatically recover, and no manual intervention will be necessary. In some instances, most notably if both WiDirects think they have been running independently of one another, the drives will be out of sync, which is known as a split brain condition. To recover from a split brain condition the administrator must determine which drive has newer data, and overwrite the contents of the drive with the older data. On the WiDirect with the out of date data, this command should be run:

	drbdadm -- --discard-my-data connect drbd0

The other WiDirect should run this command:

	drbdadm connect drbd0

If the above commands fail, some additional commands may need to be run on both devices before bringing everything back up:

	service heartbeat stop
	service drbd restart

After running the commands to sync the drives again, these commands will restart the failover services:

	
	service drbd stop
	service heartbeat start


View the Status

To view the real-time status of the heartbeat services from the command line, run the "crm_mon" command from the command line. Crm mon.jpg


Remove Device From Service

To take a device out of service, you can put that device in standby by running this command on either device:

crm node standby f2.awi6.net

Substitute in the correct hostname for the device to take out of service. That machine will be removed from the cluster, and all activity will be moved to the other machine.

Change Active Device

The command below will change the active device. The command can be run on either device. Simply specify the desired hostname for the services to be moved to.

crm_resource -r fs_mysql -M -H f1.awi6.net

The above command will make that box the preferred one for service to run on. Anytime both are running, the services will be running on the one listed there. To clear that option run this command:

crm configure delete cli-prefer-fs_mysql

Configure STONITH

It is important for the two WiDirects to be able to control the other's access to network resources. The WiDirect Carrier features this capability built in. The other models can use an external device to control the power of the other WiDirect devices. Even with the WiDirect Carrier it can be beneficial to use a secondary STONITH device in case one of the devices lose power.

Proper configuration of STONITH is tricky, and incorrect configuration can result in the services stopping on all the WiDirects. Before a device is removed from the cluster it is a good idea to put it in standby first ("crm node standby f2.awi6.net").

Run the "crm configure" command to configure the STONITH settings. Then run these commands to configure the STONITH devices:

primitive fail1-stonith stonith:external/ipmi \
	params hostname="f1.awi6.net" ipaddr="10.8.9.41" userid="ADMIN" passwd="ADMIN" interface="lan"
primitive fail2-stonith stonith:external/ipmi \
	params hostname="f2.awi6.net" ipaddr="10.8.9.37" userid="ADMIN" passwd="ADMIN" interface="lan"
location l-fail1-stonith fail1-stonith -inf: f1.awi6.net
location l-fail2-stonith fail2-stonith -inf: f2.awi6.net
property stonith-enabled="true"
commit
exit

In the above example 10.8.9.41 is the IMPI interface for f1.awi6.net, and 10.8.9.37 is the IPMI interface for f2.awi6.net. There may be warnings about duplicate settings. Those are normal:

WARNING: Resources fail1-stonith,fail2-stonith violate uniqueness for parameter "passwd": "ADMIN"
WARNING: Resources fail1-stonith,fail2-stonith violate uniqueness for parameter "interface": "lan"
WARNING: Resources fail1-stonith,fail2-stonith violate uniqueness for parameter "userid": "ADMIN"

Software Updates

To perform a software update first update the active WiDirect. Then change the active WiDirect to be the secondary device. Upgrade that one. You can then change the services to run on the original device if you wish. When complete be sure to run this command on both devices:

rm -rf /etc/rc3.d/*awicp*

Further Reading

STONITH Deathmatch Explained

DRBD 8.3 User Guide

Linux High Availability Guide

Clusterlabs Fencing and Stonith Documentation

Super Micro IPMI Manual (Used in WiDirect Carrier)