
keith1christian at gmail
Jul 2, 2009, 8:57 AM
Post #1 of 1
(1108 views)
Permalink
|
|
How to troubleshoot: No reply to standby request. Standby request cancelled.
|
|
I'm in the testing phase of building a HA cluster using Heartbeat and DRBD. Both nodes are running the same configuration files, e.g. drbd.conf, ha.cf, haresources. The two nodes are connected with dual heartbeat connections: server_one, eth1 192.168.1.136---> server_two, eth1 192.168.1.137 server_one, eth3 192.168.2.136---> server_two, eth3 192.168.2.137 Heartbeat is able to ping the gateway from both servers: Jul 1 17:43:22 server_one heartbeat: [21192]: info: Link www.xxx.yyy.129:www.xxx.yyy.129 up. Jul 1 17:43:22 server_one heartbeat: [21192]: info: Status update for node www.xxx.yyy.129: status ping Jul 1 17:33:18 server_two heartbeat: [19332]: info: Link www.xxx.yyy.129:www.xxx.yyy.129 up. Jul 1 17:33:18 server_two heartbeat: [19332]: info: Status update for node www.xxx.yyy.129: status ping Both nodes are running the same version of the software: [root [at] server_on ~]# rpm -qa | grep -i heart heartbeat-stonith-2.1.3-3.el5.centos heartbeat-2.1.3-3.el5.centos heartbeat-pils-2.1.3-3.el5.centos [root [at] server_on ~]# rpm -qa | grep -i drbd drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2 [root [at] server_on ~]# service heartbeat status service heartbeat status heartbeat OK [pid 21192 et al] is running on server_one [server_one]... [root [at] server_tw ~]# service heartbeat status heartbeat OK [pid 19332 et al] is running on server_two [server_two]... When testing failover, I issue the command "service heartbeat standby" but, failover does not occur and the message "Standby request cancelled" appears in the logs of both servers: Tell server_one to standby: [root [at] server_on ~]# service heartbeat standby service heartbeat standby auto_failback: off Attempting to enter standby mode 2009/07/02_10:54:38 Going standby [all]. Jul 2 10:54:38 server_one heartbeat: [21192]: info: server_one wants to go standby [all] Jul 2 10:54:48 server_one heartbeat: [21192]: WARN: No reply to standby request. Standby request cancelled. Tell server_two to standby: [root [at] server_tw ~]# service heartbeat standby auto_failback: off Attempting to enter standby mode 2009/07/02_10:55:16 Going standby [all]. [root [at] server_tw ~]# [ OK ] Jul 2 10:55:16 server_two heartbeat: [19332]: info: server_two wants to go standby [all] Jul 2 10:55:28 server_two heartbeat: [19332]: WARN: No reply to standby request. Standby request cancelled. Here is the log from server_one during a restart of drbd and heartbeat: [root [at] server_tw ~]# service drbd restart ; sleep 10 ; service heartbeat restart Jul 1 17:31:23 server_one kernel: drbd0: State change failed: Device is held open by someone Jul 1 17:31:23 server_one kernel: drbd0: state = { cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown r--- } Jul 1 17:31:23 server_one kernel: drbd0: wanted = { cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown r--- } Jul 1 17:31:33 server_one heartbeat: [14355]: info: Heartbeat shutdown in progress. (14355) Jul 1 17:31:33 server_one heartbeat: [19482]: info: Giving up all HA resources. Jul 1 17:31:33 server_one ResourceManager[19495]: info: Releasing resource group: server_one IPaddr::www.xxx.yyy.150/26/eth0 drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/home::ext3 sendmail Jul 1 17:31:33 server_one ResourceManager[19495]: info: Running /etc/init.d/sendmail stop Jul 1 17:31:33 server_one ResourceManager[19495]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home ext3 stop Jul 1 17:31:33 server_one Filesystem[19575]: INFO: Running stop for /dev/drbd0 on /home Jul 1 17:31:33 server_one Filesystem[19575]: INFO: Trying to unmount /home Jul 1 17:31:33 server_one Filesystem[19575]: INFO: unmounted /home successfully Jul 1 17:31:33 server_one Filesystem[19564]: INFO: Success Jul 1 17:31:33 server_one ResourceManager[19495]: info: Running /etc/ha.d/resource.d/drbddisk drbd-resource-0 stop Jul 1 17:31:33 server_one kernel: drbd0: role( Primary -> Secondary ) Jul 1 17:31:33 server_one kernel: drbd0: Writing meta data super block now. Jul 1 17:31:33 server_one ResourceManager[19495]: info: Running /etc/ha.d/resource.d/IPaddr www.xxx.yyy.150/26/eth0 stop Jul 1 17:31:34 server_one IPaddr[19727]: INFO: ifconfig eth0:0 down Jul 1 17:31:34 server_one avahi-daemon[7115]: Withdrawing address record for www.xxx.yyy.150 on eth0. Jul 1 17:31:34 server_one IPaddr[19698]: INFO: Success Jul 1 17:31:34 server_one heartbeat: [19482]: info: All HA resources relinquished. Jul 1 17:31:35 server_one heartbeat: [14355]: info: killing /usr/lib/heartbeat/ipfail process group 14366 with signal 15 Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBFIFO process 14357 with signal 15 Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBWRITE process 14358 with signal 15 Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBREAD process 14359 with signal 15 Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBWRITE process 14360 with signal 15 Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBREAD process 14361 with signal 15 Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBWRITE process 14362 with signal 15 Jul 1 17:31:36 server_one heartbeat: [14355]: info: killing HBREAD process 14363 with signal 15 Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process 14359 exited. 7 remaining Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process 14363 exited. 6 remaining Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process 14361 exited. 5 remaining Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process 14358 exited. 4 remaining Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process 14362 exited. 3 remaining Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process 14357 exited. 2 remaining Jul 1 17:31:36 server_one heartbeat: [14355]: info: Core process 14360 exited. 1 remaining Jul 1 17:31:36 server_one heartbeat: [14355]: info: server_one Heartbeat shutdown complete. Jul 1 17:32:08 server_one kernel: drbd0: conn( WFConnection -> Disconnecting ) Jul 1 17:32:08 server_one kernel: drbd0: Discarding network configuration. Jul 1 17:32:08 server_one kernel: drbd0: tl_clear() Jul 1 17:32:08 server_one kernel: drbd0: Connection closed Jul 1 17:32:08 server_one kernel: drbd0: conn( Disconnecting -> StandAlone ) Jul 1 17:32:08 server_one kernel: drbd0: receiver terminated Jul 1 17:32:08 server_one kernel: drbd0: Terminating receiver thread Jul 1 17:32:08 server_one kernel: drbd0: disk( UpToDate -> Diskless ) Jul 1 17:32:08 server_one kernel: drbd0: drbd_bm_resize called with capacity == 0 Jul 1 17:32:08 server_one kernel: drbd0: worker terminated Jul 1 17:32:08 server_one kernel: drbd0: Terminating worker thread Jul 1 17:32:08 server_one kernel: drbd: module cleanup done. Jul 1 17:32:08 server_one kernel: drbd: initialised. Version: 8.2.6 (api:88/proto:86-88) Jul 1 17:32:08 server_one kernel: drbd: GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn [at] c5-i386-buil, 2008-10-03 11:42:32 Jul 1 17:32:08 server_one kernel: drbd: registered as block device major 147 Jul 1 17:32:08 server_one kernel: drbd: minor_table @ 0xf7467240 Jul 1 17:32:08 server_one kernel: drbd0: disk( Diskless -> Attaching ) Jul 1 17:32:08 server_one kernel: drbd0: Starting worker thread (from cqueue/1 [145]) Jul 1 17:32:08 server_one kernel: klogd 1.4.1, ---------- state change ---------- Jul 1 17:32:08 server_one kernel: drbd0: Found 6 transactions (34 active extents) in activity log. Jul 1 17:32:08 server_one kernel: drbd0: max_segment_size ( = BIO size ) = 32768 Jul 1 17:32:08 server_one kernel: drbd0: drbd_bm_resize called with capacity == 832761342 Jul 1 17:32:08 server_one kernel: drbd0: resync bitmap: bits=104095168 words=3252974 Jul 1 17:32:08 server_one kernel: drbd0: size = 397 GB (416380671 KB) Jul 1 17:32:08 server_one kernel: drbd0: reading of bitmap took 21 jiffies Jul 1 17:32:08 server_one kernel: drbd0: recounting of set bits took additional 14 jiffies Jul 1 17:32:08 server_one kernel: drbd0: 397 GB (104095168 bits) marked out-of-sync by on disk bit-map. Jul 1 17:32:08 server_one kernel: drbd0: disk( Attaching -> UpToDate ) Jul 1 17:32:08 server_one kernel: drbd0: Writing meta data super block now. Jul 1 17:32:08 server_one kernel: drbd0: conn( StandAlone -> Unconnected ) Jul 1 17:32:08 server_one kernel: drbd0: Starting receiver thread (from drbd0_worker [19793]) Jul 1 17:32:08 server_one kernel: drbd0: receiver (re)started Jul 1 17:32:08 server_one kernel: drbd0: conn( Unconnected -> WFConnection ) Jul 1 17:36:20 server_one kernel: device eth1 entered promiscuous mode Jul 1 17:37:01 server_one heartbeat: [20041]: info: Version 2 support: false Jul 1 17:37:01 server_one heartbeat: [20041]: WARN: Logging daemon is disabled --enabling logging daemon is recommended Jul 1 17:37:01 server_one heartbeat: [20041]: info: ************************** Jul 1 17:37:01 server_one heartbeat: [20041]: info: Configuration validated. Starting heartbeat 2.1.3 Jul 1 17:37:01 server_one heartbeat: [20042]: info: heartbeat: version 2.1.3 Jul 1 17:37:01 server_one heartbeat: [20042]: info: Heartbeat generation: 1245962690 Jul 1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth1 Jul 1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth1 - Status: 1 Jul 1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth3 Jul 1 17:37:01 server_one heartbeat: [20042]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth3 - Status: 1 Jul 1 17:37:01 server_one heartbeat: [20042]: info: glib: ping heartbeat started. Jul 1 17:37:01 server_one heartbeat: [20042]: info: G_main_add_TriggerHandler: Added signal manual handler Jul 1 17:37:01 server_one heartbeat: [20042]: info: G_main_add_TriggerHandler: Added signal manual handler Jul 1 17:37:01 server_one heartbeat: [20042]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Jul 1 17:37:01 server_one heartbeat: [20042]: info: Local status now set to: 'up' Jul 1 17:37:02 server_one heartbeat: [20042]: info: Link www.xxx.yyy.129:www.xxx.yyy.129 up. Jul 1 17:37:02 server_one heartbeat: [20042]: info: Status update for node www.xxx.yyy.129: status ping Jul 1 17:37:15 server_one kernel: device eth1 left promiscuous mode Jul 1 17:37:32 server_one heartbeat: [20042]: WARN: node server_two: is dead Jul 1 17:37:32 server_one heartbeat: [20042]: info: Comm_now_up(): updating status to active Jul 1 17:37:32 server_one heartbeat: [20042]: info: Local status now set to: 'active' Jul 1 17:37:32 server_one heartbeat: [20042]: info: Starting child client "/usr/lib/heartbeat/ipfail" (498,496) Jul 1 17:37:32 server_one heartbeat: [20042]: WARN: No STONITH device configured. Jul 1 17:37:32 server_one heartbeat: [20042]: WARN: Shared disks are not protected. Jul 1 17:37:32 server_one heartbeat: [20042]: info: Resources being acquired from server_two. Jul 1 17:37:32 server_one heartbeat: [20055]: info: Starting "/usr/lib/heartbeat/ipfail" as uid 498 gid 496 (pid 20055) Jul 1 17:37:32 server_one harc[20056]: info: Running /etc/ha.d/rc.d/status status Jul 1 17:37:32 server_one mach_down[20086]: info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired Jul 1 17:37:32 server_one mach_down[20086]: info: mach_down takeover complete for node server_two. Jul 1 17:37:32 server_one heartbeat: [20042]: info: Initial resource acquisition complete (T_RESOURCES(us)) Jul 1 17:37:32 server_one heartbeat: [20042]: info: mach_down takeover complete. Jul 1 17:37:32 server_one IPaddr[20129]: INFO: Resource is stopped Jul 1 17:37:32 server_one heartbeat: [20057]: info: Local Resource acquisition completed. Jul 1 17:37:32 server_one harc[20193]: info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp Jul 1 17:37:32 server_one ip-request-resp[20193]: received ip-request-resp IPaddr::www.xxx.yyy.150/26/eth0 OK yes Jul 1 17:37:32 server_one ResourceManager[20214]: info: Acquiring resource group: server_one IPaddr::www.xxx.yyy.150/26/eth0 drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/home::ext3 sendmail Jul 1 17:37:32 server_one IPaddr[20241]: INFO: Resource is stopped Jul 1 17:37:32 server_one ResourceManager[20214]: info: Running /etc/ha.d/resource.d/IPaddr www.xxx.yyy.150/26/eth0 start Jul 1 17:37:32 server_one IPaddr[20339]: INFO: Using calculated netmask for www.xxx.yyy.150: 255.255.255.192 Jul 1 17:37:32 server_one IPaddr[20339]: INFO: eval ifconfig eth0:0 www.xxx.yyy.150 netmask 255.255.255.192 broadcast www.xxx.yyy.191 Jul 1 17:37:32 server_one avahi-daemon[7115]: Registering new address record for www.xxx.yyy.150 on eth0. Jul 1 17:37:32 server_one avahi-daemon[7115]: Withdrawing address record for www.xxx.yyy.150 on eth0. Jul 1 17:37:32 server_one avahi-daemon[7115]: Registering new address record for www.xxx.yyy.150 on eth0. Jul 1 17:37:32 server_one IPaddr[20310]: INFO: Success Jul 1 17:37:32 server_one ResourceManager[20214]: info: Running /etc/ha.d/resource.d/drbddisk drbd-resource-0 start Jul 1 17:37:32 server_one kernel: drbd0: role( Secondary -> Primary ) Jul 1 17:37:32 server_one kernel: drbd0: Writing meta data super block now. Jul 1 17:37:32 server_one Filesystem[20487]: INFO: Resource is stopped Jul 1 17:37:32 server_one ResourceManager[20214]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /home ext3 start Jul 1 17:37:32 server_one Filesystem[20568]: INFO: Running start for /dev/drbd0 on /home Jul 1 17:37:32 server_one kernel: kjournald starting. Commit interval 5 seconds Jul 1 17:37:32 server_one kernel: EXT3 FS on drbd0, internal journal Jul 1 17:37:32 server_one kernel: EXT3-fs: mounted filesystem with ordered data mode. Jul 1 17:37:32 server_one Filesystem[20557]: INFO: Success Jul 1 17:37:32 server_one ResourceManager[20214]: info: Running /etc/init.d/sendmail start Jul 1 17:37:38 server_one heartbeat: [20042]: WARN: standby message [me] from server_one ignored. local resources in flux. Jul 1 17:37:42 server_one heartbeat: [20042]: info: Local Resource acquisition completed. (none) Jul 1 17:37:42 server_one heartbeat: [20042]: info: local resource transition completed. There is another cluster that was built about a year ago running the same versions of heartbeat and drbd. I've configured this system in the same way except for IP address changes, so it should work. What am I overlooking here? What else can I try to troubleshoot this and get the failover resolved? Thanks, =======Keith _______________________________________________ Linux-HA mailing list Linux-HA [at] lists http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|