Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users
Re: pacemaker+drbd promotion delay
 

Index | Next | Previous | View Flat


seligman at nevis

Mar 30, 2012, 8:56 AM


Views: 1233
Permalink
Re: pacemaker+drbd promotion delay [In reply to]

On 3/30/12 1:13 AM, Andrew Beekhof wrote:
> On Fri, Mar 30, 2012 at 2:57 AM, William Seligman
> <seligman [at] nevis> wrote:
>> On 3/29/12 3:19 AM, Andrew Beekhof wrote:
>>> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
>>> <seligman [at] nevis> wrote:
>>>> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
>>>> files and versions below.
>>>>
>>>> Problem: If I restart both nodes at the same time, or even just start pacemaker
>>>> on both nodes at the same time, the drbd ms resource starts, but both nodes stay
>>>> in slave mode. They'll both stay in slave mode until one of the following occurs:
>>>>
>>>> - I manually type "crm resource cleanup <ms-resource-name>"
>>>>
>>>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
>>>> resources are promoted.
>>>>
>>>> The key resource definitions:
>>>>
>>>> primitive AdminDrbd ocf:linbit:drbd \
>>>> � � � �params drbd_resource="admin" \
>>>> � � � �op monitor interval="59s" role="Master" timeout="30s" \
>>>> � � � �op monitor interval="60s" role="Slave" timeout="30s" \
>>>> � � � �op stop interval="0" timeout="100" \
>>>> � � � �op start interval="0" timeout="240" \
>>>> � � � �meta target-role="Master"
>>>> ms AdminClone AdminDrbd \
>>>> � � � �meta master-max="2" master-node-max="1" clone-max="2" \
>>>> � � � �clone-node-max="1" notify="true" interleave="true"
>>>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
>>>> clone FilesystemClone FilesystemGroup \
>>>> � � � �meta interleave="true" target-role="Started"
>>>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
>>>> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
>>>>
>>>> Note that I stuck in "target-role" options to try to solve the problem; no effect.
>>>>
>>>> When I look in /var/log/messages, I see no error messages or indications why the
>>>> promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
>>>> on both nodes. There are no error messages when I force the issue with:
>>>>
>>>> crm resource cleanup AdminClone
>>>>
>>>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
>>>> resource is ready to be promoted.
>>>>
>>>> This is not just an abstract case for me. At my site, it's not uncommon for
>>>> there to be lengthy power outages that will bring down the cluster. Both systems
>>>> will come up when power is restored, and I need for cluster services to be
>>>> available shortly afterward, not 15 minutes later.
>>>>
>>>> Any ideas?
>>>
>>> Not without any logs
>>
>> Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R>
>>
>> Before you click on the link (it's a big wall of text),
>
> I'm used to trawling the logs. Grep is a wonderful thing :-)
>
> At this stage it is apparent that I need to see
> /var/lib/pengine/pe-input-4.bz2 from hypatia-corosync.
> Do you have this file still?

No, so I re-ran the test. Here's the log extract from the test I did today
<http://pastebin.com/6QYH2jkf>.

Based on what you asked for from the previous extract, I think what you want
from this test is pe-input-5. Just to play it safe, I copied and bunzip2'ed all
three pe-input files mentioned in the log messages:

pe-input-4: <http://pastebin.com/Txx50BJp>
pe-input-5: <http://pastebin.com/zzppL6DF>
pe-input-6: <http://pastebin.com/1dRgURK5>

I pray to the gods of Grep that you find a clue in all of that!

>> here are what I think
>> are the landmarks:
>>
>> - The extract starts just after the node boots, at the start of syslog at time
>> 10:49:21.
>> - I've highlighted when pacemakerd starts, at 10:49:46.
>> - I've highlighted when drbd reports that the 'admin' resource is UpToDate, at
>> 10:50:10.
>> - One last highlight: When pacemaker finally promotes the drbd resource to
>> Primary on both nodes, at 11:05:11.
>>
>>> Details:
>>>>
>>>> # rpm -q kernel cman pacemaker drbd
>>>> kernel-2.6.32-220.4.1.el6.x86_64
>>>> cman-3.0.12.1-23.el6.x86_64
>>>> pacemaker-1.1.6-3.el6.x86_64
>>>> drbd-8.4.1-1.el6.x86_64
>>>>
>>>> Output of crm_mon after two-node reboot or pacemaker restart:
>>>> <http://pastebin.com/jzrpCk3i>
>>>> cluster.conf: <http://pastebin.com/sJw4KBws>
>>>> "crm configure show": <http://pastebin.com/MgYCQ2JH>
>>>> "drbdadm dump all": <http://pastebin.com/NrY6bskk>

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman [at] nevis
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Attachments: smime.p7s (4.39 KB)

Subject User Time
pacemaker+drbd promotion delay seligman at nevis Mar 27, 2012, 3:12 PM
    Re: pacemaker+drbd promotion delay seligman at nevis Mar 28, 2012, 3:30 PM
    Re: pacemaker+drbd promotion delay andrew at beekhof Mar 29, 2012, 12:19 AM
    Re: pacemaker+drbd promotion delay seligman at nevis Mar 29, 2012, 8:57 AM
        Re: pacemaker+drbd promotion delay andrew at beekhof Mar 29, 2012, 10:13 PM
    Re: pacemaker+drbd promotion delay seligman at nevis Mar 30, 2012, 8:56 AM
        Re: pacemaker+drbd promotion delay andrew at beekhof Apr 10, 2012, 3:22 PM
    Re: pacemaker+drbd promotion delay lars.ellenberg at linbit Apr 12, 2012, 12:26 AM
        Re: pacemaker+drbd promotion delay andrew at beekhof Apr 12, 2012, 6:47 PM
    Re: pacemaker+drbd promotion delay andrew at beekhof Apr 12, 2012, 7:24 PM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.