Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users
Re: cman+pacemaker+drbd fencing problem
 

Index | Next | Previous | View Flat


seligman at nevis

Mar 1, 2012, 3:28 AM


Views: 1338
Permalink
Re: cman+pacemaker+drbd fencing problem [In reply to]

On 3/1/12 4:15 AM, emmanuel segura wrote:
> can you show me your /etc/cluster/cluster.conf?
>
> because i think your problem it's a fencing-loop

Here it is:

/etc/cluster/cluster.conf:

<?xml version="1.0"?>
<cluster config_version="17" name="Nevis_HA">
<logging debug="off"/>
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1">
<altname name="hypatia-private.nevis.columbia.edu" port="5405"
mcast="226.94.1.1"/>
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/>
</method>
</fence>
</clusternode>
<clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2">
<altname name="orestes-private.nevis.columbia.edu" port="5405"
mcast="226.94.1.1"/>
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="orestes-tb.nevis.columbia.edu"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="pcmk" agent="fence_pcmk"/>
</fencedevices>
<fence_daemon post_join_delay="30" />
<rm disabled="1" />
</cluster>


> Il giorno 01 marzo 2012 01:03, William Seligman<seligman [at] nevis
>> ha scritto:
>
>> On 2/28/12 7:26 PM, Lars Ellenberg wrote:
>>> On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
>>>> <off-topic>
>>>> Sigh. I wish that were the reason.
>>>>
>>>> The reason why I'm doing dual-primary is that I've a got a
>> single-primary
>>>> two-node cluster in production that simply doesn't work. One node runs
>>>> resources; the other sits and twiddles its fingers; fine. But when
>> primary goes
>>>> down, secondary has trouble starting up all the resources; when we've
>> actually
>>>> had primary failures (UPS goes haywire, hard drive failure) the
>> secondary often
>>>> winds up in a state in which it runs none of the significant resources.
>>>>
>>>> With the dual-primary setup I have now, both machines are running the
>> resources
>>>> that typically cause problems in my single-primary configuration. If
>> one box
>>>> goes down, the other doesn't have to failover anything; it's already
>> running
>>>> them. (I needed IPaddr2 cloning to work properly for this to work,
>> which is why
>>>> I started that thread... and all the stupider of me for missing that
>> crucial
>>>> page in Clusters From Scratch.)
>>>>
>>>> My only remaining problem with the configuration is restoring a fenced
>> node to
>>>> the cluster. Hence my tests, and the reason why I started this thread.
>>>> </off-topic>
>>>
>>> Uhm, I do think that is exactly on topic.
>>>
>>> Rather fix your resources to be able to successfully take over,
>>> than add even more complexity.
>>>
>>> What resources would that be,
>>> and why are they not taking over?
>>
>> I can't tell you in detail, because the major snafu happened on a
>> production
>> system after a power outage a few months ago. My goal was to get the thing
>> stable as quickly as possible. In the end, that turned out to be a non-HA
>> configuration: One runs corosync+pacemaker+drbd, while the other just runs
>> drbd.
>> It works, in the sense that the users get their e-mail. If there's a power
>> outage, I have to bring things up manually.
>>
>> So my only reference is the test-bench dual-primary setup I've got now,
>> which is
>> exhibiting the same kinds of problems even though the OS versions, software
>> versions, and layout are different. This suggests that the problem lies in
>> the
>> way I'm setting up the configuration.
>>
>> The problems I have seem to be in the general category of "the 'good guy'
>> gets
>> fenced when the 'bad guy' gets into trouble." Examples:
>>
>> - Assuming I start out with two crashed nodes. If I just start up DRBD and
>> nothing else, the partitions sync quickly with no problems.
>>
>> - If the system starts with cman running, and I start drbd, it's likely
>> that
>> system who is _not_ Outdated will be fenced (rebooted). Same thing if
>> cman+pacemaker is running.
>>
>> - Cloned ocf:heartbeat:exportfs resources are giving me problems as well
>> (which
>> is why I tried making changes to that resource script). Assume I start
>> with one
>> node running cman+pacemaker, and the other stopped. I turned on the stopped
>> node. This will typically result in the running node being fenced, because
>> it
>> has it times out when stopping the exportfs resource.
>>
>> Falling back to DRBD 8.3.12 didn't change this behavior.
>>
>> My pacemaker configuration is long, so I'll excerpt what I think are the
>> relevant pieces in the hope that it will be enough for someone to say "You
>> fool!
>> This is covered in Pacemaker Explained page 56!" When bringing up a stopped
>> node, in order to restart AdminClone pacemaker wants to stop ExportsClone,
>> then
>> Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail
>> on
>> the running node that causes it to be fenced.
>>
>> primitive AdminDrbd ocf:linbit:drbd \
>> params drbd_resource="admin" \
>> op monitor interval="60s" role="Master" \
>> op monitor interval="59s" role="Slave" \
>> op stop interval="0" timeout="320" \
>> op start interval="0" timeout="240"
>> ms AdminClone AdminDrbd \
>> meta master-max="2" master-node-max="1" \
>> clone-max="2" clone-node-max="1" notify="true"
>>
>> primitive Clvmd lsb:clvmd op monitor interval="30s"
>> clone ClvmdClone Clvmd
>> colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
>> order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
>>
>> primitive Gfs2 lsb:gfs2 op monitor interval="30s"
>> clone Gfs2Clone Gfs2
>> colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
>> order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
>>
>> primitive ExportMail ocf:heartbeat:exportfs \
>> op start interval="0" timeout="40" \
>> op stop interval="0" timeout="45" \
>> params clientspec="mail" directory="/mail" fsid="30"
>> clone ExportsClone ExportMail
>> colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
>> order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone


--
Bill Seligman | mailto://seligman [at] nevis
Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/
PO Box 137 |
Irvington NY 10533 USA | Phone: (914) 591-2823
Attachments: smime.p7s (4.39 KB)

Subject User Time
cman+pacemaker+drbd fencing problem seligman at nevis Feb 27, 2012, 4:49 PM
    Re: cman+pacemaker+drbd fencing problem andrew at beekhof Feb 27, 2012, 5:40 PM
    Re: cman+pacemaker+drbd fencing problem andrew at beekhof Feb 27, 2012, 5:41 PM
    Re: cman+pacemaker+drbd fencing problem seligman at nevis Feb 28, 2012, 10:21 AM
        Re: cman+pacemaker+drbd fencing problem lars.ellenberg at linbit Feb 28, 2012, 11:09 AM
            Re: cman+pacemaker+drbd fencing problem seligman at nevis Feb 28, 2012, 12:51 PM
                Re: cman+pacemaker+drbd fencing problem lars.ellenberg at linbit Feb 28, 2012, 4:26 PM
                    Re: cman+pacemaker+drbd fencing problem seligman at nevis Feb 29, 2012, 4:03 PM
                        Re: cman+pacemaker+drbd fencing problem emi2fast at gmail Mar 1, 2012, 1:15 AM
        Re: cman+pacemaker+drbd fencing problem andrew at beekhof Feb 28, 2012, 2:27 PM
    Re: cman+pacemaker+drbd fencing problem seligman at nevis Feb 28, 2012, 3:21 PM
    Re: cman+pacemaker+drbd fencing problem seligman at nevis Mar 1, 2012, 3:28 AM
        Re: cman+pacemaker+drbd fencing problem emi2fast at gmail Mar 1, 2012, 3:34 AM
    Re: cman+pacemaker+drbd fencing problem seligman at nevis Mar 1, 2012, 9:10 AM
        Re: cman+pacemaker+drbd fencing problem seligman at nevis Mar 1, 2012, 9:16 AM
            Re: cman+pacemaker+drbd fencing problem lars.ellenberg at linbit Mar 1, 2012, 9:56 AM
                Re: cman+pacemaker+drbd fencing problem - SOLVED seligman at nevis Mar 1, 2012, 12:54 PM
        Re: cman+pacemaker+drbd fencing problem emi2fast at gmail Mar 1, 2012, 9:27 AM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.