Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

resource unmanaged/failed

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


aleksey.kashin at gmail

Dec 7, 2011, 2:56 AM

Post #1 of 6 (1436 views)
Permalink
resource unmanaged/failed

Hello.

I have two servers (radius1, radius2). I've set up the cluster resource
- IPaddr2. I used next commands to set up this resource:

# crm configure property stonith-enabled="false"
# crm configure property no-quorum-policy="ignore"
# crm configure primitive raddb_ip ocf:heartbeat:IPaddr2 params
ip="10.99.2.57" cidr_netmask="32" op monitor interval="15s"
# crm configure group raddb raddb_ip
# crm configure location raddb-prefers-radius1 raddb inf: radius1
# crm configure rsc_defaults resource-stickiness=1000001

All ok.

But sometimes on server radius1 the load is increasing and server is
swapping and at that moment resource becomes "(unmanaged) FAILED". Below
I've presented example "unmanaged" resource:

# crm_mon
============
Last updated: Wed Dec 7 14:56:20 2011
Stack: openais
Current DC: radius1 - partition with quorum
Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ radius2 radius1 ]

Resource Group: raddb
raddb_ip (ocf::heartbeat:IPaddr2): Started radius1
(unmanaged) FAILED

Failed actions:
raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed
Out): unknown exec error
raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out):
unknown exec error


I've presented part of /var/log/syslog (radius1) here -
http://paste.org/41963


In that moment ip address 10.99.2.57 is alive and server responds to
requests coming to this ip. However sometimes this resource becomes
completely unavailable and I restart corosync. It's very bad.

I think resource becomes unmanaged because server is using swap and part
of corosync processes is in swap. I tested this suggestion and when
server is using a lot of swap resource becomes "unmanaged".

I use debian gnu/linux 5.x and this packages -
http://people.debian.org/~madkiss/ha/:

# dpkg -l |grep cluster
ii cluster-glue
1.0.7+hg2618-2~bpo50+1 The reusable cluster components for Linux HA
ii corosync
1.4.2-1~bpo50+1 Standards-based cluster framework (daemon an
ii libcluster-glue
1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries (transitional pac
ii libcorosync4
1.4.2-1~bpo50+1 Standards-based cluster framework (libraries
ii libcrmcluster1
1.1.5-3~bpo50+1 Pacemaker libraries - CRM
ii liblrm2
1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- liblrm2
ii libpils2
1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libpils2
ii libplumb2
1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libplumb2
ii libplumbgpl2
1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libplumbgpl2
ii libstonith1
1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libstonith1
ii pacemaker
1.1.5-3~bpo50+1 HA cluster resource manager



I can't increase ram on this servers. How can I do that resource isn't
becomes "unmanaged/failed" ?


With Best Regards.
Aleksey V. Kashin
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Dec 8, 2011, 8:32 AM

Post #2 of 6 (1397 views)
Permalink
Re: resource unmanaged/failed [In reply to]

Hi,
On Wed, Dec 07, 2011 at 04:56:31PM +0600, Aleksey V. Kashin wrote:
> Hello.
>
> I have two servers (radius1, radius2). I've set up the cluster resource
> - IPaddr2. I used next commands to set up this resource:
>
> # crm configure property stonith-enabled="false"

For a 2-node cluster disabling stonith is really bad.

> # crm configure property no-quorum-policy="ignore"
> # crm configure primitive raddb_ip ocf:heartbeat:IPaddr2 params
> ip="10.99.2.57" cidr_netmask="32" op monitor interval="15s"
> # crm configure group raddb raddb_ip
> # crm configure location raddb-prefers-radius1 raddb inf: radius1
> # crm configure rsc_defaults resource-stickiness=1000001
>
> All ok.
>
> But sometimes on server radius1 the load is increasing and server is
> swapping and at that moment resource becomes "(unmanaged) FAILED". Below
> I've presented example "unmanaged" resource:
>
> # crm_mon
> ============
> Last updated: Wed Dec 7 14:56:20 2011
> Stack: openais
> Current DC: radius1 - partition with quorum
> Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> ============
>
> Online: [ radius2 radius1 ]
>
> Resource Group: raddb
> raddb_ip (ocf::heartbeat:IPaddr2): Started radius1
> (unmanaged) FAILED
>
> Failed actions:
> raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed
> Out): unknown exec error
> raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out):
> unknown exec error
>
>
> I've presented part of /var/log/syslog (radius1) here -
> http://paste.org/41963
>
>
> In that moment ip address 10.99.2.57 is alive and server responds to
> requests coming to this ip. However sometimes this resource becomes
> completely unavailable and I restart corosync. It's very bad.
>
> I think resource becomes unmanaged because server is using swap and part
> of corosync processes is in swap. I tested this suggestion and when
> server is using a lot of swap resource becomes "unmanaged".

corosync gets swapped? How interesting.

> I use debian gnu/linux 5.x and this packages -
> http://people.debian.org/~madkiss/ha/:
>
> # dpkg -l |grep cluster
> ii cluster-glue
> 1.0.7+hg2618-2~bpo50+1 The reusable cluster components for Linux HA
> ii corosync
> 1.4.2-1~bpo50+1 Standards-based cluster framework (daemon an
> ii libcluster-glue
> 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries (transitional pac
> ii libcorosync4
> 1.4.2-1~bpo50+1 Standards-based cluster framework (libraries
> ii libcrmcluster1
> 1.1.5-3~bpo50+1 Pacemaker libraries - CRM
> ii liblrm2
> 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- liblrm2
> ii libpils2
> 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libpils2
> ii libplumb2
> 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libplumb2
> ii libplumbgpl2
> 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libplumbgpl2
> ii libstonith1
> 1.0.7+hg2618-2~bpo50+1 Reusable cluster libraries -- libstonith1
> ii pacemaker
> 1.1.5-3~bpo50+1 HA cluster resource manager
>
>
>
> I can't increase ram on this servers. How can I do that resource isn't
> becomes "unmanaged/failed" ?

Buy more memory. If you cannot, then I don't see any point in
using clustering.

Thanks,

Dejan


> With Best Regards.
> Aleksey V. Kashin
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Dec 8, 2011, 3:25 PM

Post #3 of 6 (1390 views)
Permalink
Re: resource unmanaged/failed [In reply to]

On Wed, Dec 7, 2011 at 9:56 PM, Aleksey V. Kashin
<aleksey.kashin [at] gmail> wrote:
> I can't increase ram on this servers. How can I do that resource isn't
> becomes "unmanaged/failed" ?
>

How much do they have now?
How much is in use by the radius servers?
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


aleksey.kashin at gmail

Dec 9, 2011, 12:46 AM

Post #4 of 6 (1395 views)
Permalink
resource unmanaged/failed [In reply to]

> How much do they have now?

They have 12G RAM.

> How much is in use by the radius servers?

total used free shared buffers cached
Mem: 12038 11606 431 0 2 6479
-/+ buffers/cache: 5124 6913
Swap: 7632 3398 4233

And now I'm seeing again "resource unmanaged/failed" :(

Resource Group: raddb
raddb_ip (ocf::heartbeat:IPaddr2): Started radius1 (unmanaged) FAILED

Failed actions:
raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed
Out): unknown exec error
raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out):
unknown exec error
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


andrew at beekhof

Dec 11, 2011, 3:34 PM

Post #5 of 6 (1379 views)
Permalink
Re: resource unmanaged/failed [In reply to]

On Fri, Dec 9, 2011 at 7:46 PM, Aleksey V. Kashin
<aleksey.kashin [at] gmail> wrote:
>> How much do they have now?
>
> They have 12G RAM.

That seems respectable.

>
>> How much is in use by the radius servers?
>
>                   total       used       free     shared    buffers     cached
> Mem:         12038      11606        431          0          2       6479
> -/+ buffers/cache:       5124       6913
> Swap:         7632       3398       4233

That doesn't really answer the question though, you really need to
find out where the memory is going.
Although 12Gb is a decent amount of RAM, /If/ a single radius server
needs 8Gb, then the machine is clearly not going to be able to handle
2 of them.
There's not really anything Pacemaker can do about it.

About the only thing you can do is increase the operation timeouts and
perhaps play with the realtime and nice values of various processes.

> And now I'm seeing  again "resource unmanaged/failed" :(



>  Resource Group: raddb
>     raddb_ip   (ocf::heartbeat:IPaddr2):       Started radius1 (unmanaged) FAILED
>
> Failed actions:
>    raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed
> Out): unknown exec error
>    raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out):
> unknown exec error
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


aleksey.kashin at gmail

Dec 12, 2011, 2:42 AM

Post #6 of 6 (1369 views)
Permalink
Re: resource unmanaged/failed [In reply to]

2011/12/12, Andrew Beekhof <andrew [at] beekhof>:
> On Fri, Dec 9, 2011 at 7:46 PM, Aleksey V. Kashin
> <aleksey.kashin [at] gmail> wrote:
>>> How much do they have now?
>>
>> They have 12G RAM.
>
> That seems respectable.
>
>>
>>> How much is in use by the radius servers?
>>
>> total used free shared buffers
>> cached
>> Mem: 12038 11606 431 0 2 6479
>> -/+ buffers/cache: 5124 6913
>> Swap: 7632 3398 4233
>
> That doesn't really answer the question though, you really need to
> find out where the memory is going.
> Although 12Gb is a decent amount of RAM, /If/ a single radius server
> needs 8Gb, then the machine is clearly not going to be able to handle
> 2 of them.
> There's not really anything Pacemaker can do about it.
>

On this server also running Oracle RDBMS (database for radius-server).
It's generate big part of load.

> About the only thing you can do is increase the operation timeouts and
> perhaps play with the realtime and nice values of various processes.
>

I tried increase "timeout" (How long to wait before declaring the action has
failed.), but this doesn't work for me. Now I'm testing with
"failure-timeout" (How many seconds to wait before acting as if the
failure had not occurred),
Also I'll try play with process priority for corosync. Thanks for your advices.

>> And now I'm seeing again "resource unmanaged/failed" :(
>
>
>
>> Resource Group: raddb
>> raddb_ip (ocf::heartbeat:IPaddr2): Started radius1 (unmanaged)
>> FAILED
>>
>> Failed actions:
>> raddb_ip_monitor_15000 (node=radius1, call=4, rc=-2, status=Timed
>> Out): unknown exec error
>> raddb_ip_stop_0 (node=radius1, call=5, rc=-2, status=Timed Out):
>> unknown exec error
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA [at] lists
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.