Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Pacemaker

Among monitor of STONITH resource, monitor of other resources is not performed

 

 

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded


inouekazu at intellilink

Jul 30, 2012, 3:15 AM

Post #1 of 4 (246 views)
Permalink
Among monitor of STONITH resource, monitor of other resources is not performed

Hi,

I am using Pacemaker-1.1.
- glue (2012 Jul 16) 2719:18489f275f75
- libqb (2012 Jul 19) 11b20e19beff7f1b6003be0b4c73da8ecf936442
- corosync (2012 Jul 19) b9eb19e623d2b69c86cf78f1fa50a004c804ac20
- pacemaker (2012 Jul 29) 33119da31c235710195c783e5c9a32c6e95b3efc

The monitor operation of other resources is not performed during the
monitor of STONITH resource.

1. service corosync start ; service pacemaker start
2. cibadmin -U -x test.xml
(STONITH resource + Dummy resource are started on the same node)
3. After libvirt (STONITH resource) started, sleep was added to status()
of libvirt.

[root [at] dev external]# diff -u libvirt.ORG libvirt
--- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
+++ libvirt 2012-07-30 13:36:19.661431208 +0900
@@ -221,6 +221,7 @@
;;

status)
+ sleep 3600
libvirt_check_config
libvirt_status
exit $?

[root [at] dev ~]# ps -ef|egrep "UID|corosync|pacemaker|stonith|fence|sleep"
UID PID PPID C STIME TTY TIME CMD
root 18567 1 0 18:47 ? 00:00:02 corosync
root 18585 1 0 18:47 ? 00:00:00 pacemakerd
496 18587 18585 0 18:47 ? 00:00:00 /usr/libexec/pacemaker/cib
root 18588 18585 0 18:47 ? 00:00:00 /usr/libexec/pacemaker/stonithd
root 18589 18585 76 18:47 ? 00:05:27 /usr/libexec/pacemaker/lrmd
496 18590 18585 0 18:47 ? 00:00:00 /usr/libexec/pacemaker/attrd
496 18591 18585 0 18:47 ? 00:00:00 /usr/libexec/pacemaker/pengine
496 18592 18585 0 18:47 ? 00:00:00 /usr/libexec/pacemaker/crmd
root 18767 18588 0 18:48 ? 00:00:00 /usr/bin/perl /usr/sbin/fence_legacy
root 18768 18767 0 18:48 ? 00:00:00 stonith -t external/libvirt -E -S
root 18778 18768 0 18:48 ? 00:00:00 /bin/sh /usr/lib64/stonith/plugins/external/libvirt status
root 18792 18778 0 18:48 ? 00:00:00 sleep 3600

4. Then monitor of Dummy resource is not performed.

The following is behavior of lrmd at that time.

# gdb /usr/libexec/pacemaker/lrmd `pgrep lrmd`
(gdb) bt
#0 0x0000003f808e83e2 in recv () from /lib64/libc.so.6
#1 0x00007f0de3820062 in qb_ipc_us_recv_at_most (one_way=0x1118ee8, msg=0x111f390,
len=20480, timeout=500) at ipc_us.c:299
#2 0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0, msg_ptr=0x111f390, msg_len=
20480, ms_timeout=500) at ipcc.c:249
#3 0x00007f0de42bc5fb in crm_ipc_send (client=0x111b580, message=0x111d860, reply=
0x7fffc8ec4c60, ms_timeout=61060000) at ipc.c:517
#4 0x00007f0de3c97e29 in stonith_send_command (stonith=0x111a6c0, op=
0x7f0de3c998dd "st_execute", data=0x1119170, output_data=0x0, call_options=4096,
timeout=61000) at st_client.c:1676
#5 0x00007f0de3c94bd1 in stonith_api_call (stonith=0x111a6c0, call_options=4096,
id=0x1129860 "f-2", action=0x7f0de3c998f7 "monitor", victim=0x0, timeout=61000,
output=0x0) at st_client.c:951
#6 0x00007f0de3c94d31 in stonith_api_monitor (stonith=0x111a6c0, call_options=4096,
id=0x1129860 "f-2", timeout=61000) at st_client.c:985
#7 0x00000000004044e2 in lrmd_rsc_execute_stonith (rsc=0x111f130, cmd=0x1129660) at
lrmd.c:522
#8 0x0000000000404cd6 in lrmd_rsc_execute (rsc=0x111f130) at lrmd.c:667
#9 0x0000000000404d2d in lrmd_rsc_dispatch (user_data=0x111f130) at lrmd.c:678
#10 0x00007f0de42dcd00 in crm_trigger_dispatch (source=0x111f300, callback=
0x404d06 <lrmd_rsc_dispatch>, userdata=0x111f300) at mainloop.c:105
#11 0x0000003642638f0e in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#12 0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
#13 0x000000364263cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#14 0x0000000000402d3f in main (argc=1, argv=0x7fffc8ec5188) at main.c:285
(gdb) fin
Run till exit from #0 0x0000003f808e83e2 in recv () from /lib64/libc.so.6
0x00007f0de3820062 in qb_ipc_us_recv_at_most (one_way=0x1118ee8, msg=0x111f390,
len=20480, timeout=500) at ipc_us.c:299
299 result = recv(one_way->u.us.sock, &data[processed], to_recv,
(gdb) fin
Run till exit from #0 0x00007f0de3820062 in qb_ipc_us_recv_at_most (one_way=
0x1118ee8, msg=0x111f390, len=20480, timeout=500) at ipc_us.c:299
0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0, msg_ptr=0x111f390, msg_len=20480,
ms_timeout=500) at ipcc.c:249
249 res = c->funcs.recv(&c->response, msg_ptr, msg_len, ms_timeout);
Value returned is $1 = -11
(gdb) fin
Run till exit from #0 0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0, msg_ptr=
0x111f390, msg_len=20480, ms_timeout=500) at ipcc.c:249
0x00007f0de42bc5fb in crm_ipc_send (client=0x111b580, message=0x111d860, reply=
0x7fffc8ec4c60, ms_timeout=61060000) at ipc.c:517
517 rc = qb_ipcc_recv(client->ipc, client->buffer, client->buf_size, 500);
Value returned is $2 = -11
(gdb) n
518 if(rc > 0 || crm_ipc_connected(client) == FALSE) {
(gdb) p rc
$3 = -11
(gdb) n
522 } while(time(NULL) < timeout);
(gdb) n
517 rc = qb_ipcc_recv(client->ipc, client->buffer, client->buf_size, 500);
(gdb) n
518 if(rc > 0 || crm_ipc_connected(client) == FALSE) {
(gdb) p rc
$4 = -11
(gdb) n
522 } while(time(NULL) < timeout);
(gdb) n
517 rc = qb_ipcc_recv(client->ipc, client->buffer, client->buf_size, 500);
(gdb) n
518 if(rc > 0 || crm_ipc_connected(client) == FALSE) {
(gdb) p rc
$5 = -11
(gdb) n
522 } while(time(NULL) < timeout);
(gdb) n
517 rc = qb_ipcc_recv(client->ipc, client->buffer, client->buf_size, 500);
(gdb) n
518 if(rc > 0 || crm_ipc_connected(client) == FALSE) {
(gdb) p rc
$6 = -11
(gdb)

It seems that lrmd has repeated the reply reception from stonithd
out of the g_main_loop. Therefore, monitor of Dummy is not performed.

[root [at] dev ~]# top -bn1
top - 18:53:31 up 5 days, 8:56, 4 users, load average: 0.98, 0.63, 0.44
Tasks: 198 total, 2 running, 196 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.9%us, 1.2%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 5089052k total, 2417444k used, 2671608k free, 266660k buffers
Swap: 1048568k total, 0k used, 1048568k free, 1724616k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18589 root 20 0 83932 3448 2548 R 98.2 0.1 4:15.89 lrmd
1 root 20 0 19348 1520 1212 S 0.0 0.0 0:00.78 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:06.93 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 15:23.59 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.10 migration/0

Best Regards,
Kazunori INOUE
Attachments: test.xml (2.77 KB)


dvossel at redhat

Jul 31, 2012, 8:11 AM

Post #2 of 4 (216 views)
Permalink
Re: Among monitor of STONITH resource, monitor of other resources is not performed [In reply to]

----- Original Message -----
> From: "Kazunori INOUE" <inouekazu [at] intellilink>
> To: "pacemaker [at] os" <pacemaker [at] oss>
> Cc: shimazakik [at] intellilink
> Sent: Monday, July 30, 2012 5:15:26 AM
> Subject: [Pacemaker] Among monitor of STONITH resource, monitor of other resources is not performed
>
> Hi,
>
> I am using Pacemaker-1.1.
> - glue (2012 Jul 16) 2719:18489f275f75
> - libqb (2012 Jul 19) 11b20e19beff7f1b6003be0b4c73da8ecf936442
> - corosync (2012 Jul 19) b9eb19e623d2b69c86cf78f1fa50a004c804ac20
> - pacemaker (2012 Jul 29) 33119da31c235710195c783e5c9a32c6e95b3efc
>
> The monitor operation of other resources is not performed during the
> monitor of STONITH resource.

Looking at the code I can see this is true. How important is it for the stonith monitors to be able to be performed in parallel with the other resource monitors? I would expect the stonith monitor interval to be quite large compared to all the other monitors, meaning that it only executes once every hour or so.

-- Vossel

> 1. service corosync start ; service pacemaker start
> 2. cibadmin -U -x test.xml
> (STONITH resource + Dummy resource are started on the same node)
> 3. After libvirt (STONITH resource) started, sleep was added to
> status()
> of libvirt.
>
> [root [at] dev external]# diff -u libvirt.ORG libvirt
> --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
> +++ libvirt 2012-07-30 13:36:19.661431208 +0900
> @@ -221,6 +221,7 @@
> ;;
>
> status)
> + sleep 3600
> libvirt_check_config
> libvirt_status
> exit $?
>
> [root [at] dev ~]# ps -ef|egrep
> "UID|corosync|pacemaker|stonith|fence|sleep"
> UID PID PPID C STIME TTY TIME CMD
> root 18567 1 0 18:47 ? 00:00:02 corosync
> root 18585 1 0 18:47 ? 00:00:00 pacemakerd
> 496 18587 18585 0 18:47 ? 00:00:00 /usr/libexec/pacemaker/cib
> root 18588 18585 0 18:47 ? 00:00:00
> /usr/libexec/pacemaker/stonithd
> root 18589 18585 76 18:47 ? 00:05:27 /usr/libexec/pacemaker/lrmd
> 496 18590 18585 0 18:47 ? 00:00:00
> /usr/libexec/pacemaker/attrd
> 496 18591 18585 0 18:47 ? 00:00:00
> /usr/libexec/pacemaker/pengine
> 496 18592 18585 0 18:47 ? 00:00:00 /usr/libexec/pacemaker/crmd
> root 18767 18588 0 18:48 ? 00:00:00 /usr/bin/perl
> /usr/sbin/fence_legacy
> root 18768 18767 0 18:48 ? 00:00:00 stonith -t external/libvirt
> -E -S
> root 18778 18768 0 18:48 ? 00:00:00 /bin/sh
> /usr/lib64/stonith/plugins/external/libvirt status
> root 18792 18778 0 18:48 ? 00:00:00 sleep 3600
>
> 4. Then monitor of Dummy resource is not performed.
>
> The following is behavior of lrmd at that time.
>
> # gdb /usr/libexec/pacemaker/lrmd `pgrep lrmd`
> (gdb) bt
> #0 0x0000003f808e83e2 in recv () from /lib64/libc.so.6
> #1 0x00007f0de3820062 in qb_ipc_us_recv_at_most (one_way=0x1118ee8,
> msg=0x111f390,
> len=20480, timeout=500) at ipc_us.c:299
> #2 0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0,
> msg_ptr=0x111f390, msg_len=
> 20480, ms_timeout=500) at ipcc.c:249
> #3 0x00007f0de42bc5fb in crm_ipc_send (client=0x111b580,
> message=0x111d860, reply=
> 0x7fffc8ec4c60, ms_timeout=61060000) at ipc.c:517
> #4 0x00007f0de3c97e29 in stonith_send_command (stonith=0x111a6c0,
> op=
> 0x7f0de3c998dd "st_execute", data=0x1119170, output_data=0x0,
> call_options=4096,
> timeout=61000) at st_client.c:1676
> #5 0x00007f0de3c94bd1 in stonith_api_call (stonith=0x111a6c0,
> call_options=4096,
> id=0x1129860 "f-2", action=0x7f0de3c998f7 "monitor", victim=0x0,
> timeout=61000,
> output=0x0) at st_client.c:951
> #6 0x00007f0de3c94d31 in stonith_api_monitor (stonith=0x111a6c0,
> call_options=4096,
> id=0x1129860 "f-2", timeout=61000) at st_client.c:985
> #7 0x00000000004044e2 in lrmd_rsc_execute_stonith (rsc=0x111f130,
> cmd=0x1129660) at
> lrmd.c:522
> #8 0x0000000000404cd6 in lrmd_rsc_execute (rsc=0x111f130) at
> lrmd.c:667
> #9 0x0000000000404d2d in lrmd_rsc_dispatch (user_data=0x111f130) at
> lrmd.c:678
> #10 0x00007f0de42dcd00 in crm_trigger_dispatch (source=0x111f300,
> callback=
> 0x404d06 <lrmd_rsc_dispatch>, userdata=0x111f300) at
> mainloop.c:105
> #11 0x0000003642638f0e in g_main_context_dispatch () from
> /lib64/libglib-2.0.so.0
> #12 0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
> #13 0x000000364263cd55 in g_main_loop_run () from
> /lib64/libglib-2.0.so.0
> #14 0x0000000000402d3f in main (argc=1, argv=0x7fffc8ec5188) at
> main.c:285
> (gdb) fin
> Run till exit from #0 0x0000003f808e83e2 in recv () from
> /lib64/libc.so.6
> 0x00007f0de3820062 in qb_ipc_us_recv_at_most (one_way=0x1118ee8,
> msg=0x111f390,
> len=20480, timeout=500) at ipc_us.c:299
> 299 result = recv(one_way->u.us.sock, &data[processed],
> to_recv,
> (gdb) fin
> Run till exit from #0 0x00007f0de3820062 in qb_ipc_us_recv_at_most
> (one_way=
> 0x1118ee8, msg=0x111f390, len=20480, timeout=500) at ipc_us.c:299
> 0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0, msg_ptr=0x111f390,
> msg_len=20480,
> ms_timeout=500) at ipcc.c:249
> 249 res = c->funcs.recv(&c->response, msg_ptr, msg_len,
> ms_timeout);
> Value returned is $1 = -11
> (gdb) fin
> Run till exit from #0 0x00007f0de381a28e in qb_ipcc_recv
> (c=0x1118ba0, msg_ptr=
> 0x111f390, msg_len=20480, ms_timeout=500) at ipcc.c:249
> 0x00007f0de42bc5fb in crm_ipc_send (client=0x111b580,
> message=0x111d860, reply=
> 0x7fffc8ec4c60, ms_timeout=61060000) at ipc.c:517
> 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
> client->buf_size, 500);
> Value returned is $2 = -11
> (gdb) n
> 518 if(rc > 0 || crm_ipc_connected(client) == FALSE)
> {
> (gdb) p rc
> $3 = -11
> (gdb) n
> 522 } while(time(NULL) < timeout);
> (gdb) n
> 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
> client->buf_size, 500);
> (gdb) n
> 518 if(rc > 0 || crm_ipc_connected(client) == FALSE)
> {
> (gdb) p rc
> $4 = -11
> (gdb) n
> 522 } while(time(NULL) < timeout);
> (gdb) n
> 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
> client->buf_size, 500);
> (gdb) n
> 518 if(rc > 0 || crm_ipc_connected(client) == FALSE)
> {
> (gdb) p rc
> $5 = -11
> (gdb) n
> 522 } while(time(NULL) < timeout);
> (gdb) n
> 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
> client->buf_size, 500);
> (gdb) n
> 518 if(rc > 0 || crm_ipc_connected(client) == FALSE)
> {
> (gdb) p rc
> $6 = -11
> (gdb)
>
> It seems that lrmd has repeated the reply reception from stonithd
> out of the g_main_loop. Therefore, monitor of Dummy is not performed.
>
> [root [at] dev ~]# top -bn1
> top - 18:53:31 up 5 days, 8:56, 4 users, load average: 0.98, 0.63,
> 0.44
> Tasks: 198 total, 2 running, 196 sleeping, 0 stopped, 0 zombie
> Cpu(s): 0.9%us, 1.2%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.0%hi,
> 0.0%si, 0.0%st
> Mem: 5089052k total, 2417444k used, 2671608k free, 266660k
> buffers
> Swap: 1048568k total, 0k used, 1048568k free, 1724616k
> cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 18589 root 20 0 83932 3448 2548 R 98.2 0.1 4:15.89 lrmd
> 1 root 20 0 19348 1520 1212 S 0.0 0.0 0:00.78 init
> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
> 3 root RT 0 0 0 0 S 0.0 0.0 0:06.93
> migration/0
> 4 root 20 0 0 0 0 S 0.0 0.0 15:23.59
> ksoftirqd/0
> 5 root RT 0 0 0 0 S 0.0 0.0 0:00.10
> migration/0
>
> Best Regards,
> Kazunori INOUE
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


dvossel at redhat

Jul 31, 2012, 2:49 PM

Post #3 of 4 (218 views)
Permalink
Re: Among monitor of STONITH resource, monitor of other resources is not performed [In reply to]

----- Original Message -----
> From: "David Vossel" <dvossel [at] redhat>
> To: "The Pacemaker cluster resource manager" <pacemaker [at] oss>
> Sent: Tuesday, July 31, 2012 10:11:44 AM
> Subject: Re: [Pacemaker] Among monitor of STONITH resource, monitor of other resources is not performed
>
> ----- Original Message -----
> > From: "Kazunori INOUE" <inouekazu [at] intellilink>
> > To: "pacemaker [at] os" <pacemaker [at] oss>
> > Cc: shimazakik [at] intellilink
> > Sent: Monday, July 30, 2012 5:15:26 AM
> > Subject: [Pacemaker] Among monitor of STONITH resource, monitor of
> > other resources is not performed
> >
> > Hi,
> >
> > I am using Pacemaker-1.1.
> > - glue (2012 Jul 16) 2719:18489f275f75
> > - libqb (2012 Jul 19) 11b20e19beff7f1b6003be0b4c73da8ecf936442
> > - corosync (2012 Jul 19) b9eb19e623d2b69c86cf78f1fa50a004c804ac20
> > - pacemaker (2012 Jul 29) 33119da31c235710195c783e5c9a32c6e95b3efc
> >
> > The monitor operation of other resources is not performed during
> > the
> > monitor of STONITH resource.
>
> Looking at the code I can see this is true. How important is it for
> the stonith monitors to be able to be performed in parallel with the
> other resource monitors? I would expect the stonith monitor
> interval to be quite large compared to all the other monitors,
> meaning that it only executes once every hour or so.
>
> -- Vossel

Yeah, this should be fixed. Can you create an issue for this on bugs.clusterlabs.org with your crm_report please.

-- Vossel


>
> > 1. service corosync start ; service pacemaker start
> > 2. cibadmin -U -x test.xml
> > (STONITH resource + Dummy resource are started on the same node)
> > 3. After libvirt (STONITH resource) started, sleep was added to
> > status()
> > of libvirt.
> >
> > [root [at] dev external]# diff -u libvirt.ORG libvirt
> > --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
> > +++ libvirt 2012-07-30 13:36:19.661431208 +0900
> > @@ -221,6 +221,7 @@
> > ;;
> >
> > status)
> > + sleep 3600
> > libvirt_check_config
> > libvirt_status
> > exit $?
> >
> > [root [at] dev ~]# ps -ef|egrep
> > "UID|corosync|pacemaker|stonith|fence|sleep"
> > UID PID PPID C STIME TTY TIME CMD
> > root 18567 1 0 18:47 ? 00:00:02 corosync
> > root 18585 1 0 18:47 ? 00:00:00 pacemakerd
> > 496 18587 18585 0 18:47 ? 00:00:00
> > /usr/libexec/pacemaker/cib
> > root 18588 18585 0 18:47 ? 00:00:00
> > /usr/libexec/pacemaker/stonithd
> > root 18589 18585 76 18:47 ? 00:05:27
> > /usr/libexec/pacemaker/lrmd
> > 496 18590 18585 0 18:47 ? 00:00:00
> > /usr/libexec/pacemaker/attrd
> > 496 18591 18585 0 18:47 ? 00:00:00
> > /usr/libexec/pacemaker/pengine
> > 496 18592 18585 0 18:47 ? 00:00:00
> > /usr/libexec/pacemaker/crmd
> > root 18767 18588 0 18:48 ? 00:00:00 /usr/bin/perl
> > /usr/sbin/fence_legacy
> > root 18768 18767 0 18:48 ? 00:00:00 stonith -t
> > external/libvirt
> > -E -S
> > root 18778 18768 0 18:48 ? 00:00:00 /bin/sh
> > /usr/lib64/stonith/plugins/external/libvirt status
> > root 18792 18778 0 18:48 ? 00:00:00 sleep 3600
> >
> > 4. Then monitor of Dummy resource is not performed.
> >
> > The following is behavior of lrmd at that time.
> >
> > # gdb /usr/libexec/pacemaker/lrmd `pgrep lrmd`
> > (gdb) bt
> > #0 0x0000003f808e83e2 in recv () from /lib64/libc.so.6
> > #1 0x00007f0de3820062 in qb_ipc_us_recv_at_most
> > (one_way=0x1118ee8,
> > msg=0x111f390,
> > len=20480, timeout=500) at ipc_us.c:299
> > #2 0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0,
> > msg_ptr=0x111f390, msg_len=
> > 20480, ms_timeout=500) at ipcc.c:249
> > #3 0x00007f0de42bc5fb in crm_ipc_send (client=0x111b580,
> > message=0x111d860, reply=
> > 0x7fffc8ec4c60, ms_timeout=61060000) at ipc.c:517
> > #4 0x00007f0de3c97e29 in stonith_send_command (stonith=0x111a6c0,
> > op=
> > 0x7f0de3c998dd "st_execute", data=0x1119170, output_data=0x0,
> > call_options=4096,
> > timeout=61000) at st_client.c:1676
> > #5 0x00007f0de3c94bd1 in stonith_api_call (stonith=0x111a6c0,
> > call_options=4096,
> > id=0x1129860 "f-2", action=0x7f0de3c998f7 "monitor",
> > victim=0x0,
> > timeout=61000,
> > output=0x0) at st_client.c:951
> > #6 0x00007f0de3c94d31 in stonith_api_monitor (stonith=0x111a6c0,
> > call_options=4096,
> > id=0x1129860 "f-2", timeout=61000) at st_client.c:985
> > #7 0x00000000004044e2 in lrmd_rsc_execute_stonith (rsc=0x111f130,
> > cmd=0x1129660) at
> > lrmd.c:522
> > #8 0x0000000000404cd6 in lrmd_rsc_execute (rsc=0x111f130) at
> > lrmd.c:667
> > #9 0x0000000000404d2d in lrmd_rsc_dispatch (user_data=0x111f130)
> > at
> > lrmd.c:678
> > #10 0x00007f0de42dcd00 in crm_trigger_dispatch (source=0x111f300,
> > callback=
> > 0x404d06 <lrmd_rsc_dispatch>, userdata=0x111f300) at
> > mainloop.c:105
> > #11 0x0000003642638f0e in g_main_context_dispatch () from
> > /lib64/libglib-2.0.so.0
> > #12 0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
> > #13 0x000000364263cd55 in g_main_loop_run () from
> > /lib64/libglib-2.0.so.0
> > #14 0x0000000000402d3f in main (argc=1, argv=0x7fffc8ec5188) at
> > main.c:285
> > (gdb) fin
> > Run till exit from #0 0x0000003f808e83e2 in recv () from
> > /lib64/libc.so.6
> > 0x00007f0de3820062 in qb_ipc_us_recv_at_most (one_way=0x1118ee8,
> > msg=0x111f390,
> > len=20480, timeout=500) at ipc_us.c:299
> > 299 result = recv(one_way->u.us.sock, &data[processed],
> > to_recv,
> > (gdb) fin
> > Run till exit from #0 0x00007f0de3820062 in qb_ipc_us_recv_at_most
> > (one_way=
> > 0x1118ee8, msg=0x111f390, len=20480, timeout=500) at
> > ipc_us.c:299
> > 0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0, msg_ptr=0x111f390,
> > msg_len=20480,
> > ms_timeout=500) at ipcc.c:249
> > 249 res = c->funcs.recv(&c->response, msg_ptr, msg_len,
> > ms_timeout);
> > Value returned is $1 = -11
> > (gdb) fin
> > Run till exit from #0 0x00007f0de381a28e in qb_ipcc_recv
> > (c=0x1118ba0, msg_ptr=
> > 0x111f390, msg_len=20480, ms_timeout=500) at ipcc.c:249
> > 0x00007f0de42bc5fb in crm_ipc_send (client=0x111b580,
> > message=0x111d860, reply=
> > 0x7fffc8ec4c60, ms_timeout=61060000) at ipc.c:517
> > 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
> > client->buf_size, 500);
> > Value returned is $2 = -11
> > (gdb) n
> > 518 if(rc > 0 || crm_ipc_connected(client) ==
> > FALSE)
> > {
> > (gdb) p rc
> > $3 = -11
> > (gdb) n
> > 522 } while(time(NULL) < timeout);
> > (gdb) n
> > 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
> > client->buf_size, 500);
> > (gdb) n
> > 518 if(rc > 0 || crm_ipc_connected(client) ==
> > FALSE)
> > {
> > (gdb) p rc
> > $4 = -11
> > (gdb) n
> > 522 } while(time(NULL) < timeout);
> > (gdb) n
> > 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
> > client->buf_size, 500);
> > (gdb) n
> > 518 if(rc > 0 || crm_ipc_connected(client) ==
> > FALSE)
> > {
> > (gdb) p rc
> > $5 = -11
> > (gdb) n
> > 522 } while(time(NULL) < timeout);
> > (gdb) n
> > 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
> > client->buf_size, 500);
> > (gdb) n
> > 518 if(rc > 0 || crm_ipc_connected(client) ==
> > FALSE)
> > {
> > (gdb) p rc
> > $6 = -11
> > (gdb)
> >
> > It seems that lrmd has repeated the reply reception from stonithd
> > out of the g_main_loop. Therefore, monitor of Dummy is not
> > performed.
> >
> > [root [at] dev ~]# top -bn1
> > top - 18:53:31 up 5 days, 8:56, 4 users, load average: 0.98,
> > 0.63,
> > 0.44
> > Tasks: 198 total, 2 running, 196 sleeping, 0 stopped, 0
> > zombie
> > Cpu(s): 0.9%us, 1.2%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.0%hi,
> > 0.0%si, 0.0%st
> > Mem: 5089052k total, 2417444k used, 2671608k free, 266660k
> > buffers
> > Swap: 1048568k total, 0k used, 1048568k free, 1724616k
> > cached
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
> > COMMAND
> > 18589 root 20 0 83932 3448 2548 R 98.2 0.1 4:15.89 lrmd
> > 1 root 20 0 19348 1520 1212 S 0.0 0.0 0:00.78 init
> > 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00
> > kthreadd
> > 3 root RT 0 0 0 0 S 0.0 0.0 0:06.93
> > migration/0
> > 4 root 20 0 0 0 0 S 0.0 0.0 15:23.59
> > ksoftirqd/0
> > 5 root RT 0 0 0 0 S 0.0 0.0 0:00.10
> > migration/0
> >
> > Best Regards,
> > Kazunori INOUE
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker [at] oss
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


inouekazu at intellilink

Aug 2, 2012, 6:55 PM

Post #4 of 4 (209 views)
Permalink
Re: Among monitor of STONITH resource, monitor of other resources is not performed [In reply to]

Hi David,

I reported it in Bugzilla with crm_report. Please confirm it.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5090

Best Regards,
Kazunori INOUE

(12.08.01 06:49), David Vossel wrote:
>
>
> ----- Original Message -----
>> From: "David Vossel" <dvossel [at] redhat>
>> To: "The Pacemaker cluster resource manager" <pacemaker [at] oss>
>> Sent: Tuesday, July 31, 2012 10:11:44 AM
>> Subject: Re: [Pacemaker] Among monitor of STONITH resource, monitor of other resources is not performed
>>
>> ----- Original Message -----
>>> From: "Kazunori INOUE" <inouekazu [at] intellilink>
>>> To: "pacemaker [at] os" <pacemaker [at] oss>
>>> Cc: shimazakik [at] intellilink
>>> Sent: Monday, July 30, 2012 5:15:26 AM
>>> Subject: [Pacemaker] Among monitor of STONITH resource, monitor of
>>> other resources is not performed
>>>
>>> Hi,
>>>
>>> I am using Pacemaker-1.1.
>>> - glue (2012 Jul 16) 2719:18489f275f75
>>> - libqb (2012 Jul 19) 11b20e19beff7f1b6003be0b4c73da8ecf936442
>>> - corosync (2012 Jul 19) b9eb19e623d2b69c86cf78f1fa50a004c804ac20
>>> - pacemaker (2012 Jul 29) 33119da31c235710195c783e5c9a32c6e95b3efc
>>>
>>> The monitor operation of other resources is not performed during
>>> the
>>> monitor of STONITH resource.
>>
>> Looking at the code I can see this is true. How important is it for
>> the stonith monitors to be able to be performed in parallel with the
>> other resource monitors? I would expect the stonith monitor
>> interval to be quite large compared to all the other monitors,
>> meaning that it only executes once every hour or so.
>>
>> -- Vossel
>
> Yeah, this should be fixed. Can you create an issue for this on bugs.clusterlabs.org with your crm_report please.
>
> -- Vossel
>
>
>>
>>> 1. service corosync start ; service pacemaker start
>>> 2. cibadmin -U -x test.xml
>>> (STONITH resource + Dummy resource are started on the same node)
>>> 3. After libvirt (STONITH resource) started, sleep was added to
>>> status()
>>> of libvirt.
>>>
>>> [root [at] dev external]# diff -u libvirt.ORG libvirt
>>> --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
>>> +++ libvirt 2012-07-30 13:36:19.661431208 +0900
>>> @@ -221,6 +221,7 @@
>>> ;;
>>>
>>> status)
>>> + sleep 3600
>>> libvirt_check_config
>>> libvirt_status
>>> exit $?
>>>
>>> [root [at] dev ~]# ps -ef|egrep
>>> "UID|corosync|pacemaker|stonith|fence|sleep"
>>> UID PID PPID C STIME TTY TIME CMD
>>> root 18567 1 0 18:47 ? 00:00:02 corosync
>>> root 18585 1 0 18:47 ? 00:00:00 pacemakerd
>>> 496 18587 18585 0 18:47 ? 00:00:00
>>> /usr/libexec/pacemaker/cib
>>> root 18588 18585 0 18:47 ? 00:00:00
>>> /usr/libexec/pacemaker/stonithd
>>> root 18589 18585 76 18:47 ? 00:05:27
>>> /usr/libexec/pacemaker/lrmd
>>> 496 18590 18585 0 18:47 ? 00:00:00
>>> /usr/libexec/pacemaker/attrd
>>> 496 18591 18585 0 18:47 ? 00:00:00
>>> /usr/libexec/pacemaker/pengine
>>> 496 18592 18585 0 18:47 ? 00:00:00
>>> /usr/libexec/pacemaker/crmd
>>> root 18767 18588 0 18:48 ? 00:00:00 /usr/bin/perl
>>> /usr/sbin/fence_legacy
>>> root 18768 18767 0 18:48 ? 00:00:00 stonith -t
>>> external/libvirt
>>> -E -S
>>> root 18778 18768 0 18:48 ? 00:00:00 /bin/sh
>>> /usr/lib64/stonith/plugins/external/libvirt status
>>> root 18792 18778 0 18:48 ? 00:00:00 sleep 3600
>>>
>>> 4. Then monitor of Dummy resource is not performed.
>>>
>>> The following is behavior of lrmd at that time.
>>>
>>> # gdb /usr/libexec/pacemaker/lrmd `pgrep lrmd`
>>> (gdb) bt
>>> #0 0x0000003f808e83e2 in recv () from /lib64/libc.so.6
>>> #1 0x00007f0de3820062 in qb_ipc_us_recv_at_most
>>> (one_way=0x1118ee8,
>>> msg=0x111f390,
>>> len=20480, timeout=500) at ipc_us.c:299
>>> #2 0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0,
>>> msg_ptr=0x111f390, msg_len=
>>> 20480, ms_timeout=500) at ipcc.c:249
>>> #3 0x00007f0de42bc5fb in crm_ipc_send (client=0x111b580,
>>> message=0x111d860, reply=
>>> 0x7fffc8ec4c60, ms_timeout=61060000) at ipc.c:517
>>> #4 0x00007f0de3c97e29 in stonith_send_command (stonith=0x111a6c0,
>>> op=
>>> 0x7f0de3c998dd "st_execute", data=0x1119170, output_data=0x0,
>>> call_options=4096,
>>> timeout=61000) at st_client.c:1676
>>> #5 0x00007f0de3c94bd1 in stonith_api_call (stonith=0x111a6c0,
>>> call_options=4096,
>>> id=0x1129860 "f-2", action=0x7f0de3c998f7 "monitor",
>>> victim=0x0,
>>> timeout=61000,
>>> output=0x0) at st_client.c:951
>>> #6 0x00007f0de3c94d31 in stonith_api_monitor (stonith=0x111a6c0,
>>> call_options=4096,
>>> id=0x1129860 "f-2", timeout=61000) at st_client.c:985
>>> #7 0x00000000004044e2 in lrmd_rsc_execute_stonith (rsc=0x111f130,
>>> cmd=0x1129660) at
>>> lrmd.c:522
>>> #8 0x0000000000404cd6 in lrmd_rsc_execute (rsc=0x111f130) at
>>> lrmd.c:667
>>> #9 0x0000000000404d2d in lrmd_rsc_dispatch (user_data=0x111f130)
>>> at
>>> lrmd.c:678
>>> #10 0x00007f0de42dcd00 in crm_trigger_dispatch (source=0x111f300,
>>> callback=
>>> 0x404d06 <lrmd_rsc_dispatch>, userdata=0x111f300) at
>>> mainloop.c:105
>>> #11 0x0000003642638f0e in g_main_context_dispatch () from
>>> /lib64/libglib-2.0.so.0
>>> #12 0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
>>> #13 0x000000364263cd55 in g_main_loop_run () from
>>> /lib64/libglib-2.0.so.0
>>> #14 0x0000000000402d3f in main (argc=1, argv=0x7fffc8ec5188) at
>>> main.c:285
>>> (gdb) fin
>>> Run till exit from #0 0x0000003f808e83e2 in recv () from
>>> /lib64/libc.so.6
>>> 0x00007f0de3820062 in qb_ipc_us_recv_at_most (one_way=0x1118ee8,
>>> msg=0x111f390,
>>> len=20480, timeout=500) at ipc_us.c:299
>>> 299 result = recv(one_way->u.us.sock, &data[processed],
>>> to_recv,
>>> (gdb) fin
>>> Run till exit from #0 0x00007f0de3820062 in qb_ipc_us_recv_at_most
>>> (one_way=
>>> 0x1118ee8, msg=0x111f390, len=20480, timeout=500) at
>>> ipc_us.c:299
>>> 0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0, msg_ptr=0x111f390,
>>> msg_len=20480,
>>> ms_timeout=500) at ipcc.c:249
>>> 249 res = c->funcs.recv(&c->response, msg_ptr, msg_len,
>>> ms_timeout);
>>> Value returned is $1 = -11
>>> (gdb) fin
>>> Run till exit from #0 0x00007f0de381a28e in qb_ipcc_recv
>>> (c=0x1118ba0, msg_ptr=
>>> 0x111f390, msg_len=20480, ms_timeout=500) at ipcc.c:249
>>> 0x00007f0de42bc5fb in crm_ipc_send (client=0x111b580,
>>> message=0x111d860, reply=
>>> 0x7fffc8ec4c60, ms_timeout=61060000) at ipc.c:517
>>> 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
>>> client->buf_size, 500);
>>> Value returned is $2 = -11
>>> (gdb) n
>>> 518 if(rc > 0 || crm_ipc_connected(client) ==
>>> FALSE)
>>> {
>>> (gdb) p rc
>>> $3 = -11
>>> (gdb) n
>>> 522 } while(time(NULL) < timeout);
>>> (gdb) n
>>> 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
>>> client->buf_size, 500);
>>> (gdb) n
>>> 518 if(rc > 0 || crm_ipc_connected(client) ==
>>> FALSE)
>>> {
>>> (gdb) p rc
>>> $4 = -11
>>> (gdb) n
>>> 522 } while(time(NULL) < timeout);
>>> (gdb) n
>>> 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
>>> client->buf_size, 500);
>>> (gdb) n
>>> 518 if(rc > 0 || crm_ipc_connected(client) ==
>>> FALSE)
>>> {
>>> (gdb) p rc
>>> $5 = -11
>>> (gdb) n
>>> 522 } while(time(NULL) < timeout);
>>> (gdb) n
>>> 517 rc = qb_ipcc_recv(client->ipc, client->buffer,
>>> client->buf_size, 500);
>>> (gdb) n
>>> 518 if(rc > 0 || crm_ipc_connected(client) ==
>>> FALSE)
>>> {
>>> (gdb) p rc
>>> $6 = -11
>>> (gdb)
>>>
>>> It seems that lrmd has repeated the reply reception from stonithd
>>> out of the g_main_loop. Therefore, monitor of Dummy is not
>>> performed.
>>>
>>> [root [at] dev ~]# top -bn1
>>> top - 18:53:31 up 5 days, 8:56, 4 users, load average: 0.98,
>>> 0.63,
>>> 0.44
>>> Tasks: 198 total, 2 running, 196 sleeping, 0 stopped, 0
>>> zombie
>>> Cpu(s): 0.9%us, 1.2%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.0%hi,
>>> 0.0%si, 0.0%st
>>> Mem: 5089052k total, 2417444k used, 2671608k free, 266660k
>>> buffers
>>> Swap: 1048568k total, 0k used, 1048568k free, 1724616k
>>> cached
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
>>> COMMAND
>>> 18589 root 20 0 83932 3448 2548 R 98.2 0.1 4:15.89 lrmd
>>> 1 root 20 0 19348 1520 1212 S 0.0 0.0 0:00.78 init
>>> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00
>>> kthreadd
>>> 3 root RT 0 0 0 0 S 0.0 0.0 0:06.93
>>> migration/0
>>> 4 root 20 0 0 0 0 S 0.0 0.0 15:23.59
>>> ksoftirqd/0
>>> 5 root RT 0 0 0 0 S 0.0 0.0 0:00.10
>>> migration/0
>>>
>>> Best Regards,
>>> Kazunori INOUE
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker [at] oss
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker [at] oss
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker [at] oss
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker [at] oss
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Linux-HA pacemaker RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.