Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users
Re: [patch] clock_t wrapped around causingfalse resourcestart failure
 

Index | Next | Previous | View Flat


Simon.Tavanyar at stratus

Jul 1, 2009, 7:22 AM


Views: 568
Permalink
Re: [patch] clock_t wrapped around causingfalse resourcestart failure

Aahh ... so it's measured after reboot. That makes sense.
Thanks, Lars.

Should I wait for you to log a bugzilla on this?




-----Original Message-----
From: linux-ha-bounces[at]lists.linux-ha.org
[mailto:linux-ha-bounces[at]lists.linux-ha.org] On Behalf Of Lars Ellenberg
Sent: Wednesday, July 01, 2009 6:28 AM
To: linux-ha[at]lists.linux-ha.org
Subject: Re: [Linux-HA] [patch] clock_t wrapped around causingfalse
resourcestart failure

On Tue, Jun 30, 2009 at 10:03:29AM -0400, Tavanyar, Simon wrote:
> Hi Dejan,
>
> The bug looks like a one-off occurrence. We run hundreds of hours of
> system stress tests in a week, moving resources between main and
standby
> systems, and we haven't seen this error in a couple years. (There was
a
> longclock error back in 2007 found by my colleague Simon Graham).
>
> The longclock wrap occurred within 2:45 of a reboot.
> The apparent coincidence seems to be that we were starting resources
on
> a back-up node around 165 seconds after the node had been rebooted and
> hearbeat restarted. As I expect you know, somewhere between 160 and
175
> seconds after a heartbeat start, the longclock is configured to wrap.

Nonono.
times() is coupled to uptime,
uptime is measured in jiffies,
and initial jiffies in linux kernel is
include/linux/jiffies.h:
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

so the first wrap occurs 5 minutes after boot.
nothing to do with heartbeat start, or longclock.

> The rareness of this makes me think we hit a really obscure window...

yes.
but.

it is totally broken.
some grepping in my mercurial checkouts suggest that you are using
heartbeat, which is using its "plumbing" wrappers.

[skip this paragraph unless you are heartbeat coder]
in particular, the killing is done by
TrackedProcTimeoutFunction(),
and the timeout for that is set by SetTrackedProcTimeouts(),
which calls into Gmain_timeout_add -> Gmain_timeout_add_full
(still in the heartbeat plumbing wrappers),
which creates a new event source of type GTimoutAppend
(also heartbeat plumbing) with methods defined by Gmain_timeout_funcs.

The Gmain_timeout_prepare does a simple check on
"is the next time this event shall trigger in the past?",
and if so, triggers immediately.

Now, that check is broken.
in lib/clplumbing/GSource.c, Gmain_timeout_prepare():
if (cmp_longclock(lnow, append->nexttime) >= 0)
assuming that nexttime was set correctly, and lnow is correct, too,
and further assuming your clock_t is only 32 bit,
longclock_t is defined as unsigned long long,
and that thing becomes:

if ((unsigned long long)(lnow) >= (unsigned long
long)(append-nexttime))

which exactly does _NOT_ care for wrap around :(

cmp_longclock macro and functions are broken. conceivably some other
longclock related macros/functions may be broken as well.

it should probably be more like
(caution, not even compile tested)

diff -r 731f8f7b5450 include/clplumbing/longclock.h
--- a/include/clplumbing/longclock.h Tue Jun 30 12:02:16 2009 +0200
+++ b/include/clplumbing/longclock.h Wed Jul 01 12:27:27 2009 +0200
@@ -127,9 +127,9 @@
((longclock_t)(l) - (longclock_t)(r))

# define cmp_longclock(l,r) \
- (((longclock_t)(l) < (longclock_t)(r)) \
+ ((((long long)(l) - (long long)(r)) < 0) \
? -1 \
- : (((longclock_t)(l) > (longclock_t)(r)) \
+ : ((((long long)(l) - (long long)(r)) > 0) \
? +1 : 0))
#endif

diff -r 731f8f7b5450 lib/clplumbing/longclock.c
--- a/lib/clplumbing/longclock.c Tue Jun 30 12:02:16 2009 +0200
+++ b/lib/clplumbing/longclock.c Wed Jul 01 12:27:27 2009 +0200
@@ -259,12 +259,7 @@
int
cmp_longclock(longclock_t l, longclock_t r)
{
- if (l < r) {
- return -1;
- }
- if (l > r) {
- return 1;
- }
- return 0;
+ return (((long long)(l) - (long long)(r)) < 0) ? -1
+ : (((long long)(l) - (long long)(r)) > 0) ? +1 : 0;
}
#endif /* CLOCK_T_IS_LONG_ENOUGH */



--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD(r) and LINBIT(r) are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Subject User Time
Re: [patch] clock_t wrapped around causingfalse resourcestart failure Simon.Tavanyar at stratus Jul 1, 2009, 7:22 AM
    Re: [patch] clock_t wrapped around causingfalse resourcestart failure lars.ellenberg at linbit Jul 1, 2009, 7:44 AM
        Re: [patch] clock_t wrapped around causingfalse resourcestart failure dejanmm at fastmail Jul 1, 2009, 7:59 AM
            Re: [patch] clock_t wrapped around causingfalse resourcestart failure lars.ellenberg at linbit Jul 1, 2009, 8:13 AM
                Re: [patch] clock_t wrapped around causingfalse resourcestart failure dejanmm at fastmail Jul 2, 2009, 2:00 AM
                    Re: [patch] clock_t wrapped around causingfalse resourcestart failure lars.ellenberg at linbit Jul 2, 2009, 5:16 AM
        Re: [patch] clock_t wrapped aroundcausingfalse resourcestart failure Simon.Tavanyar at stratus Jul 5, 2009, 3:09 PM
            Re: [patch] clock_t wrapped aroundcausingfalse resourcestart failure dejanmm at fastmail Jul 6, 2009, 1:53 AM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.