Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Users

Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

 

 

Linux-HA users RSS feed   Index | Next | Previous | View Threaded


Ulrich.Windl at rz

Aug 14, 2012, 8:48 AM

Post #1 of 22 (692 views)
Permalink
Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438

>>> Lars Marowsky-Bree <lmb [at] suse> schrieb am 14.08.2012 um 17:03 in Nachricht
<20120814150328.GD3944 [at] suse>:
> On 2012-08-14T16:59:02, Ulrich Windl <Ulrich.Windl [at] rz> wrote:
>

To Lars:

problems arrive in bursts here. Whenever I forward them to support, support will complain about the amount of problems being reported. So I tried to pre-filter them.
Your message arrived, BTW. ;-)

Regards,
Ulrich

> Perhaps. Note that that version really should be discussed via support,
> not here. But I've been telling you that so often that I don't think the
> message is being received ;-)
>
>
> Regards,
> Lars





_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lmb at suse

Aug 14, 2012, 10:29 AM

Post #2 of 22 (660 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On 2012-08-14T17:48:47, Ulrich Windl <Ulrich.Windl [at] rz> wrote:

FWIW, if you can try to reproduce in 1.1.7, that may be interesting. I'm
still not sure on the sequence of events to cause it, so I can't try
locally.

hb_report would be the minimum.

> Your message arrived, BTW. ;-)

It's not that we don't want to help, by the way. But work we do via
support actually shows up as time spent on customers, as opposed to
"general community relations" such as mailing lists discussions. And the
former just gets higher priority and looks better to our bosses ;-)


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Ulrich.Windl at rz

Aug 15, 2012, 11:22 PM

Post #3 of 22 (654 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

>>> Lars Marowsky-Bree <lmb [at] suse> schrieb am 14.08.2012 um 19:29 in Nachricht
<20120814172940.GE3944 [at] suse>:
> On 2012-08-14T17:48:47, Ulrich Windl <Ulrich.Windl [at] rz> wrote:
>
> FWIW, if you can try to reproduce in 1.1.7, that may be interesting. I'm
> still not sure on the sequence of events to cause it, so I can't try
> locally.

Hi!

What I do is evaluation od SLES11 SP2 (we run SP1) now. So testing anything that's not part of SP2 (plu Updates) is not planned right now.

I also think when reporting problems here early might get you mentally prepared when the problem is eventually reported via official support.

Maybe also in times of google, other people may be interested to see what other people found out.

>
> hb_report would be the minimum.

I'm still setting up the test cluster. Once it's in the state it should be, I'll provide more details (if there are still errors).

Regards,
Ulrich


>
> > Your message arrived, BTW. ;-)
>
> It's not that we don't want to help, by the way. But work we do via
> support actually shows up as time spent on customers, as opposed to
> "general community relations" such as mailing lists discussions. And the
> former just gets higher priority and looks better to our bosses ;-)
>
>
> Regards,
> Lars





_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


external.Martin.Konold at de

Aug 16, 2012, 8:54 AM

Post #4 of 22 (656 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

Hi,

> What I do is evaluation od SLES11 SP2 (we run SP1) now. So testing anything that's not part of SP2 (plu Updates)
> is not planned right now.

> I also think when reporting problems here early might get you mentally prepared when the problem is eventually
> reported via official support.

> Maybe also in times of google, other people may be interested to see what other people found out.

From my experience with SLES11 SP2 (with all current updates) I conclude that actually nobody is seriously running SP2 without local bugfixes.

E.g. Even the most simple examples from the official SuSE documentation don't work as expected.

A trivial example is ocf:heartbeat:exportfs as distributed by SuSE with SP2 causes unlimited growth of .rmtab files (goes fast in the gigabytes for serious NFS servers). I could work around this issue using some shell scripting.

There are other issues which are more than annoying and actually make the SLES SP2 HA Extension unusable for production systems. E.g. clvmd cannot be made less verbose from the cluster configuration. (No daemon_options="-d0" does not help!)

Not funny is also the fact that the official SLES 11 SP2 kernels crash seriously (when a node rejoins the cluster) when using STCP as recommended in the SLES HA documentation and offered via the wizards. It took me a while to find out what was going on.

When setting up a system with many (rather simple) resources funny things happen due to race conditions all over the place. (can be worked around mostly using arbitrary start-delay options.

Oh, did I mention that situations which are actually forbidden by constraints (e.g. using a score of INFINITY) actually do happen... Depending on the environment this can lead to not so funny effects.

E.g. I defined the following constraints:

colocation c17 inf: p_lsb_ccslogserver p_fs_daten
order o34 inf: p_fs_daten p_lsb_ccslogserver:start

I can proof from the logs that ccslogserver (an application) got migrated from node A to node B while p_fs_daten (a filesystem on top of drbd) was definitely still running on node A

Reporting bugs is not possible without a direct support contract. (You must enter into a support contract with SuSE before you can even report a bug or provide a patch ....)

Regards

Martin Konold
(Who used to maintain SuSE Clusters since 2001)
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Ulrich.Windl at rz

Aug 16, 2012, 11:19 PM

Post #5 of 22 (657 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

>>> "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)"
<external.Martin.Konold [at] de> schrieb am 16.08.2012 um 17:54 in
Nachricht
<B326DCECA83A2E44BE98082EA1A715FA19CEB84E79 [at] SI-MBX13>:

[...]
> From my experience with SLES11 SP2 (with all current updates) I conclude
> that actually nobody is seriously running SP2 without local bugfixes.

Unfortunately that's ture for SP1 as well: We had to use a newer corosync (among others)

>
> E.g. Even the most simple examples from the official SuSE documentation
> don't work as expected.
>
> A trivial example is ocf:heartbeat:exportfs as distributed by SuSE with SP2
> causes unlimited growth of .rmtab files (goes fast in the gigabytes for
> serious NFS servers). I could work around this issue using some shell
> scripting.

Yes, we had that, too for SP1. Fixed in "resource-agents-3.9.2-0.4.2.1.4061.0.PTF.754067" (just for reference). Unfortunately the problem only shows up when seriously using the NFS server.

>
> There are other issues which are more than annoying and actually make the
> SLES SP2 HA Extension unusable for production systems. E.g. clvmd cannot be
> made less verbose from the cluster configuration. (No daemon_options="-d0"
> does not help!)

I haven't tried it, but it's on the agenda.

>
> Not funny is also the fact that the official SLES 11 SP2 kernels crash
> seriously (when a node rejoins the cluster) when using STCP as recommended in
> the SLES HA documentation and offered via the wizards. It took me a while to
> find out what was going on.

No we did not have these bugs, but we had a crashing crmd, and a two-node cluster that could not agree who's DC for several minutes.

>
> When setting up a system with many (rather simple) resources funny things
> happen due to race conditions all over the place. (can be worked around
> mostly using arbitrary start-delay options.
>
> Oh, did I mention that situations which are actually forbidden by
> constraints (e.g. using a score of INFINITY) actually do happen... Depending
> on the environment this can lead to not so funny effects.
>
> E.g. I defined the following constraints:
>
> colocation c17 inf: p_lsb_ccslogserver p_fs_daten
> order o34 inf: p_fs_daten p_lsb_ccslogserver:start
>
> I can proof from the logs that ccslogserver (an application) got migrated
> from node A to node B while p_fs_daten (a filesystem on top of drbd) was
> definitely still running on node A

I'm absolutely no expert on that, but I think you constraints will allow p_fs_daten to be active on one node while p_lsb_ccslogserver is going down (being migrated). Only before staring p_lsb_ccslogserver p_fs_daten should be up. Probably then the colocation is ignored.

I'm also unsure whether transitive ordering an colocation works.

What also disappointed me: When adding stickiness to a primitive, a group gets more or less the sum of ist primitives, but when you add a stickiness to a goup, EVERY primitive gets that stickiness, and the group STILL gets the sum of all these then. So especially bad, when adding one more primitive to a group the total stickiness changes.

Likewise if you use resource utilization on primitives in a group, the group begains to start on one node, then stalls when the next primitive's utilization cannot be fulfilled. That's bad especially when there are enough resources for the whole group on another node. (Here ulilizations are not summed).

Some concepts had been implemented very "ad hoc".

And one of the popular clusterbooks describes the XML configuration. It's like describing how to start the engine of your car: Open the hood, locate the battery and the starter engine. The take a pair of wires, connecting one end to the battery, and the other end to the starter engine, watching for right polarity, Then... (you get it)

The best tool around is the crm shell (IMHO), while the GUI has extraordinarily poor performance once your cluster has a reasonable number of resources.

There is a acess control concept (ACLs) based on XPath. Unfortunately that would require to exactly describe the data model of the CIB to really implement proven access restrictions. It's a bit complicated...

>
> Reporting bugs is not possible without a direct support contract. (You must
> enter into a support contract with SuSE before you can even report a bug or
> provide a patch ....)

Yes: I found out that there is no mechanism to repair non-clustered MD-RAIDs, so I wrote a RAID monitor. Proposed that to support. Still didn't hear any feedback about it...

Regards,
Ulrich

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


michalko.system at a-i-p

Aug 16, 2012, 11:41 PM

Post #6 of 22 (652 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

Am Donnerstag, 16. August 2012 17:54:06 schrieb EXTERNAL Konold Martin
(erfrakon, RtP2/TEF72):
> Hi,
>
> > What I do is evaluation od SLES11 SP2 (we run SP1) now. So testing
> > anything that's not part of SP2 (plu Updates) is not planned right now.
> >
> > I also think when reporting problems here early might get you mentally
> > prepared when the problem is eventually reported via official support.
> >
> > Maybe also in times of google, other people may be interested to see what
> > other people found out.
> >
> >From my experience with SLES11 SP2 (with all current updates) I conclude
> > that actually nobody is seriously running SP2 without local bugfixes.
>

I am also testing SP2 - and yes, it's true: not yet ready for production ;-(


> E.g. Even the most simple examples from the official SuSE documentation
> don't work as expected.
>
> A trivial example is ocf:heartbeat:exportfs as distributed by SuSE with SP2
> causes unlimited growth of .rmtab files (goes fast in the gigabytes for
> serious NFS servers). I could work around this issue using some shell
> scripting.
>
> There are other issues which are more than annoying and actually make the
> SLES SP2 HA Extension unusable for production systems. E.g. clvmd cannot
> be made less verbose from the cluster configuration. (No
> daemon_options="-d0" does not help!)
>
> Not funny is also the fact that the official SLES 11 SP2 kernels crash
> seriously (when a node rejoins the cluster) when using STCP as recommended
> in the SLES HA documentation and offered via the wizards. It took me a
> while to find out what was going on.
>
> When setting up a system with many (rather simple) resources funny things
> happen due to race conditions all over the place. (can be worked around
> mostly using arbitrary start-delay options.
>
> Oh, did I mention that situations which are actually forbidden by
> constraints (e.g. using a score of INFINITY) actually do happen...
> Depending on the environment this can lead to not so funny effects.
>
> E.g. I defined the following constraints:
>
> colocation c17 inf: p_lsb_ccslogserver p_fs_daten
> order o34 inf: p_fs_daten p_lsb_ccslogserver:start
>
> I can proof from the logs that ccslogserver (an application) got migrated
> from node A to node B while p_fs_daten (a filesystem on top of drbd) was
> definitely still running on node A
>
> Reporting bugs is not possible without a direct support contract. (You must
> enter into a support contract with SuSE before you can even report a bug
> or provide a patch ....)
>
> Regards
>
> Martin Konold
> (Who used to maintain SuSE Clusters since 2001)
> _______________________________________________
> Linux-HA mailing list
> Linux-HA [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

Nikita Michalko
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lmb at suse

Aug 17, 2012, 1:28 AM

Post #7 of 22 (652 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On 2012-08-16T17:54:06, "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" <external.Martin.Konold [at] de> wrote:

Hi Martin,

> From my experience with SLES11 SP2 (with all current updates) I conclude that actually nobody is seriously running SP2 without local bugfixes.

That isn't quite true.

> E.g. Even the most simple examples from the official SuSE documentation don't work as expected.

Which ones?

> A trivial example is ocf:heartbeat:exportfs as distributed by SuSE with SP2 causes unlimited growth of .rmtab files (goes fast in the gigabytes for serious NFS servers). I could work around this issue using some shell scripting.

This is an annoying bug, yes. It's an upstream bug that's been fixed
now. I'll check if the maintenance update is already released.

> There are other issues which are more than annoying and actually make the SLES SP2 HA Extension unusable for production systems. E.g. clvmd cannot be made less verbose from the cluster configuration. (No daemon_options="-d0" does not help!)

It shouldn't log that much even at the regular loglevel. It's reasonably
quiet (except on fail-/switch-overs, of course). What do you find
excessive?

> Not funny is also the fact that the official SLES 11 SP2 kernels crash
> seriously (when a node rejoins the cluster) when using STCP as
> recommended in the SLES HA documentation and offered via the wizards.
> It took me a while to find out what was going on.

We've not observed this. Have you reported a bug?

> When setting up a system with many (rather simple) resources funny things happen due to race conditions all over the place. (can be worked around mostly using arbitrary start-delay options.

I've not encountered this either. Sorry for asking this, but: did you
report a bug?

> Oh, did I mention that situations which are actually forbidden by constraints (e.g. using a score of INFINITY) actually do happen... Depending on the environment this can lead to not so funny effects.

That would be a serious bug in the policy engine (and not just limited
to SLE HA 11 SP2).

> E.g. I defined the following constraints:
>
> colocation c17 inf: p_lsb_ccslogserver p_fs_daten
> order o34 inf: p_fs_daten p_lsb_ccslogserver:start
>
> I can proof from the logs that ccslogserver (an application) got migrated from node A to node B while p_fs_daten (a filesystem on top of drbd) was definitely still running on node A

I'd be very, very interested in seeing these logs. The rules you
specified above should not allow for that, and I can't immediately
imagine other rules that still might allow for it.

> Reporting bugs is not possible without a direct support contract. (You must enter into a support contract with SuSE before you can even report a bug or provide a patch ....)

Strangely enough, enterprise distributions target paying customers. This
is not, I believe, a SUSE-specific constraint.

You can always file a bug against the upstream projects in the
respective communities; these will then possibly tell you to upgrade to
latest upstream first and reproduce. Eventually, these bugs will trickle
back into the enterprise distributions as well. That may just take a
while.

But yes, SLE HA (and RHEL clustering too) sort-of target customers who
have support contracts, either will SUSE/RHT or a strong consulting
partner (who preferably is a high-grade technology partner with the
distributor).

I admit I don't find this particular complaint convincing.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lmb at suse

Aug 17, 2012, 1:36 AM

Post #8 of 22 (646 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On 2012-08-17T08:19:45, Ulrich Windl <Ulrich.Windl [at] rz> wrote:

> Likewise if you use resource utilization on primitives in a group, the group begains to start on one node, then stalls when the next primitive's utilization cannot be fulfilled. That's bad especially when there are enough resources for the whole group on another node. (Here ulilizations are not summed).

This was not the target use case for utilizations. They were targeted
for "I have a base storage stack and now don't want to place all VMs
manually"; e.g., just the top-level resource would have utilization
applied.

People are now applying it to other scenarios, but for those, the PE has
to be extended to cope first.

(A work-around is to manually sum up the utilization and set it on the
lowest resource in the group. Not optimal from a usability perspective,
but working.)

> Some concepts had been implemented very "ad hoc".

We like to phrase this as "sufficiently implemented to satisfy the
business need" ;-)

> And one of the popular clusterbooks describes the XML configuration.

Uhm. But that is hardly the fault of SLE HA ;-)

> The best tool around is the crm shell (IMHO), while the GUI has extraordinarily poor performance once your cluster has a reasonable number of resources.

True. The python UI is sort-of suckish for larger clusters. Which is why
we're providing the crm shell and hawk; the python UI is basically in
maintenance mode.

> There is a acess control concept (ACLs) based on XPath. Unfortunately
> that would require to exactly describe the data model of the CIB to
> really implement proven access restrictions. It's a bit
> complicated...

The ACL model targets common use cases like "I want my operations staff
to see, but not modify" or "This person is allowed to see, but only
start/stop a single resource". These use cases are trivial to express in
the shell, for example.

It's not meant to provide formally validated and BSI/DoD certified
levels of security.

> Yes: I found out that there is no mechanism to repair non-clustered
> MD-RAIDs,

You mean those not managed by the Raid resource agent?

> so I wrote a RAID monitor. Proposed that to support. Still didn't hear
> any feedback about it...

Feature requests take a while. They're not usually considered bugs. But
I've actually seen this being discussed internally; ping your support
contact again for the current status.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lmb at suse

Aug 17, 2012, 1:36 AM

Post #9 of 22 (647 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On 2012-08-17T08:41:15, Nikita Michalko <michalko.system [at] a-i-p> wrote:

> I am also testing SP2 - and yes, it's true: not yet ready for production ;-(

What problems did you find?

Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


michalko.system at a-i-p

Aug 17, 2012, 2:43 AM

Post #10 of 22 (652 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

Hi Lars!

Am Freitag, 17. August 2012 10:36:37 schrieb Lars Marowsky-Bree:
> On 2012-08-17T08:41:15, Nikita Michalko <michalko.system [at] a-i-p> wrote:
> > I am also testing SP2 - and yes, it's true: not yet ready for production
> > ;-(
>
> What problems did you find?
>

- e.g. the problem with SLES 11 SP2 kernels crash - the same as described by
Martin:
>> SP2 kernels crash seriously (when a node rejoins the cluster) when using
STCP as
>> recommended in the SLES HA documentation and offered via the wizards.

- and some specific problems with ISP-RAID driver, but those are solved in the
meantime by reseller

Regards

Nikita

> Regards,
> Lars
>
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


dejanmm at fastmail

Aug 17, 2012, 3:48 AM

Post #11 of 22 (650 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On Fri, Aug 17, 2012 at 10:36:17AM +0200, Lars Marowsky-Bree wrote:
> On 2012-08-17T08:19:45, Ulrich Windl <Ulrich.Windl [at] rz> wrote:
> > There is a acess control concept (ACLs) based on XPath. Unfortunately
> > that would require to exactly describe the data model of the CIB to
> > really implement proven access restrictions. It's a bit
> > complicated...
>
> The ACL model targets common use cases like "I want my operations staff
> to see, but not modify" or "This person is allowed to see, but only
> start/stop a single resource". These use cases are trivial to express in
> the shell, for example.

There are shortcuts for all common constructs. Unless one has a
really complex security requirements, the set of ACL rules should
never need to make use of XPath. We spent quite some time
discussing this in order to make it as easy as possible for end
users.

Thanks,

Dejan
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lmb at suse

Aug 17, 2012, 4:18 AM

Post #12 of 22 (645 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On 2012-08-17T11:43:13, Nikita Michalko <michalko.system [at] a-i-p> wrote:

> - e.g. the problem with SLES 11 SP2 kernels crash - the same as described by
> Martin:
> >> SP2 kernels crash seriously (when a node rejoins the cluster) when using
> STCP as
> >> recommended in the SLES HA documentation and offered via the wizards.

Is this not fixed by the latest maintenance upgrades? I don't see an
open bug for something like this right now.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


external.Martin.Konold at de

Aug 17, 2012, 7:38 AM

Post #13 of 22 (673 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

Hi LMB,

> > - e.g. the problem with SLES 11 SP2 kernels crash - the same as
> described by Martin:
> >> SP2 kernels crash seriously (when a node rejoins the cluster) when
> >> using STCP as recommended in the SLES HA documentation and offered via the wizards.

> Is this not fixed by the latest maintenance upgrades?

To my knowledge the latest maintenance kernel ist 3.0.34-0.7.9.

I validated that the following SuSE kernel show the crash.

vmlinux-3.0.26-0.7-default.gz
vmlinux-3.0.34-0.7-default.gz
vmlinux-3.0.31-0.9-default.gz
vmlinux-3.0.36-5-default.gz
vmlinux-3.0.36-10-default.gz

KERNEL: vmlinux-3.0.34-0.7-default.gz
DEBUGINFO: ./vmlinux-3.0.34-0.7-default.debug
DUMPFILE: vmcore
CPUS: 24
DATE: Mon Jul 2 09:36:33 2012
UPTIME: 2 days, 17:12:14
LOAD AVERAGE: 1.57, 1.42, 1.45
TASKS: 551
NODENAME: rt-lxcl9b
RELEASE: 3.0.34-0.7-default
VERSION: #1 SMP Tue Jun 19 09:56:30 UTC 2012 (fbfc70c)
MACHINE: x86_64 (2932 Mhz)
MEMORY: 48 GB
PANIC: "[234603.020857] Oops: 0000 [#1] SMP " (check log for details)
PID: 19580
COMMAND: "sh"
TASK: ffff880b6bc26140 [THREAD_INFO: ffff880bceb40000]
CPU: 7
STATE: TASK_RUNNING (PANIC)
PID: 19580 TASK: ffff880b6bc26140 CPU: 7 COMMAND: "sh"
#0 [ffff880bceb41b30] machine_kexec at ffffffff810265fe
#1 [ffff880bceb41b80] crash_kexec at ffffffff810a31fa
#2 [ffff880bceb41c50] oops_end at ffffffff81442b88
#3 [ffff880bceb41c70] __bad_area_nosemaphore at ffffffff810324e5
#4 [ffff880bceb41d30] do_page_fault at ffffffff814451cb
#5 [ffff880bceb41e30] page_fault at ffffffff81441d65
[exception RIP: sock_ioctl+40]
RIP: ffffffff81370258 RSP: ffff880bceb41ee8 RFLAGS: 00010296
RAX: 0000000000000000 RBX: 0000000000005401 RCX: 00007fff87485790
RDX: 00007fff87485790 RSI: 0000000000005401 RDI: ffff880b96a98a80
RBP: 00007fff87485790 R8: 0000000000000000 R9: 00007f9bd6e1e640
R10: 00007fff87485730 R11: ffffffff811e0a90 R12: 00007fff87485790
R13: 0000000000000000 R14: 0000000000005401 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#6 [ffff880bceb41f10] do_vfs_ioctl at ffffffff81160f5b
#7 [ffff880bceb41f40] sys_ioctl at ffffffff81161321
#8 [ffff880bceb41f80] system_call_fastpath at ffffffff81449392
RIP: 00007f9bd6725677 RSP: 00007fff874857c8 RFLAGS: 00010202
RAX: 0000000000000010 RBX: ffffffff81449392 RCX: ffffffffffffffa8
RDX: 00007fff87485790 RSI: 0000000000005401 RDI: 0000000000000000
RBP: 0000000000000006 R8: 00007fff874858f0 R9: 00007f9bd6e1e640
R10: 00007fff87485730 R11: 0000000000000202 R12: 00007f9bd6fef700
R13: ffffffffffffffa8 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b

> I don't see an open bug for something like this right now.

Are you serious?

It was you who resolved this bug as INVALID in bugzilla https://bugzilla.novell.com/show_bug.cgi?id=769292.

Best regards

Martin Konold
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Ulrich.Windl at rz

Aug 17, 2012, 7:42 AM

Post #14 of 22 (643 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

>>> Lars Marowsky-Bree <lmb [at] suse> schrieb am 17.08.2012 um 13:18 in Nachricht
<20120817111811.GR3944 [at] suse>:
> On 2012-08-17T11:43:13, Nikita Michalko <michalko.system [at] a-i-p> wrote:
>
> > - e.g. the problem with SLES 11 SP2 kernels crash - the same as described by
> > Martin:
> > >> SP2 kernels crash seriously (when a node rejoins the cluster) when using
> > STCP as
> > >> recommended in the SLES HA documentation and offered via the wizards.
>
> Is this not fixed by the latest maintenance upgrades? I don't see an
> open bug for something like this right now.

Hi Lars,

obviously not, because I have the latest updates installed. It happens frequently enough to care about it:

# zgrep sscan /var/log/messages-201208*.bz2 |wc -l
76
Here are some:
/var/log/messages-20120816.bz2:Aug 16 13:55:21 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-19194) in sscanf result (3) for 0:0:crm-resource-19194
/var/log/messages-20120817.bz2:Aug 16 15:35:54 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-7594) in sscanf result (3) for 0:0:crm-resource-7594
/var/log/messages-20120817.bz2:Aug 16 15:35:54 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-7594) in sscanf result (3) for 0:0:crm-resource-7594
/var/log/messages-20120817.bz2:Aug 16 15:36:02 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-7797) in sscanf result (3) for 0:0:crm-resource-7797
/var/log/messages-20120817.bz2:Aug 16 15:36:02 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-7797) in sscanf result (3) for 0:0:crm-resource-7797
/var/log/messages-20120817.bz2:Aug 16 15:36:15 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-8026) in sscanf result (3) for 0:0:crm-resource-8026
/var/log/messages-20120817.bz2:Aug 16 15:41:21 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-12126) in sscanf result (3) for 0:0:crm-resource-12126
/var/log/messages-20120817.bz2:Aug 16 15:41:21 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-12126) in sscanf result (3) for 0:0:crm-resource-12126
/var/log/messages-20120817.bz2:Aug 16 15:42:27 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-13769) in sscanf result (3) for 0:0:crm-resource-13769
/var/log/messages-20120817.bz2:Aug 16 15:42:27 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-13769) in sscanf result (3) for 0:0:crm-resource-13769
/var/log/messages-20120817.bz2:Aug 16 15:46:21 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-18378) in sscanf result (3) for 0:0:crm-resource-18378
/var/log/messages-20120817.bz2:Aug 16 15:46:23 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-18486) in sscanf result (3) for 0:0:crm-resource-18486
/var/log/messages-20120817.bz2:Aug 16 15:46:23 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-18486) in sscanf result (3) for 0:0:crm-resource-18486
/var/log/messages-20120817.bz2:Aug 16 15:46:45 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-18740) in sscanf result (3) for 0:0:crm-resource-18740
/var/log/messages-20120817.bz2:Aug 16 15:46:45 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-18740) in sscanf result (3) for 0:0:crm-resource-18740
/var/log/messages-20120817.bz2:Aug 16 15:50:45 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-21352) in sscanf result (3) for 0:0:crm-resource-21352
/var/log/messages-20120817.bz2:Aug 16 15:50:45 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-21352) in sscanf result (3) for 0:0:crm-resource-21352
/var/log/messages-20120817.bz2:Aug 16 15:50:47 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-21399) in sscanf result (3) for 0:0:crm-resource-21399
/var/log/messages-20120817.bz2:Aug 16 15:50:47 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-21399) in sscanf result (3) for 0:0:crm-resource-21399

Regards,
Ulrich

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


external.Martin.Konold at de

Aug 17, 2012, 9:14 AM

Post #15 of 22 (645 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

Hi LMB,


> > From my experience with SLES11 SP2 (with all current updates) I conclude that actually nobody is seriously running SP2 without local bugfixes.

> That isn't quite true.

No, this is true.
I provided an example which can easily be reproduced with stock SLES 11 SP2 and stock documentation.
On the other hand you sofar did not provide any case where SLES11 SP2 runs reliably unmodified in a mission critical environment (e.g. a HA NFS server) without local bugfixes.

> > E.g. Even the most simple examples from the official SuSE documentation don't work as expected.

> Which ones?

Are you actually reading the messages on this list before replying? I provided an example just one line below.

> > A trivial example is ocf:heartbeat:exportfs as distributed by SuSE with SP2 causes unlimited growth of .rmtab files (goes fast in the gigabytes for serious NFS servers). I could work around this issue using some shell scripting.

This is exactly the simple example of a resource not working on a fully updated SLES 11 SP2 HA Cluster.
SuSE provides an official guide how to setup a highly available NFS cluster. When following this guide this rather simple use case is not working since many months.

> This is an annoying bug, yes. It's an upstream bug that's been fixed now. I'll check if the maintenance update is already released.

This bug is not fixed in SLES 11 SP2 since many months. The fact that you are aware of it but don't make a maintenance release for obvious bugs which are triggered in the default use cases has something to say.

http://www.suse.com/support/kb/doc.php?id=7008514

This is not an annoying bug it turns the cluster unusable after some days of usage. This is not acceptable and dangerous in production environments. (HA clusters tend to be used in more critical environments)

The fact that you don't take care that fixes are available in a timely manner even though you claim that the issue was fixed upstream shows that you SuSE is not commited in supporting missions setups.

Funny enough today running single servers is more reliable and provides better uptimes and service availabilities than using SLES HA Extension.

> > There are other issues which are more than annoying and actually make
> > the SLES SP2 HA Extension unusable for production systems. E.g. clvmd
> > cannot be made less verbose from the cluster configuration. (No
> > daemon_options="-d0" does not help!)

> It shouldn't log that much even at the regular loglevel. It's reasonably quiet (except on fail-/switch-overs, of course). What do you find excessive?

Are you actually running yourself a single instance of a SLES 11 SP2 cluster in production?

Did you ever check yourself the logfiles on any uptodate SLES11 SP2 HA cluster?

rt-lxcl9b:/var/log # cat /var/log/messages | wc -l
74838
rt-lxcl9b:/var/log # cat /var/log/messages | grep clvmd | wc -l
74028
rt-lxcl9b:/var/log # ps uax | grep clvmd
root 3227 0.0 0.0 149100 46404 ? SLsl Aug10 0:24 /usr/sbin/clvmd -d0

These are about 74000 (*) messages from clvmd in about 40h.

(No failovers, switches or anything else which are expected to cause logging)

> > Not funny is also the fact that the official SLES 11 SP2 kernels crash
> > seriously (when a node rejoins the cluster) when using STCP as
> > recommended in the SLES HA documentation and offered via the wizards.
> > It took me a while to find out what was going on.

> We've not observed this. Have you reported a bug?

Argl.... Yes I reported a bug. Yes I reported how to reproduce. Yes I provided a full description and offered a kernel dump...

> > When setting up a system with many (rather simple) resources funny things happen due to race conditions all over the place. (can be worked around mostly using arbitrary start-delay options.

> I've not encountered this either. Sorry for asking this, but: did you report a bug?

Are you trying to make me angry?

> > Oh, did I mention that situations which are actually forbidden by constraints (e.g. using a score of INFINITY) actually do happen... Depending on the environment this can lead to not so funny effects.

> That would be a serious bug in the policy engine (and not just limited to SLE HA 11 SP2).

Which does not really improve the situation for your customers.

> E.g. I defined the following constraints:
>
> colocation c17 inf: p_lsb_ccslogserver p_fs_daten order o34 inf:
> p_fs_daten p_lsb_ccslogserver:start
>
> I can proof from the logs that ccslogserver (an application) got
> migrated from node A to node B while p_fs_daten (a filesystem on top
> of drbd) was definitely still running on node A

I'd be very, very interested in seeing these logs. The rules you specified above should not allow for that, and I can't immediately imagine other rules that still might allow for it.

> Strangely enough, enterprise distributions target paying customers. This is not, I believe, a SUSE-specific constraint.

It is a SuSE-specific constraint. Even with Microsoft I can report a bug without firstly buying an additional support contract beyond the existing license and the existing maintenance contract.

(In my case I do consulting work for a customer who wishes to evaluate if migrating to SLES 11 SP2 is an option for mission critical workloads.
I waited till SP2 was released before even starting the evaluation just to find out that SP2 fails in the simple test cases with configurations verbatimly copied from SLES HA documentation.
This customer buys SLES/RH/Windows licenses and support in bulk from a large multi-national. It is not feasible to buy in addition an extra support contract directly from SuSE just to be able to _report_ a bug or to provide a patch.)

> You can always file a bug against the upstream projects in the respective communities; these will then possibly tell you to upgrade to latest upstream first and reproduce.

Yes, I could do that.

Actually I coud provide a fully working and tested customized solution on top of OpenSuSE or Fedora based on uptodate upstream packages a local fixes but I was asked to answer if SLES11 SP2 provided a suitable HA solution for mission critical use cases.

Currently the result of this evaluation is that SLES 11 SP2 fails out-of-the-box for even the simple example cases.

> Eventually, these bugs will trickle back into the enterprise distributions as well. That may just take a while.

Yes, and I am observing that SuSE is currently not able to provide upstream fixes in a timely manner even for simple

> But yes, SLE HA (and RHEL clustering too) sort-of target customers who have support contracts, either will SUSE/RHT or a strong consulting partner (who preferably is a high-grade technology partner with the distributor).

In this case my customer has a "high-grade technology partner" which of course has proper contracts with SuSE but my job is to

> I admit I don't find this particular complaint convincing.

I admit that I am unable to convince you that the fact that SLES11 SP2 fully uptodate does not work reliably even for the most simple use case with a setup copied verbatim from SLES 11 SP2 HA documentation.

Knowing that you are now since more than 12 years in the SuSE HA business it make me doubt if SLES is still an option for mission critical systems(**).

BTW: I was assuming that it was part of your job description to make sure that critical upstream/community fixes get integrated into the SLES 11 HA Extension in a timely manner. I guess that I am wrong.

Yours,
-- martin
(*) The log fills up with the same rather useless debug output every 30 seconds:
Aug 17 16:52:16 rt-lxcl9b clvmd[3227]: 33859776 got message from nodeid 33859776 for 17082560. len 18
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: 33859776 got message from nodeid 17082560 for 0. len 32
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: add_to_lvmqueue: cmd=0x7fc7f80008b0. client=0x6934a0, msg=0x7fc7fef76ffc, len=32, csid=0x7ffff6a020e4, xid=0
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: process_work_item: remote
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: process_remote_command unknown (0x2d) for clientid 0x5000000 XID 12916 on node 104a8c0
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: Syncing device names
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: LVM thread waiting for work
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: 33859776 got message from nodeid 33859776 for 17082560. len 18
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: 33859776 got message from nodeid 17082560 for 0. len 32
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: add_to_lvmqueue: cmd=0x7fc7f80008b0. client=0x6934a0, msg=0x7fc7fef771ac, len=32, csid=0x7ffff6a020e4, xid=0
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: process_work_item: remote
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: process_remote_command unknown (0x2d) for clientid 0x5000000 XID 12919 on node 104a8c0
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: Syncing device names
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: LVM thread waiting for work
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: 33859776 got message from nodeid 33859776 for 17082560. len 18
(*) My main point is not about bugs which happen to be normal with complex systems. My concern is about how you/SuSE handles these bugs.
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


external.Martin.Konold at de

Aug 17, 2012, 9:35 AM

Post #16 of 22 (637 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

Hi,

Martin Konold


# zgrep sscan /var/log/messages-201208*.bz2 |wc -l
76

May I add my statistics:

zgrep sscan messages*bz2 | wc -l
508

Yours,
-- martin
_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lmb at suse

Aug 20, 2012, 1:59 AM

Post #17 of 22 (598 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On 2012-08-17T16:38:01, "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" <external.Martin.Konold [at] de> wrote:

> > I don't see an open bug for something like this right now.
> Are you serious?
>
> It was you who resolved this bug as INVALID in bugzilla https://bugzilla.novell.com/show_bug.cgi?id=769292.

Uhm, yes, I was serious (I only checked the open ones). And I apologize
- but I go through many bugs every day that I don't remember the invalid
ones.

And yes: if you want support for SLE, you need a support contract.
That's the business model behind them (and RHEL, too).

You can, of course, run the community-supported versions too. That would
usually mean latest upstream releases.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lmb at suse

Aug 20, 2012, 2:00 AM

Post #18 of 22 (603 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On 2012-08-17T16:42:42, Ulrich Windl <Ulrich.Windl [at] rz> wrote:

> obviously not, because I have the latest updates installed. It happens frequently enough to care about it:
>
> # zgrep sscan /var/log/messages-201208*.bz2 |wc -l
> 76
> Here are some:
> /var/log/messages-20120816.bz2:Aug 16 13:55:21 so3 crmd: [27050]: WARN: decode_transition_key: Bad UUID (crm-resource-19194) in sscanf result (3) for 0:0:crm-resource-19194

Strange. Not seen this before. As soon as the bug report turns up we'll
fix it, but it is very likely a harmless (if annoying) thing.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lmb at suse

Aug 20, 2012, 2:31 AM

Post #19 of 22 (603 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On 2012-08-17T18:14:18, "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" <external.Martin.Konold [at] de> wrote:

> On the other hand you sofar did not provide any case where SLES11 SP2 runs reliably unmodified in a mission critical environment (e.g. a HA NFS server) without local bugfixes.

Okay, so there's a bug in the NFS agent, point taken. I'll investigate
why it took so long to release as a real maintenance update; you're
right, that shouldn't happen. (I can already see it in the update queue
though.)

> This is exactly the simple example of a resource not working on a fully updated SLES 11 SP2 HA Cluster.

Yes, conceded, but that doesn't mean that other scenarios - HA virtual
machines, OCFS2, ... - aren't working.

We didn't observe the rmtab growing that large here, and yes, it slipped
through.

> This bug is not fixed in SLES 11 SP2 since many months. The fact that you are aware of it but don't make a maintenance release for obvious bugs which are triggered in the default use cases has something to say.
>
> http://www.suse.com/support/kb/doc.php?id=7008514

A PTF is the first step once a problem has been reported by a support
customer (and is thus immediately available to other customers reporting
the same issue); it then is aggregated with other fixes, handed to QA,
and eventually released as a generic maintenance update. The last step
seems to have taken inappropriately long here, I'll prod the machinery
and figure out why.

> The fact that you don't take care that fixes are available in a timely manner even though you claim that the issue was fixed upstream shows that you SuSE is not commited in supporting missions setups.

We prioritize issues depending on how urgently customers report them.
I'd prefer to release them much more frequently and in smaller
increments, but then our customers complain about the update frequency.
It's a question of balance (that obviously hasn't worked out well
here).

> Are you actually running yourself a single instance of a SLES 11 SP2 cluster in production?

Yes. We've got multiple clusters running in production and, of course,
on development clusters.

> rt-lxcl9b:/var/log # ps uax | grep clvmd
> root 3227 0.0 0.0 149100 46404 ? SLsl Aug10 0:24 /usr/sbin/clvmd -d0
>
> These are about 74000 (*) messages from clvmd in about 40h.

Woah. And no, I don't see this here. Sorry. I'll investigate further.
Can you provide a log excerpt please? (Never mind, I see that you did
that below.)

> (In my case I do consulting work for a customer who wishes to evaluate if migrating to SLES 11 SP2 is an option for mission critical workloads.
> I waited till SP2 was released before even starting the evaluation just to find out that SP2 fails in the simple test cases with configurations verbatimly copied from SLES HA documentation.
> This customer buys SLES/RH/Windows licenses and support in bulk from a large multi-national. It is not feasible to buy in addition an extra support contract directly from SuSE just to be able to _report_ a bug or to provide a patch.)

In such cases, a sales engineer would be able to help with bugs during
the evaluation phase, and make sure that for the evaluation/PoC you
already get the same priority as you'd later.

But yes, I'm afraid that our policies don't account for bugs against SLE
being reported directly, without either involving sales or having an
active customer/partner support contract.

> In this case my customer has a "high-grade technology partner" which of course has proper contracts with SuSE but my job is to

This sentence appears cut-off? In any case, if the customer *has* such a
partner, reporting such issues via those channels would be preferable.

The PE constraint issue is one I find worrying too. I'd like to see a PE
input for that; thankfully, the PE is designed to be debugable.

> I admit that I am unable to convince you that the fact that SLES11 SP2 fully uptodate does not work reliably even for the most simple use case with a setup copied verbatim from SLES 11 SP2 HA documentation.

Bugs happen, even in documented cases. We've not observed that during
our testing (we actually had an external partner validate the NFS server
use case too).

> BTW: I was assuming that it was part of your job description to make sure that critical upstream/community fixes get integrated into the SLES 11 HA Extension in a timely manner. I guess that I am wrong.

Thanks for the personal attack, it is appreciated ;-)

It is. I'll figure out why and where it got stuck; but the fact remains
that we've not had other support customers report this yet; and if they
had, they'd have been provided the PTF (as well as increasing the
business priority on the workflow in the maintenance queue).

You've worked for a Linux distributor in the past - you know how the
business model works.

> (*) The log fills up with the same rather useless debug output every 30 seconds:

For some reason, debug mode isn't being disabled in your environment.
Looking at the code, I can't immediately see why not, but I'll check it
in my environment too.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lmb at suse

Aug 29, 2012, 6:20 AM

Post #20 of 22 (550 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On 2012-08-20T11:31:07, Lars Marowsky-Bree <lmb [at] suse> wrote:

> Okay, so there's a bug in the NFS agent, point taken. I'll investigate
> why it took so long to release as a real maintenance update; you're
> right, that shouldn't happen. (I can already see it in the update queue
> though.)

For completeness, the update was released yesterday for SP1 and today
for SP2.

What took so long was that it got aggregated with other fixes in the
maintenance code stream - a usual process that reduces overhead costs
and the frequency of updates (which customers also want). I'd agree that
in this case the balance was somewhat off (and affected by vacation
times), and I'll monitor the situation so that it doesn't happen
again.

Customers with an active support contract though were provided with PTF
(problem temporary fix) packages on demand, which are fully supported as
well.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


external.Martin.Konold at de

Sep 4, 2012, 1:50 AM

Post #21 of 22 (523 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

> > > I don't see an open bug for something like this right now.
> > Are you serious?
> >
> > It was you who resolved this bug as INVALID in bugzilla https://bugzilla.novell.com/show_bug.cgi?id=769292.

> And yes: if you want support for SLE, you need a support contract.
> That's the business model behind them (and RHEL, too).

Nonesense!

I was not asking for a free lunch. Please don't try to let me look like a fool.

I was reporting a serious bug in _your_ product and instead of thanking for the bugreport you simply closed it as invalid and later claimed that the bug was never reported.

Yours,
-- martin

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


lmb at suse

Sep 4, 2012, 2:08 AM

Post #22 of 22 (528 views)
Permalink
Re: Antw: Re: crmd: [31942]: WARN: decode_transition_key: Bad UUID (crm-resource-25438) in sscanf result (3) for 0:0:crm-resource-25438 [In reply to]

On 2012-09-04T10:50:11, "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" <external.Martin.Konold [at] de> wrote:

> I was reporting a serious bug in _your_ product and instead of
> thanking for the bugreport you simply closed it as invalid

The bug was reported without a support contract. A support contract
usually being the pre-requirement for us to investigate and discuss with
a customer in more detail. Triaging, analysis, and investigation take
time, and time costs money.

Sorry to be blunt, but: figuring out if you hit an actual bug in the
code or had a configuration issue would have taken time away from
customers who have paid for our time and resources.

We pick up fixes that came in via upstream. We work with the community
on upstream versions. (At lower priority than support contracts,
obviously.) *But* our business model involves taking money (or at least
a prospective business case, an evaluation which is in the hands of
sales) from those who want support on SLE.

> and later claimed that the bug was never reported.

I already apologized for that. I checked only the SLE HA bugs, not those
reported against other products and projects, and I process so many bugs
that I can't remember every single "invalid" report we got.

Martin, you are complaining that we did not investigate a possible bug
in SLE HA that you reported against openSUSE, because there was no
support contract, or at least a pre-sales engineer involved. If you had
asked sales about a competitive situation, I'm sure they'd have worked
with you. And you'd have received not just a "is the product bug free in
my environment" result (in my experience, no software ever is, except
for TeX - the question only is if you've already found the bug or not),
but the support capabilities which, in my opinion, are the real selling
points of the Enterprise distributions.

I'm sorry that it didn't work out as you liked, and I am sorry that you
hit a bug in your environment in the first place.

And I don't think this is an appropriate discussion for this mailing
list. I apologize to the other subscribers.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Linux-HA users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.