Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Dev

sfex

 

 

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded


dejanmm at fastmail

Jun 16, 2008, 12:58 PM

Post #1 of 18 (871 views)
Permalink
sfex

Hello,

Since last year NTT designed and implemented sfex, a suite of
programs to improve shared disk usage (see linux-ha.org/sfex)
which unfortunately didn't attract attention it deserves. I
reviewed the code and attached you'll find some comments and some
simple changes. One general remark: all programs (sfex_*) are
monolithic and, though they are not that big, it would be
beneficial to code readers if they were split into more
units/functions.

A couple of suggestions on making sfex useful in other contexts
were making a quorum plugin and a HBcomm plugin. Did you
investigate further these options?

Of course, if you agree, we could include sfex into the heartbeat
repository.

Cheers,

Dejan
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


kskmori at intellilink

Jun 17, 2008, 1:33 AM

Post #2 of 18 (833 views)
Permalink
Re: sfex [In reply to]

Dejan,

Thank you for taking care of it.

Yes, NTT is very glad and agrees to include sfex into the
heartbeat repository!

Dejan Muhamedagic <dejanmm[at]fastmail.fm> writes:

> Hello,
>
> Since last year NTT designed and implemented sfex, a suite of
> programs to improve shared disk usage (see linux-ha.org/sfex)
> which unfortunately didn't attract attention it deserves. I
> reviewed the code and attached you'll find some comments and some
> simple changes. One general remark: all programs (sfex_*) are
> monolithic and, though they are not that big, it would be
> beneficial to code readers if they were split into more
> units/functions.

That sounds reasonable.
Where can I find your comments and modifications?


>
> A couple of suggestions on making sfex useful in other contexts
> were making a quorum plugin and a HBcomm plugin. Did you
> investigate further these options?


Yes we did but we think that
those would be totally different approach from sfex.


- a quorum plugin

A quorum plugin is executed only on 'the cluster leader node' in CCM,
and it does not care where the resource is running on,
whereas sfex should run on the same node which the resource
in question is running on because it's for the protection of
the data which resides in the resource.

In other words, sfex is to control with resource granularity,
whereas a quorum plugin is to control 'the partition' granularity.


- HBcomm plugin

I remember that somebody posted this before, called 'dskcm'.
This is also interesting idea but the approach is very different.

This approach is:
- having yet another redundant communication path through
the shared medium.
whereas sfex's approach is:
- provide a protection method when ALL of the communication
paths are failed.

Even though they have the similar goal the functionality is
very different.


>
> Of course, if you agree, we could include sfex into the heartbeat
> repository.
>
> Cheers,
>
> Dejan
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/


Thanks,

--
Keisuke MORI
NTT DATA Intellilink Corporation

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dejanmm at fastmail

Jun 17, 2008, 4:07 AM

Post #3 of 18 (836 views)
Permalink
Re: sfex [In reply to]

Hi Keisuke-san,

On Tue, Jun 17, 2008 at 05:33:52PM +0900, Keisuke MORI wrote:
> Dejan,
>
> Thank you for taking care of it.
>
> Yes, NTT is very glad and agrees to include sfex into the
> heartbeat repository!
>
> Dejan Muhamedagic <dejanmm[at]fastmail.fm> writes:
>
> > Hello,
> >
> > Since last year NTT designed and implemented sfex, a suite of
> > programs to improve shared disk usage (see linux-ha.org/sfex)
> > which unfortunately didn't attract attention it deserves. I
> > reviewed the code and attached you'll find some comments and some
> > simple changes. One general remark: all programs (sfex_*) are
> > monolithic and, though they are not that big, it would be
> > beneficial to code readers if they were split into more
> > units/functions.
>
> That sounds reasonable.
> Where can I find your comments and modifications?

A reasonable question :) Forgot to attach the file with
comments. Sorry about that. It is in the form of a patch against
version 1.3.

> > A couple of suggestions on making sfex useful in other contexts
> > were making a quorum plugin and a HBcomm plugin. Did you
> > investigate further these options?
>
>
> Yes we did but we think that
> those would be totally different approach from sfex.
>
>
> - a quorum plugin
>
> A quorum plugin is executed only on 'the cluster leader node' in CCM,

I don't think so. CCM delivers connectivity and quorum
information on each node. However, that's probably not relevant.

> and it does not care where the resource is running on,
> whereas sfex should run on the same node which the resource
> in question is running on because it's for the protection of
> the data which resides in the resource.
>
> In other words, sfex is to control with resource granularity,
> whereas a quorum plugin is to control 'the partition' granularity.

Right. The point was however to use parts of sfex for the quorum
functionality. I'll see if I can get back to you with a more
detailed and specific proposal.

> - HBcomm plugin
>
> I remember that somebody posted this before, called 'dskcm'.

Somehow missed that one.

> This is also interesting idea but the approach is very different.
>
> This approach is:
> - having yet another redundant communication path through
> the shared medium.
> whereas sfex's approach is:
> - provide a protection method when ALL of the communication
> paths are failed.
>
> Even though they have the similar goal the functionality is
> very different.

Yes. Though again sfex would need to be twisted a bit to provide
heartbeats over shared storage. I'll take a look at dskcm.

Cheers,

Dejan
Attachments: sfex-comments.gz (2.67 KB)


beekhof at gmail

Jun 17, 2008, 6:46 AM

Post #4 of 18 (832 views)
Permalink
Re: sfex [In reply to]

On Tue, Jun 17, 2008 at 10:33, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> Dejan,
>
> Thank you for taking care of it.
>
> Yes, NTT is very glad and agrees to include sfex into the
> heartbeat repository!

I haven't seen the code in a while - but does it require any crm libraries?
Thats probably the biggest factor in where it needs to live.

Other than that, I think the rationale for why it is implemented that
way makes sense and we should include it as-is.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dejanmm at fastmail

Jun 17, 2008, 9:58 AM

Post #5 of 18 (834 views)
Permalink
Re: sfex [In reply to]

Hi,

On Tue, Jun 17, 2008 at 03:46:52PM +0200, Andrew Beekhof wrote:
> On Tue, Jun 17, 2008 at 10:33, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> > Dejan,
> >
> > Thank you for taking care of it.
> >
> > Yes, NTT is very glad and agrees to include sfex into the
> > heartbeat repository!
>
> I haven't seen the code in a while - but does it require any crm libraries?

No.

> Thats probably the biggest factor in where it needs to live.
>
> Other than that, I think the rationale for why it is implemented that
> way makes sense and we should include it as-is.

It's going to be included as an RA. I just wanted to investigate
other possibilities.

Cheers,

Dejan

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Jun 17, 2008, 12:46 PM

Post #6 of 18 (827 views)
Permalink
Re: sfex [In reply to]

On Tue, Jun 17, 2008 at 18:58, Dejan Muhamedagic <dejanmm[at]fastmail.fm> wrote:
> Hi,
>
> On Tue, Jun 17, 2008 at 03:46:52PM +0200, Andrew Beekhof wrote:
>> On Tue, Jun 17, 2008 at 10:33, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
>> > Dejan,
>> >
>> > Thank you for taking care of it.
>> >
>> > Yes, NTT is very glad and agrees to include sfex into the
>> > heartbeat repository!
>>
>> I haven't seen the code in a while - but does it require any crm libraries?
>
> No.

ok, then its sounds like heartbeat is the right place for it

>> Thats probably the biggest factor in where it needs to live.
>>
>> Other than that, I think the rationale for why it is implemented that
>> way makes sense and we should include it as-is.
>
> It's going to be included as an RA. I just wanted to investigate
> other possibilities.

a daemon too though right?
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


hxinwei at gmail

Jun 18, 2008, 1:38 AM

Post #7 of 18 (815 views)
Permalink
Re: sfex [In reply to]

I'm the one who opposed sfex in the previous discussion.

My point was simple that:
""""
check-and-reserve on disk is not an atomic CAS operation. and lock
based on that may silently cause data corruption.
"""

I haven't follow the evolution of sfex though, so things might have
been changed.

Just FYI.

2008/6/17 Dejan Muhamedagic <dejanmm[at]fastmail.fm>:
> Hello,
>
> Since last year NTT designed and implemented sfex, a suite of
> programs to improve shared disk usage (see linux-ha.org/sfex)
> which unfortunately didn't attract attention it deserves. I
> reviewed the code and attached you'll find some comments and some
> simple changes. One general remark: all programs (sfex_*) are
> monolithic and, though they are not that big, it would be
> beneficial to code readers if they were split into more
> units/functions.
>
> A couple of suggestions on making sfex useful in other contexts
> were making a quorum plugin and a HBcomm plugin. Did you
> investigate further these options?
>
> Of course, if you agree, we could include sfex into the heartbeat
> repository.
>
> Cheers,
>
> Dejan
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


kskmori at intellilink

Jun 18, 2008, 4:53 AM

Post #8 of 18 (812 views)
Permalink
Re: sfex [In reply to]

Hi

Dejan Muhamedagic <dejanmm[at]fastmail.fm> writes:

> Hi Keisuke-san,
>
> On Tue, Jun 17, 2008 at 05:33:52PM +0900, Keisuke MORI wrote:
>> Dejan,
>>
>> Thank you for taking care of it.
>>
>> Yes, NTT is very glad and agrees to include sfex into the
>> heartbeat repository!
>>
>> Dejan Muhamedagic <dejanmm[at]fastmail.fm> writes:
>>
>> > Hello,
>> >
>> > Since last year NTT designed and implemented sfex, a suite of
>> > programs to improve shared disk usage (see linux-ha.org/sfex)
>> > which unfortunately didn't attract attention it deserves. I
>> > reviewed the code and attached you'll find some comments and some
>> > simple changes. One general remark: all programs (sfex_*) are
>> > monolithic and, though they are not that big, it would be
>> > beneficial to code readers if they were split into more
>> > units/functions.
>>
>> That sounds reasonable.
>> Where can I find your comments and modifications?
>
> A reasonable question :) Forgot to attach the file with
> comments. Sorry about that. It is in the form of a patch against
> version 1.3.


Thanks, I will look into it.


>
>> > A couple of suggestions on making sfex useful in other contexts
>> > were making a quorum plugin and a HBcomm plugin. Did you
>> > investigate further these options?
>>
>>
>> Yes we did but we think that
>> those would be totally different approach from sfex.
>>
>>
>> - a quorum plugin
>>
>> A quorum plugin is executed only on 'the cluster leader node' in CCM,
>
> I don't think so. CCM delivers connectivity and quorum
> information on each node. However, that's probably not relevant.
>
>> and it does not care where the resource is running on,
>> whereas sfex should run on the same node which the resource
>> in question is running on because it's for the protection of
>> the data which resides in the resource.
>>
>> In other words, sfex is to control with resource granularity,
>> whereas a quorum plugin is to control 'the partition' granularity.
>
> Right. The point was however to use parts of sfex for the quorum
> functionality. I'll see if I can get back to you with a more
> detailed and specific proposal.


I still don't understand you very well, sorry.
I'd appreciate if you could explain more details.




>
>> - HBcomm plugin
>>
>> I remember that somebody posted this before, called 'dskcm'.
>
> Somehow missed that one.
>
>> This is also interesting idea but the approach is very different.
>>
>> This approach is:
>> - having yet another redundant communication path through
>> the shared medium.
>> whereas sfex's approach is:
>> - provide a protection method when ALL of the communication
>> paths are failed.
>>
>> Even though they have the similar goal the functionality is
>> very different.
>
> Yes. Though again sfex would need to be twisted a bit to provide
> heartbeats over shared storage. I'll take a look at dskcm.
>

It was this:

http://www.gossamer-threads.com/lists/linuxha/dev/39716#39716

Thanks,

--
Keisuke MORI
NTT DATA Intellilink Corporation

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Jun 19, 2008, 12:08 AM

Post #9 of 18 (795 views)
Permalink
Re: sfex [In reply to]

On 2008-06-17T17:33:52, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:

> - a quorum plugin
>
> A quorum plugin is executed only on 'the cluster leader node' in CCM,
> and it does not care where the resource is running on,
> whereas sfex should run on the same node which the resource
> in question is running on because it's for the protection of
> the data which resides in the resource.
>
> In other words, sfex is to control with resource granularity,
> whereas a quorum plugin is to control 'the partition' granularity.

Yes, that's a correct view.

It's worth noting that "the partition" mode probably can be almost
achieved by mandatory ordering constraints from the "quorum resource"
(sfex) to the dependent resources.

"almost" because fencing isn't integrated into this method well.

> - HBcomm plugin
>
> I remember that somebody posted this before, called 'dskcm'.
> This is also interesting idea but the approach is very different.
>
> This approach is:
> - having yet another redundant communication path through
> the shared medium.
> whereas sfex's approach is:
> - provide a protection method when ALL of the communication
> paths are failed.
>
> Even though they have the similar goal the functionality is
> very different.

True, different functionality, but it's worth noting that a disk-based
comm plugin would work in all situations sfex does, and would provide
superior functionality.

I don't think it's worth investing any resources into HBcomm at this
point though. openAIS has a totally different model. So sfex is probably
the best choice available as of today.


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


kskmori at intellilink

Jun 19, 2008, 3:00 AM

Post #10 of 18 (795 views)
Permalink
Re: sfex [In reply to]

Hi,

"Xinwei Hu" <hxinwei[at]gmail.com> writes:
> I'm the one who opposed sfex in the previous discussion.
>
> My point was simple that:
> """"
> check-and-reserve on disk is not an atomic CAS operation. and lock
> based on that may silently cause data corruption.
> """

sfex doest not rely on the atomicity of "check-and-reserve".
It's always _overwriting_ the control data and the detection of
losing the ownership is done by timeout based.


Indeed it can happen that two nodes try to write the control
data at a same time in a particular condition, but

1) Such situation will not happen on the scenario of the typical
split-brain condition with sfex. It only can happen in a
particular condition such as a miss-operation that trys to
launch two nodes simultaneously _without_ fixing the
split-brain condition.

2) Even if such situation had occured, sfex resolves it as follows;
- sfex always writes its control data as "one sector" data
(512 bytes in most of cases) through the direct I/O.
That would be a single write request to the disk controller.
- If two nodes tried to write the data at a same time,
the request will be serialized in the disk controller, so
'the latter one' will win.
- sfex makes sure that the written data is "mine" and
the "loser" will return an error to prevent from lauching resources.



Does it explain to you?

Thanks,


>
> I haven't follow the evolution of sfex though, so things might have
> been changed.
>
> Just FYI.
>
> 2008/6/17 Dejan Muhamedagic <dejanmm[at]fastmail.fm>:
>> Hello,
>>
>> Since last year NTT designed and implemented sfex, a suite of
>> programs to improve shared disk usage (see linux-ha.org/sfex)
>> which unfortunately didn't attract attention it deserves. I
>> reviewed the code and attached you'll find some comments and some
>> simple changes. One general remark: all programs (sfex_*) are
>> monolithic and, though they are not that big, it would be
>> beneficial to code readers if they were split into more
>> units/functions.
>>
>> A couple of suggestions on making sfex useful in other contexts
>> were making a quorum plugin and a HBcomm plugin. Did you
>> investigate further these options?
>>
>> Of course, if you agree, we could include sfex into the heartbeat
>> repository.
>>
>> Cheers,
>>
>> Dejan
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
>>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

--
Keisuke MORI
NTT DATA Intellilink Corporation

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


hxinwei at gmail

Jun 19, 2008, 6:26 AM

Post #11 of 18 (796 views)
Permalink
Re: sfex [In reply to]

2008/6/19 Keisuke MORI <kskmori[at]intellilink.co.jp>:
> Hi,
>
> "Xinwei Hu" <hxinwei[at]gmail.com> writes:
>> I'm the one who opposed sfex in the previous discussion.
>>
>> My point was simple that:
>> """"
>> check-and-reserve on disk is not an atomic CAS operation. and lock
>> based on that may silently cause data corruption.
>> """
>
> sfex doest not rely on the atomicity of "check-and-reserve".
> It's always _overwriting_ the control data and the detection of
> losing the ownership is done by timeout based.
>
>
> Indeed it can happen that two nodes try to write the control
> data at a same time in a particular condition, but
>
> 1) Such situation will not happen on the scenario of the typical
> split-brain condition with sfex. It only can happen in a
> particular condition such as a miss-operation that trys to
> launch two nodes simultaneously _without_ fixing the
> split-brain condition.
>
> 2) Even if such situation had occured, sfex resolves it as follows;
> - sfex always writes its control data as "one sector" data
> (512 bytes in most of cases) through the direct I/O.
> That would be a single write request to the disk controller.
> - If two nodes tried to write the data at a same time,
> the request will be serialized in the disk controller, so
> 'the latter one' will win.
> - sfex makes sure that the written data is "mine" and
> the "loser" will return an error to prevent from lauching resources.
>
>
>
> Does it explain to you?

No.
Your basic assumption is that sfex can run in a deterministic
environment. Right ?
I think so because sfex totally relies on predicable execution time.
But Linux (for example) indeed is not such an environment, as the
process can be scheduled out at _any_ point for _any_ time.

And this is an essential problem due to the lack of CAS operation for disk.

btw: dskcm is lockless because of the same problem.

> Thanks,
>
>
>>
>> I haven't follow the evolution of sfex though, so things might have
>> been changed.
>>
>> Just FYI.
>>
>> 2008/6/17 Dejan Muhamedagic <dejanmm[at]fastmail.fm>:
>>> Hello,
>>>
>>> Since last year NTT designed and implemented sfex, a suite of
>>> programs to improve shared disk usage (see linux-ha.org/sfex)
>>> which unfortunately didn't attract attention it deserves. I
>>> reviewed the code and attached you'll find some comments and some
>>> simple changes. One general remark: all programs (sfex_*) are
>>> monolithic and, though they are not that big, it would be
>>> beneficial to code readers if they were split into more
>>> units/functions.
>>>
>>> A couple of suggestions on making sfex useful in other contexts
>>> were making a quorum plugin and a HBcomm plugin. Did you
>>> investigate further these options?
>>>
>>> Of course, if you agree, we could include sfex into the heartbeat
>>> repository.
>>>
>>> Cheers,
>>>
>>> Dejan
>>> _______________________________________________________
>>> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>> Home Page: http://linux-ha.org/
>>>
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
>
> --
> Keisuke MORI
> NTT DATA Intellilink Corporation
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


dejanmm at fastmail

Jun 19, 2008, 6:44 AM

Post #12 of 18 (798 views)
Permalink
Re: sfex [In reply to]

Hi,

On Thu, Jun 19, 2008 at 09:26:13PM +0800, Xinwei Hu wrote:
> 2008/6/19 Keisuke MORI <kskmori[at]intellilink.co.jp>:
> > Hi,
> >
> > "Xinwei Hu" <hxinwei[at]gmail.com> writes:
> >> I'm the one who opposed sfex in the previous discussion.
> >>
> >> My point was simple that:
> >> """"
> >> check-and-reserve on disk is not an atomic CAS operation. and lock
> >> based on that may silently cause data corruption.
> >> """
> >
> > sfex doest not rely on the atomicity of "check-and-reserve".
> > It's always _overwriting_ the control data and the detection of
> > losing the ownership is done by timeout based.
> >
> >
> > Indeed it can happen that two nodes try to write the control
> > data at a same time in a particular condition, but
> >
> > 1) Such situation will not happen on the scenario of the typical
> > split-brain condition with sfex. It only can happen in a
> > particular condition such as a miss-operation that trys to
> > launch two nodes simultaneously _without_ fixing the
> > split-brain condition.
> >
> > 2) Even if such situation had occured, sfex resolves it as follows;
> > - sfex always writes its control data as "one sector" data
> > (512 bytes in most of cases) through the direct I/O.
> > That would be a single write request to the disk controller.
> > - If two nodes tried to write the data at a same time,
> > the request will be serialized in the disk controller, so
> > 'the latter one' will win.
> > - sfex makes sure that the written data is "mine" and
> > the "loser" will return an error to prevent from lauching resources.
> >
> >
> >
> > Does it explain to you?
>
> No.
> Your basic assumption is that sfex can run in a deterministic
> environment. Right ?
> I think so because sfex totally relies on predicable execution time.
> But Linux (for example) indeed is not such an environment, as the
> process can be scheduled out at _any_ point for _any_ time.

True. It is possible to break sfex, but the probability that that
is going to happen is extremely low and could be due only to a
very pathological timing. One way to make this probability still
lower is to implement sfex as a combination of a resource and a
daemon process which would lock itself in memory and send
asynchronous monitor failures to lrmd.

BTW, asynchronous monitor API has just been implemented and waits
for its first users :).

> And this is an essential problem due to the lack of CAS operation for disk.
>
> btw: dskcm is lockless because of the same problem.

I'm going to look into dskcm. Has it been used? Any field experience?

Thanks,

Dejan
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


hxinwei at gmail

Jun 19, 2008, 7:52 AM

Post #13 of 18 (791 views)
Permalink
Re: sfex [In reply to]

2008/6/19 Dejan Muhamedagic <dejanmm[at]fastmail.fm>:
> Hi,
>
> On Thu, Jun 19, 2008 at 09:26:13PM +0800, Xinwei Hu wrote:
>> 2008/6/19 Keisuke MORI <kskmori[at]intellilink.co.jp>:
>> > Hi,
>> >
>> > "Xinwei Hu" <hxinwei[at]gmail.com> writes:
>> >> I'm the one who opposed sfex in the previous discussion.
>> >>
>> >> My point was simple that:
>> >> """"
>> >> check-and-reserve on disk is not an atomic CAS operation. and lock
>> >> based on that may silently cause data corruption.
>> >> """
>> >
>> > sfex doest not rely on the atomicity of "check-and-reserve".
>> > It's always _overwriting_ the control data and the detection of
>> > losing the ownership is done by timeout based.
>> >
>> >
>> > Indeed it can happen that two nodes try to write the control
>> > data at a same time in a particular condition, but
>> >
>> > 1) Such situation will not happen on the scenario of the typical
>> > split-brain condition with sfex. It only can happen in a
>> > particular condition such as a miss-operation that trys to
>> > launch two nodes simultaneously _without_ fixing the
>> > split-brain condition.
>> >
>> > 2) Even if such situation had occured, sfex resolves it as follows;
>> > - sfex always writes its control data as "one sector" data
>> > (512 bytes in most of cases) through the direct I/O.
>> > That would be a single write request to the disk controller.
>> > - If two nodes tried to write the data at a same time,
>> > the request will be serialized in the disk controller, so
>> > 'the latter one' will win.
>> > - sfex makes sure that the written data is "mine" and
>> > the "loser" will return an error to prevent from lauching resources.
>> >
>> >
>> >
>> > Does it explain to you?
>>
>> No.
>> Your basic assumption is that sfex can run in a deterministic
>> environment. Right ?
>> I think so because sfex totally relies on predicable execution time.
>> But Linux (for example) indeed is not such an environment, as the
>> process can be scheduled out at _any_ point for _any_ time.
>
> True. It is possible to break sfex, but the probability that that
> is going to happen is extremely low and could be due only to a
> very pathological timing. One way to make this probability still

>From my previous experience, I always got _NO_ from customers when
there are possibility to data corruption.
So I don't think "extreamely low" is a valid excuse. ;)
btw: The same to scsi-2 reservation actually.

> lower is to implement sfex as a combination of a resource and a
> daemon process which would lock itself in memory and send
> asynchronous monitor failures to lrmd.
>
> BTW, asynchronous monitor API has just been implemented and waits
> for its first users :).
>
>> And this is an essential problem due to the lack of CAS operation for disk.
>>
>> btw: dskcm is lockless because of the same problem.
>
> I'm going to look into dskcm. Has it been used? Any field experience?

dskcm has it's own problem too.
Heartbeat doesn't support the idea of "link priority" or "link
fallback", so the disk is always up busy for the communication.
It consumes several hundred KBs of disk I/O bandwidth constantly.

And as we are switching to openais stack, I don't think I'm going to
improve it any further.

dskcm attracted a lot of interest when people are testing/comparing
different HA solutions.
I did several POCs on this myself. But I haven't awared anyone use it
in _production_ environment yet.

> Thanks,
>
> Dejan
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Jun 19, 2008, 9:30 AM

Post #14 of 18 (788 views)
Permalink
Re: sfex [In reply to]

On 2008-06-17T21:46:42, Andrew Beekhof <beekhof[at]gmail.com> wrote:

> > It's going to be included as an RA. I just wanted to investigate
> > other possibilities.
> a daemon too though right?

Not yet; it performs its checks as part of the monitor ops.

Yes, I think this should be a daemon (started & stopped by the RA, and
causing an async failure notification as needed), but that could be
transparently changed on a later update, if we get the instance
attributes right ...



Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Jun 19, 2008, 9:42 AM

Post #15 of 18 (793 views)
Permalink
Re: sfex [In reply to]

On 2008-06-19T22:52:55, Xinwei Hu <hxinwei[at]gmail.com> wrote:

> > True. It is possible to break sfex, but the probability that that
> > is going to happen is extremely low and could be due only to a
> > very pathological timing. One way to make this probability still
>
> From my previous experience, I always got _NO_ from customers when
> there are possibility to data corruption.
> So I don't think "extreamely low" is a valid excuse. ;)

But it's always just "extremly low". Even STONITH could fail (the device
could be misconfigured to reset the wrong outlet, or report success when
it in fact failed), there could be issues in the very HA stack, the
kernel could cause data corruption in the fs, the storage could fail,
etc. And that's just random failure, ignoring malicious attackers or
careless sysadmins.

We're never 100% certain.

sfex relies on timing, yes, but with such considerable safety margins
that it's "safe enough". NCS SBD basically trusts the other nodes too.

I think it would be a valuable addition, in particular if it could get
it into daemon mode. This could be the first step towards a real "quorum
resource" which a future quorum plugin framework could utilize, too.

> dskcm has it's own problem too.
> Heartbeat doesn't support the idea of "link priority" or "link
> fallback", so the disk is always up busy for the communication.
> It consumes several hundred KBs of disk I/O bandwidth constantly.
>
> And as we are switching to openais stack, I don't think I'm going to
> improve it any further.
>
> dskcm attracted a lot of interest when people are testing/comparing
> different HA solutions.
> I did several POCs on this myself. But I haven't awared anyone use it
> in _production_ environment yet.

It would be interesting to see whether this could be added to openAIS.


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


hxinwei at gmail

Jun 19, 2008, 9:52 PM

Post #16 of 18 (766 views)
Permalink
Re: sfex [In reply to]

2008/6/20 Lars Marowsky-Bree <lmb[at]suse.de>:
> On 2008-06-19T22:52:55, Xinwei Hu <hxinwei[at]gmail.com> wrote:
>
>> > True. It is possible to break sfex, but the probability that that
>> > is going to happen is extremely low and could be due only to a
>> > very pathological timing. One way to make this probability still
>>
>> From my previous experience, I always got _NO_ from customers when
>> there are possibility to data corruption.
>> So I don't think "extreamely low" is a valid excuse. ;)
>
> But it's always just "extremly low". Even STONITH could fail (the device
> could be misconfigured to reset the wrong outlet, or report success when
> it in fact failed), there could be issues in the very HA stack, the
> kernel could cause data corruption in the fs, the storage could fail,
> etc. And that's just random failure, ignoring malicious attackers or
> careless sysadmins.
>
> We're never 100% certain.
>
> sfex relies on timing, yes, but with such considerable safety margins

Do we have any systematic method to analysis the "safety margin" already ?
If not, I'll not go with the "considerable" claim.

> that it's "safe enough". NCS SBD basically trusts the other nodes too.
>
> I think it would be a valuable addition, in particular if it could get
> it into daemon mode. This could be the first step towards a real "quorum
> resource" which a future quorum plugin framework could utilize, too.

If you are talking about transferring sfex into now day's qdisk, then
I totally agree.

>> dskcm has it's own problem too.
>> Heartbeat doesn't support the idea of "link priority" or "link
>> fallback", so the disk is always up busy for the communication.
>> It consumes several hundred KBs of disk I/O bandwidth constantly.
>>
>> And as we are switching to openais stack, I don't think I'm going to
>> improve it any further.
>>
>> dskcm attracted a lot of interest when people are testing/comparing
>> different HA solutions.
>> I did several POCs on this myself. But I haven't awared anyone use it
>> in _production_ environment yet.
>
> It would be interesting to see whether this could be added to openAIS.

Yeah, I'm preparing for that too. ;)

>
> Regards,
> Lars
>
> --
> Teamlead Kernel, SuSE Labs, Research and Development
> SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Jun 19, 2008, 11:37 PM

Post #17 of 18 (756 views)
Permalink
Re: sfex [In reply to]

On Thu, Jun 19, 2008 at 18:30, Lars Marowsky-Bree <lmb[at]suse.de> wrote:
> On 2008-06-17T21:46:42, Andrew Beekhof <beekhof[at]gmail.com> wrote:
>
>> > It's going to be included as an RA. I just wanted to investigate
>> > other possibilities.
>> a daemon too though right?
>
> Not yet; it performs its checks as part of the monitor ops.

ohhhhh

>
> Yes, I think this should be a daemon (started & stopped by the RA, and
> causing an async failure notification as needed),

agreed

> but that could be
> transparently changed on a later update, if we get the instance
> attributes right ...

true
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Jun 20, 2008, 1:03 PM

Post #18 of 18 (743 views)
Permalink
Re: sfex [In reply to]

On 2008-06-20T12:52:09, Xinwei Hu <hxinwei[at]gmail.com> wrote:

> > sfex relies on timing, yes, but with such considerable safety margins
> Do we have any systematic method to analysis the "safety margin" already ?
> If not, I'll not go with the "considerable" claim.

It depends; but I would think that 60 seconds with a monitor interval of
5s is plenty.

> > It would be interesting to see whether this could be added to openAIS.
> Yeah, I'm preparing for that too. ;)

;-)

Did you see the cluster summit meeting notice? This would appear to
suggest there might be good reason for you to attend.


Regards,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.