Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Dev

pgsql RA improvements

 

 

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded


kskmori at intellilink

Feb 23, 2007, 2:07 AM

Post #1 of 17 (1230 views)
Permalink
pgsql RA improvements

Hi,

We have found a several problems with pgsql RA through our testing.
It 'fails to failover' in some scenarios.
I'm proposing a patch to fix them.

Problem description:

1) The first 'monitor' may fail even if the postmaster was
successfully launched.

This is because 'start' of the pgsql may return before the
postmaster gets ready to answer to a psql query issued by
'monitor', since it only checks the existance of postmaster
process. The postmaster can take a few minitues to get ready
to answer, particularly when it needs to recover the database
after a crash. Even if no recovery is necessary, we observed
that it sometimes fails in some of our test cases.

2) The postmaster fails to startup when 'postmaster.pid' file
was left over from the previous crash.

3) 'stop' doest not execute the fast mode shutdown effectively,
because it executes the immediate mode shutdown at the very
next moment. The fast mode shutdown can take a few minutes
to complete to flush the database log.

This isn't a critical problem, but it may result to take a
time longer to complete the failover (according to our
database team). It is preferable to wait to complete the fast
mode shutdown as long as possible.


Proposals to fix:

1) In 'start', wait until the postmaster gets ready to answer by
checking as same as 'monitor' does.
The maximum wait time to complete to startup can be
customized by an additional parameter 'start_wait'.

2) Add a cleanup code for 'postmaster.pid' when stop and before starting.

3) In 'stop', wait until the postmaster completes to the fast
mode shutdown.
The maximum wait time to complete to shutdown can be
customized by an additional parameter 'stop_wait.


The attached patch is for the latest -dev.

Regards,

Keisuke MORI
NTT DATA Intellilink Corporation
Attachments: pgsql.in.patch (3.34 KB)


beekhof at gmail

Feb 23, 2007, 3:22 AM

Post #2 of 17 (1191 views)
Permalink
Re: pgsql RA improvements [In reply to]

On 2/23/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> Hi,
>
> We have found a several problems with pgsql RA through our testing.
> It 'fails to failover' in some scenarios.
> I'm proposing a patch to fix them.
>
> Problem description:
>
> 1) The first 'monitor' may fail even if the postmaster was
> successfully launched.
>
> This is because 'start' of the pgsql may return before the
> postmaster gets ready to answer to a psql query issued by
> 'monitor', since it only checks the existance of postmaster
> process. The postmaster can take a few minitues to get ready
> to answer, particularly when it needs to recover the database
> after a crash. Even if no recovery is necessary, we observed
> that it sometimes fails in some of our test cases.
>
> 2) The postmaster fails to startup when 'postmaster.pid' file
> was left over from the previous crash.
>
> 3) 'stop' doest not execute the fast mode shutdown effectively,
> because it executes the immediate mode shutdown at the very
> next moment. The fast mode shutdown can take a few minutes
> to complete to flush the database log.
>
> This isn't a critical problem, but it may result to take a
> time longer to complete the failover (according to our
> database team). It is preferable to wait to complete the fast
> mode shutdown as long as possible.
>
>
> Proposals to fix:
>
> 1) In 'start', wait until the postmaster gets ready to answer by
> checking as same as 'monitor' does.
> The maximum wait time to complete to startup can be
> customized by an additional parameter 'start_wait'.
>
> 2) Add a cleanup code for 'postmaster.pid' when stop and before starting.
>
> 3) In 'stop', wait until the postmaster completes to the fast
> mode shutdown.
> The maximum wait time to complete to shutdown can be
> customized by an additional parameter 'stop_wait.
>
>
> The attached patch is for the latest -dev.

I'd be more inclined to go with something like the patch below.

The function of start_wait and stop_wait is just as easily achieved by
setting the action's timeout. Its also harder to mess up (ie. by
setting start_wait to longer than the start action's timeout).

diff -r 959f2c429fc3 resources/OCF/pgsql.in
--- a/resources/OCF/pgsql.in Fri Feb 23 10:59:12 2007 +0100
+++ b/resources/OCF/pgsql.in Fri Feb 23 12:18:53 2007 +0100
@@ -197,15 +197,12 @@ pgsql_start() {
return $OCF_ERR_GENERIC
fi

- if ! pgsql_status
- then
- sleep 5
- if ! pgsql_status
- then
- echo "ERROR: PostgreSQL is not running!"
- return $OCF_ERR_GENERIC
- fi
- fi
+
+ active=0
+ while [ $active != 0 ]; do
+ pgsql_monitor
+ active=$?
+ done

return $OCF_SUCCESS
}
@@ -227,6 +224,13 @@ pgsql_stop() {
runasowner "$PGCTL -D $PGDATA stop -m immediate > /dev/null 2>&1"
fi

+ active=$OCF_NOT_RUNNING
+ while [ $active != $OCF_NOT_RUNNING ]; do
+ pgsql_monitor
+ active=$?
+ done
+
+ rm -f $PIDFILE
return $OCF_SUCCESS
}
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


lmb at suse

Feb 23, 2007, 3:32 AM

Post #3 of 17 (1206 views)
Permalink
Re: pgsql RA improvements [In reply to]

On 2007-02-23T19:07:19, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:

Thanks a lot for your enhancements!

They all look good. Merged.


_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


sergeyfd at gmail

Feb 23, 2007, 6:17 AM

Post #4 of 17 (1193 views)
Permalink
Re: pgsql RA improvements [In reply to]

On 2/23/07, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> On 2/23/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> > Hi,
> >
> > We have found a several problems with pgsql RA through our testing.
> > It 'fails to failover' in some scenarios.
> > I'm proposing a patch to fix them.
> >
> > Problem description:
> >
> > 1) The first 'monitor' may fail even if the postmaster was
> > successfully launched.
> >
> > This is because 'start' of the pgsql may return before the
> > postmaster gets ready to answer to a psql query issued by
> > 'monitor', since it only checks the existance of postmaster
> > process. The postmaster can take a few minitues to get ready
> > to answer, particularly when it needs to recover the database
> > after a crash. Even if no recovery is necessary, we observed
> > that it sometimes fails in some of our test cases.
> >
> > 2) The postmaster fails to startup when 'postmaster.pid' file
> > was left over from the previous crash.
> >
> > 3) 'stop' doest not execute the fast mode shutdown effectively,
> > because it executes the immediate mode shutdown at the very
> > next moment. The fast mode shutdown can take a few minutes
> > to complete to flush the database log.
> >
> > This isn't a critical problem, but it may result to take a
> > time longer to complete the failover (according to our
> > database team). It is preferable to wait to complete the fast
> > mode shutdown as long as possible.
> >
> >
> > Proposals to fix:
> >
> > 1) In 'start', wait until the postmaster gets ready to answer by
> > checking as same as 'monitor' does.
> > The maximum wait time to complete to startup can be
> > customized by an additional parameter 'start_wait'.
> >
> > 2) Add a cleanup code for 'postmaster.pid' when stop and before starting.
> >
> > 3) In 'stop', wait until the postmaster completes to the fast
> > mode shutdown.
> > The maximum wait time to complete to shutdown can be
> > customized by an additional parameter 'stop_wait.
> >
> >
> > The attached patch is for the latest -dev.
>
> I'd be more inclined to go with something like the patch below.
>
> The function of start_wait and stop_wait is just as easily achieved by
> setting the action's timeout. Its also harder to mess up (ie. by
> setting start_wait to longer than the start action's timeout).
>
> diff -r 959f2c429fc3 resources/OCF/pgsql.in
> --- a/resources/OCF/pgsql.in Fri Feb 23 10:59:12 2007 +0100
> +++ b/resources/OCF/pgsql.in Fri Feb 23 12:18:53 2007 +0100
> @@ -197,15 +197,12 @@ pgsql_start() {
> return $OCF_ERR_GENERIC
> fi
>
> - if ! pgsql_status
> - then
> - sleep 5
> - if ! pgsql_status
> - then
> - echo "ERROR: PostgreSQL is not running!"
> - return $OCF_ERR_GENERIC
> - fi
> - fi
> +
> + active=0
> + while [ $active != 0 ]; do
> + pgsql_monitor
> + active=$?
> + done

So if for some reason PostgreSQL fails to start we'll have an endless
loop here. Am I right?

>
> return $OCF_SUCCESS
> }
> @@ -227,6 +224,13 @@ pgsql_stop() {
> runasowner "$PGCTL -D $PGDATA stop -m immediate > /dev/null 2>&1"
> fi
>
> + active=$OCF_NOT_RUNNING
> + while [ $active != $OCF_NOT_RUNNING ]; do
> + pgsql_monitor
> + active=$?
> + done

And here.

> +
> + rm -f $PIDFILE
> return $OCF_SUCCESS
> }
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


sergeyfd at gmail

Feb 23, 2007, 6:23 AM

Post #5 of 17 (1195 views)
Permalink
Re: pgsql RA improvements [In reply to]

I like the idea of the patch, but honestly I don't like how it's
implemented. It shall call (as Andrew suggested) "monitor" function to
check that pgsql is up or down instead of spreading the same code all
around the script. I'd like to review the idea and prepare another
patch if everybody is agree.

On 2/23/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> Hi,
>
> We have found a several problems with pgsql RA through our testing.
> It 'fails to failover' in some scenarios.
> I'm proposing a patch to fix them.
>
> Problem description:
>
> 1) The first 'monitor' may fail even if the postmaster was
> successfully launched.
>
> This is because 'start' of the pgsql may return before the
> postmaster gets ready to answer to a psql query issued by
> 'monitor', since it only checks the existance of postmaster
> process. The postmaster can take a few minitues to get ready
> to answer, particularly when it needs to recover the database
> after a crash. Even if no recovery is necessary, we observed
> that it sometimes fails in some of our test cases.
>
> 2) The postmaster fails to startup when 'postmaster.pid' file
> was left over from the previous crash.
>
> 3) 'stop' doest not execute the fast mode shutdown effectively,
> because it executes the immediate mode shutdown at the very
> next moment. The fast mode shutdown can take a few minutes
> to complete to flush the database log.
>
> This isn't a critical problem, but it may result to take a
> time longer to complete the failover (according to our
> database team). It is preferable to wait to complete the fast
> mode shutdown as long as possible.
>
>
> Proposals to fix:
>
> 1) In 'start', wait until the postmaster gets ready to answer by
> checking as same as 'monitor' does.
> The maximum wait time to complete to startup can be
> customized by an additional parameter 'start_wait'.
>
> 2) Add a cleanup code for 'postmaster.pid' when stop and before starting.
>
> 3) In 'stop', wait until the postmaster completes to the fast
> mode shutdown.
> The maximum wait time to complete to shutdown can be
> customized by an additional parameter 'stop_wait.
>
>
> The attached patch is for the latest -dev.
>
> Regards,
>
> Keisuke MORI
> NTT DATA Intellilink Corporation
>
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
>
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


sergeyfd at gmail

Feb 23, 2007, 6:27 AM

Post #6 of 17 (1198 views)
Permalink
Re: pgsql RA improvements [In reply to]

And I don't like the idea of removing PID in "start" function. The
standard approach if to remove it after stopping application. Other
way it could lead to attempt of starting a second copy of application.

On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> I like the idea of the patch, but honestly I don't like how it's
> implemented. It shall call (as Andrew suggested) "monitor" function to
> check that pgsql is up or down instead of spreading the same code all
> around the script. I'd like to review the idea and prepare another
> patch if everybody is agree.
>
> On 2/23/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> > Hi,
> >
> > We have found a several problems with pgsql RA through our testing.
> > It 'fails to failover' in some scenarios.
> > I'm proposing a patch to fix them.
> >
> > Problem description:
> >
> > 1) The first 'monitor' may fail even if the postmaster was
> > successfully launched.
> >
> > This is because 'start' of the pgsql may return before the
> > postmaster gets ready to answer to a psql query issued by
> > 'monitor', since it only checks the existance of postmaster
> > process. The postmaster can take a few minitues to get ready
> > to answer, particularly when it needs to recover the database
> > after a crash. Even if no recovery is necessary, we observed
> > that it sometimes fails in some of our test cases.
> >
> > 2) The postmaster fails to startup when 'postmaster.pid' file
> > was left over from the previous crash.
> >
> > 3) 'stop' doest not execute the fast mode shutdown effectively,
> > because it executes the immediate mode shutdown at the very
> > next moment. The fast mode shutdown can take a few minutes
> > to complete to flush the database log.
> >
> > This isn't a critical problem, but it may result to take a
> > time longer to complete the failover (according to our
> > database team). It is preferable to wait to complete the fast
> > mode shutdown as long as possible.
> >
> >
> > Proposals to fix:
> >
> > 1) In 'start', wait until the postmaster gets ready to answer by
> > checking as same as 'monitor' does.
> > The maximum wait time to complete to startup can be
> > customized by an additional parameter 'start_wait'.
> >
> > 2) Add a cleanup code for 'postmaster.pid' when stop and before starting.
> >
> > 3) In 'stop', wait until the postmaster completes to the fast
> > mode shutdown.
> > The maximum wait time to complete to shutdown can be
> > customized by an additional parameter 'stop_wait.
> >
> >
> > The attached patch is for the latest -dev.
> >
> > Regards,
> >
> > Keisuke MORI
> > NTT DATA Intellilink Corporation
> >
> >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> >
> >
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


sergeyfd at gmail

Feb 23, 2007, 9:46 AM

Post #7 of 17 (1196 views)
Permalink
Re: pgsql RA improvements [In reply to]

Attached is the patch in the way that I like it to be.

On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> And I don't like the idea of removing PID in "start" function. The
> standard approach if to remove it after stopping application. Other
> way it could lead to attempt of starting a second copy of application.
>
> On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> > I like the idea of the patch, but honestly I don't like how it's
> > implemented. It shall call (as Andrew suggested) "monitor" function to
> > check that pgsql is up or down instead of spreading the same code all
> > around the script. I'd like to review the idea and prepare another
> > patch if everybody is agree.
> >
> > On 2/23/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> > > Hi,
> > >
> > > We have found a several problems with pgsql RA through our testing.
> > > It 'fails to failover' in some scenarios.
> > > I'm proposing a patch to fix them.
> > >
> > > Problem description:
> > >
> > > 1) The first 'monitor' may fail even if the postmaster was
> > > successfully launched.
> > >
> > > This is because 'start' of the pgsql may return before the
> > > postmaster gets ready to answer to a psql query issued by
> > > 'monitor', since it only checks the existance of postmaster
> > > process. The postmaster can take a few minitues to get ready
> > > to answer, particularly when it needs to recover the database
> > > after a crash. Even if no recovery is necessary, we observed
> > > that it sometimes fails in some of our test cases.
> > >
> > > 2) The postmaster fails to startup when 'postmaster.pid' file
> > > was left over from the previous crash.
> > >
> > > 3) 'stop' doest not execute the fast mode shutdown effectively,
> > > because it executes the immediate mode shutdown at the very
> > > next moment. The fast mode shutdown can take a few minutes
> > > to complete to flush the database log.
> > >
> > > This isn't a critical problem, but it may result to take a
> > > time longer to complete the failover (according to our
> > > database team). It is preferable to wait to complete the fast
> > > mode shutdown as long as possible.
> > >
> > >
> > > Proposals to fix:
> > >
> > > 1) In 'start', wait until the postmaster gets ready to answer by
> > > checking as same as 'monitor' does.
> > > The maximum wait time to complete to startup can be
> > > customized by an additional parameter 'start_wait'.
> > >
> > > 2) Add a cleanup code for 'postmaster.pid' when stop and before starting.
> > >
> > > 3) In 'stop', wait until the postmaster completes to the fast
> > > mode shutdown.
> > > The maximum wait time to complete to shutdown can be
> > > customized by an additional parameter 'stop_wait.
> > >
> > >
> > > The attached patch is for the latest -dev.
> > >
> > > Regards,
> > >
> > > Keisuke MORI
> > > NTT DATA Intellilink Corporation
> > >
> > >
> > > _______________________________________________________
> > > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > Home Page: http://linux-ha.org/
> > >
> > >
> > >
> >
>
Attachments: pgsql.in.patch (2.65 KB)


sergeyfd at gmail

Feb 23, 2007, 10:31 AM

Post #8 of 17 (1190 views)
Permalink
Re: pgsql RA improvements [In reply to]

Sorry, I just found that my version won't work properly on Solaris.
Attached is the corrected one. Sorry for creating so many messages
:-)

On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> Attached is the patch in the way that I like it to be.
>
> On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> > And I don't like the idea of removing PID in "start" function. The
> > standard approach if to remove it after stopping application. Other
> > way it could lead to attempt of starting a second copy of application.
> >
> > On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> > > I like the idea of the patch, but honestly I don't like how it's
> > > implemented. It shall call (as Andrew suggested) "monitor" function to
> > > check that pgsql is up or down instead of spreading the same code all
> > > around the script. I'd like to review the idea and prepare another
> > > patch if everybody is agree.
> > >
> > > On 2/23/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> > > > Hi,
> > > >
> > > > We have found a several problems with pgsql RA through our testing.
> > > > It 'fails to failover' in some scenarios.
> > > > I'm proposing a patch to fix them.
> > > >
> > > > Problem description:
> > > >
> > > > 1) The first 'monitor' may fail even if the postmaster was
> > > > successfully launched.
> > > >
> > > > This is because 'start' of the pgsql may return before the
> > > > postmaster gets ready to answer to a psql query issued by
> > > > 'monitor', since it only checks the existance of postmaster
> > > > process. The postmaster can take a few minitues to get ready
> > > > to answer, particularly when it needs to recover the database
> > > > after a crash. Even if no recovery is necessary, we observed
> > > > that it sometimes fails in some of our test cases.
> > > >
> > > > 2) The postmaster fails to startup when 'postmaster.pid' file
> > > > was left over from the previous crash.
> > > >
> > > > 3) 'stop' doest not execute the fast mode shutdown effectively,
> > > > because it executes the immediate mode shutdown at the very
> > > > next moment. The fast mode shutdown can take a few minutes
> > > > to complete to flush the database log.
> > > >
> > > > This isn't a critical problem, but it may result to take a
> > > > time longer to complete the failover (according to our
> > > > database team). It is preferable to wait to complete the fast
> > > > mode shutdown as long as possible.
> > > >
> > > >
> > > > Proposals to fix:
> > > >
> > > > 1) In 'start', wait until the postmaster gets ready to answer by
> > > > checking as same as 'monitor' does.
> > > > The maximum wait time to complete to startup can be
> > > > customized by an additional parameter 'start_wait'.
> > > >
> > > > 2) Add a cleanup code for 'postmaster.pid' when stop and before starting.
> > > >
> > > > 3) In 'stop', wait until the postmaster completes to the fast
> > > > mode shutdown.
> > > > The maximum wait time to complete to shutdown can be
> > > > customized by an additional parameter 'stop_wait.
> > > >
> > > >
> > > > The attached patch is for the latest -dev.
> > > >
> > > > Regards,
> > > >
> > > > Keisuke MORI
> > > > NTT DATA Intellilink Corporation
> > > >
> > > >
> > > > _______________________________________________________
> > > > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > > Home Page: http://linux-ha.org/
> > > >
> > > >
> > > >
> > >
> >
>
>
Attachments: pgsql.in.patch (2.66 KB)


beekhof at gmail

Feb 23, 2007, 11:50 PM

Post #9 of 17 (1186 views)
Permalink
Re: pgsql RA improvements [In reply to]

On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> On 2/23/07, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> > On 2/23/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> > > Hi,
> > >
> > > We have found a several problems with pgsql RA through our testing.
> > > It 'fails to failover' in some scenarios.
> > > I'm proposing a patch to fix them.
> > >
> > > Problem description:
> > >
> > > 1) The first 'monitor' may fail even if the postmaster was
> > > successfully launched.
> > >
> > > This is because 'start' of the pgsql may return before the
> > > postmaster gets ready to answer to a psql query issued by
> > > 'monitor', since it only checks the existance of postmaster
> > > process. The postmaster can take a few minitues to get ready
> > > to answer, particularly when it needs to recover the database
> > > after a crash. Even if no recovery is necessary, we observed
> > > that it sometimes fails in some of our test cases.
> > >
> > > 2) The postmaster fails to startup when 'postmaster.pid' file
> > > was left over from the previous crash.
> > >
> > > 3) 'stop' doest not execute the fast mode shutdown effectively,
> > > because it executes the immediate mode shutdown at the very
> > > next moment. The fast mode shutdown can take a few minutes
> > > to complete to flush the database log.
> > >
> > > This isn't a critical problem, but it may result to take a
> > > time longer to complete the failover (according to our
> > > database team). It is preferable to wait to complete the fast
> > > mode shutdown as long as possible.
> > >
> > >
> > > Proposals to fix:
> > >
> > > 1) In 'start', wait until the postmaster gets ready to answer by
> > > checking as same as 'monitor' does.
> > > The maximum wait time to complete to startup can be
> > > customized by an additional parameter 'start_wait'.
> > >
> > > 2) Add a cleanup code for 'postmaster.pid' when stop and before starting.
> > >
> > > 3) In 'stop', wait until the postmaster completes to the fast
> > > mode shutdown.
> > > The maximum wait time to complete to shutdown can be
> > > customized by an additional parameter 'stop_wait.
> > >
> > >
> > > The attached patch is for the latest -dev.
> >
> > I'd be more inclined to go with something like the patch below.
> >
> > The function of start_wait and stop_wait is just as easily achieved by
> > setting the action's timeout. Its also harder to mess up (ie. by
> > setting start_wait to longer than the start action's timeout).
> >
> > diff -r 959f2c429fc3 resources/OCF/pgsql.in
> > --- a/resources/OCF/pgsql.in Fri Feb 23 10:59:12 2007 +0100
> > +++ b/resources/OCF/pgsql.in Fri Feb 23 12:18:53 2007 +0100
> > @@ -197,15 +197,12 @@ pgsql_start() {
> > return $OCF_ERR_GENERIC
> > fi
> >
> > - if ! pgsql_status
> > - then
> > - sleep 5
> > - if ! pgsql_status
> > - then
> > - echo "ERROR: PostgreSQL is not running!"
> > - return $OCF_ERR_GENERIC
> > - fi
> > - fi
> > +
> > + active=0
> > + while [ $active != 0 ]; do
> > + pgsql_monitor
> > + active=$?
> > + done
>
> So if for some reason PostgreSQL fails to start we'll have an endless
> loop here. Am I right?

only until the action's timeout is reached and the LRM terminates the action

>
> >
> > return $OCF_SUCCESS
> > }
> > @@ -227,6 +224,13 @@ pgsql_stop() {
> > runasowner "$PGCTL -D $PGDATA stop -m immediate > /dev/null 2>&1"
> > fi
> >
> > + active=$OCF_NOT_RUNNING
> > + while [ $active != $OCF_NOT_RUNNING ]; do
> > + pgsql_monitor
> > + active=$?
> > + done
>
> And here.
>
> > +
> > + rm -f $PIDFILE
> > return $OCF_SUCCESS
> > }
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


sergeyfd at gmail

Feb 24, 2007, 6:03 AM

Post #10 of 17 (1185 views)
Permalink
Re: pgsql RA improvements [In reply to]

On 2/24/07, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> > On 2/23/07, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> > > On 2/23/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> > > > Hi,
> > > >
> > > > We have found a several problems with pgsql RA through our testing.
> > > > It 'fails to failover' in some scenarios.
> > > > I'm proposing a patch to fix them.
> > > >
> > > > Problem description:
> > > >
> > > > 1) The first 'monitor' may fail even if the postmaster was
> > > > successfully launched.
> > > >
> > > > This is because 'start' of the pgsql may return before the
> > > > postmaster gets ready to answer to a psql query issued by
> > > > 'monitor', since it only checks the existance of postmaster
> > > > process. The postmaster can take a few minitues to get ready
> > > > to answer, particularly when it needs to recover the database
> > > > after a crash. Even if no recovery is necessary, we observed
> > > > that it sometimes fails in some of our test cases.
> > > >
> > > > 2) The postmaster fails to startup when 'postmaster.pid' file
> > > > was left over from the previous crash.
> > > >
> > > > 3) 'stop' doest not execute the fast mode shutdown effectively,
> > > > because it executes the immediate mode shutdown at the very
> > > > next moment. The fast mode shutdown can take a few minutes
> > > > to complete to flush the database log.
> > > >
> > > > This isn't a critical problem, but it may result to take a
> > > > time longer to complete the failover (according to our
> > > > database team). It is preferable to wait to complete the fast
> > > > mode shutdown as long as possible.
> > > >
> > > >
> > > > Proposals to fix:
> > > >
> > > > 1) In 'start', wait until the postmaster gets ready to answer by
> > > > checking as same as 'monitor' does.
> > > > The maximum wait time to complete to startup can be
> > > > customized by an additional parameter 'start_wait'.
> > > >
> > > > 2) Add a cleanup code for 'postmaster.pid' when stop and before starting.
> > > >
> > > > 3) In 'stop', wait until the postmaster completes to the fast
> > > > mode shutdown.
> > > > The maximum wait time to complete to shutdown can be
> > > > customized by an additional parameter 'stop_wait.
> > > >
> > > >
> > > > The attached patch is for the latest -dev.
> > >
> > > I'd be more inclined to go with something like the patch below.
> > >
> > > The function of start_wait and stop_wait is just as easily achieved by
> > > setting the action's timeout. Its also harder to mess up (ie. by
> > > setting start_wait to longer than the start action's timeout).
> > >
> > > diff -r 959f2c429fc3 resources/OCF/pgsql.in
> > > --- a/resources/OCF/pgsql.in Fri Feb 23 10:59:12 2007 +0100
> > > +++ b/resources/OCF/pgsql.in Fri Feb 23 12:18:53 2007 +0100
> > > @@ -197,15 +197,12 @@ pgsql_start() {
> > > return $OCF_ERR_GENERIC
> > > fi
> > >
> > > - if ! pgsql_status
> > > - then
> > > - sleep 5
> > > - if ! pgsql_status
> > > - then
> > > - echo "ERROR: PostgreSQL is not running!"
> > > - return $OCF_ERR_GENERIC
> > > - fi
> > > - fi
> > > +
> > > + active=0
> > > + while [ $active != 0 ]; do
> > > + pgsql_monitor
> > > + active=$?
> > > + done
> >
> > So if for some reason PostgreSQL fails to start we'll have an endless
> > loop here. Am I right?
>
> only until the action's timeout is reached and the LRM terminates the action

Actually it'll never get into that loop:

active=0
while [ $active != 0 ]; do

Do you see why? :-)


>
> >
> > >
> > > return $OCF_SUCCESS
> > > }
> > > @@ -227,6 +224,13 @@ pgsql_stop() {
> > > runasowner "$PGCTL -D $PGDATA stop -m immediate > /dev/null 2>&1"
> > > fi
> > >
> > > + active=$OCF_NOT_RUNNING
> > > + while [ $active != $OCF_NOT_RUNNING ]; do
> > > + pgsql_monitor
> > > + active=$?
> > > + done
> >
> > And here.
> >
> > > +
> > > + rm -f $PIDFILE
> > > return $OCF_SUCCESS
> > > }
> > > _______________________________________________________
> > > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > Home Page: http://linux-ha.org/
> > >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


kskmori at intellilink

Feb 25, 2007, 6:18 PM

Post #11 of 17 (1175 views)
Permalink
Re: pgsql RA improvements [In reply to]

Serge,

Thanks for reviewing the patch.

"Serge Dubrouski" <sergeyfd[at]gmail.com> writes:

> And I don't like the idea of removing PID in "start" function. The
> standard approach if to remove it after stopping application. Other
> way it could lead to attempt of starting a second copy of application.

This is necessary for the recovery from the power failure of the
primary node, for example. There is no chance to cleanup by stop
in such cases.

Duplicate starting is avoided by checking if the postmaster
process exists beforehand, as the original script does.


>
> On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
>> I like the idea of the patch, but honestly I don't like how it's
>> implemented. It shall call (as Andrew suggested) "monitor" function to
>> check that pgsql is up or down instead of spreading the same code all
>> around the script. I'd like to review the idea and prepare another
>> patch if everybody is agree.

Yes, using the same monitor function would be better.
I didn't do that just because it will dump many logs every
seconds when it takes time to start.
It is OK if you don't mind it.

Thanks,
--
Keisuke MORI
NTT DATA Intellilink Corporation
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


sergeyfd at gmail

Feb 25, 2007, 8:50 PM

Post #12 of 17 (1178 views)
Permalink
Re: pgsql RA improvements [In reply to]

On 2/25/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> Serge,
>
> Thanks for reviewing the patch.
>
> "Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
>
> > And I don't like the idea of removing PID in "start" function. The
> > standard approach if to remove it after stopping application. Other
> > way it could lead to attempt of starting a second copy of application.
>
> This is necessary for the recovery from the power failure of the
> primary node, for example. There is no chance to cleanup by stop
> in such cases.
>
> Duplicate starting is avoided by checking if the postmaster
> process exists beforehand, as the original script does.

Yes, but in this case you remov the legitimate pid file from the
running instance. You remove it before testing that the checking for
postmaster. Let me think about it, I don't know what is worse in a
such case. Probably you are right and we has the right to think that
Postgress shouldn't be started outside of cluster control.

>
>
> >
> > On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> >> I like the idea of the patch, but honestly I don't like how it's
> >> implemented. It shall call (as Andrew suggested) "monitor" function to
> >> check that pgsql is up or down instead of spreading the same code all
> >> around the script. I'd like to review the idea and prepare another
> >> patch if everybody is agree.
>
> Yes, using the same monitor function would be better.
> I didn't do that just because it will dump many logs every
> seconds when it takes time to start.
> It is OK if you don't mind it.

Don't think that this is a problem. Those files are big even without
those records.

Thanks for all these proposals.

>
> Thanks,
> --
> Keisuke MORI
> NTT DATA Intellilink Corporation
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


kskmori at intellilink

Feb 25, 2007, 10:59 PM

Post #13 of 17 (1174 views)
Permalink
Re: pgsql RA improvements [In reply to]

"Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
>> "Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
>>
>> > And I don't like the idea of removing PID in "start" function. The
>> > standard approach if to remove it after stopping application. Other
>> > way it could lead to attempt of starting a second copy of application.
>>
>> This is necessary for the recovery from the power failure of the
>> primary node, for example. There is no chance to cleanup by stop
>> in such cases.
>>
>> Duplicate starting is avoided by checking if the postmaster
>> process exists beforehand, as the original script does.
>
> Yes, but in this case you remov the legitimate pid file from the
> running instance. You remove it before testing that the checking for
> postmaster.

Well, I think that the script does the cheking for postmaster first
and removing it second (remove it only when no postmaster process exists).

Here's the code snip with my patch.
pgsql_status checks for it and I think it should be good enough.
----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
pgsql_start() {
if pgsql_status
then
ocf_log info "PostgreSQL is already running. PID=`cat $PIDFILE`"
return $OCF_SUCCESS
fi

if [ -x $PGCTL ]
then
# Remove postmastre.pid if it exists
rm -f $PIDFILE
----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----


> Let me think about it, I don't know what is worse in a
> such case. Probably you are right and we has the right to think that
> Postgress shouldn't be started outside of cluster control.

If postmaster was already started outside of heartbeat control,
then it should return OCF_SUCCESS and the postmaster should
continue to run.

Power failure is one of the most typical situation that we want
to save with HA software, so this 'cleanup in start' is
important, I think.

Maybe it would be nice if we put a WARN log before removing it.

Thanks,

>
>>
>>
>> >
>> > On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
>> >> I like the idea of the patch, but honestly I don't like how it's
>> >> implemented. It shall call (as Andrew suggested) "monitor" function to
>> >> check that pgsql is up or down instead of spreading the same code all
>> >> around the script. I'd like to review the idea and prepare another
>> >> patch if everybody is agree.
>>
>> Yes, using the same monitor function would be better.
>> I didn't do that just because it will dump many logs every
>> seconds when it takes time to start.
>> It is OK if you don't mind it.
>
> Don't think that this is a problem. Those files are big even without
> those records.
>
> Thanks for all these proposals.
>
>>
>> Thanks,
>> --
>> Keisuke MORI
>> NTT DATA Intellilink Corporation
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
>>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

--
Keisuke MORI
Open Source Business Division
NTT DATA Intellilink Corporation
Tel: +81-3-3534-4811 / Fax: +81-3-3534-4814
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Feb 26, 2007, 2:31 AM

Post #14 of 17 (1174 views)
Permalink
Re: pgsql RA improvements [In reply to]

i made some further improvements in:
http://hg.beekhof.net/lha/crm-dev/rev/2e9b22cfb7e1

On 2/26/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> "Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
> >> "Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
> >>
> >> > And I don't like the idea of removing PID in "start" function. The
> >> > standard approach if to remove it after stopping application. Other
> >> > way it could lead to attempt of starting a second copy of application.
> >>
> >> This is necessary for the recovery from the power failure of the
> >> primary node, for example. There is no chance to cleanup by stop
> >> in such cases.
> >>
> >> Duplicate starting is avoided by checking if the postmaster
> >> process exists beforehand, as the original script does.
> >
> > Yes, but in this case you remov the legitimate pid file from the
> > running instance. You remove it before testing that the checking for
> > postmaster.
>
> Well, I think that the script does the cheking for postmaster first
> and removing it second (remove it only when no postmaster process exists).
>
> Here's the code snip with my patch.
> pgsql_status checks for it and I think it should be good enough.
> ----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
> pgsql_start() {
> if pgsql_status
> then
> ocf_log info "PostgreSQL is already running. PID=`cat $PIDFILE`"
> return $OCF_SUCCESS
> fi
>
> if [ -x $PGCTL ]
> then
> # Remove postmastre.pid if it exists
> rm -f $PIDFILE
> ----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
>
>
> > Let me think about it, I don't know what is worse in a
> > such case. Probably you are right and we has the right to think that
> > Postgress shouldn't be started outside of cluster control.
>
> If postmaster was already started outside of heartbeat control,
> then it should return OCF_SUCCESS and the postmaster should
> continue to run.
>
> Power failure is one of the most typical situation that we want
> to save with HA software, so this 'cleanup in start' is
> important, I think.
>
> Maybe it would be nice if we put a WARN log before removing it.
>
> Thanks,
>
> >
> >>
> >>
> >> >
> >> > On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> >> >> I like the idea of the patch, but honestly I don't like how it's
> >> >> implemented. It shall call (as Andrew suggested) "monitor" function to
> >> >> check that pgsql is up or down instead of spreading the same code all
> >> >> around the script. I'd like to review the idea and prepare another
> >> >> patch if everybody is agree.
> >>
> >> Yes, using the same monitor function would be better.
> >> I didn't do that just because it will dump many logs every
> >> seconds when it takes time to start.
> >> It is OK if you don't mind it.
> >
> > Don't think that this is a problem. Those files are big even without
> > those records.
> >
> > Thanks for all these proposals.
> >
> >>
> >> Thanks,
> >> --
> >> Keisuke MORI
> >> NTT DATA Intellilink Corporation
> >> _______________________________________________________
> >> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> >> Home Page: http://linux-ha.org/
> >>
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
>
> --
> Keisuke MORI
> Open Source Business Division
> NTT DATA Intellilink Corporation
> Tel: +81-3-3534-4811 / Fax: +81-3-3534-4814
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


sergeyfd at gmail

Feb 26, 2007, 9:52 AM

Post #15 of 17 (1176 views)
Permalink
Re: pgsql RA improvements [In reply to]

You broke it:

./pgsql start
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
chown: missing operand after `:'
Try `chown --help' for more information.
2007/02/26_12:50:26 ERROR: Can't start PostgreSQL.

The reason for these errors is changed way of initialization
variables. Also I still don't like that indefinite loop on start
because it makes harder to manually troubleshoot problem in case if
PostgreSQL doesn't start.

I don't know what is the right way to fix those problem now: fix your
version of script or fix previous one.

On 2/26/07, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> i made some further improvements in:
> http://hg.beekhof.net/lha/crm-dev/rev/2e9b22cfb7e1
>
> On 2/26/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> > "Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
> > >> "Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
> > >>
> > >> > And I don't like the idea of removing PID in "start" function. The
> > >> > standard approach if to remove it after stopping application. Other
> > >> > way it could lead to attempt of starting a second copy of application.
> > >>
> > >> This is necessary for the recovery from the power failure of the
> > >> primary node, for example. There is no chance to cleanup by stop
> > >> in such cases.
> > >>
> > >> Duplicate starting is avoided by checking if the postmaster
> > >> process exists beforehand, as the original script does.
> > >
> > > Yes, but in this case you remov the legitimate pid file from the
> > > running instance. You remove it before testing that the checking for
> > > postmaster.
> >
> > Well, I think that the script does the cheking for postmaster first
> > and removing it second (remove it only when no postmaster process exists).
> >
> > Here's the code snip with my patch.
> > pgsql_status checks for it and I think it should be good enough.
> > ----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
> > pgsql_start() {
> > if pgsql_status
> > then
> > ocf_log info "PostgreSQL is already running. PID=`cat $PIDFILE`"
> > return $OCF_SUCCESS
> > fi
> >
> > if [ -x $PGCTL ]
> > then
> > # Remove postmastre.pid if it exists
> > rm -f $PIDFILE
> > ----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
> >
> >
> > > Let me think about it, I don't know what is worse in a
> > > such case. Probably you are right and we has the right to think that
> > > Postgress shouldn't be started outside of cluster control.
> >
> > If postmaster was already started outside of heartbeat control,
> > then it should return OCF_SUCCESS and the postmaster should
> > continue to run.
> >
> > Power failure is one of the most typical situation that we want
> > to save with HA software, so this 'cleanup in start' is
> > important, I think.
> >
> > Maybe it would be nice if we put a WARN log before removing it.
> >
> > Thanks,
> >
> > >
> > >>
> > >>
> > >> >
> > >> > On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> > >> >> I like the idea of the patch, but honestly I don't like how it's
> > >> >> implemented. It shall call (as Andrew suggested) "monitor" function to
> > >> >> check that pgsql is up or down instead of spreading the same code all
> > >> >> around the script. I'd like to review the idea and prepare another
> > >> >> patch if everybody is agree.
> > >>
> > >> Yes, using the same monitor function would be better.
> > >> I didn't do that just because it will dump many logs every
> > >> seconds when it takes time to start.
> > >> It is OK if you don't mind it.
> > >
> > > Don't think that this is a problem. Those files are big even without
> > > those records.
> > >
> > > Thanks for all these proposals.
> > >
> > >>
> > >> Thanks,
> > >> --
> > >> Keisuke MORI
> > >> NTT DATA Intellilink Corporation
> > >> _______________________________________________________
> > >> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > >> Home Page: http://linux-ha.org/
> > >>
> > > _______________________________________________________
> > > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > Home Page: http://linux-ha.org/
> >
> > --
> > Keisuke MORI
> > Open Source Business Division
> > NTT DATA Intellilink Corporation
> > Tel: +81-3-3534-4811 / Fax: +81-3-3534-4814
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Feb 26, 2007, 10:16 AM

Post #16 of 17 (1176 views)
Permalink
Re: pgsql RA improvements [In reply to]

On 2/26/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> You broke it:
>
> ./pgsql start
> Usage: grep [OPTION]... PATTERN [FILE]...
> Try `grep --help' for more information.
> chown: missing operand after `:'
> Try `chown --help' for more information.
> 2007/02/26_12:50:26 ERROR: Can't start PostgreSQL.
>
> The reason for these errors is changed way of initialization

sorry - i've pushed up a fix

> variables. Also I still don't like that indefinite loop on start
> because it makes harder to manually troubleshoot problem in case if
> PostgreSQL doesn't start.

then add a call to ocf_log which indicates the RA is retrying or some-such

the RA is definitely not the best place to set limits on how long a
resource can take to start.

at the very least it leads to confusion when the timeout is less than
an RAs internal limit. on the other-hand, if the internal limit is
lower than the timeout, then you're returning before you needed to.

it is also not reliable if any part of the RA can block.

> I don't know what is the right way to fix those problem now: fix your
> version of script or fix previous one.
>
> On 2/26/07, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> > i made some further improvements in:
> > http://hg.beekhof.net/lha/crm-dev/rev/2e9b22cfb7e1
> >
> > On 2/26/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> > > "Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
> > > >> "Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
> > > >>
> > > >> > And I don't like the idea of removing PID in "start" function. The
> > > >> > standard approach if to remove it after stopping application. Other
> > > >> > way it could lead to attempt of starting a second copy of application.
> > > >>
> > > >> This is necessary for the recovery from the power failure of the
> > > >> primary node, for example. There is no chance to cleanup by stop
> > > >> in such cases.
> > > >>
> > > >> Duplicate starting is avoided by checking if the postmaster
> > > >> process exists beforehand, as the original script does.
> > > >
> > > > Yes, but in this case you remov the legitimate pid file from the
> > > > running instance. You remove it before testing that the checking for
> > > > postmaster.
> > >
> > > Well, I think that the script does the cheking for postmaster first
> > > and removing it second (remove it only when no postmaster process exists).
> > >
> > > Here's the code snip with my patch.
> > > pgsql_status checks for it and I think it should be good enough.
> > > ----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
> > > pgsql_start() {
> > > if pgsql_status
> > > then
> > > ocf_log info "PostgreSQL is already running. PID=`cat $PIDFILE`"
> > > return $OCF_SUCCESS
> > > fi
> > >
> > > if [ -x $PGCTL ]
> > > then
> > > # Remove postmastre.pid if it exists
> > > rm -f $PIDFILE
> > > ----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
> > >
> > >
> > > > Let me think about it, I don't know what is worse in a
> > > > such case. Probably you are right and we has the right to think that
> > > > Postgress shouldn't be started outside of cluster control.
> > >
> > > If postmaster was already started outside of heartbeat control,
> > > then it should return OCF_SUCCESS and the postmaster should
> > > continue to run.
> > >
> > > Power failure is one of the most typical situation that we want
> > > to save with HA software, so this 'cleanup in start' is
> > > important, I think.
> > >
> > > Maybe it would be nice if we put a WARN log before removing it.
> > >
> > > Thanks,
> > >
> > > >
> > > >>
> > > >>
> > > >> >
> > > >> > On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> > > >> >> I like the idea of the patch, but honestly I don't like how it's
> > > >> >> implemented. It shall call (as Andrew suggested) "monitor" function to
> > > >> >> check that pgsql is up or down instead of spreading the same code all
> > > >> >> around the script. I'd like to review the idea and prepare another
> > > >> >> patch if everybody is agree.
> > > >>
> > > >> Yes, using the same monitor function would be better.
> > > >> I didn't do that just because it will dump many logs every
> > > >> seconds when it takes time to start.
> > > >> It is OK if you don't mind it.
> > > >
> > > > Don't think that this is a problem. Those files are big even without
> > > > those records.
> > > >
> > > > Thanks for all these proposals.
> > > >
> > > >>
> > > >> Thanks,
> > > >> --
> > > >> Keisuke MORI
> > > >> NTT DATA Intellilink Corporation
> > > >> _______________________________________________________
> > > >> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > >> Home Page: http://linux-ha.org/
> > > >>
> > > > _______________________________________________________
> > > > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > > Home Page: http://linux-ha.org/
> > >
> > > --
> > > Keisuke MORI
> > > Open Source Business Division
> > > NTT DATA Intellilink Corporation
> > > Tel: +81-3-3534-4811 / Fax: +81-3-3534-4814
> > > _______________________________________________________
> > > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > Home Page: http://linux-ha.org/
> > >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


sergeyfd at gmail

Feb 26, 2007, 10:24 AM

Post #17 of 17 (1168 views)
Permalink
Re: pgsql RA improvements [In reply to]

There were some more problems besides that initialization. Attached is
a patch. I tested it and it seems to work fine.

On 2/26/07, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> On 2/26/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> > You broke it:
> >
> > ./pgsql start
> > Usage: grep [OPTION]... PATTERN [FILE]...
> > Try `grep --help' for more information.
> > chown: missing operand after `:'
> > Try `chown --help' for more information.
> > 2007/02/26_12:50:26 ERROR: Can't start PostgreSQL.
> >
> > The reason for these errors is changed way of initialization
>
> sorry - i've pushed up a fix
>
> > variables. Also I still don't like that indefinite loop on start
> > because it makes harder to manually troubleshoot problem in case if
> > PostgreSQL doesn't start.
>
> then add a call to ocf_log which indicates the RA is retrying or some-such
>
> the RA is definitely not the best place to set limits on how long a
> resource can take to start.
>
> at the very least it leads to confusion when the timeout is less than
> an RAs internal limit. on the other-hand, if the internal limit is
> lower than the timeout, then you're returning before you needed to.
>
> it is also not reliable if any part of the RA can block.
>
> > I don't know what is the right way to fix those problem now: fix your
> > version of script or fix previous one.
> >
> > On 2/26/07, Andrew Beekhof <beekhof[at]gmail.com> wrote:
> > > i made some further improvements in:
> > > http://hg.beekhof.net/lha/crm-dev/rev/2e9b22cfb7e1
> > >
> > > On 2/26/07, Keisuke MORI <kskmori[at]intellilink.co.jp> wrote:
> > > > "Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
> > > > >> "Serge Dubrouski" <sergeyfd[at]gmail.com> writes:
> > > > >>
> > > > >> > And I don't like the idea of removing PID in "start" function. The
> > > > >> > standard approach if to remove it after stopping application. Other
> > > > >> > way it could lead to attempt of starting a second copy of application.
> > > > >>
> > > > >> This is necessary for the recovery from the power failure of the
> > > > >> primary node, for example. There is no chance to cleanup by stop
> > > > >> in such cases.
> > > > >>
> > > > >> Duplicate starting is avoided by checking if the postmaster
> > > > >> process exists beforehand, as the original script does.
> > > > >
> > > > > Yes, but in this case you remov the legitimate pid file from the
> > > > > running instance. You remove it before testing that the checking for
> > > > > postmaster.
> > > >
> > > > Well, I think that the script does the cheking for postmaster first
> > > > and removing it second (remove it only when no postmaster process exists).
> > > >
> > > > Here's the code snip with my patch.
> > > > pgsql_status checks for it and I think it should be good enough.
> > > > ----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
> > > > pgsql_start() {
> > > > if pgsql_status
> > > > then
> > > > ocf_log info "PostgreSQL is already running. PID=`cat $PIDFILE`"
> > > > return $OCF_SUCCESS
> > > > fi
> > > >
> > > > if [ -x $PGCTL ]
> > > > then
> > > > # Remove postmastre.pid if it exists
> > > > rm -f $PIDFILE
> > > > ----8<--------8<--------8<--------8<--------8<--------8<--------8<--------8<----
> > > >
> > > >
> > > > > Let me think about it, I don't know what is worse in a
> > > > > such case. Probably you are right and we has the right to think that
> > > > > Postgress shouldn't be started outside of cluster control.
> > > >
> > > > If postmaster was already started outside of heartbeat control,
> > > > then it should return OCF_SUCCESS and the postmaster should
> > > > continue to run.
> > > >
> > > > Power failure is one of the most typical situation that we want
> > > > to save with HA software, so this 'cleanup in start' is
> > > > important, I think.
> > > >
> > > > Maybe it would be nice if we put a WARN log before removing it.
> > > >
> > > > Thanks,
> > > >
> > > > >
> > > > >>
> > > > >>
> > > > >> >
> > > > >> > On 2/23/07, Serge Dubrouski <sergeyfd[at]gmail.com> wrote:
> > > > >> >> I like the idea of the patch, but honestly I don't like how it's
> > > > >> >> implemented. It shall call (as Andrew suggested) "monitor" function to
> > > > >> >> check that pgsql is up or down instead of spreading the same code all
> > > > >> >> around the script. I'd like to review the idea and prepare another
> > > > >> >> patch if everybody is agree.
> > > > >>
> > > > >> Yes, using the same monitor function would be better.
> > > > >> I didn't do that just because it will dump many logs every
> > > > >> seconds when it takes time to start.
> > > > >> It is OK if you don't mind it.
> > > > >
> > > > > Don't think that this is a problem. Those files are big even without
> > > > > those records.
> > > > >
> > > > > Thanks for all these proposals.
> > > > >
> > > > >>
> > > > >> Thanks,
> > > > >> --
> > > > >> Keisuke MORI
> > > > >> NTT DATA Intellilink Corporation
> > > > >> _______________________________________________________
> > > > >> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > > >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > > >> Home Page: http://linux-ha.org/
> > > > >>
> > > > > _______________________________________________________
> > > > > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > > > Home Page: http://linux-ha.org/
> > > >
> > > > --
> > > > Keisuke MORI
> > > > Open Source Business Division
> > > > NTT DATA Intellilink Corporation
> > > > Tel: +81-3-3534-4811 / Fax: +81-3-3534-4814
> > > > _______________________________________________________
> > > > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > > Home Page: http://linux-ha.org/
> > > >
> > > _______________________________________________________
> > > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > Home Page: http://linux-ha.org/
> > >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev[at]lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
Attachments: pgsql.in.patch (1.99 KB)

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.