
kskmori at intellilink
Feb 23, 2007, 2:07 AM
Views: 1245
Permalink
|
Hi, We have found a several problems with pgsql RA through our testing. It 'fails to failover' in some scenarios. I'm proposing a patch to fix them. Problem description: 1) The first 'monitor' may fail even if the postmaster was successfully launched. This is because 'start' of the pgsql may return before the postmaster gets ready to answer to a psql query issued by 'monitor', since it only checks the existance of postmaster process. The postmaster can take a few minitues to get ready to answer, particularly when it needs to recover the database after a crash. Even if no recovery is necessary, we observed that it sometimes fails in some of our test cases. 2) The postmaster fails to startup when 'postmaster.pid' file was left over from the previous crash. 3) 'stop' doest not execute the fast mode shutdown effectively, because it executes the immediate mode shutdown at the very next moment. The fast mode shutdown can take a few minutes to complete to flush the database log. This isn't a critical problem, but it may result to take a time longer to complete the failover (according to our database team). It is preferable to wait to complete the fast mode shutdown as long as possible. Proposals to fix: 1) In 'start', wait until the postmaster gets ready to answer by checking as same as 'monitor' does. The maximum wait time to complete to startup can be customized by an additional parameter 'start_wait'. 2) Add a cleanup code for 'postmaster.pid' when stop and before starting. 3) In 'stop', wait until the postmaster completes to the fast mode shutdown. The maximum wait time to complete to shutdown can be customized by an additional parameter 'stop_wait. The attached patch is for the latest -dev. Regards, Keisuke MORI NTT DATA Intellilink Corporation
|