Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Linux-HA: Dev

heartbeat should detect and recover from corrupt CIB

 

 

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded


anibal at sgi

Mar 1, 2007, 11:11 PM

Post #1 of 5 (586 views)
Permalink
heartbeat should detect and recover from corrupt CIB

Hello,

Russell Coker found that "under XFS failure modes a recently created
file may end up filled with zeros if there is a power outage (IPMI
fence) at an inconvenient time. Heartbeat keeps a backup copy of
/var/lib/heartbeat/crm/cib.xml but if the primary copy is filled with
zeros it doesn't use the backup!"

I created the following patch. It has been tested and found to work
under the circumstances described above.

--- lib/crm/common/xml.c~ 2007-01-12 13:57:08.000000000 +1100
+++ lib/crm/common/xml.c 2007-03-02 14:41:03.352014250 +1100
@@ -634,6 +634,11 @@

/* establish the file with correct permissions */
file_output_strm = fopen(filename, "w");
+ if(file_output_strm == NULL) {
+ crm_free(buffer);
+ cl_perror("Cannot open %s", filename);
+ return -1;
+ }
fclose(file_output_strm);
chmod(filename, cib_mode);

@@ -684,7 +689,11 @@
if(res < 0) {
cl_perror("Cannot write output to %s",filename);
}
- fflush(file_output_strm);
+ if(fflush(file_output_strm) == EOF || fsync(fileno(file_output_strm)) < 0) {
+ cl_perror("fflush or fsync error on %s", filename);
+ fclose(file_output_strm);
+ return -1;
+ }
}
fclose(file_output_strm);
crm_free(buffer);

Aníbal
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Mar 2, 2007, 12:03 AM

Post #2 of 5 (558 views)
Permalink
Re: heartbeat should detect and recover from corrupt CIB [In reply to]

On 3/2/07, Aníbal Monsalve Salazar <anibal [at] sgi> wrote:
> Hello,
>
> Russell Coker found that "under XFS failure modes a recently created
> file may end up filled with zeros if there is a power outage (IPMI
> fence) at an inconvenient time. Heartbeat keeps a backup copy of
> /var/lib/heartbeat/crm/cib.xml but if the primary copy is filled with
> zeros it doesn't use the backup!"
>
> I created the following patch. It has been tested and found to work
> under the circumstances described above.

coincidentally i already applied something equivalent to the first
part of this patch a couple of days ago - but the second half looks
like a good addition too.

thanks!

>
> --- lib/crm/common/xml.c~ 2007-01-12 13:57:08.000000000 +1100
> +++ lib/crm/common/xml.c 2007-03-02 14:41:03.352014250 +1100
> @@ -634,6 +634,11 @@
>
> /* establish the file with correct permissions */
> file_output_strm = fopen(filename, "w");
> + if(file_output_strm == NULL) {
> + crm_free(buffer);
> + cl_perror("Cannot open %s", filename);
> + return -1;
> + }
> fclose(file_output_strm);
> chmod(filename, cib_mode);
>
> @@ -684,7 +689,11 @@
> if(res < 0) {
> cl_perror("Cannot write output to %s",filename);
> }
> - fflush(file_output_strm);
> + if(fflush(file_output_strm) == EOF || fsync(fileno(file_output_strm)) < 0) {
> + cl_perror("fflush or fsync error on %s", filename);
> + fclose(file_output_strm);
> + return -1;
> + }
> }
> fclose(file_output_strm);
> crm_free(buffer);
>
> Aníbal
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


anibal at sgi

Mar 4, 2007, 10:14 PM

Post #3 of 5 (554 views)
Permalink
Re: heartbeat should detect and recover from corrupt CIB [In reply to]

On Fri, Mar 02, 2007 at 09:03:19AM +0100, Andrew Beekhof wrote:
>coincidentally i already applied something equivalent to the first
>part of this patch a couple of days ago - but the second half looks
>like a good addition too.

There is a missing crm_free(buffer). Updated patch follows.

--- lib/crm/common/xml.c~ 2007-01-12 13:57:08.000000000 +1100
+++ lib/crm/common/xml.c 2007-03-05 11:31:17.630665050 +1100
@@ -634,6 +634,11 @@

/* establish the file with correct permissions */
file_output_strm = fopen(filename, "w");
+ if(file_output_strm == NULL) {
+ crm_free(buffer);
+ cl_perror("Cannot open %s", filename);
+ return -1;
+ }
fclose(file_output_strm);
chmod(filename, cib_mode);

@@ -684,7 +689,12 @@
if(res < 0) {
cl_perror("Cannot write output to %s",filename);
}
- fflush(file_output_strm);
+ if(fflush(file_output_strm) == EOF || fsync(fileno(file_output_strm)) < 0) {
+ crm_free(buffer);
+ cl_perror("fflush or fsync error on %s", filename);
+ fclose(file_output_strm);
+ return -1;
+ }
}
fclose(file_output_strm);
crm_free(buffer);

Aníbal
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


anibal at sgi

Mar 6, 2007, 2:22 PM

Post #4 of 5 (552 views)
Permalink
Re: heartbeat should detect and recover from corrupt CIB [In reply to]

On Fri, Mar 02, 2007 at 09:03:19AM +0100, Andrew Beekhof wrote:
>coincidentally i already applied something equivalent to the first
>part of this patch a couple of days ago

Did you also fix other cases of fopen() followed immediately by
fclose()/ftell()/fprint() with no error checking?

I've seen at least two more cases.

On line 112 of crm/cib/io.c, fopen() is followed immediately by ftell()
with no error checking.

On line 332 of crm/pengine/ptest.c, fopen() is followed immediately by
fprint() with no error checking.

Aníbal
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


beekhof at gmail

Mar 7, 2007, 1:11 AM

Post #5 of 5 (561 views)
Permalink
Re: heartbeat should detect and recover from corrupt CIB [In reply to]

On 3/6/07, Aníbal Monsalve Salazar <anibal [at] sgi> wrote:
> On Fri, Mar 02, 2007 at 09:03:19AM +0100, Andrew Beekhof wrote:
> >coincidentally i already applied something equivalent to the first
> >part of this patch a couple of days ago
>
> Did you also fix other cases of fopen() followed immediately by
> fclose()/ftell()/fprint() with no error checking?

i didn't but will do so now

> I've seen at least two more cases.
>
> On line 112 of crm/cib/io.c, fopen() is followed immediately by ftell()
> with no error checking.
>
> On line 332 of crm/pengine/ptest.c, fopen() is followed immediately by
> fprint() with no error checking.
>
> Aníbal
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev [at] lists
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev [at] lists
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Linux-HA dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.